Probability

  • Uploaded by: Roberto Garcia Sanchez
  • 0
  • 0
  • January 2021
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Probability as PDF for free.

More details

  • Words: 182,861
  • Pages: 589
Loading documents preview...
Graduate Texts in Mathematics Editorial Board

95

F. W. Gehring P.R. Halmos (Managing Editor) C. C. Moore

"Order out of chaos" (Courtesy of Professor A. T. Fomenko of the Moscow State University)

A. N. Shiryayev

Probability Translated by R. P. Boas

With 54 Illustrations

Springer Science+Business Media, LLC

A. N. Shiryayev Steklov Mathematical Institute Vavilova 42 GSP-1 117333 Moscow U.S.S.R. Editorial Board P.R. Halmos Managing Editor Department of Mathematics Indiana University Bloomington, IN 47405 U.S.A.

R. P. Boas (Translator) Department of Mathematics Northwestern University Evanston, IL 60201 U.S.A.

F. W. Gehring

C. C. Moore

Department of Mathematics University of Michigan Ann Arbor, MI 48109 U.S.A.

Department of Mathematics University of California at Berkeley Berkeley, CA 94720 U.S.A.

AMS Classification: 60-0 I Library of Congress Cataloging in Publication Data Shiriaev, Al'bert Nikolaevich. Probability. (Graduate texts in mathematics; 95) Translation of: Veroiiitnost'.

Bibliography: p. Includes index. I. Probabilities. I. Title. II. Series. 519 83-14813 QA273.S54413 1984 Original Russian edition: Veroicltnost'. Moscow: Nauka, 1979. This book is part of the Springer Series in Soviet Mathematics.

© 1984 by Springer Science+Business Media New York Originally published by Springer-Verlag New York, Inc. in 1984 Softcover reprint of the hardcover 1st edition 1984 All rights reserved. No part of this book may be translated or reproduced in any form without written permission from Springer Science+Business Media, LLC.

Typeset by Composition House Ltd., Salisbury, England.

9 8 7 6 54 32 1 ISBN 978-1-4899-0020-3 ISBN 978-1-4899-0018-0 (eBook) DOI 10.1007/978-1-4899-0018-0

Preface

This textbook is based on a three-semester course of lectures given by the author in recent years in the Mechanics-Mathematics Faculty of Moscow State University and issued, in part, in mimeographed form under the title Probability, Statistics, Stochastic Processes, I, II by the Moscow State University Press. We follow tradition by devoting the first part of the course (roughly one semester) to the elementary theory of probability (Chapter I). This begins with the construction of probabilistic models with finitely many outcomes and introduces such fundamental probabilistic concepts as sample spaces, events, probability, independence, random variables, expectation, correlation, conditional probabilities, and so on. Many probabilistic and statistical regularities are effectively illustrated even by the simplest random walk generated by Bernoulli trials. In this connection we study both classical results (law of large numbers, local and integral De Moivre and Laplace theorems) and more modern results (for example, the arc sine law). The first chapter concludes with a discussion of dependent random variables generated by martingales and by Markov chains. Chapters II-IV form an expanded version of the second part of the course (second semester). Here we present (Chapter II) Kolmogorov's generally accepted axiomatization of probability theory and the mathematical methods that constitute the tools of modern probability theory (a-algebras, measures and their representations, the Lebesgue integral, random variables and random elements, characteristic functions, conditional expectation with respect to a a-algebra, Gaussian systems, and so on). Note that two measuretheoretical results-Caratheodory's theorem on the extension of measures and the Radon-Nikodym theorem-are quoted without proof.

VI

Preface

The third chapter is devoted to problems about weak convergence of probability distributions and the method of characteristic functions for proving limit theorems. We introduce the concepts of relative compactness and tightness of families of probability distributions, and prove (for the real line) Prohorov's theorem on the equivalence of these concepts. The same part of the course discusses properties "with probability 1" for sequences and sums of independent random variables (Chapter IV). We give proofs of the "zero or one laws" of Kolmogorov and of Hewitt and Savage, tests for the convergence of series, and conditions for the strong law of large numbers. The law of the iterated logarithm is stated for arbitrary sequences of independent identically distributed random variables with finite second moments, and proved under the assumption that the variables have Gaussian distributions. Finally, the third part of the book (Chapters V-VIII) is devoted to random processes with discrete parameters (random sequences). Chapters V and VI are devoted to the theory of stationary random sequences, where "stationary" is interpreted either in the strict or the wide sense. The theory of random sequences that are stationary in the strict sense is based on the ideas of ergodic theory: measure preserving transformations, ergodicity, mixing, etc. We reproduce a simple proof (by A. Garsia) of the maximal ergodic theorem; this also lets us give a simple proof of the Birkhoff-Khinchin ergodic theorem. The discussion of sequences of random variables that are stationary in the wide sense begins with a proof of the spectral representation of the covariance fuction. Then we introduce orthogonal stochastic measures, and integrals with respect to these, and establish the spectral representation of the sequences themselves. We also discuss a number of statistical problems: estimating the covariance function and the spectral density, extrapolation, interpolation and filtering. The chapter includes material on the KalmanBucy filter and its generalizations. The seventh chapter discusses the basic results of the theory of martingales and related ideas. This material has only rarely been included in traditional courses in probability theory. In the last chapter, which is devoted to Markov chains, the greatest attention is given to problems on the asymptotic behavior of Markov chains with countably many states. Each section ends with problems of various kinds: some of them ask for proofs of statements made but not proved in the text, some consist of propositions that will be used later, some are intended to give additional information about the circle of ideas that is under discussion, and finally, some are simple exercises. In designing the course and preparing this text, the author has used a variety of sources on probability theory. The Historical and Bibliographical Notes indicate both the historical sources of the results, and supplementary references for the material under consideration. The numbering system and form of references is the following. Each section has its own enumeration of theorems, lemmas and formulas (with

Preface

Vll

no indication of chapter or section). For a reference to a result from a different section of the same chapter, we use double numbering, with the first number indicating the number of the section (thus (2.10) means formula (10) of §2). For references to a different chapter we use triple numbering (thus formula (11.4.3) means formula (3) of §4 of Chapter II). Works listed in the References at the end of the book have the form [L n], where Lis a letter and n is a numeral. The author takes this opportunity to thank his teacher A. N. Kolmogorov, and B. V. Gnedenko and Yu. V. Prohorov, from whom he learned probability theory and under whose direction he had the opportunity of using it. For discussions and advice, the author also thanks his colleagues in the Departments of Probability Theory and Mathematical Statistics at the Moscow State University, and his colleagues in the Section on probability theory of the Steklov Mathematical Institute of the Academy of Sciences of the U.S.S.R. Moscow Steklov Mathematical Institute

A. N.

SHIRYAYEV

Translator's acknowledgement. I am grateful both to the author and to my colleague C. T. Ionescu Tulcea for advice about terminology. R. P. B.

Contents

Introduction CHAPTER I

Elementary Probability Theory

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes §2. Some Classical Models and Distributions §3. Conditional Probability. Independence §4. Random Variables and Their Properties §5. The Bernoulli Scheme. I. The Law of Large Numbers §6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) ~7. Estimating the Probability of Success in the Bernoulli Scheme §8. Conditional Probabilities and Mathematical Expectations with Respect to Decompositions §9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing §10. Random Walk. II. Reflection Principle. Arcsine Law §11. Martingales. Some Applications to the Random Walk §12. Markov Chains. Ergodic Theorem. Strong Markov Property

5 5 17 23 32 45 55 68 74 81 92 101 108

CHAPTER II

Mathematical Foundations of Probability Theory §1. Probabilistic Model for an Experiment with Infinitely Many Outcomes. Kolmogorov's Axioms §2. Algebras and u-algebras. Measurable Spaces §3. Methods oflntroducing Probability Measures on Measurable Spaces §4. Random Variables. I.

129 129 137 149 164

Contents

X

§5. Random Elements §6. Lebesgue Integral. Expectation §7. Conditional Probabilities and Conditional Expectations with Respect to a a-Algebra §8. Random Variables. II. §9. Construction of a Process with Given Finite-Dimensional Distribution §10. Various Kinds of Convergence of Sequences of Random Variables §11. The Hilbert Space of Random Variables with Finite Second Moment §12. Characteristic Functions §13. Gaussian Systems

174 178 210 232 243 250 260 272 295

CHAPTER III

Convergence of Probability Measures. Central Limit Theorem §1. Weak Convergence of Probability Measures and Distributions §2. Relative Compactness and Tightness of Families of Probability Distributions §3. Proofs of Limit Theorems by the Method of Characteristic Functions §4. Central Limit Theorem for Sums of Independent Random Variables §5. Infinitely Divisible and Stable Distributions §6. Rapidity of Convergence in the Central Limit Theorem §7. Rapidity of Convergence in Poisson's Theorem

306 306 314 318 326 335 342 345

CHAPTER IV

Sequences and Sums of Independent Random Variables §1. §2. §3. §4.

Zero-or-One Laws Convergence of Series Strong Law of Large Numbers Law of the Iterated Logarithm

354 354 359 363 370

CHAPTER V

Stationary (Strict Sense) Random Sequences and Ergodic Theory §1. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations §2. Ergodicity and Mixing §3. Ergodic Theorems

376 376 379 381

CHAPTER VI

Stationary (Wide Sense) Random Sequences. L 2 Theory §1. §2. §3. §4.

Spectral Representation of the Covariance Function Orthogonal Stochastic Measures and Stochastic Integrals Spectral Representation of Stationary (Wide Sense) Sequences Statistical Estimation of the Covariance Function and the Spectral Density

387 387 395 401 412

Contents §5. Wold's Expansion §6. Extrapolation, Interpolation and Filtering §7. The Kalman-Bucy Filter and Its Generalizations

xi 418 425 436

CHAPTER VII

Sequences of Random Variables that Form Martingales §1. Definitions of Martingales and Related Concepts §2. Preservation of the Martingale Property Under Time Change at a Random Time §3. Fundamental Inequalities §4. General Theorems on the Convergence of Submartingales and Martingales §5. Sets of Convergence of Submartingales and Martingales §6. Absolute Continuity and Singularity of Probability Distributions §7. Asymptotics of the Probability of the Outcome of a Random Walk with Curvilinear Boundary §8. Central Limit Theorem for Sums of Dependent Random Variables

446 446 456 464 476 483 492 504 509

CHAPTER VIII

Sequences of Random Variables that Form Markov Chains §I. Definitions and Basic Properties §2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties of the Transition Probabilities p!jl §3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties of the Probabilities p!i> §4. On the Existence of Limits and of Stationary Distributions §5. Examples

523 523 528 532 541 546

Historical and Bibliographical Notes

555

References

561

Index of Symbols

565

Index

569

Introduction

The subject matter of probability theory is the mathematical analysis of random events, i.e. of those empirical phenomena which-under certain circumstances-can be described by saying that: They do not have deterministic regularity (observations of them do not yield the same outcome) whereas at the same time; They possess some statistical regularity (indicated by the statistical stability of their frequency). We illustrate with the classical example of a "fair" toss of an "unbiased" coin. It is clearly impossible to predict with certainty the outcome of each toss. The results of successive experiments are very irregular (now "head," now "tail") and we seem to have no possibility of discovering any regularity in such experiments. However, if we carry out a large number of" independent" experiments with an "unbiased" coin we can observe a very definite statistical regularity, namely that "head" appears with a frequency that is "close" to f. Statistical stability of a frequency is very likely to suggest a hypothesis about a possible quantitative estimate of the "randomness" of some event A connected with the results of the experiments. With this starting point, probability theory postulates that corresponding to an event A there is a definite number P(A), called the probability of the event, whose intrinsic property is that as the number of "independent" trials (experiments) increases the frequ,ency of event A is approximated by P(A). Applied to our example, this means that it is natural to assign the probability ! to the event A that consists of obtaining "head" in a toss of an "unbiased" coin.

2

Introduction

There is no difficulty in multiplying examples in which it is very easy to obtain numerical values intuitively for the probabilities of one or another event. However, these examples are all of a similar nature and involve (so far) undefined concepts such as "fair" toss, "unbiased" coin, "independence," etc. Having been invented to investigate the quantitative aspects of"randomness," probability theory, like every exact science, became such a science only at the point when the concept of a probabilistic model had been clearly formulated and axiomatized. In this connection it is natural for us to discuss, although only briefly, the fundamental steps in the development of probability theory. Probability theory, as a science, originated in the middle of the seventeenth century with Pascal (1623-1662), Fermat (1601-1655) and Huygens ( 1629-1695). Although special calculations of probabilities in games of chance had been made earlier, in the fifteenth and sixteenth centuries, by Italian mathematicians (Cardano, Pacioli, Tartaglia, etc.), the first general methods for solving such problems were apparently given in the famous correspondence between Pascal and Fermat, begun in 1654, and in the first book on probability theory, De Ratiociniis in Aleae Ludo (On Calculations in Games of Chance), published by Huygens in 1657. It was at this time that the fundamental concept of" mathematical expectation" was developed and theorems on the addition and multiplication of probabilities were established. The real history of probability theory begins with the work of James Bernoulli (1654-1705), Ars Conjectandi (The Art of Guessing) published in 1713, in which he proved (quite rigorously) the first limit theorem of probability theory, the law of large numbers; and of De Moivre (1667-1754), Miscellanea Analytica Supplementum (a rough translation might be The Analytic Method or Analytic Miscellany, 1730), in which the central limit theorem was stated and proved for the first time (for symmetric Bernoulli trials). Bernoulli was probably the first to realize the importance of considering infinite sequences of random trials and to make a clear distinction between the probability of an event and the frequency of its realization. De Moivre deserves the credit for defining such concepts as independence, mathematical expectation, and conditional probability. In 1812 there appeared Laplace's (1749-1827) great treatise Theorie Analytique des Probabilities (Analytic Theory of Probability) in which he presented his own results in probability theory as well as those of his predecessors. In particular, he generalized De Moivre's theorem to the general (unsymmetric) case of Bernoulli trials, and at the same time presented De Moivre's results in a more complete form. Laplace's most important contribution was the application of probabilistic methods to errors of observation. He formulated the idea of considering errors of observation as the cumulative results of adding a large number of independent elementary errors. From this it followed that under rather

3

Introduction

general conditions the distribution of errors of observation must be at least approximately normal. The work of Poisson (1781-1840) and Gauss (1777-1855) belongs to the same epoch in the development of probability theory, when the center of the stage was held by limit theorems. In contemporary probability theory we think of Poisson in connection with the distribution and the process that bear his name. Gauss is credited with originating the theory of errors and, in particular, with creating the fundamental method of least squares. The next important period in the development of probability theory is connected with the names of P. L. Chebyshev (1821-1894), A. A. Markov (1856-1922), and A. M. Lyapunov (1857-1918), who developed effective methods for proving limit theorems for sums of independent but arbitrarily distributed random variables. The number of Chebyshev's publications in probability theory is not large-four in all-but it would be hard to overestimate their role in probability theory and in the development of the classical Russian school of that subject. "On the methodological side, the revolution brought about by Chebyshev was not only his insistence for the first time on complete rigor in the proofs of limit theorems, ... but also, and principally, that Chebyshev always tried to obtain precise estimates for the deviations from the limiting regularities that are available for large but finite numbers of trials, in the form of inequalities that are valid unconditionally for any number of trials." (A. N.

KOLMOGOROV

[30])

Before Chebyshev the main interest in probability theory had been in the calculation of the probabilities of random events. He, however, was the first to realize clearly and exploit the full strength of the concepts of random variables and their mathematical expectations. The leading exponent of Chebyshev's ideas was his devoted student Markov, to whom there belongs the indisputable credit of presenting his teacher's results with complete clarity. Among Markov's own significant contributions to probability theory were his pioneering investigations of limit theorems for sums of independent random variables and the creation of a new branch of probability theory, the theory of dependent random variables that form what we now call a Markov chain. " ... Markov's classical course in the calculus of probability and his original papers, which are models of precision and clarity, contributed to the greatest extent to the transformation of probability theory into one of the most significant branches of mathematics and to a wide extension of the ideas and methods of Chebyshev." {S. N. BERNSTEIN [3])

To prove the central limit theorem of probability theory (the theorem on convergence to the normal distribution), Chebyshev and Markov used

4

Introduction

what is known as the method of moments. With more general hypotheses and a simpler method, the method of characteristic functions, the theorem was obtained by Lyapunov. The subsequent development of the theory has shown that the method of characteristic functions is a powerful analytic tool for establishing the most diverse limit theorems. The modern period in the development of probability theory begins with its axiomatization. The first work in this direction was done by S. N. Bernstein (1880-1968), R. von Mises (1883-1953), and E. Borel (1871-1956). A. N. Kolmogorov's book Foundations of the Theory of Probability appeared in 1933. Here he presented the axiomatic theory that has become generally accepted and is not only applicable to all the classical branches of probability theory, but also provides a firm foundation for the development of new branches that have arisen from questions in the sciences and involve infinitedimensional distributions. The treatment in the present book is based on Kolmogorov's axiomatic approach. However, to prevent formalities and logical subtleties from obscuring the intuitive ideas, our exposition begins with the elementary theory of probability, whose elementariness is merely that in the corresponding probabilistic models we consider only experiments with finitely many outcomes. Thereafter we present the foundations of probability theory in their most general form. The 1920s and '30s saw a rapid development of one of the new branches of probability theory, the theory of stochastic processes, which studies families of random variables that evolve with time. We have seen the creation of theories of Markov processes, stationary processes, martingales, and limit theorems for stochastic processes. Information theory is a recent addition. The present book is principally concerned with stochastic processes with discrete parameters: random sequences. However, the material presented in the second chapter provides a solid foundation (particularly of a logical nature) for the-study of the general theory of stochastic processes. It was also in the 1920s and '30s that mathematical statistics became a separate mathematical discipline. In a certain sense mathematical statistics deals with inverses of the problems of probability: If the basic aim of probability theory is to calculate the probabilities of complicated events under a given probabilistic model, mathematical statistics sets itself the inverse problem: to clarify the structure of probabilistic-statistical models by means of observations of various complicated events. Some of the problems and methods of mathematical statistics are also discussed in this book. However, all that is presented in detail here is probability theory and the theory of stochastic processes with discrete parameters.

CHAPTER I

Elementary Probability Theory

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 1. Let us consider an experiment of which all possible results are included in a finite number of outcomes w 1, •.. , wN. We do not need to know the nature of these outcomes, only that there are a finite number N of them.

We call w 1,

••• ,

wN elementary events, or sample points, and the finite set Q = {wl, ... , roN},

the space of elementary events or the sample space. The choice of the space of elementary events is the first step in formulating a probabilistic model for an experiment. Let us consider some examples of sample spaces. ExAMPLE

1. For a single toss of a coin the sample space Q consists of two

points: Q = {H, T},

where H ="head" and T ="tail". (We exclude possibilities like "the coin stands on edge," "the coin disappears," etc.) ExAMPLE

2. For n tosses of a coin the sample space is Q = {w: w = (a 1 ,

••• ,

an), ai = H or T}

and the general number N(Q) of outcomes is 2".

6

I. Elementary Probability Theory

3. First toss a coin. If it falls "head" then toss a die (with six faces numbered 1, 2, 3, 4, 5, 6); if it falls "tail", toss the coin again. The sample space for this experiment is

ExAMPLE

Q = {H1, H2, H3, H4, H5, H6, TH, TT}.

We now consider some more complicated examples involving the selection of n balls from an urn containing M distinguishable balls. 2. ExAMPLE 4 (Sampling with replacement). This is an experiment in which after each step the selected ball is returned again. In this case each sample of n balls can be presented in the form (a 1 , ••• , an), where ai is the label of the ball selected at the ith step. It is clear that in sampling with replacement each ai can have any of the M values 1, 2, ... , M. The description of the sample space depends in an essential way on whether we consider samples like, for example, (4, 1, 2, 1) and (1, 4, 2, 1) as different or the same. It is customary to distinguish two cases: ordered samples and unordered samples. In the first case samples containing the same elements, but arranged differently, are considered to be different. In the second case the order of the elements is disregarded and the two samples are considered to be the same. To emphasize which kind of sample we are considering, we use the notation (a 1 , .•. , an) for ordered samples and [a 1 , ..• , an] for unordered samples. Thus for ordered samples the sample space has the form Q

= {w: w = (a 1, ••• , an), ai = 1, ... , M}

and the number of (different) outcomes is N(Q) = Mn.

(1)

If, however, we consider unordered samples, then Q

= {w: w = [a 1 , ... , an], ai = 1, ... , M}.

Clearly the number N(Q) of (different) unordered samples is smaller than the number of ordered samples. Let us show that in the present case N(Q) = CM+n-l•

(2)

where q = k !j[l! (k - I)!] is the number of combinations of I elements, taken k at a time. We prove this by induction. Let N(M, n) be the number of outcomes of interest. It is clear that when k ~ M we have N(k, 1) = k =

q.

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

7

Now suppose that N(k, n) = C~+n- 1 fork~ M; we show that this formula continues to hold when n is replaced by n + 1. For the unordered samples [a 1 , ... , an+ 1] that we are considering, we may suppose that the elements are arranged in nondecreasing order: a 1 ~ a 2 ~ · · · ~ an. It is clear that the number of unordered samples with a 1 = 1 is N(M, n), the number with a 1 = 2 is N(M - 1, n), etc. Consequently N(M, n

+ 1) =

+ N(M - 1, n) + · · · + N(1, n) = CM+n-1 + CM-1+n-1 + · ·' c: = (C~+1n- CM-+1n-1) + (CM+-\+n- C~!1+n-1) + ... + (C:!i- c:) = C~_}n; N(M, n)

here we have used the easily verified property

Ci- 1 + Ci = Ci + 1 of the binomial coefficients. 5 (Sampling without replacement). Suppose that n ~ M and that the selected balls are not returned. In this case we again consider two possibilities, namely ordered and unordered samples. For ordered samples without replacement the sample space is

EXAMPLE

Q

= {w: w = (a 1 , .•. , an), ak =F a1, k =F l, ai = 1, ... , M},

and the number of elements of this set (called permutations) is M(M- 1) · · · (M - n + 1). We denote this by (M)n or AM- and call it "the number of permutations of M things, n at a time"). For unordered samples (called combinations) the sample space Q

= {w: w = [a 1 , •.. , an], ak =F a1, k =F l, ai = 1, ... , M}

consists of N(Q)

=

CM-

(3)

elements. In fact, from each unordered sample [a 1 , .•. , an] consisting of distinct elements we can obtain n! ordered samples. Consequently N(Q) · n! = (M)n and therefore N(Q) = (M;n =

n.

CM-.

The results on the numbers of samples of n from an urn with M balls are presented in Table 1.

8

I. Elementary Probability Theory

Table 1 M"

c~+n-1

With replacement

(M).

eM

Without replacement

Ordered

Unordered

For the case M = 3 and n displayed in Table 2.

= 2,

~ e

the corresponding sample spaces are

ExAMPLE 6 (Distribution of objects in cells). We consider the structure of the sample space in the problem of placing n objects (balls, etc.) in M cells (boxes, etc.). For example, such problems arise in statistical physics in studying the distribution of n particles (which might be protons, electrons, ... ) among M states (which might be energy levels). Let the cells be numbered 1, 2, ... , M, and suppose first that the objects are distinguishable (numbered 1, 2, ... , n). Then a distribution of the n objects among the M cells is completely described by an ordered set (a 1, ..• , an), where ai is the index of the cell containing object i. However, if the objects are indistinguishable their distribution among the M cells is completely determined by the unordered set [a 1, •.. , an], where ai is the index of the cell into which an object is put at the ith step. Comparing this situation with Examples 4 and 5, we have the following correspondences: (ordered samples)+-+ (distinguishable objects), (unordered samples)+-+ (indistinguishable objects), Table 2 (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (3, 1) (3, 2) (3, 3)

[1, 1] [2, 2] [3, 3] [1, 2] [1, 3] [2, 3]

With replacement

(1, 2) (1, 3) (2, 1) (2, 3) (3, 1) (3, 2)

[1, 2] [1, 3] [2, 3]

Without replacement

Ordered

Unordered

~ e

9

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

by which we mean that to an instance of an ordered (unordered) sample of n balls from an urn containing M balls there corresponds (one and only one) instance of distributing n distinguishable (indistinguishable) objects among M cells. In a similar sense we have the following correspondences: ) (a cell may receive any number) ·h . (samp1mg w1t rep1acement f b" t , o o ~ec s

. . (a cell may receive at most) b" t · (samphng without replacement)one o ~ec These correspondences generate others of the same kind: an unordered sample in) ( sampling without replacement

-

(indistinguishable objects in the ) problem of distribution among cells when each cell may receive at most one object

etc.; so that we can use Examples 4 and 5 to describe the sample space for the problem of distributing distinguishable or indistinguishable objects among cells either with exclusion (a cell may receive at most one object) or without exclusion (a cell may receive any number of objects). Table 3 displays the distributions of two objects among three cells. For distinguishable objects, we denote them by W (white) and B (black). For indistinguishable objects, the presence of an object in a cell is indicated bya +. Table 3

lwlsl I llwlsl llwl 18 I I++ I I II I+ +I II I I++ I lslwl II lwlsl II lwlsl 1+1+1 I 1+1 1+1 IBl lwll lslwll I lwlsl I 1+1+1 lwlsl lslwl I8 1

I lwl Isl I I lwlsl lwl I lslwl

Distinguishable objects

1+1+1

I 1+1 1+1 I 1+1+1

-:::: = 0 0 ·-

~2u

~

..<:

><

'"'

=

.9

-"'

~-§ ><

'"' Indistinguishable objects

I~ n

s

10

I. Elementary Probability Theory

Table 4 N(Q) in the problem of placing n objects in M cells

~

Distinguishable objects

Indistinguishable objects

Without exclusion

Mn

C:V+n-1

With exclusion

{M)n

c:v

Ordered samples

Unordered samples

s

n

(MaxwellBoltzmann statistics)

(BoseEinstein statistics)

(Fermi-Dirac statistics)

With replacement

Without replacement

~ e

N(Q) in the problem of choosing n balls from an urn containing M balls

The duality that we have observed between the two problems gives us obvious way of finding the number of outcomes in the problem of placing objects in cells. The results, which include the results in Table 1, are given in Table 4. In statistical physics one says that distinguishable (or indistinguishable, respectively) particles that are not subject to the Pauli exclusion principlet obey Maxwell-Boltzmann statistics (or, respectively, Bose-Einstein statistics). If, however, the particles are indistinguishable and are subject to the exclusion principle, they obey Fermi-Dirac statistics (see Table 4). For example, electrons, protons and neutrons obey Fermi-Dirac statistics. Photons and pions obey Bose-Einstein statistics. Distinguishable particles that are subject to the exclusion principle do not occur in physics. an

3. In addition to the concept of sample space we now need the fundamental concept of event. Experimenters are ordinarily interested, not in what particular outcome occurs as the result of a trial, but in whether the outcome belongs to some subset of the set of all possible outcomes. We shall describe as events all subsets A c Q for which, under the conditions ofthe experiment, it is possible to say either "the outcome wE A" or "the outcome w ¢A."

t At most one particle in each cell. (Translator)

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

11

For example, let a coin be tossed three times. The sample space Q consists of the eight points Q = {HHH, HHT, ... , TTT}

and if we are able to observe (determine, measure, etc.) the results of all three tosses, we say that the set A = {HHH, HHT, HTH, THH} is the event consisting of the appearance of at least two heads. If, however, we can determine only the result of the first toss, this set A cannot be considered to be an event, since there is no way to give either a positive or negative answer to the question of whether a specific outcome w belongs to A. Starting from a given collection of sets that are events, we can form new events by means of statements containing the logical connectives "or," "and," and "not," which correspond in the language of set theory to the operations "union," "intersection," and "complement." If A and B are sets, their union, denoted by A u B, is the set of points that belong either to A or to B: Au B ={wen: we A or weB}. In the language of probability theory, Au B is the event consisting of the realization either of A or of B. The intersection of A and B, denoted by A n B, or by AB, is the set of points that belong to both A and B: An B = {weQ: we A and weB}.

The event A n B consists of the simultaneous realization of both A and B. For example, if A = {HH, HT, TH} and B = {TT, TH, HT} then Au B

= {HH, HT, TH, TT} ( =0),

A n B = {TH, HT}.

If A is a subset of Q, its complement, denoted by A, is the set of points of that do not belong to A. If B\A denotes the difference of B and A (i.e. the set of points that belong to B but not to A) then A= Q\A. In the language of probability, A is the event consisting of the nonrealization of A. For example, if A = {HH, HT, TH} then A = {TT}, the event in which two successive tails occur. The sets A and A have no points in common and consequently A n A is empty. We denote the empty set by 0. In probability theory, 0 is called an impossible event. The set n is naturally called the certain event. When A and B are disjoint (AB = 0), the union A u B is called the sum of A and B and written A + B. If we consider a collection d 0 of sets A s Q we may use the set-theoretic operators u, n and \ to form a new collection of sets from the elements of Q

12

I. Elementary Probability Theory

d 0 ; these sets are again events. If we adjoin the certain and impossible events Q and 0 we obtain a collection d of sets which is an algebra, i.e. a collection of subsets of Q for which

(1) Qed, (2) if A Ed, BEd, the sets A u B, A n B, A\B also belong to d. It follows from what we have said that it will be advisable to consider collections of events that form algebras. In the future we shall consider only such collections. Here are some examples of algebras of events:

(a) {0, 0}, the collection consisting of Q and the empty set (we call this the trivial algebra); (b) {A, A, 0, 0}, the collection generated by A; (c) d = {A: A £ 0}, the collection consisting of all the subsets of Q (including the empty set 0). It is easy to check that all these algebras of events can be obtained from the following principle. We say that a collection

of sets is a decomposition of 0, and call the Di the atoms of the decomposition, if the Di are not empty, are pairwise disjoint, and their sum is 0: D1

+ · ·· + Dn

= Q.

For example, if Q consists of three points, different decompositions: !'}t={Dd !'}2

= {Dt, D2}

Q

= {1, 2, 3}, there are five

with D 1 = {1, 2, 3}; with D 1

= {1, 2}, D2 = {3};

!'}3 = {Dt, D2}

with D 1 = {1, 3}, D2 = {2};

!'}4 = {Dt, D2}

with D 1 = {2,3},D 2 = {1};

!'}s

= {Dl, D2, D3}

with

D1

= {1}, D2 = {2}, D3 = {3}.

(For the general number of decompositions of a finite set see Problem 2.) · If we consider all unions of the sets in !'}, the resulting collection of sets, together with the empty set, forms an algebra, called the algebra induced by !'}, and denoted by ex(!'}). Thus the elements of ex(!'}) consist of the empty set together with the sums of sets which are atoms of!'}, Thus if!'} is a decomposition, there is associated with it a specific algebra fJi = ex(!'}). The converse is also true. Let f!i be an algebra of subsets of a finite space 0. Then there is a unique decomposition !'} whose atoms are the elements of

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

13

flA, with flA = oc(~). In fact, let D E 14 and let D have the property that for every B E flA the set D n B either coincides with D or is empty. Then this collection of sets D forms a decomposition ~ with the required property oc(~) = flA. In Example (a),~ is the trivial decomposition consisting of the single set D1 = Q; in (b),~= {A, A}. The most fine-grained decomposition ~, which consists of the singletons {roJ, roi E Q, induces the algebra in Example (c), i.e. the algebra of all subsets of Q. Let ~ 1 and ~ 2 be two decompositions. We say that ~ 2 is finer than ~ 1 , and write ~ 1 ~ ~ 2 , ifoc(~ 1 ) ~ oc(~ 2 ). Let us show that if Q consists, as we assumed above, of a finite number of points ro 1, ... , roN, then the number N(d) of sets in the collection d is equal to 2N. In fact, every nonempty set A Ed can be represented as A = {roi,, ... , rod, where roij E Q, 1 ~ k ~ N. With this set we associate the sequence of zeros and ones

(0, ... ' 0, 1, 0, ... ' 0, 1, ... ), where there are ones in the positions i 1, ... , ik and zeros elsewhere. Then for a given k the number of different sets A of the form {roi,, ... , roik} is the same as the number of ways in which k ones (k indistinguishable objects) can be placed in N positions (N cells). According to Table 4 (see the lower right-hand square) we see that this number is C~. Hence (counting the empty set) we find that N(d) = 1 + C~

+ .. · + CZ

= (1

+ It

= 2N.

4. We have now taken the first two steps in defining a probabilistic model of an experiment with a finite number of outcomes: we have selected a sample space and a collection d of subsets, which form an algebra and are called events. We now take the next step, to assign to each sample point (outcome) roi E Qi, i = 1, ... , N, a weight. This is denoted by p(roi) and called the probability of the outcome roi ; we assume that it has the following properties: (a) 0 ~ p(roJ ~ 1 (nonnegativity), (b) p(ro 1) + · · · + p(roN) = 1 (normalization). Starting from the given probabilities p(roJ of the outcomes roi> we define the probability P(A) of any event A Ed by (4) {i:ro;eA}

Finally, we say that a triple (Q, d, P),

where Q = {ro 1 ,

.•• ,

roN}, dis an algebra of subsets of Q and p = {P(A); A Ed}

14

I. Elementary Probability Theory

defines (or assigns) a probabilistic model, or a probability space, of experiments with a (finite) space n of outcomes and algebra d of events. The following properties of probability follow from (4): (5)

P(0) = 0,

=

P(Q)

P(A u B)= P(A)

+

(6)

1,

P(B)- P(A n B).

(7)

In particular, if An B = 0, then P(A

+ B) =

P(A)

+

P(B)

(8)

and P(A) = 1 - P(A).

(9)

5. In constructing a probabilistic model for a specific situation, the construction of the sample space n and the algebra d of events are ordinarily not difficult. In elementary probability theory one usually takes the algebra d to be the algebra of all subsets of n. Any difficulty that may arise is in assigning probabilities to the sample points. In principle, the solution to this problem lies outside the domain of probability theory, and we shall not consider it in detail. We consider that our fundamental problem is not the question of how to assign probabilities, but how to calculate the probabilities of complicated events (elements of d) from the probabilities of the sample points. It is clear from a mathematical point of view that for finite sample spaces we can obtain all conceivable (finite) probability spaces by assigning nonnegative numbers p 1 , •.• , PN, satisfying the condition p 1 + · · · + PN = 1, to the outcomes w 1, ... , wN. The validity of the assignments of the numbers p 1, ••• , PN can, in specific cases, be checked to a certain extent by using the law of large numbers (which will be discussed later on). It states that in a long series of "independent" experiments, carried out under identical conditions, the frequencies with which the elementary events appear are "close" to their probabilities. In connection with the difficulty of assigning probabilities to outcomes, we note that there are many actual situations in which for reasons of symmetry it seems reasonable to consider all conceivable outcomes as equally probable. In such cases, if the sample space consists of points w 1, . . , wN, with N < oo, we put

and consequently P(A) = N(A)/N

(10)

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

15

for every event A Ed, where N(A) is the number of sample points in A. This is called the classical method of assigning probabilities. It is clear that in this case the calculation of P(A) reduces to calculating the number of outcomes belonging to A. This is usually done by combinatorial methods, so that combinatorics, applied to finite sets, plays a significant role in the calculus of probabilities. 7 (Coincidence problem). Let an urn contain M balls numbered 1, 2, ... , M. We draw an ordered sample of size n with replacement. It is clear that then ExAMPLE

Q = {w: w = (a 1 ,

.•. ,

a.), a;= 1, ... , M}

and N(Q) = M". Using the classical assignment of probabilities, we consider the M" outcomes equally probable and ask for the probability of the event

A = {w: w = (a 1 ,

••• ,

a.), a; "# aj, i "# j},

i.e., the event in which there is no repetition. Clearly N(A) = M(M - 1) ... (M - n + 1), and therefore P(A)

=

(z~· = (1- k) (1- k) ... (1- n ~ 1).

(11)

This problem has the following striking interpretation. Suppose that there are n students in a class. Let us suppose that each student's birthday is on one of 365 days and that all days are equally probable. The question is, what is the probability P. that there are at least two students in the class whose birthdays coincide? If we interpret selection of birthdays as selection of balls from an urn containing 365 balls, then by (11) P.

=

1-

(365). 365" .

The following table lists the values of P" for some values of n: n

4

16

22

23

40

64

P.

0.016

0.284

0.476

0.507

0.891

0.997

It is interesting to note that (unexpectedly!) the size of class in which there is probability! of finding at least two students with the same birthday is not very large: only 23. 8 (Prizes in a lottery). Consider a lottery that is run in the following way. There are M tickets numbered 1, 2, ... , M, of which n, numbered 1, ... , n, win prizes (M ~ 2n). You buy n tickets, and ask for the probability (P, say) of winning at least one prize.

EXAMPLE

16

I. Elementary Probability Theory

Since the order in which the tickets are drawn plays no role in the presence or absence of winners in your purchase, we may suppose that the sample space has the form

n=

{w: w = [a 1,

By Table 1, N(Q) =

C~.

..• ,

an], ak =I= a1, k =I= l, ai

= 1, ... , M}.

Now let

A 0 = {w: w = [a 1 ,

an], ak =I= a1, k =I= l, ai

••• ,

= n + 1, ... , M}

be the event that there is no winner in the set of tickets you bought. Again by Table 1, N(A 0 ) = C~-n· Therefore P(A ) = C~-n = (M- n)n (M)n 'C~ 0

and consequently

P=

). n n) ... (1 (1 - _ 1- P(A = 1- (1 - ~) M-n+1 M-1 M 0)

If M = n2 and n -4 oo, then P(A 0 )

-4

e- 1 and

P--+ 1- e- 1 ~ 0.632.

The convergence is quite fast: for n = 10 the probability is already P = 0.670. 6.

PROBLEMS

1. Establish the following properties of the operators n and u: AB = BA

A uB= BuA,

A u (B u C) = (A u B) u C, A(B u C) = AB u AC, Au A= A,

(commutativity), A(BC) = (AB)C

A u (BC)

=

(associativity), (distributivity),

(A u B)(A u C)

(idempotency).

AA = A

Show also that Au B

=An B,

AB

=Au B.

2. Let Q contain N elements. Show that the number d(N) of different decompositions of Q is given by the formula (12) (Hint: Show that N-1

d(N)

=

I

k=O

c~- 1 d(k),

where d(O)

=

1,

and then verify that the series in (12) satisfies the same recurrence relation.)

17

§2. Some Classical Models and Distributions

3. For any finite collection of sets

A~>

... , A.,

+ · · · + P(A.).

P(A 1 u · · · u A.) :0::: P(A 1 )

4. Let A and B be events. Show that AB u BA is the event in which exactly one of A and B occurs. Moreover, P(AB u BA) = P(A)

+ P(B) -

2P(AB).

5. Let A 1 , ..• , A. be events, and define S 0 , S 1 , ••• , s. as follows: S 0 = 1,

s, = L P(Ak, (\ ... (\ Ak), J,

where the sum is over the unordered subsets J, = [k 1, ... , k,] of {1, ... , n}. Let Bm be the event in which each of the events A 1 , .•• , A. occurs exactly m times. Show that n

P(Bm) =

L (-1)•-mc~s,.

In particular, form= 0 P(B 0 ) = 1 - S 1

+ S2

-

•· ·

± s•.

Show also that the probability that at least m of the events A 1 , ••• , A. occur simultaneously is n

P(B 1 )

+ · ·· + P(B.) = I (-1y-mc~_-ls,. r=m

In particular, the probability that at least one of the events A 1, .•• , A. occurs is

§2. Some Classical Models and Distributions 1. Binomial distribution. Let a coin be tossed n times and record the results as an ordered set {a 1 , ... , an), where a; = 1 for a head ("success") and a;= 0 for a tail ("failure"). The sample space is

n=

{w: w = (al, ... , an), a;= 0, 1}.

To each sample point w = (a 1, p(w)

... ,

an) we assign the probability

= {l-a,qn-r.a,,

where the nonnegative numbers p and q satisfy p + q = 1. In the first place, we verify that this assignment of the weights p(w) is consistent. It is enough to show that p(w) = 1. We consider all outcomes w = (a 1 , ••• , a.) for which a; = k, where k = 0, 1, ... , n. According to Table 4 (distribution of k indistinguishable

Lwen

L;

18

I. Elementary Probability Theory

ones inn places) the number of these outcomes is n

L p(w) = L

roen

C=pkqn-k

k=O

c:. Therefore

= (p + q)n = 1.

Thus the space 0 together with the collection Jil of all its subsets and the probabilities P(A) = LroeA p(w), A E J/1, defines a probabilistic model. It is natural to call this the probabilistic model for n tosses of a coin. In the case n = 1, when the sample space contains just the two points w = 1 ("success") and w = 0 ("failure"), it is natural to call p(l) = p the probability of success. We shall see later that this model for n tosses of a coin can be thought of as the result of n "independent" experiments with probability p of success at each trial. Let us consider the events k = 0, 1, ... , n,

consisting of exactly k successes. It follows from what we said above that (1)

LZ=o

and P(Ak) = 1. The set of probabilities (P(A 0 ), ••• , P(An)) is called the binomial distribution (the number of successes in a sample of size n). This distribution plays an extremely important role in probability theory since it arises in the most diverse probabilistic models. We write Pik) = P(Ak), k = 0, 1, ... , n. Figure 1 shows the binomial distribution in the case p = ! (symmetric coin) for n = 5, 10, 20. We now present a different model (in essence, equivalent to the preceding one) which describes the random walk of a "particle." Let the particle start at the origin, and after unit time let it take a unit step upward or downward (Figure 2). Consequently after n steps the particle can have moved at most n units up or n units down. It is clear that each path w of the particle is completely specified by a set (a 1, ••. , an), where a; = + 1 if the particle moves up at the ith step, and a; = - 1 if it moves down. Let us assign to each path w the weight p(w) = pv<w>qn-v<w>, where v(w) is the number of+ 1's in the sequence w = (a 1 , ... , an), i.e. v(w) = [(a 1 + · · · + an) + n]/2, and the nonnegative numbers p and q satisfy p + q = 1. Since Lroen p(w) = 1, the set of probabilities p(w) together with the space 0 of paths w = (a 1, .•. , an) and its subsets define an acceptable probabilistic model of the motion of the particle for n steps. Let us ask the following question: What is the probability of the event Ak that after n steps the particle is at a point with ordinate k? This condition is satisfied by those paths w for which v(ro) - (n - v(w)) = k, i.e. v(w)

n+k 2

= --.

19

§2. Some Classical Models and Distributions P.(k)

P.(k)

0.3

n

=

0.3

5

0.2

0.2

0.1

0.1

n = 10

I

I

n

= 20

I

k

012345

I1 I

I

I

012345678910

""

k

P.(k) 0.3

0.2 0.1 I

012345678910" . . . . . . . "20

k

Figure 1. Graph of the binomial probabilities P.(k) for n = 5, 10, 20.

The number of such paths (see Table 4) is P(Ak)

=

c~n+kJ/ 2 ,

and therefore

c~n+k112pln+kJ12qin-kl!2.

Consequently the binomial distribution (P(A_n), ... , P(A 0 ), ... , P(A")) can be said to describe the probability distribution for the position of the particle after n steps. Note that in the symmetric case (p = q = !) when the probabilities of the individual paths are equal tor", P(Ak) = qn+kl/2. r".

Let us investigate the asymptotic behavior of these probabilities for large n. If the number of steps is 2n, it follows from the properties of the binomial coefficients that the largest of the probabilities P(Ak), Ik I :::; 2n, is P(A 0 ) =

qn · 2- 2 ".

n

Figure 2

20

I. Elementary Probability Theory

-4 -3 -2 -1

0

2

3

4

Figure 3. Beginning of the binomial distribution.

From Stirling's formula (see formula (6) in Section 4) n! "'~ e-"nn.t

Consequently

en

2n

= (2n)! "'22n,_l_

fo

(n!?

and therefore for large n P(A 0 )

1

--.

Fn

Figure 3 represents the beginning of the binomial distribution for 2n steps of a random walk (in contrast to Figure 2, the time axis is now directed upward). 2. Multinomial distribution. Generalizing the preceding model, we now

suppose that the sample space is

n=

{w: w = (al, ... ' an), a;= bl, ... ' b,},

where b 1, ••. , b, are given numbers. Let v;(w) be the number of elements of w = (a 1 , •.. , an) that are equal to b;, i = 1, ... , r, and define the probability ofw by p(w) =

where P; ~ 0 and p 1

+ · · · + Pr

=

p~'(w) ... p;r(w),

1. Note that

where Cn(n 1 , ... , n,) is the number of (ordered) sequences (a 1 , ... , an) in which b 1 occurs n 1 times, ... , b, occurs n, times. Since n 1 elements b1 can

t The notation} (n) -

g(n) means that f (n)/g(n)-+ 1 as n-+ oo.

21

§2. Some Classical Models and Distributions

be distributed into n positions in C~' ways; n2 elements b2 into n - n 1 positions in c~:._n, ways, etc., we have

(n-

... 1 n 1 )! n 1 ! (n- n 1 )! n 2 ! (n- n 1 - n 2 )! n!

n!

Therefore "" L.... P(w )

wen

=

n · · · P~" n! "" P1' L. {"l;;::o, ... ,nr2!0,} n 1·t ••• nr·1

=

(

P1

+ · · · + Prt =

1,

n1 + ··· +nr=n

and consequently we have defined an acceptable method of assigning probabilities. Let

Then (2) The set of probabilities

is called the multinomial (or polynomial) distribution. We emphasize that both this distribution and its special case, the binomial distribution, originate from problems about sampling with replacement. 3. The multidimensional hypergeometric distribution occurs in problems that

involve sampling without replacement. Consider, for example, an urn containing M balls numbered 1, 2, ... , M, where M 1 balls have the color b 1 , . . . , Mr balls have the color b., and M 1 + · · · + Mr = M. Suppose that we draw a sample of size n < M without replacement. The sample space is Q

= {w: w = (a 1, ••. , an), ak =I a1 , k =I l, ai = 1, ... , M}

and N(Q) = (M)". Let us suppose that the sample point~ are equiprobable, and find the probability of the event Bn,, ····"· in which n 1 balls have color b 1 , .•. , nr balls have color b., where n 1 + · · · + nr = n. It is easy to show that

22

I. Elementary Probability Theory

and therefore

P{B )= ,.,.......

N(B ) C"• · · · C"• "••···•"• = Mt N(Q) CM- Mr •

(3)

The set of probabilities {P(B, •.... , ...)} is called the multidimensional hypergeometric distribution. When r = 2 it is simply called the hypergeometric distribution because its "generating function" is a hypergeometric function. The structure of the multidimensional hypergeometric distribution is rather complicated. For example, the probability

P(B,.,,2) =

C"• C"2

~M M2'

(4)

contains nine factorials. However, it is easily established that if M --+ oo and M 1 --+ oo in such a way that MdM--+ p (and therefore M 2 /M--+ 1 - p) then (5) In other words, under the present hypotheses the hypergeometric distribution is approximated by the binomial; this is intuitively clear since when M and M 1 are large (but finite), sampling without replacement ought to give almost the same result as sampling with replacement. Let us use (4) to find the probability of picking six "lucky" numbers in a lottery of the following kind (this is an abstract formulation of the "sportloto," which is well known in Russia): There are 49 balls numbered from 1 to 49; six of them are lucky (colored red, say, whereas the rest are white). We draw a sample of six balls, without replacement. The question is, What is the probability that all six of these balls are lucky? Taking M = 49, M 1 = 6, n 1 = 6, n2 = 0, we see that the event of interest, namely

ExAMPLE.

B 6 , 0 = {6 balls, all lucky} has, by (4), probability 1 P(B6,o) = c6 ~ 7.2 49

X

10- 8 •

4. The numbers n! increase extremely rapidly with n. For example, 10!

=

3,628,800,

15! = 1,307,674,368,000, and 100! has 158 digits. Hence from either the theoretical or the computational point of view, it is important to know Stirling's formula, n! =

~ (;)" exp( 1~n),

o < e, < 1,

(6)

23

§3. Conditional Probability. Independence

whose proof can be found in most textbooks on mathematical analysis (see also [69]). 5. PROBLEMS 1. Prove formula (5).

2. Show that for the multinomial distribution {P(A.,, ... , A.J} the maximum probability is attained at a point (k 1 , ••• , k.) that satisfies the inequalities np; - 1 < k; :::;; (n + r - 1)p;, i = 1, ... , r. 3. One-dimensional Ising model. Consider n particles located at the points 1, 2, ... , n. Suppose that each particle is of one of two types, and that there are n 1 particles ofthe first type and n 2 of the second (n 1 + n 2 = n). We suppose that all n! arrangements of the particles are equally probable. Construct a corresponding probabilistic model and find the probability of the event A.(m 11 , m 12 , m 21 , m22 ) = {v 11 = m11 , ••• , v22 = m22 }, where vii is the number of particles of type i following particles of type j (i, j = 1, 2). 4. Prove the following inequalities by probabilistic reasoning:

n

L (C!)

2

=

k=O n

L(-1)"-kC! =

cz•. c;:._ 1,

m ;:o: n

m(m- 1)2m-z,

m ;:o: 2.

+ 1,

k=O n

L k(k-

l)C~ =

k=O

§3. Conditional Probability. Independence 1. The concept of probabilities of events lets us answer questions of the following kind: If there are M balls in an urn, M 1 white and M 2 black, what is

the probability P(A) of the event A that a selected ball is white? With the classical approach, P(A) = M tfM. The concept of conditional probability, which will be introduced below, lets us answer questions of the following kind: What is the probability that the second ball is white (event B) under the condition that the first ball was also white (event A)? (We are thinking of sampling without replacement.) It is natural to reason as follows: if the first ball is white, then at the second step we have an urn containing M - 1 balls, of which M 1 - 1 are white and M 2 black; hence it seems reasonable to suppose that the (conditional) probability in question is (M 1 - 1)/(M- 1).

24

I. Elementary Probability Theory

We now give a definition of conditional probability that is consistent with our intuitive ideas. Let (0, d, P) be a (finite) probability space and A an event (i.e. A Ed).

Definition 1. The conditional probability of event B assuming event A with P(A) > 0 (denoted by P(BIA)) is P(AB) P(A) .

(1)

In the classical approach we have P(A) = N(A)/N(D.), P(AB) = N(AB)/N(D.), and therefore

P(BIA) = N(AB). N(A)

(2)

From Definition 1 we immediately get the following properties of conditional probability: P(AIA)

= 1,

P(0IA) = 0, P(BIA) = 1,

B2A,

It follows from these properties that for a given set A the conditional probability P( ·I A) has the same properties on the space (D. n A, dnA), where dnA= {B n A: BEd}, that the original probability PO has on (D., d). Note that

+ P(BIA) =

1;

+ P(BIA) =F P(BIA) + P(BIA) =F

1,

P(BIA)

however in general P(BIA)

1.

ExAMPLE 1. Consider a family with two children. We ask for the probability that both children are boys, assuming

(a) that the older child is a boy; (b) that at least one of the children is a boy.

25

§3. Conditional Probability. Independence

The sample space is Q

=

{BB, BG, GB, GG},

where BG means that the older child is a boy and the younger is a girl, etc. Let us suppose that all sample points are equally probable: P(BB) = P(BG) = P(GB) = P(GG) =

l

Let A be the event that the older child is a boy, and B, that the younger child is a boy. Then A u B is the event that at least one child is a boy, and AB is the event that both children are boys. In question (a) we want the conditional probability P(AB IA), and in (b), the conditional probability P(ABIA u B).

It is easy to see that P(ABIA) = P(AB) =! =!

t

P(A)

2'

P(AB) t 1 P(ABIA u B)= P(A u B)=!= "3

2. The simple but important formula (3), below, is called the formula for total probability. It provides the basic means for calculating the probabilities of complicated events by using conditional probabilities. Consider a decomposition~= {A 1, ••• , An} with P(A;) > 0, i = 1, ... , n (such a decomposition is often called a complete set of disjoint events). It is clear that B

=

+ ··· +

BA 1

BAn

and therefore n

P(B) =

I

P(BA;).

i= 1

But P(BA;)

= P(BIA;)P(A;).

Hence we have the formula for total probability: n

P(B)

= L

P(B I A;)P(A;).

(3)

i= 1

In particular, if 0 < P(A) < 1, then P(B) = P(BIA)P(A)

+ P(BIA)P(A).

(4)

26

I. Elementary Probability Theory

2. An urn contains M balls, m of which are "lucky." We ask for the probability that the second ball drawn is lucky (assuming that the result of the first draw is unknown, that a sample of size 2 is drawn without replacement, and that all outcomes are equally probable). Let A be the event that the first ball is lucky, B the event that the second is lucky. Then ExAMPLE

m(m- 1) P(BIA) = P(BA) = M(M- 1) = m- 1 P(A) m M- 1'

M

m(M- m) _ P(BA) M(M- 1) m P(B IA) = P(A) = M - m = M - 1

M

and P(B) = P(BIA)P(A)

+ P(BIA)P(A)

m-1 m m M-m =M-1·M+M-1. M

m M"

It is interesting to observe that P(A) is precisely mjM. Hence, when the nature of the first ball is unknown, it does not affect the probability that the second ball is lucky. By the definition of conditional probability (with P(A) > 0), P(AB) = P(BIA)P(A).

(5)

This formula, the multiplication formula for probabilities, can be generalized (by induction) as follows: IfA 1, ... , An_ 1 are events with P(A 1 · · · An_ 1) > 0, then P(A1 ···An)

(here A 1 ···An

=

= P(A1)P(A2I A1) · · · P(An IA1 · · · An-1)

(6)

A 1 n A 2 n ···nAn)·

3. Suppose that A and B are.events with P(A) > 0 and P(B) > 0. Then along with (5) we have the parallel formula P(AB) = P(A IB)P(B).

(7)

From (5) and (7) we obtain Bayes's formula P(AIB)

= P(A)P(BIA) P(B)

.

(8)

27

§3. Conditional Probability. Independence

If the events A 1, Bayes's theorem:

... ,

An form a decomposition of

n,

(3) and (8) imply

P(A;)P(B IA;)

(9)

P(A;IB) = LJ=l P(Aj)P(BIA).

In statistical applications, A 1, ••• , An (A 1 + · · · +An = Q) are often called hypotheses, and P(Ai) is called the a priorit probability of Ai. The conditional probability P(A; IB) is considered as the a posteriori probability of A; after the occurrence of event B. ExAMPLE

3. Let an urn contain two coins: A 1, a fair coin with probability

! of falling H; and A 2 , a biased coin with probability-! of falling H. A coin is drawn at random and tossed. Suppose that it falls head. We ask for the probability that the fair coin was selected. Let us construct the corresponding probabilistic model. Here it is natural to take the sample space to be the set n = {A 1 H, A 1 T, A 2 H, A 2 T}, which describes all possible outcomes of a selection and a toss (A 1 H means that coin A 1 was selected and fell heads, etc.) The probabilities p(w) of the various outcomes have to be assigned so that, according to the statement of the problem, P(Ad

= P(A 2 ) =!

and P(HIA 2 )

=!.

With these assignments, the probabilities of the sample points are uniquely determined : P(A 2 T)

=!.

Then by Bayes's formula the probability in question is P(A 1 )P(HIA 1 ) P(AliH) = P(Al)P(HIAl) + P(A2)P(HIA2)

3

S'

and therefore P(A 2 1H)

=

i

4. In certain sense, the concept of independence, which we are now going to introduce, plays a central role in probability theory: it is precisely this concept that distinguishes probability theory from the general theory of measure spaces. t

A priori: before the experiment; a posteriori: after the experiment.

28

I. Elementary Probability Theory

If A and B are two events, it is natural to say that B is independent of A if knowing that A has occurred has no effect on the probability of B. In other words, "B is independent of A" if P(BIA) = P(B)

(10)

(we are supposing that P(A) > 0). Since P(BIA) =

P(AB)

P(A) ,

it follows from (10) that P(AB) = P(A)P(B).

(11)

In exactly the same way, if P(B) > 0 it is natural to say that" A is independent of B" if P(A IB) = P(A).

Hence we again obtain (11 ), which is symmetric in A and B and still makes sense when the probabilities of these events are zero. After these preliminaries, we introduce the following definition.

Definition 2. Events A and Bare called independent or statistically independent (with respect to the probability P) if P(AB)

= P(A)P(B).

In probability theory it is often convenient to consider not only independence of events (or sets) but also independence of collections of events (or sets). Accordingly, we introduce the following definition.

Definition 3. Two algebras d 1 and d 2 of events (or sets) are called independent or statistically independent (with respect to the probability P) if all pairs of sets A 1 and A 2 , belonging respectively to d 1 and d 2 , are independent. For example, let us consider the two algebras

where A 1 and A 2 are subsets of n. It is easy to verify that d 1 and d 2 are independent if and only if A 1 and A 2 are independent. In fact, the independence of .91 1 and .91 2 means the independence of the 16 events A 1 and A 2 , A 1 and A2 , .•• , nand n. Consequently A 1 and A 2 are independent. Conversely, if A 1 and A 2 are independent, we have to show that the other 15

29

§3. Conditional Probability. Independence

pairs of events are independent. Let us verify, for example, the independence of A 1 and A2 • We have P(A 1 A 2 )

= P(A 1 )

= P(A 1 )

P(A 1 A 2 )

-

-

P(A 1 )P(A 2 )

= P(A 1) • (1 - P(A 2 )) = P(A 1 )P(A 2 ). The independence of the other pairs is verified similarly. 5. The concept of independence of two sets or two algebras of sets can be extended to any finite number of sets or algebras of sets. Thus we say that the sets A 1, ... , An are collectively independent or statistically independent (with respect to the probability P) if fork = 1, ... , n and 1 ~ i 1 < i 2 < · · · < ik ~ n

= P(A;,) · · · P(A;J

P(A;, · · · A;)

(12)

The algebras d 1 , ... , dn of sets are called independent or statistically independent (with respect to the probability P) if all sets A 1 , .•. , An belonging respectively to d 1 , ••• , dn are independent. Note that pairwise independence of events does not imply their independence. In fact if, for example, n = {w 1 , w 2 , w 3 , w4 } and all outcomes are equiprobable, it is easily verified that the events

are pairwise independent, whereas

i

P(ABC) =

i= (t) 3

= P(A)P(B)P(C).

Also note that if P(ABC)

=

P(A)P(B)P(C)

for events A, B and C, it by no means follows that these events are pairwise independent. In fact, let n consist of the 36 ordered pairs (i, j), where i, j = 1, 2, ... , 6 and all the pairs are equiprobable. Then if A = {(i,j):j = 1, 2 or 5}, B = {(i,j):j = 4, 5 or 6}, C = {(i,j): i + j = 9} we have P(AB) =

i i= i =

P(A)P(B),

P(AC)

=

l6

i= / 8

= P(A)P(C),

P(BC)

= /2

i= / 8

=

P(B)P(C),

but also P(ABC)

=

l6 =

P(A)P(B)P(C).

6. Let us consider in more detail, from the point of view of independence, the classical model (Q, d, P) that was introduced in §2 and used as a basis for the binomial distribution.

30

I. Elementary Probability Theory

In this model

n=

{w:

((J

= (a1, ... ' a.), a; =

d ={A: As Q}

0, 1},

and (13) Consider an event A s n. We say that this event depends on a trial at time k if it is determined by the value ak alone. Examples of such events are Let us consider the sequence of algebras d 1, d 2 , ••• , d., where dk = {Ak, Ab 0, Q} and show that under (13) these algebras are independent. It is clear that

=p (at, ... , ak-1, Uk+

1, ... ,

an)

X q
= p

L

n-1 c~-1plq(n-1)-l

= p,

i=O

and a similar calculation shows that P(Ak) = q and that, for k # 1, 2

P(AkA 1) = p ,

-

P(AkA 1) = pq,

-

2

P(AkAt) = q .

It is easy to deduce from this that d k and d 1 are independent for k # I. It can be shown in the same way that d 1 , d 2 , •.• , d. are independent. This is the basis for saying that our model (Q, d, P) corresponds to "n independent trials with two outcomes and probability p of success." James Bernoulli was the first to study this model systematically, and established the law of large numbers (§5) for it. Accordingly, this model is also called the Bernoulli scheme with two outcomes (success and failure) and probability p of success. A detailed study of the probability space for the Bernoulli scheme shows that it has the structure of a direct product of probability spaces, defined as follows. Suppose that we are given a collection (0 1 , 86 1 , P 1), ... , (n., 86., P.) of finite probability spaces. Form the space n = n1 X n2 X ... X n. of points ((J = (a1, ... ' a.), where a; E ni. Let d = 861 ® ... ® 86. be the algebra of the subsets of n that consists of sums of sets of the form

A= B1

X

B2

X ... X

B.

with B;E86;. Finally, for w = (a 1, ... ,a.) take p(w) = p 1(a 1)···p.(a.) and define P(A) for the set A = B 1 x B 2 x · · · x B. by P(A) =

31

§3. Conditional Probability. Independence

It is easy to verify that P(O) = 1 and therefore the triple (0, d, P) defines a probability space. This space is called the direct product of the probability spaces (0 1 , fJI 1 , P 1 ), .•• , (O., fJI., P.). We note an easily verified property of the direct product of probability spaces: with respect to P, the events

A 1 = {ro: a 1 E Bd, where Bi e

fJii,

... , A. =

{ro: a. E B.},

are independent. In the same way, the algebras of subsets ofO, d

1

= {A 1 :A 1 = {ro:a 1 eBd,B 1 efJI 1 },

are independent. It is clear from our construction that the Bernoulli scheme (0, d, P) with 0 = {ro: w = (a 1,

d ={A: As; 0}

.•. ,

a.), ai = 0 or 1}

p(ro) = pr,a,qn-r.a,

and

can be thought of as the direct product of the probability spaces (Oi, i = 1, 2, ... , n, where

ni = {O, 1}, Pl{1})

fJii

fJii,

Pi),

= {{O}, {1}, 0, OJ,

= p,

Pi({O})

= q.

7.PROBLEMS 1. Give examples to show that in general the equations

+ P(B!A) = P(B!A) + P(B!A) = P(B!A)

1, 1

are false. 2. An urn contains M balls, of which M 1 are white. Consider a sample of size n. Let Bi be the event that the ball selected at the jth step is white, and Ak the event that a sample of size n contains exactly k white balls. Show that P(Bi!Ak) = k/n

both for sampling with replacement and for sampling without replacement.

3. Let A 1, ... , A. be independent events. Then

4. Let A~> ... , A. be independent events with P(A;) that neither event occurs is

Po=



fl(l- pJ

i=l

= P;·

Then the probability P0

32

I. Elementary Probability Theory

5. Let A and B be independent events. In terms of P(A) and P(B), find the probabilities of the events that exactly k, at least k, and at most k of A and B occur (k = 0, 1, 2). 6. Let event A be independent of itself, i.e. Jet A and A be independent. Show that P(A) is either 0 or 1. 7. Let event A have P(A) = 0 or 1. Show that A and an arbitrary event B are independent. 8. Consider the electric circuit shown in Figure 4:

Figure 4 Each of the switches A, B, C, D, and E is independently open or closed with probabilities p and q, respectively. Find the probability that a signal fed in at "input" will be received at "output". If the signal is received, what is the conditional probability that E is open?

§4. Random Variables and Their Properties 1. Let (Q, d, P) be a probabilistic model of an experiment with a finite number of outcomes, N(Q) < oo, where d is the algebra of all subsets of n. We observe that in the examples above, where we calculated the probabilities of various events A E d, the specific nature of the sample space n was of no interest. We were interested only in numerical properties depending on the sample points. For example, we were interested in the probability of some number of successes in a series of n trials, in the probability distribution for the number of objects in cells, etc. The concept "random variable," which we now introduce (later it will be given a more general form) serves to define quantities that are subject to "measurement" in random experiments.

Definition 1. Any numerical function ~ = ~(w) defined on a (finite) sample space n is called a (simple) random variable. (The reason for the term" simple" random variable will become clear after the introduction of the general concept of random variable in §4 of Chapter II.)

33

§4. Random Variables and Their Properties

1. In the model of two tosses of a coin with sample space Q = {HH, HT, TH, TT}, define a random variable~ = ~(w) by the table

ExAMPLE

OJ

HH

HT

TH

TT

¢(w)

2

1

1

0

Here, from its very definition, ~(w) is nothing but the number of heads in the outcome w. Another extremely simple example of a random variable is the indicator (or characteristic function) of a set A E .s:1:

wheret IA(w)

1, WEA, = { 0, w¢A.

When experimenters are concerned with random variables that describe observations, their main interest is in the probabilities with which the random variables take various values. From this point of view they are interested, not in the distribution of the probability P over (Q, d), but in its distribution over the range of a random variable. Since we are considering the case when Q contains only a finite number of points, the range X of the random variable (is also finite. Let X = {x 1 , . . . , xm}, where the (different) numbers X 1, ... , Xm exhaust the values of Let fi£ be the collection of all subsets of X, and let BE fi£. We can also interpret B as an event if the sample space is taken to be X, the set of values of On (X, fi£), consider the probability P~(·) induced bye according to the formula

e.

e.

P~(B)

= P{w: ~(w) E B},

BE fi£.

It is clear that the values of this probability are completely determined by the probabilities P~(x;)

= P{w: ~(w) =X;},

The set of numbers {P~(x 1 }, bution of the random variable ~. t The notation

... ,

P~(xm)}

X;

EX.

is called the probability distri-

/(A) is also used. For frequently used properties of indicators see Problem 1.

34

I. Elementary Probability Theory

e

2. A random variable that takes the two values 1 and 0 with probabilities p ("success") and q ("failure"), is called a Bernoullit random variable. Clearly

EXAMPLE

0, 1.

X=

e

(1)

A binomial (or binomially distributed) random variable is a random variable that takes the n + 1 values 0, 1, ... , n with probabilities x = 0, 1, ... , n.

(2)

Note that here and in many subsequent examples we do not specify the sample spaces (Q, d, P), but are interested only in the values of the random variables and their probability distributions.

e

is completely The probabilistic structure of the random variables specified by the probability distributions {P~(xi), i = 1, ... , m}. The concept of distribution function, which we now introduce, yields an equivalent description of the probabilistic structure of the random variables.

Definition 2. Let x e R 1• The function F~(x) =

P{w: e(w)

~

x}

is called the distribution function of the random variable

e.

Clearly F~(x) =

L

{i: Xi

P~(xi)

$X}

and P ~(xi) = F ~(xi) - F f.._x; - ),

where F ~(x-) = limyix F ~(y). If we suppose that x 1 < x 2 < · · · < xm and put F ~(x 0 ) = 0, then

i = 1, ... , m. The following diagrams (Figure 5) exhibit P~(x) and F~(x) for a binomial random variable. It follows immediately from Definition 2 that the distribution F~ = F~(x) has the following properties: (1) F ~(- oo)

= 0, F ~( + oo) = 1;

(2) F ~(x) is continuous on the right (F ~(x+) = F ~(x)) and piecewise constant. t We use the terms "Bernoulli, binomial, Poisson, Gaussian, ... , random variables" for what are more usually called random variables with Bernoulli, binomial, Poisson, Gaussian, ... , distributions.

35

§4. Random Variables and Their Properties

-.....

/

q",_/_/-t---t--~~~==+. 0

n

2

'f---------_: _;,.

F<(x)

I I

I

I

~~ 0

--------,.

2

n

Figure 5

Along with random variables it is often necessary to consider random ~ = (~ 1 , ... , ~,) whose components are random variables. For example, when we considered the multinomial distribution we were dealing with a random vector v = (v 1, ... , v,), where v; = v;(w) is the number of elements equal to b;, i = 1, ... , r, in the sequence w = (a 1, .•. , an). The set of probabilities vectors

P~(x 1 , ... ,

x,)

= P{w: ~ 1 (w) = x 1 ,

... , ~,(w)

= x,},

where X; EX;, the range of ~;, is called the probability distribution of the random vector ~, and the function

where

X; E

R 1, is called the distribution function of the random vector ~

(~1• ... ' ~,).

For example, for the random vector v = (v 1 ,

... ,

=

v,) mentioned above,

(see (2.2)). 2. Let ~ 1 , . . . , ~r be a set of random variables with values in a (finite) set X s; R 1 . Let X be the algebra of subsets of X.

36

I. Elementary Probability Theory

Definition 3. The random variables (collectively independent) if P{~ 1 =

for all x 1,

••• ,

x 1 , ••• ,

••• ,

xd · · · P{~.

~. = x,} = P{~ 1 =

= x,}

x, EX; or, equivalently, if

P{~1EB1,

for all B 1,

are said to be independent

~ 1 , •.. , ~.

... ,~,EB,} =

P{~1EBd···P{~,EB,}

B, E f!l".

We can get a very simple example of independent random variables from the Bernoulli scheme. Let

n=

{w: (I)= (a1, ... ' an), a;= 0, 1},

p(w) = pr.a;qn-r.a;

and ~;(w) = a; for w = (a 1, ... , an), i = 1, ... , n. Then the random variables ~ 1 , ~ 2 , ••• , ~n are independent, as follows from the independence of the events

A 1 = {w: a 1 = 1}, ... , An= {w: an= 1}, which was established in §3. 3. We shall frequently encounter the problem of finding the probability distributions of random variables that are functionsf(~ 1 , ..• , ~.)of random variables ~ 1 , ••• , ~•. For the present we consider only the determination of the distribution of a sum ( = ~ + 11 of random variables. If~ and '1 take values in the respective sets X = {x 1, ... , xk} and Y = {y 1 , •.. , y 1}, the random variable ( = ~ + '1 takes values in the set Z = {z: z = X; + Yi• i = 1, ... , k;j = 1, ... , /}.Then it is clear that P,(z) = P{( = z} = Pg

+ 11

L

= z} =

P{~ =X;, '1 = Yi}.

{(i, j):x;+yJ=z)

The case of independent random variables ant. In this case

~

and '1 is particularly import-

and therefore P,(z)

=

L

P~(x;)P~(yj)

{(i,j):x;+yj=z}

=

k

L P~(x;)P~(z- x;)

i=1

(3)

for all z E Z, where in the last sum Pq(z - X;) is taken to be zero if z - X;~ Y. For example, if ~ and '1 are independent Bernoulli random variables, taking the values 1 and 0 with respective probabilities p and q, then Z = {0, 1, 2} and P,(O) P,(l)

= P~(O)Pq(O) = q2 , = P~(O)Pq(i) + Pp)Pq(O) = 2pq,

P,(2) = P~(l)Pq(1) = p2 •

37

§4. Random Variables and Their Properties

It is easy to show by induction that if ~ 1 • ~ 2 , ••• , ~n are independent Bernoulli random variables with P{~; = 1} = p, P{~; = 0} = q, then the random variable ( = ~ 1 + · · · + ~n has the binomial distribution

k = 0, 1, ... , n.

(4)

4. We now turn to the important concept of the expectation, or mean value, of a random variable. Let (Q, d, P) be a (finite) probability space and ~ = ~(w) a random variable with values in the set X= {x 1 , ••• , xk}. If we put A;= {w: ~ = x;}, i = 1, ... , k, then ~ can evidently be represented as k

L

~(w) =

X;l(A;),

(5)

i= 1

where the sets A 1, ••• , Ak form a decomposition of Q (i.e., they are pairwise disjoint and their sum is Q; see Subsection 3 of §1). Let P; = P{ ~ = x;}. It is intuitively plausible that if we observe the values of the random variable ~ in "n repetitions of identical experiments", the value X; ought to be encountered about p;n times, i = 1, ... , k. Hence the mean value calculated from the results of n experiments is roughly 1 - [np 1 x 1

n

+ · · · + npkxk] =

k

I P;X;. i=1

This discussion provides the motivation for the following definition.

Definition 4. The expectation t or mean value of the random variable

~ =

L~= 1 xJ(A;) is the number k

E~ =

I

X;P(AJ

(6)

i= 1

Since A;= {w:

~(w) =X;}

and

P~(x;) =

P(A;), we have

k

E~ =

Recalling the definition

I

i= 1

x;Pix;).

ofF~= F~(x)

(7)

and writing

dF ix) = F~(x)- F~(x- ), we obtain P ~(x;)

= dF ~(x;) and consequently k

E~ =

I

x;dF~(x;).

(8)

i= 1

t Also known as mathematical expectation, or expected value, or (especially in physics) expectation value. (Translator)

38

I. Elementary Probability Theory

Before discussing the properties of the expectation, we remark that it is often convenient to use another representation of the random variable ~' namely l

~(w) =

L xji(B),

j= 1

where B 1 + · · · + B1 = n, but some of the xj may be repeated. In this case E~ can be calculated from the formula L~= 1 xjP(B), which differs formally from (5) because in (5) the X; are all different. In fact,

L

xjP(B)

L

=X;

(j: xj = x;)

P(B)

=

x;P(A;)

(j: xj = x;)

and therefore l

I

i= 1

xjP(B)

=

k

I

i= 1

x;P(AJ

5. We list the basic properties of the expectation: (1) If~ 2': 0 then E~ 2': 0. (2) E(a~ + b1]) = aE~ + bE1], where a and bare constants. (3) If~ 2': '7 then E~ 2': El'f. (4) IE~I ~ El~l(5) If~ and '1 are independent, then E~l] = E~ · El].

e ·E17

( 6) (E I~171) 2 ~ E

(7)

If~

2

(Cauchy- Bun yakovskii inequality). t

= I(A) then E~ = P(A).

Properties (1) and (7) are evident. To prove (2), let ~

= IxJ(AJ,

I]=

i

IYi(B). j

Then i, j

i,j

L(ax; + byj)I(A; n Bi)

=

i,j

and E(a~

+ bl]) = I (ax; + byi)P(A; n Bi) i,j

= Lax; P(A;) i

+ Lbyi P(Bi) j

= al;x;P(A;) i

+ b LYiP(Bi) = j

aE~

+ bE1].

t Also known as the Cauchy-Schwarz or Schwarz inequality. (Translator)

39

§4. Random Variables and Their Properties

Property (3) follows from (1) and (2). Property (4) is evident, since

~~x;P(A;)I ;S; ~lxdP(A;) = Elel.

!Eel=

To prove (5) we note that

Ee11 =

E(~x;I(A;)) (~>il(Bi))

= E

L X;Yi(A; n B)= L X;YiP(A; n Bi) i,j

i,j

=

L X;yjP(A;)P(Bj) i,j

where we have used the property that for independent random variables the events A;= {w: e(w) = x;}

Bi = {w: 17(w) = Yi}

and

are independent: P(A; n Bi) = P(A;)P(Bi). To prove property (6) we observe that

e

L xf l(A;),

'7 2

Lx~P(A;),

E'7 2 = LYJP(Bj).

=

i

=

LYJI(Bj) j

and

Ee 2 =

i

j

Let Ee 2 > 0, E77 2 > 0. Put

-

e

e=~,

'7-

= -'7- .

for

Since 21~~1 + we have 2EI~~I $ E~ 2 + E~ 2 = 2. Therefore El~~~ $ 1 and (Eie'71) 2 $ Ee 2 ·E77 2 • However, if, say, Ee 2 = 0, this means that xfP(A;) = 0 and consequently the mean value of e is 0, and P{w: e(w) = 0} = 1. Therefore if at least one of Ee2 or E77 2 is zero, it is evident that EIe111 = 0 and consequently the Cauchy-Bunyakovskii inequality still holds. $ ~2

~2,

L;

Remark. Property (5) generalizes in an obvious way to any finite number of random variables: if

el, ... ' e, are independent, then

The proof can be given in the same way as for the case r = 2, or by induction.

40

I. Elementary Probability Theory

e

EXAMPLE 3. Let be a Bernoulli random variable, taking the values 1 and 0 with probabilities p and q. Then

ee = 1· Pg = 1}

+ o. P{e = o}

= p.

4. Let e1, ... , en ben Bernoulli random variables with P{e; = 1} P{e; = O} = q, p + q = 1. Then if

ExAMPLE

= p,

we find that ESn = np.

This result can be obtained in a different way. It is easy to see that ES" is not changed if we assume that the Bernoulli random variables 1, ••• , en are independent. With this assumption, we have according to (4)

e

k = 0, 1, ... , n. Therefore ESn =

n

L

n

kP(Sn = k) =

k=O

L

kC~pkqn-k

k=O

n!

n

kn-k

= k~ok·k!(n-k)!pq = np

I

k= 1

(n- 1)! pk-1q
_ ~ (n-1)! l(n-1)-1_ - np 1~0 l!((n- 1)- 1)! pq - np.

However, the first method is more direct.

Li

xJ(A;), where A; = {ro: e(ro) =X;}, and 6. Let e = function of e(w). If Bi = {ro: cp(e(w)) = yj}, then cp(e(w))

qJ =

cp(e(w)) is a

= LYi(Bj), j

and consequently

Ecp

=

LYiP(B) = LYiP'P(yj). j

j

But it is also clear that cp(e(w)) =

I

i

cp(xJI(A;).

(9)

41

§4. Random Variables and Their Properties

Hence, as in (9), the expectation of the random variable



can be

= L
7. The important notion of the variance of a random variable ~ indicates

the amount of scatter of the values

around E~.

of~

Definition 5. The variance (also called the dispersion) of the random variable (denoted by V ~) is

~

V~ = E(~- E~) 2 .

The number Since

CJ

= + jVf, is called the standard deviation.

we have

Clearly V~

~

V(a

0. It follows from the definition that

+ b~) =

where a and bare constants.

b 2 V~,

In particular, Va = 0, V(b~) = b 2 V~. Let ~ and 11 be random variables. Then V(~

+ rJ) = =

E((~- E~) Vi;+ V17

+ ('1-

+

E1J)) 2

2E((- E()(Y/- E17).

Write cov(~,

'1) =

E(~- E~)(IJ

- ErJ).

This number is called the covariance of~ and '1· If V ~ > 0 and V11 > 0, then

( ~ ) = cov(~, '1) JV~·VIJ p ,IJ is called the correlation coefficient of~ and '1· It is easy to show (see Problem 7 below) that if p(~, rJ) = ± 1, then~ and 11 are linearly dependent:

11 =a~+ b, with a > 0 if p(~, rJ) = 1 and a < 0 if p(~, rJ) = -1. We observe immediately that if~ and 11 are independent, so are and rJ - ErJ. Consequently by Property (5) of expectations, cov(~,

rJ) =

E(~

- EO· E(IJ - ErJ)

= 0.

~

-

E~

42

I. Elementary Probability Theory

Using the notation that we introduced for covariance, we have V(~

+ Yf)

= V~

+ VYf +

(10)

2cov(~, Yf);

if~ and Yf are independent, the variance of the sum of the variances, V(~ + Y/) = V~ + VYf.

~

+ Yf is equal to the sum (11)

It follows from (10) that (11) is still valid under weaker hypotheses than the independence of~ and Yf· In fact, it is enough to suppose that ~ and Yf are uncorrelated, i.e. cov(~, ry) = 0.

Remark. If ~ and Yf are uncorrelated, it does not follow in general that they are independent. Here is a simple example. Let the random variable o: take the values 0, n/2 and n with probability t. Then ~ = sin o: and Y/ = cos o: are uncorrelated; however, they are not only stochastically dependent (i.e., not independent with respect to the probability P): P{~ = 1, Yf = 1} = 0 ¥ ~ = P{~ = 1}P{ry = 1},

but even functionally dependent: ~ 2 + Yf 2 = 1. Properties (10) and (11) can be extended in the obvious way to any number of random variables: (12)

In particular, if ,;; 1 ,

••. ,

.;;n are pairwise independent (pairwise uncorrelated

is sufficient), then (13)

5. If ~ is a Bernoulli random variable, taking the values 1 and 0 with probabilities p and q, then

ExAMPLE

V~

= E(~- E~)2 = (~- p)2 = (1 - p)2 p

+ p2q =

pq.

It follows that if~ 1 ,

... , ~n are independent identically distributed Bernoulli random variables, and sn = ~ 1 + ... + ~n, then

vsn =

npq.

(14)

8. Consider two random variables ~ and Yf. Suppose that only ~ can be observed. If~ and Yf are correlated, we may expect that knowing the value of~ allows us to make some inference about the values of the unobserved variable Yf· Any function}= f(~) of~ is called an estimator for Yf· We say that an estimator f* = f*(~) is best in the mean-square sense if E(Yf - /*(0) 2 = inf E(Yf - /(~)) 2 • f

43

§4. Random Variables and Their Properties

Let us show how to find a best estimator in the class of linear estimators a + b~. We consider the function g(a, b) = E(17 - (a + b~)) 2 . Differentiating g(a, b) with respect to a and b, we obtain A.(~) =

og(a, b)= - 2E['7 -(a+ oa og(a, b) ob

= - 2E[('7-

b~)],

(a+ be))e],

whence, setting the derivatives equal to zero, we find that the best meansquare linear estimator is ..t*(~) = a* + b*~. where

a*

= E17 - b*E~,

b*

= cov(~, 17) v~

.

(15)

In other words, (16) The number E(17 - ..1.*(~)) 2 is called the mean-square error of observation. An easy calculation shows that it is equal to

Consequently, the larger (in absolute value) the correlation coefficient 17) between ~ and '1· the smaller the mean-square error of observation tl*. In particular, if IP(e. '7)1 = 1 then tl* = 0 (cf. Problem 7). On the other hand, if~ and '1 are uncorrelated (p(~, '7) = 0}, then A.*(~) = E'7, i.e. in the absence of correlation between ~ and '1 the best estimate of '1 in terms of~ is simply E17 (cf. Problem 4). p(~.

9. PROBLEMS 1. Verify the following properties of indicators IA = IA(w):

JAB= IA .JB, ]AuB =]A+ ]B- JAB·

The indicator of Uf=t Ai is 1- n~=l (1- lA,), the indicator of Ui=1 Ai is Df= 1 (1 - I A,), and the indicator o(D= 1 Ai is D= 1 I A,.

where A 6. B is the symmetric difference of A and B, i.e. the set (A\B) u (B\A).

44

I. Elementary Probability Theory

2. Let ~ 1 , ••• ,

~.be

independent random variables and

~min= min(~ 1 ,

.•. ,

~max= max(~!• ... ' ~.).

~.),

Show that n

P{~min;;:: x} = TIP{~;;;:: x}, i=l

< x} = TIPg; < x}.

P{~max

i=l

3. Let ~ ~> ... , ~. be independent Bernoulli random variables such that

0}

P{~; =

P{~;

where ~ is a small Show that

number,~

= 1} =A;~,

> 0, A; > 0.

P{~~ + · · · + ~. = P{~ 1

1 -A;~,

=

+ · · · + ~.

(J/;)~ + 0(~ 2),

1} =

> 1}

= 0(~ 2 ).

4. Show that inL 00
E(~ - a) 2 = V~.

-oo
5.

Let~ be a random variable with distribution function of F~(x), i.e. a point such that

F~(x)

and let me be a median

Show that inf

El~-

-ro
6. Let

P~(x)

=

P{~

= x} and

F~(x)

=

al

P(~::;;

= El~-

m.,l.

x}. Show that X -

b)

Pa~+b(x) = p~ ( -a- ' X-

b)

Fa~+b(x) = F~ ( -a-

for a rel="nofollow"> 0 and - oo < b < oo. If y ;;:: 0, then

F~z(y) Let~+ = max(~,

= F~

+Jy)- F ~ -JY) + P~( -JY).

0). Then

X< 0, 0, X> 0.

X=

45

§5. The Bernoulli Scheme. I. The Law of Large Numbers 7.

Let~ and 'I be random variables with V~ > 0, Vf1 > 0, and let p = p(~, 'I) be their correlation coefficient. Show that IpI ::;; 1. If IpI = 1, find constants a and b such that 'I = a~ + b. Moreover, if p = 1, then

(and therefore a > 0), whereas if p

-1, then

=

(and therefore a < 0). 8.

Let~ and 'I be random variables with coefficient p = p(~, f/). Show that

E~

= Ef/ = 0,

E max<e, f/ 2 )::;; 1 +

9. Use the equation (IndicatorofiV1 to deduce the formula P(B 0 )

=1-

S1

A;)

V~

= Vf/ = 1 and correlation

Jl=Pl.

=bY-

IA),

+ S 2 + · · · ± s. from Problem 4 of §1.

10. Let ~~> ... , ~. be independent random variables, cp 1 = cp 1 (~ 1 , .•• , ~k) and cp 2 = cp 2 ( ~k+ ~> ... , ~.), functions respectively of~ 1 , ••. , ~k and ~k+ 1 , ... , ~ •. Show that the random variables cp 1 and cp 2 are independent. 11. Show that the random variables~ 1 , ... , F~, ..... ~Jx 1 ,

••• ,

~.are

x.)

independent if and only if

= F~,(x 1 )

for all Xt. .•• , x., where F~, ..... ~Jx 1 , •.. , x.) =

P{~ 1

· · ·

F~"(x.)

::;;

x 1, ... , ~.::;; x.}.

12. Show that the random variable ~ is independent of itself (i.e., pendent) if and only if~ = const.

~

13. Under what hypotheses

~independent?

14.

on~

are the random

variables~

and sin

and

~

are inde-

and 'I be independent random variables and 'I # 0. Express the probabilities of the events P{~'l ::;; z} and P{~/'1 ::;; z} in terms of the probabilities P~(x) and Pq(y).

Let~

§5. The Bernoulli Scheme. Large Numbers

I. The Law of

1. In accordance with the definitions given above, a triple (Q, .szl, P)

.91

with =

Q

=

{w: w

{A: A c;; Q},

= (a 1 , ••. , a.), ai = 0, 1}, p(w)

=

pr.a;qn-r.a;

is called a probabilistic model of n independent experiments with two outcomes, or a Bernoulli scheme.

46

I. Elementary Probability Theory

In this and the next section we study some limiting properties (in a sense described below) for Bernoulli schemes. These are best expressed in terms of random variables and of the probabilities of events connected with them. by taking = ai, i = We introduce random variables 1 , ..• , 1, ... , n, where w = (a 1 , ••• , a,). As we saw above, the Bernoulli variables are independent and identically distributed:

e

e,

ei(w)

ei(w)

i = 1, ... , n.

ei

It is natural to think of as describing the result of an experiment at the ith stage (or at time i). Let us put S 0 (w) 0 and

=

k = 1, ... , n. As we found above, ES,

=

np and consequently

s,

(1)

E-=p.

n

In other words, the mean value of the frequency of "success", i.e. S,/n, coincides with the probability p of success. Hence we are led to ask how much the frequency Snfn of success differs from its probability p. We first note that we cannot expect that, for a sufficiently small e > 0 and for sufficiently large n, the deviation of S,/n from p is less than e for all w, i.e. that

wen.

(2)

In fact, when 0 < p < 1,

P{~" = 1} = P{~l = 1, ... , e, = 1} = p", P{~" = o} = P{el = o, ... , e, = o} = q", whence it follows that (2) is not satisfied for sufficiently small e > 0. We observe, however, that when n is large the probabilities of the events {S,/n = 1} and {S,/n = 0} are small. It is therefore natural to expect that the total probability of the events for which I [S,(w)/n] -Pi > e will also be small when n is sufficiently large. We shall accordingly try to estimate the probability of the event {w: i[S,(w)/n] -Pi > e}. For this purpose we need the following inequality, which was discovered by Chebyshev.

47

I. The Law of Large Numbers

§5. The Bernoulli Scheme.

Chebyshev's inequality. Let (Q, d, P) be a probability space and nonnegative random variable. Then

~ = ~(ro)

a

(3)

for all e > 0. PROOF.

We notice that

where I(A) is the indicator of A. Then, by the properties of the expectation,

which establishes (3).

Corollary. If~ is any random variable, we have for e > 0, P{l~l ~ e}

s;

El~l/e,

P{l~l ~ e} = P{~ 2 ~ e2 } s; eeje 2 , P{l~- E~l ~ e}

s; V~je 2 •

In the last of these inequalities, take~ =

Therefore

{I

s.- p p n

(4)

s.;n. Then using (4.14), we obtain

I }s ;ne- s ;4ne- , 1

pq

~e

2

2

(5)

from which we see that for large n there is rather small probability that the frequency S./n of success deviates from the probability p by more than e. For n ~ 1 and 0 s; k s; n, write

Then

P{l s.n - P I~ e} =

L

{k:i(k/n)- PI
P.(k),

and we have actually shown that

L

{k: l(k/n)- PI
P.(k) s;

pq

-2

ne

s;

1 4ne

-2'

(6)

48

I. Elementary Probability Theory

P.(k) I

~

0

np _ ____. ---v-np + ne

-----~

np - ne

n

Figure 6

i.e. we have proved an inequality that could also have been obtained analytically, without using the probabilistic interpretation. It is clear from (6) that

L

P,.(k)-+ 0,

n-+ oo.

(7)

{k:l(k/n)- PI ~e)

We can clarify this graphically in the following way. Let us represent the binomial distribution {P,.(k), 0 ~ k ~ n} as in Figure 6. Then as n increases the graph spreads out and becomes flatter. At the same time the sum of Pik), over k for which np - ne ~ k < np + ne, tends to 1. Let us think of the sequence of random variables S0 , S 1 , .•• , S,. as the path of a wandering particle. Then (7) has the following interpretation. Let us draw lines from the origin of slopes kp, k(p + e), and k(p - e). Then on the average the path follows the kp line, and for every e > 0 we can say that when n is sufficiently large there is a large probability that the point S,. specifying the position of the particle at time n lies in the interval [n(p - e), n(p + e)]; see Figure 7. We would like to write (7) in the following form: n-+ oo,

(8)

k(p +e)

I

s~

lkp

I I I I I I

k(p- e)

..........~,.---,.---..-- - - - - -- -+-+:

~

2

3

4

5

Figure 7

n

k

§5. The Bernoulli Scheme.

I. The Law of Large Numbers

49

However, we must keep in mind that there is a delicate point involved here. Indeed, the form (8) is really justified only if P is a probability on a space (0, d) on which infinitely many sequences of independent Bernoulli random variables ~ 1 , ~ 2 , .•. , are defined. Such spaces can actually be constructed and (8) can be justified in a completely rigorous probabilistic sense (see Corollary 1 below, the end of §4, Chapter II, and Theorem 1, §9, Chapter II). For the time being, if we want to attach a meaning to the analytic statement (7), using the language of probability theory, we have proved only the following. Let (n, .Jil<">, p), n ;;;:: 1, be a sequence of Bernoulli schemes such that

n, •.. ' a~">), al") = 0, 1}, .Jil ={A: A£ n}, p<">( w<">) =

pr.a~

q"- !:.a\"'

and

Sk">(w<"l) = ~\">(w<">)

+ · · · + ek">(w<">),

where, for n :s; 1, ~~nl, •.. , ~~nl are sequences of independent identically distributed Bernoulli random variables. Then

n --t oo. (9) Statements like (7)-(9) go by the name of James Bernoulli's law of large numbers. We may remark that to be precise, Bernoulli's proof consisted in establishing (7), which he did quite rigorously by using estimates for the "tails" of the binomial probabilities P"(k) (for the values of k for which l(k/n)- pi ;;;:: e). A direct calculation of the sum of the tail probabilities of the binomial distribution L{k:l
and

50

I. Elementary Probability Theory

2. The next section will be devoted to precise statements and proofs of these results. For the present we consider the question of the real meaning of the law of large numbers, and of its empirical interpretation. Let us carry out a large number, say N, of series of experiments, each of which consists of "n independent trials with probability p of the event C of interest." Let S~/n be the frequency of event C in the ith series and N, the number of series in which the frequency deviates from p by less than e: N, is the number of i's for which i(S~n)- pi ~e. Then N,/N "' P,

(10)

where P, = P{i(S!/n)- Pi~ e}. It is important to emphasize that an attempt to make (10) precise inevitably involves the introduction of some probability measure, just as an estimate for the deviation of Sn/n from p becomes possible only after the introduction of a probability measure P. 3. Let us consider the estimate obtained above,

p{l sn- pI~ e}= n

L

{k:l
Pn(k)

~ ~. 4ne

(11)

as an answer to the following question that is typical of mathematical statistics: what is the least number n of observations that is guaranteed to have (for arbitrary 0 < p < 1) (12) where ex is a given number (usually small)? It follows from (11) that this number is the smallest integer n for which

n

1 e ex

(13)

~-4 2 •

For example, if ex = 0.05 and e = 0.02, then 12 500 observations guarantee that (12) will hold independently of the value of the unknown parameter p. Later (Subsection 5, §6) we shall see that this number is much overstated; this came about because Chebyshev's inequality provides only a very crude upper bound for P{i(Sn/n)- Pi ~ e}. 4. Let us write

{ I

Sn(w) - p C(n, e) = w: -n-

I ~ e} .

From the law of large numbers that we proved, it follows that for every e > 0 and for sufficiently large n, the probability of the set C(n, e) is close to 1. In this sense it is natural to call paths (realizations) w that are in C(n, e) typical (or (n, e)-typical).

§5. The Bernoulli Scheme.

51

I. The Law of Large Numbers

We ask the following question: How many typical realizations are there, and what is the weight p(w) of a typical realization? For this purpose we first notice that the total number N(O.) of points is 2", and that if p = 0 or 1, the set of typical paths C(n, e) contains only the single path (0, 0, ... , 0) or (1, 1, ... , 1). However, if p =!,it is intuitively clear that "almost all" paths (all except those of the form (0, 0, ... , 0) or (1, 1, ... , 1)) are typical and that consequently there should be about 2n of them. It turns out that we can give a definitive answer to the question whenever 0 < p < 1; it will then appear that both the number of typical realizations and the weights p(w) are determined by a function of p called the entropy. In order to present the corresponding results in more depth, it will be helpful to consider the somewhat more general scheme of Subsection 2 of §2 instead of the Bernoulli scheme itself. Let (p 1 , p 2 , ••• , p,) be a finite probability distribution, i.e. a set of nonnegative numbers satisfying p 1 + · · · + p, = 1. The entropy of this distribution is r

L

H =-

p)npi,

(14)

i= 1

with 0 ·In 0 = 0. It is clear that H 2: 0, and H = 0 if and only if every pi, with one exception, is zero. The function f(x) = -x In x, 0 s x s 1, is convex upward, so that, as know from the theory of convex functions, f(xd

+ -~· + f(x,) s

!(x + ·; · + x} 1

Consequently

H = -

L' Pi In Pi s - r · P1 + r·· · + Pr · In (P1 + ·r· · + Pr) = In r.

i= 1

In other words, the entropy attains its largest value for p 1 = · · · = p, = 1/r (see Figure 8 for H = H(p) in the case r = 2). If we consider the probability distribution (p 1 , p 2 , ••• , p,) as giving the probabilities for the occurrence of events A 1 , A 2 , ... , A, say, then it is quite clear that the "degree of indeterminancy" of an event will be different for H(p)

I

p

2

Figure 8. The function H(p)

=

-pIn p- (1 - p)ln(l - p).

52

I. Elementary Probability Theory

different distributions. If, for example, p 1 = 1, p 2 = · · · = Pr = 0, it is clear that this distribution does not admit any indeterminacy: we can say with complete certainty that the result of the experiment will be A 1 . On the other hand, if p 1 = · · · = Pr = 1/r, the distribution has maximal indeterminacy, in the sense that it is impossible to discover any preference for the occurrence of one event rather than another. Consequently it is important to have a quantitative measure of the indeterminacy of different probability distributions, so that we may compare them in this respect. The entropy successfully provides such a measure of indeterminacy; it plays an important role in statistical mechanics and in many significant problems of coding and communication theory. Suppose now that the sample space is Q

= {w: w = (a 1 ,

..• ,

a.), ai

= 1, ... , r}

and that p(w) = p~1 (w) • · • pv; 0 and n = 1, 2, ... , let us put

I

.

}

{lvi(w) -n- -

Pi I :2: e} ,

{ I

vlw) - Pi < e, 1 = 1, ... , r . C(n, e) = w: -n~

It is clear that r P(C(n, e)) :2: 1 - i~t P

and for sufficiently large n the probabilities P{l(v;(w)/n)- p;l ;::: e} are arbitrarily small when n is sufficiently large, by the law of large numbers applied to the random variables ¢k(w)

= {01' ak: , ak

~·,

k = 1, ... , n.

-r z

Hence for large n the probability of the event C(n, e) is close to 1. Thus, as in the case n = 2, a path in C(n, e) can be said to be typical. If all Pi > 0, then for every w E Q p(w)

= exp{ -n

J

1

(-

vk~w) In Pk)}.

Consequently if w is a typical path, we have

v (w) Pk ) I L - ~k-In n r

k=t

(

I

- H ::::;; -

v (w) L I~kr

k=t

n

I

Pk In Pk ::::;; - e

L In Pk. r

k=t

It follows that for typical paths the probability p(w) is close to e-•H andsince, by the law of large numbers, the typical paths "almost" exhaust Q when n is large-the number of such paths must be of order e"H. These considerations lead up to the following proposition.

§5. The Bernoulli Scheme.

53

I. The Law of Large Numbers

Theorem (Macmillan). Let P; > 0, i = 1, ... , rand 0 < s < 1. Then there is an n0 = n0 (s; p 1 , ••• , p,) such that for all n > n0 (a) en
L

(c) P(C(n, s 1)) =

p(w)

-4

wE

1,

C(n, s 1); n-+ oo,

roe C(n,
where

s 1 is the smaller of sands/{ -2

kt/n Pk}

PROOF. Conclusion (c) follows from the law of large numbers. To establish

the other conclusions, we notice that if we C(n, s) then

k = 1, ... , r, and therefore

L

vk In Pk} < exp{ -n :::; exp{- n(H - ts)}.

p(w) = exp{-

L Pk ln Pk- s1n L ln Pd

Similarly p(w)

> exp{- n(H + ts)}.

Consequently (b) is now established. Furthermore, since P(C(n, s 1)) 2:: N(C(n, s 1 )) · we have

N ( C(n, S1 ))

:::;

min p(w), coeC(n,
P(C(n, sl)) 1 _ n(H+(l/2J•J . ( ) < -n(H+(l/2J•J - e mm pw e roeC(n,
and similarly N(C(n, s 1 )) 2:: P(C(n, s 1)) max p(w)

> P(C(n, s 1))en
meC(n,£ 1 )

Since P(C(n, s 1))-+ 1, n-+ oo, there is an n 1 such that P(C(n, s 1 )) for n > n 1, and therefore N(C(n, s 1 )) 2:: (1 - s) exp{n(H - t)}

= exp{n(H -

s)

+ (tns + ln(l

- s))}.

> 1- s

54

I. Elementary Probability Theory

Let n2 be such that

+ ln(1

!ne

- e) > 0.

for n > n2 • Then when n:;::: n0 = max(nto n2 ) we have N(C(n, e1))

:;::: en
•>.

This completes the proof of the theorem. 5. The law of large numbers for Bernoulli schemes lets us give a simple and elegant proof of Weierstrass's theorem on the approximation of continuous functions by polynomials. Letf = f(p) be a continuous function on the interval [0, 1]. We introduce the polynomials

which are called Bernstein polynomials after the inventor of this proof of Weierstrass's theorem. If ~ 1 , ••• , ~" is a sequence of independent Bernoulli random variables with P{~i = 1} = p, P{~i = 0} = q and S" = ~ 1 +···+~"'then

Ef(~) =

Bip).

Since the function.{ = f(p), being continuous on [0, 1], is uniformly continuous, for every e > 0 we can find b > 0 such that I f(x) - f(y)l :::;; e whenever lx- yl :::;; b. It is also clear that the function is bounded: If(x)l :::;; M < oo. Using this and (5), we obtain lf(p)- Bn(p)l

=I Jo : :;

[J(p)-

f(~) JC~lqn-k I

L

{k:i(k/n)- PiS~)

+

I f(p)

L

{k:i(kfn)-pi>~)

:::;; e + 2M

!(~) Jc~pkqn-k

-

n

lj(p)- J(~) IC~pkqn-k

L

n

{k:i ~J

C~lqn-k :::;;

Hence lim max n-+oo 0Sp:S1

I f(p)

- Bn(p)l = 0,

which is the conclusion of Weierstrass's theorem.

2M M e + - -2 = e + - -2 • 4n<5 2nb

§6. The Bernoulli Scheme.

II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

55

6.PROBLEMS

1. Let ~and 1J be random variables with correlation coefficient p. Establish the following two-dimensional analog of Chebyshev's inequality: 1 P{l~- E~l ~ e~ or 111- EIJI ~ ~ 2 (1 + e

eJVtl}

.Ji="7).

(Hint: Use the result of Problem 8 of §4.)

2. Let f = f(x) be a nonnegative even function that is nondecreasing for positive x. Then for a random variable ~ with I~(w) I ~ C,

EfW- f(e) < P{l;:- E;:l >c.}< Ef(~- E~). f(C) "' "' f(e) In particular, if f(x) = x 2 ,

Eee C

2

2

3. Let ~ 1 , ••• ,

~.be

v~ ~ P{l~- E~l ~c.}~~-

a sequence of independent random variables with

V~; ~

p{ I~ 1 +...n +~. _E(~ 1 +n... +~.)I>- e} <-ne_£_

C. Then

2

(15)

(With the same reservations as in (8), inequality (15) implies the validity of the law of large numbers in more general contexts than Bernoulli schemes.) 4. "Let ~ 1 , ••. , ~.be independent Bernoulli random variables with P{e; = 1} = p > 0, P{e; = -1} = 1 - p. Derive the following inequality of Bernstein: there is a number a > 0 such that

p{ I~·- (2pwhere

s. =

~1

1)

I~ c.}~ 2e-ae2•,

+ · · · + ~. and e > 0.

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-.:.Laplace, Poisson) 1. As in the preceding section, let

s. = el + ... + e•.

Then

s.

E-= p, n

(1)

and by (4.14) (2)

56

I. Elementary Probability Theory

It follows from (1) that Sn/n ~ p, where the equivalence symbol ~ has been given a precise meaning in the law of large numbers in terms of an inequality for P{I(Sn/n)- Pi ~ e}. It is natural to suppose that, in a similar way, the relation (3) which follows from (2), can also be given a precise probabilistic meaning involving, for example, probabilities of the form

xeR\ or equivalently

(since ES. = np and VS. If, as before, we write

= npq). O~k~n,

for n

~

1, then

p{ IsnJ'iS,.

ESn

I~

x} =

I

Pn(k).

(4)

{k:j(k-np)/JiiNI:5x}

We set the problem of finding convenient asymptotic formulas, as n --+ oo, for Pn(k) and for their sum over the values of k that satisfy the condition on the right-hand side of (4). The following result provides an answer not only for these values of k (that is, for those satisfying Ik - np I = O(JnPq)) but also for those satisfying Ik - np I = o(npq) 213 •

Local Limit Theorem. Let 0 < p < 1 ; then P.(k) ~ uniformly fork such that

1

J2rffiM

e-
lk- npl = o(npq) 2 13 , i.e. as n--+ P.(k)

sup

{k:jk-npj,;cp(n)} - - - - - - - - -

1

j2nnpq where
=

o(npq) 213 •

(5)

e- (k- np)2/(2npq)

1

oo --+

0,

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

57

The proof depends on Stirling's formula (2.6)

where R(n)

-+

0 as n-+ oo.

Then if n -+ oo, k

ck = n

-+

oo, n - k -+ oo, we have

n! k!(n- k)!

j2im e-"n"

1 + R(n)

J2nk · 2n(n - k) e-kkk. e-
J2.. Hl _~) ·(m~- ~r · 1 + e(n, k, n- k)

1

~

+ R(k))(1 + R(n- k))

where e = e(n, k, n- k) is defined in an evident way and e-+ 0 as n-+ oo, k -+ oo, n - k -+ oo. Therefore p J.k)

~ C,p'qo-' ~

J

l{l -

1 k(

p)n-k

k) (k)k( k)" k (1 2nn- 1 - 1- n n n n

+ e).

Write P= kjn. Then Pn(k)

1

(p)k(1 _ p)n-k 1- p (1

p

=

J2nnfi(1 - p)

=

1 exp{k ln J2nnfi(1 - p) P

=

1 exp{n ln + J2nnfi(1 - p) n P

~ + (n -

[~ ~

1

---;:::===:======::;= exp{-

J2nnfi(1 - fi)

nH(p)}(1

+

e)

~} · (1 + e)

k) ln 1 -

1- P

(1 - ~) ln 1 - P]} (1 + e) P 1-

n

+ e),

where

x

H(x) = xln-

p

1-x -p

+ (1- x)ln1-.

We are considering values of k such that lk- npl sequently p- p-+ 0, n-+ oo.

=

o(npq) 213 , and con-

58

I. Elementary Probability Theory

Since, for 0 < x < 1,

1- X X H'(x) =In- - I n - - , 1- p p

1 1 H"(x) = - + - - , 1 -·X X H"'( ) X

= -

=~

+ H'(p)(p-

(! + ~)

2 p

q

(p- p) 2

+ (1

+ (p

if we write H(fi) in the form H(p find that for sufficiently large n H(p) = H(p)

1

X2

p)

-

1

X )2 '

- p)) and use Taylor's formula, we

+ -!H"(p)(p-

p) 2

+ O(IP- pl 3 )

+ O(IP- Pl 3 ).

Consequently Pn(k)

=

~ 2n

1 (p- p) 2 exp{pq J2nnfi(1 - p)

+ nO(IP-

pl 3 )} (1

+ s).

Notice that _!!_ (p - p)z = _!!_ (~ - p)z 2pq n 2pq

= (k - np)z 2npq

Therefore l

Pn(k) =

~

e-J(Znpq)(l

+ s'(n, k, n-

k)),

where 1 + s'(n, k, n- k) = (1

+ s(n, k, n-

k))exp{n O(IP- Pl 3 )}

and, as is easily seen,

n --+ oo,

supls'(n, k, n- k)l--+ 0, if the sup is taken over the values of k for which lk- npl:::;; cp(n),

This completes the proof.

cp(n)

= o(npq) 213 •

p(l - p)

p(l - p)

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

59

Corollary. The conclusion of the local limit theorem can be put in the folio~ equivalent form: For all x E R 1 such that x = o(npq) 116 , and for np + x.Jnpq an integer from the set {0, 1, ... , n}, Pinp

+ x~) ,..., ~ e-x2f2,

(7)

2nnpq

i.e. as n -+ oo, sup {x:lxl~l/l(n))

Pinp+ x~) _ 1 -+ 0, 1 -x2j2 ---;:==e

(8)

~

where t/J(n) = o(npq) 116.

With the reservations made in connection with formula (5.8), we can reformulate these results in probabilistic language in the following way:

P{Sn = k} ,..., p{ S n -np} ~

=x

1

~ 1

e-2f<2npq>,

"'~e

-x2j2

lk- npl

= o(npq) 213 , (9)

x = o(npq)lf6.

,

(10)

(In the last formula np + x~ is assumed to have one of the values 0, 1, ... , n.) If we put tk = (k- np)/~ and t:.tk = tk+ 1 - tk = 1/~, the preceding formula assumes the form

Sn - np = P{---"-==~

tk

}

tltk _ 12 12 ,..., - - e k '

(11)

fo

It is clear that t:.tk = 1/~-+ 0 and the set of points {tk} as it were "fills" the real line. It is natural to expect that (11) can be used to obtain the integral formula

P{a < Sn -

np

~

~

b} ,. ., _1_ fb e-x2Jz dx, fo

- oo < a ~ b < oo.

a

Let us now give a precise statement. 2. For - oo < a ~ b < oo let

Pn(a, b] =

L

Pn(np

+ x~),

a<x~b

where the summation is over those x for which np

+ x~ is an integer.

60

I. Elementary Probability Theory

It follows from the local theorem (see also (11)) that for all tk defined by k = np + tkfipq and satisfying Itk I ~ T < oo,

Pn(np

"PI.f) + tky'r.:=

=

2 /2 ll.tk M: e-tk [1 v 2n

+ e(tk, n)],

(12)

where

n --+ oo.

sup le(tk, n)l--+ 0,

(13)

itkiST

Consequently, if a and b are given so that - T

~ a ~

b ~ T, then

where

L

R~ll(a, b)=

ll.tk

e-l~/2-

a
L

R~2 >(a, b) =

_1_ rb e-x2j2 dx,

j2n Ja

~tk

2

e(tk> n) - - e-tki 2 .

fo

a
From the standard properties of Riemann sums, n --+ oo.

IR~1 >(a, b)l--+ 0,

sup

(15)

-TsasbsT

It also clear that JR~2 >(a, b)l

sup -TsasbsT

~

sup Ie(tk> n)l itkiST

x [

~ JT

y'

2n

-T

e-x 212 dx

+

sup

-TsasbsT

JR~ll(a, b)JJ --+ 0,

(16)

where the convergence of the right-hand side to zero follows from (15) and from

_1_ fT e-x2!2 dx < _1_ f"' e-x2/2 dx = 1

Jbi

-T

-

Jbi

-oo

the value of the last integral being well known.

'

(17)

61

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

We write


Jin

fx -oo

e- 1212 dt.

Then it follows from (14)-(16) that

n --+ oo.

IPn(a, b] - ((b)-
sup

(18)

-T:Sa:Sb:ST

We now show that this result holds for T = oo as well as for finite T. By (17), corresponding to a given e > 0 we can find a finite T = T(e) such that -1-

IT

for-T

i

> 1-

e-x>f 2 dx

e.

(19)

According to (18), we can find anN such that for all n >Nand T = T(e) we have sup -T:Sa:Sb:ST

IPn(a, b] - ((a))l <

i

e.

(20)

It follows from this and (19) that PnC- T, T]

> 1-

t e,

and consequently Pn(- oo, T]

+ Pn(T, oo)

~

te,

where Pn(- 00, T] = lims!-oo PnCS, T] and Pn(T, oo) = limstoo Pn(T, S]. Therefore for - oo ~ a ~ - T < T ~ b ~ oo,

IPn(a, b] -

- 1-

foe

fb e-x>/2 dx I a

~ IPn(- T, T]- ~IT

+I

v 2n

Pn(a,- T]-

+ 1 foo +-foe ~ !e 4

fo

Pn(-oo,- T]

T

2

e-x 12

e-x>j 2

dx

I

dx

I+ I

-T

i-T e-x>f2

Pn(T, b]-

fo s:

e-x>f2

dx

I

+v~f-T + 1 1 1 1 + + + 4 2 8 8

dx ~ -e

2n - oo -e

e-x 212

-e

dx

PnCT, oo)

-e = e.

By using (18) it is now easy to see that Pn(a, b] tends uniformly to (b)
62

I. Elementary Probability Theory

De Moivre-Laplace Integral Theorem. Let 0 < p < 1, Pn(a, b] =

L

a<xsb

Pn(np

+ xJnpq),

Then 'P"(a,b]-

sup

-cosa
~iba e-x 12 dx,--+O, 2

y 2n

n--+ oo.

(21)

With the same reservations as in (5.8), (21) can be stated in probabilistic language in the following way: sup

-cosa
ESn IP{a < snv-~ vsn

::;; b} -

~ ib e-x212 dx I-+ 0,

v 2n

n --+ oo.

a

It follows at once from this formula that

P{A < S" :s; B} -

- np) -
--+

0,

(22)

as n--+ oo, whenever - oo :s; A < B :s; oo. A true die is tossed 12 000 times. We ask for the probability P that the number of 6's lies in the interval (1800, 2100]. The required probability is

EXAMPLE.

-

P-

(l)k(5)L: c 12 ooo 1800
12 000- k.

An exact calculation of this sum would obviously be rather difficult. However, if we use the integral theorem we find that the probability P in question is (n = 12 000, p = i, a = 1800, b = 2100) 2100- 2000) ( 1800- 2000)
<11(2.449) -
~

0.992,

where the values of <11(2.449) and
xJnM

theintegraltheoremsaysthatP"(a, b] = P{aj;lpq < S"- np :s; bj;lpq} = P{np + aj;lpq < Sn :s; np + bj;lpq} is closely approximated by the integral (1/j2ic)J! e-x2f2 dx.

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

P.(np

63

+ xJniq)

X

0

Figure9

We write

Then it follows from (21) that sup

-oo:S:x:Soo

n- oo.

IFn(x) -
(23)

It is natural to ask how rapid the approach to zero is in (21) and (23), as n- oo. We quote a result in this direction (a special case of the BerryEsseen theorem: see §6 in Chapter Ill): sup

JFn(x) -
-oo:Sx:Soo

+ q2 C::: v npq

p2

(24)

It is important to recognize that the order of the estimate (1/.}niq) cannot be improved; this means that the approximation of Fn(x) by (x) can be poor for values of p that are close to 0 or 1, even when n is large. This suggests the question of whether there is a better method of approximation for the probabilities of interest when p or q is small, something better than the normal approximation given by the local and integral theorems. In this connection we note that for p =!,say, the binomial distribution {Pn(~)} is symmetric (Figure 10). However, for small p the binomial distribution is asymmetric (Figure 10), and hence it is not reasonable to expect that the normal approximation will be satisfactory. P.(k)

P.(k)

0.3

p = !, n

0.2

= 10

\

I

'

0.1 2

4

6

0.3 0.2

1 ..,.

I

p = ;\, n

= 10

I

0.1 8

2

10

Figure 10

4

6

8

10

64

I. Elementary Probability Theory

4. It turns out that for small values of p the distribution known as the Poisson distribution provides a good approximation to {P,(k)}. Let k_ 0 1 P,(k) = ,p q ' - ' '· · ·' n, 0, k = n + 1, n + 2, . . ,

{ck kn-k

and suppose that pis a function p(n) of n.

Poisson's Theorem. Let p(n) -+ 0, n -+ oo, in such a way that np(n)-+ A, where A.> 0. Then fork= 1, 2, ... , P,(k)-+ nk, where

A.ke-;. nk=~,

n-+ oo,

(25)

k = 0, 1, ....

(26)

The proof is extremely simple. Since p(n) = (A./n) for a given k = 0, 1, ... and sufficiently large n,

P,(k) =

+ o(1/n) by hypothesis,

C~pkqn-k

But

n(n- l)···(n- k =

+

1)[~ + o(~)r

n(n - 1) · · ~~n - k

and

[ 1 --;:;A.

+ 1\A. + o( 1W-+A.\

+ o ( -;:;l)]n-k -+ e-;.,

n-+ oo,

n-+ oo,

which establishes (25). The set of numbers {nk, k = 0, 1, ... } defines the Poisson probability distribution (nk ~ 0, Lk'=o nk = 1). Notice that all the (discrete) distributions considered previously were concentrated at only a finite number of points. The Poisson distribution is the first example that we have encountered of a (discrete) distribution concentrated at a countable number of points. The following result of Prohorov exhibits the rapidity with which P,(k) converges tonk as n-+ oo: if np(n) = A. > 0, then (27) (For the proof of (27), see §7 of Chapter III.)

65

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

5. Let us return to the De Moivre-Laplace limit theorem, and show how it implies the law of large numbers (with the same reservation that was made in connection with (5.8)). Since

it is clear from (21) that when e > 0 P

{I s I } ___.!!_-

n

p :::;: e - -1-

s·..(iljpq

fo - ,,;nrpq

e-x>J 2 dx--+

0,

n --+ oo,

(28)

whence n--+ oo,

which is the conclusion of the law of large numbers. From (28)

P{l SnpI : :;: e},..., f'..fnipq n fo-.,;nrpq _1_

e-x>/2

dx,

n--+ oo,

(29)

whereas Chebyshev's inequality yielded only

It was shown at the end of §5 that Chebyshev's inequality yielded the estimate 1

n:<::-42 e a

for the number of observations needed for the validity of the inequality

Thus withe= 0.02 and a= 0.05, 12 500 observations were needed. We can now solve the same problem by using the approximation (29). We define the number k(a) by 1 Jk(IX)

--

fo

Since e .J(il!Pq)

e-x>f 2

dx = 1 - a.

-k(<X)

; : : 2e.jn, if we define n as the smallest integer satisfying 2eJn ;;::: k(a)

(30)

we find that (31)

66

I. Elementary Probability Theory

We find from (30) that the smallest integer n satisfying k 2 (rx)

n 2 4s2

guarantees that (31) is satisfied, and the accuracy of the approximation can easily be established by using (24). Taking s = 0.02, rx = 0.05, we find that in fact 2500 observations suffice, rather than the 12 500 found by using Chebyshev's inequality. The values of k(rx) have been tabulated. We quote a number of values of k(rx) for various values of rx: 0(

k(a)

0.50 0.3173 0.10 0.05 0.0454 O.ol 0.0027

0.675 1.000 1.645 1.960 2.000 2.576 3.000

6. The function (x) =

_1_

fo

fx

e-t2/2

dt,

(32)

-oo

which was introduced above and occurs in the De Moivre-Laplace integral theorem, plays an exceptionally important role in probability theory. It is known as the normal or Gaussian distribution on the real line, with the (normal or Gaussian) density

-3 -2 -1

°0.67I 1 1.96 2.58

X

Figure 11. Graph of the normal probability density cp(x).

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

0.9--

67

-1

I I I

I I I I I

1111 I

-3 -2 -1

°111111

q1J

I

2

3

4

X

0.25111 I 0.5211/ I

o.67

1

:

0.841 I 1.28i

Figure 12. Graph of the normal distribution Cl>(x).

We have already encountered (discrete) distributions concentrated on a finite or countable set of points. The normal distribution belongs to another important class of distributions that arise in probability theory. We have mentioned its exceptional role; this comes about, first of all, because under rather general hypotheses, sums of a large number of independent random variables (not necessarily Bernoulli variables) are closely approximated by the normal distribution (§4 of Chapter III). For the present we mention only some of the simplest properties of cp(x) and cD(x), whose graphs are shown in Figures 11 and 12. The function cp(x) is a symmetric bell-shaped curve, decreasing very rapidly with increasing lxl: thus cp(1) = 0.24197, cp(2) = 0.053991, cp(3) =

0.004432, cp(4)

=

0.000134, cp(5)

=

0.000016. Its maximum is attained at

x = 0 and is equal to (2nr 112 ~ 0.399. The curve cD(x) = (1/fo) J~ao e- 1212 dt approximates 1 very rapidly as x increases: C1>(1) = 0.841345, cD(2) = 0.977250, Cl>(3) = 0.998650, Cl>(4) = 0.999968, Cl>(4, 5) = 0.999997. For tables of cp(x) and cD(x), as well as of other important functions that are used in probability theory and mathematical statistics, see [A1].

7. PROBLEMS 1. Let n = 100, p = / 0 ,

120 , 130 , 1~, 150 • Using tables (for example, those in [A1]) of the binomial and Poisson distributions, compare the values of the probabilities

P{lO <

s,oo ~ 12},

P{33 < S100

~

35},

P{20 < S100 P{40 <

P{so < s,oo

~

22},

s,oo ~ 42},

~52}

with the corresponding values given by the normal and Poisson approximations.

68

I. Elementary Probability Theory

2. Letp = !andZ. = 2S.- n(theexcessofl'soverO 'sinntrials).Showthat suplfoP{Z2 •

=

j } - e-P/ 4 "1--> 0,

n-->

oo.

3. Show that the rate of convergence in Poisson's theorem is given by

§7. Estimating the Probability of Success in the Bernoulli Scheme 1. In the Bernoulli scheme (Q, d, P) with 0, 1) }, d = A: A ~ Q },

n=

{w:w = (xb ... ' x.), X;=

p(w) = p'f.x;qn-'f.x;,

we supposed that p (the probability of success) was known. Let us now suppose that p is not known in advance and that we want to determine it by observing the outcomes of experiments; or, what amounts to the same thing, by observations of the random variables ~1> ••• , ~.,where ~;(co) = x;. This is a typical problem of mathematical statistics, and can be formulated in various ways. We shall consider two of the possible formulations: the problem of estimation and the problem of constructing confidence intervals. In the notation used in mathematical statistics, the unknown parameter is denoted bye, assuming a priori that e belongs to the set 0 = [0, 1]. We say that the set (Q, d, P8 ; 8 E 01 with p8(w) = er. x'(l - 8)"-r.x, is a probabilistic-statistical model (corresponding to" n independent trials" with probability of "success" E 0), and any function T, = T,(w) with values in 0 is called an estimator. If s. = ~ 1 + · · · + ~. and T: = S./n, it follows from the law of large numbers that is consistent, in the sense that (s > 0)

e

T:

P8{/'J!'-

8/

~

s}--+ 0,

n--+ oo.

Moreover, this estimator is unbiased: for every

E0 T: =

e,

(1)

e (2)

where E8 is the expectation corresponding to the probability P8 • The property of being unbiased is quite natural: it expresses the fact that any reasonable estimate ought, at least "on the average," to lead to the desired result. However, it is easy to see that is not the only unbiased estimator. For example, the same property is possessed by every estimator

T:

T,=

b1 x 1

+ · · · + b"", x n

69

§7. Estimating the Probability of Success in the Bernoulli Scheme

where b 1 + · · · + bn = n. Moreover, the law of large numbers (1) is also satisfied by such estimators (at least if Ib;l ~ K < oo; see Problem 2(b), §3, Chapter Ill) and so these estimators T, are just as "good" as In this connection there arises the question of how to compare different unbiased estimators, and which of them to describe as best, or optimal. With the same meaning of "estimator," it is natural to suppose that an estimator is better, the smaller its deviation from the parameter that is being estimated. On this basis, we call an estimator f;, efficient (in the class of unbiased estimators T,) if,

r:.

Oe9,

(3)

where V9 T, is the dispersion ofT,, i.e. E9(T,- 0) 2 • Let us show that the estimator considered above, is efficient. We have

r:,

*_

(Sn) _ VSn _ n0(1 9

VoTn-VoHence to establish that

--2--

n

2

n

0) _ 0(1 - 0)

n

-

n

(4)

r: is efficient, we have only to show that . f v T.

m

()

n ~

0(1 - 0) • n

(5)

This is obvious for 0 = 0 or 1. Let 0 E (0, 1) and

Po(X;) = ox•(l - 0)1-x;. It is clear that n

Po(w) =

01 Po(x;).

i=

Let us write L 9(w) = In p6 (w).

Then

Lo(w)

.

= In 0. LX; + ln(l - O)L(l -

and

oL (w)

----ae- = 9

L(x; - 0) 0(1 - 0)

Since (jJ

and since Tn is unbiased,

(} =Eo T, = L T,(w)p (w). 9

(jJ

X;)

70

I. Elementary Probability Theory

After differentiating with respect to (), we find that

Therefore

1 = E{(1;. _ ()) oL;~w)] and by the Cauchy-Bunyakovskii inequality,

whence 1

2

(6)

Eo[T, - ()] ~ In(())' where I.(())=

[aL;~w)T

is known as Fisher's information. From (6) we can obtain a special case of the Rao-Cramer inequality for unbiased estimators T,: (7)

In the present case

I (()) = E [oLo(w)]z = n

0

()()

- O)]z

E [L(~; 8 ()(1 _ ())

n()(l - ()) [()(1 - ())JZ

n ()(1 - ()) '

which also establishes (5), from which, as we already noticed, there follows the efficiency of the unbiased estimator = Sn/n for the unknown parameter e.

r:

r:

2. It is evident that, in considering as a pointwise estimator for(), we have introduced a certain amount of inaccuracy. It can even happen that the numerical value of calculated from observations of x~> ... , xn differs rather severely from the true value e. Hence it would be advisable to determine the size of the error. It would be too much to hope that r:( w) differs little from the true value () for all sample points w. However, we know from the law of large numbers

r:

71

§7. Estimating the Probability of Success in the Bernoulli Scheme

that for every {J > 0 and for sufficiently large n, the probability of the event {I w)I > {J} will be arbitrarily small. By Chebyshev's inequality

o- r:c

P {IO _ T*l 6

n

b} <

>

-

v6[J2r:

=

oon[J2- O)

and therefore, for every A > 0,

P9{1o- r:1

~ AJ0(1;

O)} ~ 1- ;

2 •

If we take, for example, A = 3, then with P6 -probability greater than 0.888 (1 - (1/3 2 ) = ! ~ 0.8889) the event

10- r:1

~

3)0(1~

O)

will be realized, and a fortiori the event

10since 0(1 - 0) ~ Therefore P6

r:1 ~ 3;:, 2y n

i.

{1 o- r:1 ~ 2~} =

Po{

r: - 2~ ~ o~ r: + 2~} ~ 0.8888.

In other words, we can say with probabilipr greater than 0.8888 that the exact value of 0 is in the interval [T~ - (3/2y'n), T~ + (3/2,fn)]. This statement is sometimes written in the symbolic form

o~ r:

±if.

c~ 88%),

where " ~ 88%" means "in more than 88% of all cases." The interval [T: - (3/2.j1i), + (3/2.j1i)] is an example of what are called confidence intervals for the unknown parameter.

r:

Definition. An interval of the form

where

t/1 1( w) and t/1 i

w) are functions of sample points, is called a corifidence {J (or of significance level {J) if

interval of reliability 1 -

P11 {1/1 1 (w) for all 0 E 9.

~

0

~

t/1 2 (w)}

~

1- b.

72

I. Elementary Probability Theory

The preceding discussion shows that the interval

[ Tn* -

A Tn* + Jn A J Jn' 2

2

has reliability 1 - (1/A 2 ). In point of fact, the reliability of this confidence interval is considerably higher, since Chebyshev's inequality gives only crude estimates of the probabilities of events. To obtain more precise results we notice that

{w: /8- r:1 s ;.,)8(1: 8)} = {w: t/1 (T:, n):::; 8 s 1/Jz(T:, n)}, 1

where t/1 1 = t/1 1(T:,n) and t/1 2 = t/1 2 (T:,n) are the roots of the quadratic equation (8 - T!) 2

= ;.,z 8(1 - 8),

n which describes an ellipse situated as shown in Figure 13. Now let

Then by (6.24) sup /F(J(x)-
s

1

1

v n8(1 - 8)

Therefore if we know a priori that 0 < Ll ::::;; 8 ::::;; 1 - Ll < 1,

where Ll is a constant, then sup /F 0(x)-
()

Figure 13

s

1

r:

Llv n

73

§7. Estimating the Probability of Success in the Bernoulli Scheme

2:: (2<1>(A) - 1) -

2

r:..

~vn

Let A* be the smallest A for which (2<1>(A) - 1) -

2

r:.

~vn

2:: 1 - <5*,

where <5* is a given significance level. Putting <5 = <5* - (2/~Jn), we find that A* satisfies the equation

For large n we may neglect the term 2/~Jn and assume that A* satisfies ( A*) = 1 -

~ <5*.

In particular, if A.* = 3 then c5* = 0.9973 .... Then with probability approximately 0.9973

r:: -

3J8(1

~ 8) ~ () ~ r:: + 3J8(1 ~ 8)

(8)

or, after iterating and then suppressing terms of order O(n- 3 14 ), we obtain

T,.* - 3

T.*(1 - T.*) n

n

n

~ () ~

T,.*

+3

T.n*(1 - T.n*)

n

(9)

Hence it follows that the confidence interval

[r:- 2~, r: + 2~]

(10)

has (for large n) reliability 0.9973 (whereas Chebyshev's inequality only provided reliability approximately 0.8889). Thus we can make the following practical application. Let us carry out a large number N of series of experiments, in each of which we estimate the parameter () after n observations. Then in about 99.73% of the N cases, in each series the estimate will differ from the true value of the parameter by at most 3/2Jn. (On this topic see also the end of §5.)

74

I. Elementary Probability Theory

3. PROBLEMS 1. Let it be known a priori that 8 has a value in the set 8 estimator fore, taking values only in E>o.

0

c;; [0, 1]. Construct an unbiased

2. Under the hypotheses of the preceding problem, find an analog of the Rao-Cramer inequality and discuss the problem of efficient estimators. 3. Under the hypotheses of the first problem, discuss the construction of confidence intervals for e.

§8. Conditional Probabilities and Mathematical Expectations with Respect to Decompositions 1. Let (0, d, P) be a finite probability space and

a decomposition ofO (D; Ed, P(D;) > 0, i = 1, ... , k, and D 1 + · · · + Dk = 0). Also let A be an event from d and P(AID;) the conditional probability of A with respect to D;. With a set of conditional probabilities {P(A ID;), i = 1, ... , k} we may associate the random variable n(w)

=

k

L

P(A ID;)lv,(w)

(1)

i= 1

(cf. (4.5)), that takes the values P(AID;) on the atoms of D;. To emphasize that this random variable is associated specifically with the decomposition fl2, we denote it by P(Aifl2)

or

P(Aifl2)(w)

and call it the conditional probability of the event A with respect to the de-

composition fl2.

This concept, as well as the more general concept of conditional probabilities with respect to au-algebra, which will be introduced later, plays an important role in probability theory, a role that will be developed progressively as we proceed. We mention some of the simplest properties of conditional probabilities: P(A

+ B I fl2)

= P(A I fl2)

+ P(B Ifl2);

(2)

if fl2 is the trivial decomposition consisting of the single set 0 then P(A I0)

=

P(A).

(3)

75

§8. Conditional Probabilities and Mathematical Expectations

The definition of P(A i.@) as a random variable lets us speak of its expectation; by using this, we can write the formula (3.3) for total probability in the following compact form: EP(Ai.@) = P(A).

(4)

In fact, smce P(AI.@)

=

k

L P(AID;)Iv;(m),

i= 1

then by the definition of expectation (see (4.5) and (4.6)) EP(Ai.@)

k

L

=

P(AID;)P(D;)

i= 1

=

k

L

P(AD;)

i= 1

= P(A).

Now let '1 = 1J(m) be a random variable that takes the values y 1, ... , Yk with positive probabilities: k

1J(m) =

L Yivim),

j= 1

where Di = {m: 1J(m) = y). The decomposition.@,= {D 1, ... , Dk} is called the decomposition induced by 'I· The conditional probability P(A !.@,) will be denoted by P(AI'I) or P(AI'I)(m), and called the conditional probability of A with respect to the random variable '1· We also denote by P(AI'I =yi) the conditional probability P(AIDi), where Di = {m: 1J(m) = Yi}. Similarly, if 1J 1, 1J 2 , ••• , '1m are random variables and .@,,,, 2 , ... ,,'" is the decomposition induced by '11> 1J 2 , ••• , '1m with atoms DY!.Y2· .. ··Ym

= {m: '11(m) = Y1• · · ·' 'lm(m) = Ym},

then P(AID~ •. ~, ..... ~~) will be denoted by P(AI17 1 , '7 2 , ••• , '1m) and called the conditional probability of A with respect to '11> 1] 2 , ••• , '1m· 1. Let ~ and 11 be independent identically distributed random variables, each taking the values 1 and 0 with probabilities p and q. For k = 0, 1, 2, let us find the conditional probability P(~ + '1 = ki'l) of the event A = {m: ~ + '1 = k} with respect to '1· To do this, we first notice the following useful general fact: if ~ and 1J are independent random variables with respective values x and y, then

ExAMPLE

P(~

+ 1J =

zi1J = y) = P(~

+y

= z).

(5)

In fact,

P(~ + = fi'l = y) = P(~ + '1 = z, '1 = y) '1

P('l = y)

=

P(~

= P(~

+ y = Z,1J = y) P('l = y) + y = z).

P(~

+ y = z)P(y = 'I) P('l

= y)

76

I. Elementary Probability Theory

Using this formula for the case at hand, we find that P(~

+ 17 =

+ 17 = ki11 = O)J{,=OJ(w) + P(~ + 11 = kl11 = 1)J{.,= 11(w) = P(~ = k)I!,=O!(w) + P{~ = k-

ki17) = P(~

1}J{,= 11(w).

Thus (6)

or equivalently q(1 - 11),

P(~

+ 11 = kl17) = { p(1

- 17)

+ q17,

P11,

2. Let ~ =

~(w)

k = 0, k = 1, k = 2,

be a random variable with values in the set X = { x 1 ,

(7)

... ,

xn}:

l

~

=

L xJ4w),

j= 1

and let f!} = {D 1 , .•. , Dk} be a decomposition. Just as we defined the expectation of~ with respect to the probabilities P(A),j = 1, ... , l. l

Ee = L

xjP(Aj),

(8)

j= 1

it is now natural to define the conditional expectation of ~ with respect to f!} by using the conditional probabilities P(Ail f!}), j = 1, ... , l. We denote this expectation by EW f!}) or E(~l f!}) (w), and define it by the formula l

E(~l f!}) =

L xiP(Ail f!}).

(9)

j= 1

According to this definition the conditional expectation E(~ If!}) (w) is a random variable which, at all sample points w belonging to the same atom Di> takes the same value 1 xiP(AiiD;). This observation shows that the definition of E( ~I D;) could have been expressed differently. In fact, we could first define E(~ ID;), the conditional expectation of~ with respect to D;, by

'LJ=

(10)

and then define E(~lf!})(w)

=

k

L E(~ID;)Iv,(w)

i= 1

(see the diagram in Figure 14).

(11)

77

§8. Conditional Probabilities and Mathematical Expectations (8)

PO

E~

1(3.1) (10)

P(-ID)

E(~ID)

j(l) P(-1

1(11) {9)

~)

E(~l ~)

Figure 14 It is also useful to notice that E(~ID) and EW ~)are independent of the representation of~. The following properties of conditional expectations follow immediately from the definitions: E(a~

+ b17i ~) =

aE(~I ~)

+ bE('71 ~),

a and b constants;

(12) (13)

E(~l!l) = E~;

E(CI

~) =

C constant;

C,

(14)

(15)

The last equation shows, in particular, that properties of conditional probabilities can be deduced directly from properties of conditional expectations. The following important· property generalizes the formula for total probability (5): EE(~I ~)

=

(16)

E~.

For the proof, it is enough to notice that by (5) EE(~I ~) = E

l

l

l

j= 1

j= 1

j= 1

L xiP(Ail ~) = L xiEP(Ail2)) = L xiP(Ai) =

E~.

Let f!} = {D 1 , .•• , Dk} be a decomposition and '7 = 17(w) a random variable. We say that '7 is measurable with respect to this decomposition, or ~-measurable, if f!}~ ~ ~.i.e. '7 = 17(w) can be represented in the form k

'l(W)=

L YJn,(w),

i= 1

where some Y; might be equal. In other words, a random variable is measurable if and only if it takes constant values on the atoms of f!}.



2. If~ is the trivial decomposition,~ = {Q}, then '7 ts ~-measur­ able if and only if 17 C, where C is a constant. Every random variable '7 is measurable with respect to f!}~.

ExAMPLE

=

78

I. Elementary Probability Theory

Suppose that the random variable 1J is

~-measurable.

Then (17)

and in particular

= 1J).

(E(1J I~")

To establish (17) we observe that if~=

1

xi/AJ' then

k

l

~1] =

IJ=

(18)

L L

XiYJAjD;

j= 1 i= 1

and therefore k

l

L L xiy;P(AiDd~)

E(~1JI ~) =

j= 1 i= 1

=

L L

j=l i=l

L L

j= 1 i = 1

On the other hand, since

IJE(~I ge) = =

[t [t

P(AjD;IDm)lvm(w)

xiy;P(AiDdD;)lv;(w)

j= 1 i= 1

= lv; and lv; · lvm = 0, i

YJv;(w)

(19)

xiy;P(AiiD;)lv;(w).

1 YJv/w)J ·

k

=

L L

Jb;

L

m=l

k

l

=

XjYi

k

l

=

k

k

l

lt

=I=

m, we obtain

1 xiP(Ail ge)J

J·mt Lt XjP(AjiDm)] · lvJw)

l

L L Y;XiP(AiiD;) · lv;(w),

i= 1 j= 1

which, with (19), establishes (17). We shall establish another important property of conditional expectations. Let ge 1 and ge 2 be two decompositions, with ~ 1 ~ ge 2 (ge 2 is "finer" than ~ 1 ). Then E[E(~I ~z)l

ge1J

= E(~l

gel).

For the proof, suppose that gel = {Dll, · · ·, D1m}, Then if~ =

LJ= xi AJ' we have 1

l

E(~l gez) =

L

j= 1

xiP(Ail ge 2),

(20)

§8. Conditional Probabilities and Mathematical Expectations

79

and it is sufficient to establish that (21) Since

P(Aj/ ~ 2 ) =

n

L

q=1

P(Aj/D 2 q)lv 2 . ,

we have n

E[P(Aj/ ~2)/ ~ 1 ]

=

L

P(Aj /D 2 q)P(D 2 q/ ~d

q=1

n

m

L

p= 1

lv,p ·

L

q= 1

m

=

L

p=1

P(Aj/D 2 q)P(D 2 q/D 1 p)

L

lv,p ·

P(Aj/D 2 q)P(D 2 q/D 1 P)

{q:D2qSD1p}

m

=

L

p=1

lv,p · P(Aj/D 1 p) = P(Aj/ ~ 1 ),

which establishes (21). When ~ is induced by the random variables 11~> ... , 11k ( ~ = ~~, .... ~J, the conditional expectation E(~I~~~ ..... ~J is denoted by E(~/11 1 , ••• ,11k), or E(( IIJ 1 , ... , l]k){w), and is called the conditional expectation of ( with respect to 11 1 , .•. , 11k· It follows immediately from the definition of E(~ /11) that if~ and 11 are independent, then E(~/11) = E~.

(22)

11·

(23)

From (18) it also follows that

E(11 /11)

=

Property (22) admits the following generalization. Let ~ be independent of ~(i.e. for each D; E ~the random variables~ and lv, are independent). Then E(~/ ~) = E~.

(24)

As a special case of (20) we obtain the following useful formula: (25)

80

I. Elementary Probability Theory

EXAMPLE 3. Let us find E(~ + 11 111) for the random variables~ and 11 considered in Example 1. By (22) and (23),

This result can also be obtained by starting from (8):

E(~

+ 11111) =

EXAMPLE 4. Let variables. Then

~

2

L kP(~ + 11 = k=O

kll1) = p(1- 11)

+ ql1 + 2pl1

=

p

+ 11·

and 11 be independent and identically distributed random

(26) In fact, if we assume for simplicity that ~and 11 take the values 1, 2, ... , m, we find (1 ::; k ::; m, 2 ::; I ::; 2m)

P( ~ = k I~

+ '1

= I) =

P( ~ = k, '1 = I - k) P( ~ = k, ~ + 11 = l) P( ~ + 11 = l) = P( ~ + 11 = l) P(~ =

k)P(11 = I - k) + 11 = l)

P( ~ =

P(IJ =

klc; +

P(11 = k)P(~ = 1 - k) P( ~

+ '1 =

l)

1J = [).

This establishes the first equation in (26). To prove the second, it is enough to notice that

3. We have already noticed in §1 that to each decomposition ~ = {D 1 , ... , Dk} of the finite set Q there corresponds an algebra a(~) of subsets of Q. The converse is also true: every algebra !!J of subsets of the finite space Q generates a decomposition ~ (!!J = a(~)). Consequently there is a oneto-one correspondence between algebras and decompositions of a finite space Q. This should be kept in mind in connection with the concept, which will be introduced later, of conditional expectation with respect to the special systems of sets called a-algebras. For finite spaces, the concepts of algebra and a-algebra coincide. It will turn out that if !!J is an algebra, the conditional expectation E( ~I !!J) of a random variable ~ with respect to !!J (to be introduced in §7 of Chapter II) simply coincides with EW ~), the expectation of ~ with respect to the decomposition ~ such that !!J = a(~). In this sense we can, in dealing with finite spaces in the future, not distinguish between E(~I!!J) and EW ~), understanding in each case that E(~I !!J) is simply defined to be E(~I ~).

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

4. PROBLEMS 1. Give an example of random variables which

(Cf. (22).)

81

eand '1 which are not independent but for

e

2. The conditional variance of with respect to !!} is the random variable

Show that

3. Starting from (17), show that for every function f

= f('1) the conditional expectation

E(el'1) has the property

4. Let e and '1 be random variables. Show that inf1 E('1 - f(e)) 2 is attained for J*(e) = E('11 e). (Consequently, the best estimator for '1 in terms of in the mean-square sense, is the conditional expectation E('11 e)).

e.

el .... ' e•.

el .... 'e. are identically e + ··· + e, is the

5. Let 't' be independent random variables, where distributed and r takes the values 1, 2, ... , n. Show that if S, = sum of a random number of the random variables,

1

and ES, = Er · E~ 1 , 6. Establish equation (24).

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 1. The value of the limit theorems of §6 for Bernoulli schemes is not just that they provide convenient formulas for calculating probabilities P(Sn = k) and P(A < Sn ::;; B). They have the additional significance of being of a universal nature, i.e. they remain useful not only for independent Bernoulli random variables that have only two values, but also for variables of much more general character. In this sense the Bernoulli scheme appears as the simplest model, on the basis of which we can recognize many probabilistic regularities which are inherent also in much more general models. In this and the next section we shall discuss a number of new probabilistic regularities, some of which are quite surprising. The ones that we discuss are

82

I. Elementary Probability Theory

again based on the Bernoulli scheme, although many results on the nature of random oscillations remain valid for random walks of a more general kind. 2. Consider the Bernoulli scheme (Q, d, P), where n = {(J): (J) = (x 1• ... ' Xn), Xi= ± 1}, d consists of all subsets of Q, and p(w) = pv(w)qn-v(wl, v(w) = (L xi + n)/2. Let ~i(w) = xi, i = 1, ... , n. Then, as we know, the sequence ~ 1, •.• , ~n is a sequence of independent Bernoulli random variables, P(~i =

1) = p,

P(~i

= -1) = q,

p+q=l.

Let us put S0 = 0, Sk = ~ 1 + · · · + ~b 1 s k s n. The sequence S0 , S 1 , .•• , Sn can be considered as the path of the random motion of a particle starting at zero. Here Sk+ 1 = Sk + ~k> i.e. if the particle has reached the point Sk at time k, then at time k + 1 it is displaced either one unit up (with probability p) or one unit down (with probability q). Let A and B be integers, A s 0 s B. An interesting problem about this random walk is to find the probability that after n steps the moving particle has left the interval (A, B). It is also of interest to ask with what probability the particle leaves (A, B) at A or at B. That these are natural questions to ask becomes particularly clear if we interpret them in terms of a gambling game. Consider two players (first and second) who start with respective bankrolls (-A) and B. If ~i = + 1, we suppose that the second player pays one unit to the first; if~; = -1, the first pays the second. Then Sk = ~ 1 + · · · + ~k can be interpreted as the amount won by the first player from the second (if Sk < 0, this is actually the amount lost by the first player to the second) after k turns. At the instant k s nat which for the first time Sk = B (Sk = A) the bankroll of the second (first) player is reduced to zero; in other words, that player is ruined. (If k < n, we suppose that the game ends at time k, although the random walk itself is well defined up to time n, inclusive.) Before we turn to a precise formulation, let us introduce some notation. Let x be an integer in the interval [A, B] and for 0 s k s n lets:= x + Sk, r: = min{O

s ls

k: St =A orB},

(1)

where we agree to take r: = k if A < Sf < B for all 0 s l s k. For each k in 0 s k s n and x E [A, B], the instant r:, called a stopping time (see §11), is an integer-valued random variable defined on the sample space n (the dependence of r: on n is not explicitly indicated). It is clear that for alll < k the set {w: r: = l} is the event that the random walk {Sf, 0 s is k}, starting at time zero at the point x, leaves the interval (A, B) at time l. It is also clear that when l s k the sets {w: r: = l, Sf= A} and {w: r: = l, Sf = B} represent the events that the wandering particle leaves the interval (A, B) at time l through A orB respectively.

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

For 0

~

k

~

83

n, we write

die =

L: {w: tic = 1, s~ = A}, O:S:ISk

&lie =

L:

{w:

O:S:!Sk

(2)

tic = 1, Sf = B},

and let

be the probabilities that the particle leaves (A, B), through A orB respectively, during the time interval [0, k]. For these probabilities we can find recurrent relations from which we can successively determine 1X 1(x), ... , 1Xn(x) and Pt (x), ... , Pn(x). Let, then, A < x
= P(&llc) = P(&llciS1 =

x

+

l)P(~ 1

= 1)

+ P(&llciS1 =X- 1)P(~l = -1) = pP(&IIc = X + 1) + qP(&IIcl S1 = X

-

1).

(3)

We now show that

P(&llciS1

=X+ 1) =

P(&llc~D,

To do this, we notice that

&lie=

{w:(x,x

&llc can be represented in the form

+ ~ 1 , ... ,x + ~ 1 + ... + ~k)EBk},

Blc is the set of paths of the form (x, x + x 1 , ... , x + x 1 + · · · xk) with x 1 = ± 1, which during the time [0, k] first leave (A, B) at B (Figure 15). where

A~-----------------------

Figure 15. Example of a path from the set B';.

84

I. Elementary Probability Theory

We represent B~ in the form B~,x+ 1 + B~,x- 1 , where B~·x+ 1 and BV- 1 are the paths in B~ for which x 1 = + 1 or x 1 = -1, respectively. Notice that the paths (x, x + 1, x + 1 + x 2 , ••• , x + 1 + x 2 + · · · + xk) in B~·x+ 1 are in one-to-one correspondence with the paths (x

+ 1, x + 1 + x 2 , ••• , x + 1 + x 2, ••• , x + 1 + X 2 + · · · + xk)

in B~~ ~. The same is true for the paths in B~·x- 1• Using these facts, together with independence, the identical distribution of ~ 1 , ... , ~b and (8.6), we obtain P(Bi~ISf = X

+ 1)

= P(Bi~l~ 1 = 1)

= P{(x,x + ~t>····x + ~ 1 + ··· + ~k)eB~I~t = 1} + 1, x + 1 + ~ 2 , ••• , x + 1 + ~2 + · · · + ~k)eB~~D P{(x + 1,x + 1 + ~1, .•. ,x + 1 + ~1 + ··· + ~k-t)eB~~n

= P{(x =

= P(Bi~~D-

In the same way, P(Bi~ISf

=X-

1) = P(Bi~=D·

Consequently, by (3) with x e (A, B) and k Pk(x) = PPk-t(x

~

n,

+ 1) + qflk-t(x-

1),

(4)

1:::;;, n.

(5)

where

{J1(B)

=

1,

{J1(A)

= 0,

0

~

Similarly (6)

with

1X1(A) = 1,

1X1(B) =

0,

O~l~n.

Since tX 0 (x) = {J 0 (x) = 0, x e (A, B), these recurrent relations can (at least in principle) be solved for the probabilities

1X1(x), ... , IX"(x)

and

{J 1(x), ... , fln(x).

Putting aside any explicit calculation of the probabilities, we ask for their values for large n. For this purpose we notice that since Bi~ _ 1 c: Bi~, k ~ n, we have Pk- 1(x) ~ {Jk(x) ~ 1. It is therefore natural to expect (and this is actually the case; see Subsection 3) that for sufficiently large n the probability fln(x) will be close to the solution {J(x) of the equation {J(x) = p{J(x

+

1)

+ q{J(x -

1)

(7)

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

85

with the boundary conditions

f3(B) = 1,

f3(A) = 0,

(8)

that result from a formal approach to the limit in (4) and (5). To solve the problem in (7) and (8), we first suppose that p =1= q. We see easily that the equation has the two particular solutions a and b(qjpy, where a and b are constants. Hence we look for a solution of the form f3(x)

=

a

+ b(qjpy.

(9)

Taking account of (8), we find that for A ::;; x ::;; B f3(x)

= (q/pY - (q/p)A.

(10)

(qjp)B _ (qjp)A

Let us show that this is the only solution of our problem. It is enough to show that all solutions of the problem in (7) and (8) admit the representation (9). Let P(x) be a solution with P(A) = 0, P(B) = 1. We can always find constants aand b such that

a + b(qjp)A =

b(A),

a+

b(q/p)A+l = P(A

a+

b(q/p)A+ 2

+ 1).

Then it follows from (7) that P(A

+ 2)

=

and generally P(x) =

a + b(qjpy.

Consequently the solution (10) is the only solution of our problem. A similar discussion shows that the only solution of a(x)

= pa(x + 1) + qa(x - 1),

XE(A, B)

(11)

with the boundary conditions a(A) = 1,

a(B)

=0

(12)

A::;;x::;;B.

(13)

is given by the formula (p/q)B - (qjpy a(x) = (p/q)B _ (pjp)A'

If p = q = respectively

t, the only solutions f3(x) and a(x) of (7), (8) and (11 ), ( 12) are f3(x)

x-A

= B- A

(14)

and

B-x

a(x) = B- A.

(15)

86

I. Elementary Probability Theory

We note that

a(x)

+ fJ(x) =

1

(16)

for 0 s; p s; 1. We call a(x) and {J(x) the probabilities of ruin for the first and second players, respectively (when the first player's bankroll is x - A, and the second player's is B - x) under the assumption of infinitely many turns, which of course presupposes an infinite sequence of independent Bernoulli random variables ~b ~ 2 , ••. , where~;= + 1 is treated as a gain for the first player, and ~; = -1 as a loss. The probability space (Q, d, P) considered at the beginning of this section turns out to be too small to allow such an infinite sequence of independent variables. We shall see later that such a sequence can actually be constructed and that {J(x) and a(x) are in fact the probabilities of ruin in an unbounded number of steps. We now take up some corollaries of the preceding formulas. If we take A = 0, 0 s; x s; B, then the definition of {J(x) implies that this is the probability that a particle starting at x arrives at B before it reaches 0. It follows from (10) and (14) (Figure 16) that

fJ(x) =

xjB, p = q { (qjpy - 1

= !,

(qjp)B - 1' p =f- q.

(17)

Now let q > p, which means that the game is unfavorable for the first player, whose limiting probability of being ruined, namely a = a(O), is given by

(qjp)B - 1 (q/pl - (qjp)A .

G(=-~----c

Next suppose that the rules of the game are changed: the original bankrolls of the players are still (-A) and B, but the payoff for each player is now !, {J(x)

X

Figure 16. Graph of fl(x), the probability that a particle starting from x reaches B before reaching 0.

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

87

rather than 1 as before. In other words, now let P(ei = !) = p, P(ei = -!) = q. In this case let us denote the limiting probability of ruin for the first player by a. 112 • Then (qjp)2B _ 1 r:J.l/2 = (qjp)2B _ (qjp)2A' and therefore =

r:J.l/2

r:J.'

(qjp)B + 1 (qjp)B + (qjp)A > r:t.,

if q > p. Hence we can draw the following conclusion:

if the game is unfavorable to the .first player (i.e., q > p) then doubling the stake decreases the probability of ruin. 3. We now turn to the question of how fast r:t.n(x) and Pn(x) approach their limiting values a.(x) and P(x). Let us suppose for simplicity that x = 0 and put

r:l.n = r:t.n(O), It is clear that

Yn = P{A < Sk < B, 0 where {A < Sk < B, 0

k

~

~

~

k

~

n},

n} denotes the event

n

{A< Sk < B}.

O!>k!>n

Let n = rm, where r and m are integers and

+ ... + em, em+ 1 + ... + e2m•

el = el

e2

=

e, =

em(r-1)+ 1

+ ... + erm•

Then if C = IA I + B, it is easy to see that {A

< Sk < B, 1 ~ k

and therefore, since ( 1 ,

••• ,

~

rm} £ {1( 1 1 < C, ... , I(, I < C},

(,are independent and identically distributed,

Yn ~ P{ let I < C, ... , I(, I < C} = We notice that ciently large m,

Ve

1

r

0 P{l(d < C} =

= m[1 - (p- q) 2 ]. Hence, for 0

P{l( 1 1 < c}:::; where 8 1 < 1, since

(P{I(tl < C})'.

ve 1

(18)

i=l

81,

~ C 2 if P{IC 1 1 ~ C}

= 1.

< p < 1 and suffi(19)

88

I. Elementary Probability Theory

If p = 0 or p = 1, then P{ i(d < C} = 0 for sufficiently large m, and consequently (19) is satisfied for 0 ::::;; p ::::;; 1. It follows from (18) and (19) that for sufficiently large n

Yn::::;; e", where e = e~fm < 1. According to (16), oc

+ f3

(20)

= 1. Therefore

(oc - OCn)

+ ({3 - f3n)

=

Yn,

and since oc ~ ocn, f3 ~ f3n, we have 0 ::::;; oc - ocn ::::;; Yn ::::;; e",

0 ::::;; f3 - /3n ::::;; Yn ::::;; e",

e < 1.

There are similar inequalities for the differences oc(x) - ocn(x) and f3(x)- /3n(x). 4. We now consider the question of the mean duration of the random walk. Let mk(x) = E-r: be the expectation of the stopping time -r:,k::::;; n. Proceeding as in the derivation of the recurrent relations for f3x(x), we find that, for x E (A, B), mk(x)

= E-r: = = =

I

1 :s;l:s;k

I

1 :s;l:s;k

lP(-r: = l)

l· [pP(-rk = ~~~1 = 1)

I [. [pP(-r:~I

1 :s;l :s;k

I

(l

O:s;l:s;k-1

= pmk- 1(x

+

I

+

+

1)[pP(-r:~f = l)

1)

11~1

= -1)]

= l - 1) + qP(-r:=I = l - 1)]

+ qmk- 1(x-

[pP(-r:~f = l)

O:s;l:s;k-1

= pmk-1(x

+ qP(-rk =

+ qP(-r:=:f

1)

+ qP(-r:=:f

+ 1) + qmk- 1(x-

= l)]

1)

= l)]

+ 1.

Thus,forx E (A, B)andO::::;; k::::;; n, thefunctionsmk(x)satisfytherecurrent relations (21)

with m0 (x) = 0. From these equations together with the boundary conditions mk(A)

=

mk(B)

= 0,

we can successively find m 1(x), ... , mn(x). Since mk(x) ::::;; mk+ 1(x), the limit m(x) = lim mix) n-+oo

(22)

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

89

exists, and by (21) it satisfies the equation m(x)

+ 1 + pm(x + 1) + qm(x

- 1)

(23)

with the boundary conditions m(A) = m(B) = 0.

(24)

To solve this equation, we first suppose that m(x) < oo,

x

E

(A, B).

(25)

Then if p =I= q there is a particular solution of the form a.j(q - p) and the general solution (see (9)) can be written in the form m(x) = -Xq-p

+ a + b (q)x - . p

Then by using the boundary conditions m(A) = m(B) = 0 we find that 1 m(x) = - - (B{J(x) p-q

+ Aa.(x) -

x],

where fJ(x) and a.(x) are defined by (10) and (13). If p = q = solution of (23) has the form m(x)

=

a

+ bx -

(26)

!, the general

x 2,

and since m(A) = m(B) = 0 we have m(x) = (B - x)(x - A).

(27)

It follows, in particular, that if the players start with equal bankrolls (B = -A), then m(O) = B 2 •

If we take B = 10, and suppose that each turn takes a second, then the (limiting) time to the ruin of one player is rather long: 100 seconds. We obtained (26) and (27) under the assumption that m(x) < oo, x E (A, B). Let us now show that in fact m(x) is finite for all x E (A, B). We consider only the case x = 0; the general case can be analyzed similarly. Let p = q =!.We introduce the random variableS<" defined in terms of the sequence S0 , Sl> ... , Sn and the stopping time •n = .-~by the equation n

s,n =

L Skl{tn=k)(w). k=O

(28)

The descriptive meaning of S
•n·

•n

90

I. Elementary Probability Theory

Let us show that when p = q =

t.

ES,n = 0,

(29)

Es;n =Ern.

(30)

To establish the first equation we notice that n

ES,n =

L E[SkJ{tn=k}(w)]

k=O n

=

n

L E[SnJ{tn=k}(w)] + L E[(Sk- Sn)Jitn=k}(w)]

k=O

= ESn

k=O

+

n

L E[(Sk- Sn)II
k=O

(31)

where we evidently have ESn = 0. Let us show that n

L E[(Sk- Sn)Jitn=k}(w)] = 0.

k=O

To do this, we notice that {tn > k} = {A < S 1 < B, ... , A < Sk < B} when 0:::;; k < n. The event {A< S 1 < B, ... , A< Sk < B} can evidently be written in the form (32) where Ak is a subset of { -1, + 1}k. In other words, this set is determined by just the values of e1, ... , ek and does not depend on ek+t• ... , en. Since

{tn = k} = {tn > k- 1}\{tn > k}, this is also a set of the form (32). It then follows from the independence of 1 , •.. , en and from Problem 9 of §4 that the random variables Sn - Sk and I l
e

E[(Sn- Sk)II
ES;" =

n

n

k=O

k=O

L ESfJ{tn=k} = L E([Sn + (Sk- Sn)] 2I{tn=k}) n

=

L [ES;II
k=O

n

= n-

L (n -

k=O

n

k)P(tn = k) =

L kP(tn = k) = Et".

k=O

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

(p

Thus we have (29) and (30) when p = q = + q = 1) it can be shown similarly that

= (p

ES,n

=

E[S,"- r. · E~ 1 ] 2

t.

91

For general p and q (33)

- q) ·Ern,

V~ 1 ·Ern,

(34)

where E~ 1 = p- q, V~ 1 = 1 - (p- q) 2 • With the aid of the results obtained so far we can now show that limn-+ 00 mn(O) = m(O) < 00. If p = q = !, then by (30) (35)

If p =F q, then by (33), E

r.:::;;

max( lA I, B)

(36)

Ip-q I '

from which it is clear that m(O) < oo. We also notice that when p = q =!

Ern=

es;n = A 2 • (X.+ B 2 • fJ. + E[S;I{A<Sn
and therefore

It follows from this and (20) that as n _... oo, Er. converges with exponential rapidity to

m(O) = A 2 lX

B

A

B-A

B-A

+ B 2 {J = A 2 · - - - B 2 · - - =

IABI.

There is a similar result when p =F q: Er. _... m(0) -_ ocA

+ fJB ,

p-q

exponentially fast.

5. PROBLEMS 1. Establish the following generalizations of (33) and (34):

Es:. = x

+ (p

- q)Et~,

E[Stlf- •: · E~tJ 2 = V~ 1 . Et:. 2. Investigate the limits of oc(x), P(x), and m(x) when the level A ! 3. Let p = q =

- oo. tin the Bernoulli scheme. What is the order of EIS.l for large n?

92

I. Elementary Probability Theory

4. Two players each toss their own symmetric coins, independently. Show that the probability that each has the same number of heads after n tosses is r 2" D=o (C=) 2 • Hence deduce the equation D=o (C=>2 = Cin· Let u. be the first time when the number of heads for the first player coincides with the number of heads for the second player (if this happens within n tosses; un = n + 1 if there is no such time). Find Emin(u., n).

§10. Random Walk. II. Reflection Principle. Arcsine Law 1. As in the preceding section, we suppose that ~ 1 , ~ 2 , ••• , ~ 2 " is a sequence of independent identically distributed Bernoulli random variables with

= 1) = p, + · · · + ~k•

P(~j

sk =

~1

1~ k

~

2n;

We define a2n

= min{1

:::; k :::; 2n:

sk = 0},

putting G'2n = 00 if sk =1: 0 for 1 ~ k :::; 2n. The descriptive meaning of a 2 n is clear: it is the time of first return to zero. Properties of this time are studied in the present section, where we assume that the random walk is symmetric, i.e. p = q = !. For 0 :::; k :::; n we write u2k

= P(S 2 k = 0),

f2k

= P(a2n = 2k).

(1)

It is clear that u0 = 1 and Our immediate aim is to show that for 1 :::; k :::; n the probability f 2 k is given by (2)

It is clear that

{a 2 n = 2k} = {S 1 =1: 0, S 2 =1: 0, ... , S 2 k- 1 =1: 0, S 2 k = 0} for 1 :::; k :::; n, and by symmetry !2k

o, ... , s2k-1 =1: o, s2k = O} 2P{s1 > o, ... , s2k-1 > o, s2k = O}.

= P{S1 =

=1:

(3)

93

§10. Random Walk. II. Reflection Principle. Arcsine Law

A sequence (S 0 , ••• , Sk) is called a path of length k; we denote by Lk(A) the number of paths of length k having some specified property A. Then f2k

=

L2n<s1 >

2

o, ... , s2k-1 > o, s2k = o,

and S2k+1 = a2k+1• ... ,S2n = a2k+1

= 2L2k(s1 >

+ ··· + a2n)·2- 2"

o, ... , s2k-1 > o, s2k = O) · 2- 2\

(4)

where the summation is over all sets (a 2k+ 1, ... , a2n) with a; = ± 1. Consequently the determination of the probability f 2 k reduces to calculating the number of paths L 2k(S 1 > 0, ... , S 2k- 1 > 0, S 2k = 0).

Lemma 1. Let a and b be nonnegative integers, a - b > 0 and k = a Then a-b Lk(S1>0, ... ,Sk-1>0,Sk=a-b)=-k-Ck·

+ b. (5)

PRooF. In fact,

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b) = Lk(S 1 = 1, S2 > 0, ... , Sk- 1 > 0, Sk =a- b) = Lk(S 1 = 1, Sk =a- b) - Lk(S 1 = 1, Sk =a- b; and 3 i, 2 :::;; i :::;; k - 1, such that S; :::;; 0).

(6)

In other words, the number of positive paths (S 1, S2, ... , Sk) that originate at (1, 1) and terminate at (k, a- b) is the same as the total number of paths from (1, 1) to (k, a - b) after excluding the paths that touch or intersect the time axis.* We now notice that

Lk(S 1 = 1, Sk =a- b; 3 i, 2 :::;; i:::;; k- 1, such that S;:::;; 0) =

Lk(S 1 = -1, Sk =a- b),

(7)

i.e. the number of paths from IX = (1, 1) to f3 = (k, a - b), neither touching nor intersecting the time axis, is equal to the total number of paths that connect IX* = (1, -1) with {3. The proof of this statement, known as the reflection principle, follows from the easily established one-to-one correspondence between the paths A= (S 1, ... , Sa, Sa+ 1, ... , Sk) joining IX and {3, and paths B = ( -S 1, ... , -Sa, Sa+~> ... , Sk)joining IX* and f3 (Figure 17); a is the first point where A and B reach zero.

* A path (S 1 , ••• , Sk) is called positive (or nonnegative) if all S1 > 0 (S1 ~ 0); a path is said to touch the time axis if S1 ~ 0 or else S1 :::;; 0, for I s j s k, and there is an i, I :::;; i :::;; k, such

that S 1 = 0; and a path is said to intersect the time axis if there are two times i and j such that 0 and sj < 0.

s, >

94

I. Elementary Probability Theory

fJ

Figure 17. The reflection principle.

From (6) and (7) we find

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b)

= Lk(S 1 = 1,Sk =a- b)- Lk(S 1 = a-1

a

= ck-1- ck-1 =

-1,Sk =a- b)

a-bca -k- k,

which establishes (5). Turning to the calculation ofj2 k, we find that by (4) and (5) (with a = k, b = k- 1), f2k

o, ... , s2k-1 > o,

= o) · 2- 2k

=

2L2k(sl >

=

2Lzk-1(S1 > O, ... ,Szk- 1 = 1)·2- 2k

s2k

t 1 ck 1 2k-1 = 2k Uz(k-1)· = 2 . rzk . 2k-

Hence (2) is established. We present an alternative proof of this formula, based on the following observation. A straightforward verification shows that

1 2k Uz(k-1) = u2(k-1) -

(8)

Uzk>

At the same time, it is clear that

= 2k} = {a2n > 2(k- 1)}\{a2n > {a2n > 2!} = {S1 =I= 0, ... , S 21 =!= 0}

{azn

2k},

and therefore {a2n = 2k} = {S 1 =I= 0, ... , S2(k- 1) =I= 0}\{S 1 =I= 0, ... , S 2k =I= 0}.

Hence fzk

=

P{S1 =I= 0, ... , Sz(k-1) =I= 0} - P{S1 =I= 0, ... , S2k =I= 0},

95

§10. Random Walk. II. Reflection Principle. Arcsine Law

a

Figure 18

and consequently, because of (8), in order to show that it is enough to show only that

L2kCS1 ¥ 0, ... , S2k ¥ 0)

f 2k = (1/2k)u 2
= L2kCS2k = 0).

(9)

For this purpose we notice that evidently

L2k(S1 ¥ 0, ... , S2k ¥ 0) = 2L2k(St > 0, ... , S2k > 0). Hence to verify (9) we need only establish that

2L2k(St > 0, ... , S2k > 0) = L2k(S1

~

0, ... , S2k ~ 0)

(10)

and

(11) Now (10) will be established if we show that we can establish a one-to-one correspondence between the paths A = (S~> ... , S 2 k) for which at least one S; = 0, and the positive paths B = (S ~> ... , S 2k). Let A = (S ~> ... , S2k) be a nonnegative path for which the first zero occurs at the point a (i.e., Sa = 0). Let us construct the path, starting at (a, 2), (Sa + 2, Sa+ 1 + 2, ... , S 2 k + 2) (indicated by the broken lines in Figure 18). Then the path B = (S 1, ..• , Sa_ 1 , Sa+ 2, ... , S 2 k + 2) is positive. Conversely, let B = (S 1, ..• , S 2k) be a positive path and b the last instant at which Sb = 1 (Figure 19). Then the path

A = (S 1' . . . , Sb, Sb+ 1

-

b

Figure 19

2, ... , Sk - 2)

96

I. Elementary Probability Theory

2k

-m Figure 20

is nonnegative. It follows from these constructions that there is a one-to-one correspondence between the positive paths and the nonnegative paths with at least one Si = 0. Therefore formula (10) is established. We now establish (11). From symmetry and (10) it is enough to show that L2k(S1 > 0, ... , S 2k > 0) + L 2k(S 1 :2:: 0, ... , S 2k :2:: 0 and 3 i, 1 ~ i ~ 2k, such that S; = 0) = L 2 k(S 2 k = 0).

The set of paths (S 2 k = 0) can be represented as the sum of the two sets "t' 1 and "t'2 , where "t' 1 contains the paths (S 0 , ••• , S 2k) that have just one minimum, and "t' 2 contains those for which the minimum is attained at at least two points. Let C 1 E "t' 1 (Figure 20) and let y be the minimum point. We put the path C 1 = (S 0 , St. ... , S 2 k) in correspondence with the path Ct obtained in the following way (Figure 21). We reflect (S 0 , S 1, ••• , S1 ) around the vertical line through the point l, and displace the resulting path to the right and upward, thus releasing it from the point (2k, 0). Then we move the origin to the point (l, -m). The resulting path Ct will be positive. In the same way, if C 2 e "t' 2 we can use the same device to put it into correspondence with a nonnegative path C~.

(2k, 2m)

2k

Figure 21

97

§10. Random Walk. II. Reflection Principle. Arcsine Law

Conversely, let Ct = (S 1 > 0, ... , S 2 k > 0) be a positive path with S 2 k =2m (see Figure 21). We make it correspond to the path C 1 that is obtained in the following way. Let p be the last point at which SP = m. Reflect (SP, ... , S 2 m) with respect to the vertical line x = p and displace the resulting path downward and to the left until its right-hand end coincides with the point (0, 0). Then we move the origin to the left-hand end of the resulting path (this is just the path drawn in Figure 20). The resulting path C 1 = (S 0 , •.• , S 2 k) has a minimum at S 2 k = 0. A similar construction applied to paths (S 1 ~ 0, ... , S 2 k ~ 0 and 3 i, 1 ~ i ~ 2k, with S; = 0) leads to paths for which there are at least two minima and S 2 k = 0. Hence we have established a one-to-one correspondence, which establishes (11). Therefore we have established (9) and consequently also the formula f2k

=

Uz(k-l) -

= (1/2k)u2(k-l) ·

U2k

By Stirling's formula

u2k =

k c2k.

2

-2k

1

k-+

"' - - ,

fo

00.

Therefore

k-+

00.

Hence it follows that the expectation of the first time when zero is reached, namely Emin(a 2 n, 2n)

=

n

L 2kP(a

k=l

2n

= 2k) + 2nu 2 n

n

=

L u2(k-1) + 2nu2n• k= 1

can be arbitrarily large. In addition, 1 u 2 = oo, and consequently the limiting value of the mean time for the walk to reach zero (in an unbounded number of steps) is oo. This property accounts for many of the unexpected properties of the symmetric random walk that we have been discussing. For example, it would be natural to suppose that after time 2n the number of zero net scores in a game between two equally matched players (p = q = !), i.e. the number of instants i at which S; = 0, would be proportional to 2n. However, in fact the number of zeros has order (see [F1]). Hence it follows, in particular, that, contrary to intuition, the "typical" walk (S 0 , S 1 , .•• , S") does not have a sinusoidal character (so that roughly half the time the particle would be on the positive side and half the time on the negative side), but instead must resemble a stretched-out wave. The precise formulation of this statement is given by the arcsine law, which we proceed to investigate.

Lf=

fo

98

I. Elementary Probability Theory

2. Let P2 k, 2n be the probability that during the interval [0, 2n] the particle spends 2k units of time on the positive side.*

Lemma 2. Let u0 = 1 and 0 ::::; k ::::; n. Then (12)

PRooF. It was shown above thatf2k = u2
Uzk =

Since {S 2 k = 0}

!;;;;; {11 2"::::;

L f2r · U2(k-r)·

r=

(13)

1

2k}, we have

{S 2k = 0} = {S 2k = 0} n {O'zn::::; 2k} =

L

1 ::s;l:s;k

{S2k = 0} n {0'2n = 21}.

Consequently Uzk = P(S2k = 0) =

L L

1 ::s;l::s;k

=

P(S2k = 0, O'zn = 21)

P(Szk = OIO'zk = 2l)P(O'zn = 2/).

1 :s;l::s;k

But P(S 2k = OIO'zn = 21) = P(Szk = OIS1 #- 0, ... , S21-1 #- 0, S21 = 0)

+ (~21+1 + .. · + ~zk) = O!S1 #- 0, ... , S21-1 P(S21 + (~21+1 + · · · + ~zk) = OISz, = 0) P(~zt+ 1 + · · · + ~2k = 0) = P(Sz
= P(S21 = =

#- 0, S21 = 0)

Therefore U2k =

L

1 ::s;l::s;k

P(S2(k-1J = O)P(O'zn = 21),

which establishes (13). We turn now to the proof of (12). It is obviously true fork = 0 and k = n. Now let 1 ::::; k ::::; n - 1. If the particle is on the positive side for exactly 2k instants, it must pass through zero. Let 2r be the time of first passage through zero. There are two possibilities: either Sk ;;;:: 0, k ::::; 2r, or Sk ::::; 0, k ::::; 2r. The number of paths of the first kind is easily seen to be (21 22rf:2r) ' 22(n-rJp 2(k-r),2(n-r)

1 22n ' J:2r' p 2(k-r),2(n-r)• = 2'

* We say that the particle is on the positive side in the interval [m - l, m] if one, at least, of the values S,._ 1 and S,. is positive.

99

§10. Random Walk. II. Reflection Principle. Arcsine Law

The corresponding number of paths of the second kind is

Consequently, for 1 s k p2k,2n

=

1

s

n - 1, 1

k

k

2 7 ~1 fzr · p2(k-r),2(n-r) + 2 ,~/2r · P2k,2(n-r)•

(14)

Let us suppose that P Zk, zm = u2 k • u2 m _ Zk holds for m = 1, ... , n - 1. Then we find from (13) and (14) that P2k,2n

= !uzn-2k ·

k

k

r;1

r;1

L fzr ·llzk-2r + !uzk • L fzr ·llzn-2r-2k

This completes the proof of the lemma. Now let y(2n) be the number of time units that the particle spends on the positive axis in the interval [0, 2n]. Then, when x < 1, 1 y(2n) P{ - < - - S

X

2n

2

}

=

L

{k, 1/2 < (2k/2n) s x)

P 2k, 2n ·

Since

ask-+ oo, we have p 2k,2n =

ll

2k

·U

2(n-k)

~

1

--;==== _ k)'

njk(n

as k -+ oo and n - k -+ oo. Therefore

L

P 2k, 2n

-

{k: 1/2 <(2k/2n)Sx}

L

{k: 1/2 <(2kj2n)~x)

__!,_ ·

nn

[~ (1 - ~)]n

n

112

-+

0,

whence

L

{k: l/2<(2k/2n)Sn)

p 2k

2n '

1 n

-

IX 1/2

dt -+ 0, jt(l - t)

But, by symmetry,

L

{k:k/nS 1/2)

Pzk,2n-+!

n -+ oo.

n-+

00,

100

I. Elementary Probability Theory

and

_1 n

Jx 112

dt . vC. 1 = -2 arcsm x - 2· Jt(1 - t) n

Consequently we have proved the following theorem.

Theorem (Arcsine Law). The probability that the fraction of the time spent by the particle on the positive side is at most x tends to 2n- 1 arcsin .jX:

L

P lk, ln ~ 2n- 1 arcsin

.jX.

(15)

{k:k/n!S:x}

We remark that the integrand p(t) in the integral

1

ix

dt n o Jt(1 - t) represents aU-shaped curve that tends to infinity as t Hence it follows that, for large n,

P{o < y(2n) < ~} > p{! < y(2n) < 2n2 2n-

1 2

~

0 or 1.

+ ~}

'

i.e., it is more likely that the fraction of the time spent by the particle on the positive side is close to zero or one, than to the intuitive value !. Using a table of arcsines and noting that the convergence in (15) is indeed quite rapid, we find that

p{Y~nn) :::;; 0.024} ~ 0.1, P{y~:):::;; 0.1} ~ 0.2, P{y~:) : :; 0.2} ~ 0.3,

P{y~:):::;; 0.65} ~ 0.6. Hence if, say, n = 1000, then in about one case in ten, the particle spends only 24 units of time on the positive axis and therefore spends the greatest amount of time, 976 units, on the negative axis. 3.PR.OBLEMS 1. How fast does Emin(a 2., 2n)-+ oo as n-+ oo? 2. Lett.= min{1:::;; k:::;; n: sk·= 1}, where we take t. = oo if Sk < 1 for 1:::;; k:::;; n. What is the limit of Emin(t., n) as n-+ oo for symmetric (p = q =!)and for unsymmetric (p of. q) walks?

101

§II. Martingales. Some Applications to the Random Walk

§11. Martingales. Some Applications to the Random Walk 1. The Bernoulli random walk discussed above was generated by a sequence

~ 1 , ... , ~n of independent random variables. In this and the next section we introduce two important classes of dependent random variables, those that constitute martingales and Markov chains. The theory of martingales will be developed in detail in Chapter VII. Here we shall present only the essential definitions, prove a theorem on the preservation of the martingale property for stopping times, and apply this to deduce the "ballot theorem." In turn, the latter theorem will be used for another proof of proposition (10.5), which was obtained above by applying the reflection principle.

2. Let (Q, .91, P) be a finite probability space and 22 1 ~ 22 2 sequence of decompositions.

~

•••

~

22n a

Definition 1. A sequence of random variables~ 1, ... , ~n is called a martingale (with respect to the decomposition 22 1 ~ 22 2 ~ • • • ~ 22n) if (1) ~k is 22k-measurable, (2) E(~k+ 1l22k) = ~k' 1 :::;; k :::;; n - 1.

In order to emphasize the system of decompositions with respect to which the random variables form a martingale, we shall use the notation (1)

where for the sake of simplicity we often do not mention explicitly that 1 :::;; k :::;; n. When 22k is induced by ~ 1, ... , ~n' i.e.

instead of saying that ~ = (~k• 22k) is a martingale, we simply say that the sequence~ = (~k) is a martingale. Here are some examples of martingales. ExAMPLE

1. Let 111, ... , IJn be independent Bernoulli random variables with P(IJk sk

= 1) = P(IJk

= -1)

= t,

= 1J1 + ... + 1Jk and 22k = 22qJ, ... ,q.·

We observe that the decompositions 22k have a simple structure:

102

I. Elementary Probability Theory

where

v+

= {w: 171 = +1}, ~

2

=

v-

= {w: 171 = -1},

{v++ v+- v-+ '

'

'

v--}

'

where

v+ + =

{w: 171 = + 1,172 = + 1}, ... ' v-- = {w: 'It= -1,172 = -1},

etc. It is also easy to see that ~~~·····~k = ~s~o ... ,sk·

Let us show that (Sb ~k) forms a martingale. In fact, Skis ~k-measurable, and by (8.12), (8.18) and (8.24), E(Sk+tl~k)

= E(Sk + 1'fk+ti~k) = E(Skl~k) + E(17Htl~k) = Sk + E1'fk+t = Sk.

If we put S0 = 0 and take D 0 = {0}, the trivial decomposition, then the

sequence (Sk,

~k)osksn

also forms a martingale.

2. Let 17 1, ... , 1'fn be independent Bernoulli random variables with P(1]; = 1) = p, P(1]; = -1) = q. If p -::/: q, each of the !;lequences ~ = (~k) with

ExAMPLE

where

sk =

'11 + ... + '1n•

is a martingale. EXAMPLE

3. Let 17 be a random variable, ~ 1

~

··· ~

~n•

and (2)

Then the sequence ~ = (~k• ~k) is a martingale. In fact, it is evident that E(17 I~k) is ~k-measurable, and by (8.20)

In this connection we notice that by (8.20)

if~

=

(~k• ~k)

is any martingale, then

~k = E(~k+tl~k) = E[E(~k+21~k+t)l~k]

= E(~k+21~k) = · · · = E(~nl~k).

(3)

Consequently the set of martingales ~ = (~k• ~k) is exhausted by the martingales of the form (2). (We note that for infinite sequences ~ = (~k• ~k)k~t this is, in general, no longer the case; see Problem 7 in §1 of Chapter VII.)

103

§11. Martingales. Some Applications to the Random Walk

EXAMPLE 4. Let '1 ~> ... , '1 nbe a sequence of independent identically distributed random Variables, Sk = '11 + · · · + IJb and E»1 = E»sn' E»2 = E»sn.Sn~t' · · ·, E»n = E»s".s"~ 1 , ... ,s,· Let us show that the sequence~= (~b E»k) with

;:

sn n

;:

sn-1 n-

S1=-,~z=--1 , ... ,;,k=

sn+1-k k'''''~n=S1 n+ 1 -

is a martingale. In the first place, it is clear that E»k ~ E»k+ 1 and ~k is E»kmeasurable. Moreover, we have by symmetry, for j ::::; n - k + 1, (4)

(compare (8.26)). Therefore

(n- k + 1)E(1Jt1E»k) =

n-k+ 1

I

j= 1

E(,.,jiE»k) = E(Sn-k+11E»k) = sn-k+1•

and consequently ;: =

Sk

Sn-k+l

n _ k+ 1

and it follows from Example 3 that~ =

=E( (~b

'11

IE»)

k ,

E»k) is a martingale.

ExAMPLE 5. Let 1J 1, ... , '1n be independent Bernoulli random variables with P(IJ; = +1) = P(IJ; = -1) =

!,

Sk = '7 1 + · · · + '1k· Let A and B be integers, A < 0
n/2, the sequence~=

~k =(cos A.)-k exp{iA.(sk-

B; A)}

is a complex martingale (i.e., the real and imaginary parts of martingales).

(5) ~k

3. It follows from the definition of a martingale that the expectation the same for every k: E~k = E~ 1 .

form

E~k

is

It turns out that this property persists if time k is replaced by a random time. In order to formulate this property we introduce the following definition.

Definition 2. A random variable T = -r(w) that takes the values 1, 2, ... , n is called a stopping time (with respect to a decomposition (E»k) 1 ~k~n· E» 1 ~ E» 2 ~ · · · ~ E»n) if, for k = 1, ... , n, the random variable /{t=kl(w) is E»kmeasurable. If we consider !?)k as the decomposition induced by observations for k steps (for example, E»k = E»~ ...... ~k' the decomposition induced by the

104

I. Elementary Probability Theory

variables 17 1, .•. , 1'/k), then the ~k-measurability of Ift=kl(w) means that the realization or nonrealization of the event {r = k} is determined only by observations for k steps (and is independent of the "future"). If f!Jk = a(~k), then the ~k-measurability of J{t=kl(w) is equivalent to the assumption that {r = k} E f!Jk. (6) We have already introduced specific examples of stopping times: the times

rk, (Jzn introduced in §§9 and 10. Those times are special cases of stopping times of the form rA

= min{O < k:::;

(JA

= min{O:::;

n: ~k E A},

k:::; n: ~kEA},

(7)

which are the times (respectively the first time after zero and the first time) for a sequence ~ 0 , ~ 1 , ... , ~n to attain a point of the set A.

4. Theorem 1. Let ~ = (~k• ~k) 1 s;ks;n be a martingale and r a stopping time with respect to the decomposition (~k) 1 s;ks;n· Then (8)

where n

~t =

L ~kl{t=kl(w)

(9)

k=l

and

(10) PRooF (compare the proof of (9.29)). Let D E ~ 1 . Using (3) and the properties of conditional expectations, we find that

E(~ ID) = E(~Jn) t

P(D)

1

n

= P(D) t~1E[~J{t=tl. In] 1

= P(D) E(~nln) = E(~niD),

105

§ll. Martingales. Some Applications to the Random Walk

and consequently E(e,l~t) = E(enl~t) = e1.

The equation Ee, = Ee 1 then follows in an obvious way. This completes the proof of the theorem.

Corollary. For the martingale (Sk, ~k) 1 sksn of Example 1, and any stopping timer (with respect to (~k)) we have the formulas (11)

ES, = 0,

known as Wald's identities (cf. (9.29) and (9.30); see also Problem 1 and Theorem 3 in §2 of Chapter VII).

5. Let us use Theorem 1 to establish the following proposition. Theorem 2 (Ballot Theorem). Let q 1, ... , '1n be a sequence of independent identically distributed random variables whose values are nonnegative integers, Sk = q 1 + ... + 1'fk, 1 :::; k :::; n. Then P{Sk < kforall k, 1:::; k:::; niSn} = (1-

~r'

(12)

where a+ = max(a, 0). PRooF. On the set {w: Sn ~ n} the formula is evident. We therefore prove (12) for the sample points at which sn < n. Let us consider the martingale = ~k)tsksn introduced in Example 4, With = Sn+1-J(n + 1- k) and ~k = ~Sn+l-k•···•Sn· We define

e (ek,

ek

r = min{1 :::; k ::5; n: ~k ~ 1},

taking ! = n on the set {ek < 1 for all k such that 1 ::5; k ::5; n} = {maxl:s;l:s;n(SI/1) < 1}. It is clear that S1 = 0 on this set, and therefore

e, =en=

{ max S11 < 1} = {max lSISn lSISn

~~ < 1, S" < n} £

{e, = 0}.

(13)

Now let us consider those outcomes for which simultaneously max 1s 1 sn(S1/l) ~ 1 and Sn < n. Write a= n + 1 - r. It is easy to see that

a= max{1 :::; k:::; n: Sk;;:::; k} and therefore (since Sn < n) we have a < n, Sa ~ a, and Sa+ 1 < a + 1. Consequently '1a+ 1 = Sa+ 1 - Sa < (a + 1) - a = 1, i.e. '1a+ 1 = 0. Therefore a :::; sa = sa+ 1 < a + 1, and consequently sa = (J and

e,=

Sn+t-• =Sa=l. n+1-r a

106

I. Elementary Probability Theory

Therefore

{ max~~~ 1,Sn < n} £ {~t = 1}.

(14)

1:51:5n

From (13) and (14) we find that 1 { max S/

1 :51:5n

~ 1, S" < n} = {~t =

1} n {S" <

n}.

Therefore, on the set {S" < n}, we have

where the last equation follows because ~< takes only the two values 0 and 1. Let us notice now that E(~tiSn) = E(~tl~ 1 ), and (by Theorem 1) E(~tl~ 1 ) = ~ 1 = Sn/n. Consequently, on the set {S" < n} we have P{Sk < k

for all k such that 1 :::;; k :::;; n ISn} = 1 - (Sn/n). This completes the proof of the theorem.

We now apply this theorem to obtain a different proof of Lemma 1 of §10, and explain why it is called the ballot theorem. Let ~ 1 , ..• , ~"be independent Bernoulli random variables with P(~ 1 = 1) = P(~; = -1) = Sk

a

+ · · · + ~k

and a, b nonnegative integers such that a - b rel="nofollow"> 0, b = n. We are going to show that

= ~1

+

t,

a-b a+

P{S 1 > 0, ... , S" > OISn =a- b} =--b.

(15)

In fact, by symmetry, P{S 1 > 0, ... , S" > OISn =a- b}

= P{S 1 < 0, ... , S" < OISn = -(a- b)} = P{S 1 + 1 < 1, ... , S" + n < niSn + n = n- (a- b)} = P{1Jt < 1,. ·., '11 + · · · + 1Jn < ni1Jt + · · · + 1Jn = n- (aa-b n

b)}

a-b a+ b'

where we have put 1Jk = ~k + 1 and applied (12). Now formula (10.5) follows from (15) in an evident way; the formula was also established in Lemma 1 of §10 by using the reflection principle.

§II. Martingales. Some Applications to the Random Walk

107

Let us interpret ~i = + 1 as a vote for candidate A and ~i = -1 as a vote for B. Then Sk is the difference between the numbers of votes cast for A and B at the time when k votes have been recorded, and P{S 1 > 0, ... , Sn > OISn =a- b}

is the probability that A was always ahead of B, with the understanding that A received a votes in all, B received b votes, and a - b > 0, a + b = n. According to (15) this probability is (a - b)/n.

6. PROBLEMS 1. Let !?} 0 ~ !!} 1 ~···~!!}.be a sequence of decompositions with !?} 0 = {Q}, and Jet ''lk be !?}k-measurable variables, 1 ~ k ~ n. Show that the sequence~ = (~b !!}k) with k

I

~k =

I= I

['11 - E('ltl !?}I-I)]

is a martingale. 2. Let the random variables '1~o····'1k satisfy E('1ki'11, ••• ,'1k-l) = 0. Show that the sequence~ = (~k) 1 ,;k,;n with ~ 1 = 17 1 and k

~k+ 1 =

I

i=l

'li+lli<'11· ... , 11J,

where fi are given functions, is a martingale. 3. Show that every martingale ~ = (~i> !!}k) has uncorrelated increments: if a< b < c < d then

4. Let ~ = (~ 1 , ... , ~.) be a random sequence such that ~k is !?}k-measurable (!!} ~ !!} 2 ~ · · · ~ !?}.). Show that a necessary and sufficient condition for this sequence to be a martingale (with respect to the system (!!}k)) is that E~t = E~ 1 for every stopping timeT (with respect to (!!}k)). (The phrase "for every stopping time" can be replaced by "for every stopping time that assumes two values.") 5. Show that if~ =

(~k•

!!}k) 1,;k,;n is a martingale and Tis a stopping time, then

for every k. 6. Let~= (~k• !!}k) and '1 = ('lk• !!}k) be two martingales, ~ 1 = 11 1 = 0. Show that

E~n'ln =

I

k=2

E(~k - ~k-1)('1k - 'lk-1)

and in particular that

E~; =

I

k=2

E(~k - ~k-1) 2 •

108

I. Elementary Probability Theory

7. Let 111, ... , Yfn be a sequence of independent identically distributed random variables with E11; = 0. Show that the sequence~ = (~k) with

~k = ~

(.I

l=l

'7;)

2

-

kEYff,

_ exp A.('7 1 + · · · + Yfk) (E exp A.q 1)k

k -

is a martingale. 8. Let '7t, ... , Yfn be a sequence of independent identically distributed random variables taking values in a finite set Y. Let f 0 (y) = P('7 1 = y), y E Y, and let f 1(y) be a nonnegative function with LyeY f 1(y) = 1. Show that the sequence ~ = (~k• !0Z) with

~:J = D~"····~k' ~k =

J; (111) .. . J; ('1d fo('11) · · · foCYfk)'

is a martingale. (The variables ~k, known as likelihood ratios, are extremely important in mathematical statistics.)

§12. Markov Chains. Ergodic Theorem. Strong Markov Property 1. We have discussed the Bernoulli scheme with Q

= {w: W = (xlo ... , Xn), X; = 0, 1},

where the probability p(w) of each outcome is given by (1)

= pxql-x. With these hypotheses, the variables ~ 1 , ... , ~n = X; are independent and identically distributed with X= 0, 1. P(~ 1 = x) = .. · = P(~. = x) = p(x),

with p(x) ~;(w)

with

If we replace (1) by p(w) = P1(x1) · · · Pn(xn),

where P;(x) = pf(1 - p;), 0 :::; p; :::; 1, the random variables ~ 1 , still independent, but in general are differently distributed:

... ,

~.

are

We now consider a generalization that leads to dependent random variables that form what is known as a Markov chain. Let us suppose that

n=

{w: w = (xo. xl, ... ' x.), X; EX},

109

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

where X is a finite set. Let there be given nonnegative functions p0 (x), Pt(x, y), ... , Pn(x, y) such that

L Po(x) =

1,

L Pk(x, y) =

1,

XEX

k=1, ... ,n; yEX.

(2)

yeX

(3)

It is easily verified that Lroe!l p(w) = 1, and consequently the set of numbers p(w) together with the space Q and the collection of its subsets defines a probabilistic model, which it is usual to call a model of experiments that form a Markov chain. Let us introduce the random variables ~ 0 , ~ 1 , ... , ~n with ~;(w) =X;. A simple calculation shows that P(~ 0 = a) = p 0 (a),

P(~o

=

ao, ... , ~k

=

ak)

=

Po(ao)Pt(ao, at)··· Pk(ak-1• ak).

(4)

We now establish the validity of the following fundamental property of conditional probabilities: P{~k+t

=

ak+tl~k

=

ak, ... , ~o

= ao} =

P{~k+t

=

ak+tl~k =ad

(5)

(under the assumption that P(~k = ak, ... , ~ 0 = a0 ) > 0). By (4), P{(k+t

=

ak+tl~k

P{~k+t

=

ak> ·· ., ~o

= ao}

= ak+t' ·· ., ~o = ao}

P{ ~k = ak> ... , ~o = a 0 }

=

Po(ao)Pt(ao, a1) · · · Pk+ 1(ak, ak+ 1) Po(ao) · · · Pk(ak-l' ak)

=

( ) Pk+t abak+t.

In a similar way we verify (6)

which establishes (5). Let !0i = !0~o •... , ~k be the decomposition induced by ~ 0 , ••. , ~k> and P.Bi = a(!0i). Then, in the notation introduced in §8, it follows from (5) that (7)

or

I. Elementary Probability Theory

110 If we use the evident equation P(AB I C)

= P(A I BC)P(B I C),

we find from (7) that P{ ~n

=

an,·.·, ~k+ 1 = ak+11.?4D

= Pgn =

an,···, ~k+ 1 = ak+ll ~k}

(8)

or P{~" =an, ... , ~k+1

= ak+1l~o, ... , ~d = P{~" =a.,···, ~k+1 = ak+ll~k}. (9)

This equation admits the following intuitive interpretation. Let us think of ~k as the position of a particle "at present," (~ 0 , •.. , ~k- 1 ) as the "past," and (~k+ 1- ... , ~.)as the "future." Then (9) says that if the past and the present are given, the future depends only on the present and is independent of how the particle arrived at ~k' i.e. is independent of the past (~ 0 , ... , ~k- 1 ). Let F =(~.=a., ... , ~k+t = ak+t), N = gk =ad, B

= {~k- 1 =

ak-1• · · ·, ~o

= ao}.

Then it follows from (9) that P(FINB) = P(FIN),

from which we easily find that P(FB IN) = P(F I N)P(B IN).

(10)

In other words, it follows from (7) that for a given present N, the future F and the past B are independent. It is easily shown that the converse also holds: if (10) holds for all k = 0, 1, ... , n - 1, then (7) holds for every k = 0, 1, ... , n - 1.

The property of the independence of future and past, or, what is the same thing, the lack of dependence of the future on the past when the present is given, is called the Markov property, and the corresponding sequence of random variables ~ 0 , ••. , ~.is a Markov chain. Consequently if the probabilities p(w) of the sample points are given by (3), the sequence (~ 0 , ... , ~.)with ~;(w) =X; forms a Markov chain. We give the following formal definition.

Definition. Let (Q, d, P) be a (finite) probability space and let~ = (~ 0 , ••. , ~.) be a sequence of random variables with values in a (finite) set X. If (7) is satisfied, the sequence~ = (~ 0 , ... , ~")is called a (finite) Markov chain. The set X is called the phase space or state space of the chain. The set of probabilities (Pn(x)), x EX, with p 0 (x) = P(~ 0 = x) is the initial distribution, and the matrix IIPk(x,y)ll, x, yEX, with p(x,y) = Pgk = Yl~k-t = x} is the matrix of transition probabilities (from state x to state y) at time k = 1, ... , n.

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

Ill

When the transition probabilities Pk(x, y) are independent of k, that is, = is called a homogeneous 0 , ..• , Markov chain with transition matrix llp(x, y)ll-

Pk(x, y) = p(x, y), the sequence

e (e

en)

Notice that the matrix ll(x, y)ll is stochastic: its elements are nonnegative and the sum of the elements in each row is 1: p(x, y) = 1, x EX. We shall suppose that the phase space X is a finite set of integers (X = {0, 1, ... , N}, X = {0, ± 1, ... , ± N}, etc.), and use the traditional notation Pi = p0 (i) and Pii = p(i, j). It is clear that the properties of homogeneous Markov chains completely determine the initial distributions Pi and the transition probabilities Pii· In specific cases we describe the evolution of the chain, not by writing out the matrix IIPiill explicitly, but by a (directed) graph whose vertices are the states in X, and an arrow from state ito state j with the number Pii over it indicates that it is possible to pass from point i to point j with probability Pii· When Pii = 0, the corresponding arrow is omitted.

LY

Pii

~i j EXAMPLE

1. Let X = {0, 1, 2} and

IIPijll

=

(P i). 3·

0

3

The following graph corresponds to this matrix:

.l2

I

2

~a~_f).l3 ·o~2~ ~

Here state 0 is said to be absorbing: if the particle gets into this state it remains there, since p00 = 1. From state 1 the particle goes to the adjacent states 0 or 2 with equal probabilities; state 2 has the property that the particle remains there with probability! and goes to state 0 with probability i ExAMPLE

for

2. Let X= {0, ± 1, ... , ±N}, Po= 1, PNN = P(-N)(-N) = 1, and,

Iii< N,

p, j = i + 1, { Pii = q, j = i - 1, 0 otherwise.

(11)

112

I. Elementary Probability Theory

The transitions corresponding to this chain can be presented graphically in the following way (N = 3):

This chain corresponds to the two-player game discussed earlier, when each player has a bankroll N and at each turn the first player wins + 1 from the second with probability p, and loses (wins -1) with probability q. If we think of state i as the amount won by the first player from the second, then reaching state N or - N means the ruin of the second or first player, respectively. In fact, if 17 1 , 17 2, ... , 'ln are independent Bernoulli random variables with P('l; = + 1) = p, P(1]; = -1) = q, S 0 = 0 and Sk = 17 1 + .. · + 'lk the amounts won by the first player from the second, then the sequence S0 , S 1, ••• , Sn is a Markov chain with p0 = 1 and transition matrix (11 ), since P{Sk+1 =jiSk = ik,Sk-1 = ik-1, ... } = P{Sk = P{Sk

+ '7k+1 + '7k+1

=jiSk = ik,Sk-1 = ik-1, ... } =jiSk = ik} = P{'7k+1 =j- ik}.

This Markov chain has a very simple structure: 0

~

k

~

n- 1,

where 17 1 ,17 2 , ..• , 'ln is a sequence of independent random variables. The same considerations show that if ~ 0 , 17 1 , •.. , 'ln are independent random variables then the sequence ~ 0 , ~ 1 , ••• , ~n with 0

~

k ~ n- 1,

(12)

is also a Markov chain. It is worth noting in this connection that a Markov chain constructed in this way can be considered as a natural probabilistic analog of a (deterministic) sequence x = (x 0 , •.• , xn) generated by the recurrent equations

We now give another example of a Markov chain of the form (12); this example arises in queueing theory. EXAMPLE 3. At a taxi stand let taxis arrive at unit intervals of time (one at a time). If no one is waiting at the stand, the taxi leaves immediately. Let 'lk be the number of passengers who arrive at the stand at time k, and suppose that 1] 1 , •.• , 'ln are independent random variables. Let ~k be the length of the

113

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

waiting line at time k, ~ 0 = 0. Then if ~k = i, at the next time k ~k + 1 of the waiting line is equal to j.=

{1Jk+ 1 i- 1

+ 1Jk+ 1

+ 1 the length

if i = 0, ifizl.

In other words, 0

s

k

s

n- 1,

where a+ = max(a, 0), and therefore the sequence ~ = (~ 0 , Markov chain.

..• ,

~")

is a

4. This example comes from the theory of branching processes. A branching process with discrete times is a sequence of random variables ~ 0 , ~ 1 , .•. , ~"'where ~k is interpreted as the number of particles in existence at time k, and the process of creation and annihilation of particles is as follows: each particle, independently of the other particles and of the "prehistory" of the process, is transformed into j particles with probability pi, j = 0, 1, ... , M. We suppose that at the initial time there is just one particle, ~ 0 = 1. If at time k there are ~k particles (numbered 1, 2, ... , ~k), then by assumption ~k+ 1 is given as a random sum of random variables, EXAMPLE

~k+ 1

=

IJlk)

+ ... + 1]~~,

where 1Jlkl is the number of particles produced by particle number i. It is clear that if ~k = 0 then ~k+ 1 = 0. If we suppose that all the random variables IJ~k>, k 2 0, are independent of each other, we obtain Pgk+1 = ik+1i~k = ik, ~k-1 = ik-t• · · .} = Pgk+t = ik+ti~k = ik} = P{IJ\kl + · · · + IJ!:> = ik+ d. It is evident from this that the sequence ~ 0 , ~ 1 ,

... , ~"is a Markov chain. A particularly interesting case is that in which each particle either vanishes with probability q or divides in two with probability p, p + q = 1. In this case it is easy to calculate that

is given by the formula Pii

=

{c 0

1,."12pi12qi- j/2, ). = 0' ... , 2'I, in all other cases.

2. Let ~ = ( ~k, Ill, IP') be a homogeneous Markov chain with st~rting vectors (rows) Ill = (p;) and transition matrix Ill = IIPiill. It is clear that Pii = P{~1 =j/~o = i} = ... = P{~n =j/~n-1 = i}.

114

I. Elementary Probability Theory

We shall use the notation

for the probability of a transition from state i to state j in k steps, and

for the probability of finding the particle at point j at time k. Also let

Let us show that the transition probabilities pj'> satisfy the Kolmogorov-

Chapman equation

\~ I) = " p\k)p(l! P'1 l.J ra ~J'

(13)

~

or, in matrix form,

(14) The proof is extremely simple: using the formula for total probability and the Markov property, we obtain P!~+IJ

= P(~k+l = j l~o =

i)

=

L P(~k+l = j, ~k = IXI~o = i) ~

=

L P(~k+l = j I~k = 1X)P(~k = IX I~0 = i) = L p~}P!~>. ~

~

The following two cases of (13) are particularly important: the backward equation 1J = "P· P\~+ IJ £...., I~ p
(15)

v

and the forward equation \~+ 1) P IJ

= "p\k)p" . £...., I~ ~}

(16)

~

(see Figures 22 and 23). The forward and backward equations can be written in the following matrix forms

= IP. IP,

(17)

IP.

(18)

IP
§12. Markov Chains. Ergodic Theorem. Strong Markov Property

0

115

I+ I

Figure 22. For the backward equation.

Similarly, we find for the (unconditional) probabilities p)k) that

'\' p(k)p(l) PJ(_k +I) = ~ a: '2.]'

(19)

a

or in matrix form

fl (k +I) = fl (k) • I]J>(I). In particular,

rn
rn
·IP

(forward equation) and

fl (k+ 1) = fl (1).

I]J>(k)

(backward equation). Since IP 0 ) = IP, fl (1) = fl, it follows from these equations that JP(k) =

!Pk,

Consequently for homogeneous Markov chains the k-step transitiOn probabilities p~j) are the elements of the kth powers of the matrix IP, so that many properties of such chains can be investigated by the methods of matrix analysis.

k k

+1

Figure 23. For the forward equation.

116

I. Elementary Probability Theory

5. Consider a homogeneous Markov chain with the two states 0 and 1 and the matrix

EXAMPLE

IFD

= (Poo Pot)· Pu

Pto

It is easy to calculate that p2

Po;(Poo + Pu)) Ptt + PotPto

= ( P~o + Po1P1o Pto(Poo

and (by induction)

+ Pu)

1 (1 - 1-

IFD" _ - 2 - Poo - Pu

Pu 1 - Pu

Poo) 1 - Poo

+ (Poo + Pu

- 1)" ( 1 - Poo 2 - Poo - Pu -(1 - Pu)

-(1 - Poo)) 1- Pu

(under the hypothesis that IPoo + p 11 - 11 < 1). Hence it is clear that if the elements of IFD satisfy Ip00 + p11 - 11 < 1 (in particular, if all the transition probabilities Pii are positive), then as n-+ oo IFD" -+

1 (1 - 1-

2 - Poo - Pu

Pu 1 - Pu

Poo) 1 - Poo '

(20)

and therefore l - p 11 lim (nl = P,o 2 ' n - Poo- Pu

. Pil(n) 1tm "

1 - Poo

= -,-------'--'---

2- Poo- Pu

Consequently if IPoo + p 11 - 11 < 1, such a Markov chain exhibits regular behavior of the following kind: the influence of the initial state on the probability of finding the particle in one state or another eventually becomes negligible (plj> approach limits ni, independent of i and forming a probability distribution: n0 ~ 0, n 1 ~ 0, n0 + n 1 = 1); if also all Pii > 0 then n0 > 0 and n 1 > 0. 3. The following theorem describes a wide class of Markov chains that have the property called ergodicity: the limits ni = limn Pii not only exist, are independent of i, and form a probability distribution (ni ~ 0, Li ni = 1}, but also ni > 0 for allj (such a distribution ni is said to be ergodic).

Theorem 1 (Ergodic Theorem). Let 1P

= IIPiill be the transition matrix of a chain with a .finite state space X = {1, 2, ... , N}.

(a) If there is an n0 such that min p\~ol > 0' lJ i,j

(21)

117

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

then there are numbers n 1,

••• ,

nN

such that (22)

and

n- oo

(23)

for every i e X. if there are numbers n 1, ••. , nN satisfying (22) and (23), there is an n0 such that (21) holds. (c) 1he numbers (n 1 , ... , nN) satisfy the equations

(b) Conversely,

= 1, ... ,N.

j PROOF.

(24)

(a) Let

m(n> = min p!~> ) I] '

M)"> = max Pl~P.

i

i

Since 1) = "P· p
(25)

~

we have 1 > =min p!'!+ 1 > =min" P· pmin" P· min p<") = m(n> m("+ ) I) ' L, I~ ~) L, I~ ~) ) '

i

i

a

i

eX

a

whence m}"> :::;; m)"+ >and similarly M}"> ~ M}"+ 1l. Consequently, to establish (23) it will be enough to prove that 1

M("> - m("> .,... 0 1

Let e

=

1

n .,... oo, j = 1, ... , N.

'

min;, i Pljo> > 0. 'fhen = "p(no>p]p<") P!'!o+n) I) L, I~ ~) I~ )~ ~) ~

+ s "p(">p
~

~)

~

(p(no) - sp(">]p
= "

L,

+ sp(~n) 11 •

~

But p!no) - sp("> > 0 '· therefore la Ja. -

> m(n)." [p!no) - sp(">J P!'!o+n) I) ) L, I~ )~

+ sp(~n) = ))

m("l(1 - s) )

~

and consequently m!no+n) > m("l(1 - s) 1 1

+ sp(~n) )) •

M(no+n> < M(">(1 - s) 1 1

+ sp!~n> )) •

In a similar way Combining these inequalities, we obtain M(no+n> _ m(no+n> < (M("> _ m(">). (1 _ s) 1

1

-

1

1

+ sp(~n) Jl '

118

I. Elementary Probability Theory

and consequently M<J<no+n> _ m<.kno+n> < (M<.n> _ m<.">)(1 _ e)k! 0 J

J

-

J

J

'

k--+ 00.

Thus M~"tll - m~npl --+ 0 for some subsequence np, np --+ oo. But the difference M~nl - m~n) is monotonic in n, and therefore M~nl - m~nl --+ 0, n--+ oo. If we put ni = limn m~">, it follows from the preceding inequalities that lp \~)IJ

7C·I < M(n)m<.n> < (1 - e)[n/no)-1 J J J -

for n ~ n0 , that is, p!j> converges to its limit ni geometrically (i.e., as fast as a geometric progression). It is also clear that m~n> ~ m~no) ~ e > 0 for n ~ n0 , and therefore ni > 0. (b) Inequality (21) follows from (23) and (25). (c) Equation (24) follows from (23) and (25). This completes the proof of the theorem. 4. Equations (24) play a major role in the theory of Markov chains. A nonnegative solution (n 1 , ••• , nN) satisfying I,. n~ = 1 is said to be a stationary or invariant probability distribution for the Markov chain with transition matrix IIPiill. The reason for this terminology is as follows. Let us select an initial distribution (n 1, ... , nN) and take Pi= ni. Then PJ(l)

= "" 1C

p . = 1C.J

£...,~~} ~

and in general p)"> = ni. In other words, if we take (n 1, ... , nN) as the initial distribution, this distribution is unchanged as time goes on, i.e. for any k j = 1, ... ,N.

Moreover, with this initial distribution the Markov chain ~ =(~,Ill, IP) is really stationary: the joint distribution of the vector (~k• ~k+ ~> ... , ~k+ 1) is independent of k for alll (assuming that k + 1 :::;; n). Property (21) guarantees both the existence of limits ni = lim plj>, which are independent of i, and the existence of an ergodic distribution, i.e. one with ni > 0. The distribution (n 1 , •.• , nN) is also a stationary distribution. Let us now show that the set (n 1 , .•. , nN) is the only stationary distribution. In fact, let (ft 1, •.• , ftN) be another stationary distribution. Then

and since P~1--+ ni we have iii

= L (ft~ · n) = ni. ~

These problems will be investigated in detail in Chapter VIII for Markov chains with countably many states as well as with finitely many states.

119

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

We note that a stationary probability distribution (even unique) may exist for a nonergodic chain. In fact, if

then [p>2n

=

(0 1)

1 0 '

[p>2n+ 1

=

(1 0)

0 1 '

and consequently the limits lim pjj> do not exist. At the same time, the system j

= 1, 2,

reduces to

of which the unique solution satisfying n 1 + n 2 = 1 is(!,!). We also notice that for this example the system (24) has the form no

=

noPoo

+ n1P1o·

from which, by the condition n 0 = n 1 = 1, we find that the unique stationary distribution (n 0 , n 1) coincides with the one obtained above: no=

1- Pu ' 2 - Poo - Pu

nl

=

1- Poo

2- Poo-

P11

.

We now consider some corollaries of the ergodic theorem. Let A be a set of states, A ~ X and IA(x)

1, XEA, = { 0, x ¢:A.

Consider

which is the fraction of the time spent by the particle in the set A. Since E[JA((k)i(o = i] = P((kEAI(o = i) = LPI,l(=p\kl(A)), jeA

120

I. Elementary Probability Theory

we have

1

E[vA(n)l~o = i] = - 1

n

+

and in particular

E[vw(n)l~o = i] = -

n

1-

L pjkl(A) k=o n

±

+ 1 k=o

pjj>.

It is known from analysis (see also Lemma 1 in §3 of Chapter IV) that if an-+ a then (a 0 + · · · + an)/(n + 1)-+ a, n-+ oo. Hence if pj'l-+ ni, k-+ oo, then where

nA =

L ni.

jeA

For ergodic chains one can in fact prove more, namely that the following result holds for I A(~ 0 ), ••• , I A( ~n), ....

Law of Large Numbers. If ~ 0 , ~ 1 , ... form a .finite ergodic Markov chain, then

n-+ oo,

(26)

for every e > 0 and every initial distribution. Before we undertake the proof, let us notice that we cannot apply the results of §5 directly to IA(~ 0 ), ... , JA(~n), ... , since these variables are, in general, dependent. However, the proof can be carried through along the same lines as for independent variables if we again use Chebyshev's inequality, and apply the fact that for an ergodic chain with finitely many states there is a number p, 0 < p < 1, such that (27)

Let us consider states i and j (which might be the same) and show that, fore> 0,

n-+ oo. By Chebyshev's inequality,

Hence we have only to show that

n-+ oo.

(28)

121

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

A simple calculation shows that

=

1 (n

+

n

n

'\'

'\' m!~· I) L., l} ' k=o l=o L.,

1)2

where

s = min(k, l) and t = lk- 11. By (27),

P!'!> l}

= n:.J + e!'!>

l}'

Therefore lml~· 1 >1:::; C1[p•

+ P + Pk + /], 1

where C 1 is a constant. Consequently

1

n

n

--,-----____,.,...,.. '\' '\' m
<

- (n

C

n

1

'\'

n

'\' [

s

+ 1)2 k~o 1~0 p 4C 1 2(n + 1) 2 + 1) 1 - p

+

t

p

+

k+

p

1]

p

8C 1

(n

+ 1)(1

- p)

--+

O

,

n --+ oo.

Then (28) follows from this, and we obtain (26) in an obvious way. 5. In §9 we gave, for a random walk S 0 , S 1, ... generated by a Bernoulli scheme, recurrent equations for the probability and the expectation of the exit time at either boundary. We now derive similar equations for Markov chains. Let~= (~ 0 , ••• , ~")be a Markov chain with transition matrix IIPiill and phase space X = {0, ± 1, ... , ± N}. Let A and B be two integers, - N :::; A :::; 0 :::; B :::; N, and x EX. Let f!lk+ 1 be the set of paths {x 0 , x 1 , ..• , xk), xi E X, that leave the interval (A, B) for the first time at the upper end, i.e. leave (A, B) by going into the set (B, B + 1, ... , N). , For A :::; x :::; B, put f3k(x) = P{(~o .... , ~k) E fflk+ll~o = x}.

In order to find these probabilities (for the first exit of the Markov chain from (A, B) through the upper boundary) we use the method that was applied in the deduction of the backward equations.

122

I. Elementary Probability Theory

We have {3k(x) = P{(eo, ... ' ek) E .?Jk+ 11 eo = X} =

L Pxy. P{(eo, ... ' ek) E .?Jk+lleo =X, el = y

y},

where, as is easily seen by using the Markov property and the homogeneity of the chain,

P{(eo, ... ' ek)

.?Jk+lleo =X, el = y} = P{(x, y, e2, ... ' ek) E .?Jk+lleo =X, el = P{(y, e2, ... , ek) E BBkle1 = y}

E

= y}

= P{(y, eb ... ' ek- dE .?lkl eo = y} = Pk-l(y). Therefore y

for A < x < B and 1 :::;; k :::;; n. Moreover, it is clear that Pix)= 1,

X

= B, B + 1, ... ' N,

and X= -N, ... ,A.

In a similar way we can find equations for ocix), the probabilities for first exit from (A, B) through the lower boundary. Let rk = min{O :::;; l :::;; k: ¢(A, B)}, where rk = k if the set { ·} = 0. Then the same method, applied to mix)= E(rkl eo = x), leads to the following recurrent equations:

e,

mk(x) = 1 +

L mk-l (Y)Pxy y

(here 1 :::;; k :::;; n, A < x < B). We define

x ¢(A, B). It is clear that if the transition matrix is given by (11) the equations for ock(x), {3k(x) and mk(x) become the corresponding equations from §9, where they were obtained by essentially the same method that was used here. These equations have the most interesting applications in the limiting case when the walk continues for an unbounded length of time. Just as in §9, the corresponding equations can be obtained by a formal limiting process (k --+

CX) ).

By way of example, we consider the Markov chain with states {0, 1, ... , B} and transition probabilities

Poo

= 1,

PBB =

1,

123

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

and

Pi > 0, ~ = ~ + 1, pij = { ri, J = z, qi > 0, j = i - 1, for 1 :-:;; i :-:;; B - 1, where Pi + qi + ri = 1. For this chain, the corresponding graph is

a~---·~e 0

q1

1

2

P1

qB-1 B- 1 PB-1

B

It is clear that states 0 and B are absorbing, whereas for every other state i the particle stays there with probability ri, moves one step to the right with probability Pi• and to the left with probability qi. Let us find a(x) = limk-oo ak(x), the limit of the probability that a particle starting at the point x arrives at state zero before reaching state B. Taking limits ask--+ oo in the equations for aix), we find that

when 0 < j < B, with the boundary conditions

a(O)

=

1,

a(B) = 0.

Since rj- 1 - qj- pj, we have pj(a(j + 1) - a(j)) = qj(a(j) - a(j - 1)) and consequently a(j

+ 1) -

a(j) = pj(a(1) - 1),

where Po= 1.

But a(j

+ 1) -

j

1=

L (a(i + 1) -

a(i)).

i=O

Therefore a(j

+ 1) -

j

1 = (a(1) - 1) ·

L Pi·

i=l

124

I. Elementary Probability Theory

Ifj = B - 1, we have r:~.(j

+ 1) =

0, and therefore

r:~.(B) =

1

r:J.(l)=1=-'\'B-1

L..i= 1 Pi

'

whence an d

'\'B-1

L..i=i Pi ' L..i=1 Pi

( r:I.J

0)

='\'B-1

j = 1,

0

0

0'

B.

(This should be compared with the results of §9.) Now let m(x) = limk mk(x), the limiting value of the average time taken to arrive at one of the states 0 or B. Then m(O) = m(B) = 0, m(x)

= 1 + L m(y)Pxy y

and consequently for the example that we are considering,

m(j) = 1 + qimU - 1)

+ rimU) + pimU + 1)

for j = 1, 2, ... , B - 1. To find mU) we put

M(j) = m(j) - m(j- 1),

j = 0, 1,. 0. 'B.

Then

piM(j + 1)

= qiM(j)

- 1,

j = 1, ... , B- 1,

and consequently we find that

M(j

+ 1) =

piM(1)- Ri,

where q1 .. qj 0

P1 · · Pi ' 0

Therefore j-1

m(i) = mU)- m(O) =

L M(i + 1)

i=O

j-1

=

L

i=O

(pim(l)- Ri) = m(l)

j-1

j-1

i=O

i=O

L Pi- L Ri.

It remains only to determine m(1). But m(B) = 0, and therefore '\'B-l R m(1) = Ik=o1 i, i=O Pi

and for 1 < j ::;; B,

( 0)

mJ =

j-1

'\'~-1R.

j-1

'\' R L..• = 0 I '\' i~o Pi. Lf;-o1 Pi - i:--o io

125

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

(This should be compared with the results in §9 for the case ri = 0, Pi = p,

qi

= q.)

6. In this subsection we consider a stronger version of the Markov property (8), namely that it remains valid if time k is replaced by a random time (see also Theorem 2). The significance of this, the strong Markov property, can be illustrated in particular by the example of the derivation of the recurrent relations (38), which play an important role in the classification of the states of Markov chains (Chapter VIII). Let ~ = (~ 1 , . . . , ~n) be a homogeneous Markov chain with transition matrix IIPijll; let ~~ = (~f)oo;;k,;;n be a system of decompositions, ~f = ~ ~o •...• ~k. Let ~f denote the algebra ri(~i) generated by the decomposition ~f.

B

We first put the Markov property (8) into a somewhat different form. Let Let us show that then

E ~f.

= ak+llB n (~k = ak)} = P{~n =an, ... , ~k+l =

Pgn =an,···, ~k+l

ak+ll~k

= ak}

(29)

(assuming that P{B n (~k = ak)} > 0). In fact, B can be represented in the form

B=

where

I* {~~ .... , ~k = at},

I* extends over some set (a~, ... ' an Consequently P{~n =an, .. ·, ~k+l P{(~n

= ak+llB n

(~k

= ak)}

=an, ... , ~k = ak) n B} P{(~k = ak) n B}

I* P{(~n

= aw .. , ~k = ak) n (~ 0 = a6, ... , ~k =a%)} P{(~k = ak) n B}

(30)

But, by the Markov property, P{(~n

=an, ... , ~k = ak) n (~o = a~, ... , ~k =at)}

=

{

P{~n =an, ... , ~k+l = ak+ll~o =a~,···, ~k =at} x P{~ 0 =a~, ... , ~k =at} ifak =at, 0 if ak i= at,

=

{

P{~n =an, ... , ~k+l = ak+ll~k = ak}P{~o =a~, ... , ~k =at} if ak =at, 0 if ak i= at,

={

pgn =an,··., ~k+1 0

if ak =at, if ak i= at.

= ak+ll~k = ak}P{(~k = ak) n B}

126

I. Elementary Probability Theory

Therefore the sum I* in (30) is equal to P{ ~n = an, ... , ~k+ 1 = ak+ d ~k = adP{(~k = ak) n B},

This establishes (29). Let r be a stopping time (with respect to the system D~ = (DDo 5 k 5 n; see Definition 2 in §11).

Definition. We say that a set Bin the algebra~~ belongs to the system of sets ~; if, for each k, 0 ~ k ~ n, B n {r

= k}

(31)

E ~f.

It is easily verified that the collection of such sets B forms an algebra (called the algebra of events observed at timer).

Theorem 2. Let ~ = (~ 0 , ••• , ~n) be a homogeneous Markov chain with transition matrix IIPull, r a stopping time (with respect to 2fi~), BE~~ and A = {w: r + l ~ n}. Then if P{A n B n (~, = a0 )} > 0, we have P{~r+t

and if P{A n

(~,

= =

a1,

••• ,

P{~r+t

= a 1 IA n B n (~, = a 0 )} = a1, ••• , ~r+l = a 1 IA n (~, = a 0 )}, ~r+l

(32)

= a0 )} > 0 then

For the sake of simplicity, we give the proof only for the case l = 1. Since B n (r = k) E ~L we have, according to (29), P{~r+l =

a 1, An B n (~, = a0 )}

L

k5n-l

P{~k+l

= a1, ~k =

a0 , r

= k, B}

k5n-l k5n-l

=

Paoal.

L

k5n-l

P{~k = ao, r = k, B} =

Paoal.

P{A n B n (~,

= ao)},

which simultaneously establishes (32) and (33) (for (33) we have to take B= Q).

Remark. When l

= 1, the strong Markov property (32), (33) is evidently equivalent to the property that

(34)

127

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

for every C s;; X, where

Pao(C) =

L Paoa1·

a1EC

In turn, (34) can be restated as follows: on the set A = {T

~

n- 1}, (35)

which is a form of the strong Markov property that is commonly used in the general theory of homogeneous Markov processes. 7. Let ~ = (~ 0 , .••• , matrix IIPiill, and let

~n)

be a homogeneous Markov chain with transition

!!7>

= P{ ~k = i, ~~ =F i, 1 ~ l ~ k - 11 ~ 0 = i}

(36)

f!J>

= P{~k =j, ~~ #j,

1 ~ l ~ k- 11~ 0 = i}

(37)

and

for i =F j be respectively the probability of first return to state i at time k and the probability of first arrival at state j at time k. Let us show that n

P I}\~)

= i..J "'

j\~>p<.~-k)

k=1

I}

}} ·

'

where

p)~> =

1.

(38)

The intuitive meaning of the formula is clear: to go from state i to state j inn steps, it is necessary to reach statej for the first time ink steps (1 ~ k ~ n) and then to go from state j to state j in n - k steps. We now give a rigorous derivation. Let j be given and ' = min{l ~ k ~ n: ~k = j}, assuming that'! = n

+ 1 if {·}

=

0. Then f!J> = P{T =

kl~o = i} and

P!~j> = P{~n =jl~o = i}

L

P{~n = j, '! = kl~o = i}

L

P{~r+n-k=j,T=kl~o=i},

15k5n

15k5n

(39)

where the last equation follows because ~t+n-k = ~n on the set {'! = k}. Moreover, the set {' = k} = {' = k, ~r = j} for every k, 1 ~ k ~ n. Therefore if P{~ 0 = i,' = k} > 0, it follows from Theorem 2 that P{~r+n-k =jl~o = i, '! = k} = P{~r+n-k =jl~o = i, '! = k, ~t =j}

= P{~t+n-k = j I~t = j} = ptf-k>

128

I. Elementary Probability Theory

and by (37)

P!j> = =

n

L P{~t+n-k = jl~o = i,

7:

k= 1

= k}P{r = kl~o = i}

n

~ p(~-k>:r(~)

L,

k=l

JJ

I] '

which establishes (38). 8.PROBLEMS

= (~ 0 , ... , ~.)be a Markov chain with values in X and f = f(x) (x EX) a function. Will the sequence (f(~ 0 ), ... ,f(~.)) form a Markov chain? Will the "reversed" sequence

1. Let ~

(~., ~.- 1> • • · ' ~o)

form a Markov chain? 2. Let IJll = IIP;)I, 1 :::; i,j:::; r, be a stochastic matrix and A. an eigenvalue of the matrix, i.e. a root of the characteristic equation det IIIJll - A.E II = 0. Show that A.0 = 1 is an eigenvalue and that all the other eigenvalues have moduli not exceeding 1. If all the eigenvalues ..1. 1, ••• , A., are distinct, then p~~l admits the representation p~~l = ni

+ a;;{1)A.~ + · ·· + a;k)A.~,

where ni, a;;(!), ... , a;;{r) can be expressed in terms of the elements of IJll. (It follows from this algebraic approach to the study of Markov chains that, in particular, when I..1. 1 1< 1, ... , IA., I < 1, the limit lim Pl~l exists for every j and is independent of i.) 3.

Let~ = (~ 0 , ••. , ~.)be

tion matrix

IJll

=

a homogeneous Markov chain with state space X and transi-

IIPxyll. Let

T
E[
(=

~
Let the nonnegative function


x EX.

Show that the sequence of random variables

is a martingale. 4. Let ~=(,.,Ill, IJll) and ~=(,.,Ill, IJll) be two Markov chains with different initial distributions Ill = (p 1, ••• , p,) and li = (ph ... , p,). Show that if min;.i Pii ::::: e > 0 then

L lftl"l- p~"ll :::; 2(1 -e)".

i=l

CHAPTER II

Mathematical Foundations of Probability Theory

§I. Probabilistic Model for an Experiment with Infinitely Many Outcomes. Kolmogorov's Axioms 1. The models introduced in the preceding chapter enabled us to give a probabilistic-statistical description of experiments with a finite number of outcomes. For example, the triple (Q, d, P) with

and p(w) = pEa•qn-r.a, is a model for the experiment in which a coin is tossed n times "independently" with probability p of falling head. In this model the number N(Q) of outcomes, i.e. the number of points in n, is the finite number 2n. We now consider the problem of constructing a probabilistic model for the experiment consisting of an infinite number of independent tosses of a coin when at each step the probability offalling head is p. It is natural to take the set of outcomes to be the set

n=

{co: w = (al, a2, .. .), ai = 0, 1},

i.e. the space of sequences w = (a 1, a 2 , .• •) whose elements are 0 or 1. What is the cardinality N(Q) of Q? It is well known that every number a e [0, 1) has a unique binary expansion (containing an infinite number of zeros) (ai = 0, 1).

130

II. Mathematical Foundations of Probability Theory

Hence it is clear that there is a one-to-one correspondence between the points OJ ofQ and the points a of the set [0, 1), and therefore Q has the cardinality of the continuum. Consequently if we wish to construct a probabilistic model to describe experiments like tossing a coin infinitely often, we must consider spaces Q of a rather complicated nature. We shall now try to see what probabilities ought reasonably to be assigned (or assumed) in a model of infinitely many independent tosses of a fair coin {p + q = !). Since we may take Q to be the set [0, 1), our problem can be considered as the problem of choosing points at random from this set. For reasons of symmetry, it is clear that all outcomes ought to be equiprobable. But the set [0, 1) is uncountable, and if we suppose that its probability is 1, then it follows that the probability p(OJ) of each outcome certainly must equal zero. However, this assignment of probabilities {p{OJ) = 0, OJ e [0, 1)) does not lead very far. The fact is that we are ordinarily not interested in the probability of one outcome or another, but in the probability that the result of the experiment is in one or another specified set A of outcomes (an event). In elementary probability theory we use the probabilities p(OJ) to find the probability P(A) of the event A: P(A) = p(OJ). In the present case, with p(OJ) = 0, OJ E [0, 1), we cannot define, for example, the probability that a point chosen at random from [0, 1) belongs to the set [0, !). At the same time, it is intuitively clear that this probability should be!. These remarks should suggest that in constructing probabilistic models for uncountable spaces Q we must assign probabilities, not to individual outcomes but to subsets of Q. The same reasoning as in the first chapter shows that the collection of sets to which probabilities are assigned must be closed with respect to unions, intersections and complements. Here the following definition is useful.

LroeA

Definition 1. Let Q be a set of points OJ. A system d of subsets of Q is called an algebra if (a) Qed, (b) A, BEd=> Au BEd, (c) Aed=>Aed

An Bed,

(Notice that in condition (b) it is sufficient to require only that either A u B E d or that A n B E d, since A u B = A n Band A n B = A u B.) The next definition is needed in formulating the concept of a probabilistic model.

Definition 2. Let d be an algebra of subsets of Q. A set function J.l = J.l(A), A e d, taking values in [0, oo ], is called a finitely additive measure defined

131

§1. Probabilistic Model for an Experiment with Infinitely Many Outcomes

on .91 if Jl(A

+ B) =

Jl(A)

+ Jl(B).

(1)

for every pair of disjoint sets A and B in .91. A finitely additive measure Jl with Jl(Q) < oo is called finite, and when Jl(Q) = 1 it is called a finitely additive probability measure, or a finitely additive probability.

2. We now define a probabilistic model (in the extended sense).

Definition 3. An ordered triple (Q, .91, P), where (a) Q is a set of points w; (b) .91 is an algebra of subsets of Q; (c) Pis a finitely additive probability on A, is a probabilistic model in the extended sense. It turns out, however, that this model is too broad to lead to a fruitful mathematical theory. Consequently we must restrict both the class of subsets of Q that we consider, and the class of admissible probability measures.

Definition 4. A system ff of subsets of Q is a (1-algebra if it is an algebra and satisfies the following additional condition (stronger than (b) of Definition 1): (b*) if An E ff, n = 1, 2, ... , then (it is sufficient to require either that

u An

E

$' or that

n An

E

$').

Definition 5. The space Q together with a (1-algebra ff of its subsets is a measurable space, and is denoted by (Q, $'). Definition 6. A finitely additive measure Jl defined on an algebra .91 of subsets of Q is countably additive (or (1-additive), or simply a measure, if, for all pairwise disjoint subsets A 1, A 2 , ••• of A,

Jl(~l A") =

Jl

Jl(An>·

A finitely additive measure Jl is said to be (1-finite ifQ can be represented in the form n=l

with Jl(Qn) <

00,

n = 1, 2, ....

132

II. Mathematical Foundations of Probability Theory

If a countably additive measure P on the algebra A satisfies P(Q) = 1, it is called a probability measure or a probability (defined on the sets that belong to the algebra d). Probability measures have the following properties.

If 0 is the empty set then

P(0) = 0.

If A, B E d then P(A u B)= P(A)

If A, B E d and B

!:;;;;

+ P(B)-

P(A n B).

A then P(B) ::; P(A).

If A. E d, n = 1, 2, ... , and

U A. E d, then

The first three properties are evident. To establish the last one it is enough to observe that 1 B., where B 1 =At> B.= A1 n · · · n 1 A.= A._ 1 n A., n ~ 2, B; n Bj = 0, i # j, and therefore

L:'=

L:'=

The next theorem, which has many applications, provides conditions under which a finitely additive set function is actually countably additive.

Theorem. Let P be a finitely additive set function defined over the algebra d, with P(Q) = 1. The following four conditions are equivalent: (1) P is a-additive (Pis a probability); (2) P is continuous from below, i.e. for any sets A 1 , A 2 , •• • Ed such that A. !:;;;; An+ 1 and 1 A. Ed,

U:'=

(3) Pis continuous from above, i.e. for any sets A 1 , A 2 ,

and

n:'=

1

A. E d,

•••

such that A. 2 A.+ 1

133

§I. Probabilistic Model for an Experiment with Infinitely Many Outcomes

(4) Pis continuous at 0, i.e. for any sets A1, A 2 , ••• Ed such that An+ 1 and 1 An= 0,

n:=

lim P(An)

=

~

An

0.

n

Since

PROOF. (1) => (2).

00

UAn =

A1 + (Az \A1) + (A3 \Az) + · · ·,

n=l we have

PC91An)

=

P(A 1) + P(A 2 \A 1) + P(A 3\A 2 ) + ...

=

P(A 1) + P(A 2 )

-

P(A 1) + P(A 3)- P(A 2 ) + · · ·

=lim P(An). n

(2) => (3). Let n

~

1; then

The sequence {A 1\An}n> 1 of sets is nondecreasing (see the table in Subsection 3 below) and 00

00

U
n=l

=

A1\ nAn. n=l

Then, by (2)

and therefore n

=

P(Al)- PC91(A1\An))

=

P(A1)- P(A1)

=

+ P(Q1An)

P(A1)- P(Al\0/n) =

P(Q1An)

(3) = rel="nofollow"> (4). Obvious. (4) =>(1). Let Ab A 2 , •.. Ed be pairwise disjoint and let Then

L:-'=

1

An Ed.

n=l

U A.

00

Af:,.B

A+B A\B

0 AnB=0

An B (or AB)

AuB

A=!l\A

ff A Eff

n

(J)

Notation

Table

union of the sets A 1 , A 2 , •••

sum of sets, i.e. union of disjoint sets difference of A and B, i.e. the set of points that belong to A but not to B symmetric difference of sets, i.e. (A \B) u (B\A)

outcome, sample point, elementary event sample space; certain event a-algebra of events event (if wE A, we say that event A occurs) event that A does not occur event that either A or B occurs

element or point set of points a-algebra of subsets set of points complement of A, i.e. the set of points w that are not in A union of A and B, i.e. the set of points w belonging either to A or to B intersection of A and B, i.e. the set of points w belonging to both A and B empty set A and B are disjoint

event that at least one of A 1 , A 2 , ••• occurs

event that A or B occurs, but not both

impossible event events A and B are mutually exclusive, i.e. cannot occur simultaneously event that one of two mutually exclusive events occurs event that A occurs and B does not

event that both A and B occur

Interpretation in probability theory

Set-theoretic interpretation

;. "'

0

...,

'<

(\)

-< -l =-0

~

"'

0

....., ..., 0 g"

~



~

P-

"'=

.,e:..

r;·

~

3

(\)

-a::

.j;>.

v.>

-

A.

li~ 1 A.)

* i.o.

=

infinitely often.

(or lim inf A.)

limA.

lim A. (or lim sup A. or* {A. i.o.})

A

=

=

A li~ 1' A")

1A

(or

A.

(or

A. 1' A

n= 1

00

n A.

n=l

I

00

00

00

n= 1 k= n

00

un

the set

Ak

UAk

n= 1 k=n

00

n

the set

A 1 2 A 2 2 ···and A

= n=1

00

n A"

n=l

U A.

the decreasing sequence of sets A" converges to A, i.e.

A 1 c;:; A 2 c;:; ···and A=

00

the increasing sequence of sets A. converges to A, I.e.

intersection of A~> A 2 , •••

sum, i.e. union of pairwise disjoint sets A 1 , A 2 , .•• A~>

A2, •••

occur

•••

event that all the events A 1 , A 2 , ••• occur with the possible exception of a finite number of them

event that infinitely many of events A 1 , A 2 •..• occur

the decreasing sequence of events converges to event A

the increasing sequence of events converges to event A

event that all the events

event that one of the mutually exclusive events A 1 , A 2 , occurs

,.,

0

Vo

! j.)

-

~

a

0 c: ;:;

'<

Ol

a:: =

.:;

::n

= g.

:§. &

~

§"

(1)

$....

Ol

=

0' ....

g.

0.

a:: 0

r;·



g

0 cr"

::,0

'e:

136

II. Mathematical Foundations of Probability Theory

and since L~n+ 1 A; !

0, n --+

oo, we have

3. We can now formulate Kolmogorov's generally accepted axiom system, which forms the basis for the concept of a probability space.

Fundamental Definition. An ordered triple (Q, fi', P) where

n

(a) is a set of points w, (b) fi' is a u-algebra of subsets of n, (c) Pis a probability on fi',

is called a probabilistic model or a probability space. Here n is the sample space or space of elementary events, the sets A in ff are events, and P(A) is the probability of the event A. It is clear from the definition that the axiomatic formulation of probability theory is based on set theory and measure theory. Accordingly, it is useful to have a table (pp. 134-135) displaying the ways in which various concepts are interpreted in the two theories. In the next two sections we shall give examples of the measurable spaces that are most important for probability theory and of how probabilities are assigned on them.

4.

PROBLEMS

n = {r: r E [0, 1]} be the set of rational points of [0, 1], d the algebra of sets each of which is a finite sum of disjoint sets A of one of the forms {r: a < r < b}, {r: a::;; r < b}, {r: a< r::;; b}, {r: a::;; r::;; b}, and P(A) = b- a. Show that P(A), A Ed, is finitely additive set function but not countably additive.

1. Let

2. Let Q be a countable set and :!F the collection of all its subsets. Put JJ(A) = 0 if A is finite and JJ(A) = oo if A is infinite. Show that the set function JJ is finitely additive but not countably additive. 3. Let JJ be a finite measure on a u-algebra :!F, A. E :!F, n = 1, 2, ... , and A (i.e., A = lim. A. = lim. A.). Show that JJ(A) = lim. JJ(A.). 4. Prove that P(A 6 B)

= P(A) + P(B) -

2P(A n B).

= lim. A.

137

§2. Algebras and a-Algebras. Measurable Spaces

5. Show that the "distances" p 1(A, B) and p2(A, B) defined by p 1(A, B)= P(A

~B),

P(A ~B) { piA, B) = :(A v B)

ifP(A v B)# 0, if P(A v B)= 0

satisfy the triangle inequality. 6. Let f.1. be a finitely additive measure on an algebra d, and let the sets A 1 , A 2 , ••• Ed be pairwise disjoint and satisfy A = 1 f..I.(A;). 1 Ai Ed. Then f.J.(A) ~

Ir;

Ir;

7. Prove that lim sup An= lim inf A., lim inf A. ~ lim sup A.,

lim inf An = lim sup An,

lim sup(A. v B.) = lim sup A. v lim sup B.,

lim sup A. n lim inf B. ~ lim sup( A. n B.) ~ lim sup A. n lim sup B•.

If A. j A or A.

l

A, then lim inf An= lim sup A•.

8. Let {x.} be a sequence of numbers and A.= (- oo, xn). Show that x =lim sup x. and A = lim sup A. are related in the following way: (- oo, x) ~ A ~ (- oo, x]. In other words, A is equal to either (- oo, x) or to (- oo, x]. 9. Give an example to show that if a measure takes the value general that countable additivity implies continuity at 0.

+ oo, it does not follow in

§2. Algebras and a-Algebras. Measurable Spaces 1. Algebras and a-algebras are the components out of which probabilistic models are constructed. We shall present some examples and a number of results for these systems. Let n be a sample space. Evidently each of the collections of sets ~* =

{0, Q},

~*

= {A: A£:; Q}

is both an algebra and a a-algebra. In fact, ~* is trivial, the "poorest" a-algebra, whereas ~* is the "richest" a-algebra, consisting of all subsets

ofn.

When n is a finite space, the a-algebra ~* is fully surveyable, and commonly serves as the system of events in the elementary theory. However, when the space is uncountable the class ~* is much too large, since it is impossible to define "probability" on such a system of sets in any consistent way. If A £:; Q, the system ~A = {A,

A, 0, Q}

138

II. Mathematical Foundations of Probability Theory

is another example of an algebra (and a a-algebra), the algebra (or a-algebra) generated by A. This system of sets is a special case of the systems generated by decompositions. In fact, let ~

= {D 1 , D 2 , ••. }

be a countable decomposition of Q into nonempty sets: Q = D1

+ D2 + ··· ;

D; n Di =

0,

i =F j.

Then the system d = a(~), formed by the sets that are unions of finite numbers of elements of the decomposition, is an algebra. The following lemma is particularly useful since it establishes the important principle that there is a smallest algebra, or a-algebra, containing a given collection of sets.

Lemma 1. Let tff be a collection of subsets of n. Then there are a smallest algebra a(&) and a smallest a-algebra a(&) containing all the sets that are inS. The class !F* of all subsets of Q is a a-algebra. Therefore there are at least one algebra and one a-algebra containing S. We now define a(S) (or a(S)) to consist of all sets that belong to every algebra (or a-algebra) containing&. It is easy to verify that this system is an algebra (or a-algebra) and indeed the smallest. PROOF.

Remark. The algebra r:x(E) (or a(E), respectively) is often referred to as the smallest algebra (or a-algebra) generated by S. We often need to know what additional conditions will make an algebra, or some other system of sets, into a a-algebra. We shall present several results of this kind. Definition 1. A collection .H of subsets of Q is a monotonic class if An E .A, n = 1, 2, ... , together with An j A or An ! A, implies that A E .H. Let tff be a system of sets. Let f.l(S) be the smallest monotonic class containing&. (The proof of the existence of this class is like the proof of Lemma 1.)

Lemma 2. A necessary and sufficient condition for an algebra d to be a a-algebra is that it is a monotonic class. PRooF. A a-algebra is evidently a monotonic class. Now let d be a monotonic class and An Ed, n = 1, 2, .... It is clear that Bn = 1 A; Ed and Bn £ Bn+ 1 . Consequently, by the definition of a monotonic class, Bn j U~ 1 A; Ed. Similarly we could show that n~ 1 A; Ed.

Ui'=

By using this lemma, we can prove that, starting with an algebra d, we can construct the a-algebra a(d) by means of monotonic limiting processes.

139

§2. Algebras and a-Algebras. Measurable Spaces

Theorem 1. Let .91 be an algebra. Then f.l( .91) = a(.91).

(1)

PRooF. By Lemma 2, f.l(d) ~ a(d). Hence it is enough to show that f.l(d) is a a-algebra. But At = f.l(d) is a monotonic class, and therefore, by Lemma 2 again, it is enough to show that f.l(d) is an algebra. Let A E .A; we show that A E .A. For this purpose, we shall apply a principle that will often be used in the future, the principle of appropriate sets, which we now illustrate. Let

.Ji =

{B:Be.A,Be.A}

be the sets that have the property that concerns us. It is evident that .91 ~ .Ji ~ At. Let us show that .Ji is a monotonic class. Let Bn E .Ji; then Bn E .A, Bn E .A, and therefore lim j Bn E .A, Consequently

l Bn E .A, lim l Bn = lim j Bn E .A, lim l Bn = lim j Bn E A, lim j Bn = lim l Bn E .A, and therefore .A is a monotonic class. But .1i ~ At and At is the smallest monotonic class. Therefore .ii = At, and if A EAt = f.l(d), then we also lim j Bn = lim

have A E .A, i.e. At is closed under the operation of taking complements. Let us now show that At is closed under intersections. Let A EAt and

AlA= {B:Be.A,A n Be A}. From the equations lim

l (A n Bn) = A n

lim

l Bn,

lim j (A n Bn) = A n lim j Bn it follows that At A is a monotonic class. Moreover, it is easily verified that

(A

E

AIB)<=> (B

E

AI A).

(2)

Now let A E .91; then since .91 is an algebra, for every B E .91 the set A n B E .91 and therefore ,s;l ~ A(A ~AI.

But At A is a monotonic class (since lim i ABn = A lim i Bn and lim l ABn = A lim l Bn), and At is the smallest monotonic class. Therefore AtA = At for all A E .91. But then it follows from (2) that

(A

E

AI B) <=> (B

E

AIA

=

At).

140

II. Mathematical Foundations of Probability Theory

whenever A E .91 and BE .A. Consequently if A E .91 then

for every BE .A. Since A is any set in .91, it follows that

Therefore for every B E .A .I{B

=.If,

i.e. if B E .A and C E .A then C n B E .A. Thus .A is closed under complementation and intersection (and therefore under unions). Consequently .A is an algebra, and the theorem is established.

Definition 2. Let Q be a space. A class~ of subsets ofQ is ad-system if (a) Q E ~; (b) A, B, E ~' A s B = B\A E (c) An E ~'An SAn+ 1 = UAn If~

~; E

~.

is a collection of sets then

d(~)

denotes the smallest d-system con-

taining~.

Theorem 2. If the collection ~ of sets is closed under intersections, then (3)

d(~) = a(~)

PROOF. Every a-algebra is ad-system, and consequently d(~) s::; a(~). Hence if we prove that d(~) is closed under intersections, d(~) must be a a-algebra and then, of course, the opposite inclusion a(~) s d( ~) is valid. The proof once again uses the principle of appropriate sets. Let ~ 1 = {BEd(~):BnAEd(~)forallAE~}.

If B E ~ then B n A E ~ for all A E ~ and therefore ~ s ~ 1 . But ~ 1 is a d-system. Hence d(~) s ~ 1 • On the other hand, ~ 1 s d(~) by definition. Consequently Now let ~2

=

{BEd(~):

Bn A

Ed(~)

for all A

Ed(~)}.

Again it is easily verified that ~ 2 is ad-system. If B E ~'then by the definition of ~ 1 we obtain that B n A Ed(~) for all A E ~ 1 = d(~). Consequently ~ s ~ 2 and d(~) s ~ 2 • But d(~) 2 ~ 2 ; hence d(~) = ~ 2 , and therefore

141

§2. Algebras and a-Algebras. Measurable Spaces

whenever A and Bare in d(G), the set A n B also belongs to d(G), i.e. d(G) is closed under intersections. This completes the proof of the theorem. We next consider some measurable spaces (Q, :F) which are extremely important for probability theory.

2. The measurable space (R, PA(R)). Let R = (- oo, oo) be the real line and

= {xeR:a

(a,b]

< x:::;; b}

for all a and b, - oo :::;; a < b < oo. The interval (a, oo] is taken to be (a, oo ). (This convention is required if the complement of an interval (- oo, b] is to be an interval of the same form, i.e. open on the left and closed on the right.) Let d be the system of subsets of R which are finite sums of disjoint intervals of the form (a, b]: n

A

d if A =

E

I

(ai, b;],

i= 1

n < oo.

It is easily verified that this system of sets, in which we also include the empty set 0, is an algebra. However, it is not a a-algebra, since if An = (0, 1 - 1/n] Ed, we have An= (0, 1) ¢=d. Let PA(R) be the smallest a-algebra a(d) containing d. This a-algebra, which plays an important role in analysis, is called the Borel algebra of subsets of the real line, and its sets are called Borel sets. Iff is the system of intervals f of the form (a, b], and a(f) is the smallest a-algebra containing J, it is easily verified that a(J) is the Borel algebra. In other words, we can obtain the Borel algebra from J without going through the algebra d, since a(J) = a(iX(J)). We observe that

Un

(a, b)=

U(a, b- !], n

n=l

n(a-!, b], {a}= n(a-!, a].

[a,

b] =

n=l

n

n=l

n

a< b, a< b,

Thus the Borel algebra contains not only intervals (a, b] but also the singletons {a} and all sets of the six forms (a, b),

[a, b],

[a, b),

(- oo, b),

(- oo, b],

(a, oo).

(4)

142

II. Mathematical Foundations of Probability Theory

Let us also notice that the construction of 81(R) could have been based on any of the six kinds of intervals instead of on (a, b], since all the minimal a-algebras generated by systems of intervals of any of the forms (4) are the same as 81(R). Sometimes it is useful to deal with the a-algebra 81(R) of subsets of the extended real lineR = [ - oo, oo]. This is the smallest a-algebra generated by intervals of the form (a, b] = {x E R: a < x::::;; b},

- oo ::::;; a < b ::::;; oo,

where (- oo, b] is to stand for the set {x E R: - oo ::::;; x::::;; b}. Remark 1. The measurable space (R, 81(R)) is often denoted by (R, 81) or (R 1 , 81 1). Remark 2. Let us introduce the metric

lx- Yl p 1(x, y) = 1 + lx- Yl on the real line R (this is equivalent to the usual metric Ix - y I) and let 810 (R) be the smallest a-algebra generated by the open sets SP(x 0 ) = {x E R: p 1(x, x 0 ) < p}, p > 0, x 0 E R. Then 81 0 (R) = 81(R) (see Problem 7). 3. The measurable space (R", ?4(R")). Let R" = R x · · · x R be the direct, or Cartesian, product of n copies of the real line, i.e. the set of ordered n-tuples x = (x 1 , •.• , x,), where - oo < xk < oo, k = 1, ... , n. The set

where Ik = (ak, bk], i.e. the set {x E R": xk E Ik> k = 1, ... , n}, is called a rectangle, and I k is a side of the rectangle. Let J be the set of all rectangles I. The smallest a-algebra a(J) generated by the system J is the Borel algebra of subsets of R" and is denoted by 81(R"). Let us show that we can arrive at this Borel algebra by starting in a different way. Instead ofthe rectangles I = I 1 x · · · x I, let us consider the rectangles B = B 1 x · · · x B, with Borel sides (Bk is the Borel subset of the real line that appears in the kth place in the direct product R x · · · x R). The smallest a-algebra containing all rectangles with Borel sides is denoted by

81(R) ® · · · ® 81(R) and called the direct product of the a-algebras 81(R). Let us show that in fact

143

§2. Algebras and u-Algebras. Measurable Spaces

In other words, the smallest a--algebra generated by the rectangles I =

I 1 x · · · x I" and the (broader) class of rectangles B = B 1 x · · · x B" with Borel sides are actually the same. The proof depends on the following proposition.

n, and define

Lemma 3. Let C be a class of subsets ofO., let B £

C n B = {A n B: A E C}.

(5)

u(C n B) = u(C) n B.

(6)

Then PROOF. Since C £ u(C), we have

C n B £ u(C) n B.

(7)

But u(C) n B is a a--algebra; hence it follows from (7) that u(C n B) £ u(C) n B.

To prove the conclusion in the opposite direction, we again use the principle of appropriate sets. Define 1:6'8 = {A E u(C): An BE u(C n B)}.

Since u( C) and u(C n B) are a--algebras, 1:6'8 is also a a--algebra, and evidently C £ 1:6'8 £ u(C),

whence u(C) £ a(!ffi'8 ) = 1:6'8 £ u(C) and therefore u(C) = 1:6'8 . Therefore A n BE u(C n B)

for every A £ u(C), and consequently u(C) n B £ u(cS' n B). This completes the proof of the lemma. Proof that PJ(R") and PJ ® · · · ® PJ are the same. This is obvious for n = 1. We now show that it is true for n = 2. Since PJ(R 2 ) £ PJ ® PJ, it is enough to show that the Borel rectangle B 1 x B 2 belongs to PJ(R 2 ). Let R 2 = R 1 x R 2 , where R 1 and R 2 are the "first" and "second" real lines, ~ 1 = PJ 1 x R 2 , ~ 2 = R 1 x PJ 2 , where PJ 1 x R 2 (or R 1 x PJ 2 ) is the collectionofsetsoftheformB 1 x R 2 (orR 1 x B2 ),withB 1 EPJ 1 (orB 2 EPJ 2 ). Also let f 1 and f 2 be the sets of intervals in R 1 and R 2 , and .11 = f 1 x R 2 , .J2 = R 1 x f 2 • Then, by (6), B 1 x B2

=

B1 n B2 E ~ 1 n

~ 2 = u(.J 1 ) n

= u(.J 1 n B2 )

= u(f 1 as was to be proved.

B2

x f

£

2 ),

u(.J 1 n

.J2 )

144

II. Mathematical Foundations of Probability Theory

The case of any n, n > 2, can be discussed in the same way.

Remark. Let BI 0 (R") be the smallest a-algebra generated by the open sets Sp(x 0 ) = {x E R": Pn(x, x 0) < p},

x 0 E R",

p > 0,

in the metric n

Pn(x, x 0 ) =

L rkp1(xk, x~).

k=1

where x = (xl> ... , xn), x 0 = (x~, ... , x~). Then BI 0 (Rn) = BI(R") (Problem 7).

4. The measurable space (R 00 , BI(R 00 ) ) plays a significant role in probability theory, since it is used as the basis for constructing probabilistic models of experiments with infinitely many steps. The space R 00 is the space of ordered sequences of numbers, - oo < xk < oo, k

=

1, 2, ...

Let I k and Bk denote, respectively, the intervals (ak, bk] and the Borel subsets of the kth line (with coordinate xk). We consider the cylinder sets

J(J 1 J(B 1

X··· X X ••• X

In)= {x:x = (x 1,x 2 , •.•),x 1 E1 1, ••• ,XnEin},

(8)

Bn) = {x:x = (x 1, X2 · · ·), x 1 E B 1, ••• , Xn E Bn},

(9)

J(B")

=

{x: (x1, ... , Xn) E B"},

(10)

where B" is a Borel set in 91(R"). Each cylinder J{B 1 x · · · x Bn), or J(B"), can also be thought of as a cylinder with base in R"+ 1 , R"+ 2 , ••. , since J(B1

X ••· X

Bn)

J(B")

= J(B1 X • • · = J(B"+1),

X

Bn

X

R),

where B"+ 1 = B" x R. It follows that both systems of cylinders J(B 1 x · · · x Bn) and J(B") are algebras. It is easy to verify that the unions of disjoint cylinders

J(J 1

X ••• X

In)

also form an algebra. Let 91(R 00 ), 91 1(R 00 ) and f!I 2 (R 00 ) be the smallest a-algebras containing all the sets (8), (9) or (10), respectively. (The a-algebra 91 1(R 00 ) is often denoted by f!I(R) ® 91(R) x ···.)It is clear that f!I(R 00 )!;;;; f!I 1 (R 00)!;;;; BI 2 (R 00). As a matter of fact, all three a-algebras are the same. To prove this, we put CfJn =

{A E R": {x: (x1, ... , Xn) E A} E BI(R 00 ) }

for n = 1, 2, .... Let B" e f!I(R"). Then B"

E CfJn !;;;;

f!I(R 00 ).

145

§2. Algebras and a-Algebras. Measurable Spaces

But

t(J n

is a (J-algebra, and therefore

consequently P4z(R 00 )

£;

PJ(R 00 ).

Thus PJ(R 00 ) = P4 1(R 00 ) = P4 z(R 00 ). From now on we shall describe sets in PJ(R 00 ) as Borel sets (in R 00 ).

Remark. Let P4 0 (R 00 ) be the smallest (J-algebra generated by the open sets

= {x E Roo: Poo(x, x 0 ) < p},

Sp(x 0 )

p > 0,

x 0 E R 00 ,

in the metric Poo(x,

X 0)

=

00

L rkpl(xb xf),

k=l

where x = (xt>x 2 , ... ), x = (x?,x~, ... ). Then (Problem 7). Here are some examples of Borel sets in Roo:

P4(R 00 ) = P4 0 (R 00 )

0

(a) {xER 00 :supxn>a}, {x E Roo: inf Xn
Xn ~a},

{x E Roo: lim

Xn

rel="nofollow">a},

where, as usual,

Urn Xn

=

inf sup Xm, n

lim Xn = sup inf Xm; n

m~n

m~n

(c) {x E Roo: xn -+}, the set of x E Roo for which lim xn exists and is finite; (d) {x E R 00 : lim Xn >a}; (e) {X E R 00 : 1 IXn I > a} ; (f) {x E Roo: 1 xk = 0 for at least one n ;::: 1}.

L:'= Lk=

To be convinced, for example, that sets in (a) belong to the system PJ(R 00 ), it is enough to observe that {x: sup

Xn

> a} =

U {x: Xn > n

{x: inf Xn
a} E P4(R 00 ),

U{x: Xn
00

).

n

5. The measurable space (Rr, PJ(RT)), where Tis an arbitrary set. The space RT is the collection of real functions x = (x 1) defined for t E Tt. IIi general we shall be interested in the case when Tis an uncountable subset of the real t We shall also use the notations x = (x,),.Rr and x = (x,), t ERr, for elements of Rr.

146

II. Mathematical Foundations of Probability Theory

line. For simplicity and definiteness we shall suppose for the present that T = [0, oo). We shall consider three types of cylinder sets X ••. X

In)= {x: x,,

E

J 1,, ... ,tn(B 1

X··· X

Bn) =

EB 1, ... ,X1nEBn},

{x:X11

Ib

x,n E I 1},

J,,, ... ,tn(J 1

••. '

J,,, ... ,r"(Bn) = {x: (x,,, ... , x,J E Bn},

(11) (12) (13)

where Ik is a set of the form (ak, bk], Bk is a Borel set on the line, and Bn is a Borel set in Rn. The set J,,, ... ,r" (I 1 x · · · x In) is just the set of functions that, at times t 1 , ••• ,tn, "get through the windows" I~ rel="nofollow">···•In and at other times have arbitrary values (Figure 24). Let BI(RT), 91 1(RT) and 91 2 (RT) be the smallest a-algebras corresponding respectively to the cylinder sets (11), (12) and (13). It is clear that (14)

As a matter of fact, all three of these a-algebras are the same. Moreover, we can give a complete description of the structure of their sets.

Theorem 3. LetT be any uncountable set. Then BI(RT) = 91 1(RT) = 91 2 (RT), and every set A E P.l(RT) has the following structure: there are a countable set of points tt> t 2 , ••• ofT and a Borel set Bin fJI(R such that 00 )

(15) PRooF. Let t! denote the collection of sets of the form (15) (for various aggregates (t 1 , t 2 , ..•) and Borel sets Bin BI(R 00 )). If A1 , A 2 , ... Et! and the corresponding aggregates are r< 1> = (t~ll, t~1 >, •• •), r< 2 >= (t~2 >, t~2 >, •• •), ••. ,

Of-----+-,,-+-,, -------Figure 24

r•

147

§2. Algebras and a-Algebras. Measurable Spaces

then the set r = representation

Uk r can be taken as a basis, so that every A
Xr 2 ,

•••)

E

B;},

where B; is a set in one and the same a-algebra :?4(R 00 ), and r; E r. Hence it follows that the system C is a a-algebra. Clearly this a-algebra contains all cylinder sets of the form (1) and, since PA 2 (RT) is the smallest a-algebra containing these sets, and since we have (14), we obtain (16) Let us consider a set A from C, represented in the form (15). For a given aggregate (t 1 , t 2 , •• •), the same reasoning as for the space (R 00 , :?I(R 00 )) shows that A is an element of the a-algebra generated by the cylinder sets ( 11 ). But this a-algebra evidently belongs to the a-algebra :?I(RT); together with (16), this established both conclusions of the theorem. Thus every Borel set A in the a-algebra :?I(RT) is determined by restrictions imposed on the functions x = (x1), t E T, on an at most countable set of points t 1, t 2 , •••• Hence it follows, in particular, that the sets A 1 = {x: sup x 1 < C for all t

E

[0, 1]},

A 2 = {x: x 1 = 0 for at least one t

A 3 = {x:

X1

E

[0, 1]},

is continuous at a given point t0

E

[0, 1]},

which depend on the behavior of the function on an uncountable set of points, cannot be Borel sets. And indeed none of' these three sets belongs to 84(RI 0 • 1l). Let us establish this for A 1 . If A 1 E PA(R 10 · 1l), then by our theorem there are a point (t?, t~, . .. ) and a set B 0 E :?I(R 00 ) such that

=

It is clear that the function y1 C - 1 belongs to (Yr?• ... ) E B 0 . Now form the function 21

{c- 1,

A~>

and consequently

t E (t?, t~, .. .),

= C + 1, t ¢ (t?, t~, .. .).

It is clear that

and consequently the function z = (z1) belongs to the set {x: (x 1?, ... } E B0 }. But at the same time itisclearthatitdoesnot belong to the set {x: sup X 1 < C}. This contradiction shows that A 1 ¢ 84(R 10 • 1l).

148

II. Mathematical Foundations of Probability Theory

Since the sets A 1 , A 2 and A 3 are nonmeasurable with respect to the 0'-algebra dl[R[o, 11) in the space of all functions x = (x,), t e [0, 1], it is natural to consider a smaller class of functions for which these sets are measurable. It is intuitively clear that this will be the case if we take the intial space to be, for example, the space of continuous functions.

6. The measurable space (C, dl(C)). LetT = [0, 1] and let C be the space of continuous functions x = (x,), 0 s:; t s:; 1. This is a metric space with the metric p(x, y) = SUPteT lx,- y,l. We introduce two 0'-algebras in C: dl(C) is the 0'-algebra generated by the cylinder sets, and d6' 0 (C) is generated by the open sets (open with respect to the metric p(x, y)). Let us show that in fact these 0'-algebras are the same: dl(C) = d6' 0 (C). Let B = {x: x 10 < b} be a cylinder set. It is easy to see that this set is open. Hence it follows that {x: x 11 < b1 , ••• , x 1" < bn} E d6' 0(C), and therefore dl(C) s;; d6'0 (C). Conversely, consider a set B P = {y: y e S p(x 0 )} where x 0 is an element of C and Sp(x 0 ) = {x e C: SUPreTix1 - x?i < p} is an open ball with center at x 0 • Since the functions in C are continuous,

BP

=

{y E C:

ye SP(x0)} = {y E C: m~x IYr- x?i < P} =

n{y

E

C: IYtk- x~l < p}

E

dl(C),

(17)

lk

where tk are the rational points of [0, 1]. Therefore d6' 0 (C) s;; dl(C). The following example is fundamental.

7. The measurable space (D, dl(D)), where Dis the space offunctions x = (x1), t e [0, 1], that are continuous on the right (x 1 = x,+ for all t < 1) and have limits from the left (at every t > 0). Just as for C, we can introduce a metric d(x, y) on D such that the 0'-algebra .16'0 (D) generated by the open sets will coincide with the 0'-algebra dl(D) generated by the cylinder sets. This metric d(x, y), which was introduced by Skorohod, is defined as follows: d(x, y) = inf{e

> 0:3 A. e A: sup lx,- YA +sup It- A.(t)i s:; e}, t

(18)

t

where A is the set of strictly increasing functions A. = A.(t) that are continuous on [0, 1] and have A.(O) = 0, A.(1) = 1.

8. The measurable space CflreT Or, I®lreT ~). Along with the space (RT, dl(RT)), which is the direct product ofT copies of the real line together with the system of Borel sets, probability theory also uses the measurable space COteT n" RteT :F,), which is defined in the following way.

149

§3. Methods of Introducing Probability Measures on Measurable Spaces

Let T be any set of indices and (Q0 ~1) a measurable space, t E T. Let il10 the set of functions w = (w 1), t E T, such that w 1 E il1 for each

OreT

Q = tE T.

The collection of cylinder sets

.F1,, ... , 1.(B 1

X ••• X

B.)= {w: W 11

E

B 1, ••• , w 1•

E

B.},

where B1; E ~r;• is easily shown to be an algebra. The smallest a-algebra T ~ 1 , and the measurable containing all these cylinder sets is denoted by space Qi, ~1) is called the direct product of the measurable spaces (Q0 ~ 1 ), t E T.

<0

9.

Plre

PI

PROBLEMS

1. Let ffl 1 and f!l 2 be a-algebras of subsets of n. Are the following systems of sets aalgebras? ffl 1 n ffl 2

= {A: A e ffl 1 and A e ffl 2 },

ffl 1 u ffl 2 ={A: A ef!l 1 or A ef!l 2 }.

2. Let C!fi = { D 1 , D 2 , ••• } be a countable decomposition of n and f!l =

a(~).

Are there

also only countably many sets in f!l?

3. Show that f!l(R") ® f!l(R) = f!l(R"+ 1).

4. Prove that the sets (b)-(f) (see Subsection 4) belong to f!l(R 00 ).

5. Prove that the sets A 2 and A 3 (see Subsection 5) do not belong to f!l(RIO, 11). 6. Prove that the function (15) actually defines a metric. 7. Prove that 96 0 (R")

=

96(R"), n ~ 1, and fli 0 (R"') = f.!I(R"').

8. Let C = C[O, oo) be the space of continuous functions x = (x,) defined for t Show that with the metric p(x, y) =

I z-• min[ sup lx,- y,l, 1],

n= 1

~

0.

x,yeC,

Ostsn

this is a complete separable metric space and that the a-algebra f!l 0 (C) generated by the open sets coincides with the a-algebra f!l(C) generated by the cylinder sets.

§3. Methods of Introducing Probability Measures on Measurable Spaces 1. The measurable space (R, r!4(R)). Let P = P(A) be a probability measure defined on the Borel subsets A of the real line. Take A = (- oo, x] and put F(x)

= P(- oo, x],

X

ER.

(1)

150

II. Mathematical Foundations of Probability Theory

This function has the following properties: (1) F(x) is nondecreasing; (2) F(- oo) = 0, F( + oo) = 1, where F(- oo) = lim F(x),

F( + oo) = lim F(x); xf oo

X~- 00

(3) F(x) is continuous on the right and has a limit on the left at each x

E

R.

The first property is evident, and the other two follow from the continuity properties of probability measures.

Definition 1. Every function F = F(x) satisfying conditions (1)-(3) is called a distribution function (on the real line R). Thus to every probability measure P on (R, &b(R)) there corresponds (by (1)) a distribution function. It turns out that the converse is also true.

Theorem 1. Let F = F(x) be a distribution function on the real lineR. There exists a unique probability measure P on (R, &b(R)) such that P(a, b] = F(b)- F(a)

(2)

for all a, b, - oo :::;; a < b < oo. PRooF. Let d be the algebra of the subsets A of R that are finite sums of disjoint intervals of the form (a, b]:

A =

n

L (ak, bk].

k= 1

On these sets we define a set function P0 by putting n

P 0 (A) =

L [F(bk)- F(ak)],

A Ed.

(3)

k=l

This formula defines, evidently uniquely, a finitely additive set function on d. Therefore if we show that this function is also countably additive on this algebra, the existence and uniqueness of the required measure P on &b(R) will follow immediately from a general result of measure theory (which we quote without proof).

CarathOOdory's Theorem. Let n be a space, d an algebra of its subsets, and FA = a{d) the smallest a-algebra containing d. Let J.l.o be a a-finite measure on (0, A). Then there is a unique measure J.l. on (Q, a(d)) which is an extension of J.l.o, i.e. satisfies J.l.(A) = J.l.o(A),

A Ed.

§3. Methods of Introducing Probability Measures on Measurable Spaces

151

We are now to show that P 0 is countably additive on .91. By a theorem from §1 it is enough to show that P 0 is continuous at 0. i.e. to verify that

LetA 1 , A 2 , .•. be a sequence of sets from .91 with the property An! 0. Let us suppose first that the sets An belong to a closed interval [- N, N], N < oo. Since A is the sum of finitely many intervals of the form (a, b] and since

P0 (a', b)= F(b)- F(a') -t F(b)- F(a)

=

P0 (a, b]

as a' ! a, because F(x) is continuous on the right, we can find, for every An, a set Bn E .91 such that its closure [BnJ £ An and Po(An)- P 0 (Bn)

~

e · 2-n,

where e is a preassigned positive number. [BnJ = 0. But the sets [Bn] By hypothesis, nAn= 0 and therefore are closed, and therefore there is a finite n0 = n0 (e) such that

n

(4) n= 1

(In fact, [ -N, N] is compact, and the collection of sets {[ -N, N]\[BnJ}n~l is an open covering of this compact set. By the Heine-Bore! theorem there is a finite subcovering: no

U([- N, N]\[Bn]) =

[ - N,

N]

n=1

and therefore n~~ 1 [BnJ = 0). Using (4) and the inclusions Ano £ Ano- 1 £ · · · £ A1o we obtain Po(Ano)

=

Po(Ano\01 Bk)

= Po(Ano\01

~

Bk)

+ Po(01 Bk)

~ Po(9 (Ak\Bk)) 1

no

no

k=1

k=1

L Po(Ak\Bk) ~ L e · 2-k ~e.

Therefore P 0 (An)! 0, n -t oo. We now abandon the assumption that An £ [ - N, N] for some N. Take an e > 0 and choose N so that P 0 [ -N, N] > 1 - e/2. Then, since

An = An n [- N, N]

+ An n

[- N, N],

~e have

P 0 (A")

= ~

P 0 (An[ -N, N] + P 0 (An n [ -N, N]) P 0 (An n [ -N, N]) + e/2

152

II. Mathematical Foundations of Probability Theory

and, applying the preceding reasoning (replacing An by An n [- N, N]), we find that P0 (An n [- N, N]) s:; e/2 for sufficiently large n. Hence once again P 0 (An)! 0, n--+ oo. This completes the proof ofthe theorem. Thus there is a one-to-one correspondence between probability measures P on (R, fJI(R)) and distribution functions F on the real line R. The measure P constructed from the function F is usually called the Lebesgue-Stieltjes

probability measure corresponding to the distribution function F. The case when F(x) =

0, X< 0, { x, 0 s:; x s:; 1, 1, X> 1.

is particularly important. In this case the corresponding probability measure (denoted by A.) is Lebesgue measure on [0, 1]. Clearly A.(a, b] = b- a. In other words, the Lebesgue measure of (a, b] (as well as of any ofthe intervals (a, b), [a, b] or [a, b)) is simply its length b -a. Let

fJI([O, 1]) = {A n [0, 1]: A E fJI(R)} be the collection of Borel subsets of [0, 1]. It is often necessary to consider, besides these sets, the Lebesgue measurable subsets of [0, 1]. We say that a set A £ [0, 1] belongs to ~([0, 1)] if there are Borel sets A and B such that A £ A £ Band A.(B\A) = O.lt is easily verified that .si([O, 1]) is au-algebra. It is known as the system of Lebesgue measurable subsets of [0, 1]. Clearly fJI([O, 1]) £ ,sj([O, 1]). The measure A., defined so far only for sets in .?1([0, 1]), extends in a natural way to the system ,sj([O, 1]) of Lebesgue measurable sets. Specifically, if Ae,sj([O, 1]) and A£ A£ B, where A and Be~([O, 1]) and A.(B\A) = 0, we define A:(A) = A.(A). The set function A: = A:(A), A e ~([0, 1]), is easily seen to be a probability measure on ([0, 1], ~([0, 1])). It is usually called Lebesgue measure (on the system of Lebesgue-measurable sets).

Remark. This process of completing (or extending) a measure can be applied,

and is useful, in other situations. For example, let (Q, fF, P) be a probability space. Let §P be the collection of all the subsets A of Q for which there are sets B 1 and B 2 of fF such that B 1 £ A £ B 2 and P(B 2 \B 1) = 0. The probability measure can be defined for sets A e #'Pin a natural way (by P(A) = P(B 1)). The resulting probability space is the completion of (Q, fF, P) with respect to P. A probability measure such that §P = fF is called complete, and the corresponding space (Q, fF, P) is a complete probability space. The correspondence between probability measures P and distribution functions F established by the equation P(a, b] = F(b)- F(a) makes it

153

§3. Methods of Introducing Probability Measures on Measurable Spaces F(x)

I I ~F(x3) I 1

~F(xz)

I

I~F(x,)

x,

x2

X

XJ

Figure 25

possible to construct various probability measures by obtaining the corresponding distribution functions.

Discrete measures are measures P for which the corresponding distributions F = F(x) are piecewise constant (Figure 25), changing their values at the points x 1 , x 2 , ••• (L'1F(xJ > 0, where L'1F(x) = F(x)- F(x- ). In this case the measure is concentrated at the points x 1 , x 2 , .•. :

The set of numbers (p 1 , p2, .. .), where Pk = P( {xk}), is called a discrete probability distribution and the corresponding distribution function F = F(x) is called discrete. We present a table of the commonest types of discrete probability distribution, with their names. Table 1 Distribution

Probabilities Pk

Parameters

Discrete uniform Bernoulli Binomial

1/N, k = 1,2, ... ,N Pt = p, Po= q C~pkq•-\ k = 0, 1, ... , n

N = 1,2, ... 0 :-:; p :-:; 1, q = 1 - p 0 :-:; p :-:; 1, q = 1 - p, n = 1,2, ...

Poisson Geometric Negative binomial

l- 1p, k = q=lP'l-',

e-kfk!, k

=

0, 1, .. . 0, 1, .. . k = r,r + 1, ...

1>0 0 :-:; p :-:; 1, q = 1 - p 0 :-:; p :-:; 1, q = 1 - p,

r = 1, 2, ...

Absolutely continuous measures. These are measures for which the corresponding distribution functions are such that F(x) = roof(t) dt,

(5)

154

II. Mathematical Foundations of Probability Theory

where f = f(t) are nonnegative functions and the integral is at first taken in the Riemann sense, but later (see §6) in that of Lebesgue. The function f = f(x), x E R, is the density of the distribution function F = F(x) (or the density of the probability distribution, or simply the density) and F = F(x) is called absolutely continuous. It is clear that every nonnegative f = f(x) that is Riemann integrable and such that J~ cxJ(x) dx = 1 defines a distribution function by (5). Table 2 presents some important examples of various kinds of densities f = f(x) with their names and parameters (a density f(x) is taken to be zero for values of x not listed in the table). Table 2

Uniform on [a, b] Normal or Gaussian

Parameters

Density

Distribution

lj(b - a),

a,beR; a
a ::::; x ::::; b

(27ta)-li2e-<x-m)2;(2a2)'

X E

R

meR,a>O

x•-le-xiP

Gamma

----, r(IX}P"

IX> 0, f3 > 0

x~O

r>O,s>O

Beta Exponential (gamma with IX = 1, f3 = 1/A.)

Bilateral exponential Chi-squared, x2 (gamma with a IX = n/2, f3 = 2) Student, t

F

Cauchy

x2) -en+ 1)/2 r(!{n + 1)) ( , x (nn)112r(nj2) 1 + --;;-

(m/n)mf2

xm/2-1

B(m/2, n/2) (1

0 n(x

2

+ mxjn)<m+n)/Z

+ 02)

,

E

R

n

=

1, 2, ...

n

=

1, 2, ...

m, n

=

1, 2, ...

xER

Singular measures. These are measures whose distribution functions are continuous but have all their points of increases on sets of zero Lebesgue measure. We do not discuss this case in detail; we merely give an example of such a function.

§3. Methods of Introducing Probability Measures on Measurable Spaces

155

I

2

X

Figure 26

We consider the interval [0, 1] and construct F(x) by the following procedure originated by Cantor. We divide [0, 1] into thirds and put (Figure 26)

F2(x) =

(t, 1),

z,1

X E

1

xe(i,~),

3

(~, !), X= 0,

4' 4'

0,

X E

1, x=l defining it in the intermediate intervals by linear interpolation. Then we divide each of the intervals [0, and [i, 1] into three parts and define the function (Figure 27) with its values at other points determined by linear interpolation.

tJ

X

Figure 27

156

II. Mathematical Foundations of Probability Theory

Continuing this process, we construct a sequence of functions Fn(x),

n = 1, 2, ... , which converges to a nondecreasing continuous function F(x)

(the Cantor function), whose points of increase (xis a point of increase of F(x) + e) - F(x - e) > 0 for every e > 0) form a set of Lebesgue measure zero. In fact, it is clear from the construction of F(x) that the total length of the intervals (j, ~), (!, ~), (~, !), ... on which the function is constant is

if F(x

! + ~ + _i_ + ... = ! 3

9

27

f (~)n = 1.

3 n=O 3

(6)

Let % be the set of points of increase of the Cantor function F(x). It follows from (6) that A.(%) = 0. At the same time, if Jl. is the measure corresponding to the Cantor function F(x), we have JJ.(%) = 1. (We then say that the measure is singular with respect to Lebesgue measure A..) Without any further discussion of possible types of distribution functions, we merely observe that in fact the three types that have been mentioned cover all possibilities. More precisely, every distribution function can be represented in the form p 1 F 1 + p2 F 2 + p 3 F 3 , where F 1 is discrete, F 2 is absolutely continuous, and F 3 is singular, and P; are nonnegative numbers, p1 + p2 + P3 = 1. 2. Theorem 1 establishes a one-to-one correspondence between probability measures on (R, Ei(R)) and distribution functions on R. An analysis of the proof of the theorem shows that in fact a stronger theorem is true, one that in particular lets us introduce Lebesgue measure on the real line. Let Jl. be a u-finite measure on (Q, d), where d is an algebra of subsets of n. It turns out that the conclusion of Caratheodory's theorem on the extension of a measure and an algebra d to a minimal u-algebra u(d) remains valid with au-finite measure; this makes it possible to generalize Theorem 1. A Lebesgue-Stieltjes measure on (R, Ei(R)) is a (countably additive) measure Jl. such that the measure JJ.(I) of every bounded interval I is finite. A generalized distribution junction on the real line R is a nondecreasing function G = G(x), with values on (- oo, oo), that is continuous on the right. Theorem 1 can be generalized to the statement that the formula

JJ.(a, b] = G(b)- G(a),

a< b,

again establishes a one-to-one correspondence between Lebesgue-Stieltjes measures Jl. and generalized distribution functions G. In fact, if G( + oo) - G(- oo) < oo, the proof of Theorem 1 can be taken over without any change, since this case reduces to the case when G( + oo) G(- oo) = 1 and G(- oo) = 0. Now let G( + oo)- G(- oo) = oo. Put G(x), Gn(x) = { G(n) G( -n),

lxl ~ n,

n, x = -n.

X =

157

§3. Methods of Introducing Probability Measures on Measurable Spaces

On the algebra d let us define a finitely additive measure J-Lo such that J-Lo(a, b] = G(b) - G( a), and let J-ln be the finitely additive measure previously constructed (by Theorem 1) from GnCx). Evidently J-ln j J-Lo on d. Now let A 1, A 2 , ••• be disjoint sets in d and A LAnE d. Then (Problem 6 of §1)

=

00

J-Lo(A) ~ L J-Lo(An). n=1

2:.%

If 1 J-Lo(An) = oo then J-Lo(A) = L:'= 1 J-Lo(An). Let us suppose that L J-Lo(An) < oo. Then 00

j-t 0 (A)

= lim J-Ln(A) = lim n

n

L J-LnCAk).

k=1

By hypothesis, L J-Lo(An) < oo. Therefore 0

~ j-t 0(A)- k~/ 0 (Ak) = li~ L~1 (J-LnCAk)- J-Lo(Ak))] ~ 0,

since J-ln ~ J-Lo. Thus a a-finite finitely additive measure J-Lo is countably additive on d, and therefore (by Caratheodory's theorem) it can be extended to a countably additive measure J-1 on a(d). The case G(x) = xis particularly important. The measure Acorresponding to this generalized distribution function is Lebesgue measure on (R, P4(R)). As for the interval [0, 1] of the real line, we can define the system fJ(R) by writing A E Pi(R) if there are Borel sets A and B such that A ~ A ~ B, A(B\A) = 0. Then Lebesgue measure 1 on P4(R) is defined by l(A) = A(A) if A ~ A ~ B, A E Pi(R) and A(B\A) = 0. 3. The measurable space (W, P4(R"). Let us suppose, as for the real line, that Pis a probability measure on (R", P4(W). Let us write

FnCx1, ... , Xn)

=

P((- 00,

x1J X · · · X ( - 00,

Xn]),

or, in a more compact form,

Fn(x)

=

P(- 00,

x],

where X = (x 1, ... , Xn), (- 00, X] = (- 00, X1J X · · · X Let us introduce the difference operator Lla,. b,: R" formula

Lla,,b,Fn(X1, ... 'xn)

=

( - 00, --+

Xn].

R, defined by the

Fn(xl, ... ) X;- b b;, X;+ 1 ...) - FnCx~> ... , X;- 1, ai, X;+ 1 ...)

158

II. Mathematical Foundations of Probability Theory

where ai :S bi. A simple calculation shows that da 1b 1 • • • da"b"Fn(X1 • · · Xn)

=

P(a, b],

(7)

where (a, b] = (a 1, b1 ] x · · · x (an, bJ. Hence it is clear, in particular, that (in contrast to the one-dimensional case) P(a, b] is in general not equal to Fn(b)- Fn(a). Since P(a, b] ~ 0, it follows from (7) that (8)

for arbitrary a = (a 1, ••• , an), b = (b 1, ... , bn). It also follows from the continuity of P that Fn(x 1, ... , xn) is continuous on the right with respect to the variables collectively, i.e. if x Lx, x = (k)) th en ( x(k) 1 , .•. ,xn ,

k-+

(9)

00.

It is also clear that (10)

and lim Fn(Xb ... , Xn) = 0,

(11)

x!y

if at least one coordinate of y is - oo.

Definition 2. An n-dimensional distribution function (on Rn) is a function F = F(x 1, ••• , Xn) with properties (8)-(11). The following result can be established by the same reasoning as in Theorem 1.

Theorem 2. Let F = Fn(x 1, ... , Xn) be a distribution function on Rn. Then there is a unique probability measure P on (Rn, PJ(Rn)) such that (12)

Here are some examples of n-dimensional distribution functions. Let F 1, ••• , pn be one-dimensional distribution functions (on R) and F n(X 1• ••• , Xn) = F 1(x 1) • • • F"(xJ.

It is clear that this function is continuous on the right and satisfies (10) and (11). It is also easy to verify that

da,b, ... danbJn(X1, ... , Xn)

=

n [Fk(bk) -

Consequently F n(X 1• ••• , x,.) is a distribution function.

Fk(ak)] ~ 0.

§3. Methods of Introducing Probability Measures on Measurable Spaces

159

The case when xk

< 0,

0 ~ xk ~ 1, xk > 1 is particularly important. In this case

The probability measure corresponding to this n-dimensional distribution function is n-dimensional Lebesgue measure on [0, 1]". Many n-dimensional distribution functions appear in the form

Fn(xl, · · ·, Xn) =

f~

···f~

J,.{tl, ·. ·, tn) dtl · · · dtn,

where J,.(t 1, ... , tn) is a nonnegative function such that

f_

00 00

f_

00

• • •

00

J,.(t h

... ,

tn) dt 1 · · · dtn = 1,

and the integrals are Riemann (more generally, Lebesgue) integrals. The function f = J,.(t 1, ... , tn) is called the density of the n-dimensional distribution function, the density of the n-dimensional probability distribution, or simply ann-dimensional density. When n = 1, the function _ -1 e -(x-m) 2 /(2a 2 ) f( x ) -

u~

xeR,

'

with u > 0 is the density of the (nondegenerate) Gaussian or normal distribution. There are natural analogs of this density when n > 1. Let~= llriill be a nonnegative definite symmetric n x n matrix: n

L:

i,j= 1

rip~-).i ;;::: 0,

A.i e R, i = 1, ... , n,

When ~is a positive definite matrix, I ~I = det is an inverse matrix A = llaiill.

IAI 112

J,.(x~o ... , Xn) = ( 2n)"12 exp{ -!

~

> 0 and consequently there

L: aiixi- m)(xi- mi)},

(13)

where mi e R, i = 1, ... , n, has the property that its (Riemann) integral over the whole space equals 1 (this will be proved in §13) and therefore, since it is also positive, it is a density. This function is the density of then-dimensional (nondegenerate) Gaussian or normal distribution (with vector mean m = {m 1, ... , mn) and covariance matrix~ =A - 1).

160

II. Mathematical Foundations of Probability Theory

Figure 28. Density of the two-dimensional Gaussian distribution.

When n = 2 the density f 2 (x 1 , x 2 ) can be put in the form 1

(14)

where ui > 0, IpI < 1. (The meanings of the parameters mi, ui and p will be explained in §8.) Figure 28 indicates the form of the two-dimensional Gaussian density.

Remark. As in the case n = 1, Theorem 2 can be generalized to (similarly defined) Lebesgue-Stieltjes measures on (R", PI(R")) and generalized distribution functions on R". When the generalized distribution function Gn(x1o ... , Xn) is x 1 • • • Xn, the corresponding measure is Lebesgue measure on the Borel sets of R". It clearly satisfies

n(bi n

A.(a, b) =

ai),

i=l

i.e. the Lebesgue measure of the "rectangle" is its "content."

4. The measurable space (R 00 , 91(R 00 )). For the spaces R", n ~ 1, the probaability measures were constructed in the following way: first for elementary sets (rectangles (a, b]), then, in a natural way, for sets A = (ai, bJ, and finally, by using Caratheodory's theorem, for sets in PI(R"). A similar construction for probability measures also works for the space (Roo, 91(R 00 )).

L

§3. Methods of Introducing Probability Measures on Measurable Spaces

161

Let BE 14(R"), denote a cylinder set in Roo with base BE &I(R"). We see at once that it is natural to take the cylinder sets as elementary sets in R 00 , with their probabilities defined by the probability measure on the sets of &I(R 00 ). Let P be a probability measure on (R 00 , &I(R 00 )). For n = 1, 2, ... , we take BE &I(R").

(15)

The sequence of probability measures P t> P 2 , ••• defined respectively on (R, &I(R)), (R 2 , &I(R 2 )), ••• , has the following evident consistency property: for n = 1, 2, ... and B E &I(R"), (16)

It is noteworthy that the converse also holds.

Theorem 3 (Kolmogorov's Theorem on the Extension of Measures in (R 00 , &B(R 00 ))). Let P 1 , P 2 , ••• be a sequence of probability measures on (R, 14(R)), (R 2 , &B(R 2 )), •• • , possessing the consistency property (16). Then there is a unique probability measure P on (R 00 , &I(R 00 )) such that

BE &I(R").

(17)

for n = 1, 2, .... E &B(R") and let Jn(B") be the cylinder with base B". We assign the measure P(J.(B")) to this cylinder by taking P(J.(B")) = P.(B").

PRooF. Let B"

Let us show that, in virtue of the consistency condition, this definition is consistent, i.e. the value of P(Jn(B")) is independent of the representation of the set Jn(B"). In fact, let the same cylinder be represented in two way: J.(B") = Jn+k(B"+k). It follows that, if (x 1, ... , xn+k)

E

R"+k, we have

(18) and therefore, by (16) and (18), Pn(B") = pn+ l((xb ... 'Xn+ l):(xl, ... 'xn) E B") = ... = pn+k((xl, ... , Xn+k):(xl, ... 'x.) E B") = pn+k(B"+k).

Let Jii(R 00 ) denote the collection of all cylinder sets B" = J nCB"), B" e14(R"), n = 1, 2, ....

162

II. Mathematical Foundations of Probability Theory

Now let B 1, ... , Bk be disjoint sets in d(R 00 ). We may suppose without loss of generality that B; = .fiBi), i = 1, ... , k, for some n, where B~, ... , Bi: are disjoint sets in BI(Rn). Then

i.e. the set function Pis finitely additive on the algebra d(R 00 ). Let us show that P is "continuous at zero," i.e. if the sequence of sets Bn ! 0, n --+ oo, then P(Bn) --+ 0, n --+ oo. Suppose the contrary, i.e. let lim P(BJ = J > 0. We may suppose without loss of generality that {Bn} has the form

We use the following property of probability measures Pn on (Rn, BI(Rn)) (see Problem 9): if Bn E BI(Rn), for a given J > 0 we can find a compact set An E BI(Rn) such that An c;; Bn and

Therefore if we have

Form the set

en = n;:= 1 Ak and let en be such that en= {x: (x1, ... , xn) E Cn}.

Then, since the sets Bn decrease, we obtain P(Bn\en) ~

n

n

k= 1

k= 1

L P(Bn\Ak) ~ L P(Bk\Ak) ~ J/2.

But by assumption limn P(Bn) = J > 0, and therefore limn P(en) ~ J/2 > 0. Let us show that this contradicts the condition en! 0. . c~n· Th en (X1(n), ... , Xn(n)) E c n A(n) -- (X1(n), X2(n) , ... ) Ill . tX Let USC hOOSe a pom for n ~ 1. Let (nd be a subsequence of (n) such that x~nJ)--+ x?, where x? is a point incl. (Such a sequence exists since x~n) E cl and cl is compact.) Then select a subsequence (n 2) of (n 1 ) such that (An 2 >, x~ 2 >)--+ (x?, xg) E C 2. Similarly let (x~nk), ... , x1nkl)--+ (x?, ... , xZ) E ck. Finally form the diagonal sequence (mk), where mk is the kth term of (nk)· Then x!mkl--+ x? as mk--+ oo fori = 1, 2, ... ; and(x?, xg, ...) E e"forn = 1, 2, ... , whichevidentlycont radictstheassumption that en! 0, n--+ oo. This completes the proof of the theorem.

163

§3. Methods of Introducing Probability Measures on Measurable Spaces

Remark. In the present case, the space Roo is a countable product of lines, R 00 = R x R x · · · . It is natural to ask whether Theorem 3 remains true if (R 00 , £?#(R 00 )) is replaced by a direct product of measurable spaces (Q;, ~i),

i = 1, 2, ....

We may notice that in the preceding proof the only topological property of the real line that was used was that every set in £?#(R") contains a compact subset whose probability measure is arbitrarily close to the probability measure of the whole set. It is known, however, that this is a property not only of spaces (R", £1l(R"), but also of arbitrary complete separable metric spaces with a-algebras generated by the open sets. Consequently Theorem 3 remains valid if we suppose that P 1 , P 2 , ••• is a sequence of consistent probability measures on (Q 1 , ~ 1 ),

where (Qi, ~) are complete separable metric spaces with a-algebras generated by open sets, and (R 00 , £?#(R 00 )) is replaced by

~

(Ql X Q2 X ". ", ~ {8) ~2 {8) "" ·).

In §9 (Theorem 2) it will be shown that the result of Theorem 3 remains valid for arbitrary measurable spaces (Qi, ~) if the measures Pn are concentrated in a particular way. However, Theorem 3 may fail in the general case (without any hypotheses on the topological nature of the measurable spaces or on the structure of the family of measures {P nD· This is shown by the following example. Let us consider the space Q = (0, 1], which is evidently not complete, and construct a sequence f71 ~ f72 ~ · • · of a-algebras in the following way. For n = 1, 2, ... , let
0 < w < 1/n, 0, 1/n ~ W ~ 1,

(w) = {1,

rcn = {A E Q: A = { w:
and let §,; = a{CC 1 , •.. , CC"} be the smallest a-algebra containing the sets eel, ... ' rcn. Clearly~l ~ ~2 ~ •••• Let ~ = u(U §,;) be the smallest a-algebra containing all the §,;. Consider the measurable space (Q, §,;) and define a probability measure Pn on it as follows: •

n

_

Pn{w. (


{1O

1)

if(l, ... , h . ot erwzse,

E

B",

where B" = £?#(R"). ·It is easy to see that the family {Pn} is consistent: if A E §,;then Pn+ 1 (A) = Pn(A). However, we claim that there is no probability measure P on (Q, ~) such that its restriction PI§,; (i.e., the measure P

164

II. Mathematical Foundations of Probability Theory

considered only on sets in g;,) coincides with Pn for n = 1, 2, .... In fact, let us suppose that such a probability measure P exists. Then P{w: q> 1(w) = · · · = q>n(w) = 1} = Pn{w: q> 1(w) = · · · = q>n(w) = 1} = 1 (19)

for n = 1, 2, .... But {w: q> 1 (w) = · · · = q>n(w) = 1} = (0, 1/n)

t 0,

which contradicts (19) and the hypothesis of countable additivity (and therefore continuity at the "zero" 0) of the set function P. We now give an example of a probability measure on (R 00 , PJ(R 00 )). Let F 1 (x), F 2 (x), ... be a sequence of one-dimensional distribution functions. Define the functions G(x) = F 1 (x), Gix 1 , x 2 ) = F 1 (x 1 )F ix 2 ), ... , and denote the corresponding probability measures on (R, PJ(R)), (R 2 , PJ(R 2 )), ••. by P 1 , P 2 , •••• Then it follows from Theorem 3 that there is a measure P on (R 00 , PJ(R 00 )) such that P{x E R 00 : (x 1 , . . . , Xn) E B} = Pn(B),

BE PJ(Rn)

and, in particular, P{xER 00 :X 1

~ab···,Xn~an}

=F 1(a1)···Fn(an).

Let us take Fi(x) to be a Bernoulli distribution, 0, F;(x) = { q,

1,

X< 0, 0:::::; x < 1, X;::: 1.

Then we can say that there is a probability measure P on the space Q of sequencesofnumbersx = (x 1 , x 2 , .•• ),xi= Oorl,togetherwiththecr-algebra of its Borel subsets, such that P{x .• X 1 _- a 1' • · · ' x n -_a } _ pra,qn-ra, n

-



This is precisely the result that was not available in the first chapter for stating the law of large numbers in the form (1.5.8). 5. The measurable space (RT, PJ(RT)). LetT be a set of indices t E T and R 1 a real line corresponding to the index t. We consider a finite unordered set r = [t 1, •.• , tnJ of distinct indices ti, tiE T, n ;::: 1, and let Pr be a probability measure on (Rr, PJ(Rt)), where Rr = R1, x · · · x R1". We say that the family {Pt} of probability measures, where r runs through all finite unordered sets, is consistent if, for all sets r = [t 1 , . . . , tn] and a = [s 1 , . . . , sk] such that a ~ r we have Pcr{(x.,, ... , x.J:(x.,, ... , x.J E B} = Pt{(X11 ,

••• ,

x 1J:(x.,, ... , X 5k) E B} (20)

for every BE PJ(R").

§3. Methods of Introducing Probability Measures on Measurable Spaces

165

Theorem 4 (Ko1mogorov's Theorem on the Extension of Measures in (RT, &U(RT))). Let {P,} be a consistent family of probability measures on (R', :J#(R')). Then there is a unique probability measure P on (RT, &U(RT)) such that

P(f.(B)) = P.(B)

(21)

for all unordered sets< = [t 1 , ••• , tnJ of different indices t; E T, BE &U(R') and f.(B) = {x E RT: (X11 , • • • , x,J E B}. PROOF. Let the set BE &U(RT). By the theorem of §2 there is an at most countable setS= {s 1 ,s 2 , ... } s; T such that B = {x:(X 51 ,X 52 , • • • )EB}, where B E 8U(R 8 ), R 8 = R51 x R52 x · · · . In other words, B = f 8 (B) is a cylinder set with base B E 8U(R 8 ). We can define a set function P on such cylinder sets by putting

P(f8 (B)) = P8 (B),

(22)

where P 8 is the probability measure whose existence is guaranteed by Theorem 3. We claim that Pis in fact the measure whose existence is asserted in the theorem. To establish this we first verify that the definition (22) is consistent, i.e. that it leads to a unique value of P(B) for all possible representations of B; and second, that this set function is countably additive. Let B = f 8 /B 1 ) and B = f 8 ,(B 2 ). It is clear that then B = fs 1 us/B 3 ) with some B 3 E 8U(R 81 us 2 ) ; therefore it is enough to show that if S s; S' and B E 8U(R 8 ), then P 8 .(B') = P 8 (B), where B'

= {(X

8 •1 ,

X 82 , •• •):(X 81

,

X 82 ,

•• • )

E B}

with S' = {s~, s~, .. .}, S = {s 1 , s 2 , ••• }. But by the assumed consistency of (20) this equation follows immediately from Theorem 3. This establishes that the value of P(B) is independent of the representation of B. To verify the countable additivity of P, let us suppose that {En} is a sequence of pairwise disjoint sets in &U(RT). Then there is an at most countable set S s; T such that Bn = f 8 (Bn) for all n 2 1, where Bn E 8U(R 8 ). Since P 8 is a probability measure, we have P(L Bn) = P(L: fs(Bn)) = Ps(L Bn) = L Ps(Bn) = L P(Is(Bn)) = L P(Bn). Finally, property (21) follows immediately from the way in which P was constructed. This completes the proof.

Remark 1. We emphasize that Tis any set of indices. Hence, by the remark after Theorem 3, the present theorem remains valid if we replace the real lines R, by arbitrary complete separable metric spaces !11 (with u-algebras generated by open sets).

166

II. Mathematical Foundations of Probability Theory

Remark 2. The original probability measures {Pr} were assumed defined on unordered sets r = [t b ... , tn] of different indices. It is also possible to start from a family of probability measures {Pr} where r runs through all ordered sets r = (t 1 , ••. , tn) of different indices. In this case, in order to have Theorem 4 hold we have to adjoin to (20) a further consistency condition:

where (i, ... , in) is an arbitrary permutation of (1, ... , n) and A 1, E £?.6(R 1J As a necessary condition for the existence of P this follows from (21) (with P11 ,, ... , 1JB) replaced by P
and for each r = [t 1 , ••• , tnJ, B

=

t1

< t 2 < · · · < tn, and each set

11 X · · · X

In,

construct the measure P.(B) according to the formula PrCJ1 =

X ··· X

In)

J(it ... J( IPt.(aliO)
2

-t.(a2ial) .. ·
(24)

(integration in the Riemann sense). Now we define the set function P for each cylinder set f 1, ... 1Jf 1 x · · · x In) = {x E RT: X11 E 11, ... , X 1" E Jn} by taking P(f11 ... tn(J 1

X ··· X

Jn)) =

Pitt ...

1if1

X ·· · X

In).

The intuitive meaning of this method of assigning a measure to the cylinder set ft, ... In(I 1 X ... X In) is as follows. The set f 11 ... 1"(1 1 x · · · x In) is the set of functions that at times t 1 , ... , tn pass through the" windows" 1 1 , ••• , In (see Figure 24 in §2). We shall interpret

§3. Methods of Introducing Probability Measures on Measurable Spaces

167


that a particle, starting at ak- 1 at time of ak. Then the product of densities that appears in (24) describes a certain independence of the increments of the displacements of the moving "particle" in the time intervals

The family of measures {Pr} constructed in this way is easily seen to be consistent, and therefore can be extended to a measure on (RIO, ool, £?8(RI 0 • ool)). The measure so obtained plays an important role in probability theory. It was introduced by N. Wiener and is known as Wiener measure.

6.

PROBLEMS

1. Let F(x)

=

P(- oo, x]. Verify the following formulas:

P(a, b]

= F(b)- F(a),

P(a, b)= F(b-)- F(a),

P[a, b] = F(b)- F(a- ),

P[a, b)= F(b-)- F(a- ),

P{x} = F(x)- F(x-), where F(x-) = limy 1 x F(y). 2. Verify (7).

3. Prove Theorem 2. 4. Show that a distribution function F = F(x) on R has at most a countable set of points of discontinuity. Does a corresponding result hold for distribution functions on R"? 5. Show that ea:ch of the functions G(x, y) =

{

1, 0,

G(x, y) = [x

X+

x

y;;?: 0, < 0,

+y

+ y], the integral part of x + y,

is continuous on the right, and continuous in each argument, but is not a (generalized) distribution function on R 2 • 6. Let fl. be the Lebesgue-Stieltjes measure generated by a continuous distribution function. Show that if the set A is at most countable, then fl.(A) = 0. 7. Let c be the cardinal number of the continuum. Show that the cardinal number of the collection of Borel sets in R" is c, whereas that of the collection of Lebesgue measurable sets is

zc.

8. Let (Q, §', P) be a probability space and d an algebra of subsets of Q such that u(d) = !F. Using the principle of appropriate sets, prove that for every e > 0 and B E :¥ there is a set A E d such that P(A !:::,. B) :::; e.

168

II. Mathematical Foundations of Probability Theory

9. Let P be a probability measure on (R", !I(R")). Using Problem 8, show that, for every e > 0 and B e !I(R"), there is a compact subset A of !I(R") such that A ~ B and P(B\A) ;s; e. (This was used in the proof of Theorem 1.) 10. Verify the consistency of the measure defined by (21).

§4. Random Variables. I 1. Let (0, ~) be a measurable space and let (R, BI(R)) be the real line with the system BI(R) of Borel sets. Definition 1. A real

function~

=

~(w)

function, or a random variable, if

defined on (0, F) is an

{w: ~(w) E B}

E

ff'

~-measurable

(1)

for every Be BI(R); or, equivalently, if the inverse image ~- 1 (B)

= {w: ~(w) E B}

is a measurable set in 0. When (0, ~ = (R", BI(R")), the &I(R")-measurable functions are called

Borel functions.

The simplest example of a random variable is the indicator IA(w) of an arbitrary (measurable) set A E ff'. A random variable ~ that has a representation ~(w)

=

00

L x;IA;(w),

i= 1

(2)

where L Ai = 0, Ai E ~. is called discrete. If the sum in (2) is finite, the random variable is called simple. With the same interpretation as in §4 of Chapter I, we may say that a random variable is a numerical property of an experiment, with a value depending on "chance." Here the requirement (1) of measurability is fundamental, for the following reason. If a probability measure P is defined on (0, ~), it then makes sense to speak of the probability of the event {~(w) E B} that the value of the random variable belongs to a Borel set B. We introduce the following definitions. Definition 2. A probability measure P~ on (R, BI(R)) with P~(B)

= P{w: ~(w) E B},

BE BI(R),

is called the probability distribution of~ on (R, BI(R)).

169

§4. Random Variables. I

Definition 3. The function F~(x)

= P(w:

::::; x},

~(w)

XER,

is called the distribution function of ~For a discrete random variable the measure P~ is concentrated on an at most countable set and can be represented in the form P ~(B)

I

=

p(xk),

(3)

{k: XkEB)

wherep(xk) = P{~ = xd = M~(xk). The converse is evidently true: If P ~ is represented in the form (3) then ~ is a discrete random variable. A random variable~ is called continuous if its distribution function F~(x) is continuous for x E R. A random variable~ is called absolutely continuous if there is a nonnegative function f = Hx), called its density, such that

F~(x) = fJ~(y) dy,

XER,

(4)

(the integral can be taken in the Riemann sense, or more generally in that of Lebesgue; see §6 below). 2. To establish that a function ~ = ~(w) is a random variable, we have to verify property (1) for all sets BE~- The following lemma shows that the class of such "test" sets can be considerably narrowed. Lemma 1. Let g be a system of sets such that a( g) = §l(R). A necessary and sufficient condition that a function~ = ~(w) is ~-measurable is that

(5)

for all E

E

g_

PRooF. The necessity is evident. To prove the sufficiency we again use the principle of appropriate sets. Let~ be the system of those Borel sets D in f!.6(R) for which C 1(D) E ~­ The operation "form the inverse image" is easily shown to preserve the settheoretic operations of union, intersection and complement:

yB") = yC cl( 0B") 0Cl(B"), C 1(

1

(B"),

=

~ l(B") =

C l(B").

(6)

170

II. Mathematical Foundations of Probability Theory

It follows that f0 is a a-algebra. Therefore

and

a(S)

~

a(f0) = f0

~

PJ(R).

But a{E) = PJ(R) and consequently f0 = PJ(R).

Corollary. A necessary and sufficient condition for variable is that for every x

for every x

E

E

< x}

E

IF

x}

E

IF

{w:

~(w)

{w:

~(w) ~

~

= ~(w) to be a random

R, or that

R.

The proof is immediate, since each of the systems

S 1 = {x: x < c, c E R}, S2 =

{X:

X

~

c, c E R}

generates the a-algebra PJ(R): a{£ 1 ) = a(E 2 ) = f!J(R) (see §2). The following lemma makes it possible to construct random variables as functions of other random variables.

Lemma 2. Let cp = cp(x) be a Bore/function and~= ~(w) a random variable. Then the composition 17 = cp o (, i.e. the function 17(w) = cp(((w)), is also a random variable. The proof follows from the equations {w:1](W)EB} = {w:cp(((w))EB} = {w:~(w)Ecp- 1 (B)}EIF

(7)

for BE PJ(R), since cp- 1(B) E PJ(R). Therefore if~ is a random variable, so are, for examples,~·,~+ = max((, 0), = -min(~' 0), and In since the functions x", X+' X- and IX I are Borel functions (Problem 4).

c

3. Starting from a given collection of random variables {~.},we can construct new functions, for example, 1 I~k I, lim ~., lim ~., etc. Notice that in general such functions take values on the extended real line R = [- oo, oo]. Hence it is advisable to extend the class of IF-measurable functions somewhat by allowing them to take the values ± oo.

Lk'=

171

§4. Random Variables. I

Definition 4. A function ~ = ~(w) defined on (0, §) with values in R = [ -oo, oo] will be called an extended random variable if condition (1) is satisfied for every Borel set BE PJ(R). The following theorem, despite its simplicity, is the key to the construction of the Lebesgue integral (§6).

Theorem 1. ~ = ~(w) (extended ones included) there is a sequence of simple random variables ~ 1 , ~ 2 , .•• , such that I~" I : : ; I~ I and ~n(w)--+ ~(w), n--+ oo,for all wE 0. (b) If also ~(w) ~ 0, there is a sequence of simple random variables ~t> ~ 2 , •.• , such that ~n(w) i ~(w), n--+ oo, for all wE 0.

(a) For every random variable

PRooF. We begin by proving the second statement. For n = 1, 2, ... , put n2" k- 1 ~n(w) = k~1 ~Ik,n(w)

+ nlww>~n)(w),

where Ik,n is the indicator of the set {(k - 1)/2" ::::;; ~(w) < k/2"}. It is easy to verify that the sequence ~"(w) so constructed is such that ~"(w) i ~(w) for all w E 0. The first statement follows from this if we merely observe that ~can be represented in the form~ = ~+ - ~-.This completes the proof of the theorem. We next show that the class of extended random variables is closed under pointwise convergence. For this purpose, we note fir~t that if ~ 1 , ~ 2 , .•. is a sequence of extended random variables, then sup~"' inf ~"'lim~" and lim~" are also random variables (possibly extended). This follows immediately from {w:sup ~n > x} = {w: inf ~n

< x} =

U{w: ~n > x} E§, n U {w: ~n < x} E §, n

and lim ~n = inf sup ~m• n m2:.n

Theorem 2. Let ~(w)

~t> ~ 2 ,

•••

lim ~n

= sup sup ~m. n

m2:.n

be a sequence of extended random variables and

= lim ~n(w). Then ~(w) is also an extended random variable.

The prooffollows immediately from the remark above and the fact that

{w:

~(w)

< x}

= {w: lim ~n(w) < x} = {w: lim ~n(w) =lim ~n(w)} 11 {lim ~n(w) < x} <X} = {lim ~nCw) <X} E §.

= 011 {lim ~n(w)

172

II. Mathematical Foundations of Probability Theory

4. We mention a few more properties of the simplest functions of random variables considered on the measurable space (Q, /F) and possibly taking values on the extended real lineR = [ - oo, oo].t If e and '1 are random variables, e + '1· e - '1· e'1. and e/'1 are also random variables (assuming that they are defined, i.e. that no indeterminate forms like oo - oo, oojoo, a/0 occur. In fact, let {en} and {1'/n} be sequences of random variables converging to e and '1 (see Theorem 1). Then en

± '1n --+ e ± '1.

en '1n --+ en ------~------+ 1 1'/n + -J{.,n=O}(w)

~1'/. e -· '1

n

The functions on the left-hand sides of these relations are simple random variables. Therefore, by Theorem 2, the limit functions e ± 17, e11 and e/rt are also random variables.

e

5. Let be a random variable. Let us consider sets from :F of the form {w: e(w) E B}, BE PA(R). It is easily verified that they form a a-algebra, called the a-algebra generated by and denoted by ~If qJ is a Borel function, it follows from Lemma 2 that the function 17 = qJ o e is also a random variable, and in fact ~-measurable, i.e. such that

e.

{w: 17(w) E B} E ~.

BE PA(R)

(see (7)). It turns out that the converse is also true.

Theorem 3. Let '1 be a :F~-measurable random variable. Then there is a Borel jimction qJ such that '1 = qJ i.e. 1J(W) = qJ(e(w)) for every WE Q. 0

e,

PROOF. Let Cl> be the class of ~-measurable functions 1J = 1J(w) and &~ the class of ~-measurable functions representable in the form qJ 0 where qJ is a Borel function. It is clear that & s; Cl>~. The conclusion of the theorem is that in fact&~ = Cl>~. Let A E ~and 17(w) = IA(w). Let us show that 1J E &~.In fact, if A E ~ there is aBE PA(R) such that A = {w: e(w) E B}. Let

e.

( )_{1,0,

XBX

-

X EB, X¢: B.

Then/A(w) = XB(e(w)) E
L7=

t We shall assume the usual conventions about arithmetic operations in R: if a e R then ± oo = ± oo, a/± oo = 0; a· oo = oo if a > 0, and a· oo = - oo if a < 0; 0 · (± oo) = 0,

a

00

+ 00 = 00,

- 0 0 - 00

=

-00.

173

§4. Random Variables. I

Now let 17 be an arbitrary ~-measurable function. By Theorem 1 there is a sequence of simple ~-measurable functions {IJn} such that IJn( w) --+ IJ( w ), n--+ 00, wEn. As we just showed, there are Borel functions

lim
X E B,

n

x¢B

0,

is also a Borel function (see Problem 7). But then it is evident that 1J(W) =limn ~ = Cl>~.

6. Let us consider a probability space (Q, :F, P), where the a-algebra ff is generated by a finite or countably infinite decomposition!":»= {Dt. D2 , .• •}, I D; = n, P(D;) > 0. Here we shall suppose that the D; are atoms with respect toP, i.e. if A

~

D;, A E :F, then either P(A) = 0 or P(D;\A) = 0.

Lemma 3. Let ~ be an .?-measurable function, where ff = cr(!":»). Then constant on the atoms of the decomposition, i.e. ~ has the representation

~

is

00

~(w)

=

I

xklnk(w)

(8)

(P-a.s.).t

k= 1

(The notation

"~

= 1J (P-a.s.)" means that

P(~

=f 17) = 0.)

PROOF. Let D be an atom of the decomposition with respect to P. Let us show that ~is constant (P-a.s.), i.e. P{D n (~ =f canst)} = 0. Let K = sup{x E R: P{D n (.; < x)} = 0}. Then

P{D n (¢ < K)} = p[

ryK

{wED;

~(w)

< r}J = 0,

ratwnal r

since if P{D n (~ < x)} = 0, then also P{D n (~ < y)} = 0 for ally :::;; x. Let x > K; then P{D n (~ < x)} > 0 and therefore P{D n (~ ~ x)} = 0, since D is an atom. Therefore P{D n (~ > K)} = P

U [ r>K

{wED:~~ r}J

= 0.

ratiOnal r

Thus P {D n ( ~ > K)} = P {D n ( ~ < K)} = 0 and therefore P{D n (~ =f K)} = 0. Then (8) follows in general since the lemma. t a.s.

=

almost surely.

I

D; = n. This completes the proof of

174

7.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Show that the random variable

~

is continuous if and only if P(~ = x) = 0 for all

xeR.

e

e

2. If I I is :F-measurable, is it true that is also :F -measurable? 3. Show that ~ = ~(w) is an extended random variable if and only if {w: e(w) E B} for all Be BII(R). 4. Prove that x", x+ functions.

= max(x, 0),

x-

= -min(x, 0), and lxl =

x+

+ x-

E

:F

are Borel

e

5. If and '1 are :!'-measurable, then {w: '(w) = "'(w)} e :F. 6. Let ~ and '1 be random variables on (Q, :F), and A e :F. Then the function

C(w) = ~(w) · IA

+ '1(w)Ii

is also a random variable.

... ,

7. Let e~o ~.be random variables and cp(x 1, ... , x.) a Borel function. Show that cp(, 1(w), ... , ~.(w)) is also a random variable. 8. Let

eand '1 be random variables, both taking the values 1, 2, ... , N. Suppose that

:F. Show that there is a permutation (i 1, i2, ... , iN) of (1, 2, ... , N) such thl!t {w:' = j} = {w: '1 = ij} for j = 1, 2, ... , N.

~ =

§5. Random Elements 1. In addition to random variables, probability theory and its applications involve random objects of more general kinds, for example random points, vectors, functions, processes, fields, sets, measures, etc. In this connection it is desirable to have the concept of a random object of any kind.

Definition 1. Let (0, ff') and (E, tff) be measurable spaces. We say that a function X= X(m), defined on 0 and taking values in E, is ff'/tff-measurable, or is a random element (with values in E), if {m: X(m) E B} E ff'

(1)

for every BE tff. Random elements (with values in E) are sometimes called £-valued random variables. Let us consider some special cases. If (E, tff) = (R, f!I(R)), the definition of a random element is the same as the definition of a random variable (§4). Let (E, cC) = (R", f!I(R")). Then a random element X(m) is a "random point" in R". If nk is the projection of R" on the kth coordinate axis, X(m) can

175

§5. Random Elements

be represented in the form (2) where ~k = nk o X. It follows from (1) that BE ~(R) we have

{w: ~k(w) E B}

=

~k

is an ordinary random variable. In fact, for

{w: ~ 1 (w) E R, ... , ~k- 1 ER, ~k E B, ~k+ 1 E R, .. .} R x B x R x .. · x R)} E ~

= {w: X(w) E (R x .. · x

since R x · · · x R x B x R x · · · x R E ~(R").

Definition 2. An ordered set (1J 1(w), ... , 1].(w)) ofrandom variables is called an n-dimensional random vector. According to this definition, every random element X(w) with values in

R" is an n-dimensional random vector. The converse is also true: every random vector X(w) = (~ 1 (w), ... , ~n(w)) is a random element in R". In fact, if Bk E ~(R), k = 1, ... , n, then {w: X(w) E (B1 X ... X Bn)} =

n{w: k=1 n

~k(w) E Bk} E $'.

But ~(W) is the smallest a-algebra containing the sets B 1 x · · · x Bn. Consequently we find immediately, by an evident generalization of Lemma 1 of§4, that whenever BE ~(R"), the set {w: X(w) E B} belongs to$'. Let (E, S) = (Z, B(Z)), where Z is the set of complex numbers x + iy, x, y E R, and B(Z) is the smallest a-algebra containing the sets {z: z = x + iy, a 1 < x :::; b 1 , a 2 < y :::; b2 }. It follows from the discussion above that a complex-valued random variable Z(w) can be represented as Z(w) = X(w) + iY(w), where X(w) and Y(w) are random variables. Hence we may also call Z(w) a complex random variable. Let (E, S) = (Rr, ~(RT)), where Tis a subset of the real line. In this case every random element X = X(w) can evidently be represented as X= (~ 1) 1 e T with ~~ = n 1 o X, and is called a random function with time domain T.

Definition 3. Let T be a subset of the real line. A set of random variables X = (~ 1 ) 1 e Tis called a random processt with time domain T. If T = {1, 2, ... } we call X= (~ 1 , ~ 2 , ..• ) a random process with discrete time, or a random sequence. If T = [0, 1], (- oo, oo), [0, oo), ... , we call X= (~ 1) 1 er a random process with continuous time. tOr stochastic process (Translator).

176

II. Mathematical Foundations of Probability Theory

It is easy to show, by using the structure of the u-algebra BB(RT) (§2) that every random process X= (~ 1)reT (in the sense of Definition 3) is also a random function on the space (Rr, &B(RT)).

Definition 4. Let X= (~ 1 )reT be a random process. For each given wE Q the function (~ 1(w))1 e T is said to be a realization or a trajectory of the process, corresponding to the outcome w. The following definition is a natural generalization of Definition 2 of §4.

Definition 5. Let X = (~ 1)reT be a random process. The probability measure Px on (Rr, BB(RT)) defined by BE &B(Rr),

Px(B) = P{w: X(w) E B},

is called the probability distribution of X. The probabilities P 1,, ..• , 1"(B)

= P{w: (~ 1 ,,

... ,

~~J E B}

with t 1 < t 2 < · · · < tn, t; E T, are called finite-dimensional probabilities (or probability distributions). The functions

Fr,, ... ,tn(Xto ... ,Xn) with t 1 < functions.

t2

= P{w:~r,::::; Xto ... ,~tn::::; Xn}

< · · · < tn, t; E T, are called finite-dimensional distribution

Let (E, C) = (C, 8B0 (C)), where C is the space of continuous functions x = (x 1)reT on T = [0, 1] and BB 0 (C) is the u-algebra generated by the open sets (§2). We show that every random element X on (C, 8B 0 (C)) is also a random process with continuous trajectories in the sense of Definition 3. In fact, according to §2 the set A = {x E C: x 1
= {w: X(w) E A} E $'.

{w: ~ 1 (w)
On the other hand, let X= (~ 1(w))reT be a random process (in the sense of Definition 3) whose trajectories are continuous functions for every w E n. According to (2.14 ), {x E C:

X

n{x E C: lxrk- x~kl < p},

E Sp{x 0 )} =

lk

where

tk

are the rational points of [0, 1]. Therefore {w: X(w) E Sp(X 0 w))}

=

n{w:

l~rk(w)- ~~(w)l

< p} E ff,

lk

and therefore we also have {w: X(w) E B} E ff for every BE 8B 0(C). Similar reasoning will show that every random element of the space (D, BB 0 (D) can be considered as a random process with trajectories in the space of functions with no discontinuities of the second kind; and conversely.

177

§5. Random Elements

2. Let (Q, ff, P) be a probability space and (E 11 , &11) measurable spaces, where IX belongs to an (arbitrary) set 21.

Definition 6. We say that the ff/&11-measurable functions (Xa;(w)), IX e 21, are independent (or collectively independent) if, for every finite set of indices IXl, ••• ' IX" the random elements Xa:t• ... ' xtln are independent, i.e. P(X111 e B111 , ••• , X a." e Ba:J = P(X111 e B111 )

P(Xa." e B11J,

• • •

(3)

where B11 e Ca.. Let 21 = {1, 2, ... , n}, let F~(xl,

~a:

... , Xn)

be random variables, let IX e 21 and let

=

Xn)

P(~1 ~ X1, ... , ~" ~

be the n-dimensional distribution function of the random vector ~ = (~ 1 , ••• , ~"). Let F~.(x;) be the distribution functions of the random variables ~;, i = 1, ... , n.

Theorem. A necessary and sufficient condition for the random variables ~ 1 , ... , ~"to be independent is that

(4)

F~(xl, ... ,Xn) = F~ 1 (X1)···F~n(Xn)

for all (x 1, •.• , x") e R".

PRooF. The necessity is evident. To prove the sufficiency we put a = (a 1 , ••• , a"), b = (b 1 , .•. , b")' P~(a, b] = P{w: a 1 < ~ 1 ~ bl rel="nofollow"> ... , a"< ~" ~ b"}, P,.(a;, b;] = P{a; < ~; ~ b;}.

Then P~(a, b] =

n" [F~,(b;)- F~,(a;)] = n" P~.(a;, b;]

i=1

i= 1

by (4) and (3.7), and therefore P{~ 1 e It, ... ,~" E In} =

n" P{~;

(5)

E

I;},

P{~1 E B1, ~2 E I2, ... ' ~" E In} = P{~1 E B1}

n"

i= 1

where I; = (a;, b;]. We fix I 2 , ••• , I" and show that

i=2

P{~; E I;}

(6)

for all B 1 e iJ(R). Let vH be the collection of sets in ii(R) for which (6) holds. Then vH evidently contains the algebra d of sets consisting of sums of disjoint intervals of the form I 1 = (a 1, b 1]. Hence d £ vH £ iJ(R). From

178

II. Mathematical Foundations of Probability Theory

the countable additivity (and therefore continuity) of probability measures it also follows that .A is a monotonic class. Therefore (see Subsection 1 of §2) JJ.(d) £ .A £ ~(R).

But JJ.(d) = u(.r;l) = ~(R) by Theorem 1 of §2. Therefore .A = ~(R). Thus (6) is established. Now fix B 1 , 13 , ••• ,1.; by the same method we can establish (6) with / 2 replaced by the Borel set B2 • Continuing in this way, we can evidently arrive at the required equation,

where B; e

3.

~(R).

This completes the proof of the theorem.

PROBLEMS

e

1. Let 1, ••• , only if

e. be discrete random variables. Show that they are independent if and P(e1

= X1,

00

.'

e. = x.) = TI• P(e; = X;) i=l

for all real Xt. ••• , x •. 2. Carry out the proof that every random function (in the sense of Definition 1) is a random process (in the sense of Definition 3) and conversely. 3. Let X 1 , ••• , X. be random elements with values in (Et. S 1 ), ... , (E., s.), respectively. In addition let (E~, t9''1 ), ••• , (E~, t9'~) be measurable spaces and let gt. ... , g. be 8tfS'1, .•• ,t9'.Jt9'~-measurable functions, respectively. Show that if X 1 , ... ,x. are independent, the random elements g 1 · X t. •.. , g. · X. are also independent.

§6. Lebesgue Integral. Expectation 1. When (0, 1Ji', P) is a finite probability space and random variable,

~ = ~(m)

is a simple

n

~(m) =

L xk!Ak(m), k=l

(1)

the expectation E~ was defined in §4 of Chapter I. The same definition of the expectation of a simple random variable can be used for any probability space (Q, 1Ji', P). That is, we define

e

ee

n

E~ =

L xkP(Ak). k=l

(2)

179

§6. Lebesgue Integral. Expectation

This definition is consistent (in the sense that E~ is independent of the particular representation of~ in the form (1)), as can be shown just as for finite probability spaces. The simplest properties of the expectation can be established similarly (see Subsection 5 of §4 of Chapter 1). In the present section we shall define and study the properties of the expectation E~ of an arbitrary random variable. In the language of analysis, E~ is merely the Lebesgue integral of the g;;-measurable function ~ = ~(co) with respect to the measure P. In addition to E~ we shall use the notation ~(co)P(dco) or So~ dP. Let~ = ~(co) be a nonnegative random variable. We construct a sequence of simple nonnegative random variables {~n} n > 1 such that ~nCco) j ~(co), n -+ oo, for each co E n (see Theorem 1 in §4). Since E~n ::;; E~n+ 1 (cf. Property 3) of Subsection 5, §4, Chapter 1), the limit limn E~n exists, possibly with the value + oo.

So

Definition 1. The Lebesgue integral of the nonnegative random variable ~ = ~(co), or its expectation, is (3) n

To see that this definition is consistent, we need to show that the limit is independent of the choice of the approximating sequence {~n}. In other words, we need to show that if ~n j ~and 17m j ~'where {17m} is a sequence of simple functions, then (4) m

n

Lemma 1. Let 17 and '" be simple random variables, n ~ 1, and

Then (5) n

PROOF.

Let e > 0 and

It is clear that An j Q and ~n = ~n]An

+ ~n]An

~ ~n]An ~ (17- e)JAn·

Hence by the properties of the expectations of simple random variables we find that E~n ~

E(17- e)JA" = E17IA"- eP(An) = E17 - E17Ix"- eP(An) ~ E17 - CP(An)- e,

180

II. Mathematical Foundations of Probability Theory

where C = maxw '1(w). Since e is arbitrary, the required inequality (5) follows. It follows from this lemma that limn E~n ; : : : limm E'7m and by symmetry limm Ef1m ;?:: limn E~n• which proves (4). The following remark is often useful.

Remark 1. The expectation E~ of the nonnegative random variable ~ satisfies E~

=

sup Es,

(6)

{seS:s~~}

where S = {s} is a set of simple random variables (Problem 1). Thus the expectation is well defined for nonnegative random variables. We now consider the general case. Let~ be a random variable and~+ =max(~, 0), ~- = -min(~, 0).

Definition 2. We say that the expectation E~ of the random variable exists, or is defined, if at least one of E~ + and E~- is finite: min(E~+, E~-)

~

< oo.

In this case we define

The expectation Ee is also called the Lebesgue integral (ofthe function respect to the probability measure P).

Definition 3. We say that the expectation of EC < oo.

~

is finite if

E~+

ewith

< oo and

Since 1~1 = e+- ~-.the finiteness ofEe, or IE~I < oo, is equivalent to EI I < 00. (In this sense one says that the Lebesgue integral is absolutely convergent.)

e

e,

Remark 2. In addition to the expectation E significant numerical characteristics of a random variable eare the number E~' (if defined) and E IeI', r > 0, which are known as the moment of order r (or rth moment) and the absolute moment of order r (or absolute rth moment) of e. Remark 3. In the definition of the Lebesgue integral Jn ~(w)P(dw) given

above, we suppose that P was a probability measure (P(Q) = 1) and that the $'-measurable functions (random variables) ~ had values in R = (- oo, oo ). Suppose now that J1. is any measure defined on a measurable space (Q, ff) and possibly taking the value + oo, and that ~ = e(w) is an $'-measurable function with values in R = [- oo, oo] (an extended random variable). In this case the Lebesgue integral Jn e(w)Jl(dw) is defined in the

181

§6. Lebesgue Integral. Expectation

same way: first, for nonnegative simple ~ (by (2) with P replaced by Jl), then for arbitrary nonnegative ~. and in general by the formula

provided that no indeterminacy of the form oo - oo arises. A case that is particularly important for mathematical analysis is that in which (Q, F) = (R, PJ(R)) and J1 is Lebesgue measure. In this case the integralSR ~(X)Jl(dx) is writtenSR ~(x) dx, Or s~oo ~(x) dx, Or (L) s~oo ~(x) dx to emphasize its difference from the Riemann integral (R) S~oo ~(x) dx. If the measure J1 (Lebesgue-Stieltjes) corresponds to a generalized distribution function G = G(x), the integral SR ~(x)Jl(dx) is also called a LebesgueStieltjes integral and is denoted by (L-S) SR ~(x)G(dx), a notation that distinguishes it from the corresponding Riemann-Stieltjes integral (R-S)

L~(x)G(dx)

(see Subsection 10 below). It will be clear from what follows (Property D) that if E~ is defined then so is the expectation E( 0 A) for every A E !F. The notations E( ~; A) or SA ~ dP are often used for E(OA) or its equivalent, Sn OA dP. The integral SA~ dP is called the Lebesgue integral of~ with respect to P over the set A. Similarly, we write SA~ dJ1 instead of Sn ~ · IA dJ1 for an arbitrary measure Jl. In particular, if J1 is an n-dimensional Lebesgue-Stieltjes measure, and A = (all b 1 ] x · · · x (a., b.], we use the notation

L(

d)l.

instead of

If JliS Lebesgue measure, we write simplydx 1 · · · dx.instead of J1(dx 1 , ..• , dx.). 2. Properties of the expectation E~ of the random variable

A. Let c be a constant and let E~ exist. Then

E(c~)

~.

exists and

E(c~) = cE~.

B. Let

~

::::; 1J; then

with the understanding that

if -oo <

E~

then

-oo < E17

and

E~::::;

E17

or

if E17 < oo then

E~

< oo

and

E~

::::; E1J.

182 C.

II. Mathematical Foundations of Probability Theory

IfE~

exists then IE~ I~ E 1~1-

D. If E~ exists then E(eJA) exists for each A E ?'; if E~ is finite, E(~IA) is finite. E. If~ and rf are nonnegative random variables, or such that EI~I < oo and Elt71 < oo,then E(~

+ t'f) = E~ + Et'f.

(See Problem 2 for a generalization.) Let us establish A-E.

A. This is obvious for simple random variables. Let ~ are simple random variables and c ~ 0. Then c~n j

~ c~

0, ~n j ~' where and therefore

~n

In the general case we need to use the representation ~ = ~+ - ~­ and notice that (c~)+ = c~+, (c~)- = cC when c ~ 0, whereas when c < 0, (c~)+ = -cC, (c~)- = -c~+. B. If 0 ~ ~ ~ rf, then E~ and Et7 are defined and the inequality E~ ~ Et7 follows directly from (6). Now let E~ > - oo; then E~- < oo. If~ ~ rf, we have ~+ ~ rf+ and ~- ~ rf-. Therefore Et7- ~ EC < oo; consequently E11 is defined and E~

= E~+

- E~- :::;; E11+ - E11-

case when E17 < oo can be discussed similarly. C. Since - I~I ~ ~ ~ I~I, Properties A and B imply

=

E11. The

i.e.IE~I ~ E 1~1· D. This follows from B and

E. Let ~ ~ 0, '7 ~ 0, and let {~n} and }'In} be sequences of simple functions such that ~n i ~and 'In j t'f· Then E(~n + 1'/n) = E~n + Et'fn and

and therefore E(~ + rJ) = E~ + ErJ. The case when E 1~1 < oo and E I'11 < oo reduces to this if we use the facts that ·

and

183

§6. Lebesgue Integral. Expectation

The following group of statements about expectations involve the notion of" P-almost surely." We say that a property holds" P-almost surely" if there is a set% E ~with P(%) = 0 such that the property holds for every point w ofO.\%. Instead of" P-almost surely" we often say" P-almost everywhere" or simply "almost surely" (a.s.) or "almost everywhere" (a.e.). F.

If~

= 0 (a.s.) then

E~

= 0.

L

In fact, if ~ is a simple random variable, ~ = xk/Ak(w) and xk =I= 0, wehaveP(Ak) = ObyhypothesisandthereforeE~ = O.If~ ~ OandO~s~~. where s is a simple random variable, then s = 0 (a.s.) and consequently Es = 0 and E~ = sup(seS:s:>~J Es = 0. The general case follows from this by means of the representation ~ = ~+ - C and the facts that ~+ ~ I~ I, C ~ 1~1, and 1~1 = 0 (a.s.). G. If~= 1J (a.s.) and El~l < oo, then EIIJI < oo and E~ = E17 (see also Problem 3). In fact, let % = {w: ~=I= IJ}. Then P(%) = 0 and ~ = O.,v + 0.:f, 1J = 1Jl.,v + IJI.,v = 1Jl.,v + O.K.BypropertiesEandF,wehaveE~ = E~l.,v + E~(v = E1Jl.:f. But E1Jl.,v = 0, and therefore E~ = E1Jl.K + E1Jl.,v = E17, by Property E.

H.

Let~ ~

0 and

E~

= 0.

Then~

= 0 (a.s).

For the proof, let A= {w: ~(w) > 0}, An= {w: ~(w) ~ 1/n}. It is clear that Ani A and 0 ~ ~ · IA" ~ ~ · IA. Hence, by Property B, 0 S EOAn S E~ = 0.

Consequently

and therefore P(An) = 0 for all n P(A) = 0.

~

1. But P(A) = lim P(An) and therefore

I. Let~ and 11 be such that El~l < oo, El11l < oo and E(OA) ~ E(17/A) for all A E ~.Then~ s 11 (a.s.). In fact, let B = {w: ~(w) > 17(w)}. Then E(17/8 ) ~ E(0 8 ) ~ E(17/8 ) and therefore E(0 8 ) = E(17/ 8 ). By Property E, we have E((~- 11)18 ) = 0 and by Property H we have(~ - 17)/8 = 0 (a.s.), whence P(B) = 0. ~ be an extended random variable and E I~ I < oo. Then I~ I < oo (a. s). In fact, let A= {wl~(w)l = oo} and P(A) > 0. Then El~l ~ E(I~IJA) = oo · P(A) = oo, which contradicts the hypothesis El~l < oo. (See also Problem 4.)

J. Let

3. Here we consider the fundamental theorems on taking limits under the expectation sign (or the Lebesgue integral sign).

184

II. Mathematical Foundations of Probability Theory

Theorem 1 (On Monotone Convergence). Let IJ,

~' ~ 1 , ~ 2 , •••

be random

variables.

i

(a) If~. :2:: 1J for all n :2:: 1, El] > - oo, and ~.

~' then

i E~. < oo, and~. t ~,then E~.

(b) If~. :::;; 1J for all n ;;::-: 1, E17

E~.

t E~.

PRooF. (a) First suppose that IJ ;;::-: 0. For each k ;;::-: 1let {~i"l} n"' 1 be a sequence of simple functions such that ~k•l i ~b n--+ oo. Put ~(nJ = max 1 ,k,;;N~k·>. Then

,(n-1) :::;; '(n) = max

:::;; max

~kn)

~k

= ~ •.

1,;;k,;;n

1,;;k,;;n

for 1 :::;; k :::;; n, we find by taking limits as n

--+

oo that

for every k ; : -: 1 and therefore~ = (. The random variables (<•> are simple and (<•>

i (. Therefore

Es = E( =lim E(:::;; limEs •. On the other hand, it is obvious,

:::;;

since~.

limEs. :::;;

~n+ 1

:::;;

~,that

E~.

Consequently limE~. = E~. Now let IJ be any random variable with EIJ > - oo. If EIJ = oo then E~. = E~ = oo by Property B, and our proposition is proved. Let EIJ < oo. Then instead of EIJ > - oo we find E IIJ I < oo. It is clear that 0 :::;; ~. - 1J i ~ - IJ for all w E Q. Therefore by what has been established, E(~. - l])i E(~ - IJ)and therefore (by Property E and Problem 2) E~. - E~

i E~

- EIJ.

But E IIJ I < oo, and therefore E~. i E~, n --+ oo. The proof of (b) follows from (a) if we replace the original variables by their negatives.

Corollary. Let {IJ.}n;, 1 be a sequence of nonnegative random variables. Then 00

L

00

L

EIJn· E IJn = n=1 n=1

185

§6. Lebesgue Integral. Expectation

The proof follows from Property E (see also Problem 2), the monotone convergence theorem, and the remark that k

00

L tin j n=L tin• n= 1

k--..

00.

1

Theorem 2 (Fatou's Lemma). Lett~. ~b ~ 2 , ••• be random variables. (a) If~.. ~ t1 for all n ~ 1 and Et~ > - oo, then

E lim~~~ :::;; limE~... (b) If~ .. :::;; t1 for all n ~ 1 and Et~ < oo, then limE~ ..

:::;; E lim~ ...

(c) If 1~ .. 1~ t1 for all n ~ 1 and Et~ < oo, then

E lim~~~~ limE~ .. ~ llm E~ .. ~ E llm ~... PROOF.

(a) Let

Cn =

infm~n ~m;

(7)

then

lim~~~= lim inf ~m =lim C... n

It is clear that C.. j lim

~~~

n

m~n

and C.. ~ tl for all n ;;::: 1. Then by Theorem 1

E lim ~~~ = E lim C.. = lim EC.. = lim EC.. ~ limE~ .. , n

n

n

which establishes (a). The second conclusion follows from the first. The third is a corollary of the first two.

Theorem 3 (Lebesgue's Theorem on Dominated Convergence). Let t~, ~. ~ 1 , ~ 2 , ••• be random variables such that 1~ .. 1:::;; 1], E17 < oo and~~~--..~ (a.s.). ThenEI~I < oo, (8)

and (9) as n--.. oo. PROOF.

llm ~~~ =

Formula (7) is valid by Fatou's lemma. By hypothesis, lim~~~= ~ (a.s.). Therefore by Property G, E lim ~~~ = limE~ .. = limE~ .. = E lim ~~~ = E~,

which establishes (8). It is also clear that I~ I ~ tl· Hence EI~ I < oo. Conclusion (9) can be proved in the same way if we observe that

1~..

-" ~ 21].

186

II. Mathematical Foundations of Probability Theory

Corollary Let fl, e. e1• ••• be random variables such that Ien I :=:;; f/, en -+ e (a.s.) and E17P < oo for some p > 0. Then E 1e1P < oo and E1e- eniP-+ 0, n-+ oo. 0

For the proof, it is sufficient to observe that 1e1 :=:;; fl, I~- ~niP:=:;;
The condition "I en I :=:;; f/, Ef/ < 00" that appears in Fatou's lemma and the dominated convergence theorem, and ensures the validity of formulas (7)-(9), can be somewhat weakened. In order to be able to state the corresponding result (Theorem 4), we introduce the following definition. Definition 4. A family integrable if sup 11

{en} 11 ~ 1

f

J{i~nl>c)

of random variables is said to be uniformly c-+

le,IP(dw)-+0,

00,

(10)

or, in a different notation, sup E[ Ie~~l Iu~"' >c}J -+ 0,

c-+

00.

(11)

II

It is clear that if ell' n ;;::: 1, satisfy Iell I :=:; f/, Ef/ < oo, then the family {ell} 11 ~ 1 is uniformly integrable. Theorem 4. Let

Then

{~ 11 }n~ 1

be a uniformly integrable family ofrandom variables.

(a) E lim e11 :=:;; lim Ee11 :=:;; ITiii Ee, :=:; E ITiii eli. (b) If in addition 11 -+ (a.s.) then~ is integrable and

e e

Ee,-+ Ee, Ele~~-el-+0,

n-+ oo, n-+oo.

PRooF. (a) For every c > 0 (12)

By uniform integrability, for every e > 0 we can take c so large that sup IE[e~~I 1~"<-cj]l <e.

(13)

II

By Fatou's lemma, lim E[e,I 1~"~ -c}J ;;::: E[lim ~nl{~n~ -c}J.

But

~,If~"~ -cl

;;:::

e, and therefore

lim E[e,J!~"~-clJ;;::: E[lim e,].

(14)

187

§6. Lebesgue Integral. Expectation

From (12)-(14) we obtain

Since e > 0 is arbitrary, it follows that lim E~n 2: E lim ~n· The inequality with upper limits, 1lm E~n :::;; E lim~"' is proved similarly. Conclusion (b) can be deduced from (a) as in Theorem 3. The deeper significance of the concept of uniform integrability is revealed by the following theorem, which gives a necessary and sufficient condition for taking limits under the expectation sign. ~n-> ~and E~n < oo. Then E~n-> E~ < oo the family {~n}n> 1 is uniformly integrable.

Theorem 5. Let 0 :::;;

if and only if

PROOF. The sufficiency follows from conclusion (b) of Theorem 4. For the proof of the necessity we consider the (at most countable) set

A

= {a:

P(~

= a) > 0}.

Then we have ~n/{~" Or~
is uniformly integrable. Hence, by the sufficiency part of the theorem, we have E~nlr~" E~/~~"
Take an e rel="nofollow"> 0 and choose a 0 N 0 so large that

E

A,

n-> oo.

(15)

A so large that E ~/ (~ 2: ao) < ej2; then choose

E~nlr~"2:aoJ :::;; E~lr~~aoJ

+ e/2

for all n 2: N 0 , and consequently E~n/{~"~ao}:::;; e. Then choose a 1 2: a0 so large that E~r~"2:a!}::::;; e for all n:::;; N 0 • Then we have supE~nlr~"~ad :::;; n

e,

which establishes the uniform integrability of the family {~n} n 2: 1 of random variables.

4. Let us notice some tests for uniform integrability. We first observe that if {~n} is a family of uniformly integrable random variables, then sup EI~n I < oo. n

(16)

188

II. Mathematical Foundations of Probability Theory

In fact, for a given ~> > 0 and sufficiently large c > 0 sup E l~nl =sup [E(I~n IIu~nl~cl) n

n

s

supE(I~n IImnl~c}} n

+ E(l~n IIu~nl
which establishes (16). It turns out that (16) together with a condition of uniform continuity is necessary and sufficient for uniform integrability.

Lemma 2. A necessary and sufficient condition for a family {~n}n~ 1 of random variables to be uniformly integrable is that E I~n I, n ;;::: 1, are uniformly bounded (i.e., (16) holds) and that E{ I~n IIA},n ;;::: 1, are uniformly absolutely continuous (i.e. sup E{I ~n II A} --+ 0 when P(A) --+ 0). PROOF.

Necessity. Condition (16) was verified above. Moreover, E{l~niiA} = E{l~niiArl{l~nl~c}} + E{l~niiAn{i~nl
Take c so large that supn E{I~II 11 ~"I"cJ}

s

~>/2.

sup E{I ~n II A} S

Then if P(A)

s

~>/2c,

(17) we have

I>

by (17). This establishes the uniform absolute continuity. Sufficiency. Let 1> > 0 and b > 0 be chosen so that P(A) < b implies that E (I ~n II A) $ e, uniformly in n. Since El~nl ~ EJ~nliu~nl~cl ~ cP{l~nl ~ c}

for every c > 0 (cf. Chebyshev's inequality), we have sup

P{l~nl;;:::

n

c}

1 c

s- sup E l~nl-+ 0,

c--+ 00,

and therefore, when c is sufficiently large, any set {I ~n I ;;::: c }, n ;;::: 1, can be taken as A. Therefore sup E(l ~nl I 0 ~" 1 ~, 1 ) s ~>,which establishes the uniform integrability. This completes the proof of the lemma. The following proposition provides a simple sufficient condition for uniform integrability.

Lemma 3. Let ~b ~ 2 , •.• be a sequence of integrable random variables and G = G(t) a nonnegative increasing function, defined fort ~ 0, such that lim G(t) = oo. t-+ 00

supE[G(I~ni)J n

(18)

t

<

00.

Then the family {~n}n" 1 is uniformly integrable.

(19)

189

§6. Lebesgue Integral. Expectation

PRooF. Let e > 0, M =sup" E[G(I~ni)J, a= Mfe. Take c so large that G(t)/t 2 a for t 2 c. Then

M

1

E[l~nllll~"l"'ca:::;; ~E[G(I~nl) · Iu~"l"'cj]:::;;-;; = e

uniformly for n 2 1. 5. If ~ and Yf are independent simple random variables, we can show, as in Subsection 5 of §4 of Chapter I, that E~IJ = E~ · Ery. Let us now establish a similar proposition in the general case (see also Problem 5). Theorem 6. Let ~ and 1J be independent random variables with EI~ I < oo, Elryl < oo. ThenEI~Yfl < ooand E~ry

PRooF. First let

~

=

E~

· Ery.

(20)

2 0, Yf 2 0. Put 00

k

00

k

~n=

I -J{k/n:S~(ro)<(k+l)/n}• k;o n

=

I - J{k/n:S~(w)<(k+ 1)/n)• k;o n

Yfn

Then ~n :::;; ~' I~n - ~I :::;; lfn and Yfn :::;; Yf, IYfn - Yf I :::;; 1/n. Since E~ < oo and Ery < oo, it follows from Lebesgue's dominated convergence theorem that lim Moreover,

since~

E~nYfn =

=

and

Yf

E~n

=

E~,

are independent,

kl

L 2 k,l"'o L

EJ{k/n:S~<(k+l)/n}/{1/n:S~<(I+l)/n} n kl 2E/{k/n:S~<(k+1)/n} · EJ{I/n;S;~<(l+l)/n} = E~n · EYfn·

k,l"'o n

Now notice that

1 1 ( 1)

+E[IYfnl·l~-~niJ::;-E~+-E ry+-

n

n

n

~o,

n~

oo.

Therefore E~ry =limn E~nYfn =lim E~n ·lim E11n = E~ · Ery, and E~Yf < oo. The general case reduces to this one if we use the representations ~=C -C,Yf=Yf+ -ry-,~Yf=~+Yf+ -Cry+ -~+Yf- +Cry-.Thiscompletes the proof. 6. The inequalities for expectations that we develop in this subsection are regularly used both in probability theory and in analysis.

190

II. Mathematical Foundations of Probability Theory

Chebyshev's Inequality. Let

~

be a nonnegative random variable. Then for

every e > 0

E~

(21)

P(~ ~e):::;;-.

e

The proof follows immediately from E~ ~ E[~ · /(~>elJ ~ E/(~~·l = eP(~ ~e).

From (21) we can obtain the following variant of Chebyshev's inequality: If ~ is any random variable then P(~ ~e):::;;

ee

-

(22)

2

e

and

(23) where V~ = E(~ - E~) 2 is the variance of~.

The Cauchy-Bunyakovskiilnequality. Let~ and rt satisfy E~ 2 < oo, Ert 2 < oo. Then E I~1'/ I < oo and

(E I~1'/ 1) 2

:::;;

(24)

E~ 2 . Ert 2 •

PROOF.

Suppose that E~ 2 > 0, E17 2 > 0. Then, with ~

rt!JErli,

we find, since

=

~~~' if=

21 ~~I :::;; ~ 2 + ~ 2 , that 2E ~~~~ :::;; E~ 2 + E~ 2 = 2,

i.e. E I~~ I :::;; 1, which establishes (24).

ee

= 0, then ~ = 0 (a.s.) by Property I, and On the other hand if, say, then E~rt = 0 by Property F, i.e. (24) is still satisfied. Jensen's Inequality. Let the Borel function g = g(x) be convex downward and El~l

< oo. Then

g(E~)

:::;;

PRooF. If g = g(x) is convex downward, for each x 0 A.(x 0 ) such that g(x) for all x

E

~

g(x 0 )

(25)

Eg(~).

+ (x -

E

R there is a number

x 0 ) · A.(x 0 )

R. Putting x = ~ and x 0 = E~' we find from (26) that g(~) ~ g(E~)

and consequently Eg(~)

~ g(E~).

+ (~ -

E~) · A.(E~),

(26)

191

§6. Lebesgue Integral. Expectation

A whole series of useful inequalities can be derived from Jensen's inequality. We obtain the following one as an example.

Lyapunov's Inequality. IfO < s < t, (27) To prove this, let r = tfs. Then, putting 11 = I~ I" and applying Jensen's inequality to g(x) = IxI', we obtain IE17l' :::;;; EI'71', i.e. (E I~ 1")'1" :::;;; E IeJ',

which establishes (27). The following chain of inequalities among absolute moments in a consequence of Lyapunov's inequality:

(28) HOlder's Inequality. Let 1 < p < oo, 1 < q < oo, and (1/p) E I~ IP < oo and EI11lq < oo, then EI~'71 < oo and EI~'71

: :; ; (E I~ IP) 11P(E I'11q)lfq.

+ (1/q) = 1.

If

(29)

If E I~ IP = 0 or E Irtlq = 0, (29) follows immediately as for the CauchyBunyakovskii inequality (which is the special case p = q = 2 of Holder's inequality). Now let E I~IP > 0, E lrtlq > 0 and

-

~

~ = (E I~ lp)lfp ' We apply the inequality (30)

which holds for positive x, y, a, b and a from the concavity of the logarithm: In[ax

+ by]

~ a

In x

+b=

1, and follows immediately

+ b In y =

In xayh.

Then, putting x = ~P, y = ;jq, a = 1/p, b = 1/q, we find that

~;j :::;;; ! ~p + ! ;jq, p

q

whence -

E~q

This establishes (29).

1

:::;;; - E~P p

+ -1 E;jq = -1 + -1 = q

p

q

1.

192

II. Mathematical Foundations of Probability Theory

Minkowski's Inequality.JfE I~IP < oo, E1'71P < oo, 1 $ p < oo, then we have E I~ + 'liP < oo and (E I~ + '71P)l/p

$

(E I~ IP)lfp + (E I'71P)lip.

(31)

We begin by establishing the following inequality: if a, b > 0 and p then

~

1,

(32) In fact, consider the function F(x) = (a + x)P - 2p- 1(aP + xP). Then F'(x) = p(a + x)p-l- 2P- 1pxp-l,

and since p

~

1, we have F'(a) = 0, F'(x) > 0 for x < a and F'(x) < 0 for

x > a. Therefore

F(b) $ max F(x) = F(a) = 0, from which (32) follows. According to this inequality,

and therefore if EI~ IP < oo and EI'71P < oo it follows that EI~ + 'liP < oo. If p = 1, inequality (31) follows from (33). Now suppose that p > 1. Take q > 1 so that (1/p)

+

(1/q)

=

1. Then

I~+ 'lip= I~+ '71·1~ + '71p-l Sl~l·l~ + '71p-l + 1'711~ + '71p-l.

(34)

Notice that (p - 1)q = p. Consequently E(l~

+ '7ip-l)q = Ei~ +'liP< oo,

and therefore by Holder's inequality E(l~ll~

+ '7ip-1)

+ '7i(p-l>q)liq = (E I~ IP) 11P(E I~ + '7ip)lfq < 00. ~ (Ei~IP)liP(Ei~

In the same way, E( I'1 II~ + '71p-l) $ (E I'71P) 11P(E I~ + '7ip)lfq, Consequently, by (34),

EI~ + 'lip

$

(E I~ + '71P)lfq((E I~ IP)l/p + (E I'71P) 11P).

IfE I~ + 'liP = 0, the desired inequality (31) is evident. Now let EI~ Then we obtain

(E I~ + '71P)l-(l/q)

~

(E I~ IP)lfp + (E I'71P)lfp

from (35), and (31) follows since 1 - (1/q)

= 1/p.

(35)

+ 'liP > 0.

193

§6. Lebesgue Integral. Expectation

7.

Let~ be a random variable for which the set function

O(A)

E~

is defined. Then, by Property D,

=L~ dP,

(36)

is well defined. Let us show that this function is countably additive. First suppose that ~ is nonnegative. If A 1 , A 2 , ... are pairwise disjoint sets from fF and A = LA", the corollary to Theorem 1 implies that O(A) = E(~. IA) = E(~. IEAJ = E(:L ~. IAJ =IE(~· IAJ = L O(An). is an arbitrary random variable for which E~ is defined, the countable additivity of O(A) follows from the representation If~

(37)

where

together with the countable additivity for nonnegative random variables and the fact that min(O+(n), o-cn)) < 00. Thus if E~ is defined, the set function 0 = O(A) is a signed measurea countably additive set function representable as 0 = 0 1 - 0 2 , where at least one of the measures 0 1 and 0 2 is finite. We now show that 0 = O(A) has the following important property of absolute continuity with respect to P:

if

P(A) = 0

O(A) = 0

then

(A

E

ff)

(this property is denoted by the abbreviation 0 ~ P). To prove the sufficiency we consider nonnegative random variables. If ~ = 1 xkiAk is a simple nonnegative random variable and P(A) = 0, then

I;;=

n

O(A) = E(~ · IA) =

I

xkP(Ak n A)= 0.

k=1

If { ~"} n:e: 1 is a sequence of nonnegative simple functions such that then the theorem on monotone convergence shows that

~"

j

~ ~

0,

O(A) = E(~ · IA) = lim E(~n · IA) = 0, since E(~" · IA) = 0 for all n ~ 1 and A with P(A) = 0. Thus the Lebesgue integral O(A) = fA~ dP, considered as a function of sets A E ff, is a signed measure that is absolutely continuous with respect to P (0 ~ P). It is quite remarkable that the converse is also valid.

194

II. Mathematical Foundations of Probability Theory

Radon-Nikodym Theorem. Let (0, ff) be a measurable space, 11 a a-finite measure, and A a signed measure (i.e., A = A1 - A2 , where at least one of the measures A1 and A2 is finite) which is absolutely continuous with respect to 11· Then there is an ff-measurable.function f = .f(w) with values in R = [- oo, oo] such that A(A) = {f(w)l1(dw),

(38)

The function f(w) is unique up to sets of 11-measure zero: if h = h(w) is another ff-measurable function such that A(A) = JA h(w)11(dw), A E ff, then 11{ w: f(w) =P h(w)} = 0. If A is a measure, then f = f (w) has its values in R+ = [0, oo].

Remark. The function f = f(w) in the representation (38) is called the Radon- N ikodym derivative or the density of the measure A with respect to 11, and denoted by dA/dl1 or (dA/d11)(w). The Radon-Nikodym theorem, which we quote without proof, will play a key role in the construction of conditional expectations (§7). 8. If~=

Li'=

xJA; is a simple random variable,

1

Eg( ~)

= L: g(xi)P(Ai) = Lg(xi)AF ~(xJ

(39)

In other words, in order to calculate the expectation of a function of the (simple) random variable~ it is unnecessary to know the probability measure P completely; it is enough to know the probability distribution P ~ or, equivalently, the distribution function F ~of~The following important theorem generalizes this property.

Theorem 7 (Change of Variables in a Lebesgue Integral). Let (0, ff) and (E, $)be measurable spaces and X = X(w) an ff/$-measurable function with values in E. Let P be a probability measure on (0, ff) and Px the probability measure on (E, $)induced by X= X(w): Px(A) = P{w: X(w) Then

f

A

g(x)Px(dx) =

J

E

AE$.

A},

g(X(w))P(dw),

AE$,

Let A

E

(41)

x-l(A)

for every $-measurable function g = g(x), x E E (in the sense that integral exists, the other is well defined, and the two are equal). PROOF.

(40)

if one

$and g(x) = IB(x), where BE$. Then (41) becomes Px(AB)

=

P(X- 1(A) n

x-

1(B)),

(42)

195

§6. Lebesgue Integral. Expectation

x-

1(B) = which follows from (40) and the observation that x- 1(A) n 1 x- (A n B). It follows from (42) that (41) is valid for nonnegative simple functions g = g(x), and therefore, by the monotone convergence theorem, also for all nonnegative iff -measurable functions. In the general case we need only represent gas g+ - g-. Then, since (41) is valid for g+ and g-, if(for example) fAg+(x)Px(dx) < oo, we have

f

x- 1 (A)

g+(X(w))P(dw)

< oo

also, and therefore the existence of fA g(x)Px(dx) implies the existence of g(X(w))P(dw).

fx-'(AJ

Corollary. Let (E, $) = (R, BI(R)) and

let~ = ~(w) be a random variable with probability distribution P~. Then if g = g(x) is a Borel function and either ofthe integrals fA g(x)P~(dx) or f~-'(AJ g(~(w))P(dw) exists, we have

f g(x)P~(dx) J =

A

g(~(w))P(dw).

~-I(A)

In particular, for A = R we obtain

Eg(~(w)) = Lg(~(w))P(dw) = Lg(x)P~(dx).

(43)

The measure P~ can be uniquely reconstructed from the distribution function F~ (Theorem 1 of §3). Hence the Lebesgue integral fR g(x)P~(dx) is often denoted by JR g(x)F~(dx) and called a Lebesgue-Stieltjes integral (with respect to the measure corresponding to the distribution function F~(x)).

Let us consider the case when F ~(x) has a density f~(x), i.e. let (44) where f~ = f~(x) is a nonnegative Borel function and the integral is a Lebesgue integral with respect to Lebesgue measure on the set (- oo, x] (see Remark 2 in Subsection 1). With the assumption of (44), formula (43) takes the form

Eg(~(w)) = J:00 g(x)f~(x) dx,

(45)

where the integral is the Lebesgue integral of the function g(x)Nx) with respect to Lebesgue measure. In fact, if g(x) = Ia(x), BE BI(R), the formula becomes

BE £14(R);

(46)

196

II. Mathematical Foundations of Probability Theory

its correctness follows from Theorem 1 of §3 and the formula

F~(b)- F~(a) = ff~(x) dx. In the general case, the proof is the same as for Theorem 7. 9. Let us consider the special case of measurable spaces (Q, !F) with a measure J.l, where Q = Q 1 x Q 2, !F = !F1 ® !F2 , and J.l = f.lt x J.l 2 is the direct product of measures f.lt and f.l 2 (i.e., the measure on !F such that

the existence of this measure follows from the proof of Theorem 8). The following theorem plays the same role as the theorem on the reduction of a double Riemann integral to an iterated integral.

Theorem 8 (Fubini's Theorem). Let ~ = ~(ro 1 , ro 2) be an !F 1 ® !F rmeasurable function, integrable with respect to the measure f.lt x J.l 2: (47)

Then the integrals Jn 1 ~(rob ro 2)J.l 1(dro 1) and Jn2 ~(ro 1 , ro 2)J.lidro2) (1) are defined for all ro 1 and ro 2; (2) are respectively !F 2 - and !F 1 -measurable functions with

J.l2{ro2: J.l1{ro1:

t 1 1~(rot, ro2)1f.lt(dro

1)

= oo} =

t2 1~(rot, ro2)IJ.l2(dro2) =

0, (48)

oo} = 0

and (3)

i

l"lt xn2

~(ro1, ro2) d(J.l1

x J.l2) =

=

i [i ~(rot, l"lt

!12

ro2)J.l2(dro2)JJ.lt(drol)

tJt 1 ~(rob ro2)J.lt(drot)]f.lidro2).

(49)

PRooF. We first show that ~w 1 (ro 2 ) = ~(ro 1 , ro 2) is !F rmeasurable with respect to ro 2, for each ro 1 E Q 1. Let FE !F 1 ® !F 2 and ~(rob ro 2) = I iro 1, ro 2). Let

197

§6. Lebesgue Integral. Expectation

be the cross-section ofF at Wt. and let re'w, show that re'ro, =!IF for every w 1. If F = A x B, A E !IF 1, B E !IF 2 , then (A

X

B)w,

= {FE !IF: Fw, E F 2 }. We must

B if w 1 E A, = {0 if w 1 ¢ A.

Hence rectangles with measurable sides belong to re' "''. In addition, if FE !IF, then (Fln, = F w,, and if {F"kd are sets in !IF, then F")w, = F':,,. It follows that re'w, = !F. Now let ~(w 1 , w 2 ) :2: 0. Then, since the function ~(w 1 , w 2 ) is !F2 -measurable for each rob the integral 2 ~(w 1 , w 2 )f1z(dw 2 ) is defined. Let us show that this integral is an ff1-measurable function and


U

Jn

Let us suppose that ~(w 1 , w 2 ) = IAx 8 (w 1 , w 2 ), A E !F1, BE .?F2 . Then since IA xiw1, w 2 ) = IA(w 1 )1 8 (w 2 ), we have

and consequently the integral on the left of (51) is an !F1-measurable function. Now let ~(w 1 , w 2 ) = IF(w 1 , w 2 ), FE !IF = ff1 0 !F2 • Let us show that the integralf(w 1) = 2 I F(w 1, w 2 )f1 2 (dw 2 ) is !IF -measurable. For this purpose we put re' = {FE !F:f(w 1) is !!i'rmeasurable}. According to what has been proved, the set A x B belongs to '6' (A E $'1, B E $'2 ) and therefore the algebra d consisting of finite sums of disjoint sets of this form also belongs to re'. It follows from the monotone convergence theorem that re' is a monotonic class, re' = Jl(re'). Therefore, because of the inclusions d <;; re' c;; !IF and Theorem 1 of §2, we have ff = a{d) = fl(d) <;; Jl(re') = re' <;; .?F, i.e. re' = !F. Finally, if ~(wb w 2 ) is an arbitrary nonnegative.!F-measurable function, the !F 1-measurability of the integral 2 ~(w 1 , w 2 )f1z(dw) follows from the monotone convergence theorem and Theorem 2 of §4. Let us now show that the measure f1 = f1 1 x flz defined on !IF = ff2 0 ff2 , withtheproperty(J1 1 x J1 2 )(A x B)= f1 1(A)·J1 2 (B),AE!!i'1,BEff2 ,actually exists and is unique. For FE ff we put

Jn

Jn

Jl(F)

=

LJL/Fw,(Wz)f1z(dwz)}1(dw1).

As we have shown, the inner integral is an ff1-measurable function, and consequently the set function Jl(F) is actually defined for F E !F. It is clear

198

II. Mathematical Foundations of Probability Theory

that ifF =A x B, then fl(A x B)= 11 1(A)11 2 (B). Now let {Fn} be disjoint sets from .F. Then

/l(L Fn) = LJL/<1:1"")w 1(Wz)flidwz)]/ll(dwl) =

=

i [i i [i L

Clt n

L

n

Cl2

Clt

Cl2

/F:J 1 (wz)/lz(dwz)]/ll(dwl)

/F:J, (wz)/lz(dwz)]/1 1(dw 1) =

L fl(P), n

i.e.fl is a (a-finite) measure on .F. It follows from Caratheodory's theorem that this measure 11 is the unique measure with the property that /l(A x B) = 11 1(A)/1 2 (B). We can now establish (50). If ~(w 1 , w 2) = I Ax B(w 1, Wz), A E !#'1 , BE !#'2, then

i

Clt xQ2

I Ax B(wl> w 2 )d(fl 1 x /lz) = (/1 1 x 11 2)(A x B),

(52)

and since 1Axiw 1, w 2) = IA(w 1)la(w 2), we have

LJL/Axiwl, w2)/12(dw2)]111(dwl) =

L.[IA(wl) L/B(wl> w2)/12(dw2)]/11(dw 1) = /11(A)fliB).

(53)

But, by the definition of 11 1 x J12,

(J1 1

X

J1 2)(A x B) = J1 1 (A)J1 2 (B).

Hence it follows from (52) and (53) that (50) is valid for IAxiwl, w2). Now let

~(w 1 ,

~(w 1 ,

w2) =

w 2) = JF(w 1, w 2 ), FE .F. The set function

is evidently a a-finite measure. It is also easily verified that the set function

v(F) = L.[L/iwl, Wz)J1 2 (dw 2 )]J11(dw 1) is a a-finite measure. It will be shown below that Aand v coincide on sets of the form F = A x B, and therefore on the algebra !F. Hence it follows by Caratbeodory's theorem that Aand v coincide for all F E .F.

199

§6. Lebesgue Integral. Expectation

We turn now to the proof of the full conclusion of Fubini's theorem. By (47),

r

~+(wb w2) d(f.ll

Jn,xn 2

X

f.12) <

r

00,

By what has already been proved, the

:1'1-measurable function of w 1 and

C(wl, w2)d(J.11

f.12) <

X

00.

Jn,xn 2 integral Jn 2~+(w 1 , w 2)J.1idw 2) is

r [ r ~+(wl, W2)f.12(dw2)]f.11(dwl) = Jn,xn2 r ~+(wl, W2) d(f.ll

Jn, Jn2

f.12) <

X

an

00.

Consequently by Problem 4 (see also Property J in Subsection 2)

r ~+(wb w2)f.12(dw2) <

00

(f.ll-a.s.).

( CCw1, w2)J.11(dw1) < oo

CJ.11-a.s.),

Jo2 In the same way

Jo2 and therefore

It is clear that, except on a set Jll of J.1 1-measure zero, (

Jn2

~(wl, w2)f.1idw2) = ( ~+(wl, w2)J.12(dw2)Jo2

( C(wb w2)J.12(dw2).

Jn2

(54) Taking the integrals to be zero for w 1 E%, we may suppose that (54) holds for all w E n 1. Then, integrating (54) with respect to f.1 1 and using (50), we obtain

t,[t2 ~(wl, w2)f.12(dw2)]f.11(dw1)

=

=

t,[t2 ~+(w 1 , w2)J.1z(dw2)]f.11(dw 1)

-t,[t r

2

C(wb w2)f.12(dw2)}1(dw1)

~+(wl, W2)d(J.11

Jn,xn 2

X

- Jn, r xn2 ~-(Wl, W2) d(f.ll

f.12) X

f.12)

200

II. Mathematical Foundations of Probability Theory

Similarly we can establish the first equation in (48) and the equation

r

Jn, xn2

e(wl, W2) d(Jl.l X J1.2) =

r [ r e(wl, W2)Jl.l(dwl)JJ1.2(dw2).

Jn2 Jn,

This completes the proof of the theorem.

Corollary. If Jn, Un2 Ie<wl, w2) IJ1.idw2)]Jl.l (dwl) < Fubini's theorem is still valid.

oo,

the conclusion of

In fact, under this hypothesis (47) follows from (50), and consequently the conclusions of Fubini's theorem hold. · Let (e, q) be a pair of random variables whose distribution has a two-dimensional density f~~(x, y), i.e.

EXAMPLE.

P((e, q) e B) =

{f~~(x, y) dx dy,

where f~~(x, y) is a nonnegative &6'(R 2 )-measurable function, and the integral is a Lebesgue integral with respect to two-dimensional Lebesgue measure. Let us show that the one-dimensional distributions for and '1 have densities f~(x) and f,(y), and furthermore

e

f~(x) = f_oo,J~~(x, y) dy (55)

and

~(y) =

J:oo f~~(x, y) dx.

In fact, if A e as'(R), then by Fubini's theorem

P(eeA) = P((e,q)eA x R) =

{x/~~(x,y)dxdy = L[Lf~~(x,y)dy]dx.

This establishes both the existence of a density for the probability distribution of and the first formula in (55). The second formula is established similarly. According to the theorem in §5, a necessary and sufficient condition that and '1 are independent is that

e

e

F~71(x,

y) =

F~(x)F,(y),

(x, y)

E

R 2•

Let us show that when there is a two-dimensional density variables eand '1 are independent if and only if

h.~(x, y),

the (56)

(where the equation is to be understood in the sense of holding almost surely with respect to two-dimensional Lebesgue measure).

201

§6. Lebesgue Integral. Expectation

In fact, in (56) holds, then by Fubini's theorem F

~~(x, y) = =

J(-oo,x)x(-oo,y] f~~(x, y) dx dy = J(-oo,x)x(-oo,y] f~(x)f~(y) dx dy

i

(-oo,x]

Nx) dx(J

(-oo,y)

j,(y) dy) =

F~(x)F~(y)

and consequently ~ and 11 are independent. Conversely, if they are independent and have a density h~(x, y), then again by Fubini's theorem

f-oo,x)x(-oo,y/~~(X, y) dx dy = (f_oo,x/~(x) dx )(f-oo.y]j,(y) dy) = J f~(x)fq(y) dx dy. (-oo,x]x(-oo,y] It follows that

f/~~(x, y) dx dy = f/~(x)fq(y) dx dy for every BE fJI (R 2 ), and it is easily deduced from Property I that (56) holds. 10. In this subsection we discuss the relation between the Lebesgue and Riemann integrals. We first observe that the construction of the Lebesgue integral is inde-

pendent of the measurable space (Q, §') on which the integrands are given. On the other hand, the Riemann integral is not defined on abstract spaces in general, and for Q = W it is defined sequentially: first for R 1 , and then extended, with corresponding changes, to the case n > 1. We emphasize that the constructions of the Riemann and Lebesgue integrals are based on different ideas. The first step in the construction of the Riemann integral is to group the points x E R 1 according to their distances along the x axis. On the other hand, in Lebesgue's construction (for Q = R 1 ) the points x E R 1 are grouped according to a different principle: by the distances between the values of the integrand. It is a consequence of these different approaches that the Riemann approximating sums have limits only for "mildly" discontinuous functions, whereas the Lebesgue sums converge to limits for a much wider class of functions. Let us recall the definition of the Riemann-Stieltjes integral. Let G = G(x) be a generalized distribution function on R (see subsection 2 of §3) and 11 its corresponding Lebesgue-Stieltjes measure, and let g = g(x) be a bounded function that vanishes outside [a, b].

202

II. Mathematical Foundations of Probability Theory

Consider a decomposition fJJ = {x 0 , ••• , xn}, a = x0

< x 1 < ··· <

= b,

Xn

of [a, b], and form the upper and lower sums n

L = L iHG(x;+ 1) ~

L=

G(x;)],

~

i=1

n

L~;[G(xi+ 1 )- G(x;)]

i=l

where

g; =

g(y),

sup

inf

ff.i =

g(y).

Xi-1
Xi-1
Define simple functions g~(x) and fl~(x) by taking

on x;_ 1 < x :::;;

X;,

and define

g~(a) = fl~(a) =

g(a). It is clear that then

L = (L-S) Jbg~(x)G(dx) a

~

and

L = (L-S) Jb fl~(x)G(dx). a

~

Now let{&\} beasequenceofdecomposition ssuchthatfJJk s;: .o/lk+ 1 • Then

and if Jg(x)J :::;; C we have, by the dominated convergence theorem, lim

L=

(L-S)

rb g(x)G(dx),

k-+ oo

~k

lim

L = (L-S) Jb fl(x)G(dx),

Ja

(57)

a

k-+oo ~k

where g(x) = limk g~k(x), g(x) = limk fl~Jx). H the limits limk r~k and limk I~k are finite and equal, and their common value is independent of the sequence ofdecompositions {fJJk}, we say that g = g(x) is Riemann-Stieltjes integrable, and the common value of the limits is denoted by (R-S)

f

g(x)G(dx).

(58)

When G(x) = x, the integral is called a Riemann integral and denoted by (R) fg(x) dx.

203

§6. Lebesgue Integral. Expectation

J!

Now let (L-S) g(x)G(dx) be the corresponding Lebesgue-Stieltjes integral (see Remark 2 in Subsection 2).

Theorem 9. If g = g(x) is continuous on [a, b], it is Riemann-Stieltjes integrable and

f

(R-S)

g(x)G(dx) = (L-S)

f

(59)

g(x)G(dx).

PRooF. Since g(x) is continuous, we have g(x) = g(x) = g(x). Hence by (57) Consequently g = g(x) is -Riemann-Stieltjes = limk-+oo limk-+oo integral (again by (57)):-

L&k·

L&k

Let us consider in more detail the question of the correspondence between the Riemann and Lebesgue integrals for the case of Lebesgue measure on the lineR.

Theorem 10. Let g(x) be a bounded function on [a, b].

if and only if it is continuous almost everywhere (with respect to Lebesgue measure A on

(a) The function g = g(x) is Riemann integrable on [a, b] ~([a, b])).

(b) If g = g(x) is Riemann integrable, it is Lebesgue integrable and (60)

(R) fg(x) dx = (L) fg(x)A(dx). PROOF. (a) Let g = g(x) be Riemann integrable. Then, by (57), (L) But g(x)

~

g(x)

~

S:g(x)A(dx) = S:g(x)A(dx). (L)

g(x), and hence by Property H

g_(x) = g(x) = g(x)

(61)

(A-a.s.),

from which it is easy to see that g(x) is continuous almost everywhere (with respect to 1). Conversely, let g = g(x) be continuous almost everywhere (with respect to A). Then (61) is satisfied and consequently g(x) differs from the (Borel) measurable function g(x) only on a set JV with A:(%) = 0. But then

{x: g(x)

~

c} = {x: g(x) = {x: g(x)

It is clear that the set {x: g(x)

~

~ ~

c} n JV c} n JV

+ {x: g(x) + {x: g(x)

~ ~

c} n JV c} n JV

c} n JV E &~([a, b]), and that

{x: g(x)

~

c} n JV

204

II. Mathematical Foundations of Probability Theory

is a subset of .Y having Lebesgue measure A: equal to zero and therefore also belonging to ~([a, b)]. Therefore g(x) is &l([a, b])-measurable and, as a bounded function, is Lebesgue integrable. Therefore by Property G, (L) fg(x)A:(dx) = (L) ff!(x)A:(dx) = (L) fg(x)A(dx), which completes the proof of (a). (b) If g = g(x) is Riemann integrable, then according to (a) it is continuous (A:-a.s.). It was shown above than then g(x) is Lebesgue integrable and its Riemann and Lebesgue integrals are equal. This completes the proof of the theorem.

Remark. Let J.t be a Lebesgue-Stieltjes measure on ~([a, b]). Let ~Jt([a, b]) be the system consisting of those subsets A s; [a, b] for which there are sets A and B in ~([a, b]) such that A s; A s; B and J.t(B\A) = 0. Let J.t be an extension of J.t to &I i[a, b]) (Jl(A) = J.t(A) for A such that A s; A s; B and J.t(B\A) = 0). Then the conclusion of the theorem remains valid if we consider J1 instead of Lebesgue measure A, and the Riemann-Stieltjes and Lebesgue-Stieltjes measures with respect to J1 instead of the Riemann and Lebesgue integrals.

11. In this part we present a useful theorem on integration by parts for the Lebesgue-Stieltjes integral. Let two generalized distribution functions F = F(x) and G = G(x) be given on (R, ~(R)).

Theorem 11. The following formulas are valid for all real a and b, a < b: F(b)G(b)- F(a)G(a) = fF(s- )dG(s)

+f

G(s) dF(s),

(62)

or equivalently F(b)G(b)- F(a)G(a)

=

fF(s-)dG(s)

+ L

+ fG(s-)dF(s)

llF(s) ·llG(s),

(63)

a<s:Sb

where F(s-) = limqs F(t), !!.F(s) = F(s)- F(s- ).

Remark l. Formula (62) can be written symbolically in "differential" form d(FG) = F _ dG

+ GdF.

(64)

§6. Lebesgue Integral. Expectation

205

Remark 2. The conclusion of the theorem remains valid for functions F and G of bounded variation on [a, b]. (Every such function that is continuous on the right and has limits on the left can be represented as the difference oftwo monotone nondecreasing functions.) PRooF. We first recall that in accordance with Subsection 1 an integral (·)means J1a,bJ (·).Then (see formula (2) in §3)

J:

f

(F(b)- F(a))(G(b)- G(a)) =

dF(s) · fdG(t).

Let F x G denote the direct product of the measures corresponding to F and G. Then by Fubini's theorem (F(b) - F(a))(G(b) - G(a)) = =

=

f f

(a, b] x (a, b]

f

d(F x G)(s, t)

(a, b] x (a, b]

I{s~t)(s, t) d(F

(G(s) - G(a)) dF(s)

~~

=

f

G(s)dF(s)

+

f

X

G)(s, t)

+

f

+

J(a,

b]

x (a, b]

I 1sst)(s, t) d(F

X

G)(s, t)

(F(t-) - F(a)) dG(t)

~~

F(s- )dG(s)- G(a)(F(b)- F(a))- F(a)(G(b)- G(a)), (65)

where I A is the indicator of the set A. Formula (62) follows immediately from (65). In turn, (63) follows from

(62) if we observe that

f(G(s)- G(s- )) dF(s) =

a<~sbLlG(s) · LlF(s).

(66)

Corollary 1. If F(x) and G(x) are distribution functions, then F(x)G(x) = raoF(s-) dG(s)

+ fao G(s) dF(s).

(67)

If also F(x) = faof(s) ds, then F(x)G(x)

=

f ao F(s) dG(s)

+f

ao G(s)f(s) ds.

(68)

206

II. Mathematical Foundations of Probability Theory

Corollary 2. Let ~ be a random variable with distribution junction F(x) and El~l" < oo. Then 00 {

xn dF(x)

= n {ooxn- 1[1- F(x)] dx,

(69)

f

o lxlndF(x) = - JooxndF(-x) = n [xn- 1 F(-x)dx -oo 0 0

(70)

and

El~ln =

f_

00 00

1XIn dF(x) = n Lnxn- 1[1- F(x)

+ F(-x)] dx.

(71)

To prove (69) we observe that

fXJ xn dF(x) =

-

s:

xn d(1 - F(x))

= - bn(1- F(b))

+ n J:xn- 1(1- F(x)) dx.

(72)

Let us show that since E 1 ~I" < oo,

bn(1- F(b)

+ F( -b))::; bnP(I~I

In fact,

EJ~I"

Cf.)

=

k~1

rk Jk_

1

2 b) -t 0.

lxJ"dF(x) <

(73)

CXJ

and therefore

L rk

lxl"dF(x)-tO,

L rk

lxln dF(x) 2

n -too.

k;o,b+ 1 Jk-1

But

k;o,b+ 1 Jk-1

bnP(I~I

2 b),

which establishes (73). Taking the limit as b -t oo in (72), we obtain (69). Formula (70) is proved similarly, and (71) follows from (69) and (70).

12. Let A(t ), t 2 0, be a function oflocally bounded variation (i.e., of bounded variation on each finite interval [a, b]), which is continuous on the right and has limits on the left. Consider the equation

z, = 1 + J~z._ dA(s),

(74)

207

§6. Lebesgue Integral. Expectation

which can be written in differential form as Zo = 1.

dZ = Z_ dA,

(75)

The formula that we have proved for integration by parts lets us solve (74) explicitly in the class of functions of bounded variation. We introduce the function &,(A)= eA(rJ-A(OJ

n

(1 O:s;s:s;r

+ L\A(s))e-&A(•J,

(76)

where L\A(s) = A(s)- A(s-) for s > 0, and L\A(O) = 0. The function A(s), 0 :::;; s :::;; t, has bounded variation and therefore has at most a countable number of discontinuities, and so the series Lo::s:s::s:riM(s)l converges. It follows that

n

O:s;s:s;r

(1

+ M(s))e-&A(s)

is a function of locally bounded variation. If Ac(t) = A(t) - Lo:s:s:s;r L\A(s) is the continuous component of A(t), we can rewrite (?6) in the form &,(A)= eAC(I)-AC(O)

n (1 + M(s)).

O:s;s:s;r

(77)

Let us write F(t) = eA<,

G(t) =

fl

O:s;s:s;r

(1

+ L\A(s)).

Then by (62) tS',(A) = F(t)G(t) = 1 +

J~F(s) dG(s) + LG(s-) dF(s)

=

1 + 0 J;,,F(s)G(s- )L\A(s) + {G(s- )F(s) dAc(s)

=

1 + {s._(A) dA(s).

Therefore tS',(A), t ~ 0, is a (locally bounded) solution of(74). Let us show that this is the only locally bounded solution. Suppose that there are two such solutions and let Y = Y(t), t ~ 0, be their difference. Then Y(t)

=

f~ Y(s-) dA(s).

Put T = inf{t ~ 0: Y(t) =I= 0},

where we take T

= oo if Y(t) = 0 fort~ 0.

208

II. Mathematical Foundations of Probability Theory

Since A(t) is a function of locally bounded variation, there are two generalized distribution functions A 1(t) and A 2 (t) such that A(t) = A 1(t) A 2 (t). If we suppose that T < oo, we can find a finite T' such that

Then it follows from the equation

Y(t) =

I:

Y(s-) dA(s),

t "2! T,

that sup! Y(t)l ~!sup I Y(t)l tSt'

tST'

and since sup! Y(t)l < oo, we have Y(t) = 0 forT< t the assumption that T < oo. Thus we have proved the following theorem.

~

T', contradicting

Theorem 12. There is a unique locally bounded solution of(74), and it is given by (76). 13.

PROBLEMS

1. Establish the representation (6).

e

2. Prove the following extension of Property E. Let and '7 be random variables for which Ee and E17 are defined and the sum Ee + E17 is meaningful (does not have the form oo - oo or - oo + oo ). Then

e

3. Generalize Property G by showing that if = '7 (a.s.) and Ee exists, then E17 exists and Ee = E17.

e

4. Let be an extended random variable, p. a a-finite measure, and Jn lei dp. < Show that I I < oo (p.-a.s.) (cf. Property J).

e

e

00.

5. Let p. be a a-finite measure, and '7 extended random variables for which Ee and E17 are defined. If J.A dP ::::;; J.A '7 dP for all A E §',then '7 (p.-a.s.). (Cf. Property 1.)

e

e

e: : ;

6. Let and '7 be independent nonnegative random variables. Show that Eel'f = Ee · E17. 7. Using Fatou's lemma, show that

8. Find an example to show that in general it is impossible to weaken the hypothesis "I I : : ; l'f, El'f < 00" in the dominated convergence theorem.

e.

209

§6. Lebesgue Integral. Expectation

9. Find an example to show that in general the hypothesis "~. Fatou's lemma cannot be omitted.

s

10. Prove the following variants ofFatou's lemma. Let the family variables be uniformly integrable and let E lim ~. exist. Then

IJ, EIJ > - oo" in

g;}."' 1 of random

s

IJ., n 2:: 1, where the family {~:}."' 1 is uniformly integrable and 11. converges a.s. (or only in probability-see §10 below) to a random variable 71· Then limE~. s E lim~•. Let~.

11. Dirichlet's function d(x) = {1, 0,

x irr~tional,

x rational,

is defined on [0, 1], Lebesgue integrable, but not Riemann integrable. Why? 12. Find an example of a sequence of Riemann integrable functions {f.}.;;: 1, defined on [0, 1], such that I f. I s 1, f.--> f almost everywhere (with Lebesgue measure), but f is not Riemann integrable.

13. Let (a;,i; i,j 2:: 1) be a sequence ofreal numbers such that Li.i lai,il < oo. Deduce from Fubini's theorem that (78) 14. Findanexampleofasequ ence(aii; i,j;;::: 1)forwhich Li.i laiil = oo and the equation in (78) does not hold. 15. Starting from simple functions and using the theorem on taking limits under the Lebesgue integral sign, prove the following result on integration by substitution. Let h = h(y) be a nondecreasing continuously differentiable function on [a, b], and let f(x) be (Lebesgue) integrable on [h(a), h(b)]. Then the functionf(h(y))h'(y) is integrable on [a, b] and

f

h(b)

f(x) dx =

fb

f(h(y))h'(y) dy.

a

h(a)

16. Prove formula (70). 17. Let ~.

~ 1 , ~ 2 , •••

be nonnegative integrable random variables such that E~n ..... E~ and

P(~ - ~. > ~:)--> 0 for every 1: > 0. Show that then E I~n - ~I-> 0, n--> oo.

18. Let~. ry, (and ~n• Yin• (n, n 2:: 1, be random variables such that p

'In --> IJ,

E(n--> E(, and the expectations E~, EIJ, E( are finite. Show that then E~n ..... E~ (Pratt's lemma). If also 1fn s 0 :-::; (n then E I~. - ~I --> 0, Deduce that if~. -f. ~. EI~. I --> EI~ I and EI~ I < oo, then E I~n - ~I __. o.

210

II. Mathematical Foundations of Probability Theory

§7. Conditional Probabilities and Conditional Expectations with Respect to a a-Algebra 1. Let (0, ~ P) be a probability space, and let A E §' be an event such that P(A) > 0. As for finite probability spaces, the conditional probability of B with respect to A (denoted by P(BIA)) means P(BA)/P(A), and the conditional

probability of B with respect to the finite or countable decomposition

~ =

{Db D 2 •• • } with P(Di) > 0, i ~ 1 (denoted by P(BI~))is the random variable equal to P(BIDi) for wE Di, i ~ 1: P(BI~)

=

L

P(BIDi)Iv,(w).

i;:. 1

In a similar way, if~ is a random variable for which E~ is defined, the conditional expectation of~ with respect to the event A with P(A) > O(denoted by E(~IA)) is E(OA)/P(A) (cf. (1.8.10). The random variable P(B I~) is evidently measurable with respect to the a-algebra r§ = a(~), and is consequently also denoted by P(B Ir§) (see §8 of Chapter 1). However, in probability theory we may have to consider conditional probabilities with respect to events whose probabilities are zero. Consider, for example, the following experiment. Let ~ be a random variable that is uniformly distributed on [0, 1]. If~ = x, toss a coin for which the probability of head is x, and the probability of tail is 1 - x. Let v be the number of heads inn independent tosses of this coin. What is the "conditional probability P( v = k I~ = x) "? Since P( ~ = x) = 0, the conditional pro bability P(v = k I~ = x) is undefined, although it is intuitively plausible that "it ought to be C~xk(1 - xt-k." Let us now give a general definition of conditional expectation (and, in particular, of conditional probability) with respect to a a-algebra r§, r§ ~ ~ and compare it with the definition given in §8 of Chapter I for finite probability spaces. 2. Let (0, ~ P) be a probability space, r§ a a-algebra, r§ ~ §' (r§ is a asubalgebra of ff), and~ = ~(w) a random variable. Recall that, according to §6, the expectation E~was defined in two stages: first for a nonnegative random variable ~' then in the general case by

and only under the assumption that

A similar two-stage construction is also used to define conditional expectations E( ~I r§).

§7. Conditional Probabilities and Expectations with Respect to a u-Aigcbra

211

Definition 1. ( 1) The conditional expectation of a nonnegative random variable ~ with respect to the a-algebra t§ is a nonnegative extended random variable, denoted by E(~ It§) or E(~ It§)(m), such that

(a) E(~l~) is t'§-measurable; (b) for every A E t§

L~ LE(~l~) dP =

dP.

(1)

(2) The conditional expectation E(~lt§), or E(~l~)(m), of any random variable ~ with respect to the a-algebra t§, is considered to be defined if min(E(~+ 1~), E(~-1~))

< oo,

P-a.s., and it is given by the formula EceJ~)

=E(~+ It§)- E(~-1~),

where, on the set (of probability zero) of sample points for which E(~+ 1t§) = E(C I~) = oo, the difference E(~+ I~) - E(C I~) is given an arbitrary value, for example zero. We begin by showing that, for nonnegative random variables, actually exists. By (6.36) the set function Q(A) =

L~ dP,

A Et§,

E(~l~)

(2)

is a measure on (Q, ~). and is absolutely continuous with respect to P (considered on (Q, ~). ~ £;; ff). Therefore (by the Radon-Nikodym theorem) there is a nonnegative ~-measurable extended random variable EG I~) such that Q(A)

= LE(~I~)dP.

(3)

Then (1) follows from (2) and (3).

Remark 1. In accordance with the Radon-Nikodym theorem, the conditional expectation E( ~It§) is defined only up to sets of P-measure zero. In other words, E(~ It§) can be taken to be any t'§-measurable function f(m) for which Q(A) = JAf(m) dP, A E t§ (a "variant" of the conditional expectation). Let us observe that, in accordance with the remark on the RadonNikodym theorem, (4)

212

II. Mathematical Foundations of Probability Theory

i.e. the conditional expectation is just the derivative of the Radon- Nikodym measure Q with respect to P (considered on (0, <;9')). Remark 2. In connection with (1), we observe that we cannot in general put E(~ 1<;9') = ~,since~ is not necessarily <;9'-measurable. Remark 3. Suppose that ~ is a random variable for which E~ does not exist. Then E(~ I<;9')maybe definable as a <;9'-measurable function for which (1) holds. This is usually just what happens. Our definition E(~l<;9') = E(~+ 1<;9')E(C 1<;9') has the advantage that for the trivial u-algebra <;9' = {0, 0} it reduces to the definition of E~ but does not presuppose the existence of E~. (For example, if~ is a random variable withE~+ = oo, E~- = oo, and <;9' = !F., then E~ is not defined, but in terms of Definition 1, E(~ 1<;9') exists and is simply ~=~+-c.

Remark 4. Let the random variable~ have a conditional expectation E(~ 1<;9') with respect to the u-algebra <;9'. The conditional variance (denoted by V( ~I <;9') or V(~ I<;9')(tX)) of~ is the random variable

(Cf. the definition of the conditional variance V(~ I~) of~ with respect to a decomposition~. as given in Problem 2, §8, Chapter 1.) Definition 2. Let B E :F. The conditional expectation E(I B I<;9') is denoted by P(B I<;9'), or P(B I<;9')( w), and is called the conditional probability of the event B with respect to the u-algebra <;9', <;9' s;;; :F. It follows from Definitions 1 and 2 that, for a given B E !F., P(B 1<;9') is a random variable such that (a) P(B I<;9') is <;9'-measurable,

(b) for every A

P(A n B)= LP(BI<;9')dP E

(5)

<;9'.

Definition 3. Let ~ be a random variable and <;9'~ the u-algebra generated by a random element '1· Then E(~1<;9'n), if defined, means E(~l'1 or E(~IIJ)(w), and is called the conditional expectation of~ with respect to '1· The conditional probability P(BI<;9'71) is denoted by P(BIIJ) or P(BIIJ)(w), and is called the conditional probability of B with respect to '1· 3. Let us show that the definition of E( ~I <;9') given here agrees with the definition of conditional expectation in §8 of Chapter I.

213

§7. Conditional Probabilities and Expectations with Respect to a u-Aigebra

Let~ = {D 1 , D 2 , •.. } be a finite or countable decomposition with atoms D; with respect to the probability P (i.e. P(D;) > 0, and if A ~ D;, then either P(A) = 0 or P(D;\A) = 0).

Theorem 1. If~ = a(~) and ~ is a random variable for which E~ is defined,

then

EW~)

=

E(~ID;)

(P-a.s. on D;)

EW~

=

E(OD.) P(D;)

(P-a.s. on D;).

(6)

or equivalently

(The

notation"~ =

"~

'1 (P-a.s. on A)," or

= 17(A; P-a.s.)" means that P(A n

PRooF. According to Lemma 3 of §4, E(~l~) constants. But

{~

= K;

::/: 17}) = 0.) on D;, where K; are

whence 1

f

E(OD)

K; = P(D;) JD,~ dP = P(D;) = EWD;). This completes the proof of the theorem. Consequently the concept of the conditional expectation

E(~ 1~)

with

respect to a finite decomposition ~ = {Db ... , Dn}, as introduced in Chapter I, is a special case of the concept of conditional expectation with respect to the a-algebra ~ = a(D). 4. Properties of conditional expectations. (We shall suppose that the expectations are defined for all the random variables that we consider and that ~ ~ $'.)

A*. IfC is a constant and~= C (a.s.), then E(~l~) = C (a.s.). B*. If~ ~ '7 (a.s.) then EW ~) ~ E('71 ~)(a.s.). C*. IEW~)I ~ E(l~ll~) (a.s.). D*. If a, bare constants and aE~ + bE17 is defined, then E(a~

+ b1'fl ~)

=

aEW ~)

+ bE('71 ~)

E*. Let ff'* = { (/), Q} be the trivial a-algebra. Then E(~lff'*) = E~

(a.s.).

(a.s.).

214 F*.

II. Mathematical Foundations of Probability Theory

E(~\~)

=

~(a.s.).

G*. E(E(ell§)) =

E~.

H*. Ift§ 1 £ t§ 2 then I*. If t§ 1

;2

t§ 2 then

e

J*. Let a random variable for which Ee is defined be independent of the a-algebra t§ (i.e., independent of IB, Bet§). Then E(eit§) = Ee

(a.s.).

K*. Let '7 be a t'§-measurable random variable, EI'71 < oo and EIe'7\ < oo. Then E(e'71t§) = '7E(e\t§)

(a.s.).

Let us establish these properties. A*. A constant function is measurable with respect to t§. Therefore we need only verify that {edP But, by the

hypothesis~ =

= {cdP,

Aet§.

C (a.s.) and Property G of §6, this equation

is obviously satisfied. B*. If $ '7 (a.s.), then by Property B of §6

e

L~ f} dP $

dP,

A E t§,

and therefore {E(e\l§)dP $ {E('71t§)dP,

Aet§.

The required inequality now follows from Property I (§6). C*. This follows from the preceding property if we observe that -

$

e$I~\.

D*. If A e t§ then by Problem 2 of §6,

{ (ae + brl) dP = {ae dP + J}'1 dP = +{ which establishes D*.

bE('71 t§) dP

= {

{ aE(e\t§) dP

[aE( ~It§)

+ bE('7\ t§)] dP,

IeI

215

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

E*. This property follows from the remark that E~ is an §"*-measurable function and the evident fact that if A = 0 or A = 0 then

F*.

Since~

if §"-measurable and

L~dP = L~dP, we have EW F) = ~ (a.s.). G*. This follows from E* and H* by taking H*. Let A E I'§ 1 ; then

Since I'§ 1 s;:

I'§ 2 ,

AE§;

I'§ 1 =

{0, 0} and I'§ 2 =

1'§.

we have A E I'§ 2 and therefore

Consequently, when A E 1'§ 1,

LE(~ll'§1) LE[E(~II'§2)11'§1] dP =

dP

and by Property I (§6) and Problem 5 (§6) E(~ll'§1)

I*. If A

E ~ 1o

=

E[E(~II'§2)11'§1]

(a.s.).

then by the definition of E[E( ~I~ 2 ) I~ 1 ]

The function E( ~II'§ 2) is I'§ rmeasurable and, since I'§ 2 s;: I'§ to also I'§ rmeasurable. It follows that E( ~II'§ 2) is a variant of the expectation E[E(~II'§ 2 )11'§ 1 ], which proves Property 1*. J*. Since E~ is a /'§-measurable function, we have only to verify that

L

dP =

LE~dP,

i.e. that E[ ~ · I B] = E~ · EJB. If EI~ I < oo, this follows immediately from Theorem 6 of §6. The general case can be reduced to this by applying Problem 6 of §6. The proof of Property K* will be given a little later; it depends on conclusion (a) of the following theorem.

216

II. Mathematical Foundations of Probability Theory

Theorem 2 (On Taking Limits Under the Expectation Sign). Let be a sequence of extended random variables.

{en}n~l

e

(a) If len I~ 17, E17 < oo and en--. (a.s.), then ECenl~)--. EW~

(a.s.)

ECien-ell~)--.0

(a.s.).

and (b) If en

~

17, E17 > - oo and en j

e(a.s.), then

ECenl~)

(c) If en

~

e

~

~

(a.s.).

l E(el~)

(a.s.).

17, E17 > - oo, then E(lim enl~)

(e) If en

Ecei~)

17, E17 < oo, and en l (a.s.), then ECenl~)

(d) If en

i

~lim E(enl~)

(a.s.).

~ E(lim e"l~)

(a.s.).

17, E17 < oo, then

Ilm E(enl~) (f) If en ~ 0 then

E(L en I~)=

L E(enl~)

(a.s.).

e

(a) Let Cn = SUPm>n Iem- el. Since en__.. (a.s.), we have '" l 0 E~n and E~ are finite; therefore by Properties D* and C* (a.s.) PROOF.

(a.s.). The expectations

IECenl~)- E(el~)l

Since E(Cn+ 1 1~)

~

= IECen-

el~)l ~

E(len-

ell~)~ E(Cnl~).

E(Cn I~) (a.s.), the limit h = lim" E(Cn I~) exists (a.s.). Then

0~ LhdP~ {ECCnl~)dP= {CndP--.0,

n-.oo,

where the last statement follows from the dominated convergence theorem, since 0 ~ Cn ~ 217, E17 < oo. Consequently Jn h dP = 0 and then h = 0 (a.s.) by Property H. (b) First let 11 = 0. Since ECenl~) ~ ECen+tl~) (a.s.) the limit C(m) = lim" ECenl~) exists (a.s.). Then by the equation

Len dP = LE(enl~) dP,

AE

~.

and the theorem on monotone convergence,

LeaP= LeaP, Consequently

Ae~.

e= C(a.s.) by Property I and Problem 5 of §6.

217

§7. Conditional Probabilities and Expectations with Respect to a u-Algcbra

For the proof in the general case, we observe that 0:::; ~: j ~+,and by what has been proved, E(~: /~)

But 0 s ~;; s

C, E~- <

i

E(~+ /~)

(a.s.).

(7)

oo, and therefore by (a) E(~;; /~)-+

E(C /~),

which, with (7), proves (b). Conclusion (c) follows from (b). (d) Let (n = infm~n ~m; then (n j (, where ( = lim ~n· According to (b), E((n/~) i E((/~) (a.s.). Therefore (a.s.) E(lim ~n/~) = E((/~) =limn E((n/~) = lim E((J ~) $ lim E(~n /~). Conclusion (e) follows from (d). (f) If ~n ~ 0, by Property D* we have

E(J ~kl~) = J E(~k/~) 1

1

(a.s.)

which, with (b), establishes the required result. This completes the proof of the theorem. We can now establish Property K*. Let '1 = 18 , AE~,

f ~'1 A

dP =

f

A,-,B

~ dP =

f

A,-,B

E(~/~) dP =

BE~-

f l8 E(~/~) A

dP =

Then, for every

f 1JE(~/~) A

dP.

By the additivity of the Lebesgue integral, the equation (8)

remains valid for the simple random variables 1J = L~=l ykl 8 k, BkE~. Therefore, by Property I (§6), we have (9)

for these random variables. Now let 1J be any ~-measurable random variable with E 1171 < oo, and let {17n} n~ 1 be a sequence of simple ~-measurable random variables such that 1'7n I s 1] and 'ln -+ 11· Then by (9) E(~17nl~)

= 17nE(~/~)

(a.s.).

It is clear that 1~'7n/ si~IJI, where E/~nl < oo. Therefore E(~17nl~)-+ E(~17/~) (a.s.) by Property (a). In addition, since E I~ I < oo, we have E(~l~) finite (a.s.) (see Property C* and Property J of §6). Therefore 17nEC~I~)-+ 17E( ~I~) (a.s.). (The hypothesis that E( ~I~) is finite, almost surely, is essential, since, according to the footnote on p. 172, 0 · oo = 0, but if 'ln = 1/n, 1J = 0, we have 1/n · oo 0 · oo = 0.)

+

218

II. Mathematical Foundations of Probability Theory

5. Here we consider the more detailed structure of conditional expectations E(el~,). which we also denote, as usual, by E(el'l). Since E(el'l) is a ~.,-measurable function, then by Theorem 3 of §4 (more precisely, by its obvious modification for extended random variables) there is a Borel function m = m(y) from R toR such that m('l(ro)) for all

OJ E

n.

= E(el'l)(ro)

(10)

we denote this function m(y) by E(

e

eI, = y) and call it the

conditional expectation of with respect to the event {, = y}' or the conditional expectation of under the condition that , = y.

e

Correspondingly we define (11)

Ae~.,.

Therefore by Theorem 7 of §6 (on change of variable under the Lebesgue integral sign)

f

m(17) dP =

J{ro:.,eB}

f m(y)P,(dy), JB

BeBI(R),

where P., is the probability distribution of 'I· Consequently m Borel function such that

f

(12)

=

m(y) is a

edP= fm(y)dP,.

J{ro: 11eB}

(13)

JB

for every B e BI(R). This remark shows that we can give a different definition of the conditional expectation E(el '1 = y).

e

Definition 4. Let and '1 be random variables (possible, extended) and let Ee be defined. The conditional expectation of the random variable under the condition that 'I = y is any BI(R)-measurable function m = m(y) for which

f

J{ro:.,eB}

e

edP = JBf m(y)P.,(dy),

BeBI(R).

(14)

That such a function exists follows again from the Radon-Nikodym theorem if we observe that the set function Q(B) =

f

J{ro: 11eB}

edP

is a signed measure absolutely continuous with respect to the measure P,.

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

219

Now suppose that m(y) is a conditional expectation in the sense of Definition 4. Then if we again apply the theorem on change of variable under the Lebesgue integral sign, we obtain

r

J{ro:~eB}

~ dP =

rm(y)P~(dy) = J{ro:~eB} r m(11)P~(dy),

JB

BE &H(R).

The function m(17) is ~~-measurable, and the sets {w: 11 E B}, BE &H(R), exhaust the subsets of ~~. Hence it follows that m(IJ) is the expectation E(~ IIJ). Consequently if we know E(~l11 = y) we can reconstruct E(~l17), and conversely from E(~l11) we can find E(~ I11 = y). From an intuitive point of view, the conditional expectation E(~ 1'1 = y) is simpler and more natural than E(~ I17). However, E(~ I17), considered as a ~~-measurable random variable, is more convenient to work with. Observe that Properties A*-K* above and the conclusions of Theorem 2 can easily be transferred to E(~l11 = y) (replacing "almost surely" by "P~-almost surely"). Thus, for example, Property K* transforms as follows: if E 1~1 < oo and E llf(IJ)I < oo, where f = f(y) is a &H(R) measurable function, then E(lf(11)111 = y) = f(y)E(~IIJ = y) In addition (cf. Property J*), E(~l11

if~

(P~-a.s.).

(15)

and 11 are independent, then

= y) = E~

(P~-a.s.).

We also observe that if BE &H(R 2 ) and ~and 11 are independent, then (16) and if cp = cp(x, y) is a .c?6'(R 2 )-measurable function such that E Icp(~, 1J) I < oo, then

To prove (16) we make the following observation. If B = B 1 x B 2 , the validity of (16) will follow from

r

J{ro:~eA}

IB,xB2(~. 11)P(dw) =

r

J(yeA)

EIB,xBi~. y)P~(dy).

But the left-hand side is P{~ E Bt. 11 E An B 2 }, and the right-hand side is P( ~ E B 1)P(17 E A n B 2); their equality follows from the independence of ~ and 11· In the general case the proof depends on an application of Theorem 1, §2, on monotone classes (cf. the corresponding part of the proof of Fubini's theorem).

Definition 5. The conditional probability of the event A E ~under the condition that 11 = y (notation: P(A I11 = y)) is E(J A I1J = y).

220

II. Mathematical Foundations of Probability Theory

It is clear that P(A 111 = y) can be defined as the .14(R)-measurable function such that P(A n {IJEB}) =

ft(AIIJ

=

y)P~(dy),

BE .14(R).

(17)

6. Let us calculate some examples of conditional probabilities and conditional expectations. EXAMPLE

Lk'=

1

P(IJ

1. Let 11 be a discrete random variable with P(IJ = Yk) > 0, = Yk) = 1. Then P(AI

11

=

Yk

{17 = yk}) P( _ ) , 11- Yk

) = P(A n

For y¢{y 1 ,y 2 , ... } the conditional probability P(AIIJ = y) can be defined in any way, for example as zero. If~ is a random variable for which E~ exists, then

When y ¢ {y 1, y 2 , .•• } the conditional expectation E( ~ 111 in any way (for example, as zero).

= y) can be defined

2. Let ( ~, 11) be a pair of random variables whose distribution has a density ~~~(x, y):

EXAMPLE

P{(~,IJ)EB} = f!~~(x,y)dxdy, Let Hx) and UY) be the densities of the probability distribution of~ and 11 (see (6.46), (6.55) and (6.56). Let us put

r ( I ) - h~(x, y) X y j,(y) '

(18)

J~l~

taking J~ 1 ~(x Iy) = 0 if f~(y) = 0. Then

P(~ E CIIJ = y) = J/~ 1 ~(xly) dx,

C E .14(R),

i.e. !~ 1 ~(x ly) is the density of a conditional probability distribution.

(19)

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

A

221

In fact, in order to prove (19) it is enough to verify (17) for BE PJ(R), {~ E C}. By (6.43), (6.45} and Fubini's theorem,

=

L[J/~~~(xly)

dx

L

JP~(dy) = [L.h 1 ~(xly) dx Jf~(y) dy =

J J~ 1 ~(xiy)~(y) J f~~(x,

dx dy

CxB

=

y) dx dy

CxB

=

P{(~,

IJ) E C x B}

=

P{(~ E

C) n (IJ E B)},

which proves (17). In a similar way we can show that if E~ exists, then

(20) EXAMPLE

3. Let the length of time that a piece of apparatus will continue to

operate be described by a nonnegative random variable 11 = IJ(w) whose distribution Fq(y) has a density j~(y) (naturally, F ~(y) = j~(y) = 0 for y < 0). Find the conditional expectation E(IJ - a 111 2 a), i.e. the average time for which the apparatus will continue to operate on the hypothesis that it has already been operating for time a. Let P(IJ 2 a) > 0. Then according to the definition (see Subsection 1) and (6.45), E(11 - a I'1 > a) -

=

E[(l'/ - a)I 1paJ P(l'/ 2 a)

=

Jn (1'/ -

a)I 1 ~>a)P(dw)

"-=--'-'---=--c-'----"'-'=-='___:_____:__

P(l'/ 2 a)

_ J:' (y -

- s:

a)f~(y) dy f~(y) dy

It is interesting to observe that if '1 is exponentially distributed, i.e.

A. -.0 f,(y) = { e ' y- ' ~ 0 y < 0,

(21)

then El] = E(IJIIJ 2 0) = 1/A. and E(IJ -aiiJ 2 a)= 1/A. for every a> 0. In other words, in this case the average time for which the apparatus continues to operate, assuming that it has already operated for time a, is independent of a and simply equals the average time El]. Under the assumption (21) we can find the conditional distribution P(l'/- a :S:: xl11 2 a).

222

II. Mathematical Foundations of Probability Theory

We have P(

I

)

11 - a ~ x '1 ;;;:: a =

P(a ~ '1 ~ a +. . x) P('7 :2:: a) F,(a

-

[1 -

=e

+ x)- F,(a) + P('1 =a) 1 - F,(a) + P(17 =a) [1 - e-..ta] 1- [1 - e .l.a]

e-..t(a+x)] -

-.l.a[1 _ -.l.x] ;. e e a

=1-

e-.l.x.

Therefore the conditional distribution P('1 - a ~ xI 11 :2:: a) is the same as the unconditional distribution P(17 ~ x). This remarkable property is unique to the exponential distribution: there are no other distributions thathavedensitiesandpossessthepropertyP('1- a~ xl'1 :2:: a)= P('1 ~ x), a :2:: 0, 0 ~ x < oo. EXAMPLE 4 (Buffon's needle). Suppose that we toss a needle of unit length "at random" onto a pair of parallel straight lines, a unit distance apart, in a plane. What is the probability that the needle will intersect at least one of the lines? To solve this problem we must first define what it means to toss the needle "at random." Let be the distance from the midpoint of the needle to the left-hand line. We shall suppose that~ is uniformly distributed on [0, 1], and (see Figure 29) that the angle (} is uniformly distributed on [ -n/2, n/2]. In addition, we shall assume that and fJ are independent. Let A be the event that the needle intersects one of the lines. It is easy to see that if

e

e

B = {(a, x): Ia I ~

1[

2'

x E [0, teas a] u [1 - tcos a, 1]},

then A = {w: (fJ, e) E B}, and therefore the probability in question is P(A)

= EIA(w) = Ela(fJ(w), e(w)).

Figure 29

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

223

By Property G* and formula (16), EJa(tl(w),

~(w)) =

E(E[Ja(tl(w),

~(w))ltl(w)])

= LE[I 8 (tl(w), ~(w))ltl(w)]P(dw) =

f

tt/2

=1

n

_,12 E[I8 (tl(w), ~(w))ltl(w) = tx]P8(da)

J"

12

-tt/2

Ela(a,

~(w))da = 1

n

J"

12

-n/2

cos ada=-, 2 n

where we have used the fact that

Ela(a, ~(w)) = P{~ E [0,

t cos a] u [1 - t cos a]}= cos a.

Thus the probability that a "random" toss of the needle intersects one of the lines is 2/n. This result could be used as the basis for an experimental evaluation of n. In fact, let the needle be tossed N times independently. Define ~ito be 1 if the needle intersects a line on the ith toss, and 0 otherwise. Then by the law of large numbers (see, for example, (1.5.6))

P{l~ 1 + ·; + ~N- P(A)I > e}-+ 0,

N-+

oo.

for every e > 0. In this sense the frequency satisfies ~1

+ ... + ~N N

~

P(A) = ~2

and therefore 2N

-----~1!:.

~1

+ ... +

~N

This formula has actually been used for a statistical evaluation of n. In 1850, R. Wolf (an astronomer in Zurich) threw a needle 5000 times and obtained the value 3.1596 for n. Apparently this problem was one of the first applications (now known as Monte Carlo methods) of probabilisticstatistical regularities to numerical analysis. 7. If { ~n}n> 1 is a sequence of nonnegative random variables, then according to conclusion (f) of Theorem 2,

In particular, if B 1 , B 2 , ••• is a sequence of pairwise disjoint sets, (22)

224

II. Mathematical Foundations of Probability Theory

It must be emphasized that this equation is satisfied only almost surely and that consequently the conditional probability P(B \~)(w) cannot be considered as a measure on B for given w. One might suppose that, except for a set ..;V of measure zero, P( ·I~) (w) would still be a measure for w E Y. However, in general this is not the case, for the following reason. Let .K(B 1 , B 2 , •• •) be the set of sample points w such that the countable additivity property (22) fails for these B 1, B 2 , ...• Then the excluded set ..;Vis

(23) where the union is taken over all B 1, B 2 , ••. in fJ'. Although the P-measure of each set .K(B 1 , B 2 , .• . ) is zero, the P-measure of ..;V can be different from zero (because of an uncountable union in (23)). (Recall that the Lebesgue measure of a single point is zero, but the measure of the set ..;V = [0, 1], which is an uncountable sum of the individual points {x}, is 1). However, it would be convenient if the conditional probability P(·\ ~)(w) were a measure for each wEn, since then, for example, the calculation of conditional probabilities E(e \~)could be carried out (see Theorem 3 below) in a simple way by averaging with respect to the measure P( ·I~)(w):

E(e\~) =

fa ~(w)P(dw\~)

(a.s.)

(cf. (1.8.10)). We introduce the following definition. Definition 6. A function P(w; B), defined for all wEn and BE ff, is a regular conditional probability with respect to ~ if

(a) P(w; ·)is a probability measure on$' for every wEn; (b) For each B E $' the function P( w; B), as a function of w, is a variant of the conditional probability P(B\~)(w), i.e. P(w: B)= P(B\~)(w) (a.s.). Theorem3. Let P(w; B) be a regular conditional probability with respect to ~ and let be an integrable random variable. Then

e

Ece\~)(w) = [ ~(w)P(w; dw)

Jn

PROOF.

If

e= I

B'

(a.s.).

(24)

BE ff, the required formula (24) becomes P(B\~)(w) =

P(w; B) (a.s.),

which holds by Definition 6(b). Consequently (24) holds for simple functions.

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

225

Now let ¢ ~ 0 and ¢n i ¢, where ¢n are simple functions. Then by (b) of Theorem 2 we have E(¢/~)(w) =limn E(¢nl~)(w) (a.s.). But since P(w; ·) is a measure for every wE Q, we have lim n

E(¢n/~)(w) =lim n

( ¢kiJ)P(w; dw) = ( ¢(w)P(w; dw)

Jn

Jn

by the monotone convergence theorem. The general case reduces to this one if we use the representation ¢ =

e+- c.

This completes the proof.

Corollary. Let ~ = ~~' where rt is a random variable, and let the pair (¢, rt) have a probability distribution with density f~lx, y). Let E /g(¢) / < oo. Then

where f~ 1 ~(x/y) is the density of the conditional distribution (see (18)).

In order to be able to state the basic result on the existence of regular conditional probabilities, we need the following definitions.

Definition 7. Let (E, r!) be a measurable space, X= X(w) a random element with values in E, and ~ a a-subalgebra of g;, A function Q(w; B), defined for wEn and BE S is a regular conditional distribution of X with respect to ~if

(a) for each wEn the function Q(w; B) is a probability measure on (E, rff); (b) for each B E C the function Q( w; B), as a function of w, is a variant of the conditional probability P(X E B/ ~)(w), i.e. Q(w; B)= P(X EB/~)(w)

(a.s.).

Definition 8. Let ¢ be a random variable. A function F = F(w; x), wE Q, XE

R, is a regular distribution function for ¢ with respect to ~

if :

(a) F(w; x) is, for each wE Q, a distribution function on R; (b) F(w; x) = P(¢ :.:::; x /~)(w)(a.s.), for each x E R.

Theorem 4. A regular distribution function and a regular conditional distribution function always exist for the random variable

ewith respect to ~.

226

II. Mathematical Foundations of Probability Theory

PROOF. For each rational number r E R, define F,(OJ) = P(~ :::; ri~)(OJ), where P(~:::; ri~)(OJ) = E(J 1 ~ 9li~)(OJ) is any variant of the conditional probability, with respect to~. of the event {~:::; r}. Let {r;} be the set of rational numbers in R. If r; < ri, Property B* implies that P(~ :::; r;i ~) :::; P(~:::; rii~) (a.s.), and therefore if Aii = {OJ: F,/OJ) < F,,(OJ)}, A= UAii• we have P(A) = 0. In other words, the set of points OJ at which the distribution function F,(OJ), r E {r;}, fails to be monotonic has measure zero. Now let

B; = {OJ: lim Fr,+( 1 /n)(OJ) ¥= F,,(OJ)l,

J

n-+ oo

00

B=

U B;.

i= 1

It is clear that /{~:Sr,+( 1 /n)}! /{~~ril• n-+ oo. Therefore, by (a) of Theorem 2, F,,+( 1 /nl(OJ)-+ F,,(OJ) (a.s.), and therefore the set Bon which continuity on the right fails (with respect to the rational numbers) also has measure zero, P(B) = 0.

In addition, let

c= Then, since P(C) = 0. Now put

{OJ: lim FnCOJ) ¥= 1} u {OJ: lim F nCOJ) > n-+ oo

n-+- oo

g :::; n} i Q,

n-+ oo, and {~ :::; n}!

F(W,. X )

0,

o}.

n-+ - oo, we have

= {lim F,(OJ), OJ¢ A u B u C, r!x G(x),

wE

Au B u C,

where G(x) is any distribution function on R; we show that F(OJ; x) satisfies the conditions of Definition 8. Let ¢Au B u C. Then it is clear that F(OJ; x) is a nondecreasing function ofx. Ifx < x' :::; r, then F(OJ; x) :::; F(OJ; x'):::; F(w; r) = F,(OJ)! F(w, x) when r! x. Consequently F(OJ; x) is continuous on the right. Similarly limx-. 00 F(OJ; x) = 1, limx-.-oo F(OJ; x) = 0. Since F(OJ; x) = G(x) when OJ E Au B u C, it follows that F(OJ; x) is a distribution function on R for every OJ E Q, i.e. condition (a) of Definition 8 is satisfied. By construction, P(~:::; r)I~)(OJ) = F,(OJ) = F(OJ; r). If r! x, we have F(OJ; r)! F(OJ; x) for all OJ E Q by the continuity on the right that we just established. But by conclusion (a) of Theorem 2, we have P(~ :::; ri ~)(OJ)-+ P(~:::; xi~)(OJ) (a.s.). Therefore F(OJ; x) = P(~:::; xiG)(OJ) (a.s.), which establishes condition (b) of Definition 8. We now turn to the proof of the existence of a regular conditional distribution of~ with respect to ~Let F(OJ; x) be the function constructed above. Put

w

Q(OJ; B) = LF(OJ; dx),

227

§7. Conditional Probabilities and Expectations with Respect to au-Algebra

where the integral is a Lebesgue-Stieltjes integral. From the properties of the integral (see §6, Subsection 7), it follows that Q(ro; B) is a measure on B for each given wEn. To establish that Q(ro; B) is a variant of the conditional probability P(~ E Bl~)(ro), we use the principle of appropriate sets. Let ~ be the collection of sets B in gj(R) for which Q(ro; B)= P(~EBI~)(ro) (a.s.). Since F(w;'x) = P(~ ~ xl~)(w) (a.s.), the system~ contains the sets B of the form B = (- oo, x], x E R. Therefore ~ also contains the intervals of the form (a, b], and the algebra d consisting of finite sums of disjoint sets of the form (a, b]. Then it follows from the continuity properties of Q(w; B) (w fixed) and from conclusion (b) of Theorem 2 that~ is a monotone class, and since d £ ~ £ gj(R), we have, from Theorem 1 of§2, gj(R) = a(d) £

a(~)= Jl(~)

=

~ £

gj(R),

whence ~ = gj(R). This completes the proof of the theorem. By using topological considerations we can extend the conclusion of Theorem 4 on the existence of a regular conditional distribution to random elements with values in what are known as Borel spaces. We need the following definition.

Definition 9. A measurable space (E, tf) is a Borel space if it is Borel equivalent to a Borel subset of the real line, i.e. there is a one-to-one mapping cp = cp(e): (E, tf)

--+

(R, gj(R)) such that

=

(1) cp(E) {cp(e): eeE} is a set in gj(R); (2) cp is tS'-measurable (cp- 1(A) E I, A E cp(E) n gj(R)), (3) cp- 1 is BI(R)/tS'-measurable (cp(B) E cp(E) n BI(R), BE Iff).

Theorem 5. Let X = X(w) be a random element with values in the Borel space (E, S). Then there is a regular conditional distribution of X with respect to ~Let cp = cp(e) be the function in Definition 9. By (2), cp(X(ro)) is a random variable. Hence, by Theorem 4, we can define the conditional distribution Q(ro; A) of cp(X(ro)) with respect to r§, A E cp(E) n BI(R). We introduce the function Q(w; B) = Q(w; cp(B)), BE tf. By (3) of Definition 9, qJ(B) E qJ(E) n BI(R) and consequently Q(w; B) is defined. Evidently Q(w; B) is a measure on BE S for every ro. Now fix BE G. By the one-to-one character of the mapping cp = cp(e), PROOF.

Q(w; B)= Q(w; cp(B)) = P{cp(X) E cp(B)I~} = P{X E Bl~}

(a.s.).

Therefore Q(w; B) is a regular conditional distribution of X with respect to~.

This completes the proof of the theorem.

228

II. Mathematical Foundations of Probability Theory

Corollary .. Let X = X( w) be a random element with values in a complete separable metric space (E, 6"). Then there is a regular conditional distribution of X with respect to t§. In particular, such a distribution exists for the spaces (R", ~(R")) and (R 00 , ~(R 00 )). The proof follows from Theorem 5 and the well known topological result that such spaces are Borel spaces.

8. The theory of conditional expectations developed above makes it possible

to give a generalization of Bayes's theorem; this has applications in statistics. with Recall that if f:» = {A 1 , . . . , An} is a partition of the space that P(A;) > 0, Bayes's theorem (1.3.9) states

n

P(A; IB)

=

n

P(A;)P(B IA;)

Li=l P(A)P(BIA) for every B with P(B) > 0. Therefore if e = 2:? = a ;I 1

. A,

(25) is a discrete random

variable then, according to (1.8.10),

E[g(O)IBJ

L?=: g(a;)P(A;)P(BIA;)' Li=l P(A)P(BIA)

(26)

J~oo g(a)P(BIO = a)P9(da) J~oo P(BIO = a)P 9(da)

(27)

=

or

E[g(O)IBJ

=

On the basis of the definition ofE[g(B)JB] given at the beginning of this section, it is easy to establish that (27) holds for all events B with P(B) > 0, random variables and functions g = g(a) with E Ig(O) I < oo. We now consider an analog of (27) for conditional expectations E[g( 8) It§] with respect to a a-algebra t§, t§ ~ ff. Let

e

Q(B) = Lg(O)P(dw),

BEt§.

(28)

Then by (4)

(29) We also consider the a-algebra t§ 9 . Then, by (5),

P(B) =

Lt
t§9)

dP

(30)

or, by the formula for change of variable in Lebesgue integrals, (31)

§7. Conditional Probabilities and Expectations with Respect to a (j-Algebra

229

Since

we have

Q(B)

=

f_'Xl}(a)P(BIO = a)P 0(da).

Now suppose that the conditional probability P(B I0 admits the representation P(BIO =a)= Lp(w; a)A(dw),

(32)

= a) is regular and (33)

where p = p(w; a) is nonnegative and measurable in the two variables jointly, and A is a a-finite measure on (Q, <;§). Let E lg(O)I < oo. Let us show that (P-a.s.) E[g(O)I<;§] = J~oo g(a)p(w; a)Po(da) J~ oo p(w; a)P 0(da)

(34)

(generalized Bayes theorem). In proving (34) we shall need the following lemma.

Lemma. Let (Q, F) be a measurable space. (a) Let 11- and A be a-finite measures, and f

=

f(w) an ff-measurablefunction.

Then

(35)

(in the sense that !feither integral exists, the other exists and they are equal).

(b) If v is a signed measure and fl, A are a-finite measures v ~ fl, 11- ~ A, then

dv dA

=

dv dfl dfl. dA.

(A.-a.s.)

(36)

(11--a.s.)

(37)

and dv dfl PROOF. (a) Since

I!( A)

=

=

dvldfl dA. dA.

L(~~)

dA.,

I/;

I A,. The general case (35) is evidently satisfied for simple functions f = monotone converthe and f+ f = f follows from the representation gence theorem.

230

II. Mathematical Foundations of Probability Theory

(b) From (a) with f = dv/d/1 we obtain

Then v

~

A. and therefore

J dA.dv dA.,

=

v(A)

A

whence (36) follows since A is arbitrary, by Property I (§6). Property (37) follows from (36) and the remark that

i

dJ.l = 0} = 11 { w:dA.

-dJ.l

{ro: d!lfd).

= 0) dA.

dA. = 0

(on the set {w: dJ1/dA. = 0} the right-hand side of (37) can be defined arbitrarily, for example as zero). This completes the proof of the lemma. To prove (34) we observe that by Fubini's theorem and (33), Q(B)

=

L[J: L[f_

P(B) =

00

(38)

g(a)p(w; a)P0(da)}(dw),

00 00

(39)

p(w; a)Po(da)}(dw).

Then by the lemma dQ dQjd). dP = dPjd).

(P-a.s.).

Taking account of (38), (39) and (29), we have (34).

Remark. Formula (34) remains valid if we replace() by a random element with values in some measurable space (E, C) (and replace integration over R by integration over E). Let us consider some special cases of (34). Let the a-algebra r§ be generated by the random variable Suppose that

P(~ E A I() = a) = {

q(x; a)A.(dx),

~' r§

A E fJIJ(R),

=

r§ ~.

(40)

where q = q(x; a) is a nonnegative function, measurable with respect to both variables jointly, and ). is a a-finite measure on (R, fJIJ(R)). Then we obtain

E[g(())l~ = x] = J:?oo g(a)q(x; a)Po(da)

J:?oo q(x; a)Po(da)

(41)

§7. Conditional Probabilities and Expectations with Respect to a u-Algebra

231

In particular, let((},~) be a pair of discrete random variables,(} = La;! A,, xiB j" Then, taking A. to be the counting measure (A.( {x;}) = 1, i = 1, 2, ...) we find from (40) that ~='I

(Compare (26).) Now let (0, ~) be a pair of absolutely continuous measures with density fo.~(a, x). Then by (19) the representation (40) applies with q(x; a)= !~ 16 (xla) and Lebesgue measure A.. Therefore

E[g(O) I~ = x] = s~ <Xl g(a)f~lfJ(x la)fo(a) da J~ oo J~ 1 o(x Ia)fo(a) da 9.

(43)

PROBLEMS

1. Let eand '1 be independent identically distributed random variables with Ee defined.

Show that

E(ele + '1) = E(IJie + '1) = -e+IJ 2- (a.s.).

ee

2. Let 1, 2 , ••• be independent identically distributed random variables with E1e;1 < oo. Show that

s.

E(e11 s., s.+ 1• ...) = - (a.s.), n where s.

=

e + ... + e•. 1

3. Suppose that the random elements (X, Y) are such that there is a regular distribution Px(B) = P(YeBJX = x). Show that ifE Jg(X, Y)l < oo then

E[g(X, Y)IX = x] = 4. Let

f

g(x, y)Px(dy)

(Px-a.s.).

ebe a random variable with distribution function Fix). Show that E<eJIX <

(assuming that F~(b)-

F~(a)

J!x dFix)

e~b)= F~(b)- Fia)

> 0).

5. Let g = g(x) be a convex Borel function with E Jg<e)J < oo. Show that Jensen's inequality

holds for the conditional expectations.

232

II. Mathematical Foundations of Probability Theory

e

6. Show that a necessary and sufficient condition for the random variable and the u-algebra f§ to be independent (i.e., the random variables.; and Ia(w) are independent for every Bet§) is that E(g(e)lf§) = Eg(e) for every Borel function g(x) with Elg<e)l < oo.

e JA edP, is u-finite.

be a nonnegative random variable and <§ a u-algebra, <§ s;; !F. Show that E(el<§) < oo (a.s.) if and only if the measure Q, defined on sets A e <§by Q(A) =

7. Let

§8. Random Variables. II 1. In the first chapter we introduced characteristics of simple random variables, such as the variance, covariance, and correlation coefficient. These extend similarly to the general case. Let (Q, !F, P) be a probability space and ~ = ~(w) a random variable for which E~ is defined. The variance of ~ is

fo

The number q = + is the standard deviation. If~ is a random variable with a Gaussian (normal) density fi( ) _ ~

x -

1 foq e

the parameters m and

q

-[(x-m)2]/2a2

,

q

> 0,

- oo < m < oo,

(1)

in (1) are very simple: m

=

E~,

Hence the probability distribution ofthis random variable~. which we call Gaussian, or normally distributed, is completely determined by its mean value m and variance q 2 , (It is often convenient to write~ "' .;V (m, q 2 ).) Now let(~, 17) be a pair of random variables. Their covariance is (2)

(assuming that the expectations are defined). lfcov(~, 17) = 0 we say that~ and '1 are uncorrelated. If V ~ > 0 and V17 > 0, the number (3)

is the correlation coefficient of ~ and '7· The properties of variance, covariance, and correlation coefficient were investigated in §4 of Chapter I for simple random variables. In the general case these properties can be stated in a completely analogous way.

233

§8. Random Variables. II

Let e = <el, ... ' en) be a random vector whose components have finite second moments. The covariance matrix of e is then x n matrix~= IIRiJII, where Ril = cov(e;, ei). It is clear that ~ is symmetric. Moreover, it is nonnegative definite, i.e. n

L1 Rw~·)·i ~ o

i,j=

for all A.i e R, i = 1, ... , n, since

The following lemma shows that the converse is also true.

Lemma. A necessary and sufficient condition that an n x n matrix

~ is the covariance matrix of a vector e = <el, ... ' en) is that the matrix is symmetric and nonnegative definite, or, equivalently, that there is an n x k matrix A (1 ~ k ~ n) such that

where T denotes the transpose. PROOF. We showed above that every covariance matrix is symmetric and nonnegative definite. Conversely, let~ be a matrix with these properties. We know from matrix theory that corresponding to every symmetric nonnegative definite matrix ~ there is an orthogonal matrix (!) (i.e., (!)(!)T = E, the unit matrix) such that (!)T~(!) =

where D =

(d0

1

••

D,

'

0)

dn

is a diagonal matrix with nonnegative elements di, i = 1, ... , n. It follows that ~ = (!)D(!)T = ((!)B)(BT(!)T),

where B is the diagonal matrix with elements bi = + .jd;, i = 1, ... , n. Consequently if we put A = (!)B we have the required representation ~ = AATfor ~. It is clear that every matrix AAT is symmetric and nonnegative definite. Consequently we have only to show that ~ is the covariance matrix of some random vector. Let tit• ti 2 , ••• , tin be a sequence of independent normally distributed random variables, ..¥(0, 1). (The existence of such a sequence follows, for example, from Corollary 1 of Theorem 1, §9, and in principle could easily

234

II. Mathematical Foundations of Probability Theory

be derived from Theorem 2 of §3.) Then the random vector~ = A17 (vectors are thought of as column vectors) has the required properties. In fact, E~~T

= E(A1'/)(A1'/)T =A. E,,T. AT= AEAT = AAT.

(If ( = li(iill is a matrix whose elements are random variables, E( means the matrix IIE~iill). This completes the proof of the lemma.

We now turn our attention to the two-dimensional Gaussian (normal) density

(4)

characterized by the five parameters m1 , m2 , a 1 , a 2 and p (cf. (3.14)), where Im1 I < oo, Im2 1 < oo, a 1 > 0, a 2 > 0, IpI < 1. An easy calculation identifies these parameters :

m1 =

E~,

m2 = E1],

ai

= V~,

a~ = V1],

p = p(~, 1]).

In§4ofChapter I we explained that if~ and 17 areuncorrelated(p(~, 17) = 0), it does not follow that they are independent. However, if the pair (~, 17) is Gaussian, it does follow that if ~ and 17 are uncorrelated then they are independent. In fact, if p = 0 in (4), then

But by (6.55) and (4),

frf._x) =

J

oo

~~~(x,

-oo

y) dy =

1

foal

e-[(x-md21J2at,

Consequently ~~~(x,

y) = f~(x) · fiy),

from which it follows that~ and 17 are independent (see the end of Subsection 8 of §6).

235

§8. Random Variables. II

2. A striking example of the utility of the concept of conditional expectation (introduced in §7) is its application to the solution of the following problem which is connected with estimation theory (cf. Subsection 8 of §4 of Chapter 1). Let(~, '1) be a pair of random variables such that~ is observable but '1 is not. We ask how the unobservable component '1 can be "estimated" from the knowledge of observations of~To state the problem more precisely, we need to define the concept of an estimator. Let qJ = qJ(x) be a Borel function. We call the random variable qJ(~) an estimator of11 in terms of~. and E['1 - ({J(~)] 2 the (mean square) error of this estimator. An estimator qJ*(~) is called optimal (in the mean-square sense) if

L\

=E['1- qJ*(~)Y = infE['1- ({J(~)Y, qJ

(5)

where inf is taken over all Borel functions qJ = qJ(x). Theorem 1. Let E11 2 < oo. Then there is an optimal estimator qJ* = qJ*(~) and qJ*(x) can be taken to be thejimction

qJ*(x) = E('11 ~ = x).

(6)

PRooF. Without loss of generality we may consider only estimators tp(~) for which EqJ 2 (~) < oo. Then if ({J(~) is such an estimator, and qJ*(~) = E('11 ~), we have E['1 - ({J(~)] 2 = E[('1 - qJ*(~)) + (({J*(~) _ ({J(~))]2 = E['7 - qJ*(~)]2 + E[({J*(~) - ({J(~)]2 + 2E[('1 - qJ*(~))(({J*(~) - ({J(~))] ~ E[17 - qJ*(~)] 2 , since E[qJ*(~) - qJ(~)] 2 ~ 0 and, by the properties of conditional expectations, E[('1-

qJ*(~))(({J*(~)- ({J(~))]

= E{E[('1-

qJ*(~))(({J*(~)- ({J(~)])I~]}

= E{(({J*(~)- ({J(~))E('1- qJ*(~)I~)} = 0.

This completes the proof of the theorem. Remark. It is clear from the proof that the conclusion of the theorem is still valid when ~ is not merely a random variable but any random element with values in a measurable space (E, tf). We would then assume that (/) = qJ(x) is an tS'/81(R)-measurable function. Let us consider the form of qJ*(x) on the hypothesis that(~, '1) is a Gaussian pair with density given by (4).

236

II. Mathematical Foundations of Probability Theory

From (1), (4) and (7.10) we find that the density J, 1 ~(yJx) of the conditional probability distribution is given by

f~ I~{y IX)

=

1 J2rc(1 - p2 )u 2

e[(y-

m(x))2]/[2a~( 1 - p2)]'

(7)

where

(8) Then by the Corollary of Theorem 3, §7, (9)

and V('71~

=

x)

=E[('7- E('71~ = x)) 1~ = x] 2

=

J_'XJoo (y-

= u~{1 -

m(x)) 2f~ 1 ~(yJx) dy

p 2 ).

Notice that the conditional variance V('71 ~ therefore

(10)

=

x) is independent of x and

(11) Formulas (9) and (11) were obtained under the assumption that V~ > 0 and V17 > 0. However, if V~ > 0 and V17 = 0 they are still evidently valid. Hence we have the following result (cf. (1.4.16) and (1.4.17)).

Theorem 2. Let (~, 17) be a Gaussian vector with estimator of '7 in terms of~ is

V~

> 0. Then the optimal

(12) and its error is

(13)

Remark. The curve y(x) = E('71 ~ = x) is the curve of regression of '7 on ~ or of '7 with respect to ~. In the Gaussian case E('71 ~ = x) = a + bx and consequently the regression of '1 and ~ is linear. Hence it is not surprising that the right-hand sides of (12) and (13) agree with the corresponding parts of (1.4.6) and (1.4.17) for the optimal linear estimator and its error.

237

§8. Random Variables. II

Corollary. Let e 1 and e2 be independent Gaussian random variables with mean zero and unit variance, and

Then E~ and if af

= E11 = 0, V~ = af + a~ > 0, then E(

11

+a~, V11

= bf + bLcov(~, 1'/) = a,b, + a2 b 2 ,

I~)= a 1 b1 + a 2 b2 ~ af +a~

(14)

'

Ll = (a 1 b2 - azbd 2 ai +a~

(15)

3. Let us consider the problem of determining the distribution functions of random variables that are functions of other random variables. Let~ be a random variable with distribution function F~(x) (and density Nx), if it exists), let q> = q>(x) be a Borel function and 11 = q>(~). Letting I Y = (- oo, y), we obtain

F~(y) =

P(ry::::;; y)

= P(q>(~)Eiy) = P(~Eq>- 1 (/y)) =

f

F~(dx),

(16)

"'- l(ly)

· which expresses the distribution function F~(y) in terms of F~(x) and q>. For example, if 11 = a~ + b, a > 0, we have

( y- b) (y- b)

F ~(y) = p ~ ::::;; -a- = F ~ -a- .

(17)

If 11 = ~ 2 , it is evident that F ~(y) = 0 for y < 0, while for y 2 0

F /Y) = P(

e ::;

y) = P(- Jy ::::;; ~ ::::;; Jy)

= F~(Jy)- F~( -Jy) + P(~ = -Jy).

(18)

We now turn to the problem of determining f~(y). Let us suppose that the range of~ is a (finite or infinite) open interval I = (a, b), and that the function q> = q>(x), with domain (a, b), is continuously differentiable and either strictly increasing or strictly decreasing. We also suppose that q>'(x) =f. 0, xEI. Let us write h(y) = q>- 1(y) and suppose for definiteness that cp(x) is strictly increasing. Then when y E q>(/),

= P(e ::::;; h(y)) =

f

h(y)

_ 00Nx) dx.

(19)

238

II. Mathematical Foundations of Probability Theory

By Problem 15 of §6,

h(y) fy f - OONx) dx = - oof;(h(z))h'(z) dz

(20)

and therefore

fq(y)

= f~;(h(y))h'(y).

(21)

Similarly, if cp(x) is strictly decreasing,

fq(y)

=

Nh(y))(( -h'(y)).

Hence in either case J~(y)

For example, if 1J = a~

= Nh(y)) Ih'(y) 1.

+ b, a #

0, we have

y-b h(y) =-aand If~ ~

JV (m, u 2) and 11 fq(y)

=

= el:,

(22)

fq(y)

1 (y-b) = j;f !~;-a- ·

we find from (22) that

1 exp[{ $uy

ln(y/~)2], 2u

0

y > 0,

(23)

y :::;; 0,

with M =em. A probability distribution with the density (23) is said to be lognormal (logarithmically normal). If cp = cp(x) is neither strictly increasing nor strictly decreasing, formula (22) is inapplicable. However, the following generalization suffices for many applications. Let cp = cp(x) be defined on the set 1 [ak, bk], continuously differentiable and either strictly increasing or strictly decreasing on each open interval Ik = (ab bk), and with cp'(x) # 0 for x E Ik. Let hk = hk(y) be the inverse of cp(x) for x Elk. Then we have the following generalization of(22):

Lk=

n

fq(y) =

L f~;(hk(y))lh~(y)l· Ivk(y),

(24)

k=1

where Dk is the domain of hk(y). For example, if 11 = ~ 2 we can take I 1 = ( - oo, 0), I 2 find that h1 (y) = h 2(y) = and therefcre

-.JY,

.JY,

f~(y) = {2Jy [f~;(.jY) + fl-JY)], 0,

y > 0, y :::;; 0.

= (0,

oo ), and

(25)

239

§8. Random Variables. II

Wecanobservethatthisresultalsofollowsfrom(18),sinceP(~ = In particular, if~ ,..., .;V (0, 1),

h2(y) =

{k

-JY) = 0.

y > 0,

e-y/2,

y

0,

~

(26)

0.

A straightforward calculation shows that

fi~l(y)

f +v'l~l (y) --

=

{f~(y) + f~(- y), y > 0, 0,

y::;; 0.

{2y(f~(y2) 0,

+ f~(- y2)),

y > 0, y ~ 0.

(27) (28)

4. We now consider functions of several random variables. If ~ and '7 are random variables with joint distribution F ~,(x, y), and


F~:,(z) =

J

{x, y: q>(x, y) :S z}

dF~,(x, y).

(29)

For example, if cp(x, y) = x + y, and~ and '7 are independent (and therefore F~,(x, y) = F~(x) · F,(y)) then Fubini's theorem shows that

F,(z) =

f

{x,y:x+y:Sz)

=

=

(

JR2

dF~(x) · dF,(y)

I{x+y:s;zJ(x, y) dF~(x) dF,(y)

J:oo dF~(x){J:,/{x+y:s;z)(x, y) dF,(y)} =

J:ooF,(z- x)

dF~(x) (30)

and similarly

F,(z) =

J:ooF~(z- y) dF,(y).

(31)

IfF and G are distribution functions, the function

H(z) = J:00 F(z- x) dG(x) is denoted by F * G and called the convolution ofF and G. Thus the distribution function F 1 of the sum of two independent random variables ~ and '7 is the convolution of their distribution functions F ~ and F,: F, = F~*F,.

It is clear that F~ * F, = F, * F~.

II. Mathematical Foundations of Probability Theory

240

Now suppose that the independent random variables ~ and '1 have densities f~ and f~. Then we find from (31 ), with another application ofFubini's theorem, that

F~(z) = J:oo [f~Yf~(u) du ]f~(y) dy = s:oo

[f

J

=

ooh(u - y) du J.,(y) dy

whence J{(z)

=

f_

00

/iz -

00

f

00

[f_0000j~(u -

J

y)J,(y) dy du,

(32)

y)J.,(y) dy,

and similarly (33) Let us see some examples of the use of these formulas. Let ~to ~ 2 , ••• , ~n be a sequence of independent identically distributed random variables with the uniform density on [ -1, 1]:

f(x) =

{!. 0,

lxl ~ 1, lxl > 1.

Then by (32) we have f~, +~ 2 (x)

={

2 _...,...:.-....:.,

I 2 IX~'

0,

lxl > 2,

-lxl 4

(3 - lxl) 2 , 16 !~, +~2+~3(x)

= 3 - x2

1~

lxl

~

3,

0 ~ lxl ~ 1,

8

lxl > 3,

0, and by induction

1

[(n+x)/21

_ { 2"( _ 1) 1 · n /~, + ... +~n(x)-

0, Now let~ ,..., .% (m 1,

L

(-1)kC~(n

o-n and '1 ,..., .% (m cp(x)

+ x- 2k)"-t, lxl

~ n,

k=O

lxl > n. 2,

o-~). If we write

= _1_ e-x2J2,

fo

241

§8. Random Variables. II

then

and the formula

follows easily from (32). Therefore the sum of two independent Gaussian random variables is again a

Gaussian random variable with mean m1

e

en

+ m2 and variance uf + u~.

Let 1, •• ,, be independent random variables each of which is normally distributed with mean 0 and variance 1. Then it follows easily from (26) (by induction)

t::+ . . ~ +"(x)

{-=-2n:-;;;t2=:,.,..(n....,./2,.,..) x
ef

x > 0, X~

x;,

e;

(34)

0.

The variable + · · · + is usually denoted by and its distribution (with density (30)) is the x2 -distribution ("chi-square distribution") with n degrees of freedom (cf. Table 2 in §3).

If we write Xn =

+·.JX!, it follows from (28) and (34) that fx . (x) =

2x"-le-x2f2 { 2"12r(n/2) , X ~ 0,

(35)

X< 0.

0,

The probability distribution with this density is the x-distribution (chidistribution) with n degrees of freedom. Again let ~ and 11 be independent random variables with densities f~ and J,. Then

F~~(z) =

JJ

f~(x)f~(y) dx dy,

{x,y:xy:Sz)

F~1 ~(z) =

JJ

f~x)j,(y) dx dy.

{x, y: x/y:Sz)

Hence we easily obtain

~~~(z)

=

oo (z) dy dx f-oof~ Y J.,(y) IYT = foo-oof~ (z) X J~(x) TXT

(36)

and (37)

242

II. Mathematical Foundations of Probability Theory

e

e

e;)!n,

en

Putting =eo and '1 = j(e~ + · · · + in (37), where eo. 1 , •.. , are independent Gaussian random variables with mean 0 and variance rJ 2 > 0, and using (35), we find that

f~o/[.j(l/nH~t+ ... +~~)](x)

I r(n; 1)

=

C y'Ttn

( ) r~

1 (

(38)

x2)(n+ 1)/2

1+-

2

n

The variable eo/[)(1/n)(ei + ... + e:)J is denoted by t, and its distribution is the t-distribution, or Student's distribution, with n degrees of freedom (cf. Table 2 in §3). Observe that this distribution is independent of rJ.

5.

PROBLEMS

1. Verify formulas (9), (10), (24), (27), (28), and (34)-(38). 2. Let ~t> ••• , ~ •• n ~ 2, be independent identically distributed random variables with distribution function F(x) (and density f(x), if it exists), and let~= max(~t• ... , ~.), ~ = min(~t• ... , ~.), p = ~ - ~-Show that

F- (y x) = {(F(y))"- (F(y)- F(x))", ~.~ ' (F(y))",

y > x, y ~ x,

n(n- 1)[F(y)- F(x)]"- 2 f(x)f(y),

f~.~(y, x) = { 0,

y > x, <X,

y

- {n s~oo [F(y)- F(y- x)]"-tf(y) dy, X~ 0,

~00-0 '

0

X< '

n(n- 1) s~oo [F(y)- F(y- x)]"- 2 f(y- x)f(y) dy,

fP(x) = { 0,

X>

0,

X< 0•

3. Let ~t and ~ 2 be independent Poisson random variables with respective parameters At and A2 • Show that ~t + ~ 2 has a Poisson distribution with parameter At + A2 • 4. Let mt = m2 = 0 in (4). Show that

5. The maximal correlation coefficient of ~ and Yf is p*(~, rt) = sup"·" p(u<e), v<e)), where the supremum is taken over the Borel functions u = u(x) and v = v(x) for which the correlation coefficient p(u(~). v(e)) is defined. Show that' and Yf are independent if and only if p*(~. rt) = 0. 6. Let "t 1 , "t 2 , ••• , "t• be independent nonnegative identically distributed random variables with the exponential density

f(t) = Ae-At,

t

~

0.

§9. Construction of a Process with Given Finite-Dimensional Distribution

Show that the distribution of r 1

243

+ · · · + rk has the density

A_ktk-1e-A'

t

(k-1)!'

~

0, 1 :o;; k :o;; n,

and that P(r 1

+ · · · + rk >

t) =

k-1

(A.tY

i=O

l.

L e-A•-.-1 .

7. Let ~ ~ %(0, a 2 ). Show that, for every p ~ 1,

EI~ IP =

CPaP,

where

and f(s)

=

So e-xxs-

1

dx is the gamma function. In particular, for each integer n ~ 1, E~ 2 " = (2n - 1)!! a 2 ".

§9. Construction of a Process with Given Finite-Dimensional Distribution 1. Let

~ = ~(w)

(Q, $', P), and let

be a random variable defined on the probability space F~(x) =

P{w:

~(w)::;

x}

be its distribution function. It is clear that F ~(x) is a distribution function on the real line in the sense of Definition 1 of §3. We now ask the following question. Let F = F(x) be a distribution function on R. Does there exist a random variable whose distribution function is

F(x)?

One reason for asking this question is as follows. Many statements in probability theory begin, "Let~ be a random variable with the distribution function F(x); then ... ". Consequently if a statement of this kind is to be meaningful we need to be certain that the object under consideration actually exists. Since to know a random variable we first have to know its domain (Q, g;), and in order to speak of its distribution we need to have a probability measure P on (Q, g;), a correct way of phrasing the question of the existence of a random variable with a given distribution function F(x) is this: Do there exist a probability space (Q, $', P) and a random variable~ on it, such that

P{w:

~(w)::;

x}

=

F(x)?

= ~(w)

244

II. Mathematical Foundations of Probability Theory

Let us show that the answer is positive, and essentially contained in Theorem 1 of§ 1. In fact, let us put :F = PJ(R).

!l=R,

It follows from Theorem 1 of §1 that there is a probability measure P (and only one) on (R, PJ(R)) for which P(a, b)] = F(b) - F(a), a < b. Put c;(w) = w. Then

P{w: c;(w)

~ x}

= P{w: w

~ x}

= P(- oo, x] = F(x).

Consequently we have constructed the required probability space and the random variable on it. 2. Let us now ask a similar question for random processes. Let X= (c;,),er be a random process (in the sense of Definition 3, §5) defined on the probability space (Q, !F, P), with t E T £ R. From a physical point of view, the most fundamental characteristic of a random process is the set {F,,, ... ,1"(x 1 , ... , xn)} of its finite-dimensional distribution functions

defined for all sets tt. ... , tn with t 1 < t 2 < · · · < tn. We see from (1) that, for each set t 1 , .•. , tn with t 1 < t 2 < · · · < tn the functions F 1,, ... ,1Jx 1, ... , xn) are n-dimensional distribution functions (in the sense of Definition 2, §3) and that the collection {F,,, ... ,1"(x 1, ... , Xn)} has the following consistency property: lim F,,, ... ,tn(xl, ... ' Xn)

=

F,,, ... ,tk, ... ,tn(xl, ... '

xk> ... ' Xn)

(2)

Xkf 00

where ~ indicates an omitted coordinate. Now it is natural to ask the following question: under what conditions can a given family {F 11 , ... , 1"(xt. ... , xn)} of distribution functions F,,, ... ,1"(x 1 , •.• , xn) (in the sense of Definition 2, §3) be the family of finitedimensional distribution functions of a random process? It is quite remarkable that all such conditions are covered by the consistency condition (2).

Theorem 1 (Kolmogorov's Theorem on the Existence of a Process). Let {F1,,. .. , 1"(x 1, ... , Xn)}, with t; E T £ R, t 1 < t 2 < · · · < tn, n ~ 1, be a given

family of finite-dimensional distribution functions, satisfying the consistency condition (2). Then there are a probability space (Q, !F, P) and a random process X = (c;,),er such that (3)

§9. Construction of a Process with Given Finite-Dimensional Distribution

245

PRooF. Put

i.e. taken to be the space of real functions w = (w,),eT with the a-algebra generated by the cylindrical sets. LetT= [t 1 , ... , tn], t 1 < t 2 < · · · < tn. Then by Theorem 2 of §3 we can construct on the space (R", PA(R")) a unique probability measure Pr such that

It follows from the consistency condition (2) that the family {P r} is also consistent (see (3.20)). According to Theorem 4 of §3 there is a probability measure P on (RT, PA(RT)) such that P{w: (w11 ,

••• ,

w,J E B} = Pr(B)

for every set r = [t 1 , .. , tn], t 1 < · · · < tn. From this, it also follows that (4) is satisfied. Therefore the required random process X= (~,(w)),eT can be taken to be the process defined by tE T.

(5)

This completes the proof of the theorem.

Remark 1. The probability space (RT, PA(RT), P) that we have constructed is called canonical, and the construction given by (5) is called the coordinate method of constructing the process. Remark 2. Let (Ea, ga) be complete separable metric spaces, where oc belongs

to some set mof indices. Let {Pr} be a set of consistent finite-dimensional distribution functions P" T = [IX 1, ••. , ocnJ on (£1% 1

X ··• X

EIXn' rff/% 1 ® • • • ® rff/XJ

Then there are a probability space (Q, !F, P) and a family of !F /rff a-measurable functions (Xa(w))ae!ll such that P{(Xa 1,

••• ,

XaJ E B} = P,(B)

for all r = [oc 1 , .•• , ocnJ and BE ~fa ®···®Can· This result, which generalizes Theorem 1, follows from Theorem 4 of §3 if we put n = Ea., !F = I®Ia tel% and XIX(w) = WI% for each Q) = w(wl%), 0( em.

Tia

Corollary 1. Let F 1(x), F 2 (x), ... be a sequence of one-dimensional distribution junctions. Then there exist a probability space (Q, ff, P) and a sequence of independent random variables ~ 1 , ~ 2 , ••• such that P{w: ~;(w) ::; x} = F;(x).

(6)

246

II. Mathematical Foundations of Probability Theory

In particular, there is a probability space (Q, ~ P) on which an infinite sequence of Bernoulli random variables is defined (in this connection see Subsection 2 of §5 of Chapter 1). Notice that Q can be taken to be the space Q = {ro: w = (a 1, a2 ,

•• •),

ai = 0, 1}

(cf. also Theorem 2). To establish the corollary it is enough to put F 1 , ... ,ix1 , F 1(x 1 ) • • • Fn(Xn) and apply Theorem 1.

••• ,

Xn) =

Corollary 2. Let T = [0, oo) and let {p(s, x; t, B} be a family of nonnegative functions defined for s, t E T, t > s, x E R, BE Bl(R), and satisfying the following conditions: (a) p(s, x; t, B) is a propability measure on B for given s, x and t; (b) for given s, t and B, the function p(s, x; t, B) is a Borel function of x; (c) for 0 ::S; s < t < rand BE Bl(R), the Kolmogorov-Chapman equation

p(s, x; r, B)

=

{p(s, x; t, dy)p(t, y; r, B)

(7)

is satisfied. Also let 1t = n(B) be a probability measure on (R, Bl(R)). Then there are a probability space (Q, ~ P) and a random process X = (~ 1)1 2: 0 defined on it, such that

(8)

for 0 = t 0 < t 1 < · · · < tn. The process X so constructed is a Markov process with initial distribution nand transition probabilities {p(s, x; t, B}.

Corollary 3. Let T = {0, 1, 2, ... } and let {Pk(x; B)} be a family of nonnegative functions defined for k ;;::: 1, x E R, BE PJ(R), and such that pk(x; B) is a probability measure on B (for given k and x) and measurable in x (for given k and B). In addition, let n = n(B) be a probability measure on (R, Bl(R)). Then there is a probability space (Q, ~ P) with a family of random variables X= {~ 0 , ~ 1 , ••. } defined on it, such that

247

§9. Construction of a Process with Given Finite-Dimensional Distribution

3. In the situation of Corollary 1, there is a sequence of independent random variables ~ 1 , ~ 2 , . . . whose one-dimensional distribution functions are F 1 , F 2 , ••• , respectively. Now let (£ 1 , £8' 1 ), (£ 2 , £8' 2 ), •.. be complete separable metric spaces and let P 1 , P 2 , ... be probability measures on them. Then it follows from Remark 2 that there are a probability space (Q, ff', P) and a sequence of independent elements X1o X 2 , ..• such that Xn is ~/c8'n-measurable and P(XnEB) = PiB), BE @"n• It turns out that this result remains valid when the spaces (En,@" n) are

arbitrary measurable spaces.

Theorem 2 (lonescu Tulcea's Theorem on Extending a Measure and the Existence of a Random Sequence). Let (Qn, ff,), n = 1, 2, ... , be arbitrary

n

PI

measurable spaces and n = nn, ~ = ff,. Suppose that a probability measure P1 is given on (!1 1, ~1 ) and that, for every set (wt> ... , wn) E !1 1 x ... X nn,n ~ l,probabilitymeasuresP(wl, ... ' wn;· )aregivenon(Qn+ I•~+ 1). Suppose that for every BE ff,+ 1 the functions P(wl> ... , wn; B) are Borel fimctions on (w 1 , ••• , wn) and let

n

~

1.

(9)

Then there is a unique probability measure P on (Q, ~) such that

for every n such that

~

1, and there is a random sequence X= (X 1 (w), X 2 (w), ...)

where A; E c8';. PROOF. The first step is to establish that for each n > 1 the set function Pn defined by (9) on the rectangle A 1 x · · · x An can be extended to the a-algebra ~1 ® · · · ® ff,. For each n ~ 2 and BE ~1 ® · · · ® ff, we put

X

l

nn

I B(fl!1, ... , Wn)P(w1, ... , Wn- 1; dwn).

(12)

It is easily seen that when B = A 1 x · · · x An the right-hand side of (12) is the same as the right-hand side of (9). Moreover, when n = 2 it can be

248

II. Mathematical Foundations of Probability Theory

shown, just as in Theorem 8 of §6, that P 2 is a measure. Consequently it is easily established by induction that Pn is a measure for all n ~ 2. The next step is the same as in Kolmogorov's theorem on the extension of a measure in (R 00 , 14(R 00 ) ) (Theorem 3, §3). Thus for every cylindrical set Jn(B) = {roe 0: (ro 1, ... , ron) e B}, Be §'1 ® · · · ® !F,, we define the set function P by (13)

If we use (12) and the fact that P(ro 1, ... , rok; ·)are measures, it is easy to

establish that the definition (13) is consistent, in the sense that the value of P(Jn(B)) is independent of the representation of the cylindrical set. It follows that the set function P defined in (13) for cylindrical sets, and in an obvious way on the algebra that contains all the cylindrical sets, is a finitely additive measure on this algebra. It remains to verify its countable additivity and apply Caratheodory's theorem. In Theorem 3 of §3 the corresponding verification was based on the property of (R", 14(Rn)) that for every Borel set B there is a compact set A £ B whose probability measure is arbitrarily close to the measure of B. In the present case this part of the proof needs to be modified in the following way. As in Theorem 3 of §3, let {Bn} n~ 1 be a sequence of cylindrical sets

Bn = {ro: (rol> ... ' ron) E Bn}, that decrease to the empty set

0. but have lim P(Bn) > 0.

(14)

n-+oo

For n > 1, we have from (12)

where

Since .Bn+1 £ Bn, we have Bn+1 £ Bn

X

on+1 and therefore

JBn+l(ro1, ... ' ron+1) ~ [Bn(ro,, ... ' ron)Inn+l(ron+1>· Hence the sequence {f~1 >(ro 1 )}n~ 1 decreases. Let J(l>(ro 1) = limn f~1 >(ro 1 ). By the dominated convergence theorem lim P(Bn) =lim n

n

i f~ >(ro 1 )P 1(dro 1 ) i n,

1

=

n,

J< 1>(ro 1)P 1(dro 1).

By hypothesis, limn P(Bn) > 0. It follows that there is an ro~ e B such that > 0, since if ro 1 rl B 1 then f~1 >(ro 1 ) = 0 for n ~ 1.

j< 1 >(ro~)

§9. Construction of a Process with Given Finite-Dimensional Distribution

249

Moreover, for n > 2, (15)

where

f~2 >(w 2 ) =

LP(w~, ···l IB"(w~,w2,

w 2 ; dw 3 )

On

... ,wn)P(w~,w 2 ,

•••

,wn-bdwn).

We can establish, as for {f~l)(w 1 )}, that {f~2 >(w 2 )} is decreasing. Let

j< 2 >(w 2 ) = limn .... oo Jf>(w 2 ). Then it follows from (15) that

0<

J(l>(w~) = j j< 2 >(w 2 )P(w~; dw 2 ),

Jn2

and there is a point w~ E Q 2 such that f(2)(w~) > 0. Then (w~, w~) E B 2. Continuing this process, we find a point (w~, ... , w~) E Bn for each n. 0 Consequently (w 01, ... , wn, .. .) E n~ Bn, but by hypothesis we have n~ Bn = 0. This contradiction shows that limn P(Bn) = 0. Thus we have proved the part of the theorem about the existence of the probability measure P. The other part follows from this by putting XnCw) = Wn, n;;::: 1.

.

Corollary 1. Let (En, Cn)n<: 1 be any measurable spaces and (Pn)n<: 1 , measures on them. Then there are a probability space (Q, §', P) and a family of independent random elements X 1,X 2, ... with values in (E 1,C 1), (E 2 ,C 2 ), •.. , respectively, such that P{w: Xiw)

E

B}

= Pn(B),

Corollary 2. Let E = {1, 2, ... }, and let {pk(x, y)} be a family of nonnegative functions, k;;::: 1, x,yEE, such that LyeEPk(x;y) = 1, xEE, k;;::: 1. Also letn = n(x)beaprobabilitydistributiononE(thatis,n(x);;::: O,LxeE n(x) = 1). Then there are a probability space (Q, §', P) and a family X = { ~ 0 , ~ 1 , of random variables on it, such that

Pgo = Xo, ~1 = X1, · · ·, ~n = Xn} = n(xo)P1(xo, X1) · · · Pn(Xn-1• Xn)

••• }

(16)

(cf. (1.12.4)) for all x; E E and n ;;::: 1. We may take Q to be tl".e space Q

= {w: w = (x 0 , x 1,

••. ), X; E

E}.

A sequence X= g 0 , ~ 1 , ... } of random variables satisfying (16) is a Markov chain with a countable set E of states, transition matrix {pk(x, y)} and initial probability distribution n. (Cf. the definition in §12 of Chapter 1.)

250 4.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Let 0 = [0, 1], let :F be the class of Borel subsets of [0, 1], and let P be Lebesgue measure on [0, 1]. Show that the space (0, !F, P) is universal in the following sense. For every distribution function F(x) on (0, !F, P) there is a random variable = ro) such that its distribution function F ~x) = P(e ~ x) coincides with F(x). (Hint. e(ro) = r 1(ro), 0 < ro < 1, where r 1(ro) = sup{x: F(x) < ro}, when 0 < ro < 1, and e(O), W) can be chosen arbitrarily.)

e e(

2. Verify the consistency of the families of distributions in the corollaries to Theorems 1 and 2. 3. Deduce Corollary 2, Theorem 2, from Theorem 1.

§10. Various Kinds of Convergence of Sequences of Random Variables 1. Just as in analysis, in probability theory we need to use various kinds of convergence of random variables. Four of these are particularly important:

in probability, with probability one, in mean of order p, in distribution.

First some definitions. Let~. ~ 1 , ~ 2 , ••• be random variables defined on a probability space (Q, !F, P). Definition 1. The sequence

e

1,

~ 2 , ••.

of random variables converges in

probability to the random variable~ (notation: ~n E.~) if for every s > 0 P{l~n- ~~

> s}-+ 0,

n-+

oo.

(1)

We have already encountered this convergence in connection with the law of large numbers for a Bernoulli scheme, which stated that n-+

oo

(see §5 of Chapter 1). In analysis this is known as convergence in measure.

Definition 2. The sequence ~ 1 , ~ 2 , ••• of random variables converges with

probability one (almost surely, almost everywhere) to the random variable ~if

P{w: ~n

+ ~} =

0,

(2)

i.e. if the set of sample points w for which ~n(w) does not converge to ~has probability zero. This convergence is denoted by ~" -+ ~ (P-a.s.), or ~n ~· ~ or ~n ~· ~.

§10. Various Kinds of Convergence of Sequences of Random Variables

251

Definition 3. The sequence ~ 1 , ~ 2 , ••• of random variables converges in mean of order p, 0 < p < oo, to the random variable~ if n--+ oo.

(3)

e

~. In analysis this is known as convergence in LP, and denoted by ~n In the special case p = 2 it is called mean square convergence and denoted by ~ = l.i.m. ~n (for "limit in the mean").

Definition 4. The sequence ~ 1 , ~ 2 , . . . of random variables converges in distribution to the random variable ~ (notation: ~n ~ ~) if n--+ oo,

(4)

for every bounded continuous function f = f(x). The reason for the terminology is that, according to what will be proved in Chapter III, §1, condition (4) is equivalent to the convergence of the distribution F~"(x) to F~(x) at each point x of continuity ofF ~(x). This convergence is denoted by F~" => F~.

We emphasize that the convergence of random variables in distribution is defined only in terms of the convergence of their distribution functions. Therefore it makes sense to discuss this mode of convergence even when the random variables are defined on different probability spaces. This convergence will be studied in detail in Chapter III, where, in particular, we shall explain why in the definition ofF~" => F ~ we require only convergence at points of continuity of F~(x) and not at all x. 2. In solving problems of analysis on the convergence (in one sense or another) of a given sequence of functions, it is useful to have the concept of a fundamental sequence (or Cauchy sequence). We can introduce a similar concept for each of the first three kinds of convergence of a sequence of random variables. Let us say that a sequence {~n}n~ 1 of random variables is fundamental in probability, or with probability 1, or in mean of order, p, 0 < p < oo, if the corresponding one of the following properties is satisfied: P{ I~n - ~I > e} --+ 0, as m, n--+ oo for every e > 0; the sequence gn(w)}n~ 1 is fundamental for almost all WE Q; the sequence {~n( W)} n ~ 1 is fundamental in U, i.e. EI~n - ~m IP --+ 0 as n, m--+ 00.

3. Theorem 1.

(a) A necessary and sufficient condition that

~n

P{~~~~~k- ~I~ e}--+ 0, for every e > 0.

--+

~

(P-a.s.) is that n--+ oo.

(5)

252

II. Mathematical Foundations of Probability Theory

(b) The sequence {~"} "~ 1 is fundamental with probability 1 if and only

n--+ oo,

if (6)

for every e > 0; or equivalently

~n+k - ~n I 2

P{sup I k~O

PROOF.

n __.. w.

e}--+ 0,

=n:;,

(a) Let A~= {w: l~n- ~~ 2 e}, A'= IlniA~

+ 0 = U A' = U A

1

(7) Uk~n A;;. Then

00

{w: ~n

11 m.

m=1

e~O

But P(A') = lim n

P( U A;;), k~n

Hence (a) follows from the following chain of implications:

~" + ~} =

0 = P{w:

P(U

e>O

A')¢> P( UA1m)= 0

¢> P(A 1 1m) = 0,

1

m=1

m 2 1 ¢> P(A') = 0, e > 0,

¢> PCY" A;;)--+ 0,

n--+

00

¢>

P(~~~~~k- ~I 2 e)--+ 0, n --+ oo.

(b) Let

n U Bk,. 00

B' =

n= 1

k~n l~n

Then {w: g"{w)}"~ 1 is notfundamental} = U,~ 0 B', and it can be shown as in (a) that P{w: {~n(w)}n~ 1 is not fundamental}= 0¢>(6). The equivalence of (6) and (7) follows from the obvious inequalities supJ~n+k- ~nl:::; supl~n+k- ~n+ll:::; 2 supl~n+k- ~nl·

This completes the proof of the theorem.

Corollary. Since

253

§10. Various Kinds of Convergence of Sequences of Random Variables

a sufficient condition for

~n ~ ~

is that

00

L P{i~k- ~I~ e} <

k= 1

(8)

00

is satisfied for every e > 0. It is appropriate to observe at this point that the reasoning used in obtaining (8) lets us establish the following simple but important result which is essential in studying properties that are satisfied with probability 1. Let A 1, A 2 , ••• be a sequence of events in F. Let (see the table in §1) {An i.o.} denote the event ITiiiAn that consists in the realization of infinitely many of A 1 , Az, .... Borei-Cantelli Lemma.

(a) If'[. P(An) < 00 then P{An i.o.} = 0. (b) If'[. P(An) = oo and A1, A 2 , ••• are independent, then P{An i.o.} = 1. PROOF.

(a) By definition

{An i.o.} = ITiii An=

rl

n= 1

U Ak.

k?!n

Consequently

P{An i.o.} = Pt01 kyn Ak} =lim P(Yn Ak) and (a) follows. (b) If A 1, A 2 ,

•••

are independent, so are

we have

p

Con Ak)

=

and it is then easy to deduce that

PCQ Ak) Since log(1 - x)

~

-x, 0

00

log

fl

k=n

~

=

li lt

A1, A2 , ••••

Hence for N ~ n

P(Ak),

(9)

P(Ak).

x < 1, 00

[1 - P(Ak)] =

~lim k~nP(Ak),

I

k=n

00

log[1 - P(Ak)] ~ -

Consequently

for all n, and therefore P(An i.o.) = 1. This completes the proof of the lemma.

L

k=n

P(Ak) = - oo.

254

II. Mathematical Foundations of Probability Theory

Corollary 1. If A~= {w: 1e,.- e1;,:: 6} then (8) shows that L:'= 1 P(A,.) < oo, 6 > 0, and then by the Borel-Cantelli lemma we have P(A") = 0, 6 > 0, where A • = llm A~. Therefore

L P{lek- el;,:: 6} < 00,6 > 0 ~ P(A") =

0, 6 > 0 ~

P{w:

e,. +e)}

= 0,

as we already observed above.

Corollary 2. Let (6,.)11 2: 1 be a sequence of positive numbers such that n ~ oo. If

L P{le,.- el;,:: 6,.} < 00,

6 11 ~

0,

00

(10)

n=l

e,. - e

In fact, let A,. = {I I ;,:: 6,.}. Then P(A,. i.o.) = 0 by the BorelCantelli lemma. This means that, for almost every w e Q, there is an N = N(w) such that le,.(w)- e(w)l ~ 6,. for n;,:: N(w). But 6,. ~ 0, and therefore e,.(w) ~ e(w) for almost every WE Q.

4. Theorem 2. We have the following implications:

e,. ~ ~ ~ ~~~ f. e.

.;,. g .; = .;,. .f. .;,

(11) p > 0,

(12)

~~~ f. ~ ~ ~~~ ~ ~.

(13)

PRooF. Statement (11) follows from comparing the definition of convergence in probability with (5), and (12) follows from Chebyshev's inequality. To prove (13), let f(x) be a continuous function, let If (x) I ~ c, let 6 > 0, and let N be such that P(l~l > N) ~ 6/4c. Take b so that lf(x)- f(y)l ~ 6/2c for lxl < N and lx- Yl ~b. Then (cf. the proof of Weierstrass's theorem in Subsection 5, §5, Chapter I)

N) + E(lf(~,.)- f(~)l; 1~.. - ~I ~ b, 1~1 > N) + E(lf<e..)- J<~>l; 1e.. - e1 >b) ~ 6/2 + 6/2 + 2cP{I~.. - ~I> <5} = 6 + 2cP{I~ .. - ~I> <5}.

Elf(~,.)- f(~)l = E(lf(~,.)- f(~)l; 1~ ..

-

~I~

b,

1~1 ~

But P{le,.- ~I> <5}.....,. 0, and hence E If(~,.)- f(~)l ~ 26 for sufficiently large n; since 6 > 0 is arbitrary, this establishes (13). This completes the proof of the theorem. We now present a number of examples which show, in particular, that the converses of (11) and (12) are false in general.

255

§10. Various Kinds of Convergence of Sequences of Random Variables

1 (~n ~ ~ f:> ~n ~ ~; ~n ~ ~ 86([0, 1]), P = Lebesgue measure. Put

EXAMPLE

f:> ~n ~- ~). Let 0

A;=[~~] n n 'n '

= [0, 1],

f7 =

i = 1, 2, ... , n; n ;;::: 1.

Then the sequence { ).:1. ).:1

).:2. ).:1

).:3.

).:2

'>1• '>2• '>2• '>3• '>3• '>3• ..•

}

of random variables converges both in probability and in mean of order p > 0, but does not converge at any point wE [0, 1]. 2 (~n ~ ~ => ~n ~ ~ =f ~n!:! ~, p > 0). Again let 0 86[0, 1], P = Lebesgue measure, and let EXAMPLE

~n(w)

=

{e", 0,

0 :::;;

W :::;;

w > 1/n.

= [0, 1],

f7

=

1/n,

Then {~n} converges with probability 1 (and therefore in probability) to zero, but n--+

oo,

for every p > 0. 3 (~n!:! ~ =f ~n ~~).Let {~n} be a sequence of independent random variables with

ExAMPLE

P(~n

= 0) =

1 - Pn·

Then it is easy to show that ~n

--+ 0 ¢> Pn --+ 0,

p

n --+ oo,

(14)

~n

--+ 0 ¢> Pn --+ 0,

LP

n --+ oo,

(15)

00

~n ~ 0 =>

L Pn <

(16)

00.

n=1 LP

In particular, if Pn = 1/n then ~n --+ 0 for every p > 0, but ~n

+0.

a.s.

The following theorem singles out an interesting case when almost sure convergence implies convergence in L 1 •

Theorem 3. Let (~n) be a sequence of nonnegative random variables such that ~n ~~and E~n--+ E~ < 00. Then EI~n -

~I --+ 0,

n--+

oo.

(17)

256

II. Mathematical Foundations of Probability Theory

PROOF. We have we have

E~n

< oo for sufficiently large n, and therefore for such n

E ~~- ~nl = E(~- ~n)/{~2!~nl

=

+ E(~n- ~)J{~n>~} + E(~n- ~).

2E(~- ~n)/{~2!~n}

But 0 ::;; (~ - ~n)J 1 ~.,~"l ::;; ~- Therefore, by the dominated convergence theorem, limn E(~- ~n)/ 1 ~.,~") = 0, which together with E~n-+ E~ proves (17).

Remark. The dominated convergence theorem also holds when almost sure

convergence is replaced by convergence in probability (see Problem 1). Hence in Theorem 3 we may replace "~" ~ C' by "~" -f. ~-"

5. It is shown in analysis that every fundamental sequence (xn), xn E R, is convergent (Cauchy criterion). Let us give a similar result for the convergence of a sequence of random variables.

Theorem 4 (Cauchy Criterion for Almost Sure Convergence). A necessary and sufficient condition for the sequence (~n)n;;, 1 of random variables to converge with probability 1 (to a random variable~) is that it is fundamental with probability 1. PROOF. If ~n ~ ~ then

k;;,n l;;,n

k;::n

l
whence the necessity follows. Now let C~n)n;;,l be fundamental with probability 1. Let .;V = {w: (~n(w)) is not fundamental}. Then whenever w E Q \ .;V the sequence of numbers (~"(w))n;;,l is fundamental and, by Cauchy's criterion for sequences of numbers, lim ~n(w) exists. Let

~(w) ={lim ~n(w), wEil\%, 0,

WE%.

(18)

The function so defined is a random variable, and evidently ~" ~ ~. This completes the proof. Before considering the case of convergence in probability, let us establish the following useful result.

Theorem 5. If the sequence(~") is fundamental (or convergent) in probability, it contains a subsequence (~"k) that is fundamental (or convergent) with probability 1.

257

§10. Various Kinds of Convergence of Sequences of Random Variables

PROOF. Let(~") be fundamental in probability. By Theorem 4, it is enough to show that it contains a subsequence that converges almost surely. Take n 1 = 1 and define nk inductively as the smallest n > nk-l for which

P{l~,- ~.I> rk}

for all s

~

n, t

~

< rk.

n. Then

L P{l~nk+l- ~nkl > 2-k} < L 2-k < 00 k

and by the Borel-Cantelli lemma P{l~nk+l- ~nkl > 2-k i.o.} = 0.

Hence 00

L ~~nk+l -

k=l

with probability 1. Let .At= {m: I~nk+l - ~nkl

L

~(m) =

1 ~n,(m)

+

,;

~nkl <

00

oo }. Then if we put

I(~~:~, - ~nk(m)),

t=I

0,

mE 0\.K, WE

JV,

we obtain ~nk ~ ~· If the original sequence converges in probability, then it is fundamental in probability (see also (19)), and consequently this case reduces to the one already considered. This completes the proof of the theorem.

Theorem 6 (Cauchy Criterion for Convergence in Probability). A necessary and sufficient condition for a sequence (~n)n
If ~n ~ ~ then P{len- ~ml ~ e} ~ P{l~n- ~I~ e/2}

+ P{l~m-

~I~ e/2}

(19)

and consequently (~n) is fundamental in probability. Conversely, if(~") is fundamental in probability, by Theorem 5 there are a subsequence (~nJ and a random variable~ such that ~nk ~~.But then P{l~n- ~I~ e} ~ P{l~n- ~nkl ~ e/2}

+ P{l~nk-

~I~ e/2},

from which it is clear that ~n ~ ~. This completes the proof. Before discussing convergence in mean of order p, we make some observations about LP spaces.

258

II. Mathematical Foundations of Probability Theory

We denote by U = U(O., !F, P) the space ofrandom variables~= with EI~ IP = Jn I~ IP dP < oo. Suppose that p ~ 1 and put

Wlp =

~(ro)

(E I~ IP) 11P.

It is clear that ll~llp ~

0,

llc~IIP = lciii~IIP'

c constant,

(20) (21)

and by Minkowski's inequality (6.31) II~

+ ttllp :s; Wlp + llttllp·

(22)

Hence, in accordance with the usual terminology of functional analysis, the function II-IlP' defined on U and satisfying (20)-(22), is (for p ~ 1) a semi-

norm.

For it to be a norm, it must also satisfy ll~llp

= 0 ~ ~ = 0.

(23)

This property is, of course, not satisfied, since according to Property H (§6) we can only say that ~ = 0 almost surely. However, if U means the space whose elements are not random variables ~ with EI~ IP < oo, but equivalence classes of random variables (~ is equivalent to q if ~ = q almost surely), then 11·11 becomes a norm, so that U is a normed linear space. If we select from each equivalence class of random variables a single element, taking the identically zero function as the representative of the class equivalent to it, we obtain a space (also denoted by U) wh~ch is actually a normed linear space of functions (rather than of equivalence classes). It is a basic result of functional analysis that the spaces U, p ~ 1, are complete, i.e. that every fundamental sequence has a limit. Let us state and prove this in probabilistic language.

Theorem 7 (Cauchy Test for Convergence in Mean pth Power). A necessary and sufficient condition that a sequence (~n)n ~ 1 of random variables in U convergences in mean of order p to a random variable in LP is that the sequence is fundamental in mean of order p. PRooF. The necessity follows from Minkowski's inequality. Let C~n) be fundamental Cll~n- ~miiP-+ 0, n, m-+ oo). As in the proof of Theorem 5, we select a subsequence (~nk) such that ~nk ~ ~'where~ is a random variable with Wlp < oo. Let n1 = 1 and define nk inductively as the smallest n > nk_ 1 for which II~,- ~.liP<

for all s

~

n, t

~

n. Let

2- 2 k

259

§I 0. Various Kinds of Convergence of Sequences of Random Variables

Then by Chebyshev's inequality P(A ) < k

-

E I;;

'onk+ 1 -

2-kr

2- 2kr ;; I' < __

'onk

- 2-kr

=

2-kr < 2-k. -

As in Theorem 5, we deduce that there is a random variable ;;

':,nk

~

such that

~;;



We now deduce that II~.- ~liP--+ 0 as n--+ oo. To do this, we fix B > 0 and choose N = N(e) so that II~. - ~mil~ < dor all n ~ N, m ~ N. Then for any fixed n ~ N, by Fatou's lemma, E

1~.- ~IP = E{ lim 1~.- ~.kiP}= E{ lim 1~.- ~.kiP} ~-oo

~-oo

Consequently E I~. - ~ IP --+ 0, n --+ oo. It is also clear that since~ = ( ~ - ~.) + ~.we have E I~ IP < oo by Minkowski's inequality. This completes the proof of the theorem.

Remark 1. In the terminology of functional analysis a complete normed linear space is called a Banach space. Thus U, p ~ 1, is a Banach space. Remark 2. If 0 < p < 1, the function WI P = (E I~ IPY 1P does not satisfy the

triangle inequality (22) and consequently is not a norm. Nevertheless the space (of equivalence classes) LP, 0 < p < 1, is complete in the metric d(~,IJ) = El~- IJIP. Remark 3. Let L ro = L'XJ(n, :¥, P) be the space (of equivalence classes of) random variables~= ~(w) for which Wloo < oo, where Wloo, the essential supremum of~. is defined by

6.

=inf{O s c s oo:

Wloo

=ess

supl~l

The function

11·11 oo

is a norm, and L oo is complete in this norm.

P(l~l

>c)= 0}.

PROBLEMS

1. Use Theorem 5 to show that almost sure convergence can be replaced by convergence in probability in Theorems 3 and 4 of §6. 2. Prove that L 00 is complete. 3. Show that if~. 4. Let ~. !'. ~.lin

f.

~and also~.

f. 11 then~ and 11 are equivalent (P(~

!'. IJ, and let ~ and 11 be equivalent. Show that P{l~n-

for every e > 0.

llnl ~ ~>}--> 0,

n--> oo,

# 17) = 0).

260

II. Mathematical Foundations of Probability Theory

5. Let ~. ~ ~, IJn ~ YJ. Show that a~.+ bYJ. ~a~+ bYJ (a, b constants), 1~.1 ~ 1~1,

~.'I.~ ~'1·

6.

Let(~. - ~) 2 --+ 0. Show that~; --+ ~ 2 •

7. Show that if~. bility:

!!. C, where C is a constant, then this sequence converges in proba~. !!. c => ~ • .f. c.

8. Let (~.)., 1 have the property that ~. --+ 0 (P-a.s.).

I:'=

1

E I~. IP < oo for some p > 0. Show that

9. Let(~.)., 1 be a sequence of independent identically distributed random variables. Show that <Xl

El~ 1 1 < oo=

I

n=1

P{l~ 1 1 > e·n} < oo

10. Let ( ~.)." 1 be a sequence of random variables. Suppose that there are a random variable~ and a sequence {nd such that~ •• --+ ~ (P-a.s.) and max •• _,
§11. The Hilbert Space of Random Variables with Finite Second Moment 1. An important role among the Banach spaces U, p ~ 1, is played by the space L 2 = L 2 (0, !F, P), the space of (equivalence classes of) random variables with finite second moments. If~ and '1 E L 2, we put (~, 1'/)

It is clear that if~' IJ, ( (a~

E

=E~l'/·

L 2 then

+ bl'f, 0

= a(~,

0+

(~, ~) ~

and

(1)

b(IJ,

0

0,

a, bER,

261

§11. The Hilbert Space of Random Variables with Finite Second Moment

Consequently(~, '1) is a

scalar product. The space L 2 is complete with

respect to the norm (2)

induced by this scalar product (as was shown in §10). In accordance with the terminology of functional analysis, a space with the scalar product (1) is a

Hilbert space.

Hilbert space methods are extensively used in probability theory to study properties that depend only on the first two moments of random variables (" L 2 -theory"). Here we shall introduce the basic concepts and facts that will be needed for an exposition of L 2 -theory (Chapter VI). 2. Two random variables ~ and 11 in L 2 are said to be orthogonal ( ~ l_ 11) '1) E~11 = 0. According to §8, ~and 11 are uncorrelated ifcov(~, '1) = 0, i.e. if

if(~,

=

It follows that the properties of being orthogonal and of being uncorrelated coincide for random variables with zero mean values. A set M ~ L 2 is a system of orthogonal random variables if ~ l_ 11 for every~. 11 EM(~ ¥= 11). If also II~ I = 1 for every ~EM, then M is an orthonormal system.

3. Let M = {'1t. ... , '1n} be an orthonormal system and~ any random variable in L 2 . Let us find, in the class oflinear estimators L7= 1 a;'1;, the best meansquare estimator for~ (cf. Subsection 2, §8). A simple computation shows that

El~- ;t1 ~;'1;1 2 = 11~- it1 a;'1f = (~- ;t1 a;'1;, ~- ;t1 a;'1;)

it a;(~, + (t J1

= Wl2-

2

= 11~11 2 -

2

n

L

i= 1

-

Wl 2

-

~

where we used the equation

L

i= 1 n

I

i= 1

a;'1;)

n

a;(~,

'1;) + L af

n

= Wl 2

a;'1;,

'1;)

i= 1 n

1(~, '1;W

1<~. 11;)1 2 ,

+ L

i= 1

Ia;- (~, '1;W (3)

262

II. Mathematical Foundations of Probability Theory

Ii=

It is now clear that the infimum of E I~ 1 a;1Jd 2 over all real at> ... , an is attained for a; = (~, IJ;), i = 1, ... , n. Consequently the best (in the mean-square sense) estimator for~ in terms of 1] 1, •.• , IJn is n

~=

I

i= 1

c~, IJ;)'li·

Here

~ = infEI~- it1 a;IJ;'2 = El~- ~12 = Wlz-

(4)

J1 1(~,1];)12

(5)

(compare (1.4.17) and (8.13)). Inequality (3) also implies Bessel's inequality: if M = {1] 1, 1] 2, ... } an orthonormal system and ~ E L 2 , then 00

I

i= 1

1(~,

'1;)1 2

IS

Wl 2 ;

(6)

(~, IJ;)}I;.

(7)

~

and equality is attained if and only if n

~ = l.i.m. n

I

i= 1

The best linear estimator of~ is often denoted byE(~ I1] 1, ••. , IJn) and called the conditional expectation (of~ with respect to 1] 1, ..• , IJn) in the wide sense. The reason for the terminology is as follows. If we consider all estimators cp = cp(1J 1, ..• , 1'/n) of ( in terms of 17 t> ... , 1'/n (where cp is a Borel function), the best estimator will be cp* = E(( Ii'J 1, ••• , 1'/n), i.e. the conditional expectation of ( with respect to 1] 1 , ••• , 1'/n (cf. Theorem 1, §8). Hence the best linear estimator is, by analogy, denoted by E(~l1'f 1 , . . . , 1'/n) and called the conditional expectation in the wide sense. We note that if 'lt> ... , 1'/n form a Gaussian system (see §13 below), then E((I1J 1 , ... ,1Jn) and E((I1J 1 , ... ,1Jn) are the same. Let us discuss the geometric meaning of~ = E(~ I'lll ... , IJn). Let !E = !E{1] 1, •.• , IJn} denote the linear manifold spanned by the orthonormal system of random variables 1] 1 , ... , 'ln (i.e., the set of random variables of the form Lf= 1 a;IJ;, a; E R). Then it follows from the preceding discussion that ~ admits the "orthogonal decomposition" (8)

where ~ E !E and ~ - ~ .l !E in the sense that ~ - ~ .ill for every ll E !E. It is natural to call ~ the projection of ( on !E (the element of !t' "closest" to ~),and to say that~ - ~is perpendicular to !E. 4. The concept of orthonormality of the random variables '1~> ... , 'ln makes it easy to find the best linear estimator (the projection) ~ of ~ in terms of

§11. The Hilbert Space of Random Variables with Finite Second Moment

263

'7 1, .•. , 'ln· The situation becomes complicated if we give up the hypothesis of orthonormality. However, the case of arbitrary '7 1, ... , 'ln can in a certain sense be reduced to the case of orthonormal random variables, as will be shown below. We shall suppose for the sake of simplicity that all our random variables have zero mean values. We shall say that the random variables 'It> ... , 'In are linearly independent if the equation n

L a;'l; =

0 (P-a.s.)

i= 1

is satisfied only when all a; are zero. Consider the covariance matrix ~ =E'7'7T

of the vector '7 = ('7 1, ... , 'In). It is symmetric and nonnegative definite, and as noticed in §8, can be diagonalized by an orthogonal matrix l!J: l!JT~l!J =

D,

where

0)

D=(d1 .. 0 · dn

has nonnegative elements d;, the eigenvalues of ~' i.e. the zeros A. of the characteristic equation det(~ - A.E) = 0. If '1 b ... , 'ln are linearly independent, the Gram determinant (det ~) is not zero and therefore d; > 0. Let

and

(Ft·.·Jd,.

B

=

f3

= B-ll!JT'l·

0

0 ) (9)

Then the covariance matrix of f3 is Ef3f3T = B-1l!.JTE'7'7Tl!.JB-1 =

and therefore f3 = (/3 1 , It is also clear that

••• ,

B-1l!JT~l!JB-1

= E,

f3n) consists of uncorrelated random variables.

'I= (l!.JB)/3.

(10)

Consequently if '7 1, •.. , 'ln are linearly independent there is an orthonormal system such that (9) and (10) hold. Here

.!l'{'lb · .. ' '7n} = .!l'{/31, ·"' /3n}. This method of constructing an orthonormal system f3 to ••• , Pn is frequently inconvenient. The reason is that if we think of '7; as the value of the random sequence ('7 1 , ..• , 'ln) at the instant i, the value /3; constructed above

264

II. Mathematical Foundations of Probability Theory

depends not only on the "past," (Yf 1 , ••• , Y/;), but also on the "future," (Y/;+ t> ••• , Yfn). The Gram-Schmidt orthogonalization process, described below, does not have this defect, and moreover has the advantage that it can be applied to an infinite sequence of linearly independent random variables (i.e. to a sequence in which every finite set of the variables are linearly independent). Let Yft> Yf 2, .•. be a sequence of linearly independent random variables in L 2 • We construct a sequence e1,e 2 , •.. as follows. Let e1 = Yft!IIY/ 1 11. If e1 , •.• , e"_ 1 have been selected so that they are orthonormal, then e

n

=

Yfn- ~n

(11)

IIYfn - ~nil'

where ~" is the projection of Yfn on the linear manifold !l'(e 1 , generated by

••• ,

en_ 1 )

n-1

~" =

L (Yfn, ek)ek.

(12)

k=1

Since Yft>···•Yfn are linearly independent and !l'{Yf 1 , .•. ,Yfn-d = !l'{e 1 , •.• , en- 1 }, we have IIYfn- ~nil > 0 and consequently en is well defined. By construction, lien II = 1 for n ~ 1, and it is clear that (en, ek) = 0 for k < n. Hence the sequence e1, e2 , ••• is orthonormal. Moreover, by (11), where b" = II Yin - ~nil and ~n is defined by (12). Now let Yf 1, ••• , Yfn be any set of random variables (not necessarily linearly llriill is the covariance matrix of independent). Let det ~ = 0, where ~ (17 1, ... , Yfn), and let

=

rank~=r
Then, from linear algebra, the quadratic form n

Q(a) =

L riiaiai,

i,j= 1

has the property that there are n - r linearly independent vectors a(l>, ... , a(n-r) such that Q(dil) = 0, i = 1, ... , n- r. But

Consequently n

L a~>'1k =

k= 1

with probability 1.

0,

i = 1, ... , n- r,

§II. The Hilbert Space of Random Variables with Finite Second Moment

265

In other words, there are n - r linear relations among the variables '11> ... , '7n· Therefore if, for example, 17 1, ... , '7, are linearly independent, the other variables '7r+ 1, ... , '7n can be expressed linearly in terms of them, and consequently 2'{171> ... , '7n} = !l'{e 1, ... , e,}. Hence it is clear that we can find r orthonormal random variables et> ... , e, such that 17 1, ... , '7n can be expressed linearly in terms ofthem and 2'{17 1, ... , '7n} = !l'{et> ... , e,}. 5. Let '7 1, 17 2 , ••• be a sequence of random variables in L 2 • Let .!l = .!l{'7t> 17 2 , •• •} be the linear manifold spanned by 17 1, 17 2 , •• . ,i.e. the set of random variables of the form 1 a; I'/;. n ;:::: 1, a; e R. Then .!l = .!l{'7t> 17 2 , •• •} denotes the closed linear manifold spanned by 17 1,17 2, ... , i.e. the set of random variables in !l' together with their mean-square limits. We say that a set 17 1 , 17 2 , ••• is a countable orthonormal basis (or a complete orthonormal system) if:

Li'=

(a) '11> 17 2 , ••• is an orthonormal system, (b) .!l{'71• '12, .. .} = L 2. A Hilbert space with a countable orthonormal basis is said to be separable. By (b), for every~ eL 2 and a given e > 0 there are numbers a 1 , ... , an such that

Then by (3)

Consequently every element of a separable Hilbert space L 2 can be represented as 00

~ =

L <~. '7;). '7;,

(13)

i= 1

or more precisely as n

~ = l.i.m. n

L (~, '7)'7;·

i= 1

We infer from this and (3) that Parseval's equation holds:

Wl 2

L 1(~. '7;)1 00

=

i=1

2,

(14)

It is easy to show that the converse is also valid: if '7 1 , 17 2 , ••• is an orthonormal system and either (13) or (14) is satisfied, then the system is a basis. We now give some examples of separable Hilbert spaces and their bases.

266

II. Mathematical Foundations of Probability Theory

EXAMPLE 1.

Let

n = R, :F = &I(R), and let P be the Gaussian measure,

P(- oo, a] Let D

f oo

=

qJ(x) dx,

= d/dx and H ( ) = ( -1)"D"qJ(x) "x qJ(x) ,

n ~ 0.

(15)

We find easily that

= -xqJ(x),

DqJ(x)

D 2 qJ(x) = (x 2

=

D 3 qJ(x)

(16)

1)qJ(x),

-

(3x - x 3 )qJ(x),

It follows that H,.(x) are polynomials (the Hermite polynomials). From (15) and (16) we find that H 0 (x)

= 1,

H 1(x) = x, Hz(x) = x 2

-

1,

HJ(x) = x 3

-

3x,

A simple calculation shows that

(Hm, H,.)

=

s:ooHm(x)H,.(x) dP

=

J:ooHm(x)H,.(x)qJ(X) dx

=

n! c5mn•

where c5m,. is the Kronecker delta (0, if m =F n, and 1 if m = n). Hence if we put

h( )

,.x

= H,.(x)

Jn'

the system of normalized Hermite polynomials {h,.(x)}n>o will be an orthonormal system. We know from functional analysis that if lim c!O

foo eclxl P(dx) - oo

< oo,

(17)

the system {1, x, x 2 , ••• } is complete in L 2 , i.e. every function e = e(x) in L 2 can be represented either as 1 ai'li(x), where 1'/i(x) = xi, or as a limit of

Li=

§II. The Hilbert Space of Random Variables with Finite Second Moment

267

these functions (in the mean-square sense). If we apply the Gram-Schmidt orthogonalization process to the sequence 111(x), 11ix), ... , with IJ;(x) = xi, the resulting orthonormal system will be precisely the system of normalized Hermite polynomials. In the present case, (17) is satisfied. Hence {hn(x)}n;;,o is a basis and therefore every random variable ~ = ~(x) on this probability space can be represented in the form ~(x)

= l.i.m. n

n

I

(~, h;)h;(x).

(18)

i=O

2. Let !2 = {0, 1, 2, ... } and let P = {P 1 , P 2 , distribution

EXAMPLE

X

•• •}

be the Poisson

= 0, 1, ... ; A > 0.

Put fl.J(x) = f(x) - f(x - 1) (f(x) = 0, x < 0), and by analogy with (15) define the Poisson-Charlier polynomials

n

~

1,

I1 0 = 1.

(19)

Since 00

(ITm, ITn) =

I

x=O

ITm(x)ITn(x)Px = Cnbmn'

where en are positive constants, the system of normalized Poisson-Charlier polynomials {nn(x)}n;;,o' nn(x) = IInCx)/Jc:, is an orthonormal system, which is a basis since it satisfies (17). In this example we describe the Rademacher and Haar systems, which are of interest in function theory as well as in probability theory. Let n = [0, 1], ff = ~([0, 1]), and let P be Lebesgue measure. As we mentioned in §1, every x E [0, 1] has a unique binary expansion EXAMPLE3.

where x; = 0 or 1. To ensure uniqueness of the expansion, we agree to consider only expansions containing an infinite number of zeros. Thus we choose the first of the two expansions

110

0

011

-=-+-+-+ . 2 22 23 2 2 2 2 2 3 .. ·=-+-+-+"· We define random variables ~ 1 (x), ~ 2 (x), ... by putting

268

II. Mathematical Foundations of Probability Theory R 2 (x)

R 1(x) I I I I I I I

1.!.

0

I

X

2

0

I

1.!. 4

1.!. IJ. 2

I I

I

I

-1

~

X

4

I I

I I

I I

I I I I

I I I I I I I

I I I I I I I

I I

I I I

-1

~

I I I I I I I

I I

I I I

I I ..._...I ..._..

Figure 30

Then for any numbers a;, equal to 0 or 1,

=

a1 P{ x·. 2

1}

a. a 1 a2 a. a2 ++ ·· · ++< x <+ ··· ++2" 2" 22 2 2" 22

= p { X'. XE [ -a21

1 1 ]} a. +a. a 1 + · · · += -. + · ·· +-2" 2" 2" 2"' 2

It follows immediately that~ 1 , ~ 2 , ... form a sequence of independent Bernoulli random variables (Figure 30 shows the construction of ~ 1 = ~ 1 (x) and ~2

=

~z(x)).

If we now set R.(x) = 1 - 2~.(x), n ;;:: 1, it is easily verified that {R.} (the Rademacher functions, Figure 31) are orthonormal:

ERnRm = fR.(x)Rm(x) dx = Dnm·

=

Notice that (1, R.) ER. = 0. It follows that this system is not complete. However, the Rademacher system can be used to construct the Haar system, which also has a simple structure and is both orthonormal and

complete.

~(x)

~(x)

I~

~

I I I I I

I I I I

I

I

I

I I

0

t

I I I

I I I

I I

I I I I

I

I

I

I I

I

I

I

X

0

*

I

I

r-; I I I

I I I

I

I I

t i

Figure 31. Rademacher functions.

I I I I I I I

I

I

X

269

§11. The Hilbert Space of Random Variables with Finite Second Moment

Again let Q = [0, 1) and IF= .16'([0, 1)). Put

H 1 (x)=1, H 2 (x)

=

R 1 (x),

k- 1 k -v:::;; x < 2i'

if

n = 2i

+ k,

1 :::;; k :::;; 2i,j;;?: 1,

otherwise.

It is easy to see that H.(x) can also be written in the form

Hzm+l(x) = {

2m/2

0 <X < 2-(m+ 1)

0,

2-(m+l):::;; otherwise,

-2~ 12 ,

X~ 2-m,

m = 1, 2, ... '

Hzm+j(x)=H zm+l(x-j;. 1). j= 1, ... ,2m. Figure 32 shows graphs of the first eight functions, to give an idea of the structure of the Haar functions. H 1 (x)

H 2 (x)

1

0

1

I

4

1 2

~ 1

i

X

-I

-I

H 5 (x)

H6(x)

2

.l 2

1 4

I

X

I

I I I I I I I I

L.J

-2

H 4 (x)

21/2

21/2

0

X

~ I I

-21/2

~

I I

;)_

I I I I I I I I

4

I

~

4 I I I I I I I I I I

~I

;)_

4

1

X

0

-2

,., I I I I I I I I

.l 4

!

-2''2

I I I I I

I

I I

X

I I I I I I I I

I

'-+l

H 8 (x)

2

.ll

0

X

H?Cx)

2

I I I

-2

I I I I I I I I I

H 3 (x)

2

r+i

I I I I I I I I I I

!

I I I I I I I I I I

.ll

2

I I I I I I I I I I

w

1

0

X

r+i

I I I I I I I I I I 1

4

-2

Figure 32. The Haar functions H 1(x), ... , H 8(x).

!

I I I I I I I I I

:I

11

I 4 I I I I I I I I I I

w

X

270

II. Mathematical Foundations of Probability Theory

It is easy to see that the Haar system is orthonormal. Moreover, it is complete both in L 1 and in L 2 , i.e. iff= f(x) E IJ for p = 1 or 2, then

J

n

1

0

lf(x) - k~1 (f, Hk)Hk(x)IP dx-+ 0,

n-+ oo.

The system also has the property that n

L (f, Hk)Hk(x)-+ f(x),

n-+ oo,

k=1

with probability 1 (with respect to Lebesgue measure). In §4, Chapter VII, we shall prove these facts by deriving them from general theorems on the convergence of martingales. This will, in particular, provide a good illustration of the application of martingale methods to the theory of functions.

6. If t] 1, ... , t'/n is a finite orthonormal system then, as was shown above, for every random variable ~ E L 2 there is a random variable ~ in the linear manifold 2 = 2{tJ 1, ... , t'/n}, namely the projection of~ on f£1, such that

II~

- ~II

= inf{ll~- ell: ( E ff{t'/1, · · ·, t'fn}}.

L7=

Here~= 1 (~, t'/;)t'/;· This result has a natural generalization to the case when t] 1, t] 2 , ••• is a countable orthonormal system (not necessarily a basis).

In fact, we have the following result.

Theorem. Let

t] 1 , t] 2 , •••

be an orthonormal system of random variables, and

L = L{tJ 1, t] 2 , ••• } the closed linear manifold spanned by the system. Then there is a unique element ~ E L such that II~ -~II = inf{ll~ - Cll: (

E

2}.

(20)

Moreover,

~ = l.i.m. n

and ~ - ~

n

L (~, t'/;)t'/;

(21)

i= 1

l_ (, ( E L.

Let d = inf{ll~- Cll: ( E 2} and choose a sequence ( 1, ( 2 , ••• such that II~ - (nil -+d. Let us show that this sequence is fundamental. A simple calculation shows that

PROOF.

II("-

(mll 2

= 211'n-

~11 2 + 211'm- ~11 2 - 411'"; 'm- ~r

It is clear that ((n + (m)/2 E 2; consequently therefore ll(n - (mll 2 -+ 0, n, m-+ oo.

II[((" + (m)/2]

- ~11 2 ~ d 2 and

§II. The Hilbert Space of Random Variables with Finite Second Moment

271

The space L 2 is complete (Theorem 7, §10). Hence there is an element ~ such that II(. - ~II ---+ 0. But !lis closed, so~ E !l. Moreover, II(. - ~II ---+ d, and consequently II~ - ~II = d, which establishes the existence of the required element. Let us show that ~ is the only element of !l with the required property. Let ~ E !l and let

II~ - ~II = II~ - ~II = d. Then (by Problem 3)

II~+~- 2~11 2 + II~- ~11 2

=

211~- ~11 2 + 211~- ~11 2 = 4d 2 •

But

II~+~- 2~11 2

=

411tC~

+ ~)- ~11 2 ~ 4d 2 •

Consequently II~ - ~11 2 = 0. This establishes the uniqueness of the element of !l that is closest to ~. Now let us show that ~ - ~ l_ (, ( E !l. By (20),

II~ - ~ - c(ll ~ II~ - ~II for every c E R. But

II~

- ~- c'll 2

=

- ~11 2 + c2 WI 2

II~

-

2(~ - ~,cO.

Therefore

c2 ll'll 2

~ 2(~ - ~,

c().

(22)

Take c = A(~ - ~' (), AE R. Then we find from (22) that

(~- ~' 0 2 [A 2 II'II 2 - 2A.] ~ 0. < 0 if A is a sufficiently small positive number. Con-

We have A. 2 11'11 2 - 2,{ sequently(~ - ~' 0 = 0, ( E I. It remains only to prove (21). The set!£= i£{111, 17 2 , •.• } is a closed subspace of L 2 and therefore a Hilbert space (with the same scalar product). Now the system 17 1 ,17 2 , ... is a basis for!£ (Problem 4), and consequently n

~

=

L (~, 11k)11k·

l.i.m.

(23)

But ~ - ~ l_ l]k, k ~ 1, and therefore(~, I'Jk) = (~, '1k), k ~ 0. This, with (23) establishes (21). This completes the proof of the theorem.

Remark. As in the finite-dimensional case, we say that ~ is the projection of ~ on L = L{rJ 1, 1] 2 , •• •}, that ~- ~ is perpendicular to L, and that the representation

~

=

is the orthogonal decomposition

~

+ (~-

of~.

~)

272

II. Mathematical Foundations of Probability Theory

We also denote~ by E(el'7 1 , 17 2 , •••) and call it the conditional expectation in the wide sense (of with respect to '11> 17 2 , •••). From the point of view of estimating in terms of '71> '72' the variable eis the optimal linear estimator, with error

e

e

0

ll

0

0

'

= e1e- ~1 2 = 11e- ~11 2 = 11e11 2

00

L l<e. '7i)l

-

2•

i=l

which follows from (5) and (23).

7. PROBLEMS 1. Show that if~ = l.i.m. 2. Show that if

e

!I e. !I -+ lleli-

e= l.i.m. e. and t7 = l.i.m. tfn then (e., tt.) -+ (e, tf).

3. Show that the norm

4. Let (

~.then

1 , ••• , ~J

11·11 has the parallelogram property II~+

ttll 2 + II~- ttll 2

= 2(11~11 2

+ llttll 2).

be a family of orthogonal random variables. Show that they have the

Pythagorean property,

5. Let tf 1 , tf 2 , ••• be an orthonormal system and !l' = !l'{tt 1, tf 2 , ••• } the closed linear manifold spanned by tt 1, tt 2 , •••. Show that the system is a basis for the (Hilbert) ~~

-

-

6. Let ~ 1 , ~ 2 , ••• be a sequence of orthogonal random variables and s. = ~ 1 + · · . + ~ •. Show that if L,~ 1 E~~ < oo there is a random variable S with ES2 < oo such that l.i.m. s. = S, i.e. liS.- Sll 2 = E IS.- Sl 2 -+ 0, n-+ oo. 7. Show that in the space L 2 = L 2 ([ -n, n], aJ([ -n, n]) with Lebesgue measure p. the system {(1/fo)eu", n = 0, ±1, ...} is an orthonormal basis.

§12. Characteristic Functions 1. The method of characteristic functions is one of the main tools of the analytic theory of probability. This will appear very clearly in Chapter III in the proofs of limit theorems and, in particular, in the proof of the central limit theorem, which generalizes the De Moivre-Laplace theorem. In the present section we merely define characteristic functions and present their basic properties. First we make some general remarks. Besides random variables which take real values, the theory of characteristic functions requires random variables that take complex values (see Subsection 1 of §5).

273

§12. Characteristic Functions

Many definitions and properties involving random variables can easily be carried over to the complex case. For example, the expectation E( of a complex random variable C= ~ + i17 will exist if the expectations E~ and E17 exist. In this case we define E( = E~ + iE17. It is easy to deduce from the definition of the independence of random elements (Definition 6, §5) that the complex random variables ( 1 = ~ 1 + i17 1 and ( 2 = ~ 2 + i17 2 are independent if and only if the pairs (~ 1 , 17 1 ) and (~ 2 , 17 2 ) are independent; or, equivalently, the a--algebras !l' ~ .. ~~ and !l' ~M 2 are independent. Besides the space L 2 of real random variables with finite second moment, we shall consider the Hilbert space of complex random variables C= ~ + i17 with EICI 2 < oo, where 1(1 2 = ~ 2 + 17 2 and the scalar product (( 1, ( 2 ) is defined by EC 1 C2 , where C2 is the complex conjugate of(. The term "random variable" will now be used for both real and complex random variables, with a comment (when necessary) on which is intended. Let us introduce some notation. We consider a vector a E R" to be a column vector,

·~CJ and aT to be a row vector, aT = (a 1, •.• , an). If a andb E R"their scalar product (a, b) is 1 a;b;. Clearly (a, b) = aTb. If a E R" and ~ = llriill is ann by n matrix,

Li=

n

(~a,a) = aT~a =

L riiaiai.

(1)

i,j= 1

2. Definition 1. Let F = F(x 1 , ••• , xn) be an n-dimensional distribution function in (R", PJJ(R")). Its characteristic function is

cp(t)

= ( ei dF(x),

JR"

teR".

(2)

Definition 2. If~ = (~ 1 , ••• , ~")is a random vector defined on the probability space (Q, .fF, P) with values in R", its characteristic function is C{);(t) = ( ei
JR"

where F~ = F~(x 1 ,

••• ,

teR",

xn) is the distribution function of the vector

(~1> ... ' ~n).

If F(x) has a density f = f(x) then

cp(t) =

r ei(t,x)f(x) dx.

JRn

(3) ~ =

274

II. Mathematical Foundations of Probability Theory

In other words, in this case the characteristic function is just the Fourier transform of f(x). It follows from (3) and Theorem 6. 7 (on change of variable in a Lebesgue integral) that the characteristic function cp~(t) of a random vector can also be defined by (4) tER". We now present some basic properties of characteristic functions, stated and proved for n = 1. Further important results for the general case will be given as problems. Let~ = ~(w) be a random variable, F~ = F~(x) its distribution function, and its characteristic function. We see at once that if 11 cp~(t)

= a~ + b then = Eeit~ = Eeit(a~+b) =

eitbEeiat~.

Therefore (5)

sn

Moreover, if ~ t> ~ 2 , ••• , ~" are independent random variables and = ~1 + ... +~",then

n cp~lt). n

CfJs.(t) =

(6)

i= 1

In fact,

= Eeu~,

... Eeit~.

n

=

0

j=1

cp~

(t),

J

where we have used the property that the expectation of a product of independent (bounded) random variables (either real or complex; see Theorem 6 of §6, and Problem 1) is equal to the product of their expectations. Property (6) is the key to the proofs of limit theorems for sums of independent random variables by the method of characteristic functions (see §3, Chapter III). In this connection we note that the distribution function F s" is expressed in terms of the distribution functions of the individual terms in a rather complicated way, namely F s" = F ~~ * · · · * F ~" where * denotes convolution (see §8, Subsection 4). Here are some examples of characteristic functions. 1. Let ~ be a Bernoulli random variable with P(~ 0) = q, p + q = 1, 1 > p > 0; then

ExAMPLE

P(~

=

cp~(t) = peit

+ q.

= 1) =

p,

275

§12. Characteristic Functions

If ~ 1 ,

... , ~n

are independent identically distributed random variables like

~,then, writing T,. = (Sn - np)/~, we have

rJt) = EeiTnt = e-it0ifW[peitfv'iijiq + q]n

=

[peitJq7(npJ

Notice that it follows that as n

--+

+ qe-itv'p/(nq)y.

(7)

oo

sn- np

(8)

T,.= ~. Let ~ "' %(m, a 2 ), Im I < oo, a 2 > 0. Let us show that

EXAMPLE 2.

(9)

Let 1J =

(~

-

m)ja. Then 1J "' %(0, 1) and, since
= eitm
by (5), it is enough to show that
e-r2;z.

(10)

We have
.

= Ee 11 ~ = -1-

J21r:

Joo e'.xe-x2j2 dx 1

-00

Joo 1..---e ~ (itxt -x2jZ dx_- 1 ~ .(itt n -x2j2 dx . - - -1 - Joo xe J21r: -oo n=O n! n=O n! J21r: -oo

_ -1-

= =

I

I

(it)2n (2n - 1) I I = (it)2n (2n)! n=O (2n)! ·· n=O (2n)! 2nn!

f (- tZ)n. ~ 2 n.

n=O

= e-t212,

where we have used the formula (see Problem 7 in §8)

EXAMPLE 3.

Let

~

be a Poisson random variable, e-AA_k

P(~ = k) =

Then

kl'

k

= 0, 1, ....

276

II. Mathematical Foundations of Probability Theory

3. As we observed in §9, Subsection 1, with every distribution function in (R, fJI(R)) we can associate a random variable of which it is the distribution function. Hence in discussing the properties of characteristic functions (in the sense either of Definition 1 or Definition 2), we may consider only characteristic functions ({J(t) = (/J(,(t) of random variables e = e(w).

e

Theorem 1. Let be a random variable with distribution function F = F(x) and ({J(t) = its characteristic function. Then

qJ

Eeu~

has the following properties:

(1) I({J(t) I ~ ({J(O) = 1; (2) ({J(t) is uniformly continuous fortE R; (3) ({J(t) = ({J(- t); (4) ({J(t) is real-valued if and only ifF is symmetric (t) exists for every r ~ n, and

e

L

(ix)'eitx dF(x),

({J(rl(t) =

({J(rl(O)

Ee' = -.-,-,

(13)

l

~ (it) 2 E~' () = L... ({Jt ., 1 r=o r.

(12)

() + (it)"1 e.t, n.

(14)

where le.(t)l :::; 3E 1e1• and e.(t)--+ 0, t--+ 0; (6) if qJ< 2 •>(0) exists and is .finite then Ee 2 " < oo; (7) if E I I" < oo for all n :2 1 and

e

- . (Eiel")li• hm "

1

=-

R

n

< oo,

then ({J(t) =

f

n=O

(itr Ee·. n.

(15)

for allltl < R. PRooF. Properties (1) and (3) are evident. Property (2) follows from the inequality

+ h) -

IEeit(,(eih~ - 1) I :::; EIeih(, - 11 and the dominated convergence theorem, according to which EIeih(. - 11 --+ 0, I({J(t

({J(t) I =

h--+ 0. Property (4). Let F be symmetric. Then if g(x) is a bounded odd Borel function, we have fR g(x) dF(x) = 0 (observe that for simple odd functions

277

§12. Characteristic Functions

this follows directly from the definition of the symmetry of F). Consequently JR sin tx dF(x) = 0 and therefore

cp(t) = E cos Conversely, let

be a real function. Then by (3)

cp~(t)

({J -lt)

t~.

= cp~(- t) = cplt) = cp~(t),

t E R.

Hence (as will be shown below in Theorem 2) the distribution functions ~ and ~ are the same, and therefore (by Theorem 3.1)

F _~and F ~of the random variables -

P( ~ E B) = P(- ~ E B) = P( ~ E -B)

for every BE fJ(R). Property (5). If E I~ In < oo, we have E I~ lr < oo for r inequality (6.28). Consider the difference quotient

+ h)

cp(t

h

- cp(t) _ ir~(eih~ - Ee h

~

n, by Lyapunov's

1) .

Since i

eihx _ h

11

lxl,

~

and E1 ~ 1 < oo, it follows from the dominated convergence theorem that the limit lim h~o

exists and equals

. lim (eih~ _ h~o h

Eeu~

Eeu~(-ei_h~_-_1)

1) =

h

.

iE(~e' 1 ~)

= i

Joo xe'.x dF(x). -oo

(16)

1

Hence q/(t) exists and

cp'(t) =

i(E~eil~) =

i

s:oo xeitx dF(x).

The existence of the derivatives cp(r)(t), 1 < r ~ n, and the validity of (12), follow by induction. Formula (13) follows immediately from (12). Let us now establish (14). Since .

e'Y

=cosy

.

.

+ l Slll y

n-1 (iy)k = k~O F

(iy)n

+7

[cos ely+

. . l Sill

e2y]

278

II. Mathematical Foundations of Probability Theory

for real y, with I8 1 I :::; 1 and I8 2 1:::; 1, we have

.

e''~

n- 1 (it~)k

.

(it~t

= k~O ----;z! +--;:;![COS 81(w)t~ + i Sln 82(w)t~]

(17)

and (18) where

en(t) =

P[~"(cos 8 1 (w)t~

+ i sin 8 2 (w)t~

- 1)].

It is clear that Ic5n(t) I :::; 3E I~n 1. The theorem on dominated convergence shows that en(t) --+ 0, t --+ 0. Property (6). We give a proof by induction. Suppose first that cp"(O) exists and is finite. Let us show that in that case E~ 2 < oo. By L'Hopital's rule and Fatou's lemma, qJ

~ [cp'(2h) - cp'(O) 2h

"(O) = 1.

h~~ 2

+

cp'(O) - cp'(- 2h)] 2h

= lim 2cp'(2h) ~~cp'( - 2h) = lim 4h12 [cp(2h)- 2cp(O) + cp( -2h)] h-+0

h-+0

= lim

Joo

h-+0

=

-lim h-+0

= -

(eihx ;he-ihx)2 dF(x)

-00

hx) oo (sin -hJ-oo

2

x 2 dF(x) :::; -

hx) 2x Joo lim (sin -h-oo h-+0

X

2

dF(x)

X

J:oo x 2 dF(x).

Therefore

J:oo x

2

dF(x):::; -cp"(O) < oo.

Now let cp< 2 k+ 2 l(O) exist, finite, and let J~~ x 2k dF(x) < oo. If J~oo x 2kdF(x) = 0, then J~oo x 2k+ 2 dF(x) = 0 also. Hence we may suppose that J~ x 2k dF(x) > 0. Then, by Property (5),

oo

cp<2kl(t)

=

f_oooo (ix)2keitx dF(x)

and therefore

( -1)kcp(2kl(t) where G(x)

=

J~oo u 2 k dF(u).

=

s:ooeitx dG(x),

279

§12. Characteristic Functions

Consequently the function ( -1)ktp< 2 k>(t)G- 1 ( oo) is the characteristic function ofthe probability distribution G(x) · G- 1(oo) and by what we have proved, G- 1(oo)

J:

00

x 2 dG(x) < oo.

But G- 1(oo) > 0, and therefore

Property (7). Let 0 < t 0 < R. Then, by Stirling's formula we find that

L

[E I~ l"t0/n !] converges by Cauchy's test, and Consequently the series therefore the series L~ 0 [(it)'jr !]E~' converges for It I ::;; t 0 • But by (14), for n ~ 1, ('t)' L _z_l E~' + Rn(t), n

tp(t) =

r=O

tp(t) =



f

r=O

(it)' r!

E~'

for all It I < R. This completes the proof of the theorem.

Remark 1. By a method similar to that used for (14), we can establish that if 00 for some n ~ 1, then

EIeI" <

" ik(t - s)k tp(t) = k~O k! where len(t-

s)l ::;;

Joo

.

-oo xke••x

dF(x)

3E WI, and en(t- s)

---t

+

i"(t - st n! en(t- s),

(19)

0 as t- s ---t 0.

Remark 2. With reference to the condition that appears in Property (7), see also Subsection 9, below, on the "uniqueness of the solution of the moment problem." 4. The following theorem shows that the characteristic function is uniquely determined by the distribution function.

280

II. Mathematical Foundations of Probability Theory

b

~.

+6

b

Figure 33

Theorem 2 (Uniqueness). Let F and G be distribution functions with the same characteristic function, i.e. (20)

for all t E R. Then F(x)

= G(x).

PROOF. Choose a and bE R, and B > 0, and consider the function f' = f'(x) shown in Figure 33. We show that

(21) Let n ~ 0 be large enough so that [a- e, b + e] £; [ -n, n], and let the sequence {b"} be such that 1 ~ b" L0, n -+ oo. Like every continuous function on [- n, n] that has equal values at the endpoints,!' = f'(x) can be uniformly approximated by trigonometric polynomials (Weierstrass's theorem), i.e. there is a finite sum (22) such that sup lf'(x)- f~(x)l:::;; b".

-nsx:s;n

Let us extend the periodic function J,.(x) to all of R, and observe that sup ln(x)l:::;; 2. X

Then, since by (20)

(23)

281

§12. Characteristic Functions

we have

f_

00

IJ:00 f'(x)dF(x)-

00

f'(x)dG(x)l

= lfJ'dF- f.f'dGI

~I

f.n

dF-

f/~ dGI + 2<5n

~ If_0000 f~ dF- f_0000 f~ dGI + 2£5n + 2F([ -n, n]) + 2G([ -n, n]), (24) where F(A) = SA dF(x), G(A) = SA dG(x). As n--+ oo, the right-hand side of (24) tends to zero, and this establishes (21). Ass--+ 0, we have f'(x)--+ I(a,b/x). It follows from (21) by the theorem on distribution functions' being the same.

f_'xo

00

/(a,bJ(x) dF(x)

= f_'xoool(a,b](x) dG(x),

i.e. F(b) - F(a) = G(b) - G(a). Since a and b are arbitrary, it follows that F(x) = G(x) for all x E R. This completes the proof of the theorem.

5. The preceding theorem says that a distribution function F = F(x) is uniquely determined by its characteristic function



F(b)- F(a) = lim 21 c-+oo

fc

e-ita

-c

1t

=

~

F(x) is continuous, e-itb
(25)

lt

(b) If s~oo l
F(x) = roof(y) dy

(26)

and f(x) = 1 2n

J(X) -oo

. e-•txq>(t) dt.

(27)

II. Mathematical Foundations of Probability Theory

282 PROOF.

We first observe that if F(x) has density f(x) then

f_'>)oo eitj(x) dx,

cp(t) =

(28)

and (27) is just the Fourier transform of the (integrable) function cp(t). Integrating both sides of (27) and applying Fubini's theorem, we obtain F(b)- F(a) =

ff(x)

dx =

2~ f [f_

00

f_oooo cp(t) [fe-itx dx

00

e-itxcp(t) dt] dx

Jdt

=

21n

=

e-ita- e-itb 1 Joo dt. cp(t) -2 zt. n _ 00

After these remarks, which to some extent clarify (25), we turn to the proof. (a) We have ~c

1

= -2n 1 =2n: 1 2n

=-

fc

e-ita- e-itb

,

zt

-c

fc -c

e-ita _ e-itb . zt

Joo

[fc

_ 00

-c

cp(t) dt

[foo

J

eitx dF(x) dt

-00

J

e-ita _ e-itb eitx dt dF(x) . zt (29)

where we have put 'I'c(x) = -

1

2n

fc -c

e-ita - e-itb . eztx dt .

zt

and applied Fubini's theorem, which is applicable in this case because _e-i_ta-_e-_itb. eitxl l it

and

fc

1:

=

le-ita- e-itbl it

(b - a) dF(x)

=I Ja

rbe-itx dxl < b- a

~ 2c(b -

-

a) < oo.

283

§12. Characteristic Functions

In addition, n1 T

(

c X

)

= _!_ 27C 1

=-

2n

fc

-c

sin t(x - a) - sin t(x - b) dt t

fc(x-a) -c(x-a)

sin v 1 - - dv - 2n

V

The function

g(s, t)

fc(x-b)

=

f

1

s

-c(x-b)

sin u - - du.

(30)

U

sin v -dv v

is uniformly continuous in s and t, and

g(s, t) --+ n

(31)

ass! - oo and t j oo. Hence there is a constant C such that I'Pc(x) I < C < oo for all c and x. Moreover, it follows from (30) and (31) that 'l'c(x)--+ 'l'(x),

c --+

00,

where 0, X < a, X > b, 'P(x) = { !, x = a, x = b, 1, a< x
Let ll be a measure on (R, f!J(R)) such that /l(a, b] = F(b)- F(a). Then if we apply the dominated convergence theorem and use the formulas of Problem 1 of §3, we find that, as c--+ oo, c =

f_

00 00

'Pc(x) dF(x)--+

f_

00 00

'P(x) dF(x)

= {L(a, b)+ t!l{a) + t!l{b} = F(b-)- F(a) + t[F(a)- F(a-) + F(b)- F(b- )] = F(b) \F(b-) _ F(a) \F(a-) = F(b) _ F(a),

where the last equation holds for all points a and b of continuity of F(x). Hence (25) is established. (b) Let f~oo I
7C

foo e-"xcp(t) . dt. -oo

284

II. Mathematical Foundations of Probability Theory

It follows from the dominated convergence theorem that this is a continuous function of x and therefore is integrable on [a, b]. Consequently we find, applying Fubini's theorem again, that

1 =lim -2 r-ae 1C

Jr

e-ita- e-itb

. zt

-c

qJ(t) dt = F(b) - F(a)

for all points a and b of continuity of F(x). Hence it follows that F(x) =

r

aof(y) dy,

XER,

and since f(x) is continuous and F(x) is nondecreasing, f(x) is the density of F(x). This completes the proof of the theorem.

Corollary. The inversion formula (25) provides a second proof of Theorem 2. Theorem 4. A necessary and sufficient condition for the components of the

<el, ... ,

en) to be independent is that its characteristic random vector e = function is the product of the characteristic functions of the components: Eei(t1~1 + ... +tn~n)

=

n

fl

Eeitk~k,

k=1

PRooF. The necessity follows from Problem 1. To prove the sufficiency we let F(x 1, ... , Xn) be the distribution function of the vector e = (e 1, ••• , en) andFk(x), the distribution functions of the ek, 1 ~ k ~ n. PutG = G(xl> ... , Xn) = F 1 (x 1 ) · · · F n(xn). Then, by Fubini's theorem, for all (t 1 , •.. , tn) ERn,

=

n

fl

Eeitk~k

=

Eei

k=1

=

r

JRn

ei(I!X!+"·+tnXn)

dF(x 1

...

Xn)•

Therefore by Theorem 2 (or rather, by its multidimensional analog; see Problem 3) we have F = G, and consequently, by the theorem of §5, the random variables eh ... ' en are independent.

285

§12. Characteristic Functions

6. Theorem 1 gives us necessary conditions for a function to be a characteristic function. Hence if qJ = ((J(t) fails to satisfy, for example, one of the first three conclusions of the theorem, that function cannot be a characteristic function. We quote without proof some results in the same direction.

Bochner-Khinchin Theorem. Let qJ(t) be continuous, t E R, with qJ(O) = I. A necessary and sufficient condition that ((J(t) is a characteristic function is that it is positive semi-definite, i.e. that for all real t 1, •.. , tn and all complex Ato ... , A.n, n = 1, 2, ... , n

I

((J(t; - tj)A.)j ~

i,j= 1

The necessity of (32) is evident since if ((J(t)

i.tl

o.

=

(32)

f~ oo

eirx dF(x) then

((J(t;- tj)A.J.j = J:JJ/keilkXI2 dF(x)

~ o.

The proof of the sufficiency of (32) is more difficult. Polya's Theorem. Let a continuous even function qJ(t) satisfy qJ(t) ~ 0, qJ(O) = 1, qJ(t) --+ 0 as t --+ oo and let qJ(t) be convex on 0 :s; t < oo. Then ((J(t) is a characteristic function.

This theorem provides a very convenient method of constructing characteristic functions. Examples are ((Jl (t)

=

e -lrl,

() - {1 - ltl,

cpzt-

It I :s; 1, ltl >

0,

1.

Another is the function cp 3 (t) drawn in Figure 34. On [-a, a], the function qJ 3(t) coincides with qJ 2(t). However, the corresponding distribution functions F 2 and F 3 are evidently different. This example shows that in general two characteristic functions can be the same on a finite interval without their distribution functions' being the same.

-1

a

-a

Figure 34

286

II. Mathematical Foundations of Probability Theory

Marcinkiewicz's Theorem. If a characteristic function cp(t) is of the form exp &>(t), where &>(t) is a polynomial, then this polynomial is of degree at most 2. It follows, for example, that e-r• is not a characteristic function. 7. The following theorem shows that a property of the characteristic function of a random variable can lead to a nontrivial conclusion about the nature of the random variable.

Theorem 5. Let

cp~(t)

be the characteristic function of the random variable

(a) If Icp~(t 0 ) I = 1 for some t 0 =I= 0, then

+ nh, h = 2n/t 0 , for some a, that is,

a

L

~



is concentrated at the points

00

n=- oo

Pg =a+ nh}

=

(33)

1,

where a is a constant.

(b) If

1 for two different points t and is degenerate:

lcp~(t)l = lcp~(~t)l =

irrational, then

~

P{~

~t,

where

~

is

=a} = 1,

where a is some number. = 1, then~ is degenerate.

(c) If Icp~(t)l PROOF.

Then

(a) If Icp~(t 0 )1 = 1, t 0 =I= 0, there is a number a such that cp(t 0 ) = eitoa.

1=

f_

00 00

COS t 0 (x

- a) dF(x) =>

f_

00 00

[1 -

- a)] dF(x)

COS t 0 (x

= 0.

Since 1 - cos t 0 (x - a) 2 0, it follows from property H (Subsection 2 of §6) that 1

= cos t 0 (~

which is equivalent to (33). (b) It follows from lcp~(t)l

f

n=- oo

p{~ =

-

a)

(P-a.s.),

= lcp~(~t)l = 1 and from (33) that

a + 2n t

n} f =

m=- oo

p{~ =

b + 2n ~t

m}

= 1.

If ~ is not degenerate, there must be at least two pairs of common points: a

2n

+ -t n 1

=

2n b + - m1 ~t

'

2n a+ -n 2 = b y

2n

+ -m 2 , ~t

287

§12. Characteristic Functions

in the sets

{b + ~~ m, m = 0, ±1, .. }

{a+ 2; n, n = 0, ± 1, .. -} and whence 2n (n 1 t

~

-

n2 ) =

2n at

~(m 1 -

m2 ),

and this contradicts the assumption that a is irrational. Conclusion (c) follows from (b). This completes the proof of the theorem.

8.

Let~= (~b

... , ~k) be a random vector, lfJ~(t)

= Eei(t, ~>,

t = (t 1•

.•. '

tk),

its characteristic function. Let us suppose that E I~; In < oo for some n ; : .-: 1, i = 1, ... , k. From the inequalities of Holder (6.29) and Lyapunov (6.27) it follows that the (mixed) moments E(~1' · · · ~;;k) exist for all nonnegative v1, ... , vk such that v1 + · · · + vk :::;; n. As in Theorem 1, this implies the existence and continuity of the partial derivatives

for v1 + · · · + vk :::;; n. Then if we expand we see that l{J~(tl, ... ,tk)=

i'•+···+vk

L

1

1

••+···+vk:Sn v1 .... vk.

..• ,

tk) in a Taylor series,

m<~,,. .. ,vklt1'···tkk+o(ltln),

m(v., ... ,vk) -_ EJ::Vt '>1 ~

lfJ~(t 1 ,

(34)

)::Vk

••• '>k

is the mixed moment of order v = (v 1, ... , vk). Now lfJ~(t 1 , ... , tk) is continuous, lfJ~(O, ... , 0) = 1, and consequently this function is different from zero in some neighborhood It I < 8 of zero. In this neighborhood the partial derivative a··+ ... +vk Vk In l{J~(tb ... ' tk) 1 • . . tk

at VI

a

exists and is continuous, where In z denotes the principal value of the logarithm (if z = rei8 , we take In z to be In r + W). Hence we can expand In lfJlt 1, .. ,, tk) by Taylor's formula, iVl + ·•· + Vk s~··· •klt'l' · · · t);k + o(lt n, (35) I ln l{J~(t 1 · · · tk) = VI + ... + Vk :S n VI ! • • • Vk !

····

288

II. Mathematical Foundations of Probability Theory

where the coefficients s~v •• ···• vk) are the (mixed) semi-invariants or cumulants of order v = v(v 1, •.. , vk) of~= ~ 1 , .. , ~k· Observe that if ~ and '1 are independent, then ln

cp~+,(t)

= ln

+ ln cp,(t),

cp~(t)

(36)

and therefore (37)

(It is this property that gives rise to the term "semi-invariant" for s~v ••... ,vk>.) To simply the formulas and make (34) and (35) look "one-dimensional," we introduce the following notation. If v = (vh ... , vk) is a vector whose components are nonnegative integers, we put

We also put s~v> = s~v ••... ,vk>, m~v) = Then (34) and (35) can be written

cpr,(t) = ln cpr,(t) =

L

m~v ••... ,vk>.

jlvl

J

lvlsn V. ilvl

Lr

lvlsn V.

m~v)tv s~v>tv

+ o(ltl"),

(38)

+ o(ltl").

(39)

The following theorem and its corollaries give formulas that connect moments and semi-invariants. Theorem6. Let ~=(~h···•~k) be a random vector with i = 1, ... , k, n;;;:: 1. Then for v =(vi> ... , vk) such that Ivi:::;; n (V) -

m~;

_.!..

"

- ).Ol+···+A<•l=v L. ' 1(1)f q ·II. •

v!

1(q)f'

••• /1.



nq p=l

s

().(P))

'

El~d"
(40)

(41)

where LA<•l+···+A<•l=v indicates summation over all ordered sets of nonnegative integral vectors A

, IA

I > 0, whose sum is v. PROOF.

Since

cpr,(t) = exp(ln cpr,(t)), if we expand the function exp by Taylor's formula and use (39), we obtain

cpr,(t) = 1 +

1( iiAI )q L I L s~A>tA + o(ltl"). 11 q=l q., lSIAisn n

11..

(42)

289

§12. Characteristic Functions

Comparing terms in e· on the right-hand sides of (38) and (42), and using 1..1<1)1 + · · · + IA.(qJI = 1..1< 1> + · · · + A_
L ~~~ m~"lt" + o( It 1")]. cp~(t) = ln[1 + 1:<>1.<1:-;n ·

(43)

For small z we have the expansion ln(1

+ z)

=

L n

q= 1

(

1)q-1 q

zq

+ o(zry.

Using this in (43) and then comparing the coefficients oft;., with the corresponding coefficients on the right-hand side of (38), we obtain (41).

Corollary 1. The following formulas connect moments and semi-invariants:

L

1

I

n[ X

(44) (}.Ul)J'j v. ) m~ - {rt)..(l)+ .. ·+rx).(X)=v} r1! • • • rX! (A_(l)!)'' • • • (A_(X)!yxj=l s~ (v) -

s~'l-

L {rt),(IJ + ... + rx;.,(x) = v}

Il [

<.
where 'L1,,;.(1J+ .. ·+rx. 0, and over all ordered sets of positive integral numbers ri such that r 1A.< 1l + · · · + rxA_(x) = v. To establish (44) we suppose that among all the vectors A_(ll, ... , A_(q) that occur in (40), there are r 1 equal to A_(itl, ... , rx equal to A_(ixl (ri > 0, r1 + · · · + rx = q), where all the A_(i.J are different. There are q !f(r 1! ... rx !) different sets of vectors, corresponding (except for order) with the set {A_(ll, ... il(ql}). But if two sets, say, {il< 1>, •.• , il
Corollary 2. Let us consider the special case when v = (1, ... , 1). In this case the moments m~v) E~ 1 · · · ~k' and the corresponding semi-invariants, are called simple.

=

Formulas connecting simple moments and simple semi-invariants can be read off from the formulas given above. However, it is useful to have them written in a different way. For this purpose, we introduce the following notation. Let~= (~I> ... , ~k) be a vector, and I~= {1, 2, ... , k} its set of indices. If I c:;: I~, let ~~denote the vector consisting of the components of~ whose

290

II. Mathematical Foundations of Probability Theory

indices belong to I. Let x(J) be the vector {XI> ... , Xn} for which Xi = 1 if i e I, and Xi = 0 if i ¢I. These vectors are in one-to-one correspondence with the sets I s; I~. Hence we can write

In other words, m~(J) and s~(J) are simple moments and semi-invariants of the subvector of In accordance with the definition given on p. 12, a decomposition of a set I is an unordered collection of disjoint nonempty sets I P such that LPIP =I. In terms of these definitions, we have the formulas

el e.

(46) q

L

sp)=

(-1)q- 1(q-1)!0mpp).

l:~=llp=I

(47)

p=l

where Lr~=lip=I denotes summation over all decompositions of I, 1 ~ q ~ N(l). We shall derive (46) from (44). If v = x(I) and A.< 1 > + · · · + A. = v, then A_

= x(IP), IPs; I, where the A_

are all different, A_

f = v! = 1, and every unordered set {x(J 1 ), ••• , x(Iq)} is in one-to-one correspondence with the decomposition I= L~=l IP. Consequently (46) follows from (44). In a similar way, (47) follows from (35). EXAMPLE 1. Let

e be. a

random variable (k = 1) and mn = m~n> = Ee",

sn =st. Then (40) and (41) imply the following formulas: ml = sl,

+ si, s 3 + 3s 1s 2 + s~, s4 + 3s~ + 4sls3 + 6sis2 + st,

m2 = s2 m3 = m4 =

(48)

and s1

= m 1 = Ee,

s2 = m2- mi =

ve,

s 3 = m3 - 3m 1 m2 + 2mt s4 = m4 - 3m~- 4m 1 m3 + 12mim2 - 6mt,

(49)

291

§12. Characteristic Functions

ExAMPLE

2. Let ~ ,.... JV(m, a 2 ). Since, by (9), In cp~(t) = itm -

t2(J2

T'

we have s 1 = m, s 2 = a 2 by (39), and all the semi-invariants, from the third on, are zero: sn = 0, n 2 3. We may observe that by Marcinkiewicz's theorem a function exp &'(t), where .9' is a polynomial, can be a characteristic function only when the degree of that polynomial is at most 2. It follows, in particular, that the Gaussian distribution is the only distribution with the property that all its semi-invariants snare zero from a certain index onward. EXAMPLE

3. If

~

is a Poisson random variable with parameter A > 0, then

by (11) In

cp~(t)

= .Jc(eit - 1).

It follows that (50)

for all n 2 1. EXAMPLE

4.

Let~

=

(~ 1 , ••. , ~n)

m~(l)

=

be a random vector. Then

s~(l),

+ s~(l)s~(2), mp, 2, 3) = s~(1, 2, 3) + s~(1, 2)s~(3) + + s~(l, 3)sp) + + s~(2, 3)sll) + s~(l)se(2)sl3) m~(l,

2) =

s~(l,

2)

(51)

These formulas show that the simple moments can be expressed in terms of the simple semi-invariants in a very symmetric way. If we put ~ 1 = ~ 2 = ~k, we then, of course, obtain (48). The group-theoretical origin of the coefficients in (48) becomes clear from (51). It also follows from (51) that

··· =

se(l, 2) = me(l, 2)- me(l)mp) =

E~ 1 ~ 2 - E~ 1 E~ 2 ,

(52)

i.e., s/1, 2) is just the covariance of ~ 1 and ~ 2 . 9. Let ~ be a random variable with distribution function F = F(x) and characteristic function cp(t). Let us suppose that all the moments mn = E~n, n 2 1, exist. It follows from Theorem 2 that a characteristic function uniquely determines a probability distribution. Let us now ask the following question

292

II. Mathematical Foundations of Probability Theory

(uniqueness for the moment problem): Do the moments {mn}n> 1 determine the probability distribution? More precisely, let F and G be distribution functions with the same moments, i.e.

f_'Xloo xn dF(x)

=

f_oooo xn dG(x)

(53)

for all integers n ~ 0. The question is whether F and G must be the same. In general, the answer is "no." To see this, consider the distribution F with density

J(x) = {ke-ax'-, 0,

> 0,

X

X:::; 0,

where a: > 0, 0 < A, < t, and k is determined by the condition Write f3 = a: tan A-n and let g(x) = 0 for x :::; 0 and

+ t; sin(f3xA)],

g(x) = ke- ax'-[1 It is evident that g(x)

~

It; I <

1,

X

J0 f(x) dx =

1.

> 0.

0. Let us show that

(54) for all integers n ~ 0. For p > 0 and complex q with Re q

f

> 0, we have

ootp-1e-qt dt = r(p). qP

0

Take p = (n

+ 1)/A-, q

-

a:+ i/3, t

=

= xA. Then

r(~)

- a;
+ i tan A.n)
But

(1

+ i tan A.n)
= (cos A-n

+ i sin A.n)
= ein(n+ll(cos A.n)-
+ 1) =

0.

(55)

293

§12. Characteristic Functions

Hence right-hand side of (55) is real and therefore (54) is valid for all integral n 2:: 0. Now let G(x) be the distribution function with density g(x). It follows from (54) that the distribution functions F and G have equal moments, i.e. (53) holds for all integers n 2:: 0. We now give some conditions that guarantee the uniqueness of the solution of the moment problem.

Theorem 7. Let F = F(x) be a distribution function and J.ln = s~ 00

IX I" dF(x).

If J.l1/n

lim-"-< oo, n~CX)

(56)

n

the moments {mn}n~ to where mn = J~ oo x" dF(x), determine the distribution F = F(x) uniquely. PRooF. It follows from (56) and conclusion (7) of Theorem 1 that there is a t 0 > 0 such that, for alii t I ~ t 0 , the characteristic function

cp(t) = J:oo eirx dF(x) can be represented in the form 00 (itl cp(t) = k~o k! mk

and consequently the moments {mn}n> 1 uniquely determine the characteristic function cp(t) for It I ~ t 0 . Take a points with lsi::;; t 0 /2. Then, as in the proof of (15), we deduce from (56) that ({J

for

It- sl

~ t0 ,

( ) = ~ ik(t - s)k ( ) t

k~O

k!

({J

S

where

cp(s) = ik J:oo xkei•x dF(x) is uniquely determined by the moments {mn} n~ 1. Consequently the moments determine cp(t) uniquely for It I ~ !to. Continuing this process, we see that {mn}n~ 1 determines cp(t) uniquely for all t, and therefore also determines

F(x). This completes the proof of the theorem.

Corollary 1. The moments completely determine the probability distribution if it is concentrated on a finite interval.

294

II. Mathematical Foundations of Probability Theory

Corollary 2. A sufficient condition for the moment problem to have a unique solution is that (m )1/2n

llm

2n

2n

n-+oo

<

(57)

00.

For the proof it is enough to observe that the odd moments can be estimated in terms of the even ones, and then use (56). EXAMPLE.

Let F(x) be the normal distribution function, F(x) = _1_

~

fx

e-t2Jza2 dt.

-oo

Then m 2 n+ 1 = 0, m 2 n = [(2n) !/2"n !]u 2 ", and it follows from (57) that these are the moments only of the normal distribution. Finally we state, without proof: Carleman's test for the uniqueness of the moment problem. (a) Let

{mn}n~ 1 be

the moments of a probability distribution, and let 00

1

L (mzn)1f2n =

n=O

00 ·

Then they determine the probability distribution uniquely. (b) If {mn}n~l are the moments of a distribution that is concentrated on [0, oo ), then the solution will be unique if we require only that 00

1

L (mn)lf2n

n=O

=

00 ·

10. Let F = F(x) and G = G(x) be distribution functions with characteristic functions f = f(t) and g = g(t), respectively. The following theorem, which we give without proof, makes it possible to estimate how close F and G are to each other (in the uniform metric) in terms of the closeness of fandg. Theorem (Esseen's Inequality). Let G(x) have derivative G'(x) with supiG'(x)l ~C. Thenfor every T > 0 sup IF(x)- G(x)l x

~ ~ JT if(t)- g(t)i dt + n o

t

2T 4 sup IG'(x)l.

n

(58)

x

(This will be used in §6 of Chapter III to prove a theorem on the rapidity of convergence in the central limit theorem.)

295

§13. Gaussian Systems

11.

PROBLEMS

e

1. Let and , be independent random variables, f(x) = j~(x) + if2(x), g(x) = gl(x) + ig 2(x), where A(x) and gb) are Borel functions, k= 1, 2. Show that ifE I!WI< oo and E lg(tf)l < oo, then E lf{e)g(tf)l < oo

and EJ(e)g(tf) = Ef(e) · Eg(tf).

2. Let

e= (e

1 , ... ,

e.) and Ell ell"< oo, where WI = cp~(t)

where t = (t 1,

••• ,



+~.Show that

·k

I ..:, E(t, e)k + e.(t)lltll",

=

k=Ok.

t.) and e.(t)-> 0, t-> 0.

3. Prove Theorem 2 for n-dimensional distribution functions F = F.(x 1, G.(x 1 ,

••• ,

••• ,

x.) and

x.).

4. LetF = F(x 1, ... , x.)beann-dimensionaldistributionfunctionandcp = cp(t 1, ..• , t.) its characteristic function. Using the notation of (3.12), establish the inversion formula

(We are to suppose that (a, b] is an interval of continuity of P(a, b], i.e. fork= 1, ... , n the points ak, bk are points of continuity of the marginal distribution functions Fk(xk) which are obtained from F(x~> ... , x.) by taking all the variables except xk equal to + oo.)

5. Let cpk(t), k ~ 1, be a characteristic function, and let the nonnegative numbers A.k, k ~ 1, satisfy I A.k = 1. Show that I A.kcpk(t) is a characteristic function. 6. If cp(t) is a characteristic function, are Re cp(t) and Im cp(t) characteristic functions? 7. Let cp 1, cp 2 and cp 3 be characteristic functions, and cp 1 cp 2 = cp 1 cp 3 • Does it follow that CfJ2 =

CfJ3?

8. Construct the characteristic functions of the distributions given in Tables 1 and 2 of~3.

e

9. Let be an integral-valued random variable and Show that P(e = k) = 1 -

2n

f" .

e-u"cp~(t) dt,

-x

cp~

(t) its characteristic function.

k = 0,± 1,± 2 ....

§13. Gaussian Systems 1. Gaussian, or normal, distributions, random variables, processes, and systems play an extremely important role in probability theory and in mathematical statistics. This is explained in the first instance by the central

296

II. Mathematical Foundations of Probability Theory

limit theorem (§4 of Chapter III and §8 of Chapter VII), of which the De Moivre-Laplace limit theorem is a special case (§6, Chapter 1). According to this theorem, the normal distribution is universal in the sense that the distribution of the sum of a large number of random variables or random vectors, subject to some not very restrictive conditions, is closely approximated by this distribution. This is what provides a theoretical explanation of the "law of errors" of applied statistics, which says that errors of measurement that result from large numbers of independent "elementary" errors obey the normal distribution. A multidimensional Gaussian distribution is specified by a small number of parameters; this is a definite advantage in using it in the construction of simple probabilistic models. Gaussian random variables have finite second moments, and consequently they can be studied by Hilbert space methods. Here it is important that in the Gaussian case "uncorrelated" is equivalent to "independent," so that the results of L 2 -theory can be significantly strengthened. 2. Let us recall that (see §8) a random variable ~ = ~(m) is Gaussian, or normally distributed, with parameters m and a 2 (~"' .K(m, a 2 )), lml < oo, a 2 > 0, if its density f~(x) has the form

1"( ) _ _1_ -(x-m)2f2a2 x - ;;;= e ,

J~

....; 2na

(1)

+P.

where a = As a! 0, the density f~(x) "converges to the a-function supported at x = m." It is natural to say that ~ is normally distributed with mean m and a 2 = 0 (~ "' .K(m, 0)) if~ has the property that P(~ = m) = 1. We can, however, give a definition that applies both to the nondegenerate (a 2 > 0) and the degenerate (a 2 = 0) cases. Let us consider the characteristic function cp~(t) Eei 1 ~, t E R. If P(~ = m) = 1, then evidently

=

eitm,

(2)

eitm-(1/2)t2 a 2 •

(3)

cp~(t) =

whereas if~ "' .K(m, a 2 ), a 2 > 0, cp~(t)

=

It is obvious that when a 2 = 0 the right-hand sides of (2) and (3) are the same. It follows, by Theorem 1 of §12, that the Gaussian random variable with parameters m and a 2 (I m I < oo, a 2 ~ 0) must be the same as the random variable whose characteristic function is given by (3). This is an illustration of the "attraction of characteristic functions," a very useful technique in the multidimensional case.

297

§13. Gaussian Systems

Let

e= Ce1o ... , en) be a random vector and lp~(t) =

t = (t 1 ,

Eei(t, ~.

••• ,

t") E R",

(4)

its characteristic function (see Definition 2, §12).

Definidon 1. A random vector e= Ce 1, ... , en) is Gaussian, or normally distributed, if its characteristic function has the form lp~(t)

=

(5)

ei(t,m)-(l/2)(1Rt,t),

where m = (m1o ... , mn), Imk I < oo and ~ = llrk1l is a symmetric nonnegative definite n X n matrix; we use the abbreviation JV(m, ~).

e"'

This definition immediately makes us ask whether (5) is in fact a characteristic function. Let us show that it is. First suppose that ~ is nonsingular. Then we can define the inverse A = ~- 1 and the function

IAI 112

(6)

f(x) = (2n)"' 2 exp{ -!(A(x- m), (x- m))},

where x = (x 1 , us show that

•.. ,

xn) and

IAI = det A. This function is nonnegative. Let

r ei(t,x)f(x) dx =

JR"

ei(t,m)-(l/2)(1Rt,t),

or equivalently that I

= n -

r ei(t,x-m) (2n)nf2 IA1

JR"

112

e-(1/2)(A(x-m),(x-m))

dx

= e-(lf2)(1Rt,t)

·

(7)

Let us make the change of variable x- m

=

(!)u,

t = (!)v,

where (!) is an orthogonal matrix such that

and

is a diagonal matrix with d; ~ 0 (see the proof of the lemma in §8). Since I~I = det ~ =F 0, we have d; > 0, i = 1, ... , n. Therefore (8)

298

II. Mathematical Foundations of Probability Theory

Moreover (for notation, see Subsection 1, §12) i(t, x - m)- i(A(x - m), x - m)) = i(@v, (9u) - i(A(!)u, (9u) = i((9v)T(9u- i((9u?A((9u) = ivTu- iuT(9TA(9u = ivTu- iuTD- 1u. Together with (8) and (12.9), this yields In= (2n)-n 12 (d1 ... dn)- 112 =

Il (2ndk)-

k=l

112

r

JR"

exp(ivTu- tuTD- 1u) du

Joo exp(ivk uk -oo

2udl ) duk = k

Il exp(- ivl dk)

k=l

= exp(- ivTDv) = exp(- tvT(9T[R(9v) = exp(- itT!Rt) = exp(- i(IRt, t)). It also follows from (6) that

r f(x) dx =

JR"

1.

(9)

Therefore (5) is the characteristic function of a nondegenerate n-dimensional Gaussian distribution (see Subsection 3, §3). Now let IR be singular. Take 8 > 0 and consider the positive definite IR + 8E. Then by what has been proved, symmetric matrix IR'

=


is a characteristic function:
JR"

ei
dF.(x),

where F.(x) = F.(xb ... , xn) is ann-dimensional distribution function. As 8-+ 0,
299

§13. Gaussian Systems

we find from (12.35) and the formulas that connect the moments and the semi-invariants that

m1 --

0 • ···• Ol sU· ~ -

E;::':.1,

••• ,

mk --

s
E;::':.k.

Similarly r 11-

v;::':.1•

s<2,0, ... ,0)~

-

and generally

Consequently m is the mean-value vector of ~ and IR is its covariance matrix. If IR is nonsingular, we can obtain this result in a different way. In fact, in this case~ has a density f(x) given by (6). A direct calculation shows that

E~k = Jxd(x) dx = cov(~k, ~~) =

(11)

mk,

J<xk- mk)(x 1 - m1)f(x) dx

=

rkl·

4. Let us discuss some properties of Gaussian vectors.

Theorem 1

if and only if they are independent. (b) A vector ~ = (~ 1 , •.. , ~n) is Gaussian if and only if, for every vector A.= (A.b ... , A.n), A.keR, the random variable(~, A.)= A. 1 ~ 1 + ··· + A.n~n has a Gaussian distribution. (a) The components of a Gaussian vector are uncorrelated

PRooF. (a) If the components of~= (~ 1 , ... , ~n) are uncorrelated, it follows from the form of the characteristic function cp~(t) that it is a product of characteristic functions. Therefore, by Theorem 4 of §12, the components are independent. The converse is evident, since independence always implies lack of correlation. (b) If~ is a Gaussian vector, it follows from (5) that

E exp{it(e1A.1

+ ... + enA.n)}

and consequently

=

exp{it(LA.kmk)-

~

(Lrk,A.kA.l)},

tER,

300

II. Mathematical Foundations of Probability Theory

Conversely, to say that the random variable is Gaussian means, in particular, that

(~,A.)

= ~ 1 A. 1

+ · · · + ~nAn

Since A._t. ... , An are arbitrary it follows from Definition 1 that the vector = (~ 1 , ••• , ~n) is Gaussian. This completes the proof of the theorem.

~

Remark. Let (0,

~) be a Gaussian vector with (} = (Ot. ... , (}k) and ~ = If (} and ~ are uncorrelated, i.e. cov(O;, ~j) = 0, i = 1, ... , k; j = 1, ... , l, they are independent. (~ 1 , ..• , ~k).

The proof is the same as for conclusion (a) of the theorem. Let ~ = (~ 1 , •.. , ~n) be a Gaussian vector; let us suppose, for simplicity, that its mean-value vector is zero. If rank ~ = r < n, then (as was shown in §11), there are n- r linear relations connecting ~ 1 , ••• , ~n· We may then suppose that, say, ~ 1 , ... , ~r are linearly independent, and the others can be expressed linearly in terms of them. Hence all the basic properties of the vector~= ~ 1 , ... , ~n are determined by the first r components (~ 1 , ... , ~,) for which the corresponding covariance matrix is already known to be nonsingular. Thus we may suppose that the original vector~ = ( ~ 1, •.. , ~n) had linearly independent components and therefore that I ~I > 0. Let (!) be an orthogonal matrix that diagonalizes ~. (f)T~(f)

=D.

The diagonal elements of D are positive and therefore determine the inverse matrix. Put B 2 = D and

Then it is easily verified that

i.e. the vector P= (p 1, ... , Pn) is a Gaussian vector with components that are uncorrelated and therefore (Theorem 1) independent. Then if we write A = (f)B we find that the original Gaussian vector ~ = (~ 1 , ••• , ~n) can be represented as ~ =

Ap,

(12)

where P = (p 1, ••• , Pn) is a Gaussian vector with independent -components, pk "'..¥(0, 1). Hence we have the following result. Let~ = (e 1 , ••• , ~n) be a

301

§13. Gaussian Systems

vector with linearly independent components such that

E~k =

0, k = 1,

... , n. This vector is Gaussian if and only if there are independent Gaussian variables {3 1 , ••• , f:Jn, f:Jk "' %(0, 1), and a nonsingular matrix A of order n such that~ = Af:J. Here IR = AAT is the covariance matrix of~. If IIRI =F 0, then by the Gram-Schmidt method (see §11)

k = 1, ... , n, where since 1: =

(~: 1 ,

(13)

... , ~:k) "' %(0, E) is a Gaussian vector,

~k =

k-1

I

l-1

(~k> Et)El,

(14) (15)

and (16)

2'{~1• · · ·, ~k} = 2'{1:1, · · ·, Ek}.

We see immediately from the orthogonal decomposition (13) that

~k = E(~k~~k-1• ··.,~d.

(17)

From this, with (16) and (14), it follows that in the Gaussian case the conditional expectation E(~k I~k-1> ... , ~ 1 ) is a linear function of (~1> ... , ~k- 1 ): k= 1

E(~k~~k-1•

· · ·,

~1) =

I

i= 1

a;~;·

(18)

(This was proved in §8 for the case k = 2.) Since, according to a remark made in Theorem 1 of §8, E(~k I~k- 1 , ... , ~ 1 ) is an optimal estimator (in the mean-square sense) for ~k in terms of ~ 1 , •.• , ~k- 1 , it follows from (18) that in the Gaussian case the optimal estimator is linear. We shall use these results in looking for optimal estimators of()= ( () 1, •.. , ()k) in terms of~ = (~ 1 , ••• , ~ 1) under the hypothesis that((),~) is Gaussian. Let

m6

= E(J,

m~

= E~

be the column-vector mean values and

V66 =cov(O, 0) = llcov(O;, 0)11,

1 ~ i,j

V11~ = cov(O, ~) = llcov(O;, ~)II,

1 ~ i ~ k, 1 ~ j

V~~ = cov(~, ~) = llcov(~;. ~i) II,

1 ~ i,j

~

~

k, ~ l,

l

the covariance matrices. Let us suppose that V~~has an inverse. Then we have the following theorem.

Theorem 2 (Theorem on Normal Correlation). For a Gaussian vector (0, the optimal estimator E(O I~) of 0 in terms of~. and its error matrix L\ = E[0 - E(O I~)] [0 - E(O( ~)]T

~),

302

II. Mathematical Foundations of Probability Theory

are given by the formulas E(OI~) = m6

+ V 6 ~ V"i/(~-

m~),

ll = V 6o- Vo~V~ 1 (Vo~)T.

(19)

(20)

PRooF. Form the vector 11 = (0- m8 ) - V 6 ~ V~ 1 {~- m~).

(21)

We can verify at once that Ert(~ - m~)T = 0, i.e. 11 is not correlated with (~ - m~). But since (0, ~)is Gaussian, the vector (I'/,~) is also Gaussian. Hence by the remark on Theorem 1, 11 and ~ - m~ are independent. Therefore 11 and ~ are independent, and consequently E('11 ~) = Ert = 0. Therefore E[O- m6 1~]- V 6 ~V~ 1 (~- m~) = 0. which establishes (19). To establish (20) we consider the conditional covariance cov(O, 01~)

= E[(O- E(OI~))(O- E(OI~WieJ.

Since 0- E(Oie) = 17, and 11 and

(22)

eare independent, we find that

cov(O, Ole)= E(rtrtTie) = E11rtT = V9 6 = V 68

+ Vi/V~~V~ 1 v:~- 2V 6 ~V~ 1 V~~V~ 1 v:~ -

V 6 ~ V~ 1 V:~.

Since cov(O, Ole) does not depend on "chance," we have ll. = Ecov(O, Ole) =cov(O, Ole),

and this establishes (20). Corollary. Let (0, e1, ... , en) be an (n ~> ... , ~n independent. Then

e

+ 1)-dimensional Gaussian vector, with

(cf. (8.12) and (8.13)). 5. Let e 1 , e 2 , ••• be a sequence of Gaussian random vectors that converge in probability to e. Let us show that is also Gaussian. In accordance with (a) of Theorem 1, it is enough to establish this only for random variables.

e

303

§13. Gaussian Systems

Let mn = theorem

Een, u;

=

ven·

Then by Lebesgue's dominated convergence

n-->oo

n-+ co

It follows from the existence of the limit on the left-hand side that there are numbers m and u 2 such that

m =lim mn, n-->oo

Consequently i.e.

e"' JV(m, u

2 ).

e2, .. .}

It follows, in particular, that the closed linear manifold 2(e1, generated by the Gaussian variables 1 , 2 , •.. (see §11, Subsection 5) consists of Gaussian variables.

ee

6. We now turn to the concept of Gaussian systems in general.

Definition 2. A collection of random variables e= <e~). where IX belongs to some index set m, is a Gaussian system if the random vector (e~,, ... , e~") is Gaussian for every n ~ 1 and all indices IX1, .•• ' CX.n chosen from m. Let us notice some properties of Gaussian systems. (a) If

e=

IX' E

(b) If (c)

<e~), IX Em,

is a Gaussian system, then every subsystem

m' s; m, is also Gaussian.

e~, IX Em,

e= <eiX),

e' =

<e~.),

are independent Gaussian variables, then the system

IX Em, is Gaussian. (~~), oc Em, is a Gaussian

= system, the closed linear manifold Y(e), consisting of all variables of the form 1 ca,ea,, together with their mean-square limits, forms a Gaussian system.

If~

Li=

Let us observe that the converse of (a) is false in general. For example, let e1 and 17 1 be independent and e1"'%(0, 1), 17 1"' %(0, 1). Define the system

<e ) = • 11

{<eb 1111D <e1, -11111)

e

if e1 ~ if e1 <

o, o.

(23)

Then it is easily verified that and 11 are both Gaussian, but (e, 17) is not. Let = (ea)aelll be a Gaussian system with mean-value vector m = (ma), oc Em, and covariance matrix IR = (r~p)a,fJelll• where ma = Eea· Then IR is evidently symmetric (r rzfJ = r pa) and nonnegative definite in the sense that for every vector c = (ca)ael!l with values in R 111, and only a finite number of nonzero coordinates ca,

e

(~c, c)

=L rrzpCaCp ~ 0. tx,{J

(24)

304

II. Mathematical Foundations of Probability Theory

We now ask the converse question. Suppose that we are given a parameter set m: = {ex}, a vector m = (m11) 11 e!ll and a symmetric nonnegative definite matrix IR = (r11p)11,pe!ll· Do there exist a probability space (0, F. P) and a Gaussian system of random variables~= (~ 11)11 e 111 on it, such that E~~~ = m~~,

cov(~ 11 , ~ 11)

=

r
ex, pem:?

If we take a finite set ex 1 , ••• , exn, then for the vector ffi = (m~~., ... , m~~J and the matrix IR = (r1111), ex, P= ex 1, ••• , exn, we can construct in Rn the Gaussian distribution F 111 , ... , 11"(x 1 , ..• , xn) with characteristic function qJ(t) = exp{i(t, m) - !(IRt, t)},

t

=

(t .. t.,

••• '

t..J.

It is easily verified that the family

{F1Zt, ... ,1Zn(X1, ... ' Xn); (Xi Em:}

is consistent. Consequently by Kolmogorov's theorem (Theorem 1, §9, and Remark 2 on this) the answer to our question is positive. 7. If m: = {1, 2, ... }, then in accordance with the terminology of §5 the system of random variables~= (~ 11)11 e!IJ is a random sequence and is denoted by ~ = (~ 1 , ~ 2 , ...). A Gaussian sequence is completely described by its mean-value vector m = (m 1 , m2 , •• •) and covariance matrix IR = llriill, rii = cov(~i• ~i). In particular, if rii = afbii• then ~ = (~ 1 , ~ 2 , •••) is a Gaussian sequence of independent random variables with ~i "' .!V(mi> af), i ~ 1. When m: = [0, 1], [0, 00 ), (- 00, 00 ), ... , the system ~ = (e,), t Em:, is a random process with continuous time. Let us mention some examples of Gaussian random processes. If we take their mean values to be zero, their probabilistic properties are completely described by the covariance matrices llr.,ll· We write r(s, t) instead of r., and call it the covariance function. EXAMPLE

1. If T = [0, 00) and

e

r(s, t) = min(s, t),

(25)

the Gaussian process = (~,)r~O with this covariance function (see Problem 2) and ~ 0 = 0 is a Brownian motion or Wiener process. Observe that this process has independent increments; that is, for arbitrary t 1 < t 2 < · · · < tn the random variables

~12 - elt' • • ' ' ~In - ~In- I are independent. In fact, because the process is Gaussian it is enough to verify only that the increments are uncorrelated. But if s < t < u < v then E[~ 1

-

~.] [~v

-

~..] =

[r(t, v) - r(t, u~] - [r(s, v) - r(s, u)]

= (t - t) - (s - s) = 0.

305

§13. Gaussian Systems

EXAMPLE

2. The process

e= (e

1),

0 ::; t ::; 1, with

eo := 0 and

r(s, t) = min(s, t) - st

(26)

is a conditional Wiener process (observe that since r(1, 1) = 0 we have = o) = 1).

P<el

3. The process

EXAMPLE

e= (e

1), -

oo < t <

00,

with

r(s, t) = e-lt-•1

(27)

is a Gauss-Markov process.

8.

PROBLEMS

1. Let

e1,e2, e3be independent Gaussian random variables, e; ~ .¥(0, 1). Show that e1 + e2e3 ~ Y(o, 1).

Jt + e~

(In this case we encounter the interesting problem of describing the nonlinear transformations of independent Gaussian variables 1 , .•. , whose distributions are still Gaussian.)

e

e.

2. Show that (25), (26) and (27) are nonnegative definite (and consequently are actually covariance functions). 3. Let A be an m x n matrix. An n x m matrix A E9 is a pseudo inverse of A if there are matrices U and V such that

Show that A E9 exists and is unique. 4. Show that (19) and (20) in the theorem on normal correlation remains valid when v~~ is singular provided that v;/ is replaced by v~.

e)= (ll1o ... , llk; e1,... , e,) be a Gaussian vector with nonsingular matrix

5. Let (ll, .1 = V1111

-

V~ v:~. Show that the distribution function

P(ll::; ale)= P(01::; a1, ... , llk::; akle) has (P-a.s.) the density p(al, ... ' ak Ie) defined by

1.1-1/21 (2n)k12 exp{ -t(a- E(OI~W.1- 1 (a- E(OI~))}. 6. (S. N. Bernstein). Let~ and '1 be independent identically distributed random variables with finite variances. Show that if + '7 and ~ - '1 are independent, then and '1 are Gaussian.

e

e

CHAPTER III

Convergence of Probability Measures. Central Limit Theorem

§1. Weak Convergence of Probability Measures and Distributions 1. Many important results of probability theory are formulated as limit theorems. So, indeed, were James Bernoulli's law of large numbers, as well as the De Moivre-Laplace limit theorem, the theorems with which the true theory of probability began. In the present chapter we discuss two central aspects of limit theorems: one is the concept of weak convergence; the other is the method of characteristic functions, one of the most powerful methods for proving and refining limit theorems. We begin by recalling the statement of the law of large numbers (Chapter I, §5) for the Bernoulli scheme. Let ~ 1 , ~ 2 , ••• be a sequence of independent identically distributed random variables with P(~i = 1) = p, P(~i = 0) = q, p + q = 1. In terms of the concept of convergence in probability (Chapter II, §10), Bernoulli's law of large numbers can be stated as follows: n-+ oo,

(1)

where Sn = ~ 1 + · · · + ~n· (It will be shown in Chapter IV that in fact we have convergence with probability 1.) We put (2)

307

§I. Weak Convergence of Probability Measures and Distributions

where F(x) is the distribution function of the degenerate random variable ~ p. Also let P nand P be the probability measures on (R, Pl(R)) corresponding to the distributions F n and F. In accordance with Theorem 2 of §1 0, Chapter II, convergence in probability, Sn/n ~ p, implies convergence in distribution, Sn/n .!4 p, which means that

=

Ef(~) -+ Ef(p), for every function! ous functions on R. Since

=

E~~) =

n-+ oo,

(3)

f(x) belonging to the class C(R) of bounded continu-

L

f(x)P.(dx),

(3) can be written in the form

L L

f(x)Pn(dx)-+

Ef(p)

L L

f(x)P(dx),

=

{f(x)P(dx),

fE C(R),

(4)

or (in accordance with §6 of Chapter II) in the form f(x) dF n(x) -+

f(x) dF(x ),

f

E

C(R).

(5)

In analysis, (4) is called weak convergence (of Pn to P, n -+ oo) and written Pn ~ P. It is also natural to call ( 5) weak convergence ofF n to F and denote it by Fn ~F. Thus we may say that in a Bernoulli scheme

sn

p

w

--+ p => F"-+ F.

n

(6)

It is also easy to see from (1) that, for the distribution functions defined in (2), F.(x)-+ F(x), n-+ oo, for all points x E R except for the single point x = p, where F(x) has a discontinuity. This shows that weak convergence F"-+ F does not imply pointwise convergence of F"(x) to F(x), n-+ oo, for all points x E R. However, it turns out that, both for Bernoulli schemes and for arbitrary distribution functions, weak convergence is equivalent (see Theorem 2 below) to "convergence in general" in the sense of the following definition.

Definition 1. A sequence of distribution functions {F"}, defined on the real line, converges in general to the distribution function F (notation: F" => F) if as n-+ oo Fn(x)-+ F(x), xEPc(F), where P c(F) is the set of points of continuity ofF = F(x).

308

III. Convergence of Probability Measures. Central Limit Theorem

For Bernoulli schemes, F = F(x) is degenerate, and it is easy to see (see Problem 7 of §10, Chapter II) that

CFn=F)=(~~P)· Therefore, taking account of Theorem 2 below,

(~ ~ P) =(F n.!!+ F)<=> (F n=F)= (~ E. p)

(7)

and consequently the law of large numbers can be considered as a theorem on the weak convergence of the distribution functions defined in (2). Let us write Fix)= P{ F(x)

Sn- np

jnpq : :; x } ,

= -1- fx

fo

e-u 212 du.

(8)

-oo

The De Moivre-Laplace theorem (§6, Chapter I) states that Fn(x)--+ F(x) for all x E R, and consequently Fn =F. Since, as we have observed, weak convergence Fn ~ F and convergence in general, Fn = F, are equivalent, we may therefore say that the De Moivre-Laplace theorem is also a theorem on the weak convergence of the distribution functions defined by (8). These examples justify the concept of weak convergence of probability measures that will be introduced below in Definition 2. Although, on the real line, weak convergence is equivalent to convergence in general of the corresponding distribution functions, it is preferable to use weak convergence from the beginning. This is because in the first place it is easier to work with, and in the second place it remains useful in more general spaces than the real line, and in particular for metric spaces, including the especially important spaces R", R 00 , C and D (see §3 of Chapter II). 2. Let (E, tff, p) be a metric space with metric p = p(x, y) and u-algebra tff of Borel subsets generated by the open sets, and let P, Pt> P2 , ... be probability measures on (E, tff, p).

Definition 2. A sequence of probability measures {P n} converges weakly to the probability measure P (notation: P n ~ P) if

L

f(x)P n(dx)

--+

L

J(x)P(dx)

(9)

for every function f = f(x) in the class IC(E) of continuous bounded functions on E.

§I. Weak Convergence of Probability Measures and Distributions

309

Definition 3. A sequence of probability measures {P n} converges in general to the probability measure P (notation: Pn => P) if (10) for every set A of G for which P(oA)

= 0.

(11)

(Here oA denotes the boundary of A: oA

=

[A] n [A],

where [A] is the closure of A.) The following fundamental theorem shows the eqmvalence of the concepts of weak convergence and convergence in general for probability measures, and contains still another equivalent statement.

Theorem 1. The following statements are equivalent. (I) Pn ~ P. (II) lim Pn(A) ::::;; P(A), A closed. (Ill) lim Pn(A) 2': P(A), A open. (IV) Pn=>P. PRooF. (I)=> (II). Let A be closed, f(x)

fix)=

=

IA(x) and

g(~ p(x, A)),

s > 0,

where p(x, A)

= inf{p(x, y): yEA},

1, g(t) = { 1 - t, 0,

t::::;; 0, 0 ::::;; t ::::;; 1, t 2': 1.

Let us also put A,= {x: p(x, A) < s} and observe that A, t A as s t 0. Since.fe(x) is bounded, continuous, and satisfies

we have

which establishes the required implication.

310

III. Convergence of Probability Measures. Central Limit Theorem

The implications (II) => (Ill) and (Ill) => (II) become obvious if we take the complements of the sets concerned. (III)=> (IV). Let A 0 = A\oA be the interior, and [A] the closure, of A. Then from (II), (III), and the hypothesis P(ilA) = 0, we have lim Pn(A)

~lim n

n

Pn([A])

~

P([A]) = P(A),

n

n

and therefore Pn(A) ~ P(A) for every A such that P(ilA) = 0. (IV)~ (1). Letf = f(x) be a bounded continuous function with lf(x)l M. We put

~

= {tER: P{x:f(x) = t} =I= 0} and consider a decomposition 7k = (t 0 , t 1 , •.. , tk) of [- M, M]: D

- M =

t0

<

t1

< ··· <

tk =

k

M,

~ 1,

with ti ¢: D, i = 0, 1, ... , k. (Observe that D is at most countable since the sets f- 1 { t} are disjoint and P is finite.) Let Bi = {x: t; ~ f(x) < t;+ d. Since f(x) is continuous and therefore the set f- 1(t;, ti+ 1 ) is open, we have oB; £ f- 1 {t;} u f- 1 {ti+ 1 }. The points t;, t;+ 1 ¢ D; therefore P(ilB;) = 0 and, by (IV), k-1

k-1

L

t;Pn(B;) ~

i=O

But

I

Lf(x)Pn(dx)- Lf(x)P(dx)

I~

L

:t~ t;Pn(B;)

+ I:t~ t; Pn(B;) -

:t~ t; P(B;) I

Lf(x)Pn(dx)-

t;P(B;)- Lf(x)P(dx)

2 max

(t;+ 1

O:s;i:s;k-1

whence, by (12), since the

7k (k

~

1) are arbitrary,

lim ( f(x)Pn(dx) = ( f(x)P(dx). n

JE

I

I

+I :t~ ~

(12)

t;P(B;).

t=O

JE

This completes the proof of the theorem.

-

t;)

I

§1. Weak Convergence of Probability Measures and Distributions

311

Remark 1. The functions f(x) = IA(x) and fe(x) that appear in the proof that (I) => (II) are respectively upper semicontinuous and uniformly continuous. Hence it is easy to show that each of the conditions of the theorem is equivalent to one of the following: (V) JEf(x)P.(x) dx--+ JEf(x)P(dx) for all bounded uniformly continuous

f(x);

(VI) lim JE f(x)P .(dx) ::s;; JE f(x)P(dx) for all bounded f(x) that are upper semicontinuous (Iimf(x.) ::s;; f(x), x.--+ x); (VII) lim JEf(x)P.(dx) ~ JEf(x)P(dx) for all bounded f(x) that are lower semicontinuous (lim f(x.) ~ f(x), x.--+ x). n

Remark 2.Theorem 1 admits a natural generalization to the case when the probability measures P and P. defined on (E, tS, p) are replaced by arbitrary (not necessarily probability) finite measures J.L and J.L •. For such measures we can introduce weak convergence J.l.n ~ J.L and convergence in general f.Ln => J.1. and, just as in Theorem 1, we can establish the equivalence of the

following conditions: (I*) (II*) (III*) (IV*)

J.l.n ~ J.L; lim J.L.(A) ::s;; J.L(A), where A is closed and J.LiE) --+ J.L(E); lim J.L.(A) ~ J.L(A), where A is open and J.L.(E)--+ J.L(E); J.l.n => J.L.

Each of these is equivalent to any of (V*), (VI*), and (VII*), which are (V), (VI), and (VII) with P" and P replaced by J.l.n and J.L. 3. Let (R, BI(R)) be the real line with the system BI(R) of sets generated by the Euclidean metric p(x, y) = lx - yl (compare Remark 2 of subsection 2 of§2 of Chapter II). Let P and P "' n ;:::: 1, be probability measures on (R, aJ(R)) and let F and F., n ~ 1, be the corresponding distribution functions.

Theorem 2. The following conditions are equivalent: (1) (2) (3) (4)

P. ~ P, P. => P, F.~ F, F.=> F.

PRooF. Since (2) <=> (1) <=> (3), it is enough to show that (2) <=> (4). If P. => P, then in particular

P.(- oo, x]--+ P(- oo, x] for all x e R such that P{x} = 0. But this means that F.=> F. Now let F.=> F. To prove that P. =>Pit is enough (by Theorem 1) to show that lim. P.(A) ~ P(A) for every open set A. If A is open, there is a countable collection of disjoint open intervals 11 , 12 , •.• (of the form (a, b)) such that A= 2,1:': 1 /k. Choose e > 0 and in

312

III. Convergence of Probability Measures. Central Limit Theorem

each interval Ik = (ak> bk) select a subinterval Ik = (ak, bk] such that ak, b~ E Pc(F) and P(Ik) :-::; P(J~) + e · rk. (Since F(x) has at most countably many discontinuities, such intervals Ik, k 2:: 1, certainly exist.) By Fatou's lemma, 00

00

lim Pn(A) =lim n

n

L

L

Pn(Ik) 2::

k=1

k=1

lim Pn{/k) n

00

2::

L

lim Pn(Ik).

k= 1

n

But Therefore 00

00

lim Pn(A) 2:: -

L

P(Ik) 2::

k= 1

n

L

(P(h) - e · 2-k) = P(A)- e.

Since e > 0 is arbitrary, this shows that limn Pn(A) 2:: P(A) if A is open. This completes the proof of the theorem. 4. Let (£, S) be a measurable space. A collection X 0 (£) <::; S of subsets is a determining class if whenever two probability measures P and 0 on (E, S) satisfy P(A)

= Q(A)

for all

A EX 0 (£)

it follows that the measures are identical, i.e. P(A) = Q(A)

for all

A

E

S.

If(£, S, p) is a metric space, a collection X 1(E) <::; Iff is a convergencedetermining class if whenever probability measures P, P 1 , P 2 , ... satisfy P n(A)

-t

P(A)

for all

A EX 1 (£)

with

P(oA) = 0

it follows that for all A E E with

P(oA) = 0.

When (E, S) = (R, ~(R)), we can take a determining class X 0 (R) to be the class of "elementary" sets X = {(- oo, x], x E R} (Theorem 1, §3, Chapter II). It follows from the equivalence of (2) and (4) of Theorem 2 that this class X is also a convergence-determining class. It is natural to ask about such determining classes in more general spaces. For W, n 2:: 2, the class X of "elementary" sets of the form (- oo, x] = (- oo, x 1 ] x · · · x (- oo, Xn], where x = (x 1, ... , xn) ERn, is both a determining class (Theorem 2, §3, Chapter II) and a convergence-determining class (Problem 2).

313

§I. Weak Convergence of Probability Measures and Distributions

ljn

2/n

Figure 35

For Roo the cylindrical sets % 0 (R 00 ) are the "elementary" sets whose probabilities uniquely determine the probabilities of the Borel sets (Theorem 3, §3, Chapter II). It turns out that in this case the class of cylindrical sets is also the class of convergence-determining sets (Problem 3). Therefore %,(Roo)= %o(Roo). We might expect that the cylindrical sets would still constitute determining classes in more general spaces. However, this is, in general, not the case. For example, consider the space (C, 140 (C), p) with the uniform metric p (see subsection 6, §2, Chapter II). Let P be the probability measure concentrated on the element x = x(t) 0, 0 ~ t s 1, and let P., n;;::: 1, be the

=

probability measures each of which is concentrated on the element x

= x.(t)

shown in Figure 35. It is easy to see that P.(A)--+ P(A) for all cylindrical sets A with P(oA) = 0. But if we consider, for example, the set

A = { oc E C: loc(t) I s

t, 0 s t s

1} E 140( C),

+

then P(oA) = 0, P.(A) = 0, P(A) = 1 and consequently P. P. Therefore % 0 (C) = 140 (C) but % 0 (C) c % 1 (C) (with strict inclusion).

5.

PROBLEMS

1. Let us say that a function F = F(x), defined on R", is continuous at x E R" provided that, for every e > 0, there is a~ > 0 such that IF(x) - F(y) I < e for ally E R" that satisfy

x - be < y < x

+ ~e,

where e = (1, ... , 1) E R". Let us say that a sequence of distribution functions {F.} converges in general to the distribution function F (F.=> F) if F.(x)-+ F(x), for all points x E R" where F = F(x) is continuous. Show that the conclusion of Theorem 2 remains valid for R", n > 1. (See the remark on Theorem 2.)

314

III. Convergence of Probability Measures. Central Limit Theorem

2. Show that the class :K of"elementary" sets in R" is a convergence-determining class. 3. Let E be one ofthe spaces R"", Cor D. Let us say that a sequence {P.} of probability measures (defined on the u-algebra 8 of Borel sets generated by the open sets) converges in general in the sense of finite-dimensional distributions to the probability measure P (notation: P. b P) if P .(A) -+ P(A), n -+ oo, for all cylindrical sets A with P(oA) = 0. For R"", show that (P. b P) <->(P. => P).

4. Let F and G be distribution functions on the real line and let L(F, G)= inf{h > 0: F(x- h)- h

~

G(x)

~

F(x +h)+ h}

be the Levy distance (between F and G). Show that convergence in general is equivalent to convergence in the Levy metric: (F • =>F)<-> L(F•• F)-+ 0.

5. Let F. => F and let F be continuous. Show that in this case F .(x) converges uniformly to F(x): sup IF.(x)- F(x)l-+ 0,

n-+ oo.

X

6. Prove the statement in Remark 1 on Theorem 1. 7. Establish the equivalence of (1*)-(IV*) as stated in Remark 2 on Theorem 1. 8. Show that P. ~ P if and only if every subsequence {P•.} of {P.} contains a subsequence {P... } such that P... ~ P.

§2. Relative Compactness and Tightness of Families of Probability Distributions 1. If we are given a sequence of probability measures, then before we can consider the question of its (weak) convergence to some probability measure, we have of course to establish whether the sequence converges in general to some measure, or has at least one convergent subsequence. For example, the sequence {P.}, where P 2 • = P, P 2 .+ 1 = Q, and P and Q are different probability measures, is evidently not convergent, but has the two convergent subsequences {P 2 ,} and {P 2 ,+ 1 } It is easy to construct a sequence {P.} of probability measures P., n ~ 1, that not only fails to converge, but contains no convergent subsequences at all. All that we have to do is to take P., n ~ 1, to be concentrated at {n} (that is, P.{n} = 1). In fact, since lim. P.(a, b] = Owhenevera < b, a limit measure would have to be identically zero, contradicting the fact that 1 = P,(R) 0,

+

§2. Relative Compactness and Tightness of Families of Probability Distributions

315

n--+ oo. It is interesting to observe that in this example the corresponding sequence {F"} of distribution functions, Fn{x)

1,

x;::: n,

= { O, x < n,

is evidently convergent: for every x

E

R,

Fn{x)--+ G(x)

=0.

However, the limit function G = G(x) is not a distribution function (in the sense of Definition 1 of §3, Chapter II). This instructive example shows that the space of distribution functions is not compact. It also shows that if a sequence of distribution functions is to converge to a limit that is also a distribution function, we must have some hypothesis that will prevent mass from "escaping to infinity." After these introductory remarks, which illustrate the kinds of difficulty that can arise, we turn to the basic definitions.

2. Let us suppose that all measures are defined on the metric space (E, tff, p). Definition 1. A family of probability measures~= {P"; a Em} is relatively compact if every sequence of measures from ~contains a subsequence which converges weakly to a probability measure. We emphasize that in this definition the limit measure is to be a probability measure, although it need not belong to the original class ~- (This is why the word "relatively" appears in the definition.) It is often far from simple to verify that a given family of probability measures is relatively compact. Consequently it is desirable to have simple and useable tests for this property. We need the following definitions. Definition 2. A family of probability measures ~ = {Pa; rx Em} is tight if, for every e > 0, there is a compact set K <;; E such that supPa{E\K):::; e.

(1)

ae'll

Definition 3. A family of distribution functions F = {Fa; rx Em} defined on R", n ;::: 1, is relatively compact (or tight) if the same property is possessed by the family~= {P"; rx Em} of probability measures, where P"' is the measure constructed from F"'. 3. The following result is fundamental for the study of weak convergence of probability measures.

Theorem 1 (Prohorov's Theorem). Let ~ = {P"'; rx Em} be a family of probability measures defined on a complete separable metric space (E, iff, p). Then~ is relatively compact if and only if it is tight.

316

III. Convergence of Probability Measures. Central Limit Theorem

We shall give the proof only when the space is the real line. (The proof can be carried over, almost unchanged, to arbitrary Euclidean spaces Rn, n ~ 2. Then the theorem can be extended successively to R 00 , to a-compact spaces; and finally to general complete separable metric spaces, by reducing each case to the preceding one.) Necessity. Let the family r!J> = {Pa: !'f. E m:} of probability measures defined on (R, 86(R)) be relatively compact but not tight. Then there is an e > 0 such that for every compact K ~ R sup Pa(R\K) > e, and therefore, for each interval I = (a, b), sup Pa{R\J) >e. It follows that for every interval In= ( -n, n), n such that

~

1, there is a measure Pan

Pan(R\In) >e. Since the original family r!J> is relatively compact, we can select from {PaJ n;;, 1 a subsequence {Pan) such that Pank ~ Q, where Q is a probability measure. Then, by the equivalence of conditions (I) and (II) in Theorem 1 of §1, we have (2) for every n ~ 1. But Q(R\In) t 0, n--+ oo, and the left side of (2) exceeds e > 0. This contradiction shows that relatively compact sets are tight. To prove the sufficiency we need a general result (Helly's theorem) on the sequential compactness of families of generalized distribution functions (Subsection 2 of §3 of Chapter II). Let J = { G} be the collection of generalized distribution functions

G

= G(x) that satisfy:

(1) G(x) is nondecreasing; (2) 0 ~ G(- oo ), G( + oo) ~ 1; (3) G(x) is continuous on the right. Then J clearly contains the class of distribution functions §' = {F} for which F(- oo) = 0 and F( + oo) = 1. Theorem 2 (Helly's Theorem). The class J = {G} of generalized distribution fimctions is sequentially compact, i.e. for every sequence {Gn} of functions from J we can .find afunction G E Janda sequence {nk} ~ {n} such that k--+

00,

for every point x of the set P c( G) of points of continuity of G

= G(x ).

§2. Relative Compactness and Tightness of Families of Probability Distributions

317

PROOF. Let T = {x 1 , x 2 , ... } be a countable dense subset of R. Since the sequence of numbers {G.{x 1 }} is bounded, there is a subsequence N 1 = {nl1>, n~l)' .. .} such that G.p){x 1) approaches a limit g1 as i--+ oo. Then we extract from N 1 a subsequence N 2 = {nl2 >, n~2 >, .. •} such that G"12)(x 2 ) approaches a limit g 2 as i --+ oo; and so on. ' On the set T ~ R we can define a function Gr(x) by XiE

T,

and consider the "Cantor" diagonal sequence N = {nl1>, n~2 >, .. .}. Then, for each xi E T, as m--+ oo, we have G.~)(xi)

--+ Gr(xJ

Finally, let us define G = G(x) for all x E R by putting G(x) = inf{Gr(y): yET, y > x}.

(3)

We claim that G = G(x) is the required function and G.~,.~)(x)--+ G(x) at all points x of continuity of G. Since all the functions G. under consideration are nondecreasing, we have Gnl:rl(x) ~ G.~,.m)(y) for all x andy that belong to T and satisfy the inequality x ~ y. Hence for such x and y, Gr(x) ~ Gr(y).

It follows from this and (3) that G = G(x) is nondecreasing. Now let us show that it is continuous on the right. Let xk ! x and d = Iimk G(xk). Clearly G(x) ~ d, and we have to show that actually G(x) = d. Suppose the contrary, that is, let G(x) < d. It follows from (3) that there is ayE T, x < y, such that Gr(Y) < d. But x < xk < y for sufficiently large k, and therefore G(xk) ~ Gr(Y) < d and lim G(xk) < d, which contradicts d = Iimk G(xk). Thus we have constructed a function G that belongs to .f. We now establish that G.<ml{x 0 )--+ G(x 0 ) for every x 0 E Pc(G). If x 0 < y E T, then m lim Gnl,.ml(x 0 ) ~ lim Gnl,.ml(y) = Gr(y), m

m

whence lim G.~:r){x 0 ) ~ inf{Gr(y): y > x 0 , yET}= G(x 0 ).

(4)

m

On the other hand, let x 1 < y < x 0 , y E T. Then G(x 1 ) ~ Gr(Y) =lim Gnl:/'l(y) =lim Gnl:rl(y) ~lim Gnl:/'l(x 0 ). m

m

m

Hence if we let x 1 j x 0 we find that G(xo _) ~ li,!,D G.~:rlx 0 ).

(5)

318

III. Convergence of Probability Measures. Central Limit Theorem

But if G(x 0 - ) = G(x 0 ) we can infer from (4) and (5) that Gnllr>(x 0 ) ~ G(x 0 ), m~ oo. This completes the proof of the theorem. We can now complete the proof of Theorem 1. Sufficiency. Let the family & be tight and let {P"} be a sequence of probability measures from&. Let {F"} be the corresponding sequence of distribution functions. By Helly's theorem, there are a subsequence {F "k} £;; { F"}and a generalized distribution function G e J such that F"k(x) ~ G(x) for x e Pc(G). Let us show that because & was assumed tight, the function G = G(x) is in fact a genuine distribution function (G(- oo) = 0, G( + oo) = 1). Take e > 0, and let I = (a, b] be the interval for which

sup Pn(R\1) < e, or, equivalently,

n

n~l.

Choose points a', b' e Pc(G) such that a' b. Then 1 - e::;;; P"k(a, b] ::;;; P"k(a', b'] = F"k(b')- F"k(a') ~ G(b')- G(a'). It follows that G( + oo)G(-oo) = 1, and since 0::;;; G(-oo)::;;; G(+oo)::;;; 1, we have G(-oo) = 0 andG(+oo) = 1. Therefore the limit function G = G(x) is a distribution function and F"k =>G. Together with Theorem 2 of §1 this shows that Pnk ~ Q, where Q is the probability measure corresponding to the distribution function G. This completes the proof of Theorem 1. 4.

PROBLEMS

1. Carry out the proofs of Theorems 1 and 2 for R", n ;;:: 2. 2. Let P« be a Gaussian measure on the real line, with parameters m. and cr;, ex E 21. Show that the family f1J = {P«; ex E 21} is tight if and only if

1m. I S: a,

cr; S: b,

ex E 21.

3. Construct examples of tight and nontight families f1J = {P.; ex E 21} of probability measures defined on (R"", di(R"")).

§3. Proofs of Limit Theorems by the Method of Characteristic Functions 1. The proofs of the first limit theorems of probability theory-the law of large numbers, and the De Moivre-Laplace and Poisson theorems for Bernoulli schemes-were based on direct analysis of the limit functions ofthe

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

319

distributions F n, which are expressed rather simply in terms of binomial probabilities. (In the Bernoulli scheme, we are adding random variables that take only two values, so that in principle we can find Fn explicitly.) However, it is practically impossible to apply a similar direct method to the study of more complicated random variables. The first step in proving limit theorems for sums of arbitrarily distributed random variables was taken by Chebyshev. The inequality that he discovered, and which is now known as Chebyshev's inequality, not only makes it possible to give an elementary proof of James Bernoulli's law oflarge numbers, but also lets us establish very general conditions for this law to hold, when stated in the form n-+ oo,

every c > 0,

(1)

for sums Sn = ~ 1 + · · · + ~n• n ~ 1, of independent random variables. (See Problem 2.) Furthermore, Chebyshev created (and Markov perfected) the "moment method" which made it possible to show that the conclusion of the De Moivre-Laplace theorem, written in the form

- ESn P{Sn 1\iCi" y

Y..Jn

~

x

}

-+

1

M-

-y2n

fx e -u>;2 du,

(2)

-co

is "universal," in the sense that it is valid under very general hypotheses concerning the nature of the random variables. For this reason it is known as the central limit theorem of probability theory. Somewhat later Lyapunov proposed a different method for proving the central limit theorem, based on the idea (which goes back to Laplace) of the characteristic function of a probability distribution. Subsequent developments have shown that Lyapunov's method of characteristic functions is extremely effective for proving the most diverse limit theorems. Consequently it has been extensively developed and widely applied. In essence, the method is as follows. 2. We already know (Chapter II, §12) that there is a one-to-one correspondence between distribution functions and characteristic functions. Hence we can study the properties of distribution functions by using the corresponding characteristic functions. It is a fortunate circumstance that weak convergence Fn ~ F of distributions is equivalent to pointwise convergence q>n-+ q> of the corresponding characteristic functions. Moreover, we have the following result, which provides the basic method of proving theorems on weak convergence for distributions on the real line.

320

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 1 (Continuity Theorem). Let {Fn} be a sequence of distribution functions F n = FnCx ), x E R, and let {cpn} be the corresponding sequence of characteristic jimctions, IPn(t) = s:ooeirx dFn(x),

t E R.

(1) If Fn ~ F, where F = F(x) is a distribution function, then cpn(t) --4 cp(t), t E R, where cp( t) is the characteristic junction ofF = F( x ). (2) If limn cpit) exists for each t E R and cp(t) = limn cpn(t) is continuous at t = 0, then cp(t) is the characteristic function of a probability distribution

F = F(x), and

The proof of conclusion (1) is an immediate consequence of the definition of weak convergence, applied to the functions Re eirx and Im eirx. The proof of (2) requires some preliminary propositions.

Lemma 1. Let {P n} be a tight family of probability measures. Suppose that every weakly convergent subsequence {Pn.} of {Pn} converges to the same probability measure P. Then the whole sequence {P n} converges to P. PROOF. Suppose that Pn

f = f(x) such that

+ P. Then there is a bounded continuous function

It follows that there exist e > 0 and an infinite sequence {n'} that

I Lf(x)Pn·(dx)-

Lf(x)P(dx) I

~ e > 0.

~

{n} such (3)

By Prohorov's theorem (§2) we can select a subsequence {P n"} of {P n'} such that Pn" ~ 0, where 0 is a probability measure. By the hypotheses of the lemma, 0 = P, and therefore

Lf(x)Pn ..(dx)

--4

Lf(x)P(dx),

which leads to a contradiction with (3). This completes the proof of the lemma.

Lemma 2. Let {Pn} be a tight family of probability measures on (R, 86(R)). A necessary and stif.ficient condition for the sequence {Pn} to converge weakly to a probability measure is that for each t E R the limit limn cpnCt) exists, where IPn( t) is the characteristic function of Pn: IPn(t) = L eirxp n(dx).

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

321

PRooF. If {P n} is tight, by Prohorov's theorem there are a subsequence {Pn·} and a probability measure P such that Pn' ~ P. Suppose that the whole sequence {P n} does not converge toP (P n !!;\+ P). Then, by Lemma 1, there are a subsequence {Pn"} and a probability measure Q such that Pn" ~ Q, and p ¥= Q. Now we use the existence of limn cpn(t) for each t E R. Then lim n'

r eitxpn.(dx) = lim r eitxpn"(dx)

JR

n"

JR

and therefore

But the characteristic function determines the distribution uniquely (Theorem 2, §12, Chapter II). Hence P = Q, which contradicts the assumption that Pn ~ P. The converse part of the lemma follows immediately from the definition of weak convergence. The following lemma estimates the "tails" of a distribution function in terms of the behavior of its characteristic function in a neighborhood of zero.

Lemma 3. Let F = F(x) be a distribution function on the real line and let cp = cp(t) be its characteristic function. Then there is a constant K > 0 such that for every a > 0

f

PROOF. Since Re cp(t)

! Ja[l a o

fa [1 -

dF(x) :::;; K

a

Jlxl<e:lja

= J':'

00

Re cp(t)] dt.

(4)

0

cos tx dF(x), we find by Fubini's theorem that

Re cp(t)] dt

=! Ja[Joo (1 -COS tx) dF(x)] dt a o -oo

00 [~ J:(l

= f_00 =

-COS

Joo ( _ 00

sin ax) dF(x) 1- ~ax

. ( 1 - -sm 2:: mf JyJ<e:l

=1

K

l

tx) dt] dF(x)

lxl<e:lfa

Y

y) · l

dF(x),

Jaxl<e:l

dF(x)

III. Convergence of Probability Measures. Central Limit Theorem

322 where

y)

.

sin- = 1 - sm 1 :?: . f (1 - -1 = m K

y

IYI2!1

1

7,

so that (4) holds with K = 7. This establishes the lemma. Proof of conclusion (2) of Theorem 1. Let cpit) --+ cp(t), n --+ oo, where cp(t) is continuous at 0. Let us show that it follows that the family of probability measures {Pn} is tight, where Pn is the measure corresponding to Fn. By (4) and the dominated convergence theorem,

Pn{R\(-

~ ))} = a a

( J,xj;;,1ja

dFn(x) ::::; K a

fa [1 -

Re cpn(t)] dt

0

Kfa

[1 - Re cp(t)] dt --+ a o as n --+ oo. Since, by hypothesis, cp(t) is continuous at 0 and cp(O) there is an a > 0 such that

=

1, for every s > 0

for all n :?: 1. Consequently {P .} is tight, and by Lemma 2 there is a probability measure P such that Hence

but also cpn(t)--+ cp(t). Therefore cp(t) is the characteristic function of P. This completes the proof of the theorem.

Corollary. Let {Fn} be a sequence of distribution functions and {cpn} the corresponding sequence of characteristic functions. Also let F be a distribution function and cp its characteristic function. Then F.~ F if and only if cpit)--+ cp(t) for all t E R. Remark. Let 1], 1J1o 1] 2 , ••• be random variables and Fq. ~ Fq. In accordance with the definition of §10 of Chapter II, we then say that the random variables 1] 1 , 1] 2 , ••• converge to '7 in distribution, and write 11. 141].

Since this notation is self-explanatory, we shall frequently use it instead

ofF~.~ F~ when stating limit theorems.

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

323

3. In the next section, Theorem 1 will be applied to prove the central limit theorem for independent but not identically distributed random variables. In the present section we shall merely apply the method of characteristic functions to prove some simple limit theorems.

Theorem 2 (Law of LargeN umbers). Let ~ 1 , ~ 2 , •.• be a sequence ofindependent identically distributed random variables with E I ~ 1 1 < oo, sn = ~ 1 + · · · + ~n and E~ 1 = m. Then Sn/n ~ m, that is,for every e > 0

n ~ oo. PROOF. Let
by (11.12.6). But according to (11.12.14)


+ o(t),

t~o.

Therefore for each given t E R n-+

oo,

and therefore

The function
sn n

d

-~m,

and consequently (see Problem 7, §10, Chapter II)

This completes the proof of the theorem.

324

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 3 (Central Limit Theorem for Independent Identically Distributed Random Variables). Let 1 , 2 , ••• be a sequence of independent identically distributed (nondegenerate) random variables with Ee~ < oo and Sn = + · · · + Then as n- oo

ee

en·

e1

xeR,

(5)

where

1 W(x) = ;;,::. y 2n PROOF.

Let

Ee

1

=

m,

Ve

1

=

fx

e-" 212 du.

-oo

a 2 and cp(t) =

Eeit(~,

-m>.

Then if we put

ES}

S cp"(t) = E exp{ it ~" , we find that

But by (11.12.14)

t-o. Therefore

cpn(t)

=

[1 - ;;~: + o(~)

r-

e-1 2 /2.

as n - oo for fixed t. The function e- 1212 is the characteristic function of a random variable (denoted by %(0, 1)) with mean zero and unit variance. This, by Theorem 1, also establishes (5). In accordance with the remark in Theorem 1, this can also be written in the form

S~n J4 .ff(O, 1). vsn

(6)

This completes the proof of the theorem. The preceding two theorems have dealt with the behavior of the probabilities of (normalized and symmetrized) sums of independent and identically distributed random variables. However, in order to state Poisson's theorem (§6, Chapter I) we have to use a more general model.

325

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

Let us suppose that for each n ~ 1 we are given a sequence of independent random variables ~" 1 , ••• , ~nn· In other words, let there be given a triangular array

(~11 ~21> ~22

)

~31• ~32• ~33

of random variables, those in each row being independent. Put

sn = ~n1 + ... + ~nn·

Theorem 4 (Poisson's Theorem). For each n ~ 1 let the independent random variables ~" 1 , ••• , ~nn be such that P(~nk = 1) =

Pnk•

+ qnk = 1. Suppose that

Pnk

max Pnk-+ 0,

and

Ll:=

1

Pnk -+

,( >

n-+ oo,

0, n -+ oo. Then,Jor each m = 0, 1, ... , e-).A.m P(Sn = m) -+ - -1- , n -+ oo.

(7)

m.

PRooF.

Since

for 1

k

~

~

n, by our assumptions we have

CfJsn(t) = EeitSn =

rr (pnkeit + qnk) n

k=

1

n

=

f1 (1 + Pnieit -

1))-+ exp{A.(eit - 1)},

k=1

n-+ oo.

The function cp(t) = exp{A.(eit - 1)} is the characteristic function of the Poisson distribution (II.l2.11), so that (7) is established. If n(A.) denotes a Poisson random variable with parameter A., then (7) can be written like (6), in the form

sn ~ n(A.).

This completes the proof of the theorem.

4.

PROBLEMS

1. Prove Theorem 1 for R", n

~

2.

2. Let eto ~ 2 , ••• be a sequence of independent random variables with finite means E 1.;.1 and variances V such that V ~ K < oo, where K is a constant. Use Chebyshev's inequality to prove the law of large numbers (1).

e.

e.

326

III. Convergence of Probability Measures. Central Limit Theorem

3. Show, as a corollary to Theorem 1, that the family {cp.} is uniformly continuous and that
4.

-+

q> uniformly on every finite interval.

Let~., n ;;:::: 1, be random variables with characteristic functions q>~"(t), n :2:: 1. Show that ~ • .!4 0 if and only if q>~"(t) -+ 1, n -+ oo, in some neighborhood oft = 0.

5. Let X~> X 2 , ••• be a sequence of independent random vectors (with values in Rk) with mean zero and (finite) covariance matrix r. Show that

xl + · · · + x • .!!.. JV(O, r).

Jn

(Compare Theorem 3.)

§4. Central Limit Theorem for Sums of Independent Random Variables 1. Let us suppose that for every n

~

1 we have a sequence

of independent random variables with E~nk

Lets.= ~nl

= 0,

+ ·· · + ~•• ,

F.k(x) = P(~nk::; x),

cll(x) = (2n)-li2 fooe-y2;zdy,

Theorem 1. A sufficient (and necessary) condition that

s. ~ %(0, 1) is that (A)

I

k=l

f

Jlxl>•

lxiiF.k(x)- .k(x)ldx--+0,

n --+ oo,

for every t: > 0.

This theorem implies, in particular, the traditional statement of the central limit theorem under the Lindeberg condition.

Theorem 2. Suppose that the Lindeberg condition is satisfied, that is, that for every t: > 0

I

(L)

k=l

then

s. ~ %(0, 1).

f

x 2 dF.k(x)--+ 0,

Jlxl>e

n--+oo;

§4. Central Limit Theorem for Sums oflndependent Random Variables

327

Before proving these theorems (notice that Theorem 2 is an easy corollary of Theorem 1) we discuss the significance of conditions (A) and (L). Since max E~;k ::; e2 1~k~n

+

n

L E(~;ki( I~nk I > e)),

k=1

it follows from the Lindeberg condition (L) that max E~;k --+ 0,

n --+ oo.

(1)

1 ~k~n

From this it follows (by Chebyshev's inequality) that the random variables are asymptotically infinitesimal (negligible in the limit), that is, that, for every e > 0, max P{l~nkl > e}--+ 0, 1:!>k:!>n

n --+ oo.

(2)

Consequently we may say that Theorem 2 provides a condition for the validity of the central limit theorem when the random variables that are summed are asymptotically infinitesimal. Limit theorems which depend on a condition of this type are known as theorems with a classical formulation. It is easy to give examples which satisfy neither the Lindeberg condition nor the condition of being asymptotically infinitesimal, but where the central limit theorem nevertheless holds. Here is a particularly simple example. Let ~ 1 , ~ 2 , .•• be a sequence of independent normally distributed random variables withE~"= 0, V~ 1 = 1, V~k = 2k- 2 ,k 2:: 2.PutSn = ~n 1 + · .. + ~nn with

~nk = ~kl Jit1 v~i· It is easily verified (Problem 1) that neither the Lindeberg condition nor the condition of being asymptotically infinitesimal is satisfied, although the central limit theorem is evident since S" is normally distributed withES" = 0, vsn = 1. Later (Theorem 3) we shall show that the Lindeberg condition (L) implies condition (A). Nevertheless we may say that Theorem 1 also covers "nonclassical" situations (in which no condition of being asymptotically infinitesimal is used). In this sense we say that (A) is an example of a nonclassical condition for the validity of the central limit theorem. 2. PROOF OF THEOREM 1. We shall not take the time to verify the necessity of (A), but we prove its sufficiency. Let

328

III. Convergence of Probability Measures. Central Limit Theorem

It follows from §12, Chapter II, that

By the corollary to Theorem 1 of §3, we have S" ~ %(0, 1) if and only if fn(t) -+ ({J(t), n -+ oo, for all real t. We have fn(t) - ({J(t) =

n

n

k;l

k;l

ll fnk(t) - ll (/Jnk(t).

Since I f,.k(t) I :::;; 1 and I({Jnk(t) I :::;; 1, we obtain

I f,.(t)

- ({J(t) I =

I)J f,.k(t) -

il

(/Jnk(t)

:::;; ktllfnk(t)- (/Jnk(t)l =

I

kt f_oooo (eitx- itx

=

I kt11 s:ooeitx d(Fnk- nk)

+ tt2x2)d(Fnk-

nk)

I,

I (3)

where we have used the equations i

=

1, 2.

Let us integrate by parts (Theorem 11 of Chapter II, §6, and Remark 2 following it) in the integral

and then let a-+

-

oo and b-+ oo. Since we have

and X-+ CIJ,

we obtain

§4. Central Limit Theorem for Sums of Independent Random Variables

329

From (3) and (4), lfn(t)- q>(t)l:::;; kt1 1t :::;; tltl 3 c

s:}eitx±r ±(

k=l

+ 2t 2

k=l

Jlxi:St

1- itx)(Fnix)- nk(x))dxl

lxiiFnk(x)- nk(x)ldx

Jlxl>e

lxiiFnk(x)- nk(x)ldx

where we have used the inequality (

Jlxi:St

lxiiFnk(x)- nk(x)ldx:::;; 2(I;k,

(6)

which is easily deduced from (71), §6, Chapter II. Since c > 0 is arbitrary, it follows from (5) and (A) that fn(t)-+ cp(t) as n -+ oo, for all t E R. This completes the proof of the theorem. 3. The following theorem gives the connections between (A) and (L). Theorem 3. (1) (L) =>(A). (2) If max, s;ks;n E(;k-+ 0, n-+ oo, then (A)=> (L). (1) We noticed above that (L) implies that max 1 s;ks;n (I;k-+ 0. Consequently, since 1 (I;k = 1, we obtain PROOF.

Lk=

n -+ oo,

(7)

where the integration on the right is over IxI > c/Jmax 1 s;ks;n (I;k. This, together with (L), shows that

I (

k=l

Jlxl>e

x 2 d[Fnk(x)

+ nk(x)]-+ 0,

n-+ oo,

(8)

foreveryc > O.Nowfixe > Oandleth = h(x)beacontinuouslydifferentiable even function such that lh(x)l:::;; x 2 , h'(x) sgn x ~ 0, h(x) = x 2 for lxl > 2c, h(x) = 0 for lxl:::;; B, and lh'(x)l :::;; 4x fore< lxl:::;; 2c. Then, by (8),

I (

k=l

Jlxl>t

h(x) d[Fnk(x)

+ nk(x)] -+ 0,

n -+ oo.

330

III. Convergence of Probability Measures. Central Limit Theorem

By integration by parts we then find that, as n -+ oo, kt1

L~.h'(x)[(1 -

Since h'(x) = 2x for IxI

±Jlxl~2e r

k=l

+ <~»nk(x)] dx ~

+ (1

-
L~.h(x) d[Fnk +
= kt

ktl Ls-eh'(x)[Fnk(x)

Fnk(x))

= ktl Ls-eh(x)d[Fnk

+
2e, we therefore have

<~»nk(x) I dx-+ 0,

I X IIFnk(x) -

n-+ oo.

Consequently, since e > 0 is arbitrary, we obtain (L)::? (A). (2) For the function h = h(x) introduced above, we obtain by (7) and the condition max 1 sksn a~k-+ 0 that (9)

n-+ oo.

Again integrating by parts, we find that when (A) is satisfied,

r h(x) d[Fnk- <~»nkJ I ~ I ±fx~eh(x) d[(1 - Fnk)- (1 - <~»nJJ I I ±JJxJ~e + Ikt1 Ls -eh(x) d[Fnk - <~»nkJ I k=1

k=1

~ ktl L>eh'(x)[(1 -

Fnk)- (1 -

+ kt1 Ls-elh'(x)IIFnk-

~

I

k=1

I

JJxJ~e

lh'(x)IIFnk-

<~»nk)] dx

<~»nkldx

(10)

<~»nkldx

~ 4 ktl JxJ ~·lxiiFnk(x)- <~»nk(x)l dx-+ 0. It follows from (9) and (10) that

±( ~2e

k= 1

Jlxl

x 2 dFnk(x)

~

±( ~·

k= 1

JJ:xJ

h(x) dFnk(x)-+ 0,

n-+oo;

that is, (L) is satisfied. 4. PRooF OF THEOREM 2. According to Theorem 3, Condition (L) implies (A); hence Theorem 2 follows at once from Theorem 1.

§4. Central Limit Theorem for Sums of Independent Random Variables

331

5. We mention some corollaries in which ~ 1 , ~ 2 , .•• is a sequence of independent random variables with finite second moments. Let mk = E~k> af = v~k > o,sn = ~1 + ... + ~n• = Lk=1 af,andletFk = Fk(x)bethe distribution function of ~k.

v;

Corollary 1. Let the Lindeberg condition be satisfied: for every 8 > 0,

n --+ oo.

(11)

Then (12)

Corollary 2. Let the Lyapunov condition be satisfied: 1

~

v;+a k';;1E lsk- mki f:

2H

--+

0,

n-+ oo.

(13)

Then the central limit theorem (12) holds.

It is enough to prove that the Lyapunov condition implies the Lindeberg condition. Let 8 > 0; then

El~k- mki 2 H = f_

00

2: 2:

00

lx- mki 2 H dFk(x)

I

J{x:lx-mki~
8°V~ I

lx- mki 2 H dFk(x)

J{x:lx-mki~
(x - mk) 2 dFix)

and therefore

Corollary 3. Let

~ 1 , ~ 2 , ... be independent identically distributed random variables with m = E~ 1 and 0 < a 2 = V~ 1 < oo. Then

332

III. Convergence of Probability Measures. Central Limit Theorem

Therefore the Lindeberg condition is satisfied and consequently Theorem 3 of §3 follows from Theorem 2.

Corollary 4. Let n 2 1,

ee 1,

be independent random variables such that,for all

2 , .•.

lenl:::; K, where K is a constant and V,.-+ oo as n-+ oo. Then by Chebyshev's inequality

f

{x: lx-mkl ~eVn

lx- mkl 2 dFk(x)

= E[(ek-

mk)2 I!ek- mk! 2 sV,.)]

:::; (2K) 2 P{Iek- mkl 2 sV,.}:::; (2k) 2

(]2 2

S

Vk 2 -+ 0, n

n-+ oo.

Consequently the Lindeberg condition is again applicable, and the central limit theorem holds. 6. We remarked (without proof) in Theorem 1 that condition (A) is also necessary. The following (Lindeberg-Feller) theorem shows that, with the supplementary condition max 1 s ks n Ee;k -+ 0, the Lindeberg condition (L) is also necessary. Theorem 4. Let max 1 sksn Ee;k-+ 0, n -+ oo. Then (L) is necessary and suf .ficientfor the validity of the central limit theorem: Sn ~ %(0, 1). The proof is based on the following proposition, which estimates the "tails~· of the variance in terms of the behavior of the characteristic function at the origin (compare Lemma 3, §3, Chapter III).

Lemma. Let e be a random variable with distribution function F = F(x), Ee = 0, ve = a 2 < oo. Thenforeacha > 0

(

Jlxl~1/a

x 2 dF(x) :::; _.;. [Re J(jUa) - 1 + 3a 2a2 ]. a

PRooF. We have Re f(t) - 1

+ !a2 t 2

=

!a 2 t 2

-

=

!a 2 t 2

-

-

Taking t

=

(

Jlxl:51/a

(

Jlxl>1/a

2 !a 2 t 2 =

J:oo [1 -

(!t 2

-

-

(14)

cos tx] dF(x)

[1 - cos tx] dF(x)

[1 - cos tx] dF(x)

!t 2 2a 2 )

(

Jlxl:->1/a (

Jlxl>1/a

x 2 dF(x) - 2a 2 x 2 dF(x).

aJU, we obtain (14), as required.

(

Jlxl>1/a

x 2 dF(x)

333

§4. Central Limit Theorem for Sums of Independent Random Variables

PROOF OF THEOREM 4. The sufficiency was established in Theorem 2. We now prove the necessity. Let

0,

E~nk =

max a;k

--+

n--+ oo.

0,

(15)

1 o<;ko<;n

Since

n fnk(t)--+ n

(16)

e-(1/2)t2,

k=1

we can find, for a given t, a number no = no(t) such that ~ n0 (t) and consequently

n

nfnk(t) n

In

nk=

1

J,k(t) > 0 for

n

L In J,k(t),

=

k= 1

k= 1

where the logarithms are well defined. Then since

Ifnk( t) - 11 S a;k t 2 , we have, by (15),

Ikt1{In [1 + Unk(t) -

1)] - Unk(t) - 1]}

I

n

S Llfnit)-11 2 k=1

t4

s - · 4

max

n

L u~k k=1

u;k ·

1o<;kos;n

t4

2 0, -- -4 max ank--+

n--+ oo.

1o<;ko<;n

Consequently, by using (16), we have n

Re

L Unk(t) -

k=1

n

1]

+ !t 2 = L [Re fnk(t) k=1

- 1

+ !t 2 a;k] --+ 0.

In particular, if we take t = a~ we find that n

I

k= 1

[Re fnk(a~) - 1

+ 3a 2a;k]

and therefore, by (14), and for every

1::

--+

0,

n--+ oo,

= 1/a > 0,

kt1~xi;,;ex 2 dFnk(x) s B2kt1[Re f,k(a~) -

2

1 + 3a 2a ]--+

which shows that the Lindeberg condition is satisfied.

00,

n--+ oo,

334

III. Convergence of Probability Measures. Central Limit Theorem

7. The method that we used for Theorem 1 can be used to obtain a corresponding condition for the central limit theorem without assuming that the second moments are finite. For each n ~ 1 let

be independent random variables with

E~nk =

0,

Let g = g(x) be a bounded nonnegative even function with the following properties: g(x) = x 2 for lxl ~ 1, minlxl<:l g(x) > 0, lg'(x)l ~canst. Define f:.nk(g) by the equation

J oo g(x) dFnk(x) = Joo g(x) d
-

00

00

Theorem 5. Let

and for each

£

> 0 let

Then

The proof is left to the reader (Problem 4); it can be carried out along the same lines as the proof of Theorem 1.

8.

PROBLEMS ~ 1 , ~ 2 , ••. be a sequence of independent normally distributed random variables with E~k = 0, k ~ 1, and V~ 1 = 1, V~k = 2k- 2 , k ~ 2. Show that gd does not satisfy the Linde berg condition and also is not asymptotically infinitesimal.

1. Let

2. Prove (4). 3. Let ~ 1 , ~ 2 , ••. be a sequence of independent identically distributed random variables with E~ 1 = 0, E~f = 1. Show that

4. Prove Theorem 5. (Hint: use the method of the proof of Theorem 1, applying integration by parts twice in the integral (eitn - itx + !t 2x 2) d(Fnk - .k).)

J!

335

§5. Infinitely Divisible and Stable Distributions

§5. Infinitely Divisible and Stable Distributions 1. In stating Poisson's theorem in §3 we found it necessary to use a triangular array, supposing that for each n ~ 1 there was a sequence of independent random variables {en k}, 1 ~ k ~ n. Put

T,.

en, 1 +

=

0

0

0

+ en,n•

n ~ 1.

(1)

The idea of an infinitely divisible distribution arises in the following problem: how can we determine all the distributions that can be expressed as limits of sequences of distributions of random variables T,., n ~ 1? Generally speaking, the problem of limit distributions is indeterminate k = 0, in such great generality. Indeed, if is a random variable and 1 = 1 < k ~ n, then T,. and consequently the limit distribution is the distribution of which can be arbitrary. In order to have a more meaningful problem, we shall suppose in the present section that the variables 1• are, for each n ~ 1, not only independent, but also identically distributed. Recall that this was the situation in Poisson's theorem (Theorem 4 of §3). The same framework also includes the central limit theorem (Theorem 3 of §3) for sums sk = + ... + ~n• n ~ 1, of independent identically distributed random variables 1 , 2 , •••• In fact, ifwe put

e.

e

=e

en,

e1

en,

0

••

'

e, en,

en,n

ee

then T.

n

=

~~

k~1 n,k

=

Sn - ES"

V,

·

Consequently both the normal and the Poisson distributions can be presented as limits in a triangular array. If T,.-+ T, it is intuitively clear that since T, is a sum of independent identically distributed random variables, the limit variable T must also be a sum of independent identically distributed random variables. With this in mind, we introduce the following definition.

Definition 1. A random variable T, its distribution F T• and its characteristic function q>T are said to be infinitely divisible if, for each n ~ 1, there are independent identically distributed random variables r, 1, ••• , 'In such thatt T 4 r, 1 + · · · + 'In (or, equivalently, F T = F~, * · · · * F~"' or q>T = (q>,11 )n). Theorem 1. A random variable T can be a limit of sums T, = only if T is infinitely divisible.

t The

Lk= 1 en,

k

if and

notation ~ 4. '1 means that the random variables ~ and '1 agree in distribution, i.e. F,(x), x e R.

F~(x) =

336

Ill. Convergence of Probability Measures. Central Limit Theorem

PROOF. If T is infinitely divisible, for each n 2:: 1 there are independent identically distributed random variables ~n.t• ..• , ~n.k such that T 4. ~n. 1 + · · · + ~n.k• and this means that T 4. T,, n 2:: 1. Conversely, let T, ~ T. Let us show that T is infinitely divisible, i.e. for each k there are independent identically distributed random variables 1/t, ••• , 1'/k such that T 4. 17 1 + · · · + 1'/k· Choose a k 2:: 1 and represent T,k in the form C~l) + · · · + C~k>, where Y(l) _

'>n

-

J! ':.nk,1

+ ... + '>nk,n•·"•'>n J! y(k)

_

-

J! '>nk,n(k-1)+1

+ ... + '>nk,nk· ;;

Since T,k ~ T, n ~ oo, the sequence of distribution functions corresponding to the random variables T,k> n 2:: 1, is relatively compact and therefore, by Prohorov's theorem, is tight. Moreover, [P(C~1 >

> z)]k

= P(C~1 >

> z, ... , C~k> > z) ::; P(T,k > kz)

and [P(C~1 >

< -z)]k

= P(C~1 >

< -z, ... , C~k> < -z)::; P(T,k < -kz).

The family of distributions for C~1 >, n 2:: 1, is tight because of the preceding two inequalities and because the family of distributions for T,k> n 2:: 1, is tight. Therefore there are a subsequence {ni} !:; {n} and a random variable 17 1 such thatC~!> ~ 17 1 asni ~ oo. SincethevariablesC~l)' ... , C~k>areidentically distributed, we have c~:) ~ 1'/z, •.. ''~~) .!41'/k, where 1'11 4. 172 4. = 1'/k· Since C~l), ... , C~k> are independent, it follows from the corollary to Theorem 1 of §3 that 1'/ 1 , ••• , 11k are independent and 0

T,,k = C~!>

••

+ .. · + C~~> -4 1'11 + .. · + 11k·

But T,,k.!!... T; therefore (Problem 1) T 4. 1'11 + .. · + 1'/k· This completes the proof of the theorem.

Remark. The conclusion of the theorem remains valid if we replace the hypothesis that ~n. 1, ••• , ~n." are identically distributed for each n 2:: 1 by the hypothesis that they are uniformly asymptotically infinitesimal (4.2). 2. To test whether a given random variable T is infinitely divisible, it is simplest to begin with its characteristic function qJ(t). If we can find characteristic functions qJnCt) such that qJ(t) = [({J"(t)]" for every n 2:: 1, then T is infinitely divisible. In the Gaussian case,

and if we put

we see at once that qJ(t) = [({Jn(t)]".

337

§5. Infinitely Divisible and Stable Distributions

In the Poisson case,


Xa-1e-xfP { f(x) = r(a)pa '

X> O - '

0,

X< 0,

it is easy to show that its characteristic function is

1
= [
=

1 (1 - i{3t)afn'

and therefore T is infinitely divisible. We quote without proof the following result on the general form of the characteristic functions of infinitely divisible distributions.

Theorem 2 (Levy-Khinchin Theorem). A random variable T is irifinitely divisible if and only if
where

f3 E R, cr 2

2 2 . it/3 - -t cr + fro (e"x

2

~ 0 and

1 +x 2 dA.(x), 1 - -itx-) 2 2 1+x x

-oo

A. is a .finite measure on (R, PA(R)) with A.{O}

(2) =

0.

3. Let ~ 1 , ~ 2 , ..• be a sequence of independent identically distributed random variables and Sn = ~ 1 + · · · + ~n. Suppose that there are constants bn and an > 0, and a random variable T, such that

Sn- bn

d

__::__....:.: -"+

T•

(3)

We ask for a description of the distributions (random variables T) that can be obtained as limit distributions in (3). If the independent identically distributed random variables ~ 1 , ~ 2 , •..

satisfy 0 < cr 2 = V~ 1 < oo, then if we put bn = nE~ 1 and an= crJn, we find by §4 that Tis the normal distribution %(0, 1). If f(x) = ()jn(x 2 + () 2 ) is the Cauchy density (with parameter () > 0) and ~ 1 , ~ 2 , ... are independent random variables with density f(x), the characteristic functions
338

III. Convergence of Probability Measures. Central Limit Theorem

Consequently there are other limit distributions besides the normal: the Cauchy distribution, for example. If we put ~nk = (~Jan) - (bJnan), 1 ~ k ~ n, we find that

i ~n. k

sn - bn = an

( = T,.).

k= 1

Therefore all conceivable distributions for T that can conceivably appear as limits in (3) are necessarily (in agreement with Theorem 1) infinitely divisible. However, the specific characteristics of the variable T,. = (Sn - bn)/an may make it possible to obtain further information on the structure of the limit distributions that arise. For this reason we introduce the following definition.

Definition 2. A random variable T, its distribution function F(x), and its characteristic function cp(t) are stable if, for every n ~ 1, there are constants an > 0, bn, and independent random variables ~ 1 , ••• , ~n• distributed like T, such that (4) anT + bn 4 ~ 1 + · · · + ~n or, equivalently, F[(x - bn)/an] = F * · · · * F(x), or '-..---' ntimes

(5)

Theorem 3. A necessary and sufficient condition for the random variable T to be a limit in distribution of random variables (Sn - bn)/an, an > 0, is that Tis stable. PRooF. If T is stable, then by (4)

T 4 Sn- bn , an where Sn = ~ 1 + · · · + ~n• and consequently (Sn - bn)/an .4 T. Conversely, let ~ 1, ~ 2 , ••• be a sequence of independent identically distributed random variables, Sn = ~ 1 + · · · + ~n and (Sn- bn)/an ~ T, an> 0. Let us show that T is a stable random variable. If T is degenerate, it is evidently stable. Let us suppose that T is nondegenerate. Choose k ~ 1 and write (1) _ )! Sn -'o1 T(l)

n

=

+ ... + 'on••••• )!

Sn<1) - bn T(k) = '• • •' n an

s
n an

+ •·· + 'okn• )!

b

n

It is clear that all the variables T~ll, ... , T~kl have the same distribution and

n ~ oo, i = 1, ... , k.

339

§5. Infinitely Divisible and Stable Distributions

Write Then u~k> ~ y(l>

+ ... +

y,

where y 4 . · · 4 y 4:. T. On the other hand,

u
+ ... + ~kn

- kb.

a.

=

v. + p
ll(k) n kn

(6)

where

and

It is clear from (6) that

V. kn =

u _ p n

(k)

n

lln

'

where J.-k. ~ T, U~k> ~ T(l> + · · · + r, n-+ oo. It follows from the lemma established below that there are constants ll > 0 and p such that ll~k) -+ ll and f3~k> -+ p as n -+ oo. Therefore

d y
=

+ ... +

y
ll

'

which shows that T is a stable random variable. This completes the proof of the theorem. We now state and prove the lemma that we used above.

Lemma. Let ~. ~ ~ and let there be constants a. > 0 and b. such that d

~

a.~.+ b.-+~'

where the random variables ~ and ~ are not degenerate. Then there are constants a > 0 and b such that lim a. = a, lim b. = b, and ~=a~+ b.

340

III. Convergence of Probability Measures. Central Limit Theorem

PRooF. Let cpn, cp and iP be the characteristic functions of ~n• ~ and ~' respectively. Then cp""~"+b"(t), the characteristic function of an~n + bn, is equal to eirb"cpn(an t) and, by Theorem 1 and Problem 3 of §3,

eitb"cpn(an t)

-+

ip(t),

(7)

cpn(t)

-+

cp(t)

(8)

uniformly on every finite interval of length t. Let {n;} be a subsequence of {n} such that an1 -+ a. Let us first show that a < oo. Suppose that a = oo. By (7), sup llcpn(ant)l-

ltiSc

lip(t)ll-+ 0,

n-+ oo

for every c > 0. We replace t by t 0 /an;· Then, since an;-+ oo, we have

and therefore

Icpn.(to) I -+ Iip(O) I = 1. But Icpn1(t 0) I -+ Icp(t 0 ) 1. Therefore Icp(t 0 ) I = 1 for every t 0 E R, and consequently, by Theorem 5, §12, Chapter II, the random variable ~ must be degenerate, which contradicts the hypotheses of the lemma. Thus a < oo. Now suppose that there are two subsequences {n;} and {n;} such that an, -+ a, ani -+ a', where a =f a'; suppose for definiteness that 0 :::; a' < a. Then by (7) and (8),

lcpn 1(anJ)I-+ lcp(at)l,

lcpn1(an,t)l-+ liP(t)l

and

Consequently

lcp(at)l

=

lcp(a't)l,

and therefore, for all t E R,

lcp(t)l =

lcp(~t)l = ... = lcp((~)"t)l-+ 1,

n-+ oo.

Therefore Icp(t) I = 1 and, by Theorem 5 of §12, Chapter II, it follows that ~ is a degenerate random variable. This contradiction shows that a = a' and therefore that there is a finite limit lim an = a, with a ~ 0. Let us now show that there is a limit lim b" = b, and that a > 0. Since (8) is satisfied uniformly on each finite interval, we have

cpn(ant)-+ cp(at),

341

§5. Infinitely Divisible and Stable Distributions

and therefore, by (7), the limit limn-+oo eitbn exists for all t such that qJ(at) =F 0. Let [J > 0 be such that qJ(at) =F 0 for all It I < b. For such t, lim eitbn exists. Hence we can deduce (Problem 9) that lim Ibn I < oo. Let there be two sequences {n;} and {ni} such that lim bn, = b and lim bn; = b'. Then for It I < [J, and consequently b = b'. Thus there is a finite limit b = lim bn and, by (7),

ijJ(t) = eitbqJ(at), which means that ~ 4 a~ + b. Since ~ is not degenerate, we have a > 0. This completes the proof of the lemma. 4. We quote without proof a theorem on the general form of the characteristic functions of stable distributions.

Theorem 4 (Levy-Khinchin Representation). A random variable Tis stable if and only if its characteristic function qJ(t) has the form qJ(t) = exp 1/J(t), t/l(t) = itf3 - d It 1"'(1 whereO
5.

~

(11)

0.

PROBLEMS

1. Show that ~ ~ '1 if ~n .!!... ~ and ~n .!!... '1· 2. Show that if qJ 1 and qJ 2 are infinitely divisible characteristic functions, so is qJ 1

· qJ 2 •

3. Let (/Jn be infinitely divisible characteristic functions and let ({Jn(t)--+ ({J(t) for every t e R, where ({J(t) is a characteristic function. Show that ({J(t) is infinitely divisible. 4. Show that the characteristic function of an infinitely divisible distribution cannot take the value 0. 5. Give an example of a random variable that is infinitely divisible but not stable. 6. Show that a stable random variable ~always satisfies the inequality E I~ I' < oo for all r E (0, IX).

342

III. Convergence of Probability Measures. Central Limit Theorem

7. Show that if~ is a stable random variable with parameter 0 < differentiable at t = 0.

IX

:s; 1, then rp(t) is not

8. Prove that e-dlrla is a characteristic function provided that d ~ 0, 0 < 9. Let (bn)n;d be a sequence of numbers such that limn eirb. exists for all Show that lim lbnl < oo.

IX

:s; 2.

It I < c5, c5 rel="nofollow"> 0.

§6. Rapidity of Convergence in the Central Limit Theorem 1. Let ~nl• ... , ~"" be a sequence of independent random variables, S" = ~nl + · · · + ~""' Fn(x) = P(Sn ~ x). If Sn--+ %(0, 1), then FnCx)--+ (x) for every x e R. Since (x) is continuous, the convergence here is actually uniform (Problem 5 in §1): n --+ oo.

supJFn(x)- (x)l--+ 0,

(1)

X

In particular, it follows that P(Sn ~ x) - (

x-ES)

JVS,"

n--+ oo

--+ 0,

(under the assumption that ESn and VSn exist and are finite).

It is natural to ask how rapid the convergence in (1) is. We shall establish a result for the case when

n;;::: 1, where ~ 1 , ~ 2 , .•. is a sequence of independent identically distributed random variables with E~k = 0, V~k = a 2 and El~ 1 1 3 < oo. Theorem (Berry and Esseen). We have the bound sup IFn(x)-
~ CEIJt, ~

(2)

n

where Cis an absolute constant ((2n)- 112 ~ C < 0.8). PRooF. For simplicity, let a 2 = 1 and

(Subsection 10, §12, Chapter II) sup IFnCx)-
~~

n

(I

Jo

P3

= El~ 1 1 3 . By Esseen's inequality

I

fn(t)- qJ(t) dt t

+ 2T4 ~ n v 2n

(3)

343

§6. Rapidity of Convergence in the Central Limit Theorem

where qJ(t) = e- 1212 and

fit) = [f(tjjn)]", with f(t) = Ee; 1 ~'. In (3) we may take T arbitrarily. Let us choose

jn/(5/33).

T =

We are going to show that for this T,

I fit)- ({J(t)l::; ~

}n ltl

3 e- 1214 ,

ltl ::; T.

(4)

The required estimate (2), with C an absolute constant, will follow immediately from (3) by means of (4). (A more detailed analysis shows that c < 0.8.) We now turn to the proof of (4). By Taylor's formula with integral remainder, t f(t) = 1 + 1 ! f'(O)

t3 Jl (1 -

t2

+ 2 ! f"(O) + 2

0

v) 2f"'(vt) dv.

(5)

Hence

where 181 ::; 1. If It I ::; Jn/(5/3 3 ), we have, since t2 2n

/3 3 z

a 3 = 1.

lt 3 1/33

+

6n 312

:$

1 25 ·

Consequently

f(t/Jn) for

It I ::;

T

z ~~

= Jn/(5/3 3 ), and hence we may write [f(t/jn)]" =

(6)

enlnnf(t/,fti).

By (5) (with f(t) replaced by In f(tjjll)) we find that t2

In f(t/Jn) = - 2n here

8t 3

+ 6n 312

18 1 1::; 1 and IOn f)/Ill = 1[f'" · ! 2

-

3f" ·

r ·J + 2U'n · f-

::; (/33 + 3PzPt + 2PD<~~)- 3 where

Pk =

E I~ 1 1\ k = 1, 2, 3.

(7)

(In f) 111(8 1 t/jn);

::;

7/3 3 ,

3

1 (8)

344

III. Convergence of Probability Measures. Central Limit Theorem

Using the inequality le•- 11 :s; lzlel•l, we can now show, for I tl :s; T = Jn/(5P 3 ), that

l[f(t/Jn)J" _ e-t2 /21 = le"In/(t/Jiil _

<

e-t2f2l

2P31tl3 exp{- t2 + 21tl3 ~}

-6

Jn

2

6

Jn

< 2P31 t 13e-t2f4.

-6

Jn

This completes the proof of the theorem.

Remark. We observe that unless we make some supplementary hypothesis about the behavior of the random variables that are added, (2) cannot be improved. In fact, let ~ 1 , ~ 2 , ••• be independent identically distributed random variables with

It is evident by symmetry that

and hence by Stirling's formula

= l.Cn . 2-2n""' _1_ = 2

2fo

2"

1

j(2n) · (2n).

It follows, in particular, that the constant C in (2) cannot be less than (2n) - 1 12 •

2.

PROBLEMS

1. Prove (8).

2.

be independent identically distributed random variables with E~k = 0, Ele 1 l3 < oo. It is known [53] that the following nonuniform inequality holds: for all x E R,

Let~~> ~ 2 , •••

vek

= u 2and

IFn(x)- ~x)l :::;;

CEI~ 1 1 3

1

u\fo . (1 + lxl)3.

Prove this, at least for Bernoulli random variables.

345

§7. Rapidity of Convergence in Poisson's Theorem

§7. Rapidity of Convergence in Poisson's Theorem 1. Let 1] 1, 1] 2 , .•• , 1Jn be independent Bernoulli random variables, taking the values 1 and 0 with probabilities

P(1Jk = 1) = Pk>

P(1Jk

= 0) =

1~ k

1 - Pk>

~

n.

Write sn

= 111 + ... + 1Jn,

Pk = P(S = k),

= 0, 1, ... ; A. > 0.

k

In §6 of Chapter I we observe that when p 1 = · · · = P. = p and A. = np we have Prohorov's inequality,

L IPk- nkl ~ C1(A.)p, 00

k=O

where C 1 (A.) = 2 min(2, A.). When the Pk are not necessarily equal, but that

Lk=

1

Pk

= A., LeCam showed

L IPk- nkl ~ Cz(A.) max Pk> 00

k=O

1:5k:5n

where Cz(A.) = 2 min(9, A.). The object of the present section is to prove the following theorem, which provides an estimate of the approximation of the P k by the nk, specifically not including the assumption that 1 Pk = A.. Although the proof does not produce very good constants (see C(A.) below), it may be of interest because of its simplicity.

Li:=

Theorem. (1) We have the following inequality: (1)

where min is taken over all permutations i = (i 1, i2 , ••• , in) of(l, 2, ... , n), P;0 = 0, and [ns] is the integral part of ns. (2) IfLi:= 1 Pk = A., then

k~0 1Pk -nkl ~ C(A.)~in 0~~~~~k~/;k- A.sl ~

C(A.) max pk,

where

C(A.) = (2

+ 4A.)eu.

(2)

346

III. Convergence of Probability Measures. Central Limit Theorem

2. The key to the proof is the following lemma.

Lemma 1. Let S(t)

= Ll"2o 1'fk, where 1'fo = 0, 0 ~ t

Pk(t) = P(S(t) = k),

Then for every tin 0

~

t

~

(A.t)ke-.
nk(t) =

k!

~ 1,

k = 0, 1, ....

,

1,

k~o IPk(t)- nk(t)l ~ eur(2 + 4~~!k) ~~~srI ~~k- A.s



(3)

PRooF. Let us introduce the variables Xk(t) = J(S(t) = k),

where I(A) is the indicator of A. For each sample point, the function S(t), 0 ~ t ~ 1, is a generalized distribution function, for which, consequently, the Lebesgue-Stieltjes integral

J~xk(s-) dS(s) is defined; it is in fact just the sum

(j _ 1)

[nt]

LXk - - 'lj·

n

i= 1

A simple argument shows that Xk(t), k ;;::: 0, satisfy X 0 (t)

=

1-

Xk(t) = -

{x

i

dS(s),

f~[Xk(s-)- Xk_ 1(s-)] dS(s),

where X 0 (0) = 1 and Xk(O) Now EXk(t) = Pk(t) and

E

0 (s-)

k

~

1,

= 0 fork~ 1. [nt]

t

Xk(s-) dS(s) = E Lxk o i= 1

= =

[nt]

1) n (j _1)E17j = ~ 17;

('

LEXk - n

i= 1

{Pk(s-) dA(s),

where [nt]

A(t) =

L Pk

k=O

( =ES(t)).

[nt]

(j _ 1)

Lpk --Pi

i= 1

n

(4)

347

§7. Rapidity of Convergence in Poisson's Theorem

Hence if we average the left-hand and right-hand sides of(4) we find that P 0 (t)

= 1 - J:P 0 (s-) dA(s), (5)

k

Pk(t) = - s:[Pk(s-)- Pk_ 1(s-)] dA(s),

In turn, it is easily verified that nk(t), k system n 0 (t)

~

1.

~

0, 0 :::;:; t :::;:; 1, satisfy the similar

= 1 - J:n 0 (s-) d(A.s),

nk(t) = -

1

k

[nis-) - nk-l (s- )] d(A.s),

~ 1.

Therefore n 0 (t)- P 0 (t)

= - {[n0 (s-)- P 0 (s-)] d(A.s)

+ {P0 (s-) d(A(s)-

(6)

A.s)

and nk(t)- Pk(t)

= - {[nk(s-)-

Pk(s- )] d(A.s)

+ {[nk- 1(s-)- Pk_ 1(s-)] d(A.s) + {[Pk(s-)-

Pk_ 1 (s- )] d[A(s)- A.s].

By the formula for integration by parts (namely "dUV = U dV see Theorem 11, §6, Chapter II) and by (5),

J~P 0 (s-) d(A(s) -

A.s) = (A(t)- A.t)P 0 (t)

+ J~(A(s)-

(7)

+ VdU";

A.s)P 0 (s-) dA(s),

(8) { [Pk(s-) - Pk-t (s-)] d(A(s) - A.s)

=

(A(t) - A.t)(Pk(t) - Pk- 1(t))

+ {[Pk(s-)-

2Pk_ 1(s-)

+ Pk_ 2 (s-)](A(s)-

where it is convenient to suppose that P _ 1 (s)

=0.

A.s)dA(s), (9)

348

III. Convergence of Probability Measures. Central Limit Theorem

From (6)-(9) we obtain

L

k~0 lnk(t)- Pk(t)l ~ 2 k~o lnk(s-)- Pk(s- )ld(A.s) + 21A(t)- A.tl + 4A(t)max IA(s)- A.sl O:;;;s:;;;t

L

~ 2 k~o lnk(s-)- Pk(s- )ld(A.s) +(2 + 4A(t)) max IA(s)- A.sl. O:;;;s:;;;t Hence, by Lemma 2, which will be proved in Subsection 3, max IA(s)- ..lsi, L IPk(t)- nk(t)l ~ e M(2 + 4A(t)) o:;;;s:;;;t k=O where we recall that A(t) = 21"2 Pk· 00

2

(10)

0

This completes the proof of Lemma 1.

3. PROOF OF THE THEOREM. Inequality (1) follows immediately from Lemma 1 if we observe that Pk = Pk(l), nk = nk(1), and that the probability Pk = P{1] 1 + · · · + 1Jn = k} is the same as P{1]; 1 + · · · + 1];" = k}, where (i 1 , i2 , ••• , in) is an arbitrary permutation of(1, 2, ... , n). Moreover, the first inequality in (2) and the estimate (3) follow from ( 1). We have only to show that

I L Pik O:;;;s:;;; I k=O

min sup i

[ns]

I

(11)

A.s ~ max Pk> 1 :;;;k:;;;n

where we may evidently suppose that A. = 1. We write F;(s) = Lk~o P;k, G(s) = s, 0 ~ s ~ 1. These are distribution functions: Fi(s) is a discrete distribution function with masses Pi 1 , Pi 2 , • • • , Pi" at the points 1/n, 2/n, ... , 1; and G(s) is a uniform distribution on [0, 1]. Our object is to find a permutation i* = (i!, ... , i:) such that sup IFi(s)- G(s)l ~ max Pk· Since sup IF;"(s)- G(s)l = max

!:;;;k:;;;n

O:;;;s:;;;l

I (~ Fi

n

-)

-~I· n

it is sufficient to study the deviations F;•(s-) - G(s) only at the points s = kjn, k = 1, ... , n. We observe that if all the Pk are equal (p 1 = · · · = Pn = 1/n), then

1 n

sup IF;(s)- G(s)l = - = max Pk· O:;;;s:;;;l

!:;;;k:;;;n

349

§7. Rapidity of Convergence in Poisson's Theorem

We shall accordingly suppose that at least one of p 1, ... , Pn is not equal to 1/n. With this assumption, we separate the set of numbers P1> . .. , Pn into the union of two (nonempty) sets A = {pi: Pi> 1/n}

B = {pi:

and

Pi~

To simplify the notation, we write F*(s) = Fi.(s), Pit It is clear that F*(1) = 1, F*(1 - (1/n)) = 1 - p:,

F*(1-

~) =

1-

(p: + P:- 1), ••• ,F*(~)

1-

=

1/n}.

= Pt.

(p~ + ··· + P2*).

Hence we see that the distribution F*(s), 0 ~ s ~ 1, can be generated by successively choosing p~, then p~_ 1, etc. The following picture indicates the inductive construction of p:, P!-1• · · ·, pf:

I

(0, 1 ) - -

I

-~--- ~--

I

I

I I i-i-- -:--- ](1, I

I

1

I

1

I

1

I

- - -: - - - J - - - L - - ~- - I

I

I

I

I

I 1 1 - - _j - - - ...!- - - t - - : I I

I

--,--I

n

I

~----~~

I

I/

1

/I

---r7{_ -~-- _

*

p,

v

/1 / I

2 n

/I

~:-~ I

I

}fL/~--~ I

I

I

/1

I

--~-~~~---~---~

I

n

P:

--~J

I

-- -~---~----~- --

I)

n

I I

I

I

I I

I

I

I

I I

_j_

I

I

I I ---1- __ ,

I

I

I

2 n

1--

I

1-n

On the right-hand side of the square [0, 1] x [0, 1] we start from (1, 1) and descend an amount p~, where p~ is the largest number (or one of the largest numbers) in A, and from the point (1, 1 - p~) we draw a (dotted) line parallel to the diagonal of the square that extends from (0, 0) to {1, 1).

350

III. Convergence of Probability Measures. Central Limit Theorem

Now we draw a horizontal line segment of length 1/n from the point (1, 1 - p:>. Its left-hand end will be at the point (1 - (1/n), 1 - p:), from which we descend an amount 1 is the largest number (or one 1 , where of the largest numbers) in B. Since 1 :::;; 1/n, it is easily seen that this line segment does not intersect the dotted line and therefore G(1 - (1/n)) F*((1 - (1/n))-) :::;; From the point (1 - (1/n), 1 1 ), we again draw a horizontal either the left-hand end possibilities: two are There line segment oflength 1/n. the diagonal or on it; below falls ) 1 of this interval (1 - (2/n), 1 diagonal. In the first the above is ) 1 or the point (1 - (2/n), 1 length of segment 2 , where case, we descend from this point by a line {p: _ 1 }. B\ set the in numbers) largest the of p:_ 2 is the largest number (or one is clear it Again, 2)/n.) 2 > (n(This set is not empty, since Pi+···+ descend we case second the In :::;; that G(1 - (2/n)) - F*((1 - (2/n)-) 2 is the largest number (or one of by a line segment oflength 2 , where the largest numbers) in the set A\{p:}. (This set is not empty since Pi+···+ P:- 2 > (n- 2)/2).) Since P:- 2 :::;; p:, it is clear that in this case

p:_

p:.

p:_

p:_

p: - p:_ p: - p:_ p: - p:_

P:-

p:_ p:.

P:-

p:_

Continuing in this way, we construct the sequence P:- 3 , .•• , Pi· It is clear from the construction that

for 1 :::;; k:::;; n. Since min sup i

05s51

II

k=O

p;k- s

I : :;

sup IF*(s)- G(s)l :::;; p:, 05s51

we have also established the second inequality in (2).

Remark. Let us observe that the lower bound min sup i

05s51

I L P;k - s I~ !P: [ns]

k=O

is evident. 4. Let A = A(t), t ~ 0, be a nondecreasing function, continuous on the right, with limits on the left, and with A(O) = 0. In accordance with Theorem 12 of §6, Chapter II, the equation

Zr = K +

J~z._ dA(s)

(12)

351

§7. Rapidity of Convergence in Poisson's Theorem

has (in the class of locally bounded functions that are continuous on the right and have limits on the left) a unique solution, given by the formula

Z, = KtS',(A), where

tS',(A) = eA
(13)

n (1 + L\A(s))e-&<•>.

(14)

oss:!>f

Let us now suppose that the function V(t), t ~ 0, which is locally bounded, continuous on the right, and with limits on the left, satisfies

Jt;

~ K + LV.- dA(s),

where K is a constant, for every t

Lemma 2. For every t

~

(15)

0.

~

0, (16)

PRooF. Let T = inf{t ~ 0: Jt; > KtS',(A)}, where inf{0} = oo. If T = oo, then (16) holds. Suppose T < oo; we show that this leads to a contradiction. By the definition of T, From this and (12),

Vr

~ KtS'r(A) =

K

~K

+K

LTs._(A) dA.

+

V.-

s:

dA(s)

~ Vr.

(17)

If Vr > KtS' r(A), inequality (17) yields Vr > Vr, which is impossible, since

IVrl <

00.

Now let Vr = Kef r(A). Then, by (17),

Vr

=K +

s:

V.-

dA(s).

From the definition ofT, and the continuity on the right of Jt;, Kcf,(A) and A(t), it follows that there is an h > 0 such that when T < t ~ T + h we have

Jt; > KtS',(A) and Ar+h - Ar ~

t.

Let us write 1/11 = Jt; - Kcf,(A). Then 0 < 1/1,

~ J:t/1.- dA.,

T < t

~ T + h,

and therefore 0:::;;; sup 1/11

:::;;;

t

sup

1/J,.

T:s;r:s;T+.!

352

III. Convergence of Probability Measures. Central Limit Theorem

Hence it follows that ljl, = 0 for T assumption that T < oo.

~

t

T

~

+ h,

which contradicts the

Corollary (Gronwall-Bellman Lemma). In (15) let A(t) = J~ a(s) ds and K;;;::: 0. Then (18)

5.

PROBLEMS

1. Prove formula (4). ~ 0, be functions of locally bounded variation, continuous on the right and with limits on the left. Let A(O) = B(O) = 0 and L\A(t) > -1, t ~ 0. Show that the equation

2. Let A = A(t), B = B(t), t

z,

=

J~z._ dA(s) + B(t)

has a unique solution $.(A, B) of locally bounded variation, given by

3. Let ~ and 11 be random variables taking the values 0, 1, .... Let p(~,

1'/) = supj P(~

E

A)- P(l'/

E

A)j,

where sup is taken over all subsets of {0, 1, ... }. Prove that (1) p(~, 1'/) = t Ik'=ol P(~ = k)- P('l = k)l; (2) p(~. 1'/) s p(~. ~) + p(~. 1'/); (3) if Cis independent of(~, 1'/), then p(~

(4) If the vectors (~ 1 , ••• ,

~.)and

+ ,, '1 + 0 s

it 1'/i)

s

J/(~;. 1'/;).

Let~= ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p, 0 < p < 1; and n = n(p) a Poisson random variable with En= p. Show that

p(~(p),

5.

1'/);

('7 1 , ••• , 11.) are independent, then

p(t1 ~i• 4.

p(~.

n(p)) = p(1 - e-P)

s

p2 •

Let~ = ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p, 0 < p < 1, and let n = n(A.) be a Poisson random variable such that P(~ = 0) = P(n = 0). Show that A.= -ln(1 - p) and p(~(p),

n(A.)) = 1 - e--<- A_e-.t

s tA. 2 •

§7. Rapidity of Convergence in Poisson's Theorem

353

6. Show, using Property (4) of Problem 3, and the conclusions of Problems 4 and 5, that if ~ 1 = ~ 1(pd, ... , ~. = ~.(p.) are independent Bernoulli random variables, 0 < P; < 1, and A; = -ln(1 - p;), 1 :::;; i :::;; n, then

and

CHAPTER IV

Sequences and Sums of Independent Random Variables

§1. Zero-or-One Laws 1. The series 2::': 1 (1/n) diverges and the series 2::': 1 ( -1)"(1/n) converges. We ask the following question. What can we say about the convergence or divergence of a series L.."'= 1 (~n/n), where ~ 1 , ~ 2 , ••• isasequenceofindependent identically distributed Bernoulli random variables with P(~ 1 = + 1) = P(~ 1 = --'-1) = In other words, what can be said about the convergence of a series whose general term is ± 1/n, where the signs are chosen in a random manner, according to the sequence ~ 1 , ~ 2 , .•• ? Let

t?

A1 =

{w: I

n= 1

~"converges} n

2::':

be the set of sample points for which 1 (~n/n) converges (to a finite number) and consider the probability P(A 1 ) of this set. It is far from clear, to begin with, what values this probability might have. However, it is a remarkable fact that we are able to say that the probability can have only two values, 0 or 1. This is a corollary of Kolmogorov's "zero-one law," whose statement and proof form the main content of the present section. 2. Let (Q, §, P) be a probability space, and let ~I> ~ 2 , ••• be a sequence of random variables. Let !F': = a(~"' ~n+ 1 , •••) be the a-algebra generated by ~"' ~n+ 1 , ••• , and write !!'

=

n ~F:. 00

n=1

355

§I. Zero-or-One Laws

Since an intersection of a-algebras is again a a-algebra, X is a a-algebra. It is called a tail algebra (or terminal or asymptotic algebra), because every event A EX is independent of the values of ~ 1 , ..• , ~n for every finite number n, and is determined, so to speak, only by the behavior of the infinitely remote values of~ 1 , ~ 2 , •••• Since, for every k 2::: 1, A1 we have A1

E

={I ~nn converges} = {I ~nn converges} n=k

n=1

nk ff'k =X. In the same way, if A2 = {

E

ff'k,

~1• ~2• ... is any sequence,

~ ~n converges} EX.

The following events are also tail events:

A3 = where In

E

{~n E

In for infinitely many n},

fJ6(R), n 2::: 1; A4

=

{Ji~ ~n < 00 };

A5 =

+ " ' + ~n {I~ 1m ~ 1

A6 =

{I

A7

{~n converges};

=

As =

n

n

1m ~ 1

+ ' ' ' + ~n n

n

{rrm: n

Sn

J2n log n

=

}
}
1}.

On the other hand,

= 0 for all n

B1 =

{~n

B2 =

{li~(~ 1 + · · · + ~n) exists and is less than c}

2::: 1},

are examples of events that do not belong to X. Let us now suppose that our random variables are independent. Then by the Borel-Cantelli lemma it follows that P(A3)

=

P(A3) =

L P(~n E Jn) < 1 ~ L P(~n E Jn) =

0~

00, 00.

Therefore the probability of A 3 can take only the values 0 or 1 according to the convergence or divergence of L P(~n E Jn). This is Borel's zero-one law.

356

IV. Sequences and Sums of Independent Random Variables

Theorem 1 (Kolmogorov's Zero-One Law). Let ~ 1 , ~ 2 , ••. be a sequence of independent random variables and let A E !!f. The P(A) can only have one of the values zero or one. PROOF. The idea of the proof is to show that every tail event A is independent of itself and therefore P(A n A) = P(A) · P(A), i.e. P(A) = P 2 (A), so that P(A) = 0 or 1. If A EX then A E ffr" = crg 1 , ~ 2 , ... } = cr(Un ffD, where ff~ = a{~ 1, ... , ~n}, and we can find (Problem 8, §3, Chapter II) sets An E ff~, m :2: 1, such that P(A !::,. An) --+ 0, n --+ oo. Hence

P(An)--+ P(A),

P(An n A)--+ P(A).

(1)

But if A E X, the events An and A are independent for every n :2: 1. Hence it follows from (1) that P(A) = P 2 (A) and therefore P(A) = 0 or 1. This completes the proof of the theorem.

Corollary. Let 1J be a random variable that is measurable with respect to the tail a-algebra X, i.e. {IJ E B} E !!f, BE £?4(R). Then 1J is degenerate, i.e. there is a constant c such that P(IJ = c) = 1. 3. Theorem 2 below provides an example of a nontrivial application of Kolmogorov's zero-one law. Let ~ 1 , ~ 2 , ••. be a sequence of independent Bernoulli random variables with P(~n = 1) = p, P(~n = -1) = q, p + q = 1, n :2: 1, and let Sn = ~ 1 + · · · + ~n· It seems intuitively clear that in the symmetric case (p = !) a "typical" path of the random walk Sn, n :2: 1, will cross zero infinitely often, whereas when p # 1it will go off to infinity. Let us give a precise formulation.

Theorem 2. (a)

(b) lfp #

~fp =

1 then P(Sn =

1, then P(Sn =

0 i.o.) = 1.

0 i.o.) = 0.

PRooF. We first observe that the event B = (Sn = 0 i.o.) is not a tail event, i.e. B¢!!f=nff;:o, ff:O=a{~n,~n+t•· .. }· Consequently it is, in principle, not clear that B should have only the values 0 or 1. Statement (b) is easily proved by applying (the first part of) the BorelCantelli lemma. In fact, if B 2 n = {S 2 n = 0}, then by Stirling's formula P(B ) = 2n

en n n 2nP q

"'(4pqt !:: -vnn

and therefore L P(B 2 n) < oo. Consequently P(Sn = 0 i.o.) = 0. To prove (a), it is enough to prove that the event A =

~ Sn {hm Jn =

has probability 1 (since A

5;:;

B).

Jn

. Sn } oo, hm = - oo

357

§1. Zero-or-One Laws

Let

Then Ac! A, c--+ oo, and all the events A, A 0 A~, A~ are tail events. Let us show that P(A~) = P(A~) = 1 for each c > 0. Since A~ E ?£and A~ E ?£, it is sufficient to show only that P(A~) > 0, P(A~) > 0. But by Problem 5

P(lim }n <-c)= P(rrm }n >c)~ umP(}n >c)> O, where the last inequality follows from the De Moivre-Laplace theorem. Thus P(Ac) = 1 for all c > 0 and therefore P(A) = limc .... oo P(Ac) = 1. This completes the proof of the theorem. 4. Let us observe again that B = { Sn = 0 i.o.} is not a tail event. Nevertheless, it follows from Theorem 2 that, for a Bernoulli scheme, the probability of this event, just as for tail events, takes only the values 0 and 1. This phenomenon is not accidental: it is a corollary of the Hewitt-Savage zero-one law, which for independent identically distributed random variables extends the result of Theorem 1 to the class of" symmetric" events (which includes the class of tail events). Let us give the essential definitions. A one-to-one mapping n = (n 1, n 2 , •• •) of the set (1, 2, ... ) on itself is said to be a finite permutation if rt 11 = n for every n with a finite number of exceptions. If~= ~ 1 , ~ 2 , ••• is a sequence of random variables, n(~) denotes the sequence (~,11 , ~ 112 , ••• ). If A is the event {~ E B}, BE PJ(R 00 ), then n(A) denotes the event {n(~) E B}, BE .94(R We call an event A = {~ E B}, BE PJ(R 00 ), symmetric if n(A) coincides with A for every finite permutation n. An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn = ~ 1 + · · · + ~n· Moreover, we may suppose (Problem 4) that every event in ff;:'(S) = a{w: Sn, Sn+ 1 , ..• } generated by the tail a-algebra ?I(S) = s1 = ~1• s2 = ~1 + ~2• ... is symmetric. 00

).

n

Theorem 3 (Hewitt-Savage Zero-One Law). Let ~ 1 , ~ 2 , ••• be a sequence of independent identically distributed random variables, and A=

a symmetric event. Then P(A)

{w:(~ 1 ,~ 2 , ... )EB}

= 0

or 1.

PRooF. Let A = { ~ E B} be a symmetric event. Choose sets Bn E PJ(R") such that, for An = {w: (~ 1• ... , ~n) E Bn}, n--+ oo.

(2)

358

IV. Sequences and Sums of Independent Random Variables

Since the random variables ~ 1 , ~ 2 , •.• are independent and identically distributed, the probability distributions P~(B) = P(~ E B) and P,..w(B) = P(n.(O E B) coincide. Therefore P(A lJ. A.)

= P~(B lJ. B.) = P,..w(B lJ. B.).

Since A is symmetric, we have A

Therefore P,.<~>(B

=g

E

B} = n.(A)

=

{nnC~) E

(3)

B}.

lJ. B.)= P{n.(~) E B) lJ. (n.(~) E B.)} = P{(~ E B) lJ. (n.(~) E B.)} = P{A i:J. n.(A.)}.

(4)

Hence, by (3) and (4), P(A lJ. A.) = P(A lJ. n.(A.)).

(5)

It then follows from (2) that P(A lJ. (A. n n.(A.)))

--+

n--+ oo.

0,

(6)

Hence, by (2), (5) and (6), we obtain P(A.)

--+

P(A),

P(n.(A))

P(A. n n.(A.))

--+

--+

P(A),

P(A).

(7)

Moreover, since ~ 1 and ~ 2 are independent, P(A.

f1

n.(A.)) = P{(~ 1 , ... , ~.) E B., (~n+ 1• ... , ~2n) E B;,} = P{(~I• ... , ~.) E B.}· P{(~n+I• ... , ~2n) E B.} = P(A.)P(n:.(A.)),

whence by (7) P(A)

= P 2 (A)

and therefore P(A) = 0 or 1. This completes the proof of the theorem.

5.

PROBLEMS

1. Prove the corollary to Theorem 1.

2. Show that if (e.) is a sequence of independent random variables, the random variables Urn and lim are degenerate.

e.

e.

e

e.,

3. Let (e.) be a sequence of independent random variables, s. = 1 + · · · + and let the constants b. satisfy 0 < b. j oo. Show that the random variables Iiiii(S./b.) and lim(S.Jb.) are degenerate.

e.,

4. Lets.= el + ... + n ~ 1, and ~(S) = Show that every event in ~(S) is symmetric.

n$';:'(S), $';:'(S) = a{ro: s., s.+ I• ...}.

S. Let (e.) be a sequence of random variables. Show that {Iiiii for each c > 0.

e.> c} 2ilm{e. > c}

359

§2. Convergence of Series

§2. Convergence of Series 1. Let us suppose that ~ 1 , ~ 2 , ••. is a sequence of independent random variables, Sn = ~ 1 + · · · + ~"' and let A be the set of sample points w for which ~nCw) converges to a finite limit. It follows from Kolmogorov's zero-one law that P(A) = 0 or 1, i.e. the series ~n converges or diverges with probability 1. The object of the present section is to give criteria that will determine whether a sum of independent random variables converges or diverges.

L:

L:

Theorem 1 (Kolmogorov and Khinchin). (a) Let E~n = 0, n ~ 1. Then if (1)

L

the series ~n converges with probability 1. (b) !{the random variables~"' n ~ 1, are unfformly bounded (i.e., P(l~nl s c) = 1, c < oo ), the converse is true: the convergence of ~n with probability 1 implies (1).

L

The proof depends on

Kolmogorov's Inequality (a) Let~ 1, ~ 2 , •.. , ~n be independent random variables with E~i = 0, E~t < oo, i S n. Then for every c; > 0 (2)

(b) If also

P(\~d

S c)= 1, iS n, then (3)

PRooF. (a) Put

A= {max\Ski ~ e}, Ak = {\Sd < Then A= LAk and

t:,

i = 1, ... ,k- 1, \Ski~ e},

1 S k S n.

360

IV. Sequences and Sums of Independent Random Variables

But

ES;IAk = E(Sk + (ek+t + · · · + en)) 2 1Ak = ESf/Ak + 2ESiek+l + ... + en)/Ak ~ ESf/Ak'

+ E(ek+l + ... + en) 2 1Ak

since

+ .. · + en)/Ak = ESk/Ak · E(ek+t + .. ·+en)= 0 because of independence and the conditions Eei = 0, i :::;; n. Hence ESk(ek+t

Es; ~

L ESf/ Ak ~ 8 2 L P(Ak) =

8 2 P(A),

which proves the first inequality. (b) To prove (3), we observe that

ES;IA

= Es;

= ES;

- ES;Ix ~ Es; - 82 P(A) On the other hand, on the set Ak

1sk-tl:::;; and therefore

Es;IA

=

1sk1:::;; 1sk-tl

8,

:::;; (8

82

+ 82 P(A).

(4)

+ 1ek1:::;; 8 + c

L ESf/Ak + L E(/Ak(Sn- Sk) k

-

2)

k

n

n

k=l

j=k+l

+ c) 2 L P(Ak) + L P(Ak) L Ee] k

:::;; P(A{(8

+ c) 2 +

J 1

J

+ c) 2 + ES;),

Ee] = P(A)[(8

(5)

From (4) and (5) we obtain P(A) > -

Es; -

(8

= 1-

82

+ c) 2 + Es;

-

82

(8

(8

+

> 1-

c)2

+ c) 2 + Es;

-

(8

+

c)2

Es;

82 -

·

This completes the proof of (3). PROOF OF THEOREM 1. (a) By Theorem 4 of §10, Chapter II, the sequence (Sn), n ~ 1, converges with probability 1, if and only if it is fundamental with probability 1. By Theorem 1 of §10, Chapter II, the sequence (Sn), n ~ 1, is fundamental (P-a.s.) if and only if

n-... oo.

(6)

By (2), P{sup ISn+k- Snl k~l

~ 8} =

limP{ max ISn+k- Snl

N-+oo

1Sk:SN

Therefore (6) is satisfied if:Lk"'=. 1 Ee~ < oo, and consequently with probability 1.

~ 8} L ek converges

361

§2. Convergence of Series

(b) Let

L ~k converge. Then, by (6), for sufficiently large n, P{sup ISn+k- Snl ~ e} < 1-.

(7)

k~l

By (3), P{ supiSn+k- Snl ~ e} ~ 1k~

1

Therefore if we suppose that

Lf=

1

(c + e) E1'2" "oo 2

L...k=n

'ok

E~f = oo, we obtain

P{sup ISn+k- S.l k~1

~ e} = 1,

which contradicts (7). This completes the proof of the theorem. If ~ 1 , ~ 2 , ... is a sequence of independent Bernoulli random variables with P(~n = + 1) = P(~n = -1) = !, then the series ~nan, with Ian I ~ c, converges with probability 1, if and only if La; < oo.

ExAMPLE.

L

2. Theorem 2 (Two-Series Theorem). A sufficient condition for the convergence of the series L ~" of independent random variables, with probability 1, is that both series L E~n and L V~"converge. If P( I~n I ~ c) = 1, the condition is also necessary. PROOF. Ifl: V~" < oo, then by Theorem 1 the series L (~n- E~n) converges (P-a.s.). But by hypothesis the series E~n converges; hence ~"converges (P-a.s.) To prove the necessity we use the following symmetrization method. In addition to the sequence ~ 1 , ~ 2 , ... we consider a different sequence ~ 1 ,

L

L

~2 ,

•••

of independent random variables such that ~" has the same distribu-

tion as ~"' n ~ 1. (When the original sample space is sufficiently rich, the existence of such a sequence follows from Theorem 1 of §9, Chapter II. We can also show that this assumption involves no loss of generality.) Then if ~"converges (P-a.s.), the series ~"also converges, and hence so does L (~. - ~.). But E(~. - ~.) = 0 and P( I~" - ~"I ~ 2c) = 1. Therefore LV(~. ~.) < oo by Theorem 1. In addition,

L -

L

L: v~. = 1- L: V(~. - ~.) < oo. Consequently, by Theorem 1, L (~. - E~.) converges with probability 1, and therefore L E~" converges. Thus if L ~"converges (P-a.s.)(and P( I~" I ~ c) = 1, n ~ 1) it follows that both L E~. and LV~. converge. This completes the proof of the theorem.

3. The following theorem provides a necessary and sufficient condition for the convergence of L ~" without any boundedness condition on the random variables.

362

IV. Sequences and Sums of Independent Random Variables

Let

c be a constant and

~c = {~' ~~~ ~ C, 0,

>c.

1~1

Theorem 3 (Kolmogorov's Three-Series Theorem). Let ~ 1 , ~ 2 , ••• be a sequence of independent random variables. A necessary condition for the convergence of"[. ~ .. with probability 1 is that the series

L

"[. E~~. v~~. "[. P(l~.. l;?: c) converge for every c > 0; a sufficient condition is that these series converge for some c > 0.

PRooF. Sufficiency. By the two-series theorem,"[. ~~converges with probability 1. But if "[. P( I~.. I ;?: c) < oo, then by the Borel-Cantelli lemma we have I(l ~ .. I ;?: c) < oo with probability 1. Therefore ~ .. = ~~ for all n with at

L

L.

most finitely many exceptions. Therefore ~ also converges (P-a.s.). Necessity. If ~ converges (P-a.s.) then ~ .. -+ 0 (P-a.s.), and therefore, for every c > 0, at most a finite number of the events {I~ .. I ;?: c} can occur (P-a.s.). Therefore I( I~ .. I ;?: c) < oo (P-a.s.), and, by the second part of the Borel-Cantelli lemma, P( I~ .. I > c) < oo. Moreover, the convergence of ~ implies the convergence of ~~. Therefore, by the two-series theorem, both of the series E~~ and LV~~ converge. This completes the proof of the theorem.

L. L

L .

L

L

L

Corollary. Let ~ 1 , ~ 2 , ••• be independent variables withE~, = 0. Then if ~2

L E 1 + ~~~~~~~ <

00,

the series "[. ~ .. converges with probability 1. For the proof we observe that ~2

L E 1 + ~~~~~~~ < 00 <=> L E[~;I(I~nl ~ 1) + l~nii(I~nl > 1)] < 00. Therefore if~! = ~ .. I( I~ .. I ~ 1), we have

L E(~!)2

<

00.

Since E~, = 0, we have

LIE~! I=

L IE~,I(I~.. I ~

1)1 =

L IE~,I(I~.. I > 1)1

~ l:EI~ .. II(I~ .. I > 1) < oo. Therefore both inequality, P{l~ .. l

L

L E~!

> 1} =

and LV~! converge. Moreover, by Chebyshev's P{I~.. II(I~ .. I > 1)

Therefore P( I~ .. I > 1) the three-series theorem.

> 1}

~ E(I~ .. II(I~ .. I > 1).

< oo. Hence the convergence of L ~.. follows from

363

§3. Strong Law of Large Numbers

4.

PROBLEMS

1. Let ~ 1 , ~ 2 , ..• be a sequence of independent random variables, S" = ~ 1 ,

••. , ~n·

Show, using the three-series theorem, that (a) if I ~;; < oo (P-a.s.) then I ~" converges with probability 1, if and only if

IE

U(l~d ~!)converges;

(b) if I~" converges (P-a.s.) then I~;; < oo (P-a.s.) if and only if

I

(E l~nll(lenl ~ 1))2 < oo.

2. Let eI• ~2' ... be a sequence of independent random variables. Show that I (P-a.s.) if and only if ~;;

IE - - 2 < 1 + ~"

e;;

<

00

00.

3. Let ~ 1 , ~ 2 , ••• be a sequence of independent random variables. Show that I~" converges (P-a.s.) if and only if it converges in probability.

§3. Strong Law of Large Numbers 1. Let ~ 1 , ~ 2 , .•• be a sequence of independent random variables with finite second moments; Sn = ~ 1 + · · · + ~n· By Problem 2, §3, Chapter III, if the numbers V~i are uniformly bounded, we have the law of large numbers: Sn- ESn

P

---"'-+

0

n

n-+ oo.

'

(1)

A strong law of large numbers is a proposition in which convergence in probability is replaced by convergence with probability 1. One of the earliest results in this direction is the following theorem. Theorem 1 (Cantelli). Let ~ 1 , finite fourth moments and let

~ 2 , .••

be independent random variables with

El~n-E~ni 4 ~C,

n~l,

for some constant C. Then as n-+ oo

Sn- ES"

- - - -+

n

0

(P-a.s.).

(2)

PRooF. Without loss of generality, we may assume that E~" = 0 for n ~ 1. By the corollary to Theorem 1, §10, Chapter II, we will have Sn/n-+ 0 (P-a.s.) provided that

for every e > 0. In turn, by Chebyshev's inequality, this will follow from

364

IV. Sequences and Sums oflndependent Random Variables

Let us show that this condition is actually satisfied under our hypotheses. We have 4

s.

n

4

4

= (~1 + ... + ~.) = i~1~i-

b

4! 2 2 2!2! ~i~j

i<j

"

4!

+ loFJ .L.. 3'1' ~i~j· • • Remembering that

E~k

n

ES! =

s

= 0, k ::::;; n, we then obtain n

L E~i + 6 L: i=1

i,j=1

nC

+

3

E~?E~J s nC + 6

6n(n - 1) C = (3n 2 2

L

i,j=1 i<j

JE~i · E~J

2n)C < 3n 2 C.

-

Consequently

L E(s: )

4

::::;; 3C

L n1

2

< oo.

This completes the proof of the theorem. 2. The hypotheses of Theorem 1 can be considerably weakened by the use of more precise methods. In this way we obtain a stronger law of large numbers.

Theorem 2 (Kolmogorov). Let~ 1 , ~ 2 , ..• be a sequence of independent random variables with finite second moments, and let there be positive numbers b. such that b. i oo and

"v~. < oo.

(3)

L.f? n

Then

s. -b ES. __,. O n

In particular,

(

)

P-a.s..

(4)

if (5)

then

S" - ES. --.. 0 (P- a.s. ) ---'--n For the proof ofthis, and of Theorem 2 below, we need two lemmas.

(6)

365

§3. Strong Law of Large Numbers

Lemma 1 (Toeplitz). Let {a.} be a sequence of nonnegative numbers, b. = > 0 for n ~ 1, and b. i oo, n--+ oo. Let {x.} be a sequence of

L~= 1 a;, b.

numbers converging to x. Then

(7) In particular,

if a.

=

1 then

x1

+ · · · + x.

(8)

------+X.

n

PROOF. Let e > 0 and let n0 Choose n1 > n0 so that

n0 (e) be such that Ix. -

=

1

b

no

.L

" ' j=

Ixi

-

xI ::;

~:/2

for all n

~

n0 .

xI < e/2.

1

Then, for n > n 1,

This completes the proof of the lemma.

Lemma 2 (Kronecker). Let {b.} be a sequence of positive increasing numbers, b. j oo, n--+ oo, and let {x.} be a sequence of numbers such that L x. converges. Then 1

n

b.

j= 1

- L bixi--+ 0, In particular, if b. = n, x. = Y./n and Y1

Let b 0

n

n --+ oo.

(10)

= 0, S 0 = 0, s. = L}= 1 xi. Then (by summation by parts) n

L bjxj = L b/Sj- sj-1) =

j=1

(9)

L (y./n) converg_es, then

+ · · · + Yn --+, 0 n

PROOF.

n --+ oo.

j=1

bnSn - boSo -

n

L sj-1(bj- b;-1)

j=l

366

IV. Sequences and Sums of Independent Random Variables

and therefore

since, if sn

~X,

then by Toeplitz' lemma,

1 -b

n

L Si _ 1ai ~ x.

n j= 1

This establishes the lemma. PROOF OF THEOREM 1. Since

Sn - ES" bn

=

2_

I bk(~k -bkE~k),

bn k=1

a sufficient condition for (4) is, by Kronecker's lemma, that the series [(~k - E~k)/bk] converges (P-a.s.). But this series does converge by (3) of Theorem 1, §2. This completes the proof of the theorem.

L

EXAMPLE 1. Let ~ 1 , ~ 2 , •.• be a sequence of independent Bernoulli random variables with P(~n= l)=P(~n= -1)=1. Then, since L [l/(n log 2 n)] < oo, we have

r::. sn

v n log n

) ~ ~ 0 (P-a.s ..

(11)

3. In the case when the variables ~ 1 , ~ 2 , •.• are not only independent but also identically distributed, we can obtain a strong Jaw of large numbers without requiring (as in Theorem 2) the existence of the second moment, provided that the first absolute moment exists.

Theorem 3 (Kolmogorov). Let ~ 1 , ~ 2 , .•• be a sequence of independent identically distributed random variables withE I~ 1 1< oo. Then

sn ~ m

n where m

(P-a.s.)

(12)

= E~ 1 .

We need the following lemma.

Lemma 3. Let~ be a nonnegative random variable. Then 00

00

n=l

n=l

L P( ~ :2: n) ~ E~ ~ 1 + L P( ~ ;;:: n).

(13)

367

§3. Strong Law of Large Numbers

The proof consists of the following chain of inequalities:

L P(e ~ n) = L L P(k ~ e< 00

00

n= 1

k + 1)

n= 1 k;;,n 00

= I

kP(k ~ e < k

k=1

+ 1) =

00

I

~

k=O

E[ei
r E[kl(k ~ e < k + 1)J 00

k=O

+ 1)J

r E[(k + 1)I(k ~ e< k + 1n 00

= Ee ~ 00

I

=

k=O

(k

k=O

+ 1)P(k ~ e< k + 1)

00

00

L P(e ~ n) + L P(k ~ e<

=

n=1

k=O

00

k + 1) =

L P(e ~ n) +

n=1

1.

PRooF OF THEOREM 3. By Lemma 3 and the Borel-Cantelli lemma, E le11 <

00

¢> ¢>

L P{le11 ~ n} < 00 L P{lenl ~ n} < oo

¢>

P{lenl ~ n i.o.} = 0.

Hence Ien I < n, except for a finite number of n, with probability 1. Let us put

and suppose that Een and only if ( 1 + · · · +

e

Een

= 0, n

~

1. Then (e 1 + · · · + en)/n--. 0 (P-a.s.), if

en )/n --. 0 (P-a.s. ). Note that in general Een # 0 but

= Een !(len I<

n)

= Eeti(Ietl

< n) ___. Eet

= 0.

Hence by Toeplitz' lemma

n--. oo, and consequently (e 1 + · · · + en)!n-+ 0 (P-a.s.), if and only if

<e~ - E~~) +···+<en- Ee") --., 0 n Write ~n = ~n

-

n --.

oo

(P-a.s.),

n --.

oo. (14)

E~n· By Kronecker's lemma, (14) will be established if

I (~jn) converges (P-a.s.). In turn, by Theorem 1 of §2, this will follow if we show that, when E Ie1 I < 00, the series L (V ~n/n 2 ) converges.

368

IV. Sequences and Sums of Independent Random Variables

We have

"V~n < ~ E~; L.,

n

2-L.

n=l

n

2

00

=

L E[~iJ(k-

1 ~ l~tl < k)].

k=1

00

1

00

L

2

n

n=k

1

~ 2 k~1 kE[~if(k- 1 ~ 1~ 1 1 < k)]

L E[1~11J(k -1 ~ l~tl < k)] = 00

~ 2

2E1~11

<

00.

k= 1

This completes the proof of the theorem.

Remark 1. The theorem admits a converse in the following sense. Let be a sequence of independent identically distributed random variables such that ~ 1 , ~ 2 , •.•

~1

+ "' + ~n

-=-----"-+

n

C,

with probability 1, where Cis a (finite) constant. Then E I~ 1 I < oo and C In fact, if Sn/n -+ C (P-a.s.) then

~n = n

Sn _ (n n n

Sn1) n-1

1 -+

=

E~ 1 •

O (P-a.s.)

and therefore P( I~n I > n i.o.) = 0. By the Borel-Cantelli lemma,

L P(l~1l > n) < oo, and by Lemma 3 we have E I~ 1 I < oo. Then it follows from the theorem that C = E~ 1 .

Consequently for independent identically distributed random variables the conditionE I ~ 1 1 < oo is necessary and sufficient for the convergence (with probability 1) of the ratio Sn/n to a finite limit.

Remark 2. If the expectation m = E~ 1 exists but is not necessarily finite, the conclusion (10) of the theorem remains valid. In fact, let, for example, E~1 < oo and E~i = oo. With C > 0, put n

s; = L ~J(~; ~ C). i= 1

369

§3. Strong Law of Large Numbers

Then (P-a.s.).

But as C -+ oo, Ee 1 I(e 1 ~C)-+ Ee 1 = oo; therefore Sn/n -+

+ oo (P-a.s.)

4. Let us give some applications of the strong law oflarge numbers. 1 (Application to number theory). Let Q = [0, 1), let 11 be the algebra of Borel subsets of Q and let P be Lebesgue measure on [0, 1). Consider the binary expansions OJ=O. OJ 1 0J 2 •.• of numbers OJ e Q (with infinitely many O's) and define random variables 1(OJ), 2(OJ), ... by putting en(OJ) = OJn. Since, for all no~ 1 and all x 1 , ..• , xn taking the values 0 or 1, EXAMPLE

e e

the P-measure of this set is 1/2n. It follows that e 1 , en, ... is a sequence of independent identically distributed random variables with P<el =

o> =

P<el = 1) = !.

Hence, by the strong law of large numbers, we have the following result of Borel: almost every number in [0, 1) is normal, in the sense that with probability 1 the proportion of zeros and ones in its binary expansion tends to ! , i.e.

1

n

- L I(ek = n k=t

1)-+! (P-a.s.).

EXAMPLE 2 (The Monte Carlo method). Let f(x) be a continuous function defined on [0, 1], with values on [0, 1]. The following idea is the foundation of the statistical method of calculating JAf(x) dx (the "Monte Carlo method"). Let e 1 , '7 1 , e 2 , '7 2 , .•• be a sequence of independent random variables, uniformly distributed on [0, 1]. Put

P· =



It is clear that

{1

o

if f(O > '7;. if < 'li·

f(ei>

370

IV. Sequences and Sums of Independent Random Variables

By the strong law of large numbers (Theorem 3)

1 n

(1

n;~/; -+ Jo f(x) dx

(P-a.s.).

Consequently we can approximate an integral f~ f(x) dx by taking a simulation consisting of a pair of random variables(~;, '7;), i ;;:: 1, and then 1 P;· calculating P; and (1/n)

L7=

5. PROBLEMS 1. Show that

Ee

2

<

00

eI > n) <

if and only if L:'=t nP( I

00.

ee

2. Supposing that 1, 2 , ••• are independent and identically distributed, show that if E Ietl" < 00 for some IX, 0 < IX < 1, then S./n 11"--> 0 (P-a.s.), and if EI~tiP < 00 for some {J, 1 ~ fJ < 2, then (S. - nE~ 1 )/n 11 P-+ 0 (P-a.s.).

ee

3. Let 1, 2 , ••• be a sequence of independent identically distributed random variables and let E I ~ 1 1 = oo. Show that

~I~

-I a.

=

oo

(P-a.s.)

for every sequence of constants {a.}. 4. Show that a rational number on [0, 1) is never normal (in the sense of Example 1, Subsection 4).

§4. Law of the Iterated Logarithm 1. Let ~ 1 , ~ 2 , ••• be a sequence of independent Bernoulli random variables with P(~n = 1) = P(~n = -1) = t; let Sn = ~ 1 + · · · + ~n· It follows from the proof of Theorem 2, §1, that (1)

with probability 1. On the other hand, by (3.11),

JnnSnlog n -+ 0

(P-a.s.).

(2)

Let us compare these results. It follows from (1) that with probability 1 the paths of (Sn)n~ 1 intersect the "curves" infinitely often for any given but at the same time (2)

±eJn

e;

~4.

371

Law of the Iterated Logarithm

shows that they only finitely often leave the region bounded by the curves ± eJn log n. These two results yield useful information on the amplitude of the oscillations of the symmetric random walk (Sn)n ~ 1• The law of the iterated logarithm, which we present below, improves this picture of the amplitude of the oscillations of (Sn)n> 1 . Let us introduce the following definition. We call a function cp* = cp*(n), n ;;::: 1, upper (for (Sn)n~ 1) if, with probability 1, Sn ~ cp*(n) for all n from n = n 0 (w) on. Wecallafunctioncp* = cp*(n),n;;::: 1,lower(for(Sn)n~ 1 )if,withprobability 1, > cp*(n) for infinitely many n. Using these definitions, and appealing to (1) and (2), we can say that every function cp* = eJn log n, e > 0, is upper, whereas cp* = eJn is lower, e > 0. Let cp = cp(n) be a function and cp: = (1 + e)cp, cp*" = (1 - e)cp, where e > 0. Then it is easily seen that

sn

{rrm (/)~~> ~ 1} = {1i~ [~~~ (/)~:>] ~ t} ¢:> { sup

m~n,(<)

¢:.

S(m) cp m

{Sm ~ (1

~ 1 + e for every e > 0, from some n (e) on} 1

+ e)cp(m) for every e >

0, from some n 1(e) on}. (3)

In the same way,

{11m(/)~~) ; : : 1} = {li~ [~~~ (/)~:)] ;;::: 1} ¢:>{sup S(m) m2:n2(<) cp m

~ 1 + e.foreverye > O,.fromsomen (e)on} 1

¢:> {Sm ;;::: (1 - e)cp(m) for every e > 0 and for infinitely many m larger than some n 3 (e) ;;::: n 2 (e)}.

(1

(4)

It follows from (3) and (4) that in order to verify that each function cp: = + e)cp, e > 0, is upper, we have to show that (5)

But to show that cp*. = (1 - e)cp, e > 0, is lower, we have to show that (6)

372

IV. Sequences and Sums of Independent Random Variables

2. Theorem 1 (Law of the Iterated Logarithm). Let~ 1 , ~ 2 , ••. be a sequence of independent identically distributed random variables with E~; = 0 and E~t = CJ 2 > 0. Then (7)

where tf;(n) = J2CJ 2 n log log n.

(8)

For uniformly bounded random variables, the law of the iterated logarithm was established by Khinchin (1924). In 1929 Kolmogorov generalized this result to a wide class of independent variables. Under the conditions of Theorem 1, the law of the iterated logarithm was established by Hartman and Wintner (1941). Since the proof of Theorem 1 is rather complicated, we shall confine ourselves to the special case when the random variables ~n are normal, ~n " ' %(0, 1), n ~ 1. We begin by proving two auxiliary results.

Lemma 1. Let~ 1, •.• , ~n be independent random variables that are symmetrically distributed (P( ~k E B) = P(- ~k E B) for every B E 84 (R), k ::;; n). Then for every real number a

P( max

Sk >

1'5k:5,n

a) : ; 2P(S" > a).

(9)

PROOF. Let A= {max1:5k:5n sk >a}, Ak = {S;::;; a, i::;; k- 1; sk >a} and B = {Sn > a}. Since S" > a on Ak (because Sk ::;; S"), we have

P(B n Ak) ~ P(Ak n {S" ~ Sd) = P(Ak)P(S" ~ Sk) = P(Ak)P(~k+ 1 + · · · + ~n ~ 0). By the symmetry of the distributions of the random variables ~ 1, ... , ~n, we have P(~k+ 1 +

Hence P(~k+ 1 +

··· +

· · · + ~n >

~n > 0) = P(~k+ 1

+ · · · + ~n <

0).

0) ~ -!,and therefore

n

P(B) ~ k~1 P(Ak n B) ~

1

n

2k~1 P(Ak) =

1

2 P(A),

which establishes (9).

Lemma 2. Let Sn ,...., JV (0, CJ 2 (n)), CJ 2 (n) j oo, and let a(n), n ~ 1, satisfy a(n)/CJ(n)--.. oo, n--.. oo. Then P(Sn > a(n)) ,....,

CJ(n)

11::..

-y

2na(n)

exp{ -ta 2 (n)/CJ 2 (n)}.

(10)

373

§4. Law of the Iterated Logarithm

The proof follows from the asymptotic formula

-1-

faa e - y1/2 dy - -1- e -x2/2

fox

fox

.

,

X-+ 00,

since S,Ju(n) ,.... .Af(O, 1). PROOF OF THEOREM 1 (for~; "' .Af(O, 1)). Let us first establish (5). Let e > 0, A.= l + e, nk =A.\ where k ~ k0 , and k0 is chosen so that In In k0 is defined. We also define

Ak = {Sn > A.t/l(n) for some n E (nk, nk+ tJ},

(11)

and put

A = {Ak i.o.}

= {Sn > A.t/J(n) for infinitely many n}.

In accordance with (3), we can establish (5) by showing that P(A) = 0. Let us show that P(Ak) < oo. Then P(A) = 0 by the Borel-Cantelli lemma. From (11), (9) and (10) we find that

L

P(Ak)

~

P{Sn > A.t/J(nk) for some n E (nk, nk+ 1)}

~ P{S"

> A.t/J(nk) for some n

~

nk+ tl

~ 2P{Snk+1 > A.t/J(nk)} "'fo~nk) exp{ -tA.2[t/J(nk)/AJl} ~

C 1 exp( -A. In In A.k)

~

Ce-J.Ink = C2 k-;.,

where C 1 and C2 are constants. But L::;. 1 k-;. < oo, and therefore

L P(Ak) <

oo.

Consequently (5) is established. We turn now to the proof of(6). In accordance with (4) we must show that, with A. = 1 - e, e > 0, we have with probability 1 that sn ~ A.t/J(n) for infinitely many n. Let us apply (5), which we just proved, to the sequence (- Sn)n<::.t· Then we find that for all n, with finitely many exceptions, -Sn ~ 21/J(n) (P-a.s.). Consequently if nk = N\ N > 1, then for sufficiently large k, either snk-1 ~ -21/J(nk-l) or (12)

where Yk = snk - snk-1. Hence if we show that for infinitely many k

lk >

A.t/J(nk)

+ 21/J(nk-l),

(13)

374

IV. Sequences and Sums of Independent Random Variables

this and (12) show that (P-a.s.) s.k > A.t/J(nk) for infinitely many k. Take some ).' E (A., 1). Then there is anN > 1 such that for all k

A.'[2(Nk- Nk- 1) In In Nk] 112 > A.(2Nk In In Nk) 112 + 2(2Nk- 1 In In Nk- 1) 112

=A.t/J(Nk) + 21/J(Nk- 1).

It is now enough to show that ~

> A.'[2(Nk - Nk- 1) In In Nk] 112

(14)

for infinitely many k. Evidently ~ ""' %(0, Nk - Nk- 1). Therefore, by Lemma2, P{Y. > A.'[2(Nk- Nk-1) In In Nkr/2}"'

k

>

1 foX(2 In In Nk) 112

c1

k-(A')2

- (In k) 112

e-0.'>2JntnNk

c2>-

k In k ·

L

Since (1/k Ink) = oo, it follows from the second part of the Borel-Cantelli lemma that, with probability 1, inequality (14) is satisfied for infinitely many k, so that (6) is established. This completes the proof of the theorem.

Remark 1. Applying (7) to the random variables (- s.).;;,; 1, we find that

s.

r

(15)

tm cp(n) = -1.

It follows from (7) and (15) that the law of the iterated logarithm can be put in the form (16)

Remark 2. The law of the iterated logarithm says that for every 8 > 0 each function t/Ji = (1 + 8)1/1 is upper, and t/1*, = (1 - 8)1/1 is lower. The conclusion (7) is also equivalent to the statement that, for each 8 > 0,

3.

P{IS.I

~

(1 - 8)1/J(n) i.o.}

=

1,

P{IS.I

~

(1

+ 8)1/J(n) i.o.}

=

o.

PROBLEMS

1. Let ~ 1 , ~ 2 , ... be a sequence of independent random variables Show that = 1} = 1. P{IIm v~ 2ln n

with~.~

JV(O, 1).

375

§4. Law of the Iterated Logarithm

2. Let ~ 1 , ~ 2 , .•. be a sequence of independent random variables, distributed according to Poisson's law with parameter A. > 0. Show that (independently of A.)

{ Pnm

=~.In Inn =1 } =1. Inn

3. Let ~ 1 , ~ 2 , .•. be a sequence of independent identically distributed random variables with

0
{ I

} S lll(lnlnn) = ell• = 1. p Ilrri -" nil•

4. Establish the following generalization of (9). Let ~ 1 , .•• , ~.be independent random variables. Levy's inequality

PL~::. [Sk + Jl(S. -

Sk)] >

a}

:$

2P(S. a), >

S0

=

0,

holds for every real a, where /1(~) is the median of~, i.e. a constant such that

CHAPTER V

Stationary (Strict Sense) Random Sequences and Ergodic Theory

§1. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations 1. Let (Q, :F, P) be a probability space and~= (~t. ~ 2 , ..• ) a sequence of random variables or, as we say, a random sequence. Let (}k ~denote the sequence (~k+1• ~k+2• ... ).

Definition l. A random sequence ~ is stationary (in the strict sense) if the probability distributions of (}k~ and~ are the same for every k 2: 1:

BE PA(R 00 ). The simplest example is a sequence ~ = (~ 1 , ~ 2 , ..•) of independent identically distributed random variables. Starting from such a sequence, we can construct a broad class of stationary sequences 11 = (17 t. 17 2 , ••• ) by choosing any Borel function g(x 1, ... , x.) and setting '1k = g(~k• ~k+ 1, ... , ~k+ 1). If ~ = (~ 1 , ~ 2 , .•. ) is a sequence of independent identically distributed random variables with E I ~ 1 1 < oo and E~ 1 = m, the law of large numbers tells us that, with probability 1, ~1 + ... + ~. __.m, _____

n

n ~ oo.

In 1931 Birkhoff obtained a remarkable generalization of this fact for the case of stationary sequences. The present chapter consists mainly of a proof of Birkhoff's theorem. The following presentation is based on the idea of measure-preserving transformations, something that brings us in contact with an interesting

§I. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations

377

branch of analysis (ergodic theory), and at the same time shows the connection between this theory and stationary random proceesses. Let (Q, fF, P) be a probability space.

Definition 2. A transformation T ofQ into n is measurable if, for every A E fF, T- 1 A = {w: TwEA}EfF.

Definition 3. A measurable transformation T is a measure-preserving transformation (or morphism) if, for every A E F, P(T- 1A) = P(A). LetT be a measure-preserving transformation, T" its nth iterate, and ~ 1 = n ~ 2, and consider the sequence~= (~ 1 , ~ 2 , ... ). We claim that this sequence is stationary. In fact, let A= {w: ~EB} and A 1 = {w: 0 1 ~EB}, where BE~(R 00 ). Since A = {w: (~ 1 (w), ~ 1 (Tw), .. .) E B}, and A 1 = {w: (~ 1 (Tw), ~ 1 (T 2 w), ...) E B}, we have wE A 1 if and only if either Tw E A or A 1 = r- 1A. But P(T- 1A) = P(A), and therefore P(Ad = P(A). Similarly P(Ak) = P(A) for every Ak = {w: Ok~ EB}, k ~ 2. Thus we can use measure-preserving transformations to construct stationary (strict sense) random variables. In a certain sense, there is a converse result: for every stationary sequence ~considered on (Q, fF, P) we can construct a new probability space (0, #, P), a random variable ~ 1 (w) and a measure-preserving transformation f, such that the distribution of~= {~ 1 (&), ~ 1 (fw), ... } coincides with the distribution of~. In fact, take 0 to be the coordinate space Roo and put :fft = BI(R 00 ), P = P~, where P~(B) = P{w: ~ E B}, BE BH(R 00 ). The action of f on Q is given by ~ 1 (w) a random variable. Put ~k(w) = ~ 1 (Tn- 1 w),

If w = (xt. x 2 ,

••• ),

put

~1(&)

Now let A = {w: (x 1,

=

x1,

... ,

t-

1

Uw) = ~ 1 (fn- 1 w),

n ~ 2.

xk) E B}, BE ~(Rk), and

A = {w:(x 2 ,

...

,xk+ 1 )EB}.

Then the property of being stationary means that P(A) = P{w:(~ 1 ,

...

,~dEB} = P{w:(~ 2 ,

...

,~k+ 1 )EB} =

P(f- 1A),

i.e. Tis a measure-preserving transformation. Since P{w: (~ 1 , ••• , ~k) E B} = P {w: (~ 1 , ... , ~k)EB} for every k, it follows that~ and~ have the same distribution. Here are some examples of measure-preserving transformations.

378

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

EXAMPLE 1.

Let Q = {w 1 , ... , ron} consist ofn points (a finite number), n 2:: 2, letffbethecollectionofitssubsets,andlet Tw; = W;+ 1, 1::;; i::;; n- 1,and Tw" = w 1 • If P (w;) = 1/n, the transformation T is measure-preserving.

2.1f n = [0, 1), §' = ~([0, 1)), Pis Lebesgue measure, A. E [0, 1), then Tx = (x + A.) mod 1 and T = 2x mod 1 are both measure-preserving transformations. ExAMPLE

2. Let us pause to consider the physical hypotheses that led to the consideration of measure-preserving transformations. Let us suppose that Q is the phase space of a system that evolves (in discrete time) according to a given law of motion. If w is the state at instant n = 1, then Tnw, where Tis the translation operator induced by the given law of motion, is the state attained by the system after n steps. Moreover, if A is some set of states w then T- 1A = { w: Tw E A} is, by definition, the set of states w that lead to A in one step. Therefore if we interpret Q as an incompressible fluid, the condition P(T- 1 A) = P(A) can be thought of as the rather natural condition of conservation of volume. (For the classical conservative Hamiltonian systems, Liouville's theorem asserts that the corresponding transformation T preserves Lebesgue measure.)

3. One of the earliest results on measure-preserving transformations was Poincare's recurrence theorem (1912). Theorem 1. Let (Q, ff, P) be a probability space, let T be a measure-preserving transformation, and let A E F. Then, for almost every point wE A, we have Tnw E A for infinitely many n 2:: 1.

PROOF. Let C = {wEA: Tnw~A, for all n 2:: 1}. Since C n T-nc = 0 for all n 2:: 1, we have T-mc n T-<m+nlc = T-m(C n T-nC) = 0. Therefore the sequence {T-nC} consists of disjoint sets of equal measure. Therefore L:':o P(C) = L:':o P(T-nC)::;; P(O) = 1 and consequently P(C) = 0. Therefore, for almost every point wE A, for at least one n 2:: 1, we have Tnw EA. It follows that Tnw E A for infinitely many n. Let us apply the preceding result to T\ k 2:: 1. Then for every wE A\N, where N is a set of probability zero, the union of the corresponding sets corresponding to the various values of k, there is an nk such that (Tkrw EA. It is then clear that rnw E A for infinitely many n. This completes the proof of the theorem.

Corollary. Let ~(w) 2:: 0. Then

L ~(Tkw) = 00

k=O

on the set {w: ~(w)

> 0}.

oo

(P-a.s.)

379

§2. Ergodicity and Mixing

In fact, let An= {w: ~(w) ~ 1/n}. Then, according to the theorem, = oo (P-a.s.) on An, and the required result follows by letting n--+ oo.

Lk'=o ~(Tkw)

Remark. The theorem remains valid if we replace the probability measure P by any finite measure Jl with Jl(Q) < oo. 4. PROBLEMS 1. Let Tbeameasure-preservingtransformationand~ = ~(w)arandom variable whose expectation E~(w) exists. Show that E~(w) = E~(Tw). 2. Show that the transformations in Examples 1 and 2 are measure-preserving. 3. Let n = [0, 1), F = af([O, 1)) and let P be a measure whose distribution function is continuous. Show that the transformations Tx = ..l.x, 0 < A. < 1, and Tx = x 2 are not measure-preserving.

§2. Ergodicity and Mixing 1. In the present section T denotes a measure-preserving transformation on the probability space (Q, fl', P).

Definition 1. A set A E fJ' is invariant if T- 1A = A. A set A E fJ' is almost invariant if A and r- 1A differ only by a set of measure zero, i.e. P(A 6. r- 1 A) =Q . It is easily verified that the classes J and J* of invariant or almost invariant sets, respectively, are u-algebras.

Definition 2. A measure-preserving transformation Tis ergodic (or metrically transitive) if every invariant set A has measure either zero or one. Definition 3. A random variable~ = ~(w) is invariant (or almost invariant) if ~(w) = ~(Tw) for all wEn (or for almost all wE Q). The following lemma establishes a connection between invariant and almost invariant sets.

Lemma 1. If A is almost invariant, there is an invariant set B such that P(A!::. B) = 0. PRooF. Let B = Ilm r-nA. Then T- 1B = Ilm r-A = B, i.e. B EJ.It is easily seen that A!::. B s;;; Uk'=o (T-kA!::. r-A). But P(T-kA!::. r-A)

Hence P(A !'::!,. B) = 0.

= P(A 6. T- 1A) = 0.

380

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

Lemma 2. A transformation T is ergodic if and only if every almost invariant set has measure zero or one. PROOF. Let A E §*; then according to Lemma 1 there is an invariant set B such that P(A 6. B) = 0. But T is ergodic and therefore P(B) = 0 or 1. Therefore P(A) = 0 or 1. The converse is evident, since J £ J*. This completes the proof of the lemma. Theorem 1. Let T be a measure-preserving transformation. Then the following conditions are equivalent: (1) Tis ergodic; (2) every almost invariant random variable is (P-a.s.) constant; (3) every invariant random variable is (P-a.s.) constant.

PROOF. (1) (2). Let T be ergodic and ~ almost invariant, i.e. (P-a.s.) ~( w) = ~(Tw). Then for every c E R we have Ac = {w: ~(w) ~ c} E J*, and then P(AJ = 0 or 1 by Lemma 2. Let C = sup{c: P(Ac) = 0}. Since Ac j Q as c j oo and Ac! 0 as c!- w, we have ICI < oo. Then P{w:

~(w) < C} =

P{.91

{~(w) ~ C- ~}} =

and similarly P{w: ~(w) > C} = 0. Consequently P{w: (2) => (3). Evident. (3)

=>

(1 ). Let A

E

~(w)

0

= C} = 1.

J; then I A is an invariant random variable and therefore,

(P-a.s.), IA = 0 or IA = 1, whence P(A) = 0 or 1,

Remark. The conclusion of the theorem remains valid in the case when "random variable" is replaced by "bounded-random variable". We illustrate the theorem with the following example. ExAMPLE. Let Q = [0, 1), §' = .?4([0, 1)), let P be Lebesgue measure and let Tw = (w + A.) mod 1. Let us show that T is ergodic if and only if A. is irrational. Let~ = ~(w) be a random variable with Ee(w) < oo. Then we know that the Fourier series L~ _"' c.e 2 "inw of ~(w) converges in the mean square sense, L Ic. 12 < oo, and, because T is a measure-preserving transformation (Example 2, §1), we have (Problem 1, §1) that for the random variable~ c.E~(w)e2xin~<w>

= =

E~(Tw)e2xinTw

=

ez";";.E~(w)e2xinw

e2xin.i.E~(Tw)ez"inw

= c.ez";";..

So c.(1 - e 2 ";";.) = 0. By hypothesis, A. is irrational and therefore e2 "in.l. # 1 for all n # 0. Therefore c. = 0, n # 0, ~(w) = c0 (P-a.s.), and Tis ergodic by Theorem 1.

381

§3. Ergodic Theorems

On the other hand, let A. be rational, i.e. A. integers. Consider the set

A =

U k=O

2m-2 {

k

k

= k/m,

+

where k and m are

1}

w: -2 ~ w < -2- . m m

ltisclearthatthissetisinvariant;butP(A) =

t. Consequently Tis not ergodic.

2. Definition 4. A measure-preserving transformation is mixing (or has the mixing property) if, for all A and BE F, lim P(A n y-nB)

= P(A)P(B).

(1)

n-+ oo

The following theorem establishes a connection between ergodicity and mixing. Theorem 2. Every mixing transformation T is ergodic. PRooF.

Let A E ~. BE ..F. Then B = y-nB, n ;::: 1, and therefore P(A n y-nB) = P(A n B)

for all n ;::: 1. Because of (1), P(A n B) = P(A)P(B). Hence we find, when A = B, that P(B) = P 2 (B), and consequently P(B) = 0 or 1. This completes the proof.

3.

PROBLEMS

e

1. Show that a random variable is invariant if and only if it is J-measurable.

2. Show that a set A is almost invariant if and only if either P(T- 1A\A) = 0

or

3. Show that the transformation considered in the example of Subsection 1 of the present section is not mixing.

e

4. Show that a transformation is mixing if and only if, for all random variables and 11 with Ee 2 < oo and E11 2 < oo,

§3. Ergodic Theorems 1. Theorem 1 (Birkhoff and Khinchin). Let T be a measure-preserving trans~(w) a random variable withE I~ I < oo. Then (P-a.s.)

formation and~ =

(1)

382

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

lf also Tis ergodic then (P-a.s.) 1 n-1 limn

L ~(T"w) =

n k=O

(2)

E~.

The proof given below is based on the following proposition, whose simple proof was given by A. Garsia (1965).

Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving transformation, let~ be a random variable with E I~ I < oo, and let Sk(w)

=

~(w)

Mk(w)

+ ~(Tw) + · · · + ~(T"- 1 w),

= max{O, S 1(ro), ... , S"(ro)}.

Then for every n

~

PRooF. Ifn ~

~(w)

1.

k, we have Mn(Tw) ~ Sk(Tw) and therefore ~(ro) + Mn(Tw) ~ Sk+ 1(w). Since it is evident that ~(ro) ~ S 1 (ro)- Mn(Tw),

+ Sk(Tw) =

we have Therefore E[c;(ro)/1M">olro)] ~ E(max(S 1 (ro), ... , Sn(w)) - Mn(Tw)), But max(Sl> ... , Sn) = Mn on the set {Mn > 0}. Consequently, E[~(ro)/{Mn>O}(ro)] ~ E{(Mn(ro)- Mn(Tro))/{Mn(w)>OJ} ~

E{Mn(ro)- Mn(Tw)}

= 0,

since if T is a measure-preserving transformation we have EMn(ro) EM,.(Tw) (Problem 1, §1). This completes the proof of the lemma. PRooF OF THE

THEOREM. Let us suppose that E(~l.f) = O(otherwise replace

~by~- E(~l.f)).

Let tf (P-a.s.)

=

= Ilm(SJn)

and!!.

= lim(SJn). It 0

$;

!!.

$;

will be enough to establish that

'1 $; 0.

Consider the random variable i7 = q(w). Since q(ro) = q(Tw), the variable i7 is invariant and consequently, for every e > 0, the set A. = {q(ro) > e} is also invariant. Let us introduce the new random variable and put St(w) = ~*(ro)

+ .. · + ~·cr"- 1 ro),

Mt(w)

=

max(O, Sf, ... , St).

383

§3. Ergodic Theorems

Then, by the lemma,

o

E[~*J 1 M~>oa ~

for every n

~

1. But as n--+ oo,

{M~ > 0} = { maxSt > o} j {sup St > o} ={sup Skt > o} = {sup k;;>: 1

k<-: 1

k<': 1

1 ,;;k,;;n

~k > e} nA, = A.,

where the last equation follows because supk,., 1(S:/k) ~ ij, and A,= {w: ij > e}. Moreover, El~*l ~ El~l +e. Hence, by the dominated convergence theorem,

Thus

0

~ E[~*IAJ

= E[(~ - e)IAJ = E[UAJ - eP(A,) = E[E<ei.Y)IAJ - eP(A,) = -eP(A,),

so that P(A,) = 0 and therefore P(ij ~ 0) = 1. Similarly, if we consider -~(w) instead of ~(w), we find that

=(

11111

Sn) = - 1l. mSn- = -1]

--

n

n

-

and P( -17 ~ 0) = 1, i.e. P(17 ~ 0) = 1. Therefore 0 ~ 11 ~ ij ~ 0 (P-a.s.) and the first part of the theorem is established. To prove the second part, we observe that ~ince E( ~ I.Y) is an invariant random variable, we have E(~IJ) = E~ (P-a.s.) in the ergodic case. This completes the proof of the theorem. Corollary. A measure-preserving transformation T is ergodic all A and BE f7,

1 n-1 lim- L P(A n r-kB) = P(A)P(B). n

n k=O

if and only if, for (3)

To prove the ergodicity of Twe use A = BE J in (3). Then A n T- kB = B and therefore P(B) = P 2 (B), i.e. P(B) = 0 or 1. Conversely, let T be ergodic. Then if we apply (2) to the random variable~ = Ja(w), where BE f7, we find that (P-a.s.) 1 n-1 lim- L IT-kiw) = P(B). n

n k=O

384

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

If we now integrate both sides over A theorem, we obtain (3) as required.

E

fF and use the dominated convergence

2. We now show that, under the hypotheses of Theorem 1, there is not only almost sure convergence in (1) and (2), but also convergence in mean. (This result will be used below in the proof of Theorem 3.)

Theoreml. LetT be a measure-preserving transformation and let random variable with EI I < 00. Then

e

E

~~ :t>(Tkw)- E(el~ I--. 0,

e=

e(w) be a

n--. oo.

(4)

If also T is ergodic, then n--. oo.

(5)

PRooF. For every e > 0 there is a bounded random variable such that E1e- '71:::;; e. Then

'7( I'7(w) I :::;; M)

(6)

Since I'71 :::;; M, then by the dominated convergence theorem and by using (1) we find that the second term on the right of (6) tends to zero as n--. oo. The first and third terms are each at most e. Hence for sufficiently large n the lefthand side of(6) is less than 2e, so that (4) is proved. Finally, if Tis ergodic, then (5) follows from (4) and the remark that E(eii) = Ee (P-a.s.). This completes the proof of the theorem. 3. We now turn to the question of the validity of the ergodic theorem for )defined on a probstationary (strict sense) random sequences = <el> ability space (Q, !F, P). In general, (Q, !F, P) need not carry any measurepreserving transformations, so that it is not possible to apply Theorem 1 directly. However, as we observed in §1, we can construct a coordinate probability space (fi, §;, P), random variables ~ = (~ 1 , ~ 2 , .•• ), and a measurepreserving transformation T such that ~n(w) = ~ 1 (T"- 1 w) and the distributions of and ~ are the same. Since such properties as almost sure convergence and convergence in the mean are defined only for proba1w) (P-a.s. bility distributions, from the convergence of (1/n) 1 ~1 and in mean) to a random variable ij it follows that (1/n) also 1 converges (P-a.s. and in mean) to a random variable '1 such that '1 :4: ij. It

e

ez, .. .

e

Li=

(Tk-

:D= ek<w)

385

§3. Ergodic Theorems

follows from Theorem 1 that if EI~ 1 I < oo then = E( ~ 1 If), where .J is a collection of invariant sets (E is the average with respect to the measure P). We now describe the structure of '1·

Definition 1. A set A E §i' is invariant with respect to the sequence~ if there is a set BE 86(R oo) such that for n ~ 1 A = {w:

(~., ~n+ 1, ...) E

B}.

The collection of all such invariant sets is a a-algebra, denoted by

J~.

Definition 2. A stationary sequence ~ is ergodic if the measure of every invariant set is either 0 or 1. Let us now show that the random variable 1J can be taken equal toE(~ 1 IJ ~). In fact, let A E f ~. Then since

L

1 n-1

E1~k n k~ 1 we have

f ~k ~ f

!n I

k~ 1

A

{w;

dP

A

A

(7)

'1 dP.

= {w: (~k• ~k+ 1, ... ) E B}

Let B Eei(R 00 ) be such that A since ~ is stationary,

f ~k dP = Jl

I

'1 ~ 0,

-

(~k. ~k+ 1 ....)eB}

~k dP =

Hence it follows from (7) that for all A

that '1 = E(¢ 1 1...¢';). Here E(¢ 1 1...¢';) =

E

f{ro;(~t. ~2 J

~,

... ·)eB}

for all k ~ 1. Then

~1 dP =

f ~1 dP. A

which implies (see §7, Chapter II)

E~ 1 if~

is ergodic.

Therefore we have proved the following theorem.

Theorem 3 (Ergodic Theorem). Let ~ = (~ 1 , ~ 2 , •.. ) be a stationary (strict sense) random sequence withE I~ 1 1< oo. Then (P-a.s., and in the mean)

If~

is also an ergqdic sequence, then (P-a.s., and in the mean)

1 lim-

n

L ~k(w) = E~ 1 .

n k~ 1

4.

PROBLEMS

1. Let ~ = (~ 1 , ~ 2 ,

function R(n) ergodic.

•.• )

be a Gaussian stationary sequence withE~" = 0 and covariance Show that R(n)--+ 0 is a sufficient condition for ~ to be

= E~k+n~k·

386

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

2. Show that every sequence ~ = random variables is ergodic.

(~ 1 , ~ 2 , ••. )

3. Show that a stationary sequence

~

for every BE [J(R), k = 1,2, ....

of independent identically distributed

is ergodic if and only if

CHAPTER VI

Stationary (Wide Sense) Random Sequences. L 2 - Theory

§1. Spectral Representation of the Covariance Function 1. According to the definition given in the preceding chapter, a random sequence ~ = (~ 1 , ~ 2 , ..•) is stationary in the strict sense if, for every set BE&l(R 00 )andeveryn ~ 1, P{(~l' ~2• ... )EB} = P{(~n+l• ~n+2• ... )EB}.

(1)

It follows, in particular, that if E~i < oo then E~n is independent of n: E~n

and the covariance cov(~n+m~n) =

onm:

=

E~1'

E(~n+m- E~n+m)(~n - E~n)

(2)

depends only (3)

In the present chapter we study sequences that are stationary in the wide sense (and have finite second moments), namely those for which (1) is replaced by the (weaker) conditions (2) and (3). The random variables ~n are understood to be defined for n E 7L. = {0, ± 1, ... } and to be complex-valued. The latter assumption not only does not complicate the theory, but makes it more elegant. It is also clear that results for real random variables can easily be obtained as special cases of the corresponding results for complex random variables. Let H 2 = H 2 (Q, Jl', P) be the space of (complex) random variables 2 < oo, where I~ 1 2 = r~. 2 + {3 2 • If ~ and ~ = r:1. + i{3, r:J., {3 E R, with E I~ 1 17 E H 2 , we put (4)

388 where

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

r; =

(J( -

if3 is the complex conjugate of 11 =

(J(

+ if3 and

WI = (~, 0 112 •

(5)

As for real random variables, the space H 2 (more precisely, the space of equivalence classes of random variables; compare §§10 and 11 of Chapter II) is complete under the scalar product ( ~, 11) and norm II~ 11. In accordance with the terminology of functional analysis, H 2 is called the complex (or unitary) Hilbert space (of random variables considered on the probability space (Q, !F, P)). If~' 11 E H 2 their covariance is

11) =

cov(~,

E(~ - E~)(11

- E11).

(6)

It follows from (4) and (6) that if E~ = E11 = 0 then cov(~,

11) = (~, 11).

(7)

Definition. A sequence of complex random variables ~ = (~")" E 2 with E I~" 12 < oo, n E 7l., is stationary (in the wide sense) if, for all n E 7l., E~" = E~ 0 , cov(~k+n• ~k)

=

k E 7i..

cov(~"' ~ 0 ),

(8)

As a matter of convenience, we shall always suppose that E~ 0 = 0. This involves no loss of generality, but does make it possible (by (7)) to identify the covariance with the scalar product and hence to apply the methods and results of the theory of Hilbert spaces. Let us write

R(n) =

cov(~"' ~ 0 ),

n E 7l.,

(9)

and (assuming R(O) = E I ~ 0 12 # 0)

R(n) p(n) = R(O)'

n E 7l..

(10)

We call R(n) the covariance function, and p(n), the correlationfunction, of the sequence~ (assumed stationary in the wide sense). It follows immediately from (9) that R(n) is nonnegative-definite, i.e. for all complex numbers a 1, ... , am and t 1 , ..• , tm E 7l., m ;::::: 1, we have m

L aJijR(t; -

t) ;: : : 0.

(11)

,i,j= 1

It is then easy to deduce (either from (11) or directly from (9)) the following properties of the covariance function (see Problem 1):

R(O);::::: 0,

R( -n) = R(n),

IR(n) I :s; R(O),

IR(n)- R(mW :s; 2R(O)[R(O)- Re R(n- m)].

(12)

389

§1. Spectral Representation of the Covariance Function

2. Let us give some examples of stationary sequences ~ = (~n)nEZ· (From now on, the words "in the wide sense" and the statement n E Z will both be omitted.) ExAMPLE 1. Let ~n = ~ 0 · g(n), where E~ 0 = 0, E~6 = 1 and g = g(n) is a function. The sequence~ = (~n) will be stationary if and only if g(k + n)g(k) depends only on n. Hence it is easy to see that there is a Asuch that

g(n) = g(O)ei;.n. Consequently the sequence of random variables ~n = ~0



g(O)eiln

is stationary with In particular, the random EXAMPLE

"constant"~

= 0 is a stationary sequence. ~

2. An almost periodic sequence. Let (13)

where z 1, •.• , zN are orthogonal (Ez;zi = 0, i =F j) random variables with zero means and E lzkl 2 =a~ > 0; -n ~ Ak < n, k = 1, ... , N; A; =F Ai, i =F j. The sequence~ = (~n) is stationary with N

L a~eiJ.•n.

R(n) =

(14)

k=l

As a generalization of (13) we now suppose that ~n

L 00

=

k=-

zkei).•n,

(15)

00

where zk, k E Z, have the same properties as in (13). If we suppose that oo a~ < oo, the series on the right of (15) converges in mean-square and

Lr'= _

R(n)

L 00

=

k=-

a~e;;.•n.

(16)

00

Let us introduce the function

L

F(A) =

af.

(17)

{k: Ak:5 ).)

Then the covariance function (16) can be written as a Lebesgue-Stieltjes integral,

R(n)

=

r"

e;;.n dF(A.).

(18)

390

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

The stationary sequence (15) is represented as a sum of "harmonics" ei).kn with "frequencies" Ak and random "amplitudes" zk of "intensities" (Jl = E jzkj 2. Consequently the values of F(A) provide complete information on the "spectrum" of the sequence ~' i.e. on the intensity with which each frequency appears in (15). By (18), the values of F(A) also completely determine the structure of the covariance function R(n). Up to a constant multiple, a (nondegenerate) F(A) is evidently a distribution function, which in the examples considered so far has been piecewise constant. It is quite remarkable that the covariance function of every stationary (wide sense) random sequence can be represented (see the theorem in Subsection 3) in the form (18), where F(A) is a distribution function (up to normalization), whose support is concentrated on [- n, n), i.e. F(A) = 0 for A < -nand F(A) = F(n) for A > n. The result on the integral representation of the covariance function, if compared with (15) and (16), suggests that every stationary sequence also admits an "integral" representation. This is in fact the case, as will be shown in §3 by using what we shall learn to call stochastic integrals with respect to orthogonal stochastic measures (§2). EXAMPLE 3 (White noise). Let e = (en) be an orthonormal sequence of random variables, Een = 0, Eeiei = bii' where bii is the Kronecker delta. Such a sequence is evidently stationary, and R(n) =

{10,

0,

n= n # 0.

Observe that R(n) can be represented in the form R(n) =

f/""

(19)

dF(A),

where

F(A)

=

f/(v) dv;

1

j(A) = 2n'

-n::;; A< n.

(20)

Comparison of the spectral functions (17) and (20) shows that whereas the spectrum in Example 2 is discrete, in the present example it is absolutely continuous with constant "spectral density" j(A) tn. In this sense we can say that the sequence e = (en) "consists of harmonics of equal intensities." It is just this property that has led to calling such a sequence e = (en) "white noise" by analogy with white light, which consists of different frequencies with the same intensities.

=

EXAMPLE 4 (Moving averages) Starting from the white noise e = (en) introduced in Example 3, let us form the new sequence C()

~" =

2: aken-k, k=-oo

(21)

391

§I. Spectral Representation of the Covariance Function

where ak are complex numbers such that equation,

Lk'= _

oo I ak 12

<

rx:;.

By Parseval's

00

cov(~n+m• ~m) = cov(~"' ~ 0 ) =

L

k=-

an+kak,

00

so that~ = (~k) is a stationary sequence, which we call the sequence obtained from e = (ek) by a (two-sided) moving average. In the special case when the ak of negative index are zero, i.e. 00

~" the sequence~ = k > p, i.e. if

(~n)

=

L aken-k• k=O

is a one-sided moving average. If, in addition, ak = 0 for (22)

then ~ = (~") is a moving average of order p. We can show (Problem 5) that (22) has a covariance function of the form R(n) = e;;."f(A.) dA., where the spectral density is

J':.,

(23)

with

P(z) = a0

+ a1 z + · · · + aPzP.

EXAMPLE 5 (Autoregression). Again let e = (en) be white noise. We say that a random sequence~ = ( ~n) is described by an autoregressive model of order q if

~n

+ bl~n-1 + ''' + bq~n-q =

Sn.

(24)

Under what conditions on b 1 , ••• , bn can we say that (24) has a stationary solution? To find an answer, let us begin with the case q = 1: (25) where a.= -b 1 . If la.l < 1, it is easy to verify that the stationary sequence ~ = (~n) with

L a.ien00

~n =

j

(26)

j=O

is a solution of (25). (The series on the right of (26) converges in mean-square.) Let us now show that, in the class of stationary sequences ~ = (~") (with finite second moments) this is the only solution. In fact, we find from (25), by successive iteration, that

392

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Hence it follows that

k-+

00.

Therefore when Ia I < 1 a stationary solution of (25) exists and is representable as the one-sided moving average (26). There is a similar result for every q > 1 : if all the zeros of the polynomial (27) lie outside the unit disk, then the autoregression equation (24) has a unique stationary solution, which is representable as a one-sided moving average (Problem 2). Here the covariance function R(n) can be represented (Problem 5) in the form R(n) =

f/An dF().),

F().) =

f/(v)

dv,

(28)

where

(29) In the special case q = 1, we find easily from (25) that Eeo = 0, 2

Eeo

= t -

1

lo::l2'

and R(n)

a"

= 1-lo::l2'

n~O

(when n < 0 we have R(n) = R(- n)), Here

1 1 J().) = 2n '11 - ae-;;.1 2 ' EXAMPLE 6.

This example illustrates how autoregression arises in the construction of probabilistic models in hydrology. Consider a body of water; we try to construct a probabilistic model of the deviations of the level ofthe water from its average value because of variations in the inflow and evaporation from the surface. If we take a year as the unit of time and let H n denote the water level in yearn, we obtain the following balance equation:

Hn+ 1 = Hn - KS(H")

+ ~n+ 1•

(30)

where ~n+ 1 is the inflow in year (n + 1), S(H) is the area of the surface of the water at level H, and K is the coefficient of evaporation.

393

§I. Spectral Representation of the Covariance Function

Let ~n = Hn - H be the deviation from the mean level (which is obtained from observations over many years) and suppose that S(H) = S(H) + c(H - H). Then it follows from the balance equation that ~n satisfies (31)

with a = 1 - cK, en = :En - KS(H). It is natural to assume that the random variables ~>n have zero means and are identically distributed. Then, as we showed in Example 5, equation (31) has (for Ia I < 1) a unique stationary solution, which we think of as the steady-state solution (with respect to time in years) of the oscillations of the level in the body of water. As an example of practical conclusions that can be drawn from a (theoretical) model (31 ), we call attention to the possibility of predicting the level for the following year from the results of the observations of the present and preceding years. It turns out (see also Example 2 in §6) that (in the meansquare sense) the optimal linear estimator of ~n+ 1 in terms of the values of •.• , ~n-l• ~n is simply O!~n· 7 (Autoregression and moving average (mixed model)). If we suppose that the right-hand side of (24) contains a0 Bn + a 1Bn- 1 + · · · + apsn- P instead of en, we obtain a mixed model with autoregression and moving average of order (p, q):

ExAMPLE

~n

+ b1~n-1 + · ·· + bq~n-q = aoBn + a1Bn-1 + ··· + apBn-p·

(32)

Under the same hypotheses as in Example 5 on the zeros it will be shown later (Corollary 2 to Theorem 3 of§3) that (32) has the stationary solution~ = (~") e;;.n dF(A.) with F(A.) = for which the covariance function is R(n) = J~,f(v) dv, where

J':,

3. Theorem (Herglotz). Let R(n) be the covariance function of a stationary

(wide sense) random sequence with zero mean. Then there is, on

([ -n, n), g6([ -n, n))), a finite measure F = F(B),

BE g6([ -n,

R(n) = PRooF. For N

~

n)), such that for every n E 7L

f/;." F(dA.).

(33)

1 and A. E [ -n, n], put

1

fN(A.) = 2 N 1t

N

N

I I

k=1 1=1

R(k - T)e- ik;.ew'.

(34)

394

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Since R(n) is nonnegative definite, fN(A.) is nonnegative. Since there are N - Im I pairs (k, I) for which k - l = m, we have

L

fN(A.) = __!_ 2n lmi
(1 - ~)R(m)e-im'-. N

(35)

Let BE~([ -n,

n)).

Then

lnl < lnl

N,} (36)

~N.

The measures F N• N ~ 1, are supported on the interval [ -n, n] and F N([ -n, n]) = R(O) < oo for all N ~ 1. Consequently the family of measures {FN}, N ~ 1, is tight, and by Prohorov's theorem {Theorem 1 of §2 of Chapter III) there are a sequence {Nk}

£;

{N} and a measure F such that

F Nk ~ F. {The concepts of tightness, relative compactness, and weak

convergence, together with Prohorov's theorem, can be extended in an obvious way from probability measures to any finite measures.) It then follows from (36) that f/'"nF(dA.)

=

N~~oo f~/'-npNk(dA.) =

R(n).

The measure F so constructed is supported on [ -n, n]. Without changing the integral J~oo ei'-nF(dA.), we can redefine F by transferring the "mass" F( { n} ), which is concentrated at n, to -n. The resulting new measure (which we again denote by F) will be supported on [ -n, n). This completes the proof of the theorem.

Remark 1. The measure F = F(B) involved in (33) is known as the spectral measure, and F(A.) = F([ -n, A.]) as the spectral function, of the stationary sequence with covariance function R(n). In Example 2 above the spectral measure was discrete (concentrated at A.k, k = 0, ± 1, ...). In Examples 3-6 the spectral measures were absolutely continuous.

Remark 2. The spectral measure F is uniquely defined by the covariance function. In fact, let F 1 and F 2 be two spectral measures and let f/'-nFl(dA.)

= f/'"nFz(dA.),

n E Z.

§2. Orthogonal Stochastic Measures and Stochastic Integrals

395

Since every bounded continuous function g(A.) can be uniformly approximated on [ -n, n) by trigonometric polynomials, we have {,.g(A.)F 1(dA.) = {,.g(A.)FidA.). It follows (compare the proof in Theorem 2, §12, Chapter II) that F 1(B) =FiB) for all Be&l([ -n, n)).

Remark 3. If e=(en) is a stationary sequence of real random variables en, then R(n) = {,.cos A.n F(dA.).

4.

PROBLEMS

1. Derive (12) from (11). 2. Show that the autoregression equation (24) has a stationary solution if all the zeros ofthe polynomial Q(z) defined by (27) lie outside the unit disk.

3. Prove that the covariance function (28) admits the representation (29) with spectral density given by (30). 4. Show that the sequence

e= (e.) of random variables, where e. = I <Xl

k=l

and cxk and f3k are real random variables, can be represented in the form

e. = I

<Xl

zkeu••

k=-oo

with zk = !(f3k- icxk) fork~ 0 and zk = z_k, A.k = -A.-k fork < 0. 5. Show that the spectral functions of the sequences (22) and (24) have densities given respectively by (23) and (29). 6. Show that if I IR(n) I < oo, the spectral function F(A.) has density f(A.) given by 1

f(A.} = - I e-i.
§2. Orthogonal Stochastic Measures and Stochastic Integrals 1. As we observed in §1, the integral representation of the covariance function and the example of a stationary sequence

en=

I

00

k=-oo

zke;;. ••

(1)

396

VI. Stationary (Wide Sense) Random Sequences. £ 2 -Theory

with pairwise orthogonal random variables zk, k e 7L, suggest the possibility of representing an arbitrary stationary sequence as a corresponding integral generalization of (1). If we put (2) Z(A.) = zk,

L

{k:.l.kS.l.}

we can rewrite (1) in the form co

e. = L

k= -co

=

eilk".1Z(A.k),

(3)

where .1Z(A.k) Z(A.k) - Z(A.k -) = zk. The right-hand side of (3) reminds us of an approximating sum for an integral j'':., ei-.n dZ(A.) of Riemann-Stieltjes type. However, in the present case Z(A.) is a random function (it also depends on w). Hence it is clear that for an integral representation of a general stationary sequence we need to use functions Z(A.) that do not have bounded variation for each ro. Consequently the simple interpretation of eiAn dZ(A.) as a Riemann-Stieltjes integral for each w is inapplicable.

J':.,

2. By analogy with the general ideas of the Lebesgue, Lebesgue-Stieltjes and Riemann-Stieltjes integrals (§6, Chapter II), we begin by defining stochastic measure. Let (Q, JF, P) be a probability space, and let E be a subset, with an algebra S 0 of subsets and the u-algebra S generated by S 0 •

Definition 1. A complex-valued function Z(.1) = Z(w; .1), defined for wen and .1 e S 0 , is a finitely additive stochastic measure if (1) E I Z(.1) 12 < oo for every .1 E S 0 ; (2) for every pair A 1 and .12 of disjoint sets in tff 0 ,

Z(.11

+ .12) =

Z(.1 1)

+ Z(.12)

(P-a.s.)

(4)

Definition 2. A finitely additive stochastic measure Z(.1) is an elementary stochastic measure if, for all disjoint sets .1 1, .1 2, ... of tff 0 such that .1 = Lk'= 1 .1k E tff 0• n--+ oo.

(5)

Remark 1. In this definition of an elementary stochastic measure on subsets of S 0 , it is assumed that its values are in the Hilbert space H 2 = H 2 (il, JF, P), and that countable additivity is understood in the mean-square sense (5). There are other definitions of stochastic measures, without the requirement of the existence of second moments, where countable additivity is defined (for example) in terms of convergence in probability or with probability one.

397

§2. Orthogonal Stochastic Measures and Stochastic Integrals

Remark 2. In analogy with nonstochastic measures, one can show that for finitely additive stochastic measures the condition (5) of countable additivity (in the mean-square sense) is equivalent to continuity (in the mean-square sense) at "zero": (6)

A particularly important class of elementary stochastic measures consists of those that are orthogonal according to the following definition.

Definition 3. An elementary stochastic measure (or.a measure with orthogonal values) if

Z(~), ~ E

fff 0 , is orthogonal

(7)

for every pair of disjoint sets

~1

and

EZ(~ 1 )Z(~ 2 )

for all

~1

and

~2

~2

in fff 0 ; or, equivalently, if

= E IZ(~1

n ~2W

(8)

in fff 0 .

We write m(~)

= E IZ(~W,

(9)

For elementary orthogonal stochastic measures, the set function m = m(~), ~ E fff 0 , is, as is easily verified, a finite measure, and consequently by Caratheodory's theorem (§3, Chapter II) it can be extended to (E, f%). The resulting measure will again be denoted by m = m(~) and called the structure function (of the elementary orthogonal stochastic measure Z = Z(~). ~ E

fff 0 ).

The following question now arises naturally: since the set function m = m(~) defined on (E, fff 0 ) admits an extension to (E, Iff), where Iff = a(fff 0 ), cannot an elementary orthogonalstochastic measure Z = Z(~), ~ E fff 0 , be extended to sets ~in E in such a way that EI Z(~) 12 = m(~), ~ E fff? The answer is affirmative, as follows from the construction given below. This construction, at the same time, leads to the stochastic integral which we need for the integral representation of stationary sequences. 3. Let Z = Z(~) be an elementary orthogonal stochastic measure, with structure function m. = m(~), ~ E f%. For every function

~ E fff 0 ,

(10)

with only a finite number of different (complex) values, we define the random variable

398

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let L 2 = L 2 (E, 8, m) be the Hilbert space of complex-valued functions with the scalar product

(f, g)

=

L

f(A.)g(A.)m(dA.)

and the norm 11!11 = (f,J) 112 , and let H 2 = H 2 (0., $', P) be the Hilbert space of complex-valued random variables with the scalar product

=

(~, t~)

E~ii

and the norm 11~11 = (~, ~) 1 ' 2 . Then it is clear that, for every pair of functions

(J(f), J(g))

f

and g of the form (10),

= (f, g)

and IIJ(f)ll 2 = 11!11 2 = 11f
n,m ..... oo.

Therefore the sequence {J(f,.)} is fundamental in the mean-square sense and by Theorem 7, §10, Chapter II, there is a random variable (denoted by J(f)) such that J(f) E H 2 and IIJ(J,.) - f(f)ll ..... 0, n ..... oo. The random variable f(f) constructed in this way is uniquely defined (up to stochastic equivalence) and is independent of the choice of the approximating sequence {J,.}. We call it the stochastic integral off E L 2 with respect to the elementary orthogonal stochastic measure Z and denote it by f( f) =

f

fCA.)Z(dA.).

We note the following basic properties of the stochastic integral J(f); these are direct consequences of its construction (Problem 1). Let g, f, and fnEL 2 . Then (J(f), J(g)) = (f, g); IIJ(f)ll

J(af + bg)

= 11!11;

= aJ(f) + bJ(g) (P-a.s.)

(11) (12) (13)

where a and bare constants; IIJ(J,.)- J(f)ll ..... o, if II fn - f II ..... o, n ..... oo.

(14)

399

§2. Orthogonal Stochastic Measures and Stochastic Integrals

4. Let us use the preceding definition of the stochastic integral to extend the elementary stochastic measure Z(A), A E C 0 , to sets inC = u(C 0 ). Since m is assumed to be finite, we have I d = I d(A.) E L 2 for all A E C. Write Z(A) = J(I d). It is clear that Z(A) = Z(A) for A E C 0 . It follows from (13) that if At n A2 = 0 for At and A2 E C, then Z(At

+ A2 )

= Z(At)

+ Z(A 2 )

(P-a.s.)

and it follows from (12) that EIZ(A)I 2 = m(A),

AEC.

Let us show that the random set function Z(A), A E C, is countably additive in the mean-square sense. In fact, let Ak E C and A = Lk"= t Ak. Then n

Z(A) -

L Z(Ak) = J(gn),

k= t

where n

9n(A) = Id(A.) -

L Idk(A) =

Il:n(A.),

k= t

But n-+ oo,

i.e. n

E I Z(A) -

L Z(Ak) 2 -+ 0, 1

n-+ oo.

k=1

It also follows from (11) that

when At n A2 = 0, At, A2 EC. Thus our function Z(A), defined on A E C, is countably additive in the mean-square sense and coincides with Z(A) on the sets A E C 0 • We shall call Z(A), A E C, an orthogonal stochastic measure (since it is an extension of the elementary orthogonal stochastic measure Z(A)) with respect to the structure function m(A), A E C; and we call the integral J(f) = .fE f(A.)Z(dA.), defined above, a stochastic integral with respect to this measure. 5. We now consider the case (E, C) = (R, ~(R)), which is the most important for our purposes. As we know (§3, Chapter II), there is a one-to-one correspondence between finite measures m = m(A) on (R, ~(R)) and certain (generalized) distribution functions G = G(x), with m(a, b] = G(b)- G(a). It turns out that there is something similar for orthogonal stochastic measures. We introduce the following definition.

400

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Definition 4. A set of (complex-valued) random variables {Z;.}, A. E R, defined on (Q, $', P), is a random process with orthogonal increments if (1) EIZ;.I 2 < oo,A.eR; (2) for every A. E R

EIZ;.- Z;.J2~0, (3) whenever ..1. 1 < ..1. 2 < ..1. 3 < A.4, E(Z;. 4

Z;.,)(Z;. 2

-

-

Z;.,)

= 0.

Condition (3) is the condition of orthogonal increments. Condition (1) means that Z;. E H 2 . Finally, condition (2) is included for technical reasons; it is a requirement of continuity on the right (in the mean-square sense) at each A.e R. Let Z = Z(~) be an orthogonal stochastic measure with respect to the structure function m = m(~), of finite measure, with the (generalized) distribution function G(A.). Let us put

Z;. = Z(- oo, A.]. Then E IZ;.I 2 = m(- oo, A.] = G(A.) < oo, and (evidently) 3) is satisfied also. Then this process {Z;.} is called a process with orthogonal increments. On the other hand, if {Z;.} is such a process withE IZ;.I 2 = G(A.), G(- oo) = 0, G( + oo) < oo, we put Z(~) =

when

~ =

Zb- Za

(a, b]. Let cC 0 be the algebra of sets

~

=

n

I

(ak, bk]

and

Z(~)

=

k=l

n

I

Z(ak, bk].

k=l

It is clear that

E IZ(~)I 2

= m(~),

where m(d) = I~=t [G(bk)- G(ak)] and EZ(~ 1 )Z(~ 2 )

=0

for disjoint intervals ~ 1 = (a 1 , b 1 ] and ~ 2 = (a 2 , b2 ]. Therefore Z = Z(~), ~ E cC 0 , is an elementary stochastic measure with orthogonal values. The set function m = m(~), ~ E cC 0 , has a unique extension to a measure on cC = 86(R), and it follows from the preceding constructions that Z = Z(~), ~ E cC 0 , can also be extended to the set~ E C, where cC = 86(R), and E IZ(~) 12 = m(~), ~ E B(&l).

401

§3. Spectral Representation of Stationary (Wide Sense) Sequences

Therefore there is a one-to-one correspondence between processes {Z_.}, A. E R, with orthogonal increments and EIZ .. I2 = G(A.), G(- oo) = 0, G( + oo) < oo, and orthogonal stochastic measures Z = Z(.::\), .::\ E &I(R), with structure functions m = m(L\). The correspondence is given by

Z;.

= Z(- oo,

A.],

G(A.) = m(- oo, A.]

and m(a, b] = G(b)- G(a).

By analogy with the usual notation of the theory of Riemann-Stieltjes integration, the stochastic integral JR f(A.) dZ_., where {Z;.} is a process with orthogonal increments, means the stochastic integral JR f(A.)Z(dA.) with respect to the corresponding process with an orthogonal stochastic measure.

6.

PROBLEMS

1. Prove the equivalence of (5) and (6). 2. Letf e L 2 • Using the results of Chapter II (Theorem 1 of§4, the Corollary to Theorem 3 of §6, and Problem 9 of §3), prove that there is a sequence of functions J,. of the form (10) such that II!- J..ll -+ 0, n-+ oo. 3. Establish the following properties of an orthogonal stochastic measure Z(L\} with structure function m(L\): E IZ(L\1) - Z(L\2W

= m(L\16 L\2),

Z(L\1 \L\2) = Z(L\1) - Z(L\1 r~ L\2) Z(L\ 16 L\ 2 ) = Z(L\1)

+ Z(L\ 2 )

-

(P-a.s.),

2Z(L\ 1 rl L\ 2 )

(P-a.s.).

§3. Spectral Representation of Stationary (Wide Sense) Sequences 1. If ~ = (~n) is a stationary sequence with E~n = 0, n E Z, then by the theorem of §1, there is a finite measure F = F(L\) on ([ -n, n), &~([ -n, n))) such that its covariance function R(n) = cov(~k+n• ~k) admits the spectral representation (1)

The following result provides the corresponding spectral representation of the sequence ~ = (~"), n E Z, itself.

402

VI. Stationary (Wide Sense) Random Sequences. L 2 - Theory

Theorem 1. There is an orthogonal stochastic measure Z Ll E .11([ -n, n)), such that for every n E Z (P-a.s.)

=

Z(Ll),

~n = ff""Z(dA).

(2)

Moreover, E !Z(LlW = F(Ll).

The simplest proof is based on properties of Hilbert spaces. Let L 2 (F) = L 2 (E, lff, F) be a Hilbert space of complex functions, E = [ -n, n), lff = .11([ -n, n)), with the scalar product (f, g)

= f/(}c)g(}c)F(d}c),

(3)

and let L6(F) be the linear manifold (L6(F) ~ L 2 (F)) spanned by en = en( A), n E Z, where en(A) = ei"". Observe that since E = [ -n, n) and F is finite, the closure of L6(F) coincides (Problem 1) with L 2 (F): L6(F)

=

L 2 (F).

Also let L6(~) be the linear manifold spanned by the random variables~"' n E Z, and let e(O be its closure in the mean-square sense (with respect toP). We establish a one-to-one correspondence between the elements of L6(F) and L6( 0, denoted by "~ ", by setting nEZ,

(4)

and defining it for elements in general (more precisely, for equivalence classes of elements) by linearity: (5)

(here we suppose that only finitely many of the complex numbers r:xn are different from zero). Observe that (5) is a consistent definition, in the sense that !: rx.nen = 0 almost everywhere with respect to F if and only if L rx.n~n = 0 (P-a.s.). The correspondence "~" is an isometry, i.e. it preserves scalar products. In fact, by (3), (en, em)

=

f"en(A)em(A)F(dA)

=

ffl(n-m rel="nofollow">F(d}c)

=

R(n - m)

= E~n~m = (~"' ~m) and similarly (6)

§3. Spectral Representation of Stationary (Wide Sense) Sequences

403

Now let '7 E L 2 (~). Since L 2 (~) = I6(~), there is a sequence {1'/n} such that 1'/n E L6( ~) and II IJn - '711 --+ 0, n --+ oo. Consequently {1'/n} is a fundamental sequence and therefore so is the sequence Un}, where fn E L6(F) and f, +-+ IJn. The space L 2 (F) is complete and consequently there is an f E L 2 (F) such that llfn- fll--+ 0. There is an evident converse: iff E e(F) and II!- !nil --+ 0, fnEL6{F), tpere is an element '7 of L 2(0 such that 11'7 - '7nll --+ 0, IJn E L6{~) and '7n +-+ f,. Up to now the isometry"+-+" has been defined only as between elements of L6( ~)and L6(F). We extend it by continuity, taking f +-+ '1 when f and 11 are the elements considered above. It is easily verified that the correspondence obtained in this way is one-to-one (between classes of equivalent random variables and of functions), is linear, and preserves scalar products. Consider the function f(A.) = I 4 (A.), where Ll E .16([ -n, n)), and let Z(Ll) be the element of L 2 (~) such that I 4 (A.) +-+ Z(A.). It is clear that III 4 (A.)II 2 = F(Ll) and therefore E IZ(LlW = F(Ll). Moreover, if .1 1 n .1 2 = 0, we have EZ(Ll 1 )Z(Ll 2 ) = 0 and E I Z(Ll) - I~= 1 Z(LlkW --+ 0, n--+ oo, where .1 = 1 .1k. Hence the family of elements Z(Ll), Ll E .16([ -n, n)), form an orthogonal stochastic measure, with respect to which (according to §2) we can define the stochastic integral

If=

Let f E L 2 (F) and '1 +-+f. Denote the element '1 by (f) (more precisely, select single representatives from the corresponding equivalence classes of random variables or functions). Let us show that (P-a.s.) J(f) = (f).

(7)

In fact, if (8)

is a finite linear combination of functions I 4 .(A.), Llk = (ak, bk], then, by the very definition of the stochastic integral, J(f) = I IXkZ(Llk), which is evidently equal to (f). Therefore (7) is valid for functions of the form (8). But if J E L 2 (F) and II fn - f II --+ 0, where fn are functions of the form (8), then ll(fn) - (f)ll --+ 0 and IIJ(fn) - J(f)IJ -> 0 (by (2.14)). Therefore (f) = J(f) (P-a.s.). Consider the function f(A.) = ei'-n. Then (ei'-n) = ~n by (4), but on the other hand J(ei'-n) = .f':._, ei'-"Z(dA.). Therefore

n E 7l. (P-a.s.) by (7). This completes the proof of the theorem.

404

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

e

en,

Corollary 1. Let = (en) be a stationary sequence qfreal random variables n E 71.. Then the stochastic measure Z = Z(t1) involved in the spectral representation (2) has the property that Z(t1) = Z(- .:1)

(9)

for every .:1 = aJ([ -n, n)), where -.:1 = {A: -A. E .:1}.

In fact, let f(A.) = therefore

L cxkeo.k and 11 = L cxkek (finite sums). Then f - 11 and (10)

Since J .1(A.)- Z(t1), it follows from (10) that either J .1(- A.)- Z(L1) or J -<1(..1.)- Z(t1). On the other hand, J -<1(..1.)- Z(- .:1). Therefore Z(t1) = Z( -.:1)(P-a.s.).

Corollary 2. Again let ables

en and Z(L1) =

e

=(en) be a stationary sequence of real random variZ 1 (.:1) + iZ 2 (L1). Then EZ 1(i1 1)Zit12) = 0

for every .:1 1 and .:1 2 ; and

if .:1 1 11

.:1 2 =

0

(11)

then (12)

In fact, since Z(t1) = Z( -.:1), we have Z 1( -.:1) = Z 1(.:1),

Z 2( -.:1) = -Zit1).

(13)

Moreover, since EZ(t1 1)Z(t12) = EIZ(t1 1 11 .:12)1 2, we have Im EZ(t1 1)Z(t12) = 0, i.e. (14)

If we take the interval -.:1 1 instead of .:1 1 we therefore obtain EZ1( -L11)Z2(l12)

+ EZ2( -L\1)Z1(l12) = 0,

which, by (13), can be transformed into EZ1(L\1)Zil\2)- EZ2(L\1)Z1(f12) = 0.

(15)

Then (11) follows from (14) and (15). On the other hand, if .:1 1 11 L\ 2 = 0 then EZ{t1 1)Z(t1 2) = 0, whence Re EZ(t1 1)Z(L\ 2) = 0 and Re EZ(- t1 1)Z(L\ 2) = 0, which, with (13), provides an evident proof of (12).

e

Corollary 3. Let =(en) be a Gaussian sequence. Then, for every family .:1 1, ... , L\k, the vector (Z 1(.:1 1), ... , Z 1(.:1k), Z 2(.:1 1), ... , Z 2(L1k)) is normally distributed.

§3. Spectral Representation of Stationary (Wide Sense) Sequences

405

In fact, the linear manifold L5{~) consists of (complex-valued) Gaussian random variables 17, i.e. the vector (Re 17, lm 17) has a Gaussian distribution. Then, according to Subsection 5, §13, Chapter II, the closure of L5(~) also consists of Gaussian variables. It follows from Corollary 2 that, when ~ = ( ~n) is a Gaussian sequence, the real and imaginary parts of Z 1 and Z 2 are independent in the sense that the families of random variables (Z 1 (L\ 1 ), ... , Z 1(L\k)) and (Z 2 (L\ 1), ... , Zil\k)) are independent. It also follows from (12) that when the sets L\ 1 , ... , L\k are disjoint, the random variables Z;(L\ 1), ..• , Z;(L\k) are collectively independent, i = 1, 2.

Corollary 4. If then (P-a.s.)

~ = C~n)

is a stationary sequence of real random variables,

Remark. If {Z;.}, A. E [ -n, n), is a process with orthogonal increments, corresponding to an orthogonal stochastic measure Z = Z(L\), then in accordance with §2 the spectral representation (2) can also be written in the following form: J! ':.n

f"

=

eiAn

-n

dZ ).•

nel..

(17)

2. Let~ = C~n) be a stationary sequence with the spectral representation (2) and let 17 E L 2 ( ~). The following theorem describes the structure of such random variables.

r,

Theorem 2. If 11 E L 2 (~). there is a function q> E L 2 (F) such that (P-a.s.) 11

=

q>(A.)Z(dA.).

(18)

PRooF. If (19)

then by (2) (20)

i.e. (18) is satisfied with

q>"(A.) =

L:

lklsn

a" em.

(21)

In the general case, when 17 E L 2 (~), there are variables 17n of type (19) such that //11- 17nl/ --t 0, n --t 00. But then [fq>n- q>m/1 = ll17n- 17m I --t 0, n, m --t 00.

406

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Consequently {qJ,.} is fundamental in L 2 (F) and therefore there is a function qJ e L 2 (F) such that 11(/J - qJ,.II ~ 0, n ~ oo. By property (2.14) we have ll..l(qJ,.)- ..l(qJ)II ~ 0, and since '7n =..l(qJ,.) we also have '1 = ..l(qJ) (P-a.s.). This completes the proof of the theorem.

Remark. Let

H 0 (~) and H 0 (F) be the respective closed linear manifolds spanned by the variables ~ .. and by the functions e,. when n :s;; 0. Then if '1 e H 0 (~) there is a function qJ e H 0 (F) such that (P-a.s.) '1 = J':.,. qJ(A.)Z(dA.).

3. Formula (18) describes the structure of the random variables that are obtained from ~ .. , n e 71., by linear transformations, i.e. in the form of finite sums (19) and their mean-square limits. A special but important class of such linear transformations are defined by means of what are known as (linear).filters. Let us suppose that, at instant m, a system (filter) receives as input a signal xm, and that the output of the system is, at instant n, the signal h(n - m)xm, where h = h(s), s e 71., is a complex valued function called the impulse response (of the filter). Therefore the total signal obtained from the input can be represented in the form 00

Yn =

L

m=-oo

h(n - m)xm.

(22)

For physically realizable systems, the values of the input at instant n are determined only by the "past" values of the signal, i.e. the values xm for m :s;; n. It is therefore natural to call a filter with the impulse response h(s) physically realizable if h(s) = 0 for all s < o, in other words if n

Yn =

L

m=-oo

oo

h(n- m)Xm =

L h(m)Xn-m·

m=O

(23)

An important spectral characteristic of a filter with the impulse response h is its Fourier transform

qJ(A.) =

L 00

e-iJ.mh(m),

(24)

m=-oo

known as the frequency characteristic or transfer function of the filter. Let us now take up conditions, about which nothing has been said so far, for the convergence ofthe series in (22) and (24). Let us suppose that the input is a stationary random sequence ~ = (~ ..), n e 71., with covariance function R(n) and spectral decomposition (2). Then if 00

L

k,l=-oo

h(k)R(k - l)li(l) < 00,

(25)

407

§3. Spectral Representation of Stationary (Wide Sense) Sequences

_

the series L~= 00 h(n - m)~m converges in mean-square and therefore there is a stationary sequence 1J = (IJn) with 00

'ln =

L

00

m=-oo

h(n - m)~m =

L

m=-oo

h(m)~n-m·

(26)

In terms of the spectral measure, (25) is evidently equivalent to saying that

cp(A.) E L 2 (F), i.e.

f"icp(A.WF(dA.) < oo.

(27)

Under (25) or (27), we obtain the spectral representation

IJn = f/)."cp(A.)Z(dA.).

(28)

of 1J from (26) and (2). Consequently the covariance function R,(n) of 1J is given by the formula

R/n) = f/)."lcp(A.)I 2 F(dA.).

(29)

In particular, if the input to a filter with frequency characteristic

n), the output will be a stationary sequence (moving average) 00

'ln =

L

m=-oo

h(m)~>n-m

(30)

with spectral density

JP,.)

1

2

= 2n I
.

The following theorem shows that there is a sense in which every stationary sequence with a spectral density is obtainable by means of a moving average.

Theorem 3. Let 1] = (IJn) be a stationary sequence with spectral density f,(A.). Then (possibly at the expense of enlarging the original probability space) we can find a sequence E = ( ~>n) representing white noise, and a filter, such that the representation (30) holds. PRooF. For a given (nonnegative) function UA.) we can find a function cp(A.) such that f.,( A.) = (1/2n) I cp(A.) 12 • Since f.,( A.) dA. < oo, we have cp(A.) E L 2 (~), where ~ is Lebesgue measure on [ -n, n). Hence


J':."

J" I
L

lmlsn

e-iA.mh(m) 12 dA.--+ 0,

n--+ oo.

408

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let

n E Z.

Yin = f/).nz(dA.),

Besides the measure Z = Z(d) we introduce another independent orthogonal stochastic measure Z = Z(d) with E I Z(a, b] 12 = (b - a)/2n. (The possibility of constructing such a measure depends, in general, on having a sufficiently "rich" original probability space.) Let us put

Z(d)

=

L

cpffi(A.)Z(dA.)

+

L

[1 - cpffi(A.)cp(A.)]Z(dA.),

where

a E!1

=

{a -1, ~f a ~ 0, 0, If a- 0.

The stochastic measure Z = Z(d) is a measure with orthogonal values, and for every d = (a, b] we have EIZ(d) 12 =

where

2~L Icpffi(A.W Icp(A.W dA. + 2~

L

11 - cpffi(A.)cp(A.W dA.

=

12~ 1 '

ldl = b- a. Therefore the stationary sequence e = (en), n E Z, with Bn = ff;.nz(dA.),

is a white noise. We now observe that (31) and, on the other hand, by property (2.14) (P-a.s.)

f"

ein).cp(A.)Z(dA.) =

f" ei).nc=~

oo

e- i).mh(m) )z(dA.)

which, together with (31), establishes the representation (30). This completes the proof of the theorem.

Remark. If f,(A.) > 0 (almost everywhere with respect to Lebesgue measure), the introduction of the auxiliary measure Z = Z(d) becomes unnecessary (since then 1 - cpffi(A.)cp(A.) = 0 almost everywhere with respect to Lebesgue measure), and the reservation concerning the necessity of extending the original probability space can be omitted.

§3. Spectral Representation of Stationary (Wide Sense) Sequences

Corollary 1. Let the spectral density

J~(A.)

409

> 0 (almost everywhere with

respect to Lebesgue measure) and

1

2

J,(A.) = 2n Iq>(A.) I , where

q>(A.)

=

L e-mh(k), <X)

<X)

I

k=O

k=O

lh(k)l 2 < oo.

Then the sequence 11 admits a representation as a one-sided moving average, <X)

11n =

L h(m)t:n-m·

m=O

In particular, let P(z) = a0 + a 1z + · · · + aPzP be a polynomial that has no zeros on {z: lzl = 1}. Then the sequence 11 = (17,) with spectral density

J,(A.) = ~

_!_ 2n

IP(e-iAW

can be represented in the form

Corollary 2. Let

~

= (~..) be a sequence with rational spectral density (32)

Let us show that if P(z) and Q(z) have no zeros on {z: lzl = 1}, there is a white noise e = e(n) such that (P-a.s.) ~ ..

+ bl~n-1 + ... + bq~n-q =

aot:n

+ alt:n-1 + ... + apen-p·

(33)

Conversely, every stationary sequence~ = (~,)that satisfies this equation with some white noise e = (e,) and some polynomial Q(z) with no zeros on {z: lzl = 1} has a spectral density (32). In fact, let 17, = ~ .. + b 1 ~,_ 1 + ··· + bqen-q· Then J,(A.) = (1/2n)IP(e-iAW and the required representation follows from Corollary 1. On the other hand, if (33) holds and F~(A.) and F~(A.) are the spectral functions of eand 11· then

F~(A.) =

f.

IQ(e-i")l 2

dF~(v) =

1 2n

f.

IP(e-i•W dv.

Since IQ(e-i•W > 0, it follows that F~(A.) has a density defined by (32).

410

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

4. The following mean-square ergodic theorem can be thought of as an analog of the law of large numbers for stationary (wide sense) random sequences. Theorem 4. Let ~ = (~n), n E Z, be a stationary sequence with covariance function (1 ), and spectral resolution (2). Then 1 n-1

- L ~k £ Z({O})

E~n = 0,

(34)

n k=o

and 1 n-1

- L R(k)-+ F({O}).

(35)

n k=O

PROOF.

By (2),

where

(36) Since Isin A. I 2:: (2/n:) IA. I for I A. I : : :; n/2, we have

nA. nA. sin2 1t 2 1t <- - -
Moreover,
f,

J}(F)

---+

J 10l(A.) and therefore by (2.14)

cpnCA.)Z(dA.)

S

f/

101 (A.)Z(dA.)

= Z({O}),

which establishes (34). Relation (35) can be proved in a similar way. This completes the proof of the theorem.

Corollary. If the spectral function is continuous at zero, i.e. F( {0}) = 0, then Z({O}) = 0 (P-a.s.) and by (34) and (35),

411

§3. Spectral Representation of Stationary (Wide Sense) Sequences

Since

the converse implication also holds:

Therefore the condition (1/n) I;;:6 R(k)---+ 0 is necessary and sufficient for the convergence (in the mean-square sense) of the arithmetic means (1/n) I;;:6 ~k to zero. It follows that if the original sequences~= C~n) has expectation m (that is, E~ 0 = m), then 1n-1

-I

nk=o

1n-1

R(k)---+ O-

L ~k

nk=O

L2 ---+

m,

(37)

where R(n) = E(~n- E~n)(~o- E~o). Let us also observe that if Z({O}) =1= 0 (P-a.s.) and m = 0, then ~n "contains a random constant IX": ~n

= IX + IJn,

where IX= Z({O}); and in the spectral representation IJn = J':." ei'-nz,,(d.-1.) the measure Z~ = Z~(Ll) is such that Z~({O}) = 0 (P-a.s.). Conclusion (34) means that the arithmetic mean converges in mean-square to precisely this random constant IX.

5.

PROBLEMS

1. Show that L~(F) = L 2 (F) (for the notation see the proof of Theorem 1).

2. Let ~ = (~.)be a stationary sequence with the property that ~n+N = ~.for some N and all n. Show that the spectral representation of such a sequence reduces to (1.13). 3.

Let~ = (~.)be

21 N

a stationary sequence such that

I I N

N

k=O l=O

R(k - /) = -1 N

I

lklsN-1

E~. =

0 and

I I]

R(k) [ 1 - - k N

::; CN



for some C > 0, IX > 0. Use the Borel-Cantelli lemma to show that then

1

N

N

k=O

- I 4. Let the spectral density f~(J..) of the

~k -> 0

(P-a.s.)

sequence~ = (~.)be

rational,

}.. -~IP.-I(e-il)l IQ.(e i'-)1 '

M)- 2n

whereP._ 1 (z) = a0 + a 1 z + ··· + a._ 1z"- 1 andQ.(z) = 1 + b1z all the zeros of these polynomials lie outside the unit disk.

(38)

+ ··· + b.z",and

412

VI. Stationary (Wide Sense) Random Sequences. L2 -Theory

Show that there is a white noise e = (em), mE 71., such that the sequence (~m) is a component of an n-dimensional sequence (~~. ~;, ... , ~~), ~~ = ~m• that satisfies the system of equations i = 1, ... , n- 1, n-1

):n

- - "i..Jbn-j":>m ):j+l+Pnem+h

(39)

~m+l-

j=O

§4. Statistical Estimation of the Covariance Function and the Spectral Density 1. Problems of the statistical estimation of various characteristics of the probability distributions of random sequences arise in the most diverse branches of science (geophysics, medicine, economics, etc.) The material presented in this section will give the reader an idea of the concepts and methods of estimation, and of the difficulties that are encountered. To begin with, let c; = (c;n), n E Z, be a sequence, stationary in the wide sense (for simplicity, real) with expectation Ec;n = m and covariance R(n) = ei).nF(dA.). Let x 0, x 1, •.. , xN- 1 be the results of observing the random variables c; 0 , ~ 1 , ..• , c;N_ 1 . How are we then to construct a "good" estimator of the (unknown) mean value m? Let us put

J':."

1

mN(x) = N

N-1

L

k=O

(1)

xk.

Then it follows from the elementary properties of the expectation that this is a "good" estimator of m in the sense that it is unbiased "in the mean over all kinds of data x 0, ... , xN-1 ",i.e.

1 EmN(c;) = E( N

L c;k

N-1

k=O

)

= m.

In addition, it follows from Theorem 4 of §3 that when (1/N) N ~ oo, our estimator is consistent (in mean-square), i.e. N~

oo.

(2)

Lf=o R(k) ~ 0, (3)

Next we take up the problem of estimating the covariance function R(n), the spectral function F(A.) = F([ -n, A.]), and the spectral density f(A.), all under the assumption that m = 0.

§4. Statistical Estimation of the Covariance Function and the Spectral Density

413

Since R(n) = E~n+k~k• it is natural to estimate this function on the basis of N observations x 0 , x 1, ••• , xN-l (when 0 ~ n < N) by

It is clear that this estimator is unbiased in the sense that

0 ~ n < N. Let us now consider the question of its consistency. If we replace ~k in (3.37) by ~n+k~k and suppose that the sequence~ = C~n) under consideration has a fourth moment (E ~ti < oo ), we find that the condition N-+ oo,

(4)

is necessary and sufficient for N-+ oo. Let us suppose that the original sequence~ = mean and covariance R(n)). Then by (11.12.51) E[~n+k~k - R(n)] [~n~O - R(n)]

= = =

C~n)

E~n+k~k~n~o

(5)

is Gaussian (with zero

-

E~n+k~k · E~n~o

R 2 (n)

+ E~n+k~n • E~k~o

+ E~n+k~o · E~k~n - R 2(n) R 2 (k) + R(n + k)R(n - k).

Therefore in the Gaussian case condition (4) is equivalent to 1

N-1

N

k=O

- L [R Since IR(n

2 (k)

+ k)R(n-

+ R(n + k)R(n

k)l ~ IR(n

N-+ oo.

(6)

+ kW + IR(n- k)l 2 , the condition

_!_Nil R2(k)-+ 0,

N

- k)] -+ 0,

N-+ oo,

(7)

k=O

implies (6). Conversely, if (6) holds for n = 0, then (7) is satisfied. We have now established the following theorem.

Theorem. Let

~ = (~n) be a Gaussian stationary sequence with E~n = 0 and covariance function R(n). Then (7) is a necessary and sufficient condition that, for every n ~ 0, the estimator RN(n; x) is mean-square consistent, (i.e. that (5) is satisfied).

414

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Remark. If we use the spectral representation of the covariance function, we obtain

-1 N-1 L R 2 (k) = N

k=O

J" J" -..

=

-1

- ..

N

L ei(A.-v)kF(dA.)F(dv)

N-1

k=O

f . f/N(A., v)F(dA.)F(dv),

where (compare (3.35))

A.= v,

But as N-+ oo

1, A.= v, fN(A., v)-+ f(A., v) = { 0 , ' Jl, =I= v. Therefore

~

:t:

R 2 (k)-+

=

f . f/().,

v)F(dA.)F(dv)

ff({A.})F(dA.)

= ~ F2({A.}),

where the sum over A. contains at most a countable number of terms since the measure F is finite. Hence (7) is equivalent to (8)

which means that the spectral function F(A.)

= F([ -n, A.]) is continuous.

2. We now turn to the problem of finding estimators for the spectral function F(A.) and the spectral density f(A.) (under the assumption that they exist). A method that naturally suggests itself for estimating the spectral density follows from the proof of Herglotz' theorem that we gave earlier. Recall that the function

fN(A.) =

21 L (1 - INni)R(n)e-iA.n 11: lni
introduced in §1 has the property that the function

FN(A.) =

f/N(v) dv

(9)

415

§4. Statistical Estimation of the Covariance Function and the Spectral Density

converges on the whole (Chapter III, §1) to the spectral function F(A.). Therefore if F(A.) has a density f(A.), we have

f/N(v) dv-+ f/(v) dv

(10)

for each A. E [ - n, n). Starting from these facts and recalling that an estimator for R(n) (on the basis of the observations x 0 , x 1, •.. , xN_ 1 ) is RN(n; x), we take as an estimator for f(A.) the function

~ ( 1 - In ~ fN(A.; x) = -21 t... N 1) RN(n; x)e _'·;.",

(11)

A

7t lnl
putting RN(n; x) = RN(inl; x) for In I < N. The function jN(A.; x) is known as a periodogram. It is easily verified that it can also be represented in the following more convenient form:

jN(A.; x) = Since ERN(n; ~) = R(n),

In I <

2~N

[t:

(12)

Xne-i.i.nr

N, we have

EjN(A.; ~) = fN(A.). If the spectral function F(A.) has density f(A.), then, since fN(A.) can also be written in the form (1.34), we find that

N-1 L L f" ei•(k-l>ei.i.(l-k>J(v) dv

1 fN(A.) = 2nN k=O =

N-l

1 IN-1 L f" -2nN -n

-n

1=0

ei
12f(v) dv.

k=O

The function

1


IN-

A.

1 .

k~o e'.i.k

12 =

2

1 sin 2- N 2nN sin A./2

is the Fejer kernel. It is known, from the properties of this function, that for almost every A. (with respect to Lebesgue measure)

f . <~>N(A.-

v)f(v) dv-+ f(A.).

(13)

Therefore for almost every A. E [ -n, n)

EjN(A.; ~)

-+

f(A.);

in other words, the estimator jN(A.; x)ofj(A.)on the basis ofx 0 , x 1 , is asymptotically unbiased.

(14) ... ,

xN-t

416

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

In this sense the estimator ]N(A.; x) can be considered to be "good." However, at the individual observed values x 0 , ... , xN-l the values of the periodogram ]N(A.; x) usually turn out to be far from the actual values f(A.). In fact, let~ = ( ~n) be a stationary sequence of independent Gaussian random variables, ~n ~ JV(O, 1). Then f(A.) 1f2n and

=

]N(A.;

~) = _21 I ~Nil ~ke-wl2· 1!:

v' N

k=O

Then at the point A. = 0 we have ]N(O, 0 coinciding in distribution with the square of the Gaussian random variable 11 ~ JV(O, 1). Hence, for every N, 1 E lfN(O; ~)- f(OW = 4n 2 E 1'1 2 A

-

11 2 >

0.

Moreover, an easy calculation shows that if f(A.) is the spectral density of a stationary sequence~ = (~")that is constructed as a moving average: 00

~" = with Ee~

I aken-k k=O

(15)

Lk'=o lakl < oo, Lk'=o lakl 2 < oo, where e =(en) is white noise with

< oo, then

limE IJN(A.;

~)- f(A.W

N-+oo

= {

2! 22 (0), f (A.),

A.= O, A.# 0,

±n, ±n.

(16)

Hence it is clear that the periodogram cannot be a satisfactory estimator of the spectral density. To improve the situation, one often uses an estimator for f(A.) of the form

f Z'
r1t WN(A.- v)]N(v;

x) dv,

(17)

which is obtained from the periodogram fN(A.; x) and a smoothing function WN(A.), and which we call a spectral window. Natural requirements on WN(A.) are: (a) WN(A.) has a sharp maximum at A. = 0; (b) J':" WN(A.) d)..= 1; (c) PI]~·(A.; ~)- f(A.W-+ 0, N-+ oo, A. E [ -n, n). By (14) and (b) the estimators ]';(A.;~) are asymptotically unbiased. Condition (c) is the condition of asymptotic consistency in mean-square, which, as we showed above, is violated for the periodogram. Finally, condition (a) ensures that the required frequency A. is "picked out" from the periodogram. Let us give some examples of estimators of the form (17). Bartlett's estimator is based on the spectral window

WN(A.)

= aNB(aNA.),

417

§4. Statistical Estimation of the Covariance Function and the Spectral Density

where aN j oo, aN/N -+ 0, N -+ oo, and 2 ( ,) = __!__ I sin(A./2) 1

B II.

2n

A./2

·

Parzen's estimator takes the spectral window to be

WN(A.) = aNP(aNA.), where aN are the same as before and p

4 ( ,) = ]_ I sin(A./4) 1 II.

8n

A./4

·

Zhurbenko's estimator is constructed from a spectral window of the form

with

Z(A.) = {

-(X+ 1 1..1.1" +(X+ 1 ' lA. I~ 1, 2cx

2cx

1..1.1 rel="nofollow"> 1,

0,

where 0 < ex ~ 2 and the aN are selected in a particular way. We shall not spend any more time on problems of estimating spectral densities; we merely note that there is an extensive statistical literature dealing with the construction of spectral windows and the comparison of the corresponding estimators ]';(A.; x).

3. We now consider the problem of estimating the spectral function F(A.) = F([ -n, A.]). We begin by defining

FN(A.) = f!N(v) dv,

FN(A.; x) =

fJN(v; x) dv,

where ]N(v; x) is the periodogram constructed with (x 0 , x 1, It follows from the proof of Herglotz' theorem (§1) that

•.. ,

xN_ 1 ).

f/).n dFN(A.)-+ f/).n dF(A.) for every n E 7L. Hence it follows (compare the corollary to Theorem 1, §3, Chapter III) that F N => F, i.e. FN(A.) converges to F(A.) at each point of continuity of F(A.). Observe that

418

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

for all lnl < N. Therefore if we suppose that RN(n; ~) converges to R(n) with probability one as N -+ oo, we have

f/An dFN().; 0-+ f/An dF(A)

(P-a.s.)

and therefore FN().; ~) ~ F().) (P-a.s.). It is then easy to deduce (if necessary, passing from a sequence to a subsequence) that if RN(n; ~)-+ R(n) in probability, then FN().; ~) ~ F().) in probability. 4.

PROBLEMS

1. In (15) let e.

~

.%(0, 1). Show that

(N- n)VRN(n,

~)-+ 2n

fy +

e 2i"l)f 2(A.) dA.

for every n, as N -+ oo. 2. Establish (16) and the following generalization: lim COV(]N(A_; ~), ]N(v; N~oo

m

2/ 2 (0),

={

/ 2 (A), ~

A.= v = 0, ±n, A.= v # 0, ±n, A.# ±v.

§5. Wold's Expansion 1. In contrast to the representation (3.2) which gives an expansion of a stationary sequence in the frequency domain, Wold's expansion operates in the time domain. The main point of this expansion is that a stationary

sequence ~ = (~n), n E 71., can be represented as the sum of two stationary sequences, one of which is completely predictable (in the sense that its values are completely determined by its "past"), whereas the second does not have this property. We begin with some definitions. Let HnC~) = UW) and H(~) = U(~) be closed linear manifolds, spanned respectively by ~n = (... , ~n-l• ~n) and ~ = (- · · ~n-1> ~n• ••• ). Let

For every 17 E H(~), denote by ftnC11) =

E('li Hn I~))

the projection of 17 on the subspace write

Hn(~)

(see §11, Chapter II). We also

419

§5. Wold's Expansion

Every element 1J E H(~) can be represented as

where 1J sum

ft_

00

(1J)

_l ft_

00

(1]). Therefore H(~) is represented as the orthogonal H(~)

=

S(~) Ef) R(~),

where S(~) consists of the elements fL 00 (1]) with IJ E H(~), and R(~) consists of the elements of the form IJ- ft_ 00 (1J). We shallnowassumethatE~n = Oand V~n > 0. ThenH(¢)isautomat ically nontrivial (contains elements different from zero).

Definition 1. A stationary sequence

~

= (~n) is regular if

H(~) = R(~)

and singular if H(~)

=

S(~).

Remark. Singular sequences are also called deterministic and regular sequences are called purely or completely nondeterministic. If S(~) is a proper subspace of H(O we just say that~ is nondeterministic. Theorem 1. Every stationary (wide sense) random sequence decomposition

~

has a unique (1)

where

regular and ~s = ~~for all nand m).

~r = (~~)is

onal(~~ _l

PROOF.

(~~)is

singular. Here

~r

amd

~s

are orthog-

We define

for every n, we have S(~') _l S(~). On the other hand, S(~') s; S(~) and therefore S(n is trivial (contains only random sequences that coincide almost surely with zero). Consequently ~r is regular. Moreover, Hn(~) S:; Hi~ 5 ) Ef> HnW) and Hi~s) S:; Hn(~), HnW) S:; Hi~). Therefore Hn(~) = Hn(~ 5) Ef) HnW) and hence

Since

~~ _l S(~),

(2)

for every n. Since

~~ _l

S(¢) it follows from (2) that S( ~) s; H n
420

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

and therefore S(~) s; S(~") s; H(~"). But~~ s; S(~); hence H(~") £ S(~) and consequently S( ~) = S( ~") = Hce'), which means that ~· is singular. The orthogonality of~· and ~·follows in an obvious way from ~~ E S(~) and ~~ _i S( ~). Let us now show that (1) is unique. Let ~n = 11~ + 17~, where 17' and '7 8 are regular and singular orthogonal sequences. Then since Hn('1") = H(17•), we have Hn(~) =

Hn('1') E9 HnC17•) = Hn('1') E9 H(17•),

and therefore S(~) = S(17') S(~)

= H(17

Ee H(17

8 ).

But S(17') is trivial, and therefore

8 ).

Since '7~ E H(17 8 ) = S( ~) and '7~ l_ H(17") = S( ~), we have E( ~n IS(~)) = E(17~ + 11~ IS(~)) = 17~,i.e. 11~ coincides with~~; this establishes the uniqueness of (1). This completes the proof of the theorem.

2. Definition 2. Let ~ = (~n) be a nondegenerate stationary sequence. A random sequence e = (en) is an innovation sequence (for~) if (a) e = (en) consists of pairwise orthogonal random variables with Een = 0, E lenl 2 = 1; (b) Hn(~) = Hn(e) for all n E Z.

Remark. The reason for the term "innovation" is that en+ 1 provides, so to speak, new" information" not contained in H nC ~)(in other words," innovates" in Hn(~) the information that is needed for forming Hn+ 1 (~)). The following fundamental theorem establishes a connection between one-sided moving averages (Example 4, §1) and regular sequences.

Theorem 2. A necessary and sufficient condition for a nondegenerate sequence ~to be regular is that there are an innovation sequence e = (en) and a sequence· (an) of complex numbers; n ~ 0, with L:'=o lanl 2 < oo, such that 00

~n = PROOF.

L aken-k k=O

Necessity. We represent

Hn(~)

(P-a.s.)

(3)

in the form

Since HnC~) is spanned by elements of Hn_ 1 (~) and elements of the form

P· ~n• where Pis a complex number, the dimension (dim) of Bn is either zero

or one. But the space Hn(~) cannot coincide with Hn_ 1 (~) for any value ofn.

421

§5. Wold's Expansion

In fact, if Bn is trivial for some n, then by stationarity Bk is trivial for all k, and therefore H( ~) = S( 0, contradicting the assumption that ~ is regular. Thus Bn has the dimension dim Bn = 1. Let IJn be a nonzero element of Bn. Put

where lll1nll 2 = E IIJnl 2 > 0. For given nand k ~ 0, consider the decomposition HnC~) = Hn-k(~) EB Bn-k+1 EB · · · EB Bn.

Then En-b ... , En is an orthogonal basis in Bn-k+ 1 EB · · · EB Bn and k-1

~n =

L ajEn- j + nn-l~n),

(4)

j=O

where ai = E~nEn- j· By Bessel's inequality (II.ll.16) 00

L lajl

j=O

2

~ ll~nll 2 <

00.

It follows that Lf=o a;En-i converges in mean square, and then, by (4), equation (3) will be established as soon as we show that nn-i~n) .!.{ 0,

k

~ 00.

It is enough to consider the case n

= 0. Since

and the terms that appear in this sum are orthogonal, we have for every k~O

Therefore the limit limk-oo ft_k exists (in mean square). Now ft_k E H -k(~) for each k, and therefore the limit in question must belong to nk>O Hk(~) = S(O. But, by assumption, S(O is trivial, and therefore n_k g 0, k- ~ oo. Sufficiency. Let the nondegenerate sequence ~ have a representation (3), where E = (En) is an orthonormal system (not necessarily satisfying the condition Hn(~) = Hn(E), n E 1:). Then HnCO ~ Hn(E) and therefore S(~) = nk Hk(~) ~ Hn(E) for every n. But En+ 1 j_ Hn(E), and therefore En+ 1 j_ S(O and at the same time E = (En) is a basis in H(~). It follows that S(~) is trivial, and consequently ~ is regular. This completes the proof of the theorem.

422

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

e

Remark. It follows from the proof that a nondegenerate sequence is regular if and only if it admits a representation as a one-sided moving average

e, = L: a~ce,_k> 00

(5)

k=O

where e = e, is an orthonormal system which (it is important to emphasize this!) does not necessarily satisfy the condition H,(e) = H,(e), n E 7!... In this sense the conclusion of Theorem 2 says more, and specifically that for a regular sequence there exist a = (a,) and an orthonortnal system e = (e,) such that not only (5), but also (3), is satisfied, with H,(e) = H,(e), n E 7!...

e

The following theorem is an immediate corollary of Theorems 1 and 2.

Theorem 3 (Wold's Expansion). If e = (e,) is a nondegenerate stationary sequence, then

e, = e~ + L: ake,-k, 00

Where b00=o la~cl 2 <

(6)

k=O

00

and C: = (e,) is an innovation sequence (fore').

3. The significance of the concepts introduced here (regular and singular sequences) becomes particularly clear if we consider the following (linear) extrapolation problem, for whose solution the Wold expansion (6) is especially useful. Let H 0 ( e) = P(eo) be the closed linear manifold spanned by the variables eo = (... , 1, eo). Consider the problem of constructing an optimal (leastsquares) linear estimator ~ .. of e, in terms of the "past" eo = (. 00' e -1• eo>· It follows from §11, Chapter II, that

e_

(7)

e'

(In the notation of Subsection 1, ~.. = ft 0 (e,).) Since and and H 0 (e) = H 0 (e') EB H 0 {e"), we obtain, by using (6),

e• are orthogonal

e, = ECe: + e~IHoCe)) = ECe:IHoCe)) + E(e~IHoCe)) = ECe:IHoCe') EB Ho(e•)) = ECe:IHoCe•)) =

+

E(e~IHo(e') EB HoCe"))

+ E(e~IHoCe'))

e: + EC~oaken-kiHo<e')).

In (6), the sequence e = (e,) is an innovation sequence fore'= fore H 0 (e') = H 0 (e). Therefore

Ce~) and

~.. = e: + EC~oaken-kiHo(e)) = e: + k~naken-k

there-

(8)

423

§5. Wold's Expansion

and the mean-square error of predicting

~n

by ~ 0 = ( ... , ~ _ 1 , ~ 0 ) is (9)

We can draw two important conclusions. (a) If ~ is singular, then for every n ~ 1 the error (in the extrapolation) a; is zero; in other words, we can predict ~n without error from its "past" ~ 0 = (. · ·, ~-1• ~o). (b) If~ is regular, then a; ~ a;+ 1 and

a;= L lakl 00

lim

n->oo

2•

(10)

k=O

Since 00

L lakl

2

k=O

=

El~nl 2 ,

it follows from (10) and (9) that n~

oo;

i.e. as n increases, the prediction of ~n in terms of ~ 0 = (... , ~ _ 1 , ~ 0 ) becomes trivial (reducing simply to E~n = 0). 4. Let us suppose that ~ is a nondegenerate regular stationary sequence. According to Theorem 2, every such sequence admits a representation as a one-sided moving average 00

~n

=

I

k=O

aken-k•

(11)

L

where f= 0 I ak 12 < oo and the orthonormal sequence e = (en) has the important property that

n E 7l..

(12)

The representation (11) means (see Subsection 3, §3) that ~n can be interpreted as the output signal of a physically realizable filter with impulse response a = (ak), k ~ 0, when the input is e = (en). Like any sequence of two-sided moving averages, a regular sequence has a spectral density f(A.). But since a regular sequence admits a representation as a one-sided moving average it is possible to obtain additional information about properties of the spectral density. In the first place, it is clear that f(A.) = 21n lcp(A.W,

VI. Stationary (Wide Sense) Random Sequences. L2 -Theory

424

where
I

00

(13)

e-i'-kab

k=O

Put

I

00


(14)

akzk.

k=O

This function is analytic in the open domain I z I < 1 and since Ik"= 0 Iak 12 < oo it belongs to the Hardy class Hz, the class of functions g = g(z), analytic in Iz I < 1, satisfying sup -21

O:<>r
n

J" lg(rei W 8

d()

< oo.

(15)

-rr

In fact,

and O:<;;r< 1

It is shown in the theory of functions of a complex variable that the boundary function
In our case

where
+ 2ln I
and consequently the spectral density f(A.) of a regular process satisfies

f"

In f(A.) dA. > - oo.

(17)

On the other hand, let the spectral density f(A.) satisfy (17). It again follows from the theory of functions of a complex variable that there is then a function
425

§6. Extrapolation, Interpolation and Filtering

Therefore if we put q>(A.) = (e-o.) we obtain

f(A.) = 21n Iq>(A.) 12' where q>(A.) is given by (13). Then it follows from the corollary to Theorem 3, §3, that~ admits a representation as a one-sided moving average (11), where a = (a.) is an orthonormal sequence. From this and from the Remark on Theorem 2, it follows that ~ is regular. Thus we have the following theorem.

Theorem 4 (Kolmogorov). Let ~ be a nondegenerate regular stationary sequence. Then there is a spectral density f(A.) such that

r"

(18)

In f(A.) dA. > - oo.

In particular,f(A.) > 0 (almost everywhere with respect to Lebesgue measure). Conversely, if~ is a stationary sequence with a spectral density satisfying (18), the sequence is regular. 5.

PROBLEMS

1. Show that a stationary sequence with discrete spectrum (piecewise-constant spectral function F(A.)) is singular. 2. Let a~= El~.- ~.1 2 , ~. = E(~.IH 0 (~)). Show that if a~= 0 for some n ~ 1, the sequence is singular; if -> R(O) as -> oo, the sequence is regular.

n

a;

3. Show that the stationary sequence~ = (~.), ~. = ei""', where t:p is a uniform random variable on [0, 2n], is regular. Find the estimator ~.and the number a;, and show that the nonlinear estimator

~. = (lc2__)" Ct

provides a correct estimate of ~. by the "past" ¢0 E I~.

-

~. 12 = 0,

= ( ... ,

¢ _ 1 , ~ 0 ), i.e.

n ~ 1.

§6. Extrapolation, Interpolation and Filtering 1. Extrapolation. According to the preceding section, a singular sequence admits an error-free prediction (extrapolation) of~., n ~ 1, in terms of the "past," ~ 0 = ( ... , ~ _ 1 , ~ 0 ). Consequently it is reasonable, when considering the problem of extrapolation for arbitrary stationary sequences, to begin with the case of regular sequences.

426

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

According to Theorem 2 of §5, every regular representation as a one-sided moving average,

sequence~

=

(~n)

admits a

00

~n

=I aken-k k=O

(1)

with Lk'=o lakl 2 < oo and some innovation sequence e = (en). It follows from §5 that the representation (1) solves the problem of finding the optimal (linear) estimator~= E(~niH 0 (~)) since, by (5.8), 00

~n =

I aken-k k=n

(2)

and (3)

However, this can be considered only as a theoretical solution, for the following reasons. The sequences that we consider are ordinarily not given to us by means of their representations (1), but by their covariance functions R(n) or the spectral densities f(A.) (which exist for regular sequences). Hence a solution (2) can only be regarded as satisfactory if the coefficients ak are given in terms of R(n) or of f(A.), and ek are given by their values · · · ~k-l• ~k. Without discussing the problem in general, we consider only the special case (of interest in applications) when the spectral density has the form

f(A.) = where Cl>(z) = in lzl::::;; 1. Let

2~ ICI>(e-i.t)l

2,

Lk'=o bkzk has radius of convergence r >

~n =

f/.tnz(dA.)

(4)

1 and has no zeros

(5)

be the spectral representation of ~ = (~n), n E Z.

Theorem 1. If the spectral density of~ has the density (4), then the optimal (linear) estimator ~n of ~n in terms of ~ 0 = ( ... , ~ -1> ~ 0 ) is given by (6) where (7) and

L bkzk. 00

Cl>n(z) =

k=n

427

§6. Extrapolation, Interpolation and Filtering

PRooF. According to the remark on Theorem 2 of §3, every variable t E H 0( ~) admits a representation in the form (8) where H 0 (F) is the closed linear manifold spanned by the functions en = ei'-n for n ~ 0 (F(.A.) = J~, f(v) dv). Since

E

l~n- ~nl 2 = EIf,<eiln-

cpn().))Z(d).)'

2

= f,iei'-n- cpi-"Wf(.A.)d.A.,

the proof that (6) is optimal reduces to proving that inf $nEHo(F)

J"

leiln- cpn(.A.)I 2/(.A.) d).=

f"

leiln- cpn(.A.Wf(.A.) d.A..

(9)

-n

-n

It follows from Hilbert-space theory (§11, Chapter II) that the optimal function cpn().) (in the sense of (9)) is determined by the two conditions

(1) (2)

(10)

cpn(.A.)EHo(F), eiln - cpn(A) _[_ H o(F).

Since

ei'-"n(e-il) = ei'-"[bne-i'-n

+ bn+le-i'-
and in a similar way 1/(e-;;.) E H 0 (F), the function cpn().) defined in (7) belongs to H 0 (F). Therefore in proving that cpn(A) is optimal it is sufficient to verify that, for every m 2': 0,

ei'-n - cpn(.A.)

_L

ei'-m,

I.e.

rn 2:: 0. The following chain of equations shows that this is actually the case:

I

n,m

=

2_

d). f" eil(n-m)[1- ie-.;;.)JI(e-i'-)12 (e ''-)

2n -n

428

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

where the last equation follows because, for m ~ 0 and r > 1,

f".

e- iAmeiM

dA. = 0.

-

This completes the proof of the theorem.

Remark 1. Expanding ~n(A.) in a Fourier series, we find that the predicted value ~n of~"' n ~ 1, in terms of the past, ~ 0 = ( ... , ~ _ 1 , ~ 0 ), is given by the formula

Remark 2. A typical example of a spectral density represented in the form (4) is the rational function 1 IP(e-iA)I2

f(A.) = 2n Q(e iA) ' where the polynomials P(z) = a0 + a 1 z + + · · · + bqzq have no zeros in {z: lzl :::; 1}.

· · · + aPzP

and Q(z) = 1 + b1 z

In fact, in this case it is enough to put Cl>(z) = P(z)/Q(z). Then Cl>(z)

=

b"'=o Ckzk and the radius of convergence ofthis series is greater than one. Let us illustrate Theorem 1 with two examples.

EXAMPLE 1. Let the spectral density be 1

f(A.) = 2n (5

+ 4 cos A.).

The corresponding covariance function R(n) has the shape of a triangle with R(O)

= 5,

R(±1)

=

2,

R(n) = 0

lnl

for

~

2.

(11)

Since this spectral density can be represented in the form f(A.) = 21n 12

+ e-iAI 2 ,

we may apply Theorem 1. We find easily that

~1().)

-j).

=

eiA

e

2+e

iA'

for

n

~

2.

(12)

Therefore ~n = 0 for all n ~ 2, i.e. the (linear) prediction of ~n in terms of ~ 0 = (... , ~ _ 1 , ~ 0 ) is trivial, which is not at all surprising if we observe that, by (11), the correlation between ~nand any of ~ 0 , ~- 1 , •.• is zero for n ~ 2.

429

§6. Extrapolation, Interpolation and Filtering

For n = 1 we find from (6) and (12) that

J lt

A

~1 =

e' 2 -n

1 =-2

·;.

J"

-n

(

e -i).

+e 1

;;. Z(dA.)

L

( -

00

;;.)Z(dA.)=

1 +_e_

k=O

1)k

2k+1

. f" e-•k.
2

EXAMPLE 2. Let the covariance function be

lal < 1. Then (see Example 5 in §1)

f(A.) = _.!._ 1 - lal 2 2n 11 - ae ;;. 12 ' i.e.

where

ll>(z) =

(1

I

- a

1z)112

oo

= (1 - lal 2) 112 I (az)\ k=O

1 - az

from which
~n = ~0

f,anZ(dJc)

=

an~o·

In other words, in order to predict the value of ~n from the observations = (... , ~- 1 , ~ 0 ) it is sufficient to know only the last observation ~ 0 .

Remark 3. It follows from the Wold expansion of the regular sequence = Gn) with

~

00

~" =

I ak~n-k k=O

(13)

that the spectral density f(A.) admits the representation

f(A.) =

2~ lll>(e-;;.)1

2,

(14)

where 00

ll>(z)

= L akzk. k=O

(15)

430

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

It is evident that the converse also holds, that is, if f(2) admits the representation (14) with a function (z) of the form (15), then the Wold expansion of ~n has the form (13). Therefore the problem of representing the spectral density in the form (14) and the problem of determining the coefficients ak in the Wold expansion are equivalent. r

The assumptions that (z) in Theorem 1 has no zeros for Iz I ~ 1 and that > 1 are in fact not essential. In other words, if the spectral density of a

regular sequence is represented in the form (14), then the optimal estimator (in the mean square sense) for ~n in terms of ~ 0 = (... , ~ _ 1 , ~ 0 ) is determined by formulas (6) and (7).

t

Remark 4. Theorem 1 (with the preceding remark) solves the prediction problem for regular sequences. Let us show that in fact the same answer remains valid for arbitrary stationary sequences. More precisely, let

and let f'(2) = (1/2n)I(e-u)l 2 be the spectral density of the regular sequence ~r = (~~).Then ~n is determined by (6) and (7). In fact, let (see Subsection 3, §5)

~~ =

r,


where Z'(Ll) is the orthogonal stochastic measure in the representation ofthe regular sequence ~'. Then

El~n- ~nl 2

=

f,lei.ln-
; : : f,lei.
E ~~~- ~~1 2 .

(16)

But ~n - ~n = ~~ - ~'. Hence E I~n - ~n 12 = E I~~ from (16) that we may take
- ~~ 12 , and it follows

2. Interpolation. Suppose that ~ = ( ~n) is a regular sequence with spectral density f(2). The simplest interpolation problem is the problem of constructing the optimal (mean-square) linear estimator from the results of the measurements {~n' n = ± 1, ±2, ... } omitting ~ 0 • Let H 0 (~) be the closed linear manifold spanned by ~n' n =f. 0. Then according to the results of Theorem 2, §3, every random variable '1 E H 0 (~) can be represented in the form '1

=

r,
431

§6. Extrapolation, Interpolation and Filtering

where q> belongs to H 0 (F), the closed linear manifold spanned by the functions ei).n, n =f. 0. The estimator

~0 =

f"

will be optimal if and only if inf E I~

~EHO(~)

0 - 111 2 =

inf q>EHO(F)

(17)

cp(A.)Z(dA.)

J" 11 -

q>(A.W F(dA.)

-x

= f"11- cp(A.WF(dA.) = El~o-

~0 1 2 .

It follows from the perpendicularity properties of the Hilbert space H 0 (F) that cp(A.) is completely determined (compare (1 0)) by the two conditions

cp(A.) E H 0 (F), 1 - cp(A.) j_ H 0 (F).

(1) (2)

(18)

Theorem 2 (Kolmogorov).

be a regular sequence such that

Let~ = (~n)

f

dA. -x J(A.) < lt

(19)

OO.

Then

f~A.)'

cp(A.) = 1 -

(20)

where

J"

()( =

2n dA.,

(21)

-xf(A.)

and the interpolation error <5 2 = EI ~ 0

-

~ 0 12 is given by <5 2 = 2n · ()(.

PROOF. We shall give the proof only under very stringent hypotheses on the spectral density, specifically that

0
~

f(A.)

~

c < 00.

(22)

It follows from (2) and (18) that

f"

[1 - cp(A.)]ein).f(A.) dA.

=0

(23)

for every n =f. 0. By (22), the function [1 - cp(A.)]f(A.) belongs to the Hilbert space L 2 ([ -n, n], .?4[ -n, n], J-1.) with Lebesgue measure J-1.. In this space the functions {ein).;J2n, n = 0, ± 1, ... } form an orthonormal basis (Problem 7, §11, Chapter II). Hence it follows from (23) that [1 - cp(A.)]f(A.) is a constant,

432

VI. Stationary (Wide Sense) Random Sequences. £ 2 -Theory

which we denote by clusion that

Thus the second condition in (18) leads to the con-

IX.

IX

(24)

cp(Jc) = 1 - f(Jc).

Starting from the first condition (18), we now determine IX. By (22), cp E L 2 and the condition cp E H 0 (F) is equivalent to the condition that cp belongs to the closed (in the L 2 norm) linear manifold spanned by the functions ei'-n, n =1= 0. Hence it is clear that the zeroth coefficient in the expansion of cp(Jc) must be zero. Therefore

0

J"

= _, cp(Jc) dJc = 2n -

IX

J"

_,

dJc f(Jc)

and hence IX is determined by (21). Finally,

b2 = Eleo- ¢ol 2 = r , l l - cp(JcWf(Jc)dJc

=I lXI

2

4n2 f(Jc) J"-,F(Jc) dJc = J" dA . _, f(Jc)

This completes the proof (under condition (22)). Corollary. If

cp(Jc) =

L

ckei'-k,

O
then

3. Let f(Jc) be the spectral density in Example 2 above. Then an easy calculation shows that

EXAMPLE

and the interpolation error is

3. Filtering. Let (0, e) = ((On), (en)), n E 7L, be a partially observed sequence,

e

where = (On) and e = (en) are respectively the unobserved and the observed components.

433

§6. Extrapolation, Interpolation and Filtering

Each of the sequences () and ~ will be supposed stationary (wide sense) with zero mean; let the spectral densities be

We write and F8 ~(fl)

=

EZ 8 (fl)Z~(fl).

In addition, we suppose that () and ~ are connected in a stationary way, i.e. that their covariance functions cov(()n, ~m) = E()n~m depend only on the differences n- m. Let R 8 ~(n) = E()n~o; then

R 8 ~(n) =

f/).nF ldA.). 8

The filtering problem that we shall consider is the construction of the optimal (mean-square)linear estimator en of ()n in terms of some observation of the sequence ~The problem is easily solved under the assumption that ()n is to be constructed from all the values ~m' mE z. In fact, since en = E(()n IH(~)) there is a function
As in Subsections 1 and 2, the conditions to impose on the optimal
<]Jn(A.) E H(F~), ()n -

en l_

H(~).

From the latter condition we find

for every mE Z. Therefore if we suppose that F 8 ~(A.) and FlA.) have densities and /~(A.), we find from (26) that

f8 ~(A.)

f/).(n-m)[fo~(A.)- e-i).n
0.

If f~(A.) > 0 (almost everywhere with respect to Lebesgue measure) we find immediately that (27)

434

VI. Stationary (Wide Sense) Random Sequences. £ 2 -Theory

where

cp(A.) = fo~(A.). IT (A.) and f~ffi(A.) is the "pseudotransform" of NA.), i.e.

!TCA.) =

{Jztc;.), o,

NA.): NA.)-

o, o.

Then the filtering error is • 2 = EIOn- Onl

Jn_"[fo(A.)- !o~(A.)f~ (A.)] dA.. 2

ffi

(28)

As is easily verified, cp E H(F ~), and consequently the estimator (25), with the function (27), is optimal. 4. Detection ()[a signal in the presence of noise. Let ~n = en + l'fn, where the signal e = (On) and the noise '1 = ('1n) are uncorrelated sequences with spectral densities.f8 (A.) and _[,,(A.). Then

EXAMPLE

{)n =

f/Ancp(A.)Z~(dA.),

where and the filtering error is

The solution (25) obtained above can now be used to construct an optimal estimator en+m of en+m as a result of observing ~k' k ~ n, where m is a given element of 7l.. Let us suppose that~ = (~n) is regular, with spectral density

2~ l(e-iA)I2,

.f(A.) = where (z) =

Lk'=o akzk. By the Wold expansion, 00

~n = where s

= (sk) is white noise with the spectral resolution

~>n = Since

I aksn-k> k=O

f/Anze(dA.).

435

§6. Extrapolation, Interpolation and Filtering

and {') n+m =

f" ei.<(n+m>,ll(A.)(e-i.<)z (dA.) = -x

e

..,.,

" L.

an+m-k ek'

k~n+m

where (29)

then

But

Hn(~)

= Hn(e) and therefore en+m =

=

I f" [I f" eu"[~oa1 +me-w]a>(e-;;.)Z~(dA.),

ks.n

an+m-kek =

-n

an+m-kej).k]z.
k~n

where <J>EB is the pseudotransform of <1>. We have therefore established the following theorem. Theorem 3. If the sequence ~ = (~") under observation is regular, then the optimal (mean-square) linear estimator iJn+m of (}n+m in terms of ~k• k ~ n, is given by

(30)

where Hm(e-i).)

=

L al+me-i).lEB(e-i).) 00

l=O

(31)

and the coefficients ak are defined by (29). 4.

PROBLEMS

1. Let¢ be a nondegenerate regular sequence with spectral density (4). Show that Cl>(z) has no zeros for Iz I :::;; 1. 2. Show that the conclusion of Theorem 1 remains valid even without the hypotheses that (z) has radius of convergence r > 1 and that the zeros ofci>(z) all lie in lzl > 1.

436

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

3. Show that, for a regular process. the function (z) introduced in (4) can be represented in the form lzl < 1,

where ck

= ~ 1

2n

J"

eikl

-n

In f(A.) dA..

Deduce from this formula and (5.9) that the one-step prediction error ui = E 1~ 1 - ~ 1 12 is given by the Szego-Kolmogorov formula

ui =

2n

J" exp{~ 2n

-n

In f(A.) dA.}.

4. Prove Theorem 2 without using (22).

5. Let a signal 0 and a noise IJ, not correlated with each other, have spectral densities

). =~·

fo()

1 2n ll+b,e-i'l2

1 1 and UA.)=2n.ll+b2e ill2'

Using Theorem 3, find an estimator l!n+m for On+m in terms of ~k• k :;:;; n, where ~k = fJk + 'lk· Consider the same problem for the spectral densities j 6(A.)

=

1

2n

.•

2

12 + e-• I

and

f~(A.) =

1

2n.

§7. The Kalman-Bucy Filter and Its Generalizations 1. From a computational point of view, the solution presented above for the problem of filtering out an unobservable component (} by means of observations of¢ is not practical, since, because it is expressed in terms ofthe spectrum, it has to be carried out by spectral methods. In the method proposed by Kalman and Bucy, the synthesis of the optimal filter is carried out recursively; this makes it possible to do it with a digital computer. There are also other reasons for the wide use of the Kalman-Bucy filter, one being that it still "works" even without the assumption that the sequence ((}, ¢) is stationary. We shall present not only the usual Kalman-Bucy method, but also a generalization in which the recurrent equations determined by ((}, ~) have coefficients that depend on all the data observed in the past. Thus, let us suppose that ((}, ¢) = (((}.), (¢.)) is a partially observed sequence, and let

e.= ((}l(n), ... ' (}k(n))

and

¢. = (¢1(n), ... ' ¢z(n))

be governed by the recurrent equations

+ a1(n, ¢)(}. + b 1(n, ¢)e 1(n + 1) + b 2(n, ¢)ein + 1), = A 0(n, ¢) + A 1(n, ¢)(}. + B 1(n, ¢)e 1(n + 1) + B 2(n, ¢)ein + 1).

(}n+l = a0(n, ¢) ¢n+l

(1)

437

§7. The Kalman-Bucy Filter and Its Generalizations

Here e1(n) = (e 11 (n), ... , elk(n))

and

e2(n) = (e 21 (n), ... , e21(n))

are independent Gaussian vectors with independent components, each of which is normally distributed with parameters 0 and 1; a0(n, ~) = (a 01 (n, ~), ... , rxOk(n, ~))and A 0(n, ~) = (A 01 (n, ~), ... , A 0ln, ~))are vector functions, where the dependence on ~ = {~ 0 , ... , ~n) is determined without looking ahead, i.e. for a given n the functions a 0 (n, ~), ... , A 01 (n, ~) depend only on ~ 0 , ••• , ~n; the matrix functions b1(n, ~)

= llb!Jl(n, ~)II,

B1(n, ~) = IIBIJ>(n, ~)II, a1(n, ~)

= lla!J>(n, ~)II,

bz(n, ~) = llbljl(n, ~)II, Bz(n, ~)

= IIBljl(n, ~)II,

A1(n, ~) = IIA\jl(n, ~)II

have orders k x k, k x l, l x k, l x I, k x k, l x k, respectively, and also depend on ~ without looking ahead. We also suppose that the initial vector (0 0, ~ 0 ) is independent of the sequences e1 = (e 1(n)) and e2 = (ez(n)). To simplify the presentation, we shall frequently not indicate the dependence of the coefficients on~So that the system (1) will have a solution with finite second moments, we assume that E(11Boll 2 + ll~oll 2 ) < oo

and if g(n, ~) is any of the functions a0 i, A0 i, blJ>, b\J>, BIJ> or BIJ> then E lg(n, ~W < oo, n = 0, 1,... . With these assumptions, (0, ~) has E(110nll 2 + ll~nll 2 ) < oo, n ~ 0. Now let ff~ = tr{ro: ~ 0 , ••• , ~"} be the smallest u-algebra generated by ~0 ,

.•. ,

~nand

According to Theorem 1, §8, Chapter II, mn = (m 1(n), ... , mk(n))is an optimal estimator (in the mean square sense) for the vector (}n = (0 1(n), ... , (}k(n)), and Eyn = E[(On- mn)((}n- mn)*] is the matrix of errors of observation. To determine these matrices for arbitrary sequences ((}, ~) governed by equations (1) is a very difficult problem. However, there is a further supplementary condition on (0 0 , ~ 0 ) that leads to a system of recurrent equations for mn and Ym that still contains the Kalman-Bucy filter. This is the condition that the conditional distribution P(0 0 ::::; a I~ 0 I) is Gaussian, P(00

::::;

1 fa exp{- (x - m0 ) 2 } dx, a! ~ 0 ) = ;;:c::2 2 y 2ny - oo Yo 0

with parameters m0 = mo(~o), Yo = YoC~o). To begin with, let us establish an important auxiliary result.

(2)

438

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Lemma 1. Under the assumptions made above about the coefficients of (1), together with (2), the sequence (fJ, 0 is conditionally Gaussian, i.e. the conditional distribution function P{fJ 0

::::;;

ao, ... , 1'/n::::;; anl~n

is (P-a.s.) the distribution.fimction of ann-dimensional Gaussian vector whose mean and covariance matrix depend on (~ 0 , •.. , ~n). PROOF. We prove only the Gaussian character of P(fJn::::;; al~~); this is enough to let us obtain equations for mn and Yn· First we observe that (1) implies that the conditional distribution

P(fJn+l ::::;; a1, ~n+l ::::;; xl~~,f}n =b) is Gaussian with mean-value vector

A

o

+

A b = ( ao t Ao

+ a 1b ) + Atb

and covariance matrix IEB =

(

bob boB) (b a B)* B a B '

where bob= b 1 bf + b2 bi, b a B = b 1 Bf + b2 Bi, B a B = B 1 Bf Let Cn = ((}n, ~n) and t = (t 1, ... , tk+l). Then E[exp(it*Cn+ 1 )1~~' fJnJ

= exp{it*(A 0 (n,

~)

+ B2 Bi.

+ A 1(n, ~)f}n)- !t*IEB(n, ~)t}. (3)

Suppose now that the conclusion of the lemma holds for some n

~

0. Then

E[exp(it*A 1(n, ~)f}n)I~~J = exp(it*A 1(n, Omn- !t*(A 1(n, ~)ynAfCn, ~))t. (4) Let us show that (4) is also valid when n is replaced by n From (3) and (4), we have E[exp(it*Cn+ 1 )1~~]

+

1.

+ A 1(n, ~)mn) -±t*IEB(n, ~)t- ±t*(Al(n, OynAf(n, ~))t}.

= exp{it*(A 0 (n, 0

Hence the conditional distribution P(fJn+l::::;; a, ~n+l::::;; xl~~)

(5)

is Gaussian. As in the proof of the theorem on normal correlation (Theorem 2, §13, Chapter II) we can verify that there is a matrix C such that the vector 1J = [fJn+l- E(fJn+li~~)J- C[~n+l- E(~n+l~~~)J

has the property that (P-a.s.) E[17(~n+ 1

-

E(~n+ tl ~~))*I~~] = 0.

439

§7. The Kalman-Bucy Filter and Its Generalizations

It follows that the conditionally-Gaussian vectors 11 and under the condition g;~, are independent, i.e.

~n+ 1,

considered

P(11 E A, ~n+1 E Big;~)= P(11 E A lg;~)' P(~n+ 1 E Big;~) for all A E BB(Rk), BE &B(R 1). Therefore if s = (s 1, ... , sn) then E[exp(is*en+1)1g;~, ~n+1J

= E{exp(is*[E(en+11g;~) + 11 + C[~n+1- E(~n+11g;~)]J)Ig;~, ~n+tl = exp{is*[E(en+tlg;~) + C[~n+t- E(~n+tlg;~)]} x E[exp(is*11)lg;~, ~n+1] = exp{is*[E(en+11g;m + C[~n+1- E(~n+11g;~)]} (6) x E(exp(is*11)1g;n

By (5), the conditional distribution P(11 :::::; ylg;~) is Gaussian. With (6), this shows that the conditional distribution P(en+t:::::; alg;~+ 1 ) is also Gaussian. This completes the proof of the lemma.

Theorem 1. Let (e, ~)be a partial observation of a sequence that satisfies the system (1) and condition (2). Then (mn, Yn) obey the following recursion relations:

mn+ 1 = [a 0 X

+ a 1mnJ +[boB+ a 1ynATJ[B oB + A 1ynAiJE!l

[~n+1-

Ao- A1mn],

(7)

Yn+ 1 = [a 1ynai +bob] - [boB+ a1ynAi] [BoB+ A 1ynAiJE!l (8) X [b B + a1ynAiJ*. 0

PROOF.

From (1), (9)

and

en+ 1

-

+ b 1et(n + 1) + b2ein + 1), A1[en- mnJ + B1e1(n + 1) + B2e2(n + 1).

E(en+ 1 lg;~) = at[en- mnJ

~n+ 1 - E(~n+11g;~) =

(10) Let us write

d11 = cov(en+ l• en+ 11 g;~) = E{[en+1- E(en+tlg;~)][en+t- E(en+11g;m*;g;H,

d12 = cov(en+ 1• ~n+ 1l g;~) = E{[en+1- E[en+11g;m[~n+1- E(~n+11g;m*;g;n,

d22 =COV(~n+l• ~n+tlg;~) = E{[~n+1- E(~n+11g;~)J[~n+1- E(~n+11g;~)J*/g;~}.

440

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Then, by (10),

d22

=

AlynAT

+ B a B. (11)

By the theorem on normal correlation (see Theorem 2 and Problem 4, §13, Chapter II), mn+l = E(I'Jn+li~~, en+l) = E(On+li~~)

+ d12d?2(c;n+1-

E(c;n+li~~))

and

Yn+l = cov(On-1• On+ll~~. en+l)= du - d12d?2dT2· If we then use the expressions from (9) for E(On+ll~~) and E(c;n+ll~~) and those for d 11 , d 12 , d22 from (11), we obtain the required recursion formulas (7) and (8). This completes the proof of the theorem.

Corollary 1. If the coefficients a0 (n, c;), ... , B2(n, c;) in (1) are independent of c; the corresponding method is known as the Kalman-Bucy method, and equations (7) and (8)for mn and Yn describe the Kalman-Bucy filter. It is important to observe that in this case the conditional and unconditional error matrices Yn agree, i.e. Corollary 2. Suppose that a partially observed sequence (On, c;n) has the property that (Jn satisfies the first equation (1), and that t;n satisfies the equation i;n = A 0 (n- 1, i;) + A 1(n- 1, i;)On + B 1(n - 1, i;)e 1(n) + B2(n- 1, i;)ein).

(12)

Then evidently

+ A1(n, i;)[ao(n, i;) + a1(n, i;)(Jn + b1(n, c;)e 1(n + 1) +bin, c;)e2(n+ 1)] + B 1(n, i;)e 1(n + 1) + Bin, i;)ein + 1),

en+l = Ao(n, i;)

and with the notation Ao

=

Bl

=

+ Alao, Albl + Bl, Ao

Al

=

Alal,

B2 = A 1b2 + B 2,

we find that the case under consideration also depends on the model (1), and that mn and Yn satisfy (7) and (8). 2. We now consider a linear model (compare (1))

+ alen + a2en + blel(n + 1) + b2e2(n + 1), Ao + A 10n + A 2i;n + B 1e1(n + 1) + B2e2(n + 1),

en+l = ao i;n+l =

(13)

441

§7. The Kalman-Bucy Filter and Its Generalizations

where the coefficients a0 , ..• , Bn may depend on n (but not on~), and eiin) are independent Gaussian random variables with Eeii(n) = 0 and Eet(n) = 1. Let (13) be solved for the initial values (8 0 , ~ 0 ) so that the conditional distribution P(8 0 ~ al~o) is Gaussian with parameters m0 = E(8 0 , ~ 0 ) and y = cov(0 0 , 8 0 1~ 0 ) = Ey 0 . Then, by the theorem on normal correlation and (7) and (8), the optimal estimator mn = E((Jn I$'~) is a linear function of

~0' ~1' " · ' ~n· This remark makes it possible to prove the following important statement about the structure of the optimal linear filter without the assumption that it is Gaussian.

Theorem 2. Let (8, ~) = (On, ~n)n;<:O be a partially observed sequence that satisfies (13), where eiin) are uncorrelated random variables with EeJn) = 0, Ee?in) = 1, and the components of the initial vector (0 0 , ~ 0 ) have finite second moments. Then the optimal linear estimator mn = E(On I~0' ... ' ~n) satisfies (7) with a0 (n, ~) = a0 (n) + a 2 (n)~n' A 0 (n, ~) = A 0 (n) + A 2 (n)~n' and the error matrix Yn = E[(On - em)( en - mn)*] satisfies (8) with initial values

m0 = cov(0 0 , ~ 0 )cove!(~ 0 , ~ 0 ) · ~ 0 , Yo = cov(8 0 , 00 ) - cov(0 0 , ~ 0 )cov6!(~ 0 , ~ 0 )cov*(8 0 , ~o)·

(14)

For the proof of this lemma, we need the following lemma, which reveals the role of the Gaussian case in determining optimal linear estimators.

Lemma 2. Let(rx, /3) be a two-dimensional random vector with E(rx 2 + /3 2 ) < oo, a( a, P) a two-dimensional Gaussian vector with the same first and second moments as (o:, {3), i.e. i

= 1, 2;

EcxP = Eo:/3.

Let A.(b) be a linear function of b such that A.(b) = E(fi I p = b).

Then A.(/3) is the optimal (in the mean square sense) linear estimator of rx in terms of /3, i.e.

E(rx I/3) = A.(/3). Here E.A(/3) = Erx. We first observe that the existence of a linear function .A(b) coinciding with E(fi IP= b) follows from the theorem on normal correlation. Moreover, let 'X(b) be any other linear estimator. Then

PROOF.

E[ti- J:(P)] 2

;;:::

E[ti - .A(P)J 2

442

VI. Stationary (Wide Sense) Random Sequences. L 2 - Theory

and since 'X(b) and A.(b) are linear and the hypotheses of the lemma are satisfied, we have

E[cx - l(p)] 2 = E[a - l(P)J 2 ~ E[a - A.(P)J 2 = E[cx - A.(p)] 2 , which shows that A.(p) is optimal in the class of linear estimators. Finally,

EA.(p) = EA.(P) = E[E(iXI]I)] =Eli= Ecx. This completes the proof of the lemma. PRooF OF

THEOREM 2. We consider, besides (13), the system

+ ali'Jn + az~n + b1il 11 (n + l) + b2il12(n + 1), Ao + Ali'Jn + Az~n + B1il21(n + 1) + Bzilzz(n + 1),

i'Jn+l = ao

~n+l =

(15)

where f.iin) are independent Gaussian random variables with EBiin) = 0 and Ee'fin) = 1. Let (0 0 , ~ 0 ) also be a Gaussian vector which has the same first moment and covariance as (0 0 , ~ 0 ) and is independent of il;in). Then since (15) is linear, the vector (0 0 , ... , i'J", ~ 0 , .•• , ~")is Gaussian and therefore the conclusion of the theorem follows from Lemma 2 (more precisely, from its multidimensional analog) and the theorem on normal covariance. This completes the proof of the theorem. 3. Let us consider some illustrations of Theorems 1 and 2. EXAMPLE 1. Let () = ((Jn) and '7 = (rtn) be two stationary (wide sense) uncorrelated random sequences with EO" = Ertn = 0 and spectral densities

1

1

1

fe(A.) = 2nl1 + b 1 e ul 2 and f,(A.) = 2n '11 + b2 e ;;.1 2 ' where !btl< 1, lbzl < 1. We are going to interpret(} as a useful signal and '1 as noise, and suppose that observation produces a sequence~ = (~")with

According to Corollary 2 to Theorem 3 of §3 there are (mutually uncorrelated) white noises e1 = (e 1 (n)) and e2 = (e 2(n)) such that

Then ~n+l

= (}n+l + '1n+l = -bl(}n- bzrtn + el(n + 1) + ez(n + 1)

On(bl - bz) + el(n + 1) + e2(n = -bz~n- (bl- b2)(}n + el(n + 1) + ez(n + 1). = -b2((}"

+ '1n)-

+ 1)

443

§7. The Kalman-Bucy Filter and Its Generalizations

Hence 8 and

satisfy the recursion relations

~

en+l = -hl(Jn ~n+l

+ el(n + 1),

= -(hl- hz)8n-

hz~n

+ e1(n + 1) + ez(n +

(16) 1),

and, according to Theorem 2, mn = E(8nl~ 0 , ••• , ~n) and Yn = E(On- mn) 2 satisfy the following system of recursion equations for optimal linear filtering:

Let us find the initial conditions under which we should solve this system. Write d 11 = Ee;, d 12 = E8n~n• d22 = E~;. Then we find from (16) that

+ h1h2 d12 + 1, hz)2 d 11 + h~d 22 + 2hz(h 1

d12 = htCht - hz)d 11 dzz

=

(ht -

-

h2 )d 12

+ 2,

from which

2- hi- h~ dzz = (1 - hi)(1 - h~)' which, by (14), leads to the following initial values:

1 - h~ d12 mo = dzz ~o = 2- hi - h~ ~o.

Yo = d 11

-

1 1 - h~ 1 df 2 d22 = 1 - hi - (1 - hi)(2 - hi - hD = 2 - hi -

br

(18)

Thus the optimal (in the least squares sense) linear estimators mn for the in terms of ~o •... ' ~n and the mean-square error are determined signal by the system of recurrent equations (17), solved under the initial conditions (18). Observe that the equation for Yn does not contain any random components, and consequently the number Yn• which is needed for finding mn, can be calculated in advance, before the filtering problem has been solved.

en

2. This example is instructive because it shows that the result of Theorem 2 can be applied to find the optimal linear filter in a case where the sequence (8, 0 is described by a (nonlinear) system which is different from (13). ExAMPLE

444

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let e1 = (e 1(n)) and e2 = (e 2 (n)) be two independent Gaussian sequences of independent random variables with Ee;(n) = 0 and Eef(n) = 1, n ~ 1. Consider a pair of sequences (0, ~) = (0., ~.), n ~ 0, with (}n+l

=

ae.

+ (1 +

~n+l =A(}.+ e2(n

e.)el(n

+ 1),

(19)

+ 1).

We shall suppose that 00 is independent of (e 1 , e2 ) and that 00 "'JV(m 0 , y0 ). The system (19) is nonlinear, and Theorem 2 is not immediately applicable. However, if we put

+ 1) = J

el(n

1 +e. el(n E(l + 0.) 2

we can observe that E8 1 (n) = 0, E8 1(n)e 1(m) we have reduced (19) to a linear system

+ 1),

= 0, n =F m, Eei(n) = 1. Hence

+ b 1 a1 (n + 1), A1 e. + e2 (n + 1),

(}n+l = a1 0. ~•+ 1 =

(20)

where b 1 = jE(1 + 0.) 2 , and {8 1(n)} is a sequence of uncorrelated random variables. Now (20) is a linear system of the same type as (13), and consequently = E(O.I~ 0 , •.. , ~.)and its error Yn can be the optimal linear estimator determined from (7) and (8) via Theorem 2, applied in the following form in

m.

the present case:

where b 1 = jE(1

+ 0.) 2 must be found from the first equation in (19).

EXAMPLE 3. Estimators for parameters. Let (} = (0 1 , •.• , (}k) be a Gaussian vector with EO = m and cov(O, 0) = y. Suppose that (with known m and v) we want the optimal estimator of(} in terms of observations on an [-dimensional sequence~ = (~.), n ~ 0, with

~0 =

0,

(21)

where e 1 is as in (1). Then from (7) and (8), with m. = E((J I$'~) andy., we find that mn+l =

m.

+ y.A!(n,

X [~n+l-

~)[(B 1 B!)(n, ~)

+ A 1(n,

~)y.A!(n, ~)]~

A 0 (n, ~)- A 1(n, ~)m.],

Yn+l = Yn- y.A!(n, ~)[(B 1 B!)(n, ~)

+ A 1(n, ~)y.A!(n, ~)]~ A 1(n, ~)y•. (22)

445

§7. The Kalman-Bucy Filter and Its Generalizations

If the matrices B 1Bt are nonsingular, the solution of (22) is given by

+ Ymto Af(m, ~)(B 1 Bt)- 1 (m, ~)Af(m, ~)r 1

mn+1 = [ E X [

Yn+l =

m

[E

+ ymtOAt(m, ~)(B1Bt)- 1 (m, ~)(~m+l-

Ao(m,

+ Ymt0 Af(m, ~)(B1Bt)- 1 (m, ~)A1(m, ~)r 1 y,

mJ (23)

where E is a unit matrix.

4.

PROBLEMS

1. Show that the vectors m. and

e.- m. in (1) are uncorrelated: E[m:(o - m.)] = 0.

2. In (1), let y and the coefficients other than a0 (n, e) and A 0 (n, e) be independent of "chance" (i.e. of e). Show that then the conditional covariance 'l'n is independent of "chance": 'l'n = Ey•. 3. Show that the solution of (22) is given by (23). 4. Let (0,

0

= (0.,

e.) be a Gaussian sequence satisfying the following special case of(1):

Show that if A =I 0, b =I 0, B =I 0, the limiting error of filtering, y = exists and is determined as the positive root of the equation

2+ [B2(1A2- a2) - b2] y -

y

b2 B2

Al =

0.

lim.~oo

y.,

CHAPTER VII

Sequences of Random Variables That Form Martingales

§1. Definitions of Martingales and Related Concepts 1. The study of the dependence of random variables arises in various ways in probability theory. In the theory of stationary (wide sense) random sequences, the basic indicator of dependence is the covariance function, and the inferences made in this theory are determined by the properties of that function. In the theory of Markov chains (§12 of Chapter I; Chapter VIII) the basic dependence is supplied by the transition function, which completely determines the development of the random variables involved in Markov dependence. In the present chapter (see also §11, Chapter I), we single out a rather wide class of sequences of random variables (martingales and their generalizations) for which dependence can be studied by methods based on a discussion of the properties of conditional expectations.

2. Let (0, §', P) be a given probability space, and let (§,;) be a family of a-algebras §,;, n ~ 0, such that $'o s; ~ s; · · · s; §'. Let X 0 , X 1 , .•• be a sequence of random variables defined on (0, §', P). If, for each n ~ 0, the variable X,. is §,;-measurable, we say that the set X = (X,.,§,;), n ~ 0, or simply X = (X,.,§,;), is a stochastic sequence. If a stochastic sequence X = (X,.,§,;) has the property that, for each n ~ 1, the variable X,. is §.;_cmeasurable, we write X= (X,., §,;_ 1 ), taking F _ 1 = F 0 , and call X a predictable sequence. We call such a sequence increasing if X 0 = 0 and X,. ::s;; X,.+ 1 (P-a.s.).

Definition 1. A stochastic sequence X= (X,.,§',.) is a martingale, or a submartingale, if, for all n ~ 0, (1)

447

§1. Definitions of Martingales and Related Concepts

and, respectively, E(X.+ 1 1~)

or

E(X.+ll~) ~

=

x.

x.

(P-a.s.)(martingale) (2)

(P-a.s.)(submartingale).

A stochastic sequence X = (X.~) is a supermartingale if the sequence -X= (-X.,~) is a submartingale. where = u{w: X 0 , ... , X.}, and In the special case when~= the stochastic sequence X= (X.,~) is a martingale (or submartingale), we say that the sequence (X.).~ 0 itself is a martingale (or submartingale). It is easy to deduce from the properties of conditional expectations that (2) is equivalent to the property that, for every n ~ 0 and A E ~.

ff:,

L L

ff:

L ~L

Xn+l dP =

or

xn+l dP

x.dP (3)

x.dP.

1. If (~.).~ 0 is a sequence of independent random variables with = 0 and x. = ~ 0 + · · · + ~., ~ = u{w: ~ 0 , ••. , ~.}, the stochastic sequence X= {X.,~) is a martingale.

ExAMPLE

E~.

2. If (~.).~ 0 is a sequence of independent random variables with E~. = 1, the stochastic sequence (X.,~) with X.= n~=o ~k• ~ = u{w: ~ 0 , •.• , ~.}is also a martingale. ExAMPLE

ExAMPLE

3. Let

~

be a random variable with EI~ I < oo and ~s;;;~s;;;

Then the sequence X

=(X.,~)

with

... s;;;ff.

x. =

E(~ I~)

is a martingale.

ExAMPLE 4. If (~.).~ 0 is a sequence of nonnegative integrable random variables, the sequence (X.) with x. = ~ 0 + · · · + ~.is a submartingale. ExAMPLE 5. If X= (X.,~) is a martingale and g(x) is convex downward with E lg(X.)I < oo, n ~ 0, then the stochastic sequence (g(X.), ~) is a submartingale (as follows from Jensen's inequality). If X= (X.,~) is a submartingale and g(x) is convex downward and nondecreasing, with Elg(X.)I < oo for all n ~ 0, then (g(X.), ~) is also a su bmartingale.

Assumption (1) in Definition 1 ensures the existence of the conditional expectations E(X•+ 1 I~), n ~ 0. However, these expectations can also exist without the assumption that E IX.+ 1 1 < oo. Recall that by §7 of Chapter

448

VII. Sequences of Random Variables That Form Martingales

II, E(X ;+ 1I~) and E(X;+ 1I~) are always defined. Let us write A = B (P-a.s.) when P(A!::,. B)= 0. Then if {w:E(X;+ 1 1~)

< oo} u

{w:E(X,;-1~)

< oo}

= Q

(P-a.s.)

we say that E(Xn + 1 I~) is also defined and is given by E(Xn+11~) = E(X:+1I~)- E(X,;-+11~).

After this, the following definition is natural.

Definition 2. A stochastic sequence X = (Xn,

~)is a generalized martingale E(Xn+ll~) are defined expectations conditional the if (or submartingale) satisfied. is (2) and 0 ~ n for every

Notice that it follows from this definition that E(X;+ 1 1~) < oo for a generalized submartingale, and the E(IXn+tll~) < oo (P-a.s.) for a generalized martingale. 3. In the following definition we introduce the concept of a Markov time, which plays a very important role in the subsequent theory.

Definition 3. A random variable r = r(w) with values in the set {0, 1, ... , + oo} is a Markov time (with respect to (~))(or a random variable independent of the future) if, for each n ~ 0, (4)

{r = n}E~. When P( r < oo) = 1, a Markov time r is called a stopping time.

Let X= (Xn, ~)be a stochastic sequence and let r be a Markov time (with respect to (~)). We write 00

X,=

I

n=O

Xnl{t=nj(w)

(hence X,= 0 on the set {w: r = oo }). Then for every BE PA(R), 00

{w: X, E B} =

I

n=O

{Xn E B, r = n}

E

~

and consequently X, is a random variable. ExAMPLE 6. Let X = (Xn, ~) be a stochastic sequence and let Then the time of first hitting the set B, that is,

rB = inf{n

~

0: xn E B}

BE

PA(R).

449

§I. Definitions of Martingales and Related Concepts

(with r 8 =

+ ro if { ·} {rB

= 0) is a Markov time, since

= n} = {Xo ¢ B, ... , Xn-1 ¢ B, X. E B} E ~

for every n 2 0. ExAMPLE 7. Let X= (X.,~) be a martingale (or submartingale) and r a Markov time (with respect to (~)). Then the "stopped" process X'= (X" A" ~) is also a martingale (or submartingale ). In fact, the equation n-1 XnAt = L Xm/{t=m} + XnJ{t~n} m=O

implies that the variables X n A r are ~-measurable, are integrable, and satisfy

whence E[X(n+1)At- XnAri~J = J{t>n}E[Xn+1 - Xnlg;;,] = 0

Every system collection of sets

(~)

(or 2 0).

and Markov time r corresponding to it generate a

~= {AE.?:An{r=n}E~foralln20}.

It is clear that Q E ~ and g; is closed under countable unions. Moreover, if AEg;, then An{r=n}={r=n}\(An{r=n})E.fl;. and therefore AE~. Hence it follows that ~ is a a-algebra. If we think of~ as a collection of events observed up to time n (inclusive), then ~can be thought of as a collection of events observed at the "random" timer. It is easy to show (Problem 3) that the random variables r and Xr are ~-measurable.

4. Definition 4. A stochastic sequence X = (X.,~) is a local martingale (or submartingale) if there is a (localizing) sequence (rk)h 1 of Markov times such that rk :S rk+ 1 (P-a.s.), rk i ro (P-a.s.) as k --+ oo, and every "stopped" sequence X'k = (X,kAn · J 1rk>Ol' ~)is a martingale (or submartingale).

In Theorem 1 below, we show that in fact the class of local martingales coincides with the class of generalized martingales. Moreover, every local martingale can be obtained by a "martingale transformation" from a martingale and a predictable sequence.

450

VII. Sequences of Random Variables That Form Martingales

Definition 5. Let Y = (Y,, ~)be a stochastic sequence and let V = (V,., ~- 1 ) be a predictable sequence (/F_ 1 = ~). The stochastic sequence V · Y = ((V · Y)n, ~)with n

(V· Y)n = Vo Yo+

L

V;AY;,

(5)

i= 1

where A Yi = Yi - Yi _~> is called the transform of Y by V. If, in addition, Y is a martingale, we say that V · Yis a martingale transform.

Theorem 1. Let X= (X", ~)n;,o be a stochastic sequence and let X 0 = 0 (P-a.s.). The following conditions are equivalent: (a) X is a local martingale; (b) X is a generalized martingale; (c) X is a martingale transform, i.e. there are a predictable sequence V = (V,., ~- 1 ) with V0 = 0 and a martingale Y = (Y,, ~) with Y0 = 0 such that X= V· Y PRooF. (a)=> (b). Let X be a local martingale and let (rk) be a local sequence of Markov times for X. Then for every m ~ 0 (6)

and therefore (7)

The random variable J 1,k>nJ is

~-measurable.

Hence it follows from (7) that

E[ IXn+ 1ll1,k>nJ I~] = I 1,k>nJE[I Xn+ 1ll ~] < oo

(P-a.s.).

Here J 1,k>nJ--> 1 (P-a.s.), k--> oo, and therefore E[IXn+lii~J

< oo (P-a.s.).

(8)

Under this condition, E[Xn+ 1 I~J is defined, and it remains only to show that E[Xn+ 1 1~] = Xn (P-a.s.). To do this, we need to show that

L

Xn+1 dP =

L

XndP

for A E ~-By Problem 7, §7, Chapter II, we have E[IXn+ 1 1~] < oo (P-a.s.) if and only if the measure JA IXn+ 11 dP, A E ~' is a-finite. Therefore if we show that the measure JAIXnldP, AE~, is also a-finite, then in order to establish the equation in which we are interested it will be sufficient to establish it only for those sets BE~ for which JB IX n+ 11 dP < oo.

451

§I. Definitions of Martingales and Related Concepts

Since xtk is a martingale, Ixrk I = (I X tk 1\ n I . I {tk > 0}, ~) is a submartingale and hence, if we recall that {rk > n} E ~'we obtain

r

JB,..,{tk>n}

~

r

IXnl dP =

r

JB{tk>n)

JB,..,{tk>n}

IXn/\tJ{tk>O} I dP

IX(n+l)Atk/{tk>O}IdP

=

r

JB'{tk>n)

IXn+lldP.

Letting k-+ oo, we find that 11XnldP

~ 11Xn+tldP.

J

It follows that if B E ~ is such that B IX n + 11dP < oo, then (by Lebesgue's dominated convergence theorem) we can take limits ask -+ oo in the martingale equation

Thus

fB XndP =

1 Xn+ 1 dP

J

for B E ~ and such that B IX n + 11dP < oo. Hence it follows that this equation is valid for all BE~, and implies E(Xn+ 1 1~) = Xn (P-a.s.). (b)=> (c). Let

and

V0

= 0, V, =

E[IAXniiFn-tJ, n 2::: 1.

Put W, = v,.EB, Y0 = 0, and n

L

~AX;,

E[IAY,II~- 1 ] ~

1 and

Y, =

i= 1

n 2::: 1.

It is clear that E[AY,I~- 1 ]

= 0.

Consequently Y = (Y,, ~) is a martingale. Moreover, X 0 and A(Y. Y)n = AXn. Therefore

= V0 • Y0 = 0

X= V· Y. (c)=> (a). Let X = V · Ywhere Vis a predictable sequence, Yis a martingale and Vo = Y0 = 0. Put rk

== inf{n 2::: 0: I V,+ 1 1 > k},

452

VII. Sequences of Random Variables That Form Martingales

and suppose that rk = oo if the set { ·} = 0. Since V,+ 1 is ~-measurable, the variables rk are Markov times for every k 2=: 1. Consider a "stopped" sequence X'" = (( V · Y). A,J r'" >01 , ~). On the set {rk > 0}, the inequality I V,Arkl.:::;; k is in effect. Hence it follows that E I(V · Y).A ,J {rk> 01 I < oo for every n 2': 1. In addition, for n 2=: 1, E{[(V· Y)(n+l)Atk- (V· Y)nAtJJ{tk>O}I~J =

J{tk>O)" V(n+l)ATk·E{l(n+l)ATk- Y,Atkl~}

=

0

since (see Example 7) E{ l(n+ 0 AT" - Y, A'k I~} = 0. Thus for every k 2=: 1 the stochastic sequences X'k are martingales, rk l oo (P-a.s.), and consequently X is a local martingale. This completes the proof of the theorem. 5. EXAMPLE 8. Let (IJn)n:?.l be a sequence of independent identically distributed Bernoulli random variables and let P(IJn = 1) = p, P(IJn = -1) = q, p + q = 1. We interpret the event {IJ. = 1} as success (gain) and {IJn = -1} as failure (loss) of a player at the nth turn. Let us suppose that the player's stake at the nth turn is V,. Then the player's total gain through the nth turn is n

x. = I 1 Vi'li =

+ v,IJ.,

xn-1

i~

X 0 = 0.

It is quite natural to suppose that the amount V, at the nth turn may depend on the results of the preceding turns, i.e. on V1 , ..• , v,_ 1 and on '11> ... , IJn-l· In other words, if we put F 0 = {0, 0} and Fn = u{w: 1] 1 , . . . , IJn}, then V, is an ~- 1 -measurable random variable, i.e. the sequence V = (V,, ~- 1 ) that determines the player's "strategy" is predictable. Putting Y, = '11 + · · · + IJn,we find that n

x. = I

;~

1

v;~Y;,

i.e. the sequence X= (Xn, ~)with X 0 = 0 is the transform of Yby V. From the player's point of view, the game in question is fair (or favorable, or unfavorable) if, at every stage, the conditional expectation E(Xn+l - X.l~)

=

0 (or 2': 0 or.:::;; 0).

Moreover, it is clear that the game is fair if p

=

q=

!,

favorable if p > q, unfavorable, if p < q. Since X=

(X.,~)

is a

martingale if p = q = t, submartingale if p > q, supermartingale if p < q,

453

§l. Definitions of Martingales and Related Concepts

we can say that the assumption that the game is fair (or favorable, or unfavorable) corresponds to the assumption that the sequence X is a martingale (or submartingale, or supermartingale ). Let us now consider the special class of strategies V = (V,, ~- 1 )"~ 1 with V1 = 1 and (for n > 1)

V. = {2n-1 n 0

ifJ71 = -1, ... ,J7n-1 = -1, otherwise.

(9)

In such a strategy, a player, having started with a stake V1 = 1, doubles the stake after a loss and drops out of the game immediately after a win. If 17 1 = -1, ... , lln = -1, the total loss to the player after n turns will be

L 2i-1 = 2"-t. n

i= 1

Therefore if also lln+ 1 = 1, we have Xn+1 = Xn

+

V,+1 = -(2"- 1)

+ 2" =

1.

Let r = inf{n ~ 1: Xn = 1}. If p = q =!,i.e. the game in question is fair, then P(r = n) = (!)", P(r < oo) = 1, P(Xt = 1) = 1, and EXt= 1. Therefore even for a fair game, by applying the strategy (9), a player can in a finite time (with probability unity) complete the game "successfully," increasing his capital by one unit (EXt= 1 > X 0 = 0). In gambling practice, this system (doubling the stakes after a loss and dropping out of the game after a win) is called a martingale. This is the origin of the mathematical term "martingale."

Remark. When p = q = martingale and therefore

!,

the sequence X= (Xn, ~) with X 0 = 0 is a for every

n

~

1.

We may therefore expect that this equation is preserved if the instant n is replaced by a random instant r. It will appear later (Theorem 1, §2) that EXt= EX 0 in "typical" situations. Violations of this equation (as in the game discussed above) arise in what we may describe as physically unrealizable situations, when either r or IX nI takes values that are much too large. (Note that the game discussed above would be physically unrealizable, since it supposes an unbounded time for playing and an unbounded initial capital for the player.)

6. Definition 6. A stochastic sequence if E I~I < oo for all n ~ 0 and E(~n+tl~)

~

= 0

= (~"'~)is a martingale-difference (P-a.s.).

(10)

454

VII. Sequences of Random Variables That Form Martingales

The connection between martingales and martingale-differences is clear from Definitions 1 and 6. Thus if X = (Xn, ~) is a martingale, then ~ = (~n• ~)with ~ 0 = X 0 and ~n = ~Xn, n;:::: 1, is a martingale-difference. In turn, if ~ = (~n• ~) is a martingale-difference, then X = (Xn, ~) with + ~n is a martingale. Xn = ~ 0 In agreement with this terminology, every sequence ~ = (~n)n~o of independent integrable random variables with E~n = 0 is a martingaledifference (with~= a{w: ~ 0 , ~b ... , ~n}).

+···

7. The following theorem elucidates the structure of submartingales (or su permartingales). Theorem 2 (Doob). Let X= (Xn,

~)be a submartingale. Then there are a martingale m = (mn, ~)and a predictable increasing sequence A =(Am ~- 1 ) such that,for every n ;:::: 0, Doob's decomposition

Xn = mn +An

(P-a.s.)

(11)

holds. A decomposition of this kind is unique.

PRooF. Let us put m0 = X 0 , A 0 = 0 and mn = mo

+

n-1

I

[Xj+1- E(Xj+11~)],

n-1 An=

L

(12)

j=O

[E(Xj+l/~)- Xj].

(13)

j=O

It is evident that m and A, defined in this way, have the required properties. In addition, let Xn = m~ + A~, where m' = (m~, ~)is a martingale and A' = (A~, Fn)

is a predictable increasing sequence. Then

A~+1- A~= (An+1 -An)+ (mn+1- mn)- (m~+1- m~),

and if we take conditional expectations on both sides, we find that (P-a.s.) A~+ 1 - A~ = An+ 1 - An. But A 0 = A~ = 0, and therefore An = A~ and mn = m~ (P-a.s.) for all n ;:::: 0. This completes the proof of the theorem. It follows from (11) that the sequence A = (An, Fn_ 1 ) compensates X = (Xn, Fn) so that it becomes a martingale. This observation is justified by the

following definition. Definition 7. A predictable increasing sequence A = (An, ~- 1 ) appearing in the Doob decomposition (11) is called a compensator (of the submartingale X). The Doob decomposition plays a key role in the study of square integrable martingales M = (Mn, Fn) i.e. martingales for which EM; < oo, n ;:::: 0; this

455

§1. Definitions of Martingales and Related Concepts

depends on the observation that the stochastic sequence M 2 = (M 2 , ~) is a submartingale. According to Theorem 2 there are a martingale m = (m", ~) and a predictable increasing sequence (M) = ((M)"' ~- 1 ) such that (14)

The sequence (M) is called the quadratic variation of M and, in many respects, determines its structure and properties. It follows from (12) that (M)n =

L" E[(AMj)2 I~-1J

(15)

j= 1

and, for all l :::;; k, E[(Mk- M,) 2 l~tJ = E[Mf - Mri~J = E[(M)k- (M),I~]. (16) In particular, if M 0 = 0 (P-a.s.) then EMf= E(M)k·

(17)

It is useful to observe that if M 0 = 0 and M" = ~ 1 + · · · + ~"' where (~")isasequenceofindependentrandom variables withE~;= OandE~r < oo,

the quadratic variation

(18)

is not random, and indeed coincides with the variance. If X= (X",~) and Y = (Yn, ~)are square integrable martingales, we put (19)

It is easily verified that (X" Y, - (X, Y)", for l:::;; k,

E[(Xk- X 1)(Y,.-

l!)I~J

~)is

= E[(X,

a martingale and therefore,

Y)k- (X,

Y),l~].

(20)

In the case when X"= ~ 1 + · · · + ~"' Y, = 17 1 + · · · + "'"' where (~") and (1'/n) are sequences of independent random variables withE~; = E'7; = 0 and E~r < oo, E17t < oo, the variable (X, Y)" is given by (X, Y)" =

L" cov(~;, t];).

i=1

The sequence (X, Y) = ((X, Y)"' ~- 1 ) is often called the mutual variation of the (square integrable) martingales X and Y. 8.

PROBLEMS

1. Show that (2) and (3) are equivalent. 2. Let u and 1: be Markov times. Show that 1: times; and if P(u:;;; 1:) = 1, then s;,. s; s;,.

+ u, 1:

1\

u, and

1:

v u are also Markov

456

VII. Sequences of Random Variables That Form Martingales

3. Show that rand X, are ff,-measurable. 4. Let Y = (Y,, ff.) be a martingale (or submartingale), let V = (V,., §,;_ 1 ) be a predictable sequence, and let (V. Y). be integrable random variables, n ;::: 0. Show that V. Yis a martingale (or submartingale). 5. Let 3"1 ~ 3"2 ~ · • · be a nondecreasing family of a-algebras and ~ an integrable random variable. Show that (X.).;, 1 with X. = E(~ I§,;) is a martingale. 6. Let!'§ 1 2 !'§ 2 2 · · · be a nonincreasing family of a-algebras and let ~be an integrable random variable. Show that (X.).;, 1 with x. = E(~ I"§.) is a bounded martingale, i.e. E(X.IX.+ 1 , x.+ 2 ,

••• )

= x.+l

(P-a.s.)

for every n ;::: 1. 7. Let ~ 1 , ~ 2 , ~ 3 , .•• be independent random variables, P(~; = 0) = P(~; = 2) =!and x. = TI7= 1 ~;·Show that there do not exist an integrable random variable~ and a nondecreasing family(§,;) of a-algebras such that X" = E( ~I§,;). This example shows that not every martingale (X.).;, 1 can be represented in the form (E(~I§.;)).;, 1 ; compare Example 3, §11, Chapter 1.) 8. Let X= (X.,§,;), n;::: 0, be a square integrable martingale with EX.= 0. Show that it has orthogonal increments, i.e. m# n, where L1Xk = Xk- Xk-! fork :2: 1 and £1X 0 = X 0 . (Consequently the square integrable martingales occupy a position in the class of stochastic sequences with zero mean and finite second moments, intermediate between sequences with independent increments and sequences with orthogonal increments.)

§2. Preservation of the Martingale Property Under Time Change at a Random Time 1. If X = (Xn, ~)n<:O is a martingale, we have

EXn = EXo

(1)

for every n ;:::: 1. Is this property preserved if the time n is replaced by a Markov timer? Example 8 of the preceding section shows that, in general, the answer is "no": there exist a martingale X and a Markov time r (finite with probability 1) such that (2)

The following basic theorem describes the "typical" situation, in which, in particular, EX,= EX 0 .

§2. Preservation of the Martingale Property Under Time Change at a Random Time

457

Theorem 1 (Doob). Let X= (Xn, ~) be a martingale (or submartingale), and r 1 and r 2 , stopping timesfor which i = 1, 2, lim n- oo

Then

r

J > n}

IXnldP

= 0,

(3)

i = 1, 2.

(4)

{ti

(5)

(6) (Here and in the formulas below, read the upper symbol for martingales and the lower symbol for submartingales.) PROOF.

It is sufficient to show that, for every A E ~., (7)

For this, in turn, it is sufficient to show that, for every n 2:: 0,

or, what amounts to the same thing, (8)

where B =An {r1 We have (

JBn{<2~n)

= (,;;)

n}

E ~.

X n dP = (

( JBn{n,;;<2Sm}

whence

=

X n dP

X" 2 dP

+ ( JBn{<2>n}

JBn{<2=n}

+ (

JBn{<2>m}

Xm dP,

X n dP = (,;)

( JBn{<2=n}

X ndP

458

VII. Sequences of Random Variables That Form Martingales

and since Xm = 2X!.- IXml, we have, by (4), ( JBI"I{t2 ;;:n}

X,2 dP

= lim[ (

(;;,)

m-->oo

XndP- (

JBI"\{nSt2}

XmdP]

JB1"1{m<<2l

which establishes (8), and hence (5). Finally, (6) follows from (5). This completes the proof of the theorem.

Corollary 1. If there is a constant N such that P(• 1 :::;; N) = 1and P(• 2 :s; N) = 1, then (3) and (4) are satisfied. Hence if, in addition, P(< 1 :::;; <2 ) = 1 and X is a martingale, then (9)

Corollary 2. If the random variables {Xn} are uniformly integrable (in particular, ifiXnl :::;; C < oo, n ~ 0), then (3) and (4) are satisfied. In fact, P(<; > n)- 0, n- oo, and hence (4) follows from Lemma 2, §6, Chapter II. In addition, since the family {Xn} is uniformly integrable, we have (see Il.6.(16)) (10) If • is a stopping time and X is a submartingale, then by Corollary 1, applied

to the bounded time

•N = •

1\

N,

EX 0

:::;;

EX,N.

Therefore

EIX,NI = 2EX~- EX,N :s; 2EX~- EX 0 •

(11)

The sequence x+ = (X:,~) is a submartingale (Example 5, §1) and therefore EX~ =

Ni L

j=O

+

i

{
Xf dP +

f. x: dP :::;; LNi {t>N}

Xt dP =EXt

j=O

{tN=J1

:s; EIXNI:::;; supEIXNI· N

~>m

From this and (11) we have E IX,NI:::;; 3 supE IXNI, N

and hence by Fatou's lemma E IX,I :::;; 3 sup E IXNI· N

x: dP

§2. Preservation of the Martingale Property Under Time Change at a Random Time

Therefore if we take i = 1, 2.

T

=

Ti,

i

459

= 1, 2, and use (10), we obtain E I Xr, I < oo,

Remark. In Example 8 of the preceding section,

i

IXnl dP

(2n- 1)P{T > n}

=

(2n- 1). rn--+ 1,

=

n --+ oo,

{t>n}

and consequently (4) is violated (for

T 2 = T).

2. The following proposition, which we shall deduce from Theorem 1, is often useful in applications.

Theorem 2. Let X= (Xn) be a martingale (or submartingale) and T a stopping = O"{w: X 0 , ... , Xn). Suppose that time (with respect to (ff;), where

g;;

ET < 00, and that for some n

~

0 and some constant C

E{IXn+1- Xnll ff;}:::; C ({T

~ n}; P-a.s.).

Then EIXrl
and (12)

We first verify that hypotheses (3) and (4) of Theorem 1 are satisfied with Tz

=

T.

Let j ~ 1.

Then IXrl:::;

LJ=o lj and

L Ln 00

n=O j=O

The set {T ~ j}

f

{t<>:j}

=

i

{t=n}

lj dP =

0\ {T < j}

ljdP =

f

{t<>:j}

E

L L 00

00

j=O n=j

i

{t=n}

lj dP =

L 00

j=O

i

{t<'=J1

ffJ_ 1, j ~ 1. Therefore

E[l]IX0 ,

... ,

Xi_ 1] dP:::; CP{T

~ j}

lj dP.

460 for j

VII. Sequences of Random Variables That Form Martingales

~

1; and

EIX,I

~ E(t }j) ~ EIXol + Ci~t P{r ~j} = EIXol + CEr
Moreover, if r > n, then n

t

j=O

j=O

I lJ ~ I lJ,

and therefore

r

J{t>n}

Hence since (by (13)) E I.i= 0 convergence theorem yields

lim (

n-+oo J{t>n}

IXnldP

}j

~

r

±

ljdP.

J{t>n} j=O

< oo and {r > n}

I X" I dP

~

lim (

(

! 0, n --+ oo, the dominated

I }j) dP = 0.

n-+oo J{t>n} j=O

Hence the hypotheses of Theorem 1 are satisfied, and (12) follows as required. This completes the proof of the theorem. 3. Here we present some applications of the preceding theorems.

Theorem 3 (Wald's Identities). Let ~b ~ 2 , .•• be independent identically distributed random variables with EI~d < oo and r a stopping time (with respect to§"~), where§"~= a{ro: ~b •.. , ~n}, r ~ 1), and Er < oo. Then If also

(14)

Ea < oo then E{(~ 1

+ · · · + ~.)-

rE~d 2 = V~ 1 • Er.

PROOF. It is clear that X= (Xn, F~)n>t with Xn = (~ 1 is a martingale with

+ ··· + ~n)-

(15) nE~1

E[IXn+t- XniiXt, ... , X"]= E[l~n+t- E~tll~t• ... , ~nJ = El~n+t- E~tl ~ 2EI~tl
Therefore EX,= EX 0 = 0, by Theorem 2, and (14) is established. Similar considerations applied to the martingale Y = (Y,, §"~) with Y, = nV~ 1 lead to a proof of (15).

x; -

CoroUary. Let~ b ~ 2 •.•• be independent identically distributed random variables with

§2. Preservation of the Martingale Property Under Time Change at a Random Time

461

and r = inf{n;;::: 1:Sn = 1}. Then P{r
Y,

e'"s"( cp(to))- ".

=

Then Y = ( Y,, $'~)" > 1 is a martingale with EY, = 1 and, on the set {r ;;::: n},

E{l Y,+ 1

-

Y,ll Y1, ... , Y,}

Y,E{Ie;·;;:;-

=

111~1• .. ·' ~n}

= Y,·E{je 10 ~ 1
11}

:$;

B < oo,

where B is a constant. Therefore Theorem 2 is applicable, and (16) follows since EY1 = 1. This completes the proof.

1. This example will let us illustrate the use of the preceding examples to find the probabilities of ruin and of mean duration in games (see §9, Chapter 1). Let ~ 1 , ~ 2 , ••. be a sequence of independent Bernoulli random variables with P(~; = 1) = p, P(~; = -1) = q, p + q = 1, S = ~ 1 + .. · + ~"'and EXAMPLE

r = inf{n;;::: 1: Sn =BorA},

(17)

where (-A) and B are positive integers. It follows from (1.9.20) that P( r < oo) = 1 and Er < oo. Then if a = P(St = A), f3 = P(St = B), we have a + f3 = 1. If p = q = !, we obtain 0 = ESt = rtA

+ {JB,

from (14),

whence B rt=B+IAI'

Applying (15), we obtain

Et = Es; =

aA 2

+ f3B 2 =

lAB I.

462

VII. Sequences of Random Variables That Form Martingales

However, if p i= q we find, by considering the martingale ((qjp) 8 ")n?.l• that

(q)s1 =1 (q)s, =E' p p

Eand therefore

Together with the equation ex

+ [3

= 1 this yields

(18)

Finally, since ESt = (p - q)E r, we find

+ [3B p-q '

Er = ~ = ocA

p-q

where ex and [3 are defined by (18). ExAMPLE 2. In the example considered above, let p = q = ! . Let us show that for every A in 0 < A < n/(B + IA I) and every time r defined in (17),

B+A

COSA·--

2

E(cos A)-r = cos

1 1\..

B+ 1 AI

(19)

.

2

For this purpose we consider the martingale X = (Xn, ff~)n?.O with

x. =(cos A)-" cos A(s.- B; A)

(20)

and S 0 = 0. It is clear that

B+A

(21)

EX.= EX 0 =cos A-2- .

Let us show that the family {XnAt} is uniformly integrable. For this purpose we observe that, by Corollary 1 to Theorem 1 for 0
B-A

E (cos A)-(nAt) cos A - 2- .

B; A)

§2. Preservation of the Martingale Property Under Time Change at a Random Time

463

Therefore, by (21),

B+A

cosA.-2-

E(cos A.)- < ---::----.,.--.,--,,B+ IAI' 2 cos II. and consequently by Fatou's lemma,

B+A

cosA.-2B+IAI. E(cosA.)-r~ 2 cos A.

(22)

Consequently, by (20), IXn"rl

~(cos

A.)-r.

With (22), this establishes the uniform integrability of the family {Xn/\r}. Then, by Corollary 2 to Theorem 1,

B+A

cos A.-2-

B-A

= EX 0 = EXr = E(cos A.)-r cos A.-2- ,

from which the required inequality (19) follows.

4.

PROBLEMS

1. Show that Theorem 1 remains valid for submartingales if (4) is replaced by

lim n..-..+ oo

2. Let X =(X.,

~).;,:o

f

{t'i

> n}

X: dP

=

i = 1, 2.

0,

be a square-integrable martingale,' a stopping time and lim n-+oo

lim n-+oo

J

x;dP

=

o,

f

IX.I dP

=

o.

{r>n}

{t">n}

Show that then

where

~X 0 =

X0 ,

~Xi=

Xi- Xi_ 1 ,j

~

1.

3. Show that

E IX,I:::; limE IX.I for every martingale or nonnegative submartingale X = (X., ~).;,:o and every stopping time'·

VII. Sequences of Random Variables That Form Martingales

464

4. Let X =(X., ~)• ., 0 be a submartingale such that x.;;::.: EW~) (P-a.s.), n;;::.: 0, where E I~ I < oo. Show that if't' 1 and r 2 are stopping times with P( r 1 ~ r ~) = 1, then X,,;;::.:

E(X, 2 1~,)

(P-a.s.).

~ 1 , ~ 2 , •.• be a sequence of independent random variables with P(~; = 1) = a and b positive numbers, b >a, P(~; = -1) =

5. Let

t,

I"

X.=a

l(~k= +1)-b

k=1

I"

l(~k= -1)

and r = inf{n;;::.: 1:X.

~

-r},

r > 0.

Show that Ee;,' < oo for A ~ a 0 and Ee;,' = oo for A > a 0 , where IX 0

b a+b

2a a 2b + --In--. a+b a+b a+b

=--In--

6. Let ~ 1 , ~ 2 , ..• be a sequence of independent random variables withE~; = 0, V~i = uf, s. = ~ 1 + · ·· + ~., ff~ = u{ w: ~ 1, ••. , ~.}. Prove the following generalizations of 1 E~f < oo, Wald's identities (14) and (15): IfE = 1 E I~il < oo then ES, = 0; ifE then

LJ=

Li

Es; = E

I'

~f = E

j= 1

I' uf.

(23)

j= 1

§3. Fundamental Inequalities 1. Let X= (Xn,

~)n;;,o

x: =

0

be a stochastic sequence, p > 0,

max IXjl, IIXnllp = (EIXnjP) 11P, :s;j:s;n

Theorem 1 (Doob). Let X= (X",~) be a nonnegative submartingale. Then, for every 8 > 0 and all n ;;::: 0,

1f

P{x:;;::: 8} ~8

{X~;;,e)

IIXnllp

~

IIX:IIp

IIXnllp ~ _!!_ p- 1

IIXnllp

~

IIX:IIp

~e~

PRooF. Put

1 {1

x:dP

EX ~ -";

(1)

8

ifp>1;

+ IIXnln+ Xnllp} ifp =

1.

(2)

(3)

465

§3. Fundamental Inequalities

taking rn

= n if max0 si sn Xi < e. Then, by (2.6),

EXn ~EXt" =

f

{X~
X< dP n

+f

{X~<e}

Therefore eP{X:

X< dP > e f "

f

~ e} ~ EXn-

-

XndP

=

{X~<e}

dP

{X~
f

+

f

{X~<e}

XndP {X~
XndP.

~ EXn,

which establishes (1). The first inequalities in (2) and (3) are evident. In proving the second inequality in (2), we first suppose that (4)

IIX:IIv < oo, and use the equation Ee' = r

f"

t'- 1 P(e

~ t) dt,

(5)

e

which is satisfied for every nonnegative random variable and for r > 0. Then we find from (1) and Fubini's theorem that when p > 1, E(X:)P = p foo tP- 1 P{x: 0

= p Loo tp- 2

[

L

~ t} dt ~ p foo tP- 2 (f 0

Xnl{X:

XndP) dt

L [Lx~tp- 2 {X~
~ t} Jdt = p

Xn

dt] dP

= ____l!___1 E[Xn(x:y- 1 ].

(6)

p-

Hence by Holder's inequality E(X:)v

::5;

qiiXnllv·IICX:)P- 1 IIq = qf1Xnllv[E(X:)v] 11q,

(7)

where q = pj(p - 1). If (4) is satisfied, the second inequality in (2) follows at once. If (4) is not satisfied, we may proceed as follows. In (6), we consider (X: " L) instead of where L is a constant. We obtain

x:,

E(X:

A

L)P ~ qE[XiX:

A

L)p- 1] ~ qiiXnllv[E(X:

A

and it follows from the inequality E(X: " L)P ~ £P < oo that E(X:

A

L)P

~ qPEX~ = qviiXnll~

and therefore

E(x:y

= lim E(x: " L->oo

L)P

~ qPffXnlf~.

L)P] 11q,

466

VII. Sequences of Random Variables That Form Martingales

We can now prove the second inequality in (3). Applying (1) again, we find that Ex: - 1 :::;; E(x: - 1)+ =

fo'Xl P{x: - 1 ~ t} dt

1 1- [ f : :; foo -1+t XndPJ dt =EX" fx~- ~ J{X~~l+t} 1+t 0

0

= EXnlnx:. Since

a In b :::;; a In+ a + be- 1 for all a

~

(8)

0 and b > 0, we have Ex:- 1:::;; EXnlnx::::;; EXnln+ Xn + e- 1 EX:.

If EX: < oo we then obtain (3) immediately. If Ex: = oo, we again introduce Lin place of and proceed as above.

x: "

x:,

This completes the proof of the theorem.

Corollary 1. Let X = (Xn, §,;) be a square-integrable martingale. Then X 2 =(X;,§,;) is a submartingale, and it follows from (1) that (9)

In particular, if Xi = eo + · · · + ei, where (ei) is a sequence of independent random variables with Eei = 0 and Ee] < oo, inequality (9) becomes Kolmogorov's inequality (§2, Chapter IV). Corollary 2. If X = (Xn, §,;) is a square-integrable martingale, we find from (2) that

E[~ax x;J:::;; 4Ex;.

(10)

J:5.n

2. Let X= (X",§,;) be a nonnegative submartingale and let

Xn = Mn +An be its Doob decomposition. Then since EM"= 0 it follows from (1) that P{Xn* ~

} ::5;-. EAn

B

B

Theorem 2 (below) will show that this inequality is valid not only for submartingales but also for the wider class of sequences that satisfy a dominance relation, in the following sense.

467

§3. Fundamental Inequalities

Definition. Let (Xn, ff.) be a nonnegative stochastic sequence and let A = (An, ff._ 1) be an increasing predictable sequence. We say that X is dominated by A if (11)

for every stopping time r. Theorem 2. If X= (X., ff'.) is a nonnegative stochastic sequence dominated by the increasing predictable sequence A= (An, ff._ 1 ), then we have,Jor every e > 0, a > 0, and stopping time r,

EA, P{ X,* 2e } .::;;-,

(12)

e

P{X~

1 2 e} .::;; - E(A, e

and

(2- p)!/p

*

IIX, llv .::;; 1 _ p

a)

+ P(A, 2

a),

(13)

(14)

O
IIArllv,

Put

PROOF.

taking

A

CJ

= r A n if { · } =

0. Then

EA, 2 EA"" 2 EX"" 2

f

{X~,_n>E}

X"" dP 2 eP{Xi "n > e},

whence

Inequality (12) now follows from Fatou's lemma. To prove (13) we introduce the time

y = inf{j: Ai+ 1 2 a}, taking y = oo if { · } = P{X~ 2

0. Then

+ P{Xi 2 e, A, 2 + P{A, 2 a}

e} = P{Xi 2 e, A,< a} .::;; P{J{A,
+ P{A, 2

1 a}.::;; -EA," 1

e

a}

+ P{Ar 2

a}

468

VII. Sequences of Random Variables That Form Martingales

where we have used (12) and the inequality

J{A,
Xi::::; XiAY" Finally, by

(13),

IIXill~ = E(Xi)P = Loo P{(Xi)P ~

::;; Loo t- 11PE[A, 1\ =E

l

AP <

dt

t 11P] dt

+ E foo (A A~

0

t

t} dt =

+

Loo P{Xi ~ t 11P} dt

Loo P{Af ~ t} dt

t- 11P) dt

2 EAP + EAP = ----=--!!.. 1 t

-

p

t.

This completes the proof of the theorem. Remark. Suppose that under the hypotheses of the theorem the increasing sequence A = (An,.?,;) is not predictable, but for some c rel="nofollow"> 0

where LlAk quality

= Ak

- Ak_ 1 for k ~ 1. Then (compare (13)) we have the ine-

P(Xi

~

1 e) ::;; - E(A,

e

1\

(a

+ c)) +

P(A,

~

a).

(The proof is the same as for (13) with y = inf{j: Ai+ 1 ~a} replaced by y = inf{j: Ai ~a}, taking account of the inequality Ay::;; a+ c.) Corollary. Let the sequences X" and A" satisfy, for each n ~ 1, the hypotheses of Theorem 2 or of the preceding Remark (with P(sup ILlA~ I ::;; c) = 1), and for some sequence of stopping times {rn} let n --+ oo.

Then 0' ( X")*~ tn

n --+ oo.

3. In this subsection we present (without proofs, but with applications) a number of significant inequalities for martingales. These generalize the inequalities of Khinchin and of Marcinkiewicz and Zygmund for sums of independent random variables. Khinchin's Inequalities. Let ~ 1 , ~ 2 , . . . be independent identically distributed Bernoulli random variables with P(~; = 1) = P(~; = -1) = 1 and let (cn)n2! 1 be a sequence of numbers.

469

§3. Fundamental Inequalities

Then for every p, 0 < p < oo, there are universal constants AP and BP (independent of(cn)) such that (15)

for every n

~

1.

The following result generalizes these inequalities (for p

~

1):

Marcinkiewicz and Zygmund's Inequalities. If ~ 1 , ~ 2 , ••• is a sequence of independent integrable random variables with E~; = 0, then for p ~ 1 there are universal constants AP and BP (independent of(~n)) such that

for every n

~

1.

In (15) and (16) the sequences X= (Xn) with Xn = 'f,j= 1 ci~i and Xn =

Li= 1 ~i are martingales. It is natural to ask whether the inequalities can be

extended to arbitrary martingales. The first result in this direction was obtained by Burkholder.

Burkholder's Inequalities. If X= (Xn, ~) is a martingale, then for every p > 1 there are universal constants AP and BP (independent of X) such that (17) for every n ~ 1, where [X]" is the quadratic variation of X",

[XJn =

n

L (£\X) 2 ,

X 0 = 0.

(18)

j= 1

The constants AP and BP can be taken to have the values

It follows from (17), by using (2), that

APIIJEX111p

~ IIX:IIp ~ B;IIJ[Xr.;IIP'

(19)

where

Burkholder's inequalities (17) hold for p > 1, whereas the MarcinkiewiczZygmund inequalities (16) also hold when p = 1. What can we say about the validity of (17) for p = 1? It turns out that a direct generalization to p = 1 is impossible, as the following example shows.

470 EXAMPLE.

VII. Sequences of Random Variables That Form Martingales

Let ~ 1 , ~ 2 , ... be independent Bernoulli random variables with

PG; = 1) = P(~; = -1) =!and let

nAt

where

The sequence X= (Xn, ~~)is a martingale with

n--. oo. But IIJEX1111 = EJEX],; = E

Ct~ 1)

112

=

E~--. 00.

Consequently the first inequality in (17) fails. It turns out that when p = 1 we must generalize not (17), but (19) (which is equivalent when p > 1).

Davis's Inequality. If X = (Xn, ~n) is a martingale, there are universal constants A and B, 0
L

(~xy.

j= 1

CoroUary l. Let ~ 1 , ~ 2 , . . . be independent identically distributed random variables; Sn = ~ 1 + · · · + ~n· JfEI~ 1 1
=0

(21)

for every stopping timer (with respect to (~~))for which Er < oo.

It turns out that (21) is still valid under a weaker hypothesis than Er if we impose stronger conditions on the random variables. In fact, if

< oo

El~1l'
where 1 < r :$ 2, the conditionE r 11' < oo is a sufficient condition for ES, = 0. For the proof, we put Tn = T 1\ n, y = SUPn Is," I, and let m = [t'] (integral part of t') for t rel="nofollow"> 0. By Corollary 1 to Theorem 1, §2, we have ES," = 0. Therefore a sufficient condition for ES, = 0 is (by the dominated convergence theorem)thatEsupniS,"I
471

§3. Fundamental Inequalities

Using (1) and (17), we obtain t) = P(r

P(Y~

Notice that (with ff&

oo

tm

E

t',

~

Y~

t)

+ P(r <

~ t') + P{ m~x

s

P(r

s

P(r ~ t')

+ C'EIS,J'

s

P(r ~ t')

+ t-'B,E

t',

IS,jl

1:5J:5m

t)

Y~

~ t}

L I~J.

t..,

j= 1

= {0, Q})

L I~l = E j=L1 l(j S rm)l ~l

j= 1

L EE[J(j S 00

=

j= 1

L l(j S j=1

rm)l~llffj_1J t..,

00

=E

rm)E[I~I'Iffj_1J

=E

L El~jl' = ,u,Erm, j=1

where ,u, = EI~ 1 1'. Consequently P(Y ~ t)

s

P(r ~ t')

= P(r

s

(1

+ C'B,,u,Erm

~ t') + B,,u,t-'[mP(r ~ t') + J

r dP]

{t< rr}

+ B,,u,)P(r ~

t')

+ B,,u,t-'

r

r dP

J{t
and therefore EY

roo P(Y ~

Jo

t) dt

s

(1

+ B,,u,)Er 1i' + B,,u, root-'[ r

i •[ foo Jo

= (1 + B,,u,)Er 1i' + B,,u, =

n

)E (1 + Br.Ur + rB,,u, - 1 r

1/r

r dPJ dt

J{t
C' dt] dP

Jtl/r

< 00.

Corollary 2. Let M = (M n) be a martingale with E I M n 12 ' < oo for some r ~ 1 and such that (with M 0 = 0)

oo

L1

n=

E ~~Mn12r

n

1+r


(22)

472

VII. Sequences of Random Variables That Form Martingales

Then (compare Theorem 2 of§3, Chapter IV) we have the strong law of large numbers:

M,

---+ 0 (P-a.s.), n

n--+ oo.

(23)

When r = 1 the proof follows the same lines as the proof of Theorem 2, §3, Chapter IV. In fact, let

Then

and, by Kronecker's lemma (§3, Chapter IV) a sufficient condition for the limit relation (P-a.s.) 1

-n L kAmk --+ 0, n

n--+ oo,

k=l

is that the limit lim,m, exists and is finite (P-a.s.) which in turn (Theorems 1 and 4, §10, Chapter II) is true if and only if

~ e}--+ 0,

n --+ oo.

P { sup lm,+k- m,l ~ e} ::::;; s- 2

) Loo E(AM k

P{sup lmn+k- m,l k~l

(24)

By (1),

k~l

k=n

2

2 k

Hence the required result follows from (22) and (24). Now let r > 1. Then the statement (23) is equivalent (Theorem 1, §10, Chapter II) to the statement that

IM·I e2'P { s.up .J_ )~II

j

~

e} --+ 0,

for every s > 0. By inequality (29) of Problem 1,

It follows from Kronecker's lemma and (22) that

lim

n-+oo

~r E IM,.I 2 ' = 0.

n

n--+ oo

(25)

473

§3. Fundamental Inequalities

Hence to prove (25) we need only prove that 1 ·2rE{IMjl 2' -IMj-11 2')
(26)

We have

J

1 ElM l2r ~ [ 1 j-1 ~ j~3 {j _ 1)2r - l'

+

EIMNI 2' N2r

By Burkholder's inequality (17) and Holder's inequality, E IMY' ~ E[ i~1 (dM;) 2 j

Hence

]'

j

~ Ej'- 1 i~1 IdM;I 2'.

1J

[1

j N-1 "IN < i L., ·2r- { · 1)2r J·r-1"EidMI2' L., i=1 J+ j=2 J

~ c1

N-1 1 -r+2

L

j=Z}

j

N

L L EldM;I 2' ~ c2 j=2 i=1

EldM·I2r ·r+; }

+ c3

(C, are constants). By (22), this establishes (26).

The sequence of random variables {Xn}n?. 1 has a limit lim Xn (finite or infinite) with probability 1, if and only if the number of" oscillations between two arbitrary rational numbers a and b, a < b" is finite with probability 1. Theorem 3, below, provides an upper bound for the number of"oscillations" for submartingales. In the next section, this will be applied to prove the fundamental result on their convergence. Let us choose two numbers a and b, a < b, and define the following times in terms of the stochastic sequence X= (Xn, ~): ro = 0, r 1 = min{n

> 0: Xn

~a},

r 2 = min {n > r 1 : X n ~ b}, r 2 m_ 1 = min{n > r 2 m_ 2 : Xn ~a},

r 2 m = min{n > r 2 m- 1 : Xn

~

b},

taking rk = oo if the corresponding set { · } is empty. In addition, for each n ~ 1 we define the random variables

{

0,

{3n(a, b)= max{m: r 2 m ~ n}

ifr 2 > n, if r 2 ~ n.

474

VII. Sequences of Random Variables That Form Martingales

In words, {3.( a, b) is the number of upcrossings of [a, b] by the sequence X 1 , ... ,x•.

Theorem 3 (Doob). Let X = every n ~ 1, E/3 ( n

be a submartingale. Then, for

(X.,~).;;, 1

a,

b)

s

E[X.- a]+

b

(27)

-a

PROOF. The number of intersections of X= (X.,~) with [a, b] is equal to the number of intersections of the nonnegative submartingale x+ = ((X.- a)+,~) with [0, b- a]. Hence it is sufficient to suppose that X is nonnegative with a = 0, and show that

Ef3.(0, b)

EX.

s b.

(28)

Put X 0 = 0, ~ = {0, Q}, and fori= 1, 2, ... , let if rm < i if rm < i

1

({Jj

= {0

s s

rm+ 1 for some odd m, rm+ 1 for some even m.

It is easily seen that n

bf3.(0, b) s and

{cpi = 1} =

I

cpJXi - xi-1J

i= 1

U [{rm < i}\{rm+1 < i}]Eff;-1·

oddm

Therefore n

n

r-1

r-1

bEf3.(0, b) s E .~ fPJXi - Xi-1J = .~

.± f .± f

r= 1

s

{
±i

r= 1

{
i=1

Q

f

_ (Xi - Xi-1) dP

{
E(Xi-Xi-11ff;-ddP [E(Xilff;-1)- Xi-1J dP

[E(Xil.9'i-1)- Xi-1] dP =EX.,

which establishes (28).

4. PROBLEMS 1. Let X= (Xn, ~)be a nonnegative submartingale and let V = (Vn, ~- 1 ) be a predictable sequence such that 0 ~ V,+ 1 ~ V, ~ C (P-a.s.), where Cis a constant. Establish the following generalization of (1):

eP{~:"x"V}Xj ;e: e} +

L.xl'i'nV;X;<e)

V,X.dP

~ JEVj~Xj. 1

(29)

475

§3. Fundamental Inequalities

2. Let X=

(X.,~)

be a supermartingale. Show that

PL~~. IXil ~ e} ~ ~· ~~~:. EIXil, where C ~ 3 (the constant C can be taken to be 1 if X is a martingale or if X does not change sign).

3. Establish Krickeberg's decomposition: every martingale X=

(X.,~) with sup E IX. I < oo can be represented as the difference of two nonnegative martingales.

4. Let X=

(X.,~)

be a submartingale. Show that, for every e > 0 and n

ePL~~~. xi~ -e} ~ E(X.- X,)-

tinlsjsnXj$-<)

~

1,

x.dP

~EX:- EX 1 •

5. Let e,, e 2 , and

•••

be a sequence of independent random variables,

sm,n = LJ=m+ ej. Establish Ottaviani's inequality:

s. =

e, +···+e.

I

P{ max IS·I > 2e} < min,,i,.P{ISj,nl P{IS.I > e} ~ e} !sis•

J

-

and deduce that

foo P{ 0

6. Let e,, e 2 ,

•••

max ISil > 2t}dt 1 ,;;j,;;n

~ 2E IS. I+ 2 foo

P{IS.I > t} dt.

(30)

2EISnl

be a sequence of independent random variables withEe;= 0. Use

(30) to show that in this case we can strengthen inequality (3) to

Es:

~

8EIS.I.

7. Verify formula (5).

8. Establish inequality (8). 9. Let the a-algebra § 0 , ••. , ff'. be such that § 0 £ § 1 £ · · · £~and let the events Ak E ~. k = 1, ... , n. Use (13) to establish Dvoretzky's inequality: for each e > 0,

10. Let (X.).,., be a sequence of random variables with zero means, finite second

moments, and orthogonal differences, EdXmdX. = 0, m -:f. n, where dXk = XkXk_,, k > 1; dX 1 = X 2 (compare Problem 8, §1). Prove (compare Doob's inequality (10)) the Rademacher-Menshov inequality E[ max J$n

Xf ]

~ (

In 4n) -In 2

2

Ex;.

476

VII. Sequences of Random Variables That Form Martingales

§4. General Theorems on the Convergence of Submartingales and Martingales 1. The following result, which is fundamental for all problems about the convergence of submartingales, can be thought of as an analog of the fact that in real analysis a bounded monotonic sequence of numbers has a (finite) limit.

Theorem 1 (Doob). Let X= (Xn,

~)be

a submartingale with (1)

supEIXnl
Then with probability 1, the limit lim X" = X co exists and E IX co I < oo. PROOF.

Suppose that (2)

Then since {liiTIXn > limXn} =

U {limXn > b >a> limXn}

a
(here a and bare rational numbers), there are values a and b such that P{limXn > b >a> limXn} > 0.

(3)

Let f3n(a, b) be the number of upcrossings of (a, b) by the sequence X 1 , . . . , Xn, and let f3co(a, b) = limnf3nCa, b). By (3.27), E/3 ( b)< E[Xn- a]+ <Ex: + lal b -a b -a n a, and therefore E/3 co (a, b)

+ lal < 00, . E/3 n(a, b) :::;; sup"Ex: = 11m b -a

n

which follows from (1) and the remark that supEIXnl
n

for submartingales (since Ex::::;; EIXnl = 2EX:- EX":::;; 2Ex:- EX 1 ). But the condition Ef3co(a, b) < oo contradicts assumption (3). Hence lim Xn = X co exists with probability 1, and then by Fatou's lemma E IX co I:::;; sup E IXnl <

00.

n

This completes the proof of the theorem. Corollary 1. If X is a nonpositive submartingale, then with probability 1 the limit lim X n exists and is finite.

§4. General Theorems on the Convergence of Submartingales and Martingales

477

Corollary 2. If X= (Xn, ~)n~ 1 is a nonpositive submartingale, the sequence X= (Xn,~) with 1 S n soo, X 00 = limXn and §'oo = a{U~} is a (nonpositive) submartingale. In fact, by Fatou's lemma EX 00 = E lim Xn;::: lim EXn;::: EX 1 > -oo and (P-a.s.) E(Xool§'m) = E(limXnl§'m);:::

Corollary 3. If X = (Xn, with probability 1.

~)is

lim E(Xnl§'m);::: Xm.

a nonnegative martingale, then lim Xn exists

In fact, in that case supEIXnl =sup EX"= EX 1
2. Let ~ b ~ 2 , ... be a sequence of independent random variables with P(~; = 0) = P(~; = 2) = !. Then X= (Xn, §'~), with Xn = 1 ~i and 0 §'~ = a{w: ~ 1 , . . . , ~n} is a martingale with EXn = 1 and Xn---* X 00 (P-a.s.). At the same time, it is clear that E IX n - X oo I = 1 and therefore Xn ~ X 00 • Therefore condition (1) does not in general guarantee the convergence of X n to X oo in the L 1 sense. Theorem 2 below shows that if hypothesis (1) is strengthened to uniform integrability of the family {X"} (from which (1) follows by Subsection 4, §6, Chapter II), then besides almost sure convergence we also have convergence in L 1 .

0i'=

=

Theorem 2. Let X = {Xn, ~} be a uniformly integrable submartingale (that is, the family {Xn} is uniformly integrable). Then there is a random variable X 00 withE IXoo I < oo, such that as n---* oo (4)

(5)

Moreover, the sequence X= (Xn, ~), 1 s n also a submartingale.

s

oo, with $'oo = a(U~), is

Statement (4) follows from Theorem 1, and (5) follows from (4) and Theorem 4, §6, Chapter II. Moreover, if A E ~and m ;::: n, then

PROOF.

m---* oo, and therefore lim

m-+oo

f

A

X m dP =

f

A

X oo dP.

478

VII. Sequences of Random Variables That Form Martingales

The sequence {JAXm dP}m~n is nondecreasing and therefore

~{

{ XndP

~{

XmdP

X 00 dP,

whence Xn ~ E(X oo Iff.) (P-a.s.) for n z 1. This completes the proof of the theorem.

Corollary. If X= (Xn, ff.) is a submartingale and,for some p > 1, (6)

supEIXniP
then there is an integrable random variable X 00 for which (4) and (5) are satisfied. For the proof, it is enough to observe that, by Lemma 3 of §6 of Chapter II, condition (6) guarantees the uniform integrability of the family{X"}.

3. We now present a theorem on the continuity properties of conditional expectations; this was one of the very first results about the convergence of martingales.

Theorem 3 (P. Levy). Let (Q, §', P) be a probability space, and let

(ff.)n~ 1

be a nondecreasing family of a-algebras, §'1 s;;: §'2 s;;: • • • s;;: §'. Let ~ be a ~). Then, both P-a.s. and random variable with E I~I < oo and §'00 = a( in the L 1 sense, n ...... oo. (7)

U"

PRooF. Let Xn =

f

Joxd~a)

E(~lff.),

IX; I dP

n

z

~ f

1. Then, with a> 0 and b > 0,

J{IXd~a)

E(l ~II F;) dP =

::; f

Joxd~a)n{l~l,;;b)

I~ I dP +

~ bP{IX;I z a}+

f

{1~1 >b)

~~EIX;I+f

l~ldP

f

l~ldP.

a

~ ~EI~I + a

ll~l>bJ

ll~l>bJ

f

J{IXd~a)

f

J{IXd~a)n{i~l>b}

1~1 dP

Letting a ...... oo and then b ...... oo, we obtain lim sup a-+oo

i

r

J{IX;I~a)

IX;! dP = 0,

i.e. the family {Xn} is uniformly integrable.

I~ I dP I~ I dP

479

§4. General Theorems on the Convergence of Submartingales and Martingales

Therefore, by Theorem 2, there is a random variable X oo such that Xn = E(~IFn) ~ Xoo ((P-a.s.) and in the L 1 sense). Hence we have only to show that Xoo = Let m

~

E(~l$'oo)

(P-a.s.).

n and A E ~. Then LxmdP= LxndP=

LE(~IFn)dP= L~dP.

Since the family {X"} is uniformly integrable and since, by Theorem 5, §6, Chapter II, we have EI A IX m - X oo I ~ 0 as m ~ oo, it follows that (8)

U:'=

This equation is satisfied for all A E %,. and therefore for all A E 1 %,. . Since E IX oo I < oo and E I~ I < oo, the left-hand and right-hand sides of (8) are a-additive measures; possibly taking negative as well as positive values, but finite and agreeing on the algebra U~; 1 ~·Because of the uniqueness of the extension of a a-additive measure to an algebra over the smallest aalgebra containing it (Caratheodory's theorem, §3, Chapter II, equation (8) remains valid for sets A E F oo = a(UFn). Thus, L XoodP = L

~dP =

L

E(~l$'00 )dP,

AE$'00 •

(9)

Since X oo and E(~ I$'00 ) are $'00 -measurable, it follows from Property I of Subsection 2, §6, Chapter II, and from (9), that X 00 = E(~l$'00 ) (P-a.s.). This completes the proof of the theorem. Corollary. A stochastic sequence X = (Xn, ~) is a uniformly integrable martingale if and only if there is a random variable ~ with EI~ I < oo such that X"= E(~IFn)for all n ~ 1. Here Xn ~ E(~IFoo) (both P-a.s. and in the U sense) as n ~ oo.

In fact, if X = (X",~) is a uniformly integrable martingale, then by Theorem 2 there is an integrable random variable X oo such that X" ~ X oo (P-a.s. and in the L 1 sense) and Xn = E(X oo IFn). As the random variable~ we may take the$"' 00 -measurable variable X oo. The converse follows from Theorem 3.

4. We now turn to some applications of these theorems. 1. The" zero or one" law. Let ~ 1 , ~ 2 , ... be a sequence of independent random variables,$"'~= a{w: ~ 1 , ... , ~"}and let X be the a-algebra of the "tail" events. By Theorem 3, we have E(IAIF~) ~ E(IAIF~) = IA (P-a.s.). But ]A and en) are independent. Since E(JA IF~)= EIA and therefore IA = EIA (P-a.s.), we find that either P(A) = 0 or P(A) = 1. ExAMPLE

eel, ... '

480

VII. Sequences of Random Variables That Form Martingales

2. Let ~ 1 , ~ 2 , ••. be a sequence of random variables. According to Theorem 2 of §10, Chapter II, the convergence of the series L~n implies its convergence in probability and in distribution. It turns out that if we also assume that ~ 1 , ~ 2 , •.• are independent, there is also a converse: the con-

ExAMPLE

vergence ofL~n in distribution implies its convergence in probability and with probability 1.

Thus for a series L~n whose terms are independent random variables, all three kinds of convergence (with probability 1, in probability, or in distribution) turn out to be equivalent (compare Problem 3 of §2, Chapter IV). For the proof we must evidently show that if Sn = ~ 1 + · · · + ~n, n ~ 1, then the convergence of in distribution, as n--? oo, implies its convergence with probability 1. A simple and elegant proof of this can be based on Theorem 1. In fact, let Sn !!... S. Then, for every real t, we have Eeits. --? Eeits. There is clearly a b > 0 such that IEeirs I > 0 for all It I < i5. Take It 0 I < b; then there is also an n0 = n0 (t0 ) such that jEeitoS"I ~ c > 0 for all n > n0 , where c is a constant. For n ~ n0 , form the sequence X= (Xn, ~)with

sn

Since the variables ~ 1 , ~ 2 , .•• are assumed independent, the stochastic sequence X= (Xn, ~)is a martingale and supEIXnl::::;; c- 1
Hence it follows from Theorem 1 that, with probability 1, the limit limnXn exists and is finite, and therefore the limit lim eitos. exists with probability 1. Consequently we may say that there is a b > 0 such that for every t in the set T = {t: It I < b} the limit limn e; 1s" exists P-a.s. Write T X Q = {(t, w): t E T, wE Q}, let aJ(T) be the a-algebra of Lebesgue sets on T, let A. denote Lebesgue measure on (T, Rl(T)), and let

c = {(t, w) E T X n: lim eitS.(ro) exists}. It is clear that C E ~(T) ® :F. According to what was established above, we have P(C1) = 1 for each t E T, where C1 = {wEn: (t, w) E C} is the cross-section of C at t. Then, by Fubini's theorem (Theorem 8, §6, Chapter II)

ixo lc(t, w)d(A.

X

P) =

i( 1

/c(t, w)dP) dA.

= { P(C1) dA. = A.(T) = 2b > 0.

§4. General Theorems on the Convergence of Submartingales and Martingales

481

On the other hand, again by Fubini's theorem,

Jc(T)=

Lx

0

Ic(t,w)d(Jc

X

P)= IndP(f/c(t,w)d,l)= L,l(Cw)dP,

where Cw = {t: (t, w) E C}. Hence it follows that there is a set 0 with P(0) = 1 such that

Jc(Cw) = Jc(T) = 2c5 > 0 for all w EO. Hence we can say that, for each w EO, the limit limneitSn(w) exists for all t E C"'; moreover, the Lebesgue measure of Cw is positive. From this and Problem 8 it follows that limnSnCw) exists and is finite for w EO; since P(O) = 1, this also holds P-a.s. The next two examples illustrate possible applications of the preceding results to convergence theorems in analysis. 3. Iff= f(x) satisfies a Lipschitz condition on [0, 1), it is absolutely continuous and, as is shown in courses in analysis, there is a (Lebesgue) integrable function g = g(x) such that

EXAMPLE

f(x) - f(O) =

f

g(y) dy.

(10)

(In this sense, g(x) is a" derivative" of f(x).) Let us show how this result can be deduced from Theorem 1. Let Q = [0, 1), !#'" = .16([0, 1)), and let P denote Lebesgue measure. Put 2

"k-1 {k-1 -----y- : : ; < 2k}

.;'n(x) = k~1 -----y- I

:Fn = u{x:

~1,

... ,

~n} =

u{x:

~n},

Xn =J(~n

X

n

,

and

+ 2-")- f(~.). 2-n

Since for a given ~" the random variable ~n+ 1 takes only the values ~.and ~. + 2-
= E[Xn+11~nJ = 2"+ 1 E[f(~n+1 + r<•+ 1l)- f(~n+1)1~.]

= 2n+l{t[f(~n

+ 2-(n+l))- J(~n)J + t[f(~n + 2-")- J(~n + 2-n+1))]}

= 2"{f(~n + 2-")

- /(~.)}

= X •.

It follows that X = (X.,~) is a martingale, and it is uniformly integrable since IXnl::::;; L, where Lis the Lipschitz constant: lf(x)- f(y)l::::;; Llx- yj. Observe that !#'" = .16([0, 1)) = u(U ff,). Therefore, by the corollary to

482

VII. Sequences of Random Variables That Form Martingales

Theorem 3, there is an (P-a.s.) and

~-measurable

function g = g(x) such that Xn---. g (11)

Xn = E[gl~]. Consider the set B = [0, kf2n]. Then by (11)

(k)

f 2n

f(O) =

-

t/2"

Jo

L

k/2"

Xndx =

g(x) dx,

and since nand k are arbitrary, we obtain the required equation (10). EXAMPLE 4. Let n = [0, 1), ~ = gjJ([O, 1)) and let p denote Lebesgue measure. Consider the Haar system {Hn(x)}n:
E[f(x)l~nJ =

L akHk(x)

(P-a.s.),

(12)

k=l

for every Borel function! E L, where ak

= (J, Hk) =

f

f(x)Hk(x) dx.

In other words, the conditional expectation E[f (x) Iff,.] is a partial sum of the Fourier series of f(x) in the Haar system. Then if we apply Theorem 3 to the martingale we find that, as n ---. CJJ, n

I

k=l

(J, H")H"(x)---. f(x)

(P-a.s.)

and

5. PROBLEMS 1. Let{~.} be a nonincreasing family of a-algebras, ~ 1 2 ~ 2 2 .. ·~ 00 = n(§'., and let '7 be an integrable random variable. Establish the following analog of Theorem 3: asn-+oo, E(I'/IG.)-+ E(I'/IG 00 ) (P-a.s. and in the L 1 sense).

el>

2. Let ~2• .•• be a sequence of independent identically distributed random variables withE 1~ 1 1
e

E(~d S"'

s.+ 1, ...) =

s.

E(~ 1 IS.) = -

n

(P-a.s.),

deduce from Problem 1 a stronger form of the law of large numbers: as n -+ oo,

s.

1

- -+ m (P-a.s. and in the L sense). n

483

§5. Sets of Convergence of Submartingales and Martingales

3. Establish the following result, which combines Lebesgue's dominated convergence theorem and P. Levy's theorem. Let {~.}., 1 be a sequence of random variables such that~.-+~ (P-a.s.), 1~.1::;; I'J, E11 < oo and {~ mlm 21 is a nondecreasing family of ualgebras, with ~oo = u(U ~).Then lim E(~.l~m) = E(~l~oo) (P-a.s.). 4. Establish formula (12). 5. LeHl = [0, 1], ~ = aJ([O, 1)), let P denote Lebesgue measure, and letf = f(x) E L 1. Put

f

(k+ 1)2

f.(x)

=

2"

-n

f(y) dy,

kl-•

Show that f.(x) -+ f(x) (P-a.s.). 6. Let Q = [0, 1), ~ = at([O, 1)), let P denote Lebesgue measure and letf Continue this function periodically on [0, 2) and put

= f(x) EL 1•

2" i=l

Show that j~(x)-+ f(x) (P-a.s.). 7. Prove that Theorem 1 remains valid for generalized submartingales X= ifinfm sup. 2m E(X:I~m) < 00 (P-a.s.).

(X.,~),

8. Let a., n ;;::: 1, be a sequence of real numbers such that for all real numbers t with It I < b, b > 0, the limit lim. eira. exists. Prove that then the limit lim a. exists and is finite.

§5. Sets of Convergence of Submartingales and Martingales 1. Let X= (Xn, ~)be a stochastic sequence. Let us denote by {Xn-.. }, or {- oo
{X"-..} =

n

(P-a.s.).

Let us consider the structure of sets {X" -..} of convergence for submartingales when the hypothesis sup EIX" I < oo is not satisfied. Let a > 0, and Ta = inf{n ~ 1: Xn > a} with Ta = oo if { ·} = 0.

Definition. A stochastic sequence X= (Xn, ~)belongs to class c+ (X E c+) if E(L\X,J+I{ra < oo}
for every a> 0, where L\Xn = Xn- Xn-l• X 0 = 0.

(1)

484

VII. Sequences of Random Variables That Form Martingales

It is evident that X

E

c+ if (2) n

or, all the more so, if lilXnl ~ C
~

(P-a.s.),

(3)

1.

Theorem 1. If the submartingale X

E

c+ then

{supXn < oo} = {Xn-+}

(P-a.s.).

(4)

PROOF. The inclusion {Xn--+} ~ {sup Xn < oo} is evident. To establish the inclusion in the opposite direction, we consider the stopped submartingale X'• = (X'•""' ~).Then, by (1),

sup EX~""~ n

~

a+ E[X~ ·I{ra
(5)

and therefore by Theorem 1 of §4, {ra = oo} ~ {Xn --+}

(P-a.s.).

But Ua>o{ra = oo} = {sup Xn < oo}; hence {sup Xn < oo} ~ {Xn--+} (P-a.s.}. This completes the proof of the theorem.

Corollary. Let X be a martingale with E sup ILlX nI < oo. Then (P-a.s.) {Xn-+}u{limXn= -oo,limXn= +oo}=Q.

(6)

In fact, if we apply Theorem 1 to X and to -X, we find that (P-a.s.) {limXn - 00} = {inf X n > - 00} = {Xn --+}. Therefore (P-a.s.) {lim X n < 00} U {lim X n > - 00} = {Xn --+ }, which establishes (6). Statement (6) means that, provided that E sup ILlXn I < oo, either almost all trajectories of the martingale M have finite limits, or else all behave very badly, in the sense that lim Xn = + oo and lim Xn = - oo. 2. If ~ 1 , ~ 2 ,, ... is a sequence of independent random variables with E~i = 0 and I~i I ~ c < oo, then by Theorem 1 of §2, Chapter IV, the series 2Ji converges (P-a.s.) if and only if.~:: E~f < oo. The sequence X= (Xn, s;;;,) with

485

§5. Sets of Convergence of Submartingales and Martingales

x. = el + ... +~.and~= u{w: ~1• •.. ' ~.}.isasquare-integrablemartin­ gale with (X). = 1 Eel, and the proposition just stated can be interpreted as follows:

D=

{(X) oo < oo} = {X.--+} = Q (P-a.s.), where (X) 00 =lim. (X) •. The following proposition generalizes this result to more general martingales and submartingales.

Theorem 2. Let X=

(X.,~)

be a submartingale and

x. =

m. +A.

its Doob decomposition. (a) If X is a nonnegative submartingale, then (P-a.s.)

{A 00 < oo} (b) If X

E

~

{X.--+}

~

{sup X. < oo }.

(7)

{A 00 < oo }.

(8)

c+ then (P-a.s.) {X.--+} = {sup X. < oo}

~

(c) If X is a nonnegative submartingale and X

E

c+, then (P-a.s.)

{X.--+} = {sup X. < oo} = {A 00 < oo }.

(9)

PRooF. (a) The second inclusion in (7) is obvious. To establish the first inclusion we introduce the times Ua

= inf{n ~ 1: An+t > a},

taking ua = + oo if { ·} = 1 of §2, we have

0. Then Aa.

a> 0,

~a and by Corollary 1 to Theorem

Y~ = x.,..a.· Then ya = (Y~, ~)is a submartingale with supEY: ~ < oo. Since the martingale is nonnegative, it follows from Theorem 1, §4, that (P-a.s.)

Let

a

Therefore (P-a.s.)

{Aoo < oo} =

U {Aoo ~a} ~ {X.--+}.

a>O

(b) The first equation follows from Theorem 1. To prove the second, we notice that, in accordance with (5), EAt•"" = EXt.An ~EX~,...~ 2a

+ E[(~XtJ+I{ra

and therefore EA,. = E limA,.,... < oo .




486

VII. Sequences of Random Variables That Form Martingales

Hence {ra = oo} £ {A 00 o {ra = oo} ={sup X,.
Remark. The hypothesis that X is nonnegative can be replaced by the hypothesis sup,. EX; < oo. Corollary 1. Let X,. measurable, and $'0

=

c; 1 +

· · · + c;,.,

= {0, Q}.

where c;i ~ 0, Ec;i < oo, c;i are ~­ Then (P-a.s.)

{J E(c;,.l~- 1 ) 1


(10)

and if, in addition, E sup,. c;,. < oo then (P-a.s.)

t~1 E(c;,.l~-1)
(11)

Corollary 2 (Borel-Cantelli-Levy Lemma). If the events B,. E ffn, then if we put c;,. = IB" in (11), we find that

t~1 P(B,.I~-1)
3. Theorem 3. Let M (P-a.s.)

1

IB"
(12)

= (M,., ~),.~ 1 be a square-integrable martingale. Then

{(M) 00
(13)

If also E sup I.!\M,.I 2 < oo, then (P-a.s.)

{(M) 00
(14)

where

(M) 00 =

L E((.!\M,.) 2 1~-1) 00

n=1

(15)

with M 0 = 0, Fa = {0, Q}. PROOF.

((M

Consider the two submartingales M 2 = (M;, ~)and (M

+ 1)2 , $',.).Let their Doob decompositions be

Then

A~

and A; are the same, since n

A~=

L

k=1

E(.!\M~I~-1)

=

n

L

k=1

E((.!\Mk) 2 1~-1)

+ 1)2

=

487

§5. Sets of Convergence of Submartingales and Martingales

and n

A;=

L

k= 1

E(Ll(Mk

+

1) 2 1~-1)

=

n

L

k= 1

E(LlM~I~-1)

n

=

L E((L\Mkfl~-1).

k= 1

Hence (7) implies that (P-a.s.)

{(M) 00
+ 1)2 --+}

=

{Mn--+}.

Because of (9), equation (14) will be established if we show that the conditionE sup I L\Mn 12 < 00 guarantees that M 2 belongs to c+. Let 't'a = inf{n ~ 1: M; >a}, a > 0. Then, on the set {'t'a < oo },

ILlM;"I = IM;"- M;"-1l::;:; IM."- Mr"-1! 2

+ 2jMra-tl·IMta

- Mra- d

$;

(L\MrY

+ 2a 112 jL\M.J,

whence

EILlM;all{'t'a
Theorem 4. Let M = (Mn, ~) be a square-integrable martingale and let A =(An, §,;_ 1 ) be a predictable increasing sequence with A1 ~ 1, A 00 = oo (P-a.s.).

lf(P-a.s.) (16)

then n --+ oo,

(17)

with probability 1. In particular, if (M) = (Mn, ~)is the quadratic variation of the squareintegrable martingale, M = (Mn, ~n) and (M) 00 = oo (P-a.s.), then with probability 1 n--+oo.

(18)

488

VII. Sequences of Random Variables That Form Martingales

PROOF.

Consider the square-integrable martingale m = (mn, g;,;) with

_ ~11M; mn - L., . i=l

A;

Then (19) Since

Mn An

L~=lAkflmk An

we have, by Kronecker's lemma (§3, Chapter IV), Mn/An--+ 0 (P-a.s.) if the limit limnmn exists (finite) with probability 1. By (13), {<m) 00
(20)

Therefore it follows from (19) that (16) is a sufficient condition for (17). If now An = <M)n, then (16) is automatically satisfied (see Problem 6) and consequently we have

Mn <M)n --+ 0 (P-a.s.). This completes the proof of the theorem. Consider a sequence ¢ 1> ¢2 , ... of independent random variables with E~;-= 0, V~i = ll; > 0, and let the sequence X = {Xn}n~o be defined recursively by

ExAMPLE.

(21) where X 0 is independent of -

CX)

< () < 00.

~1> ~ 2 , •••

and ()is an unknown parameter,

We interpret Xn as the result of an observation made at time n and ask for an estimator of the unknown parameter As an estimator of() in terms of X 0 , Xl> ... , Xn, we take

e.

(22)

taking this to be 0 if the denominator is 0. (The number () is the least-squares estimator.) It is clear from (21) and (22) that

e= () +A' Mn n

489

§5. Sets of Convergence of Submartingales and Martingales

where

Therefore if the true value of the unknown parameter is P(Bn-+ 8)

e, then

= 1,

(23)

n -+oo.

(24)

when (P-a.s.)

en

(An estimator with property (23) is said to be strongly consistent; compare the notion of consistency in §7, Chapter 1.) Let us show that the conditions

v,.+ 1

n

(e )

00

=oo

l:E~A1

supV.
V,.

n=1

n

(25)

are sufficient for (24), and therefore sufficient for (23). We have

Therefore

{ ~ (~;V,. n= 1

i\

1)

=

oo}

£;

{(M) 00

=

oo}.

By the three-series theorem (Theorem 3 of §2, Chapter IV) the divergence ofl::'= 1 E((~;;v,.) i\ 1) guarantees the divergence (P-a.s.) ofl::'= 1 ((¢2 /V,.) i\ 1). Therefore P{(M) 00 = oo} = 1. Moreover, if n

mn =

flM.

.L

1=

(M)~.' 1 1

then n

(m)n =

and (see Problem 6) P(m 00 < oo) Theorem 4.

.L

1=

1

= 1.

fl(M)i

(M)~

1

Hence (24) follows directly from

(In Subsection 5 of the next section we continue the discussion of this example for Gaussian variables ~ b ~ 2 , . . . . )

490

VII. Sequences of Random Variables That Form Martingales

Theorem 5. Let X= (Xn,

~)be

a submartingale, and let

= mn +A.

Xn be its Doob decomposition.

Ifi~Xnl

::;; C, then (P-a.s.)

(26)

or equivalently,

PRooF. Since n

An=

L E(~Xkl~-1),

(28)

k= 1

and n

mn =

L

(29)

[~Xk- E(~Xkl~-1)],

k=1

it follows from the assumption that I~X k I ::;; C that the martingale m = ~)is square-integrable with l~mnl ::;; 2C. Then by (13)

(mn,

(30) and according to (8) {X.-+} S: {A 00
Therefore, by (14) and (20), {X n --+} = {X n --+ } n {A 00 < 00} = {X n --+ } n {A 00 <

OCI }

n {mn --+}

= {Xn --+} n {A
00

={X.-+} n {A 00

+ (m)
= {A 00

+ (m)
Finally, the equivalence of (26) and (27) follows because, by (29), (m)n = L{E[(~Xk) 2 1~-1]- [E(~Xkl~-1)] 2 },

Lk'=

and the convergence of the series 1 E(~X k I~- 1 ) of nonnegative terms implies the convergence of 1 [E(~X k I ~- 1 )Y This completes the proof.

Lk'=

4. Kolmogorov's three-series theorem (Theorem 3 of §2, Chapter IV) gives a necessary and sufficient condition for the convergence, with probability 1, of a series L(n of independent random variables. The following theorems, whose proofs are based on Theorems 2 and 3, describe sets of without the assumption that the random variables convergence of ( 1 , ( 2 , ... are independent.

2:(.

491

§5. Sets of Convergence of Submartingalcs and Martingales

Theorem 6. Let e = (en,~), n ;;::: 1, be a stochastic sequence, let :F0 = {0, Q}, and let c be a positive constant. Then the series Ien converges on the set A of sample points for which the three series

converge, where

e~

= e"I(Ienl::; c).

L

Let Xn = L~=1 ek. Since the series P(lenl;;::: cl~-1) converges, by Corollary 2 of Theorem 2, and by the convergence of the series IE<e~l~- 1 ), we have

PROOF.

An{xn~l =AnLt1 eki(Iekl ::;c)~} =An

Lt

[eki(Iekl::;

c)- E(eki(Iekl::; c)l$i-1)]

~}. (31)

Let '7k = eki(Iekl::; c)- E(eki(Iekl::; c)l$i-1) and let Y, = L~=1'1k· Then Y = (Y,, ~) is a square-integrable martingale with I'7k I ::; 2c. By Theorem 3, we have A~

{L V(e~I:Fn-1) <

= {(Y) <

oo}

00

oo}

= {Y, ~}.

(32)

Then it follows from (31) that An and therefore A 5.

~

{Xn~}

=A,

{Xn ~}; this completes the proof.

PROBLEMS

1. Show that if a submartingale X = (X", ~) satisfies E sup. IX" I < oo, then it belongs to class c+.

2. Show that Theorems 1 and 2 remain valid for generalized submartingales. 3. Show that generalized submartingales satisfy (P-a.s.) the inclusion

4. Show that the corollary to Theorem 1 remains valid for generalized martingales. 5. Show that every generalized submartingale of class c+ is a local submartingale. 6. Let a.> 0, n ~ 1, and let b.= D~ 1 ak. Show that

L""

n= 1

b;a
492

VII. Sequences of Random Variables That Form Martingales

§6. Absolute Continuity and Singularity of Probability Distributions 1. Let (0, §) be a measurable space on which there is defined a family of u-algebras such that § 1 £ ~ £ · · · £ IF and

(~)"~ 1

§

=

u(u ~)·

(1)

n=1

Let us suppose that two probability measures P and Let us write

P are given on (0, §).

for the restrictions of these measures to ~, i.e. let Pn and (0, ~) and for B E ~ let P"(B) = P(B),

Pn be measures on

P"(B) = P(B).

P is absolutely continuous with respect to P (notation, P « P) if P(A) = 0 whenever P(A) = 0, A E.~ When P « P and P « P the measures P and P are equivalent (notation, p "' P). The measures P and P are singular (or orthogonal) if there is a set A e § such that P(A) = 1 and P(A) = 1 (notation, P .l P). Definition 1. The probability measure

Definition 2. We say that P is locally absolutely continuous with respect to

P (notation,

P ~<

P) if Pn « Pn

(2)

for every n 2! 1. The fundamental question that we shall consider in this section is the determination of conditions under which local absolute continuity P ~ P implies one of the properties P « P, P "' P, P .l P. It will become clear that martingale theory is the mathematical apparatus that lets us give definitive answers to these questions. Let us then suppose that P ~< P. We write Zn

dPn

= dP n

the Radon-Nikodym derivative of Pn with respect to P". It is clear that Zn is ~-measurable; and if A E ~then

f

A

Zn+1 dP = =

f

A

dPn+1 1'1. 1'1. -dP dP = t"'n+1(A) = t"'n(A) n+1

dPn f zndP. f dPdP A

n

=

A

493

§6. Absolute Continuity and Singularity of Probability Distributions

It follows that, with respect to P, the stochastic sequence Z

martingale.

= (z,, ~~~~ 1 is a

Write

= llm z,.

Z 00

Since Ez, = 1, it follows from Theorem 1, §4, that lim z, exists P-a.s. and therefore P(z 00 = lim z,) = 1. (In the course of the proof of Theorem 1 it will be established that lim z, exists also for P, so that P(z 00 = lim z,) = 1.) The key to problems on absolute continuity and singularity is Lebesgue's

decomposition :

F. loc

Theorem 1. Let r- « P. Then for every A E §, P(A) =

and the measures ,u(A)

L

Z 00 dP

+ P{A n

= !!'{An (z 00 = oo )}

(Z 00 =

oo)},

(3)

and P(A), A E §,are singular.

PRooF. Let us notice first that the classical Lebesgue decomposition shows that if P and P are two measures, there are unique measures A. and ,u such that P = A. + ,u, where A. « P and ,u l. P. Conclusion (3) can be thought of as a specialization of this decomposition under the assumption that P, « P,,n;;::l. Let us introduce the probability measures

0 = !(P + P),

n;;:: 1,

and the notation

dP

3 = dO '

dP,

dP 3 = dO '

dP,

3, = dO, '

3n

= dO, •

Since 1!'(3 = 0) = P(3 = 0) = 0, we have 0(3 = 0, 3 = 0) = 0. Consequently the product 3 · 3- 1 can be defined consistently on the set !l\ {3 = 0, 3 = 0}; we define it to be zero on the set {3 = 0, 3 = 0}. Since P, « P, « 0,, we have (see (11.7.36))

dP,

dP, dP, (O

dO = dP · dO n

n

n

-a.s.

)

(4)

i.e.

3, =

z,3,

(0-a.s.)

(5)

whence

3, · 3,;- 1 (0-a.s.) where, as before, we take 3, · 3,;- 1 = 0 on the set {3, = 0, 3, = 0}, which is of z,

0-measure zero.

=

494

VII. Sequences of Random Variables That Form Martingales

Each of the sequences (3n, .?,;) and (3", .?,;) is (with respect to 0) a uniformly integrable martingale and consequently the limits lim 3n and lim 3n exist. Moreover (0-a.s.) lim 3n

= §,

= 3·

lim 3n

(6)

From this and the equations z" = 3n3; 1 (0-a.s.) and 0(3 = 0, 3 = 0) = 0, it follows that (0-a.s.) the limit lim z" = Z 00 exists and is equal to 3 · 3- 1 . It is clear that P « 0 and P « 0. Therefore lim z" exists both with respect to P and with respect to P. Now let

A( A)

L

=

Z 00

,u(A)

dP,

= P{A n (zoo = oo)}.

To establish (3), we must show that

P(A)

=

A(A)

+ .u(A),

We have

P(A)

L3 L33~ +L = L3~ dO

=

=

dP

dO

+

A « P,

L

3[1 -

[1 - 3!J dP

=

.u l. P.

3~] dO

Lz

00

dP

+ P{A n

(3

= 0)}, (7)

where the last equation follows from

Furthermore,

P{A n (3

= 0)} = P{A n (3 = 0) n (3 > 0)} = P{A n (3 ·r 1 = oo)} = P{A n (z = oo)}, 00

which, together with (7), establishes (3). It is clear from the construction of A that A « P and that P(z oo < oo) But we also have

,u(zoo < oo)

=

1.

= P{(Z < oo) n (z = oo)} = 0. 00

00

Consequently the theorem is proved. The Lebesgue decomposition (3) implies the following useful tests for absolute continuity or singularity for locally absolutely continuous probability measures.

495

§6. Absolute Continuity and Singularity of Probability Distributions -

lac

Theorem 2. Let P « P, i.e.

..._

~""n

« Pn, n;;::: 1. Then

P « P¢>EZ 00 = 1¢>P(z 00 EZ 00

= 0¢>P(Z = 00

oo)

=

(8)

1,

(9)

where E denotes averaging with respect to P. PRooF. Putting A = Q in (3), we find that

Ezoo Ezoo

= 1 ¢>p(z = = 0¢>P(z = 00

oo)

00

oo)

= 0, = 1.

If P(z oo = oo) = 0, it again follows from (3) that P « P. Conversely, let P « P. Then since P(z 00 = oo) = 0, we have P(z 00

= 0.

(10) (11)

=

oo)

In addition, if P j_ P there is a set BE:#' with P(B) = 1 and P(B) = 0. Then P(B n (zoo = oo )) = 1 by (3), and therefore P(z 00 = oo) = 1. If, on the other hand, P(z oo = oo) = 1 the property P j_ P is evident, since P(Z 00 = oo) = 0. This completes the proof of the theorem. 2. It is clear from Theorem 2 that the tests for absolute continuity or singularity can be expressed either in terms of P (verify the equation Ezoo = 1 or Ez 00 = 0), or in terms ofP (verify that P(z 00 < oo) = 1 or that P(z 00 = oo) = 1. By Theorem 5 of §6, Chapter II, the condition Ez 00 = 1 is equivalent to the uniform integrability (with respect to P) of the family {zn}.;, 1 • This allows us to give simple sufficient conditions for the absolute continuity P « P. For example, if (12) n

or if sup Ez~ +e < oo,

6

> 0,

(13)

then, by Lemma 3 of §6, Chapter II, the family of random variables {zn}n;;, 1 is uniformly integrable and therefore P « P. In many cases it is preferable to verify the property of absolute continuity or of singularity by using a test in terms of P, since then the question is reduced to the investigation of the probability of the "tail" event {z 00 < oo }, where one can use propositions like the "zero-one" law. Let us show, by way of illustration, that the "Kakutani dichotomy" can be deduced from Theorem 2. Let (Q, :#', P) be a probability space, let (Roo, f?J 00 ) be a measurable space of sequences x = (x 1, x 2 , ..• ) of numbers with f?J 00 = PJ(R 00 ), and let f?Jn = u{x: {x 1 , . . . , xn)}. Let ~ = (~ 1 , ~ 2 , •.. ) and ~ = (~ 1 , ~ 2 , ... ) be sequences of independent random variables.

496

VII. Sequences of Random Variables That Form Martingales

Let P and P be the probability distributions on (R 00 , f!4 00 ) for ~ and ~. respectively, i.e. P(B) = P{~eB},

F(B)

=

P{~eB},

Also let be the restrictions of P and

P to f!l" and let

Theorem 3 (Kakutani Dichotomy). Let~ = (~ 1 , ~ 2 , ••• ) and~= (~ 1 , ~ 2 , •.• ) be sequences of independent random variables for which p~n

Then either



P or

«

p~n'

n ;;:: 1.

(14)

P j_ P.

PROOF. Condition (14) is evidently equivalent toP"« P", n;;:: 1, i.e. P ~< P. It is clear that

where (15)

Consequently {x:z 00
J 1

lnqi(xi)
The event {x: L~ 1 In qi(xi) < oo} is a tail event. Therefore, by the Kolmogorov zero-one law {Theorem 1 of §1, Chapter IV) the probability P{x:zoo
Theorem 4. Let ,.... « P and let n;;:: 1, with z 0 = 1. Then (with

j5 «

p <;:;>

ffo

= {0,!1})

j5t~1 [1 -

p j_ p <;:;> PL~1 [1 -

E{A I

!/'n-1)] < 00} =

1,

{16)

E(AI~-1)] = 00} =

1.

(17)

§6. Absolute Continuity and Singularity of Probability Distributions

497

PROOF. Since

F\{z. =

0} =

I

z"dP = 0,

J(zn=O}

we have (P-a.s.)

zn =

rl

k=1

= exp{

r:t.k

±

k=1

(18)

In r:x.k}.

Putting A = {z 00 = 0} in (3), we find that P{z 00 = 0} = 0. Therefore, by (18), we have (P-a.s.) {Z 00
= {0 < Z
= { - oo < lim

f

k=1

In r:x.k < oo} .

(19)

Let us introduce the function u(x)

=

{

lxl ::::; 1,

X,

.

Sign

X,

lxl > 1.

Then

{- oo < lim kt In r:x.k < oo} = {- oo < lim

J

1 u(ln r:x.k) < oo}.

(20)

Let E denote averaging with respect to P and let 11 be an ~-measurable integrable random variable. It follows from the properties of conditional expectations (Problem 4) that zn_ 1E(1Jiff,- 1 ) = E(IJznlff,- 1 ) E(IJI~-1)

(P- and P-a.s.),

= z~- 1 E(1Jzn1~- 1 ) (P-a.s.).

(21) (22)

Recalling that r:x.n = z~_ 1 z", we obtain the following useful formula from (22):

(23) From this it follows, in particular, that E(r:x.nl~- 1 ) =

1 (P-a.s.).

By (23),

Since xu(ln x)

~

x - 1 for x ~ 0, we have, by (24),

E[u(ln r:x.n)l ~- 1 ] ~ 0

(P-a.s.).

(24)

498

VII. Sequences of Random Variables That Form Martingales

= (Xn,

It follows that the stochastic sequence X

~)with

n

Xn =

L u(ln o:k) k=l

is a submartingale with respect to P; and I-1Xn I = Iu(ln o:n)l :::;; 1. Then, by Theorem 5 of §5, we have (P-a.s.) {- oo < lim

J 1

u(ln o:k) < oo} = {J1 E[u(ln o:k)

+ u2 (ln o:k)l g;-k_ 1] < oo}. (25)

Hence we find, by combining (19), (20), (22) and (25), that (P-a.s.) {zoo
t~1 E[u(lno:k) + u (1no:k)l${_ 2

1 ]
t~1 E[o:ku(ln e>:k) + o:ku (ln o:k)lg;-k_ 2

1]

< oo}

and consequently, by Theorem 2, P«

P~P{J1 E[e>:ku(lno:k) + o:ku (ln e>:k)l${2

1

P ..L P ~ Pt~ E[e>:ku(ln o:k)

1 ]
+ o:ku 2 (1n o:k)l${- 1 ]= oo}

= 1,

(26)

= 1.

(27)

We now observe that by (24),

and for x

~

A(1 -

0 there are constants A and B (0 < A < B < oo) such that

jx) 2 :::;; xu (ln x) + xu 2 (ln x) + 1 -

x :::;; B(1 -

jX) 2 .

(28)

Hence (16) and (17) follow from (26), (27) and (24), (28). This completes the proof of the theorem.

Corollary 1. If, for all n ~ 1, the a-algebras a(o:J and ~- 1 are independent - loc with respect to P(or P), and P « P, then we have the dichotomy: either P « P or P ..L P. Correspondingly,



00

p~

p j_ p~

L

n=l

[1 - EJa:J < oo,

00

L

n=l

[1- EJa:J

=00.

499

§6. Absolute Continuity and Singularity of Probability Distributions

In particular, in the Kakutani situation (see Theorem 3) 1Xn = qn and

L [1- EjqTxJ]
P « P~

n=1

L [1 00

j5 .l p ~

EjqTxJ] = 00.

n=1

Corollary 2. Let

loc

P « P.

Then

Pt~1 E(1Xn In 1Xnl ~-1) < 00} =

1 =>



P.

for the proof, it is enough to notice that x In x

for all x

~

+ !(1 -

(29)

x) ~ 1 - x 1' 2 ,

0, and apply (16) and (24).

Corollary 3. Since the series L:': 1 [1 - E(A I~- 1 )], which has nonnegative

L

lin (P-a.s.) terms, converges or diverges with the series conclusions ( 16) and (17) of Theorem 4 can be put in the form



E(Aiffn_ 1 )1,

P~Pt~1 1lnE(AI~-1)1
(30)

P .l P~Pt~

1 1lnE(AI~-1)1 =oo} = 1.

Corollary 4. Let there exist constants A and B such that 0 :s; A < 1, B P{ 1 - A :s; 1Xn :s; 1 + B} Then

=

1,

n

~

(31) ~

0 and

1.

!{P !:'< P we have· P « P ~ Pt~ E[(1 - 1Xn) 21~-1] < 00}

1

P .l P ~ Pt~1 E[(1

= 1,

~-1] = 00} =

- 1Xn) 21

1.

For the proof it is enough to notice that when x e [1 - A, 1 + B], where 0 :s; A < 1, B ~ 0, there are constants c and C (0 < c < C < oo) such that c(l - x) 2 :s; (1 -

Jx)

2

:s; C(1 - x) 2 •

(32)

4. With the notation of Subsection 2, let us suppose that ~ = (~ 1 , ~ 2 , ••• ) and ~ = (~ 1 , ~ 2 , ••• ) are Gaussian sequences, Pn"' Pn, n ~ 1. Let us show that, for such sequences, the "Hajek-Feldman dichotomy," either P - P or P .l P, follows from the "predictable" test given above. By the theorem on normal correlation (Theorem 2 of §13, Chapter II) the conditional expectations M(xn Iffln_ 1 ) and M(xn Iflin-t), where M and M

500

VII. Sequences of Random Variables That Form Martingales

are averages with respect to P and P, respectively, are linear functions of X 1' Xn- 1 We denote these linear functions by an- 1(x) and an- 1(x) and put 0

0

0

'

0

bn- 1 = (M[xn- an- 1(x)] 2) 112 , bn-1 = (M[xn- an-l(x)] 2) 112 ·

Again by the theorem on normal correlation, there are sequences and 8 = (8 1 , 82 , .•. ) of independent Gaussian random variables with zero means and unit variances, such that

8 = (8 1 , 8 2 , .• •)

+ bn-l8n, an-1(x) + 5n-1iln

Xn = an-1(x) Xn =

(P-a.s.),

(33)

(P-a.s.).

Notice that if bn-l = 0, or 5n_ 1 = 0, it is generally necessary to extend the probability space in order to construct 8n or iln. However, if bn- 1 = 0 the extended vector (x 1, ... , xn) will be contained (P-a.s.) in the linear manifold Xn = an_ 1(x), and since by hypothesis Pn "" P", we have bn_ 1 = 0, an- 1 = an-l (x), and 1Xn(x) = 1 (P- or P-a.s.). Hence we may suppose without loss of generality that > 0, > 0 for all n ;;::: 1, since otherwise the contribution of the corresponding terms of the sum l::= 1 [1 - M.fi:IBn_ 1 ] (see (16) and (17)) is zero. Using the Gaussian hypothesis, we find from (33) that, for n ;;::: 1,

b;

5;

(34) where dn = Ibn · 5;; 1 1 and a0 (x) = E~t>

ii 0 (x) = E~1'

b~ = v~1.

5~= v~l·

From (34),

d;_l

1 + d;_1 Since ln [2dn- d(l

P « P<:!>P{

+ d;_ 1)]

an-l (x) - an-1 (x)) 2 bn-1

:::; 0, statement (30) can be written in the form

f [tln 1+2dn-1 d;_l + d;_~ . (an-t(x)- an-t(x))2]
n=l

= 1.

The series

(35)

501

§6. Absolute Continuity and Singularity of Probability Distributions

converge or diverge together; hence it follows from (35) that

p« p~p{Jo [(~-

1r

+ i\i~x)J


(36)

where i\n(x) = aix) - tln(x). Since an(x) and an(x) are linear, the sequence of random variables {i\n(x)/bn}n;o,o is a Gaussian system (with respect to both P and P). As follows from a lemma that will be proved below, such sequences satisfy an analog of the zero-one law:

p{z:(i\~~x)r
(37)

Hence it follows from (36) that



p~ Jo [ M(i\~~x)r + (~J- 1rJ
and in a similar way

(i\ (x))2 oo [ + M p j_ p ~ Jo

---t-

(62bi - 1)2] = oo.

Then it is clear that if P and Pare not singular measures, we have P « P. But by hypothesis, Pn - Pn, n ~ 1; hence by symmetry we have P « P. Therefore we have the following theorem. . . . ) and ~ = be Gaussian sequences whose finite-dimensional distributions are equivalent: Pn "' Pn, n ~ 1. Then either P"' P or P j_ P. Moreover,

Theorem 5 (Hajek-Feldman Dichotomy). Let ~ = (~ 1 , ~ 2 ,

(~ 1 , ~ 2 ,

... )

P"'

P~

Loo

[

n=O

p j_ p~ n~O oo

[

M

(A (x))2 + ([)2---i- 1)2]
bn

bn

(x))2 (62 )2] M ---t- + bi- 1 (i\

(38) =OO.

Let us now prove the zero-one law for Gaussian sequences that we need for the proof of Theorem 5.

Lemma. Let f3 = (f3n)n;o, 1 be a Gaussian sequence defined on (Q, $'; P). Then P{J1 p; <

oo} = 1 ~ J

1

E/3; < oo.

(39)

The implication <= follows from Fubini's theorem. To establish the opposite proposition, we first suppose that Ef3n = 0, n ~ 1. Here it is enough to show that

PROOF.

(40)

502

VII. Sequences of Random Variables That Form Martingales

since then the condition P{'f.fJ~ < oo} = 1 will imply that the right-hand side of (40) is finite. Select an n ~ 1. Then it follows from §§11 and 13, Chapter II, that there are independent Gaussian random variables Pk,n• k = 1, ... , r :::; n, with EfJk,n = 0, such that r

I. P~.n· I. P~ = k=1 k=1 If we write EP~.n =

A.k,n, we see easily that r

r

E

L

k=1

L Ak,n

fJ~.n =

(41)

k=1

and E exp(-

kt1 fJ~.n)

lJY + 2A.k,n)-112.

=

(42)

Comparing the right-hand sides of (41) and (42), we obtain E J1

p~ = E kt1 P~.n:::; [E exp(- kt1 P~.n)rz =

[E exp(-

kt1 p~

)rz'

from which, by letting n ..... oo, we obtain the required inequality (40). Now suppose that E/Jn ;f= 0. Let us consider again the sequence P= CPn)n?. 1 with the same distribution as f3 = (f3n)n?. 1 but independent of it (if necessary, extending the original probability space).lf P{'f.:'= 1/J; < oo} = 1, then P{L:'= 1(/Jn- P.) 2 < oo} = 1, and by what we have proved, 00

00

2

L E(f3n -

E/3n) 2 =

n=1

L E(/Jn - Pn)

n=1

2

< 00.

Since (E/Jn) 2

:::;

2/J~

+ 2(/Jn

- EfJ.) 2 ,

we have 'f.:'= 1(E/Jn) 2 < oo and therefore n=1

00

00

00

L

EfJ; =

L

n=1

(E/Jn) 2

+ L

n=1

E(/Jn - M /Jn) 2 < 00.

This completes the proof of the lemma. 5. We continue the discussion of the example in Subsection 3 of the preceding section, assuming that ~ 0 , ~ 1 , ... are independent Gaussian random variables withE~;= 0, V~; = ~ > 0. Again we let

§6. Absolute Continuity and Singularity of Probability Distributions

503

for n ~ 1, where Xo = ~O• and the unknown parameter e that is to be estimated has values in R. Let{)" be the least-squares estimator (see (5.22)).

Theorem 6. A necessary and sufficient condition for the estimator{)"' n ~ 1, to be strongly consistent is that

f l=

(43)

00.

n=OV,+1

PROOF. Sufficiency. Let P 0 denote the probability distribution on (R 00 , f16 00 ) corresponding to the sequence (X 0 , X 1 , •• •) when the true value of the unknown parameter is e. Let E0 denote an average with respect to P 0 • We have already seen that

Mn

A

e"

e + <M).'

=

where

X2

n-1

<M)n=

L ~· k=O k+1

According to the lemma from the preceding subsection,

Po(<M)oo = oo) = 1-Eo<M)oo = oo, i.e. <M) 00 = oo (P 0-a.s.) if and only if

f

k=O

Eoxr ~+1

= oo

(44) .

But

EoXr

=

k

L

ezi~-i

i=O

and E xz ~ k=O ~+1

Ioo

=

Ioo - 1

k=O

~+1

(

Ik

i=O

ezi~-i

)

(45)

Hence (44) follows from (43) and therefore, by Theorem 4, the estimator {)"' n ~ 1, is strongly consistent for every e. Necessity. For all e E R, let P 0(0" ~ e) = 1. It follows that if e 1 =f. e 2 , the measures P 0 , and P 02 are singular (P 0 , _l P 02 ). In fact, since the sequence (X 0 , X 1 , . . . ) is Gaussian, by Theorem 5 of §5 the measures P 0 , and P02 are either singular or equivalent. But they cannot be equivalent, since if P0 , "' P02 ,

504

VII. Sequences of Random Variables That Form Martingales

but P8 ,(0.-+ 0 1 ) = 1, then also P02(0.-+ 0 1 ) = 1. However, by hypothesis, P82(0n-+ 02 ) = 1 and 02 "# 0 1 . Therefore P0 , _l P02 for 0 1 "# 02 . According to (5.38),

Po,

_l

Po 2 ~ (01 - 02 ) 2

f

k=O

E6 ,[v.Xf

k+1

J

=

oo

for 0 1 "# 02 • Taking 0 1 = 0 and 02 "# 0, we obtain from (45) that

P0

_l

P62 ~

v.

L~ = 00

oo,

i=O ~'i+ 1

which establishes the necessity of (43). This completes the proof of the theorem.

6.

PROBLEMS

1. Prove (6).

2. Let

f\-

P., n;;:: 1. Show that

P- P=P{z
P ..L P = P{z 00

= P{z 00

= 00} = 1

> 0} = 1,

or

P{z 00 = 0} = 1.

3. Let P. ~ P., n ;;:: 1, let" be a stopping time (with respect to (§,;)),and let P, = PI~ and P, = PI~ be the restrictions of P and P to the a-algebra ~. Show that P, ~ P, if and only if {'t" = oo} = {z""
P,
4. Prove (21) and (22). 5. Verify (28), (29) and (32). 6. Prove (34). 7. In Subsection 2 let the sequences ~ = (~ 1 , ~ 2 , ••. ) and ~ = (~ 1 , ~ 2 , ••• ) consist of independent identically distributed random variables. Show that if P~, ~ P~,, then P ~ P if and only if the measures P~, and P~, coincide. If, however, P~, ~ P~, and P~, of. P~, then P ..L P.

§7. Asymptotics of the Probability of the Outcome of a Random Walk with Curvilinear Boundary I. Let ~ 1 , ~ 2 , ••• be a sequence of independent identically distributed random variables. Lets. = ~ 1 + · · · + ~ •. let g = g(n) be a "boundary," n;;;:: 1, and let -r = inf{n ~ 1: Sn
505

§7. Asymptotics of the Probability of the Outcome of a Random Walk

It is difficult to discover the exact form of the distribution of the time r. In the present section we find the asymptotic form of the probability P(r > n) as n-+ 00, for a wide class of boundaries g = g(n) and assuming that the are normally distributed. The method of proof is based on the idea of an absolutely continuous change of measure together with a number of the properties of martingales and Markov times that were presented earlier.

ei

Theorem 1. Let ~ 1 , ~ 2 , .•• be independent identically distributed random variables, with ~; ""%(0, 1). Suppose that g = g(n) is such that g(1) < 0 and, for n ~ 2, 0 ::::;; Ag(n + 1) ::::;; Ag(n), (1) where Ag(n)

= g(n) - g(n ln n =

Then P( r > n) = exp{

1) and

o(t

2

n-+ oo.

[Ag(k)Jl) ,

-tt

[Ag(k)] 2 (1

+ o(l))},

(2)

n-+ oo.

(3)

Before starting the proof, let us observe that (1) and (2) are satisfied if, for example,

g(n)

= an• + b,

! < v ::::;; 1,

a + b < 0,

or (for sufficiently large n)

! : : ; v::::;; 1,

g(n) = n•L(n),

where L(n) is a slowly varying function (for example, L(n) = C(ln n)fl with arbitrary f3 for! < v < 1 or with f3 > 0 for v = !). 2. We shall need the following two auxiliary propositions for the proof of Theorem 1. Let us suppose that ~ 1 , ~ 2 , .•• is a sequence of independent identically distributed random variables, ~; ""%(0, 1). Let ff0 = {0, Q}, ~ = u{w: ~ 1 , ... , ~n}, and let oc =(an, ~- 1 ) be a predictable sequence with P(l ocn I : : ; C) = 1, n ~ 1, where C is a constant. Form the sequence z = (zn, ~)with Zn

= exp{

n L ock ek k=1

1 -2

n } L ocf ' k=1

n;:::l.

(4)

It is easily verified that (with respect to P) the sequence z = (zn, ~) is a martingale with Ezn = 1, n ~ 1. Choose a value n ~ 1 and introduce a probability measure Pn on the measurable space (Q, ~) by putting

(5)

506

VII. Sequences of Random Variables That Form Martingales

Lemma 1. With respect to Pn, the random variables ~k = ~k - IJ.k, 1 :s;; k :s;; n, are independent and normally distributed, ~k "'%(0, 1). PROOF. Let En denote averaging with respect to Pn. Then for A.k E R, 1 :s;; k :s;; n, En

exp{ikt/k~k} = E exp{ikt/k~k}zn =

E[exp{i :t:

A.k~k}zn-1 ·E{exp(iA.n(~n- IY.n) + IJ.n~n- ~) ~~-1}]

= E[exp{
Lt A.~}· 1

Now the desired conclusion follows from Theorem 4 of §12, Chapter II.

Lemma 2. Let X zero and

(Xn,

=

~)n"' 1

(J

=

be a square-integrable martingale with mean

inf{n

~

1:

xn :s;;

-b},

where b is a constant, b > 0. Suppose that P(X 1 <-b)> 0.

Then there is a constant C > 0 such that,for all n

~

1,

c

(6)

P((J > n) 2': EX2 . n

PROOF. By Corollary 1 to Theorem VII.2.1 we have EX""n - EJ((J :s;; n)X"

=

0, whence

= EJ((J > n)Xn.

(7)

On the set {(J :s;; n} Therefore, for n

~

1,

-EJ((J :s;; n)X" ~ bP((J :s;; n) ~ bP((J

= 1) = bP(X 1 < -b)> 0.

(8)

On the other hand, by the Cauchy-Schwarz inequality, EJ((J > n)Xn :s;; [P((J > n) · Ex;r 12 ,

(9)

which, with (7) and (8), leads to the required inequality with C

= (bP(X 1 < - b)) 2 .

PROOF OF THEOREM 1. It is enough to show that

!~~In P(r > n)

Ikt

[Ag(k)] 2

~

-

i

(10)

507

§7. Asymptotics of the Probability of the Outcome of a Random Walk

and

kt

!~ ln P(r > n) I

[Ag(k)] 2

t.

::;; -

(11)

For this purpose we consider the (nonrandom) sequence (O!n)n;;, 1 with 0! 1 =

0,

O!n =

n

Ag(n),

~

2,

and the probability measure (P n)n;;, 1 defined by (5). Then by Holder's inequality

P nCr > n)

=

E/( r > n) zn ::;; (P( r > n)) 11 q(Ez:) 11P,

(12)

where p > 1 and q = p/(p - 1). The last factor is easily calculated explicitly:

(Ez,;) 11 P = exp{p - 1

2

I

(13)

[Ag(k)] 2 }.

k=2

Now let us estimate the probability Pn(r > n) that appears on the lefthand side of (12). We have

Pir > n)

=

Pn(Sk ~ g(k), 1 ::;; k::;; n)

=

PiSk

~ g(1), 1 ::;; k::;; n),

where sk = L~= 1 ~i• ~i = ~i - 0!;. By Lemma 1, the variables are independent and normally distributed, ~i ""%(0, 1), with respect to the measure Pn. Then by Lemma 2 (applied to b = -g(l), p = Pn, xn = Sn) we find that c

P(r > n) ~ -, n

(14)

where cis a constant. Then it follows from (12)-(14) that, for every p > 1,

p P(r > n) ~ CPexp { - 2

Ln

[Ag(k)] 2

-

k=2

p

}

-------=-lnn , 1 p

(15)

where CP is a constant. Then (15) implies the lower bound (10) by the hypotheses of the theorem, since p > 1 is arbitrary. To obtain the upper bound (11), we first observe that since zn > 0 (Pand P-a.s.), we have by (5)

P(r > n) = En/(r > n)z; 1 ,

(16)

where En denotes an average with respect to Pn. In the case under consideration, 0! 1 = 0, O!n = Ag(n), n ~ 2, and therefore for n ~ 2

z; 1

=

n

exp { - k~z Ag(k) · ~k

1

n

+ 2: k~Z [Ag(k)] 2

} •

508

VII. Sequences of Random Variables That Form Martingales

By the formula for summation by parts (see the proof of Lemma 2 of §3, Chapter IV) n

I !!g(k) · ~k = k;2

n

!!g(n) · s" -

I sk-l!l(!!g(k)). k;2

Hence if we recall that by hypothesis !lg(k) 2:: 0 and !l(!lg(k)) that, on the set {r > n} = {Sk 2:: g(k), 1 :5 k :5 n}, n

I

k;2

!lg(k)·~k 2:: !lg(n)·g(n)n

=

I

k;2

[!lg(k)J 2

~

0, we find

n

I

k;J

g(k- 1)/l(!lg(k))- ~ 1 !lg(2)

+ g(1)!lg(2)

- ~~ !!g(2).

Thus, by (16), P(r > n) :5 exp{-

~ 2

I [!lg(k)Jl -

k;2

g(1)!lg(2)} En/(r > n)e-,ag(l)

where

Therefore

P(r > n)

~ C exp{- ~2k;2 I [!lg(k)]

2},

where Cis a positive constant; this establishes the upper bound (11). This completes the proof of the theorem. 3. The idea of an absolutely continuous change of measure can be used to study similar problems, including the case of a two-sided boundary. We present (without proof) a result in this direction.

Theorem 2. Let ~ 1 , ~ 2 , ••• be independent identically distributed random variables with ~; - .%(0, 1). Suppose that f = f (n) is a positive function such that f(n) -+oo, n -+oo, and n -+oo.

Then

if u = inf{n 2:: 1:

ISn I 2:: f(n)},

509

§8. Central Limit Theorem for Sums of Dependent Random Variables

we have P(a > n) = exp{ -

4.

n2 n } g k~1 f- 2 (k)(1 + o(1))

n --+ oo,

,

(17)

PROBLEMS

1. Show that the sequence defined in (4) is a martingale. 2. Establish (13). 3. Prove (17).

§8. Central Limit Theorem for Sums of Dependent Random Variables 1. Let us suppose that stochastic sequences are given on the probability space (Q, $',P): 0

k

~

n,

~

n

~

1,

with ~no = 0, $'~ = {0, Q}, $'1: £ $'1:+ 1 £ !F. Let [nt]

X~=

L

k=O

~nk•

0

~

t

~

1,

and make the convention $'"_ 1 = {0, Q}.

Theorem 1. For a given t, 0 < t for each e E (0, 1), as n --+ oo, [nt]

(A)

L

k=1 [nl]

~

1, let the following conditions be satisfied:

P(l~nkl > el$'1:-1) ~ 0,

I

(B) k~1 E[~nk/(l~nkl ~ 1) $'I:-1J ~0, [nt]

I

(C) k~1 V[~nk/(l~nkl ~e) $'f:-1J ~a?~ 0.

Then X~

.!!. %(0, a?).

Let us begin by discussing the hypotheses of the theorem. The difference from the hypotheses in §4 of Chapter III is that there we considered independent random variables, whereas here we consider arbitrary

510

VII. Sequences of Random Variables That Form Martingales

dependent variables, and moreover without assuming that EI~nk I is finite. It will be shown in Theorem 2 that hypothesis (A) is equivalent to max I ~nkl f. 0.

(A*)

1 :s;k :s;(nt]

Therefore (compare Theorem 1 of §4, Chapter III) our Theorem 1 is a statement about the validity of the central limit theorem under the assumption that the variables that are added have the property of being uniformly asymptotically infinitesimal (see, however, Theorem 5 below). Hypotheses (A) and (B) guarantee that X~ can be represented in the form X"t -- Y"t + Z"t with Z"t ~ 0 and Y"t -- '\'[ntl n where the sequence 'In" -L.,k=O 'Ink' ('1nk' ~~) is a martingale-difference, and E('7nkl~~- 1 ) = 0 with 1'7nkl ~ c, uniformly for 1 ~ k ~ n and n ;?: 1. Consequently, in the cases under consideration, the proof reduces to proving the central limit theorem for martingale-differences. In the case when the variables ~" 1 , ... , ~nn are independent, conditions (A), (B) and (C), with t = 1, and a 2 = ai, become n

(a)

L P(l ~nk I > e) --.. 0,

k=l n

(b)

L E[~nkJ(I~nkl ~ 1)] __. 0,

k= 1 n

(c)

L

k=l

V[~nk/(j ~nk I ~ t:)]

--+

u 2.

These are well known; see the book by Gnedenko and Kolmogorov [GS]. Hence we have the following corollary to Theorem 1.

Corollary. If ~" 1 ,

..• ,

(""are independent random variables, n

;?:

1, then

(a), (b), (c)=> X~ ~%(0, a 2 ).

Remark 1. In hypothesis (C), the case

a~ = 0 is not excluded. Hence, in particular, Theorem 1 yields a convergence condition for degenerate distributions (X~ .!4 0).

Remark 2. The method used to prove Theorem 1 lets us state and prove the following more general proposition. Let 0 < t 1 < t 2 < · · · < ti ~ 1, a?. ~ a~ ~ · · · ~a~, a~ = 0, and let ••. , ei be independent Gaussian random variables with zero means and Eef = a~k - aL ,. Form the (Gaussian) vectors CWr,, ... , ~) with ~k =

e1, Bt

+ ... + ek.

Let conditions (A), (B) and (C) be satisfied for t = t 1, ••• , ti. Then the joint distribution (P~t. ... ,,) of the random variables (X~,, ... , X~) con-

§8. Central Limit Theorem for Sums of Dependent Random Variables

511

verges weakly to the Gaussian distribution P(t 1, .•• , tj) of the variables

CJVr,, ... , Jt;):

P~, .... ,t; ~ P,,, ... ,t;·

2. Theorem 2. (1) Condition (A) is equivalent to (A*). (2) Assuming (A) or (A*), condition (C) is equivalent to

I

[nt]

(C*) k~o [~nk- E(~nk/(1 ~nkl :::; 1) ~~- 1 W ~a?.

Theorem 3. For each n 2 1 let the sequence 1 :::; k:::; n, be a square-integrable martingale-difference:

Suppose that the Lindeberg condition is satisfied: fore > 0, [nt]

(L) k~o

I

E[~;J(I~nkl >e) ~Z-1J ~ 0.

Then (C) is equivalent to (1)

where (quadratic variation) (2)

and (C*) is equivalent to (3)

where (quadratic variation)

[Xn], =

[nt]

L

k;O

(4)

~;k.

The next theorem is a corollary of Theorems 1-3.

Theorem 4. Let the square-integrable martingale-differences ~n = (~nk, ~~), n 2 1, satisfy (for a given t, 0 < t :::; 1) the Lindeberg condition (L). Then [nt]

L

k;Q

E(~;k I~~-1) ~a? =X~!!. %(0, [nt]

I

k;O

~;k ~a?=

x,n ~%(o, a~) .

a?),

(5)

(6)

512

VII. Sequences of Random Variables That Form Martingales

3. PROOF OF THEOREM 1. Let us represent

X~=

[nt]

[nt]

k=O

k=O

X~

in the form

L ~nkJ(i~nkl~l)+ L ~nkl(l~nki>1)

(7)

We define

B~

[nt]

= k~o E[~.kJ(I~.d ~

Jl~(r)

=

v~(r)

= P(~nk E r I ~Z-1),

J(~nk E

I

1) ~Z-1J,

(8)

r),

where r is a set from the smallest a-algebra a(R\ {0}) and P(~nk E r I~~- 1) is the regular conditional distribution of ~nk with respect to ~Z _1 . Then (7) can be rewritten in the following form:

X~= B~

+ L

[nl]

k=O

f

lxl> 1

xd11Z

+ L

[nt]

k=O

f

lxl :S: 1

xd(Jl~- vZ),

(9)

which is known as the canonical decomposition of (X~, !F~). (The integrals are to be understood as Lebesgue-Stieltjes integrals, defined for every sample point.) According to (B) we have B~ ~ 0. Let us show that (A) implies

L

[nt]

k=O

l

lxl> 1

ixldJ1Z~O.

(10)

We have (11) For every bE (0, 1), (12)

It is clear that [nt]

L

k=O

l(i~nkl > 1) =

[nt]

L

k=O

l lxl>1

dJ1~(= U[nr]).

§8. Central Limit Theorem for Sums of Dependent Random Variables

i

By (A), V[nr1

L

[nt]

=

dv~ ~

513

(13)

0,

lxl>l

k=O

and v~ is .?~- !-measurable. Then by the corollary to Theorem 2 of §3, Chapter VII,

~ 0 =>

V(nr]

U[nt]

~ 0.

(14)

(By the same corollary and the inequality ~U[nrJ ~ 1, we also have the converse implication U[nr 1 ~ 0 => V[nr 1 ~ 0,

(15)

which will be needed in the proof of Theorem 2.) The required proposition (10) now follows from (11)-(14). Thus X~= Y~

i

where Y~

[nt]

I

=

(16)

x d(p~ - v~),

(17)

lxl,;;l

k=O

and [nt]

I

+

z~ = B~

+ Z~,

i

x dp~ ~

o.

(18)

lxl>l

k=O

It then follows by Problem 1 that to establish that X~..'!.. %(0, (J;)

we need only show that

..'!.. %(0, (J;).

y~

Let us represent

Y~

Y~

(19)

in the form

=

Y[nr]

(c)

where

+ ~[nr] (c),

f Li

[nr]

Y[nr] (c) =

I

k=O

~lntl (c) =

X

d(p~ - v~),

(20)

e
[nt]

k=O

cE(O, 1],

lxi,;;e

x d(p~ - v~).

(21)

As in the proof of (14), it is easily verified that, because of (A), we have Y[nrl (c) ~ 0, n -.. 00.

514

VII. Sequences of Random Variables That Form Martingales

The sequence L\"(e) = (L\~(e), ~;:), 1 martingale with quadratic variation (L\"(e))k =

.± [ (

•=0

Jlxl:s;e

X2

~

k

~

dv'; - (

n, is a square-integrable

r Jlxl:s:e

X

dv';)

2

]

Because of (C),

Hence, for every e E (0, 1], max{y[m1(e), I(L\n(e)) 1nrJ - a} I} ~ 0. By Problem 2 there is then a sequence of numbers en! 0 such that Ylnr](en) ~ 0,

(L\"(en))[nt) ~ at.

Therefore, again by Problem 1, it is enough to prove only that M(nr] -4 JV(O, at),

(22)

±f

(23)

where

Mi: For

r

E

=L\i: (en)

=

i=O

Jlxl :S:en

x(p,i - v';).

a(R\ {0}), let ,U~(r) =

l(L\Mi: E r),

be a regular conditional probability, L\M~ = MI. - M~_ 1 , k ~ 1, M 0 = 0. Then the square-integrable martingale M" = (M~, ~k), 1 ~ k ~ n, can evidently be written in the form

M'k =

L: L\M'i = L k

k

i=1

i=l

f

lxl:s;2en

X

d,U;:.

(Notice that IL\M'il ~ 2e. by (23).) To establish (22) we have to show that, for every real Jc, E exp{iJcM[nrJ}--+ exp(- !Jc 2 at). Put

and

8i:(Gn)

=

k

TI (1 + L\G~).

j= 1

(24)

515

§8. Central Limit Theorem for Sums of Dependent Random Variables

Observe that 1+

~Gi:

= 1

+ ( (e;;.x J!x!~2En

- 1) dvi: = (

J!x!~2Ea

e;;.x dvi:

and consequently $i:(G")

=

k

0

j= 1

(1

+ ~GD.

On the basis of a lemma that will be proved in Subsection 4, (24) will follow if for every real A. 18[nr1(G")I =

IX\(nt] E[exp(iA.~Mj)I§j_ I~ C(A.) > 0 1]

(25)

S[ntl(G") ~ exp(- !A. 2 o}).

(26)

and

To see this we represent Si:(G") in the form k

8i:(G") = exp(Gi:) ·

0

(1

+ ~Gj) exp(-

~Gj).

j= 1

(Compare the function E,(A) defined by (76) in §6, Chapter II.) Since

(

J!x!,;; 2En

we have Gi: =

xdvj =

I ( ~2En

j=

1 J!x!

E(~Mjl§j_ 1 ) =

0,

(eux - 1 - iA.x) dvj.

(27)

Therefore

I~Gi:l

:::; (

J!x!~2En

< -

lei.l.x- 1 - iA.xl dvi::::; !A. 2

2 (2e ) 2 -+ l.A_ 2 n

0

(

J!x!~2£n

x 2 dvf: (28)

and (29)

By (C), (30)

516

VII. Sequences of Random Variables That Form Martingales

Suppose first that (M")k (28), (29) and Problem 3,

_:5;

[nt)

0

(1

a (P-a.s.), k

_:5;

[nt], where a ~ a}

+ dGi:) exp(- dGk) ~

+

1. Then by

n-+ oo,

1,

k=1

and therefore to establish (26) we have only to show that n G[nt]-+

-

2

1 12

(31)

211. (jt'

i.e., after (27), (29) and (30), that (32) But

Therefore if (M")£ntl _:5; a (P-a.s.), (31) is established and consequently (26) is established also. Let us now verify (25). Since Ieih - 1 - iA.x I _:5; !(A.x) 2 , we find from (28) that, for sufficiently large n, k

k

j= 1

j= 1

l$i:(G")I = I

0 (1 + dG?)I ~ fl (1

- 1Jc 2 d(M"))

= exp{t1 ln(1- 1A. 2d(M")i)} But !Jc 2d(M")i ln(1 - .lJc 2d(M") .) > 1 - 1Jc2d(M")i 1 2 and d(M")i:::; (2s.) 2 ! 0, n-+ oo. Therefore there is an n0 = n0 (Jc) such that for all n ~ n0 (Jc),

and therefore

lt'[nr1(G")I

~ exp{ -Jc 2 (M")£ntJ ~

e-J.la.

Hence the theorem is proved under the assumption that (M")[ntJ (P-a.s.). To remove this assumption, we proceed as follows.

_:5;

a

517

§8. Central Limit Theorem for Sums of Dependent Random Variables

Let r" = min{k ~ [nt]: (M")k ~ uf

u;

+ 1},

taking r" = oo if (M")[ntl ~ + 1. Then for Mi: = Mi: ·"" we have (M") 1ntl = (M") 1ntl A'" ~ 1 + u;

+ 2e;

~ 1 + u;

+ 2ef

(=a),

and by what has been proved, E exp{iA.M[ntJ}--+ exp(- !A. 2 u;).

But limiE{exp(iA.M[ntJ)- exp(iA.M[ntJ)},I

~

2lim P(r" < oo) = 0.

n

n

Consequently limE exp(iA.M[ntJ) =lim E{exp(iA.M[ntJ)- exp(iA.M[ntJ)} n

n

+ limE exp(iA.M[n11)

= exp(- !A. 2 u;).

n

This completes the proof of Theorem 1.

Remark. To prove the statement made in Remark 2 to Theorem 1, we need (using the Cramer-Wold method [B3]) to show that for all real numbers

At, ... ' A.i

j

--+ exp( -iA.iu;,)-

L iA.~(u;k- uLJ.

k=2

The proof of this is similar to the proof of (24), replacing (Mi:, $'J:) by the square-integrable martingales (Mi:, $'i:), k

MJ: = L

i= 1

vitlM'i,

4. In this subsection we prove a simple lemma which lets us reduce the verification of (24) to the verification of (25) and (26). Let 17" = ('7nk' $'i:), 1 ~ k ~ n, n ~ 1, be stochastic sequences, let n

yn =

L Yfnk'

k=l

let n

C"(A.) =

fl

k=l

E[exp(iA.'7nk)I$'J:_l],

AE R,

518

VII. Sequences of Random Variables That Form Martingales

and let Y be a random variable with

Lemma. If (for a given /c)l 6""(/c) I 2:: c(/c) > 0, n 2:: 1, a sufficient condition for the limit relation

(33) is that

(34) PROOF. Let ei.
m"(Jc) = @""(Jc). Then Im"(Jc)l ~ c - 1 (Jc) < oo, and it is easily verified that

Em"(Jc) = 1. Hence by (34) and the Lebesgue dominated convergence theorem,

IEei.
= IE(m"(Jc)[&'"(Jc)- &'(Jc)])l

~

c- 1 (Jc)E I&'"(Jc)- &'(Jc)l--+ 0,

n --+oo.

Remark. It follows, from (33) and the hypothesis that &""(Jc) 2:: c(Jc) > 0, that tff(Jc) =f. 0. In fact, the conclusion of the lemma remains valid without the assumption that lt&'"(Jc)l 2:: c(Jc) > 0, if restated in the form: if @""(Jc) .E. g(Jc) and 8(Jc) =f. 0, then (33) holds (Problem 5). 5. PROOF OF THEOREM 2. (1) Let e > 0, (j Since max I~nk I ~ e lsksn

+

E

(0, e), and for simplicity lett = 1.

n

L

I~nk I/(I ~nk I > e)

k~l

and

we have

If (A) is satisfied, i.e.

I I dvZ > (j} --+ o P{k~lJixi>E

519

§8. Central Limit Theorem for Sums of Dependent Random Variables

then (compare (14)) we also have

p{ ± f k=1

Jlxl>e

dJ1~ > (j} -+ 0.

Therefore (A) => (A*). Conversely, let (Jn

= min{k:::; n: l~nkl

~ e/2},

supposingthat(Jn = ooifmax 1 sksnl~nkl < e/2.By(A*),limnPC(Jn
[~1" f(l~nkl ~ e/2) > (j}

= 0.

t,~~~"" l~nkl ~ ie}

and

coincide, and by (A*)

"I"

f(l

k= 1

Therefore by (15)

~nkl ~ B/2)

""""L i k= 1

lxl<>:e

dv~ :::;

="I" f

d11Z

""""L i

dv~ ~ 0,

Jlxl<>:e/2

k= 1

k= 1

lxl<>:t/2

which, together with the property limn P( (J n < oo)

=>(A).

(2) Again suppose that t integrable martingales d"((j) with

(j E

=

~ 0.

= 0, prove that (A*)

1. Choose an e E (0, 1] and consider the square-

= (d;;(b), ~k)

(1 :::; k :::; n),

(0, e]. For the given e E (0, 1], we have, according to (C),
(Jf.

It is then easily deduced from (A) that for every

(j E

(0, e]

(35) Let us show that it follows from (C*), (A) and (A*) that, for every (j [d"(c))]n ~

E

(0, e], (36)

(Jf,

where [d"((j)]n

=

I

k=l

[~nkf(l ~nk I :::; b) - f

Jlxlsd

X

dvk] 2•

In fact, it is easily verified that by (A) [A"(b)]n - [d"(l)Jn ~ 0.

(37)

520

VII. Sequences of Random Variables That Form Martingales

But

n

::; 5

L J(l~nkl rel="nofollow">

k=l

::; 5 max l:Sk:Sn

I

~nk 2 1

l)l~nkl 2

f (

k=l

Jlxl>l

df.l'k

--+

o.

(38)

Hence (36) follows from (37) and (38). Consequently to establish the equivalence of (C) and (C*) it is enough to establish that when (C) is satisfied (for a given e E (0, 1]), then (C*) is also satisfied for every a > 0: lim lim b-0

P{l[~"(a)Jn- <~"(J))nl

n

>a}= 0.

(39)

Let 1 ::; k ::; n.

The sequence m"(J) = (m'k(J), ~'k) is a square-integrable martingale, and (m"(J)) 2 is dominated (in the sense of the definition of §3 on p. 467) by the sequences [m"(J)] and <m"(J)) ). It is clear that n

[m"(J)Jn

=

L

(~m'k(J)f ::; max l~m'k(J)I{[~"(~)Jn

k=l

+

<~"(~))n}

l:Sk:Sn

(40)

Since [~"(J)] and <~"(J)) dominate each other, it follows from (40) that (m"(J)) 2 is dominated by the sequences 6J 2 [~"(J)] and M 2 <~"(~)). Hence if (C) is satisfied, then for sufficiently small J (for example, for

J <

ib(af + 1))

n

and hence, by the corollary to Theorem 2 of §3, we have (39). If also (C*) is satisfied, then for the same values of J, (41) n

Since I~[~"( J)]k I ::; (215) 2 , the validity of (39) follows from (41) and another appeal to Theorem 2 of §3. This completes the proof of Theorem 2.

521

§8. Central Limit Theorem for Sums of Dependent Random Variables

6. PRooF OF THEOREM 3. On account of the Lindeberg condition (L), the equivalence of (C) and (1), and of (C*) and (3), can be established by direct calculation (Problem 6). 7. PRooF OF THEOREM 4. Condition (A) follows from the Lindeberg condition (L). As for condition (B), it is sufficient to observe that when ~n is a martingale-difference, the variables B~ that appear in the canonical decomposition (9) can be represented in the form B~ = -

L

[nt]

k=O

l

lxl>1

xdv!.

Therefore B~ .f. 0 by the Lindeberg condition (L). 8. The fundamental theorem of the present section, namely Theorem 1, was proved under the hypothesis that the terms that are summed are uniformly asymptotically infinitesimal. It is natural to ask for conditions for the central limit theorem without such a hypothesis. For independent random variables, examples of such theorems are given by Theorem 1 (assuming finite second moments) or Theorem 5 (assuming finite first moments) from §4, Chapter III. We quote (without proof) an analog of the first of these theorems, applicable only to sequences ~n = (~nk• ~~)that are square-integrable martingale differences. Let ~nk(x) = P(~nk ~ xl~~- 1 ) be a regular distribution function of ~nk with respect to ~i:- 1 , and let L\nk = E(~;kl~~- 1 ).

Theorem 5. If a square-integrable martingale-difference ~n 0 :::;; k :::;; n, n ;;::: 1, ~no = 0, satisfies the condition [nt]

L

k=O

L\nk

J.

= (~nk•

~i:),

at,

and for every e > 0

L

[nt]l

k=O

lxl>•

I

X ) dx .P. 0, lxll~k(x)- W( ~

yilnk

then

9. PROBLEMS 1. Let ~n = 'In

+ ~n• n ~

1, where 'In.!!.. '1 and ~n .!4 0. Prove that ~n .!4 'I·

2. Let (~n(e)), n ~ 1, e > 0, be a family ofrandom variables such that ~n(e) f. 0 for each e > 0 as n -+oo. Using, for example, Problem 11 of §10, Chapter II, prove that there is a sequence e. ! 0 such that ~.(e.) f. 0.

522 3. Let

VII. Sequences of Random Variables That Form Martingales

(~).

1 s; k s; n, n

~

1, be a complex-valued random variable such that (P-a.s.) n

L

IIX~I ::5

c,

n•

+ IX~) exp( -IXZ) =

k=l

Show that then (P-a.s.)

lim •

(1

k=l

4. Prove the statement made in Remark 2 to Theorem 1. 5. Prove the statement made in Remark 1 to the lemma. 6. Prove Theorem 3. 7. Prove Theorem 5.

1.

CHAPTER VIII

Sequences of Random Variables That Form Markov Chains

§1. Definitions and Basic Properties 1. In Chapter I (§12), for finite probability spaces, we took the basic idea

to be that of Markov dependence between random variables. We also presented a variety of examples and considered the simplest regularities that are possessed by random variables that are connected by a Markov chain. In the present chapter we give a general definition of a stochastic sequence of random variables that are connected by Markov dependence, and devote our main attention to the asymptotic properties of Markov chains with countable state spaces.

2. Let (Q, fF, P) be a probability space with a distinguished nondecreasing family(~) of u-algebras, ~ s;;;: $i'1 s;;;: • • • s;;;: !F. Definidon. A stochastic sequence X = (Xn, $i'n) is called a Markov chain (with respect to the measure P) if (1)

for all n 2::: m 2::: 0 and all BE fJB(R). Property (1), the Markov property, can be stated in a number of ways. For example, it is equivalent to saying that E[g(Xn)l$i'mJ = E[g(Xn)IXm]

for every bounded Borel function g = g(x).

(P-a.s.)

(2)

524

VIII. Sequences of Random Variables That Form Markov Chains

Property (1) is also equivalent to the statement that, for a given "present"

Xm, the "future" F and the "past" Pare independent, i.e. (3) where FE a{w: X;, i ~ m}, and BEg;"' n.:::; m. In the special case when

g;"

=

g;;

= a{w:

X0 ,

... ,

X.}

and the stochastic sequence X = (X., g;;) is a Markov chain, we say that the sequence {X.} itself is a Markov chain. It is useful to notice that if X= {X., g;n} is a Markov chain, then (X.) is also a Markov chain.

Remark. It was assumed in the definition that the variables X m are realvalued. In a similar way, we can also define Markov chains for the case when X n takes values in some measurable space (E. 8). In this case, if all singletons are measurable, the space is called a phase space, and we say that X = (X., g;.) is a Markov chain with values in the phase space (E, 8). When E is finite or countably infinite (and 8 is the a-algebra of all its subsets) we say that the Markov chain is discrete. In turn, a discrete chain with a finite phase space is called a finite chain. The theory of finite Markov chains, as presented in §12, Chapter I, shows that a fundamental role is played by the one-step transition probabilities P(Xn+ 1 E Bl X.). By Theorem 3, §7, Chapter II, there arefunctions Pn+ 1 (x; B), the regular conditional probabilities, which (for given x) are measures on (R, gJ(R)), and (for given B) are measurable functions of x, such that

(4) The functions P. = P.(x, B), n ~ 0, are called transition functions, and in the case when they coincide (P 1 = P 2 = · · ·), the corresponding Markov chain is said to be homogeneous (in time). From now on we shall consider only homogeneous Markov chains, and the transition function P 1 = P 1(x, B) will be denoted simply by P = P(x, B). Besides the transition function, an important probabilistic property of a Markov chain is the initial distribution n = n(B), that is, the probability distribution defined by n(B) = P(X 0 E B). The set of pairs (n, P), where n is an initial distribution and Pis a transition function, completely determines the probabilistic properties of X, since every finite-dimensional distribution can be expressed (Problem 2) in terms of n and P: for every n ~ 0 and A E gJ(R"+ 1) P{(X 0 ,

=

... ,

X.) E A}

Ln(dx 0 ) LP(x 0 ;dx 1 ) · · · f/A(x 0 ,

...

,x.)P(x._ 1 ;dxn).

(5)

525

§1. Definitions and Basic Properties

We deduce, by a standard limiting process, that for any f?I(W+ 1 )-measurable function g(x 0 , •.. , xn), either of constant sign or bounded, Eg(Xo, ... , Xn)

= {n(dxo) LP(x 0 ;dx 1 )

...

Lg(x 0 ,

...

,xn)P(xn-t;dx.).

(6)

3. Let p<•> = p<•>(x; B) denote a regular variant of the n-step transition probability: (7)

It follows at once from the Markov property that for all k and l, (k, l ;;:::: 1), p(X 0 ; B)= Lp(X 0 ; dy)P< 1>(y; B)

It does not follow, of course, that for all x

E

(P-a.s.).

(8)

R

p(x; B)= Lp(x; dy)P< 1>(y; B).

(9)

It turns out, however, that regular variants of the transition probabilities can be chosen so that (9) will be satisfied for all x E R (see the discussion in the historical and bibliographical notes, p. 559). Equation (9) is the Kolmogorov-Chapman equation (compare (112.13)) and is the starting point for the study oft he probabilistic properties of Markov chains. 4. It follows from our discussion that with every Markov chain X = (X n, JF.), defined on (Q, ff, P) there is associated a set (n, P). It is natural to ask what properties a set (n, P) must have in order for n = n(B) to be a probability distribution on (R, .cJJ(R)) and for P = P(x; B) to be a function that is measurable in x for given B, and a probability measure on B for every x, so that n will be the initial distribution, and P the transition function, for some Markov chain. As we shall now show, no additional hypotheses are required. In fact, let us take (Q, !F) to be the measurable space (R 00 , !?J(R 00 )). On the sets A E !?J(W+ 1) we define a probability measure by the right-hand side of formula (5). It follows from §9, Chapter II, that a probability measure P exists on (R 00 , !?J(R 00 ) for which P{w: (x 0 ,

... ,

x.) E A}

= {n(dxo)

f/(x 0 ;dx 1)

..

·f/A(x 0 , ... ,x.)P(x._ 1 ;dx.). (10)

Let us show that if we put x.(w) = x. for w = (xo, Xt, ... ), the sequence X = (X.)n~o will constitute a Markov chain (with respect to the measure P just constructed).

526

VIII. Sequences of Random Variables That Form Markov Chains

In fact, if Be BI(R) and C e BI(R"+ 1), then P{Xn+l eB, (X 0 ,

••• ,

Xn)e C}

= { n(dx 0 ) LP(x0 ; dx 1)

•••

f/a<xn+l)Ic(x 0 , ••• , Xn)P(xn; dxn+l)

= { n(dx 0 ) LP(x0 ; dx 1 )

• ..

LP(xn; B)Ic(x 0 ,

=

f

... ,

Xn)P(xn-l; dxn)

P(Xn; B)dP,

{co: (Xo, ... , Xn)eC}

whence (P-a.s.) (11)

Similarly we can verify that (P-a.s.) (12)

Equation (1) now follows from (11) and (12). It can be shown in the same way that for every k ;;::: 1 and n ;;::: 0,

This implies the homogeneity of Markov chains. The Markov chain X= (Xn) that we have constructed is known as the Markov chain generated by (n, P). To emphasize that the measure P on (R 00 , 91(R 00 )) has precisely the initial distribution n, it is often denoted by P,. If n is concentrated at the single point x, we write P x instead of P "' and the corresponding Markov chain is called the chain generated by the point x (since Px{X 0 = x} = 1). Consequently, each transition function P = P(x, B) is in fact connected with the whole family of probability measures {P x• x e R}, and therefore with the whole family of Markov chains that arise when the sequence (Xn)n2:o is considered with respect to the measures Px• x e R. From now on, we shall use the phrase "Markov chain with given transition function" to mean the family of Markov chains in the sense just described. We observe that the measures P" and P x constructed from the transition function P = P(x, B) are consistent in the sense that, when A e 91(R 00 ),

and

527

§1. Definitions and Basic Properties

5. Let us suppose that (Q, !F)= (R 00 , PJ(R 00 )) and that we are considering a sequence X= (X.) that is defined coordinate-wise, that is, X.(w) = x. for w = (x 0 , x 1 , ... ). Also let !F, = a{w: X 0 , •.• , X.), n 2: 0. Let us define the shifting operators e., n 2: 0, on Q by the equation

= (x., Xn+l• ... ), and let us define, for every random variable 11 = IJ(w), the random variables 8.(xo,

Xb ... )

e.IJ by putting

In this notation, the Markov property of homogeneous chains can (Problem 1) be given the following form: For every ff-measurable 11 = IJ(w), every n 2: 0, and BE PJ(R), P{8.1JEB/!F,}

=

PxJIJEB}

(P-a.s.).

(15)

This form of the Markov property allows us to give the following important generalization: (15) remains valid if we replace n by stopping times r.

Theorem. Let X= (X.) be a homogeneous Markov chain defined on (R 00 , PJ(R 00 ), P) and let r be a stopping time. Then the following strong Markov property is valid: (16) P{8r1JEB/~} = PxJIJEB} (P-a.s.). PROOF. If A E ~ then

L P{8r1JEB, A, r = n} 00

P{8r1JEB, A}=

n=O

00

= L P{8.1JEB, A, r = n}.

(17)

n=O

The events An {r = n} E :F., and therefore P{8.1JEB, An {r = n}} = =

f f

An{r=n}

An~=~

P{8.1JEB/!F,} dP Px)IJ EB} dP =

f

PxJIJ E B} dP,

An~=~

which, with (17), establishes (16).

Corollary. If a is a stopping time such that P(a 2: r) = 1 and a is ~-measurable, then ({a< oo}; P-a.s.). (18) 6. As we said above, we are going to consider only discrete Markov chains (with phase space E = {... , i,j, k, .. .}). To simplify the notation, we shall now denote the transition functions P(i; U}) by Pij and call them transition

528

VIII. Sequences of Random Variables That Form Markov Chains

probabilities; an n-step transition probability from i to j will be denoted by p~';>. Let E = { 1, 2, ... }. The principal questions that we study in §§2-4 are intended to clarify the conditions under which: (A) The limits ni = limp~~} exist and are independent of i; (B) The limits (n 1 , n 2 , •• •) form a probability distribution, that is, ni 2 0,

ni = 1; (C) The chain is ergodic, that is, the limits (n 1, n 2 , •• •) have the properties ni > 0, L~ 1 ni = 1; (D) There is one and only one stationary probability distribution iQI = (q 1 , q2 , •• •), that is, one such that qi 2 0, L~ 1 qi = 1, and qi = Li qiPii• jEE. In the course of answering these questions we shall develop a classification of the states of a Markov chain as they depend on the arithmetic and asymptotic properties of p\'!l and p\~l l) u 0

7. PROBLEMS 1. Prove the equivalence of definitions (1), (2), (3) and (15) of the Markov property. 2. Prove formula (5). 3. Prove equation (18). 4. Let (X.)• ., 0 be a Markov chain. Show that the reversed sequence( . .. ,X.,X n-l•· . . ,X 0 )

is also a Markov chain.

§2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties of the Transition Probabilities p~'P 1. We say that a state i E E = { 1, 2, ... } is inessential if, with positive probability, it is possible to escape from it after a finite number of steps, without ever returning to it; that is, there exist m andj such that p~jl > 0, but p)il = 0 for all n and j. Let us delete all the inessential states from E. Then the remaining set of essential states has the property that a wandering particle that encounters it can never leave it (Figure 36). As will become clear later, it is essential states that are the most interesting.

Let us now consider the set of essential states. We say that state j is accessible from the point i (i---. j) if there is an m 2 0 such that p~jl > 0

§2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties

I I

529

t~@: 1

I

I

I

l1t

I I

I I

I

(

2

1

I

I

2

t1 2

I.

------------1

I ___________ _

Inessential state

Essential state Figure 36

(p!J> =

1 if i = j, and 0 if i =F j). States i and j communicate (i- j) if j is accessible from i and i is accessible from j. By the definition, the relation "-" is symmetric and reflexive. It is easy to verify that it is also transitive (i- j, j - k => i - k). Consequently the set of essential states separates into a finite or countable number of disjoint sets E 1, E 2 , •.• , each of which consists of communicating sets but with the property that passage between different sets is impossible. By way of abbreviation, we call the sets £ 1, E 2 , •.• classes or indecomposable classes (of essential communicating sets), and we call a Markov chain indecomposable if its states form a single indecomposable class. As an illustration we consider the chain with matrix

1 l:o i:o 1

'3

3 .

4

IP=

•••

0

0 0

0

0 0 0 0

•••••••••••

o:o o:.l o:o • 2

1 0 =

0

(~1

1

~J

2

1 0

The graph of this chain, with set of states E = {1, 2, 3, 4, 5} has the form I

1

2

~ t I

It is clear that this chain has two indecomposable classes E 1 = {1, 2}, E 2 = {3, 4, 5}, and the investigation of their properties reduces to the investigation of the two separate chains whose states are the sets E 1 and E 2 , and whose transition matrices are IP 1 and IP 2 • Now let us consider any indecomposable class E, for example the one sketched in Figure 37.

530

VIII. Sequences of Random Variables That Form Markov Chains

2

Figure 37. Example of a Markov chain with period d = 2.

Observe that in this case a return to each state is possible only after an even number of steps; a transition to an adjacent state, after an odd number; the transition matrix has block structure,

o o:t t o o:t t

IP=

... ···.··· ... 1_

2

I

'0 0

2:

! t:o o Therefore it is clear that the class E = {1, 2, 3, 4} separates into two subclasses C 0 = {1, 2} and C 1 = {3, 4} with the following cyclic property: after one step from C 0 the particle necessarily enters C 1 , and from C 1 it returns to C 0 . This example suggests a classification of indecomposable classes into cyclic subclasses.

·

2. Let us say that state j has period d = d(j) if the following two conditions are satisfied:

p}'p > 0 only for values of n of the form dm; (2) dis the largest number satisfying (1).

(1)

In other words, d is the greatest common divisor of the numbers n for which

p}'j> > 0. (If p)jl = 0 for all n ~ 1, we put dU) = 0.)

Let us show that all states of a single indecomposable class E have the same period d, which is therefore naturally called the period of the class, d = d(E). Let i and j E E. Then there are numbers k and l such that PW > 0 and ll -> p\~>p\!l > 0' and therefore k + I is divisible PJl(!l > 0• Consequently p\~+ u lj Jl by d(i). Suppose that n > 0 and n is not divisible by d(i). Then n + k + I is also not divisible by d(i) and consequently P!7+k+l) = 0. But P(~+k+l) II

> -

p\~)p\'!lp\!) IJ

JJ

II

§2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties

531

/~~

c,_,o\\

/c,

c3•

•c2

Figure 38. Motion among cyclic subclasses.

and therefore PY? = 0. It follows that if p}'jl > 0 we have n divisible by d(i), and therefore d(i) :::; d(j). By symmetry, d(j) :::; d(i). Consequently d(i) = d(j). If d(j) = 1 (d(E) = 1), the state j (or class E) is said to be aperiodic. Let d = d(E) be the period of an indecomposable class E. The transitions within such a class may be quite freakish, but (as in the preceding example) there is a cyclic character to the transitions from one group of states to another. To show this, let us select a state i 0 and introduce (for d 2 1) the following subclasses:

C 0 = {J. E E ·. p\"). > 0 ~ n = O(mod d)} · lQ)

-

'

=

{J. E E ·° p\")· > 0~n lQ]

= -

Cd- 1 =

{j E E:.p!~} > 0 ~ n

=d- 1(mod d)}.

C1

1(mod d)} '·

Clearly E = C 0 + C 1 + · · · + C d _ 1 . Let us show that the motion from subclass to subclass is as indicated in Figure 38. In fact, let state i E CP and Pii > 0. Let us show that necessarily j E cp+ 1(modd)• Let n be such that Pi~l > 0. Then n =ad+ p and therefore n p(mod d) and n + 1 p + l(mod d). Hence Pl~J 1 ) > 0 and f E Cp+ 1(modd)• Let us observe that it now follows that the transition matrix IP of an indecomposable chain has the following block structure:

=

=

532

VIII. Sequences of Random Variables That Form Markov Chains

Figure 39. Classification of states of a Markov chain in terms of arithmetic properties of the probabilities p?i·

Consider a subclass CP. If we suppose that a particle is in the set C 0 at the initial time, then at times = p + dt, t = 0, 1, ... 'it will be in the subclass cp. Consequently, with each subclass CP we can connect a new Markov chain with transition matrix (pf)i,jeCp' which is indecomposable and aperiodic. Hence if we take account of the classification that we have outlined (see the summary in Figure 39) we infer that in studying problems on limits of probabilities Pl~;> we can restrict our attention to aperiodic indecomposable chains.

3.

PROBLEMS

l. Show that the relation "+--+"is transitive.

2. For Example 1, §5, show that when 0 < p < 1, all states belong to a single class with period d = 2. 3. Show that the Markov chains discussed in Examples 4 and 5 of §5 are aperiodic.

§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties of the Probabilities p~?>

1. Let IP =

IIPiill

be the transition matrix of a Markov chain,

flf> =

Pi{Xk = i, Xl =ft i, 1 :-s; l :-s; k- 1}

(1)

and for i =ft j (2)

§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties

533

For X 0 = i, these are respectively the probability of first return to state i at time k, and the probability offirst arrival at state j at time k. Using the strong Markov property (1.16), we can show as in (1.12.38) that n

(3)

= "f..., f\~lp\~-k) P \~) " IJ I} I} k=l

For each i E Ewe introduce 00

Iii= I fl?l,

(4)

n=l

which is the probability that a particle that leaves state i will sooner or later return to that state. In other words, hi = Pi{ai < oo }, where a;= inf{n 2 1: xn = i} with (ji = 00 when{·}= 0. We say that a state i is recurrent if

hi= 1, and nonrecurrent if

hi < 1. Every recurrent state can, in turn, be classified according to whether the average time of return is finite or infinite. Let us say that a recurrent state i is positive if

and null if Jl.i- 1

= I

00

(

n=l

nf~i>

)-1

= 0.

Thus we obtain the classification of the states of the chain, as displayed in Figure 40. Set of all states

Nonrecurrent states

Positive states

Null states

Figure 40. Classification of the states of a Markov chain in terms of the asymptotic properties of the probabilities P1?>.

534

VIII. Sequences of Random Variables That Form Markov Chains

2. Since the calculation of the functions f!i> can be quite complicated, it is useful to have the following tests for whether a state i is recurrent or not.

Lemma 1 (a) The state i is recurrent

if and only if 00

L Pli> =

(5)

00.

n=l

(b) If state j is recurrent and i f-+ j then state i is also recurrent. PROOF. {a)

By {3), n

P u\~> = "L... f!~>p!~-k>, n n k=l

and therefore (with pjf> = 1) oo

oon

oo

L pjj> = L L n=l

n=l k=l

= fii

oo

=L

f!~>p!i-k)

Pli-k) n=k

f!~>L

k=l

f pjj> = fii(1 + f pjj>).

n=O

n=l

Therefore if L:'= 1 P!i> < oo, we have f;; < 1 and therefore state i is nonrecurrent. Furthermore, let L:'= 1 pjj> = oo. Then N

"

i...J

n=l

Nn

p\~) = " u

"

~

LJ

~

f(k)

N

N

f!~>p\~-k) = " u

n=lk=l

n

~

N

f\~)" p\~-k) ui..J

n=k

k=l

u

< " -L....

k=l

N

f\~) n

"~

1=0

p\~)

"'

and therefore I' Jii-

L...

k=l

ii

;;:::

~

~

k=l

f(k) ii ;;:::

'L:= pjj> LN l p -+ 1=0

ii

1,

N-+ oo.

Thus if L:'= 1 pjj> = oo then fii = 1, that is, the state i is recurrent.

> 0 and p!!> > 0• Then (b) Let p!~> I] ]I P II!~+s+t)

> -

p\~>p\~>p\!) I]

]I

]I'

and if L:'= 1 p}j> = oo, then also L:'= 1 pjj> = oo, that is, the state i is recurrent. 3. From (5) it is easy to deduce a first result on the asymptotic behavior of pjj>.

Lemma 2. If state j is nonrecurrent then 00

LP!j> <

n=l

00

(6)

§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties

535

for every i, and ther(!fore n --+ oo.

(7)

PROOF. By (3) and Lemma 1,

n=1 k=1

n=1

fl~>v]'J- k)

-

{'

(n)

"

=I

k=1

fl~> I

n=O

v]'J>

co

co _

oo

oo

oon

oo

I vl'P = I I

Jij L.. pjj

n=O

" (n) < < - L.. pjj n=O

00.

Here we used the inequality };j = Lk"= 1 f!J> : : ; 1, which holds because the series represents the probability that a particle starting at i eventually arrives at j. This establishes (6) and therefore (7). Let us now consider recurrent states.

Lemma 3. Let j be a recurrent state with d(j) = 1. (a) If i communicates with j, then

1

P (~)--+' I)

n --+ oo.

(8)

/lj

If in addition j is a positive state then

1

P I)(~l--+- > 0'

n--+ oo.

(9)

/lj

If, however, j is a null state, then --+ 0 P (~l ' I)

n --+ oo.

(10)

(b) If i and j belong to different classes of communicating states, then (n)

pij

fij

--+ - ,

n --+ oo.

(11)

/lj

The proof of the lemma depends on the following theorem from analysis. Let j~, f 2 , ... be a sequence of nonnegative numbers with L~ 1 ]; = 1, such that the greatest common divisor of the indices j for which jj > 0 is 1. Let u0 = 1, un = LZ= 1 fkun-k• n = 1, 2, ... , and let J1 = L:'= 1 nfn· Then un --+ 1/11 as n --+ oo. (For a proof, see [Fl], §10 of Chapter XIII.) Taking account of (3), we apply this to un = pjj>, .h = f]J>. Then we immediately find that

where J1 j -_

if(n) "co L..n = 1 n jj .

536 Taking

VIII. Sequences of Random Variables That Form Markov Chains

p)j1 = 0 for s < 0, we can rewrite (3) in the form 00

P I}\~) = "L., J!~lp!~-k) I} JJ '

(12)

k=1

By what has been proved, we have Therefore if we suppose that 00

00

lim "L., n

p)j-k) -+ fli- 1, n -+ oo, for each given k.

k= 1

1 p!~- k) J!~ I} JJ

= "L.,

k= 1

J!~l I}

lim p\~k) JJ '

(13)

n

we immediately obtain

P!} 1 -+ ~ fli

(If!~') = ~fli fii,

(14)

k=l

which establishes (11). Recall that hi is the probability that a particle starting from state i arrives, sooner or later, at state j. State j is recurrent, and if i communicates with j, it is natural to suppose that hi = 1. Let us show that this is indeed the case. Let J;i be the probability that a particle, starting from state i, visits state j infinitely often. Clearly J;i ;;::: J;i. Therefore if we show that, for a recurrent state j and a state i that communicates with it, the probability J;i = 1, we will have established that hi = 1. According to part (b) of Lemma 1, the state i is also recurrent, and therefore

fu

=

Lflt =

1.

(15)

Let

a;= inf{n;;::: 1: Xn = i} be the first time (for times n ;;::: 1) at which the particle reaches state i; take a; = oo if no such time exists. Then 1=

h; =

00

00

n= 1

n= 1

L !!?1 = L P;(a; =

n) = P;(a; < oo),

(16)

and consequently to say that state i is recurrent means that a particle starting at i will eventually return to the same state (at a random time a;). But after returning to this state the "life" of the particle starts over, so to speak (because of the strong Markov property). Hence it appears that if state i is recurrent the particle must return to it infinitely often:

P;{Xn = i for infinitely many n} = 1.

(17)

Let us now give a formal proof. Let i be a state (recurrent or nonrecurrent). Let us show that the probability of return to that state at least r times is (h;)'.

§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties

53 7

For r = 1 this follows from the definition of fii· Suppose that the proposition has been proved for r = m - 1. Then by using the strong Markov property and (16), we have Pi (number of returns to i is greater than or equal tom)

f P-(ui k, ~ndat theleastnumber of returns to i after time k) m- 1 = f P{u· = k)P·(at least m- 1 values Ia· = k) ofXa,+ ,X.-,+ equali =

=

k= 1

1

k= 1

1

IS

1

1

L P;(ui = 00

=

1

2 , •••

1

k)P;(at least m- 1 values of X 1 , X 2 ,

•.•

equal i)

k=1 00

I

=

Jl~>uiir- 1 =

k=1

r::.

Hence it follows in particular that formula (17) holds for a recurrent state i. If the state is nonrecurrent, then P;{Xn = i for infinitely many n} = 0.

(18)

We now turn to the proof that J;i = 1. Since the state i is recurrent, we have by (17) and the strong Markov property 1=

=

L P;(ui = k) + P;(ui = oo) 00

k=1

L P; k=1 oo

(

ui

_ ~ p(

-

= =

L...

k=1

_ k infinitely many values of) ' • X.-,+ 1, X.-j+ 2 , ... equalz

i (J'j -

L P;(Uj = k). P; 00

k= 1

(infinitely many values of x ..i+l' ... , equa1 z.

+

x..r2•

+ P;(ui = p(

1

00

I

!I~>J;j

k= 1

+ (1 -.h) = J;j.hj + (1 -.h).

Thus .f.. 1 = f~1J)i)

and therefore

+ 1-

...

Ji}

00

u. = k )

00 )

)

= . + P;(Uj = oo)

~

O"j

2 , •••

_

i (J'j -

f !I~>. pi(infinitely many valu~s) + (1 _ _[;) ofX ,X equalz

k= 1

=

the number of returns to i)

= k, after time k is infinite

1

538

VIII. Sequences of Random Variables That Form Markov Chains

Since i +-+ j, we have hi > 0, and consequently fii = 1 and hi = 1. Therefore if we assume (13), it follows from (14) and the equation hi = 1 that, for communicating states i and j,

1 P('!)_._ 11 ' J.li

n-+ oo.

As for (13), its validity follows from the theorem on dominated convergence together with the remark that 1 PJJ('!-kl-+- ' J.li

n-+ oo

00

I f!~> = k=1

'

hi ~ L

This completes the proof of the lemma. Next we consider periodic states.

Lemma 4. Let j be a recurrent state and let dU) > 1. (a) If i and j belong to the same class (of states), and if i belongs to the cyclic subclass c, and j to cr+a• then d P11('!d+a)-+ _ • J.li

(19)

(b) With an arbitrary i, p!f!d+a) -+ I}

[I J!~d+a)J ~J.l/

= 0, 1, ... , d -

a

.

r=O

IJ

1.

(20)

PRooF. (a) First let a = 0. With respect to the transition matrix !fl>4 the state j is recurrent and aperiodic. Consequently, by (8), p
ii

1

d

-

kf(~d) L...k= 1 ':1 J1

"00

"00

L...k= 1

Suppose that (19) has been proved for a P (nd+r+ ij

1) _

-

kdlf(~d) J1

f'"'J

= r. Then d

00

00

- d - 11 _"

d

" p p
(b) Clearly nd+a

P11('!d+a) = "L...

k=1

f!~lp('!d+a+k) 11

11

State j has period d, and therefore form r · d. Therefore

a = 0, 1, ... , d - 1.

'

p}jd+a-k) = n

+a) = " P('!d 11 L...

r=O

0, except when k - a has the

f(~d + a)p(<~- r)d) 11

11

and the required result (20) follows from (19). This completes the proof of the lemma.

§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties

539

Lemmas 2-4 imply, in particular, the following result about limits of p~'P.

Theorem 1. Let a Markov chain be indecomposable (that is, its states form a single class of essential communicating states) and aperiodic. Then: (a) If all states are either null or nonrecurrent, then,Jor all i and j, n--+oo; P (~) --+ 0 1]

(b)

'

(21)

if all states j are positive, then,Jor all i, 1

> 0'

P 1](~)--+/lj

(22)

n--+oo;

4. Let us discuss the conclusion of this theorem in the case of a Markov chain with a finite number of states, E = {1, 2, ... , r}. Let us suppose that the chain is indecomposable and aperiodic. It turns out that then it is automatically increasing and positive: indecomposability) ~ ( ( d= 1

indecomposability)

recu~~e~ce

(23)

positlVlty d= 1

For the proof, we suppose that all states are nonrecurrent. Then by (21) and the finiteness of the set of states of the chain, r

r

1 = lim "" 1] = 0 • L. lim p\~l 1] = "" L., p\~l n j

=1

(24)

j= 1 n

The resulting contradiction shows that not all states can be nonrecurrent. Let i 0 be a recurrent state andj an arbitrary state. Since i0 +-+ j, Lemma 1 shows that j is also recurrent. Thus all states of an aperiodic indecomposable chain are recurrent. Let us now show that all recurrent states are positive. If we suppose that they are all null states, we again obtain a contradiction with (24). Consequently there is at least one positive state, say i0 . Let i be any other state. Since i +-+ i 0 , there are s and t such that P!~1 > 0 and PI:~ > 0, and therefore P u(~+s+t) > -

.

p\~lp\n~

p\tl.--+

no rota zoz

p\~)

uo

1

_ . p\tl. flio

ror

> 0·

(25)

Hence there is a positive e such that p~7l ~ e > 0 for all sufficiently large n. But Pl7l--+ 1/f.l; and therefore f.l; > 0. Consequently (23) is established. Let rri = 1/f.li· Then rri > 0 by (22) and since 1 = lim n

r

L

j=1

Pl}l =

r

L ni,

j=l

540

VIII. Sequences of Random Variables That Form Markov Chains

the (aperiodic indecomposable) chain is ergodic. Clearly, for all ergodic finite chains,

there is an n0 such that min Pl~P > 0 for all n 2: n0 .

(26)

i,j

It was shown in §12 of Chapter I that the converse is also valid: (26) implies ergodicity. Consequently we have the following implications:

(

indecomposability) d= 1

¢>

(

indecomposability) recurrence positivity d= 1

=>

· ¢> (26). ergo d'tctty

However, we can prove more.

Theorem 2. For a finite Markov chain ( indecomposability) ¢> ( d= 1 PROOF.

indecomposability) recurrence ( a· · ) ( 6) • . . . ¢> ergo zczty ¢> 2 posztzvzty d= 1

We have only to establish . . (ergodtctty)

indecomposability) =>

(

recurrence post.t.1v1't y d= 1

.

Indecomposability follows from (26). As for aperiodicity, increasingness, and positivity, they are valid in more general situations (the existence of a limiting distribution is sufficient), as will be shown in Theorem 2, §4.

5.

PROBLEMS

1. Consider an indecomposable chain with states 0, 1, 2, .... A necessary and sufficient condition for it to be nonrecurrent is that the system of equations ui = j = 0, 1, ... , has a bounded solution such that u; ¢= c, i = 0, 1, ....

Li u;pij,

2. A sufficient condition for an indecomposable chain with states 0, 1, ... to be recurrent is that there is a sequence (u 0 , u 1 , .• •) with u;---> oo, i---> oo, such that ui ~ u,p,i for allj =1- 0.

Li

3.

Anecessary and sufficient condition for an indecomposable chain with states 0, 1, .. . to be recurrent and positive is that the system of equations ui = u,p;i,j = 0, 1, ... , has a solution, not identically zero, such that Iu; I < oo.

L;

L;

§4. On the Existence of Limits and of Stationary Distributions

541

4. Consider a Markov chain with states 0, 1, ... and transition probabilities Pot =Po > 0,

Pii =

!

Pi > 0,

~ : ~ + 1,

ri;:::: 0,

J-

1,

qi > 0, j = i - 1,

0

otherwise.

Let Po = 1, Pm = (qt ... qm)/(pt ... Pm). Prove the following propositions.

I Chain is nonrecurrent~ I Chain is recurrent~

Chain is positive~ Chain is null~

Pm < oo,

1

I-- < PmPm

I

oo,

1 Pm = oo, I - - = oo. PmPm 00

: :; k : :; I PtP.

5.

Show that .fik ;::::

6.

Show that for every Markov chain with countably many states, the limit of Pt;> always exists in the Cesaro sense:

iii jjk and

sup Pt;>

Pm = oo,

n=l

1'.. . 1 ~n (k)- Ji} 1tmL.. Pii - - . n n k~ t /lj

ek+

(e;n

7. Consider a Markov chain eo, el> ... with t = +'It+ t• where 'It• '7 2 , ••• is a sequence of independent identically distributed random variables with P(l'/k = J) = Pi• j = 0, 1, .... Write the transition matrix and show that if p 0 > 0, p 0 + p 1 < 1, the kpk :::;; 1. chain is recurrent if and only if

Ik

§4. On the Existence of Limits and of Stationary Distributions 1. We begin with some necessary conditions for the existence of stationary distributions.

Theorem 1. Let a Markov chain with countably many states E = {1, 2, ... } and transition matrix lfl> = IIPiill be such that the limits . (il) 11m Pii = ni, n

exist for all i ""'d j and do not depend on i.

542

VIII. Sequences of Random Variables That Form Markov Chains

Then (a) Li 1ti :::;; 1, Li 1tiPii = n.i; (b) either allni = 0 or Li ni = 1; (c) if all 1ti = 0, there is no stationary distribution;

rr = (1t 1' 1tz' ...) is the unique stationary distribution.

if Li ni = 1, then

PRooF. By Fatou's lemma,

"' £.... nJ. = "' £.... lim pl~> I} < -lim "' £.... pl~> I} j

n

j

n

j

=1

"

Moreover,

"'4- 1tiPii = "'4- (1·1m Pki

(n))

1

I

n

Pii :::;; )"tm "' 4- Pki(n) Pii = )"tm Pki(n+ 1) = ni, n

n

l

that is, for each j,

L 1tiPii :::;;

1tj.

i

Suppose that

for some j 0 • Then

~ 1ti > ~ (~ 1tiPii) = ~ 1ti ~ Pii = ~ ni. J

}

'

I

J

'

This contradiction shows that (1)

for allj.

It follows from (1) that

Therefore

that is, for allj,

from which (b) follows.

§4. On the Existence of Limits and of Stationary Distributions

543

Now let Q = (q 1 , q2 , •• •) be a stationary distribution. Since Li qiplj' = qi and therefore Li qini = qi, that is, ni = qifor allj, this stationary distribution must coincide with II= (n 1, n 2 , •• •). Therefore if all ni = 0, there is no stationary distribution. If, however, Li ni = 1, then II= (n 1, n 2 , •• •) is the unique stationary distribution. This completes the proof of the theorem. Let us state and prove a fundamental result on the existence of a unique stationary distribution.

Theorem 2. For Markov chains with countably many states, there is a unique stationary distribution if and only if the set of states contains precisely one positive recurrent class (of essential communicating states). PRooF. Let N be the number of positive recurrent classes. Suppose N = 0. Then all states are either nonrecurrent or are recurrent null states, and by (3.10) and (3.20), limn pjj' = 0 for all i andj. Consequently, by Theorem 1, there is no stationary distribution. Let N = 1 and let C be the unique positive recurrent class. If d(C) = 1 we have, by (3.8), 1 z· J. E C P!~> -+ - > 0' ' .

,,

/lj

Ifj ¢ C, thenj is nonrecurrent, and pjj'-+ 0 for all i as n-+ oo, by (3.7).

Put

{~ >

0, jEC,

0,



fl,

qj =

c.

Then, by Theorem 1, the set Q = (q 1 , q 2 , •• •) is the unique stationary distribution. Now let d = d(C) > 1. Let C 0 , ..• , Cd- 1 be the cyclic subclasses. With respect to IJl>d, each subclass Ck is a recurrent aperiodic class. Then if i and jE Ck We have d

,,

P(~d)-+/lj

> 0

by (3.19). Therefore on each set Ck, the set d/fli• j E Ck, forms (with respect to IJl>d) the unique stationary distribution. Hence it follows, in particular, that LieCk (d/fl) = 1, that is, Lie ck (1/fl) = 1jd. Let us put qj =

I

1

.

0,

j¢C,

--:• JEC =Co+···+ Cd-1• fl,

544

VIII. Sequences of Random Variables That Form Markov Chains

and show that, for the original chain, the set Q = (q 1, q2 , stationary distribution. In fact, for i e C,

=

P!~d)

L P!jd-

•• •)

is the unique

1 )Pii•

jeC

Then by Fatou's lemma,

d -=lim P!Fl ~ J.li n

L

· Pbnd-1)Pii = "L.. -Pii 1 hm jeC n jeC J.li

and therefore

1 1 -> .. - " L, - pJl' J.li

But

1

jeC

d-1 (

J.li

1)

L-

d-1

1

2:-=L =2:-=L ieC J.li k=O ieCk J.li k=O d As in Theorem 1, it can now be shown that in fact 1

1

- = 2: -Pji·

J.li

jeC

J.li

This shows that the set Q = (q 1 , q 2 , •• •) is a stationary distribution, which is unique by Theorem 1. Now let there be N ~ 2 positive recurrent classes. Denote them by C 1 , , , , , CN, and let Qi = (q~, q~, ...) be the Stationary distribution COrreSponding to the class Ci and constructed according to the formula 1

.

.0,

j~Ci.

. -> 0, jeC', qj = { J.lj

Then, for all nonnegative numbers a 1 , . . . , aN such that a 1 + · · · +aN= 1, the set a 1 Q 1 + · · · + aNON will also form a stationary distribution, since

(a10 1 +

· · · + aNQN)I? = a10 11fl> + · · · + aNQNI? = a 10 1 + · · · + aNQN.

Hence it follows that when N ~ 2 there is a continuum of stationary distributions. Therefore there is a unique stationary distribution only in the case N=l. This completes the proof of the theorem. 2. The following theorem answers the question of when there is a limit distribution for a Markov chain with a countable set of states E.

§4. On the Existence of Limits and of Stationary Distributions

545

Theorem 3. A necessary and sufficient condition for the existence of a limit distribution is that there is, in the set E of states of the chain, exactly one aperiodic positive recurrent class C such that hi = 1 for all j E C and i E E.

Necessity. Let qi = lim Pl~P and let iQl = (q 1, q2 , •.

.) be a distribution 1). Then by Theorem 1 this limit distribution is the unique stationary distribution, and therefore by Theorem 2 there is one and only one recurrent positive class C. Let us show that this class has period d = 1. Suppose the contrary, that is, let d > 1. Let C 0 , C1o ... , Cd_ 1 be the cyclic subclasses. If i E C 0 and j E C 1, then by (19), Pl}d+ 1 )--+ d/f.1; and pljdl = 0 for all n. But d/f.li > 0, and therefore PtP does not have a limit as n--+ oo; this contradicts the hypothesis that limn pljl exists. Now let j E C and i E E. Then, by (3.11), Pljl--+ h)f.li· Consequently ni = h)f.li· But ni is independent of i. Therefore hi = f}; = 1.

PROOF.

(q; 2 0,

L; q; =

Sufficiency. By (3.11), (3.10) and (3.7), hi, jEC, iEE, Pl'Y--+ { /.1j

0,

j¢C, i E E.

Therefore ifhi = 1 for allj E CandiE E, then qi = limn pljl is independent of i. Class Cis positive and therefore qi > 0 for j E C. Then, by Theorem 1, we have Li qi = 1 and the set iQl = (ql> q2 , •• •) is a limit distribution. 3. Let us summarize the results obtained above on the existence of a limit distribution, the uniqueness of a stationary distribution and ergodicity, for the case of finite chains.

Theorem 4. We have the following implications for finite Markov chains:

(ergodicity)

{l)

<=>

(chain indecomp~s~ble,) recurrent, posztzve, with d = 1

one) . ) {2} (there exists exactly 1. . d' .butzon . . ( zmzt zstrz <=> recurrent posztzve class . .hd = 1 exzsts Wlt ~tati?nary) !;£ (there exists e~~ctly one) (uniq~e recurrent posztwe class dzstrzbutzon PROOF. The "vertical" implications are evident. {1} is established in Theorem 2, §3; {2} in Theorem 3; {3} in Theorem 2.

546 4.

VIII. Sequences of Random Variables That Form Markov Chains

PROBLEMS

1. Show that, in Example 1 of §5, neither stationary nor a limit distribution occurs. 2. Discuss the question of stationarity and limit distribution for the Markov chain with transition matrix

iP' (

0t 0)

t

0 0 0 1 1

1

1

4

4

2

0

t t

0

.

0

Ir=

3. Let iP' = IIP;)I be a finite doubly stochastic matrix, that is, 1 Pu = l,j = 1, ... , m. Show that the stationary distribution of the corresponding Markov chain is the vector Q = (1/m, ... , 1/m).

§5. Examples 1. We present a number of examples to illustrate the concepts introduced above, and the results on the classification and limit behavior of transition probabilities.

1. A simple random walk is a Markov chain such that a particle remains in each state with a certain probability, and goes to the next state with a certain probability. The simple random walk corresponding to the graph

ExAMPLE

C>C>c: q

q

q

describes the motion of a particle among the states E = {0, ± 1, ... } with transitions one unit to the right with probability p and to the left with probability q. It is clear that the transition probabilities are p,

Pij

j = i

+ 1,

= { q, j = i - 1, p + q = 1, 0

otherwise.

If p = 0, the particle moves deterministically to the left; if p = 1, to the right. These cases are of little interest since all states are inessential. We therefore assume that 0 < p < 1. With this assumption, the states of the chain form a single class (of essential communicating states). A particle can return to each state after 2, 4, 6, ... steps. Hence the chain has period d = 2.

547

§5. Examples

Since, for each i E E,

(pq)" P n(~n) = C"2n(pq)" = (2n)! (n!)2 ' then by Stirling's formula (which says n! "' ~ n"e-") we have (2n)

p ii

"'

Ln

( 4 pq)"

r.::::. .

v nn

Ln

Therefore Plt"1 = oo if p = q, and Plt"1 < oo if p =F q. In other words, the chain is recurrent if p = q, but if p =F q it is nonrecurrent. It was shown in §10, Chapter I, thatj"lf" 1 "' 1/(2Jnn 312 ), n-+ oo, if p = q = Therefore Jl; = (2n)/;\ 2" 1 = oo, that is, all recurrent states are null states. Hence by Theorem 1 of §3, p\j1 -+ 0 as n --+ oo for all i and j. There are no stationary, limit, or ergodic distributions.

t.

Ln

EXAMPLE 2.

Consider a simple random walk withE = {0, 1, 2, ... }, where 0 is an absorbing barrier:

State 0 forms a unique positive recurrent class with d = 1. Allpther states are nonrecurrent. Therefore, by Theorem 2 of §4, there is a unique stationary distribution

n

=(no,

1tt,

with n0 = 1 and

1tz, .. .)

7t; =

0, i

~

1.

Let us now consider the question of limit distributions. Clearly p~~ = 1, ~ 1, i ~ 0. Let us now show that for i ~ 1 the numbers oc(i) = limn p\01 are given by the formulas

pjj>-+ 0, j

{(q)i -

oc(i) =

1~

'

p p

>

q,

~ q.

(1)

We begin by observing that since state 0 is absorbing we have bs:n..r;g<> and consequently oc(i) = /; 0 , that is, the probability oc(i) is the probability that a particle starting from state i sooner or later reaches the null

pj'CJ =

548

VIII. Sequences of Random Variables That Form Markov Chains

state. By the method of §12, Chapter I (see also §2 of Chapter VII) we can obtain the recursion relation

ct(i)

= pct(i +

1)

+

qct(i - 1),

(2)

with cx(O) = 1. The general solution of this equation has the form

cx(i) = a

+ b(q/pY,

(3)

and the condition cx(O) = 1 imposes the condition a + b = 1. If we suppose that q > p, then since cx(i) is bounded we see at once that b = 0, and therefore cx(i) = 1. This is quite natural, since when q > p the particle tends to move toward the null state. If, on the other hand, p > q the opposite is true: the particle tends to move to the right, and so it is natural to expect that

ct(i)

~

0,

i

~ 00,

(4)

and consequently a = 0 and

(5) To establish this equation, we shall not start from (4), but proceed differently. In addition to the absorbing barrier at 0 we introduce an absorbing barrier at the integral point N. Let us denote by cxN(i) the probability that a particle that starts at i reaches the zero state before reaching N. Then cxN(i) satisfies (2) with the boundary conditions

and, as we have already shown in §9, Chapter I,

ctN

( ")l

-

(q)i p

(q)N p

(q)N ' 1- p

0 :::;; i :::;; N.

(6)

Hence

li~ ctN(i) = (~)i and consequently to prove (5) we have only to show that

cx(i) = lim cxN(i).

(7)

N

This is intuitively clear. A formal proof can be given as follows. Let us suppose that the particle starts from a given state i. Then

cx(i) = P ;(A),

(8)

~5.

549

Examples

where A is the event in which there is an N such that a particle starting from i reaches the zero state before reaching stateN. If

AN = {particle reaches 0 before N}, then A= u~=i+l AN. It is clear that AN s;;; AN+l and

Pi(N=i+U AN) = N-+oo lim Pi(AN).

(9)

1

But rxN(i) = P;(AN), so that (7) follows directly from (8) and (9). Thus if p > q the limit lim PW depends on i, and consequently there is no limit distribution in this case. If, however, p ~ q, then in all cases lim PI~ = 1 and lim pi'P = 0, j 2:: 1. Therefore in this case the limit distribution has the form II = (1, 0, 0, ... ). ExAMPLE 3.

Consider a simple random walk with absorbing barriers at 0

andN:

Here there are two positive recurrent classes {0} and {N}. All other states

{1, ... , N- 1} are nonrecurrent. It follows from Theorem 1, §3, that there are infinitely many stationary distributions II = (n 0 , n 1, ..• , nN) with n 0 = a, nN = b, n 1 = · · · = nN-t = 0, where a 2:: 0, b 2:: 0, a + b = 1. From Theorem 4, §4 it also follows that there is no limit distribution. This is also a consequence of the equations (Subsection 2, §9, Chapter I)

(~r- (~r lim

p(n)-

1-

iO-

n-+oo

(~r

i 1-N' lim PI~ = 1 - lim PI~ n

n

ExAMPLE

and

.

p =F q,

(10) p

=

q,

lim PI~;> = 0, 1 ~ j ~ N - 1. n

4. Consider a simple random walk with E = {0, 1, ... } and a

reflecting barrier at 0: I

p

p

..-------...~~~

0~~~~ q

q

q

0


550

VIII. Sequences of Random Variables That Form Markov Chains

It is easy to see that the chain is periodic with period d = 2. Suppose that

p

> q (the moving particle tends to move to the right). Let i > 1; to determine

the probability fi 1we may use formula (1), from which it follows that

fi1 = ( ~)

i-1

i > 1.

< 1,

All states of this chain communicate with each other. Therefore if state i is recurrent, state 1 will also be recurrent. But (see the proof of Lemma 3 in §3) in that case fi 1 must be 1. Consequently when p > q all the states of the chain are nonrecurrent. Therefore pj'jl --+ 0, n --+ oo for i and j E E, and there is neither a limit distribution nor a stationary distribution. Now let p :::;; q. Then, by (1), fi 1 = 1 for i > 1 and / 11 = q + Pf21 = 1. Hence the chain is recurrent. Consider the system of equations determining the stationary distribution n =(no, 1t1, . . . ):

1t2

+ n1,q, = 1t1P + 1t3 q,

1t1

= n1q

1t1

=

1to

that is,

1tz =

+ 1tzq, 1tzq + 1t3q,

whence

ni =

(~)nj- 1,

j

= 2, 3, ....

If p = q we have n 1 = n 2 = ... , and consequently

In other words, if p = q, there is no stationary distribution, and therefore no limit distribution. From this and Theorem 3, §4, it follows, in particular, that in this case all states of the chain are null states. It remains to consider the case p < q. From the condition Lf=o ni = 1 we find that

that is 1t1

q-p 2q

= --

551

§5. Examples

and ni = q

~ p. (~y-1;

j ~ 2.

Therefore the distribution I1 is the unique stationary distribution. Hence when p < q the chain is recurrent and positive (Theorem 2, §4). The distribution I1 is also a limit distribution and is ergodic. EXAMPLE 5. Again consider a simple random walk with reflecting barriers at Oand N: 1

p

p

q

q

q

~

p

p

---~

0< p< 1

All the states of the chain are periodic with period d = 2, recurrent, and positive. According to Theorem 4 of §4, the chain is ergodic. Solving the system ni = ~J=o n;pii subject to ~J=o n; = 1, we obtain the ergodic distribution

2 ::s;;j:::;; N- 1,

and

2. ExAMPLE 6. It follows from Example 1 that the simple random walk considered there on the integral points of the line is recurrent if p = q, but nonrecurrent if p =f. q. Now let us consider simple random walks in the plane and in space, from the point of view of recurrence or nonrecurrence. For the plane, we suppose that a particle in any state (i, j) moves up, down, to the right or to the left with probability! (Figure 41).

*

i

...

2~+----=------=.!_

4

.!_

4

-1

*

2 0 Figure 41. A walk in the plane.

552

VIII. Sequences of Random Variables That Form Markov Chains

For definiteness, consider the state (0, 0). Then the probability Pk = p~~. o).(O, o> of going from (0, 0) to (0, 0) ink steps is given by p2n+l

p2n

= 0, =

n = 0, 1, 2, ... ,

._,

L,

{(i,j):i+j=n,OS:iS:n}

(2n)! 1 2n ~(4) ' l.l.J.J.

n = 1, 2, ....

Multiplying numerators and denominators by (n!) 2 , we obtain n

p 2n = (1.)2" C"2n ._, C;n C"i = (l)Z"(C" )z 4 f..., n 4 2n ' i=O

smce n

._, C; c•-i L.. n n

i=O

= C"2n •

Applying Stirling's formula, we find that

Pzn

~

1 nn

-,

and therefore I P 2 • = oo. Consequently the state (0, 0) (likewise any other) is recurrent. It turns out, however, that in three or more dimensions the symmetric random walk is nonrecurrent. Let us prove this for walks on the integral points (i,j, k) in space. Let us suppose that a particle moves from (i, j, k) by one unit along a coordinate direction, with probability i for each.

~5.

553

Examples

Then if Pk is the probability of going from (0, 0, 0) to (0, 0, 0) ink steps, we have

Pzn+ 1 2n -

2

-

'1)2 ( l.

L..

{(i,j):Oo>i+ j,;n, O,;i,;n, O,;j,;n)

-- -12n c 2n <

n

0,

"

_

p

=

=

0, 1, ... ,

1 2n (2n)! (6) (} '1)2(( . n- l· - }') 1)2 .

I

[

_1_ cn _1_cn 2n 3" 22n

·1 '1

!..} ·

{(i,j):Oo>i+ j,;n, O,;i,;n, O,;j,;n)

"





(n -

l -

. 1 '1 (

L..

{(i,j):Oo;i+ j,;n, O,;i,;n, O,;j,;n) l.J ·

n

]2 C1_)2n 3

·1 J) · n!

n-

1_ n

') 1 (3)

.

l-} .

(11)

1, 2, ... ,

=

where

en=

max

{(i,j):Oo>i+jo>n, O,;i,;n, O,;j,;n)

[

J

n! . i!j! (n- i- j)!

(12)

Let us show that when n is large, the max in (12) is attained for i ~ n/3, j ~ n/3. Let i0 and j 0 be the values at which the max is attained. Then the following inequalities are evident:

n! n! ---<----------------~ ~----~~-jo! (i 0 - 1)! (n- jo- io + 1)!- jo! io! (n- jo- io)!' n!

n!

1)! -jo- io +-~~ Uo --l)!io!(n -jo- io- 1)!+ l)!(n ---------~--------~<-----------------

j 0 !(io

n! - j 0 - i 0 -1)!' Uo + 1)!i0 !(n -<---------~ ---------whence

n - i0

-

1 ::;; 2j 0

::;;

n - i0

n - j0

-

1 ::;; 2i 0

::;;

n - j0

+ +

1, 1,

and therefore we have, for large n, i 0 ~ n/3,j 0 ~ n/3, and

cn ~ By .Stirling's formula,

n!

554

VIII. Sequences of Random Variables That Form Markov Chains

and since 00

n~l

3)3

2n312n312 < oo,

we have Ln P zn < oo. Consequently the state (0, 0, 0), and likewise any other state, is nonrecurrent. A similar result holds for dimensions greater than 3. Thus we have the following result (P6lya): Theorem. For R 1 and R 2 , the symmetric random walk is recurrent; for R", n ?: 3, it is nonrecurrent.

3. PROBLEMS 1. Derive the recursion relation ( 1). 2. Establish (4). 3. Show that in Example 5 all states are aperiodic, recurrent, and positive.

4. Classify the states of a Markov chain with transition matrix

~~

q

:)

0 0 ,

0 p q where p

+ q = 1, p ;?: 0, q ;?: 0.

Historical and Bibliographical Notes

Introduction The history of probability theory up to the time of Laplace is described by Todhunter [Tl]. The period from Laplace to the end of the nineteenth century is covered by Gnedenko and Sheinin in [KlO]. Maistrov [Ml] discusses the history of probability theory from the beginning to the thirties of the present century. There is a brief survey in Gnedenko [G4]. For the origin of much of the terminology of the subject see Aleksandrova [A3]. For the basic concepts see Kolmogorov [K8], Gnedenko [G4], Borovkov [B4], Gnedenko and Khinchin [G6], A.M. and I. M. Yaglom [Yl], Prohorov and Rozanov [P5], Feller [Fl, F2], Neyman [N3], Loeve [L7], and Doob [D3]. We also mention [M3] which contains a large number of problems on probability theory. In putting this text together, the author has consulted a wide range of sources. We mention particularly the books by Breiman [B5], Ash [A4, A5], and Ash and Gardner [A6], which (in the author's opinion) contain an excellent selection and presentation of material. For current work in the field see, for example, Annals of Probability (formerly Annals of Mathematical Statistics) and Theory of Probability and its Applications (translation of Teoriya Veroyatnostei i ee Primeneniya). Mathematical Reviews and Zentralblatt fur Mathematik contain abstracts of current papers on probability and mathematical statistics from all over the world. For tables for use in computations, see [Al].

556

Historical and Bibliographical Notes

Chapter I §1. Concerning the construction of probabilistic models see Kolmogorov [K7] and Gnedenko [G4]. For further material on problems of distributing objects among boxes see, e.g., Kolchin, Sevastyanov and Chistyakov [K3]. §2. For other probabilistic models (in particular, the one-dimensional Ising model) that are used in statistical physics, see Isihara [12]. §3. Bayes's formula and theorem form the basis for the "Bayesian approach" to mathematical statistics. See, for example, De Groot [D 1] and Zacks [Zl]. §4. A variety of problems about random variables and their probabilistic description can be found in Meshalkin [M3]. §5. A combinatorial proof of the law of large numbers (originating with James Bernoulli) is given in, for example, Feller [Fl]. For the empirical meaning of the law of large numbers see Kolmogorov [K7]. §6. For sharper forms of the local and integrated theorems, and ofPoisson's theorem, see Borovkov [B4] and Prohorov [P3]. §7. The examples of Bernoulli schemes illustrate some of the basic concepts and methods of mathematical statistics. For more detailed discussions see, for example, Cramer [C5] and van der Waerden [Wl]. §8. Conditional probability and conditional expectation with respect to a partition will help the reader understand the concepts of conditional probability and conditional expectation with respect to a-algebras, which will be introduced later. §9. The ruin problem was considered in essentially the present form by Laplace. See Gnedenko and Sheinin [KlO]. Feller [Fl] contains extensive material from the same circle of ideas. §10. Our presentation essentially follows Feller [Fl]. The method for proving (10) and (11) is taken from Doherty [D2]. §11. Martingale theory is throughly covered in Doob [D3]. A different proof of the ballot theorem is given, for instance, in Feller [Fl]. §12. There is extensive material on Markov chains in the books by Feller [Fl], Dynkin [D4], Kemeny and Snell [K2], Sarymsakov [Sl], and Sirazhdinov [S8]. The theory of branching processes is discussed by Sevastyanov [S3].

Chapter II §1. Kolmogorov's axioms are presented in his book [K8]. §2. Further material on algebras and a-algebras can be found in, for example, Kolmogorov and Fomin [K8], Neveu [Nl], Breiman [B5], and Ash [A5]. §3. For a proof of Caratheodory's theorem see Loeve [17] or Halmos [Hl].

557

Historical and Bibliographical Notes

§§4-5. More material on measurable functions is available in Halmos [H1]. §6. See also Kolmogorov and Fomin [K8], Halmos [H1], and Ash and Gardner [A6]. The Radon-Nikodym theorem is proved in these books. The inequality

is sometimes called Chebyshev's inequality, and the inequality

e

P( I I ~ e)

~ EI; I', e

r > 0,

is called Markov's inequality. For Pratt's lemma see [P2]. §7. The definitions of conditional probability and conditional expectation with respect to a a-algebra were given by Kolmogorov [K8]. For additional material see Breiman [B5] and Ash [A5]. The result quoted in the Corollary to Theorem 5 can be found in [M5]. §8. See also Borovkov [B4], Ash [A5], Cramer [C5], and Gnedenko [G4]. §9. Kolmogorov's theorem on the existence of a process with given finitedimensional distribution is in his book [K8]. For Ionescu-Tulcea's theorem see also Neveu [N1] and Ash [A5]. The proof in the text follows [A5]. §§10-11. See also Kolmogorov and Fomin [K9], Ash [A5], Doob [D3], and Loeve [L 7]. §12. The theory of characteristic functions is presented in many books. See, for example, Gnedenko [G4], Gnedenko and Kolmogorov [G5], Ramachandran [R1], Lukacs [L8], and Lukacs and Laha [L9]. Our presentation of the connection between moments and semi-invariants follows Leonov and Shiryayev [L4]. §13. See also Ibragimov and Rozanov [11], Breiman [B5], and Liptser and Shiryayev [L5].

Chapter III §1. Detailed investigations of problems on weak convergence of probability measures are given in Gnedenko and Kolmogorov [G5] and Billingsley [B3]. §2. Prohorov's theorem appears in his paper [P4]. §3. The monograph [G5] by Gnedenko and Kolmogorov studies the limit theorems of probability theory by the method of characteristic functions. See also Billingsley [B3]. Problem 2 includes both Bernoulli's law of large numbers and Poisson's law of large numbers (which assumes that 1, 2 , •.. are independent and take only two values (1 and 0), but in general are differently distributed: P(ei = 1) = Pi• P(ei = 0) = 1 - Pi• i ~ 1).

ee

558

Historical and Bibliographical Notes

§4. Concerning "nonclassical" conditions for the validity of the central limit theorem, see Zolotarev [Z2, Z3] and Rotar [R3]. The statement and proof of Theorem 1 were given by Rotar. The lemma in Subsection 6 is also due to him. §5. Here the presentation follows Gnedenko and Kolmogorov [G5] and Ash [A5]. §6. The proof follows Sazonov [S2]. §7. See also Valkeila [Vl].

Chapter IV §1. Kolmogorov's zero-or-one law appears in his book [K8]. For the Hewitt-Savage zero-or-one law see also Borovkov [B4], Breiman [B5], and Ash [A5]. §§2-4. Here the fundamental results were obtained by Kolmogorov and Khinchin (see [K8] and the references given there). See also Petrov [Pl] and Stout [S9]. For probabilistic methods in number theory see Kubilius [Kll].

Chapter V §§1-3. Our exposition of the theory of (strict sense) stationary random processes is based on Breiman [B5], Sinai [S7], and Lamperti [12]. The simple proof of the maximal ergodic theorem was given by Garsia [Gl].

Chapter VI §1. The books by Rozanov [R4], and Gihman and Skorohod [G2, G3] are devoted to the theory of (wide sense) stationary random processes. Example 6 was frequently presented in Kolmogorov's lectures. §2. For orthogonal stochastic measures and stochastic integrals see also Ooob [03], Gihman and Skorohod [G3], Rozanov [R4], and Ash and Gardner [A6]. §3. The spectral representation (2) was obtained by Cramer and Loiwe (see, for example, [17]). The same representation (in different language) is contained in Kolmogorov [K5]. Also see Ooob [03], Rozanov [R4], and Ash and Gardner [A6]. §4. There is a detailed exposition of problems of statistical estimation of the covariance function and spectral density in Hannan [H2, H3]. §§5-6. See also Rozanov [R4], Lamperti [L2], and Gihman and Skorohod [G2, G3]. §7. The presentation follows Lipster and Shiryayev [L5].

Historical and Bibliographical Notes

559

Chapter VII §1. Most of the fundamental results of the theory of martingales were obtained by Doob [D3]. Theorem 1 is taken from Meyer [M4]. Also see Meyer and Dellacherie [M5], Lipster and Shiryayev [15], and Gihman and Skorohod [G3]. §2. Theorem I is often called the theorem "on transformation under a system of optional stopping" (Doob [D3]). For the identities (14) and (15) and Wald's fundamental identity see Wald [W2]. §3. Chow and Teicher [C3] contains an illuminating study of the results presented here, including proofs of the inequalities of Khinchin, Marcinkiewicz and Zygmund, Burkholder, and Davis. Theorem 2 was given by Lenglart [L3]. §4. See Doob [D3]. §5. Here we follow Kabanov, Liptser and Shiryayev [K1], Engelbert and Shiryayev [El], and Neveu [N2]. Theorem 4 and the example were given by Liptser. §6. This approach to problems of absolute continuity and singularity, and the results given here, can be found in Kabanov, Liptser and Shiryayev [Kl]. Theorem 6 was obtained by Kabanov. §7. Theorems 1 and 2 were given by Novikov [N4]. Lemma 1 is a discrete analog of Girsonov's lemma (see [K1]). §8. The exposition follows Liptser and Shiryayev [L6].

Chapter VIII §1. For the basic definitions see Dynkin [D4], Ventzel [V2], Doob [D3],

and Gihman and Skorohod [G3]. The existence of regular transition probabilities such that the Kolmogorov-Chapman equation (9) is satisfied for all x E R is proved in [N 1] (corollary to Proposition V.2.1) and in [G 3] (Volume I, Chapter II, §4). Kuznetsov (see Abstracts of the Twelfth European Meeting of Statisticians, Varna, 1979) has established the validity (which is far from trivial) of a similar result for Markov processes with continuous times and values in universal measurable spaces. §§2-5. Here the presentation follows Kolmogorov [K4], Borovkov [B4], and Ash [A4].

Referencest

[Al] [A2] [A3] [A4] [AS] [A6] [BI]

[B2] [B3] [B4] [B5] [Cl]

[C2]

M. Abramovitz and I. A. Stegun, editors. Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables. National Bureau of Standards, Washington, D.C., 1964. P. S. Aleksandrov. Einfiihrung in die Mengenlehre und die Theorie der reel/en Funktionen. DVW, Berlin, 1956. N. V. Aleksandrova. Mathematical Terms [Matematicheskie terminy]. Vysshaia Shkol~, Moscow, 1978. R. B. Ash. Basic Probability Theory. Wiley, New York, 1970. R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972. R. B. Ash and M. F. Gardner. Topics in Stochastic Processes. Academic Press, New York, 1975. S. N. Bernshtein. Chebyshev's work on the theory of probability (in Russian), in The Scientific Legacy of P. L. Chebyshel' [Nauchnoe nasledie P. L. Chebysheva], pp. 43-68, Akademiya Nauk SSSR, Moscow-Leningrad, 1945. S. N. Bernshtein. Theory of Probability [Teoriya veroyatnostei], 4th ed.

Gostehizdat, Moscow, 1946.

P. Billingsley. Convergence of Probability Measures. Wiley, New York, 1968.

A. A. Borovkov. Wahrscheinlichkeitstheorie: eine Einfiihrung. Birkhiiuser, Basel-Stuttgart, 1976. L. Breiman. Probability. Addison-Wesley, Reading, MA, 1968. P. L. Chebyshev. Theory of Probability: Lectures Given in 1879 and 1880 [Teoriya veroyatnostei: Lektsii akad. P. L. Chebysheva chitannye v 1879, 1880 gg.]. Edited by A. N. Krylov from notes by A. N. Lyapunov. MoscowLeningrad, 1936. Y. S. Chow, H. Robbins, and D. Siegmund. Great Expectations: The Theory of Optimal Stopping. Houghton-Mifflin, Boston, 1971.

t Translator's note: References to translations into Russian have been replaced by references to the original works. Russian references have been replaced by their translations whenever I could locate translations; otherwise they are reproduced (with translated titles). Names of journals are abbreviated according to the forms used in Mathematical Reviews (1982 version).

562

References

[C3]

Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchangability, Martingales. Springer-Verlag, New York, 1978. Kai-Lai Chung. Markov Chains with Stationary Transition Probabilities. Springer-Verlag, New York, 1967. H. Cramer, Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ, 1957. M. H. De Groot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970. M. Doherty. An amusing proof in fluctuation theory. Lecture Notes in Mathematics, no. 452, 101-104, Springer-Verlag, Berlin, 1975. J. L. Doob. Stochastic Processes. Wiley, New York, 1953. E. B. Dynkin. Markov Processes. Plenum, New York, 1963. H. J. Engelbert and A. N. Shiryayev. On the sets of convergence and generalized submartingales. Stochastics 2 (1979), 155-166. W. Feller. An Introduction to Probability Theory and Its Applications, vol. 1, 3rd ed. Wiley, New York, 1968. W. Feller. An Introduction to Probability Theory and Its Applications, vol. 2, 2nd ed. Wiley, New York, 1966. A. Garcia. A simple proof of E. Hopf's maximal ergodic theorem. J. Math. Mech. 14 (1965), 381-382. I. I. Gihman [Gikhman] and A. V. Skorohod [Skorokhod]. Introduction to the Theory of Random Processes. Saunders, Philadelphia, 1969. I. I. Gihman and A. V. Skorohod. Theory of Stochastic Processes, 3 vols. Springer-Verlag, New York-Berlin, 1974--1979. B. V. Gnedenko. Lehrbuch der Wahrscheinlichkeitstheorie. Akademische Verlagsgesellschaft, Berlin, 1967. B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of Independent Variables. Addison-Wesley, Cambridge, MA, 1954. B. V. Gnedenko and A. Ya. Khinchin. Elementary Introduction to the Theory ofProbability. Freeman, San Francisco, 1961. P.R. Halmos. Measure Theory. Van Nostrand, New York, 1950. E. J. Hannan. Time Series Analysis. Methuen, London, 1960. E. J. Hannan. Multiple Time Series. New York, Wiley, 1970. I. A. lbragimov and Yu. V. Linnik. Independent and Stationary Sequences of Random Variables. Walters-Noordhoff, Groningen, 1971. I. A. lbragimov and Yu. A. Rozanov. Gaussian Random Processes. SpringerVerlag, New York, 1978. A. Isihara. Statistical Physics. Academic Press, New York, 1971. Yu. M. Kabanov, R. Sh. Liptser, and A. N. Shiryayev. On the question of the absolute continuity and singularity of probability measures. Math. USSR-Sb. 33 (1977), 203-221. J. Kemeny and L. J. Snell. Finite Markov Chains. Van Nostrand, Princeton, 1960. V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. Halsted, New York, 1978. A. N. Kolmogorov, Markov chains with countably many states. Byull. Moskov. Univ. I (1937), 1-16 (in Russian). A. N. Kolmogorov. Stationary sequences in Hilbert space. Byull. Moskov. Univ. Mat. 2 (1941), 1-40 (in Russian). A. N. Kolmogorov, The contribution of Russian science to the development of probability theory. Uchen. Zap. Moskot'. Unil'. 1947, no. 91, 56ff. (in Russian).

[C4] [C5] [D1] [D2] [D3] [D4] [E 1] [F1] [F2] [G1] [G2] [G3] [G4] [G5] [G6] [HI] [H2] [H3] [II] [12] [13] [Kl] [K2] [K3] [K4] [K5] [K6]

References [K7] [K8] [K9]

[KIO] [Kll] [Ll] [L2] [L3] [L4] [L5] [L6] [L7] [L8] [L9] [Ml] [M2]

tM3] [M4] [M5] [Nl] [N2] [N3] [N4] [PI] [P2] [P3] [P4] [PS]

563

A. N. Kolmogorov, Probability theory (in Russian), in Mathematics: Its Contents, Methods, and Value [Matematika, ee soderzhanie, metody i znachenie]. Akad. Nauk SSSR, vol. 2, 1956. A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, New York, 1956. A. N. Kolmogorov and S. V. Fomin. Elements of the Theory of Functions and Functional Analysis. Graylock, Rochester, NY, 1957. A. N. Kolmogorov and A. P. Yushkevich, editors. Mathematics of the Nineteenth Century [Matematika XIX veka]. Nauka, Moscow, 1978. J. Kubilius. Probabilistic Methods in the Theory of Numbers. American Mathematical Society, Providence, 1964. J. Lamperti. Probability. Benjamin, New York, 1966. J. Lamperti. Stochastic Processes. Springer-Verlag, New York, 1977. E. Lenglart. Relation de domination entre deux processus. Ann. Inst. H. Poincare. Sect. B (N.S.) 13 (1977), 171-179. V. P. Leonov and A. N. Shiryayev. On a method of calculation of semiinvariants. Theory Probab. Appl. 4 (1959), 319-329. R. S. Liptser and A. N. Shiryayev. Statistics of Random Processes. SpringerVerlag, New York, 1977. R. Sh. Liptser and A. N. Shiryayev. A functional central limit theorem for semimartingales. Theory Probab. Appl. 25 (1980), 667-688. M. Loeve. Probability Theory. Springer-Verlag, New York, 1977-78. E. Lukacs. Characteristic Functions. Hafner, New York, 1960. E. Lukacs and R. G. Laha. Applications of Characteristic Functions. Hafner, New York, 1964. D. E. Maistrov. Probability Theory: A Historical Sketch. Academic Press, New York, 1974. A. A. Markov. Calculus of Probabilities [Ischislenie veroyatnostei], 3rd ed. St. Petersburg, 1913. L. D. Meshalkin. Collection of Problems on Probability Theory [Sbornik zadach po teorii t•eroyatnostel]. Moscow University Press, 1963. P.-A. Meyer, Martingales and stochastic integrals, I. Lecture Notes in Mathematics, no. 284. Springer-Verlag, Berlin-New York, 1972. P.-A. Meyer and C. Dellacherie. Probabilities and potential. North-Holland Mathematical Studies, no. 29. Hermann, Paris; North-Holland, Amsterdam, 1978. J. Neveu. Mathematical Foundations of the Calculus of Probability. HoldenDay, San Francisco, 1965. J. Neveu. Discrete Parameter Martingales. North-Holland, Amsterdam, 1975. J. Neyman. First Course in Probability and Statistics. Holt, New York, 1950. A. A. Novikov. On estimates and the asymptotic behavior of the probability of nonintersection of moving boundaries by sums of independent random variables. Math. USSR-Izv. 17 (1980), 129-145. V. V. Petrov. Sums of Independent Random Variables. Springer-Verlag, Berlin-New York, 1975. J. W. Pratt. On interchanging limits and integrals. Ann. Math. Stat. 31 (1960), 74-77. Yu. V. Prohorov [Prokhorov]. Asymptotic behavior of the binomial distribution. Uspekhi Mat. Nauk 8, no. 3(55) (1953), 135-142 (in Russian). Yu. V. Prohorov. Convergence of random processes and limit theorems in probability theory. Theory Probab. Appl. 1 (1956), 157-214. Yu. V. Prohorov and Yu. A. Rozanov. Probability Theory. Springer-Verlag, Berlin-New York, 1969.

564

References

[Rl]

B. Ramachandran. Advanced Theory of Characteristic Functions. Statistical Publishing Society, Calcutta, 1967. A. Renyi. Probability Theory, North-Holland, Amsterdam, 1970. V. I. Rotar. An extension of the Lindeberg-Feller theorem. Math. Notes 18 (1975), 660-663. Yu. A. Rozanov. Stationary Random Processes [Statsionarnye sluchainye protsessy]. Fizmatgiz, Moscow, 1963. T. A. Sarymsakov. Foundations of the Theory of Markov Processes [Osnovy teorii protsessov Markova]. GITTL, Moscow, 1954. V. V. Sazonov. Normal approximation: Some recent advances. Lecture Notes in Mathematics, no. 879. Springer-Verlag, Berlin-New York, 1981. B. A. Sevyastyanov [Sewastjanow]. Verzweigungsprozesse. Oldenbourg, Munich-Vienna, 1975. A. N. Shiryayev. Random Processes [Sluchainye processy]. Moscow State University Press, 1972. A. N. Shiryayev. Probability, Statistics, Random Processes [Veroyatnost, statistika, sluchainye protsessy], vols. I and II. Moscow State University Press, 1973, 1974. A. N. Shiryayev. Statistical Sequential Analysis [Statisticheskil posledovatelnyi analiz]. Nauka, Moscow, 1976. Ya. G. Sinai. Introduction to Ergodic Theory [Vvedenie v ergodicheskuyu teoriyu]. Erevan University Press, 1973. S. H. Sirazhdinov. Limit Theorems for Stationary Markov Chains [Predelnye teoremydlyaodnorodnyh tsepei Markova]. Akad. Nauk Uzbek. SSR, Tashkent, 1955. W. F. Stout. Almost Sure Convergence. Academic Press, New York, 1974. I. Todhunter. A History of the Mathematical Theory of Probability from the Time of Pascal to that of Laplace. Macmillan, London, 1865. E. Valkeila. A general Poisson approximation theorem. Stochastics 7 (1982), 159-171. A. D. Ventzel. Course in the Theory ofRandom Processes [Kurs teorii sluchainyh protsessov]. Nauka, Moscow, 1975-. B. L. van der Waerden. Mathematical Statistics. Springer-Verlag, BerlinNew York, 1969. A. Wald. Sequential Analysis. Wiley, New York, 1947. A. M. and I. M. Yaglom. Wahrscheinlichkeit und Information. DVW, Berlin, 1960; Probabi/ite et Information. Dunod, Paris, 1969. S. Zacks. The Theory of Statistical Inference. Wiley, New York, 1971. V. M. Zolotarev. A generalization of the Lindeberg-Feller theorem. Theory Probab. Appl. 12 (1967), 606-618. V. M. Zolotarev. Theoremes limites pour les sommes de variables aleatoires independantes qui ne sont pas infinitesimales. C.R. Acad. Sci. Paris. Ser. A-B 264 (1967), A799-A800.

[R2] [R3] [R4] [Sl] [S2] [S3] [S4] [S5] [S6] [S7] [S8] [S9] [Tl]

[Vl] [V2] [W1] [W2] [Yl] [Z1] [Z2] [Z3]

Index of Symbols

u,

u. n, n

134, 135 a' boundary 309 0, empty set 11, 134 EF> 419 ® 30 =,identity, or definition 149 "'• asymptotic equality 20; or equivalence 296 ff. =>,<=>,implications 139, 140 =>,also used for "convergence in general" 307, 309 4 335 b 314 ~, finer than 13 j,! 135 j_ 262,492 ~.~ • .!4, -4,24 250; ~ 306 loc «, « 492 (X, Y) 455 {X. --+} 483 [A], closure of A 151, 309 A, complement of A 11, 134 [t 1 , . . . , t.] combination, unordered set 7 (t 1 , . . . , t.) permutation, ordered set 7 A + B union of disjoint sets 11, 134 A\B 11, 134

A 6. B 43, 134 {A. i.o.} = lim sup A. 135 a- = min(a, 0); a+ = max(a, 0) 105 a$= a- 1 ,a # O;O,a = 0 434 a A b = min(a, b); a v b = max(a, b) 455 A$ 305 d 13 ~. index set 315 PJ(R) = PJ(R 1) = ffd = P4 1 141, 142 f!4(R") 142, 157 P4 0 (R") 144 PJ(R 00 ) 144, 160; PJ 0 (R 00 ) 145 PJ(R 1) 146, 164 PJ(C), Pl0 (C) 149 ffd(D) 148 PJ[O, 1] 152 ffd(R) ® · · · ® PJ(R) = PJ(R") 142 c 148 C = C[O, oo) 149 (C, PJ(C)) 148 c~ 6 c+ 483 cov (~. '1) 41, 232 D, (D, PJ(D)) 148 ~ 169 E 37 (E, C) 175

566

Index of Symbols

E<eJD),

76 211 E(~, '/) 79, 218 E(~J'It• o o o, 'In) 79 EC~I'It·ooo,'!k) 272 ess sup 259 (f, g), <J, g) 398 F* G 239 g; 131, 134 ~p 152 $'/lr 174 $'*,$'*'$'A 137 H 2 424 hn(x), Hn(x) 266 inf 44 § 161, 394 IA,I(A) 33 i - j 529 JA~dP 181 ~ dP 179 (L-S) (R-S) (L) (R) L 2 260 258 IE 262 !l 265

IP'k

E(~j.@)

E(~J~)

Jn

J,

J,

J,

0

Rn

251

7

A

138

N(A) 15 N(d) 13 N(Q) 5 JV(m, a 2 ) 232 0

233,263 p(w) 13 !!}> 315 p 18, 131 P(A) 18, 132 (!)

P(A I D), P(A I£&)

212 P(A I~) 74, 212 Px, P" 526 Pc(F) 307 IP' 113 P(AI~)

74

142, 157 159, 233, 263 (R, 81) = (R 1 ; PJ) = (R, &I(R)) 142 Rn, PA(Rn) 142, 157 R"', R 00 , EJI(R 00 ) 144, 160 RT, (RT, &I(RT)) 145, 164; (RT, &I(RT), P) 245 tt(x) 497 v 41 V(~l~) 212 Xn* = maxisniXil 464 IIXnllp 464 z 175 7l. 387 408 Z(A.), Z(A) 396 a(£&) 12 bij 266 A 59, 157 (Jk~ 376, 527 Jl, p(A) 131 /1 1 x Jlz 196 ~ 32 c:+ 44 ~ 270 n (permutation) 568 Ill= IIPijll 113 l0I 148 Cf1r s T np I®It E T g;;) 148 p(~. '/) 41 (p, t$ 427 (p 431 (x) 61 x. x2 154, 241 XB 172 n 5 (0, d, P) 18, 29 ffo

IR

J

lim, lim, lim sup, lim inf, lim liml 135

l.iomo

0

R 141 R. 142, 172

181

z

u

(M)n

114

IIPiill 113 llp(x, y)ll 110, 111 !!J> = {Pa;etE2l} 315 I[) = (qt. q2' o) 543

i,

141,

Index

Also see the Historical and Bibliographical Notes (pp. 555-559), and the Index of Symbols. Absolute continuity with respect toP 193, 492 Absolute moment 180 Absolutely continuous distributions 154 functions 154 measures 153 probability measures 193, 492 random variables 169 Absorbing Ill, 547 Accessible state 528 a.e. 183 Algebra correspondence with decomposition 80 induced by a decomposition 12 of events 12, 126 of sets 130, 137 If. (J131, 137ff. smallest 138 tail 355 If. Almost everywhere 183 Almost invariant 379 Almost periodic 389 Almost surely I 73, 183 Amplitude 390 Aperiodic state 531 a posteriori probability 27 Appropriate sets, principle of 139 a priori probability 27 Arcsine law 100 Arithmetic properties 528

a.s. 173, 183 Asymptotic algebra 355 Asymptotic properties 532 If. Asymptotically infinitesimal 327 uniformly 510 unbiased 415 Asymptotics 505 If. Atoms 12, 173 Attraction of characteristic functions 296 Autoregression, autoregressive 391, 393 Average time of return 533 Averages, moving 391, 393, 409 Axioms 13, 136

Backward equation I 14 Balance equation 392 Ballot theorem 105 If. Banach space 259 Barriers Ill, 54 7, 549 Bartlett's estimator 416 Basis, orthonormal 265 Bayes's formula 26 Bayes's theorem 27, 228 If. Bellman, R. 352 Bernoulli, James 2 distribution 153 law of large numbers 49 random variable 34, 46 scheme 30, 45 If., 55 If., 68 If.

568 Bernstein, S. N. 3, 4, 5, 305 inequality 55 polynomials 54 proof of Weierstrass' theorem 54 Berry-Esseen theorem 63, 342 Bessel's inequality 262 Best estimator 42, 69. Also see Optimal estimator Beta distribution !54 Bilateral exponential distribution !54 Binary expansion 129, 369 Binomial distribution 17 tf., !52 negative 153 Binomial random variable 34 Birkhotf, G. D. 376 Birthday problem 15 Bochner-Khinchin theorem 285 Borel, E. 4 algebra 141 ff. function 168 rectangle 143 sets 141, 145 space 227 zero-or-one law 355 Borel-Cantelli lemma 253 Borel-Cantelli-Levy theorem 486 Bose-Einstein 10 Boundary 504 Bounded variation 205 Branching process 113 Brownian motion 304 Butfon's needle 222 Bunyakovskii, V. Ya. 38, 190 Burkholder's inequality 469

Canonical decomposition 512 probability space 245 Cantelli, F. P. 253, 363 Cantor, G. diagonal process 317 function 155 Caratheodory's theorem 150 Carleman's test 294 Cauchy distribution 154, 337 inequality 38 sequence 251 Cauchy-Bunyakovskii inequality 38, 190 Cauchy criterion for almost sure convergence 256 convergence in probability 257

Index convergence in mean-p 258 Central limit theorem 3, 319, 324, 326 ff. for dependent variables 509 ff. nonclassical condition for 327 Certain event II, 134 Cesaro limit 541 Change of variable in integral 194 Chapman, D. G. 114, 246, 525 Characteristic function of distribution 4, 272 tf. random vector 273 set 33 Charlier, C. V. L. 267 Chebyshev, P. L. 3, 319 inequality 47, 55, 190 Chi, chi-squared distributions 154, 241 Class c+ 483 convergence-determining 312 determining 312 monotonic 138 of states 529 Classical method 15 models 17 ff. Classification of states 528 ff. Closed linear manifold 265 Coin tossing I, 5, 17, 33, 81 ff., 129 ff. Coincidence problem 15 Collectively independent 36 Combinations 6 Communicating states 529 Compact relatively 315 sequentially 316 Compensator 454 Complement II, 134 Complete function space 258 ff. probability measure 152 probabiiity space !52 Completely nondeterministic 419 Conditional distribution 225 probability 23 tf., 74, 219 regular 224 with respect to a decomposition 74, 210 u-algebra 210, 212 random variable 75, 212 variance 81, 212 Wiener process 305 Conditional expectation in the wide sense 262, 272

Index

with respect to decomposition 76 event 218 set of variables 79 u-algebra 211 Conditionally Gaussian 438 Confidence interval 71 Consistency property 161, 244 Consistent estimator 68 Construction of a process 243 Continuity theorem 320 Continuous at 0 !51, 162 Continuous from above or below 132 Continuous time 175 Convergence-determining class 312 Convergence of martingales and submartingales 476 ff., 483 ff. probability measures Chap. III, 306 ff. random variables: equivalences 480 sequences. See Convergence of sequences series 359 Convergence of sequences almost everywhere 183, 250, 322 almost sure 173, 183, 250, 322 at points of continuity 251 dominated 185 in distribution 250, 322 in general 307, 313 in mean 250 in mean of order p 250 in mean square 250 in measure 250, 322 in probability 250 monotone 184 weak 306 with probability I 250 Convolution 239 Coordinate method 245 Correlation coefficient 41, 232 function 388 maximal 242 Counting measure 231 Covariance 41, 232, 291 function 304, 388, 395 matrix 233 Cramer-Wold method 517 Cumulant 288 Curve of regression 236 Curvilinear boundary 504 ff. Cyclic property 530 Cyclic subclasses 530 Cylinder set 144

569 Davis's inequality 470 Decomposition canonical 512 countable 138 Doob 454 Krickeberg 475 Lebesgue 493 of martingale 475 ofQ 12, 138 of probability measure 493 of random sequence 419 of set 12, 290 of submartingale 454 trivial 77 Degenerate distribution 510 distribution function 286 random variable 296 Delta function 296 Delta, Kronecker 266 De Moivre, A. 2, 49 De Moivre-Laplace limit theorem 62 Density Gaussian 66, 154, 160, 234 n-dimensional 159 of distribution !54 of measure with respect to a measure 194 of random variable 169 Dependent random variables 101 central limit theorem for 509 ff. Derivative, Radon-Nikodym 194 Detection of signal 434 Determining class 312 Deterministic 419 Dichotomy Hajek-Feldman 501 Kakutani 496 Difference of sets II, 134 Direct product 30, 31, 142, 147, 149 Dirichlet's function 209 Discrete measure 153 random variable 169 time 175 uniform density !53 Disjoint 134 Dispersion 41 Distribution. Also see Distribution function, Probability distribution Bernoulli 153 beta 154 binomial 17, 18, 153 Cauchy 241, 338

570 chi, chi-squared 241 conditional 225 degenerate 510 discrete (list) 153 discrete uniform !53 entropy of 51 ergodic 116 exponential 154 Fisher 241 gamma 154 Gaussian 66, 154, 159, 291 geometric 153 hypergeometric 21 infinitely divisible 335 initial II 0, 524 invariant 118 limit 545 lognormal 238 multidimensional 35, 158 multinomial 20 negative binomial !53 normal 66, 154 of process 176 ofsum 36 Poisson 64, 153 polynomial 20 probability 33 stable 338 stationary 118 Student's 154, 242 154, 242 1two-sided exponential 154 uniform 154 with density (list) 154 Distribution function 34, 35, 150, 169 absolutely continuous 154 degenerate 286 discrete 153 finite-dimensional 244 generalized 156 n-dimensional 158 of functions of random variables 36, 237 ff. of sum 36, 239 Distribution of objects in cells 8 .@-measurable 77 Dominated convergence 185 Dominated sequence 467 Doob, J. L. 454, 457, 464, 474, 476 Doubling stakes 87, 453 Doubly stochastic 546 d-system 140 Duration of random walk 88 Dvoretzky's inequality 475

Index

Efficient estimator 69 Eigenvalue 128 Electric circuit 32 Elementary events 5, 134 probability theory Chap. I stochastic measure 396 Empty set II, 134 Entropy 51 Equivalent measures 492 Ergodic distribution 46 sequence 385 theorems 116, 381 ff., 385 maximal 382 mean-square 410 theory 377 ff. transformation 379 Ergodicity 116, 379 ff., 545 Errors law of 296 mean-square 43 of observation 2, 3 Esseen's inequality 294 Essential state 528 Essential supremum 259 Estimation 68, 235, 412 ff., 426 Estimator 42, 68 ff., 235 Bartlett's 416 best 42, 69 consistent 68 efficient 69 for parameters 444, 488, 503 linear 43 of spectral quantities 414 optimal 69, 235, 301, 426, 433, 435, 441 Parzen's 417 unbiased 68, 412 Zhurbenko's 417 Events 5, 10, 134 certain II, 134 impossible II, 134 independent 28, 29 mutually exclusive 134 symmetric 357 Existence of limits and stationary distributions 541 ff. Expectation 37, 179 inequalities for 190 ff. of function 55 of maximum 45 of random variable with respect to decomposition 76 set of random variables 79

571

Index

a-algebra 211 ofsum 38 Expected value 37 Exponential distribution 154 Exponential random variable 154, 242, 243 Extended random variable 171 Extension of a measure 150,247, 399 Extrapolation 422, 425 tf.

Fair game 452 Fatou's lemma 185, 209 Favorable game 86, 452 F-distribution 154 Feldman, J., 501 Feller, W., 332 Fermat, P. de 2 Fermi-Dirac 10 Filter 406 tf., 436 tf. physically realizable 423 Filtering 432 tf., 436 tf. Finer decomposition 78 Finite second moment 260 tf. Finite-dimensional distribution function 244 Finitely additive 131 First arrival 127, 533 exit 121 return 92, 127, 533 Fisher's information 70 .'1' -measurable 168 Forward equation 114 Foundations Chap. II, 129 tf. Fourier transform 274 Frequencies 390 Frequency 46 Frequency characteristic 406 Fubini's theorem 196 tf. Fundamental inequalities (for martingales) 464 tf.

Fundamental sequence

251 tf., 256 tf.

Gamma distribution 154, 337 Garcia, A. 382 Gauss, C. F. 3 Gaussian density 66, 154, 160, 234 distribution 66, 154, !59, 291 measure 266 random variables 232, 241, 296 random vector 297 sequence 304,385,404,413,4 38 zero-one Ia ·for 50 I

systems 295 tf., 303 Gauss-Markov process 305 Generalized Bayes theorem 229 distribution function 156 martingale 448, 491 submartingale 448, 491 Geometric distribution 153 Gnedenko, B. V. vii, 510 Gram determinant 263 Gram-Schmidt process 264 Gronwall-Bellman lemma 352 Haar functions 268, 482 Hajek-Feldman dichotomy 501 Hardy class 424 Harmonics 390 Hartman, P. 372 Helly's theorem 316 Herglotz, G. 393 Hermite polynomials 266 Hewitt, E. 357 Hilbert space 261 complex 273, 388 separable 265 unitary 388 Hinchin. See Khinchin History 556-559 Holder inequality 191 Huygens, C. 2 Hydrology 392, 393 Hypergeometric distribution 21 Hypotheses 27

Impossible event I I, 134 Impulse response 406 Increasing sequence 134 Increments independent 304 uncorrelated 107, 304 Indecomposable 529 Independence 27 tf. linear 263. Also see Linear dependence of components of a random vector 284

Independent algebras 28, 29 events 28, 29 functions 177 increments 304 random variables 36, 75, 79, 177, 480 Independent of the future 448

572 Indeterminacy 50 If. Indicator 33, 43 Inequalities Bernstein 55 Bessel 262 Burkholder 469 Cauchy-Bunyakovskii 38, 70, 190 Cauchy-Schwarz 38 Chebyshev 47, 55, 190 Davis 470 Dvoretzky 475 Holder 191 Jensen 190 Khinchin 468, 469 Kolmogorov 466 Levy 375 Lyapunov 191 Marcinkiewicz-Zygmund 469 Markov 557 martingale 464 If. Minkowski 192 Ottaviani 475 Rademacher-Menshov 475 Rao-Cramer 70 Schwarz 38 two-dimensional Chebyshev 55 Inessential state 528 Infinitely divisible 335 Infinitely many outcomes 129 If. Information 70 Initial distribution 110, 246 Innovation sequence 420 Integral Lebesgue 179, 181 Lebesgue-Stieltjes 181 Riemann 181, 201 Riemann-Stieltjes 201, 202 stochastic 390, 398 Integral equation 206-208, 350 If. Integral theorem 62 Integration by parts 204 substitution 209 Intensity 390 Interpolation 430 If. Intersection II, 134 Introducing probability measures 149 If. Invariant set 379, 385 Inversion formulas 281, 295 i.o. 135 Ionescu Tulcea, C. T. vii, 247 Ising model 23 Isometry 402 Iterated logarithm 372

Index

Jensen's inequality

190, 231

Kakutani dichotomy 496 Kalman-Bucy filter 436 If. Khinchin, A. Ya. 285, 359, 372, 468 Kolmogorov, A. N. vii, 3, 4, 510 axioms 136 inequality 359 Kolmogorov-Chapman equation 114, 246, 525 Kolmogorov's theorems convergence of series 359 existence of process 244 extension of measures 161, 165 iterated logarithm 372 stationary sequences 425, 431 strong law of large numbers 364, 366 three-series theorem 362 two-series theorem 361 zero-or-one law 356 Krickeberg's decomposition 475 Kronecker, L. 365 delta 266

A (condition) 326 Laplace, P. S. 2, 35, 62 Law of large numbers 49, 65, 323 for Markov chains 120 for square-integrable martingales 487 Poisson's 557 strong 363 If. Law of the iterated logarithm 370 If. Least squares 3 Lebesgue, H. decomposition 493 dominated convergence theorem 185 integral 178 If. change of variable in 194 measure 152, 157 Lebesgue-Stieltjes integral 195 Lebesgue-Stieltjes measure 152, 156, 204 LeCam, L. 345 Levy, P. convergence theorem 478 distance 314 inequality 375 Levy-Khinchin representation 341 Levy-Khinchin theorem 337 Likelihood ratio I 08 lim inf, lim sup 135, 137 Limit theorems 49 If., 55 If.

573

Index

Limits under expectation signs 216 integral signs 183, 187 Linde berg condition 326 Lindeberg-Feller theorem 332 Linear manifold 262, 265 Linearly independent 263 Liouville's theorem 378 Lipschitz condition 481 Local limit theorem 55, 56 Local martingale, submartingale 449 Locally absolutely continuous 492 Locally bounded variation 206 Lognormal 238 Lottery 15, 22 Lower function 371 L 2 -theory Chap. VI, 387 ff. Lyapunov, A.M. 3, 319 condition 331 inequality 191

Macmillan's theorem 53 Marcinkiewicz's theorem 286 Marcinkiewicz-Zygmund inequality 469 Markov, A. A. 3, 319 dependence 523 process 246 property 110, 125, 523 time 448 Markov chains 108 ff., 249, Chap. VIII, 523 ff. classification of 528-541 discrete 524 examples 111 ff., 541 ff. finite 524 homogeneous 111, 524 Martingale 100 ff., Chap. VII (446 ff.), 453 bounded 456 convergence of 476 ff. generalized 448 inequalities for 464 ff. in gambling 453 local 449 oscillations of 473 sets of convergence for 483 ff. square-integrable 454 ff., 466, 506 uniformly integrable 479 Martingale-difference 453, 511, 521 Martingale transform 450 Mathematical expectation 37, 74. Also see Expectation

Mathematical foundations Chap. II, 129 ff. Matrix covariance 233 doubly stochastic 546 of transition probabilities 110 orthogonal 233, 263 stochastic 111 transition 111 Maximal correlation coefficient 242 ergodic theorem 382 Maxwell-Boltzmann 10 Mean duration 81, 88, 461 -square 42 ergodic theorem 410 value 37 vector 299 Measurable function 168 random variable 77 set 152 spaces 131 (C, ~(C)) 148 (D, ~(D)) 148 n" $"(t)) 148 (R, ~(R)) 141, 150 (R"', ~(R"')) 144, 160 (R", ~(R")) 142, 157 (Rr, ~(RT)) 145, 164 transformation 377 Measure 131 absolutely continuous 153 complete 152 consistent 526 countably additive 131 counting 231 discrete 153 elementary 396 extending a 150, 247, 399 finite 131 finitely additive 130, 131 Gaussian 266 Lebesgue 152 Lebesgue-Stieltjes 156 orthogonal 397 probability 131, 132 restriction of 163 a-additive 131 a-finite 131 signed 194 singular 154 ff. stochastic 390 Wiener 166, 167


574 Measure-preserving transformations 376 If. Median 44 Menshov, D. E. 475 Method of characteristic functions 318 If. of moments 4, 319 Metrically transitive 379 Minkowski inequality 192 Mises, R. von 4 Mixed model 393 Mixed moment 287 Mixing 381 Moivre. See De Moivre Moment 180 absolute 180 and semi-invariant 288 method 4, 319 mixed 287 problem 292 If. Monotone convergence theorem 184 Monotonic class 138 Monte Carlo method 223, 369 Moving averages 390, 393, 409 Multinomial distribution 20, 21 Multiplication formula 26 Mutual variation 455 Needle (Buffon) 222 Negative binomial 153 Noise 390, 407, 412 Nondeterministic 419 Nonlinear estimator 425 Nonnegative definite 233 Nonrecurrent state 533 Norm 258 Normal correlation 301, 305 density 66, 159, 264 distribution function 62, 66, 154, !59 number 369 Normally distributed 297 Null state 533 Occurrence of event 134 Optimal estimator 69, 235, 301,426,433, 435,441 Optional stopping 559 Ordered sample 6 Orthogonal increments 400 matrix 233, 263 random variables 261

Index stochastic measures 390, 397, 492 system 261 values 397 Orthogonalization 264 Orthonormal 265 Oscillations of martingales 473 water level 393 Ottaviani's inequality 475 Outcome 5, 134 Pairwise independence 29 P-almost surely, almost everywhere 183 Parallelogram property 272 Parseval's equation 265 Partially ordered sequence 432 If. Parzen's estimator 417 Pascal, B. 2 Path 48, 83, 93 Pauli exclusion principle 10 Period of Markov chain 530 Periodogram 415 Permutation 7, 357 Perpendicular 262 Phase space 110, 378, 524 Physically realizable 406, 423 7t 223 Poincare recurrence principle 378 Poisson, D. 3 distribution 64, !53 law of large numbers 557 limit theorem 64, 325 Poisson-Charlier polynomials 267 Polya's theorems characteristic functions 285 random walk 554 Polynomials Bernstein 54 Hermite 266 Poisson-Charlier 267 Positive semi-definite 285 Positive state 533 Pratt's lemma 209, 557 Predictable sequence 418, 446 Preservation of martingale property 456 If. Principle of appropriate sets 139 Probabilistic model 5, 14, 129, 136 in the extended sense 131 Probability 13, 132 a posteriori, a priori 27 classical 15 conditional 23, 74, 212 finitely additive 131

575

Index

measure 131, 149-152 multiplication 26 of first arrival or return 533 of mean duration 88 ofruin 86 of success 68 If. total 25, 75, 77 transition 525 Probability distribution 33, 168, 176 discrete 153 lognormal 238 stationary 528 table 153, 154 Probability measure 131, 132, 152, 492 absolutely continuous 492 complete 152 Probability space 14, 136 canonical 245 complete 152 universal 250 Problems on arrangements 8 coincidence 15 ruin 82, 86 Process branching 113 Brownian motion 304 construction of 243 If. Gaussian 305 Gauss-Markov 305 Markov 246 stochastic 4, 175 Wiener 304, 305 with independent increments 305 Prohorov, Yu. V. vii, 64, 315 Projection 262, 271 Pseudoinverse 305 Pseudotransform 434 Purely nondeterministic 419 Pythagorean property 272

Quadratic variation 455 Queueing theory 112

Rademacher-Menshov inequality 475 Rademacher system 268 Radon-Nikodym derivative 194 theorem 194, 557 Random elements 174 If.

function 175 process 175, 304 with orthogonal increments 400 sequences 4, Chap. V, 366 If. existence of 247, 304 orthogonal 419 Random variables 32 If., 168 If., 232 If. absolutely continuous 169 almost invariant 379 complex 175 continuous 169 degenerate 296 discrete 168, 169 exponential 242, 243 £-valued 174 extended 171 Gaussian 232, 296 invariant 379 normally distributed 232 simple 168 uncorrelated 232 Random vectors 35, 175 Gaussian 297, 299 Random walk 18, 82 If., 356 in two and three dimensions 552 simple 546 symmetric 92, 356, 552 with curvilinear boundary 504 If. Rao-Cramer inequality 70 Rapidity of convergence 342 If., 345 If. Realization of a process 176 Recurrent state 533 Reflecting barrier 549 If. Reflection principle 93 Regression 236 Regular conditional distribution 225 conditional probability 224 stationary sequence 419 Relatively compact 315 Reliability 71 Restriction of a measure 163 Reversed sequence 128 Riemann integral 201 Riemann-Stieltjes integral 201 If. Ruin 82, 86, 461

Sample points, space 5 Sampling with replacement 6 without replacement 7, 21, 23 Savage, L. J. 357

576 Scalar product 261 Schwarz inequality 38. Also see Bunyakovskii, Cauchy Semicontinuous 311 Semi-definite 285 Semi-invariant 288 Semi-norm 258 Separable 265 Sequences almost periodic 389 moving average 390 of independent random variables 354 ff. partially observed 432 predictable 418, 446 random 175 regular 419 singular 419 stationary (strict sense) 376 ff. stationary (wide sense) 387 ff. stochastic 446, 483 Sequential compactness 316 Series of random variables 359 Sets of convergence 483 ff. Shifting operators 527 u-additive 132 Sigma algebra 131, 134 asymptotic 355 generated by ~ 172 tail, terminal 355 Signal, detection of 434 Significance level 71 Simple moments 289 random variable 32 random walk 546 semi-invariants 289 Singular measure 154 ff., 492 Singular sequence 419 Singularity of distributions 493 ff. Skorohod, A. V. 148 Slowly varying 505 Spectral characteristic 406 density 390 rational 409, 428 function 394 measure 394 representation of covariance function 387 ff., 393 sequences 401, 402 window 416 Spectrum 390 Square-integrable martingale 454 ff., 466, 487, 506

Index Stable 338 Standard deviation 41, 232 State space 110 States, classification of 528 ff., 533 ff. Stationary distribution 118, 528, 542 Markov chain 118 sequence Chap. V, 376 ff.; Chap. VI, 387 ff. Statistical estimation 412 ff. Statistically independent 28 Statistics 4, 50 Stieltjes, T. J. 181, 201 Stirling's formula 20, 22 Stochastic integral 390 matrix Ill, 546 measure 390, 396 ff. extension of 399 orthogonal 397 with orthogonal values 397, 398 process 4, 175 sequence 446, 483 Stochastically independent 42 Stopped process 449 Stopping time 82, 103, 448 Strong law of large numbers 363 ff., 472, 482 Strong Markov property 125 Structure function 397 Student distribution 154, 242 Submartingales 446 ff. convergence of 476 ff. generalized 448, 491 local 449 nonnegative 477 nonpositive 476 sets of convergence of 483 ff. uniformly integrable 477 Substitution, integration by 209 Sum of dependent random variables 509 ff. events II exponential random variables 243 Gaussian random variables 241 independent random variables 326 ff., Chap. IV, 354 ff. Poisson random variables 242 sets 11, 134 Summation by parts 365 Supermartingale 447 Symmetric difference!::,. 43, 134 Symmetric events 357 Szego-Kolmogorov formula 436

577

Index

Tables continuous densities !54 discrete densities 153 terms in set theory and probability 134, 135 Tail 49, 321, 332 algebra 355 If. Taxi stand 112 !-distribution 154, 242 Terminal algebra 355 Three-series theorem 362 Tight 315 Time change (in martingale) 456 If. continuous 175 discrete 175 domain 175 Toeplitz, 0. 365 Total probability 25, 75, 77 Trajectory 176 Transfer function 406 Transform 450 Transformation, measure-preserving 376 If. Transition function 524 matrix Ill probabilities Ill, 246, 524 Trial 30 Trivial algebra 12 Tulcea. See Ionescu Tulcea Two-dimensional Gaussian density 160 Two-series theorem 361 Typical path 48, 52 realization 48

Unbiased estimator 68, 412 Uncorrelated 42, 232 increments 107 Unfavorable game 86, 87, 452 Uniform distribution 153, 154 Uniformly asymptotically infinitesimal 510 continuous 326 integrable 186 If.

Union II, 134 Uniqueness of distribution function 280 solution of moment problem 293 Universal probability space 250 Unordered samples 6 sets 166 Upper function 371

Variance 41 conditional 81 ofsum 42 Variation, mutual, quadratic 455 Vector Gaussian 236 random 35, 175, 236, 297, 299

Wald's identities I 05, 460 If., 464 Water level 392, 393 Weak convergence 306 If. Weierstrass approximation theorem for polynomials 54 for trigonometric polynomials 280 White noise 390, 407, 412 Wiener, N. measure, 166, 167 process 304, 305 Window, spectral 416 Wintner, A. 372 Wold's expansion 418 If., 422 method 517 Wolf, R. 223

Zero-or-one laws 354 If., 479 Borel 355 for Gaussian sequences 50 I Hewitt-Savage 357 Kolmogorov 354, 356 Zhurbenko 's estimator 417 Zygmund, A. 469

Graduate Texts in Mathematics 2 3 4 5 6 7 8 9 lO 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

TAKEUTI/ZARING. Introduction to Axiomatic Set Theory. 2nd ed. OxTOBY. Measure and Category. 2nd ed. ScHAEFFER. Topological Vector Spaces. HILTONISTAMMBACH. A Course in Homological Algebra. MACLANE. Categories for the Working Mathematician. HUGHEs/PiPER. Projective Planes. SERRE. A Course in Arithmetic. TAKEun/ZARING. Axiometic Set Theory. HuMPHREYS. Introduction to Lie Algebras and Representation Theory. CoHEN. A Course in Simple Homotopy Theory. CoNWAY. Functions of One Complex Variable. 2nd ed. BEALS. Advanced Mathematical Analysis. ANDERSON/FuLLER. Rings and Categories of Modules. GoLUBITSKY/GuiLLEMIN. Stable Mappings and Their Singularities. BERBERIAN. Lectures in Functional Analysis and Operator Theory. WINTER. The Structure of Fields. RoSENBLATT. Random Processes. 2nd ed. HALMOS. Measure Theory. HALMOS. A Hilbert Space Problem Book. 2nd ed., revised. HUSEMOLLER. Fibre Bundles. 2nd ed. HuMPHREYS. Linear Algebraic Groups. BARNEs/MACK. An Algebraic Introduction to Mathematical Logic. GREUB. Linear Algebra. 4th ed. HoLMES. Geometric Functional Analysis and its Applications. HEWITT/STROMBERG. Real and Abstract Analysis. MANES. Algebraic Theories. KELLEY. General Topology. ZARISKI/SAMUEL. Commutative Algebra. Vol. I. ZARISKIISAMUEL. Commutative Algebra. Vol. II. JACOBSON. Lectures in Abstract Algebra I: Basic Concepts. JACOBSON. Lectures in Abstract Algebra II: Linear Algebra. JACOBSON. Lectures in Abstract Algebra III: Theory of Fields and Galois Theory. HIRSCH. Differential Topology. SPITZER. Principles of Random Walk. 2nd ed. WERMER. Banach Algebras and Several Complex Variables. 2nd ed. KELLEY/NAMIOKA et al. Linear Topological Spaces. MoNK. Mathematical Logic. GRAUERTIFRITZSCHE. Several Complex Variables. ARVESON. An Invitation to C*-Algebras. KEMENY/SNELL/KNAPP. Denumerable Markov Chains. 2nd ed. APOSTOL. Modular Functions and Dirichlet Series in Number Theory. SERRE. Linear Representations of Finite Groups. GILLMAN/J ERISON. Rings of Continuous Functions. KENDIG. Elementary Algebraic Geometry. LOEVE. Probability Theory I. 4th ed. LOEVE. Probability Theory II. 4th ed. MoiSE. Geometric Topology in Dimensions 2 and 3. SAcHs/Wu. General Relativity for Mathematicians. GRUENBERG/WEIR. Linear Geometry. 2nd ed. EDWARDS. Fermat's Last Theorem.

Graduate Texts in Mathematics 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

KLINGENBERG. A Course in Differential Geometry. HARTSHORNE. Algebraic Geometry. MAN IN. A Course in Mathematical Logic. GRA vER!W ATKINS. Combinatorics with Emphasis on the Theory of Graphs. BROWN/PEARCY. Introduction to Operator Theory 1: Elements of Functional Analysis. MASSEY. Algebraic Topology: An Introduction. CROWELL/Fox. Introduction to Knot Theory. KoBLITZ. p-adic Numbers, p-adic Analysis, and Zeta-Functions. LANG. Cyclotomic Fields. ARNOLD. Mathematical Methods in Classical Mechanics. WHITEHEAD. Elements of Homotopy Theory. KARGAPOLov/MERZLJAKOV. Fundamentals of the Theory of Group~. BoLLOBAS. Graph Theory. EDWARDS. Fourier Series. Vol. I. 2nd ed. WELLS. Differential Analysis on Complex Manifolds. 2nd ed. WATERHOUSE. Introduction to Affine Group Schemes. SERRE. Local Fields. WEIDMANN. Linear Operators in Hilbert Spaces. LANG. Cyclotomic Fields II. MASSEY. Singular Homology Theory. FARKASIKRA. Riemann Surfaces. STILL WELL. {:Jassical Topology and Combinatorial Group Theory. HuNGERFORD. Algebra. DAVENPORT. Multiplicative Number Theory. 2nd ed. HOCHSCHILD. Basic Theory of Algebraic Groups and Lie Algebras. In AKA. Algebraic Geometry. HECKE. Lectures on the Theory of Algebraic Numbers. BURRIS!SANKAPPANAVAR. A Course in Universal Algebra. WALTERS. An Introduction to Ergodic Theory. RoBINSON. A Course in the Theory of Groups. FoRSTER. Lectures on Riemann Surfaces. Bon/Tu. Differential Forms in Algebraic Topology. WASHINGTON. Introduction to Cyclotomic Fields. IRELAND/RoSEN. A Classical Introduction Modern Number Theory. EDWARDS. Fourier Series. Vol. II. 2nd ed. vAN LINT. Introduction to Coding Theory. BROWN. Cohomology of Groups. PIERCE. Associative Algebras. LANG. Introduction to Algebraic and Abelian Functions. 2nd ed. BR9NDSTED. An Introduction to Convex Polytopes. BE ARDON. On the Gtumetry of Discrete Groups. DIESTEL. Sequences and Series in Banach Spaces. DuBROVIN/FoMENKo/NoviKOV. Modern Geometry- Methods and Applications Vol. I. WARNER. Foundations of Differentiable Manifolds and Lie Groups. SHIRYAYEV. Probability, Statistics, and Random Processes. ZEIDLER. Nonlinear Functional Analysis and Its Applications I: Fixed Points Theorem.


Related Documents

Probability
February 2021 0
Probability
January 2021 10
3 Probability
January 2021 2
Probability .pdf
January 2021 1

More Documents from ""