Probability

Uploaded by: Roberto Garcia Sanchez
0
0

January 2021
PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Probability as PDF for free.

More details

Words: 182,861
Pages: 589

Preview
Full text

Loading documents preview...

Graduate Texts in Mathematics Editorial Board

95

F. W. Gehring P.R. Halmos (Managing Editor) C. C. Moore

"Order out of chaos" (Courtesy of Professor A. T. Fomenko of the Moscow State University)

A. N. Shiryayev

Probability Translated by R. P. Boas

With 54 Illustrations

Springer Science+Business Media, LLC

A. N. Shiryayev Steklov Mathematical Institute Vavilova 42 GSP-1 117333 Moscow U.S.S.R. Editorial Board P.R. Halmos Managing Editor Department of Mathematics Indiana University Bloomington, IN 47405 U.S.A.

R. P. Boas (Translator) Department of Mathematics Northwestern University Evanston, IL 60201 U.S.A.

F. W. Gehring

C. C. Moore

Department of Mathematics University of Michigan Ann Arbor, MI 48109 U.S.A.

Department of Mathematics University of California at Berkeley Berkeley, CA 94720 U.S.A.

AMS Classification: 60-0 I Library of Congress Cataloging in Publication Data Shiriaev, Al'bert Nikolaevich. Probability. (Graduate texts in mathematics; 95) Translation of: Veroiiitnost'.

Bibliography: p. Includes index. I. Probabilities. I. Title. II. Series. 519 83-14813 QA273.S54413 1984 Original Russian edition: Veroicltnost'. Moscow: Nauka, 1979. This book is part of the Springer Series in Soviet Mathematics.

© 1984 by Springer Science+Business Media New York Originally published by Springer-Verlag New York, Inc. in 1984 Softcover reprint of the hardcover 1st edition 1984 All rights reserved. No part of this book may be translated or reproduced in any form without written permission from Springer Science+Business Media, LLC.

Typeset by Composition House Ltd., Salisbury, England.

9 8 7 6 54 32 1 ISBN 978-1-4899-0020-3 ISBN 978-1-4899-0018-0 (eBook) DOI 10.1007/978-1-4899-0018-0

Preface

This textbook is based on a three-semester course of lectures given by the author in recent years in the Mechanics-Mathematics Faculty of Moscow State University and issued, in part, in mimeographed form under the title Probability, Statistics, Stochastic Processes, I, II by the Moscow State University Press. We follow tradition by devoting the first part of the course (roughly one semester) to the elementary theory of probability (Chapter I). This begins with the construction of probabilistic models with finitely many outcomes and introduces such fundamental probabilistic concepts as sample spaces, events, probability, independence, random variables, expectation, correlation, conditional probabilities, and so on. Many probabilistic and statistical regularities are effectively illustrated even by the simplest random walk generated by Bernoulli trials. In this connection we study both classical results (law of large numbers, local and integral De Moivre and Laplace theorems) and more modern results (for example, the arc sine law). The first chapter concludes with a discussion of dependent random variables generated by martingales and by Markov chains. Chapters II-IV form an expanded version of the second part of the course (second semester). Here we present (Chapter II) Kolmogorov's generally accepted axiomatization of probability theory and the mathematical methods that constitute the tools of modern probability theory (a-algebras, measures and their representations, the Lebesgue integral, random variables and random elements, characteristic functions, conditional expectation with respect to a a-algebra, Gaussian systems, and so on). Note that two measuretheoretical results-Caratheodory's theorem on the extension of measures and the Radon-Nikodym theorem-are quoted without proof.

VI

Preface

The third chapter is devoted to problems about weak convergence of probability distributions and the method of characteristic functions for proving limit theorems. We introduce the concepts of relative compactness and tightness of families of probability distributions, and prove (for the real line) Prohorov's theorem on the equivalence of these concepts. The same part of the course discusses properties "with probability 1" for sequences and sums of independent random variables (Chapter IV). We give proofs of the "zero or one laws" of Kolmogorov and of Hewitt and Savage, tests for the convergence of series, and conditions for the strong law of large numbers. The law of the iterated logarithm is stated for arbitrary sequences of independent identically distributed random variables with finite second moments, and proved under the assumption that the variables have Gaussian distributions. Finally, the third part of the book (Chapters V-VIII) is devoted to random processes with discrete parameters (random sequences). Chapters V and VI are devoted to the theory of stationary random sequences, where "stationary" is interpreted either in the strict or the wide sense. The theory of random sequences that are stationary in the strict sense is based on the ideas of ergodic theory: measure preserving transformations, ergodicity, mixing, etc. We reproduce a simple proof (by A. Garsia) of the maximal ergodic theorem; this also lets us give a simple proof of the Birkhoff-Khinchin ergodic theorem. The discussion of sequences of random variables that are stationary in the wide sense begins with a proof of the spectral representation of the covariance fuction. Then we introduce orthogonal stochastic measures, and integrals with respect to these, and establish the spectral representation of the sequences themselves. We also discuss a number of statistical problems: estimating the covariance function and the spectral density, extrapolation, interpolation and filtering. The chapter includes material on the KalmanBucy filter and its generalizations. The seventh chapter discusses the basic results of the theory of martingales and related ideas. This material has only rarely been included in traditional courses in probability theory. In the last chapter, which is devoted to Markov chains, the greatest attention is given to problems on the asymptotic behavior of Markov chains with countably many states. Each section ends with problems of various kinds: some of them ask for proofs of statements made but not proved in the text, some consist of propositions that will be used later, some are intended to give additional information about the circle of ideas that is under discussion, and finally, some are simple exercises. In designing the course and preparing this text, the author has used a variety of sources on probability theory. The Historical and Bibliographical Notes indicate both the historical sources of the results, and supplementary references for the material under consideration. The numbering system and form of references is the following. Each section has its own enumeration of theorems, lemmas and formulas (with

Preface

Vll

no indication of chapter or section). For a reference to a result from a different section of the same chapter, we use double numbering, with the first number indicating the number of the section (thus (2.10) means formula (10) of §2). For references to a different chapter we use triple numbering (thus formula (11.4.3) means formula (3) of §4 of Chapter II). Works listed in the References at the end of the book have the form [L n], where Lis a letter and n is a numeral. The author takes this opportunity to thank his teacher A. N. Kolmogorov, and B. V. Gnedenko and Yu. V. Prohorov, from whom he learned probability theory and under whose direction he had the opportunity of using it. For discussions and advice, the author also thanks his colleagues in the Departments of Probability Theory and Mathematical Statistics at the Moscow State University, and his colleagues in the Section on probability theory of the Steklov Mathematical Institute of the Academy of Sciences of the U.S.S.R. Moscow Steklov Mathematical Institute

A. N.

SHIRYAYEV

Translator's acknowledgement. I am grateful both to the author and to my colleague C. T. Ionescu Tulcea for advice about terminology. R. P. B.

Contents

Introduction CHAPTER I

Elementary Probability Theory

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes §2. Some Classical Models and Distributions §3. Conditional Probability. Independence §4. Random Variables and Their Properties §5. The Bernoulli Scheme. I. The Law of Large Numbers §6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) ~7. Estimating the Probability of Success in the Bernoulli Scheme §8. Conditional Probabilities and Mathematical Expectations with Respect to Decompositions §9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing §10. Random Walk. II. Reflection Principle. Arcsine Law §11. Martingales. Some Applications to the Random Walk §12. Markov Chains. Ergodic Theorem. Strong Markov Property

5 5 17 23 32 45 55 68 74 81 92 101 108

CHAPTER II

Mathematical Foundations of Probability Theory §1. Probabilistic Model for an Experiment with Infinitely Many Outcomes. Kolmogorov's Axioms §2. Algebras and u-algebras. Measurable Spaces §3. Methods oflntroducing Probability Measures on Measurable Spaces §4. Random Variables. I.

129 129 137 149 164

Contents

X

§5. Random Elements §6. Lebesgue Integral. Expectation §7. Conditional Probabilities and Conditional Expectations with Respect to a a-Algebra §8. Random Variables. II. §9. Construction of a Process with Given Finite-Dimensional Distribution §10. Various Kinds of Convergence of Sequences of Random Variables §11. The Hilbert Space of Random Variables with Finite Second Moment §12. Characteristic Functions §13. Gaussian Systems

174 178 210 232 243 250 260 272 295

CHAPTER III

Convergence of Probability Measures. Central Limit Theorem §1. Weak Convergence of Probability Measures and Distributions §2. Relative Compactness and Tightness of Families of Probability Distributions §3. Proofs of Limit Theorems by the Method of Characteristic Functions §4. Central Limit Theorem for Sums of Independent Random Variables §5. Infinitely Divisible and Stable Distributions §6. Rapidity of Convergence in the Central Limit Theorem §7. Rapidity of Convergence in Poisson's Theorem

306 306 314 318 326 335 342 345

CHAPTER IV

Sequences and Sums of Independent Random Variables §1. §2. §3. §4.

Zero-or-One Laws Convergence of Series Strong Law of Large Numbers Law of the Iterated Logarithm

354 354 359 363 370

CHAPTER V

Stationary (Strict Sense) Random Sequences and Ergodic Theory §1. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations §2. Ergodicity and Mixing §3. Ergodic Theorems

376 376 379 381

CHAPTER VI

Stationary (Wide Sense) Random Sequences. L 2 Theory §1. §2. §3. §4.

Spectral Representation of the Covariance Function Orthogonal Stochastic Measures and Stochastic Integrals Spectral Representation of Stationary (Wide Sense) Sequences Statistical Estimation of the Covariance Function and the Spectral Density

387 387 395 401 412

Contents §5. Wold's Expansion §6. Extrapolation, Interpolation and Filtering §7. The Kalman-Bucy Filter and Its Generalizations

xi 418 425 436

CHAPTER VII

Sequences of Random Variables that Form Martingales §1. Definitions of Martingales and Related Concepts §2. Preservation of the Martingale Property Under Time Change at a Random Time §3. Fundamental Inequalities §4. General Theorems on the Convergence of Submartingales and Martingales §5. Sets of Convergence of Submartingales and Martingales §6. Absolute Continuity and Singularity of Probability Distributions §7. Asymptotics of the Probability of the Outcome of a Random Walk with Curvilinear Boundary §8. Central Limit Theorem for Sums of Dependent Random Variables

446 446 456 464 476 483 492 504 509

CHAPTER VIII

Sequences of Random Variables that Form Markov Chains §I. Definitions and Basic Properties §2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties of the Transition Probabilities p!jl §3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties of the Probabilities p!i> §4. On the Existence of Limits and of Stationary Distributions §5. Examples

523 523 528 532 541 546

Historical and Bibliographical Notes

555

References

561

Index of Symbols

565

Index

569

Introduction

The subject matter of probability theory is the mathematical analysis of random events, i.e. of those empirical phenomena which-under certain circumstances-can be described by saying that: They do not have deterministic regularity (observations of them do not yield the same outcome) whereas at the same time; They possess some statistical regularity (indicated by the statistical stability of their frequency). We illustrate with the classical example of a "fair" toss of an "unbiased" coin. It is clearly impossible to predict with certainty the outcome of each toss. The results of successive experiments are very irregular (now "head," now "tail") and we seem to have no possibility of discovering any regularity in such experiments. However, if we carry out a large number of" independent" experiments with an "unbiased" coin we can observe a very definite statistical regularity, namely that "head" appears with a frequency that is "close" to f. Statistical stability of a frequency is very likely to suggest a hypothesis about a possible quantitative estimate of the "randomness" of some event A connected with the results of the experiments. With this starting point, probability theory postulates that corresponding to an event A there is a definite number P(A), called the probability of the event, whose intrinsic property is that as the number of "independent" trials (experiments) increases the frequ,ency of event A is approximated by P(A). Applied to our example, this means that it is natural to assign the probability ! to the event A that consists of obtaining "head" in a toss of an "unbiased" coin.

2

Introduction

There is no difficulty in multiplying examples in which it is very easy to obtain numerical values intuitively for the probabilities of one or another event. However, these examples are all of a similar nature and involve (so far) undefined concepts such as "fair" toss, "unbiased" coin, "independence," etc. Having been invented to investigate the quantitative aspects of"randomness," probability theory, like every exact science, became such a science only at the point when the concept of a probabilistic model had been clearly formulated and axiomatized. In this connection it is natural for us to discuss, although only briefly, the fundamental steps in the development of probability theory. Probability theory, as a science, originated in the middle of the seventeenth century with Pascal (1623-1662), Fermat (1601-1655) and Huygens ( 1629-1695). Although special calculations of probabilities in games of chance had been made earlier, in the fifteenth and sixteenth centuries, by Italian mathematicians (Cardano, Pacioli, Tartaglia, etc.), the first general methods for solving such problems were apparently given in the famous correspondence between Pascal and Fermat, begun in 1654, and in the first book on probability theory, De Ratiociniis in Aleae Ludo (On Calculations in Games of Chance), published by Huygens in 1657. It was at this time that the fundamental concept of" mathematical expectation" was developed and theorems on the addition and multiplication of probabilities were established. The real history of probability theory begins with the work of James Bernoulli (1654-1705), Ars Conjectandi (The Art of Guessing) published in 1713, in which he proved (quite rigorously) the first limit theorem of probability theory, the law of large numbers; and of De Moivre (1667-1754), Miscellanea Analytica Supplementum (a rough translation might be The Analytic Method or Analytic Miscellany, 1730), in which the central limit theorem was stated and proved for the first time (for symmetric Bernoulli trials). Bernoulli was probably the first to realize the importance of considering infinite sequences of random trials and to make a clear distinction between the probability of an event and the frequency of its realization. De Moivre deserves the credit for defining such concepts as independence, mathematical expectation, and conditional probability. In 1812 there appeared Laplace's (1749-1827) great treatise Theorie Analytique des Probabilities (Analytic Theory of Probability) in which he presented his own results in probability theory as well as those of his predecessors. In particular, he generalized De Moivre's theorem to the general (unsymmetric) case of Bernoulli trials, and at the same time presented De Moivre's results in a more complete form. Laplace's most important contribution was the application of probabilistic methods to errors of observation. He formulated the idea of considering errors of observation as the cumulative results of adding a large number of independent elementary errors. From this it followed that under rather

3

Introduction

general conditions the distribution of errors of observation must be at least approximately normal. The work of Poisson (1781-1840) and Gauss (1777-1855) belongs to the same epoch in the development of probability theory, when the center of the stage was held by limit theorems. In contemporary probability theory we think of Poisson in connection with the distribution and the process that bear his name. Gauss is credited with originating the theory of errors and, in particular, with creating the fundamental method of least squares. The next important period in the development of probability theory is connected with the names of P. L. Chebyshev (1821-1894), A. A. Markov (1856-1922), and A. M. Lyapunov (1857-1918), who developed effective methods for proving limit theorems for sums of independent but arbitrarily distributed random variables. The number of Chebyshev's publications in probability theory is not large-four in all-but it would be hard to overestimate their role in probability theory and in the development of the classical Russian school of that subject. "On the methodological side, the revolution brought about by Chebyshev was not only his insistence for the first time on complete rigor in the proofs of limit theorems, ... but also, and principally, that Chebyshev always tried to obtain precise estimates for the deviations from the limiting regularities that are available for large but finite numbers of trials, in the form of inequalities that are valid unconditionally for any number of trials." (A. N.

KOLMOGOROV

[30])

Before Chebyshev the main interest in probability theory had been in the calculation of the probabilities of random events. He, however, was the first to realize clearly and exploit the full strength of the concepts of random variables and their mathematical expectations. The leading exponent of Chebyshev's ideas was his devoted student Markov, to whom there belongs the indisputable credit of presenting his teacher's results with complete clarity. Among Markov's own significant contributions to probability theory were his pioneering investigations of limit theorems for sums of independent random variables and the creation of a new branch of probability theory, the theory of dependent random variables that form what we now call a Markov chain. " ... Markov's classical course in the calculus of probability and his original papers, which are models of precision and clarity, contributed to the greatest extent to the transformation of probability theory into one of the most significant branches of mathematics and to a wide extension of the ideas and methods of Chebyshev." {S. N. BERNSTEIN [3])

To prove the central limit theorem of probability theory (the theorem on convergence to the normal distribution), Chebyshev and Markov used

4

Introduction

what is known as the method of moments. With more general hypotheses and a simpler method, the method of characteristic functions, the theorem was obtained by Lyapunov. The subsequent development of the theory has shown that the method of characteristic functions is a powerful analytic tool for establishing the most diverse limit theorems. The modern period in the development of probability theory begins with its axiomatization. The first work in this direction was done by S. N. Bernstein (1880-1968), R. von Mises (1883-1953), and E. Borel (1871-1956). A. N. Kolmogorov's book Foundations of the Theory of Probability appeared in 1933. Here he presented the axiomatic theory that has become generally accepted and is not only applicable to all the classical branches of probability theory, but also provides a firm foundation for the development of new branches that have arisen from questions in the sciences and involve infinitedimensional distributions. The treatment in the present book is based on Kolmogorov's axiomatic approach. However, to prevent formalities and logical subtleties from obscuring the intuitive ideas, our exposition begins with the elementary theory of probability, whose elementariness is merely that in the corresponding probabilistic models we consider only experiments with finitely many outcomes. Thereafter we present the foundations of probability theory in their most general form. The 1920s and '30s saw a rapid development of one of the new branches of probability theory, the theory of stochastic processes, which studies families of random variables that evolve with time. We have seen the creation of theories of Markov processes, stationary processes, martingales, and limit theorems for stochastic processes. Information theory is a recent addition. The present book is principally concerned with stochastic processes with discrete parameters: random sequences. However, the material presented in the second chapter provides a solid foundation (particularly of a logical nature) for the-study of the general theory of stochastic processes. It was also in the 1920s and '30s that mathematical statistics became a separate mathematical discipline. In a certain sense mathematical statistics deals with inverses of the problems of probability: If the basic aim of probability theory is to calculate the probabilities of complicated events under a given probabilistic model, mathematical statistics sets itself the inverse problem: to clarify the structure of probabilistic-statistical models by means of observations of various complicated events. Some of the problems and methods of mathematical statistics are also discussed in this book. However, all that is presented in detail here is probability theory and the theory of stochastic processes with discrete parameters.

CHAPTER I

Elementary Probability Theory

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 1. Let us consider an experiment of which all possible results are included in a finite number of outcomes w 1, •.. , wN. We do not need to know the nature of these outcomes, only that there are a finite number N of them.

We call w 1,

••• ,

wN elementary events, or sample points, and the finite set Q = {wl, ... , roN},

the space of elementary events or the sample space. The choice of the space of elementary events is the first step in formulating a probabilistic model for an experiment. Let us consider some examples of sample spaces. ExAMPLE

1. For a single toss of a coin the sample space Q consists of two

points: Q = {H, T},

where H ="head" and T ="tail". (We exclude possibilities like "the coin stands on edge," "the coin disappears," etc.) ExAMPLE

2. For n tosses of a coin the sample space is Q = {w: w = (a 1 ,

••• ,

an), ai = H or T}

and the general number N(Q) of outcomes is 2".

6

I. Elementary Probability Theory

3. First toss a coin. If it falls "head" then toss a die (with six faces numbered 1, 2, 3, 4, 5, 6); if it falls "tail", toss the coin again. The sample space for this experiment is

ExAMPLE

Q = {H1, H2, H3, H4, H5, H6, TH, TT}.

We now consider some more complicated examples involving the selection of n balls from an urn containing M distinguishable balls. 2. ExAMPLE 4 (Sampling with replacement). This is an experiment in which after each step the selected ball is returned again. In this case each sample of n balls can be presented in the form (a 1 , ••• , an), where ai is the label of the ball selected at the ith step. It is clear that in sampling with replacement each ai can have any of the M values 1, 2, ... , M. The description of the sample space depends in an essential way on whether we consider samples like, for example, (4, 1, 2, 1) and (1, 4, 2, 1) as different or the same. It is customary to distinguish two cases: ordered samples and unordered samples. In the first case samples containing the same elements, but arranged differently, are considered to be different. In the second case the order of the elements is disregarded and the two samples are considered to be the same. To emphasize which kind of sample we are considering, we use the notation (a 1 , .•. , an) for ordered samples and [a 1 , ..• , an] for unordered samples. Thus for ordered samples the sample space has the form Q

= {w: w = (a 1, ••• , an), ai = 1, ... , M}

and the number of (different) outcomes is N(Q) = Mn.

(1)

If, however, we consider unordered samples, then Q

= {w: w = [a 1 , ... , an], ai = 1, ... , M}.

Clearly the number N(Q) of (different) unordered samples is smaller than the number of ordered samples. Let us show that in the present case N(Q) = CM+n-l•

(2)

where q = k !j[l! (k - I)!] is the number of combinations of I elements, taken k at a time. We prove this by induction. Let N(M, n) be the number of outcomes of interest. It is clear that when k ~ M we have N(k, 1) = k =

q.

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

7

Now suppose that N(k, n) = C~+n- 1 fork~ M; we show that this formula continues to hold when n is replaced by n + 1. For the unordered samples [a 1 , ... , an+ 1] that we are considering, we may suppose that the elements are arranged in nondecreasing order: a 1 ~ a 2 ~ · · · ~ an. It is clear that the number of unordered samples with a 1 = 1 is N(M, n), the number with a 1 = 2 is N(M - 1, n), etc. Consequently N(M, n

+ 1) =

+ N(M - 1, n) + · · · + N(1, n) = CM+n-1 + CM-1+n-1 + · ·' c: = (C~+1n- CM-+1n-1) + (CM+-\+n- C~!1+n-1) + ... + (C:!i- c:) = C~_}n; N(M, n)

here we have used the easily verified property

Ci- 1 + Ci = Ci + 1 of the binomial coefficients. 5 (Sampling without replacement). Suppose that n ~ M and that the selected balls are not returned. In this case we again consider two possibilities, namely ordered and unordered samples. For ordered samples without replacement the sample space is

EXAMPLE

Q

= {w: w = (a 1 , .•. , an), ak =F a1, k =F l, ai = 1, ... , M},

and the number of elements of this set (called permutations) is M(M- 1) · · · (M - n + 1). We denote this by (M)n or AM- and call it "the number of permutations of M things, n at a time"). For unordered samples (called combinations) the sample space Q

= {w: w = [a 1 , •.. , an], ak =F a1, k =F l, ai = 1, ... , M}

consists of N(Q)

=

CM-

(3)

elements. In fact, from each unordered sample [a 1 , .•. , an] consisting of distinct elements we can obtain n! ordered samples. Consequently N(Q) · n! = (M)n and therefore N(Q) = (M;n =

n.

CM-.

The results on the numbers of samples of n from an urn with M balls are presented in Table 1.

8

I. Elementary Probability Theory

Table 1 M"

c~+n-1

With replacement

(M).

eM

Without replacement

Ordered

Unordered

For the case M = 3 and n displayed in Table 2.

= 2,

~ e

the corresponding sample spaces are

ExAMPLE 6 (Distribution of objects in cells). We consider the structure of the sample space in the problem of placing n objects (balls, etc.) in M cells (boxes, etc.). For example, such problems arise in statistical physics in studying the distribution of n particles (which might be protons, electrons, ... ) among M states (which might be energy levels). Let the cells be numbered 1, 2, ... , M, and suppose first that the objects are distinguishable (numbered 1, 2, ... , n). Then a distribution of the n objects among the M cells is completely described by an ordered set (a 1, ..• , an), where ai is the index of the cell containing object i. However, if the objects are indistinguishable their distribution among the M cells is completely determined by the unordered set [a 1, •.. , an], where ai is the index of the cell into which an object is put at the ith step. Comparing this situation with Examples 4 and 5, we have the following correspondences: (ordered samples)+-+ (distinguishable objects), (unordered samples)+-+ (indistinguishable objects), Table 2 (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (3, 1) (3, 2) (3, 3)

[1, 1] [2, 2] [3, 3] [1, 2] [1, 3] [2, 3]

With replacement

(1, 2) (1, 3) (2, 1) (2, 3) (3, 1) (3, 2)

[1, 2] [1, 3] [2, 3]

Without replacement

Ordered

Unordered

~ e

9

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

by which we mean that to an instance of an ordered (unordered) sample of n balls from an urn containing M balls there corresponds (one and only one) instance of distributing n distinguishable (indistinguishable) objects among M cells. In a similar sense we have the following correspondences: ) (a cell may receive any number) ·h . (samp1mg w1t rep1acement f b" t , o o ~ec s

. . (a cell may receive at most) b" t · (samphng without replacement)one o ~ec These correspondences generate others of the same kind: an unordered sample in) ( sampling without replacement

-

(indistinguishable objects in the ) problem of distribution among cells when each cell may receive at most one object

etc.; so that we can use Examples 4 and 5 to describe the sample space for the problem of distributing distinguishable or indistinguishable objects among cells either with exclusion (a cell may receive at most one object) or without exclusion (a cell may receive any number of objects). Table 3 displays the distributions of two objects among three cells. For distinguishable objects, we denote them by W (white) and B (black). For indistinguishable objects, the presence of an object in a cell is indicated bya +. Table 3

lwlsl I llwlsl llwl 18 I I++ I I II I+ +I II I I++ I lslwl II lwlsl II lwlsl 1+1+1 I 1+1 1+1 IBl lwll lslwll I lwlsl I 1+1+1 lwlsl lslwl I8 1

I lwl Isl I I lwlsl lwl I lslwl

Distinguishable objects

1+1+1

I 1+1 1+1 I 1+1+1

-:::: = 0 0 ·-

~2u

~

..<:

><

'"'

=

.9

-"'

~-§ ><

'"' Indistinguishable objects

I~ n

s

10

I. Elementary Probability Theory

Table 4 N(Q) in the problem of placing n objects in M cells

~

Distinguishable objects

Indistinguishable objects

Without exclusion

Mn

C:V+n-1

With exclusion

{M)n

c:v

Ordered samples

Unordered samples

s

n

(MaxwellBoltzmann statistics)

(BoseEinstein statistics)

(Fermi-Dirac statistics)

With replacement

Without replacement

~ e

N(Q) in the problem of choosing n balls from an urn containing M balls

The duality that we have observed between the two problems gives us obvious way of finding the number of outcomes in the problem of placing objects in cells. The results, which include the results in Table 1, are given in Table 4. In statistical physics one says that distinguishable (or indistinguishable, respectively) particles that are not subject to the Pauli exclusion principlet obey Maxwell-Boltzmann statistics (or, respectively, Bose-Einstein statistics). If, however, the particles are indistinguishable and are subject to the exclusion principle, they obey Fermi-Dirac statistics (see Table 4). For example, electrons, protons and neutrons obey Fermi-Dirac statistics. Photons and pions obey Bose-Einstein statistics. Distinguishable particles that are subject to the exclusion principle do not occur in physics. an

3. In addition to the concept of sample space we now need the fundamental concept of event. Experimenters are ordinarily interested, not in what particular outcome occurs as the result of a trial, but in whether the outcome belongs to some subset of the set of all possible outcomes. We shall describe as events all subsets A c Q for which, under the conditions ofthe experiment, it is possible to say either "the outcome wE A" or "the outcome w ¢A."

t At most one particle in each cell. (Translator)

§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes

11

For example, let a coin be tossed three times. The sample space Q consists of the eight points Q = {HHH, HHT, ... , TTT}

and if we are able to observe (determine, measure, etc.) the results of all three tosses, we say that the set A = {HHH, HHT, HTH, THH} is the event consisting of the appearance of at least two heads. If, however, we can determine only the result of the first toss, this set A cannot be considered to be an event, since there is no way to give either a positive or negative answer to the question of whether a specific outcome w belongs to A. Starting from a given collection of sets that are events, we can form new events by means of statements containing the logical connectives "or," "and," and "not," which correspond in the language of set theory to the operations "union," "intersection," and "complement." If A and B are sets, their union, denoted by A u B, is the set of points that belong either to A or to B: Au B ={wen: we A or weB}. In the language of probability theory, Au B is the event consisting of the realization either of A or of B. The intersection of A and B, denoted by A n B, or by AB, is the set of points that belong to both A and B: An B = {weQ: we A and weB}.

The event A n B consists of the simultaneous realization of both A and B. For example, if A = {HH, HT, TH} and B = {TT, TH, HT} then Au B

= {HH, HT, TH, TT} ( =0),

A n B = {TH, HT}.

If A is a subset of Q, its complement, denoted by A, is the set of points of that do not belong to A. If B\A denotes the difference of B and A (i.e. the set of points that belong to B but not to A) then A= Q\A. In the language of probability, A is the event consisting of the nonrealization of A. For example, if A = {HH, HT, TH} then A = {TT}, the event in which two successive tails occur. The sets A and A have no points in common and consequently A n A is empty. We denote the empty set by 0. In probability theory, 0 is called an impossible event. The set n is naturally called the certain event. When A and B are disjoint (AB = 0), the union A u B is called the sum of A and B and written A + B. If we consider a collection d 0 of sets A s Q we may use the set-theoretic operators u, n and \ to form a new collection of sets from the elements of Q

12

I. Elementary Probability Theory

d 0 ; these sets are again events. If we adjoin the certain and impossible events Q and 0 we obtain a collection d of sets which is an algebra, i.e. a collection of subsets of Q for which

(1) Qed, (2) if A Ed, BEd, the sets A u B, A n B, A\B also belong to d. It follows from what we have said that it will be advisable to consider collections of events that form algebras. In the future we shall consider only such collections. Here are some examples of algebras of events:

(a) {0, 0}, the collection consisting of Q and the empty set (we call this the trivial algebra); (b) {A, A, 0, 0}, the collection generated by A; (c) d = {A: A £ 0}, the collection consisting of all the subsets of Q (including the empty set 0). It is easy to check that all these algebras of events can be obtained from the following principle. We say that a collection

of sets is a decomposition of 0, and call the Di the atoms of the decomposition, if the Di are not empty, are pairwise disjoint, and their sum is 0: D1

+ · ·· + Dn

= Q.

For example, if Q consists of three points, different decompositions: !'}t={Dd !'}2

= {Dt, D2}

Q

= {1, 2, 3}, there are five

with D 1 = {1, 2, 3}; with D 1

= {1, 2}, D2 = {3};

!'}3 = {Dt, D2}

with D 1 = {1, 3}, D2 = {2};

!'}4 = {Dt, D2}

with D 1 = {2,3},D 2 = {1};

!'}s

= {Dl, D2, D3}

with

D1

= {1}, D2 = {2}, D3 = {3}.

(For the general number of decompositions of a finite set see Problem 2.) · If we consider all unions of the sets in !'}, the resulting collection of sets, together with the empty set, forms an algebra, called the algebra induced by !'}, and denoted by ex(!'}). Thus the elements of ex(!'}) consist of the empty set together with the sums of sets which are atoms of!'}, Thus if!'} is a decomposition, there is associated with it a specific algebra fJi = ex(!'}). The converse is also true. Let f!i be an algebra of subsets of a finite space 0. Then there is a unique decomposition !'} whose atoms are the elements of

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

13

flA, with flA = oc(~). In fact, let D E 14 and let D have the property that for every B E flA the set D n B either coincides with D or is empty. Then this collection of sets D forms a decomposition ~ with the required property oc(~) = flA. In Example (a),~ is the trivial decomposition consisting of the single set D1 = Q; in (b),~= {A, A}. The most fine-grained decomposition ~, which consists of the singletons {roJ, roi E Q, induces the algebra in Example (c), i.e. the algebra of all subsets of Q. Let ~ 1 and ~ 2 be two decompositions. We say that ~ 2 is finer than ~ 1 , and write ~ 1 ~ ~ 2 , ifoc(~ 1 ) ~ oc(~ 2 ). Let us show that if Q consists, as we assumed above, of a finite number of points ro 1, ... , roN, then the number N(d) of sets in the collection d is equal to 2N. In fact, every nonempty set A Ed can be represented as A = {roi,, ... , rod, where roij E Q, 1 ~ k ~ N. With this set we associate the sequence of zeros and ones

(0, ... ' 0, 1, 0, ... ' 0, 1, ... ), where there are ones in the positions i 1, ... , ik and zeros elsewhere. Then for a given k the number of different sets A of the form {roi,, ... , roik} is the same as the number of ways in which k ones (k indistinguishable objects) can be placed in N positions (N cells). According to Table 4 (see the lower right-hand square) we see that this number is C~. Hence (counting the empty set) we find that N(d) = 1 + C~

+ .. · + CZ

= (1

+ It

= 2N.

4. We have now taken the first two steps in defining a probabilistic model of an experiment with a finite number of outcomes: we have selected a sample space and a collection d of subsets, which form an algebra and are called events. We now take the next step, to assign to each sample point (outcome) roi E Qi, i = 1, ... , N, a weight. This is denoted by p(roi) and called the probability of the outcome roi ; we assume that it has the following properties: (a) 0 ~ p(roJ ~ 1 (nonnegativity), (b) p(ro 1) + · · · + p(roN) = 1 (normalization). Starting from the given probabilities p(roJ of the outcomes roi> we define the probability P(A) of any event A Ed by (4) {i:ro;eA}

Finally, we say that a triple (Q, d, P),

where Q = {ro 1 ,

.•• ,

roN}, dis an algebra of subsets of Q and p = {P(A); A Ed}

14

I. Elementary Probability Theory

defines (or assigns) a probabilistic model, or a probability space, of experiments with a (finite) space n of outcomes and algebra d of events. The following properties of probability follow from (4): (5)

P(0) = 0,

=

P(Q)

P(A u B)= P(A)

+

(6)

1,

P(B)- P(A n B).

(7)

In particular, if An B = 0, then P(A

+ B) =

P(A)

+

P(B)

(8)

and P(A) = 1 - P(A).

(9)

5. In constructing a probabilistic model for a specific situation, the construction of the sample space n and the algebra d of events are ordinarily not difficult. In elementary probability theory one usually takes the algebra d to be the algebra of all subsets of n. Any difficulty that may arise is in assigning probabilities to the sample points. In principle, the solution to this problem lies outside the domain of probability theory, and we shall not consider it in detail. We consider that our fundamental problem is not the question of how to assign probabilities, but how to calculate the probabilities of complicated events (elements of d) from the probabilities of the sample points. It is clear from a mathematical point of view that for finite sample spaces we can obtain all conceivable (finite) probability spaces by assigning nonnegative numbers p 1 , •.• , PN, satisfying the condition p 1 + · · · + PN = 1, to the outcomes w 1, ... , wN. The validity of the assignments of the numbers p 1, ••• , PN can, in specific cases, be checked to a certain extent by using the law of large numbers (which will be discussed later on). It states that in a long series of "independent" experiments, carried out under identical conditions, the frequencies with which the elementary events appear are "close" to their probabilities. In connection with the difficulty of assigning probabilities to outcomes, we note that there are many actual situations in which for reasons of symmetry it seems reasonable to consider all conceivable outcomes as equally probable. In such cases, if the sample space consists of points w 1, . . , wN, with N < oo, we put

and consequently P(A) = N(A)/N

(10)

§I. Probabilistic Model of an Experiment with a Finite Number of Outcomes

15

for every event A Ed, where N(A) is the number of sample points in A. This is called the classical method of assigning probabilities. It is clear that in this case the calculation of P(A) reduces to calculating the number of outcomes belonging to A. This is usually done by combinatorial methods, so that combinatorics, applied to finite sets, plays a significant role in the calculus of probabilities. 7 (Coincidence problem). Let an urn contain M balls numbered 1, 2, ... , M. We draw an ordered sample of size n with replacement. It is clear that then ExAMPLE

Q = {w: w = (a 1 ,

.•. ,

a.), a;= 1, ... , M}

and N(Q) = M". Using the classical assignment of probabilities, we consider the M" outcomes equally probable and ask for the probability of the event

A = {w: w = (a 1 ,

••• ,

a.), a; "# aj, i "# j},

i.e., the event in which there is no repetition. Clearly N(A) = M(M - 1) ... (M - n + 1), and therefore P(A)

=

(z~· = (1- k) (1- k) ... (1- n ~ 1).

(11)

This problem has the following striking interpretation. Suppose that there are n students in a class. Let us suppose that each student's birthday is on one of 365 days and that all days are equally probable. The question is, what is the probability P. that there are at least two students in the class whose birthdays coincide? If we interpret selection of birthdays as selection of balls from an urn containing 365 balls, then by (11) P.

=

1-

(365). 365" .

The following table lists the values of P" for some values of n: n

4

16

22

23

40

64

P.

0.016

0.284

0.476

0.507

0.891

0.997

It is interesting to note that (unexpectedly!) the size of class in which there is probability! of finding at least two students with the same birthday is not very large: only 23. 8 (Prizes in a lottery). Consider a lottery that is run in the following way. There are M tickets numbered 1, 2, ... , M, of which n, numbered 1, ... , n, win prizes (M ~ 2n). You buy n tickets, and ask for the probability (P, say) of winning at least one prize.

EXAMPLE

16

I. Elementary Probability Theory

Since the order in which the tickets are drawn plays no role in the presence or absence of winners in your purchase, we may suppose that the sample space has the form

n=

{w: w = [a 1,

By Table 1, N(Q) =

C~.

..• ,

an], ak =I= a1, k =I= l, ai

= 1, ... , M}.

Now let

A 0 = {w: w = [a 1 ,

an], ak =I= a1, k =I= l, ai

••• ,

= n + 1, ... , M}

be the event that there is no winner in the set of tickets you bought. Again by Table 1, N(A 0 ) = C~-n· Therefore P(A ) = C~-n = (M- n)n (M)n 'C~ 0

and consequently

P=

). n n) ... (1 (1 - _ 1- P(A = 1- (1 - ~) M-n+1 M-1 M 0)

If M = n2 and n -4 oo, then P(A 0 )

-4

e- 1 and

P--+ 1- e- 1 ~ 0.632.

The convergence is quite fast: for n = 10 the probability is already P = 0.670. 6.

PROBLEMS

1. Establish the following properties of the operators n and u: AB = BA

A uB= BuA,

A u (B u C) = (A u B) u C, A(B u C) = AB u AC, Au A= A,

(commutativity), A(BC) = (AB)C

A u (BC)

=

(associativity), (distributivity),

(A u B)(A u C)

(idempotency).

AA = A

Show also that Au B

=An B,

AB

=Au B.

2. Let Q contain N elements. Show that the number d(N) of different decompositions of Q is given by the formula (12) (Hint: Show that N-1

d(N)

=

I

k=O

c~- 1 d(k),

where d(O)

=

1,

and then verify that the series in (12) satisfies the same recurrence relation.)

17

§2. Some Classical Models and Distributions

3. For any finite collection of sets

A~>

... , A.,

+ · · · + P(A.).

P(A 1 u · · · u A.) :0::: P(A 1 )

4. Let A and B be events. Show that AB u BA is the event in which exactly one of A and B occurs. Moreover, P(AB u BA) = P(A)

+ P(B) -

2P(AB).

5. Let A 1 , ..• , A. be events, and define S 0 , S 1 , ••• , s. as follows: S 0 = 1,

s, = L P(Ak, (\ ... (\ Ak), J,

where the sum is over the unordered subsets J, = [k 1, ... , k,] of {1, ... , n}. Let Bm be the event in which each of the events A 1 , .•• , A. occurs exactly m times. Show that n

P(Bm) =

L (-1)•-mc~s,.

In particular, form= 0 P(B 0 ) = 1 - S 1

+ S2

-

•· ·

± s•.

Show also that the probability that at least m of the events A 1 , ••• , A. occur simultaneously is n

P(B 1 )

+ · ·· + P(B.) = I (-1y-mc~_-ls,. r=m

In particular, the probability that at least one of the events A 1, .•• , A. occurs is

§2. Some Classical Models and Distributions 1. Binomial distribution. Let a coin be tossed n times and record the results as an ordered set {a 1 , ... , an), where a; = 1 for a head ("success") and a;= 0 for a tail ("failure"). The sample space is

n=

{w: w = (al, ... , an), a;= 0, 1}.

To each sample point w = (a 1, p(w)

... ,

an) we assign the probability

= {l-a,qn-r.a,,

where the nonnegative numbers p and q satisfy p + q = 1. In the first place, we verify that this assignment of the weights p(w) is consistent. It is enough to show that p(w) = 1. We consider all outcomes w = (a 1 , ••• , a.) for which a; = k, where k = 0, 1, ... , n. According to Table 4 (distribution of k indistinguishable

Lwen

L;

18

I. Elementary Probability Theory

ones inn places) the number of these outcomes is n

L p(w) = L

roen

C=pkqn-k

k=O

c:. Therefore

= (p + q)n = 1.

Thus the space 0 together with the collection Jil of all its subsets and the probabilities P(A) = LroeA p(w), A E J/1, defines a probabilistic model. It is natural to call this the probabilistic model for n tosses of a coin. In the case n = 1, when the sample space contains just the two points w = 1 ("success") and w = 0 ("failure"), it is natural to call p(l) = p the probability of success. We shall see later that this model for n tosses of a coin can be thought of as the result of n "independent" experiments with probability p of success at each trial. Let us consider the events k = 0, 1, ... , n,

consisting of exactly k successes. It follows from what we said above that (1)

LZ=o

and P(Ak) = 1. The set of probabilities (P(A 0 ), ••• , P(An)) is called the binomial distribution (the number of successes in a sample of size n). This distribution plays an extremely important role in probability theory since it arises in the most diverse probabilistic models. We write Pik) = P(Ak), k = 0, 1, ... , n. Figure 1 shows the binomial distribution in the case p = ! (symmetric coin) for n = 5, 10, 20. We now present a different model (in essence, equivalent to the preceding one) which describes the random walk of a "particle." Let the particle start at the origin, and after unit time let it take a unit step upward or downward (Figure 2). Consequently after n steps the particle can have moved at most n units up or n units down. It is clear that each path w of the particle is completely specified by a set (a 1, ••. , an), where a; = + 1 if the particle moves up at the ith step, and a; = - 1 if it moves down. Let us assign to each path w the weight p(w) = pv<w>qn-v<w>, where v(w) is the number of+ 1's in the sequence w = (a 1 , ... , an), i.e. v(w) = [(a 1 + · · · + an) + n]/2, and the nonnegative numbers p and q satisfy p + q = 1. Since Lroen p(w) = 1, the set of probabilities p(w) together with the space 0 of paths w = (a 1, .•. , an) and its subsets define an acceptable probabilistic model of the motion of the particle for n steps. Let us ask the following question: What is the probability of the event Ak that after n steps the particle is at a point with ordinate k? This condition is satisfied by those paths w for which v(ro) - (n - v(w)) = k, i.e. v(w)

n+k 2

= --.

19

§2. Some Classical Models and Distributions P.(k)

P.(k)

0.3

n

=

0.3

5

0.2

0.2

0.1

0.1

n = 10

I

I

n

= 20

I

k

012345

I1 I

I

I

012345678910

""

k

P.(k) 0.3

0.2 0.1 I

012345678910" . . . . . . . "20

k

Figure 1. Graph of the binomial probabilities P.(k) for n = 5, 10, 20.

The number of such paths (see Table 4) is P(Ak)

=

c~n+kJ/ 2 ,

and therefore

c~n+k112pln+kJ12qin-kl!2.

Consequently the binomial distribution (P(A_n), ... , P(A 0 ), ... , P(A")) can be said to describe the probability distribution for the position of the particle after n steps. Note that in the symmetric case (p = q = !) when the probabilities of the individual paths are equal tor", P(Ak) = qn+kl/2. r".

Let us investigate the asymptotic behavior of these probabilities for large n. If the number of steps is 2n, it follows from the properties of the binomial coefficients that the largest of the probabilities P(Ak), Ik I :::; 2n, is P(A 0 ) =

qn · 2- 2 ".

n

Figure 2

20

I. Elementary Probability Theory

-4 -3 -2 -1

0

2

3

4

Figure 3. Beginning of the binomial distribution.

From Stirling's formula (see formula (6) in Section 4) n! "'~ e-"nn.t

Consequently

en

2n

= (2n)! "'22n,_l_

fo

(n!?

and therefore for large n P(A 0 )

1

--.

Fn

Figure 3 represents the beginning of the binomial distribution for 2n steps of a random walk (in contrast to Figure 2, the time axis is now directed upward). 2. Multinomial distribution. Generalizing the preceding model, we now

suppose that the sample space is

n=

{w: w = (al, ... ' an), a;= bl, ... ' b,},

where b 1, ••. , b, are given numbers. Let v;(w) be the number of elements of w = (a 1 , •.. , an) that are equal to b;, i = 1, ... , r, and define the probability ofw by p(w) =

where P; ~ 0 and p 1

+ · · · + Pr

=

p~'(w) ... p;r(w),

1. Note that

where Cn(n 1 , ... , n,) is the number of (ordered) sequences (a 1 , ... , an) in which b 1 occurs n 1 times, ... , b, occurs n, times. Since n 1 elements b1 can

t The notation} (n) -

g(n) means that f (n)/g(n)-+ 1 as n-+ oo.

21

§2. Some Classical Models and Distributions

be distributed into n positions in C~' ways; n2 elements b2 into n - n 1 positions in c~:._n, ways, etc., we have

(n-

... 1 n 1 )! n 1 ! (n- n 1 )! n 2 ! (n- n 1 - n 2 )! n!

n!

Therefore "" L.... P(w )

wen

=

n · · · P~" n! "" P1' L. {"l;;::o, ... ,nr2!0,} n 1·t ••• nr·1

=

(

P1

+ · · · + Prt =

1,

n1 + ··· +nr=n

and consequently we have defined an acceptable method of assigning probabilities. Let

Then (2) The set of probabilities

is called the multinomial (or polynomial) distribution. We emphasize that both this distribution and its special case, the binomial distribution, originate from problems about sampling with replacement. 3. The multidimensional hypergeometric distribution occurs in problems that

involve sampling without replacement. Consider, for example, an urn containing M balls numbered 1, 2, ... , M, where M 1 balls have the color b 1 , . . . , Mr balls have the color b., and M 1 + · · · + Mr = M. Suppose that we draw a sample of size n < M without replacement. The sample space is Q

= {w: w = (a 1, ••. , an), ak =I a1 , k =I l, ai = 1, ... , M}

and N(Q) = (M)". Let us suppose that the sample point~ are equiprobable, and find the probability of the event Bn,, ····"· in which n 1 balls have color b 1 , .•. , nr balls have color b., where n 1 + · · · + nr = n. It is easy to show that

22

I. Elementary Probability Theory

and therefore

P{B )= ,.,.......

N(B ) C"• · · · C"• "••···•"• = Mt N(Q) CM- Mr •

(3)

The set of probabilities {P(B, •.... , ...)} is called the multidimensional hypergeometric distribution. When r = 2 it is simply called the hypergeometric distribution because its "generating function" is a hypergeometric function. The structure of the multidimensional hypergeometric distribution is rather complicated. For example, the probability

P(B,.,,2) =

C"• C"2

~M M2'

(4)

contains nine factorials. However, it is easily established that if M --+ oo and M 1 --+ oo in such a way that MdM--+ p (and therefore M 2 /M--+ 1 - p) then (5) In other words, under the present hypotheses the hypergeometric distribution is approximated by the binomial; this is intuitively clear since when M and M 1 are large (but finite), sampling without replacement ought to give almost the same result as sampling with replacement. Let us use (4) to find the probability of picking six "lucky" numbers in a lottery of the following kind (this is an abstract formulation of the "sportloto," which is well known in Russia): There are 49 balls numbered from 1 to 49; six of them are lucky (colored red, say, whereas the rest are white). We draw a sample of six balls, without replacement. The question is, What is the probability that all six of these balls are lucky? Taking M = 49, M 1 = 6, n 1 = 6, n2 = 0, we see that the event of interest, namely

ExAMPLE.

B 6 , 0 = {6 balls, all lucky} has, by (4), probability 1 P(B6,o) = c6 ~ 7.2 49

X

10- 8 •

4. The numbers n! increase extremely rapidly with n. For example, 10!

=

3,628,800,

15! = 1,307,674,368,000, and 100! has 158 digits. Hence from either the theoretical or the computational point of view, it is important to know Stirling's formula, n! =

~ (;)" exp( 1~n),

o < e, < 1,

(6)

23

§3. Conditional Probability. Independence

whose proof can be found in most textbooks on mathematical analysis (see also [69]). 5. PROBLEMS 1. Prove formula (5).

2. Show that for the multinomial distribution {P(A.,, ... , A.J} the maximum probability is attained at a point (k 1 , ••• , k.) that satisfies the inequalities np; - 1 < k; :::;; (n + r - 1)p;, i = 1, ... , r. 3. One-dimensional Ising model. Consider n particles located at the points 1, 2, ... , n. Suppose that each particle is of one of two types, and that there are n 1 particles ofthe first type and n 2 of the second (n 1 + n 2 = n). We suppose that all n! arrangements of the particles are equally probable. Construct a corresponding probabilistic model and find the probability of the event A.(m 11 , m 12 , m 21 , m22 ) = {v 11 = m11 , ••• , v22 = m22 }, where vii is the number of particles of type i following particles of type j (i, j = 1, 2). 4. Prove the following inequalities by probabilistic reasoning:

n

L (C!)

2

=

k=O n

L(-1)"-kC! =

cz•. c;:._ 1,

m ;:o: n

m(m- 1)2m-z,

m ;:o: 2.

+ 1,

k=O n

L k(k-

l)C~ =

k=O

§3. Conditional Probability. Independence 1. The concept of probabilities of events lets us answer questions of the following kind: If there are M balls in an urn, M 1 white and M 2 black, what is

the probability P(A) of the event A that a selected ball is white? With the classical approach, P(A) = M tfM. The concept of conditional probability, which will be introduced below, lets us answer questions of the following kind: What is the probability that the second ball is white (event B) under the condition that the first ball was also white (event A)? (We are thinking of sampling without replacement.) It is natural to reason as follows: if the first ball is white, then at the second step we have an urn containing M - 1 balls, of which M 1 - 1 are white and M 2 black; hence it seems reasonable to suppose that the (conditional) probability in question is (M 1 - 1)/(M- 1).

24

I. Elementary Probability Theory

We now give a definition of conditional probability that is consistent with our intuitive ideas. Let (0, d, P) be a (finite) probability space and A an event (i.e. A Ed).

Definition 1. The conditional probability of event B assuming event A with P(A) > 0 (denoted by P(BIA)) is P(AB) P(A) .

(1)

In the classical approach we have P(A) = N(A)/N(D.), P(AB) = N(AB)/N(D.), and therefore

P(BIA) = N(AB). N(A)

(2)

From Definition 1 we immediately get the following properties of conditional probability: P(AIA)

= 1,

P(0IA) = 0, P(BIA) = 1,

B2A,

It follows from these properties that for a given set A the conditional probability P( ·I A) has the same properties on the space (D. n A, dnA), where dnA= {B n A: BEd}, that the original probability PO has on (D., d). Note that

+ P(BIA) =

1;

+ P(BIA) =F P(BIA) + P(BIA) =F

1,

P(BIA)

however in general P(BIA)

1.

ExAMPLE 1. Consider a family with two children. We ask for the probability that both children are boys, assuming

(a) that the older child is a boy; (b) that at least one of the children is a boy.

25

§3. Conditional Probability. Independence

The sample space is Q

=

{BB, BG, GB, GG},

where BG means that the older child is a boy and the younger is a girl, etc. Let us suppose that all sample points are equally probable: P(BB) = P(BG) = P(GB) = P(GG) =

l

Let A be the event that the older child is a boy, and B, that the younger child is a boy. Then A u B is the event that at least one child is a boy, and AB is the event that both children are boys. In question (a) we want the conditional probability P(AB IA), and in (b), the conditional probability P(ABIA u B).

It is easy to see that P(ABIA) = P(AB) =! =!

t

P(A)

2'

P(AB) t 1 P(ABIA u B)= P(A u B)=!= "3

2. The simple but important formula (3), below, is called the formula for total probability. It provides the basic means for calculating the probabilities of complicated events by using conditional probabilities. Consider a decomposition~= {A 1, ••• , An} with P(A;) > 0, i = 1, ... , n (such a decomposition is often called a complete set of disjoint events). It is clear that B

=

+ ··· +

BA 1

BAn

and therefore n

P(B) =

I

P(BA;).

i= 1

But P(BA;)

= P(BIA;)P(A;).

Hence we have the formula for total probability: n

P(B)

= L

P(B I A;)P(A;).

(3)

i= 1

In particular, if 0 < P(A) < 1, then P(B) = P(BIA)P(A)

+ P(BIA)P(A).

(4)

26

I. Elementary Probability Theory

2. An urn contains M balls, m of which are "lucky." We ask for the probability that the second ball drawn is lucky (assuming that the result of the first draw is unknown, that a sample of size 2 is drawn without replacement, and that all outcomes are equally probable). Let A be the event that the first ball is lucky, B the event that the second is lucky. Then ExAMPLE

m(m- 1) P(BIA) = P(BA) = M(M- 1) = m- 1 P(A) m M- 1'

M

m(M- m) _ P(BA) M(M- 1) m P(B IA) = P(A) = M - m = M - 1

M

and P(B) = P(BIA)P(A)

+ P(BIA)P(A)

m-1 m m M-m =M-1·M+M-1. M

m M"

It is interesting to observe that P(A) is precisely mjM. Hence, when the nature of the first ball is unknown, it does not affect the probability that the second ball is lucky. By the definition of conditional probability (with P(A) > 0), P(AB) = P(BIA)P(A).

(5)

This formula, the multiplication formula for probabilities, can be generalized (by induction) as follows: IfA 1, ... , An_ 1 are events with P(A 1 · · · An_ 1) > 0, then P(A1 ···An)

(here A 1 ···An

=

= P(A1)P(A2I A1) · · · P(An IA1 · · · An-1)

(6)

A 1 n A 2 n ···nAn)·

3. Suppose that A and B are.events with P(A) > 0 and P(B) > 0. Then along with (5) we have the parallel formula P(AB) = P(A IB)P(B).

(7)

From (5) and (7) we obtain Bayes's formula P(AIB)

= P(A)P(BIA) P(B)

.

(8)

27

§3. Conditional Probability. Independence

If the events A 1, Bayes's theorem:

... ,

An form a decomposition of

n,

(3) and (8) imply

P(A;)P(B IA;)

(9)

P(A;IB) = LJ=l P(Aj)P(BIA).

In statistical applications, A 1, ••• , An (A 1 + · · · +An = Q) are often called hypotheses, and P(Ai) is called the a priorit probability of Ai. The conditional probability P(A; IB) is considered as the a posteriori probability of A; after the occurrence of event B. ExAMPLE

3. Let an urn contain two coins: A 1, a fair coin with probability

! of falling H; and A 2 , a biased coin with probability-! of falling H. A coin is drawn at random and tossed. Suppose that it falls head. We ask for the probability that the fair coin was selected. Let us construct the corresponding probabilistic model. Here it is natural to take the sample space to be the set n = {A 1 H, A 1 T, A 2 H, A 2 T}, which describes all possible outcomes of a selection and a toss (A 1 H means that coin A 1 was selected and fell heads, etc.) The probabilities p(w) of the various outcomes have to be assigned so that, according to the statement of the problem, P(Ad

= P(A 2 ) =!

and P(HIA 2 )

=!.

With these assignments, the probabilities of the sample points are uniquely determined : P(A 2 T)

=!.

Then by Bayes's formula the probability in question is P(A 1 )P(HIA 1 ) P(AliH) = P(Al)P(HIAl) + P(A2)P(HIA2)

3

S'

and therefore P(A 2 1H)

=

i

4. In certain sense, the concept of independence, which we are now going to introduce, plays a central role in probability theory: it is precisely this concept that distinguishes probability theory from the general theory of measure spaces. t

A priori: before the experiment; a posteriori: after the experiment.

28

I. Elementary Probability Theory

If A and B are two events, it is natural to say that B is independent of A if knowing that A has occurred has no effect on the probability of B. In other words, "B is independent of A" if P(BIA) = P(B)

(10)

(we are supposing that P(A) > 0). Since P(BIA) =

P(AB)

P(A) ,

it follows from (10) that P(AB) = P(A)P(B).

(11)

In exactly the same way, if P(B) > 0 it is natural to say that" A is independent of B" if P(A IB) = P(A).

Hence we again obtain (11 ), which is symmetric in A and B and still makes sense when the probabilities of these events are zero. After these preliminaries, we introduce the following definition.

Definition 2. Events A and Bare called independent or statistically independent (with respect to the probability P) if P(AB)

= P(A)P(B).

In probability theory it is often convenient to consider not only independence of events (or sets) but also independence of collections of events (or sets). Accordingly, we introduce the following definition.

Definition 3. Two algebras d 1 and d 2 of events (or sets) are called independent or statistically independent (with respect to the probability P) if all pairs of sets A 1 and A 2 , belonging respectively to d 1 and d 2 , are independent. For example, let us consider the two algebras

where A 1 and A 2 are subsets of n. It is easy to verify that d 1 and d 2 are independent if and only if A 1 and A 2 are independent. In fact, the independence of .91 1 and .91 2 means the independence of the 16 events A 1 and A 2 , A 1 and A2 , .•• , nand n. Consequently A 1 and A 2 are independent. Conversely, if A 1 and A 2 are independent, we have to show that the other 15

29

§3. Conditional Probability. Independence

pairs of events are independent. Let us verify, for example, the independence of A 1 and A2 • We have P(A 1 A 2 )

= P(A 1 )

= P(A 1 )

P(A 1 A 2 )

-

-

P(A 1 )P(A 2 )

= P(A 1) • (1 - P(A 2 )) = P(A 1 )P(A 2 ). The independence of the other pairs is verified similarly. 5. The concept of independence of two sets or two algebras of sets can be extended to any finite number of sets or algebras of sets. Thus we say that the sets A 1, ... , An are collectively independent or statistically independent (with respect to the probability P) if fork = 1, ... , n and 1 ~ i 1 < i 2 < · · · < ik ~ n

= P(A;,) · · · P(A;J

P(A;, · · · A;)

(12)

The algebras d 1 , ... , dn of sets are called independent or statistically independent (with respect to the probability P) if all sets A 1 , .•. , An belonging respectively to d 1 , ••• , dn are independent. Note that pairwise independence of events does not imply their independence. In fact if, for example, n = {w 1 , w 2 , w 3 , w4 } and all outcomes are equiprobable, it is easily verified that the events

are pairwise independent, whereas

i

P(ABC) =

i= (t) 3

= P(A)P(B)P(C).

Also note that if P(ABC)

=

P(A)P(B)P(C)

for events A, B and C, it by no means follows that these events are pairwise independent. In fact, let n consist of the 36 ordered pairs (i, j), where i, j = 1, 2, ... , 6 and all the pairs are equiprobable. Then if A = {(i,j):j = 1, 2 or 5}, B = {(i,j):j = 4, 5 or 6}, C = {(i,j): i + j = 9} we have P(AB) =

i i= i =

P(A)P(B),

P(AC)

=

l6

i= / 8

= P(A)P(C),

P(BC)

= /2

i= / 8

=

P(B)P(C),

but also P(ABC)

=

l6 =

P(A)P(B)P(C).

6. Let us consider in more detail, from the point of view of independence, the classical model (Q, d, P) that was introduced in §2 and used as a basis for the binomial distribution.

30

I. Elementary Probability Theory

In this model

n=

{w:

((J

= (a1, ... ' a.), a; =

d ={A: As Q}

0, 1},

and (13) Consider an event A s n. We say that this event depends on a trial at time k if it is determined by the value ak alone. Examples of such events are Let us consider the sequence of algebras d 1, d 2 , ••• , d., where dk = {Ak, Ab 0, Q} and show that under (13) these algebras are independent. It is clear that

=p (at, ... , ak-1, Uk+

1, ... ,

an)

X q
= p

L

n-1 c~-1plq(n-1)-l

= p,

i=O

and a similar calculation shows that P(Ak) = q and that, for k # 1, 2

P(AkA 1) = p ,

-

P(AkA 1) = pq,

-

2

P(AkAt) = q .

It is easy to deduce from this that d k and d 1 are independent for k # I. It can be shown in the same way that d 1 , d 2 , •.• , d. are independent. This is the basis for saying that our model (Q, d, P) corresponds to "n independent trials with two outcomes and probability p of success." James Bernoulli was the first to study this model systematically, and established the law of large numbers (§5) for it. Accordingly, this model is also called the Bernoulli scheme with two outcomes (success and failure) and probability p of success. A detailed study of the probability space for the Bernoulli scheme shows that it has the structure of a direct product of probability spaces, defined as follows. Suppose that we are given a collection (0 1 , 86 1 , P 1), ... , (n., 86., P.) of finite probability spaces. Form the space n = n1 X n2 X ... X n. of points ((J = (a1, ... ' a.), where a; E ni. Let d = 861 ® ... ® 86. be the algebra of the subsets of n that consists of sums of sets of the form

A= B1

X

B2

X ... X

B.

with B;E86;. Finally, for w = (a 1, ... ,a.) take p(w) = p 1(a 1)···p.(a.) and define P(A) for the set A = B 1 x B 2 x · · · x B. by P(A) =

31

§3. Conditional Probability. Independence

It is easy to verify that P(O) = 1 and therefore the triple (0, d, P) defines a probability space. This space is called the direct product of the probability spaces (0 1 , fJI 1 , P 1 ), .•• , (O., fJI., P.). We note an easily verified property of the direct product of probability spaces: with respect to P, the events

A 1 = {ro: a 1 E Bd, where Bi e

fJii,

... , A. =

{ro: a. E B.},

are independent. In the same way, the algebras of subsets ofO, d

1

= {A 1 :A 1 = {ro:a 1 eBd,B 1 efJI 1 },

are independent. It is clear from our construction that the Bernoulli scheme (0, d, P) with 0 = {ro: w = (a 1,

d ={A: As; 0}

.•. ,

a.), ai = 0 or 1}

p(ro) = pr,a,qn-r.a,

and

can be thought of as the direct product of the probability spaces (Oi, i = 1, 2, ... , n, where

ni = {O, 1}, Pl{1})

fJii

fJii,

Pi),

= {{O}, {1}, 0, OJ,

= p,

Pi({O})

= q.

7.PROBLEMS 1. Give examples to show that in general the equations

+ P(B!A) = P(B!A) + P(B!A) = P(B!A)

1, 1

are false. 2. An urn contains M balls, of which M 1 are white. Consider a sample of size n. Let Bi be the event that the ball selected at the jth step is white, and Ak the event that a sample of size n contains exactly k white balls. Show that P(Bi!Ak) = k/n

both for sampling with replacement and for sampling without replacement.

3. Let A 1, ... , A. be independent events. Then

4. Let A~> ... , A. be independent events with P(A;) that neither event occurs is

Po=

•

fl(l- pJ

i=l

= P;·

Then the probability P0

32

I. Elementary Probability Theory

5. Let A and B be independent events. In terms of P(A) and P(B), find the probabilities of the events that exactly k, at least k, and at most k of A and B occur (k = 0, 1, 2). 6. Let event A be independent of itself, i.e. Jet A and A be independent. Show that P(A) is either 0 or 1. 7. Let event A have P(A) = 0 or 1. Show that A and an arbitrary event B are independent. 8. Consider the electric circuit shown in Figure 4:

Figure 4 Each of the switches A, B, C, D, and E is independently open or closed with probabilities p and q, respectively. Find the probability that a signal fed in at "input" will be received at "output". If the signal is received, what is the conditional probability that E is open?

§4. Random Variables and Their Properties 1. Let (Q, d, P) be a probabilistic model of an experiment with a finite number of outcomes, N(Q) < oo, where d is the algebra of all subsets of n. We observe that in the examples above, where we calculated the probabilities of various events A E d, the specific nature of the sample space n was of no interest. We were interested only in numerical properties depending on the sample points. For example, we were interested in the probability of some number of successes in a series of n trials, in the probability distribution for the number of objects in cells, etc. The concept "random variable," which we now introduce (later it will be given a more general form) serves to define quantities that are subject to "measurement" in random experiments.

Definition 1. Any numerical function ~ = ~(w) defined on a (finite) sample space n is called a (simple) random variable. (The reason for the term" simple" random variable will become clear after the introduction of the general concept of random variable in §4 of Chapter II.)

33

§4. Random Variables and Their Properties

1. In the model of two tosses of a coin with sample space Q = {HH, HT, TH, TT}, define a random variable~ = ~(w) by the table

ExAMPLE

OJ

HH

HT

TH

TT

¢(w)

2

1

1

0

Here, from its very definition, ~(w) is nothing but the number of heads in the outcome w. Another extremely simple example of a random variable is the indicator (or characteristic function) of a set A E .s:1:

wheret IA(w)

1, WEA, = { 0, w¢A.

When experimenters are concerned with random variables that describe observations, their main interest is in the probabilities with which the random variables take various values. From this point of view they are interested, not in the distribution of the probability P over (Q, d), but in its distribution over the range of a random variable. Since we are considering the case when Q contains only a finite number of points, the range X of the random variable (is also finite. Let X = {x 1 , . . . , xm}, where the (different) numbers X 1, ... , Xm exhaust the values of Let fi£ be the collection of all subsets of X, and let BE fi£. We can also interpret B as an event if the sample space is taken to be X, the set of values of On (X, fi£), consider the probability P~(·) induced bye according to the formula

e.

e.

P~(B)

= P{w: ~(w) E B},

BE fi£.

It is clear that the values of this probability are completely determined by the probabilities P~(x;)

= P{w: ~(w) =X;},

The set of numbers {P~(x 1 }, bution of the random variable ~. t The notation

... ,

P~(xm)}

X;

EX.

is called the probability distri-

/(A) is also used. For frequently used properties of indicators see Problem 1.

34

I. Elementary Probability Theory

e

2. A random variable that takes the two values 1 and 0 with probabilities p ("success") and q ("failure"), is called a Bernoullit random variable. Clearly

EXAMPLE

0, 1.

X=

e

(1)

A binomial (or binomially distributed) random variable is a random variable that takes the n + 1 values 0, 1, ... , n with probabilities x = 0, 1, ... , n.

(2)

Note that here and in many subsequent examples we do not specify the sample spaces (Q, d, P), but are interested only in the values of the random variables and their probability distributions.

e

is completely The probabilistic structure of the random variables specified by the probability distributions {P~(xi), i = 1, ... , m}. The concept of distribution function, which we now introduce, yields an equivalent description of the probabilistic structure of the random variables.

Definition 2. Let x e R 1• The function F~(x) =

P{w: e(w)

~

x}

is called the distribution function of the random variable

e.

Clearly F~(x) =

L

{i: Xi

P~(xi)

$X}

and P ~(xi) = F ~(xi) - F f.._x; - ),

where F ~(x-) = limyix F ~(y). If we suppose that x 1 < x 2 < · · · < xm and put F ~(x 0 ) = 0, then

i = 1, ... , m. The following diagrams (Figure 5) exhibit P~(x) and F~(x) for a binomial random variable. It follows immediately from Definition 2 that the distribution F~ = F~(x) has the following properties: (1) F ~(- oo)

= 0, F ~( + oo) = 1;

(2) F ~(x) is continuous on the right (F ~(x+) = F ~(x)) and piecewise constant. t We use the terms "Bernoulli, binomial, Poisson, Gaussian, ... , random variables" for what are more usually called random variables with Bernoulli, binomial, Poisson, Gaussian, ... , distributions.

35

§4. Random Variables and Their Properties

-.....

/

q",_/_/-t---t--~~~==+. 0

n

2

'f---------_: _;,.

F<(x)

I I

I

I

~~ 0

--------,.

2

n

Figure 5

Along with random variables it is often necessary to consider random ~ = (~ 1 , ... , ~,) whose components are random variables. For example, when we considered the multinomial distribution we were dealing with a random vector v = (v 1, ... , v,), where v; = v;(w) is the number of elements equal to b;, i = 1, ... , r, in the sequence w = (a 1, .•. , an). The set of probabilities vectors

P~(x 1 , ... ,

x,)

= P{w: ~ 1 (w) = x 1 ,

... , ~,(w)

= x,},

where X; EX;, the range of ~;, is called the probability distribution of the random vector ~, and the function

where

X; E

R 1, is called the distribution function of the random vector ~

(~1• ... ' ~,).

For example, for the random vector v = (v 1 ,

... ,

=

v,) mentioned above,

(see (2.2)). 2. Let ~ 1 , . . . , ~r be a set of random variables with values in a (finite) set X s; R 1 . Let X be the algebra of subsets of X.

36

I. Elementary Probability Theory

Definition 3. The random variables (collectively independent) if P{~ 1 =

for all x 1,

••• ,

x 1 , ••• ,

••• ,

xd · · · P{~.

~. = x,} = P{~ 1 =

= x,}

x, EX; or, equivalently, if

P{~1EB1,

for all B 1,

are said to be independent

~ 1 , •.. , ~.

... ,~,EB,} =

P{~1EBd···P{~,EB,}

B, E f!l".

We can get a very simple example of independent random variables from the Bernoulli scheme. Let

n=

{w: (I)= (a1, ... ' an), a;= 0, 1},

p(w) = pr.a;qn-r.a;

and ~;(w) = a; for w = (a 1, ... , an), i = 1, ... , n. Then the random variables ~ 1 , ~ 2 , ••• , ~n are independent, as follows from the independence of the events

A 1 = {w: a 1 = 1}, ... , An= {w: an= 1}, which was established in §3. 3. We shall frequently encounter the problem of finding the probability distributions of random variables that are functionsf(~ 1 , ..• , ~.)of random variables ~ 1 , ••• , ~•. For the present we consider only the determination of the distribution of a sum ( = ~ + 11 of random variables. If~ and '1 take values in the respective sets X = {x 1, ... , xk} and Y = {y 1 , •.. , y 1}, the random variable ( = ~ + '1 takes values in the set Z = {z: z = X; + Yi• i = 1, ... , k;j = 1, ... , /}.Then it is clear that P,(z) = P{( = z} = Pg

+ 11

L

= z} =

P{~ =X;, '1 = Yi}.

{(i, j):x;+yJ=z)

The case of independent random variables ant. In this case

~

and '1 is particularly import-

and therefore P,(z)

=

L

P~(x;)P~(yj)

{(i,j):x;+yj=z}

=

k

L P~(x;)P~(z- x;)

i=1

(3)

for all z E Z, where in the last sum Pq(z - X;) is taken to be zero if z - X;~ Y. For example, if ~ and '1 are independent Bernoulli random variables, taking the values 1 and 0 with respective probabilities p and q, then Z = {0, 1, 2} and P,(O) P,(l)

= P~(O)Pq(O) = q2 , = P~(O)Pq(i) + Pp)Pq(O) = 2pq,

P,(2) = P~(l)Pq(1) = p2 •

37

§4. Random Variables and Their Properties

It is easy to show by induction that if ~ 1 • ~ 2 , ••• , ~n are independent Bernoulli random variables with P{~; = 1} = p, P{~; = 0} = q, then the random variable ( = ~ 1 + · · · + ~n has the binomial distribution

k = 0, 1, ... , n.

(4)

4. We now turn to the important concept of the expectation, or mean value, of a random variable. Let (Q, d, P) be a (finite) probability space and ~ = ~(w) a random variable with values in the set X= {x 1 , ••• , xk}. If we put A;= {w: ~ = x;}, i = 1, ... , k, then ~ can evidently be represented as k

L

~(w) =

X;l(A;),

(5)

i= 1

where the sets A 1, ••• , Ak form a decomposition of Q (i.e., they are pairwise disjoint and their sum is Q; see Subsection 3 of §1). Let P; = P{ ~ = x;}. It is intuitively plausible that if we observe the values of the random variable ~ in "n repetitions of identical experiments", the value X; ought to be encountered about p;n times, i = 1, ... , k. Hence the mean value calculated from the results of n experiments is roughly 1 - [np 1 x 1

n

+ · · · + npkxk] =

k

I P;X;. i=1

This discussion provides the motivation for the following definition.

Definition 4. The expectation t or mean value of the random variable

~ =

L~= 1 xJ(A;) is the number k

E~ =

I

X;P(AJ

(6)

i= 1

Since A;= {w:

~(w) =X;}

and

P~(x;) =

P(A;), we have

k

E~ =

Recalling the definition

I

i= 1

x;Pix;).

ofF~= F~(x)

(7)

and writing

dF ix) = F~(x)- F~(x- ), we obtain P ~(x;)

= dF ~(x;) and consequently k

E~ =

I

x;dF~(x;).

(8)

i= 1

t Also known as mathematical expectation, or expected value, or (especially in physics) expectation value. (Translator)

38

I. Elementary Probability Theory

Before discussing the properties of the expectation, we remark that it is often convenient to use another representation of the random variable ~' namely l

~(w) =

L xji(B),

j= 1

where B 1 + · · · + B1 = n, but some of the xj may be repeated. In this case E~ can be calculated from the formula L~= 1 xjP(B), which differs formally from (5) because in (5) the X; are all different. In fact,

L

xjP(B)

L

=X;

(j: xj = x;)

P(B)

=

x;P(A;)

(j: xj = x;)

and therefore l

I

i= 1

xjP(B)

=

k

I

i= 1

x;P(AJ

5. We list the basic properties of the expectation: (1) If~ 2': 0 then E~ 2': 0. (2) E(a~ + b1]) = aE~ + bE1], where a and bare constants. (3) If~ 2': '7 then E~ 2': El'f. (4) IE~I ~ El~l(5) If~ and '1 are independent, then E~l] = E~ · El].

e ·E17

( 6) (E I~171) 2 ~ E

(7)

If~

2

(Cauchy- Bun yakovskii inequality). t

= I(A) then E~ = P(A).

Properties (1) and (7) are evident. To prove (2), let ~

= IxJ(AJ,

I]=

i

IYi(B). j

Then i, j

i,j

L(ax; + byj)I(A; n Bi)

=

i,j

and E(a~

+ bl]) = I (ax; + byi)P(A; n Bi) i,j

= Lax; P(A;) i

+ Lbyi P(Bi) j

= al;x;P(A;) i

+ b LYiP(Bi) = j

aE~

+ bE1].

t Also known as the Cauchy-Schwarz or Schwarz inequality. (Translator)

39

§4. Random Variables and Their Properties

Property (3) follows from (1) and (2). Property (4) is evident, since

~~x;P(A;)I ;S; ~lxdP(A;) = Elel.

!Eel=

To prove (5) we note that

Ee11 =

E(~x;I(A;)) (~>il(Bi))

= E

L X;Yi(A; n B)= L X;YiP(A; n Bi) i,j

i,j

=

L X;yjP(A;)P(Bj) i,j

where we have used the property that for independent random variables the events A;= {w: e(w) = x;}

Bi = {w: 17(w) = Yi}

and

are independent: P(A; n Bi) = P(A;)P(Bi). To prove property (6) we observe that

e

L xf l(A;),

'7 2

Lx~P(A;),

E'7 2 = LYJP(Bj).

=

i

=

LYJI(Bj) j

and

Ee 2 =

i

j

Let Ee 2 > 0, E77 2 > 0. Put

-

e

e=~,

'7-

= -'7- .

for

Since 21~~1 + we have 2EI~~I $ E~ 2 + E~ 2 = 2. Therefore El~~~ $ 1 and (Eie'71) 2 $ Ee 2 ·E77 2 • However, if, say, Ee 2 = 0, this means that xfP(A;) = 0 and consequently the mean value of e is 0, and P{w: e(w) = 0} = 1. Therefore if at least one of Ee2 or E77 2 is zero, it is evident that EIe111 = 0 and consequently the Cauchy-Bunyakovskii inequality still holds. $ ~2

~2,

L;

Remark. Property (5) generalizes in an obvious way to any finite number of random variables: if

el, ... ' e, are independent, then

The proof can be given in the same way as for the case r = 2, or by induction.

40

I. Elementary Probability Theory

e

EXAMPLE 3. Let be a Bernoulli random variable, taking the values 1 and 0 with probabilities p and q. Then

ee = 1· Pg = 1}

+ o. P{e = o}

= p.

4. Let e1, ... , en ben Bernoulli random variables with P{e; = 1} P{e; = O} = q, p + q = 1. Then if

ExAMPLE

= p,

we find that ESn = np.

This result can be obtained in a different way. It is easy to see that ES" is not changed if we assume that the Bernoulli random variables 1, ••• , en are independent. With this assumption, we have according to (4)

e

k = 0, 1, ... , n. Therefore ESn =

n

L

n

kP(Sn = k) =

k=O

L

kC~pkqn-k

k=O

n!

n

kn-k

= k~ok·k!(n-k)!pq = np

I

k= 1

(n- 1)! pk-1q
_ ~ (n-1)! l(n-1)-1_ - np 1~0 l!((n- 1)- 1)! pq - np.

However, the first method is more direct.

Li

xJ(A;), where A; = {ro: e(ro) =X;}, and 6. Let e = function of e(w). If Bi = {ro: cp(e(w)) = yj}, then cp(e(w))

qJ =

cp(e(w)) is a

= LYi(Bj), j

and consequently

Ecp

=

LYiP(B) = LYiP'P(yj). j

j

But it is also clear that cp(e(w)) =

I

i

cp(xJI(A;).

(9)

41

§4. Random Variables and Their Properties

Hence, as in (9), the expectation of the random variable

can be

= L
7. The important notion of the variance of a random variable ~ indicates

the amount of scatter of the values

around E~.

of~

Definition 5. The variance (also called the dispersion) of the random variable (denoted by V ~) is

~

V~ = E(~- E~) 2 .

The number Since

CJ

= + jVf, is called the standard deviation.

we have

Clearly V~

~

V(a

0. It follows from the definition that

+ b~) =

where a and bare constants.

b 2 V~,

In particular, Va = 0, V(b~) = b 2 V~. Let ~ and 11 be random variables. Then V(~

+ rJ) = =

E((~- E~) Vi;+ V17

+ ('1-

+

E1J)) 2

2E((- E()(Y/- E17).

Write cov(~,

'1) =

E(~- E~)(IJ

- ErJ).

This number is called the covariance of~ and '1· If V ~ > 0 and V11 > 0, then

( ~ ) = cov(~, '1) JV~·VIJ p ,IJ is called the correlation coefficient of~ and '1· It is easy to show (see Problem 7 below) that if p(~, rJ) = ± 1, then~ and 11 are linearly dependent:

11 =a~+ b, with a > 0 if p(~, rJ) = 1 and a < 0 if p(~, rJ) = -1. We observe immediately that if~ and 11 are independent, so are and rJ - ErJ. Consequently by Property (5) of expectations, cov(~,

rJ) =

E(~

- EO· E(IJ - ErJ)

= 0.

~

-

E~

42

I. Elementary Probability Theory

Using the notation that we introduced for covariance, we have V(~

+ Yf)

= V~

+ VYf +

(10)

2cov(~, Yf);

if~ and Yf are independent, the variance of the sum of the variances, V(~ + Y/) = V~ + VYf.

~

+ Yf is equal to the sum (11)

It follows from (10) that (11) is still valid under weaker hypotheses than the independence of~ and Yf· In fact, it is enough to suppose that ~ and Yf are uncorrelated, i.e. cov(~, ry) = 0.

Remark. If ~ and Yf are uncorrelated, it does not follow in general that they are independent. Here is a simple example. Let the random variable o: take the values 0, n/2 and n with probability t. Then ~ = sin o: and Y/ = cos o: are uncorrelated; however, they are not only stochastically dependent (i.e., not independent with respect to the probability P): P{~ = 1, Yf = 1} = 0 ¥ ~ = P{~ = 1}P{ry = 1},

but even functionally dependent: ~ 2 + Yf 2 = 1. Properties (10) and (11) can be extended in the obvious way to any number of random variables: (12)

In particular, if ,;; 1 ,

••. ,

.;;n are pairwise independent (pairwise uncorrelated

is sufficient), then (13)

5. If ~ is a Bernoulli random variable, taking the values 1 and 0 with probabilities p and q, then

ExAMPLE

V~

= E(~- E~)2 = (~- p)2 = (1 - p)2 p

+ p2q =

pq.

It follows that if~ 1 ,

... , ~n are independent identically distributed Bernoulli random variables, and sn = ~ 1 + ... + ~n, then

vsn =

npq.

(14)

8. Consider two random variables ~ and Yf. Suppose that only ~ can be observed. If~ and Yf are correlated, we may expect that knowing the value of~ allows us to make some inference about the values of the unobserved variable Yf· Any function}= f(~) of~ is called an estimator for Yf· We say that an estimator f* = f*(~) is best in the mean-square sense if E(Yf - /*(0) 2 = inf E(Yf - /(~)) 2 • f

43

§4. Random Variables and Their Properties

Let us show how to find a best estimator in the class of linear estimators a + b~. We consider the function g(a, b) = E(17 - (a + b~)) 2 . Differentiating g(a, b) with respect to a and b, we obtain A.(~) =

og(a, b)= - 2E['7 -(a+ oa og(a, b) ob

= - 2E[('7-

b~)],

(a+ be))e],

whence, setting the derivatives equal to zero, we find that the best meansquare linear estimator is ..t*(~) = a* + b*~. where

a*

= E17 - b*E~,

b*

= cov(~, 17) v~

.

(15)

In other words, (16) The number E(17 - ..1.*(~)) 2 is called the mean-square error of observation. An easy calculation shows that it is equal to

Consequently, the larger (in absolute value) the correlation coefficient 17) between ~ and '1· the smaller the mean-square error of observation tl*. In particular, if IP(e. '7)1 = 1 then tl* = 0 (cf. Problem 7). On the other hand, if~ and '1 are uncorrelated (p(~, '7) = 0}, then A.*(~) = E'7, i.e. in the absence of correlation between ~ and '1 the best estimate of '1 in terms of~ is simply E17 (cf. Problem 4). p(~.

9. PROBLEMS 1. Verify the following properties of indicators IA = IA(w):

JAB= IA .JB, ]AuB =]A+ ]B- JAB·

The indicator of Uf=t Ai is 1- n~=l (1- lA,), the indicator of Ui=1 Ai is Df= 1 (1 - I A,), and the indicator o(D= 1 Ai is D= 1 I A,.

where A 6. B is the symmetric difference of A and B, i.e. the set (A\B) u (B\A).

44

I. Elementary Probability Theory

2. Let ~ 1 , ••• ,

~.be

independent random variables and

~min= min(~ 1 ,

.•. ,

~max= max(~!• ... ' ~.).

~.),

Show that n

P{~min;;:: x} = TIP{~;;;:: x}, i=l

< x} = TIPg; < x}.

P{~max

i=l

3. Let ~ ~> ... , ~. be independent Bernoulli random variables such that

0}

P{~; =

P{~;

where ~ is a small Show that

number,~

= 1} =A;~,

> 0, A; > 0.

P{~~ + · · · + ~. = P{~ 1

1 -A;~,

=

+ · · · + ~.

(J/;)~ + 0(~ 2),

1} =

> 1}

= 0(~ 2 ).

4. Show that inL 00
E(~ - a) 2 = V~.

-oo
5.

Let~ be a random variable with distribution function of F~(x), i.e. a point such that

F~(x)

and let me be a median

Show that inf

El~-

-ro
6. Let

P~(x)

=

P{~

= x} and

F~(x)

=

al

P(~::;;

= El~-

m.,l.

x}. Show that X -

b)

Pa~+b(x) = p~ ( -a- ' X-

b)

Fa~+b(x) = F~ ( -a-

for a rel="nofollow"> 0 and - oo 0.

X=

45

§5. The Bernoulli Scheme. I. The Law of Large Numbers 7.

Let~ and 'I be random variables with V~ > 0, Vf1 > 0, and let p = p(~, 'I) be their correlation coefficient. Show that IpI ::;; 1. If IpI = 1, find constants a and b such that 'I = a~ + b. Moreover, if p = 1, then

(and therefore a > 0), whereas if p

-1, then

=

(and therefore a < 0). 8.

Let~ and 'I be random variables with coefficient p = p(~, f/). Show that

E~

= Ef/ = 0,

E max<e, f/ 2 )::;; 1 +

9. Use the equation (IndicatorofiV1 to deduce the formula P(B 0 )

=1-

S1

A;)

V~

= Vf/ = 1 and correlation

Jl=Pl.

=bY-

IA),

+ S 2 + · · · ± s. from Problem 4 of §1.

10. Let ~~> ... , ~. be independent random variables, cp 1 = cp 1 (~ 1 , .•• , ~k) and cp 2 = cp 2 ( ~k+ ~> ... , ~.), functions respectively of~ 1 , ••. , ~k and ~k+ 1 , ... , ~ •. Show that the random variables cp 1 and cp 2 are independent. 11. Show that the random variables~ 1 , ... , F~, ..... ~Jx 1 ,

••• ,

~.are

x.)

independent if and only if

= F~,(x 1 )

for all Xt. .•• , x., where F~, ..... ~Jx 1 , •.. , x.) =

P{~ 1

· · ·

F~"(x.)

::;;

x 1, ... , ~.::;; x.}.

12. Show that the random variable ~ is independent of itself (i.e., pendent) if and only if~ = const.

~

13. Under what hypotheses

~independent?

14.

on~

are the random

variables~

and sin

and

~

are inde-

and 'I be independent random variables and 'I # 0. Express the probabilities of the events P{~'l ::;; z} and P{~/'1 ::;; z} in terms of the probabilities P~(x) and Pq(y).

Let~

§5. The Bernoulli Scheme. Large Numbers

I. The Law of

1. In accordance with the definitions given above, a triple (Q, .szl, P)

.91

with =

Q

=

{w: w

{A: A c;; Q},

= (a 1 , ••. , a.), ai = 0, 1}, p(w)

=

pr.a;qn-r.a;

is called a probabilistic model of n independent experiments with two outcomes, or a Bernoulli scheme.

46

I. Elementary Probability Theory

In this and the next section we study some limiting properties (in a sense described below) for Bernoulli schemes. These are best expressed in terms of random variables and of the probabilities of events connected with them. by taking = ai, i = We introduce random variables 1 , ..• , 1, ... , n, where w = (a 1 , ••• , a,). As we saw above, the Bernoulli variables are independent and identically distributed:

e

e,

ei(w)

ei(w)

i = 1, ... , n.

ei

It is natural to think of as describing the result of an experiment at the ith stage (or at time i). Let us put S 0 (w) 0 and

=

k = 1, ... , n. As we found above, ES,

=

np and consequently

s,

(1)

E-=p.

n

In other words, the mean value of the frequency of "success", i.e. S,/n, coincides with the probability p of success. Hence we are led to ask how much the frequency Snfn of success differs from its probability p. We first note that we cannot expect that, for a sufficiently small e > 0 and for sufficiently large n, the deviation of S,/n from p is less than e for all w, i.e. that

wen.

(2)

In fact, when 0 0. We observe, however, that when n is large the probabilities of the events {S,/n = 1} and {S,/n = 0} are small. It is therefore natural to expect that the total probability of the events for which I [S,(w)/n] -Pi > e will also be small when n is sufficiently large. We shall accordingly try to estimate the probability of the event {w: i[S,(w)/n] -Pi > e}. For this purpose we need the following inequality, which was discovered by Chebyshev.

47

I. The Law of Large Numbers

§5. The Bernoulli Scheme.

Chebyshev's inequality. Let (Q, d, P) be a probability space and nonnegative random variable. Then

~ = ~(ro)

a

(3)

for all e > 0. PROOF.

We notice that

where I(A) is the indicator of A. Then, by the properties of the expectation,

which establishes (3).

Corollary. If~ is any random variable, we have for e > 0, P{l~l ~ e}

s;

El~l/e,

P{l~l ~ e} = P{~ 2 ~ e2 } s; eeje 2 , P{l~- E~l ~ e}

s; V~je 2 •

In the last of these inequalities, take~ =

Therefore

{I

s.- p p n

(4)

s.;n. Then using (4.14), we obtain

I }s ;ne- s ;4ne- , 1

pq

~e

2

2

(5)

from which we see that for large n there is rather small probability that the frequency S./n of success deviates from the probability p by more than e. For n ~ 1 and 0 s; k s; n, write

Then

P{l s.n - P I~ e} =

L

{k:i(k/n)- PI
P.(k),

and we have actually shown that

L

{k: l(k/n)- PI
P.(k) s;

pq

-2

ne

s;

1 4ne

-2'

(6)

48

I. Elementary Probability Theory

P.(k) I

~

0

np _ ____. ---v-np + ne

-----~

np - ne

n

Figure 6

i.e. we have proved an inequality that could also have been obtained analytically, without using the probabilistic interpretation. It is clear from (6) that

L

P,.(k)-+ 0,

n-+ oo.

(7)

{k:l(k/n)- PI ~e)

We can clarify this graphically in the following way. Let us represent the binomial distribution {P,.(k), 0 ~ k ~ n} as in Figure 6. Then as n increases the graph spreads out and becomes flatter. At the same time the sum of Pik), over k for which np - ne ~ k < np + ne, tends to 1. Let us think of the sequence of random variables S0 , S 1 , .•• , S,. as the path of a wandering particle. Then (7) has the following interpretation. Let us draw lines from the origin of slopes kp, k(p + e), and k(p - e). Then on the average the path follows the kp line, and for every e > 0 we can say that when n is sufficiently large there is a large probability that the point S,. specifying the position of the particle at time n lies in the interval [n(p - e), n(p + e)]; see Figure 7. We would like to write (7) in the following form: n-+ oo,

(8)

k(p +e)

I

s~

lkp

I I I I I I

k(p- e)

..........~,.---,.---..-- - - - - -- -+-+:

~

2

3

4

5

Figure 7

n

k

§5. The Bernoulli Scheme.

I. The Law of Large Numbers

49

However, we must keep in mind that there is a delicate point involved here. Indeed, the form (8) is really justified only if P is a probability on a space (0, d) on which infinitely many sequences of independent Bernoulli random variables ~ 1 , ~ 2 , .•. , are defined. Such spaces can actually be constructed and (8) can be justified in a completely rigorous probabilistic sense (see Corollary 1 below, the end of §4, Chapter II, and Theorem 1, §9, Chapter II). For the time being, if we want to attach a meaning to the analytic statement (7), using the language of probability theory, we have proved only the following. Let (n, .Jil<">, p), n ;;;:: 1, be a sequence of Bernoulli schemes such that

n, •.. ' a~">), al") = 0, 1}, .Jil ={A: A£ n}, p<">( w<">) =

pr.a~

q"- !:.a\"'

and

Sk">(w<"l) = ~\">(w<">)

+ · · · + ek">(w<">),

where, for n :s; 1, ~~nl, •.. , ~~nl are sequences of independent identically distributed Bernoulli random variables. Then

n --t oo. (9) Statements like (7)-(9) go by the name of James Bernoulli's law of large numbers. We may remark that to be precise, Bernoulli's proof consisted in establishing (7), which he did quite rigorously by using estimates for the "tails" of the binomial probabilities P"(k) (for the values of k for which l(k/n)- pi ;;;:: e). A direct calculation of the sum of the tail probabilities of the binomial distribution L{k:l
and

50

I. Elementary Probability Theory

2. The next section will be devoted to precise statements and proofs of these results. For the present we consider the question of the real meaning of the law of large numbers, and of its empirical interpretation. Let us carry out a large number, say N, of series of experiments, each of which consists of "n independent trials with probability p of the event C of interest." Let S~/n be the frequency of event C in the ith series and N, the number of series in which the frequency deviates from p by less than e: N, is the number of i's for which i(S~n)- pi ~e. Then N,/N "' P,

(10)

where P, = P{i(S!/n)- Pi~ e}. It is important to emphasize that an attempt to make (10) precise inevitably involves the introduction of some probability measure, just as an estimate for the deviation of Sn/n from p becomes possible only after the introduction of a probability measure P. 3. Let us consider the estimate obtained above,

p{l sn- pI~ e}= n

L

{k:l
Pn(k)

~ ~. 4ne

(11)

as an answer to the following question that is typical of mathematical statistics: what is the least number n of observations that is guaranteed to have (for arbitrary 0 < p < 1) (12) where ex is a given number (usually small)? It follows from (11) that this number is the smallest integer n for which

n

1 e ex

(13)

~-4 2 •

For example, if ex = 0.05 and e = 0.02, then 12 500 observations guarantee that (12) will hold independently of the value of the unknown parameter p. Later (Subsection 5, §6) we shall see that this number is much overstated; this came about because Chebyshev's inequality provides only a very crude upper bound for P{i(Sn/n)- Pi ~ e}. 4. Let us write

{ I

Sn(w) - p C(n, e) = w: -n-

I ~ e} .

From the law of large numbers that we proved, it follows that for every e > 0 and for sufficiently large n, the probability of the set C(n, e) is close to 1. In this sense it is natural to call paths (realizations) w that are in C(n, e) typical (or (n, e)-typical).

§5. The Bernoulli Scheme.

51

I. The Law of Large Numbers

We ask the following question: How many typical realizations are there, and what is the weight p(w) of a typical realization? For this purpose we first notice that the total number N(O.) of points is 2", and that if p = 0 or 1, the set of typical paths C(n, e) contains only the single path (0, 0, ... , 0) or (1, 1, ... , 1). However, if p =!,it is intuitively clear that "almost all" paths (all except those of the form (0, 0, ... , 0) or (1, 1, ... , 1)) are typical and that consequently there should be about 2n of them. It turns out that we can give a definitive answer to the question whenever 0 < p < 1; it will then appear that both the number of typical realizations and the weights p(w) are determined by a function of p called the entropy. In order to present the corresponding results in more depth, it will be helpful to consider the somewhat more general scheme of Subsection 2 of §2 instead of the Bernoulli scheme itself. Let (p 1 , p 2 , ••• , p,) be a finite probability distribution, i.e. a set of nonnegative numbers satisfying p 1 + · · · + p, = 1. The entropy of this distribution is r

L

H =-

p)npi,

(14)

i= 1

with 0 ·In 0 = 0. It is clear that H 2: 0, and H = 0 if and only if every pi, with one exception, is zero. The function f(x) = -x In x, 0 s x s 1, is convex upward, so that, as know from the theory of convex functions, f(xd

+ -~· + f(x,) s

!(x + ·; · + x} 1

Consequently

H = -

L' Pi In Pi s - r · P1 + r·· · + Pr · In (P1 + ·r· · + Pr) = In r.

i= 1

In other words, the entropy attains its largest value for p 1 = · · · = p, = 1/r (see Figure 8 for H = H(p) in the case r = 2). If we consider the probability distribution (p 1 , p 2 , ••• , p,) as giving the probabilities for the occurrence of events A 1 , A 2 , ... , A, say, then it is quite clear that the "degree of indeterminancy" of an event will be different for H(p)

I

p

2

Figure 8. The function H(p)

=

-pIn p- (1 - p)ln(l - p).

52

I. Elementary Probability Theory

different distributions. If, for example, p 1 = 1, p 2 = · · · = Pr = 0, it is clear that this distribution does not admit any indeterminacy: we can say with complete certainty that the result of the experiment will be A 1 . On the other hand, if p 1 = · · · = Pr = 1/r, the distribution has maximal indeterminacy, in the sense that it is impossible to discover any preference for the occurrence of one event rather than another. Consequently it is important to have a quantitative measure of the indeterminacy of different probability distributions, so that we may compare them in this respect. The entropy successfully provides such a measure of indeterminacy; it plays an important role in statistical mechanics and in many significant problems of coding and communication theory. Suppose now that the sample space is Q

= {w: w = (a 1 ,

..• ,

a.), ai

= 1, ... , r}

and that p(w) = p~1 (w) • · • pv; 0 and n = 1, 2, ... , let us put

I

.

}

{lvi(w) -n- -

Pi I :2: e} ,

{ I

vlw) - Pi < e, 1 = 1, ... , r . C(n, e) = w: -n~

It is clear that r P(C(n, e)) :2: 1 - i~t P

and for sufficiently large n the probabilities P{l(v;(w)/n)- p;l ;::: e} are arbitrarily small when n is sufficiently large, by the law of large numbers applied to the random variables ¢k(w)

= {01' ak: , ak

~·,

k = 1, ... , n.

-r z

Hence for large n the probability of the event C(n, e) is close to 1. Thus, as in the case n = 2, a path in C(n, e) can be said to be typical. If all Pi > 0, then for every w E Q p(w)

= exp{ -n

J

1

(-

vk~w) In Pk)}.

Consequently if w is a typical path, we have

v (w) Pk ) I L - ~k-In n r

k=t

(

I

- H ::::;; -

v (w) L I~kr

k=t

n

I

Pk In Pk ::::;; - e

L In Pk. r

k=t

It follows that for typical paths the probability p(w) is close to e-•H andsince, by the law of large numbers, the typical paths "almost" exhaust Q when n is large-the number of such paths must be of order e"H. These considerations lead up to the following proposition.

§5. The Bernoulli Scheme.

53

I. The Law of Large Numbers

Theorem (Macmillan). Let P; > 0, i = 1, ... , rand 0 < s < 1. Then there is an n0 = n0 (s; p 1 , ••• , p,) such that for all n > n0 (a) en
L

(c) P(C(n, s 1)) =

p(w)

-4

wE

1,

C(n, s 1); n-+ oo,

roe C(n,
where

s 1 is the smaller of sands/{ -2

kt/n Pk}

PROOF. Conclusion (c) follows from the law of large numbers. To establish

the other conclusions, we notice that if we C(n, s) then

k = 1, ... , r, and therefore

L

vk In Pk} < exp{ -n :::; exp{- n(H - ts)}.

p(w) = exp{-

L Pk ln Pk- s1n L ln Pd

Similarly p(w)

> exp{- n(H + ts)}.

Consequently (b) is now established. Furthermore, since P(C(n, s 1)) 2:: N(C(n, s 1 )) · we have

N ( C(n, S1 ))

:::;

min p(w), coeC(n,
P(C(n, sl)) 1 _ n(H+(l/2J•J . ( ) < -n(H+(l/2J•J - e mm pw e roeC(n,
and similarly N(C(n, s 1 )) 2:: P(C(n, s 1)) max p(w)

> P(C(n, s 1))en
meC(n,£ 1 )

Since P(C(n, s 1))-+ 1, n-+ oo, there is an n 1 such that P(C(n, s 1 )) for n > n 1, and therefore N(C(n, s 1 )) 2:: (1 - s) exp{n(H - t)}

= exp{n(H -

s)

+ (tns + ln(l

- s))}.

> 1- s

54

I. Elementary Probability Theory

Let n2 be such that

+ ln(1

!ne

- e) > 0.

for n > n2 • Then when n:;::: n0 = max(nto n2 ) we have N(C(n, e1))

:;::: en
•>.

This completes the proof of the theorem. 5. The law of large numbers for Bernoulli schemes lets us give a simple and elegant proof of Weierstrass's theorem on the approximation of continuous functions by polynomials. Letf = f(p) be a continuous function on the interval [0, 1]. We introduce the polynomials

which are called Bernstein polynomials after the inventor of this proof of Weierstrass's theorem. If ~ 1 , ••• , ~" is a sequence of independent Bernoulli random variables with P{~i = 1} = p, P{~i = 0} = q and S" = ~ 1 +···+~"'then

Ef(~) =

Bip).

Since the function.{ = f(p), being continuous on [0, 1], is uniformly continuous, for every e > 0 we can find b > 0 such that I f(x) - f(y)l :::;; e whenever lx- yl :::;; b. It is also clear that the function is bounded: If(x)l :::;; M < oo. Using this and (5), we obtain lf(p)- Bn(p)l

=I Jo : :;

[J(p)-

f(~) JC~lqn-k I

L

{k:i(k/n)- PiS~)

+

I f(p)

L

{k:i(kfn)-pi>~)

:::;; e + 2M

!(~) Jc~pkqn-k

-

n

lj(p)- J(~) IC~pkqn-k

L

n

{k:i ~J

C~lqn-k :::;;

Hence lim max n-+oo 0Sp:S1

I f(p)

- Bn(p)l = 0,

which is the conclusion of Weierstrass's theorem.

2M M e + - -2 = e + - -2 • 4n<5 2nb

§6. The Bernoulli Scheme.

II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

55

6.PROBLEMS

1. Let ~and 1J be random variables with correlation coefficient p. Establish the following two-dimensional analog of Chebyshev's inequality: 1 P{l~- E~l ~ e~ or 111- EIJI ~ ~ 2 (1 + e

eJVtl}

.Ji="7).

(Hint: Use the result of Problem 8 of §4.)

2. Let f = f(x) be a nonnegative even function that is nondecreasing for positive x. Then for a random variable ~ with I~(w) I ~ C,

EfW- f(e) < P{l;:- E;:l >c.}< Ef(~- E~). f(C) "' "' f(e) In particular, if f(x) = x 2 ,

Eee C

2

2

3. Let ~ 1 , ••• ,

~.be

v~ ~ P{l~- E~l ~c.}~~-

a sequence of independent random variables with

V~; ~

p{ I~ 1 +...n +~. _E(~ 1 +n... +~.)I>- e} <-ne_£_

C. Then

2

(15)

(With the same reservations as in (8), inequality (15) implies the validity of the law of large numbers in more general contexts than Bernoulli schemes.) 4. "Let ~ 1 , ••. , ~.be independent Bernoulli random variables with P{e; = 1} = p > 0, P{e; = -1} = 1 - p. Derive the following inequality of Bernstein: there is a number a > 0 such that

p{ I~·- (2pwhere

s. =

~1

1)

I~ c.}~ 2e-ae2•,

+ · · · + ~. and e > 0.

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-.:.Laplace, Poisson) 1. As in the preceding section, let

s. = el + ... + e•.

Then

s.

E-= p, n

(1)

and by (4.14) (2)

56

I. Elementary Probability Theory

It follows from (1) that Sn/n ~ p, where the equivalence symbol ~ has been given a precise meaning in the law of large numbers in terms of an inequality for P{I(Sn/n)- Pi ~ e}. It is natural to suppose that, in a similar way, the relation (3) which follows from (2), can also be given a precise probabilistic meaning involving, for example, probabilities of the form

xeR\ or equivalently

(since ES. = np and VS. If, as before, we write

= npq). O~k~n,

for n

~

1, then

p{ IsnJ'iS,.

ESn

I~

x} =

I

Pn(k).

(4)

{k:j(k-np)/JiiNI:5x}

We set the problem of finding convenient asymptotic formulas, as n --+ oo, for Pn(k) and for their sum over the values of k that satisfy the condition on the right-hand side of (4). The following result provides an answer not only for these values of k (that is, for those satisfying Ik - np I = O(JnPq)) but also for those satisfying Ik - np I = o(npq) 213 •

Local Limit Theorem. Let 0 < p < 1 ; then P.(k) ~ uniformly fork such that

1

J2rffiM

e-
lk- npl = o(npq) 2 13 , i.e. as n--+ P.(k)

sup

{k:jk-npj,;cp(n)} - - - - - - - - -

1

j2nnpq where
=

o(npq) 213 •

(5)

e- (k- np)2/(2npq)

1

oo --+

0,

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

57

The proof depends on Stirling's formula (2.6)

where R(n)

-+

0 as n-+ oo.

Then if n -+ oo, k

ck = n

-+

oo, n - k -+ oo, we have

n! k!(n- k)!

j2im e-"n"

1 + R(n)

J2nk · 2n(n - k) e-kkk. e-
J2.. Hl _~) ·(m~- ~r · 1 + e(n, k, n- k)

1

~

+ R(k))(1 + R(n- k))

where e = e(n, k, n- k) is defined in an evident way and e-+ 0 as n-+ oo, k -+ oo, n - k -+ oo. Therefore p J.k)

~ C,p'qo-' ~

J

l{l -

1 k(

p)n-k

k) (k)k( k)" k (1 2nn- 1 - 1- n n n n

+ e).

Write P= kjn. Then Pn(k)

1

(p)k(1 _ p)n-k 1- p (1

p

=

J2nnfi(1 - p)

=

1 exp{k ln J2nnfi(1 - p) P

=

1 exp{n ln + J2nnfi(1 - p) n P

~ + (n -

[~ ~

1

---;:::===:======::;= exp{-

J2nnfi(1 - fi)

nH(p)}(1

+

e)

~} · (1 + e)

k) ln 1 -

1- P

(1 - ~) ln 1 - P]} (1 + e) P 1-

n

+ e),

where

x

H(x) = xln-

p

1-x -p

+ (1- x)ln1-.

We are considering values of k such that lk- npl sequently p- p-+ 0, n-+ oo.

=

o(npq) 213 , and con-

58

I. Elementary Probability Theory

Since, for 0 < x < 1,

1- X X H'(x) =In- - I n - - , 1- p p

1 1 H"(x) = - + - - , 1 -·X X H"'( ) X

= -

=~

+ H'(p)(p-

(! + ~)

2 p

q

(p- p) 2

+ (1

+ (p

if we write H(fi) in the form H(p find that for sufficiently large n H(p) = H(p)

1

X2

p)

-

1

X )2 '

- p)) and use Taylor's formula, we

+ -!H"(p)(p-

p) 2

+ O(IP- pl 3 )

+ O(IP- Pl 3 ).

Consequently Pn(k)

=

~ 2n

1 (p- p) 2 exp{pq J2nnfi(1 - p)

+ nO(IP-

pl 3 )} (1

+ s).

Notice that _!!_ (p - p)z = _!!_ (~ - p)z 2pq n 2pq

= (k - np)z 2npq

Therefore l

Pn(k) =

~

e-J(Znpq)(l

+ s'(n, k, n-

k)),

where 1 + s'(n, k, n- k) = (1

+ s(n, k, n-

k))exp{n O(IP- Pl 3 )}

and, as is easily seen,

n --+ oo,

supls'(n, k, n- k)l--+ 0, if the sup is taken over the values of k for which lk- npl:::;; cp(n),

This completes the proof.

cp(n)

= o(npq) 213 •

p(l - p)

p(l - p)

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

59

Corollary. The conclusion of the local limit theorem can be put in the folio~ equivalent form: For all x E R 1 such that x = o(npq) 116 , and for np + x.Jnpq an integer from the set {0, 1, ... , n}, Pinp

+ x~) ,..., ~ e-x2f2,

(7)

2nnpq

i.e. as n -+ oo, sup {x:lxl~l/l(n))

Pinp+ x~) _ 1 -+ 0, 1 -x2j2 ---;:==e

(8)

~

where t/J(n) = o(npq) 116.

With the reservations made in connection with formula (5.8), we can reformulate these results in probabilistic language in the following way:

P{Sn = k} ,..., p{ S n -np} ~

=x

1

~ 1

e-2f<2npq>,

"'~e

-x2j2

lk- npl

= o(npq) 213 , (9)

x = o(npq)lf6.

,

(10)

(In the last formula np + x~ is assumed to have one of the values 0, 1, ... , n.) If we put tk = (k- np)/~ and t:.tk = tk+ 1 - tk = 1/~, the preceding formula assumes the form

Sn - np = P{---"-==~

tk

}

tltk _ 12 12 ,..., - - e k '

(11)

fo

It is clear that t:.tk = 1/~-+ 0 and the set of points {tk} as it were "fills" the real line. It is natural to expect that (11) can be used to obtain the integral formula

P{a < Sn -

np

~

~

b} ,. ., _1_ fb e-x2Jz dx, fo

- oo < a ~ b < oo.

a

Let us now give a precise statement. 2. For - oo < a ~ b < oo let

Pn(a, b] =

L

Pn(np

+ x~),

a<x~b

where the summation is over those x for which np

+ x~ is an integer.

60

I. Elementary Probability Theory

It follows from the local theorem (see also (11)) that for all tk defined by k = np + tkfipq and satisfying Itk I ~ T < oo,

Pn(np

"PI.f) + tky'r.:=

=

2 /2 ll.tk M: e-tk [1 v 2n

+ e(tk, n)],

(12)

where

n --+ oo.

sup le(tk, n)l--+ 0,

(13)

itkiST

Consequently, if a and b are given so that - T

~ a ~

b ~ T, then

where

L

R~ll(a, b)=

ll.tk

e-l~/2-

a
L

R~2 >(a, b) =

_1_ rb e-x2j2 dx,

j2n Ja

~tk

2

e(tk> n) - - e-tki 2 .

fo

a
From the standard properties of Riemann sums, n --+ oo.

IR~1 >(a, b)l--+ 0,

sup

(15)

-TsasbsT

It also clear that JR~2 >(a, b)l

sup -TsasbsT

~

sup Ie(tk> n)l itkiST

x [

~ JT

y'

2n

-T

e-x 212 dx

+

sup

-TsasbsT

JR~ll(a, b)JJ --+ 0,

(16)

where the convergence of the right-hand side to zero follows from (15) and from

_1_ fT e-x2!2 dx < _1_ f"' e-x2/2 dx = 1

Jbi

-T

-

Jbi

-oo

the value of the last integral being well known.

'

(17)

61

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

We write

Jin

fx -oo

e- 1212 dt.

Then it follows from (14)-(16) that

n --+ oo.

IPn(a, b] - ((b)-
sup

(18)

-T:Sa:Sb:ST

We now show that this result holds for T = oo as well as for finite T. By (17), corresponding to a given e > 0 we can find a finite T = T(e) such that -1-

IT

for-T

i

> 1-

e-x>f 2 dx

e.

(19)

According to (18), we can find anN such that for all n >Nand T = T(e) we have sup -T:Sa:Sb:ST

IPn(a, b] - ((a))l 1-

t e,

and consequently Pn(- oo, T]

+ Pn(T, oo)

~

te,

where Pn(- 00, T] = lims!-oo PnCS, T] and Pn(T, oo) = limstoo Pn(T, S]. Therefore for - oo ~ a ~ - T < T ~ b ~ oo,

IPn(a, b] -

- 1-

foe

fb e-x>/2 dx I a

~ IPn(- T, T]- ~IT

+I

v 2n

Pn(a,- T]-

+ 1 foo +-foe ~ !e 4

fo

Pn(-oo,- T]

T

2

e-x 12

e-x>j 2

dx

I

dx

I+ I

-T

i-T e-x>f2

Pn(T, b]-

fo s:

e-x>f2

dx

I

+v~f-T + 1 1 1 1 + + + 4 2 8 8

dx ~ -e

2n - oo -e

e-x 212

-e

dx

PnCT, oo)

-e = e.

By using (18) it is now easy to see that Pn(a, b] tends uniformly to (b)
62

I. Elementary Probability Theory

De Moivre-Laplace Integral Theorem. Let 0 < p < 1, Pn(a, b] =

L

a<xsb

Pn(np

+ xJnpq),

Then 'P"(a,b]-

sup

-cosa
~iba e-x 12 dx,--+O, 2

y 2n

n--+ oo.

(21)

With the same reservations as in (5.8), (21) can be stated in probabilistic language in the following way: sup

-cosa
ESn IP{a < snv-~ vsn

::;; b} -

~ ib e-x212 dx I-+ 0,

v 2n

n --+ oo.

a

It follows at once from this formula that

P{A < S" :s; B} -

- np) -
--+

0,

(22)

as n--+ oo, whenever - oo :s; A < B :s; oo. A true die is tossed 12 000 times. We ask for the probability P that the number of 6's lies in the interval (1800, 2100]. The required probability is

EXAMPLE.

-

P-

(l)k(5)L: c 12 ooo 1800
12 000- k.

An exact calculation of this sum would obviously be rather difficult. However, if we use the integral theorem we find that the probability P in question is (n = 12 000, p = i, a = 1800, b = 2100) 2100- 2000) ( 1800- 2000)
<11(2.449) -
~

0.992,

where the values of <11(2.449) and
xJnM

theintegraltheoremsaysthatP"(a, b] = P{aj;lpq < S"- np :s; bj;lpq} = P{np + aj;lpq < Sn :s; np + bj;lpq} is closely approximated by the integral (1/j2ic)J! e-x2f2 dx.

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

P.(np

63

+ xJniq)

X

0

Figure9

We write

Then it follows from (21) that sup

-oo:S:x:Soo

n- oo.

IFn(x) -
(23)

It is natural to ask how rapid the approach to zero is in (21) and (23), as n- oo. We quote a result in this direction (a special case of the BerryEsseen theorem: see §6 in Chapter Ill): sup

JFn(x) -
-oo:Sx:Soo

+ q2 C::: v npq

p2

(24)

It is important to recognize that the order of the estimate (1/.}niq) cannot be improved; this means that the approximation of Fn(x) by (x) can be poor for values of p that are close to 0 or 1, even when n is large. This suggests the question of whether there is a better method of approximation for the probabilities of interest when p or q is small, something better than the normal approximation given by the local and integral theorems. In this connection we note that for p =!,say, the binomial distribution {Pn(~)} is symmetric (Figure 10). However, for small p the binomial distribution is asymmetric (Figure 10), and hence it is not reasonable to expect that the normal approximation will be satisfactory. P.(k)

P.(k)

0.3

p = !, n

0.2

= 10

\

I

'

0.1 2

4

6

0.3 0.2

1 ..,.

I

p = ;\, n

= 10

I

0.1 8

2

10

Figure 10

4

6

8

10

64

I. Elementary Probability Theory

4. It turns out that for small values of p the distribution known as the Poisson distribution provides a good approximation to {P,(k)}. Let k_ 0 1 P,(k) = ,p q ' - ' '· · ·' n, 0, k = n + 1, n + 2, . . ,

{ck kn-k

and suppose that pis a function p(n) of n.

Poisson's Theorem. Let p(n) -+ 0, n -+ oo, in such a way that np(n)-+ A, where A.> 0. Then fork= 1, 2, ... , P,(k)-+ nk, where

A.ke-;. nk=~,

n-+ oo,

(25)

k = 0, 1, ....

(26)

The proof is extremely simple. Since p(n) = (A./n) for a given k = 0, 1, ... and sufficiently large n,

P,(k) =

+ o(1/n) by hypothesis,

C~pkqn-k

But

n(n- l)···(n- k =

+

1)[~ + o(~)r

n(n - 1) · · ~~n - k

and

[ 1 --;:;A.

+ 1\A. + o( 1W-+A.\

+ o ( -;:;l)]n-k -+ e-;.,

n-+ oo,

n-+ oo,

which establishes (25). The set of numbers {nk, k = 0, 1, ... } defines the Poisson probability distribution (nk ~ 0, Lk'=o nk = 1). Notice that all the (discrete) distributions considered previously were concentrated at only a finite number of points. The Poisson distribution is the first example that we have encountered of a (discrete) distribution concentrated at a countable number of points. The following result of Prohorov exhibits the rapidity with which P,(k) converges tonk as n-+ oo: if np(n) = A. > 0, then (27) (For the proof of (27), see §7 of Chapter III.)

65

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

5. Let us return to the De Moivre-Laplace limit theorem, and show how it implies the law of large numbers (with the same reservation that was made in connection with (5.8)). Since

it is clear from (21) that when e > 0 P

{I s I } ___.!!_-

n

p :::;: e - -1-

s·..(iljpq

fo - ,,;nrpq

e-x>J 2 dx--+

0,

n --+ oo,

(28)

whence n--+ oo,

which is the conclusion of the law of large numbers. From (28)

P{l SnpI : :;: e},..., f'..fnipq n fo-.,;nrpq _1_

e-x>/2

dx,

n--+ oo,

(29)

whereas Chebyshev's inequality yielded only

It was shown at the end of §5 that Chebyshev's inequality yielded the estimate 1

n:<::-42 e a

for the number of observations needed for the validity of the inequality

Thus withe= 0.02 and a= 0.05, 12 500 observations were needed. We can now solve the same problem by using the approximation (29). We define the number k(a) by 1 Jk(IX)

--

fo

Since e .J(il!Pq)

e-x>f 2

dx = 1 - a.

-k(<X)

; : : 2e.jn, if we define n as the smallest integer satisfying 2eJn ;;::: k(a)

(30)

we find that (31)

66

I. Elementary Probability Theory

We find from (30) that the smallest integer n satisfying k 2 (rx)

n 2 4s2

guarantees that (31) is satisfied, and the accuracy of the approximation can easily be established by using (24). Taking s = 0.02, rx = 0.05, we find that in fact 2500 observations suffice, rather than the 12 500 found by using Chebyshev's inequality. The values of k(rx) have been tabulated. We quote a number of values of k(rx) for various values of rx: 0(

k(a)

0.50 0.3173 0.10 0.05 0.0454 O.ol 0.0027

0.675 1.000 1.645 1.960 2.000 2.576 3.000

6. The function (x) =

_1_

fo

fx

e-t2/2

dt,

(32)

-oo

which was introduced above and occurs in the De Moivre-Laplace integral theorem, plays an exceptionally important role in probability theory. It is known as the normal or Gaussian distribution on the real line, with the (normal or Gaussian) density

-3 -2 -1

°0.67I 1 1.96 2.58

X

Figure 11. Graph of the normal probability density cp(x).

§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson)

0.9--

67

-1

I I I

I I I I I

1111 I

-3 -2 -1

°111111

q1J

I

2

3

4

X

0.25111 I 0.5211/ I

o.67

1

:

0.841 I 1.28i

Figure 12. Graph of the normal distribution Cl>(x).

We have already encountered (discrete) distributions concentrated on a finite or countable set of points. The normal distribution belongs to another important class of distributions that arise in probability theory. We have mentioned its exceptional role; this comes about, first of all, because under rather general hypotheses, sums of a large number of independent random variables (not necessarily Bernoulli variables) are closely approximated by the normal distribution (§4 of Chapter III). For the present we mention only some of the simplest properties of cp(x) and cD(x), whose graphs are shown in Figures 11 and 12. The function cp(x) is a symmetric bell-shaped curve, decreasing very rapidly with increasing lxl: thus cp(1) = 0.24197, cp(2) = 0.053991, cp(3) =

0.004432, cp(4)

=

0.000134, cp(5)

=

0.000016. Its maximum is attained at

x = 0 and is equal to (2nr 112 ~ 0.399. The curve cD(x) = (1/fo) J~ao e- 1212 dt approximates 1 very rapidly as x increases: C1>(1) = 0.841345, cD(2) = 0.977250, Cl>(3) = 0.998650, Cl>(4) = 0.999968, Cl>(4, 5) = 0.999997. For tables of cp(x) and cD(x), as well as of other important functions that are used in probability theory and mathematical statistics, see [A1].

7. PROBLEMS 1. Let n = 100, p = / 0 ,

120 , 130 , 1~, 150 • Using tables (for example, those in [A1]) of the binomial and Poisson distributions, compare the values of the probabilities

P{lO <

s,oo ~ 12},

P{33 < S100

~

35},

P{20 < S100 P{40 <

P{so < s,oo

~

22},

s,oo ~ 42},

~52}

with the corresponding values given by the normal and Poisson approximations.

68

I. Elementary Probability Theory

2. Letp = !andZ. = 2S.- n(theexcessofl'soverO 'sinntrials).Showthat suplfoP{Z2 •

=

j } - e-P/ 4 "1--> 0,

n-->

oo.

3. Show that the rate of convergence in Poisson's theorem is given by

§7. Estimating the Probability of Success in the Bernoulli Scheme 1. In the Bernoulli scheme (Q, d, P) with 0, 1) }, d = A: A ~ Q },

n=

{w:w = (xb ... ' x.), X;=

p(w) = p'f.x;qn-'f.x;,

we supposed that p (the probability of success) was known. Let us now suppose that p is not known in advance and that we want to determine it by observing the outcomes of experiments; or, what amounts to the same thing, by observations of the random variables ~1> ••• , ~.,where ~;(co) = x;. This is a typical problem of mathematical statistics, and can be formulated in various ways. We shall consider two of the possible formulations: the problem of estimation and the problem of constructing confidence intervals. In the notation used in mathematical statistics, the unknown parameter is denoted bye, assuming a priori that e belongs to the set 0 = [0, 1]. We say that the set (Q, d, P8 ; 8 E 01 with p8(w) = er. x'(l - 8)"-r.x, is a probabilistic-statistical model (corresponding to" n independent trials" with probability of "success" E 0), and any function T, = T,(w) with values in 0 is called an estimator. If s. = ~ 1 + · · · + ~. and T: = S./n, it follows from the law of large numbers that is consistent, in the sense that (s > 0)

e

T:

P8{/'J!'-

8/

~

s}--+ 0,

n--+ oo.

Moreover, this estimator is unbiased: for every

E0 T: =

e,

(1)

e (2)

where E8 is the expectation corresponding to the probability P8 • The property of being unbiased is quite natural: it expresses the fact that any reasonable estimate ought, at least "on the average," to lead to the desired result. However, it is easy to see that is not the only unbiased estimator. For example, the same property is possessed by every estimator

T:

T,=

b1 x 1

+ · · · + b"", x n

69

§7. Estimating the Probability of Success in the Bernoulli Scheme

where b 1 + · · · + bn = n. Moreover, the law of large numbers (1) is also satisfied by such estimators (at least if Ib;l ~ K < oo; see Problem 2(b), §3, Chapter Ill) and so these estimators T, are just as "good" as In this connection there arises the question of how to compare different unbiased estimators, and which of them to describe as best, or optimal. With the same meaning of "estimator," it is natural to suppose that an estimator is better, the smaller its deviation from the parameter that is being estimated. On this basis, we call an estimator f;, efficient (in the class of unbiased estimators T,) if,

r:.

Oe9,

(3)

where V9 T, is the dispersion ofT,, i.e. E9(T,- 0) 2 • Let us show that the estimator considered above, is efficient. We have

r:,

*_

(Sn) _ VSn _ n0(1 9

VoTn-VoHence to establish that

--2--

n

2

n

0) _ 0(1 - 0)

n

-

n

(4)

r: is efficient, we have only to show that . f v T.

m

()

n ~

0(1 - 0) • n

(5)

This is obvious for 0 = 0 or 1. Let 0 E (0, 1) and

Po(X;) = ox•(l - 0)1-x;. It is clear that n

Po(w) =

01 Po(x;).

i=

Let us write L 9(w) = In p6 (w).

Then

Lo(w)

.

= In 0. LX; + ln(l - O)L(l -

and

oL (w)

----ae- = 9

L(x; - 0) 0(1 - 0)

Since (jJ

and since Tn is unbiased,

(} =Eo T, = L T,(w)p (w). 9

(jJ

X;)

70

I. Elementary Probability Theory

After differentiating with respect to (), we find that

Therefore

1 = E{(1;. _ ()) oL;~w)] and by the Cauchy-Bunyakovskii inequality,

whence 1

2

(6)

Eo[T, - ()] ~ In(())' where I.(())=

[aL;~w)T

is known as Fisher's information. From (6) we can obtain a special case of the Rao-Cramer inequality for unbiased estimators T,: (7)

In the present case

I (()) = E [oLo(w)]z = n

0

()()

- O)]z

E [L(~; 8 ()(1 _ ())

n()(l - ()) [()(1 - ())JZ

n ()(1 - ()) '

which also establishes (5), from which, as we already noticed, there follows the efficiency of the unbiased estimator = Sn/n for the unknown parameter e.

r:

r:

2. It is evident that, in considering as a pointwise estimator for(), we have introduced a certain amount of inaccuracy. It can even happen that the numerical value of calculated from observations of x~> ... , xn differs rather severely from the true value e. Hence it would be advisable to determine the size of the error. It would be too much to hope that r:( w) differs little from the true value () for all sample points w. However, we know from the law of large numbers

r:

71

§7. Estimating the Probability of Success in the Bernoulli Scheme

that for every {J > 0 and for sufficiently large n, the probability of the event {I w)I > {J} will be arbitrarily small. By Chebyshev's inequality

o- r:c

P {IO _ T*l 6

n

b} <

>

-

v6[J2r:

=

oon[J2- O)

and therefore, for every A > 0,

P9{1o- r:1

~ AJ0(1;

O)} ~ 1- ;

2 •

If we take, for example, A = 3, then with P6 -probability greater than 0.888 (1 - (1/3 2 ) = ! ~ 0.8889) the event

10- r:1

~

3)0(1~

O)

will be realized, and a fortiori the event

10since 0(1 - 0) ~ Therefore P6

r:1 ~ 3;:, 2y n

i.

{1 o- r:1 ~ 2~} =

Po{

r: - 2~ ~ o~ r: + 2~} ~ 0.8888.

In other words, we can say with probabilipr greater than 0.8888 that the exact value of 0 is in the interval [T~ - (3/2y'n), T~ + (3/2,fn)]. This statement is sometimes written in the symbolic form

o~ r:

±if.

c~ 88%),

where " ~ 88%" means "in more than 88% of all cases." The interval [T: - (3/2.j1i), + (3/2.j1i)] is an example of what are called confidence intervals for the unknown parameter.

r:

Definition. An interval of the form

where

t/1 1( w) and t/1 i

w) are functions of sample points, is called a corifidence {J (or of significance level {J) if

interval of reliability 1 -

P11 {1/1 1 (w) for all 0 E 9.

~

0

~

t/1 2 (w)}

~

1- b.

72

I. Elementary Probability Theory

The preceding discussion shows that the interval

[ Tn* -

A Tn* + Jn A J Jn' 2

2

has reliability 1 - (1/A 2 ). In point of fact, the reliability of this confidence interval is considerably higher, since Chebyshev's inequality gives only crude estimates of the probabilities of events. To obtain more precise results we notice that

{w: /8- r:1 s ;.,)8(1: 8)} = {w: t/1 (T:, n):::; 8 s 1/Jz(T:, n)}, 1

where t/1 1 = t/1 1(T:,n) and t/1 2 = t/1 2 (T:,n) are the roots of the quadratic equation (8 - T!) 2

= ;.,z 8(1 - 8),

n which describes an ellipse situated as shown in Figure 13. Now let

Then by (6.24) sup /F(J(x)-
s

1

1

v n8(1 - 8)

Therefore if we know a priori that 0 < Ll ::::;; 8 ::::;; 1 - Ll < 1,

where Ll is a constant, then sup /F 0(x)-
()

Figure 13

s

1

r:

Llv n

73

§7. Estimating the Probability of Success in the Bernoulli Scheme

2:: (2<1>(A) - 1) -

2

r:..

~vn

Let A* be the smallest A for which (2<1>(A) - 1) -

2

r:.

~vn

2:: 1 - <5*,

where <5* is a given significance level. Putting <5 = <5* - (2/~Jn), we find that A* satisfies the equation

For large n we may neglect the term 2/~Jn and assume that A* satisfies ( A*) = 1 -

~ <5*.

In particular, if A.* = 3 then c5* = 0.9973 .... Then with probability approximately 0.9973

r:: -

3J8(1

~ 8) ~ () ~ r:: + 3J8(1 ~ 8)

(8)

or, after iterating and then suppressing terms of order O(n- 3 14 ), we obtain

T,.* - 3

T.*(1 - T.*) n

n

n

~ () ~

T,.*

+3

T.n*(1 - T.n*)

n

(9)

Hence it follows that the confidence interval

[r:- 2~, r: + 2~]

(10)

has (for large n) reliability 0.9973 (whereas Chebyshev's inequality only provided reliability approximately 0.8889). Thus we can make the following practical application. Let us carry out a large number N of series of experiments, in each of which we estimate the parameter () after n observations. Then in about 99.73% of the N cases, in each series the estimate will differ from the true value of the parameter by at most 3/2Jn. (On this topic see also the end of §5.)

74

I. Elementary Probability Theory

3. PROBLEMS 1. Let it be known a priori that 8 has a value in the set 8 estimator fore, taking values only in E>o.

0

c;; [0, 1]. Construct an unbiased

2. Under the hypotheses of the preceding problem, find an analog of the Rao-Cramer inequality and discuss the problem of efficient estimators. 3. Under the hypotheses of the first problem, discuss the construction of confidence intervals for e.

§8. Conditional Probabilities and Mathematical Expectations with Respect to Decompositions 1. Let (0, d, P) be a finite probability space and

a decomposition ofO (D; Ed, P(D;) > 0, i = 1, ... , k, and D 1 + · · · + Dk = 0). Also let A be an event from d and P(AID;) the conditional probability of A with respect to D;. With a set of conditional probabilities {P(A ID;), i = 1, ... , k} we may associate the random variable n(w)

=

k

L

P(A ID;)lv,(w)

(1)

i= 1

(cf. (4.5)), that takes the values P(AID;) on the atoms of D;. To emphasize that this random variable is associated specifically with the decomposition fl2, we denote it by P(Aifl2)

or

P(Aifl2)(w)

and call it the conditional probability of the event A with respect to the de-

composition fl2.

This concept, as well as the more general concept of conditional probabilities with respect to au-algebra, which will be introduced later, plays an important role in probability theory, a role that will be developed progressively as we proceed. We mention some of the simplest properties of conditional probabilities: P(A

+ B I fl2)

= P(A I fl2)

+ P(B Ifl2);

(2)

if fl2 is the trivial decomposition consisting of the single set 0 then P(A I0)

=

P(A).

(3)

75

§8. Conditional Probabilities and Mathematical Expectations

The definition of P(A i.@) as a random variable lets us speak of its expectation; by using this, we can write the formula (3.3) for total probability in the following compact form: EP(Ai.@) = P(A).

(4)

In fact, smce P(AI.@)

=

k

L P(AID;)Iv;(m),

i= 1

then by the definition of expectation (see (4.5) and (4.6)) EP(Ai.@)

k

L

=

P(AID;)P(D;)

i= 1

=

k

L

P(AD;)

i= 1

= P(A).

Now let '1 = 1J(m) be a random variable that takes the values y 1, ... , Yk with positive probabilities: k

1J(m) =

L Yivim),

j= 1

where Di = {m: 1J(m) = y). The decomposition.@,= {D 1, ... , Dk} is called the decomposition induced by 'I· The conditional probability P(A !.@,) will be denoted by P(AI'I) or P(AI'I)(m), and called the conditional probability of A with respect to the random variable '1· We also denote by P(AI'I =yi) the conditional probability P(AIDi), where Di = {m: 1J(m) = Yi}. Similarly, if 1J 1, 1J 2 , ••• , '1m are random variables and .@,,,, 2 , ... ,,'" is the decomposition induced by '11> 1J 2 , ••• , '1m with atoms DY!.Y2· .. ··Ym

= {m: '11(m) = Y1• · · ·' 'lm(m) = Ym},

then P(AID~ •. ~, ..... ~~) will be denoted by P(AI17 1 , '7 2 , ••• , '1m) and called the conditional probability of A with respect to '11> 1] 2 , ••• , '1m· 1. Let ~ and 11 be independent identically distributed random variables, each taking the values 1 and 0 with probabilities p and q. For k = 0, 1, 2, let us find the conditional probability P(~ + '1 = ki'l) of the event A = {m: ~ + '1 = k} with respect to '1· To do this, we first notice the following useful general fact: if ~ and 1J are independent random variables with respective values x and y, then

ExAMPLE

P(~

+ 1J =

zi1J = y) = P(~

+y

= z).

(5)

In fact,

P(~ + = fi'l = y) = P(~ + '1 = z, '1 = y) '1

P('l = y)

=

P(~

= P(~

+ y = Z,1J = y) P('l = y) + y = z).

P(~

+ y = z)P(y = 'I) P('l

= y)

76

I. Elementary Probability Theory

Using this formula for the case at hand, we find that P(~

+ 17 =

+ 17 = ki11 = O)J{,=OJ(w) + P(~ + 11 = kl11 = 1)J{.,= 11(w) = P(~ = k)I!,=O!(w) + P{~ = k-

ki17) = P(~

1}J{,= 11(w).

Thus (6)

or equivalently q(1 - 11),

P(~

+ 11 = kl17) = { p(1

- 17)

+ q17,

P11,

2. Let ~ =

~(w)

k = 0, k = 1, k = 2,

be a random variable with values in the set X = { x 1 ,

(7)

... ,

xn}:

l

~

=

L xJ4w),

j= 1

and let f!} = {D 1 , .•. , Dk} be a decomposition. Just as we defined the expectation of~ with respect to the probabilities P(A),j = 1, ... , l. l

Ee = L

xjP(Aj),

(8)

j= 1

it is now natural to define the conditional expectation of ~ with respect to f!} by using the conditional probabilities P(Ail f!}), j = 1, ... , l. We denote this expectation by EW f!}) or E(~l f!}) (w), and define it by the formula l

E(~l f!}) =

L xiP(Ail f!}).

(9)

j= 1

According to this definition the conditional expectation E(~ If!}) (w) is a random variable which, at all sample points w belonging to the same atom Di> takes the same value 1 xiP(AiiD;). This observation shows that the definition of E( ~I D;) could have been expressed differently. In fact, we could first define E(~ ID;), the conditional expectation of~ with respect to D;, by

'LJ=

(10)

and then define E(~lf!})(w)

=

k

L E(~ID;)Iv,(w)

i= 1

(see the diagram in Figure 14).

(11)

77

§8. Conditional Probabilities and Mathematical Expectations (8)

PO

E~

1(3.1) (10)

P(-ID)

E(~ID)

j(l) P(-1

1(11) {9)

~)

E(~l ~)

Figure 14 It is also useful to notice that E(~ID) and EW ~)are independent of the representation of~. The following properties of conditional expectations follow immediately from the definitions: E(a~

+ b17i ~) =

aE(~I ~)

+ bE('71 ~),

a and b constants;

(12) (13)

E(~l!l) = E~;

E(CI

~) =

C constant;

C,

(14)

(15)

The last equation shows, in particular, that properties of conditional probabilities can be deduced directly from properties of conditional expectations. The following important· property generalizes the formula for total probability (5): EE(~I ~)

=

(16)

E~.

For the proof, it is enough to notice that by (5) EE(~I ~) = E

l

l

l

j= 1

j= 1

j= 1

L xiP(Ail ~) = L xiEP(Ail2)) = L xiP(Ai) =

E~.

Let f!} = {D 1 , .•• , Dk} be a decomposition and '7 = 17(w) a random variable. We say that '7 is measurable with respect to this decomposition, or ~-measurable, if f!}~ ~ ~.i.e. '7 = 17(w) can be represented in the form k

'l(W)=

L YJn,(w),

i= 1

where some Y; might be equal. In other words, a random variable is measurable if and only if it takes constant values on the atoms of f!}.

~

2. If~ is the trivial decomposition,~ = {Q}, then '7 ts ~-measur able if and only if 17 C, where C is a constant. Every random variable '7 is measurable with respect to f!}~.

ExAMPLE

=

78

I. Elementary Probability Theory

Suppose that the random variable 1J is

~-measurable.

Then (17)

and in particular

= 1J).

(E(1J I~")

To establish (17) we observe that if~=

1

xi/AJ' then

k

l

~1] =

IJ=

(18)

L L

XiYJAjD;

j= 1 i= 1

and therefore k

l

L L xiy;P(AiDd~)

E(~1JI ~) =

j= 1 i= 1

=

L L

j=l i=l

L L

j= 1 i = 1

On the other hand, since

IJE(~I ge) = =

[t [t

P(AjD;IDm)lvm(w)

xiy;P(AiDdD;)lv;(w)

j= 1 i= 1

= lv; and lv; · lvm = 0, i

YJv;(w)

(19)

xiy;P(AiiD;)lv;(w).

1 YJv/w)J ·

k

=

L L

Jb;

L

m=l

k

l

=

XjYi

k

l

=

k

k

l

lt

=I=

m, we obtain

1 xiP(Ail ge)J

J·mt Lt XjP(AjiDm)] · lvJw)

l

L L Y;XiP(AiiD;) · lv;(w),

i= 1 j= 1

which, with (19), establishes (17). We shall establish another important property of conditional expectations. Let ge 1 and ge 2 be two decompositions, with ~ 1 ~ ge 2 (ge 2 is "finer" than ~ 1 ). Then E[E(~I ~z)l

ge1J

= E(~l

gel).

For the proof, suppose that gel = {Dll, · · ·, D1m}, Then if~ =

LJ= xi AJ' we have 1

l

E(~l gez) =

L

j= 1

xiP(Ail ge 2),

(20)

§8. Conditional Probabilities and Mathematical Expectations

79

and it is sufficient to establish that (21) Since

P(Aj/ ~ 2 ) =

n

L

q=1

P(Aj/D 2 q)lv 2 . ,

we have n

E[P(Aj/ ~2)/ ~ 1 ]

=

L

P(Aj /D 2 q)P(D 2 q/ ~d

q=1

n

m

L

p= 1

lv,p ·

L

q= 1

m

=

L

p=1

P(Aj/D 2 q)P(D 2 q/D 1 p)

L

lv,p ·

P(Aj/D 2 q)P(D 2 q/D 1 P)

{q:D2qSD1p}

m

=

L

p=1

lv,p · P(Aj/D 1 p) = P(Aj/ ~ 1 ),

which establishes (21). When ~ is induced by the random variables 11~> ... , 11k ( ~ = ~~, .... ~J, the conditional expectation E(~I~~~ ..... ~J is denoted by E(~/11 1 , ••• ,11k), or E(( IIJ 1 , ... , l]k){w), and is called the conditional expectation of ( with respect to 11 1 , .•. , 11k· It follows immediately from the definition of E(~ /11) that if~ and 11 are independent, then E(~/11) = E~.

(22)

11·

(23)

From (18) it also follows that

E(11 /11)

=

Property (22) admits the following generalization. Let ~ be independent of ~(i.e. for each D; E ~the random variables~ and lv, are independent). Then E(~/ ~) = E~.

(24)

As a special case of (20) we obtain the following useful formula: (25)

80

I. Elementary Probability Theory

EXAMPLE 3. Let us find E(~ + 11 111) for the random variables~ and 11 considered in Example 1. By (22) and (23),

This result can also be obtained by starting from (8):

E(~

+ 11111) =

EXAMPLE 4. Let variables. Then

~

2

L kP(~ + 11 = k=O

kll1) = p(1- 11)

+ ql1 + 2pl1

=

p

+ 11·

and 11 be independent and identically distributed random

(26) In fact, if we assume for simplicity that ~and 11 take the values 1, 2, ... , m, we find (1 ::; k ::; m, 2 ::; I ::; 2m)

P( ~ = k I~

+ '1

= I) =

P( ~ = k, '1 = I - k) P( ~ = k, ~ + 11 = l) P( ~ + 11 = l) = P( ~ + 11 = l) P(~ =

k)P(11 = I - k) + 11 = l)

P( ~ =

P(IJ =

klc; +

P(11 = k)P(~ = 1 - k) P( ~

+ '1 =

l)

1J = [).

This establishes the first equation in (26). To prove the second, it is enough to notice that

3. We have already noticed in §1 that to each decomposition ~ = {D 1 , ... , Dk} of the finite set Q there corresponds an algebra a(~) of subsets of Q. The converse is also true: every algebra !!J of subsets of the finite space Q generates a decomposition ~ (!!J = a(~)). Consequently there is a oneto-one correspondence between algebras and decompositions of a finite space Q. This should be kept in mind in connection with the concept, which will be introduced later, of conditional expectation with respect to the special systems of sets called a-algebras. For finite spaces, the concepts of algebra and a-algebra coincide. It will turn out that if !!J is an algebra, the conditional expectation E( ~I !!J) of a random variable ~ with respect to !!J (to be introduced in §7 of Chapter II) simply coincides with EW ~), the expectation of ~ with respect to the decomposition ~ such that !!J = a(~). In this sense we can, in dealing with finite spaces in the future, not distinguish between E(~I!!J) and EW ~), understanding in each case that E(~I !!J) is simply defined to be E(~I ~).

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

4. PROBLEMS 1. Give an example of random variables which

(Cf. (22).)

81

eand '1 which are not independent but for

e

2. The conditional variance of with respect to !!} is the random variable

Show that

3. Starting from (17), show that for every function f

= f('1) the conditional expectation

E(el'1) has the property

4. Let e and '1 be random variables. Show that inf1 E('1 - f(e)) 2 is attained for J*(e) = E('11 e). (Consequently, the best estimator for '1 in terms of in the mean-square sense, is the conditional expectation E('11 e)).

e.

el .... ' e•.

el .... 'e. are identically e + ··· + e, is the

5. Let 't' be independent random variables, where distributed and r takes the values 1, 2, ... , n. Show that if S, = sum of a random number of the random variables,

1

and ES, = Er · E~ 1 , 6. Establish equation (24).

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 1. The value of the limit theorems of §6 for Bernoulli schemes is not just that they provide convenient formulas for calculating probabilities P(Sn = k) and P(A < Sn ::;; B). They have the additional significance of being of a universal nature, i.e. they remain useful not only for independent Bernoulli random variables that have only two values, but also for variables of much more general character. In this sense the Bernoulli scheme appears as the simplest model, on the basis of which we can recognize many probabilistic regularities which are inherent also in much more general models. In this and the next section we shall discuss a number of new probabilistic regularities, some of which are quite surprising. The ones that we discuss are

82

I. Elementary Probability Theory

again based on the Bernoulli scheme, although many results on the nature of random oscillations remain valid for random walks of a more general kind. 2. Consider the Bernoulli scheme (Q, d, P), where n = {(J): (J) = (x 1• ... ' Xn), Xi= ± 1}, d consists of all subsets of Q, and p(w) = pv(w)qn-v(wl, v(w) = (L xi + n)/2. Let ~i(w) = xi, i = 1, ... , n. Then, as we know, the sequence ~ 1, •.• , ~n is a sequence of independent Bernoulli random variables, P(~i =

1) = p,

P(~i

= -1) = q,

p+q=l.

Let us put S0 = 0, Sk = ~ 1 + · · · + ~b 1 s k s n. The sequence S0 , S 1 , .•• , Sn can be considered as the path of the random motion of a particle starting at zero. Here Sk+ 1 = Sk + ~k> i.e. if the particle has reached the point Sk at time k, then at time k + 1 it is displaced either one unit up (with probability p) or one unit down (with probability q). Let A and B be integers, A s 0 s B. An interesting problem about this random walk is to find the probability that after n steps the moving particle has left the interval (A, B). It is also of interest to ask with what probability the particle leaves (A, B) at A or at B. That these are natural questions to ask becomes particularly clear if we interpret them in terms of a gambling game. Consider two players (first and second) who start with respective bankrolls (-A) and B. If ~i = + 1, we suppose that the second player pays one unit to the first; if~; = -1, the first pays the second. Then Sk = ~ 1 + · · · + ~k can be interpreted as the amount won by the first player from the second (if Sk < 0, this is actually the amount lost by the first player to the second) after k turns. At the instant k s nat which for the first time Sk = B (Sk = A) the bankroll of the second (first) player is reduced to zero; in other words, that player is ruined. (If k < n, we suppose that the game ends at time k, although the random walk itself is well defined up to time n, inclusive.) Before we turn to a precise formulation, let us introduce some notation. Let x be an integer in the interval [A, B] and for 0 s k s n lets:= x + Sk, r: = min{O

s ls

k: St =A orB},

(1)

where we agree to take r: = k if A < Sf < B for all 0 s l s k. For each k in 0 s k s n and x E [A, B], the instant r:, called a stopping time (see §11), is an integer-valued random variable defined on the sample space n (the dependence of r: on n is not explicitly indicated). It is clear that for alll < k the set {w: r: = l} is the event that the random walk {Sf, 0 s is k}, starting at time zero at the point x, leaves the interval (A, B) at time l. It is also clear that when l s k the sets {w: r: = l, Sf= A} and {w: r: = l, Sf = B} represent the events that the wandering particle leaves the interval (A, B) at time l through A orB respectively.

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

For 0

~

k

~

83

n, we write

die =

L: {w: tic = 1, s~ = A}, O:S:ISk

&lie =

L:

{w:

O:S:!Sk

(2)

tic = 1, Sf = B},

and let

be the probabilities that the particle leaves (A, B), through A orB respectively, during the time interval [0, k]. For these probabilities we can find recurrent relations from which we can successively determine 1X 1(x), ... , 1Xn(x) and Pt (x), ... , Pn(x). Let, then, A < x
= P(&llc) = P(&llciS1 =

x

+

l)P(~ 1

= 1)

+ P(&llciS1 =X- 1)P(~l = -1) = pP(&IIc = X + 1) + qP(&IIcl S1 = X

-

1).

(3)

We now show that

P(&llciS1

=X+ 1) =

P(&llc~D,

To do this, we notice that

&lie=

{w:(x,x

&llc can be represented in the form

+ ~ 1 , ... ,x + ~ 1 + ... + ~k)EBk},

Blc is the set of paths of the form (x, x + x 1 , ... , x + x 1 + · · · xk) with x 1 = ± 1, which during the time [0, k] first leave (A, B) at B (Figure 15). where

A~-----------------------

Figure 15. Example of a path from the set B';.

84

I. Elementary Probability Theory

We represent B~ in the form B~,x+ 1 + B~,x- 1 , where B~·x+ 1 and BV- 1 are the paths in B~ for which x 1 = + 1 or x 1 = -1, respectively. Notice that the paths (x, x + 1, x + 1 + x 2 , ••• , x + 1 + x 2 + · · · + xk) in B~·x+ 1 are in one-to-one correspondence with the paths (x

+ 1, x + 1 + x 2 , ••• , x + 1 + x 2, ••• , x + 1 + X 2 + · · · + xk)

in B~~ ~. The same is true for the paths in B~·x- 1• Using these facts, together with independence, the identical distribution of ~ 1 , ... , ~b and (8.6), we obtain P(Bi~ISf = X

+ 1)

= P(Bi~l~ 1 = 1)

= P{(x,x + ~t>····x + ~ 1 + ··· + ~k)eB~I~t = 1} + 1, x + 1 + ~ 2 , ••• , x + 1 + ~2 + · · · + ~k)eB~~D P{(x + 1,x + 1 + ~1, .•. ,x + 1 + ~1 + ··· + ~k-t)eB~~n

= P{(x =

= P(Bi~~D-

In the same way, P(Bi~ISf

=X-

1) = P(Bi~=D·

Consequently, by (3) with x e (A, B) and k Pk(x) = PPk-t(x

~

n,

+ 1) + qflk-t(x-

1),

(4)

1:::;;, n.

(5)

where

{J1(B)

=

1,

{J1(A)

= 0,

0

~

Similarly (6)

with

1X1(A) = 1,

1X1(B) =

0,

O~l~n.

Since tX 0 (x) = {J 0 (x) = 0, x e (A, B), these recurrent relations can (at least in principle) be solved for the probabilities

1X1(x), ... , IX"(x)

and

{J 1(x), ... , fln(x).

Putting aside any explicit calculation of the probabilities, we ask for their values for large n. For this purpose we notice that since Bi~ _ 1 c: Bi~, k ~ n, we have Pk- 1(x) ~ {Jk(x) ~ 1. It is therefore natural to expect (and this is actually the case; see Subsection 3) that for sufficiently large n the probability fln(x) will be close to the solution {J(x) of the equation {J(x) = p{J(x

+

1)

+ q{J(x -

1)

(7)

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

85

with the boundary conditions

f3(B) = 1,

f3(A) = 0,

(8)

that result from a formal approach to the limit in (4) and (5). To solve the problem in (7) and (8), we first suppose that p =1= q. We see easily that the equation has the two particular solutions a and b(qjpy, where a and b are constants. Hence we look for a solution of the form f3(x)

=

a

+ b(qjpy.

(9)

Taking account of (8), we find that for A ::;; x ::;; B f3(x)

= (q/pY - (q/p)A.

(10)

(qjp)B _ (qjp)A

Let us show that this is the only solution of our problem. It is enough to show that all solutions of the problem in (7) and (8) admit the representation (9). Let P(x) be a solution with P(A) = 0, P(B) = 1. We can always find constants aand b such that

a + b(qjp)A =

b(A),

a+

b(q/p)A+l = P(A

a+

b(q/p)A+ 2

+ 1).

Then it follows from (7) that P(A

+ 2)

=

and generally P(x) =

a + b(qjpy.

Consequently the solution (10) is the only solution of our problem. A similar discussion shows that the only solution of a(x)

= pa(x + 1) + qa(x - 1),

XE(A, B)

(11)

with the boundary conditions a(A) = 1,

a(B)

=0

(12)

A::;;x::;;B.

(13)

is given by the formula (p/q)B - (qjpy a(x) = (p/q)B _ (pjp)A'

If p = q = respectively

t, the only solutions f3(x) and a(x) of (7), (8) and (11 ), ( 12) are f3(x)

x-A

= B- A

(14)

and

B-x

a(x) = B- A.

(15)

86

I. Elementary Probability Theory

We note that

a(x)

+ fJ(x) =

1

(16)

for 0 s; p s; 1. We call a(x) and {J(x) the probabilities of ruin for the first and second players, respectively (when the first player's bankroll is x - A, and the second player's is B - x) under the assumption of infinitely many turns, which of course presupposes an infinite sequence of independent Bernoulli random variables ~b ~ 2 , ••. , where~;= + 1 is treated as a gain for the first player, and ~; = -1 as a loss. The probability space (Q, d, P) considered at the beginning of this section turns out to be too small to allow such an infinite sequence of independent variables. We shall see later that such a sequence can actually be constructed and that {J(x) and a(x) are in fact the probabilities of ruin in an unbounded number of steps. We now take up some corollaries of the preceding formulas. If we take A = 0, 0 s; x s; B, then the definition of {J(x) implies that this is the probability that a particle starting at x arrives at B before it reaches 0. It follows from (10) and (14) (Figure 16) that

fJ(x) =

xjB, p = q { (qjpy - 1

= !,

(qjp)B - 1' p =f- q.

(17)

Now let q > p, which means that the game is unfavorable for the first player, whose limiting probability of being ruined, namely a = a(O), is given by

(qjp)B - 1 (q/pl - (qjp)A .

G(=-~----c

Next suppose that the rules of the game are changed: the original bankrolls of the players are still (-A) and B, but the payoff for each player is now !, {J(x)

X

Figure 16. Graph of fl(x), the probability that a particle starting from x reaches B before reaching 0.

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

87

rather than 1 as before. In other words, now let P(ei = !) = p, P(ei = -!) = q. In this case let us denote the limiting probability of ruin for the first player by a. 112 • Then (qjp)2B _ 1 r:J.l/2 = (qjp)2B _ (qjp)2A' and therefore =

r:J.l/2

r:J.'

(qjp)B + 1 (qjp)B + (qjp)A > r:t.,

if q > p. Hence we can draw the following conclusion:

if the game is unfavorable to the .first player (i.e., q > p) then doubling the stake decreases the probability of ruin. 3. We now turn to the question of how fast r:t.n(x) and Pn(x) approach their limiting values a.(x) and P(x). Let us suppose for simplicity that x = 0 and put

r:l.n = r:t.n(O), It is clear that

Yn = P{A < Sk < B, 0 where {A < Sk < B, 0

k

~

~

~

k

~

n},

n} denotes the event

n

{A< Sk < B}.

O!>k!>n

Let n = rm, where r and m are integers and

+ ... + em, em+ 1 + ... + e2m•

el = el

e2

=

e, =

em(r-1)+ 1

+ ... + erm•

Then if C = IA I + B, it is easy to see that {A

< Sk < B, 1 ~ k

and therefore, since ( 1 ,

••• ,

~

rm} £ {1( 1 1 < C, ... , I(, I < C},

(,are independent and identically distributed,

Yn ~ P{ let I < C, ... , I(, I < C} = We notice that ciently large m,

Ve

1

r

0 P{l(d < C} =

= m[1 - (p- q) 2 ]. Hence, for 0

P{l( 1 1 < c}:::; where 8 1 < 1, since

(P{I(tl < C})'.

ve 1

(18)

i=l

81,

~ C 2 if P{IC 1 1 ~ C}

= 1.

< p < 1 and suffi(19)

88

I. Elementary Probability Theory

If p = 0 or p = 1, then P{ i(d < C} = 0 for sufficiently large m, and consequently (19) is satisfied for 0 ::::;; p ::::;; 1. It follows from (18) and (19) that for sufficiently large n

Yn::::;; e", where e = e~fm < 1. According to (16), oc

+ f3

(20)

= 1. Therefore

(oc - OCn)

+ ({3 - f3n)

=

Yn,

and since oc ~ ocn, f3 ~ f3n, we have 0 ::::;; oc - ocn ::::;; Yn ::::;; e",

0 ::::;; f3 - /3n ::::;; Yn ::::;; e",

e < 1.

There are similar inequalities for the differences oc(x) - ocn(x) and f3(x)- /3n(x). 4. We now consider the question of the mean duration of the random walk. Let mk(x) = E-r: be the expectation of the stopping time -r:,k::::;; n. Proceeding as in the derivation of the recurrent relations for f3x(x), we find that, for x E (A, B), mk(x)

= E-r: = = =

I

1 :s;l:s;k

I

1 :s;l:s;k

lP(-r: = l)

l· [pP(-rk = ~~~1 = 1)

I [. [pP(-r:~I

1 :s;l :s;k

I

(l

O:s;l:s;k-1

= pmk- 1(x

+

I

+

+

1)[pP(-r:~f = l)

1)

11~1

= -1)]

= l - 1) + qP(-r:=I = l - 1)]

+ qmk- 1(x-

[pP(-r:~f = l)

O:s;l:s;k-1

= pmk-1(x

+ qP(-rk =

+ qP(-r:=:f

1)

+ qP(-r:=:f

+ 1) + qmk- 1(x-

= l)]

1)

= l)]

+ 1.

Thus,forx E (A, B)andO::::;; k::::;; n, thefunctionsmk(x)satisfytherecurrent relations (21)

with m0 (x) = 0. From these equations together with the boundary conditions mk(A)

=

mk(B)

= 0,

we can successively find m 1(x), ... , mn(x). Since mk(x) ::::;; mk+ 1(x), the limit m(x) = lim mix) n-+oo

(22)

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

89

exists, and by (21) it satisfies the equation m(x)

+ 1 + pm(x + 1) + qm(x

- 1)

(23)

with the boundary conditions m(A) = m(B) = 0.

(24)

To solve this equation, we first suppose that m(x) < oo,

x

E

(A, B).

(25)

Then if p =I= q there is a particular solution of the form a.j(q - p) and the general solution (see (9)) can be written in the form m(x) = -Xq-p

+ a + b (q)x - . p

Then by using the boundary conditions m(A) = m(B) = 0 we find that 1 m(x) = - - (B{J(x) p-q

+ Aa.(x) -

x],

where fJ(x) and a.(x) are defined by (10) and (13). If p = q = solution of (23) has the form m(x)

=

a

+ bx -

(26)

!, the general

x 2,

and since m(A) = m(B) = 0 we have m(x) = (B - x)(x - A).

(27)

It follows, in particular, that if the players start with equal bankrolls (B = -A), then m(O) = B 2 •

If we take B = 10, and suppose that each turn takes a second, then the (limiting) time to the ruin of one player is rather long: 100 seconds. We obtained (26) and (27) under the assumption that m(x) < oo, x E (A, B). Let us now show that in fact m(x) is finite for all x E (A, B). We consider only the case x = 0; the general case can be analyzed similarly. Let p = q =!.We introduce the random variableS<" defined in terms of the sequence S0 , Sl> ... , Sn and the stopping time •n = .-~by the equation n

s,n =

L Skl{tn=k)(w). k=O

(28)

The descriptive meaning of S
•n·

•n

90

I. Elementary Probability Theory

Let us show that when p = q =

t.

ES,n = 0,

(29)

Es;n =Ern.

(30)

To establish the first equation we notice that n

ES,n =

L E[SkJ{tn=k}(w)]

k=O n

=

n

L E[SnJ{tn=k}(w)] + L E[(Sk- Sn)Jitn=k}(w)]

k=O

= ESn

k=O

+

n

L E[(Sk- Sn)II
k=O

(31)

where we evidently have ESn = 0. Let us show that n

L E[(Sk- Sn)Jitn=k}(w)] = 0.

k=O

To do this, we notice that {tn > k} = {A < S 1 < B, ... , A < Sk < B} when 0:::;; k < n. The event {A< S 1 < B, ... , A< Sk < B} can evidently be written in the form (32) where Ak is a subset of { -1, + 1}k. In other words, this set is determined by just the values of e1, ... , ek and does not depend on ek+t• ... , en. Since

{tn = k} = {tn > k- 1}\{tn > k}, this is also a set of the form (32). It then follows from the independence of 1 , •.. , en and from Problem 9 of §4 that the random variables Sn - Sk and I l
e

E[(Sn- Sk)II
ES;" =

n

n

k=O

k=O

L ESfJ{tn=k} = L E([Sn + (Sk- Sn)] 2I{tn=k}) n

=

L [ES;II
k=O

n

= n-

L (n -

k=O

n

k)P(tn = k) =

L kP(tn = k) = Et".

k=O

§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing

(p

Thus we have (29) and (30) when p = q = + q = 1) it can be shown similarly that

= (p

ES,n

=

E[S,"- r. · E~ 1 ] 2

t.

91

For general p and q (33)

- q) ·Ern,

V~ 1 ·Ern,

(34)

where E~ 1 = p- q, V~ 1 = 1 - (p- q) 2 • With the aid of the results obtained so far we can now show that limn-+ 00 mn(O) = m(O) < 00. If p = q = !, then by (30) (35)

If p =F q, then by (33), E

r.:::;;

max( lA I, B)

(36)

Ip-q I '

from which it is clear that m(O) < oo. We also notice that when p = q =!

Ern=

es;n = A 2 • (X.+ B 2 • fJ. + E[S;I{A<Sn
and therefore

It follows from this and (20) that as n _... oo, Er. converges with exponential rapidity to

m(O) = A 2 lX

B

A

B-A

B-A

+ B 2 {J = A 2 · - - - B 2 · - - =

IABI.

There is a similar result when p =F q: Er. _... m(0) -_ ocA

+ fJB ,

p-q

exponentially fast.

5. PROBLEMS 1. Establish the following generalizations of (33) and (34):

Es:. = x

+ (p

- q)Et~,

E[Stlf- •: · E~tJ 2 = V~ 1 . Et:. 2. Investigate the limits of oc(x), P(x), and m(x) when the level A ! 3. Let p = q =

- oo. tin the Bernoulli scheme. What is the order of EIS.l for large n?

92

I. Elementary Probability Theory

4. Two players each toss their own symmetric coins, independently. Show that the probability that each has the same number of heads after n tosses is r 2" D=o (C=) 2 • Hence deduce the equation D=o (C=>2 = Cin· Let u. be the first time when the number of heads for the first player coincides with the number of heads for the second player (if this happens within n tosses; un = n + 1 if there is no such time). Find Emin(u., n).

§10. Random Walk. II. Reflection Principle. Arcsine Law 1. As in the preceding section, we suppose that ~ 1 , ~ 2 , ••• , ~ 2 " is a sequence of independent identically distributed Bernoulli random variables with

= 1) = p, + · · · + ~k•

P(~j

sk =

~1

1~ k

~

2n;

We define a2n

= min{1

:::; k :::; 2n:

sk = 0},

putting G'2n = 00 if sk =1: 0 for 1 ~ k :::; 2n. The descriptive meaning of a 2 n is clear: it is the time of first return to zero. Properties of this time are studied in the present section, where we assume that the random walk is symmetric, i.e. p = q = !. For 0 :::; k :::; n we write u2k

= P(S 2 k = 0),

f2k

= P(a2n = 2k).

(1)

It is clear that u0 = 1 and Our immediate aim is to show that for 1 :::; k :::; n the probability f 2 k is given by (2)

It is clear that

{a 2 n = 2k} = {S 1 =1: 0, S 2 =1: 0, ... , S 2 k- 1 =1: 0, S 2 k = 0} for 1 :::; k :::; n, and by symmetry !2k

o, ... , s2k-1 =1: o, s2k = O} 2P{s1 > o, ... , s2k-1 > o, s2k = O}.

= P{S1 =

=1:

(3)

93

§10. Random Walk. II. Reflection Principle. Arcsine Law

A sequence (S 0 , ••• , Sk) is called a path of length k; we denote by Lk(A) the number of paths of length k having some specified property A. Then f2k

=

L2n<s1 >

2

o, ... , s2k-1 > o, s2k = o,

and S2k+1 = a2k+1• ... ,S2n = a2k+1

= 2L2k(s1 >

+ ··· + a2n)·2- 2"

o, ... , s2k-1 > o, s2k = O) · 2- 2\

(4)

where the summation is over all sets (a 2k+ 1, ... , a2n) with a; = ± 1. Consequently the determination of the probability f 2 k reduces to calculating the number of paths L 2k(S 1 > 0, ... , S 2k- 1 > 0, S 2k = 0).

Lemma 1. Let a and b be nonnegative integers, a - b > 0 and k = a Then a-b Lk(S1>0, ... ,Sk-1>0,Sk=a-b)=-k-Ck·

+ b. (5)

PRooF. In fact,

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b) = Lk(S 1 = 1, S2 > 0, ... , Sk- 1 > 0, Sk =a- b) = Lk(S 1 = 1, Sk =a- b) - Lk(S 1 = 1, Sk =a- b; and 3 i, 2 :::;; i :::;; k - 1, such that S; :::;; 0).

(6)

In other words, the number of positive paths (S 1, S2, ... , Sk) that originate at (1, 1) and terminate at (k, a- b) is the same as the total number of paths from (1, 1) to (k, a - b) after excluding the paths that touch or intersect the time axis.* We now notice that

Lk(S 1 = 1, Sk =a- b; 3 i, 2 :::;; i:::;; k- 1, such that S;:::;; 0) =

Lk(S 1 = -1, Sk =a- b),

(7)

i.e. the number of paths from IX = (1, 1) to f3 = (k, a - b), neither touching nor intersecting the time axis, is equal to the total number of paths that connect IX* = (1, -1) with {3. The proof of this statement, known as the reflection principle, follows from the easily established one-to-one correspondence between the paths A= (S 1, ... , Sa, Sa+ 1, ... , Sk) joining IX and {3, and paths B = ( -S 1, ... , -Sa, Sa+~> ... , Sk)joining IX* and f3 (Figure 17); a is the first point where A and B reach zero.

* A path (S 1 , ••• , Sk) is called positive (or nonnegative) if all S1 > 0 (S1 ~ 0); a path is said to touch the time axis if S1 ~ 0 or else S1 :::;; 0, for I s j s k, and there is an i, I :::;; i :::;; k, such

that S 1 = 0; and a path is said to intersect the time axis if there are two times i and j such that 0 and sj < 0.

s, >

94

I. Elementary Probability Theory

fJ

Figure 17. The reflection principle.

From (6) and (7) we find

Lk(S 1 > 0, ... , Sk- 1 > 0, Sk =a- b)

= Lk(S 1 = 1,Sk =a- b)- Lk(S 1 = a-1

a

= ck-1- ck-1 =

-1,Sk =a- b)

a-bca -k- k,

which establishes (5). Turning to the calculation ofj2 k, we find that by (4) and (5) (with a = k, b = k- 1), f2k

o, ... , s2k-1 > o,

= o) · 2- 2k

=

2L2k(sl >

=

2Lzk-1(S1 > O, ... ,Szk- 1 = 1)·2- 2k

s2k

t 1 ck 1 2k-1 = 2k Uz(k-1)· = 2 . rzk . 2k-

Hence (2) is established. We present an alternative proof of this formula, based on the following observation. A straightforward verification shows that

1 2k Uz(k-1) = u2(k-1) -

(8)

Uzk>

At the same time, it is clear that

= 2k} = {a2n > 2(k- 1)}\{a2n > {a2n > 2!} = {S1 =I= 0, ... , S 21 =!= 0}

{azn

2k},

and therefore {a2n = 2k} = {S 1 =I= 0, ... , S2(k- 1) =I= 0}\{S 1 =I= 0, ... , S 2k =I= 0}.

Hence fzk

=

P{S1 =I= 0, ... , Sz(k-1) =I= 0} - P{S1 =I= 0, ... , S2k =I= 0},

95

§10. Random Walk. II. Reflection Principle. Arcsine Law

a

Figure 18

and consequently, because of (8), in order to show that it is enough to show only that

L2kCS1 ¥ 0, ... , S2k ¥ 0)

f 2k = (1/2k)u 2
= L2kCS2k = 0).

(9)

For this purpose we notice that evidently

L2k(S1 ¥ 0, ... , S2k ¥ 0) = 2L2k(St > 0, ... , S2k > 0). Hence to verify (9) we need only establish that

2L2k(St > 0, ... , S2k > 0) = L2k(S1

~

0, ... , S2k ~ 0)

(10)

and

(11) Now (10) will be established if we show that we can establish a one-to-one correspondence between the paths A = (S~> ... , S 2 k) for which at least one S; = 0, and the positive paths B = (S ~> ... , S 2k). Let A = (S ~> ... , S2k) be a nonnegative path for which the first zero occurs at the point a (i.e., Sa = 0). Let us construct the path, starting at (a, 2), (Sa + 2, Sa+ 1 + 2, ... , S 2 k + 2) (indicated by the broken lines in Figure 18). Then the path B = (S 1, ..• , Sa_ 1 , Sa+ 2, ... , S 2 k + 2) is positive. Conversely, let B = (S 1, ..• , S 2k) be a positive path and b the last instant at which Sb = 1 (Figure 19). Then the path

A = (S 1' . . . , Sb, Sb+ 1

-

b

Figure 19

2, ... , Sk - 2)

96

I. Elementary Probability Theory

2k

-m Figure 20

is nonnegative. It follows from these constructions that there is a one-to-one correspondence between the positive paths and the nonnegative paths with at least one Si = 0. Therefore formula (10) is established. We now establish (11). From symmetry and (10) it is enough to show that L2k(S1 > 0, ... , S 2k > 0) + L 2k(S 1 :2:: 0, ... , S 2k :2:: 0 and 3 i, 1 ~ i ~ 2k, such that S; = 0) = L 2 k(S 2 k = 0).

The set of paths (S 2 k = 0) can be represented as the sum of the two sets "t' 1 and "t'2 , where "t' 1 contains the paths (S 0 , ••• , S 2k) that have just one minimum, and "t' 2 contains those for which the minimum is attained at at least two points. Let C 1 E "t' 1 (Figure 20) and let y be the minimum point. We put the path C 1 = (S 0 , St. ... , S 2 k) in correspondence with the path Ct obtained in the following way (Figure 21). We reflect (S 0 , S 1, ••• , S1 ) around the vertical line through the point l, and displace the resulting path to the right and upward, thus releasing it from the point (2k, 0). Then we move the origin to the point (l, -m). The resulting path Ct will be positive. In the same way, if C 2 e "t' 2 we can use the same device to put it into correspondence with a nonnegative path C~.

(2k, 2m)

2k

Figure 21

97

§10. Random Walk. II. Reflection Principle. Arcsine Law

Conversely, let Ct = (S 1 > 0, ... , S 2 k > 0) be a positive path with S 2 k =2m (see Figure 21). We make it correspond to the path C 1 that is obtained in the following way. Let p be the last point at which SP = m. Reflect (SP, ... , S 2 m) with respect to the vertical line x = p and displace the resulting path downward and to the left until its right-hand end coincides with the point (0, 0). Then we move the origin to the left-hand end of the resulting path (this is just the path drawn in Figure 20). The resulting path C 1 = (S 0 , •.• , S 2 k) has a minimum at S 2 k = 0. A similar construction applied to paths (S 1 ~ 0, ... , S 2 k ~ 0 and 3 i, 1 ~ i ~ 2k, with S; = 0) leads to paths for which there are at least two minima and S 2 k = 0. Hence we have established a one-to-one correspondence, which establishes (11). Therefore we have established (9) and consequently also the formula f2k

=

Uz(k-l) -

= (1/2k)u2(k-l) ·

U2k

By Stirling's formula

u2k =

k c2k.

2

-2k

1

k-+

"' - - ,

fo

00.

Therefore

k-+

00.

Hence it follows that the expectation of the first time when zero is reached, namely Emin(a 2 n, 2n)

=

n

L 2kP(a

k=l

2n

= 2k) + 2nu 2 n

n

=

L u2(k-1) + 2nu2n• k= 1

can be arbitrarily large. In addition, 1 u 2 = oo, and consequently the limiting value of the mean time for the walk to reach zero (in an unbounded number of steps) is oo. This property accounts for many of the unexpected properties of the symmetric random walk that we have been discussing. For example, it would be natural to suppose that after time 2n the number of zero net scores in a game between two equally matched players (p = q = !), i.e. the number of instants i at which S; = 0, would be proportional to 2n. However, in fact the number of zeros has order (see [F1]). Hence it follows, in particular, that, contrary to intuition, the "typical" walk (S 0 , S 1 , .•• , S") does not have a sinusoidal character (so that roughly half the time the particle would be on the positive side and half the time on the negative side), but instead must resemble a stretched-out wave. The precise formulation of this statement is given by the arcsine law, which we proceed to investigate.

Lf=

fo

98

I. Elementary Probability Theory

2. Let P2 k, 2n be the probability that during the interval [0, 2n] the particle spends 2k units of time on the positive side.*

Lemma 2. Let u0 = 1 and 0 ::::; k ::::; n. Then (12)

PRooF. It was shown above thatf2k = u2
Uzk =

Since {S 2 k = 0}

!;;;;; {11 2"::::;

L f2r · U2(k-r)·

r=

(13)

1

2k}, we have

{S 2k = 0} = {S 2k = 0} n {O'zn::::; 2k} =

L

1 ::s;l:s;k

{S2k = 0} n {0'2n = 21}.

Consequently Uzk = P(S2k = 0) =

L L

1 ::s;l::s;k

=

P(S2k = 0, O'zn = 21)

P(Szk = OIO'zk = 2l)P(O'zn = 2/).

1 :s;l::s;k

But P(S 2k = OIO'zn = 21) = P(Szk = OIS1 #- 0, ... , S21-1 #- 0, S21 = 0)

+ (~21+1 + .. · + ~zk) = O!S1 #- 0, ... , S21-1 P(S21 + (~21+1 + · · · + ~zk) = OISz, = 0) P(~zt+ 1 + · · · + ~2k = 0) = P(Sz
= P(S21 = =

#- 0, S21 = 0)

Therefore U2k =

L

1 ::s;l::s;k

P(S2(k-1J = O)P(O'zn = 21),

which establishes (13). We turn now to the proof of (12). It is obviously true fork = 0 and k = n. Now let 1 ::::; k ::::; n - 1. If the particle is on the positive side for exactly 2k instants, it must pass through zero. Let 2r be the time of first passage through zero. There are two possibilities: either Sk ;;;:: 0, k ::::; 2r, or Sk ::::; 0, k ::::; 2r. The number of paths of the first kind is easily seen to be (21 22rf:2r) ' 22(n-rJp 2(k-r),2(n-r)

1 22n ' J:2r' p 2(k-r),2(n-r)• = 2'

* We say that the particle is on the positive side in the interval [m - l, m] if one, at least, of the values S,._ 1 and S,. is positive.

99

§10. Random Walk. II. Reflection Principle. Arcsine Law

The corresponding number of paths of the second kind is

Consequently, for 1 s k p2k,2n

=

1

s

n - 1, 1

k

k

2 7 ~1 fzr · p2(k-r),2(n-r) + 2 ,~/2r · P2k,2(n-r)•

(14)

Let us suppose that P Zk, zm = u2 k • u2 m _ Zk holds for m = 1, ... , n - 1. Then we find from (13) and (14) that P2k,2n

= !uzn-2k ·

k

k

r;1

r;1

L fzr ·llzk-2r + !uzk • L fzr ·llzn-2r-2k

This completes the proof of the lemma. Now let y(2n) be the number of time units that the particle spends on the positive axis in the interval [0, 2n]. Then, when x < 1, 1 y(2n) P{ - < - - S

X

2n

2

}

=

L

{k, 1/2 < (2k/2n) s x)

P 2k, 2n ·

Since

ask-+ oo, we have p 2k,2n =

ll

2k

·U

2(n-k)

~

1

--;==== _ k)'

njk(n

as k -+ oo and n - k -+ oo. Therefore

L

P 2k, 2n

-

{k: 1/2 <(2k/2n)Sx}

L

{k: 1/2 <(2kj2n)~x)

__!,_ ·

nn

[~ (1 - ~)]n

n

112

-+

0,

whence

L

{k: l/2<(2k/2n)Sn)

p 2k

2n '

1 n

-

IX 1/2

dt -+ 0, jt(l - t)

But, by symmetry,

L

{k:k/nS 1/2)

Pzk,2n-+!

n -+ oo.

n-+

00,

100

I. Elementary Probability Theory

and

_1 n

Jx 112

dt . vC. 1 = -2 arcsm x - 2· Jt(1 - t) n

Consequently we have proved the following theorem.

Theorem (Arcsine Law). The probability that the fraction of the time spent by the particle on the positive side is at most x tends to 2n- 1 arcsin .jX:

L

P lk, ln ~ 2n- 1 arcsin

.jX.

(15)

{k:k/n!S:x}

We remark that the integrand p(t) in the integral

1

ix

dt n o Jt(1 - t) represents aU-shaped curve that tends to infinity as t Hence it follows that, for large n,

P{o < y(2n) < ~} > p{! < y(2n) < 2n2 2n-

1 2

~

0 or 1.

+ ~}

'

i.e., it is more likely that the fraction of the time spent by the particle on the positive side is close to zero or one, than to the intuitive value !. Using a table of arcsines and noting that the convergence in (15) is indeed quite rapid, we find that

p{Y~nn) :::;; 0.024} ~ 0.1, P{y~:):::;; 0.1} ~ 0.2, P{y~:) : :; 0.2} ~ 0.3,

P{y~:):::;; 0.65} ~ 0.6. Hence if, say, n = 1000, then in about one case in ten, the particle spends only 24 units of time on the positive axis and therefore spends the greatest amount of time, 976 units, on the negative axis. 3.PR.OBLEMS 1. How fast does Emin(a 2., 2n)-+ oo as n-+ oo? 2. Lett.= min{1:::;; k:::;; n: sk·= 1}, where we take t. = oo if Sk < 1 for 1:::;; k:::;; n. What is the limit of Emin(t., n) as n-+ oo for symmetric (p = q =!)and for unsymmetric (p of. q) walks?

101

§II. Martingales. Some Applications to the Random Walk

§11. Martingales. Some Applications to the Random Walk 1. The Bernoulli random walk discussed above was generated by a sequence

~ 1 , ... , ~n of independent random variables. In this and the next section we introduce two important classes of dependent random variables, those that constitute martingales and Markov chains. The theory of martingales will be developed in detail in Chapter VII. Here we shall present only the essential definitions, prove a theorem on the preservation of the martingale property for stopping times, and apply this to deduce the "ballot theorem." In turn, the latter theorem will be used for another proof of proposition (10.5), which was obtained above by applying the reflection principle.

2. Let (Q, .91, P) be a finite probability space and 22 1 ~ 22 2 sequence of decompositions.

~

•••

~

22n a

Definition 1. A sequence of random variables~ 1, ... , ~n is called a martingale (with respect to the decomposition 22 1 ~ 22 2 ~ • • • ~ 22n) if (1) ~k is 22k-measurable, (2) E(~k+ 1l22k) = ~k' 1 :::;; k :::;; n - 1.

In order to emphasize the system of decompositions with respect to which the random variables form a martingale, we shall use the notation (1)

where for the sake of simplicity we often do not mention explicitly that 1 :::;; k :::;; n. When 22k is induced by ~ 1, ... , ~n' i.e.

instead of saying that ~ = (~k• 22k) is a martingale, we simply say that the sequence~ = (~k) is a martingale. Here are some examples of martingales. ExAMPLE

1. Let 111, ... , IJn be independent Bernoulli random variables with P(IJk sk

= 1) = P(IJk

= -1)

= t,

= 1J1 + ... + 1Jk and 22k = 22qJ, ... ,q.·

We observe that the decompositions 22k have a simple structure:

102

I. Elementary Probability Theory

where

v+

= {w: 171 = +1}, ~

2

=

v-

= {w: 171 = -1},

{v++ v+- v-+ '

'

'

v--}

'

where

v+ + =

{w: 171 = + 1,172 = + 1}, ... ' v-- = {w: 'It= -1,172 = -1},

etc. It is also easy to see that ~~~·····~k = ~s~o ... ,sk·

Let us show that (Sb ~k) forms a martingale. In fact, Skis ~k-measurable, and by (8.12), (8.18) and (8.24), E(Sk+tl~k)

= E(Sk + 1'fk+ti~k) = E(Skl~k) + E(17Htl~k) = Sk + E1'fk+t = Sk.

If we put S0 = 0 and take D 0 = {0}, the trivial decomposition, then the

sequence (Sk,

~k)osksn

also forms a martingale.

2. Let 17 1, ... , 1'fn be independent Bernoulli random variables with P(1]; = 1) = p, P(1]; = -1) = q. If p -::/: q, each of the !;lequences ~ = (~k) with

ExAMPLE

where

sk =

'11 + ... + '1n•

is a martingale. EXAMPLE

3. Let 17 be a random variable, ~ 1

~

··· ~

~n•

and (2)

Then the sequence ~ = (~k• ~k) is a martingale. In fact, it is evident that E(17 I~k) is ~k-measurable, and by (8.20)

In this connection we notice that by (8.20)

if~

=

(~k• ~k)

is any martingale, then

~k = E(~k+tl~k) = E[E(~k+21~k+t)l~k]

= E(~k+21~k) = · · · = E(~nl~k).

(3)

Consequently the set of martingales ~ = (~k• ~k) is exhausted by the martingales of the form (2). (We note that for infinite sequences ~ = (~k• ~k)k~t this is, in general, no longer the case; see Problem 7 in §1 of Chapter VII.)

103

§11. Martingales. Some Applications to the Random Walk

EXAMPLE 4. Let '1 ~> ... , '1 nbe a sequence of independent identically distributed random Variables, Sk = '11 + · · · + IJb and E»1 = E»sn' E»2 = E»sn.Sn~t' · · ·, E»n = E»s".s"~ 1 , ... ,s,· Let us show that the sequence~= (~b E»k) with

;:

sn n

;:

sn-1 n-

S1=-,~z=--1 , ... ,;,k=

sn+1-k k'''''~n=S1 n+ 1 -

is a martingale. In the first place, it is clear that E»k ~ E»k+ 1 and ~k is E»kmeasurable. Moreover, we have by symmetry, for j ::::; n - k + 1, (4)

(compare (8.26)). Therefore

(n- k + 1)E(1Jt1E»k) =

n-k+ 1

I

j= 1

E(,.,jiE»k) = E(Sn-k+11E»k) = sn-k+1•

and consequently ;: =

Sk

Sn-k+l

n _ k+ 1

and it follows from Example 3 that~ =

=E( (~b

'11

IE»)

k ,

E»k) is a martingale.

ExAMPLE 5. Let 1J 1, ... , '1n be independent Bernoulli random variables with P(IJ; = +1) = P(IJ; = -1) =

!,

Sk = '7 1 + · · · + '1k· Let A and B be integers, A < 0
n/2, the sequence~=

~k =(cos A.)-k exp{iA.(sk-

B; A)}

is a complex martingale (i.e., the real and imaginary parts of martingales).

(5) ~k

3. It follows from the definition of a martingale that the expectation the same for every k: E~k = E~ 1 .

form

E~k

is

It turns out that this property persists if time k is replaced by a random time. In order to formulate this property we introduce the following definition.

Definition 2. A random variable T = -r(w) that takes the values 1, 2, ... , n is called a stopping time (with respect to a decomposition (E»k) 1 ~k~n· E» 1 ~ E» 2 ~ · · · ~ E»n) if, for k = 1, ... , n, the random variable /{t=kl(w) is E»kmeasurable. If we consider !?)k as the decomposition induced by observations for k steps (for example, E»k = E»~ ...... ~k' the decomposition induced by the

104

I. Elementary Probability Theory

variables 17 1, .•. , 1'/k), then the ~k-measurability of Ift=kl(w) means that the realization or nonrealization of the event {r = k} is determined only by observations for k steps (and is independent of the "future"). If f!Jk = a(~k), then the ~k-measurability of J{t=kl(w) is equivalent to the assumption that {r = k} E f!Jk. (6) We have already introduced specific examples of stopping times: the times

rk, (Jzn introduced in §§9 and 10. Those times are special cases of stopping times of the form rA

= min{O < k:::;

(JA

= min{O:::;

n: ~k E A},

k:::; n: ~kEA},

(7)

which are the times (respectively the first time after zero and the first time) for a sequence ~ 0 , ~ 1 , ... , ~n to attain a point of the set A.

4. Theorem 1. Let ~ = (~k• ~k) 1 s;ks;n be a martingale and r a stopping time with respect to the decomposition (~k) 1 s;ks;n· Then (8)

where n

~t =

L ~kl{t=kl(w)

(9)

k=l

and

(10) PRooF (compare the proof of (9.29)). Let D E ~ 1 . Using (3) and the properties of conditional expectations, we find that

E(~ ID) = E(~Jn) t

P(D)

1

n

= P(D) t~1E[~J{t=tl. In] 1

= P(D) E(~nln) = E(~niD),

105

§ll. Martingales. Some Applications to the Random Walk

and consequently E(e,l~t) = E(enl~t) = e1.

The equation Ee, = Ee 1 then follows in an obvious way. This completes the proof of the theorem.

Corollary. For the martingale (Sk, ~k) 1 sksn of Example 1, and any stopping timer (with respect to (~k)) we have the formulas (11)

ES, = 0,

known as Wald's identities (cf. (9.29) and (9.30); see also Problem 1 and Theorem 3 in §2 of Chapter VII).

5. Let us use Theorem 1 to establish the following proposition. Theorem 2 (Ballot Theorem). Let q 1, ... , '1n be a sequence of independent identically distributed random variables whose values are nonnegative integers, Sk = q 1 + ... + 1'fk, 1 :::; k :::; n. Then P{Sk < kforall k, 1:::; k:::; niSn} = (1-

~r'

(12)

where a+ = max(a, 0). PRooF. On the set {w: Sn ~ n} the formula is evident. We therefore prove (12) for the sample points at which sn < n. Let us consider the martingale = ~k)tsksn introduced in Example 4, With = Sn+1-J(n + 1- k) and ~k = ~Sn+l-k•···•Sn· We define

e (ek,

ek

r = min{1 :::; k ::5; n: ~k ~ 1},

taking ! = n on the set {ek < 1 for all k such that 1 ::5; k ::5; n} = {maxl:s;l:s;n(SI/1) < 1}. It is clear that S1 = 0 on this set, and therefore

e, =en=

{ max S11 < 1} = {max lSISn lSISn

~~ < 1, S" < n} £

{e, = 0}.

(13)

Now let us consider those outcomes for which simultaneously max 1s 1 sn(S1/l) ~ 1 and Sn < n. Write a= n + 1 - r. It is easy to see that

a= max{1 :::; k:::; n: Sk;;:::; k} and therefore (since Sn < n) we have a < n, Sa ~ a, and Sa+ 1 < a + 1. Consequently '1a+ 1 = Sa+ 1 - Sa < (a + 1) - a = 1, i.e. '1a+ 1 = 0. Therefore a :::; sa = sa+ 1 < a + 1, and consequently sa = (J and

e,=

Sn+t-• =Sa=l. n+1-r a

106

I. Elementary Probability Theory

Therefore

{ max~~~ 1,Sn < n} £ {~t = 1}.

(14)

1:51:5n

From (13) and (14) we find that 1 { max S/

1 :51:5n

~ 1, S" < n} = {~t =

1} n {S" <

n}.

Therefore, on the set {S" < n}, we have

where the last equation follows because ~< takes only the two values 0 and 1. Let us notice now that E(~tiSn) = E(~tl~ 1 ), and (by Theorem 1) E(~tl~ 1 ) = ~ 1 = Sn/n. Consequently, on the set {S" < n} we have P{Sk < k

for all k such that 1 :::;; k :::;; n ISn} = 1 - (Sn/n). This completes the proof of the theorem.

We now apply this theorem to obtain a different proof of Lemma 1 of §10, and explain why it is called the ballot theorem. Let ~ 1 , ..• , ~"be independent Bernoulli random variables with P(~ 1 = 1) = P(~; = -1) = Sk

a

+ · · · + ~k

and a, b nonnegative integers such that a - b rel="nofollow"> 0, b = n. We are going to show that

= ~1

+

t,

a-b a+

P{S 1 > 0, ... , S" > OISn =a- b} =--b.

(15)

In fact, by symmetry, P{S 1 > 0, ... , S" > OISn =a- b}

= P{S 1 < 0, ... , S" < OISn = -(a- b)} = P{S 1 + 1 < 1, ... , S" + n < niSn + n = n- (a- b)} = P{1Jt < 1,. ·., '11 + · · · + 1Jn < ni1Jt + · · · + 1Jn = n- (aa-b n

b)}

a-b a+ b'

where we have put 1Jk = ~k + 1 and applied (12). Now formula (10.5) follows from (15) in an evident way; the formula was also established in Lemma 1 of §10 by using the reflection principle.

§II. Martingales. Some Applications to the Random Walk

107

Let us interpret ~i = + 1 as a vote for candidate A and ~i = -1 as a vote for B. Then Sk is the difference between the numbers of votes cast for A and B at the time when k votes have been recorded, and P{S 1 > 0, ... , Sn > OISn =a- b}

is the probability that A was always ahead of B, with the understanding that A received a votes in all, B received b votes, and a - b > 0, a + b = n. According to (15) this probability is (a - b)/n.

6. PROBLEMS 1. Let !?} 0 ~ !!} 1 ~···~!!}.be a sequence of decompositions with !?} 0 = {Q}, and Jet ''lk be !?}k-measurable variables, 1 ~ k ~ n. Show that the sequence~ = (~b !!}k) with k

I

~k =

I= I

['11 - E('ltl !?}I-I)]

is a martingale. 2. Let the random variables '1~o····'1k satisfy E('1ki'11, ••• ,'1k-l) = 0. Show that the sequence~ = (~k) 1 ,;k,;n with ~ 1 = 17 1 and k

~k+ 1 =

I

i=l

'li+lli<'11· ... , 11J,

where fi are given functions, is a martingale. 3. Show that every martingale ~ = (~i> !!}k) has uncorrelated increments: if a< b < c < d then

4. Let ~ = (~ 1 , ... , ~.) be a random sequence such that ~k is !?}k-measurable (!!} ~ !!} 2 ~ · · · ~ !?}.). Show that a necessary and sufficient condition for this sequence to be a martingale (with respect to the system (!!}k)) is that E~t = E~ 1 for every stopping timeT (with respect to (!!}k)). (The phrase "for every stopping time" can be replaced by "for every stopping time that assumes two values.") 5. Show that if~ =

(~k•

!!}k) 1,;k,;n is a martingale and Tis a stopping time, then

for every k. 6. Let~= (~k• !!}k) and '1 = ('lk• !!}k) be two martingales, ~ 1 = 11 1 = 0. Show that

E~n'ln =

I

k=2

E(~k - ~k-1)('1k - 'lk-1)

and in particular that

E~; =

I

k=2

E(~k - ~k-1) 2 •

108

I. Elementary Probability Theory

7. Let 111, ... , Yfn be a sequence of independent identically distributed random variables with E11; = 0. Show that the sequence~ = (~k) with

~k = ~

(.I

l=l

'7;)

2

-

kEYff,

_ exp A.('7 1 + · · · + Yfk) (E exp A.q 1)k

k -

is a martingale. 8. Let '7t, ... , Yfn be a sequence of independent identically distributed random variables taking values in a finite set Y. Let f 0 (y) = P('7 1 = y), y E Y, and let f 1(y) be a nonnegative function with LyeY f 1(y) = 1. Show that the sequence ~ = (~k• !0Z) with

~:J = D~"····~k' ~k =

J; (111) .. . J; ('1d fo('11) · · · foCYfk)'

is a martingale. (The variables ~k, known as likelihood ratios, are extremely important in mathematical statistics.)

§12. Markov Chains. Ergodic Theorem. Strong Markov Property 1. We have discussed the Bernoulli scheme with Q

= {w: W = (xlo ... , Xn), X; = 0, 1},

where the probability p(w) of each outcome is given by (1)

= pxql-x. With these hypotheses, the variables ~ 1 , ... , ~n = X; are independent and identically distributed with X= 0, 1. P(~ 1 = x) = .. · = P(~. = x) = p(x),

with p(x) ~;(w)

with

If we replace (1) by p(w) = P1(x1) · · · Pn(xn),

where P;(x) = pf(1 - p;), 0 :::; p; :::; 1, the random variables ~ 1 , still independent, but in general are differently distributed:

... ,

~.

are

We now consider a generalization that leads to dependent random variables that form what is known as a Markov chain. Let us suppose that

n=

{w: w = (xo. xl, ... ' x.), X; EX},

109

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

where X is a finite set. Let there be given nonnegative functions p0 (x), Pt(x, y), ... , Pn(x, y) such that

L Po(x) =

1,

L Pk(x, y) =

1,

XEX

k=1, ... ,n; yEX.

(2)

yeX

(3)

It is easily verified that Lroe!l p(w) = 1, and consequently the set of numbers p(w) together with the space Q and the collection of its subsets defines a probabilistic model, which it is usual to call a model of experiments that form a Markov chain. Let us introduce the random variables ~ 0 , ~ 1 , ... , ~n with ~;(w) =X;. A simple calculation shows that P(~ 0 = a) = p 0 (a),

P(~o

=

ao, ... , ~k

=

ak)

=

Po(ao)Pt(ao, at)··· Pk(ak-1• ak).

(4)

We now establish the validity of the following fundamental property of conditional probabilities: P{~k+t

=

ak+tl~k

=

ak, ... , ~o

= ao} =

P{~k+t

=

ak+tl~k =ad

(5)

(under the assumption that P(~k = ak, ... , ~ 0 = a0 ) > 0). By (4), P{(k+t

=

ak+tl~k

P{~k+t

=

ak> ·· ., ~o

= ao}

= ak+t' ·· ., ~o = ao}

P{ ~k = ak> ... , ~o = a 0 }

=

Po(ao)Pt(ao, a1) · · · Pk+ 1(ak, ak+ 1) Po(ao) · · · Pk(ak-l' ak)

=

( ) Pk+t abak+t.

In a similar way we verify (6)

which establishes (5). Let !0i = !0~o •... , ~k be the decomposition induced by ~ 0 , ••. , ~k> and P.Bi = a(!0i). Then, in the notation introduced in §8, it follows from (5) that (7)

or

I. Elementary Probability Theory

110 If we use the evident equation P(AB I C)

= P(A I BC)P(B I C),

we find from (7) that P{ ~n

=

an,·.·, ~k+ 1 = ak+11.?4D

= Pgn =

an,···, ~k+ 1 = ak+ll ~k}

(8)

or P{~" =an, ... , ~k+1

= ak+1l~o, ... , ~d = P{~" =a.,···, ~k+1 = ak+ll~k}. (9)

This equation admits the following intuitive interpretation. Let us think of ~k as the position of a particle "at present," (~ 0 , •.. , ~k- 1 ) as the "past," and (~k+ 1- ... , ~.)as the "future." Then (9) says that if the past and the present are given, the future depends only on the present and is independent of how the particle arrived at ~k' i.e. is independent of the past (~ 0 , ... , ~k- 1 ). Let F =(~.=a., ... , ~k+t = ak+t), N = gk =ad, B

= {~k- 1 =

ak-1• · · ·, ~o

= ao}.

Then it follows from (9) that P(FINB) = P(FIN),

from which we easily find that P(FB IN) = P(F I N)P(B IN).

(10)

In other words, it follows from (7) that for a given present N, the future F and the past B are independent. It is easily shown that the converse also holds: if (10) holds for all k = 0, 1, ... , n - 1, then (7) holds for every k = 0, 1, ... , n - 1.

The property of the independence of future and past, or, what is the same thing, the lack of dependence of the future on the past when the present is given, is called the Markov property, and the corresponding sequence of random variables ~ 0 , ••. , ~.is a Markov chain. Consequently if the probabilities p(w) of the sample points are given by (3), the sequence (~ 0 , ... , ~.)with ~;(w) =X; forms a Markov chain. We give the following formal definition.

Definition. Let (Q, d, P) be a (finite) probability space and let~ = (~ 0 , ••. , ~.) be a sequence of random variables with values in a (finite) set X. If (7) is satisfied, the sequence~ = (~ 0 , ... , ~")is called a (finite) Markov chain. The set X is called the phase space or state space of the chain. The set of probabilities (Pn(x)), x EX, with p 0 (x) = P(~ 0 = x) is the initial distribution, and the matrix IIPk(x,y)ll, x, yEX, with p(x,y) = Pgk = Yl~k-t = x} is the matrix of transition probabilities (from state x to state y) at time k = 1, ... , n.

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

Ill

When the transition probabilities Pk(x, y) are independent of k, that is, = is called a homogeneous 0 , ..• , Markov chain with transition matrix llp(x, y)ll-

Pk(x, y) = p(x, y), the sequence

e (e

en)

Notice that the matrix ll(x, y)ll is stochastic: its elements are nonnegative and the sum of the elements in each row is 1: p(x, y) = 1, x EX. We shall suppose that the phase space X is a finite set of integers (X = {0, 1, ... , N}, X = {0, ± 1, ... , ± N}, etc.), and use the traditional notation Pi = p0 (i) and Pii = p(i, j). It is clear that the properties of homogeneous Markov chains completely determine the initial distributions Pi and the transition probabilities Pii· In specific cases we describe the evolution of the chain, not by writing out the matrix IIPiill explicitly, but by a (directed) graph whose vertices are the states in X, and an arrow from state ito state j with the number Pii over it indicates that it is possible to pass from point i to point j with probability Pii· When Pii = 0, the corresponding arrow is omitted.

LY

Pii

~i j EXAMPLE

1. Let X = {0, 1, 2} and

IIPijll

=

(P i). 3·

0

3

The following graph corresponds to this matrix:

.l2

I

2

~a~_f).l3 ·o~2~ ~

Here state 0 is said to be absorbing: if the particle gets into this state it remains there, since p00 = 1. From state 1 the particle goes to the adjacent states 0 or 2 with equal probabilities; state 2 has the property that the particle remains there with probability! and goes to state 0 with probability i ExAMPLE

for

2. Let X= {0, ± 1, ... , ±N}, Po= 1, PNN = P(-N)(-N) = 1, and,

Iii< N,

p, j = i + 1, { Pii = q, j = i - 1, 0 otherwise.

(11)

112

I. Elementary Probability Theory

The transitions corresponding to this chain can be presented graphically in the following way (N = 3):

This chain corresponds to the two-player game discussed earlier, when each player has a bankroll N and at each turn the first player wins + 1 from the second with probability p, and loses (wins -1) with probability q. If we think of state i as the amount won by the first player from the second, then reaching state N or - N means the ruin of the second or first player, respectively. In fact, if 17 1 , 17 2, ... , 'ln are independent Bernoulli random variables with P('l; = + 1) = p, P(1]; = -1) = q, S 0 = 0 and Sk = 17 1 + .. · + 'lk the amounts won by the first player from the second, then the sequence S0 , S 1, ••• , Sn is a Markov chain with p0 = 1 and transition matrix (11 ), since P{Sk+1 =jiSk = ik,Sk-1 = ik-1, ... } = P{Sk = P{Sk

+ '7k+1 + '7k+1

=jiSk = ik,Sk-1 = ik-1, ... } =jiSk = ik} = P{'7k+1 =j- ik}.

This Markov chain has a very simple structure: 0

~

k

~

n- 1,

where 17 1 ,17 2 , ..• , 'ln is a sequence of independent random variables. The same considerations show that if ~ 0 , 17 1 , •.. , 'ln are independent random variables then the sequence ~ 0 , ~ 1 , ••• , ~n with 0

~

k ~ n- 1,

(12)

is also a Markov chain. It is worth noting in this connection that a Markov chain constructed in this way can be considered as a natural probabilistic analog of a (deterministic) sequence x = (x 0 , •.• , xn) generated by the recurrent equations

We now give another example of a Markov chain of the form (12); this example arises in queueing theory. EXAMPLE 3. At a taxi stand let taxis arrive at unit intervals of time (one at a time). If no one is waiting at the stand, the taxi leaves immediately. Let 'lk be the number of passengers who arrive at the stand at time k, and suppose that 1] 1 , •.• , 'ln are independent random variables. Let ~k be the length of the

113

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

waiting line at time k, ~ 0 = 0. Then if ~k = i, at the next time k ~k + 1 of the waiting line is equal to j.=

{1Jk+ 1 i- 1

+ 1Jk+ 1

+ 1 the length

if i = 0, ifizl.

In other words, 0

s

k

s

n- 1,

where a+ = max(a, 0), and therefore the sequence ~ = (~ 0 , Markov chain.

..• ,

~")

is a

4. This example comes from the theory of branching processes. A branching process with discrete times is a sequence of random variables ~ 0 , ~ 1 , .•. , ~"'where ~k is interpreted as the number of particles in existence at time k, and the process of creation and annihilation of particles is as follows: each particle, independently of the other particles and of the "prehistory" of the process, is transformed into j particles with probability pi, j = 0, 1, ... , M. We suppose that at the initial time there is just one particle, ~ 0 = 1. If at time k there are ~k particles (numbered 1, 2, ... , ~k), then by assumption ~k+ 1 is given as a random sum of random variables, EXAMPLE

~k+ 1

=

IJlk)

+ ... + 1]~~,

where 1Jlkl is the number of particles produced by particle number i. It is clear that if ~k = 0 then ~k+ 1 = 0. If we suppose that all the random variables IJ~k>, k 2 0, are independent of each other, we obtain Pgk+1 = ik+1i~k = ik, ~k-1 = ik-t• · · .} = Pgk+t = ik+ti~k = ik} = P{IJ\kl + · · · + IJ!:> = ik+ d. It is evident from this that the sequence ~ 0 , ~ 1 ,

... , ~"is a Markov chain. A particularly interesting case is that in which each particle either vanishes with probability q or divides in two with probability p, p + q = 1. In this case it is easy to calculate that

is given by the formula Pii

=

{c 0

1,."12pi12qi- j/2, ). = 0' ... , 2'I, in all other cases.

2. Let ~ = ( ~k, Ill, IP') be a homogeneous Markov chain with st~rting vectors (rows) Ill = (p;) and transition matrix Ill = IIPiill. It is clear that Pii = P{~1 =j/~o = i} = ... = P{~n =j/~n-1 = i}.

114

I. Elementary Probability Theory

We shall use the notation

for the probability of a transition from state i to state j in k steps, and

for the probability of finding the particle at point j at time k. Also let

Let us show that the transition probabilities pj'> satisfy the Kolmogorov-

Chapman equation

\~ I) = " p\k)p(l! P'1 l.J ra ~J'

(13)

~

or, in matrix form,

(14) The proof is extremely simple: using the formula for total probability and the Markov property, we obtain P!~+IJ

= P(~k+l = j l~o =

i)

=

L P(~k+l = j, ~k = IXI~o = i) ~

=

L P(~k+l = j I~k = 1X)P(~k = IX I~0 = i) = L p~}P!~>. ~

~

The following two cases of (13) are particularly important: the backward equation 1J = "P· P\~+ IJ £...., I~ p
(15)

v

and the forward equation \~+ 1) P IJ

= "p\k)p" . £...., I~ ~}

(16)

~

(see Figures 22 and 23). The forward and backward equations can be written in the following matrix forms

= IP. IP,

(17)

IP.

(18)

IP
§12. Markov Chains. Ergodic Theorem. Strong Markov Property

0

115

I+ I

Figure 22. For the backward equation.

Similarly, we find for the (unconditional) probabilities p)k) that

'\' p(k)p(l) PJ(_k +I) = ~ a: '2.]'

(19)

a

or in matrix form

fl (k +I) = fl (k) • I]J>(I). In particular,

rn
rn
·IP

(forward equation) and

fl (k+ 1) = fl (1).

I]J>(k)

(backward equation). Since IP 0 ) = IP, fl (1) = fl, it follows from these equations that JP(k) =

!Pk,

Consequently for homogeneous Markov chains the k-step transitiOn probabilities p~j) are the elements of the kth powers of the matrix IP, so that many properties of such chains can be investigated by the methods of matrix analysis.

k k

+1

Figure 23. For the forward equation.

116

I. Elementary Probability Theory

5. Consider a homogeneous Markov chain with the two states 0 and 1 and the matrix

EXAMPLE

IFD

= (Poo Pot)· Pu

Pto

It is easy to calculate that p2

Po;(Poo + Pu)) Ptt + PotPto

= ( P~o + Po1P1o Pto(Poo

and (by induction)

+ Pu)

1 (1 - 1-

IFD" _ - 2 - Poo - Pu

Pu 1 - Pu

Poo) 1 - Poo

+ (Poo + Pu

- 1)" ( 1 - Poo 2 - Poo - Pu -(1 - Pu)

-(1 - Poo)) 1- Pu

(under the hypothesis that IPoo + p 11 - 11 < 1). Hence it is clear that if the elements of IFD satisfy Ip00 + p11 - 11 < 1 (in particular, if all the transition probabilities Pii are positive), then as n-+ oo IFD" -+

1 (1 - 1-

2 - Poo - Pu

Pu 1 - Pu

Poo) 1 - Poo '

(20)

and therefore l - p 11 lim (nl = P,o 2 ' n - Poo- Pu

. Pil(n) 1tm "

1 - Poo

= -,-------'--'---

2- Poo- Pu

Consequently if IPoo + p 11 - 11 < 1, such a Markov chain exhibits regular behavior of the following kind: the influence of the initial state on the probability of finding the particle in one state or another eventually becomes negligible (plj> approach limits ni, independent of i and forming a probability distribution: n0 ~ 0, n 1 ~ 0, n0 + n 1 = 1); if also all Pii > 0 then n0 > 0 and n 1 > 0. 3. The following theorem describes a wide class of Markov chains that have the property called ergodicity: the limits ni = limn Pii not only exist, are independent of i, and form a probability distribution (ni ~ 0, Li ni = 1}, but also ni > 0 for allj (such a distribution ni is said to be ergodic).

Theorem 1 (Ergodic Theorem). Let 1P

= IIPiill be the transition matrix of a chain with a .finite state space X = {1, 2, ... , N}.

(a) If there is an n0 such that min p\~ol > 0' lJ i,j

(21)

117

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

then there are numbers n 1,

••• ,

nN

such that (22)

and

n- oo

(23)

for every i e X. if there are numbers n 1, ••. , nN satisfying (22) and (23), there is an n0 such that (21) holds. (c) 1he numbers (n 1 , ... , nN) satisfy the equations

(b) Conversely,

= 1, ... ,N.

j PROOF.

(24)

(a) Let

m(n> = min p!~> ) I] '

M)"> = max Pl~P.

i

i

Since 1) = "P· p
(25)

~

we have 1 > =min p!'!+ 1 > =min" P· pmin" P· min p<") = m(n> m("+ ) I) ' L, I~ ~) L, I~ ~) ) '

i

i

a

i

eX

a

whence m}"> :::;; m)"+ >and similarly M}"> ~ M}"+ 1l. Consequently, to establish (23) it will be enough to prove that 1

M("> - m("> .,... 0 1

Let e

=

1

n .,... oo, j = 1, ... , N.

'

min;, i Pljo> > 0. 'fhen = "p(no>p]p<") P!'!o+n) I) L, I~ ~) I~ )~ ~) ~

+ s "p(">p
~

~)

~

(p(no) - sp(">]p
= "

L,

+ sp(~n) 11 •

~

But p!no) - sp("> > 0 '· therefore la Ja. -

> m(n)." [p!no) - sp(">J P!'!o+n) I) ) L, I~ )~

+ sp(~n) = ))

m("l(1 - s) )

~

and consequently m!no+n) > m("l(1 - s) 1 1

+ sp(~n) )) •

M(no+n> < M(">(1 - s) 1 1

+ sp!~n> )) •

In a similar way Combining these inequalities, we obtain M(no+n> _ m(no+n> < (M("> _ m(">). (1 _ s) 1

1

-

1

1

+ sp(~n) Jl '

118

I. Elementary Probability Theory

and consequently M<J<no+n> _ m<.kno+n> < (M<.n> _ m<.">)(1 _ e)k! 0 J

J

-

J

J

'

k--+ 00.

Thus M~"tll - m~npl --+ 0 for some subsequence np, np --+ oo. But the difference M~nl - m~n) is monotonic in n, and therefore M~nl - m~nl --+ 0, n--+ oo. If we put ni = limn m~">, it follows from the preceding inequalities that lp \~)IJ

7C·I < M(n)m<.n> < (1 - e)[n/no)-1 J J J -

for n ~ n0 , that is, p!j> converges to its limit ni geometrically (i.e., as fast as a geometric progression). It is also clear that m~n> ~ m~no) ~ e > 0 for n ~ n0 , and therefore ni > 0. (b) Inequality (21) follows from (23) and (25). (c) Equation (24) follows from (23) and (25). This completes the proof of the theorem. 4. Equations (24) play a major role in the theory of Markov chains. A nonnegative solution (n 1 , ••• , nN) satisfying I,. n~ = 1 is said to be a stationary or invariant probability distribution for the Markov chain with transition matrix IIPiill. The reason for this terminology is as follows. Let us select an initial distribution (n 1, ... , nN) and take Pi= ni. Then PJ(l)

= "" 1C

p . = 1C.J

£...,~~} ~

and in general p)"> = ni. In other words, if we take (n 1, ... , nN) as the initial distribution, this distribution is unchanged as time goes on, i.e. for any k j = 1, ... ,N.

Moreover, with this initial distribution the Markov chain ~ =(~,Ill, IP) is really stationary: the joint distribution of the vector (~k• ~k+ ~> ... , ~k+ 1) is independent of k for alll (assuming that k + 1 :::;; n). Property (21) guarantees both the existence of limits ni = lim plj>, which are independent of i, and the existence of an ergodic distribution, i.e. one with ni > 0. The distribution (n 1 , •.• , nN) is also a stationary distribution. Let us now show that the set (n 1 , .•. , nN) is the only stationary distribution. In fact, let (ft 1, •.• , ftN) be another stationary distribution. Then

and since P~1--+ ni we have iii

= L (ft~ · n) = ni. ~

These problems will be investigated in detail in Chapter VIII for Markov chains with countably many states as well as with finitely many states.

119

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

We note that a stationary probability distribution (even unique) may exist for a nonergodic chain. In fact, if

then [p>2n

=

(0 1)

1 0 '

[p>2n+ 1

=

(1 0)

0 1 '

and consequently the limits lim pjj> do not exist. At the same time, the system j

= 1, 2,

reduces to

of which the unique solution satisfying n 1 + n 2 = 1 is(!,!). We also notice that for this example the system (24) has the form no

=

noPoo

+ n1P1o·

from which, by the condition n 0 = n 1 = 1, we find that the unique stationary distribution (n 0 , n 1) coincides with the one obtained above: no=

1- Pu ' 2 - Poo - Pu

nl

=

1- Poo

2- Poo-

P11

.

We now consider some corollaries of the ergodic theorem. Let A be a set of states, A ~ X and IA(x)

1, XEA, = { 0, x ¢:A.

Consider

which is the fraction of the time spent by the particle in the set A. Since E[JA((k)i(o = i] = P((kEAI(o = i) = LPI,l(=p\kl(A)), jeA

120

I. Elementary Probability Theory

we have

1

E[vA(n)l~o = i] = - 1

n

+

and in particular

E[vw(n)l~o = i] = -

n

1-

L pjkl(A) k=o n

±

+ 1 k=o

pjj>.

It is known from analysis (see also Lemma 1 in §3 of Chapter IV) that if an-+ a then (a 0 + · · · + an)/(n + 1)-+ a, n-+ oo. Hence if pj'l-+ ni, k-+ oo, then where

nA =

L ni.

jeA

For ergodic chains one can in fact prove more, namely that the following result holds for I A(~ 0 ), ••• , I A( ~n), ....

Law of Large Numbers. If ~ 0 , ~ 1 , ... form a .finite ergodic Markov chain, then

n-+ oo,

(26)

for every e > 0 and every initial distribution. Before we undertake the proof, let us notice that we cannot apply the results of §5 directly to IA(~ 0 ), ... , JA(~n), ... , since these variables are, in general, dependent. However, the proof can be carried through along the same lines as for independent variables if we again use Chebyshev's inequality, and apply the fact that for an ergodic chain with finitely many states there is a number p, 0 0,

n-+ oo. By Chebyshev's inequality,

Hence we have only to show that

n-+ oo.

(28)

121

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

A simple calculation shows that

=

1 (n

+

n

n

'\'

'\' m!~· I) L., l} ' k=o l=o L.,

1)2

where

s = min(k, l) and t = lk- 11. By (27),

P!'!> l}

= n:.J + e!'!>

l}'

Therefore lml~· 1 >1:::; C1[p•

+ P + Pk + /], 1

where C 1 is a constant. Consequently

1

n

n

--,-----____,.,...,.. '\' '\' m
<

- (n

C

n

1

'\'

n

'\' [

s

+ 1)2 k~o 1~0 p 4C 1 2(n + 1) 2 + 1) 1 - p

+

t

p

+

k+

p

1]

p

8C 1

(n

+ 1)(1

- p)

--+

O

,

n --+ oo.

Then (28) follows from this, and we obtain (26) in an obvious way. 5. In §9 we gave, for a random walk S 0 , S 1, ... generated by a Bernoulli scheme, recurrent equations for the probability and the expectation of the exit time at either boundary. We now derive similar equations for Markov chains. Let~= (~ 0 , ••• , ~")be a Markov chain with transition matrix IIPiill and phase space X = {0, ± 1, ... , ± N}. Let A and B be two integers, - N :::; A :::; 0 :::; B :::; N, and x EX. Let f!lk+ 1 be the set of paths {x 0 , x 1 , ..• , xk), xi E X, that leave the interval (A, B) for the first time at the upper end, i.e. leave (A, B) by going into the set (B, B + 1, ... , N). , For A :::; x :::; B, put f3k(x) = P{(~o .... , ~k) E fflk+ll~o = x}.

In order to find these probabilities (for the first exit of the Markov chain from (A, B) through the upper boundary) we use the method that was applied in the deduction of the backward equations.

122

I. Elementary Probability Theory

We have {3k(x) = P{(eo, ... ' ek) E .?Jk+ 11 eo = X} =

L Pxy. P{(eo, ... ' ek) E .?Jk+lleo =X, el = y

y},

where, as is easily seen by using the Markov property and the homogeneity of the chain,

P{(eo, ... ' ek)

.?Jk+lleo =X, el = y} = P{(x, y, e2, ... ' ek) E .?Jk+lleo =X, el = P{(y, e2, ... , ek) E BBkle1 = y}

E

= y}

= P{(y, eb ... ' ek- dE .?lkl eo = y} = Pk-l(y). Therefore y

for A < x < B and 1 :::;; k :::;; n. Moreover, it is clear that Pix)= 1,

X

= B, B + 1, ... ' N,

and X= -N, ... ,A.

In a similar way we can find equations for ocix), the probabilities for first exit from (A, B) through the lower boundary. Let rk = min{O :::;; l :::;; k: ¢(A, B)}, where rk = k if the set { ·} = 0. Then the same method, applied to mix)= E(rkl eo = x), leads to the following recurrent equations:

e,

mk(x) = 1 +

L mk-l (Y)Pxy y

(here 1 :::;; k :::;; n, A < x < B). We define

x ¢(A, B). It is clear that if the transition matrix is given by (11) the equations for ock(x), {3k(x) and mk(x) become the corresponding equations from §9, where they were obtained by essentially the same method that was used here. These equations have the most interesting applications in the limiting case when the walk continues for an unbounded length of time. Just as in §9, the corresponding equations can be obtained by a formal limiting process (k --+

CX) ).

By way of example, we consider the Markov chain with states {0, 1, ... , B} and transition probabilities

Poo

= 1,

PBB =

1,

123

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

and

Pi > 0, ~ = ~ + 1, pij = { ri, J = z, qi > 0, j = i - 1, for 1 :-:;; i :-:;; B - 1, where Pi + qi + ri = 1. For this chain, the corresponding graph is

a~---·~e 0

q1

1

2

P1

qB-1 B- 1 PB-1

B

It is clear that states 0 and B are absorbing, whereas for every other state i the particle stays there with probability ri, moves one step to the right with probability Pi• and to the left with probability qi. Let us find a(x) = limk-oo ak(x), the limit of the probability that a particle starting at the point x arrives at state zero before reaching state B. Taking limits ask--+ oo in the equations for aix), we find that

when 0 < j < B, with the boundary conditions

a(O)

=

1,

a(B) = 0.

Since rj- 1 - qj- pj, we have pj(a(j + 1) - a(j)) = qj(a(j) - a(j - 1)) and consequently a(j

+ 1) -

a(j) = pj(a(1) - 1),

where Po= 1.

But a(j

+ 1) -

j

1=

L (a(i + 1) -

a(i)).

i=O

Therefore a(j

+ 1) -

j

1 = (a(1) - 1) ·

L Pi·

i=l

124

I. Elementary Probability Theory

Ifj = B - 1, we have r:~.(j

+ 1) =

0, and therefore

r:~.(B) =

1

r:J.(l)=1=-'\'B-1

L..i= 1 Pi

'

whence an d

'\'B-1

L..i=i Pi ' L..i=1 Pi

( r:I.J

0)

='\'B-1

j = 1,

0

0

0'

B.

(This should be compared with the results of §9.) Now let m(x) = limk mk(x), the limiting value of the average time taken to arrive at one of the states 0 or B. Then m(O) = m(B) = 0, m(x)

= 1 + L m(y)Pxy y

and consequently for the example that we are considering,

m(j) = 1 + qimU - 1)

+ rimU) + pimU + 1)

for j = 1, 2, ... , B - 1. To find mU) we put

M(j) = m(j) - m(j- 1),

j = 0, 1,. 0. 'B.

Then

piM(j + 1)

= qiM(j)

- 1,

j = 1, ... , B- 1,

and consequently we find that

M(j

+ 1) =

piM(1)- Ri,

where q1 .. qj 0

P1 · · Pi ' 0

Therefore j-1

m(i) = mU)- m(O) =

L M(i + 1)

i=O

j-1

=

L

i=O

(pim(l)- Ri) = m(l)

j-1

j-1

i=O

i=O

L Pi- L Ri.

It remains only to determine m(1). But m(B) = 0, and therefore '\'B-l R m(1) = Ik=o1 i, i=O Pi

and for 1 < j ::;; B,

( 0)

mJ =

j-1

'\'~-1R.

j-1

'\' R L..• = 0 I '\' i~o Pi. Lf;-o1 Pi - i:--o io

125

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

(This should be compared with the results in §9 for the case ri = 0, Pi = p,

qi

= q.)

6. In this subsection we consider a stronger version of the Markov property (8), namely that it remains valid if time k is replaced by a random time (see also Theorem 2). The significance of this, the strong Markov property, can be illustrated in particular by the example of the derivation of the recurrent relations (38), which play an important role in the classification of the states of Markov chains (Chapter VIII). Let ~ = (~ 1 , . . . , ~n) be a homogeneous Markov chain with transition matrix IIPijll; let ~~ = (~f)oo;;k,;;n be a system of decompositions, ~f = ~ ~o •...• ~k. Let ~f denote the algebra ri(~i) generated by the decomposition ~f.

B

We first put the Markov property (8) into a somewhat different form. Let Let us show that then

E ~f.

= ak+llB n (~k = ak)} = P{~n =an, ... , ~k+l =

Pgn =an,···, ~k+l

ak+ll~k

= ak}

(29)

(assuming that P{B n (~k = ak)} > 0). In fact, B can be represented in the form

B=

where

I* {~~ .... , ~k = at},

I* extends over some set (a~, ... ' an Consequently P{~n =an, .. ·, ~k+l P{(~n

= ak+llB n

(~k

= ak)}

=an, ... , ~k = ak) n B} P{(~k = ak) n B}

I* P{(~n

= aw .. , ~k = ak) n (~ 0 = a6, ... , ~k =a%)} P{(~k = ak) n B}

(30)

But, by the Markov property, P{(~n

=an, ... , ~k = ak) n (~o = a~, ... , ~k =at)}

=

{

P{~n =an, ... , ~k+l = ak+ll~o =a~,···, ~k =at} x P{~ 0 =a~, ... , ~k =at} ifak =at, 0 if ak i= at,

=

{

P{~n =an, ... , ~k+l = ak+ll~k = ak}P{~o =a~, ... , ~k =at} if ak =at, 0 if ak i= at,

={

pgn =an,··., ~k+1 0

if ak =at, if ak i= at.

= ak+ll~k = ak}P{(~k = ak) n B}

126

I. Elementary Probability Theory

Therefore the sum I* in (30) is equal to P{ ~n = an, ... , ~k+ 1 = ak+ d ~k = adP{(~k = ak) n B},

This establishes (29). Let r be a stopping time (with respect to the system D~ = (DDo 5 k 5 n; see Definition 2 in §11).

Definition. We say that a set Bin the algebra~~ belongs to the system of sets ~; if, for each k, 0 ~ k ~ n, B n {r

= k}

(31)

E ~f.

It is easily verified that the collection of such sets B forms an algebra (called the algebra of events observed at timer).

Theorem 2. Let ~ = (~ 0 , ••• , ~n) be a homogeneous Markov chain with transition matrix IIPull, r a stopping time (with respect to 2fi~), BE~~ and A = {w: r + l ~ n}. Then if P{A n B n (~, = a0 )} > 0, we have P{~r+t

and if P{A n

(~,

= =

a1,

••• ,

P{~r+t

= a 1 IA n B n (~, = a 0 )} = a1, ••• , ~r+l = a 1 IA n (~, = a 0 )}, ~r+l

(32)

= a0 )} > 0 then

For the sake of simplicity, we give the proof only for the case l = 1. Since B n (r = k) E ~L we have, according to (29), P{~r+l =

a 1, An B n (~, = a0 )}

L

k5n-l

P{~k+l

= a1, ~k =

a0 , r

= k, B}

k5n-l k5n-l

=

Paoal.

L

k5n-l

P{~k = ao, r = k, B} =

Paoal.

P{A n B n (~,

= ao)},

which simultaneously establishes (32) and (33) (for (33) we have to take B= Q).

Remark. When l

= 1, the strong Markov property (32), (33) is evidently equivalent to the property that

(34)

127

§12. Markov Chains. Ergodic Theorem. Strong Markov Property

for every C s;; X, where

Pao(C) =

L Paoa1·

a1EC

In turn, (34) can be restated as follows: on the set A = {T

~

n- 1}, (35)

which is a form of the strong Markov property that is commonly used in the general theory of homogeneous Markov processes. 7. Let ~ = (~ 0 , .••• , matrix IIPiill, and let

~n)

be a homogeneous Markov chain with transition

!!7>

= P{ ~k = i, ~~ =F i, 1 ~ l ~ k - 11 ~ 0 = i}

(36)

f!J>

= P{~k =j, ~~ #j,

1 ~ l ~ k- 11~ 0 = i}

(37)

and

for i =F j be respectively the probability of first return to state i at time k and the probability of first arrival at state j at time k. Let us show that n

P I}\~)

= i..J "'

j\~>p<.~-k)

k=1

I}

}} ·

'

where

p)~> =

1.

(38)

The intuitive meaning of the formula is clear: to go from state i to state j inn steps, it is necessary to reach statej for the first time ink steps (1 ~ k ~ n) and then to go from state j to state j in n - k steps. We now give a rigorous derivation. Let j be given and ' = min{l ~ k ~ n: ~k = j}, assuming that'! = n

+ 1 if {·}

=

0. Then f!J> = P{T =

kl~o = i} and

P!~j> = P{~n =jl~o = i}

L

P{~n = j, '! = kl~o = i}

L

P{~r+n-k=j,T=kl~o=i},

15k5n

15k5n

(39)

where the last equation follows because ~t+n-k = ~n on the set {'! = k}. Moreover, the set {' = k} = {' = k, ~r = j} for every k, 1 ~ k ~ n. Therefore if P{~ 0 = i,' = k} > 0, it follows from Theorem 2 that P{~r+n-k =jl~o = i, '! = k} = P{~r+n-k =jl~o = i, '! = k, ~t =j}

= P{~t+n-k = j I~t = j} = ptf-k>

128

I. Elementary Probability Theory

and by (37)

P!j> = =

n

L P{~t+n-k = jl~o = i,

7:

k= 1

= k}P{r = kl~o = i}

n

~ p(~-k>:r(~)

L,

k=l

JJ

I] '

which establishes (38). 8.PROBLEMS

= (~ 0 , ... , ~.)be a Markov chain with values in X and f = f(x) (x EX) a function. Will the sequence (f(~ 0 ), ... ,f(~.)) form a Markov chain? Will the "reversed" sequence

1. Let ~

(~., ~.- 1> • • · ' ~o)

form a Markov chain? 2. Let IJll = IIP;)I, 1 :::; i,j:::; r, be a stochastic matrix and A. an eigenvalue of the matrix, i.e. a root of the characteristic equation det IIIJll - A.E II = 0. Show that A.0 = 1 is an eigenvalue and that all the other eigenvalues have moduli not exceeding 1. If all the eigenvalues ..1. 1, ••• , A., are distinct, then p~~l admits the representation p~~l = ni

+ a;;{1)A.~ + · ·· + a;k)A.~,

where ni, a;;(!), ... , a;;{r) can be expressed in terms of the elements of IJll. (It follows from this algebraic approach to the study of Markov chains that, in particular, when I..1. 1 1< 1, ... , IA., I < 1, the limit lim Pl~l exists for every j and is independent of i.) 3.

Let~ = (~ 0 , ••. , ~.)be

tion matrix

IJll

=

a homogeneous Markov chain with state space X and transi-

IIPxyll. Let

T
E[
(=

~
Let the nonnegative function

x EX.

Show that the sequence of random variables

is a martingale. 4. Let ~=(,.,Ill, IJll) and ~=(,.,Ill, IJll) be two Markov chains with different initial distributions Ill = (p 1, ••• , p,) and li = (ph ... , p,). Show that if min;.i Pii ::::: e > 0 then

L lftl"l- p~"ll :::; 2(1 -e)".

i=l

CHAPTER II

Mathematical Foundations of Probability Theory

§I. Probabilistic Model for an Experiment with Infinitely Many Outcomes. Kolmogorov's Axioms 1. The models introduced in the preceding chapter enabled us to give a probabilistic-statistical description of experiments with a finite number of outcomes. For example, the triple (Q, d, P) with

and p(w) = pEa•qn-r.a, is a model for the experiment in which a coin is tossed n times "independently" with probability p of falling head. In this model the number N(Q) of outcomes, i.e. the number of points in n, is the finite number 2n. We now consider the problem of constructing a probabilistic model for the experiment consisting of an infinite number of independent tosses of a coin when at each step the probability offalling head is p. It is natural to take the set of outcomes to be the set

n=

{co: w = (al, a2, .. .), ai = 0, 1},

i.e. the space of sequences w = (a 1, a 2 , .• •) whose elements are 0 or 1. What is the cardinality N(Q) of Q? It is well known that every number a e [0, 1) has a unique binary expansion (containing an infinite number of zeros) (ai = 0, 1).

130

II. Mathematical Foundations of Probability Theory

Hence it is clear that there is a one-to-one correspondence between the points OJ ofQ and the points a of the set [0, 1), and therefore Q has the cardinality of the continuum. Consequently if we wish to construct a probabilistic model to describe experiments like tossing a coin infinitely often, we must consider spaces Q of a rather complicated nature. We shall now try to see what probabilities ought reasonably to be assigned (or assumed) in a model of infinitely many independent tosses of a fair coin {p + q = !). Since we may take Q to be the set [0, 1), our problem can be considered as the problem of choosing points at random from this set. For reasons of symmetry, it is clear that all outcomes ought to be equiprobable. But the set [0, 1) is uncountable, and if we suppose that its probability is 1, then it follows that the probability p(OJ) of each outcome certainly must equal zero. However, this assignment of probabilities {p{OJ) = 0, OJ e [0, 1)) does not lead very far. The fact is that we are ordinarily not interested in the probability of one outcome or another, but in the probability that the result of the experiment is in one or another specified set A of outcomes (an event). In elementary probability theory we use the probabilities p(OJ) to find the probability P(A) of the event A: P(A) = p(OJ). In the present case, with p(OJ) = 0, OJ E [0, 1), we cannot define, for example, the probability that a point chosen at random from [0, 1) belongs to the set [0, !). At the same time, it is intuitively clear that this probability should be!. These remarks should suggest that in constructing probabilistic models for uncountable spaces Q we must assign probabilities, not to individual outcomes but to subsets of Q. The same reasoning as in the first chapter shows that the collection of sets to which probabilities are assigned must be closed with respect to unions, intersections and complements. Here the following definition is useful.

LroeA

Definition 1. Let Q be a set of points OJ. A system d of subsets of Q is called an algebra if (a) Qed, (b) A, BEd=> Au BEd, (c) Aed=>Aed

An Bed,

(Notice that in condition (b) it is sufficient to require only that either A u B E d or that A n B E d, since A u B = A n Band A n B = A u B.) The next definition is needed in formulating the concept of a probabilistic model.

Definition 2. Let d be an algebra of subsets of Q. A set function J.l = J.l(A), A e d, taking values in [0, oo ], is called a finitely additive measure defined

131

§1. Probabilistic Model for an Experiment with Infinitely Many Outcomes

on .91 if Jl(A

+ B) =

Jl(A)

+ Jl(B).

(1)

for every pair of disjoint sets A and B in .91. A finitely additive measure Jl with Jl(Q) < oo is called finite, and when Jl(Q) = 1 it is called a finitely additive probability measure, or a finitely additive probability.

2. We now define a probabilistic model (in the extended sense).

Definition 3. An ordered triple (Q, .91, P), where (a) Q is a set of points w; (b) .91 is an algebra of subsets of Q; (c) Pis a finitely additive probability on A, is a probabilistic model in the extended sense. It turns out, however, that this model is too broad to lead to a fruitful mathematical theory. Consequently we must restrict both the class of subsets of Q that we consider, and the class of admissible probability measures.

Definition 4. A system ff of subsets of Q is a (1-algebra if it is an algebra and satisfies the following additional condition (stronger than (b) of Definition 1): (b*) if An E ff, n = 1, 2, ... , then (it is sufficient to require either that

u An

E

$' or that

n An

E

$').

Definition 5. The space Q together with a (1-algebra ff of its subsets is a measurable space, and is denoted by (Q, $'). Definition 6. A finitely additive measure Jl defined on an algebra .91 of subsets of Q is countably additive (or (1-additive), or simply a measure, if, for all pairwise disjoint subsets A 1, A 2 , ••• of A,

Jl(~l A") =

Jl

Jl(An>·

A finitely additive measure Jl is said to be (1-finite ifQ can be represented in the form n=l

with Jl(Qn) <

00,

n = 1, 2, ....

132

II. Mathematical Foundations of Probability Theory

If a countably additive measure P on the algebra A satisfies P(Q) = 1, it is called a probability measure or a probability (defined on the sets that belong to the algebra d). Probability measures have the following properties.

If 0 is the empty set then

P(0) = 0.

If A, B E d then P(A u B)= P(A)

If A, B E d and B

!:;;;;

+ P(B)-

P(A n B).

A then P(B) ::; P(A).

If A. E d, n = 1, 2, ... , and

U A. E d, then

The first three properties are evident. To establish the last one it is enough to observe that 1 B., where B 1 =At> B.= A1 n · · · n 1 A.= A._ 1 n A., n ~ 2, B; n Bj = 0, i # j, and therefore

L:'=

L:'=

The next theorem, which has many applications, provides conditions under which a finitely additive set function is actually countably additive.

Theorem. Let P be a finitely additive set function defined over the algebra d, with P(Q) = 1. The following four conditions are equivalent: (1) P is a-additive (Pis a probability); (2) P is continuous from below, i.e. for any sets A 1 , A 2 , •• • Ed such that A. !:;;;; An+ 1 and 1 A. Ed,

U:'=

(3) Pis continuous from above, i.e. for any sets A 1 , A 2 ,

and

n:'=

1

A. E d,

•••

such that A. 2 A.+ 1

133

§I. Probabilistic Model for an Experiment with Infinitely Many Outcomes

(4) Pis continuous at 0, i.e. for any sets A1, A 2 , ••• Ed such that An+ 1 and 1 An= 0,

n:=

lim P(An)

=

~

An

0.

n

Since

PROOF. (1) => (2).

00

UAn =

A1 + (Az \A1) + (A3 \Az) + · · ·,

n=l we have

PC91An)

=

P(A 1) + P(A 2 \A 1) + P(A 3\A 2 ) + ...

=

P(A 1) + P(A 2 )

-

P(A 1) + P(A 3)- P(A 2 ) + · · ·

=lim P(An). n

(2) => (3). Let n

~

1; then

The sequence {A 1\An}n> 1 of sets is nondecreasing (see the table in Subsection 3 below) and 00

00

U
n=l

=

A1\ nAn. n=l

Then, by (2)

and therefore n

=

P(Al)- PC91(A1\An))

=

P(A1)- P(A1)

=

+ P(Q1An)

P(A1)- P(Al\0/n) =

P(Q1An)

(3) = rel="nofollow"> (4). Obvious. (4) =>(1). Let Ab A 2 , •.. Ed be pairwise disjoint and let Then

L:-'=

1

An Ed.

n=l

U A.

00

Af:,.B

A+B A\B

0 AnB=0

An B (or AB)

AuB

A=!l\A

ff A Eff

n

(J)

Notation

Table

union of the sets A 1 , A 2 , •••

sum of sets, i.e. union of disjoint sets difference of A and B, i.e. the set of points that belong to A but not to B symmetric difference of sets, i.e. (A \B) u (B\A)

outcome, sample point, elementary event sample space; certain event a-algebra of events event (if wE A, we say that event A occurs) event that A does not occur event that either A or B occurs

element or point set of points a-algebra of subsets set of points complement of A, i.e. the set of points w that are not in A union of A and B, i.e. the set of points w belonging either to A or to B intersection of A and B, i.e. the set of points w belonging to both A and B empty set A and B are disjoint

event that at least one of A 1 , A 2 , ••• occurs

event that A or B occurs, but not both

impossible event events A and B are mutually exclusive, i.e. cannot occur simultaneously event that one of two mutually exclusive events occurs event that A occurs and B does not

event that both A and B occur

Interpretation in probability theory

Set-theoretic interpretation

;. "'

0

...,

'<

(\)

-< -l =-0

~

"'

0

....., ..., 0 g"

~

o·

~

P-

"'=

.,e:..

r;·

~

3

(\)

-a::

.j;>.

v.>

-

A.

li~ 1 A.)

* i.o.

=

infinitely often.

(or lim inf A.)

limA.

lim A. (or lim sup A. or* {A. i.o.})

A

=

=

A li~ 1' A")

1A

(or

A.

(or

A. 1' A

n= 1

00

n A.

n=l

I

00

00

00

n= 1 k= n

00

un

the set

Ak

UAk

n= 1 k=n

00

n

the set

A 1 2 A 2 2 ···and A

= n=1

00

n A"

n=l

U A.

the decreasing sequence of sets A" converges to A, i.e.

A 1 c;:; A 2 c;:; ···and A=

00

the increasing sequence of sets A. converges to A, I.e.

intersection of A~> A 2 , •••

sum, i.e. union of pairwise disjoint sets A 1 , A 2 , .•• A~>

A2, •••

occur

•••

event that all the events A 1 , A 2 , ••• occur with the possible exception of a finite number of them

event that infinitely many of events A 1 , A 2 •..• occur

the decreasing sequence of events converges to event A

the increasing sequence of events converges to event A

event that all the events

event that one of the mutually exclusive events A 1 , A 2 , occurs

,.,

0

Vo

! j.)

-

~

a

0 c: ;:;

'<

Ol

a:: =

.:;

::n

= g.

:§. &

~

§"

(1)

$....

Ol

=

0' ....

g.

0.

a:: 0

r;·

~

g

0 cr"

::,0

'e:

136

II. Mathematical Foundations of Probability Theory

and since L~n+ 1 A; !

0, n --+

oo, we have

3. We can now formulate Kolmogorov's generally accepted axiom system, which forms the basis for the concept of a probability space.

Fundamental Definition. An ordered triple (Q, fi', P) where

n

(a) is a set of points w, (b) fi' is a u-algebra of subsets of n, (c) Pis a probability on fi',

is called a probabilistic model or a probability space. Here n is the sample space or space of elementary events, the sets A in ff are events, and P(A) is the probability of the event A. It is clear from the definition that the axiomatic formulation of probability theory is based on set theory and measure theory. Accordingly, it is useful to have a table (pp. 134-135) displaying the ways in which various concepts are interpreted in the two theories. In the next two sections we shall give examples of the measurable spaces that are most important for probability theory and of how probabilities are assigned on them.

4.

PROBLEMS

n = {r: r E [0, 1]} be the set of rational points of [0, 1], d the algebra of sets each of which is a finite sum of disjoint sets A of one of the forms {r: a < r < b}, {r: a::;; r < b}, {r: a< r::;; b}, {r: a::;; r::;; b}, and P(A) = b- a. Show that P(A), A Ed, is finitely additive set function but not countably additive.

1. Let

2. Let Q be a countable set and :!F the collection of all its subsets. Put JJ(A) = 0 if A is finite and JJ(A) = oo if A is infinite. Show that the set function JJ is finitely additive but not countably additive. 3. Let JJ be a finite measure on a u-algebra :!F, A. E :!F, n = 1, 2, ... , and A (i.e., A = lim. A. = lim. A.). Show that JJ(A) = lim. JJ(A.). 4. Prove that P(A 6 B)

= P(A) + P(B) -

2P(A n B).

= lim. A.

137

§2. Algebras and a-Algebras. Measurable Spaces

5. Show that the "distances" p 1(A, B) and p2(A, B) defined by p 1(A, B)= P(A

~B),

P(A ~B) { piA, B) = :(A v B)

ifP(A v B)# 0, if P(A v B)= 0

satisfy the triangle inequality. 6. Let f.1. be a finitely additive measure on an algebra d, and let the sets A 1 , A 2 , ••• Ed be pairwise disjoint and satisfy A = 1 f..I.(A;). 1 Ai Ed. Then f.J.(A) ~

Ir;

Ir;

7. Prove that lim sup An= lim inf A., lim inf A. ~ lim sup A.,

lim inf An = lim sup An,

lim sup(A. v B.) = lim sup A. v lim sup B.,

lim sup A. n lim inf B. ~ lim sup( A. n B.) ~ lim sup A. n lim sup B•.

If A. j A or A.

l

A, then lim inf An= lim sup A•.

8. Let {x.} be a sequence of numbers and A.= (- oo, xn). Show that x =lim sup x. and A = lim sup A. are related in the following way: (- oo, x) ~ A ~ (- oo, x]. In other words, A is equal to either (- oo, x) or to (- oo, x]. 9. Give an example to show that if a measure takes the value general that countable additivity implies continuity at 0.

+ oo, it does not follow in

§2. Algebras and a-Algebras. Measurable Spaces 1. Algebras and a-algebras are the components out of which probabilistic models are constructed. We shall present some examples and a number of results for these systems. Let n be a sample space. Evidently each of the collections of sets ~* =

{0, Q},

~*

= {A: A£:; Q}

is both an algebra and a a-algebra. In fact, ~* is trivial, the "poorest" a-algebra, whereas ~* is the "richest" a-algebra, consisting of all subsets

ofn.

When n is a finite space, the a-algebra ~* is fully surveyable, and commonly serves as the system of events in the elementary theory. However, when the space is uncountable the class ~* is much too large, since it is impossible to define "probability" on such a system of sets in any consistent way. If A £:; Q, the system ~A = {A,

A, 0, Q}

138

II. Mathematical Foundations of Probability Theory

is another example of an algebra (and a a-algebra), the algebra (or a-algebra) generated by A. This system of sets is a special case of the systems generated by decompositions. In fact, let ~

= {D 1 , D 2 , ••. }

be a countable decomposition of Q into nonempty sets: Q = D1

+ D2 + ··· ;

D; n Di =

0,

i =F j.

Then the system d = a(~), formed by the sets that are unions of finite numbers of elements of the decomposition, is an algebra. The following lemma is particularly useful since it establishes the important principle that there is a smallest algebra, or a-algebra, containing a given collection of sets.

Lemma 1. Let tff be a collection of subsets of n. Then there are a smallest algebra a(&) and a smallest a-algebra a(&) containing all the sets that are inS. The class !F* of all subsets of Q is a a-algebra. Therefore there are at least one algebra and one a-algebra containing S. We now define a(S) (or a(S)) to consist of all sets that belong to every algebra (or a-algebra) containing&. It is easy to verify that this system is an algebra (or a-algebra) and indeed the smallest. PROOF.

Remark. The algebra r:x(E) (or a(E), respectively) is often referred to as the smallest algebra (or a-algebra) generated by S. We often need to know what additional conditions will make an algebra, or some other system of sets, into a a-algebra. We shall present several results of this kind. Definition 1. A collection .H of subsets of Q is a monotonic class if An E .A, n = 1, 2, ... , together with An j A or An ! A, implies that A E .H. Let tff be a system of sets. Let f.l(S) be the smallest monotonic class containing&. (The proof of the existence of this class is like the proof of Lemma 1.)

Lemma 2. A necessary and sufficient condition for an algebra d to be a a-algebra is that it is a monotonic class. PRooF. A a-algebra is evidently a monotonic class. Now let d be a monotonic class and An Ed, n = 1, 2, .... It is clear that Bn = 1 A; Ed and Bn £ Bn+ 1 . Consequently, by the definition of a monotonic class, Bn j U~ 1 A; Ed. Similarly we could show that n~ 1 A; Ed.

Ui'=

By using this lemma, we can prove that, starting with an algebra d, we can construct the a-algebra a(d) by means of monotonic limiting processes.

139

§2. Algebras and a-Algebras. Measurable Spaces

Theorem 1. Let .91 be an algebra. Then f.l( .91) = a(.91).

(1)

PRooF. By Lemma 2, f.l(d) ~ a(d). Hence it is enough to show that f.l(d) is a a-algebra. But At = f.l(d) is a monotonic class, and therefore, by Lemma 2 again, it is enough to show that f.l(d) is an algebra. Let A E .A; we show that A E .A. For this purpose, we shall apply a principle that will often be used in the future, the principle of appropriate sets, which we now illustrate. Let

.Ji =

{B:Be.A,Be.A}

be the sets that have the property that concerns us. It is evident that .91 ~ .Ji ~ At. Let us show that .Ji is a monotonic class. Let Bn E .Ji; then Bn E .A, Bn E .A, and therefore lim j Bn E .A, Consequently

l Bn E .A, lim l Bn = lim j Bn E .A, lim l Bn = lim j Bn E A, lim j Bn = lim l Bn E .A, and therefore .A is a monotonic class. But .1i ~ At and At is the smallest monotonic class. Therefore .ii = At, and if A EAt = f.l(d), then we also lim j Bn = lim

have A E .A, i.e. At is closed under the operation of taking complements. Let us now show that At is closed under intersections. Let A EAt and

AlA= {B:Be.A,A n Be A}. From the equations lim

l (A n Bn) = A n

lim

l Bn,

lim j (A n Bn) = A n lim j Bn it follows that At A is a monotonic class. Moreover, it is easily verified that

(A

E

AIB)<=> (B

E

AI A).

(2)

Now let A E .91; then since .91 is an algebra, for every B E .91 the set A n B E .91 and therefore ,s;l ~ A(A ~AI.

But At A is a monotonic class (since lim i ABn = A lim i Bn and lim l ABn = A lim l Bn), and At is the smallest monotonic class. Therefore AtA = At for all A E .91. But then it follows from (2) that

(A

E

AI B) <=> (B

E

AIA

=

At).

140

II. Mathematical Foundations of Probability Theory

whenever A E .91 and BE .A. Consequently if A E .91 then

for every BE .A. Since A is any set in .91, it follows that

Therefore for every B E .A .I{B

=.If,

i.e. if B E .A and C E .A then C n B E .A. Thus .A is closed under complementation and intersection (and therefore under unions). Consequently .A is an algebra, and the theorem is established.

Definition 2. Let Q be a space. A class~ of subsets ofQ is ad-system if (a) Q E ~; (b) A, B, E ~' A s B = B\A E (c) An E ~'An SAn+ 1 = UAn If~

~; E

~.

is a collection of sets then

d(~)

denotes the smallest d-system con-

taining~.

Theorem 2. If the collection ~ of sets is closed under intersections, then (3)

d(~) = a(~)

PROOF. Every a-algebra is ad-system, and consequently d(~) s::; a(~). Hence if we prove that d(~) is closed under intersections, d(~) must be a a-algebra and then, of course, the opposite inclusion a(~) s d( ~) is valid. The proof once again uses the principle of appropriate sets. Let ~ 1 = {BEd(~):BnAEd(~)forallAE~}.

If B E ~ then B n A E ~ for all A E ~ and therefore ~ s ~ 1 . But ~ 1 is a d-system. Hence d(~) s ~ 1 • On the other hand, ~ 1 s d(~) by definition. Consequently Now let ~2

=

{BEd(~):

Bn A

Ed(~)

for all A

Ed(~)}.

Again it is easily verified that ~ 2 is ad-system. If B E ~'then by the definition of ~ 1 we obtain that B n A Ed(~) for all A E ~ 1 = d(~). Consequently ~ s ~ 2 and d(~) s ~ 2 • But d(~) 2 ~ 2 ; hence d(~) = ~ 2 , and therefore

141

§2. Algebras and a-Algebras. Measurable Spaces

whenever A and Bare in d(G), the set A n B also belongs to d(G), i.e. d(G) is closed under intersections. This completes the proof of the theorem. We next consider some measurable spaces (Q, :F) which are extremely important for probability theory.

2. The measurable space (R, PA(R)). Let R = (- oo, oo) be the real line and

= {xeR:a

(a,b]

< x:::;; b}

for all a and b, - oo :::;; a < b < oo. The interval (a, oo] is taken to be (a, oo ). (This convention is required if the complement of an interval (- oo, b] is to be an interval of the same form, i.e. open on the left and closed on the right.) Let d be the system of subsets of R which are finite sums of disjoint intervals of the form (a, b]: n

A

d if A =

E

I

(ai, b;],

i= 1

n < oo.

It is easily verified that this system of sets, in which we also include the empty set 0, is an algebra. However, it is not a a-algebra, since if An = (0, 1 - 1/n] Ed, we have An= (0, 1) ¢=d. Let PA(R) be the smallest a-algebra a(d) containing d. This a-algebra, which plays an important role in analysis, is called the Borel algebra of subsets of the real line, and its sets are called Borel sets. Iff is the system of intervals f of the form (a, b], and a(f) is the smallest a-algebra containing J, it is easily verified that a(J) is the Borel algebra. In other words, we can obtain the Borel algebra from J without going through the algebra d, since a(J) = a(iX(J)). We observe that

Un

(a, b)=

U(a, b- !], n

n=l

n(a-!, b], {a}= n(a-!, a].

[a,

b] =

n=l

n

n=l

n

a< b, a< b,

Thus the Borel algebra contains not only intervals (a, b] but also the singletons {a} and all sets of the six forms (a, b),

[a, b],

[a, b),

(- oo, b),

(- oo, b],

(a, oo).

(4)

142

II. Mathematical Foundations of Probability Theory

Let us also notice that the construction of 81(R) could have been based on any of the six kinds of intervals instead of on (a, b], since all the minimal a-algebras generated by systems of intervals of any of the forms (4) are the same as 81(R). Sometimes it is useful to deal with the a-algebra 81(R) of subsets of the extended real lineR = [ - oo, oo]. This is the smallest a-algebra generated by intervals of the form (a, b] = {x E R: a < x::::;; b},

- oo ::::;; a < b ::::;; oo,

where (- oo, b] is to stand for the set {x E R: - oo ::::;; x::::;; b}. Remark 1. The measurable space (R, 81(R)) is often denoted by (R, 81) or (R 1 , 81 1). Remark 2. Let us introduce the metric

lx- Yl p 1(x, y) = 1 + lx- Yl on the real line R (this is equivalent to the usual metric Ix - y I) and let 810 (R) be the smallest a-algebra generated by the open sets SP(x 0 ) = {x E R: p 1(x, x 0 ) < p}, p > 0, x 0 E R. Then 81 0 (R) = 81(R) (see Problem 7). 3. The measurable space (R", ?4(R")). Let R" = R x · · · x R be the direct, or Cartesian, product of n copies of the real line, i.e. the set of ordered n-tuples x = (x 1 , •.• , x,), where - oo < xk < oo, k = 1, ... , n. The set

where Ik = (ak, bk], i.e. the set {x E R": xk E Ik> k = 1, ... , n}, is called a rectangle, and I k is a side of the rectangle. Let J be the set of all rectangles I. The smallest a-algebra a(J) generated by the system J is the Borel algebra of subsets of R" and is denoted by 81(R"). Let us show that we can arrive at this Borel algebra by starting in a different way. Instead ofthe rectangles I = I 1 x · · · x I, let us consider the rectangles B = B 1 x · · · x B, with Borel sides (Bk is the Borel subset of the real line that appears in the kth place in the direct product R x · · · x R). The smallest a-algebra containing all rectangles with Borel sides is denoted by

81(R) ® · · · ® 81(R) and called the direct product of the a-algebras 81(R). Let us show that in fact

143

§2. Algebras and u-Algebras. Measurable Spaces

In other words, the smallest a--algebra generated by the rectangles I =

I 1 x · · · x I" and the (broader) class of rectangles B = B 1 x · · · x B" with Borel sides are actually the same. The proof depends on the following proposition.

n, and define

Lemma 3. Let C be a class of subsets ofO., let B £

C n B = {A n B: A E C}.

(5)

u(C n B) = u(C) n B.

(6)

Then PROOF. Since C £ u(C), we have

C n B £ u(C) n B.

(7)

But u(C) n B is a a--algebra; hence it follows from (7) that u(C n B) £ u(C) n B.

To prove the conclusion in the opposite direction, we again use the principle of appropriate sets. Define 1:6'8 = {A E u(C): An BE u(C n B)}.

Since u( C) and u(C n B) are a--algebras, 1:6'8 is also a a--algebra, and evidently C £ 1:6'8 £ u(C),

whence u(C) £ a(!ffi'8 ) = 1:6'8 £ u(C) and therefore u(C) = 1:6'8 . Therefore A n BE u(C n B)

for every A £ u(C), and consequently u(C) n B £ u(cS' n B). This completes the proof of the lemma. Proof that PJ(R") and PJ ® · · · ® PJ are the same. This is obvious for n = 1. We now show that it is true for n = 2. Since PJ(R 2 ) £ PJ ® PJ, it is enough to show that the Borel rectangle B 1 x B 2 belongs to PJ(R 2 ). Let R 2 = R 1 x R 2 , where R 1 and R 2 are the "first" and "second" real lines, ~ 1 = PJ 1 x R 2 , ~ 2 = R 1 x PJ 2 , where PJ 1 x R 2 (or R 1 x PJ 2 ) is the collectionofsetsoftheformB 1 x R 2 (orR 1 x B2 ),withB 1 EPJ 1 (orB 2 EPJ 2 ). Also let f 1 and f 2 be the sets of intervals in R 1 and R 2 , and .11 = f 1 x R 2 , .J2 = R 1 x f 2 • Then, by (6), B 1 x B2

=

B1 n B2 E ~ 1 n

~ 2 = u(.J 1 ) n

= u(.J 1 n B2 )

= u(f 1 as was to be proved.

B2

x f

£

2 ),

u(.J 1 n

.J2 )

144

II. Mathematical Foundations of Probability Theory

The case of any n, n > 2, can be discussed in the same way.

Remark. Let BI 0 (R") be the smallest a-algebra generated by the open sets Sp(x 0 ) = {x E R": Pn(x, x 0) < p},

x 0 E R",

p > 0,

in the metric n

Pn(x, x 0 ) =

L rkp1(xk, x~).

k=1

where x = (xl> ... , xn), x 0 = (x~, ... , x~). Then BI 0 (Rn) = BI(R") (Problem 7).

4. The measurable space (R 00 , BI(R 00 ) ) plays a significant role in probability theory, since it is used as the basis for constructing probabilistic models of experiments with infinitely many steps. The space R 00 is the space of ordered sequences of numbers, - oo < xk < oo, k

=

1, 2, ...

Let I k and Bk denote, respectively, the intervals (ak, bk] and the Borel subsets of the kth line (with coordinate xk). We consider the cylinder sets

J(J 1 J(B 1

X··· X X ••• X

In)= {x:x = (x 1,x 2 , •.•),x 1 E1 1, ••• ,XnEin},

(8)

Bn) = {x:x = (x 1, X2 · · ·), x 1 E B 1, ••• , Xn E Bn},

(9)

J(B")

=

{x: (x1, ... , Xn) E B"},

(10)

where B" is a Borel set in 91(R"). Each cylinder J{B 1 x · · · x Bn), or J(B"), can also be thought of as a cylinder with base in R"+ 1 , R"+ 2 , ••. , since J(B1

X ••· X

Bn)

J(B")

= J(B1 X • • · = J(B"+1),

X

Bn

X

R),

where B"+ 1 = B" x R. It follows that both systems of cylinders J(B 1 x · · · x Bn) and J(B") are algebras. It is easy to verify that the unions of disjoint cylinders

J(J 1

X ••• X

In)

also form an algebra. Let 91(R 00 ), 91 1(R 00 ) and f!I 2 (R 00 ) be the smallest a-algebras containing all the sets (8), (9) or (10), respectively. (The a-algebra 91 1(R 00 ) is often denoted by f!I(R) ® 91(R) x ···.)It is clear that f!I(R 00 )!;;;; f!I 1 (R 00)!;;;; BI 2 (R 00). As a matter of fact, all three a-algebras are the same. To prove this, we put CfJn =

{A E R": {x: (x1, ... , Xn) E A} E BI(R 00 ) }

for n = 1, 2, .... Let B" e f!I(R"). Then B"

E CfJn !;;;;

f!I(R 00 ).

145

§2. Algebras and a-Algebras. Measurable Spaces

But

t(J n

is a (J-algebra, and therefore

consequently P4z(R 00 )

£;

PJ(R 00 ).

Thus PJ(R 00 ) = P4 1(R 00 ) = P4 z(R 00 ). From now on we shall describe sets in PJ(R 00 ) as Borel sets (in R 00 ).

Remark. Let P4 0 (R 00 ) be the smallest (J-algebra generated by the open sets

= {x E Roo: Poo(x, x 0 ) < p},

Sp(x 0 )

p > 0,

x 0 E R 00 ,

in the metric Poo(x,

X 0)

=

00

L rkpl(xb xf),

k=l

where x = (xt>x 2 , ... ), x = (x?,x~, ... ). Then (Problem 7). Here are some examples of Borel sets in Roo:

P4(R 00 ) = P4 0 (R 00 )

0

(a) {xER 00 :supxn>a}, {x E Roo: inf Xn
Xn ~a},

{x E Roo: lim

Xn

rel="nofollow">a},

where, as usual,

Urn Xn

=

inf sup Xm, n

lim Xn = sup inf Xm; n

m~n

m~n

(c) {x E Roo: xn -+}, the set of x E Roo for which lim xn exists and is finite; (d) {x E R 00 : lim Xn >a}; (e) {X E R 00 : 1 IXn I > a} ; (f) {x E Roo: 1 xk = 0 for at least one n ;::: 1}.

L:'= Lk=

To be convinced, for example, that sets in (a) belong to the system PJ(R 00 ), it is enough to observe that {x: sup

Xn

> a} =

U {x: Xn > n

{x: inf Xn
a} E P4(R 00 ),

U{x: Xn
00

).

n

5. The measurable space (Rr, PJ(RT)), where Tis an arbitrary set. The space RT is the collection of real functions x = (x 1) defined for t E Tt. IIi general we shall be interested in the case when Tis an uncountable subset of the real t We shall also use the notations x = (x,),.Rr and x = (x,), t ERr, for elements of Rr.

146

II. Mathematical Foundations of Probability Theory

line. For simplicity and definiteness we shall suppose for the present that T = [0, oo). We shall consider three types of cylinder sets X ••. X

In)= {x: x,,

E

J 1,, ... ,tn(B 1

X··· X

Bn) =

EB 1, ... ,X1nEBn},

{x:X11

Ib

x,n E I 1},

J,,, ... ,tn(J 1

••. '

J,,, ... ,r"(Bn) = {x: (x,,, ... , x,J E Bn},

(11) (12) (13)

where Ik is a set of the form (ak, bk], Bk is a Borel set on the line, and Bn is a Borel set in Rn. The set J,,, ... ,r" (I 1 x · · · x In) is just the set of functions that, at times t 1 , ••• ,tn, "get through the windows" I~ rel="nofollow">···•In and at other times have arbitrary values (Figure 24). Let BI(RT), 91 1(RT) and 91 2 (RT) be the smallest a-algebras corresponding respectively to the cylinder sets (11), (12) and (13). It is clear that (14)

As a matter of fact, all three of these a-algebras are the same. Moreover, we can give a complete description of the structure of their sets.

Theorem 3. LetT be any uncountable set. Then BI(RT) = 91 1(RT) = 91 2 (RT), and every set A E P.l(RT) has the following structure: there are a countable set of points tt> t 2 , ••• ofT and a Borel set Bin fJI(R such that 00 )

(15) PRooF. Let t! denote the collection of sets of the form (15) (for various aggregates (t 1 , t 2 , ..•) and Borel sets Bin BI(R 00 )). If A1 , A 2 , ... Et! and the corresponding aggregates are r< 1> = (t~ll, t~1 >, •• •), r< 2 >= (t~2 >, t~2 >, •• •), ••. ,

Of-----+-,,-+-,, -------Figure 24

r•

147

§2. Algebras and a-Algebras. Measurable Spaces

then the set r = representation

Uk r can be taken as a basis, so that every A
Xr 2 ,

•••)

E

B;},

where B; is a set in one and the same a-algebra :?4(R 00 ), and r; E r. Hence it follows that the system C is a a-algebra. Clearly this a-algebra contains all cylinder sets of the form (1) and, since PA 2 (RT) is the smallest a-algebra containing these sets, and since we have (14), we obtain (16) Let us consider a set A from C, represented in the form (15). For a given aggregate (t 1 , t 2 , •• •), the same reasoning as for the space (R 00 , :?I(R 00 )) shows that A is an element of the a-algebra generated by the cylinder sets ( 11 ). But this a-algebra evidently belongs to the a-algebra :?I(RT); together with (16), this established both conclusions of the theorem. Thus every Borel set A in the a-algebra :?I(RT) is determined by restrictions imposed on the functions x = (x1), t E T, on an at most countable set of points t 1, t 2 , •••• Hence it follows, in particular, that the sets A 1 = {x: sup x 1 < C for all t

E

[0, 1]},

A 2 = {x: x 1 = 0 for at least one t

A 3 = {x:

X1

E

[0, 1]},

is continuous at a given point t0

E

[0, 1]},

which depend on the behavior of the function on an uncountable set of points, cannot be Borel sets. And indeed none of' these three sets belongs to 84(RI 0 • 1l). Let us establish this for A 1 . If A 1 E PA(R 10 · 1l), then by our theorem there are a point (t?, t~, . .. ) and a set B 0 E :?I(R 00 ) such that

=

It is clear that the function y1 C - 1 belongs to (Yr?• ... ) E B 0 . Now form the function 21

{c- 1,

A~>

and consequently

t E (t?, t~, .. .),

= C + 1, t ¢ (t?, t~, .. .).

It is clear that

and consequently the function z = (z1) belongs to the set {x: (x 1?, ... } E B0 }. But at the same time itisclearthatitdoesnot belong to the set {x: sup X 1 < C}. This contradiction shows that A 1 ¢ 84(R 10 • 1l).

148

II. Mathematical Foundations of Probability Theory

Since the sets A 1 , A 2 and A 3 are nonmeasurable with respect to the 0'-algebra dl[R[o, 11) in the space of all functions x = (x,), t e [0, 1], it is natural to consider a smaller class of functions for which these sets are measurable. It is intuitively clear that this will be the case if we take the intial space to be, for example, the space of continuous functions.

6. The measurable space (C, dl(C)). LetT = [0, 1] and let C be the space of continuous functions x = (x,), 0 s:; t s:; 1. This is a metric space with the metric p(x, y) = SUPteT lx,- y,l. We introduce two 0'-algebras in C: dl(C) is the 0'-algebra generated by the cylinder sets, and d6' 0 (C) is generated by the open sets (open with respect to the metric p(x, y)). Let us show that in fact these 0'-algebras are the same: dl(C) = d6' 0 (C). Let B = {x: x 10 < b} be a cylinder set. It is easy to see that this set is open. Hence it follows that {x: x 11 < b1 , ••• , x 1" < bn} E d6' 0(C), and therefore dl(C) s;; d6'0 (C). Conversely, consider a set B P = {y: y e S p(x 0 )} where x 0 is an element of C and Sp(x 0 ) = {x e C: SUPreTix1 - x?i < p} is an open ball with center at x 0 • Since the functions in C are continuous,

BP

=

{y E C:

ye SP(x0)} = {y E C: m~x IYr- x?i < P} =

n{y

E

C: IYtk- x~l < p}

E

dl(C),

(17)

lk

where tk are the rational points of [0, 1]. Therefore d6' 0 (C) s;; dl(C). The following example is fundamental.

7. The measurable space (D, dl(D)), where Dis the space offunctions x = (x1), t e [0, 1], that are continuous on the right (x 1 = x,+ for all t < 1) and have limits from the left (at every t > 0). Just as for C, we can introduce a metric d(x, y) on D such that the 0'-algebra .16'0 (D) generated by the open sets will coincide with the 0'-algebra dl(D) generated by the cylinder sets. This metric d(x, y), which was introduced by Skorohod, is defined as follows: d(x, y) = inf{e

> 0:3 A. e A: sup lx,- YA +sup It- A.(t)i s:; e}, t

(18)

t

where A is the set of strictly increasing functions A. = A.(t) that are continuous on [0, 1] and have A.(O) = 0, A.(1) = 1.

8. The measurable space CflreT Or, I®lreT ~). Along with the space (RT, dl(RT)), which is the direct product ofT copies of the real line together with the system of Borel sets, probability theory also uses the measurable space COteT n" RteT :F,), which is defined in the following way.

149

§3. Methods of Introducing Probability Measures on Measurable Spaces

Let T be any set of indices and (Q0 ~1) a measurable space, t E T. Let il10 the set of functions w = (w 1), t E T, such that w 1 E il1 for each

OreT

Q = tE T.

The collection of cylinder sets

.F1,, ... , 1.(B 1

X ••• X

B.)= {w: W 11

E

B 1, ••• , w 1•

E

B.},

where B1; E ~r;• is easily shown to be an algebra. The smallest a-algebra T ~ 1 , and the measurable containing all these cylinder sets is denoted by space Qi, ~1) is called the direct product of the measurable spaces (Q0 ~ 1 ), t E T.

<0

9.

Plre

PI

PROBLEMS

1. Let ffl 1 and f!l 2 be a-algebras of subsets of n. Are the following systems of sets aalgebras? ffl 1 n ffl 2

= {A: A e ffl 1 and A e ffl 2 },

ffl 1 u ffl 2 ={A: A ef!l 1 or A ef!l 2 }.

2. Let C!fi = { D 1 , D 2 , ••• } be a countable decomposition of n and f!l =

a(~).

Are there

also only countably many sets in f!l?

3. Show that f!l(R") ® f!l(R) = f!l(R"+ 1).

4. Prove that the sets (b)-(f) (see Subsection 4) belong to f!l(R 00 ).

5. Prove that the sets A 2 and A 3 (see Subsection 5) do not belong to f!l(RIO, 11). 6. Prove that the function (15) actually defines a metric. 7. Prove that 96 0 (R")

=

96(R"), n ~ 1, and fli 0 (R"') = f.!I(R"').

8. Let C = C[O, oo) be the space of continuous functions x = (x,) defined for t Show that with the metric p(x, y) =

I z-• min[ sup lx,- y,l, 1],

n= 1

~

0.

x,yeC,

Ostsn

this is a complete separable metric space and that the a-algebra f!l 0 (C) generated by the open sets coincides with the a-algebra f!l(C) generated by the cylinder sets.

§3. Methods of Introducing Probability Measures on Measurable Spaces 1. The measurable space (R, r!4(R)). Let P = P(A) be a probability measure defined on the Borel subsets A of the real line. Take A = (- oo, x] and put F(x)

= P(- oo, x],

X

ER.

(1)

150

II. Mathematical Foundations of Probability Theory

This function has the following properties: (1) F(x) is nondecreasing; (2) F(- oo) = 0, F( + oo) = 1, where F(- oo) = lim F(x),

F( + oo) = lim F(x); xf oo

X~- 00

(3) F(x) is continuous on the right and has a limit on the left at each x

E

R.

The first property is evident, and the other two follow from the continuity properties of probability measures.

Definition 1. Every function F = F(x) satisfying conditions (1)-(3) is called a distribution function (on the real line R). Thus to every probability measure P on (R, &b(R)) there corresponds (by (1)) a distribution function. It turns out that the converse is also true.

Theorem 1. Let F = F(x) be a distribution function on the real lineR. There exists a unique probability measure P on (R, &b(R)) such that P(a, b] = F(b)- F(a)

(2)

for all a, b, - oo :::;; a < b < oo. PRooF. Let d be the algebra of the subsets A of R that are finite sums of disjoint intervals of the form (a, b]:

A =

n

L (ak, bk].

k= 1

On these sets we define a set function P0 by putting n

P 0 (A) =

L [F(bk)- F(ak)],

A Ed.

(3)

k=l

This formula defines, evidently uniquely, a finitely additive set function on d. Therefore if we show that this function is also countably additive on this algebra, the existence and uniqueness of the required measure P on &b(R) will follow immediately from a general result of measure theory (which we quote without proof).

CarathOOdory's Theorem. Let n be a space, d an algebra of its subsets, and FA = a{d) the smallest a-algebra containing d. Let J.l.o be a a-finite measure on (0, A). Then there is a unique measure J.l. on (Q, a(d)) which is an extension of J.l.o, i.e. satisfies J.l.(A) = J.l.o(A),

A Ed.

§3. Methods of Introducing Probability Measures on Measurable Spaces

151

We are now to show that P 0 is countably additive on .91. By a theorem from §1 it is enough to show that P 0 is continuous at 0. i.e. to verify that

LetA 1 , A 2 , .•. be a sequence of sets from .91 with the property An! 0. Let us suppose first that the sets An belong to a closed interval [- N, N], N < oo. Since A is the sum of finitely many intervals of the form (a, b] and since

P0 (a', b)= F(b)- F(a') -t F(b)- F(a)

=

P0 (a, b]

as a' ! a, because F(x) is continuous on the right, we can find, for every An, a set Bn E .91 such that its closure [BnJ £ An and Po(An)- P 0 (Bn)

~

e · 2-n,

where e is a preassigned positive number. [BnJ = 0. But the sets [Bn] By hypothesis, nAn= 0 and therefore are closed, and therefore there is a finite n0 = n0 (e) such that

n

(4) n= 1

(In fact, [ -N, N] is compact, and the collection of sets {[ -N, N]\[BnJ}n~l is an open covering of this compact set. By the Heine-Bore! theorem there is a finite subcovering: no

U([- N, N]\[Bn]) =

[ - N,

N]

n=1

and therefore n~~ 1 [BnJ = 0). Using (4) and the inclusions Ano £ Ano- 1 £ · · · £ A1o we obtain Po(Ano)

=

Po(Ano\01 Bk)

= Po(Ano\01

~

Bk)

+ Po(01 Bk)

~ Po(9 (Ak\Bk)) 1

no

no

k=1

k=1

L Po(Ak\Bk) ~ L e · 2-k ~e.

Therefore P 0 (An)! 0, n -t oo. We now abandon the assumption that An £ [ - N, N] for some N. Take an e > 0 and choose N so that P 0 [ -N, N] > 1 - e/2. Then, since

An = An n [- N, N]

+ An n

[- N, N],

~e have

P 0 (A")

= ~

P 0 (An[ -N, N] + P 0 (An n [ -N, N]) P 0 (An n [ -N, N]) + e/2

152

II. Mathematical Foundations of Probability Theory

and, applying the preceding reasoning (replacing An by An n [- N, N]), we find that P0 (An n [- N, N]) s:; e/2 for sufficiently large n. Hence once again P 0 (An)! 0, n--+ oo. This completes the proof ofthe theorem. Thus there is a one-to-one correspondence between probability measures P on (R, fJI(R)) and distribution functions F on the real line R. The measure P constructed from the function F is usually called the Lebesgue-Stieltjes

probability measure corresponding to the distribution function F. The case when F(x) =

0, X< 0, { x, 0 s:; x s:; 1, 1, X> 1.

is particularly important. In this case the corresponding probability measure (denoted by A.) is Lebesgue measure on [0, 1]. Clearly A.(a, b] = b- a. In other words, the Lebesgue measure of (a, b] (as well as of any ofthe intervals (a, b), [a, b] or [a, b)) is simply its length b -a. Let

fJI([O, 1]) = {A n [0, 1]: A E fJI(R)} be the collection of Borel subsets of [0, 1]. It is often necessary to consider, besides these sets, the Lebesgue measurable subsets of [0, 1]. We say that a set A £ [0, 1] belongs to ~([0, 1)] if there are Borel sets A and B such that A £ A £ Band A.(B\A) = O.lt is easily verified that .si([O, 1]) is au-algebra. It is known as the system of Lebesgue measurable subsets of [0, 1]. Clearly fJI([O, 1]) £ ,sj([O, 1]). The measure A., defined so far only for sets in .?1([0, 1]), extends in a natural way to the system ,sj([O, 1]) of Lebesgue measurable sets. Specifically, if Ae,sj([O, 1]) and A£ A£ B, where A and Be~([O, 1]) and A.(B\A) = 0, we define A:(A) = A.(A). The set function A: = A:(A), A e ~([0, 1]), is easily seen to be a probability measure on ([0, 1], ~([0, 1])). It is usually called Lebesgue measure (on the system of Lebesgue-measurable sets).

Remark. This process of completing (or extending) a measure can be applied,

and is useful, in other situations. For example, let (Q, fF, P) be a probability space. Let §P be the collection of all the subsets A of Q for which there are sets B 1 and B 2 of fF such that B 1 £ A £ B 2 and P(B 2 \B 1) = 0. The probability measure can be defined for sets A e #'Pin a natural way (by P(A) = P(B 1)). The resulting probability space is the completion of (Q, fF, P) with respect to P. A probability measure such that §P = fF is called complete, and the corresponding space (Q, fF, P) is a complete probability space. The correspondence between probability measures P and distribution functions F established by the equation P(a, b] = F(b)- F(a) makes it

153

§3. Methods of Introducing Probability Measures on Measurable Spaces F(x)

I I ~F(x3) I 1

~F(xz)

I

I~F(x,)

x,

x2

X

XJ

Figure 25

possible to construct various probability measures by obtaining the corresponding distribution functions.

Discrete measures are measures P for which the corresponding distributions F = F(x) are piecewise constant (Figure 25), changing their values at the points x 1 , x 2 , ••• (L'1F(xJ > 0, where L'1F(x) = F(x)- F(x- ). In this case the measure is concentrated at the points x 1 , x 2 , .•. :

The set of numbers (p 1 , p2, .. .), where Pk = P( {xk}), is called a discrete probability distribution and the corresponding distribution function F = F(x) is called discrete. We present a table of the commonest types of discrete probability distribution, with their names. Table 1 Distribution

Probabilities Pk

Parameters

Discrete uniform Bernoulli Binomial

1/N, k = 1,2, ... ,N Pt = p, Po= q C~pkq•-\ k = 0, 1, ... , n

N = 1,2, ... 0 :-:; p :-:; 1, q = 1 - p 0 :-:; p :-:; 1, q = 1 - p, n = 1,2, ...

Poisson Geometric Negative binomial

l- 1p, k = q=lP'l-',

e-kfk!, k

=

0, 1, .. . 0, 1, .. . k = r,r + 1, ...

1>0 0 :-:; p :-:; 1, q = 1 - p 0 :-:; p :-:; 1, q = 1 - p,

r = 1, 2, ...

Absolutely continuous measures. These are measures for which the corresponding distribution functions are such that F(x) = roof(t) dt,

(5)

154

II. Mathematical Foundations of Probability Theory

where f = f(t) are nonnegative functions and the integral is at first taken in the Riemann sense, but later (see §6) in that of Lebesgue. The function f = f(x), x E R, is the density of the distribution function F = F(x) (or the density of the probability distribution, or simply the density) and F = F(x) is called absolutely continuous. It is clear that every nonnegative f = f(x) that is Riemann integrable and such that J~ cxJ(x) dx = 1 defines a distribution function by (5). Table 2 presents some important examples of various kinds of densities f = f(x) with their names and parameters (a density f(x) is taken to be zero for values of x not listed in the table). Table 2

Uniform on [a, b] Normal or Gaussian

Parameters

Density

Distribution

lj(b - a),

a,beR; a
a ::::; x ::::; b

(27ta)-li2e-<x-m)2;(2a2)'

X E

R

meR,a>O

x•-le-xiP

Gamma

----, r(IX}P"

IX> 0, f3 > 0

x~O

r>O,s>O

Beta Exponential (gamma with IX = 1, f3 = 1/A.)

Bilateral exponential Chi-squared, x2 (gamma with a IX = n/2, f3 = 2) Student, t

F

Cauchy

x2) -en+ 1)/2 r(!{n + 1)) ( , x (nn)112r(nj2) 1 + --;;-

(m/n)mf2

xm/2-1

B(m/2, n/2) (1

0 n(x

2

+ mxjn)<m+n)/Z

+ 02)

,

E

R

n

=

1, 2, ...

n

=

1, 2, ...

m, n

=

1, 2, ...

xER

Singular measures. These are measures whose distribution functions are continuous but have all their points of increases on sets of zero Lebesgue measure. We do not discuss this case in detail; we merely give an example of such a function.

§3. Methods of Introducing Probability Measures on Measurable Spaces

155

I

2

X

Figure 26

We consider the interval [0, 1] and construct F(x) by the following procedure originated by Cantor. We divide [0, 1] into thirds and put (Figure 26)

F2(x) =

(t, 1),

z,1

X E

1

xe(i,~),

3

(~, !), X= 0,

4' 4'

0,

X E

1, x=l defining it in the intermediate intervals by linear interpolation. Then we divide each of the intervals [0, and [i, 1] into three parts and define the function (Figure 27) with its values at other points determined by linear interpolation.

tJ

X

Figure 27

156

II. Mathematical Foundations of Probability Theory

Continuing this process, we construct a sequence of functions Fn(x),

n = 1, 2, ... , which converges to a nondecreasing continuous function F(x)

(the Cantor function), whose points of increase (xis a point of increase of F(x) + e) - F(x - e) > 0 for every e > 0) form a set of Lebesgue measure zero. In fact, it is clear from the construction of F(x) that the total length of the intervals (j, ~), (!, ~), (~, !), ... on which the function is constant is

if F(x

! + ~ + _i_ + ... = ! 3

9

27

f (~)n = 1.

3 n=O 3

(6)

Let % be the set of points of increase of the Cantor function F(x). It follows from (6) that A.(%) = 0. At the same time, if Jl. is the measure corresponding to the Cantor function F(x), we have JJ.(%) = 1. (We then say that the measure is singular with respect to Lebesgue measure A..) Without any further discussion of possible types of distribution functions, we merely observe that in fact the three types that have been mentioned cover all possibilities. More precisely, every distribution function can be represented in the form p 1 F 1 + p2 F 2 + p 3 F 3 , where F 1 is discrete, F 2 is absolutely continuous, and F 3 is singular, and P; are nonnegative numbers, p1 + p2 + P3 = 1. 2. Theorem 1 establishes a one-to-one correspondence between probability measures on (R, Ei(R)) and distribution functions on R. An analysis of the proof of the theorem shows that in fact a stronger theorem is true, one that in particular lets us introduce Lebesgue measure on the real line. Let Jl. be a u-finite measure on (Q, d), where d is an algebra of subsets of n. It turns out that the conclusion of Caratheodory's theorem on the extension of a measure and an algebra d to a minimal u-algebra u(d) remains valid with au-finite measure; this makes it possible to generalize Theorem 1. A Lebesgue-Stieltjes measure on (R, Ei(R)) is a (countably additive) measure Jl. such that the measure JJ.(I) of every bounded interval I is finite. A generalized distribution junction on the real line R is a nondecreasing function G = G(x), with values on (- oo, oo), that is continuous on the right. Theorem 1 can be generalized to the statement that the formula

JJ.(a, b] = G(b)- G(a),

a< b,

again establishes a one-to-one correspondence between Lebesgue-Stieltjes measures Jl. and generalized distribution functions G. In fact, if G( + oo) - G(- oo) < oo, the proof of Theorem 1 can be taken over without any change, since this case reduces to the case when G( + oo) G(- oo) = 1 and G(- oo) = 0. Now let G( + oo)- G(- oo) = oo. Put G(x), Gn(x) = { G(n) G( -n),

lxl ~ n,

n, x = -n.

X =

157

§3. Methods of Introducing Probability Measures on Measurable Spaces

On the algebra d let us define a finitely additive measure J-Lo such that J-Lo(a, b] = G(b) - G( a), and let J-ln be the finitely additive measure previously constructed (by Theorem 1) from GnCx). Evidently J-ln j J-Lo on d. Now let A 1, A 2 , ••• be disjoint sets in d and A LAnE d. Then (Problem 6 of §1)

=

00

J-Lo(A) ~ L J-Lo(An). n=1

2:.%

If 1 J-Lo(An) = oo then J-Lo(A) = L:'= 1 J-Lo(An). Let us suppose that L J-Lo(An) < oo. Then 00

j-t 0 (A)

= lim J-Ln(A) = lim n

n

L J-LnCAk).

k=1

By hypothesis, L J-Lo(An) < oo. Therefore 0

~ j-t 0(A)- k~/ 0 (Ak) = li~ L~1 (J-LnCAk)- J-Lo(Ak))] ~ 0,

since J-ln ~ J-Lo. Thus a a-finite finitely additive measure J-Lo is countably additive on d, and therefore (by Caratheodory's theorem) it can be extended to a countably additive measure J-1 on a(d). The case G(x) = xis particularly important. The measure Acorresponding to this generalized distribution function is Lebesgue measure on (R, P4(R)). As for the interval [0, 1] of the real line, we can define the system fJ(R) by writing A E Pi(R) if there are Borel sets A and B such that A ~ A ~ B, A(B\A) = 0. Then Lebesgue measure 1 on P4(R) is defined by l(A) = A(A) if A ~ A ~ B, A E Pi(R) and A(B\A) = 0. 3. The measurable space (W, P4(R"). Let us suppose, as for the real line, that Pis a probability measure on (R", P4(W). Let us write

FnCx1, ... , Xn)

=

P((- 00,

x1J X · · · X ( - 00,

Xn]),

or, in a more compact form,

Fn(x)

=

P(- 00,

x],

where X = (x 1, ... , Xn), (- 00, X] = (- 00, X1J X · · · X Let us introduce the difference operator Lla,. b,: R" formula

Lla,,b,Fn(X1, ... 'xn)

=

( - 00, --+

Xn].

R, defined by the

Fn(xl, ... ) X;- b b;, X;+ 1 ...) - FnCx~> ... , X;- 1, ai, X;+ 1 ...)

158

II. Mathematical Foundations of Probability Theory

where ai :S bi. A simple calculation shows that da 1b 1 • • • da"b"Fn(X1 • · · Xn)

=

P(a, b],

(7)

where (a, b] = (a 1, b1 ] x · · · x (an, bJ. Hence it is clear, in particular, that (in contrast to the one-dimensional case) P(a, b] is in general not equal to Fn(b)- Fn(a). Since P(a, b] ~ 0, it follows from (7) that (8)

for arbitrary a = (a 1, ••• , an), b = (b 1, ... , bn). It also follows from the continuity of P that Fn(x 1, ... , xn) is continuous on the right with respect to the variables collectively, i.e. if x Lx, x = (k)) th en ( x(k) 1 , .•. ,xn ,

k-+

(9)

00.

It is also clear that (10)

and lim Fn(Xb ... , Xn) = 0,

(11)

x!y

if at least one coordinate of y is - oo.

Definition 2. An n-dimensional distribution function (on Rn) is a function F = F(x 1, ••• , Xn) with properties (8)-(11). The following result can be established by the same reasoning as in Theorem 1.

Theorem 2. Let F = Fn(x 1, ... , Xn) be a distribution function on Rn. Then there is a unique probability measure P on (Rn, PJ(Rn)) such that (12)

Here are some examples of n-dimensional distribution functions. Let F 1, ••• , pn be one-dimensional distribution functions (on R) and F n(X 1• ••• , Xn) = F 1(x 1) • • • F"(xJ.

It is clear that this function is continuous on the right and satisfies (10) and (11). It is also easy to verify that

da,b, ... danbJn(X1, ... , Xn)

=

n [Fk(bk) -

Consequently F n(X 1• ••• , x,.) is a distribution function.

Fk(ak)] ~ 0.

§3. Methods of Introducing Probability Measures on Measurable Spaces

159

The case when xk

< 0,

0 ~ xk ~ 1, xk > 1 is particularly important. In this case

The probability measure corresponding to this n-dimensional distribution function is n-dimensional Lebesgue measure on [0, 1]". Many n-dimensional distribution functions appear in the form

Fn(xl, · · ·, Xn) =

f~

···f~

J,.{tl, ·. ·, tn) dtl · · · dtn,

where J,.(t 1, ... , tn) is a nonnegative function such that

f_

00 00

f_

00

• • •

00

J,.(t h

... ,

tn) dt 1 · · · dtn = 1,

and the integrals are Riemann (more generally, Lebesgue) integrals. The function f = J,.(t 1, ... , tn) is called the density of the n-dimensional distribution function, the density of the n-dimensional probability distribution, or simply ann-dimensional density. When n = 1, the function _ -1 e -(x-m) 2 /(2a 2 ) f( x ) -

u~

xeR,

'

with u > 0 is the density of the (nondegenerate) Gaussian or normal distribution. There are natural analogs of this density when n > 1. Let~= llriill be a nonnegative definite symmetric n x n matrix: n

L:

i,j= 1

rip~-).i ;;::: 0,

A.i e R, i = 1, ... , n,

When ~is a positive definite matrix, I ~I = det is an inverse matrix A = llaiill.

IAI 112

J,.(x~o ... , Xn) = ( 2n)"12 exp{ -!

~

> 0 and consequently there

L: aiixi- m)(xi- mi)},

(13)

where mi e R, i = 1, ... , n, has the property that its (Riemann) integral over the whole space equals 1 (this will be proved in §13) and therefore, since it is also positive, it is a density. This function is the density of then-dimensional (nondegenerate) Gaussian or normal distribution (with vector mean m = {m 1, ... , mn) and covariance matrix~ =A - 1).

160

II. Mathematical Foundations of Probability Theory

Figure 28. Density of the two-dimensional Gaussian distribution.

When n = 2 the density f 2 (x 1 , x 2 ) can be put in the form 1

(14)

where ui > 0, IpI < 1. (The meanings of the parameters mi, ui and p will be explained in §8.) Figure 28 indicates the form of the two-dimensional Gaussian density.

Remark. As in the case n = 1, Theorem 2 can be generalized to (similarly defined) Lebesgue-Stieltjes measures on (R", PI(R")) and generalized distribution functions on R". When the generalized distribution function Gn(x1o ... , Xn) is x 1 • • • Xn, the corresponding measure is Lebesgue measure on the Borel sets of R". It clearly satisfies

n(bi n

A.(a, b) =

ai),

i=l

i.e. the Lebesgue measure of the "rectangle" is its "content."

4. The measurable space (R 00 , 91(R 00 )). For the spaces R", n ~ 1, the probaability measures were constructed in the following way: first for elementary sets (rectangles (a, b]), then, in a natural way, for sets A = (ai, bJ, and finally, by using Caratheodory's theorem, for sets in PI(R"). A similar construction for probability measures also works for the space (Roo, 91(R 00 )).

L

§3. Methods of Introducing Probability Measures on Measurable Spaces

161

Let BE 14(R"), denote a cylinder set in Roo with base BE &I(R"). We see at once that it is natural to take the cylinder sets as elementary sets in R 00 , with their probabilities defined by the probability measure on the sets of &I(R 00 ). Let P be a probability measure on (R 00 , &I(R 00 )). For n = 1, 2, ... , we take BE &I(R").

(15)

The sequence of probability measures P t> P 2 , ••• defined respectively on (R, &I(R)), (R 2 , &I(R 2 )), ••• , has the following evident consistency property: for n = 1, 2, ... and B E &I(R"), (16)

It is noteworthy that the converse also holds.

Theorem 3 (Kolmogorov's Theorem on the Extension of Measures in (R 00 , &B(R 00 ))). Let P 1 , P 2 , ••• be a sequence of probability measures on (R, 14(R)), (R 2 , &B(R 2 )), •• • , possessing the consistency property (16). Then there is a unique probability measure P on (R 00 , &I(R 00 )) such that

BE &I(R").

(17)

for n = 1, 2, .... E &B(R") and let Jn(B") be the cylinder with base B". We assign the measure P(J.(B")) to this cylinder by taking P(J.(B")) = P.(B").

PRooF. Let B"

Let us show that, in virtue of the consistency condition, this definition is consistent, i.e. the value of P(Jn(B")) is independent of the representation of the set Jn(B"). In fact, let the same cylinder be represented in two way: J.(B") = Jn+k(B"+k). It follows that, if (x 1, ... , xn+k)

E

R"+k, we have

(18) and therefore, by (16) and (18), Pn(B") = pn+ l((xb ... 'Xn+ l):(xl, ... 'xn) E B") = ... = pn+k((xl, ... , Xn+k):(xl, ... 'x.) E B") = pn+k(B"+k).

Let Jii(R 00 ) denote the collection of all cylinder sets B" = J nCB"), B" e14(R"), n = 1, 2, ....

162

II. Mathematical Foundations of Probability Theory

Now let B 1, ... , Bk be disjoint sets in d(R 00 ). We may suppose without loss of generality that B; = .fiBi), i = 1, ... , k, for some n, where B~, ... , Bi: are disjoint sets in BI(Rn). Then

i.e. the set function Pis finitely additive on the algebra d(R 00 ). Let us show that P is "continuous at zero," i.e. if the sequence of sets Bn ! 0, n --+ oo, then P(Bn) --+ 0, n --+ oo. Suppose the contrary, i.e. let lim P(BJ = J > 0. We may suppose without loss of generality that {Bn} has the form

We use the following property of probability measures Pn on (Rn, BI(Rn)) (see Problem 9): if Bn E BI(Rn), for a given J > 0 we can find a compact set An E BI(Rn) such that An c;; Bn and

Therefore if we have

Form the set

en = n;:= 1 Ak and let en be such that en= {x: (x1, ... , xn) E Cn}.

Then, since the sets Bn decrease, we obtain P(Bn\en) ~

n

n

k= 1

k= 1

L P(Bn\Ak) ~ L P(Bk\Ak) ~ J/2.

But by assumption limn P(Bn) = J > 0, and therefore limn P(en) ~ J/2 > 0. Let us show that this contradicts the condition en! 0. . c~n· Th en (X1(n), ... , Xn(n)) E c n A(n) -- (X1(n), X2(n) , ... ) Ill . tX Let USC hOOSe a pom for n ~ 1. Let (nd be a subsequence of (n) such that x~nJ)--+ x?, where x? is a point incl. (Such a sequence exists since x~n) E cl and cl is compact.) Then select a subsequence (n 2) of (n 1 ) such that (An 2 >, x~ 2 >)--+ (x?, xg) E C 2. Similarly let (x~nk), ... , x1nkl)--+ (x?, ... , xZ) E ck. Finally form the diagonal sequence (mk), where mk is the kth term of (nk)· Then x!mkl--+ x? as mk--+ oo fori = 1, 2, ... ; and(x?, xg, ...) E e"forn = 1, 2, ... , whichevidentlycont radictstheassumption that en! 0, n--+ oo. This completes the proof of the theorem.

163

§3. Methods of Introducing Probability Measures on Measurable Spaces

Remark. In the present case, the space Roo is a countable product of lines, R 00 = R x R x · · · . It is natural to ask whether Theorem 3 remains true if (R 00 , £?#(R 00 )) is replaced by a direct product of measurable spaces (Q;, ~i),

i = 1, 2, ....

We may notice that in the preceding proof the only topological property of the real line that was used was that every set in £?#(R") contains a compact subset whose probability measure is arbitrarily close to the probability measure of the whole set. It is known, however, that this is a property not only of spaces (R", £1l(R"), but also of arbitrary complete separable metric spaces with a-algebras generated by the open sets. Consequently Theorem 3 remains valid if we suppose that P 1 , P 2 , ••• is a sequence of consistent probability measures on (Q 1 , ~ 1 ),

where (Qi, ~) are complete separable metric spaces with a-algebras generated by open sets, and (R 00 , £?#(R 00 )) is replaced by

~

(Ql X Q2 X ". ", ~ {8) ~2 {8) "" ·).

In §9 (Theorem 2) it will be shown that the result of Theorem 3 remains valid for arbitrary measurable spaces (Qi, ~) if the measures Pn are concentrated in a particular way. However, Theorem 3 may fail in the general case (without any hypotheses on the topological nature of the measurable spaces or on the structure of the family of measures {P nD· This is shown by the following example. Let us consider the space Q = (0, 1], which is evidently not complete, and construct a sequence f71 ~ f72 ~ · • · of a-algebras in the following way. For n = 1, 2, ... , let
0 < w < 1/n, 0, 1/n ~ W ~ 1,

(w) = {1,

rcn = {A E Q: A = { w:
and let §,; = a{CC 1 , •.. , CC"} be the smallest a-algebra containing the sets eel, ... ' rcn. Clearly~l ~ ~2 ~ •••• Let ~ = u(U §,;) be the smallest a-algebra containing all the §,;. Consider the measurable space (Q, §,;) and define a probability measure Pn on it as follows: •

n

_

Pn{w. (

{1O

1)

if(l, ... , h . ot erwzse,

E

B",

where B" = £?#(R"). ·It is easy to see that the family {Pn} is consistent: if A E §,;then Pn+ 1 (A) = Pn(A). However, we claim that there is no probability measure P on (Q, ~) such that its restriction PI§,; (i.e., the measure P

164

II. Mathematical Foundations of Probability Theory

considered only on sets in g;,) coincides with Pn for n = 1, 2, .... In fact, let us suppose that such a probability measure P exists. Then P{w: q> 1(w) = · · · = q>n(w) = 1} = Pn{w: q> 1(w) = · · · = q>n(w) = 1} = 1 (19)

for n = 1, 2, .... But {w: q> 1 (w) = · · · = q>n(w) = 1} = (0, 1/n)

t 0,

which contradicts (19) and the hypothesis of countable additivity (and therefore continuity at the "zero" 0) of the set function P. We now give an example of a probability measure on (R 00 , PJ(R 00 )). Let F 1 (x), F 2 (x), ... be a sequence of one-dimensional distribution functions. Define the functions G(x) = F 1 (x), Gix 1 , x 2 ) = F 1 (x 1 )F ix 2 ), ... , and denote the corresponding probability measures on (R, PJ(R)), (R 2 , PJ(R 2 )), ••. by P 1 , P 2 , •••• Then it follows from Theorem 3 that there is a measure P on (R 00 , PJ(R 00 )) such that P{x E R 00 : (x 1 , . . . , Xn) E B} = Pn(B),

BE PJ(Rn)

and, in particular, P{xER 00 :X 1

~ab···,Xn~an}

=F 1(a1)···Fn(an).

Let us take Fi(x) to be a Bernoulli distribution, 0, F;(x) = { q,

1,

X< 0, 0:::::; x < 1, X;::: 1.

Then we can say that there is a probability measure P on the space Q of sequencesofnumbersx = (x 1 , x 2 , .•• ),xi= Oorl,togetherwiththecr-algebra of its Borel subsets, such that P{x .• X 1 _- a 1' • · · ' x n -_a } _ pra,qn-ra, n

-

•

This is precisely the result that was not available in the first chapter for stating the law of large numbers in the form (1.5.8). 5. The measurable space (RT, PJ(RT)). LetT be a set of indices t E T and R 1 a real line corresponding to the index t. We consider a finite unordered set r = [t 1, •.• , tnJ of distinct indices ti, tiE T, n ;::: 1, and let Pr be a probability measure on (Rr, PJ(Rt)), where Rr = R1, x · · · x R1". We say that the family {Pt} of probability measures, where r runs through all finite unordered sets, is consistent if, for all sets r = [t 1 , . . . , tn] and a = [s 1 , . . . , sk] such that a ~ r we have Pcr{(x.,, ... , x.J:(x.,, ... , x.J E B} = Pt{(X11 ,

••• ,

x 1J:(x.,, ... , X 5k) E B} (20)

for every BE PJ(R").

§3. Methods of Introducing Probability Measures on Measurable Spaces

165

Theorem 4 (Ko1mogorov's Theorem on the Extension of Measures in (RT, &U(RT))). Let {P,} be a consistent family of probability measures on (R', :J#(R')). Then there is a unique probability measure P on (RT, &U(RT)) such that

P(f.(B)) = P.(B)

(21)

for all unordered sets< = [t 1 , ••• , tnJ of different indices t; E T, BE &U(R') and f.(B) = {x E RT: (X11 , • • • , x,J E B}. PROOF. Let the set BE &U(RT). By the theorem of §2 there is an at most countable setS= {s 1 ,s 2 , ... } s; T such that B = {x:(X 51 ,X 52 , • • • )EB}, where B E 8U(R 8 ), R 8 = R51 x R52 x · · · . In other words, B = f 8 (B) is a cylinder set with base B E 8U(R 8 ). We can define a set function P on such cylinder sets by putting

P(f8 (B)) = P8 (B),

(22)

where P 8 is the probability measure whose existence is guaranteed by Theorem 3. We claim that Pis in fact the measure whose existence is asserted in the theorem. To establish this we first verify that the definition (22) is consistent, i.e. that it leads to a unique value of P(B) for all possible representations of B; and second, that this set function is countably additive. Let B = f 8 /B 1 ) and B = f 8 ,(B 2 ). It is clear that then B = fs 1 us/B 3 ) with some B 3 E 8U(R 81 us 2 ) ; therefore it is enough to show that if S s; S' and B E 8U(R 8 ), then P 8 .(B') = P 8 (B), where B'

= {(X

8 •1 ,

X 82 , •• •):(X 81

,

X 82 ,

•• • )

E B}

with S' = {s~, s~, .. .}, S = {s 1 , s 2 , ••• }. But by the assumed consistency of (20) this equation follows immediately from Theorem 3. This establishes that the value of P(B) is independent of the representation of B. To verify the countable additivity of P, let us suppose that {En} is a sequence of pairwise disjoint sets in &U(RT). Then there is an at most countable set S s; T such that Bn = f 8 (Bn) for all n 2 1, where Bn E 8U(R 8 ). Since P 8 is a probability measure, we have P(L Bn) = P(L: fs(Bn)) = Ps(L Bn) = L Ps(Bn) = L P(Is(Bn)) = L P(Bn). Finally, property (21) follows immediately from the way in which P was constructed. This completes the proof.

Remark 1. We emphasize that Tis any set of indices. Hence, by the remark after Theorem 3, the present theorem remains valid if we replace the real lines R, by arbitrary complete separable metric spaces !11 (with u-algebras generated by open sets).

166

II. Mathematical Foundations of Probability Theory

Remark 2. The original probability measures {Pr} were assumed defined on unordered sets r = [t b ... , tn] of different indices. It is also possible to start from a family of probability measures {Pr} where r runs through all ordered sets r = (t 1 , ••. , tn) of different indices. In this case, in order to have Theorem 4 hold we have to adjoin to (20) a further consistency condition:

where (i, ... , in) is an arbitrary permutation of (1, ... , n) and A 1, E £?.6(R 1J As a necessary condition for the existence of P this follows from (21) (with P11 ,, ... , 1JB) replaced by P
and for each r = [t 1 , ••• , tnJ, B

=

t1

< t 2 < · · · < tn, and each set

11 X · · · X

In,

construct the measure P.(B) according to the formula PrCJ1 =

X ··· X

In)

J(it ... J( IPt.(aliO)
2

-t.(a2ial) .. ·
(24)

(integration in the Riemann sense). Now we define the set function P for each cylinder set f 1, ... 1Jf 1 x · · · x In) = {x E RT: X11 E 11, ... , X 1" E Jn} by taking P(f11 ... tn(J 1

X ··· X

Jn)) =

Pitt ...

1if1

X ·· · X

In).

The intuitive meaning of this method of assigning a measure to the cylinder set ft, ... In(I 1 X ... X In) is as follows. The set f 11 ... 1"(1 1 x · · · x In) is the set of functions that at times t 1 , ... , tn pass through the" windows" 1 1 , ••• , In (see Figure 24 in §2). We shall interpret

§3. Methods of Introducing Probability Measures on Measurable Spaces

167

that a particle, starting at ak- 1 at time of ak. Then the product of densities that appears in (24) describes a certain independence of the increments of the displacements of the moving "particle" in the time intervals

The family of measures {Pr} constructed in this way is easily seen to be consistent, and therefore can be extended to a measure on (RIO, ool, £?8(RI 0 • ool)). The measure so obtained plays an important role in probability theory. It was introduced by N. Wiener and is known as Wiener measure.

6.

PROBLEMS

1. Let F(x)

=

P(- oo, x]. Verify the following formulas:

P(a, b]

= F(b)- F(a),

P(a, b)= F(b-)- F(a),

P[a, b] = F(b)- F(a- ),

P[a, b)= F(b-)- F(a- ),

P{x} = F(x)- F(x-), where F(x-) = limy 1 x F(y). 2. Verify (7).

3. Prove Theorem 2. 4. Show that a distribution function F = F(x) on R has at most a countable set of points of discontinuity. Does a corresponding result hold for distribution functions on R"? 5. Show that ea:ch of the functions G(x, y) =

{

1, 0,

G(x, y) = [x

X+

x

y;;?: 0, < 0,

+y

+ y], the integral part of x + y,

is continuous on the right, and continuous in each argument, but is not a (generalized) distribution function on R 2 • 6. Let fl. be the Lebesgue-Stieltjes measure generated by a continuous distribution function. Show that if the set A is at most countable, then fl.(A) = 0. 7. Let c be the cardinal number of the continuum. Show that the cardinal number of the collection of Borel sets in R" is c, whereas that of the collection of Lebesgue measurable sets is

zc.

8. Let (Q, §', P) be a probability space and d an algebra of subsets of Q such that u(d) = !F. Using the principle of appropriate sets, prove that for every e > 0 and B E :¥ there is a set A E d such that P(A !:::,. B) :::; e.

168

II. Mathematical Foundations of Probability Theory

9. Let P be a probability measure on (R", !I(R")). Using Problem 8, show that, for every e > 0 and B e !I(R"), there is a compact subset A of !I(R") such that A ~ B and P(B\A) ;s; e. (This was used in the proof of Theorem 1.) 10. Verify the consistency of the measure defined by (21).

§4. Random Variables. I 1. Let (0, ~) be a measurable space and let (R, BI(R)) be the real line with the system BI(R) of Borel sets. Definition 1. A real

function~

=

~(w)

function, or a random variable, if

defined on (0, F) is an

{w: ~(w) E B}

E

ff'

~-measurable

(1)

for every Be BI(R); or, equivalently, if the inverse image ~- 1 (B)

= {w: ~(w) E B}

is a measurable set in 0. When (0, ~ = (R", BI(R")), the &I(R")-measurable functions are called

Borel functions.

The simplest example of a random variable is the indicator IA(w) of an arbitrary (measurable) set A E ff'. A random variable ~ that has a representation ~(w)

=

00

L x;IA;(w),

i= 1

(2)

where L Ai = 0, Ai E ~. is called discrete. If the sum in (2) is finite, the random variable is called simple. With the same interpretation as in §4 of Chapter I, we may say that a random variable is a numerical property of an experiment, with a value depending on "chance." Here the requirement (1) of measurability is fundamental, for the following reason. If a probability measure P is defined on (0, ~), it then makes sense to speak of the probability of the event {~(w) E B} that the value of the random variable belongs to a Borel set B. We introduce the following definitions. Definition 2. A probability measure P~ on (R, BI(R)) with P~(B)

= P{w: ~(w) E B},

BE BI(R),

is called the probability distribution of~ on (R, BI(R)).

169

§4. Random Variables. I

Definition 3. The function F~(x)

= P(w:

::::; x},

~(w)

XER,

is called the distribution function of ~For a discrete random variable the measure P~ is concentrated on an at most countable set and can be represented in the form P ~(B)

I

=

p(xk),

(3)

{k: XkEB)

wherep(xk) = P{~ = xd = M~(xk). The converse is evidently true: If P ~ is represented in the form (3) then ~ is a discrete random variable. A random variable~ is called continuous if its distribution function F~(x) is continuous for x E R. A random variable~ is called absolutely continuous if there is a nonnegative function f = Hx), called its density, such that

F~(x) = fJ~(y) dy,

XER,

(4)

(the integral can be taken in the Riemann sense, or more generally in that of Lebesgue; see §6 below). 2. To establish that a function ~ = ~(w) is a random variable, we have to verify property (1) for all sets BE~- The following lemma shows that the class of such "test" sets can be considerably narrowed. Lemma 1. Let g be a system of sets such that a( g) = §l(R). A necessary and sufficient condition that a function~ = ~(w) is ~-measurable is that

(5)

for all E

E

g_

PRooF. The necessity is evident. To prove the sufficiency we again use the principle of appropriate sets. Let~ be the system of those Borel sets D in f!.6(R) for which C 1(D) E ~ The operation "form the inverse image" is easily shown to preserve the settheoretic operations of union, intersection and complement:

yB") = yC cl( 0B") 0Cl(B"), C 1(

1

(B"),

=

~ l(B") =

C l(B").

(6)

170

II. Mathematical Foundations of Probability Theory

It follows that f0 is a a-algebra. Therefore

and

a(S)

~

a(f0) = f0

~

PJ(R).

But a{E) = PJ(R) and consequently f0 = PJ(R).

Corollary. A necessary and sufficient condition for variable is that for every x

for every x

E

E

< x}

E

IF

x}

E

IF

{w:

~(w)

{w:

~(w) ~

~

= ~(w) to be a random

R, or that

R.

The proof is immediate, since each of the systems

S 1 = {x: x < c, c E R}, S2 =

{X:

X

~

c, c E R}

generates the a-algebra PJ(R): a{£ 1 ) = a(E 2 ) = f!J(R) (see §2). The following lemma makes it possible to construct random variables as functions of other random variables.

Lemma 2. Let cp = cp(x) be a Bore/function and~= ~(w) a random variable. Then the composition 17 = cp o (, i.e. the function 17(w) = cp(((w)), is also a random variable. The proof follows from the equations {w:1](W)EB} = {w:cp(((w))EB} = {w:~(w)Ecp- 1 (B)}EIF

(7)

for BE PJ(R), since cp- 1(B) E PJ(R). Therefore if~ is a random variable, so are, for examples,~·,~+ = max((, 0), = -min(~' 0), and In since the functions x", X+' X- and IX I are Borel functions (Problem 4).

c

3. Starting from a given collection of random variables {~.},we can construct new functions, for example, 1 I~k I, lim ~., lim ~., etc. Notice that in general such functions take values on the extended real line R = [- oo, oo]. Hence it is advisable to extend the class of IF-measurable functions somewhat by allowing them to take the values ± oo.

Lk'=

171

§4. Random Variables. I

Definition 4. A function ~ = ~(w) defined on (0, §) with values in R = [ -oo, oo] will be called an extended random variable if condition (1) is satisfied for every Borel set BE PJ(R). The following theorem, despite its simplicity, is the key to the construction of the Lebesgue integral (§6).

Theorem 1. ~ = ~(w) (extended ones included) there is a sequence of simple random variables ~ 1 , ~ 2 , .•• , such that I~" I : : ; I~ I and ~n(w)--+ ~(w), n--+ oo,for all wE 0. (b) If also ~(w) ~ 0, there is a sequence of simple random variables ~t> ~ 2 , •.• , such that ~n(w) i ~(w), n--+ oo, for all wE 0.

(a) For every random variable

PRooF. We begin by proving the second statement. For n = 1, 2, ... , put n2" k- 1 ~n(w) = k~1 ~Ik,n(w)

+ nlww>~n)(w),

where Ik,n is the indicator of the set {(k - 1)/2" ::::;; ~(w) < k/2"}. It is easy to verify that the sequence ~"(w) so constructed is such that ~"(w) i ~(w) for all w E 0. The first statement follows from this if we merely observe that ~can be represented in the form~ = ~+ - ~-.This completes the proof of the theorem. We next show that the class of extended random variables is closed under pointwise convergence. For this purpose, we note fir~t that if ~ 1 , ~ 2 , .•. is a sequence of extended random variables, then sup~"' inf ~"'lim~" and lim~" are also random variables (possibly extended). This follows immediately from {w:sup ~n > x} = {w: inf ~n

< x} =

U{w: ~n > x} E§, n U {w: ~n < x} E §, n

and lim ~n = inf sup ~m• n m2:.n

Theorem 2. Let ~(w)

~t> ~ 2 ,

•••

lim ~n

= sup sup ~m. n

m2:.n

be a sequence of extended random variables and

= lim ~n(w). Then ~(w) is also an extended random variable.

The prooffollows immediately from the remark above and the fact that

{w:

~(w)

< x}

= {w: lim ~n(w) < x} = {w: lim ~n(w) =lim ~n(w)} 11 {lim ~n(w) < x} <X} = {lim ~nCw) <X} E §.

= 011 {lim ~n(w)

172

II. Mathematical Foundations of Probability Theory

4. We mention a few more properties of the simplest functions of random variables considered on the measurable space (Q, /F) and possibly taking values on the extended real lineR = [ - oo, oo].t If e and '1 are random variables, e + '1· e - '1· e'1. and e/'1 are also random variables (assuming that they are defined, i.e. that no indeterminate forms like oo - oo, oojoo, a/0 occur. In fact, let {en} and {1'/n} be sequences of random variables converging to e and '1 (see Theorem 1). Then en

± '1n --+ e ± '1.

en '1n --+ en ------~------+ 1 1'/n + -J{.,n=O}(w)

~1'/. e -· '1

n

The functions on the left-hand sides of these relations are simple random variables. Therefore, by Theorem 2, the limit functions e ± 17, e11 and e/rt are also random variables.

e

5. Let be a random variable. Let us consider sets from :F of the form {w: e(w) E B}, BE PA(R). It is easily verified that they form a a-algebra, called the a-algebra generated by and denoted by ~If qJ is a Borel function, it follows from Lemma 2 that the function 17 = qJ o e is also a random variable, and in fact ~-measurable, i.e. such that

e.

{w: 17(w) E B} E ~.

BE PA(R)

(see (7)). It turns out that the converse is also true.

Theorem 3. Let '1 be a :F~-measurable random variable. Then there is a Borel jimction qJ such that '1 = qJ i.e. 1J(W) = qJ(e(w)) for every WE Q. 0

e,

PROOF. Let Cl> be the class of ~-measurable functions 1J = 1J(w) and &~ the class of ~-measurable functions representable in the form qJ 0 where qJ is a Borel function. It is clear that & s; Cl>~. The conclusion of the theorem is that in fact&~ = Cl>~. Let A E ~and 17(w) = IA(w). Let us show that 1J E &~.In fact, if A E ~ there is aBE PA(R) such that A = {w: e(w) E B}. Let

e.

( )_{1,0,

XBX

-

X EB, X¢: B.

Then/A(w) = XB(e(w)) E
L7=

t We shall assume the usual conventions about arithmetic operations in R: if a e R then ± oo = ± oo, a/± oo = 0; a· oo = oo if a > 0, and a· oo = - oo if a < 0; 0 · (± oo) = 0,

a

00

+ 00 = 00,

- 0 0 - 00

=

-00.

173

§4. Random Variables. I

Now let 17 be an arbitrary ~-measurable function. By Theorem 1 there is a sequence of simple ~-measurable functions {IJn} such that IJn( w) --+ IJ( w ), n--+ 00, wEn. As we just showed, there are Borel functions

lim
X E B,

n

x¢B

0,

is also a Borel function (see Problem 7). But then it is evident that 1J(W) =limn ~ = Cl>~.

6. Let us consider a probability space (Q, :F, P), where the a-algebra ff is generated by a finite or countably infinite decomposition!":»= {Dt. D2 , .• •}, I D; = n, P(D;) > 0. Here we shall suppose that the D; are atoms with respect toP, i.e. if A

~

D;, A E :F, then either P(A) = 0 or P(D;\A) = 0.

Lemma 3. Let ~ be an .?-measurable function, where ff = cr(!":»). Then constant on the atoms of the decomposition, i.e. ~ has the representation

~

is

00

~(w)

=

I

xklnk(w)

(8)

(P-a.s.).t

k= 1

(The notation

"~

= 1J (P-a.s.)" means that

P(~

=f 17) = 0.)

PROOF. Let D be an atom of the decomposition with respect to P. Let us show that ~is constant (P-a.s.), i.e. P{D n (~ =f canst)} = 0. Let K = sup{x E R: P{D n (.; < x)} = 0}. Then

P{D n (¢ < K)} = p[

ryK

{wED;

~(w)

< r}J = 0,

ratwnal r

since if P{D n (~ < x)} = 0, then also P{D n (~ < y)} = 0 for ally :::;; x. Let x > K; then P{D n (~ < x)} > 0 and therefore P{D n (~ ~ x)} = 0, since D is an atom. Therefore P{D n (~ > K)} = P

U [ r>K

{wED:~~ r}J

= 0.

ratiOnal r

Thus P {D n ( ~ > K)} = P {D n ( ~ < K)} = 0 and therefore P{D n (~ =f K)} = 0. Then (8) follows in general since the lemma. t a.s.

=

almost surely.

I

D; = n. This completes the proof of

174

7.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Show that the random variable

~

is continuous if and only if P(~ = x) = 0 for all

xeR.

e

e

2. If I I is :F-measurable, is it true that is also :F -measurable? 3. Show that ~ = ~(w) is an extended random variable if and only if {w: e(w) E B} for all Be BII(R). 4. Prove that x", x+ functions.

= max(x, 0),

x-

= -min(x, 0), and lxl =

x+

+ x-

E

:F

are Borel

e

5. If and '1 are :!'-measurable, then {w: '(w) = "'(w)} e :F. 6. Let ~ and '1 be random variables on (Q, :F), and A e :F. Then the function

C(w) = ~(w) · IA

+ '1(w)Ii

is also a random variable.

... ,

7. Let e~o ~.be random variables and cp(x 1, ... , x.) a Borel function. Show that cp(, 1(w), ... , ~.(w)) is also a random variable. 8. Let

eand '1 be random variables, both taking the values 1, 2, ... , N. Suppose that

:F. Show that there is a permutation (i 1, i2, ... , iN) of (1, 2, ... , N) such thl!t {w:' = j} = {w: '1 = ij} for j = 1, 2, ... , N.

~ =

§5. Random Elements 1. In addition to random variables, probability theory and its applications involve random objects of more general kinds, for example random points, vectors, functions, processes, fields, sets, measures, etc. In this connection it is desirable to have the concept of a random object of any kind.

Definition 1. Let (0, ff') and (E, tff) be measurable spaces. We say that a function X= X(m), defined on 0 and taking values in E, is ff'/tff-measurable, or is a random element (with values in E), if {m: X(m) E B} E ff'

(1)

for every BE tff. Random elements (with values in E) are sometimes called £-valued random variables. Let us consider some special cases. If (E, tff) = (R, f!I(R)), the definition of a random element is the same as the definition of a random variable (§4). Let (E, cC) = (R", f!I(R")). Then a random element X(m) is a "random point" in R". If nk is the projection of R" on the kth coordinate axis, X(m) can

175

§5. Random Elements

be represented in the form (2) where ~k = nk o X. It follows from (1) that BE ~(R) we have

{w: ~k(w) E B}

=

~k

is an ordinary random variable. In fact, for

{w: ~ 1 (w) E R, ... , ~k- 1 ER, ~k E B, ~k+ 1 E R, .. .} R x B x R x .. · x R)} E ~

= {w: X(w) E (R x .. · x

since R x · · · x R x B x R x · · · x R E ~(R").

Definition 2. An ordered set (1J 1(w), ... , 1].(w)) ofrandom variables is called an n-dimensional random vector. According to this definition, every random element X(w) with values in

R" is an n-dimensional random vector. The converse is also true: every random vector X(w) = (~ 1 (w), ... , ~n(w)) is a random element in R". In fact, if Bk E ~(R), k = 1, ... , n, then {w: X(w) E (B1 X ... X Bn)} =

n{w: k=1 n

~k(w) E Bk} E $'.

But ~(W) is the smallest a-algebra containing the sets B 1 x · · · x Bn. Consequently we find immediately, by an evident generalization of Lemma 1 of§4, that whenever BE ~(R"), the set {w: X(w) E B} belongs to$'. Let (E, S) = (Z, B(Z)), where Z is the set of complex numbers x + iy, x, y E R, and B(Z) is the smallest a-algebra containing the sets {z: z = x + iy, a 1 < x :::; b 1 , a 2 < y :::; b2 }. It follows from the discussion above that a complex-valued random variable Z(w) can be represented as Z(w) = X(w) + iY(w), where X(w) and Y(w) are random variables. Hence we may also call Z(w) a complex random variable. Let (E, S) = (Rr, ~(RT)), where Tis a subset of the real line. In this case every random element X = X(w) can evidently be represented as X= (~ 1) 1 e T with ~~ = n 1 o X, and is called a random function with time domain T.

Definition 3. Let T be a subset of the real line. A set of random variables X = (~ 1 ) 1 e Tis called a random processt with time domain T. If T = {1, 2, ... } we call X= (~ 1 , ~ 2 , ..• ) a random process with discrete time, or a random sequence. If T = [0, 1], (- oo, oo), [0, oo), ... , we call X= (~ 1) 1 er a random process with continuous time. tOr stochastic process (Translator).

176

II. Mathematical Foundations of Probability Theory

It is easy to show, by using the structure of the u-algebra BB(RT) (§2) that every random process X= (~ 1)reT (in the sense of Definition 3) is also a random function on the space (Rr, &B(RT)).

Definition 4. Let X= (~ 1 )reT be a random process. For each given wE Q the function (~ 1(w))1 e T is said to be a realization or a trajectory of the process, corresponding to the outcome w. The following definition is a natural generalization of Definition 2 of §4.

Definition 5. Let X = (~ 1)reT be a random process. The probability measure Px on (Rr, BB(RT)) defined by BE &B(Rr),

Px(B) = P{w: X(w) E B},

is called the probability distribution of X. The probabilities P 1,, ..• , 1"(B)

= P{w: (~ 1 ,,

... ,

~~J E B}

with t 1 < t 2 < · · · < tn, t; E T, are called finite-dimensional probabilities (or probability distributions). The functions

Fr,, ... ,tn(Xto ... ,Xn) with t 1 < functions.

t2

= P{w:~r,::::; Xto ... ,~tn::::; Xn}

< · · · < tn, t; E T, are called finite-dimensional distribution

Let (E, C) = (C, 8B0 (C)), where C is the space of continuous functions x = (x 1)reT on T = [0, 1] and BB 0 (C) is the u-algebra generated by the open sets (§2). We show that every random element X on (C, 8B 0 (C)) is also a random process with continuous trajectories in the sense of Definition 3. In fact, according to §2 the set A = {x E C: x 1
= {w: X(w) E A} E $'.

{w: ~ 1 (w)
On the other hand, let X= (~ 1(w))reT be a random process (in the sense of Definition 3) whose trajectories are continuous functions for every w E n. According to (2.14 ), {x E C:

X

n{x E C: lxrk- x~kl < p},

E Sp{x 0 )} =

lk

where

tk

are the rational points of [0, 1]. Therefore {w: X(w) E Sp(X 0 w))}

=

n{w:

l~rk(w)- ~~(w)l

< p} E ff,

lk

and therefore we also have {w: X(w) E B} E ff for every BE 8B 0(C). Similar reasoning will show that every random element of the space (D, BB 0 (D) can be considered as a random process with trajectories in the space of functions with no discontinuities of the second kind; and conversely.

177

§5. Random Elements

2. Let (Q, ff, P) be a probability space and (E 11 , &11) measurable spaces, where IX belongs to an (arbitrary) set 21.

Definition 6. We say that the ff/&11-measurable functions (Xa;(w)), IX e 21, are independent (or collectively independent) if, for every finite set of indices IXl, ••• ' IX" the random elements Xa:t• ... ' xtln are independent, i.e. P(X111 e B111 , ••• , X a." e Ba:J = P(X111 e B111 )

P(Xa." e B11J,

• • •

(3)

where B11 e Ca.. Let 21 = {1, 2, ... , n}, let F~(xl,

~a:

... , Xn)

be random variables, let IX e 21 and let

=

Xn)

P(~1 ~ X1, ... , ~" ~

be the n-dimensional distribution function of the random vector ~ = (~ 1 , ••• , ~"). Let F~.(x;) be the distribution functions of the random variables ~;, i = 1, ... , n.

Theorem. A necessary and sufficient condition for the random variables ~ 1 , ... , ~"to be independent is that

(4)

F~(xl, ... ,Xn) = F~ 1 (X1)···F~n(Xn)

for all (x 1, •.• , x") e R".

PRooF. The necessity is evident. To prove the sufficiency we put a = (a 1 , ••• , a"), b = (b 1 , .•. , b")' P~(a, b] = P{w: a 1 < ~ 1 ~ bl rel="nofollow"> ... , a"< ~" ~ b"}, P,.(a;, b;] = P{a; < ~; ~ b;}.

Then P~(a, b] =

n" [F~,(b;)- F~,(a;)] = n" P~.(a;, b;]

i=1

i= 1

by (4) and (3.7), and therefore P{~ 1 e It, ... ,~" E In} =

n" P{~;

(5)

E

I;},

P{~1 E B1, ~2 E I2, ... ' ~" E In} = P{~1 E B1}

n"

i= 1

where I; = (a;, b;]. We fix I 2 , ••• , I" and show that

i=2

P{~; E I;}

(6)

for all B 1 e iJ(R). Let vH be the collection of sets in ii(R) for which (6) holds. Then vH evidently contains the algebra d of sets consisting of sums of disjoint intervals of the form I 1 = (a 1, b 1]. Hence d £ vH £ iJ(R). From

178

II. Mathematical Foundations of Probability Theory

the countable additivity (and therefore continuity) of probability measures it also follows that .A is a monotonic class. Therefore (see Subsection 1 of §2) JJ.(d) £ .A £ ~(R).

But JJ.(d) = u(.r;l) = ~(R) by Theorem 1 of §2. Therefore .A = ~(R). Thus (6) is established. Now fix B 1 , 13 , ••• ,1.; by the same method we can establish (6) with / 2 replaced by the Borel set B2 • Continuing in this way, we can evidently arrive at the required equation,

where B; e

3.

~(R).

This completes the proof of the theorem.

PROBLEMS

e

1. Let 1, ••• , only if

e. be discrete random variables. Show that they are independent if and P(e1

= X1,

00

.'

e. = x.) = TI• P(e; = X;) i=l

for all real Xt. ••• , x •. 2. Carry out the proof that every random function (in the sense of Definition 1) is a random process (in the sense of Definition 3) and conversely. 3. Let X 1 , ••• , X. be random elements with values in (Et. S 1 ), ... , (E., s.), respectively. In addition let (E~, t9''1 ), ••• , (E~, t9'~) be measurable spaces and let gt. ... , g. be 8tfS'1, .•• ,t9'.Jt9'~-measurable functions, respectively. Show that if X 1 , ... ,x. are independent, the random elements g 1 · X t. •.. , g. · X. are also independent.

§6. Lebesgue Integral. Expectation 1. When (0, 1Ji', P) is a finite probability space and random variable,

~ = ~(m)

is a simple

n

~(m) =

L xk!Ak(m), k=l

(1)

the expectation E~ was defined in §4 of Chapter I. The same definition of the expectation of a simple random variable can be used for any probability space (Q, 1Ji', P). That is, we define

e

ee

n

E~ =

L xkP(Ak). k=l

(2)

179

§6. Lebesgue Integral. Expectation

This definition is consistent (in the sense that E~ is independent of the particular representation of~ in the form (1)), as can be shown just as for finite probability spaces. The simplest properties of the expectation can be established similarly (see Subsection 5 of §4 of Chapter 1). In the present section we shall define and study the properties of the expectation E~ of an arbitrary random variable. In the language of analysis, E~ is merely the Lebesgue integral of the g;;-measurable function ~ = ~(co) with respect to the measure P. In addition to E~ we shall use the notation ~(co)P(dco) or So~ dP. Let~ = ~(co) be a nonnegative random variable. We construct a sequence of simple nonnegative random variables {~n} n > 1 such that ~nCco) j ~(co), n -+ oo, for each co E n (see Theorem 1 in §4). Since E~n ::;; E~n+ 1 (cf. Property 3) of Subsection 5, §4, Chapter 1), the limit limn E~n exists, possibly with the value + oo.

So

Definition 1. The Lebesgue integral of the nonnegative random variable ~ = ~(co), or its expectation, is (3) n

To see that this definition is consistent, we need to show that the limit is independent of the choice of the approximating sequence {~n}. In other words, we need to show that if ~n j ~and 17m j ~'where {17m} is a sequence of simple functions, then (4) m

n

Lemma 1. Let 17 and '" be simple random variables, n ~ 1, and

Then (5) n

PROOF.

Let e > 0 and

It is clear that An j Q and ~n = ~n]An

+ ~n]An

~ ~n]An ~ (17- e)JAn·

Hence by the properties of the expectations of simple random variables we find that E~n ~

E(17- e)JA" = E17IA"- eP(An) = E17 - E17Ix"- eP(An) ~ E17 - CP(An)- e,

180

II. Mathematical Foundations of Probability Theory

where C = maxw '1(w). Since e is arbitrary, the required inequality (5) follows. It follows from this lemma that limn E~n ; : : : limm E'7m and by symmetry limm Ef1m ;?:: limn E~n• which proves (4). The following remark is often useful.

Remark 1. The expectation E~ of the nonnegative random variable ~ satisfies E~

=

sup Es,

(6)

{seS:s~~}

where S = {s} is a set of simple random variables (Problem 1). Thus the expectation is well defined for nonnegative random variables. We now consider the general case. Let~ be a random variable and~+ =max(~, 0), ~- = -min(~, 0).

Definition 2. We say that the expectation E~ of the random variable exists, or is defined, if at least one of E~ + and E~- is finite: min(E~+, E~-)

~

< oo.

In this case we define

The expectation Ee is also called the Lebesgue integral (ofthe function respect to the probability measure P).

Definition 3. We say that the expectation of EC < oo.

~

is finite if

E~+

ewith

< oo and

Since 1~1 = e+- ~-.the finiteness ofEe, or IE~I < oo, is equivalent to EI I < 00. (In this sense one says that the Lebesgue integral is absolutely convergent.)

e

e,

Remark 2. In addition to the expectation E significant numerical characteristics of a random variable eare the number E~' (if defined) and E IeI', r > 0, which are known as the moment of order r (or rth moment) and the absolute moment of order r (or absolute rth moment) of e. Remark 3. In the definition of the Lebesgue integral Jn ~(w)P(dw) given

above, we suppose that P was a probability measure (P(Q) = 1) and that the $'-measurable functions (random variables) ~ had values in R = (- oo, oo ). Suppose now that J1. is any measure defined on a measurable space (Q, ff) and possibly taking the value + oo, and that ~ = e(w) is an $'-measurable function with values in R = [- oo, oo] (an extended random variable). In this case the Lebesgue integral Jn e(w)Jl(dw) is defined in the

181

§6. Lebesgue Integral. Expectation

same way: first, for nonnegative simple ~ (by (2) with P replaced by Jl), then for arbitrary nonnegative ~. and in general by the formula

provided that no indeterminacy of the form oo - oo arises. A case that is particularly important for mathematical analysis is that in which (Q, F) = (R, PJ(R)) and J1 is Lebesgue measure. In this case the integralSR ~(X)Jl(dx) is writtenSR ~(x) dx, Or s~oo ~(x) dx, Or (L) s~oo ~(x) dx to emphasize its difference from the Riemann integral (R) S~oo ~(x) dx. If the measure J1 (Lebesgue-Stieltjes) corresponds to a generalized distribution function G = G(x), the integral SR ~(x)Jl(dx) is also called a LebesgueStieltjes integral and is denoted by (L-S) SR ~(x)G(dx), a notation that distinguishes it from the corresponding Riemann-Stieltjes integral (R-S)

L~(x)G(dx)

(see Subsection 10 below). It will be clear from what follows (Property D) that if E~ is defined then so is the expectation E( 0 A) for every A E !F. The notations E( ~; A) or SA ~ dP are often used for E(OA) or its equivalent, Sn OA dP. The integral SA~ dP is called the Lebesgue integral of~ with respect to P over the set A. Similarly, we write SA~ dJ1 instead of Sn ~ · IA dJ1 for an arbitrary measure Jl. In particular, if J1 is an n-dimensional Lebesgue-Stieltjes measure, and A = (all b 1 ] x · · · x (a., b.], we use the notation

L(

d)l.

instead of

If JliS Lebesgue measure, we write simplydx 1 · · · dx.instead of J1(dx 1 , ..• , dx.). 2. Properties of the expectation E~ of the random variable

A. Let c be a constant and let E~ exist. Then

E(c~)

~.

exists and

E(c~) = cE~.

B. Let

~

::::; 1J; then

with the understanding that

if -oo <

E~

then

-oo < E17

and

E~::::;

E17

or

if E17 < oo then

E~

< oo

and

E~

::::; E1J.

182 C.

II. Mathematical Foundations of Probability Theory

IfE~

exists then IE~ I~ E 1~1-

D. If E~ exists then E(eJA) exists for each A E ?'; if E~ is finite, E(~IA) is finite. E. If~ and rf are nonnegative random variables, or such that EI~I < oo and Elt71 < oo,then E(~

+ t'f) = E~ + Et'f.

(See Problem 2 for a generalization.) Let us establish A-E.

A. This is obvious for simple random variables. Let ~ are simple random variables and c ~ 0. Then c~n j

~ c~

0, ~n j ~' where and therefore

~n

In the general case we need to use the representation ~ = ~+ - ~ and notice that (c~)+ = c~+, (c~)- = cC when c ~ 0, whereas when c < 0, (c~)+ = -cC, (c~)- = -c~+. B. If 0 ~ ~ ~ rf, then E~ and Et7 are defined and the inequality E~ ~ Et7 follows directly from (6). Now let E~ > - oo; then E~- < oo. If~ ~ rf, we have ~+ ~ rf+ and ~- ~ rf-. Therefore Et7- ~ EC < oo; consequently E11 is defined and E~

= E~+

- E~- :::;; E11+ - E11-

case when E17 < oo can be discussed similarly. C. Since - I~I ~ ~ ~ I~I, Properties A and B imply

=

E11. The

i.e.IE~I ~ E 1~1· D. This follows from B and

E. Let ~ ~ 0, '7 ~ 0, and let {~n} and }'In} be sequences of simple functions such that ~n i ~and 'In j t'f· Then E(~n + 1'/n) = E~n + Et'fn and

and therefore E(~ + rJ) = E~ + ErJ. The case when E 1~1 < oo and E I'11 < oo reduces to this if we use the facts that ·

and

183

§6. Lebesgue Integral. Expectation

The following group of statements about expectations involve the notion of" P-almost surely." We say that a property holds" P-almost surely" if there is a set% E ~with P(%) = 0 such that the property holds for every point w ofO.\%. Instead of" P-almost surely" we often say" P-almost everywhere" or simply "almost surely" (a.s.) or "almost everywhere" (a.e.). F.

If~

= 0 (a.s.) then

E~

= 0.

L

In fact, if ~ is a simple random variable, ~ = xk/Ak(w) and xk =I= 0, wehaveP(Ak) = ObyhypothesisandthereforeE~ = O.If~ ~ OandO~s~~. where s is a simple random variable, then s = 0 (a.s.) and consequently Es = 0 and E~ = sup(seS:s:>~J Es = 0. The general case follows from this by means of the representation ~ = ~+ - C and the facts that ~+ ~ I~ I, C ~ 1~1, and 1~1 = 0 (a.s.). G. If~= 1J (a.s.) and El~l < oo, then EIIJI < oo and E~ = E17 (see also Problem 3). In fact, let % = {w: ~=I= IJ}. Then P(%) = 0 and ~ = O.,v + 0.:f, 1J = 1Jl.,v + IJI.,v = 1Jl.,v + O.K.BypropertiesEandF,wehaveE~ = E~l.,v + E~(v = E1Jl.:f. But E1Jl.,v = 0, and therefore E~ = E1Jl.K + E1Jl.,v = E17, by Property E.

H.

Let~ ~

0 and

E~

= 0.

Then~

= 0 (a.s).

For the proof, let A= {w: ~(w) > 0}, An= {w: ~(w) ~ 1/n}. It is clear that Ani A and 0 ~ ~ · IA" ~ ~ · IA. Hence, by Property B, 0 S EOAn S E~ = 0.

Consequently

and therefore P(An) = 0 for all n P(A) = 0.

~

1. But P(A) = lim P(An) and therefore

I. Let~ and 11 be such that El~l < oo, El11l < oo and E(OA) ~ E(17/A) for all A E ~.Then~ s 11 (a.s.). In fact, let B = {w: ~(w) > 17(w)}. Then E(17/8 ) ~ E(0 8 ) ~ E(17/8 ) and therefore E(0 8 ) = E(17/ 8 ). By Property E, we have E((~- 11)18 ) = 0 and by Property H we have(~ - 17)/8 = 0 (a.s.), whence P(B) = 0. ~ be an extended random variable and E I~ I < oo. Then I~ I < oo (a. s). In fact, let A= {wl~(w)l = oo} and P(A) > 0. Then El~l ~ E(I~IJA) = oo · P(A) = oo, which contradicts the hypothesis El~l < oo. (See also Problem 4.)

J. Let

3. Here we consider the fundamental theorems on taking limits under the expectation sign (or the Lebesgue integral sign).

184

II. Mathematical Foundations of Probability Theory

Theorem 1 (On Monotone Convergence). Let IJ,

~' ~ 1 , ~ 2 , •••

be random

variables.

i

(a) If~. :2:: 1J for all n :2:: 1, El] > - oo, and ~.

~' then

i E~. < oo, and~. t ~,then E~.

(b) If~. :::;; 1J for all n ;;::-: 1, E17

E~.

t E~.

PRooF. (a) First suppose that IJ ;;::-: 0. For each k ;;::-: 1let {~i"l} n"' 1 be a sequence of simple functions such that ~k•l i ~b n--+ oo. Put ~(nJ = max 1 ,k,;;N~k·>. Then

,(n-1) :::;; '(n) = max

:::;; max

~kn)

~k

= ~ •.

1,;;k,;;n

1,;;k,;;n

for 1 :::;; k :::;; n, we find by taking limits as n

--+

oo that

for every k ; : -: 1 and therefore~ = (. The random variables (<•> are simple and (<•>

i (. Therefore

Es = E( =lim E(:::;; limEs •. On the other hand, it is obvious,

:::;;

since~.

limEs. :::;;

~n+ 1

:::;;

~,that

E~.

Consequently limE~. = E~. Now let IJ be any random variable with EIJ > - oo. If EIJ = oo then E~. = E~ = oo by Property B, and our proposition is proved. Let EIJ < oo. Then instead of EIJ > - oo we find E IIJ I < oo. It is clear that 0 :::;; ~. - 1J i ~ - IJ for all w E Q. Therefore by what has been established, E(~. - l])i E(~ - IJ)and therefore (by Property E and Problem 2) E~. - E~

i E~

- EIJ.

But E IIJ I < oo, and therefore E~. i E~, n --+ oo. The proof of (b) follows from (a) if we replace the original variables by their negatives.

Corollary. Let {IJ.}n;, 1 be a sequence of nonnegative random variables. Then 00

L

00

L

EIJn· E IJn = n=1 n=1

185

§6. Lebesgue Integral. Expectation

The proof follows from Property E (see also Problem 2), the monotone convergence theorem, and the remark that k

00

L tin j n=L tin• n= 1

k--..

00.

1

Theorem 2 (Fatou's Lemma). Lett~. ~b ~ 2 , ••• be random variables. (a) If~.. ~ t1 for all n ~ 1 and Et~ > - oo, then

E lim~~~ :::;; limE~... (b) If~ .. :::;; t1 for all n ~ 1 and Et~ < oo, then limE~ ..

:::;; E lim~ ...

(c) If 1~ .. 1~ t1 for all n ~ 1 and Et~ < oo, then

E lim~~~~ limE~ .. ~ llm E~ .. ~ E llm ~... PROOF.

(a) Let

Cn =

infm~n ~m;

(7)

then

lim~~~= lim inf ~m =lim C... n

It is clear that C.. j lim

~~~

n

m~n

and C.. ~ tl for all n ;;::: 1. Then by Theorem 1

E lim ~~~ = E lim C.. = lim EC.. = lim EC.. ~ limE~ .. , n

n

n

which establishes (a). The second conclusion follows from the first. The third is a corollary of the first two.

Theorem 3 (Lebesgue's Theorem on Dominated Convergence). Let t~, ~. ~ 1 , ~ 2 , ••• be random variables such that 1~ .. 1:::;; 1], E17 < oo and~~~--..~ (a.s.). ThenEI~I < oo, (8)

and (9) as n--.. oo. PROOF.

llm ~~~ =

Formula (7) is valid by Fatou's lemma. By hypothesis, lim~~~= ~ (a.s.). Therefore by Property G, E lim ~~~ = limE~ .. = limE~ .. = E lim ~~~ = E~,

which establishes (8). It is also clear that I~ I ~ tl· Hence EI~ I < oo. Conclusion (9) can be proved in the same way if we observe that

1~..

-" ~ 21].

186

II. Mathematical Foundations of Probability Theory

Corollary Let fl, e. e1• ••• be random variables such that Ien I :=:;; f/, en -+ e (a.s.) and E17P < oo for some p > 0. Then E 1e1P < oo and E1e- eniP-+ 0, n-+ oo. 0

For the proof, it is sufficient to observe that 1e1 :=:;; fl, I~- ~niP:=:;;
The condition "I en I :=:;; f/, Ef/ < 00" that appears in Fatou's lemma and the dominated convergence theorem, and ensures the validity of formulas (7)-(9), can be somewhat weakened. In order to be able to state the corresponding result (Theorem 4), we introduce the following definition. Definition 4. A family integrable if sup 11

{en} 11 ~ 1

f

J{i~nl>c)

of random variables is said to be uniformly c-+

le,IP(dw)-+0,

00,

(10)

or, in a different notation, sup E[ Ie~~l Iu~"' >c}J -+ 0,

c-+

00.

(11)

II

It is clear that if ell' n ;;::: 1, satisfy Iell I :=:; f/, Ef/ < oo, then the family {ell} 11 ~ 1 is uniformly integrable. Theorem 4. Let

Then

{~ 11 }n~ 1

be a uniformly integrable family ofrandom variables.

(a) E lim e11 :=:;; lim Ee11 :=:;; ITiii Ee, :=:; E ITiii eli. (b) If in addition 11 -+ (a.s.) then~ is integrable and

e e

Ee,-+ Ee, Ele~~-el-+0,

n-+ oo, n-+oo.

PRooF. (a) For every c > 0 (12)

By uniform integrability, for every e > 0 we can take c so large that sup IE[e~~I 1~"<-cj]l <e.

(13)

II

By Fatou's lemma, lim E[e,I 1~"~ -c}J ;;::: E[lim ~nl{~n~ -c}J.

But

~,If~"~ -cl

;;:::

e, and therefore

lim E[e,J!~"~-clJ;;::: E[lim e,].

(14)

187

§6. Lebesgue Integral. Expectation

From (12)-(14) we obtain

Since e > 0 is arbitrary, it follows that lim E~n 2: E lim ~n· The inequality with upper limits, 1lm E~n :::;; E lim~"' is proved similarly. Conclusion (b) can be deduced from (a) as in Theorem 3. The deeper significance of the concept of uniform integrability is revealed by the following theorem, which gives a necessary and sufficient condition for taking limits under the expectation sign. ~n-> ~and E~n < oo. Then E~n-> E~ < oo the family {~n}n> 1 is uniformly integrable.

Theorem 5. Let 0 :::;;

if and only if

PROOF. The sufficiency follows from conclusion (b) of Theorem 4. For the proof of the necessity we consider the (at most countable) set

A

= {a:

P(~

= a) > 0}.

Then we have ~n/{~" Or~
is uniformly integrable. Hence, by the sufficiency part of the theorem, we have E~nlr~" E~/~~"
Take an e rel="nofollow"> 0 and choose a 0 N 0 so large that

E

A,

n-> oo.

(15)

A so large that E ~/ (~ 2: ao) < ej2; then choose

E~nlr~"2:aoJ :::;; E~lr~~aoJ

+ e/2

for all n 2: N 0 , and consequently E~n/{~"~ao}:::;; e. Then choose a 1 2: a0 so large that E~r~"2:a!}::::;; e for all n:::;; N 0 • Then we have supE~nlr~"~ad :::;; n

e,

which establishes the uniform integrability of the family {~n} n 2: 1 of random variables.

4. Let us notice some tests for uniform integrability. We first observe that if {~n} is a family of uniformly integrable random variables, then sup EI~n I < oo. n

(16)

188

II. Mathematical Foundations of Probability Theory

In fact, for a given ~> > 0 and sufficiently large c > 0 sup E l~nl =sup [E(I~n IIu~nl~cl) n

n

s

supE(I~n IImnl~c}} n

+ E(l~n IIu~nl
which establishes (16). It turns out that (16) together with a condition of uniform continuity is necessary and sufficient for uniform integrability.

Lemma 2. A necessary and sufficient condition for a family {~n}n~ 1 of random variables to be uniformly integrable is that E I~n I, n ;;::: 1, are uniformly bounded (i.e., (16) holds) and that E{ I~n IIA},n ;;::: 1, are uniformly absolutely continuous (i.e. sup E{I ~n II A} --+ 0 when P(A) --+ 0). PROOF.

Necessity. Condition (16) was verified above. Moreover, E{l~niiA} = E{l~niiArl{l~nl~c}} + E{l~niiAn{i~nl
Take c so large that supn E{I~II 11 ~"I"cJ}

s

~>/2.

sup E{I ~n II A} S

Then if P(A)

s

~>/2c,

(17) we have

I>

by (17). This establishes the uniform absolute continuity. Sufficiency. Let 1> > 0 and b > 0 be chosen so that P(A) 0 (cf. Chebyshev's inequality), we have sup

P{l~nl;;:::

n

c}

1 c

s- sup E l~nl-+ 0,

c--+ 00,

and therefore, when c is sufficiently large, any set {I ~n I ;;::: c }, n ;;::: 1, can be taken as A. Therefore sup E(l ~nl I 0 ~" 1 ~, 1 ) s ~>,which establishes the uniform integrability. This completes the proof of the lemma. The following proposition provides a simple sufficient condition for uniform integrability.

Lemma 3. Let ~b ~ 2 , •.• be a sequence of integrable random variables and G = G(t) a nonnegative increasing function, defined fort ~ 0, such that lim G(t) = oo. t-+ 00

supE[G(I~ni)J n

(18)

t

<

00.

Then the family {~n}n" 1 is uniformly integrable.

(19)

189

§6. Lebesgue Integral. Expectation

PRooF. Let e > 0, M =sup" E[G(I~ni)J, a= Mfe. Take c so large that G(t)/t 2 a for t 2 c. Then

M

1

E[l~nllll~"l"'ca:::;; ~E[G(I~nl) · Iu~"l"'cj]:::;;-;; = e

uniformly for n 2 1. 5. If ~ and Yf are independent simple random variables, we can show, as in Subsection 5 of §4 of Chapter I, that E~IJ = E~ · Ery. Let us now establish a similar proposition in the general case (see also Problem 5). Theorem 6. Let ~ and 1J be independent random variables with EI~ I < oo, Elryl < oo. ThenEI~Yfl < ooand E~ry

PRooF. First let

~

=

E~

· Ery.

(20)

2 0, Yf 2 0. Put 00

k

00

k

~n=

I -J{k/n:S~(ro)<(k+l)/n}• k;o n

=

I - J{k/n:S~(w)<(k+ 1)/n)• k;o n

Yfn

Then ~n :::;; ~' I~n - ~I :::;; lfn and Yfn :::;; Yf, IYfn - Yf I :::;; 1/n. Since E~ < oo and Ery < oo, it follows from Lebesgue's dominated convergence theorem that lim Moreover,

since~

E~nYfn =

=

and

Yf

E~n

=

E~,

are independent,

kl

L 2 k,l"'o L

EJ{k/n:S~<(k+l)/n}/{1/n:S~<(I+l)/n} n kl 2E/{k/n:S~<(k+1)/n} · EJ{I/n;S;~<(l+l)/n} = E~n · EYfn·

k,l"'o n

Now notice that

1 1 ( 1)

+E[IYfnl·l~-~niJ::;-E~+-E ry+-

n

n

n

~o,

n~

oo.

Therefore E~ry =limn E~nYfn =lim E~n ·lim E11n = E~ · Ery, and E~Yf < oo. The general case reduces to this one if we use the representations ~=C -C,Yf=Yf+ -ry-,~Yf=~+Yf+ -Cry+ -~+Yf- +Cry-.Thiscompletes the proof. 6. The inequalities for expectations that we develop in this subsection are regularly used both in probability theory and in analysis.

190

II. Mathematical Foundations of Probability Theory

Chebyshev's Inequality. Let

~

be a nonnegative random variable. Then for

every e > 0

E~

(21)

P(~ ~e):::;;-.

e

The proof follows immediately from E~ ~ E[~ · /(~>elJ ~ E/(~~·l = eP(~ ~e).

From (21) we can obtain the following variant of Chebyshev's inequality: If ~ is any random variable then P(~ ~e):::;;

ee

-

(22)

2

e

and

(23) where V~ = E(~ - E~) 2 is the variance of~.

The Cauchy-Bunyakovskiilnequality. Let~ and rt satisfy E~ 2 < oo, Ert 2 < oo. Then E I~1'/ I < oo and

(E I~1'/ 1) 2

:::;;

(24)

E~ 2 . Ert 2 •

PROOF.

Suppose that E~ 2 > 0, E17 2 > 0. Then, with ~

rt!JErli,

we find, since

=

~~~' if=

21 ~~I :::;; ~ 2 + ~ 2 , that 2E ~~~~ :::;; E~ 2 + E~ 2 = 2,

i.e. E I~~ I :::;; 1, which establishes (24).

ee

= 0, then ~ = 0 (a.s.) by Property I, and On the other hand if, say, then E~rt = 0 by Property F, i.e. (24) is still satisfied. Jensen's Inequality. Let the Borel function g = g(x) be convex downward and El~l

< oo. Then

g(E~)

:::;;

PRooF. If g = g(x) is convex downward, for each x 0 A.(x 0 ) such that g(x) for all x

E

~

g(x 0 )

(25)

Eg(~).

+ (x -

E

R there is a number

x 0 ) · A.(x 0 )

R. Putting x = ~ and x 0 = E~' we find from (26) that g(~) ~ g(E~)

and consequently Eg(~)

~ g(E~).

+ (~ -

E~) · A.(E~),

(26)

191

§6. Lebesgue Integral. Expectation

A whole series of useful inequalities can be derived from Jensen's inequality. We obtain the following one as an example.

Lyapunov's Inequality. IfO < s < t, (27) To prove this, let r = tfs. Then, putting 11 = I~ I" and applying Jensen's inequality to g(x) = IxI', we obtain IE17l' :::;;; EI'71', i.e. (E I~ 1")'1" :::;;; E IeJ',

which establishes (27). The following chain of inequalities among absolute moments in a consequence of Lyapunov's inequality:

(28) HOlder's Inequality. Let 1 < p < oo, 1 < q < oo, and (1/p) E I~ IP < oo and EI11lq < oo, then EI~'71 < oo and EI~'71

: :; ; (E I~ IP) 11P(E I'11q)lfq.

+ (1/q) = 1.

If

(29)

If E I~ IP = 0 or E Irtlq = 0, (29) follows immediately as for the CauchyBunyakovskii inequality (which is the special case p = q = 2 of Holder's inequality). Now let E I~IP > 0, E lrtlq > 0 and

-

~

~ = (E I~ lp)lfp ' We apply the inequality (30)

which holds for positive x, y, a, b and a from the concavity of the logarithm: In[ax

+ by]

~ a

In x

+b=

1, and follows immediately

+ b In y =

In xayh.

Then, putting x = ~P, y = ;jq, a = 1/p, b = 1/q, we find that

~;j :::;;; ! ~p + ! ;jq, p

q

whence -

E~q

This establishes (29).

1

:::;;; - E~P p

+ -1 E;jq = -1 + -1 = q

p

q

1.

192

II. Mathematical Foundations of Probability Theory

Minkowski's Inequality.JfE I~IP < oo, E1'71P < oo, 1 $ p < oo, then we have E I~ + 'liP < oo and (E I~ + '71P)l/p

$

(E I~ IP)lfp + (E I'71P)lip.

(31)

We begin by establishing the following inequality: if a, b > 0 and p then

~

1,

(32) In fact, consider the function F(x) = (a + x)P - 2p- 1(aP + xP). Then F'(x) = p(a + x)p-l- 2P- 1pxp-l,

and since p

~

1, we have F'(a) = 0, F'(x) > 0 for x < a and F'(x) < 0 for

x > a. Therefore

F(b) $ max F(x) = F(a) = 0, from which (32) follows. According to this inequality,

and therefore if EI~ IP < oo and EI'71P < oo it follows that EI~ + 'liP < oo. If p = 1, inequality (31) follows from (33). Now suppose that p > 1. Take q > 1 so that (1/p)

+

(1/q)

=

1. Then

I~+ 'lip= I~+ '71·1~ + '71p-l Sl~l·l~ + '71p-l + 1'711~ + '71p-l.

(34)

Notice that (p - 1)q = p. Consequently E(l~

+ '7ip-l)q = Ei~ +'liP< oo,

and therefore by Holder's inequality E(l~ll~

+ '7ip-1)

+ '7i(p-l>q)liq = (E I~ IP) 11P(E I~ + '7ip)lfq < 00. ~ (Ei~IP)liP(Ei~

In the same way, E( I'1 II~ + '71p-l) $ (E I'71P) 11P(E I~ + '7ip)lfq, Consequently, by (34),

EI~ + 'lip

$

(E I~ + '71P)lfq((E I~ IP)l/p + (E I'71P) 11P).

IfE I~ + 'liP = 0, the desired inequality (31) is evident. Now let EI~ Then we obtain

(E I~ + '71P)l-(l/q)

~

(E I~ IP)lfp + (E I'71P)lfp

from (35), and (31) follows since 1 - (1/q)

= 1/p.

(35)

+ 'liP > 0.

193

§6. Lebesgue Integral. Expectation

7.

Let~ be a random variable for which the set function

O(A)

E~

is defined. Then, by Property D,

=L~ dP,

(36)

is well defined. Let us show that this function is countably additive. First suppose that ~ is nonnegative. If A 1 , A 2 , ... are pairwise disjoint sets from fF and A = LA", the corollary to Theorem 1 implies that O(A) = E(~. IA) = E(~. IEAJ = E(:L ~. IAJ =IE(~· IAJ = L O(An). is an arbitrary random variable for which E~ is defined, the countable additivity of O(A) follows from the representation If~

(37)

where

together with the countable additivity for nonnegative random variables and the fact that min(O+(n), o-cn)) < 00. Thus if E~ is defined, the set function 0 = O(A) is a signed measurea countably additive set function representable as 0 = 0 1 - 0 2 , where at least one of the measures 0 1 and 0 2 is finite. We now show that 0 = O(A) has the following important property of absolute continuity with respect to P:

if

P(A) = 0

O(A) = 0

then

(A

E

ff)

(this property is denoted by the abbreviation 0 ~ P). To prove the sufficiency we consider nonnegative random variables. If ~ = 1 xkiAk is a simple nonnegative random variable and P(A) = 0, then

I;;=

n

O(A) = E(~ · IA) =

I

xkP(Ak n A)= 0.

k=1

If { ~"} n:e: 1 is a sequence of nonnegative simple functions such that then the theorem on monotone convergence shows that

~"

j

~ ~

0,

O(A) = E(~ · IA) = lim E(~n · IA) = 0, since E(~" · IA) = 0 for all n ~ 1 and A with P(A) = 0. Thus the Lebesgue integral O(A) = fA~ dP, considered as a function of sets A E ff, is a signed measure that is absolutely continuous with respect to P (0 ~ P). It is quite remarkable that the converse is also valid.

194

II. Mathematical Foundations of Probability Theory

Radon-Nikodym Theorem. Let (0, ff) be a measurable space, 11 a a-finite measure, and A a signed measure (i.e., A = A1 - A2 , where at least one of the measures A1 and A2 is finite) which is absolutely continuous with respect to 11· Then there is an ff-measurable.function f = .f(w) with values in R = [- oo, oo] such that A(A) = {f(w)l1(dw),

(38)

The function f(w) is unique up to sets of 11-measure zero: if h = h(w) is another ff-measurable function such that A(A) = JA h(w)11(dw), A E ff, then 11{ w: f(w) =P h(w)} = 0. If A is a measure, then f = f (w) has its values in R+ = [0, oo].

Remark. The function f = f(w) in the representation (38) is called the Radon- N ikodym derivative or the density of the measure A with respect to 11, and denoted by dA/dl1 or (dA/d11)(w). The Radon-Nikodym theorem, which we quote without proof, will play a key role in the construction of conditional expectations (§7). 8. If~=

Li'=

xJA; is a simple random variable,

1

Eg( ~)

= L: g(xi)P(Ai) = Lg(xi)AF ~(xJ

(39)

In other words, in order to calculate the expectation of a function of the (simple) random variable~ it is unnecessary to know the probability measure P completely; it is enough to know the probability distribution P ~ or, equivalently, the distribution function F ~of~The following important theorem generalizes this property.

Theorem 7 (Change of Variables in a Lebesgue Integral). Let (0, ff) and (E, $)be measurable spaces and X = X(w) an ff/$-measurable function with values in E. Let P be a probability measure on (0, ff) and Px the probability measure on (E, $)induced by X= X(w): Px(A) = P{w: X(w) Then

f

A

g(x)Px(dx) =

J

E

AE$.

A},

g(X(w))P(dw),

AE$,

Let A

E

(41)

x-l(A)

for every $-measurable function g = g(x), x E E (in the sense that integral exists, the other is well defined, and the two are equal). PROOF.

(40)

if one

$and g(x) = IB(x), where BE$. Then (41) becomes Px(AB)

=

P(X- 1(A) n

x-

1(B)),

(42)

195

§6. Lebesgue Integral. Expectation

x-

1(B) = which follows from (40) and the observation that x- 1(A) n 1 x- (A n B). It follows from (42) that (41) is valid for nonnegative simple functions g = g(x), and therefore, by the monotone convergence theorem, also for all nonnegative iff -measurable functions. In the general case we need only represent gas g+ - g-. Then, since (41) is valid for g+ and g-, if(for example) fAg+(x)Px(dx) < oo, we have

f

x- 1 (A)

g+(X(w))P(dw)

< oo

also, and therefore the existence of fA g(x)Px(dx) implies the existence of g(X(w))P(dw).

fx-'(AJ

Corollary. Let (E, $) = (R, BI(R)) and

let~ = ~(w) be a random variable with probability distribution P~. Then if g = g(x) is a Borel function and either ofthe integrals fA g(x)P~(dx) or f~-'(AJ g(~(w))P(dw) exists, we have

f g(x)P~(dx) J =

A

g(~(w))P(dw).

~-I(A)

In particular, for A = R we obtain

Eg(~(w)) = Lg(~(w))P(dw) = Lg(x)P~(dx).

(43)

The measure P~ can be uniquely reconstructed from the distribution function F~ (Theorem 1 of §3). Hence the Lebesgue integral fR g(x)P~(dx) is often denoted by JR g(x)F~(dx) and called a Lebesgue-Stieltjes integral (with respect to the measure corresponding to the distribution function F~(x)).

Let us consider the case when F ~(x) has a density f~(x), i.e. let (44) where f~ = f~(x) is a nonnegative Borel function and the integral is a Lebesgue integral with respect to Lebesgue measure on the set (- oo, x] (see Remark 2 in Subsection 1). With the assumption of (44), formula (43) takes the form

Eg(~(w)) = J:00 g(x)f~(x) dx,

(45)

where the integral is the Lebesgue integral of the function g(x)Nx) with respect to Lebesgue measure. In fact, if g(x) = Ia(x), BE BI(R), the formula becomes

BE £14(R);

(46)

196

II. Mathematical Foundations of Probability Theory

its correctness follows from Theorem 1 of §3 and the formula

F~(b)- F~(a) = ff~(x) dx. In the general case, the proof is the same as for Theorem 7. 9. Let us consider the special case of measurable spaces (Q, !F) with a measure J.l, where Q = Q 1 x Q 2, !F = !F1 ® !F2 , and J.l = f.lt x J.l 2 is the direct product of measures f.lt and f.l 2 (i.e., the measure on !F such that

the existence of this measure follows from the proof of Theorem 8). The following theorem plays the same role as the theorem on the reduction of a double Riemann integral to an iterated integral.

Theorem 8 (Fubini's Theorem). Let ~ = ~(ro 1 , ro 2) be an !F 1 ® !F rmeasurable function, integrable with respect to the measure f.lt x J.l 2: (47)

Then the integrals Jn 1 ~(rob ro 2)J.l 1(dro 1) and Jn2 ~(ro 1 , ro 2)J.lidro2) (1) are defined for all ro 1 and ro 2; (2) are respectively !F 2 - and !F 1 -measurable functions with

J.l2{ro2: J.l1{ro1:

t 1 1~(rot, ro2)1f.lt(dro

1)

= oo} =

t2 1~(rot, ro2)IJ.l2(dro2) =

0, (48)

oo} = 0

and (3)

i

l"lt xn2

~(ro1, ro2) d(J.l1

x J.l2) =

=

i [i ~(rot, l"lt

!12

ro2)J.l2(dro2)JJ.lt(drol)

tJt 1 ~(rob ro2)J.lt(drot)]f.lidro2).

(49)

PRooF. We first show that ~w 1 (ro 2 ) = ~(ro 1 , ro 2) is !F rmeasurable with respect to ro 2, for each ro 1 E Q 1. Let FE !F 1 ® !F 2 and ~(rob ro 2) = I iro 1, ro 2). Let

197

§6. Lebesgue Integral. Expectation

be the cross-section ofF at Wt. and let re'w, show that re'ro, =!IF for every w 1. If F = A x B, A E !IF 1, B E !IF 2 , then (A

X

B)w,

= {FE !IF: Fw, E F 2 }. We must

B if w 1 E A, = {0 if w 1 ¢ A.

Hence rectangles with measurable sides belong to re' "''. In addition, if FE !IF, then (Fln, = F w,, and if {F"kd are sets in !IF, then F")w, = F':,,. It follows that re'w, = !F. Now let ~(w 1 , w 2 ) :2: 0. Then, since the function ~(w 1 , w 2 ) is !F2 -measurable for each rob the integral 2 ~(w 1 , w 2 )f1z(dw 2 ) is defined. Let us show that this integral is an ff1-measurable function and

U

Jn

Let us suppose that ~(w 1 , w 2 ) = IAx 8 (w 1 , w 2 ), A E !F1, BE .?F2 . Then since IA xiw1, w 2 ) = IA(w 1 )1 8 (w 2 ), we have

and consequently the integral on the left of (51) is an !F1-measurable function. Now let ~(w 1 , w 2 ) = IF(w 1 , w 2 ), FE !IF = ff1 0 !F2 • Let us show that the integralf(w 1) = 2 I F(w 1, w 2 )f1 2 (dw 2 ) is !IF -measurable. For this purpose we put re' = {FE !F:f(w 1) is !!i'rmeasurable}. According to what has been proved, the set A x B belongs to '6' (A E $'1, B E $'2 ) and therefore the algebra d consisting of finite sums of disjoint sets of this form also belongs to re'. It follows from the monotone convergence theorem that re' is a monotonic class, re' = Jl(re'). Therefore, because of the inclusions d <;; re' c;; !IF and Theorem 1 of §2, we have ff = a{d) = fl(d) <;; Jl(re') = re' <;; .?F, i.e. re' = !F. Finally, if ~(wb w 2 ) is an arbitrary nonnegative.!F-measurable function, the !F 1-measurability of the integral 2 ~(w 1 , w 2 )f1z(dw) follows from the monotone convergence theorem and Theorem 2 of §4. Let us now show that the measure f1 = f1 1 x flz defined on !IF = ff2 0 ff2 , withtheproperty(J1 1 x J1 2 )(A x B)= f1 1(A)·J1 2 (B),AE!!i'1,BEff2 ,actually exists and is unique. For FE ff we put

Jn

Jn

Jl(F)

=

LJL/Fw,(Wz)f1z(dwz)}1(dw1).

As we have shown, the inner integral is an ff1-measurable function, and consequently the set function Jl(F) is actually defined for F E !F. It is clear

198

II. Mathematical Foundations of Probability Theory

that ifF =A x B, then fl(A x B)= 11 1(A)11 2 (B). Now let {Fn} be disjoint sets from .F. Then

/l(L Fn) = LJL/<1:1"")w 1(Wz)flidwz)]/ll(dwl) =

=

i [i i [i L

Clt n

L

n

Cl2

Clt

Cl2

/F:J 1 (wz)/lz(dwz)]/ll(dwl)

/F:J, (wz)/lz(dwz)]/1 1(dw 1) =

L fl(P), n

i.e.fl is a (a-finite) measure on .F. It follows from Caratheodory's theorem that this measure 11 is the unique measure with the property that /l(A x B) = 11 1(A)/1 2 (B). We can now establish (50). If ~(w 1 , w 2) = I Ax B(w 1, Wz), A E !#'1 , BE !#'2, then

i

Clt xQ2

I Ax B(wl> w 2 )d(fl 1 x /lz) = (/1 1 x 11 2)(A x B),

(52)

and since 1Axiw 1, w 2) = IA(w 1)la(w 2), we have

LJL/Axiwl, w2)/12(dw2)]111(dwl) =

L.[IA(wl) L/B(wl> w2)/12(dw2)]/11(dw 1) = /11(A)fliB).

(53)

But, by the definition of 11 1 x J12,

(J1 1

X

J1 2)(A x B) = J1 1 (A)J1 2 (B).

Hence it follows from (52) and (53) that (50) is valid for IAxiwl, w2). Now let

~(w 1 ,

~(w 1 ,

w2) =

w 2) = JF(w 1, w 2 ), FE .F. The set function

is evidently a a-finite measure. It is also easily verified that the set function

v(F) = L.[L/iwl, Wz)J1 2 (dw 2 )]J11(dw 1) is a a-finite measure. It will be shown below that Aand v coincide on sets of the form F = A x B, and therefore on the algebra !F. Hence it follows by Caratbeodory's theorem that Aand v coincide for all F E .F.

199

§6. Lebesgue Integral. Expectation

We turn now to the proof of the full conclusion of Fubini's theorem. By (47),

r

~+(wb w2) d(f.ll

Jn,xn 2

X

f.12) <

r

00,

By what has already been proved, the

:1'1-measurable function of w 1 and

C(wl, w2)d(J.11

f.12) <

X

00.

Jn,xn 2 integral Jn 2~+(w 1 , w 2)J.1idw 2) is

r [ r ~+(wl, W2)f.12(dw2)]f.11(dwl) = Jn,xn2 r ~+(wl, W2) d(f.ll

Jn, Jn2

f.12) <

X

an

00.

Consequently by Problem 4 (see also Property J in Subsection 2)

r ~+(wb w2)f.12(dw2) <

00

(f.ll-a.s.).

( CCw1, w2)J.11(dw1) < oo

CJ.11-a.s.),

Jo2 In the same way

Jo2 and therefore

It is clear that, except on a set Jll of J.1 1-measure zero, (

Jn2

~(wl, w2)f.1idw2) = ( ~+(wl, w2)J.12(dw2)Jo2

( C(wb w2)J.12(dw2).

Jn2

(54) Taking the integrals to be zero for w 1 E%, we may suppose that (54) holds for all w E n 1. Then, integrating (54) with respect to f.1 1 and using (50), we obtain

t,[t2 ~(wl, w2)f.12(dw2)]f.11(dw1)

=

=

t,[t2 ~+(w 1 , w2)J.1z(dw2)]f.11(dw 1)

-t,[t r

2

C(wb w2)f.12(dw2)}1(dw1)

~+(wl, W2)d(J.11

Jn,xn 2

X

- Jn, r xn2 ~-(Wl, W2) d(f.ll

f.12) X

f.12)

200

II. Mathematical Foundations of Probability Theory

Similarly we can establish the first equation in (48) and the equation

r

Jn, xn2

e(wl, W2) d(Jl.l X J1.2) =

r [ r e(wl, W2)Jl.l(dwl)JJ1.2(dw2).

Jn2 Jn,

This completes the proof of the theorem.

Corollary. If Jn, Un2 Ie<wl, w2) IJ1.idw2)]Jl.l (dwl) < Fubini's theorem is still valid.

oo,

the conclusion of

In fact, under this hypothesis (47) follows from (50), and consequently the conclusions of Fubini's theorem hold. · Let (e, q) be a pair of random variables whose distribution has a two-dimensional density f~~(x, y), i.e.

EXAMPLE.

P((e, q) e B) =

{f~~(x, y) dx dy,

where f~~(x, y) is a nonnegative &6'(R 2 )-measurable function, and the integral is a Lebesgue integral with respect to two-dimensional Lebesgue measure. Let us show that the one-dimensional distributions for and '1 have densities f~(x) and f,(y), and furthermore

e

f~(x) = f_oo,J~~(x, y) dy (55)

and

~(y) =

J:oo f~~(x, y) dx.

In fact, if A e as'(R), then by Fubini's theorem

P(eeA) = P((e,q)eA x R) =

{x/~~(x,y)dxdy = L[Lf~~(x,y)dy]dx.

This establishes both the existence of a density for the probability distribution of and the first formula in (55). The second formula is established similarly. According to the theorem in §5, a necessary and sufficient condition that and '1 are independent is that

e

e

F~71(x,

y) =

F~(x)F,(y),

(x, y)

E

R 2•

Let us show that when there is a two-dimensional density variables eand '1 are independent if and only if

h.~(x, y),

the (56)

(where the equation is to be understood in the sense of holding almost surely with respect to two-dimensional Lebesgue measure).

201

§6. Lebesgue Integral. Expectation

In fact, in (56) holds, then by Fubini's theorem F

~~(x, y) = =

J(-oo,x)x(-oo,y] f~~(x, y) dx dy = J(-oo,x)x(-oo,y] f~(x)f~(y) dx dy

i

(-oo,x]

Nx) dx(J

(-oo,y)

j,(y) dy) =

F~(x)F~(y)

and consequently ~ and 11 are independent. Conversely, if they are independent and have a density h~(x, y), then again by Fubini's theorem

f-oo,x)x(-oo,y/~~(X, y) dx dy = (f_oo,x/~(x) dx )(f-oo.y]j,(y) dy) = J f~(x)fq(y) dx dy. (-oo,x]x(-oo,y] It follows that

f/~~(x, y) dx dy = f/~(x)fq(y) dx dy for every BE fJI (R 2 ), and it is easily deduced from Property I that (56) holds. 10. In this subsection we discuss the relation between the Lebesgue and Riemann integrals. We first observe that the construction of the Lebesgue integral is inde-

pendent of the measurable space (Q, §') on which the integrands are given. On the other hand, the Riemann integral is not defined on abstract spaces in general, and for Q = W it is defined sequentially: first for R 1 , and then extended, with corresponding changes, to the case n > 1. We emphasize that the constructions of the Riemann and Lebesgue integrals are based on different ideas. The first step in the construction of the Riemann integral is to group the points x E R 1 according to their distances along the x axis. On the other hand, in Lebesgue's construction (for Q = R 1 ) the points x E R 1 are grouped according to a different principle: by the distances between the values of the integrand. It is a consequence of these different approaches that the Riemann approximating sums have limits only for "mildly" discontinuous functions, whereas the Lebesgue sums converge to limits for a much wider class of functions. Let us recall the definition of the Riemann-Stieltjes integral. Let G = G(x) be a generalized distribution function on R (see subsection 2 of §3) and 11 its corresponding Lebesgue-Stieltjes measure, and let g = g(x) be a bounded function that vanishes outside [a, b].

202

II. Mathematical Foundations of Probability Theory

Consider a decomposition fJJ = {x 0 , ••• , xn}, a = x0

< x 1 < ··· <

= b,

Xn

of [a, b], and form the upper and lower sums n

L = L iHG(x;+ 1) ~

L=

G(x;)],

~

i=1

n

L~;[G(xi+ 1 )- G(x;)]

i=l

where

g; =

g(y),

sup

inf

ff.i =

g(y).

Xi-1
Xi-1
Define simple functions g~(x) and fl~(x) by taking

on x;_ 1 < x :::;;

X;,

and define

g~(a) = fl~(a) =

g(a). It is clear that then

L = (L-S) Jbg~(x)G(dx) a

~

and

L = (L-S) Jb fl~(x)G(dx). a

~

Now let{&\} beasequenceofdecomposition ssuchthatfJJk s;: .o/lk+ 1 • Then

and if Jg(x)J :::;; C we have, by the dominated convergence theorem, lim

L=

(L-S)

rb g(x)G(dx),

k-+ oo

~k

lim

L = (L-S) Jb fl(x)G(dx),

Ja

(57)

a

k-+oo ~k

where g(x) = limk g~k(x), g(x) = limk fl~Jx). H the limits limk r~k and limk I~k are finite and equal, and their common value is independent of the sequence ofdecompositions {fJJk}, we say that g = g(x) is Riemann-Stieltjes integrable, and the common value of the limits is denoted by (R-S)

f

g(x)G(dx).

(58)

When G(x) = x, the integral is called a Riemann integral and denoted by (R) fg(x) dx.

203

§6. Lebesgue Integral. Expectation

J!

Now let (L-S) g(x)G(dx) be the corresponding Lebesgue-Stieltjes integral (see Remark 2 in Subsection 2).

Theorem 9. If g = g(x) is continuous on [a, b], it is Riemann-Stieltjes integrable and

f

(R-S)

g(x)G(dx) = (L-S)

f

(59)

g(x)G(dx).

PRooF. Since g(x) is continuous, we have g(x) = g(x) = g(x). Hence by (57) Consequently g = g(x) is -Riemann-Stieltjes = limk-+oo limk-+oo integral (again by (57)):-

L&k·

L&k

Let us consider in more detail the question of the correspondence between the Riemann and Lebesgue integrals for the case of Lebesgue measure on the lineR.

Theorem 10. Let g(x) be a bounded function on [a, b].

if and only if it is continuous almost everywhere (with respect to Lebesgue measure A on

(a) The function g = g(x) is Riemann integrable on [a, b] ~([a, b])).

(b) If g = g(x) is Riemann integrable, it is Lebesgue integrable and (60)

(R) fg(x) dx = (L) fg(x)A(dx). PROOF. (a) Let g = g(x) be Riemann integrable. Then, by (57), (L) But g(x)

~

g(x)

~

S:g(x)A(dx) = S:g(x)A(dx). (L)

g(x), and hence by Property H

g_(x) = g(x) = g(x)

(61)

(A-a.s.),

from which it is easy to see that g(x) is continuous almost everywhere (with respect to 1). Conversely, let g = g(x) be continuous almost everywhere (with respect to A). Then (61) is satisfied and consequently g(x) differs from the (Borel) measurable function g(x) only on a set JV with A:(%) = 0. But then

{x: g(x)

~

c} = {x: g(x) = {x: g(x)

It is clear that the set {x: g(x)

~

~ ~

c} n JV c} n JV

+ {x: g(x) + {x: g(x)

~ ~

c} n JV c} n JV

c} n JV E &~([a, b]), and that

{x: g(x)

~

c} n JV

204

II. Mathematical Foundations of Probability Theory

is a subset of .Y having Lebesgue measure A: equal to zero and therefore also belonging to ~([a, b)]. Therefore g(x) is &l([a, b])-measurable and, as a bounded function, is Lebesgue integrable. Therefore by Property G, (L) fg(x)A:(dx) = (L) ff!(x)A:(dx) = (L) fg(x)A(dx), which completes the proof of (a). (b) If g = g(x) is Riemann integrable, then according to (a) it is continuous (A:-a.s.). It was shown above than then g(x) is Lebesgue integrable and its Riemann and Lebesgue integrals are equal. This completes the proof of the theorem.

Remark. Let J.t be a Lebesgue-Stieltjes measure on ~([a, b]). Let ~Jt([a, b]) be the system consisting of those subsets A s; [a, b] for which there are sets A and B in ~([a, b]) such that A s; A s; B and J.t(B\A) = 0. Let J.t be an extension of J.t to &I i[a, b]) (Jl(A) = J.t(A) for A such that A s; A s; B and J.t(B\A) = 0). Then the conclusion of the theorem remains valid if we consider J1 instead of Lebesgue measure A, and the Riemann-Stieltjes and Lebesgue-Stieltjes measures with respect to J1 instead of the Riemann and Lebesgue integrals.

11. In this part we present a useful theorem on integration by parts for the Lebesgue-Stieltjes integral. Let two generalized distribution functions F = F(x) and G = G(x) be given on (R, ~(R)).

Theorem 11. The following formulas are valid for all real a and b, a < b: F(b)G(b)- F(a)G(a) = fF(s- )dG(s)

+f

G(s) dF(s),

(62)

or equivalently F(b)G(b)- F(a)G(a)

=

fF(s-)dG(s)

+ L

+ fG(s-)dF(s)

llF(s) ·llG(s),

(63)

a<s:Sb

where F(s-) = limqs F(t), !!.F(s) = F(s)- F(s- ).

Remark l. Formula (62) can be written symbolically in "differential" form d(FG) = F _ dG

+ GdF.

(64)

§6. Lebesgue Integral. Expectation

205

Remark 2. The conclusion of the theorem remains valid for functions F and G of bounded variation on [a, b]. (Every such function that is continuous on the right and has limits on the left can be represented as the difference oftwo monotone nondecreasing functions.) PRooF. We first recall that in accordance with Subsection 1 an integral (·)means J1a,bJ (·).Then (see formula (2) in §3)

J:

f

(F(b)- F(a))(G(b)- G(a)) =

dF(s) · fdG(t).

Let F x G denote the direct product of the measures corresponding to F and G. Then by Fubini's theorem (F(b) - F(a))(G(b) - G(a)) = =

=

f f

(a, b] x (a, b]

f

d(F x G)(s, t)

(a, b] x (a, b]

I{s~t)(s, t) d(F

(G(s) - G(a)) dF(s)

~~

=

f

G(s)dF(s)

+

f

X

G)(s, t)

+

f

+

J(a,

b]

x (a, b]

I 1sst)(s, t) d(F

X

G)(s, t)

(F(t-) - F(a)) dG(t)

~~

F(s- )dG(s)- G(a)(F(b)- F(a))- F(a)(G(b)- G(a)), (65)

where I A is the indicator of the set A. Formula (62) follows immediately from (65). In turn, (63) follows from

(62) if we observe that

f(G(s)- G(s- )) dF(s) =

a<~sbLlG(s) · LlF(s).

(66)

Corollary 1. If F(x) and G(x) are distribution functions, then F(x)G(x) = raoF(s-) dG(s)

+ fao G(s) dF(s).

(67)

If also F(x) = faof(s) ds, then F(x)G(x)

=

f ao F(s) dG(s)

+f

ao G(s)f(s) ds.

(68)

206

II. Mathematical Foundations of Probability Theory

Corollary 2. Let ~ be a random variable with distribution junction F(x) and El~l" < oo. Then 00 {

xn dF(x)

= n {ooxn- 1[1- F(x)] dx,

(69)

f

o lxlndF(x) = - JooxndF(-x) = n [xn- 1 F(-x)dx -oo 0 0

(70)

and

El~ln =

f_

00 00

1XIn dF(x) = n Lnxn- 1[1- F(x)

+ F(-x)] dx.

(71)

To prove (69) we observe that

fXJ xn dF(x) =

-

s:

xn d(1 - F(x))

= - bn(1- F(b))

+ n J:xn- 1(1- F(x)) dx.

(72)

Let us show that since E 1 ~I" < oo,

bn(1- F(b)

+ F( -b))::; bnP(I~I

In fact,

EJ~I"

Cf.)

=

k~1

rk Jk_

1

2 b) -t 0.

lxJ"dF(x) <

(73)

CXJ

and therefore

L rk

lxl"dF(x)-tO,

L rk

lxln dF(x) 2

n -too.

k;o,b+ 1 Jk-1

But

k;o,b+ 1 Jk-1

bnP(I~I

2 b),

which establishes (73). Taking the limit as b -t oo in (72), we obtain (69). Formula (70) is proved similarly, and (71) follows from (69) and (70).

12. Let A(t ), t 2 0, be a function oflocally bounded variation (i.e., of bounded variation on each finite interval [a, b]), which is continuous on the right and has limits on the left. Consider the equation

z, = 1 + J~z._ dA(s),

(74)

207

§6. Lebesgue Integral. Expectation

which can be written in differential form as Zo = 1.

dZ = Z_ dA,

(75)

The formula that we have proved for integration by parts lets us solve (74) explicitly in the class of functions of bounded variation. We introduce the function &,(A)= eA(rJ-A(OJ

n

(1 O:s;s:s;r

+ L\A(s))e-&A(•J,

(76)

where L\A(s) = A(s)- A(s-) for s > 0, and L\A(O) = 0. The function A(s), 0 :::;; s :::;; t, has bounded variation and therefore has at most a countable number of discontinuities, and so the series Lo::s:s::s:riM(s)l converges. It follows that

n

O:s;s:s;r

(1

+ M(s))e-&A(s)

is a function of locally bounded variation. If Ac(t) = A(t) - Lo:s:s:s;r L\A(s) is the continuous component of A(t), we can rewrite (?6) in the form &,(A)= eAC(I)-AC(O)

n (1 + M(s)).

O:s;s:s;r

(77)

Let us write F(t) = eA<,

G(t) =

fl

O:s;s:s;r

(1

+ L\A(s)).

Then by (62) tS',(A) = F(t)G(t) = 1 +

J~F(s) dG(s) + LG(s-) dF(s)

=

1 + 0 J;,,F(s)G(s- )L\A(s) + {G(s- )F(s) dAc(s)

=

1 + {s._(A) dA(s).

Therefore tS',(A), t ~ 0, is a (locally bounded) solution of(74). Let us show that this is the only locally bounded solution. Suppose that there are two such solutions and let Y = Y(t), t ~ 0, be their difference. Then Y(t)

=

f~ Y(s-) dA(s).

Put T = inf{t ~ 0: Y(t) =I= 0},

where we take T

= oo if Y(t) = 0 fort~ 0.

208

II. Mathematical Foundations of Probability Theory

Since A(t) is a function of locally bounded variation, there are two generalized distribution functions A 1(t) and A 2 (t) such that A(t) = A 1(t) A 2 (t). If we suppose that T < oo, we can find a finite T' such that

Then it follows from the equation

Y(t) =

I:

Y(s-) dA(s),

t "2! T,

that sup! Y(t)l ~!sup I Y(t)l tSt'

tST'

and since sup! Y(t)l < oo, we have Y(t) = 0 forT< t the assumption that T < oo. Thus we have proved the following theorem.

~

T', contradicting

Theorem 12. There is a unique locally bounded solution of(74), and it is given by (76). 13.

PROBLEMS

1. Establish the representation (6).

e

2. Prove the following extension of Property E. Let and '7 be random variables for which Ee and E17 are defined and the sum Ee + E17 is meaningful (does not have the form oo - oo or - oo + oo ). Then

e

3. Generalize Property G by showing that if = '7 (a.s.) and Ee exists, then E17 exists and Ee = E17.

e

4. Let be an extended random variable, p. a a-finite measure, and Jn lei dp. < Show that I I < oo (p.-a.s.) (cf. Property J).

e

e

00.

5. Let p. be a a-finite measure, and '7 extended random variables for which Ee and E17 are defined. If J.A dP ::::;; J.A '7 dP for all A E §',then '7 (p.-a.s.). (Cf. Property 1.)

e

e

e: : ;

6. Let and '7 be independent nonnegative random variables. Show that Eel'f = Ee · E17. 7. Using Fatou's lemma, show that

8. Find an example to show that in general it is impossible to weaken the hypothesis "I I : : ; l'f, El'f < 00" in the dominated convergence theorem.

e.

209

§6. Lebesgue Integral. Expectation

9. Find an example to show that in general the hypothesis "~. Fatou's lemma cannot be omitted.

s

10. Prove the following variants ofFatou's lemma. Let the family variables be uniformly integrable and let E lim ~. exist. Then

IJ, EIJ > - oo" in

g;}."' 1 of random

s

IJ., n 2:: 1, where the family {~:}."' 1 is uniformly integrable and 11. converges a.s. (or only in probability-see §10 below) to a random variable 71· Then limE~. s E lim~•. Let~.

11. Dirichlet's function d(x) = {1, 0,

x irr~tional,

x rational,

is defined on [0, 1], Lebesgue integrable, but not Riemann integrable. Why? 12. Find an example of a sequence of Riemann integrable functions {f.}.;;: 1, defined on [0, 1], such that I f. I s 1, f.--> f almost everywhere (with Lebesgue measure), but f is not Riemann integrable.

13. Let (a;,i; i,j 2:: 1) be a sequence ofreal numbers such that Li.i lai,il < oo. Deduce from Fubini's theorem that (78) 14. Findanexampleofasequ ence(aii; i,j;;::: 1)forwhich Li.i laiil = oo and the equation in (78) does not hold. 15. Starting from simple functions and using the theorem on taking limits under the Lebesgue integral sign, prove the following result on integration by substitution. Let h = h(y) be a nondecreasing continuously differentiable function on [a, b], and let f(x) be (Lebesgue) integrable on [h(a), h(b)]. Then the functionf(h(y))h'(y) is integrable on [a, b] and

f

h(b)

f(x) dx =

fb

f(h(y))h'(y) dy.

a

h(a)

16. Prove formula (70). 17. Let ~.

~ 1 , ~ 2 , •••

be nonnegative integrable random variables such that E~n ..... E~ and

P(~ - ~. > ~:)--> 0 for every 1: > 0. Show that then E I~n - ~I-> 0, n--> oo.

18. Let~. ry, (and ~n• Yin• (n, n 2:: 1, be random variables such that p

'In --> IJ,

E(n--> E(, and the expectations E~, EIJ, E( are finite. Show that then E~n ..... E~ (Pratt's lemma). If also 1fn s 0 :-::; (n then E I~. - ~I --> 0, Deduce that if~. -f. ~. EI~. I --> EI~ I and EI~ I < oo, then E I~n - ~I __. o.

210

II. Mathematical Foundations of Probability Theory

§7. Conditional Probabilities and Conditional Expectations with Respect to a a-Algebra 1. Let (0, ~ P) be a probability space, and let A E §' be an event such that P(A) > 0. As for finite probability spaces, the conditional probability of B with respect to A (denoted by P(BIA)) means P(BA)/P(A), and the conditional

probability of B with respect to the finite or countable decomposition

~ =

{Db D 2 •• • } with P(Di) > 0, i ~ 1 (denoted by P(BI~))is the random variable equal to P(BIDi) for wE Di, i ~ 1: P(BI~)

=

L

P(BIDi)Iv,(w).

i;:. 1

In a similar way, if~ is a random variable for which E~ is defined, the conditional expectation of~ with respect to the event A with P(A) > O(denoted by E(~IA)) is E(OA)/P(A) (cf. (1.8.10). The random variable P(B I~) is evidently measurable with respect to the a-algebra r§ = a(~), and is consequently also denoted by P(B Ir§) (see §8 of Chapter 1). However, in probability theory we may have to consider conditional probabilities with respect to events whose probabilities are zero. Consider, for example, the following experiment. Let ~ be a random variable that is uniformly distributed on [0, 1]. If~ = x, toss a coin for which the probability of head is x, and the probability of tail is 1 - x. Let v be the number of heads inn independent tosses of this coin. What is the "conditional probability P( v = k I~ = x) "? Since P( ~ = x) = 0, the conditional pro bability P(v = k I~ = x) is undefined, although it is intuitively plausible that "it ought to be C~xk(1 - xt-k." Let us now give a general definition of conditional expectation (and, in particular, of conditional probability) with respect to a a-algebra r§, r§ ~ ~ and compare it with the definition given in §8 of Chapter I for finite probability spaces. 2. Let (0, ~ P) be a probability space, r§ a a-algebra, r§ ~ §' (r§ is a asubalgebra of ff), and~ = ~(w) a random variable. Recall that, according to §6, the expectation E~was defined in two stages: first for a nonnegative random variable ~' then in the general case by

and only under the assumption that

A similar two-stage construction is also used to define conditional expectations E( ~I r§).

§7. Conditional Probabilities and Expectations with Respect to a u-Aigcbra

211

Definition 1. ( 1) The conditional expectation of a nonnegative random variable ~ with respect to the a-algebra t§ is a nonnegative extended random variable, denoted by E(~ It§) or E(~ It§)(m), such that

(a) E(~l~) is t'§-measurable; (b) for every A E t§

L~ LE(~l~) dP =

dP.

(1)

(2) The conditional expectation E(~lt§), or E(~l~)(m), of any random variable ~ with respect to the a-algebra t§, is considered to be defined if min(E(~+ 1~), E(~-1~))

< oo,

P-a.s., and it is given by the formula EceJ~)

=E(~+ It§)- E(~-1~),

where, on the set (of probability zero) of sample points for which E(~+ 1t§) = E(C I~) = oo, the difference E(~+ I~) - E(C I~) is given an arbitrary value, for example zero. We begin by showing that, for nonnegative random variables, actually exists. By (6.36) the set function Q(A) =

L~ dP,

A Et§,

E(~l~)

(2)

is a measure on (Q, ~). and is absolutely continuous with respect to P (considered on (Q, ~). ~ £;; ff). Therefore (by the Radon-Nikodym theorem) there is a nonnegative ~-measurable extended random variable EG I~) such that Q(A)

= LE(~I~)dP.

(3)

Then (1) follows from (2) and (3).

Remark 1. In accordance with the Radon-Nikodym theorem, the conditional expectation E( ~It§) is defined only up to sets of P-measure zero. In other words, E(~ It§) can be taken to be any t'§-measurable function f(m) for which Q(A) = JAf(m) dP, A E t§ (a "variant" of the conditional expectation). Let us observe that, in accordance with the remark on the RadonNikodym theorem, (4)

212

II. Mathematical Foundations of Probability Theory

i.e. the conditional expectation is just the derivative of the Radon- Nikodym measure Q with respect to P (considered on (0, <;9')). Remark 2. In connection with (1), we observe that we cannot in general put E(~ 1<;9') = ~,since~ is not necessarily <;9'-measurable. Remark 3. Suppose that ~ is a random variable for which E~ does not exist. Then E(~ I<;9')maybe definable as a <;9'-measurable function for which (1) holds. This is usually just what happens. Our definition E(~l<;9') = E(~+ 1<;9')E(C 1<;9') has the advantage that for the trivial u-algebra <;9' = {0, 0} it reduces to the definition of E~ but does not presuppose the existence of E~. (For example, if~ is a random variable withE~+ = oo, E~- = oo, and <;9' = !F., then E~ is not defined, but in terms of Definition 1, E(~ 1<;9') exists and is simply ~=~+-c.

Remark 4. Let the random variable~ have a conditional expectation E(~ 1<;9') with respect to the u-algebra <;9'. The conditional variance (denoted by V( ~I <;9') or V(~ I<;9')(tX)) of~ is the random variable

(Cf. the definition of the conditional variance V(~ I~) of~ with respect to a decomposition~. as given in Problem 2, §8, Chapter 1.) Definition 2. Let B E :F. The conditional expectation E(I B I<;9') is denoted by P(B I<;9'), or P(B I<;9')( w), and is called the conditional probability of the event B with respect to the u-algebra <;9', <;9' s;;; :F. It follows from Definitions 1 and 2 that, for a given B E !F., P(B 1<;9') is a random variable such that (a) P(B I<;9') is <;9'-measurable,

(b) for every A

P(A n B)= LP(BI<;9')dP E

(5)

<;9'.

Definition 3. Let ~ be a random variable and <;9'~ the u-algebra generated by a random element '1· Then E(~1<;9'n), if defined, means E(~l'1 or E(~IIJ)(w), and is called the conditional expectation of~ with respect to '1· The conditional probability P(BI<;9'71) is denoted by P(BIIJ) or P(BIIJ)(w), and is called the conditional probability of B with respect to '1· 3. Let us show that the definition of E( ~I <;9') given here agrees with the definition of conditional expectation in §8 of Chapter I.

213

§7. Conditional Probabilities and Expectations with Respect to a u-Aigebra

Let~ = {D 1 , D 2 , •.. } be a finite or countable decomposition with atoms D; with respect to the probability P (i.e. P(D;) > 0, and if A ~ D;, then either P(A) = 0 or P(D;\A) = 0).

Theorem 1. If~ = a(~) and ~ is a random variable for which E~ is defined,

then

EW~)

=

E(~ID;)

(P-a.s. on D;)

EW~

=

E(OD.) P(D;)

(P-a.s. on D;).

(6)

or equivalently

(The

notation"~ =

"~

'1 (P-a.s. on A)," or

= 17(A; P-a.s.)" means that P(A n

PRooF. According to Lemma 3 of §4, E(~l~) constants. But

{~

= K;

::/: 17}) = 0.) on D;, where K; are

whence 1

f

E(OD)

K; = P(D;) JD,~ dP = P(D;) = EWD;). This completes the proof of the theorem. Consequently the concept of the conditional expectation

E(~ 1~)

with

respect to a finite decomposition ~ = {Db ... , Dn}, as introduced in Chapter I, is a special case of the concept of conditional expectation with respect to the a-algebra ~ = a(D). 4. Properties of conditional expectations. (We shall suppose that the expectations are defined for all the random variables that we consider and that ~ ~ $'.)

A*. IfC is a constant and~= C (a.s.), then E(~l~) = C (a.s.). B*. If~ ~ '7 (a.s.) then EW ~) ~ E('71 ~)(a.s.). C*. IEW~)I ~ E(l~ll~) (a.s.). D*. If a, bare constants and aE~ + bE17 is defined, then E(a~

+ b1'fl ~)

=

aEW ~)

+ bE('71 ~)

E*. Let ff'* = { (/), Q} be the trivial a-algebra. Then E(~lff'*) = E~

(a.s.).

(a.s.).

214 F*.

II. Mathematical Foundations of Probability Theory

E(~\~)

=

~(a.s.).

G*. E(E(ell§)) =

E~.

H*. Ift§ 1 £ t§ 2 then I*. If t§ 1

;2

t§ 2 then

e

J*. Let a random variable for which Ee is defined be independent of the a-algebra t§ (i.e., independent of IB, Bet§). Then E(eit§) = Ee

(a.s.).

K*. Let '7 be a t'§-measurable random variable, EI'71 < oo and EIe'7\ < oo. Then E(e'71t§) = '7E(e\t§)

(a.s.).

Let us establish these properties. A*. A constant function is measurable with respect to t§. Therefore we need only verify that {edP But, by the

hypothesis~ =

= {cdP,

Aet§.

C (a.s.) and Property G of §6, this equation

is obviously satisfied. B*. If $ '7 (a.s.), then by Property B of §6

e

L~ f} dP $

dP,

A E t§,

and therefore {E(e\l§)dP $ {E('71t§)dP,

Aet§.

The required inequality now follows from Property I (§6). C*. This follows from the preceding property if we observe that -

$

e$I~\.

D*. If A e t§ then by Problem 2 of §6,

{ (ae + brl) dP = {ae dP + J}'1 dP = +{ which establishes D*.

bE('71 t§) dP

= {

{ aE(e\t§) dP

[aE( ~It§)

+ bE('7\ t§)] dP,

IeI

215

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

E*. This property follows from the remark that E~ is an §"*-measurable function and the evident fact that if A = 0 or A = 0 then

F*.

Since~

if §"-measurable and

L~dP = L~dP, we have EW F) = ~ (a.s.). G*. This follows from E* and H* by taking H*. Let A E I'§ 1 ; then

Since I'§ 1 s;:

I'§ 2 ,

AE§;

I'§ 1 =

{0, 0} and I'§ 2 =

1'§.

we have A E I'§ 2 and therefore

Consequently, when A E 1'§ 1,

LE(~ll'§1) LE[E(~II'§2)11'§1] dP =

dP

and by Property I (§6) and Problem 5 (§6) E(~ll'§1)

I*. If A

E ~ 1o

=

E[E(~II'§2)11'§1]

(a.s.).

then by the definition of E[E( ~I~ 2 ) I~ 1 ]

The function E( ~II'§ 2) is I'§ rmeasurable and, since I'§ 2 s;: I'§ to also I'§ rmeasurable. It follows that E( ~II'§ 2) is a variant of the expectation E[E(~II'§ 2 )11'§ 1 ], which proves Property 1*. J*. Since E~ is a /'§-measurable function, we have only to verify that

L

dP =

LE~dP,

i.e. that E[ ~ · I B] = E~ · EJB. If EI~ I < oo, this follows immediately from Theorem 6 of §6. The general case can be reduced to this by applying Problem 6 of §6. The proof of Property K* will be given a little later; it depends on conclusion (a) of the following theorem.

216

II. Mathematical Foundations of Probability Theory

Theorem 2 (On Taking Limits Under the Expectation Sign). Let be a sequence of extended random variables.

{en}n~l

e

(a) If len I~ 17, E17 < oo and en--. (a.s.), then ECenl~)--. EW~

(a.s.)

ECien-ell~)--.0

(a.s.).

and (b) If en

~

17, E17 > - oo and en j

e(a.s.), then

ECenl~)

(c) If en

~

e

~

~

(a.s.).

l E(el~)

(a.s.).

17, E17 > - oo, then E(lim enl~)

(e) If en

Ecei~)

17, E17 < oo, and en l (a.s.), then ECenl~)

(d) If en

i

~lim E(enl~)

(a.s.).

~ E(lim e"l~)

(a.s.).

17, E17 < oo, then

Ilm E(enl~) (f) If en ~ 0 then

E(L en I~)=

L E(enl~)

(a.s.).

e

(a) Let Cn = SUPm>n Iem- el. Since en__.. (a.s.), we have '" l 0 E~n and E~ are finite; therefore by Properties D* and C* (a.s.) PROOF.

(a.s.). The expectations

IECenl~)- E(el~)l

Since E(Cn+ 1 1~)

~

= IECen-

el~)l ~

E(len-

ell~)~ E(Cnl~).

E(Cn I~) (a.s.), the limit h = lim" E(Cn I~) exists (a.s.). Then

0~ LhdP~ {ECCnl~)dP= {CndP--.0,

n-.oo,

where the last statement follows from the dominated convergence theorem, since 0 ~ Cn ~ 217, E17 < oo. Consequently Jn h dP = 0 and then h = 0 (a.s.) by Property H. (b) First let 11 = 0. Since ECenl~) ~ ECen+tl~) (a.s.) the limit C(m) = lim" ECenl~) exists (a.s.). Then by the equation

Len dP = LE(enl~) dP,

AE

~.

and the theorem on monotone convergence,

LeaP= LeaP, Consequently

Ae~.

e= C(a.s.) by Property I and Problem 5 of §6.

217

§7. Conditional Probabilities and Expectations with Respect to a u-Algcbra

For the proof in the general case, we observe that 0:::; ~: j ~+,and by what has been proved, E(~: /~)

But 0 s ~;; s

C, E~- <

i

E(~+ /~)

(a.s.).

(7)

oo, and therefore by (a) E(~;; /~)-+

E(C /~),

which, with (7), proves (b). Conclusion (c) follows from (b). (d) Let (n = infm~n ~m; then (n j (, where ( = lim ~n· According to (b), E((n/~) i E((/~) (a.s.). Therefore (a.s.) E(lim ~n/~) = E((/~) =limn E((n/~) = lim E((J ~) $ lim E(~n /~). Conclusion (e) follows from (d). (f) If ~n ~ 0, by Property D* we have

E(J ~kl~) = J E(~k/~) 1

1

(a.s.)

which, with (b), establishes the required result. This completes the proof of the theorem. We can now establish Property K*. Let '1 = 18 , AE~,

f ~'1 A

dP =

f

A,-,B

~ dP =

f

A,-,B

E(~/~) dP =

BE~-

f l8 E(~/~) A

dP =

Then, for every

f 1JE(~/~) A

dP.

By the additivity of the Lebesgue integral, the equation (8)

remains valid for the simple random variables 1J = L~=l ykl 8 k, BkE~. Therefore, by Property I (§6), we have (9)

for these random variables. Now let 1J be any ~-measurable random variable with E 1171 < oo, and let {17n} n~ 1 be a sequence of simple ~-measurable random variables such that 1'7n I s 1] and 'ln -+ 11· Then by (9) E(~17nl~)

= 17nE(~/~)

(a.s.).

It is clear that 1~'7n/ si~IJI, where E/~nl < oo. Therefore E(~17nl~)-+ E(~17/~) (a.s.) by Property (a). In addition, since E I~ I < oo, we have E(~l~) finite (a.s.) (see Property C* and Property J of §6). Therefore 17nEC~I~)-+ 17E( ~I~) (a.s.). (The hypothesis that E( ~I~) is finite, almost surely, is essential, since, according to the footnote on p. 172, 0 · oo = 0, but if 'ln = 1/n, 1J = 0, we have 1/n · oo 0 · oo = 0.)

+

218

II. Mathematical Foundations of Probability Theory

5. Here we consider the more detailed structure of conditional expectations E(el~,). which we also denote, as usual, by E(el'l). Since E(el'l) is a ~.,-measurable function, then by Theorem 3 of §4 (more precisely, by its obvious modification for extended random variables) there is a Borel function m = m(y) from R toR such that m('l(ro)) for all

OJ E

n.

= E(el'l)(ro)

(10)

we denote this function m(y) by E(

e

eI, = y) and call it the

conditional expectation of with respect to the event {, = y}' or the conditional expectation of under the condition that , = y.

e

Correspondingly we define (11)

Ae~.,.

Therefore by Theorem 7 of §6 (on change of variable under the Lebesgue integral sign)

f

m(17) dP =

J{ro:.,eB}

f m(y)P,(dy), JB

BeBI(R),

where P., is the probability distribution of 'I· Consequently m Borel function such that

f

(12)

=

m(y) is a

edP= fm(y)dP,.

J{ro: 11eB}

(13)

JB

for every B e BI(R). This remark shows that we can give a different definition of the conditional expectation E(el '1 = y).

e

Definition 4. Let and '1 be random variables (possible, extended) and let Ee be defined. The conditional expectation of the random variable under the condition that 'I = y is any BI(R)-measurable function m = m(y) for which

f

J{ro:.,eB}

e

edP = JBf m(y)P.,(dy),

BeBI(R).

(14)

That such a function exists follows again from the Radon-Nikodym theorem if we observe that the set function Q(B) =

f

J{ro: 11eB}

edP

is a signed measure absolutely continuous with respect to the measure P,.

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

219

Now suppose that m(y) is a conditional expectation in the sense of Definition 4. Then if we again apply the theorem on change of variable under the Lebesgue integral sign, we obtain

r

J{ro:~eB}

~ dP =

rm(y)P~(dy) = J{ro:~eB} r m(11)P~(dy),

JB

BE &H(R).

The function m(17) is ~~-measurable, and the sets {w: 11 E B}, BE &H(R), exhaust the subsets of ~~. Hence it follows that m(IJ) is the expectation E(~ IIJ). Consequently if we know E(~l11 = y) we can reconstruct E(~l17), and conversely from E(~l11) we can find E(~ I11 = y). From an intuitive point of view, the conditional expectation E(~ 1'1 = y) is simpler and more natural than E(~ I17). However, E(~ I17), considered as a ~~-measurable random variable, is more convenient to work with. Observe that Properties A*-K* above and the conclusions of Theorem 2 can easily be transferred to E(~l11 = y) (replacing "almost surely" by "P~-almost surely"). Thus, for example, Property K* transforms as follows: if E 1~1 < oo and E llf(IJ)I < oo, where f = f(y) is a &H(R) measurable function, then E(lf(11)111 = y) = f(y)E(~IIJ = y) In addition (cf. Property J*), E(~l11

if~

(P~-a.s.).

(15)

and 11 are independent, then

= y) = E~

(P~-a.s.).

We also observe that if BE &H(R 2 ) and ~and 11 are independent, then (16) and if cp = cp(x, y) is a .c?6'(R 2 )-measurable function such that E Icp(~, 1J) I < oo, then

To prove (16) we make the following observation. If B = B 1 x B 2 , the validity of (16) will follow from

r

J{ro:~eA}

IB,xB2(~. 11)P(dw) =

r

J(yeA)

EIB,xBi~. y)P~(dy).

But the left-hand side is P{~ E Bt. 11 E An B 2 }, and the right-hand side is P( ~ E B 1)P(17 E A n B 2); their equality follows from the independence of ~ and 11· In the general case the proof depends on an application of Theorem 1, §2, on monotone classes (cf. the corresponding part of the proof of Fubini's theorem).

Definition 5. The conditional probability of the event A E ~under the condition that 11 = y (notation: P(A I11 = y)) is E(J A I1J = y).

220

II. Mathematical Foundations of Probability Theory

It is clear that P(A 111 = y) can be defined as the .14(R)-measurable function such that P(A n {IJEB}) =

ft(AIIJ

=

y)P~(dy),

BE .14(R).

(17)

6. Let us calculate some examples of conditional probabilities and conditional expectations. EXAMPLE

Lk'=

1

P(IJ

1. Let 11 be a discrete random variable with P(IJ = Yk) > 0, = Yk) = 1. Then P(AI

11

=

Yk

{17 = yk}) P( _ ) , 11- Yk

) = P(A n

For y¢{y 1 ,y 2 , ... } the conditional probability P(AIIJ = y) can be defined in any way, for example as zero. If~ is a random variable for which E~ exists, then

When y ¢ {y 1, y 2 , .•• } the conditional expectation E( ~ 111 in any way (for example, as zero).

= y) can be defined

2. Let ( ~, 11) be a pair of random variables whose distribution has a density ~~~(x, y):

EXAMPLE

P{(~,IJ)EB} = f!~~(x,y)dxdy, Let Hx) and UY) be the densities of the probability distribution of~ and 11 (see (6.46), (6.55) and (6.56). Let us put

r ( I ) - h~(x, y) X y j,(y) '

(18)

J~l~

taking J~ 1 ~(x Iy) = 0 if f~(y) = 0. Then

P(~ E CIIJ = y) = J/~ 1 ~(xly) dx,

C E .14(R),

i.e. !~ 1 ~(x ly) is the density of a conditional probability distribution.

(19)

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

A

221

In fact, in order to prove (19) it is enough to verify (17) for BE PJ(R), {~ E C}. By (6.43), (6.45} and Fubini's theorem,

=

L[J/~~~(xly)

dx

L

JP~(dy) = [L.h 1 ~(xly) dx Jf~(y) dy =

J J~ 1 ~(xiy)~(y) J f~~(x,

dx dy

CxB

=

y) dx dy

CxB

=

P{(~,

IJ) E C x B}

=

P{(~ E

C) n (IJ E B)},

which proves (17). In a similar way we can show that if E~ exists, then

(20) EXAMPLE

3. Let the length of time that a piece of apparatus will continue to

operate be described by a nonnegative random variable 11 = IJ(w) whose distribution Fq(y) has a density j~(y) (naturally, F ~(y) = j~(y) = 0 for y < 0). Find the conditional expectation E(IJ - a 111 2 a), i.e. the average time for which the apparatus will continue to operate on the hypothesis that it has already been operating for time a. Let P(IJ 2 a) > 0. Then according to the definition (see Subsection 1) and (6.45), E(11 - a I'1 > a) -

=

E[(l'/ - a)I 1paJ P(l'/ 2 a)

=

Jn (1'/ -

a)I 1 ~>a)P(dw)

"-=--'-'---=--c-'----"'-'=-='___:_____:__

P(l'/ 2 a)

_ J:' (y -

- s:

a)f~(y) dy f~(y) dy

It is interesting to observe that if '1 is exponentially distributed, i.e.

A. -.0 f,(y) = { e ' y- ' ~ 0 y < 0,

(21)

then El] = E(IJIIJ 2 0) = 1/A. and E(IJ -aiiJ 2 a)= 1/A. for every a> 0. In other words, in this case the average time for which the apparatus continues to operate, assuming that it has already operated for time a, is independent of a and simply equals the average time El]. Under the assumption (21) we can find the conditional distribution P(l'/- a :S:: xl11 2 a).

222

II. Mathematical Foundations of Probability Theory

We have P(

I

)

11 - a ~ x '1 ;;;:: a =

P(a ~ '1 ~ a +. . x) P('7 :2:: a) F,(a

-

[1 -

=e

+ x)- F,(a) + P('1 =a) 1 - F,(a) + P(17 =a) [1 - e-..ta] 1- [1 - e .l.a]

e-..t(a+x)] -

-.l.a[1 _ -.l.x] ;. e e a

=1-

e-.l.x.

Therefore the conditional distribution P('1 - a ~ xI 11 :2:: a) is the same as the unconditional distribution P(17 ~ x). This remarkable property is unique to the exponential distribution: there are no other distributions thathavedensitiesandpossessthepropertyP('1- a~ xl'1 :2:: a)= P('1 ~ x), a :2:: 0, 0 ~ x < oo. EXAMPLE 4 (Buffon's needle). Suppose that we toss a needle of unit length "at random" onto a pair of parallel straight lines, a unit distance apart, in a plane. What is the probability that the needle will intersect at least one of the lines? To solve this problem we must first define what it means to toss the needle "at random." Let be the distance from the midpoint of the needle to the left-hand line. We shall suppose that~ is uniformly distributed on [0, 1], and (see Figure 29) that the angle (} is uniformly distributed on [ -n/2, n/2]. In addition, we shall assume that and fJ are independent. Let A be the event that the needle intersects one of the lines. It is easy to see that if

e

e

B = {(a, x): Ia I ~

1[

2'

x E [0, teas a] u [1 - tcos a, 1]},

then A = {w: (fJ, e) E B}, and therefore the probability in question is P(A)

= EIA(w) = Ela(fJ(w), e(w)).

Figure 29

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

223

By Property G* and formula (16), EJa(tl(w),

~(w)) =

E(E[Ja(tl(w),

~(w))ltl(w)])

= LE[I 8 (tl(w), ~(w))ltl(w)]P(dw) =

f

tt/2

=1

n

_,12 E[I8 (tl(w), ~(w))ltl(w) = tx]P8(da)

J"

12

-tt/2

Ela(a,

~(w))da = 1

n

J"

12

-n/2

cos ada=-, 2 n

where we have used the fact that

Ela(a, ~(w)) = P{~ E [0,

t cos a] u [1 - t cos a]}= cos a.

Thus the probability that a "random" toss of the needle intersects one of the lines is 2/n. This result could be used as the basis for an experimental evaluation of n. In fact, let the needle be tossed N times independently. Define ~ito be 1 if the needle intersects a line on the ith toss, and 0 otherwise. Then by the law of large numbers (see, for example, (1.5.6))

P{l~ 1 + ·; + ~N- P(A)I > e}-+ 0,

N-+

oo.

for every e > 0. In this sense the frequency satisfies ~1

+ ... + ~N N

~

P(A) = ~2

and therefore 2N

-----~1!:.

~1

+ ... +

~N

This formula has actually been used for a statistical evaluation of n. In 1850, R. Wolf (an astronomer in Zurich) threw a needle 5000 times and obtained the value 3.1596 for n. Apparently this problem was one of the first applications (now known as Monte Carlo methods) of probabilisticstatistical regularities to numerical analysis. 7. If { ~n}n> 1 is a sequence of nonnegative random variables, then according to conclusion (f) of Theorem 2,

In particular, if B 1 , B 2 , ••• is a sequence of pairwise disjoint sets, (22)

224

II. Mathematical Foundations of Probability Theory

It must be emphasized that this equation is satisfied only almost surely and that consequently the conditional probability P(B \~)(w) cannot be considered as a measure on B for given w. One might suppose that, except for a set ..;V of measure zero, P( ·I~) (w) would still be a measure for w E Y. However, in general this is not the case, for the following reason. Let .K(B 1 , B 2 , •• •) be the set of sample points w such that the countable additivity property (22) fails for these B 1, B 2 , ...• Then the excluded set ..;Vis

(23) where the union is taken over all B 1, B 2 , ••. in fJ'. Although the P-measure of each set .K(B 1 , B 2 , .• . ) is zero, the P-measure of ..;V can be different from zero (because of an uncountable union in (23)). (Recall that the Lebesgue measure of a single point is zero, but the measure of the set ..;V = [0, 1], which is an uncountable sum of the individual points {x}, is 1). However, it would be convenient if the conditional probability P(·\ ~)(w) were a measure for each wEn, since then, for example, the calculation of conditional probabilities E(e \~)could be carried out (see Theorem 3 below) in a simple way by averaging with respect to the measure P( ·I~)(w):

E(e\~) =

fa ~(w)P(dw\~)

(a.s.)

(cf. (1.8.10)). We introduce the following definition. Definition 6. A function P(w; B), defined for all wEn and BE ff, is a regular conditional probability with respect to ~ if

(a) P(w; ·)is a probability measure on$' for every wEn; (b) For each B E $' the function P( w; B), as a function of w, is a variant of the conditional probability P(B\~)(w), i.e. P(w: B)= P(B\~)(w) (a.s.). Theorem3. Let P(w; B) be a regular conditional probability with respect to ~ and let be an integrable random variable. Then

e

Ece\~)(w) = [ ~(w)P(w; dw)

Jn

PROOF.

If

e= I

B'

(a.s.).

(24)

BE ff, the required formula (24) becomes P(B\~)(w) =

P(w; B) (a.s.),

which holds by Definition 6(b). Consequently (24) holds for simple functions.

§7. Conditional Probabilities and Expectations with Respect to a a-Algebra

225

Now let ¢ ~ 0 and ¢n i ¢, where ¢n are simple functions. Then by (b) of Theorem 2 we have E(¢/~)(w) =limn E(¢nl~)(w) (a.s.). But since P(w; ·) is a measure for every wE Q, we have lim n

E(¢n/~)(w) =lim n

( ¢kiJ)P(w; dw) = ( ¢(w)P(w; dw)

Jn

Jn

by the monotone convergence theorem. The general case reduces to this one if we use the representation ¢ =

e+- c.

This completes the proof.

Corollary. Let ~ = ~~' where rt is a random variable, and let the pair (¢, rt) have a probability distribution with density f~lx, y). Let E /g(¢) / < oo. Then

where f~ 1 ~(x/y) is the density of the conditional distribution (see (18)).

In order to be able to state the basic result on the existence of regular conditional probabilities, we need the following definitions.

Definition 7. Let (E, r!) be a measurable space, X= X(w) a random element with values in E, and ~ a a-subalgebra of g;, A function Q(w; B), defined for wEn and BE S is a regular conditional distribution of X with respect to ~if

(a) for each wEn the function Q(w; B) is a probability measure on (E, rff); (b) for each B E C the function Q( w; B), as a function of w, is a variant of the conditional probability P(X E B/ ~)(w), i.e. Q(w; B)= P(X EB/~)(w)

(a.s.).

Definition 8. Let ¢ be a random variable. A function F = F(w; x), wE Q, XE

R, is a regular distribution function for ¢ with respect to ~

if :

(a) F(w; x) is, for each wE Q, a distribution function on R; (b) F(w; x) = P(¢ :.:::; x /~)(w)(a.s.), for each x E R.

Theorem 4. A regular distribution function and a regular conditional distribution function always exist for the random variable

ewith respect to ~.

226

II. Mathematical Foundations of Probability Theory

PROOF. For each rational number r E R, define F,(OJ) = P(~ :::; ri~)(OJ), where P(~:::; ri~)(OJ) = E(J 1 ~ 9li~)(OJ) is any variant of the conditional probability, with respect to~. of the event {~:::; r}. Let {r;} be the set of rational numbers in R. If r; < ri, Property B* implies that P(~ :::; r;i ~) :::; P(~:::; rii~) (a.s.), and therefore if Aii = {OJ: F,/OJ) < F,,(OJ)}, A= UAii• we have P(A) = 0. In other words, the set of points OJ at which the distribution function F,(OJ), r E {r;}, fails to be monotonic has measure zero. Now let

B; = {OJ: lim Fr,+( 1 /n)(OJ) ¥= F,,(OJ)l,

J

n-+ oo

00

B=

U B;.

i= 1

It is clear that /{~:Sr,+( 1 /n)}! /{~~ril• n-+ oo. Therefore, by (a) of Theorem 2, F,,+( 1 /nl(OJ)-+ F,,(OJ) (a.s.), and therefore the set Bon which continuity on the right fails (with respect to the rational numbers) also has measure zero, P(B) = 0.

In addition, let

c= Then, since P(C) = 0. Now put

{OJ: lim FnCOJ) ¥= 1} u {OJ: lim F nCOJ) > n-+ oo

n-+- oo

g :::; n} i Q,

n-+ oo, and {~ :::; n}!

F(W,. X )

0,

o}.

n-+ - oo, we have

= {lim F,(OJ), OJ¢ A u B u C, r!x G(x),

wE

Au B u C,

where G(x) is any distribution function on R; we show that F(OJ; x) satisfies the conditions of Definition 8. Let ¢Au B u C. Then it is clear that F(OJ; x) is a nondecreasing function ofx. Ifx < x' :::; r, then F(OJ; x) :::; F(OJ; x'):::; F(w; r) = F,(OJ)! F(w, x) when r! x. Consequently F(OJ; x) is continuous on the right. Similarly limx-. 00 F(OJ; x) = 1, limx-.-oo F(OJ; x) = 0. Since F(OJ; x) = G(x) when OJ E Au B u C, it follows that F(OJ; x) is a distribution function on R for every OJ E Q, i.e. condition (a) of Definition 8 is satisfied. By construction, P(~:::; r)I~)(OJ) = F,(OJ) = F(OJ; r). If r! x, we have F(OJ; r)! F(OJ; x) for all OJ E Q by the continuity on the right that we just established. But by conclusion (a) of Theorem 2, we have P(~ :::; ri ~)(OJ)-+ P(~:::; xi~)(OJ) (a.s.). Therefore F(OJ; x) = P(~:::; xiG)(OJ) (a.s.), which establishes condition (b) of Definition 8. We now turn to the proof of the existence of a regular conditional distribution of~ with respect to ~Let F(OJ; x) be the function constructed above. Put

w

Q(OJ; B) = LF(OJ; dx),

227

§7. Conditional Probabilities and Expectations with Respect to au-Algebra

where the integral is a Lebesgue-Stieltjes integral. From the properties of the integral (see §6, Subsection 7), it follows that Q(ro; B) is a measure on B for each given wEn. To establish that Q(ro; B) is a variant of the conditional probability P(~ E Bl~)(ro), we use the principle of appropriate sets. Let ~ be the collection of sets B in gj(R) for which Q(ro; B)= P(~EBI~)(ro) (a.s.). Since F(w;'x) = P(~ ~ xl~)(w) (a.s.), the system~ contains the sets B of the form B = (- oo, x], x E R. Therefore ~ also contains the intervals of the form (a, b], and the algebra d consisting of finite sums of disjoint sets of the form (a, b]. Then it follows from the continuity properties of Q(w; B) (w fixed) and from conclusion (b) of Theorem 2 that~ is a monotone class, and since d £ ~ £ gj(R), we have, from Theorem 1 of§2, gj(R) = a(d) £

a(~)= Jl(~)

=

~ £

gj(R),

whence ~ = gj(R). This completes the proof of the theorem. By using topological considerations we can extend the conclusion of Theorem 4 on the existence of a regular conditional distribution to random elements with values in what are known as Borel spaces. We need the following definition.

Definition 9. A measurable space (E, tf) is a Borel space if it is Borel equivalent to a Borel subset of the real line, i.e. there is a one-to-one mapping cp = cp(e): (E, tf)

--+

(R, gj(R)) such that

=

(1) cp(E) {cp(e): eeE} is a set in gj(R); (2) cp is tS'-measurable (cp- 1(A) E I, A E cp(E) n gj(R)), (3) cp- 1 is BI(R)/tS'-measurable (cp(B) E cp(E) n BI(R), BE Iff).

Theorem 5. Let X = X(w) be a random element with values in the Borel space (E, S). Then there is a regular conditional distribution of X with respect to ~Let cp = cp(e) be the function in Definition 9. By (2), cp(X(ro)) is a random variable. Hence, by Theorem 4, we can define the conditional distribution Q(ro; A) of cp(X(ro)) with respect to r§, A E cp(E) n BI(R). We introduce the function Q(w; B) = Q(w; cp(B)), BE tf. By (3) of Definition 9, qJ(B) E qJ(E) n BI(R) and consequently Q(w; B) is defined. Evidently Q(w; B) is a measure on BE S for every ro. Now fix BE G. By the one-to-one character of the mapping cp = cp(e), PROOF.

Q(w; B)= Q(w; cp(B)) = P{cp(X) E cp(B)I~} = P{X E Bl~}

(a.s.).

Therefore Q(w; B) is a regular conditional distribution of X with respect to~.

This completes the proof of the theorem.

228

II. Mathematical Foundations of Probability Theory

Corollary .. Let X = X( w) be a random element with values in a complete separable metric space (E, 6"). Then there is a regular conditional distribution of X with respect to t§. In particular, such a distribution exists for the spaces (R", ~(R")) and (R 00 , ~(R 00 )). The proof follows from Theorem 5 and the well known topological result that such spaces are Borel spaces.

8. The theory of conditional expectations developed above makes it possible

to give a generalization of Bayes's theorem; this has applications in statistics. with Recall that if f:» = {A 1 , . . . , An} is a partition of the space that P(A;) > 0, Bayes's theorem (1.3.9) states

n

P(A; IB)

=

n

P(A;)P(B IA;)

Li=l P(A)P(BIA) for every B with P(B) > 0. Therefore if e = 2:? = a ;I 1

. A,

(25) is a discrete random

variable then, according to (1.8.10),

E[g(O)IBJ

L?=: g(a;)P(A;)P(BIA;)' Li=l P(A)P(BIA)

(26)

J~oo g(a)P(BIO = a)P9(da) J~oo P(BIO = a)P 9(da)

(27)

=

or

E[g(O)IBJ

=

On the basis of the definition ofE[g(B)JB] given at the beginning of this section, it is easy to establish that (27) holds for all events B with P(B) > 0, random variables and functions g = g(a) with E Ig(O) I < oo. We now consider an analog of (27) for conditional expectations E[g( 8) It§] with respect to a a-algebra t§, t§ ~ ff. Let

e

Q(B) = Lg(O)P(dw),

BEt§.

(28)

Then by (4)

(29) We also consider the a-algebra t§ 9 . Then, by (5),

P(B) =

Lt
t§9)

dP

(30)

or, by the formula for change of variable in Lebesgue integrals, (31)

§7. Conditional Probabilities and Expectations with Respect to a (j-Algebra

229

Since

we have

Q(B)

=

f_'Xl}(a)P(BIO = a)P 0(da).

Now suppose that the conditional probability P(B I0 admits the representation P(BIO =a)= Lp(w; a)A(dw),

(32)

= a) is regular and (33)

where p = p(w; a) is nonnegative and measurable in the two variables jointly, and A is a a-finite measure on (Q, <;§). Let E lg(O)I < oo. Let us show that (P-a.s.) E[g(O)I<;§] = J~oo g(a)p(w; a)Po(da) J~ oo p(w; a)P 0(da)

(34)

(generalized Bayes theorem). In proving (34) we shall need the following lemma.

Lemma. Let (Q, F) be a measurable space. (a) Let 11- and A be a-finite measures, and f

=

f(w) an ff-measurablefunction.

Then

(35)

(in the sense that !feither integral exists, the other exists and they are equal).

(b) If v is a signed measure and fl, A are a-finite measures v ~ fl, 11- ~ A, then

dv dA

=

dv dfl dfl. dA.

(A.-a.s.)

(36)

(11--a.s.)

(37)

and dv dfl PROOF. (a) Since

I!( A)

=

=

dvldfl dA. dA.

L(~~)

dA.,

I/;

I A,. The general case (35) is evidently satisfied for simple functions f = monotone converthe and f+ f = f follows from the representation gence theorem.

230

II. Mathematical Foundations of Probability Theory

(b) From (a) with f = dv/d/1 we obtain

Then v

~

A. and therefore

J dA.dv dA.,

=

v(A)

A

whence (36) follows since A is arbitrary, by Property I (§6). Property (37) follows from (36) and the remark that

i

dJ.l = 0} = 11 { w:dA.

-dJ.l

{ro: d!lfd).

= 0) dA.

dA. = 0

(on the set {w: dJ1/dA. = 0} the right-hand side of (37) can be defined arbitrarily, for example as zero). This completes the proof of the lemma. To prove (34) we observe that by Fubini's theorem and (33), Q(B)

=

L[J: L[f_

P(B) =

00

(38)

g(a)p(w; a)P0(da)}(dw),

00 00

(39)

p(w; a)Po(da)}(dw).

Then by the lemma dQ dQjd). dP = dPjd).

(P-a.s.).

Taking account of (38), (39) and (29), we have (34).

Remark. Formula (34) remains valid if we replace() by a random element with values in some measurable space (E, C) (and replace integration over R by integration over E). Let us consider some special cases of (34). Let the a-algebra r§ be generated by the random variable Suppose that

P(~ E A I() = a) = {

q(x; a)A.(dx),

~' r§

A E fJIJ(R),

=

r§ ~.

(40)

where q = q(x; a) is a nonnegative function, measurable with respect to both variables jointly, and ). is a a-finite measure on (R, fJIJ(R)). Then we obtain

E[g(())l~ = x] = J:?oo g(a)q(x; a)Po(da)

J:?oo q(x; a)Po(da)

(41)

§7. Conditional Probabilities and Expectations with Respect to a u-Algebra

231

In particular, let((},~) be a pair of discrete random variables,(} = La;! A,, xiB j" Then, taking A. to be the counting measure (A.( {x;}) = 1, i = 1, 2, ...) we find from (40) that ~='I

(Compare (26).) Now let (0, ~) be a pair of absolutely continuous measures with density fo.~(a, x). Then by (19) the representation (40) applies with q(x; a)= !~ 16 (xla) and Lebesgue measure A.. Therefore

E[g(O) I~ = x] = s~ <Xl g(a)f~lfJ(x la)fo(a) da J~ oo J~ 1 o(x Ia)fo(a) da 9.

(43)

PROBLEMS

1. Let eand '1 be independent identically distributed random variables with Ee defined.

Show that

E(ele + '1) = E(IJie + '1) = -e+IJ 2- (a.s.).

ee

2. Let 1, 2 , ••• be independent identically distributed random variables with E1e;1 < oo. Show that

s.

E(e11 s., s.+ 1• ...) = - (a.s.), n where s.

=

e + ... + e•. 1

3. Suppose that the random elements (X, Y) are such that there is a regular distribution Px(B) = P(YeBJX = x). Show that ifE Jg(X, Y)l < oo then

E[g(X, Y)IX = x] = 4. Let

f

g(x, y)Px(dy)

(Px-a.s.).

ebe a random variable with distribution function Fix). Show that E<eJIX <

(assuming that F~(b)-

F~(a)

J!x dFix)

e~b)= F~(b)- Fia)

> 0).

5. Let g = g(x) be a convex Borel function with E Jg<e)J < oo. Show that Jensen's inequality

holds for the conditional expectations.

232

II. Mathematical Foundations of Probability Theory

e

6. Show that a necessary and sufficient condition for the random variable and the u-algebra f§ to be independent (i.e., the random variables.; and Ia(w) are independent for every Bet§) is that E(g(e)lf§) = Eg(e) for every Borel function g(x) with Elg<e)l < oo.

e JA edP, is u-finite.

be a nonnegative random variable and <§ a u-algebra, <§ s;; !F. Show that E(el<§) < oo (a.s.) if and only if the measure Q, defined on sets A e <§by Q(A) =

7. Let

§8. Random Variables. II 1. In the first chapter we introduced characteristics of simple random variables, such as the variance, covariance, and correlation coefficient. These extend similarly to the general case. Let (Q, !F, P) be a probability space and ~ = ~(w) a random variable for which E~ is defined. The variance of ~ is

fo

The number q = + is the standard deviation. If~ is a random variable with a Gaussian (normal) density fi( ) _ ~

x -

1 foq e

the parameters m and

q

-[(x-m)2]/2a2

,

q

> 0,

- oo < m < oo,

(1)

in (1) are very simple: m

=

E~,

Hence the probability distribution ofthis random variable~. which we call Gaussian, or normally distributed, is completely determined by its mean value m and variance q 2 , (It is often convenient to write~ "' .;V (m, q 2 ).) Now let(~, 17) be a pair of random variables. Their covariance is (2)

(assuming that the expectations are defined). lfcov(~, 17) = 0 we say that~ and '1 are uncorrelated. If V ~ > 0 and V17 > 0, the number (3)

is the correlation coefficient of ~ and '7· The properties of variance, covariance, and correlation coefficient were investigated in §4 of Chapter I for simple random variables. In the general case these properties can be stated in a completely analogous way.

233

§8. Random Variables. II

Let e = <el, ... ' en) be a random vector whose components have finite second moments. The covariance matrix of e is then x n matrix~= IIRiJII, where Ril = cov(e;, ei). It is clear that ~ is symmetric. Moreover, it is nonnegative definite, i.e. n

L1 Rw~·)·i ~ o

i,j=

for all A.i e R, i = 1, ... , n, since

The following lemma shows that the converse is also true.

Lemma. A necessary and sufficient condition that an n x n matrix

~ is the covariance matrix of a vector e = <el, ... ' en) is that the matrix is symmetric and nonnegative definite, or, equivalently, that there is an n x k matrix A (1 ~ k ~ n) such that

where T denotes the transpose. PROOF. We showed above that every covariance matrix is symmetric and nonnegative definite. Conversely, let~ be a matrix with these properties. We know from matrix theory that corresponding to every symmetric nonnegative definite matrix ~ there is an orthogonal matrix (!) (i.e., (!)(!)T = E, the unit matrix) such that (!)T~(!) =

where D =

(d0

1

••

D,

'

0)

dn

is a diagonal matrix with nonnegative elements di, i = 1, ... , n. It follows that ~ = (!)D(!)T = ((!)B)(BT(!)T),

where B is the diagonal matrix with elements bi = + .jd;, i = 1, ... , n. Consequently if we put A = (!)B we have the required representation ~ = AATfor ~. It is clear that every matrix AAT is symmetric and nonnegative definite. Consequently we have only to show that ~ is the covariance matrix of some random vector. Let tit• ti 2 , ••• , tin be a sequence of independent normally distributed random variables, ..¥(0, 1). (The existence of such a sequence follows, for example, from Corollary 1 of Theorem 1, §9, and in principle could easily

234

II. Mathematical Foundations of Probability Theory

be derived from Theorem 2 of §3.) Then the random vector~ = A17 (vectors are thought of as column vectors) has the required properties. In fact, E~~T

= E(A1'/)(A1'/)T =A. E,,T. AT= AEAT = AAT.

(If ( = li(iill is a matrix whose elements are random variables, E( means the matrix IIE~iill). This completes the proof of the lemma.

We now turn our attention to the two-dimensional Gaussian (normal) density

(4)

characterized by the five parameters m1 , m2 , a 1 , a 2 and p (cf. (3.14)), where Im1 I < oo, Im2 1 < oo, a 1 > 0, a 2 > 0, IpI < 1. An easy calculation identifies these parameters :

m1 =

E~,

m2 = E1],

ai

= V~,

a~ = V1],

p = p(~, 1]).

In§4ofChapter I we explained that if~ and 17 areuncorrelated(p(~, 17) = 0), it does not follow that they are independent. However, if the pair (~, 17) is Gaussian, it does follow that if ~ and 17 are uncorrelated then they are independent. In fact, if p = 0 in (4), then

But by (6.55) and (4),

frf._x) =

J

oo

~~~(x,

-oo

y) dy =

1

foal

e-[(x-md21J2at,

Consequently ~~~(x,

y) = f~(x) · fiy),

from which it follows that~ and 17 are independent (see the end of Subsection 8 of §6).

235

§8. Random Variables. II

2. A striking example of the utility of the concept of conditional expectation (introduced in §7) is its application to the solution of the following problem which is connected with estimation theory (cf. Subsection 8 of §4 of Chapter 1). Let(~, '1) be a pair of random variables such that~ is observable but '1 is not. We ask how the unobservable component '1 can be "estimated" from the knowledge of observations of~To state the problem more precisely, we need to define the concept of an estimator. Let qJ = qJ(x) be a Borel function. We call the random variable qJ(~) an estimator of11 in terms of~. and E['1 - ({J(~)] 2 the (mean square) error of this estimator. An estimator qJ*(~) is called optimal (in the mean-square sense) if

L\

=E['1- qJ*(~)Y = infE['1- ({J(~)Y, qJ

(5)

where inf is taken over all Borel functions qJ = qJ(x). Theorem 1. Let E11 2 < oo. Then there is an optimal estimator qJ* = qJ*(~) and qJ*(x) can be taken to be thejimction

qJ*(x) = E('11 ~ = x).

(6)

PRooF. Without loss of generality we may consider only estimators tp(~) for which EqJ 2 (~) < oo. Then if ({J(~) is such an estimator, and qJ*(~) = E('11 ~), we have E['1 - ({J(~)] 2 = E[('1 - qJ*(~)) + (({J*(~) _ ({J(~))]2 = E['7 - qJ*(~)]2 + E[({J*(~) - ({J(~)]2 + 2E[('1 - qJ*(~))(({J*(~) - ({J(~))] ~ E[17 - qJ*(~)] 2 , since E[qJ*(~) - qJ(~)] 2 ~ 0 and, by the properties of conditional expectations, E[('1-

qJ*(~))(({J*(~)- ({J(~))]

= E{E[('1-

qJ*(~))(({J*(~)- ({J(~)])I~]}

= E{(({J*(~)- ({J(~))E('1- qJ*(~)I~)} = 0.

This completes the proof of the theorem. Remark. It is clear from the proof that the conclusion of the theorem is still valid when ~ is not merely a random variable but any random element with values in a measurable space (E, tf). We would then assume that (/) = qJ(x) is an tS'/81(R)-measurable function. Let us consider the form of qJ*(x) on the hypothesis that(~, '1) is a Gaussian pair with density given by (4).

236

II. Mathematical Foundations of Probability Theory

From (1), (4) and (7.10) we find that the density J, 1 ~(yJx) of the conditional probability distribution is given by

f~ I~{y IX)

=

1 J2rc(1 - p2 )u 2

e[(y-

m(x))2]/[2a~( 1 - p2)]'

(7)

where

(8) Then by the Corollary of Theorem 3, §7, (9)

and V('71~

=

x)

=E[('7- E('71~ = x)) 1~ = x] 2

=

J_'XJoo (y-

= u~{1 -

m(x)) 2f~ 1 ~(yJx) dy

p 2 ).

Notice that the conditional variance V('71 ~ therefore

(10)

=

x) is independent of x and

(11) Formulas (9) and (11) were obtained under the assumption that V~ > 0 and V17 > 0. However, if V~ > 0 and V17 = 0 they are still evidently valid. Hence we have the following result (cf. (1.4.16) and (1.4.17)).

Theorem 2. Let (~, 17) be a Gaussian vector with estimator of '7 in terms of~ is

V~

> 0. Then the optimal

(12) and its error is

(13)

Remark. The curve y(x) = E('71 ~ = x) is the curve of regression of '7 on ~ or of '7 with respect to ~. In the Gaussian case E('71 ~ = x) = a + bx and consequently the regression of '1 and ~ is linear. Hence it is not surprising that the right-hand sides of (12) and (13) agree with the corresponding parts of (1.4.6) and (1.4.17) for the optimal linear estimator and its error.

237

§8. Random Variables. II

Corollary. Let e 1 and e2 be independent Gaussian random variables with mean zero and unit variance, and

Then E~ and if af

= E11 = 0, V~ = af + a~ > 0, then E(

11

+a~, V11

= bf + bLcov(~, 1'/) = a,b, + a2 b 2 ,

I~)= a 1 b1 + a 2 b2 ~ af +a~

(14)

'

Ll = (a 1 b2 - azbd 2 ai +a~

(15)

3. Let us consider the problem of determining the distribution functions of random variables that are functions of other random variables. Let~ be a random variable with distribution function F~(x) (and density Nx), if it exists), let q> = q>(x) be a Borel function and 11 = q>(~). Letting I Y = (- oo, y), we obtain

F~(y) =

P(ry::::;; y)

= P(q>(~)Eiy) = P(~Eq>- 1 (/y)) =

f

F~(dx),

(16)

"'- l(ly)

· which expresses the distribution function F~(y) in terms of F~(x) and q>. For example, if 11 = a~ + b, a > 0, we have

( y- b) (y- b)

F ~(y) = p ~ ::::;; -a- = F ~ -a- .

(17)

If 11 = ~ 2 , it is evident that F ~(y) = 0 for y < 0, while for y 2 0

F /Y) = P(

e ::;

y) = P(- Jy ::::;; ~ ::::;; Jy)

= F~(Jy)- F~( -Jy) + P(~ = -Jy).

(18)

We now turn to the problem of determining f~(y). Let us suppose that the range of~ is a (finite or infinite) open interval I = (a, b), and that the function q> = q>(x), with domain (a, b), is continuously differentiable and either strictly increasing or strictly decreasing. We also suppose that q>'(x) =f. 0, xEI. Let us write h(y) = q>- 1(y) and suppose for definiteness that cp(x) is strictly increasing. Then when y E q>(/),

= P(e ::::;; h(y)) =

f

h(y)

_ 00Nx) dx.

(19)

238

II. Mathematical Foundations of Probability Theory

By Problem 15 of §6,

h(y) fy f - OONx) dx = - oof;(h(z))h'(z) dz

(20)

and therefore

fq(y)

= f~;(h(y))h'(y).

(21)

Similarly, if cp(x) is strictly decreasing,

fq(y)

=

Nh(y))(( -h'(y)).

Hence in either case J~(y)

For example, if 1J = a~

= Nh(y)) Ih'(y) 1.

+ b, a #

0, we have

y-b h(y) =-aand If~ ~

JV (m, u 2) and 11 fq(y)

=

= el:,

(22)

fq(y)

1 (y-b) = j;f !~;-a- ·

we find from (22) that

1 exp[{ $uy

ln(y/~)2], 2u

0

y > 0,

(23)

y :::;; 0,

with M =em. A probability distribution with the density (23) is said to be lognormal (logarithmically normal). If cp = cp(x) is neither strictly increasing nor strictly decreasing, formula (22) is inapplicable. However, the following generalization suffices for many applications. Let cp = cp(x) be defined on the set 1 [ak, bk], continuously differentiable and either strictly increasing or strictly decreasing on each open interval Ik = (ab bk), and with cp'(x) # 0 for x E Ik. Let hk = hk(y) be the inverse of cp(x) for x Elk. Then we have the following generalization of(22):

Lk=

n

fq(y) =

L f~;(hk(y))lh~(y)l· Ivk(y),

(24)

k=1

where Dk is the domain of hk(y). For example, if 11 = ~ 2 we can take I 1 = ( - oo, 0), I 2 find that h1 (y) = h 2(y) = and therefcre

-.JY,

.JY,

f~(y) = {2Jy [f~;(.jY) + fl-JY)], 0,

y > 0, y :::;; 0.

= (0,

oo ), and

(25)

239

§8. Random Variables. II

Wecanobservethatthisresultalsofollowsfrom(18),sinceP(~ = In particular, if~ ,..., .;V (0, 1),

h2(y) =

{k

-JY) = 0.

y > 0,

e-y/2,

y

0,

~

(26)

0.

A straightforward calculation shows that

fi~l(y)

f +v'l~l (y) --

=

{f~(y) + f~(- y), y > 0, 0,

y::;; 0.

{2y(f~(y2) 0,

+ f~(- y2)),

y > 0, y ~ 0.

(27) (28)

4. We now consider functions of several random variables. If ~ and '7 are random variables with joint distribution F ~,(x, y), and

F~:,(z) =

J

{x, y: q>(x, y) :S z}

dF~,(x, y).

(29)

For example, if cp(x, y) = x + y, and~ and '7 are independent (and therefore F~,(x, y) = F~(x) · F,(y)) then Fubini's theorem shows that

F,(z) =

f

{x,y:x+y:Sz)

=

=

(

JR2

dF~(x) · dF,(y)

I{x+y:s;zJ(x, y) dF~(x) dF,(y)

J:oo dF~(x){J:,/{x+y:s;z)(x, y) dF,(y)} =

J:ooF,(z- x)

dF~(x) (30)

and similarly

F,(z) =

J:ooF~(z- y) dF,(y).

(31)

IfF and G are distribution functions, the function

H(z) = J:00 F(z- x) dG(x) is denoted by F * G and called the convolution ofF and G. Thus the distribution function F 1 of the sum of two independent random variables ~ and '7 is the convolution of their distribution functions F ~ and F,: F, = F~*F,.

It is clear that F~ * F, = F, * F~.

II. Mathematical Foundations of Probability Theory

240

Now suppose that the independent random variables ~ and '1 have densities f~ and f~. Then we find from (31 ), with another application ofFubini's theorem, that

F~(z) = J:oo [f~Yf~(u) du ]f~(y) dy = s:oo

[f

J

=

ooh(u - y) du J.,(y) dy

whence J{(z)

=

f_

00

/iz -

00

f

00

[f_0000j~(u -

J

y)J,(y) dy du,

(32)

y)J.,(y) dy,

and similarly (33) Let us see some examples of the use of these formulas. Let ~to ~ 2 , ••• , ~n be a sequence of independent identically distributed random variables with the uniform density on [ -1, 1]:

f(x) =

{!. 0,

lxl ~ 1, lxl > 1.

Then by (32) we have f~, +~ 2 (x)

={

2 _...,...:.-....:.,

I 2 IX~'

0,

lxl > 2,

-lxl 4

(3 - lxl) 2 , 16 !~, +~2+~3(x)

= 3 - x2

1~

lxl

~

3,

0 ~ lxl ~ 1,

8

lxl > 3,

0, and by induction

1

[(n+x)/21

_ { 2"( _ 1) 1 · n /~, + ... +~n(x)-

0, Now let~ ,..., .% (m 1,

L

(-1)kC~(n

o-n and '1 ,..., .% (m cp(x)

+ x- 2k)"-t, lxl

~ n,

k=O

lxl > n. 2,

o-~). If we write

= _1_ e-x2J2,

fo

241

§8. Random Variables. II

then

and the formula

follows easily from (32). Therefore the sum of two independent Gaussian random variables is again a

Gaussian random variable with mean m1

e

en

+ m2 and variance uf + u~.

Let 1, •• ,, be independent random variables each of which is normally distributed with mean 0 and variance 1. Then it follows easily from (26) (by induction)

t::+ . . ~ +"(x)

{-=-2n:-;;;t2=:,.,..(n....,./2,.,..) x
ef

x > 0, X~

x;,

e;

(34)

0.

The variable + · · · + is usually denoted by and its distribution (with density (30)) is the x2 -distribution ("chi-square distribution") with n degrees of freedom (cf. Table 2 in §3).

If we write Xn =

+·.JX!, it follows from (28) and (34) that fx . (x) =

2x"-le-x2f2 { 2"12r(n/2) , X ~ 0,

(35)

X< 0.

0,

The probability distribution with this density is the x-distribution (chidistribution) with n degrees of freedom. Again let ~ and 11 be independent random variables with densities f~ and J,. Then

F~~(z) =

JJ

f~(x)f~(y) dx dy,

{x,y:xy:Sz)

F~1 ~(z) =

JJ

f~x)j,(y) dx dy.

{x, y: x/y:Sz)

Hence we easily obtain

~~~(z)

=

oo (z) dy dx f-oof~ Y J.,(y) IYT = foo-oof~ (z) X J~(x) TXT

(36)

and (37)

242

II. Mathematical Foundations of Probability Theory

e

e

e;)!n,

en

Putting =eo and '1 = j(e~ + · · · + in (37), where eo. 1 , •.. , are independent Gaussian random variables with mean 0 and variance rJ 2 > 0, and using (35), we find that

f~o/[.j(l/nH~t+ ... +~~)](x)

I r(n; 1)

=

C y'Ttn

( ) r~

1 (

(38)

x2)(n+ 1)/2

1+-

2

n

The variable eo/[)(1/n)(ei + ... + e:)J is denoted by t, and its distribution is the t-distribution, or Student's distribution, with n degrees of freedom (cf. Table 2 in §3). Observe that this distribution is independent of rJ.

5.

PROBLEMS

1. Verify formulas (9), (10), (24), (27), (28), and (34)-(38). 2. Let ~t> ••• , ~ •• n ~ 2, be independent identically distributed random variables with distribution function F(x) (and density f(x), if it exists), and let~= max(~t• ... , ~.), ~ = min(~t• ... , ~.), p = ~ - ~-Show that

F- (y x) = {(F(y))"- (F(y)- F(x))", ~.~ ' (F(y))",

y > x, y ~ x,

n(n- 1)[F(y)- F(x)]"- 2 f(x)f(y),

f~.~(y, x) = { 0,

y > x, <X,

y

- {n s~oo [F(y)- F(y- x)]"-tf(y) dy, X~ 0,

~00-0 '

0

X< '

n(n- 1) s~oo [F(y)- F(y- x)]"- 2 f(y- x)f(y) dy,

fP(x) = { 0,

X>

0,

X< 0•

3. Let ~t and ~ 2 be independent Poisson random variables with respective parameters At and A2 • Show that ~t + ~ 2 has a Poisson distribution with parameter At + A2 • 4. Let mt = m2 = 0 in (4). Show that

5. The maximal correlation coefficient of ~ and Yf is p*(~, rt) = sup"·" p(u<e), v<e)), where the supremum is taken over the Borel functions u = u(x) and v = v(x) for which the correlation coefficient p(u(~). v(e)) is defined. Show that' and Yf are independent if and only if p*(~. rt) = 0. 6. Let "t 1 , "t 2 , ••• , "t• be independent nonnegative identically distributed random variables with the exponential density

f(t) = Ae-At,

t

~

0.

§9. Construction of a Process with Given Finite-Dimensional Distribution

Show that the distribution of r 1

243

+ · · · + rk has the density

A_ktk-1e-A'

t

(k-1)!'

~

0, 1 :o;; k :o;; n,

and that P(r 1

+ · · · + rk >

t) =

k-1

(A.tY

i=O

l.

L e-A•-.-1 .

7. Let ~ ~ %(0, a 2 ). Show that, for every p ~ 1,

EI~ IP =

CPaP,

where

and f(s)

=

So e-xxs-

1

dx is the gamma function. In particular, for each integer n ~ 1, E~ 2 " = (2n - 1)!! a 2 ".

§9. Construction of a Process with Given Finite-Dimensional Distribution 1. Let

~ = ~(w)

(Q, $', P), and let

be a random variable defined on the probability space F~(x) =

P{w:

~(w)::;

x}

be its distribution function. It is clear that F ~(x) is a distribution function on the real line in the sense of Definition 1 of §3. We now ask the following question. Let F = F(x) be a distribution function on R. Does there exist a random variable whose distribution function is

F(x)?

One reason for asking this question is as follows. Many statements in probability theory begin, "Let~ be a random variable with the distribution function F(x); then ... ". Consequently if a statement of this kind is to be meaningful we need to be certain that the object under consideration actually exists. Since to know a random variable we first have to know its domain (Q, g;), and in order to speak of its distribution we need to have a probability measure P on (Q, g;), a correct way of phrasing the question of the existence of a random variable with a given distribution function F(x) is this: Do there exist a probability space (Q, $', P) and a random variable~ on it, such that

P{w:

~(w)::;

x}

=

F(x)?

= ~(w)

244

II. Mathematical Foundations of Probability Theory

Let us show that the answer is positive, and essentially contained in Theorem 1 of§ 1. In fact, let us put :F = PJ(R).

!l=R,

It follows from Theorem 1 of §1 that there is a probability measure P (and only one) on (R, PJ(R)) for which P(a, b)] = F(b) - F(a), a < b. Put c;(w) = w. Then

P{w: c;(w)

~ x}

= P{w: w

~ x}

= P(- oo, x] = F(x).

Consequently we have constructed the required probability space and the random variable on it. 2. Let us now ask a similar question for random processes. Let X= (c;,),er be a random process (in the sense of Definition 3, §5) defined on the probability space (Q, !F, P), with t E T £ R. From a physical point of view, the most fundamental characteristic of a random process is the set {F,,, ... ,1"(x 1 , ... , xn)} of its finite-dimensional distribution functions

defined for all sets tt. ... , tn with t 1 < t 2 < · · · < tn. We see from (1) that, for each set t 1 , .•. , tn with t 1 < t 2 < · · · < tn the functions F 1,, ... ,1Jx 1, ... , xn) are n-dimensional distribution functions (in the sense of Definition 2, §3) and that the collection {F,,, ... ,1"(x 1, ... , Xn)} has the following consistency property: lim F,,, ... ,tn(xl, ... ' Xn)

=

F,,, ... ,tk, ... ,tn(xl, ... '

xk> ... ' Xn)

(2)

Xkf 00

where ~ indicates an omitted coordinate. Now it is natural to ask the following question: under what conditions can a given family {F 11 , ... , 1"(xt. ... , xn)} of distribution functions F,,, ... ,1"(x 1 , •.• , xn) (in the sense of Definition 2, §3) be the family of finitedimensional distribution functions of a random process? It is quite remarkable that all such conditions are covered by the consistency condition (2).

Theorem 1 (Kolmogorov's Theorem on the Existence of a Process). Let {F1,,. .. , 1"(x 1, ... , Xn)}, with t; E T £ R, t 1 < t 2 < · · · < tn, n ~ 1, be a given

family of finite-dimensional distribution functions, satisfying the consistency condition (2). Then there are a probability space (Q, !F, P) and a random process X = (c;,),er such that (3)

§9. Construction of a Process with Given Finite-Dimensional Distribution

245

PRooF. Put

i.e. taken to be the space of real functions w = (w,),eT with the a-algebra generated by the cylindrical sets. LetT= [t 1 , ... , tn], t 1 < t 2 < · · · < tn. Then by Theorem 2 of §3 we can construct on the space (R", PA(R")) a unique probability measure Pr such that

It follows from the consistency condition (2) that the family {P r} is also consistent (see (3.20)). According to Theorem 4 of §3 there is a probability measure P on (RT, PA(RT)) such that P{w: (w11 ,

••• ,

w,J E B} = Pr(B)

for every set r = [t 1 , .. , tn], t 1 < · · · < tn. From this, it also follows that (4) is satisfied. Therefore the required random process X= (~,(w)),eT can be taken to be the process defined by tE T.

(5)

This completes the proof of the theorem.

Remark 1. The probability space (RT, PA(RT), P) that we have constructed is called canonical, and the construction given by (5) is called the coordinate method of constructing the process. Remark 2. Let (Ea, ga) be complete separable metric spaces, where oc belongs

to some set mof indices. Let {Pr} be a set of consistent finite-dimensional distribution functions P" T = [IX 1, ••. , ocnJ on (£1% 1

X ··• X

EIXn' rff/% 1 ® • • • ® rff/XJ

Then there are a probability space (Q, !F, P) and a family of !F /rff a-measurable functions (Xa(w))ae!ll such that P{(Xa 1,

••• ,

XaJ E B} = P,(B)

for all r = [oc 1 , .•• , ocnJ and BE ~fa ®···®Can· This result, which generalizes Theorem 1, follows from Theorem 4 of §3 if we put n = Ea., !F = I®Ia tel% and XIX(w) = WI% for each Q) = w(wl%), 0( em.

Tia

Corollary 1. Let F 1(x), F 2 (x), ... be a sequence of one-dimensional distribution junctions. Then there exist a probability space (Q, ff, P) and a sequence of independent random variables ~ 1 , ~ 2 , ••• such that P{w: ~;(w) ::; x} = F;(x).

(6)

246

II. Mathematical Foundations of Probability Theory

In particular, there is a probability space (Q, ~ P) on which an infinite sequence of Bernoulli random variables is defined (in this connection see Subsection 2 of §5 of Chapter 1). Notice that Q can be taken to be the space Q = {ro: w = (a 1, a2 ,

•• •),

ai = 0, 1}

(cf. also Theorem 2). To establish the corollary it is enough to put F 1 , ... ,ix1 , F 1(x 1 ) • • • Fn(Xn) and apply Theorem 1.

••• ,

Xn) =

Corollary 2. Let T = [0, oo) and let {p(s, x; t, B} be a family of nonnegative functions defined for s, t E T, t > s, x E R, BE Bl(R), and satisfying the following conditions: (a) p(s, x; t, B) is a propability measure on B for given s, x and t; (b) for given s, t and B, the function p(s, x; t, B) is a Borel function of x; (c) for 0 ::S; s < t < rand BE Bl(R), the Kolmogorov-Chapman equation

p(s, x; r, B)

=

{p(s, x; t, dy)p(t, y; r, B)

(7)

is satisfied. Also let 1t = n(B) be a probability measure on (R, Bl(R)). Then there are a probability space (Q, ~ P) and a random process X = (~ 1)1 2: 0 defined on it, such that

(8)

for 0 = t 0 < t 1 < · · · < tn. The process X so constructed is a Markov process with initial distribution nand transition probabilities {p(s, x; t, B}.

Corollary 3. Let T = {0, 1, 2, ... } and let {Pk(x; B)} be a family of nonnegative functions defined for k ;;::: 1, x E R, BE PJ(R), and such that pk(x; B) is a probability measure on B (for given k and x) and measurable in x (for given k and B). In addition, let n = n(B) be a probability measure on (R, Bl(R)). Then there is a probability space (Q, ~ P) with a family of random variables X= {~ 0 , ~ 1 , ••. } defined on it, such that

247

§9. Construction of a Process with Given Finite-Dimensional Distribution

3. In the situation of Corollary 1, there is a sequence of independent random variables ~ 1 , ~ 2 , . . . whose one-dimensional distribution functions are F 1 , F 2 , ••• , respectively. Now let (£ 1 , £8' 1 ), (£ 2 , £8' 2 ), •.. be complete separable metric spaces and let P 1 , P 2 , ... be probability measures on them. Then it follows from Remark 2 that there are a probability space (Q, ff', P) and a sequence of independent elements X1o X 2 , ..• such that Xn is ~/c8'n-measurable and P(XnEB) = PiB), BE @"n• It turns out that this result remains valid when the spaces (En,@" n) are

arbitrary measurable spaces.

Theorem 2 (lonescu Tulcea's Theorem on Extending a Measure and the Existence of a Random Sequence). Let (Qn, ff,), n = 1, 2, ... , be arbitrary

n

PI

measurable spaces and n = nn, ~ = ff,. Suppose that a probability measure P1 is given on (!1 1, ~1 ) and that, for every set (wt> ... , wn) E !1 1 x ... X nn,n ~ l,probabilitymeasuresP(wl, ... ' wn;· )aregivenon(Qn+ I•~+ 1). Suppose that for every BE ff,+ 1 the functions P(wl> ... , wn; B) are Borel fimctions on (w 1 , ••• , wn) and let

n

~

1.

(9)

Then there is a unique probability measure P on (Q, ~) such that

for every n such that

~

1, and there is a random sequence X= (X 1 (w), X 2 (w), ...)

where A; E c8';. PROOF. The first step is to establish that for each n > 1 the set function Pn defined by (9) on the rectangle A 1 x · · · x An can be extended to the a-algebra ~1 ® · · · ® ff,. For each n ~ 2 and BE ~1 ® · · · ® ff, we put

X

l

nn

I B(fl!1, ... , Wn)P(w1, ... , Wn- 1; dwn).

(12)

It is easily seen that when B = A 1 x · · · x An the right-hand side of (12) is the same as the right-hand side of (9). Moreover, when n = 2 it can be

248

II. Mathematical Foundations of Probability Theory

shown, just as in Theorem 8 of §6, that P 2 is a measure. Consequently it is easily established by induction that Pn is a measure for all n ~ 2. The next step is the same as in Kolmogorov's theorem on the extension of a measure in (R 00 , 14(R 00 ) ) (Theorem 3, §3). Thus for every cylindrical set Jn(B) = {roe 0: (ro 1, ... , ron) e B}, Be §'1 ® · · · ® !F,, we define the set function P by (13)

If we use (12) and the fact that P(ro 1, ... , rok; ·)are measures, it is easy to

establish that the definition (13) is consistent, in the sense that the value of P(Jn(B)) is independent of the representation of the cylindrical set. It follows that the set function P defined in (13) for cylindrical sets, and in an obvious way on the algebra that contains all the cylindrical sets, is a finitely additive measure on this algebra. It remains to verify its countable additivity and apply Caratheodory's theorem. In Theorem 3 of §3 the corresponding verification was based on the property of (R", 14(Rn)) that for every Borel set B there is a compact set A £ B whose probability measure is arbitrarily close to the measure of B. In the present case this part of the proof needs to be modified in the following way. As in Theorem 3 of §3, let {Bn} n~ 1 be a sequence of cylindrical sets

Bn = {ro: (rol> ... ' ron) E Bn}, that decrease to the empty set

0. but have lim P(Bn) > 0.

(14)

n-+oo

For n > 1, we have from (12)

where

Since .Bn+1 £ Bn, we have Bn+1 £ Bn

X

on+1 and therefore

JBn+l(ro1, ... ' ron+1) ~ [Bn(ro,, ... ' ron)Inn+l(ron+1>· Hence the sequence {f~1 >(ro 1 )}n~ 1 decreases. Let J(l>(ro 1) = limn f~1 >(ro 1 ). By the dominated convergence theorem lim P(Bn) =lim n

n

i f~ >(ro 1 )P 1(dro 1 ) i n,

1

=

n,

J< 1>(ro 1)P 1(dro 1).

By hypothesis, limn P(Bn) > 0. It follows that there is an ro~ e B such that > 0, since if ro 1 rl B 1 then f~1 >(ro 1 ) = 0 for n ~ 1.

j< 1 >(ro~)

§9. Construction of a Process with Given Finite-Dimensional Distribution

249

Moreover, for n > 2, (15)

where

f~2 >(w 2 ) =

LP(w~, ···l IB"(w~,w2,

w 2 ; dw 3 )

On

... ,wn)P(w~,w 2 ,

•••

,wn-bdwn).

We can establish, as for {f~l)(w 1 )}, that {f~2 >(w 2 )} is decreasing. Let

j< 2 >(w 2 ) = limn .... oo Jf>(w 2 ). Then it follows from (15) that

0<

J(l>(w~) = j j< 2 >(w 2 )P(w~; dw 2 ),

Jn2

and there is a point w~ E Q 2 such that f(2)(w~) > 0. Then (w~, w~) E B 2. Continuing this process, we find a point (w~, ... , w~) E Bn for each n. 0 Consequently (w 01, ... , wn, .. .) E n~ Bn, but by hypothesis we have n~ Bn = 0. This contradiction shows that limn P(Bn) = 0. Thus we have proved the part of the theorem about the existence of the probability measure P. The other part follows from this by putting XnCw) = Wn, n;;::: 1.

.

Corollary 1. Let (En, Cn)n<: 1 be any measurable spaces and (Pn)n<: 1 , measures on them. Then there are a probability space (Q, §', P) and a family of independent random elements X 1,X 2, ... with values in (E 1,C 1), (E 2 ,C 2 ), •.. , respectively, such that P{w: Xiw)

E

B}

= Pn(B),

Corollary 2. Let E = {1, 2, ... }, and let {pk(x, y)} be a family of nonnegative functions, k;;::: 1, x,yEE, such that LyeEPk(x;y) = 1, xEE, k;;::: 1. Also letn = n(x)beaprobabilitydistributiononE(thatis,n(x);;::: O,LxeE n(x) = 1). Then there are a probability space (Q, §', P) and a family X = { ~ 0 , ~ 1 , of random variables on it, such that

Pgo = Xo, ~1 = X1, · · ·, ~n = Xn} = n(xo)P1(xo, X1) · · · Pn(Xn-1• Xn)

••• }

(16)

(cf. (1.12.4)) for all x; E E and n ;;::: 1. We may take Q to be tl".e space Q

= {w: w = (x 0 , x 1,

••. ), X; E

E}.

A sequence X= g 0 , ~ 1 , ... } of random variables satisfying (16) is a Markov chain with a countable set E of states, transition matrix {pk(x, y)} and initial probability distribution n. (Cf. the definition in §12 of Chapter 1.)

250 4.

II. Mathematical Foundations of Probability Theory

PROBLEMS

1. Let 0 = [0, 1], let :F be the class of Borel subsets of [0, 1], and let P be Lebesgue measure on [0, 1]. Show that the space (0, !F, P) is universal in the following sense. For every distribution function F(x) on (0, !F, P) there is a random variable = ro) such that its distribution function F ~x) = P(e ~ x) coincides with F(x). (Hint. e(ro) = r 1(ro), 0 < ro < 1, where r 1(ro) = sup{x: F(x) < ro}, when 0 < ro < 1, and e(O), W) can be chosen arbitrarily.)

e e(

2. Verify the consistency of the families of distributions in the corollaries to Theorems 1 and 2. 3. Deduce Corollary 2, Theorem 2, from Theorem 1.

§10. Various Kinds of Convergence of Sequences of Random Variables 1. Just as in analysis, in probability theory we need to use various kinds of convergence of random variables. Four of these are particularly important:

in probability, with probability one, in mean of order p, in distribution.

First some definitions. Let~. ~ 1 , ~ 2 , ••• be random variables defined on a probability space (Q, !F, P). Definition 1. The sequence

e

1,

~ 2 , ••.

of random variables converges in

probability to the random variable~ (notation: ~n E.~) if for every s > 0 P{l~n- ~~

> s}-+ 0,

n-+

oo.

(1)

We have already encountered this convergence in connection with the law of large numbers for a Bernoulli scheme, which stated that n-+

oo

(see §5 of Chapter 1). In analysis this is known as convergence in measure.

Definition 2. The sequence ~ 1 , ~ 2 , ••• of random variables converges with

probability one (almost surely, almost everywhere) to the random variable ~if

P{w: ~n

+ ~} =

0,

(2)

i.e. if the set of sample points w for which ~n(w) does not converge to ~has probability zero. This convergence is denoted by ~" -+ ~ (P-a.s.), or ~n ~· ~ or ~n ~· ~.

§10. Various Kinds of Convergence of Sequences of Random Variables

251

Definition 3. The sequence ~ 1 , ~ 2 , ••• of random variables converges in mean of order p, 0 < p < oo, to the random variable~ if n--+ oo.

(3)

e

~. In analysis this is known as convergence in LP, and denoted by ~n In the special case p = 2 it is called mean square convergence and denoted by ~ = l.i.m. ~n (for "limit in the mean").

Definition 4. The sequence ~ 1 , ~ 2 , . . . of random variables converges in distribution to the random variable ~ (notation: ~n ~ ~) if n--+ oo,

(4)

for every bounded continuous function f = f(x). The reason for the terminology is that, according to what will be proved in Chapter III, §1, condition (4) is equivalent to the convergence of the distribution F~"(x) to F~(x) at each point x of continuity ofF ~(x). This convergence is denoted by F~" => F~.

We emphasize that the convergence of random variables in distribution is defined only in terms of the convergence of their distribution functions. Therefore it makes sense to discuss this mode of convergence even when the random variables are defined on different probability spaces. This convergence will be studied in detail in Chapter III, where, in particular, we shall explain why in the definition ofF~" => F ~ we require only convergence at points of continuity of F~(x) and not at all x. 2. In solving problems of analysis on the convergence (in one sense or another) of a given sequence of functions, it is useful to have the concept of a fundamental sequence (or Cauchy sequence). We can introduce a similar concept for each of the first three kinds of convergence of a sequence of random variables. Let us say that a sequence {~n}n~ 1 of random variables is fundamental in probability, or with probability 1, or in mean of order, p, 0 e} --+ 0, as m, n--+ oo for every e > 0; the sequence gn(w)}n~ 1 is fundamental for almost all WE Q; the sequence {~n( W)} n ~ 1 is fundamental in U, i.e. EI~n - ~m IP --+ 0 as n, m--+ 00.

3. Theorem 1.

(a) A necessary and sufficient condition that

~n

P{~~~~~k- ~I~ e}--+ 0, for every e > 0.

--+

~

(P-a.s.) is that n--+ oo.

(5)

252

II. Mathematical Foundations of Probability Theory

(b) The sequence {~"} "~ 1 is fundamental with probability 1 if and only

n--+ oo,

if (6)

for every e > 0; or equivalently

~n+k - ~n I 2

P{sup I k~O

PROOF.

n __.. w.

e}--+ 0,

=n:;,

(a) Let A~= {w: l~n- ~~ 2 e}, A'= IlniA~

+ 0 = U A' = U A

1

(7) Uk~n A;;. Then

00

{w: ~n

11 m.

m=1

e~O

But P(A') = lim n

P( U A;;), k~n

Hence (a) follows from the following chain of implications:

~" + ~} =

0 = P{w:

P(U

e>O

A')¢> P( UA1m)= 0

¢> P(A 1 1m) = 0,

1

m=1

m 2 1 ¢> P(A') = 0, e > 0,

¢> PCY" A;;)--+ 0,

n--+

00

¢>

P(~~~~~k- ~I 2 e)--+ 0, n --+ oo.

(b) Let

n U Bk,. 00

B' =

n= 1

k~n l~n

Then {w: g"{w)}"~ 1 is notfundamental} = U,~ 0 B', and it can be shown as in (a) that P{w: {~n(w)}n~ 1 is not fundamental}= 0¢>(6). The equivalence of (6) and (7) follows from the obvious inequalities supJ~n+k- ~nl:::; supl~n+k- ~n+ll:::; 2 supl~n+k- ~nl·

This completes the proof of the theorem.

Corollary. Since

253

§10. Various Kinds of Convergence of Sequences of Random Variables

a sufficient condition for

~n ~ ~

is that

00

L P{i~k- ~I~ e} <

k= 1

(8)

00

is satisfied for every e > 0. It is appropriate to observe at this point that the reasoning used in obtaining (8) lets us establish the following simple but important result which is essential in studying properties that are satisfied with probability 1. Let A 1, A 2 , ••• be a sequence of events in F. Let (see the table in §1) {An i.o.} denote the event ITiiiAn that consists in the realization of infinitely many of A 1 , Az, .... Borei-Cantelli Lemma.

(a) If'[. P(An) < 00 then P{An i.o.} = 0. (b) If'[. P(An) = oo and A1, A 2 , ••• are independent, then P{An i.o.} = 1. PROOF.

(a) By definition

{An i.o.} = ITiii An=

rl

n= 1

U Ak.

k?!n

Consequently

P{An i.o.} = Pt01 kyn Ak} =lim P(Yn Ak) and (a) follows. (b) If A 1, A 2 ,

•••

are independent, so are

we have

p

Con Ak)

=

and it is then easy to deduce that

PCQ Ak) Since log(1 - x)

~

-x, 0

00

log

fl

k=n

~

=

li lt

A1, A2 , ••••

Hence for N ~ n

P(Ak),

(9)

P(Ak).

x < 1, 00

[1 - P(Ak)] =

~lim k~nP(Ak),

I

k=n

00

log[1 - P(Ak)] ~ -

Consequently

for all n, and therefore P(An i.o.) = 1. This completes the proof of the lemma.

L

k=n

P(Ak) = - oo.

254

II. Mathematical Foundations of Probability Theory

Corollary 1. If A~= {w: 1e,.- e1;,:: 6} then (8) shows that L:'= 1 P(A,.) < oo, 6 > 0, and then by the Borel-Cantelli lemma we have P(A") = 0, 6 > 0, where A • = llm A~. Therefore

L P{lek- el;,:: 6} < 00,6 > 0 ~ P(A") =

0, 6 > 0 ~

P{w:

e,. +e)}

= 0,

as we already observed above.

Corollary 2. Let (6,.)11 2: 1 be a sequence of positive numbers such that n ~ oo. If

L P{le,.- el;,:: 6,.} < 00,

6 11 ~

0,

00

(10)

n=l

e,. - e

In fact, let A,. = {I I ;,:: 6,.}. Then P(A,. i.o.) = 0 by the BorelCantelli lemma. This means that, for almost every w e Q, there is an N = N(w) such that le,.(w)- e(w)l ~ 6,. for n;,:: N(w). But 6,. ~ 0, and therefore e,.(w) ~ e(w) for almost every WE Q.

4. Theorem 2. We have the following implications:

e,. ~ ~ ~ ~~~ f. e.

.;,. g .; = .;,. .f. .;,

(11) p > 0,

(12)

~~~ f. ~ ~ ~~~ ~ ~.

(13)

PRooF. Statement (11) follows from comparing the definition of convergence in probability with (5), and (12) follows from Chebyshev's inequality. To prove (13), let f(x) be a continuous function, let If (x) I ~ c, let 6 > 0, and let N be such that P(l~l > N) ~ 6/4c. Take b so that lf(x)- f(y)l ~ 6/2c for lxl < N and lx- Yl ~b. Then (cf. the proof of Weierstrass's theorem in Subsection 5, §5, Chapter I)

N) + E(lf(~,.)- f(~)l; 1~.. - ~I ~ b, 1~1 > N) + E(lf<e..)- J<~>l; 1e.. - e1 >b) ~ 6/2 + 6/2 + 2cP{I~.. - ~I> <5} = 6 + 2cP{I~ .. - ~I> <5}.

Elf(~,.)- f(~)l = E(lf(~,.)- f(~)l; 1~ ..

-

~I~

b,

1~1 ~

But P{le,.- ~I> <5}.....,. 0, and hence E If(~,.)- f(~)l ~ 26 for sufficiently large n; since 6 > 0 is arbitrary, this establishes (13). This completes the proof of the theorem. We now present a number of examples which show, in particular, that the converses of (11) and (12) are false in general.

255

§10. Various Kinds of Convergence of Sequences of Random Variables

1 (~n ~ ~ f:> ~n ~ ~; ~n ~ ~ 86([0, 1]), P = Lebesgue measure. Put

EXAMPLE

f:> ~n ~- ~). Let 0

A;=[~~] n n 'n '

= [0, 1],

f7 =

i = 1, 2, ... , n; n ;;::: 1.

Then the sequence { ).:1. ).:1

).:2. ).:1

).:3.

).:2

'>1• '>2• '>2• '>3• '>3• '>3• ..•

}

of random variables converges both in probability and in mean of order p > 0, but does not converge at any point wE [0, 1]. 2 (~n ~ ~ => ~n ~ ~ =f ~n!:! ~, p > 0). Again let 0 86[0, 1], P = Lebesgue measure, and let EXAMPLE

~n(w)

=

{e", 0,

0 :::;;

W :::;;

w > 1/n.

= [0, 1],

f7

=

1/n,

Then {~n} converges with probability 1 (and therefore in probability) to zero, but n--+

oo,

for every p > 0. 3 (~n!:! ~ =f ~n ~~).Let {~n} be a sequence of independent random variables with

ExAMPLE

P(~n

= 0) =

1 - Pn·

Then it is easy to show that ~n

--+ 0 ¢> Pn --+ 0,

p

n --+ oo,

(14)

~n

--+ 0 ¢> Pn --+ 0,

LP

n --+ oo,

(15)

00

~n ~ 0 =>

L Pn <

(16)

00.

n=1 LP

In particular, if Pn = 1/n then ~n --+ 0 for every p > 0, but ~n

+0.

a.s.

The following theorem singles out an interesting case when almost sure convergence implies convergence in L 1 •

Theorem 3. Let (~n) be a sequence of nonnegative random variables such that ~n ~~and E~n--+ E~ < 00. Then EI~n -

~I --+ 0,

n--+

oo.

(17)

256

II. Mathematical Foundations of Probability Theory

PROOF. We have we have

E~n

< oo for sufficiently large n, and therefore for such n

E ~~- ~nl = E(~- ~n)/{~2!~nl

=

+ E(~n- ~)J{~n>~} + E(~n- ~).

2E(~- ~n)/{~2!~n}

But 0 ::;; (~ - ~n)J 1 ~.,~"l ::;; ~- Therefore, by the dominated convergence theorem, limn E(~- ~n)/ 1 ~.,~") = 0, which together with E~n-+ E~ proves (17).

Remark. The dominated convergence theorem also holds when almost sure

convergence is replaced by convergence in probability (see Problem 1). Hence in Theorem 3 we may replace "~" ~ C' by "~" -f. ~-"

5. It is shown in analysis that every fundamental sequence (xn), xn E R, is convergent (Cauchy criterion). Let us give a similar result for the convergence of a sequence of random variables.

Theorem 4 (Cauchy Criterion for Almost Sure Convergence). A necessary and sufficient condition for the sequence (~n)n;;, 1 of random variables to converge with probability 1 (to a random variable~) is that it is fundamental with probability 1. PROOF. If ~n ~ ~ then

k;;,n l;;,n

k;::n

l
whence the necessity follows. Now let C~n)n;;,l be fundamental with probability 1. Let .;V = {w: (~n(w)) is not fundamental}. Then whenever w E Q \ .;V the sequence of numbers (~"(w))n;;,l is fundamental and, by Cauchy's criterion for sequences of numbers, lim ~n(w) exists. Let

~(w) ={lim ~n(w), wEil\%, 0,

WE%.

(18)

The function so defined is a random variable, and evidently ~" ~ ~. This completes the proof. Before considering the case of convergence in probability, let us establish the following useful result.

Theorem 5. If the sequence(~") is fundamental (or convergent) in probability, it contains a subsequence (~"k) that is fundamental (or convergent) with probability 1.

257

§10. Various Kinds of Convergence of Sequences of Random Variables

PROOF. Let(~") be fundamental in probability. By Theorem 4, it is enough to show that it contains a subsequence that converges almost surely. Take n 1 = 1 and define nk inductively as the smallest n > nk-l for which

P{l~,- ~.I> rk}

for all s

~

n, t

~

< rk.

n. Then

L P{l~nk+l- ~nkl > 2-k} < L 2-k < 00 k

and by the Borel-Cantelli lemma P{l~nk+l- ~nkl > 2-k i.o.} = 0.

Hence 00

L ~~nk+l -

k=l

with probability 1. Let .At= {m: I~nk+l - ~nkl

L

~(m) =

1 ~n,(m)

+

,;

~nkl <

00

oo }. Then if we put

I(~~:~, - ~nk(m)),

t=I

0,

mE 0\.K, WE

JV,

we obtain ~nk ~ ~· If the original sequence converges in probability, then it is fundamental in probability (see also (19)), and consequently this case reduces to the one already considered. This completes the proof of the theorem.

Theorem 6 (Cauchy Criterion for Convergence in Probability). A necessary and sufficient condition for a sequence (~n)n
If ~n ~ ~ then P{len- ~ml ~ e} ~ P{l~n- ~I~ e/2}

+ P{l~m-

~I~ e/2}

(19)

and consequently (~n) is fundamental in probability. Conversely, if(~") is fundamental in probability, by Theorem 5 there are a subsequence (~nJ and a random variable~ such that ~nk ~~.But then P{l~n- ~I~ e} ~ P{l~n- ~nkl ~ e/2}

+ P{l~nk-

~I~ e/2},

from which it is clear that ~n ~ ~. This completes the proof. Before discussing convergence in mean of order p, we make some observations about LP spaces.

258

II. Mathematical Foundations of Probability Theory

We denote by U = U(O., !F, P) the space ofrandom variables~= with EI~ IP = Jn I~ IP dP < oo. Suppose that p ~ 1 and put

Wlp =

~(ro)

(E I~ IP) 11P.

It is clear that ll~llp ~

0,

llc~IIP = lciii~IIP'

c constant,

(20) (21)

and by Minkowski's inequality (6.31) II~

+ ttllp :s; Wlp + llttllp·

(22)

Hence, in accordance with the usual terminology of functional analysis, the function II-IlP' defined on U and satisfying (20)-(22), is (for p ~ 1) a semi-

norm.

For it to be a norm, it must also satisfy ll~llp

= 0 ~ ~ = 0.

(23)

This property is, of course, not satisfied, since according to Property H (§6) we can only say that ~ = 0 almost surely. However, if U means the space whose elements are not random variables ~ with EI~ IP < oo, but equivalence classes of random variables (~ is equivalent to q if ~ = q almost surely), then 11·11 becomes a norm, so that U is a normed linear space. If we select from each equivalence class of random variables a single element, taking the identically zero function as the representative of the class equivalent to it, we obtain a space (also denoted by U) wh~ch is actually a normed linear space of functions (rather than of equivalence classes). It is a basic result of functional analysis that the spaces U, p ~ 1, are complete, i.e. that every fundamental sequence has a limit. Let us state and prove this in probabilistic language.

Theorem 7 (Cauchy Test for Convergence in Mean pth Power). A necessary and sufficient condition that a sequence (~n)n ~ 1 of random variables in U convergences in mean of order p to a random variable in LP is that the sequence is fundamental in mean of order p. PRooF. The necessity follows from Minkowski's inequality. Let C~n) be fundamental Cll~n- ~miiP-+ 0, n, m-+ oo). As in the proof of Theorem 5, we select a subsequence (~nk) such that ~nk ~ ~'where~ is a random variable with Wlp < oo. Let n1 = 1 and define nk inductively as the smallest n > nk_ 1 for which II~,- ~.liP<

for all s

~

n, t

~

n. Let

2- 2 k

259

§I 0. Various Kinds of Convergence of Sequences of Random Variables

Then by Chebyshev's inequality P(A ) < k

-

E I;;

'onk+ 1 -

2-kr

2- 2kr ;; I' < __

'onk

- 2-kr

=

2-kr < 2-k. -

As in Theorem 5, we deduce that there is a random variable ;;

':,nk

~

such that

~;;

S·

We now deduce that II~.- ~liP--+ 0 as n--+ oo. To do this, we fix B > 0 and choose N = N(e) so that II~. - ~mil~ < dor all n ~ N, m ~ N. Then for any fixed n ~ N, by Fatou's lemma, E

1~.- ~IP = E{ lim 1~.- ~.kiP}= E{ lim 1~.- ~.kiP} ~-oo

~-oo

Consequently E I~. - ~ IP --+ 0, n --+ oo. It is also clear that since~ = ( ~ - ~.) + ~.we have E I~ IP < oo by Minkowski's inequality. This completes the proof of the theorem.

Remark 1. In the terminology of functional analysis a complete normed linear space is called a Banach space. Thus U, p ~ 1, is a Banach space. Remark 2. If 0 < p < 1, the function WI P = (E I~ IPY 1P does not satisfy the

triangle inequality (22) and consequently is not a norm. Nevertheless the space (of equivalence classes) LP, 0 < p < 1, is complete in the metric d(~,IJ) = El~- IJIP. Remark 3. Let L ro = L'XJ(n, :¥, P) be the space (of equivalence classes of) random variables~= ~(w) for which Wloo < oo, where Wloo, the essential supremum of~. is defined by

6.

=inf{O s c s oo:

Wloo

=ess

supl~l

The function

11·11 oo

is a norm, and L oo is complete in this norm.

P(l~l

>c)= 0}.

PROBLEMS

1. Use Theorem 5 to show that almost sure convergence can be replaced by convergence in probability in Theorems 3 and 4 of §6. 2. Prove that L 00 is complete. 3. Show that if~. 4. Let ~. !'. ~.lin

f.

~and also~.

f. 11 then~ and 11 are equivalent (P(~

!'. IJ, and let ~ and 11 be equivalent. Show that P{l~n-

for every e > 0.

llnl ~ ~>}--> 0,

n--> oo,

# 17) = 0).

260

II. Mathematical Foundations of Probability Theory

5. Let ~. ~ ~, IJn ~ YJ. Show that a~.+ bYJ. ~a~+ bYJ (a, b constants), 1~.1 ~ 1~1,

~.'I.~ ~'1·

6.

Let(~. - ~) 2 --+ 0. Show that~; --+ ~ 2 •

7. Show that if~. bility:

!!. C, where C is a constant, then this sequence converges in proba~. !!. c => ~ • .f. c.

8. Let (~.)., 1 have the property that ~. --+ 0 (P-a.s.).

I:'=

1

E I~. IP < oo for some p > 0. Show that

9. Let(~.)., 1 be a sequence of independent identically distributed random variables. Show that <Xl

El~ 1 1 < oo=

I

n=1

P{l~ 1 1 > e·n} < oo

10. Let ( ~.)." 1 be a sequence of random variables. Suppose that there are a random variable~ and a sequence {nd such that~ •• --+ ~ (P-a.s.) and max •• _,
§11. The Hilbert Space of Random Variables with Finite Second Moment 1. An important role among the Banach spaces U, p ~ 1, is played by the space L 2 = L 2 (0, !F, P), the space of (equivalence classes of) random variables with finite second moments. If~ and '1 E L 2, we put (~, 1'/)

It is clear that if~' IJ, ( (a~

E

=E~l'/·

L 2 then

+ bl'f, 0

= a(~,

0+

(~, ~) ~

and

(1)

b(IJ,

0

0,

a, bER,

261

§11. The Hilbert Space of Random Variables with Finite Second Moment

Consequently(~, '1) is a

scalar product. The space L 2 is complete with

respect to the norm (2)

induced by this scalar product (as was shown in §10). In accordance with the terminology of functional analysis, a space with the scalar product (1) is a

Hilbert space.

Hilbert space methods are extensively used in probability theory to study properties that depend only on the first two moments of random variables (" L 2 -theory"). Here we shall introduce the basic concepts and facts that will be needed for an exposition of L 2 -theory (Chapter VI). 2. Two random variables ~ and 11 in L 2 are said to be orthogonal ( ~ l_ 11) '1) E~11 = 0. According to §8, ~and 11 are uncorrelated ifcov(~, '1) = 0, i.e. if

if(~,

=

It follows that the properties of being orthogonal and of being uncorrelated coincide for random variables with zero mean values. A set M ~ L 2 is a system of orthogonal random variables if ~ l_ 11 for every~. 11 EM(~ ¥= 11). If also II~ I = 1 for every ~EM, then M is an orthonormal system.

3. Let M = {'1t. ... , '1n} be an orthonormal system and~ any random variable in L 2 . Let us find, in the class oflinear estimators L7= 1 a;'1;, the best meansquare estimator for~ (cf. Subsection 2, §8). A simple computation shows that

El~- ;t1 ~;'1;1 2 = 11~- it1 a;'1f = (~- ;t1 a;'1;, ~- ;t1 a;'1;)

it a;(~, + (t J1

= Wl2-

2

= 11~11 2 -

2

n

L

i= 1

-

Wl 2

-

~

where we used the equation

L

i= 1 n

I

i= 1

a;'1;)

n

a;(~,

'1;) + L af

n

= Wl 2

a;'1;,

'1;)

i= 1 n

1(~, '1;W

1<~. 11;)1 2 ,

+ L

i= 1

Ia;- (~, '1;W (3)

262

II. Mathematical Foundations of Probability Theory

Ii=

It is now clear that the infimum of E I~ 1 a;1Jd 2 over all real at> ... , an is attained for a; = (~, IJ;), i = 1, ... , n. Consequently the best (in the mean-square sense) estimator for~ in terms of 1] 1, •.• , IJn is n

~=

I

i= 1

c~, IJ;)'li·

Here

~ = infEI~- it1 a;IJ;'2 = El~- ~12 = Wlz-

(4)

J1 1(~,1];)12

(5)

(compare (1.4.17) and (8.13)). Inequality (3) also implies Bessel's inequality: if M = {1] 1, 1] 2, ... } an orthonormal system and ~ E L 2 , then 00

I

i= 1

1(~,

'1;)1 2

IS

Wl 2 ;

(6)

(~, IJ;)}I;.

(7)

~

and equality is attained if and only if n

~ = l.i.m. n

I

i= 1

The best linear estimator of~ is often denoted byE(~ I1] 1, ••. , IJn) and called the conditional expectation (of~ with respect to 1] 1, ..• , IJn) in the wide sense. The reason for the terminology is as follows. If we consider all estimators cp = cp(1J 1, ..• , 1'/n) of ( in terms of 17 t> ... , 1'/n (where cp is a Borel function), the best estimator will be cp* = E(( Ii'J 1, ••• , 1'/n), i.e. the conditional expectation of ( with respect to 1] 1 , ••• , 1'/n (cf. Theorem 1, §8). Hence the best linear estimator is, by analogy, denoted by E(~l1'f 1 , . . . , 1'/n) and called the conditional expectation in the wide sense. We note that if 'lt> ... , 1'/n form a Gaussian system (see §13 below), then E((I1J 1 , ... ,1Jn) and E((I1J 1 , ... ,1Jn) are the same. Let us discuss the geometric meaning of~ = E(~ I'lll ... , IJn). Let !E = !E{1] 1, •.• , IJn} denote the linear manifold spanned by the orthonormal system of random variables 1] 1 , ... , 'ln (i.e., the set of random variables of the form Lf= 1 a;IJ;, a; E R). Then it follows from the preceding discussion that ~ admits the "orthogonal decomposition" (8)

where ~ E !E and ~ - ~ .l !E in the sense that ~ - ~ .ill for every ll E !E. It is natural to call ~ the projection of ( on !E (the element of !t' "closest" to ~),and to say that~ - ~is perpendicular to !E. 4. The concept of orthonormality of the random variables '1~> ... , 'ln makes it easy to find the best linear estimator (the projection) ~ of ~ in terms of

§11. The Hilbert Space of Random Variables with Finite Second Moment

263

'7 1, .•. , 'ln· The situation becomes complicated if we give up the hypothesis of orthonormality. However, the case of arbitrary '7 1, ... , 'ln can in a certain sense be reduced to the case of orthonormal random variables, as will be shown below. We shall suppose for the sake of simplicity that all our random variables have zero mean values. We shall say that the random variables 'It> ... , 'In are linearly independent if the equation n

L a;'l; =

0 (P-a.s.)

i= 1

is satisfied only when all a; are zero. Consider the covariance matrix ~ =E'7'7T

of the vector '7 = ('7 1, ... , 'In). It is symmetric and nonnegative definite, and as noticed in §8, can be diagonalized by an orthogonal matrix l!J: l!JT~l!J =

D,

where

0)

D=(d1 .. 0 · dn

has nonnegative elements d;, the eigenvalues of ~' i.e. the zeros A. of the characteristic equation det(~ - A.E) = 0. If '1 b ... , 'ln are linearly independent, the Gram determinant (det ~) is not zero and therefore d; > 0. Let

and

(Ft·.·Jd,.

B

=

f3

= B-ll!JT'l·

0

0 ) (9)

Then the covariance matrix of f3 is Ef3f3T = B-1l!.JTE'7'7Tl!.JB-1 =

and therefore f3 = (/3 1 , It is also clear that

••• ,

B-1l!JT~l!JB-1

= E,

f3n) consists of uncorrelated random variables.

'I= (l!.JB)/3.

(10)

Consequently if '7 1, •.. , 'ln are linearly independent there is an orthonormal system such that (9) and (10) hold. Here

.!l'{'lb · .. ' '7n} = .!l'{/31, ·"' /3n}. This method of constructing an orthonormal system f3 to ••• , Pn is frequently inconvenient. The reason is that if we think of '7; as the value of the random sequence ('7 1 , ..• , 'ln) at the instant i, the value /3; constructed above

264

II. Mathematical Foundations of Probability Theory

depends not only on the "past," (Yf 1 , ••• , Y/;), but also on the "future," (Y/;+ t> ••• , Yfn). The Gram-Schmidt orthogonalization process, described below, does not have this defect, and moreover has the advantage that it can be applied to an infinite sequence of linearly independent random variables (i.e. to a sequence in which every finite set of the variables are linearly independent). Let Yft> Yf 2, .•. be a sequence of linearly independent random variables in L 2 • We construct a sequence e1,e 2 , •.. as follows. Let e1 = Yft!IIY/ 1 11. If e1 , •.• , e"_ 1 have been selected so that they are orthonormal, then e

n

=

Yfn- ~n

(11)

IIYfn - ~nil'

where ~" is the projection of Yfn on the linear manifold !l'(e 1 , generated by

••• ,

en_ 1 )

n-1

~" =

L (Yfn, ek)ek.

(12)

k=1

Since Yft>···•Yfn are linearly independent and !l'{Yf 1 , .•. ,Yfn-d = !l'{e 1 , •.• , en- 1 }, we have IIYfn- ~nil > 0 and consequently en is well defined. By construction, lien II = 1 for n ~ 1, and it is clear that (en, ek) = 0 for k < n. Hence the sequence e1, e2 , ••• is orthonormal. Moreover, by (11), where b" = II Yin - ~nil and ~n is defined by (12). Now let Yf 1, ••• , Yfn be any set of random variables (not necessarily linearly llriill is the covariance matrix of independent). Let det ~ = 0, where ~ (17 1, ... , Yfn), and let

=

rank~=r
Then, from linear algebra, the quadratic form n

Q(a) =

L riiaiai,

i,j= 1

has the property that there are n - r linearly independent vectors a(l>, ... , a(n-r) such that Q(dil) = 0, i = 1, ... , n- r. But

Consequently n

L a~>'1k =

k= 1

with probability 1.

0,

i = 1, ... , n- r,

§II. The Hilbert Space of Random Variables with Finite Second Moment

265

In other words, there are n - r linear relations among the variables '11> ... , '7n· Therefore if, for example, 17 1, ... , '7, are linearly independent, the other variables '7r+ 1, ... , '7n can be expressed linearly in terms of them, and consequently 2'{171> ... , '7n} = !l'{e 1, ... , e,}. Hence it is clear that we can find r orthonormal random variables et> ... , e, such that 17 1, ... , '7n can be expressed linearly in terms ofthem and 2'{17 1, ... , '7n} = !l'{et> ... , e,}. 5. Let '7 1, 17 2 , ••• be a sequence of random variables in L 2 • Let .!l = .!l{'7t> 17 2 , •• •} be the linear manifold spanned by 17 1, 17 2 , •• . ,i.e. the set of random variables of the form 1 a; I'/;. n ;:::: 1, a; e R. Then .!l = .!l{'7t> 17 2 , •• •} denotes the closed linear manifold spanned by 17 1,17 2, ... , i.e. the set of random variables in !l' together with their mean-square limits. We say that a set 17 1 , 17 2 , ••• is a countable orthonormal basis (or a complete orthonormal system) if:

Li'=

(a) '11> 17 2 , ••• is an orthonormal system, (b) .!l{'71• '12, .. .} = L 2. A Hilbert space with a countable orthonormal basis is said to be separable. By (b), for every~ eL 2 and a given e > 0 there are numbers a 1 , ... , an such that

Then by (3)

Consequently every element of a separable Hilbert space L 2 can be represented as 00

~ =

L <~. '7;). '7;,

(13)

i= 1

or more precisely as n

~ = l.i.m. n

L (~, '7)'7;·

i= 1

We infer from this and (3) that Parseval's equation holds:

Wl 2

L 1(~. '7;)1 00

=

i=1

2,

(14)

It is easy to show that the converse is also valid: if '7 1 , 17 2 , ••• is an orthonormal system and either (13) or (14) is satisfied, then the system is a basis. We now give some examples of separable Hilbert spaces and their bases.

266

II. Mathematical Foundations of Probability Theory

EXAMPLE 1.

Let

n = R, :F = &I(R), and let P be the Gaussian measure,

P(- oo, a] Let D

f oo

=

qJ(x) dx,

= d/dx and H ( ) = ( -1)"D"qJ(x) "x qJ(x) ,

n ~ 0.

(15)

We find easily that

= -xqJ(x),

DqJ(x)

D 2 qJ(x) = (x 2

=

D 3 qJ(x)

(16)

1)qJ(x),

-

(3x - x 3 )qJ(x),

It follows that H,.(x) are polynomials (the Hermite polynomials). From (15) and (16) we find that H 0 (x)

= 1,

H 1(x) = x, Hz(x) = x 2

-

1,

HJ(x) = x 3

-

3x,

A simple calculation shows that

(Hm, H,.)

=

s:ooHm(x)H,.(x) dP

=

J:ooHm(x)H,.(x)qJ(X) dx

=

n! c5mn•

where c5m,. is the Kronecker delta (0, if m =F n, and 1 if m = n). Hence if we put

h( )

,.x

= H,.(x)

Jn'

the system of normalized Hermite polynomials {h,.(x)}n>o will be an orthonormal system. We know from functional analysis that if lim c!O

foo eclxl P(dx) - oo

< oo,

(17)

the system {1, x, x 2 , ••• } is complete in L 2 , i.e. every function e = e(x) in L 2 can be represented either as 1 ai'li(x), where 1'/i(x) = xi, or as a limit of

Li=

§II. The Hilbert Space of Random Variables with Finite Second Moment

267

these functions (in the mean-square sense). If we apply the Gram-Schmidt orthogonalization process to the sequence 111(x), 11ix), ... , with IJ;(x) = xi, the resulting orthonormal system will be precisely the system of normalized Hermite polynomials. In the present case, (17) is satisfied. Hence {hn(x)}n;;,o is a basis and therefore every random variable ~ = ~(x) on this probability space can be represented in the form ~(x)

= l.i.m. n

n

I

(~, h;)h;(x).

(18)

i=O

2. Let !2 = {0, 1, 2, ... } and let P = {P 1 , P 2 , distribution

EXAMPLE

X

•• •}

be the Poisson

= 0, 1, ... ; A > 0.

Put fl.J(x) = f(x) - f(x - 1) (f(x) = 0, x < 0), and by analogy with (15) define the Poisson-Charlier polynomials

n

~

1,

I1 0 = 1.

(19)

Since 00

(ITm, ITn) =

I

x=O

ITm(x)ITn(x)Px = Cnbmn'

where en are positive constants, the system of normalized Poisson-Charlier polynomials {nn(x)}n;;,o' nn(x) = IInCx)/Jc:, is an orthonormal system, which is a basis since it satisfies (17). In this example we describe the Rademacher and Haar systems, which are of interest in function theory as well as in probability theory. Let n = [0, 1], ff = ~([0, 1]), and let P be Lebesgue measure. As we mentioned in §1, every x E [0, 1] has a unique binary expansion EXAMPLE3.

where x; = 0 or 1. To ensure uniqueness of the expansion, we agree to consider only expansions containing an infinite number of zeros. Thus we choose the first of the two expansions

110

0

011

-=-+-+-+ . 2 22 23 2 2 2 2 2 3 .. ·=-+-+-+"· We define random variables ~ 1 (x), ~ 2 (x), ... by putting

268

II. Mathematical Foundations of Probability Theory R 2 (x)

R 1(x) I I I I I I I

1.!.

0

I

X

2

0

I

1.!. 4

1.!. IJ. 2

I I

I

I

-1

~

X

4

I I

I I

I I

I I I I

I I I I I I I

I I I I I I I

I I

I I I

-1

~

I I I I I I I

I I

I I I

I I ..._...I ..._..

Figure 30

Then for any numbers a;, equal to 0 or 1,

=

a1 P{ x·. 2

1}

a. a 1 a2 a. a2 ++ ·· · ++< x <+ ··· ++2" 2" 22 2 2" 22

= p { X'. XE [ -a21

1 1 ]} a. +a. a 1 + · · · += -. + · ·· +-2" 2" 2" 2"' 2

It follows immediately that~ 1 , ~ 2 , ... form a sequence of independent Bernoulli random variables (Figure 30 shows the construction of ~ 1 = ~ 1 (x) and ~2

=

~z(x)).

If we now set R.(x) = 1 - 2~.(x), n ;;:: 1, it is easily verified that {R.} (the Rademacher functions, Figure 31) are orthonormal:

ERnRm = fR.(x)Rm(x) dx = Dnm·

=

Notice that (1, R.) ER. = 0. It follows that this system is not complete. However, the Rademacher system can be used to construct the Haar system, which also has a simple structure and is both orthonormal and

complete.

~(x)

~(x)

I~

~

I I I I I

I I I I

I

I

I

I I

0

t

I I I

I I I

I I

I I I I

I

I

I

I I

I

I

I

X

0

*

I

I

r-; I I I

I I I

I

I I

t i

Figure 31. Rademacher functions.

I I I I I I I

I

I

X

269

§11. The Hilbert Space of Random Variables with Finite Second Moment

Again let Q = [0, 1) and IF= .16'([0, 1)). Put

H 1 (x)=1, H 2 (x)

=

R 1 (x),

k- 1 k -v:::;; x < 2i'

if

n = 2i

+ k,

1 :::;; k :::;; 2i,j;;?: 1,

otherwise.

It is easy to see that H.(x) can also be written in the form

Hzm+l(x) = {

2m/2

0 <X < 2-(m+ 1)

0,

2-(m+l):::;; otherwise,

-2~ 12 ,

X~ 2-m,

m = 1, 2, ... '

Hzm+j(x)=H zm+l(x-j;. 1). j= 1, ... ,2m. Figure 32 shows graphs of the first eight functions, to give an idea of the structure of the Haar functions. H 1 (x)

H 2 (x)

1

0

1

I

4

1 2

~ 1

i

X

-I

-I

H 5 (x)

H6(x)

2

.l 2

1 4

I

X

I

I I I I I I I I

L.J

-2

H 4 (x)

21/2

21/2

0

X

~ I I

-21/2

~

I I

;)_

I I I I I I I I

4

I

~

4 I I I I I I I I I I

~I

;)_

4

1

X

0

-2

,., I I I I I I I I

.l 4

!

-2''2

I I I I I

I

I I

X

I I I I I I I I

I

'-+l

H 8 (x)

2

.ll

0

X

H?Cx)

2

I I I

-2

I I I I I I I I I

H 3 (x)

2

r+i

I I I I I I I I I I

!

I I I I I I I I I I

.ll

2

I I I I I I I I I I

w

1

0

X

r+i

I I I I I I I I I I 1

4

-2

Figure 32. The Haar functions H 1(x), ... , H 8(x).

!

I I I I I I I I I

:I

11

I 4 I I I I I I I I I I

w

X

270

II. Mathematical Foundations of Probability Theory

It is easy to see that the Haar system is orthonormal. Moreover, it is complete both in L 1 and in L 2 , i.e. iff= f(x) E IJ for p = 1 or 2, then

J

n

1

0

lf(x) - k~1 (f, Hk)Hk(x)IP dx-+ 0,

n-+ oo.

The system also has the property that n

L (f, Hk)Hk(x)-+ f(x),

n-+ oo,

k=1

with probability 1 (with respect to Lebesgue measure). In §4, Chapter VII, we shall prove these facts by deriving them from general theorems on the convergence of martingales. This will, in particular, provide a good illustration of the application of martingale methods to the theory of functions.

6. If t] 1, ... , t'/n is a finite orthonormal system then, as was shown above, for every random variable ~ E L 2 there is a random variable ~ in the linear manifold 2 = 2{tJ 1, ... , t'/n}, namely the projection of~ on f£1, such that

II~

- ~II

= inf{ll~- ell: ( E ff{t'/1, · · ·, t'fn}}.

L7=

Here~= 1 (~, t'/;)t'/;· This result has a natural generalization to the case when t] 1, t] 2 , ••• is a countable orthonormal system (not necessarily a basis).

In fact, we have the following result.

Theorem. Let

t] 1 , t] 2 , •••

be an orthonormal system of random variables, and

L = L{tJ 1, t] 2 , ••• } the closed linear manifold spanned by the system. Then there is a unique element ~ E L such that II~ -~II = inf{ll~ - Cll: (

E

2}.

(20)

Moreover,

~ = l.i.m. n

and ~ - ~

n

L (~, t'/;)t'/;

(21)

i= 1

l_ (, ( E L.

Let d = inf{ll~- Cll: ( E 2} and choose a sequence ( 1, ( 2 , ••• such that II~ - (nil -+d. Let us show that this sequence is fundamental. A simple calculation shows that

PROOF.

II("-

(mll 2

= 211'n-

~11 2 + 211'm- ~11 2 - 411'"; 'm- ~r

It is clear that ((n + (m)/2 E 2; consequently therefore ll(n - (mll 2 -+ 0, n, m-+ oo.

II[((" + (m)/2]

- ~11 2 ~ d 2 and

§II. The Hilbert Space of Random Variables with Finite Second Moment

271

The space L 2 is complete (Theorem 7, §10). Hence there is an element ~ such that II(. - ~II ---+ 0. But !lis closed, so~ E !l. Moreover, II(. - ~II ---+ d, and consequently II~ - ~II = d, which establishes the existence of the required element. Let us show that ~ is the only element of !l with the required property. Let ~ E !l and let

II~ - ~II = II~ - ~II = d. Then (by Problem 3)

II~+~- 2~11 2 + II~- ~11 2

=

211~- ~11 2 + 211~- ~11 2 = 4d 2 •

But

II~+~- 2~11 2

=

411tC~

+ ~)- ~11 2 ~ 4d 2 •

Consequently II~ - ~11 2 = 0. This establishes the uniqueness of the element of !l that is closest to ~. Now let us show that ~ - ~ l_ (, ( E !l. By (20),

II~ - ~ - c(ll ~ II~ - ~II for every c E R. But

II~

- ~- c'll 2

=

- ~11 2 + c2 WI 2

II~

-

2(~ - ~,cO.

Therefore

c2 ll'll 2

~ 2(~ - ~,

c().

(22)

Take c = A(~ - ~' (), AE R. Then we find from (22) that

(~- ~' 0 2 [A 2 II'II 2 - 2A.] ~ 0. < 0 if A is a sufficiently small positive number. Con-

We have A. 2 11'11 2 - 2,{ sequently(~ - ~' 0 = 0, ( E I. It remains only to prove (21). The set!£= i£{111, 17 2 , •.• } is a closed subspace of L 2 and therefore a Hilbert space (with the same scalar product). Now the system 17 1 ,17 2 , ... is a basis for!£ (Problem 4), and consequently n

~

=

L (~, 11k)11k·

l.i.m.

(23)

But ~ - ~ l_ l]k, k ~ 1, and therefore(~, I'Jk) = (~, '1k), k ~ 0. This, with (23) establishes (21). This completes the proof of the theorem.

Remark. As in the finite-dimensional case, we say that ~ is the projection of ~ on L = L{rJ 1, 1] 2 , •• •}, that ~- ~ is perpendicular to L, and that the representation

~

=

is the orthogonal decomposition

~

+ (~-

of~.

~)

272

II. Mathematical Foundations of Probability Theory

We also denote~ by E(el'7 1 , 17 2 , •••) and call it the conditional expectation in the wide sense (of with respect to '11> 17 2 , •••). From the point of view of estimating in terms of '71> '72' the variable eis the optimal linear estimator, with error

e

e

0

ll

0

0

'

= e1e- ~1 2 = 11e- ~11 2 = 11e11 2

00

L l<e. '7i)l

-

2•

i=l

which follows from (5) and (23).

7. PROBLEMS 1. Show that if~ = l.i.m. 2. Show that if

e

!I e. !I -+ lleli-

e= l.i.m. e. and t7 = l.i.m. tfn then (e., tt.) -+ (e, tf).

3. Show that the norm

4. Let (

~.then

1 , ••• , ~J

11·11 has the parallelogram property II~+

ttll 2 + II~- ttll 2

= 2(11~11 2

+ llttll 2).

be a family of orthogonal random variables. Show that they have the

Pythagorean property,

5. Let tf 1 , tf 2 , ••• be an orthonormal system and !l' = !l'{tt 1, tf 2 , ••• } the closed linear manifold spanned by tt 1, tt 2 , •••. Show that the system is a basis for the (Hilbert) ~~

-

-

6. Let ~ 1 , ~ 2 , ••• be a sequence of orthogonal random variables and s. = ~ 1 + · · . + ~ •. Show that if L,~ 1 E~~ < oo there is a random variable S with ES2 < oo such that l.i.m. s. = S, i.e. liS.- Sll 2 = E IS.- Sl 2 -+ 0, n-+ oo. 7. Show that in the space L 2 = L 2 ([ -n, n], aJ([ -n, n]) with Lebesgue measure p. the system {(1/fo)eu", n = 0, ±1, ...} is an orthonormal basis.

§12. Characteristic Functions 1. The method of characteristic functions is one of the main tools of the analytic theory of probability. This will appear very clearly in Chapter III in the proofs of limit theorems and, in particular, in the proof of the central limit theorem, which generalizes the De Moivre-Laplace theorem. In the present section we merely define characteristic functions and present their basic properties. First we make some general remarks. Besides random variables which take real values, the theory of characteristic functions requires random variables that take complex values (see Subsection 1 of §5).

273

§12. Characteristic Functions

Many definitions and properties involving random variables can easily be carried over to the complex case. For example, the expectation E( of a complex random variable C= ~ + i17 will exist if the expectations E~ and E17 exist. In this case we define E( = E~ + iE17. It is easy to deduce from the definition of the independence of random elements (Definition 6, §5) that the complex random variables ( 1 = ~ 1 + i17 1 and ( 2 = ~ 2 + i17 2 are independent if and only if the pairs (~ 1 , 17 1 ) and (~ 2 , 17 2 ) are independent; or, equivalently, the a--algebras !l' ~ .. ~~ and !l' ~M 2 are independent. Besides the space L 2 of real random variables with finite second moment, we shall consider the Hilbert space of complex random variables C= ~ + i17 with EICI 2 < oo, where 1(1 2 = ~ 2 + 17 2 and the scalar product (( 1, ( 2 ) is defined by EC 1 C2 , where C2 is the complex conjugate of(. The term "random variable" will now be used for both real and complex random variables, with a comment (when necessary) on which is intended. Let us introduce some notation. We consider a vector a E R" to be a column vector,

·~CJ and aT to be a row vector, aT = (a 1, •.• , an). If a andb E R"their scalar product (a, b) is 1 a;b;. Clearly (a, b) = aTb. If a E R" and ~ = llriill is ann by n matrix,

Li=

n

(~a,a) = aT~a =

L riiaiai.

(1)

i,j= 1

2. Definition 1. Let F = F(x 1 , ••• , xn) be an n-dimensional distribution function in (R", PJJ(R")). Its characteristic function is

cp(t)

= ( ei dF(x),

JR"

teR".

(2)

Definition 2. If~ = (~ 1 , ••• , ~")is a random vector defined on the probability space (Q, .fF, P) with values in R", its characteristic function is C{);(t) = ( ei
JR"

where F~ = F~(x 1 ,

••• ,

teR",

xn) is the distribution function of the vector

(~1> ... ' ~n).

If F(x) has a density f = f(x) then

cp(t) =

r ei(t,x)f(x) dx.

JRn

(3) ~ =

274

II. Mathematical Foundations of Probability Theory

In other words, in this case the characteristic function is just the Fourier transform of f(x). It follows from (3) and Theorem 6. 7 (on change of variable in a Lebesgue integral) that the characteristic function cp~(t) of a random vector can also be defined by (4) tER". We now present some basic properties of characteristic functions, stated and proved for n = 1. Further important results for the general case will be given as problems. Let~ = ~(w) be a random variable, F~ = F~(x) its distribution function, and its characteristic function. We see at once that if 11 cp~(t)

= a~ + b then = Eeit~ = Eeit(a~+b) =

eitbEeiat~.

Therefore (5)

sn

Moreover, if ~ t> ~ 2 , ••• , ~" are independent random variables and = ~1 + ... +~",then

n cp~lt). n

CfJs.(t) =

(6)

i= 1

In fact,

= Eeu~,

... Eeit~.

n

=

0

j=1

cp~

(t),

J

where we have used the property that the expectation of a product of independent (bounded) random variables (either real or complex; see Theorem 6 of §6, and Problem 1) is equal to the product of their expectations. Property (6) is the key to the proofs of limit theorems for sums of independent random variables by the method of characteristic functions (see §3, Chapter III). In this connection we note that the distribution function F s" is expressed in terms of the distribution functions of the individual terms in a rather complicated way, namely F s" = F ~~ * · · · * F ~" where * denotes convolution (see §8, Subsection 4). Here are some examples of characteristic functions. 1. Let ~ be a Bernoulli random variable with P(~ 0) = q, p + q = 1, 1 > p > 0; then

ExAMPLE

P(~

=

cp~(t) = peit

+ q.

= 1) =

p,

275

§12. Characteristic Functions

If ~ 1 ,

... , ~n

are independent identically distributed random variables like

~,then, writing T,. = (Sn - np)/~, we have

rJt) = EeiTnt = e-it0ifW[peitfv'iijiq + q]n

=

[peitJq7(npJ

Notice that it follows that as n

--+

+ qe-itv'p/(nq)y.

(7)

oo

sn- np

(8)

T,.= ~. Let ~ "' %(m, a 2 ), Im I < oo, a 2 > 0. Let us show that

EXAMPLE 2.

(9)

Let 1J =

(~

-

m)ja. Then 1J "' %(0, 1) and, since
= eitm
by (5), it is enough to show that
e-r2;z.

(10)

We have
.

= Ee 11 ~ = -1-

J21r:

Joo e'.xe-x2j2 dx 1

-00

Joo 1..---e ~ (itxt -x2jZ dx_- 1 ~ .(itt n -x2j2 dx . - - -1 - Joo xe J21r: -oo n=O n! n=O n! J21r: -oo

_ -1-

= =

I

I

(it)2n (2n - 1) I I = (it)2n (2n)! n=O (2n)! ·· n=O (2n)! 2nn!

f (- tZ)n. ~ 2 n.

n=O

= e-t212,

where we have used the formula (see Problem 7 in §8)

EXAMPLE 3.

Let

~

be a Poisson random variable, e-AA_k

P(~ = k) =

Then

kl'

k

= 0, 1, ....

276

II. Mathematical Foundations of Probability Theory

3. As we observed in §9, Subsection 1, with every distribution function in (R, fJI(R)) we can associate a random variable of which it is the distribution function. Hence in discussing the properties of characteristic functions (in the sense either of Definition 1 or Definition 2), we may consider only characteristic functions ({J(t) = (/J(,(t) of random variables e = e(w).

e

Theorem 1. Let be a random variable with distribution function F = F(x) and ({J(t) = its characteristic function. Then

qJ

Eeu~

has the following properties:

(1) I({J(t) I ~ ({J(O) = 1; (2) ({J(t) is uniformly continuous fortE R; (3) ({J(t) = ({J(- t); (4) ({J(t) is real-valued if and only ifF is symmetric (t) exists for every r ~ n, and

e

L

(ix)'eitx dF(x),

({J(rl(t) =

({J(rl(O)

Ee' = -.-,-,

(13)

l

~ (it) 2 E~' () = L... ({Jt ., 1 r=o r.

(12)

() + (it)"1 e.t, n.

(14)

where le.(t)l :::; 3E 1e1• and e.(t)--+ 0, t--+ 0; (6) if qJ< 2 •>(0) exists and is .finite then Ee 2 " < oo; (7) if E I I" < oo for all n :2 1 and

e

- . (Eiel")li• hm "

1

=-

R

n

< oo,

then ({J(t) =

f

n=O

(itr Ee·. n.

(15)

for allltl < R. PRooF. Properties (1) and (3) are evident. Property (2) follows from the inequality

+ h) -

IEeit(,(eih~ - 1) I :::; EIeih(, - 11 and the dominated convergence theorem, according to which EIeih(. - 11 --+ 0, I({J(t

({J(t) I =

h--+ 0. Property (4). Let F be symmetric. Then if g(x) is a bounded odd Borel function, we have fR g(x) dF(x) = 0 (observe that for simple odd functions

277

§12. Characteristic Functions

this follows directly from the definition of the symmetry of F). Consequently JR sin tx dF(x) = 0 and therefore

cp(t) = E cos Conversely, let

be a real function. Then by (3)

cp~(t)

({J -lt)

t~.

= cp~(- t) = cplt) = cp~(t),

t E R.

Hence (as will be shown below in Theorem 2) the distribution functions ~ and ~ are the same, and therefore (by Theorem 3.1)

F _~and F ~of the random variables -

P( ~ E B) = P(- ~ E B) = P( ~ E -B)

for every BE fJ(R). Property (5). If E I~ In < oo, we have E I~ lr < oo for r inequality (6.28). Consider the difference quotient

+ h)

cp(t

h

- cp(t) _ ir~(eih~ - Ee h

~

n, by Lyapunov's

1) .

Since i

eihx _ h

11

lxl,

~

and E1 ~ 1 < oo, it follows from the dominated convergence theorem that the limit lim h~o

exists and equals

. lim (eih~ _ h~o h

Eeu~

Eeu~(-ei_h~_-_1)

1) =

h

.

iE(~e' 1 ~)

= i

Joo xe'.x dF(x). -oo

(16)

1

Hence q/(t) exists and

cp'(t) =

i(E~eil~) =

i

s:oo xeitx dF(x).

The existence of the derivatives cp(r)(t), 1 < r ~ n, and the validity of (12), follow by induction. Formula (13) follows immediately from (12). Let us now establish (14). Since .

e'Y

=cosy

.

.

+ l Slll y

n-1 (iy)k = k~O F

(iy)n

+7

[cos ely+

. . l Sill

e2y]

278

II. Mathematical Foundations of Probability Theory

for real y, with I8 1 I :::; 1 and I8 2 1:::; 1, we have

.

e''~

n- 1 (it~)k

.

(it~t

= k~O ----;z! +--;:;![COS 81(w)t~ + i Sln 82(w)t~]

(17)

and (18) where

en(t) =

P[~"(cos 8 1 (w)t~

+ i sin 8 2 (w)t~

- 1)].

It is clear that Ic5n(t) I :::; 3E I~n 1. The theorem on dominated convergence shows that en(t) --+ 0, t --+ 0. Property (6). We give a proof by induction. Suppose first that cp"(O) exists and is finite. Let us show that in that case E~ 2 < oo. By L'Hopital's rule and Fatou's lemma, qJ

~ [cp'(2h) - cp'(O) 2h

"(O) = 1.

h~~ 2

+

cp'(O) - cp'(- 2h)] 2h

= lim 2cp'(2h) ~~cp'( - 2h) = lim 4h12 [cp(2h)- 2cp(O) + cp( -2h)] h-+0

h-+0

= lim

Joo

h-+0

=

-lim h-+0

= -

(eihx ;he-ihx)2 dF(x)

-00

hx) oo (sin -hJ-oo

2

x 2 dF(x) :::; -

hx) 2x Joo lim (sin -h-oo h-+0

X

2

dF(x)

X

J:oo x 2 dF(x).

Therefore

J:oo x

2

dF(x):::; -cp"(O) < oo.

Now let cp< 2 k+ 2 l(O) exist, finite, and let J~~ x 2k dF(x) < oo. If J~oo x 2kdF(x) = 0, then J~oo x 2k+ 2 dF(x) = 0 also. Hence we may suppose that J~ x 2k dF(x) > 0. Then, by Property (5),

oo

cp<2kl(t)

=

f_oooo (ix)2keitx dF(x)

and therefore

( -1)kcp(2kl(t) where G(x)

=

J~oo u 2 k dF(u).

=

s:ooeitx dG(x),

279

§12. Characteristic Functions

Consequently the function ( -1)ktp< 2 k>(t)G- 1 ( oo) is the characteristic function ofthe probability distribution G(x) · G- 1(oo) and by what we have proved, G- 1(oo)

J:

00

x 2 dG(x) < oo.

But G- 1(oo) > 0, and therefore

Property (7). Let 0 < t 0 < R. Then, by Stirling's formula we find that

L

[E I~ l"t0/n !] converges by Cauchy's test, and Consequently the series therefore the series L~ 0 [(it)'jr !]E~' converges for It I ::;; t 0 • But by (14), for n ~ 1, ('t)' L _z_l E~' + Rn(t), n

tp(t) =

r=O

tp(t) =

r·

f

r=O

(it)' r!

E~'

for all It I < R. This completes the proof of the theorem.

Remark 1. By a method similar to that used for (14), we can establish that if 00 for some n ~ 1, then

EIeI" <

" ik(t - s)k tp(t) = k~O k! where len(t-

s)l ::;;

Joo

.

-oo xke••x

dF(x)

3E WI, and en(t- s)

---t

+

i"(t - st n! en(t- s),

(19)

0 as t- s ---t 0.

Remark 2. With reference to the condition that appears in Property (7), see also Subsection 9, below, on the "uniqueness of the solution of the moment problem." 4. The following theorem shows that the characteristic function is uniquely determined by the distribution function.

280

II. Mathematical Foundations of Probability Theory

b

~.

+6

b

Figure 33

Theorem 2 (Uniqueness). Let F and G be distribution functions with the same characteristic function, i.e. (20)

for all t E R. Then F(x)

= G(x).

PROOF. Choose a and bE R, and B > 0, and consider the function f' = f'(x) shown in Figure 33. We show that

(21) Let n ~ 0 be large enough so that [a- e, b + e] £; [ -n, n], and let the sequence {b"} be such that 1 ~ b" L0, n -+ oo. Like every continuous function on [- n, n] that has equal values at the endpoints,!' = f'(x) can be uniformly approximated by trigonometric polynomials (Weierstrass's theorem), i.e. there is a finite sum (22) such that sup lf'(x)- f~(x)l:::;; b".

-nsx:s;n

Let us extend the periodic function J,.(x) to all of R, and observe that sup ln(x)l:::;; 2. X

Then, since by (20)

(23)

281

§12. Characteristic Functions

we have

f_

00

IJ:00 f'(x)dF(x)-

00

f'(x)dG(x)l

= lfJ'dF- f.f'dGI

~I

f.n

dF-

f/~ dGI + 2<5n

~ If_0000 f~ dF- f_0000 f~ dGI + 2£5n + 2F([ -n, n]) + 2G([ -n, n]), (24) where F(A) = SA dF(x), G(A) = SA dG(x). As n--+ oo, the right-hand side of (24) tends to zero, and this establishes (21). Ass--+ 0, we have f'(x)--+ I(a,b/x). It follows from (21) by the theorem on distribution functions' being the same.

f_'xo

00

/(a,bJ(x) dF(x)

= f_'xoool(a,b](x) dG(x),

i.e. F(b) - F(a) = G(b) - G(a). Since a and b are arbitrary, it follows that F(x) = G(x) for all x E R. This completes the proof of the theorem.

5. The preceding theorem says that a distribution function F = F(x) is uniquely determined by its characteristic function

F(b)- F(a) = lim 21 c-+oo

fc

e-ita

-c

1t

=

~

F(x) is continuous, e-itb
(25)

lt

(b) If s~oo l
F(x) = roof(y) dy

(26)

and f(x) = 1 2n

J(X) -oo

. e-•txq>(t) dt.

(27)

II. Mathematical Foundations of Probability Theory

282 PROOF.

We first observe that if F(x) has density f(x) then

f_'>)oo eitj(x) dx,

cp(t) =

(28)

and (27) is just the Fourier transform of the (integrable) function cp(t). Integrating both sides of (27) and applying Fubini's theorem, we obtain F(b)- F(a) =

ff(x)

dx =

2~ f [f_

00

f_oooo cp(t) [fe-itx dx

00

e-itxcp(t) dt] dx

Jdt

=

21n

=

e-ita- e-itb 1 Joo dt. cp(t) -2 zt. n _ 00

After these remarks, which to some extent clarify (25), we turn to the proof. (a) We have ~c

1

= -2n 1 =2n: 1 2n

=-

fc

e-ita- e-itb

,

zt

-c

fc -c

e-ita _ e-itb . zt

Joo

[fc

_ 00

-c

cp(t) dt

[foo

J

eitx dF(x) dt

-00

J

e-ita _ e-itb eitx dt dF(x) . zt (29)

where we have put 'I'c(x) = -

1

2n

fc -c

e-ita - e-itb . eztx dt .

zt

and applied Fubini's theorem, which is applicable in this case because _e-i_ta-_e-_itb. eitxl l it

and

fc

1:

=

le-ita- e-itbl it

(b - a) dF(x)

=I Ja

rbe-itx dxl < b- a

~ 2c(b -

-

a) < oo.

283

§12. Characteristic Functions

In addition, n1 T

(

c X

)

= _!_ 27C 1

=-

2n

fc

-c

sin t(x - a) - sin t(x - b) dt t

fc(x-a) -c(x-a)

sin v 1 - - dv - 2n

V

The function

g(s, t)

fc(x-b)

=

f

1

s

-c(x-b)

sin u - - du.

(30)

U

sin v -dv v

is uniformly continuous in s and t, and

g(s, t) --+ n

(31)

ass! - oo and t j oo. Hence there is a constant C such that I'Pc(x) I < C < oo for all c and x. Moreover, it follows from (30) and (31) that 'l'c(x)--+ 'l'(x),

c --+

00,

where 0, X < a, X > b, 'P(x) = { !, x = a, x = b, 1, a< x
Let ll be a measure on (R, f!J(R)) such that /l(a, b] = F(b)- F(a). Then if we apply the dominated convergence theorem and use the formulas of Problem 1 of §3, we find that, as c--+ oo, c =

f_

00 00

'Pc(x) dF(x)--+

f_

00 00

'P(x) dF(x)

= {L(a, b)+ t!l{a) + t!l{b} = F(b-)- F(a) + t[F(a)- F(a-) + F(b)- F(b- )] = F(b) \F(b-) _ F(a) \F(a-) = F(b) _ F(a),

where the last equation holds for all points a and b of continuity of F(x). Hence (25) is established. (b) Let f~oo I
7C

foo e-"xcp(t) . dt. -oo

284

II. Mathematical Foundations of Probability Theory

It follows from the dominated convergence theorem that this is a continuous function of x and therefore is integrable on [a, b]. Consequently we find, applying Fubini's theorem again, that

1 =lim -2 r-ae 1C

Jr

e-ita- e-itb

. zt

-c

qJ(t) dt = F(b) - F(a)

for all points a and b of continuity of F(x). Hence it follows that F(x) =

r

aof(y) dy,

XER,

and since f(x) is continuous and F(x) is nondecreasing, f(x) is the density of F(x). This completes the proof of the theorem.

Corollary. The inversion formula (25) provides a second proof of Theorem 2. Theorem 4. A necessary and sufficient condition for the components of the

<el, ... ,

en) to be independent is that its characteristic random vector e = function is the product of the characteristic functions of the components: Eei(t1~1 + ... +tn~n)

=

n

fl

Eeitk~k,

k=1

PRooF. The necessity follows from Problem 1. To prove the sufficiency we let F(x 1, ... , Xn) be the distribution function of the vector e = (e 1, ••• , en) andFk(x), the distribution functions of the ek, 1 ~ k ~ n. PutG = G(xl> ... , Xn) = F 1 (x 1 ) · · · F n(xn). Then, by Fubini's theorem, for all (t 1 , •.. , tn) ERn,

=

n

fl

Eeitk~k

=

Eei

k=1

=

r

JRn

ei(I!X!+"·+tnXn)

dF(x 1

...

Xn)•

Therefore by Theorem 2 (or rather, by its multidimensional analog; see Problem 3) we have F = G, and consequently, by the theorem of §5, the random variables eh ... ' en are independent.

285

§12. Characteristic Functions

6. Theorem 1 gives us necessary conditions for a function to be a characteristic function. Hence if qJ = ((J(t) fails to satisfy, for example, one of the first three conclusions of the theorem, that function cannot be a characteristic function. We quote without proof some results in the same direction.

Bochner-Khinchin Theorem. Let qJ(t) be continuous, t E R, with qJ(O) = I. A necessary and sufficient condition that ((J(t) is a characteristic function is that it is positive semi-definite, i.e. that for all real t 1, •.. , tn and all complex Ato ... , A.n, n = 1, 2, ... , n

I

((J(t; - tj)A.)j ~

i,j= 1

The necessity of (32) is evident since if ((J(t)

i.tl

o.

=

(32)

f~ oo

eirx dF(x) then

((J(t;- tj)A.J.j = J:JJ/keilkXI2 dF(x)

~ o.

The proof of the sufficiency of (32) is more difficult. Polya's Theorem. Let a continuous even function qJ(t) satisfy qJ(t) ~ 0, qJ(O) = 1, qJ(t) --+ 0 as t --+ oo and let qJ(t) be convex on 0 :s; t < oo. Then ((J(t) is a characteristic function.

This theorem provides a very convenient method of constructing characteristic functions. Examples are ((Jl (t)

=

e -lrl,

() - {1 - ltl,

cpzt-

It I :s; 1, ltl >

0,

1.

Another is the function cp 3 (t) drawn in Figure 34. On [-a, a], the function qJ 3(t) coincides with qJ 2(t). However, the corresponding distribution functions F 2 and F 3 are evidently different. This example shows that in general two characteristic functions can be the same on a finite interval without their distribution functions' being the same.

-1

a

-a

Figure 34

286

II. Mathematical Foundations of Probability Theory

Marcinkiewicz's Theorem. If a characteristic function cp(t) is of the form exp &>(t), where &>(t) is a polynomial, then this polynomial is of degree at most 2. It follows, for example, that e-r• is not a characteristic function. 7. The following theorem shows that a property of the characteristic function of a random variable can lead to a nontrivial conclusion about the nature of the random variable.

Theorem 5. Let

cp~(t)

be the characteristic function of the random variable

(a) If Icp~(t 0 ) I = 1 for some t 0 =I= 0, then

+ nh, h = 2n/t 0 , for some a, that is,

a

L

~

~

is concentrated at the points

00

n=- oo

Pg =a+ nh}

=

(33)

1,

where a is a constant.

(b) If

1 for two different points t and is degenerate:

lcp~(t)l = lcp~(~t)l =

irrational, then

~

P{~

~t,

where

~

is

=a} = 1,

where a is some number. = 1, then~ is degenerate.

(c) If Icp~(t)l PROOF.

Then

(a) If Icp~(t 0 )1 = 1, t 0 =I= 0, there is a number a such that cp(t 0 ) = eitoa.

1=

f_

00 00

COS t 0 (x

- a) dF(x) =>

f_

00 00

[1 -

- a)] dF(x)

COS t 0 (x

= 0.

Since 1 - cos t 0 (x - a) 2 0, it follows from property H (Subsection 2 of §6) that 1

= cos t 0 (~

which is equivalent to (33). (b) It follows from lcp~(t)l

f

n=- oo

p{~ =

-

a)

(P-a.s.),

= lcp~(~t)l = 1 and from (33) that

a + 2n t

n} f =

m=- oo

p{~ =

b + 2n ~t

m}

= 1.

If ~ is not degenerate, there must be at least two pairs of common points: a

2n

+ -t n 1

=

2n b + - m1 ~t

'

2n a+ -n 2 = b y

2n

+ -m 2 , ~t

287

§12. Characteristic Functions

in the sets

{b + ~~ m, m = 0, ±1, .. }

{a+ 2; n, n = 0, ± 1, .. -} and whence 2n (n 1 t

~

-

n2 ) =

2n at

~(m 1 -

m2 ),

and this contradicts the assumption that a is irrational. Conclusion (c) follows from (b). This completes the proof of the theorem.

8.

Let~= (~b

... , ~k) be a random vector, lfJ~(t)

= Eei(t, ~>,

t = (t 1•

.•. '

tk),

its characteristic function. Let us suppose that E I~; In < oo for some n ; : .-: 1, i = 1, ... , k. From the inequalities of Holder (6.29) and Lyapunov (6.27) it follows that the (mixed) moments E(~1' · · · ~;;k) exist for all nonnegative v1, ... , vk such that v1 + · · · + vk :::;; n. As in Theorem 1, this implies the existence and continuity of the partial derivatives

for v1 + · · · + vk :::;; n. Then if we expand we see that l{J~(tl, ... ,tk)=

i'•+···+vk

L

1

1

••+···+vk:Sn v1 .... vk.

..• ,

tk) in a Taylor series,

m<~,,. .. ,vklt1'···tkk+o(ltln),

m(v., ... ,vk) -_ EJ::Vt '>1 ~

lfJ~(t 1 ,

(34)

)::Vk

••• '>k

is the mixed moment of order v = (v 1, ... , vk). Now lfJ~(t 1 , ... , tk) is continuous, lfJ~(O, ... , 0) = 1, and consequently this function is different from zero in some neighborhood It I < 8 of zero. In this neighborhood the partial derivative a··+ ... +vk Vk In l{J~(tb ... ' tk) 1 • . . tk

at VI

a

exists and is continuous, where In z denotes the principal value of the logarithm (if z = rei8 , we take In z to be In r + W). Hence we can expand In lfJlt 1, .. ,, tk) by Taylor's formula, iVl + ·•· + Vk s~··· •klt'l' · · · t);k + o(lt n, (35) I ln l{J~(t 1 · · · tk) = VI + ... + Vk :S n VI ! • • • Vk !

····

288

II. Mathematical Foundations of Probability Theory

where the coefficients s~v •• ···• vk) are the (mixed) semi-invariants or cumulants of order v = v(v 1, •.. , vk) of~= ~ 1 , .. , ~k· Observe that if ~ and '1 are independent, then ln

cp~+,(t)

= ln

+ ln cp,(t),

cp~(t)

(36)

and therefore (37)

(It is this property that gives rise to the term "semi-invariant" for s~v ••... ,vk>.) To simply the formulas and make (34) and (35) look "one-dimensional," we introduce the following notation. If v = (vh ... , vk) is a vector whose components are nonnegative integers, we put

We also put s~v> = s~v ••... ,vk>, m~v) = Then (34) and (35) can be written

cpr,(t) = ln cpr,(t) =

L

m~v ••... ,vk>.

jlvl

J

lvlsn V. ilvl

Lr

lvlsn V.

m~v)tv s~v>tv

+ o(ltl"),

(38)

+ o(ltl").

(39)

The following theorem and its corollaries give formulas that connect moments and semi-invariants. Theorem6. Let ~=(~h···•~k) be a random vector with i = 1, ... , k, n;;;:: 1. Then for v =(vi> ... , vk) such that Ivi:::;; n (V) -

m~;

_.!..

"

- ).Ol+···+A<•l=v L. ' 1(1)f q ·II. •

v!

1(q)f'

••• /1.

•

nq p=l

s

().(P))

'

El~d"
(40)

(41)

where LA<•l+···+A<•l=v indicates summation over all ordered sets of nonnegative integral vectors A

, IA

I > 0, whose sum is v. PROOF.

Since

cpr,(t) = exp(ln cpr,(t)), if we expand the function exp by Taylor's formula and use (39), we obtain

cpr,(t) = 1 +

1( iiAI )q L I L s~A>tA + o(ltl"). 11 q=l q., lSIAisn n

11..

(42)

289

§12. Characteristic Functions

Comparing terms in e· on the right-hand sides of (38) and (42), and using 1..1<1)1 + · · · + IA.(qJI = 1..1< 1> + · · · + A_
L ~~~ m~"lt" + o( It 1")]. cp~(t) = ln[1 + 1:<>1.<1:-;n ·

(43)

For small z we have the expansion ln(1

+ z)

=

L n

q= 1

(

1)q-1 q

zq

+ o(zry.

Using this in (43) and then comparing the coefficients oft;., with the corresponding coefficients on the right-hand side of (38), we obtain (41).

Corollary 1. The following formulas connect moments and semi-invariants:

L

1

I

n[ X

(44) (}.Ul)J'j v. ) m~ - {rt)..(l)+ .. ·+rx).(X)=v} r1! • • • rX! (A_(l)!)'' • • • (A_(X)!yxj=l s~ (v) -

s~'l-

L {rt),(IJ + ... + rx;.,(x) = v}

Il [

<.
where 'L1,,;.(1J+ .. ·+rx. 0, and over all ordered sets of positive integral numbers ri such that r 1A.< 1l + · · · + rxA_(x) = v. To establish (44) we suppose that among all the vectors A_(ll, ... , A_(q) that occur in (40), there are r 1 equal to A_(itl, ... , rx equal to A_(ixl (ri > 0, r1 + · · · + rx = q), where all the A_(i.J are different. There are q !f(r 1! ... rx !) different sets of vectors, corresponding (except for order) with the set {A_(ll, ... il(ql}). But if two sets, say, {il< 1>, •.• , il
Corollary 2. Let us consider the special case when v = (1, ... , 1). In this case the moments m~v) E~ 1 · · · ~k' and the corresponding semi-invariants, are called simple.

=

Formulas connecting simple moments and simple semi-invariants can be read off from the formulas given above. However, it is useful to have them written in a different way. For this purpose, we introduce the following notation. Let~= (~I> ... , ~k) be a vector, and I~= {1, 2, ... , k} its set of indices. If I c:;: I~, let ~~denote the vector consisting of the components of~ whose

290

II. Mathematical Foundations of Probability Theory

indices belong to I. Let x(J) be the vector {XI> ... , Xn} for which Xi = 1 if i e I, and Xi = 0 if i ¢I. These vectors are in one-to-one correspondence with the sets I s; I~. Hence we can write

In other words, m~(J) and s~(J) are simple moments and semi-invariants of the subvector of In accordance with the definition given on p. 12, a decomposition of a set I is an unordered collection of disjoint nonempty sets I P such that LPIP =I. In terms of these definitions, we have the formulas

el e.

(46) q

L

sp)=

(-1)q- 1(q-1)!0mpp).

l:~=llp=I

(47)

p=l

where Lr~=lip=I denotes summation over all decompositions of I, 1 ~ q ~ N(l). We shall derive (46) from (44). If v = x(I) and A.< 1 > + · · · + A. = v, then A_

= x(IP), IPs; I, where the A_

are all different, A_

f = v! = 1, and every unordered set {x(J 1 ), ••• , x(Iq)} is in one-to-one correspondence with the decomposition I= L~=l IP. Consequently (46) follows from (44). In a similar way, (47) follows from (35). EXAMPLE 1. Let

e be. a

random variable (k = 1) and mn = m~n> = Ee",

sn =st. Then (40) and (41) imply the following formulas: ml = sl,

+ si, s 3 + 3s 1s 2 + s~, s4 + 3s~ + 4sls3 + 6sis2 + st,

m2 = s2 m3 = m4 =

(48)

and s1

= m 1 = Ee,

s2 = m2- mi =

ve,

s 3 = m3 - 3m 1 m2 + 2mt s4 = m4 - 3m~- 4m 1 m3 + 12mim2 - 6mt,

(49)

291

§12. Characteristic Functions

ExAMPLE

2. Let ~ ,.... JV(m, a 2 ). Since, by (9), In cp~(t) = itm -

t2(J2

T'

we have s 1 = m, s 2 = a 2 by (39), and all the semi-invariants, from the third on, are zero: sn = 0, n 2 3. We may observe that by Marcinkiewicz's theorem a function exp &'(t), where .9' is a polynomial, can be a characteristic function only when the degree of that polynomial is at most 2. It follows, in particular, that the Gaussian distribution is the only distribution with the property that all its semi-invariants snare zero from a certain index onward. EXAMPLE

3. If

~

is a Poisson random variable with parameter A > 0, then

by (11) In

cp~(t)

= .Jc(eit - 1).

It follows that (50)

for all n 2 1. EXAMPLE

4.

Let~

=

(~ 1 , ••. , ~n)

m~(l)

=

be a random vector. Then

s~(l),

+ s~(l)s~(2), mp, 2, 3) = s~(1, 2, 3) + s~(1, 2)s~(3) + + s~(l, 3)sp) + + s~(2, 3)sll) + s~(l)se(2)sl3) m~(l,

2) =

s~(l,

2)

(51)

These formulas show that the simple moments can be expressed in terms of the simple semi-invariants in a very symmetric way. If we put ~ 1 = ~ 2 = ~k, we then, of course, obtain (48). The group-theoretical origin of the coefficients in (48) becomes clear from (51). It also follows from (51) that

··· =

se(l, 2) = me(l, 2)- me(l)mp) =

E~ 1 ~ 2 - E~ 1 E~ 2 ,

(52)

i.e., s/1, 2) is just the covariance of ~ 1 and ~ 2 . 9. Let ~ be a random variable with distribution function F = F(x) and characteristic function cp(t). Let us suppose that all the moments mn = E~n, n 2 1, exist. It follows from Theorem 2 that a characteristic function uniquely determines a probability distribution. Let us now ask the following question

292

II. Mathematical Foundations of Probability Theory

(uniqueness for the moment problem): Do the moments {mn}n> 1 determine the probability distribution? More precisely, let F and G be distribution functions with the same moments, i.e.

f_'Xloo xn dF(x)

=

f_oooo xn dG(x)

(53)

for all integers n ~ 0. The question is whether F and G must be the same. In general, the answer is "no." To see this, consider the distribution F with density

J(x) = {ke-ax'-, 0,

> 0,

X

X:::; 0,

where a: > 0, 0 < A, < t, and k is determined by the condition Write f3 = a: tan A-n and let g(x) = 0 for x :::; 0 and

+ t; sin(f3xA)],

g(x) = ke- ax'-[1 It is evident that g(x)

~

It; I <

1,

X

J0 f(x) dx =

1.

> 0.

0. Let us show that

(54) for all integers n ~ 0. For p > 0 and complex q with Re q

f

> 0, we have

ootp-1e-qt dt = r(p). qP

0

Take p = (n

+ 1)/A-, q

-

a:+ i/3, t

=

= xA. Then

r(~)

- a;
+ i tan A.n)
But

(1

+ i tan A.n)
= (cos A-n

+ i sin A.n)
= ein(n+ll(cos A.n)-
+ 1) =

0.

(55)

293

§12. Characteristic Functions

Hence right-hand side of (55) is real and therefore (54) is valid for all integral n 2:: 0. Now let G(x) be the distribution function with density g(x). It follows from (54) that the distribution functions F and G have equal moments, i.e. (53) holds for all integers n 2:: 0. We now give some conditions that guarantee the uniqueness of the solution of the moment problem.

Theorem 7. Let F = F(x) be a distribution function and J.ln = s~ 00

IX I" dF(x).

If J.l1/n

lim-"-< oo, n~CX)

(56)

n

the moments {mn}n~ to where mn = J~ oo x" dF(x), determine the distribution F = F(x) uniquely. PRooF. It follows from (56) and conclusion (7) of Theorem 1 that there is a t 0 > 0 such that, for alii t I ~ t 0 , the characteristic function

cp(t) = J:oo eirx dF(x) can be represented in the form 00 (itl cp(t) = k~o k! mk

and consequently the moments {mn}n> 1 uniquely determine the characteristic function cp(t) for It I ~ t 0 . Take a points with lsi::;; t 0 /2. Then, as in the proof of (15), we deduce from (56) that ({J

for

It- sl

~ t0 ,

( ) = ~ ik(t - s)k ( ) t

k~O

k!

({J

S

where

cp(s) = ik J:oo xkei•x dF(x) is uniquely determined by the moments {mn} n~ 1. Consequently the moments determine cp(t) uniquely for It I ~ !to. Continuing this process, we see that {mn}n~ 1 determines cp(t) uniquely for all t, and therefore also determines

F(x). This completes the proof of the theorem.

Corollary 1. The moments completely determine the probability distribution if it is concentrated on a finite interval.

294

II. Mathematical Foundations of Probability Theory

Corollary 2. A sufficient condition for the moment problem to have a unique solution is that (m )1/2n

llm

2n

2n

n-+oo

<

(57)

00.

For the proof it is enough to observe that the odd moments can be estimated in terms of the even ones, and then use (56). EXAMPLE.

Let F(x) be the normal distribution function, F(x) = _1_

~

fx

e-t2Jza2 dt.

-oo

Then m 2 n+ 1 = 0, m 2 n = [(2n) !/2"n !]u 2 ", and it follows from (57) that these are the moments only of the normal distribution. Finally we state, without proof: Carleman's test for the uniqueness of the moment problem. (a) Let

{mn}n~ 1 be

the moments of a probability distribution, and let 00

1

L (mzn)1f2n =

n=O

00 ·

Then they determine the probability distribution uniquely. (b) If {mn}n~l are the moments of a distribution that is concentrated on [0, oo ), then the solution will be unique if we require only that 00

1

L (mn)lf2n

n=O

=

00 ·

10. Let F = F(x) and G = G(x) be distribution functions with characteristic functions f = f(t) and g = g(t), respectively. The following theorem, which we give without proof, makes it possible to estimate how close F and G are to each other (in the uniform metric) in terms of the closeness of fandg. Theorem (Esseen's Inequality). Let G(x) have derivative G'(x) with supiG'(x)l ~C. Thenfor every T > 0 sup IF(x)- G(x)l x

~ ~ JT if(t)- g(t)i dt + n o

t

2T 4 sup IG'(x)l.

n

(58)

x

(This will be used in §6 of Chapter III to prove a theorem on the rapidity of convergence in the central limit theorem.)

295

§13. Gaussian Systems

11.

PROBLEMS

e

1. Let and , be independent random variables, f(x) = j~(x) + if2(x), g(x) = gl(x) + ig 2(x), where A(x) and gb) are Borel functions, k= 1, 2. Show that ifE I!WI< oo and E lg(tf)l < oo, then E lf{e)g(tf)l < oo

and EJ(e)g(tf) = Ef(e) · Eg(tf).

2. Let

e= (e

1 , ... ,

e.) and Ell ell"< oo, where WI = cp~(t)

where t = (t 1,

••• ,

•

+~.Show that

·k

I ..:, E(t, e)k + e.(t)lltll",

=

k=Ok.

t.) and e.(t)-> 0, t-> 0.

3. Prove Theorem 2 for n-dimensional distribution functions F = F.(x 1, G.(x 1 ,

••• ,

••• ,

x.) and

x.).

4. LetF = F(x 1, ... , x.)beann-dimensionaldistributionfunctionandcp = cp(t 1, ..• , t.) its characteristic function. Using the notation of (3.12), establish the inversion formula

(We are to suppose that (a, b] is an interval of continuity of P(a, b], i.e. fork= 1, ... , n the points ak, bk are points of continuity of the marginal distribution functions Fk(xk) which are obtained from F(x~> ... , x.) by taking all the variables except xk equal to + oo.)

5. Let cpk(t), k ~ 1, be a characteristic function, and let the nonnegative numbers A.k, k ~ 1, satisfy I A.k = 1. Show that I A.kcpk(t) is a characteristic function. 6. If cp(t) is a characteristic function, are Re cp(t) and Im cp(t) characteristic functions? 7. Let cp 1, cp 2 and cp 3 be characteristic functions, and cp 1 cp 2 = cp 1 cp 3 • Does it follow that CfJ2 =

CfJ3?

8. Construct the characteristic functions of the distributions given in Tables 1 and 2 of~3.

e

9. Let be an integral-valued random variable and Show that P(e = k) = 1 -

2n

f" .

e-u"cp~(t) dt,

-x

cp~

(t) its characteristic function.

k = 0,± 1,± 2 ....

§13. Gaussian Systems 1. Gaussian, or normal, distributions, random variables, processes, and systems play an extremely important role in probability theory and in mathematical statistics. This is explained in the first instance by the central

296

II. Mathematical Foundations of Probability Theory

limit theorem (§4 of Chapter III and §8 of Chapter VII), of which the De Moivre-Laplace limit theorem is a special case (§6, Chapter 1). According to this theorem, the normal distribution is universal in the sense that the distribution of the sum of a large number of random variables or random vectors, subject to some not very restrictive conditions, is closely approximated by this distribution. This is what provides a theoretical explanation of the "law of errors" of applied statistics, which says that errors of measurement that result from large numbers of independent "elementary" errors obey the normal distribution. A multidimensional Gaussian distribution is specified by a small number of parameters; this is a definite advantage in using it in the construction of simple probabilistic models. Gaussian random variables have finite second moments, and consequently they can be studied by Hilbert space methods. Here it is important that in the Gaussian case "uncorrelated" is equivalent to "independent," so that the results of L 2 -theory can be significantly strengthened. 2. Let us recall that (see §8) a random variable ~ = ~(m) is Gaussian, or normally distributed, with parameters m and a 2 (~"' .K(m, a 2 )), lml < oo, a 2 > 0, if its density f~(x) has the form

1"( ) _ _1_ -(x-m)2f2a2 x - ;;;= e ,

J~

....; 2na

(1)

+P.

where a = As a! 0, the density f~(x) "converges to the a-function supported at x = m." It is natural to say that ~ is normally distributed with mean m and a 2 = 0 (~ "' .K(m, 0)) if~ has the property that P(~ = m) = 1. We can, however, give a definition that applies both to the nondegenerate (a 2 > 0) and the degenerate (a 2 = 0) cases. Let us consider the characteristic function cp~(t) Eei 1 ~, t E R. If P(~ = m) = 1, then evidently

=

eitm,

(2)

eitm-(1/2)t2 a 2 •

(3)

cp~(t) =

whereas if~ "' .K(m, a 2 ), a 2 > 0, cp~(t)

=

It is obvious that when a 2 = 0 the right-hand sides of (2) and (3) are the same. It follows, by Theorem 1 of §12, that the Gaussian random variable with parameters m and a 2 (I m I < oo, a 2 ~ 0) must be the same as the random variable whose characteristic function is given by (3). This is an illustration of the "attraction of characteristic functions," a very useful technique in the multidimensional case.

297

§13. Gaussian Systems

Let

e= Ce1o ... , en) be a random vector and lp~(t) =

t = (t 1 ,

Eei(t, ~.

••• ,

t") E R",

(4)

its characteristic function (see Definition 2, §12).

Definidon 1. A random vector e= Ce 1, ... , en) is Gaussian, or normally distributed, if its characteristic function has the form lp~(t)

=

(5)

ei(t,m)-(l/2)(1Rt,t),

where m = (m1o ... , mn), Imk I < oo and ~ = llrk1l is a symmetric nonnegative definite n X n matrix; we use the abbreviation JV(m, ~).

e"'

This definition immediately makes us ask whether (5) is in fact a characteristic function. Let us show that it is. First suppose that ~ is nonsingular. Then we can define the inverse A = ~- 1 and the function

IAI 112

(6)

f(x) = (2n)"' 2 exp{ -!(A(x- m), (x- m))},

where x = (x 1 , us show that

•.. ,

xn) and

IAI = det A. This function is nonnegative. Let

r ei(t,x)f(x) dx =

JR"

ei(t,m)-(l/2)(1Rt,t),

or equivalently that I

= n -

r ei(t,x-m) (2n)nf2 IA1

JR"

112

e-(1/2)(A(x-m),(x-m))

dx

= e-(lf2)(1Rt,t)

·

(7)

Let us make the change of variable x- m

=

(!)u,

t = (!)v,

where (!) is an orthogonal matrix such that

and

is a diagonal matrix with d; ~ 0 (see the proof of the lemma in §8). Since I~I = det ~ =F 0, we have d; > 0, i = 1, ... , n. Therefore (8)

298

II. Mathematical Foundations of Probability Theory

Moreover (for notation, see Subsection 1, §12) i(t, x - m)- i(A(x - m), x - m)) = i(@v, (9u) - i(A(!)u, (9u) = i((9v)T(9u- i((9u?A((9u) = ivTu- iuT(9TA(9u = ivTu- iuTD- 1u. Together with (8) and (12.9), this yields In= (2n)-n 12 (d1 ... dn)- 112 =

Il (2ndk)-

k=l

112

r

JR"

exp(ivTu- tuTD- 1u) du

Joo exp(ivk uk -oo

2udl ) duk = k

Il exp(- ivl dk)

k=l

= exp(- ivTDv) = exp(- tvT(9T[R(9v) = exp(- itT!Rt) = exp(- i(IRt, t)). It also follows from (6) that

r f(x) dx =

JR"

1.

(9)

Therefore (5) is the characteristic function of a nondegenerate n-dimensional Gaussian distribution (see Subsection 3, §3). Now let IR be singular. Take 8 > 0 and consider the positive definite IR + 8E. Then by what has been proved, symmetric matrix IR'

=

is a characteristic function:
JR"

ei
dF.(x),

where F.(x) = F.(xb ... , xn) is ann-dimensional distribution function. As 8-+ 0,
299

§13. Gaussian Systems

we find from (12.35) and the formulas that connect the moments and the semi-invariants that

m1 --

0 • ···• Ol sU· ~ -

E;::':.1,

••• ,

mk --

s
E;::':.k.

Similarly r 11-

v;::':.1•

s<2,0, ... ,0)~

-

and generally

Consequently m is the mean-value vector of ~ and IR is its covariance matrix. If IR is nonsingular, we can obtain this result in a different way. In fact, in this case~ has a density f(x) given by (6). A direct calculation shows that

E~k = Jxd(x) dx = cov(~k, ~~) =

(11)

mk,

J<xk- mk)(x 1 - m1)f(x) dx

=

rkl·

4. Let us discuss some properties of Gaussian vectors.

Theorem 1

if and only if they are independent. (b) A vector ~ = (~ 1 , •.. , ~n) is Gaussian if and only if, for every vector A.= (A.b ... , A.n), A.keR, the random variable(~, A.)= A. 1 ~ 1 + ··· + A.n~n has a Gaussian distribution. (a) The components of a Gaussian vector are uncorrelated

PRooF. (a) If the components of~= (~ 1 , ... , ~n) are uncorrelated, it follows from the form of the characteristic function cp~(t) that it is a product of characteristic functions. Therefore, by Theorem 4 of §12, the components are independent. The converse is evident, since independence always implies lack of correlation. (b) If~ is a Gaussian vector, it follows from (5) that

E exp{it(e1A.1

+ ... + enA.n)}

and consequently

=

exp{it(LA.kmk)-

~

(Lrk,A.kA.l)},

tER,

300

II. Mathematical Foundations of Probability Theory

Conversely, to say that the random variable is Gaussian means, in particular, that

(~,A.)

= ~ 1 A. 1

+ · · · + ~nAn

Since A._t. ... , An are arbitrary it follows from Definition 1 that the vector = (~ 1 , ••• , ~n) is Gaussian. This completes the proof of the theorem.

~

Remark. Let (0,

~) be a Gaussian vector with (} = (Ot. ... , (}k) and ~ = If (} and ~ are uncorrelated, i.e. cov(O;, ~j) = 0, i = 1, ... , k; j = 1, ... , l, they are independent. (~ 1 , ..• , ~k).

The proof is the same as for conclusion (a) of the theorem. Let ~ = (~ 1 , •.. , ~n) be a Gaussian vector; let us suppose, for simplicity, that its mean-value vector is zero. If rank ~ = r < n, then (as was shown in §11), there are n- r linear relations connecting ~ 1 , ••• , ~n· We may then suppose that, say, ~ 1 , ... , ~r are linearly independent, and the others can be expressed linearly in terms of them. Hence all the basic properties of the vector~= ~ 1 , ... , ~n are determined by the first r components (~ 1 , ... , ~,) for which the corresponding covariance matrix is already known to be nonsingular. Thus we may suppose that the original vector~ = ( ~ 1, •.. , ~n) had linearly independent components and therefore that I ~I > 0. Let (!) be an orthogonal matrix that diagonalizes ~. (f)T~(f)

=D.

The diagonal elements of D are positive and therefore determine the inverse matrix. Put B 2 = D and

Then it is easily verified that

i.e. the vector P= (p 1, ... , Pn) is a Gaussian vector with components that are uncorrelated and therefore (Theorem 1) independent. Then if we write A = (f)B we find that the original Gaussian vector ~ = (~ 1 , ••• , ~n) can be represented as ~ =

Ap,

(12)

where P = (p 1, ••• , Pn) is a Gaussian vector with independent -components, pk "'..¥(0, 1). Hence we have the following result. Let~ = (e 1 , ••• , ~n) be a

301

§13. Gaussian Systems

vector with linearly independent components such that

E~k =

0, k = 1,

... , n. This vector is Gaussian if and only if there are independent Gaussian variables {3 1 , ••• , f:Jn, f:Jk "' %(0, 1), and a nonsingular matrix A of order n such that~ = Af:J. Here IR = AAT is the covariance matrix of~. If IIRI =F 0, then by the Gram-Schmidt method (see §11)

k = 1, ... , n, where since 1: =

(~: 1 ,

(13)

... , ~:k) "' %(0, E) is a Gaussian vector,

~k =

k-1

I

l-1

(~k> Et)El,

(14) (15)

and (16)

2'{~1• · · ·, ~k} = 2'{1:1, · · ·, Ek}.

We see immediately from the orthogonal decomposition (13) that

~k = E(~k~~k-1• ··.,~d.

(17)

From this, with (16) and (14), it follows that in the Gaussian case the conditional expectation E(~k I~k-1> ... , ~ 1 ) is a linear function of (~1> ... , ~k- 1 ): k= 1

E(~k~~k-1•

· · ·,

~1) =

I

i= 1

a;~;·

(18)

(This was proved in §8 for the case k = 2.) Since, according to a remark made in Theorem 1 of §8, E(~k I~k- 1 , ... , ~ 1 ) is an optimal estimator (in the mean-square sense) for ~k in terms of ~ 1 , •.• , ~k- 1 , it follows from (18) that in the Gaussian case the optimal estimator is linear. We shall use these results in looking for optimal estimators of()= ( () 1, •.. , ()k) in terms of~ = (~ 1 , ••• , ~ 1) under the hypothesis that((),~) is Gaussian. Let

m6

= E(J,

m~

= E~

be the column-vector mean values and

V66 =cov(O, 0) = llcov(O;, 0)11,

1 ~ i,j

V11~ = cov(O, ~) = llcov(O;, ~)II,

1 ~ i ~ k, 1 ~ j

V~~ = cov(~, ~) = llcov(~;. ~i) II,

1 ~ i,j

~

~

k, ~ l,

l

the covariance matrices. Let us suppose that V~~has an inverse. Then we have the following theorem.

Theorem 2 (Theorem on Normal Correlation). For a Gaussian vector (0, the optimal estimator E(O I~) of 0 in terms of~. and its error matrix L\ = E[0 - E(O I~)] [0 - E(O( ~)]T

~),

302

II. Mathematical Foundations of Probability Theory

are given by the formulas E(OI~) = m6

+ V 6 ~ V"i/(~-

m~),

ll = V 6o- Vo~V~ 1 (Vo~)T.

(19)

(20)

PRooF. Form the vector 11 = (0- m8 ) - V 6 ~ V~ 1 {~- m~).

(21)

We can verify at once that Ert(~ - m~)T = 0, i.e. 11 is not correlated with (~ - m~). But since (0, ~)is Gaussian, the vector (I'/,~) is also Gaussian. Hence by the remark on Theorem 1, 11 and ~ - m~ are independent. Therefore 11 and ~ are independent, and consequently E('11 ~) = Ert = 0. Therefore E[O- m6 1~]- V 6 ~V~ 1 (~- m~) = 0. which establishes (19). To establish (20) we consider the conditional covariance cov(O, 01~)

= E[(O- E(OI~))(O- E(OI~WieJ.

Since 0- E(Oie) = 17, and 11 and

(22)

eare independent, we find that

cov(O, Ole)= E(rtrtTie) = E11rtT = V9 6 = V 68

+ Vi/V~~V~ 1 v:~- 2V 6 ~V~ 1 V~~V~ 1 v:~ -

V 6 ~ V~ 1 V:~.

Since cov(O, Ole) does not depend on "chance," we have ll. = Ecov(O, Ole) =cov(O, Ole),

and this establishes (20). Corollary. Let (0, e1, ... , en) be an (n ~> ... , ~n independent. Then

e

+ 1)-dimensional Gaussian vector, with

(cf. (8.12) and (8.13)). 5. Let e 1 , e 2 , ••• be a sequence of Gaussian random vectors that converge in probability to e. Let us show that is also Gaussian. In accordance with (a) of Theorem 1, it is enough to establish this only for random variables.

e

303

§13. Gaussian Systems

Let mn = theorem

Een, u;

=

ven·

Then by Lebesgue's dominated convergence

n-->oo

n-+ co

It follows from the existence of the limit on the left-hand side that there are numbers m and u 2 such that

m =lim mn, n-->oo

Consequently i.e.

e"' JV(m, u

2 ).

e2, .. .}

It follows, in particular, that the closed linear manifold 2(e1, generated by the Gaussian variables 1 , 2 , •.. (see §11, Subsection 5) consists of Gaussian variables.

ee

6. We now turn to the concept of Gaussian systems in general.

Definition 2. A collection of random variables e= <e~). where IX belongs to some index set m, is a Gaussian system if the random vector (e~,, ... , e~") is Gaussian for every n ~ 1 and all indices IX1, .•• ' CX.n chosen from m. Let us notice some properties of Gaussian systems. (a) If

e=

IX' E

(b) If (c)

<e~), IX Em,

is a Gaussian system, then every subsystem

m' s; m, is also Gaussian.

e~, IX Em,

e= <eiX),

e' =

<e~.),

are independent Gaussian variables, then the system

IX Em, is Gaussian. (~~), oc Em, is a Gaussian

= system, the closed linear manifold Y(e), consisting of all variables of the form 1 ca,ea,, together with their mean-square limits, forms a Gaussian system.

If~

Li=

Let us observe that the converse of (a) is false in general. For example, let e1 and 17 1 be independent and e1"'%(0, 1), 17 1"' %(0, 1). Define the system

<e ) = • 11

{<eb 1111D <e1, -11111)

e

if e1 ~ if e1 <

o, o.

(23)

Then it is easily verified that and 11 are both Gaussian, but (e, 17) is not. Let = (ea)aelll be a Gaussian system with mean-value vector m = (ma), oc Em, and covariance matrix IR = (r~p)a,fJelll• where ma = Eea· Then IR is evidently symmetric (r rzfJ = r pa) and nonnegative definite in the sense that for every vector c = (ca)ael!l with values in R 111, and only a finite number of nonzero coordinates ca,

e

(~c, c)

=L rrzpCaCp ~ 0. tx,{J

(24)

304

II. Mathematical Foundations of Probability Theory

We now ask the converse question. Suppose that we are given a parameter set m: = {ex}, a vector m = (m11) 11 e!ll and a symmetric nonnegative definite matrix IR = (r11p)11,pe!ll· Do there exist a probability space (0, F. P) and a Gaussian system of random variables~= (~ 11)11 e 111 on it, such that E~~~ = m~~,

cov(~ 11 , ~ 11)

=

r
ex, pem:?

If we take a finite set ex 1 , ••• , exn, then for the vector ffi = (m~~., ... , m~~J and the matrix IR = (r1111), ex, P= ex 1, ••• , exn, we can construct in Rn the Gaussian distribution F 111 , ... , 11"(x 1 , ..• , xn) with characteristic function qJ(t) = exp{i(t, m) - !(IRt, t)},

t

=

(t .. t.,

••• '

t..J.

It is easily verified that the family

{F1Zt, ... ,1Zn(X1, ... ' Xn); (Xi Em:}

is consistent. Consequently by Kolmogorov's theorem (Theorem 1, §9, and Remark 2 on this) the answer to our question is positive. 7. If m: = {1, 2, ... }, then in accordance with the terminology of §5 the system of random variables~= (~ 11)11 e!IJ is a random sequence and is denoted by ~ = (~ 1 , ~ 2 , ...). A Gaussian sequence is completely described by its mean-value vector m = (m 1 , m2 , •• •) and covariance matrix IR = llriill, rii = cov(~i• ~i). In particular, if rii = afbii• then ~ = (~ 1 , ~ 2 , •••) is a Gaussian sequence of independent random variables with ~i "' .!V(mi> af), i ~ 1. When m: = [0, 1], [0, 00 ), (- 00, 00 ), ... , the system ~ = (e,), t Em:, is a random process with continuous time. Let us mention some examples of Gaussian random processes. If we take their mean values to be zero, their probabilistic properties are completely described by the covariance matrices llr.,ll· We write r(s, t) instead of r., and call it the covariance function. EXAMPLE

1. If T = [0, 00) and

e

r(s, t) = min(s, t),

(25)

the Gaussian process = (~,)r~O with this covariance function (see Problem 2) and ~ 0 = 0 is a Brownian motion or Wiener process. Observe that this process has independent increments; that is, for arbitrary t 1 < t 2 < · · · < tn the random variables

~12 - elt' • • ' ' ~In - ~In- I are independent. In fact, because the process is Gaussian it is enough to verify only that the increments are uncorrelated. But if s < t < u < v then E[~ 1

-

~.] [~v

-

~..] =

[r(t, v) - r(t, u~] - [r(s, v) - r(s, u)]

= (t - t) - (s - s) = 0.

305

§13. Gaussian Systems

EXAMPLE

2. The process

e= (e

1),

0 ::; t ::; 1, with

eo := 0 and

r(s, t) = min(s, t) - st

(26)

is a conditional Wiener process (observe that since r(1, 1) = 0 we have = o) = 1).

P<el

3. The process

EXAMPLE

e= (e

1), -

oo < t <

00,

with

r(s, t) = e-lt-•1

(27)

is a Gauss-Markov process.

8.

PROBLEMS

1. Let

e1,e2, e3be independent Gaussian random variables, e; ~ .¥(0, 1). Show that e1 + e2e3 ~ Y(o, 1).

Jt + e~

(In this case we encounter the interesting problem of describing the nonlinear transformations of independent Gaussian variables 1 , .•. , whose distributions are still Gaussian.)

e

e.

2. Show that (25), (26) and (27) are nonnegative definite (and consequently are actually covariance functions). 3. Let A be an m x n matrix. An n x m matrix A E9 is a pseudo inverse of A if there are matrices U and V such that

Show that A E9 exists and is unique. 4. Show that (19) and (20) in the theorem on normal correlation remains valid when v~~ is singular provided that v;/ is replaced by v~.

e)= (ll1o ... , llk; e1,... , e,) be a Gaussian vector with nonsingular matrix

5. Let (ll, .1 = V1111

-

V~ v:~. Show that the distribution function

P(ll::; ale)= P(01::; a1, ... , llk::; akle) has (P-a.s.) the density p(al, ... ' ak Ie) defined by

1.1-1/21 (2n)k12 exp{ -t(a- E(OI~W.1- 1 (a- E(OI~))}. 6. (S. N. Bernstein). Let~ and '1 be independent identically distributed random variables with finite variances. Show that if + '7 and ~ - '1 are independent, then and '1 are Gaussian.

e

e

CHAPTER III

Convergence of Probability Measures. Central Limit Theorem

§1. Weak Convergence of Probability Measures and Distributions 1. Many important results of probability theory are formulated as limit theorems. So, indeed, were James Bernoulli's law of large numbers, as well as the De Moivre-Laplace limit theorem, the theorems with which the true theory of probability began. In the present chapter we discuss two central aspects of limit theorems: one is the concept of weak convergence; the other is the method of characteristic functions, one of the most powerful methods for proving and refining limit theorems. We begin by recalling the statement of the law of large numbers (Chapter I, §5) for the Bernoulli scheme. Let ~ 1 , ~ 2 , ••• be a sequence of independent identically distributed random variables with P(~i = 1) = p, P(~i = 0) = q, p + q = 1. In terms of the concept of convergence in probability (Chapter II, §10), Bernoulli's law of large numbers can be stated as follows: n-+ oo,

(1)

where Sn = ~ 1 + · · · + ~n· (It will be shown in Chapter IV that in fact we have convergence with probability 1.) We put (2)

307

§I. Weak Convergence of Probability Measures and Distributions

where F(x) is the distribution function of the degenerate random variable ~ p. Also let P nand P be the probability measures on (R, Pl(R)) corresponding to the distributions F n and F. In accordance with Theorem 2 of §1 0, Chapter II, convergence in probability, Sn/n ~ p, implies convergence in distribution, Sn/n .!4 p, which means that

=

Ef(~) -+ Ef(p), for every function! ous functions on R. Since

=

E~~) =

n-+ oo,

(3)

f(x) belonging to the class C(R) of bounded continu-

L

f(x)P.(dx),

(3) can be written in the form

L L

f(x)Pn(dx)-+

Ef(p)

L L

f(x)P(dx),

=

{f(x)P(dx),

fE C(R),

(4)

or (in accordance with §6 of Chapter II) in the form f(x) dF n(x) -+

f(x) dF(x ),

f

E

C(R).

(5)

In analysis, (4) is called weak convergence (of Pn to P, n -+ oo) and written Pn ~ P. It is also natural to call ( 5) weak convergence ofF n to F and denote it by Fn ~F. Thus we may say that in a Bernoulli scheme

sn

p

w

--+ p => F"-+ F.

n

(6)

It is also easy to see from (1) that, for the distribution functions defined in (2), F.(x)-+ F(x), n-+ oo, for all points x E R except for the single point x = p, where F(x) has a discontinuity. This shows that weak convergence F"-+ F does not imply pointwise convergence of F"(x) to F(x), n-+ oo, for all points x E R. However, it turns out that, both for Bernoulli schemes and for arbitrary distribution functions, weak convergence is equivalent (see Theorem 2 below) to "convergence in general" in the sense of the following definition.

Definition 1. A sequence of distribution functions {F"}, defined on the real line, converges in general to the distribution function F (notation: F" => F) if as n-+ oo Fn(x)-+ F(x), xEPc(F), where P c(F) is the set of points of continuity ofF = F(x).

308

III. Convergence of Probability Measures. Central Limit Theorem

For Bernoulli schemes, F = F(x) is degenerate, and it is easy to see (see Problem 7 of §10, Chapter II) that

CFn=F)=(~~P)· Therefore, taking account of Theorem 2 below,

(~ ~ P) =(F n.!!+ F)<=> (F n=F)= (~ E. p)

(7)

and consequently the law of large numbers can be considered as a theorem on the weak convergence of the distribution functions defined in (2). Let us write Fix)= P{ F(x)

Sn- np

jnpq : :; x } ,

= -1- fx

fo

e-u 212 du.

(8)

-oo

The De Moivre-Laplace theorem (§6, Chapter I) states that Fn(x)--+ F(x) for all x E R, and consequently Fn =F. Since, as we have observed, weak convergence Fn ~ F and convergence in general, Fn = F, are equivalent, we may therefore say that the De Moivre-Laplace theorem is also a theorem on the weak convergence of the distribution functions defined by (8). These examples justify the concept of weak convergence of probability measures that will be introduced below in Definition 2. Although, on the real line, weak convergence is equivalent to convergence in general of the corresponding distribution functions, it is preferable to use weak convergence from the beginning. This is because in the first place it is easier to work with, and in the second place it remains useful in more general spaces than the real line, and in particular for metric spaces, including the especially important spaces R", R 00 , C and D (see §3 of Chapter II). 2. Let (E, tff, p) be a metric space with metric p = p(x, y) and u-algebra tff of Borel subsets generated by the open sets, and let P, Pt> P2 , ... be probability measures on (E, tff, p).

Definition 2. A sequence of probability measures {P n} converges weakly to the probability measure P (notation: P n ~ P) if

L

f(x)P n(dx)

--+

L

J(x)P(dx)

(9)

for every function f = f(x) in the class IC(E) of continuous bounded functions on E.

§I. Weak Convergence of Probability Measures and Distributions

309

Definition 3. A sequence of probability measures {P n} converges in general to the probability measure P (notation: Pn => P) if (10) for every set A of G for which P(oA)

= 0.

(11)

(Here oA denotes the boundary of A: oA

=

[A] n [A],

where [A] is the closure of A.) The following fundamental theorem shows the eqmvalence of the concepts of weak convergence and convergence in general for probability measures, and contains still another equivalent statement.

Theorem 1. The following statements are equivalent. (I) Pn ~ P. (II) lim Pn(A) ::::;; P(A), A closed. (Ill) lim Pn(A) 2': P(A), A open. (IV) Pn=>P. PRooF. (I)=> (II). Let A be closed, f(x)

fix)=

=

IA(x) and

g(~ p(x, A)),

s > 0,

where p(x, A)

= inf{p(x, y): yEA},

1, g(t) = { 1 - t, 0,

t::::;; 0, 0 ::::;; t ::::;; 1, t 2': 1.

Let us also put A,= {x: p(x, A) < s} and observe that A, t A as s t 0. Since.fe(x) is bounded, continuous, and satisfies

we have

which establishes the required implication.

310

III. Convergence of Probability Measures. Central Limit Theorem

The implications (II) => (Ill) and (Ill) => (II) become obvious if we take the complements of the sets concerned. (III)=> (IV). Let A 0 = A\oA be the interior, and [A] the closure, of A. Then from (II), (III), and the hypothesis P(ilA) = 0, we have lim Pn(A)

~lim n

n

Pn([A])

~

P([A]) = P(A),

n

n

and therefore Pn(A) ~ P(A) for every A such that P(ilA) = 0. (IV)~ (1). Letf = f(x) be a bounded continuous function with lf(x)l M. We put

~

= {tER: P{x:f(x) = t} =I= 0} and consider a decomposition 7k = (t 0 , t 1 , •.. , tk) of [- M, M]: D

- M =

t0

<

t1

< ··· <

tk =

k

M,

~ 1,

with ti ¢: D, i = 0, 1, ... , k. (Observe that D is at most countable since the sets f- 1 { t} are disjoint and P is finite.) Let Bi = {x: t; ~ f(x) < t;+ d. Since f(x) is continuous and therefore the set f- 1(t;, ti+ 1 ) is open, we have oB; £ f- 1 {t;} u f- 1 {ti+ 1 }. The points t;, t;+ 1 ¢ D; therefore P(ilB;) = 0 and, by (IV), k-1

k-1

L

t;Pn(B;) ~

i=O

But

I

Lf(x)Pn(dx)- Lf(x)P(dx)

I~

L

:t~ t;Pn(B;)

+ I:t~ t; Pn(B;) -

:t~ t; P(B;) I

Lf(x)Pn(dx)-

t;P(B;)- Lf(x)P(dx)

2 max

(t;+ 1

O:s;i:s;k-1

whence, by (12), since the

7k (k

~

1) are arbitrary,

lim ( f(x)Pn(dx) = ( f(x)P(dx). n

JE

I

I

+I :t~ ~

(12)

t;P(B;).

t=O

JE

This completes the proof of the theorem.

-

t;)

I

§1. Weak Convergence of Probability Measures and Distributions

311

Remark 1. The functions f(x) = IA(x) and fe(x) that appear in the proof that (I) => (II) are respectively upper semicontinuous and uniformly continuous. Hence it is easy to show that each of the conditions of the theorem is equivalent to one of the following: (V) JEf(x)P.(x) dx--+ JEf(x)P(dx) for all bounded uniformly continuous

f(x);

(VI) lim JE f(x)P .(dx) ::s;; JE f(x)P(dx) for all bounded f(x) that are upper semicontinuous (Iimf(x.) ::s;; f(x), x.--+ x); (VII) lim JEf(x)P.(dx) ~ JEf(x)P(dx) for all bounded f(x) that are lower semicontinuous (lim f(x.) ~ f(x), x.--+ x). n

Remark 2.Theorem 1 admits a natural generalization to the case when the probability measures P and P. defined on (E, tS, p) are replaced by arbitrary (not necessarily probability) finite measures J.L and J.L •. For such measures we can introduce weak convergence J.l.n ~ J.L and convergence in general f.Ln => J.1. and, just as in Theorem 1, we can establish the equivalence of the

following conditions: (I*) (II*) (III*) (IV*)

J.l.n ~ J.L; lim J.L.(A) ::s;; J.L(A), where A is closed and J.LiE) --+ J.L(E); lim J.L.(A) ~ J.L(A), where A is open and J.L.(E)--+ J.L(E); J.l.n => J.L.

Each of these is equivalent to any of (V*), (VI*), and (VII*), which are (V), (VI), and (VII) with P" and P replaced by J.l.n and J.L. 3. Let (R, BI(R)) be the real line with the system BI(R) of sets generated by the Euclidean metric p(x, y) = lx - yl (compare Remark 2 of subsection 2 of§2 of Chapter II). Let P and P "' n ;:::: 1, be probability measures on (R, aJ(R)) and let F and F., n ~ 1, be the corresponding distribution functions.

Theorem 2. The following conditions are equivalent: (1) (2) (3) (4)

P. ~ P, P. => P, F.~ F, F.=> F.

PRooF. Since (2) <=> (1) <=> (3), it is enough to show that (2) <=> (4). If P. => P, then in particular

P.(- oo, x]--+ P(- oo, x] for all x e R such that P{x} = 0. But this means that F.=> F. Now let F.=> F. To prove that P. =>Pit is enough (by Theorem 1) to show that lim. P.(A) ~ P(A) for every open set A. If A is open, there is a countable collection of disjoint open intervals 11 , 12 , •.• (of the form (a, b)) such that A= 2,1:': 1 /k. Choose e > 0 and in

312

III. Convergence of Probability Measures. Central Limit Theorem

each interval Ik = (ak> bk) select a subinterval Ik = (ak, bk] such that ak, b~ E Pc(F) and P(Ik) :-::; P(J~) + e · rk. (Since F(x) has at most countably many discontinuities, such intervals Ik, k 2:: 1, certainly exist.) By Fatou's lemma, 00

00

lim Pn(A) =lim n

n

L

L

Pn(Ik) 2::

k=1

k=1

lim Pn{/k) n

00

2::

L

lim Pn(Ik).

k= 1

n

But Therefore 00

00

lim Pn(A) 2:: -

L

P(Ik) 2::

k= 1

n

L

(P(h) - e · 2-k) = P(A)- e.

Since e > 0 is arbitrary, this shows that limn Pn(A) 2:: P(A) if A is open. This completes the proof of the theorem. 4. Let (£, S) be a measurable space. A collection X 0 (£) <::; S of subsets is a determining class if whenever two probability measures P and 0 on (E, S) satisfy P(A)

= Q(A)

for all

A EX 0 (£)

it follows that the measures are identical, i.e. P(A) = Q(A)

for all

A

E

S.

If(£, S, p) is a metric space, a collection X 1(E) <::; Iff is a convergencedetermining class if whenever probability measures P, P 1 , P 2 , ... satisfy P n(A)

-t

P(A)

for all

A EX 1 (£)

with

P(oA) = 0

it follows that for all A E E with

P(oA) = 0.

When (E, S) = (R, ~(R)), we can take a determining class X 0 (R) to be the class of "elementary" sets X = {(- oo, x], x E R} (Theorem 1, §3, Chapter II). It follows from the equivalence of (2) and (4) of Theorem 2 that this class X is also a convergence-determining class. It is natural to ask about such determining classes in more general spaces. For W, n 2:: 2, the class X of "elementary" sets of the form (- oo, x] = (- oo, x 1 ] x · · · x (- oo, Xn], where x = (x 1, ... , xn) ERn, is both a determining class (Theorem 2, §3, Chapter II) and a convergence-determining class (Problem 2).

313

§I. Weak Convergence of Probability Measures and Distributions

ljn

2/n

Figure 35

For Roo the cylindrical sets % 0 (R 00 ) are the "elementary" sets whose probabilities uniquely determine the probabilities of the Borel sets (Theorem 3, §3, Chapter II). It turns out that in this case the class of cylindrical sets is also the class of convergence-determining sets (Problem 3). Therefore %,(Roo)= %o(Roo). We might expect that the cylindrical sets would still constitute determining classes in more general spaces. However, this is, in general, not the case. For example, consider the space (C, 140 (C), p) with the uniform metric p (see subsection 6, §2, Chapter II). Let P be the probability measure concentrated on the element x = x(t) 0, 0 ~ t s 1, and let P., n;;::: 1, be the

=

probability measures each of which is concentrated on the element x

= x.(t)

shown in Figure 35. It is easy to see that P.(A)--+ P(A) for all cylindrical sets A with P(oA) = 0. But if we consider, for example, the set

A = { oc E C: loc(t) I s

t, 0 s t s

1} E 140( C),

+

then P(oA) = 0, P.(A) = 0, P(A) = 1 and consequently P. P. Therefore % 0 (C) = 140 (C) but % 0 (C) c % 1 (C) (with strict inclusion).

5.

PROBLEMS

1. Let us say that a function F = F(x), defined on R", is continuous at x E R" provided that, for every e > 0, there is a~ > 0 such that IF(x) - F(y) I < e for ally E R" that satisfy

x - be < y < x

+ ~e,

where e = (1, ... , 1) E R". Let us say that a sequence of distribution functions {F.} converges in general to the distribution function F (F.=> F) if F.(x)-+ F(x), for all points x E R" where F = F(x) is continuous. Show that the conclusion of Theorem 2 remains valid for R", n > 1. (See the remark on Theorem 2.)

314

III. Convergence of Probability Measures. Central Limit Theorem

2. Show that the class :K of"elementary" sets in R" is a convergence-determining class. 3. Let E be one ofthe spaces R"", Cor D. Let us say that a sequence {P.} of probability measures (defined on the u-algebra 8 of Borel sets generated by the open sets) converges in general in the sense of finite-dimensional distributions to the probability measure P (notation: P. b P) if P .(A) -+ P(A), n -+ oo, for all cylindrical sets A with P(oA) = 0. For R"", show that (P. b P) <->(P. => P).

4. Let F and G be distribution functions on the real line and let L(F, G)= inf{h > 0: F(x- h)- h

~

G(x)

~

F(x +h)+ h}

be the Levy distance (between F and G). Show that convergence in general is equivalent to convergence in the Levy metric: (F • =>F)<-> L(F•• F)-+ 0.

5. Let F. => F and let F be continuous. Show that in this case F .(x) converges uniformly to F(x): sup IF.(x)- F(x)l-+ 0,

n-+ oo.

X

6. Prove the statement in Remark 1 on Theorem 1. 7. Establish the equivalence of (1*)-(IV*) as stated in Remark 2 on Theorem 1. 8. Show that P. ~ P if and only if every subsequence {P•.} of {P.} contains a subsequence {P... } such that P... ~ P.

§2. Relative Compactness and Tightness of Families of Probability Distributions 1. If we are given a sequence of probability measures, then before we can consider the question of its (weak) convergence to some probability measure, we have of course to establish whether the sequence converges in general to some measure, or has at least one convergent subsequence. For example, the sequence {P.}, where P 2 • = P, P 2 .+ 1 = Q, and P and Q are different probability measures, is evidently not convergent, but has the two convergent subsequences {P 2 ,} and {P 2 ,+ 1 } It is easy to construct a sequence {P.} of probability measures P., n ~ 1, that not only fails to converge, but contains no convergent subsequences at all. All that we have to do is to take P., n ~ 1, to be concentrated at {n} (that is, P.{n} = 1). In fact, since lim. P.(a, b] = Owhenevera < b, a limit measure would have to be identically zero, contradicting the fact that 1 = P,(R) 0,

+

§2. Relative Compactness and Tightness of Families of Probability Distributions

315

n--+ oo. It is interesting to observe that in this example the corresponding sequence {F"} of distribution functions, Fn{x)

1,

x;::: n,

= { O, x < n,

is evidently convergent: for every x

E

R,

Fn{x)--+ G(x)

=0.

However, the limit function G = G(x) is not a distribution function (in the sense of Definition 1 of §3, Chapter II). This instructive example shows that the space of distribution functions is not compact. It also shows that if a sequence of distribution functions is to converge to a limit that is also a distribution function, we must have some hypothesis that will prevent mass from "escaping to infinity." After these introductory remarks, which illustrate the kinds of difficulty that can arise, we turn to the basic definitions.

2. Let us suppose that all measures are defined on the metric space (E, tff, p). Definition 1. A family of probability measures~= {P"; a Em} is relatively compact if every sequence of measures from ~contains a subsequence which converges weakly to a probability measure. We emphasize that in this definition the limit measure is to be a probability measure, although it need not belong to the original class ~- (This is why the word "relatively" appears in the definition.) It is often far from simple to verify that a given family of probability measures is relatively compact. Consequently it is desirable to have simple and useable tests for this property. We need the following definitions. Definition 2. A family of probability measures ~ = {Pa; rx Em} is tight if, for every e > 0, there is a compact set K <;; E such that supPa{E\K):::; e.

(1)

ae'll

Definition 3. A family of distribution functions F = {Fa; rx Em} defined on R", n ;::: 1, is relatively compact (or tight) if the same property is possessed by the family~= {P"; rx Em} of probability measures, where P"' is the measure constructed from F"'. 3. The following result is fundamental for the study of weak convergence of probability measures.

Theorem 1 (Prohorov's Theorem). Let ~ = {P"'; rx Em} be a family of probability measures defined on a complete separable metric space (E, iff, p). Then~ is relatively compact if and only if it is tight.

316

III. Convergence of Probability Measures. Central Limit Theorem

We shall give the proof only when the space is the real line. (The proof can be carried over, almost unchanged, to arbitrary Euclidean spaces Rn, n ~ 2. Then the theorem can be extended successively to R 00 , to a-compact spaces; and finally to general complete separable metric spaces, by reducing each case to the preceding one.) Necessity. Let the family r!J> = {Pa: !'f. E m:} of probability measures defined on (R, 86(R)) be relatively compact but not tight. Then there is an e > 0 such that for every compact K ~ R sup Pa(R\K) > e, and therefore, for each interval I = (a, b), sup Pa{R\J) >e. It follows that for every interval In= ( -n, n), n such that

~

1, there is a measure Pan

Pan(R\In) >e. Since the original family r!J> is relatively compact, we can select from {PaJ n;;, 1 a subsequence {Pan) such that Pank ~ Q, where Q is a probability measure. Then, by the equivalence of conditions (I) and (II) in Theorem 1 of §1, we have (2) for every n ~ 1. But Q(R\In) t 0, n--+ oo, and the left side of (2) exceeds e > 0. This contradiction shows that relatively compact sets are tight. To prove the sufficiency we need a general result (Helly's theorem) on the sequential compactness of families of generalized distribution functions (Subsection 2 of §3 of Chapter II). Let J = { G} be the collection of generalized distribution functions

G

= G(x) that satisfy:

(1) G(x) is nondecreasing; (2) 0 ~ G(- oo ), G( + oo) ~ 1; (3) G(x) is continuous on the right. Then J clearly contains the class of distribution functions §' = {F} for which F(- oo) = 0 and F( + oo) = 1. Theorem 2 (Helly's Theorem). The class J = {G} of generalized distribution fimctions is sequentially compact, i.e. for every sequence {Gn} of functions from J we can .find afunction G E Janda sequence {nk} ~ {n} such that k--+

00,

for every point x of the set P c( G) of points of continuity of G

= G(x ).

§2. Relative Compactness and Tightness of Families of Probability Distributions

317

PROOF. Let T = {x 1 , x 2 , ... } be a countable dense subset of R. Since the sequence of numbers {G.{x 1 }} is bounded, there is a subsequence N 1 = {nl1>, n~l)' .. .} such that G.p){x 1) approaches a limit g1 as i--+ oo. Then we extract from N 1 a subsequence N 2 = {nl2 >, n~2 >, .. •} such that G"12)(x 2 ) approaches a limit g 2 as i --+ oo; and so on. ' On the set T ~ R we can define a function Gr(x) by XiE

T,

and consider the "Cantor" diagonal sequence N = {nl1>, n~2 >, .. .}. Then, for each xi E T, as m--+ oo, we have G.~)(xi)

--+ Gr(xJ

Finally, let us define G = G(x) for all x E R by putting G(x) = inf{Gr(y): yET, y > x}.

(3)

We claim that G = G(x) is the required function and G.~,.~)(x)--+ G(x) at all points x of continuity of G. Since all the functions G. under consideration are nondecreasing, we have Gnl:rl(x) ~ G.~,.m)(y) for all x andy that belong to T and satisfy the inequality x ~ y. Hence for such x and y, Gr(x) ~ Gr(y).

It follows from this and (3) that G = G(x) is nondecreasing. Now let us show that it is continuous on the right. Let xk ! x and d = Iimk G(xk). Clearly G(x) ~ d, and we have to show that actually G(x) = d. Suppose the contrary, that is, let G(x) < d. It follows from (3) that there is ayE T, x < y, such that Gr(Y) < d. But x < xk < y for sufficiently large k, and therefore G(xk) ~ Gr(Y) < d and lim G(xk) < d, which contradicts d = Iimk G(xk). Thus we have constructed a function G that belongs to .f. We now establish that G.<ml{x 0 )--+ G(x 0 ) for every x 0 E Pc(G). If x 0 < y E T, then m lim Gnl,.ml(x 0 ) ~ lim Gnl,.ml(y) = Gr(y), m

m

whence lim G.~:r){x 0 ) ~ inf{Gr(y): y > x 0 , yET}= G(x 0 ).

(4)

m

On the other hand, let x 1 < y < x 0 , y E T. Then G(x 1 ) ~ Gr(Y) =lim Gnl:/'l(y) =lim Gnl:rl(y) ~lim Gnl:/'l(x 0 ). m

m

m

Hence if we let x 1 j x 0 we find that G(xo _) ~ li,!,D G.~:rlx 0 ).

(5)

318

III. Convergence of Probability Measures. Central Limit Theorem

But if G(x 0 - ) = G(x 0 ) we can infer from (4) and (5) that Gnllr>(x 0 ) ~ G(x 0 ), m~ oo. This completes the proof of the theorem. We can now complete the proof of Theorem 1. Sufficiency. Let the family & be tight and let {P"} be a sequence of probability measures from&. Let {F"} be the corresponding sequence of distribution functions. By Helly's theorem, there are a subsequence {F "k} £;; { F"}and a generalized distribution function G e J such that F"k(x) ~ G(x) for x e Pc(G). Let us show that because & was assumed tight, the function G = G(x) is in fact a genuine distribution function (G(- oo) = 0, G( + oo) = 1). Take e > 0, and let I = (a, b] be the interval for which

sup Pn(R\1) < e, or, equivalently,

n

n~l.

Choose points a', b' e Pc(G) such that a' b. Then 1 - e::;;; P"k(a, b] ::;;; P"k(a', b'] = F"k(b')- F"k(a') ~ G(b')- G(a'). It follows that G( + oo)G(-oo) = 1, and since 0::;;; G(-oo)::;;; G(+oo)::;;; 1, we have G(-oo) = 0 andG(+oo) = 1. Therefore the limit function G = G(x) is a distribution function and F"k =>G. Together with Theorem 2 of §1 this shows that Pnk ~ Q, where Q is the probability measure corresponding to the distribution function G. This completes the proof of Theorem 1. 4.

PROBLEMS

1. Carry out the proofs of Theorems 1 and 2 for R", n ;;:: 2. 2. Let P« be a Gaussian measure on the real line, with parameters m. and cr;, ex E 21. Show that the family f1J = {P«; ex E 21} is tight if and only if

1m. I S: a,

cr; S: b,

ex E 21.

3. Construct examples of tight and nontight families f1J = {P.; ex E 21} of probability measures defined on (R"", di(R"")).

§3. Proofs of Limit Theorems by the Method of Characteristic Functions 1. The proofs of the first limit theorems of probability theory-the law of large numbers, and the De Moivre-Laplace and Poisson theorems for Bernoulli schemes-were based on direct analysis of the limit functions ofthe

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

319

distributions F n, which are expressed rather simply in terms of binomial probabilities. (In the Bernoulli scheme, we are adding random variables that take only two values, so that in principle we can find Fn explicitly.) However, it is practically impossible to apply a similar direct method to the study of more complicated random variables. The first step in proving limit theorems for sums of arbitrarily distributed random variables was taken by Chebyshev. The inequality that he discovered, and which is now known as Chebyshev's inequality, not only makes it possible to give an elementary proof of James Bernoulli's law oflarge numbers, but also lets us establish very general conditions for this law to hold, when stated in the form n-+ oo,

every c > 0,

(1)

for sums Sn = ~ 1 + · · · + ~n• n ~ 1, of independent random variables. (See Problem 2.) Furthermore, Chebyshev created (and Markov perfected) the "moment method" which made it possible to show that the conclusion of the De Moivre-Laplace theorem, written in the form

- ESn P{Sn 1\iCi" y

Y..Jn

~

x

}

-+

1

M-

-y2n

fx e -u>;2 du,

(2)

-co

is "universal," in the sense that it is valid under very general hypotheses concerning the nature of the random variables. For this reason it is known as the central limit theorem of probability theory. Somewhat later Lyapunov proposed a different method for proving the central limit theorem, based on the idea (which goes back to Laplace) of the characteristic function of a probability distribution. Subsequent developments have shown that Lyapunov's method of characteristic functions is extremely effective for proving the most diverse limit theorems. Consequently it has been extensively developed and widely applied. In essence, the method is as follows. 2. We already know (Chapter II, §12) that there is a one-to-one correspondence between distribution functions and characteristic functions. Hence we can study the properties of distribution functions by using the corresponding characteristic functions. It is a fortunate circumstance that weak convergence Fn ~ F of distributions is equivalent to pointwise convergence q>n-+ q> of the corresponding characteristic functions. Moreover, we have the following result, which provides the basic method of proving theorems on weak convergence for distributions on the real line.

320

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 1 (Continuity Theorem). Let {Fn} be a sequence of distribution functions F n = FnCx ), x E R, and let {cpn} be the corresponding sequence of characteristic jimctions, IPn(t) = s:ooeirx dFn(x),

t E R.

(1) If Fn ~ F, where F = F(x) is a distribution function, then cpn(t) --4 cp(t), t E R, where cp( t) is the characteristic junction ofF = F( x ). (2) If limn cpit) exists for each t E R and cp(t) = limn cpn(t) is continuous at t = 0, then cp(t) is the characteristic function of a probability distribution

F = F(x), and

The proof of conclusion (1) is an immediate consequence of the definition of weak convergence, applied to the functions Re eirx and Im eirx. The proof of (2) requires some preliminary propositions.

Lemma 1. Let {P n} be a tight family of probability measures. Suppose that every weakly convergent subsequence {Pn.} of {Pn} converges to the same probability measure P. Then the whole sequence {P n} converges to P. PROOF. Suppose that Pn

f = f(x) such that

+ P. Then there is a bounded continuous function

It follows that there exist e > 0 and an infinite sequence {n'} that

I Lf(x)Pn·(dx)-

Lf(x)P(dx) I

~ e > 0.

~

{n} such (3)

By Prohorov's theorem (§2) we can select a subsequence {P n"} of {P n'} such that Pn" ~ 0, where 0 is a probability measure. By the hypotheses of the lemma, 0 = P, and therefore

Lf(x)Pn ..(dx)

--4

Lf(x)P(dx),

which leads to a contradiction with (3). This completes the proof of the lemma.

Lemma 2. Let {Pn} be a tight family of probability measures on (R, 86(R)). A necessary and stif.ficient condition for the sequence {Pn} to converge weakly to a probability measure is that for each t E R the limit limn cpnCt) exists, where IPn( t) is the characteristic function of Pn: IPn(t) = L eirxp n(dx).

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

321

PRooF. If {P n} is tight, by Prohorov's theorem there are a subsequence {Pn·} and a probability measure P such that Pn' ~ P. Suppose that the whole sequence {P n} does not converge toP (P n !!;\+ P). Then, by Lemma 1, there are a subsequence {Pn"} and a probability measure Q such that Pn" ~ Q, and p ¥= Q. Now we use the existence of limn cpn(t) for each t E R. Then lim n'

r eitxpn.(dx) = lim r eitxpn"(dx)

JR

n"

JR

and therefore

But the characteristic function determines the distribution uniquely (Theorem 2, §12, Chapter II). Hence P = Q, which contradicts the assumption that Pn ~ P. The converse part of the lemma follows immediately from the definition of weak convergence. The following lemma estimates the "tails" of a distribution function in terms of the behavior of its characteristic function in a neighborhood of zero.

Lemma 3. Let F = F(x) be a distribution function on the real line and let cp = cp(t) be its characteristic function. Then there is a constant K > 0 such that for every a > 0

f

PROOF. Since Re cp(t)

! Ja[l a o

fa [1 -

dF(x) :::;; K

a

Jlxl<e:lja

= J':'

00

Re cp(t)] dt.

(4)

0

cos tx dF(x), we find by Fubini's theorem that

Re cp(t)] dt

=! Ja[Joo (1 -COS tx) dF(x)] dt a o -oo

00 [~ J:(l

= f_00 =

-COS

Joo ( _ 00

sin ax) dF(x) 1- ~ax

. ( 1 - -sm 2:: mf JyJ<e:l

=1

K

l

tx) dt] dF(x)

lxl<e:lfa

Y

y) · l

dF(x),

Jaxl<e:l

dF(x)

III. Convergence of Probability Measures. Central Limit Theorem

322 where

y)

.

sin- = 1 - sm 1 :?: . f (1 - -1 = m K

y

IYI2!1

1

7,

so that (4) holds with K = 7. This establishes the lemma. Proof of conclusion (2) of Theorem 1. Let cpit) --+ cp(t), n --+ oo, where cp(t) is continuous at 0. Let us show that it follows that the family of probability measures {Pn} is tight, where Pn is the measure corresponding to Fn. By (4) and the dominated convergence theorem,

Pn{R\(-

~ ))} = a a

( J,xj;;,1ja

dFn(x) ::::; K a

fa [1 -

Re cpn(t)] dt

0

Kfa

[1 - Re cp(t)] dt --+ a o as n --+ oo. Since, by hypothesis, cp(t) is continuous at 0 and cp(O) there is an a > 0 such that

=

1, for every s > 0

for all n :?: 1. Consequently {P .} is tight, and by Lemma 2 there is a probability measure P such that Hence

but also cpn(t)--+ cp(t). Therefore cp(t) is the characteristic function of P. This completes the proof of the theorem.

Corollary. Let {Fn} be a sequence of distribution functions and {cpn} the corresponding sequence of characteristic functions. Also let F be a distribution function and cp its characteristic function. Then F.~ F if and only if cpit)--+ cp(t) for all t E R. Remark. Let 1], 1J1o 1] 2 , ••• be random variables and Fq. ~ Fq. In accordance with the definition of §10 of Chapter II, we then say that the random variables 1] 1 , 1] 2 , ••• converge to '7 in distribution, and write 11. 141].

Since this notation is self-explanatory, we shall frequently use it instead

ofF~.~ F~ when stating limit theorems.

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

323

3. In the next section, Theorem 1 will be applied to prove the central limit theorem for independent but not identically distributed random variables. In the present section we shall merely apply the method of characteristic functions to prove some simple limit theorems.

Theorem 2 (Law of LargeN umbers). Let ~ 1 , ~ 2 , •.• be a sequence ofindependent identically distributed random variables with E I ~ 1 1 < oo, sn = ~ 1 + · · · + ~n and E~ 1 = m. Then Sn/n ~ m, that is,for every e > 0

n ~ oo. PROOF. Let
by (11.12.6). But according to (11.12.14)

+ o(t),

t~o.

Therefore for each given t E R n-+

oo,

and therefore

The function
sn n

d

-~m,

and consequently (see Problem 7, §10, Chapter II)

This completes the proof of the theorem.

324

III. Convergence of Probability Measures. Central Limit Theorem

Theorem 3 (Central Limit Theorem for Independent Identically Distributed Random Variables). Let 1 , 2 , ••• be a sequence of independent identically distributed (nondegenerate) random variables with Ee~ < oo and Sn = + · · · + Then as n- oo

ee

en·

e1

xeR,

(5)

where

1 W(x) = ;;,::. y 2n PROOF.

Let

Ee

1

=

m,

Ve

1

=

fx

e-" 212 du.

-oo

a 2 and cp(t) =

Eeit(~,

-m>.

Then if we put

ES}

S cp"(t) = E exp{ it ~" , we find that

But by (11.12.14)

t-o. Therefore

cpn(t)

=

[1 - ;;~: + o(~)

r-

e-1 2 /2.

as n - oo for fixed t. The function e- 1212 is the characteristic function of a random variable (denoted by %(0, 1)) with mean zero and unit variance. This, by Theorem 1, also establishes (5). In accordance with the remark in Theorem 1, this can also be written in the form

S~n J4 .ff(O, 1). vsn

(6)

This completes the proof of the theorem. The preceding two theorems have dealt with the behavior of the probabilities of (normalized and symmetrized) sums of independent and identically distributed random variables. However, in order to state Poisson's theorem (§6, Chapter I) we have to use a more general model.

325

§3. Proofs of Limit Theorems by the Method of Characteristic Functions

Let us suppose that for each n ~ 1 we are given a sequence of independent random variables ~" 1 , ••• , ~nn· In other words, let there be given a triangular array

(~11 ~21> ~22

)

~31• ~32• ~33

of random variables, those in each row being independent. Put

sn = ~n1 + ... + ~nn·

Theorem 4 (Poisson's Theorem). For each n ~ 1 let the independent random variables ~" 1 , ••• , ~nn be such that P(~nk = 1) =

Pnk•

+ qnk = 1. Suppose that

Pnk

max Pnk-+ 0,

and

Ll:=

1

Pnk -+

,( >

n-+ oo,

0, n -+ oo. Then,Jor each m = 0, 1, ... , e-).A.m P(Sn = m) -+ - -1- , n -+ oo.

(7)

m.

PRooF.

Since

for 1

k

~

~

n, by our assumptions we have

CfJsn(t) = EeitSn =

rr (pnkeit + qnk) n

k=

1

n

=

f1 (1 + Pnieit -

1))-+ exp{A.(eit - 1)},

k=1

n-+ oo.

The function cp(t) = exp{A.(eit - 1)} is the characteristic function of the Poisson distribution (II.l2.11), so that (7) is established. If n(A.) denotes a Poisson random variable with parameter A., then (7) can be written like (6), in the form

sn ~ n(A.).

This completes the proof of the theorem.

4.

PROBLEMS

1. Prove Theorem 1 for R", n

~

2.

2. Let eto ~ 2 , ••• be a sequence of independent random variables with finite means E 1.;.1 and variances V such that V ~ K < oo, where K is a constant. Use Chebyshev's inequality to prove the law of large numbers (1).

e.

e.

326

III. Convergence of Probability Measures. Central Limit Theorem

3. Show, as a corollary to Theorem 1, that the family {cp.} is uniformly continuous and that
4.

-+

q> uniformly on every finite interval.

Let~., n ;;:::: 1, be random variables with characteristic functions q>~"(t), n :2:: 1. Show that ~ • .!4 0 if and only if q>~"(t) -+ 1, n -+ oo, in some neighborhood oft = 0.

5. Let X~> X 2 , ••• be a sequence of independent random vectors (with values in Rk) with mean zero and (finite) covariance matrix r. Show that

xl + · · · + x • .!!.. JV(O, r).

Jn

(Compare Theorem 3.)

§4. Central Limit Theorem for Sums of Independent Random Variables 1. Let us suppose that for every n

~

1 we have a sequence

of independent random variables with E~nk

Lets.= ~nl

= 0,

+ ·· · + ~•• ,

F.k(x) = P(~nk::; x),

cll(x) = (2n)-li2 fooe-y2;zdy,

Theorem 1. A sufficient (and necessary) condition that

s. ~ %(0, 1) is that (A)

I

k=l

f

Jlxl>•

lxiiF.k(x)- .k(x)ldx--+0,

n --+ oo,

for every t: > 0.

This theorem implies, in particular, the traditional statement of the central limit theorem under the Lindeberg condition.

Theorem 2. Suppose that the Lindeberg condition is satisfied, that is, that for every t: > 0

I

(L)

k=l

then

s. ~ %(0, 1).

f

x 2 dF.k(x)--+ 0,

Jlxl>e

n--+oo;

§4. Central Limit Theorem for Sums oflndependent Random Variables

327

Before proving these theorems (notice that Theorem 2 is an easy corollary of Theorem 1) we discuss the significance of conditions (A) and (L). Since max E~;k ::; e2 1~k~n

+

n

L E(~;ki( I~nk I > e)),

k=1

it follows from the Lindeberg condition (L) that max E~;k --+ 0,

n --+ oo.

(1)

1 ~k~n

From this it follows (by Chebyshev's inequality) that the random variables are asymptotically infinitesimal (negligible in the limit), that is, that, for every e > 0, max P{l~nkl > e}--+ 0, 1:!>k:!>n

n --+ oo.

(2)

Consequently we may say that Theorem 2 provides a condition for the validity of the central limit theorem when the random variables that are summed are asymptotically infinitesimal. Limit theorems which depend on a condition of this type are known as theorems with a classical formulation. It is easy to give examples which satisfy neither the Lindeberg condition nor the condition of being asymptotically infinitesimal, but where the central limit theorem nevertheless holds. Here is a particularly simple example. Let ~ 1 , ~ 2 , .•• be a sequence of independent normally distributed random variables withE~"= 0, V~ 1 = 1, V~k = 2k- 2 ,k 2:: 2.PutSn = ~n 1 + · .. + ~nn with

~nk = ~kl Jit1 v~i· It is easily verified (Problem 1) that neither the Lindeberg condition nor the condition of being asymptotically infinitesimal is satisfied, although the central limit theorem is evident since S" is normally distributed withES" = 0, vsn = 1. Later (Theorem 3) we shall show that the Lindeberg condition (L) implies condition (A). Nevertheless we may say that Theorem 1 also covers "nonclassical" situations (in which no condition of being asymptotically infinitesimal is used). In this sense we say that (A) is an example of a nonclassical condition for the validity of the central limit theorem. 2. PROOF OF THEOREM 1. We shall not take the time to verify the necessity of (A), but we prove its sufficiency. Let

328

III. Convergence of Probability Measures. Central Limit Theorem

It follows from §12, Chapter II, that

By the corollary to Theorem 1 of §3, we have S" ~ %(0, 1) if and only if fn(t) -+ ({J(t), n -+ oo, for all real t. We have fn(t) - ({J(t) =

n

n

k;l

k;l

ll fnk(t) - ll (/Jnk(t).

Since I f,.k(t) I :::;; 1 and I({Jnk(t) I :::;; 1, we obtain

I f,.(t)

- ({J(t) I =

I)J f,.k(t) -

il

(/Jnk(t)

:::;; ktllfnk(t)- (/Jnk(t)l =

I

kt f_oooo (eitx- itx

=

I kt11 s:ooeitx d(Fnk- nk)

+ tt2x2)d(Fnk-

nk)

I,

I (3)

where we have used the equations i

=

1, 2.

Let us integrate by parts (Theorem 11 of Chapter II, §6, and Remark 2 following it) in the integral

and then let a-+

-

oo and b-+ oo. Since we have

and X-+ CIJ,

we obtain

§4. Central Limit Theorem for Sums of Independent Random Variables

329

From (3) and (4), lfn(t)- q>(t)l:::;; kt1 1t :::;; tltl 3 c

s:}eitx±r ±(

k=l

+ 2t 2

k=l

Jlxi:St

1- itx)(Fnix)- nk(x))dxl

lxiiFnk(x)- nk(x)ldx

Jlxl>e

lxiiFnk(x)- nk(x)ldx

where we have used the inequality (

Jlxi:St

lxiiFnk(x)- nk(x)ldx:::;; 2(I;k,

(6)

which is easily deduced from (71), §6, Chapter II. Since c > 0 is arbitrary, it follows from (5) and (A) that fn(t)-+ cp(t) as n -+ oo, for all t E R. This completes the proof of the theorem. 3. The following theorem gives the connections between (A) and (L). Theorem 3. (1) (L) =>(A). (2) If max, s;ks;n E(;k-+ 0, n-+ oo, then (A)=> (L). (1) We noticed above that (L) implies that max 1 s;ks;n (I;k-+ 0. Consequently, since 1 (I;k = 1, we obtain PROOF.

Lk=

n -+ oo,

(7)

where the integration on the right is over IxI > c/Jmax 1 s;ks;n (I;k. This, together with (L), shows that

I (

k=l

Jlxl>e

x 2 d[Fnk(x)

+ nk(x)]-+ 0,

n-+ oo,

(8)

foreveryc > O.Nowfixe > Oandleth = h(x)beacontinuouslydifferentiable even function such that lh(x)l:::;; x 2 , h'(x) sgn x ~ 0, h(x) = x 2 for lxl > 2c, h(x) = 0 for lxl:::;; B, and lh'(x)l :::;; 4x fore< lxl:::;; 2c. Then, by (8),

I (

k=l

Jlxl>t

h(x) d[Fnk(x)

+ nk(x)] -+ 0,

n -+ oo.

330

III. Convergence of Probability Measures. Central Limit Theorem

By integration by parts we then find that, as n -+ oo, kt1

L~.h'(x)[(1 -

Since h'(x) = 2x for IxI

±Jlxl~2e r

k=l

+ <~»nk(x)] dx ~

+ (1

-
L~.h(x) d[Fnk +
= kt

ktl Ls-eh'(x)[Fnk(x)

Fnk(x))

= ktl Ls-eh(x)d[Fnk

+
2e, we therefore have

<~»nk(x) I dx-+ 0,

I X IIFnk(x) -

n-+ oo.

Consequently, since e > 0 is arbitrary, we obtain (L)::? (A). (2) For the function h = h(x) introduced above, we obtain by (7) and the condition max 1 sksn a~k-+ 0 that (9)

n-+ oo.

Again integrating by parts, we find that when (A) is satisfied,

r h(x) d[Fnk- <~»nkJ I ~ I ±fx~eh(x) d[(1 - Fnk)- (1 - <~»nJJ I I ±JJxJ~e + Ikt1 Ls -eh(x) d[Fnk - <~»nkJ I k=1

k=1

~ ktl L>eh'(x)[(1 -

Fnk)- (1 -

+ kt1 Ls-elh'(x)IIFnk-

~

I

k=1

I

JJxJ~e

lh'(x)IIFnk-

<~»nk)] dx

<~»nkldx

(10)

<~»nkldx

~ 4 ktl JxJ ~·lxiiFnk(x)- <~»nk(x)l dx-+ 0. It follows from (9) and (10) that

±( ~2e

k= 1

Jlxl

x 2 dFnk(x)

~

±( ~·

k= 1

JJ:xJ

h(x) dFnk(x)-+ 0,

n-+oo;

that is, (L) is satisfied. 4. PRooF OF THEOREM 2. According to Theorem 3, Condition (L) implies (A); hence Theorem 2 follows at once from Theorem 1.

§4. Central Limit Theorem for Sums of Independent Random Variables

331

5. We mention some corollaries in which ~ 1 , ~ 2 , .•• is a sequence of independent random variables with finite second moments. Let mk = E~k> af = v~k > o,sn = ~1 + ... + ~n• = Lk=1 af,andletFk = Fk(x)bethe distribution function of ~k.

v;

Corollary 1. Let the Lindeberg condition be satisfied: for every 8 > 0,

n --+ oo.

(11)

Then (12)

Corollary 2. Let the Lyapunov condition be satisfied: 1

~

v;+a k';;1E lsk- mki f:

2H

--+

0,

n-+ oo.

(13)

Then the central limit theorem (12) holds.

It is enough to prove that the Lyapunov condition implies the Lindeberg condition. Let 8 > 0; then

El~k- mki 2 H = f_

00

2: 2:

00

lx- mki 2 H dFk(x)

I

J{x:lx-mki~
8°V~ I

lx- mki 2 H dFk(x)

J{x:lx-mki~
(x - mk) 2 dFix)

and therefore

Corollary 3. Let

~ 1 , ~ 2 , ... be independent identically distributed random variables with m = E~ 1 and 0 < a 2 = V~ 1 < oo. Then

332

III. Convergence of Probability Measures. Central Limit Theorem

Therefore the Lindeberg condition is satisfied and consequently Theorem 3 of §3 follows from Theorem 2.

Corollary 4. Let n 2 1,

ee 1,

be independent random variables such that,for all

2 , .•.

lenl:::; K, where K is a constant and V,.-+ oo as n-+ oo. Then by Chebyshev's inequality

f

{x: lx-mkl ~eVn

lx- mkl 2 dFk(x)

= E[(ek-

mk)2 I!ek- mk! 2 sV,.)]

:::; (2K) 2 P{Iek- mkl 2 sV,.}:::; (2k) 2

(]2 2

S

Vk 2 -+ 0, n

n-+ oo.

Consequently the Lindeberg condition is again applicable, and the central limit theorem holds. 6. We remarked (without proof) in Theorem 1 that condition (A) is also necessary. The following (Lindeberg-Feller) theorem shows that, with the supplementary condition max 1 s ks n Ee;k -+ 0, the Lindeberg condition (L) is also necessary. Theorem 4. Let max 1 sksn Ee;k-+ 0, n -+ oo. Then (L) is necessary and suf .ficientfor the validity of the central limit theorem: Sn ~ %(0, 1). The proof is based on the following proposition, which estimates the "tails~· of the variance in terms of the behavior of the characteristic function at the origin (compare Lemma 3, §3, Chapter III).

Lemma. Let e be a random variable with distribution function F = F(x), Ee = 0, ve = a 2 < oo. Thenforeacha > 0

(

Jlxl~1/a

x 2 dF(x) :::; _.;. [Re J(jUa) - 1 + 3a 2a2 ]. a

PRooF. We have Re f(t) - 1

+ !a2 t 2

=

!a 2 t 2

-

=

!a 2 t 2

-

-

Taking t

=

(

Jlxl:51/a

(

Jlxl>1/a

2 !a 2 t 2 =

J:oo [1 -

(!t 2

-

-

(14)

cos tx] dF(x)

[1 - cos tx] dF(x)

[1 - cos tx] dF(x)

!t 2 2a 2 )

(

Jlxl:->1/a (

Jlxl>1/a

x 2 dF(x) - 2a 2 x 2 dF(x).

aJU, we obtain (14), as required.

(

Jlxl>1/a

x 2 dF(x)

333

§4. Central Limit Theorem for Sums of Independent Random Variables

PROOF OF THEOREM 4. The sufficiency was established in Theorem 2. We now prove the necessity. Let

0,

E~nk =

max a;k

--+

n--+ oo.

0,

(15)

1 o<;ko<;n

Since

n fnk(t)--+ n

(16)

e-(1/2)t2,

k=1

we can find, for a given t, a number no = no(t) such that ~ n0 (t) and consequently

n

nfnk(t) n

In

nk=

1

J,k(t) > 0 for

n

L In J,k(t),

=

k= 1

k= 1

where the logarithms are well defined. Then since

Ifnk( t) - 11 S a;k t 2 , we have, by (15),

Ikt1{In [1 + Unk(t) -

1)] - Unk(t) - 1]}

I

n

S Llfnit)-11 2 k=1

t4

s - · 4

max

n

L u~k k=1

u;k ·

1o<;kos;n

t4

2 0, -- -4 max ank--+

n--+ oo.

1o<;ko<;n

Consequently, by using (16), we have n

Re

L Unk(t) -

k=1

n

1]

+ !t 2 = L [Re fnk(t) k=1

- 1

+ !t 2 a;k] --+ 0.

In particular, if we take t = a~ we find that n

I

k= 1

[Re fnk(a~) - 1

+ 3a 2a;k]

and therefore, by (14), and for every

1::

--+

0,

n--+ oo,

= 1/a > 0,

kt1~xi;,;ex 2 dFnk(x) s B2kt1[Re f,k(a~) -

2

1 + 3a 2a ]--+

which shows that the Lindeberg condition is satisfied.

00,

n--+ oo,

334

III. Convergence of Probability Measures. Central Limit Theorem

7. The method that we used for Theorem 1 can be used to obtain a corresponding condition for the central limit theorem without assuming that the second moments are finite. For each n ~ 1 let

be independent random variables with

E~nk =

0,

Let g = g(x) be a bounded nonnegative even function with the following properties: g(x) = x 2 for lxl ~ 1, minlxl<:l g(x) > 0, lg'(x)l ~canst. Define f:.nk(g) by the equation

J oo g(x) dFnk(x) = Joo g(x) d
-

00

00

Theorem 5. Let

and for each

£

> 0 let

Then

The proof is left to the reader (Problem 4); it can be carried out along the same lines as the proof of Theorem 1.

8.

PROBLEMS ~ 1 , ~ 2 , ••. be a sequence of independent normally distributed random variables with E~k = 0, k ~ 1, and V~ 1 = 1, V~k = 2k- 2 , k ~ 2. Show that gd does not satisfy the Linde berg condition and also is not asymptotically infinitesimal.

1. Let

2. Prove (4). 3. Let ~ 1 , ~ 2 , ••. be a sequence of independent identically distributed random variables with E~ 1 = 0, E~f = 1. Show that

4. Prove Theorem 5. (Hint: use the method of the proof of Theorem 1, applying integration by parts twice in the integral (eitn - itx + !t 2x 2) d(Fnk - .k).)

J!

335

§5. Infinitely Divisible and Stable Distributions

§5. Infinitely Divisible and Stable Distributions 1. In stating Poisson's theorem in §3 we found it necessary to use a triangular array, supposing that for each n ~ 1 there was a sequence of independent random variables {en k}, 1 ~ k ~ n. Put

T,.

en, 1 +

=

0

0

0

+ en,n•

n ~ 1.

(1)

The idea of an infinitely divisible distribution arises in the following problem: how can we determine all the distributions that can be expressed as limits of sequences of distributions of random variables T,., n ~ 1? Generally speaking, the problem of limit distributions is indeterminate k = 0, in such great generality. Indeed, if is a random variable and 1 = 1 < k ~ n, then T,. and consequently the limit distribution is the distribution of which can be arbitrary. In order to have a more meaningful problem, we shall suppose in the present section that the variables 1• are, for each n ~ 1, not only independent, but also identically distributed. Recall that this was the situation in Poisson's theorem (Theorem 4 of §3). The same framework also includes the central limit theorem (Theorem 3 of §3) for sums sk = + ... + ~n• n ~ 1, of independent identically distributed random variables 1 , 2 , •••• In fact, ifwe put

e.

e

=e

en,

e1

en,

0

••

'

e, en,

en,n

ee

then T.

n

=

~~

k~1 n,k

=

Sn - ES"

V,

·

Consequently both the normal and the Poisson distributions can be presented as limits in a triangular array. If T,.-+ T, it is intuitively clear that since T, is a sum of independent identically distributed random variables, the limit variable T must also be a sum of independent identically distributed random variables. With this in mind, we introduce the following definition.

Definition 1. A random variable T, its distribution F T• and its characteristic function q>T are said to be infinitely divisible if, for each n ~ 1, there are independent identically distributed random variables r, 1, ••• , 'In such thatt T 4 r, 1 + · · · + 'In (or, equivalently, F T = F~, * · · · * F~"' or q>T = (q>,11 )n). Theorem 1. A random variable T can be a limit of sums T, = only if T is infinitely divisible.

t The

Lk= 1 en,

k

if and

notation ~ 4. '1 means that the random variables ~ and '1 agree in distribution, i.e. F,(x), x e R.

F~(x) =

336

Ill. Convergence of Probability Measures. Central Limit Theorem

PROOF. If T is infinitely divisible, for each n 2:: 1 there are independent identically distributed random variables ~n.t• ..• , ~n.k such that T 4. ~n. 1 + · · · + ~n.k• and this means that T 4. T,, n 2:: 1. Conversely, let T, ~ T. Let us show that T is infinitely divisible, i.e. for each k there are independent identically distributed random variables 1/t, ••• , 1'/k such that T 4. 17 1 + · · · + 1'/k· Choose a k 2:: 1 and represent T,k in the form C~l) + · · · + C~k>, where Y(l) _

'>n

-

J! ':.nk,1

+ ... + '>nk,n•·"•'>n J! y(k)

_

-

J! '>nk,n(k-1)+1

+ ... + '>nk,nk· ;;

Since T,k ~ T, n ~ oo, the sequence of distribution functions corresponding to the random variables T,k> n 2:: 1, is relatively compact and therefore, by Prohorov's theorem, is tight. Moreover, [P(C~1 >

> z)]k

= P(C~1 >

> z, ... , C~k> > z) ::; P(T,k > kz)

and [P(C~1 >

< -z)]k

= P(C~1 >

< -z, ... , C~k> < -z)::; P(T,k < -kz).

The family of distributions for C~1 >, n 2:: 1, is tight because of the preceding two inequalities and because the family of distributions for T,k> n 2:: 1, is tight. Therefore there are a subsequence {ni} !:; {n} and a random variable 17 1 such thatC~!> ~ 17 1 asni ~ oo. SincethevariablesC~l)' ... , C~k>areidentically distributed, we have c~:) ~ 1'/z, •.. ''~~) .!41'/k, where 1'11 4. 172 4. = 1'/k· Since C~l), ... , C~k> are independent, it follows from the corollary to Theorem 1 of §3 that 1'/ 1 , ••• , 11k are independent and 0

T,,k = C~!>

••

+ .. · + C~~> -4 1'11 + .. · + 11k·

But T,,k.!!... T; therefore (Problem 1) T 4. 1'11 + .. · + 1'/k· This completes the proof of the theorem.

Remark. The conclusion of the theorem remains valid if we replace the hypothesis that ~n. 1, ••• , ~n." are identically distributed for each n 2:: 1 by the hypothesis that they are uniformly asymptotically infinitesimal (4.2). 2. To test whether a given random variable T is infinitely divisible, it is simplest to begin with its characteristic function qJ(t). If we can find characteristic functions qJnCt) such that qJ(t) = [({J"(t)]" for every n 2:: 1, then T is infinitely divisible. In the Gaussian case,

and if we put

we see at once that qJ(t) = [({Jn(t)]".

337

§5. Infinitely Divisible and Stable Distributions

In the Poisson case,

Xa-1e-xfP { f(x) = r(a)pa '

X> O - '

0,

X< 0,

it is easy to show that its characteristic function is

1
= [
=

1 (1 - i{3t)afn'

and therefore T is infinitely divisible. We quote without proof the following result on the general form of the characteristic functions of infinitely divisible distributions.

Theorem 2 (Levy-Khinchin Theorem). A random variable T is irifinitely divisible if and only if
where

f3 E R, cr 2

2 2 . it/3 - -t cr + fro (e"x

2

~ 0 and

1 +x 2 dA.(x), 1 - -itx-) 2 2 1+x x

-oo

A. is a .finite measure on (R, PA(R)) with A.{O}

(2) =

0.

3. Let ~ 1 , ~ 2 , ..• be a sequence of independent identically distributed random variables and Sn = ~ 1 + · · · + ~n. Suppose that there are constants bn and an > 0, and a random variable T, such that

Sn- bn

d

__::__....:.: -"+

T•

(3)

We ask for a description of the distributions (random variables T) that can be obtained as limit distributions in (3). If the independent identically distributed random variables ~ 1 , ~ 2 , •..

satisfy 0 < cr 2 = V~ 1 < oo, then if we put bn = nE~ 1 and an= crJn, we find by §4 that Tis the normal distribution %(0, 1). If f(x) = ()jn(x 2 + () 2 ) is the Cauchy density (with parameter () > 0) and ~ 1 , ~ 2 , ... are independent random variables with density f(x), the characteristic functions
338

III. Convergence of Probability Measures. Central Limit Theorem

Consequently there are other limit distributions besides the normal: the Cauchy distribution, for example. If we put ~nk = (~Jan) - (bJnan), 1 ~ k ~ n, we find that

i ~n. k

sn - bn = an

( = T,.).

k= 1

Therefore all conceivable distributions for T that can conceivably appear as limits in (3) are necessarily (in agreement with Theorem 1) infinitely divisible. However, the specific characteristics of the variable T,. = (Sn - bn)/an may make it possible to obtain further information on the structure of the limit distributions that arise. For this reason we introduce the following definition.

Definition 2. A random variable T, its distribution function F(x), and its characteristic function cp(t) are stable if, for every n ~ 1, there are constants an > 0, bn, and independent random variables ~ 1 , ••• , ~n• distributed like T, such that (4) anT + bn 4 ~ 1 + · · · + ~n or, equivalently, F[(x - bn)/an] = F * · · · * F(x), or '-..---' ntimes

(5)

Theorem 3. A necessary and sufficient condition for the random variable T to be a limit in distribution of random variables (Sn - bn)/an, an > 0, is that Tis stable. PRooF. If T is stable, then by (4)

T 4 Sn- bn , an where Sn = ~ 1 + · · · + ~n• and consequently (Sn - bn)/an .4 T. Conversely, let ~ 1, ~ 2 , ••• be a sequence of independent identically distributed random variables, Sn = ~ 1 + · · · + ~n and (Sn- bn)/an ~ T, an> 0. Let us show that T is a stable random variable. If T is degenerate, it is evidently stable. Let us suppose that T is nondegenerate. Choose k ~ 1 and write (1) _ )! Sn -'o1 T(l)

n

=

+ ... + 'on••••• )!

Sn<1) - bn T(k) = '• • •' n an

s
n an

+ •·· + 'okn• )!

b

n

It is clear that all the variables T~ll, ... , T~kl have the same distribution and

n ~ oo, i = 1, ... , k.

339

§5. Infinitely Divisible and Stable Distributions

Write Then u~k> ~ y(l>

+ ... +

y,

where y 4 . · · 4 y 4:. T. On the other hand,

u
+ ... + ~kn

- kb.

a.

=

v. + p
ll(k) n kn

(6)

where

and

It is clear from (6) that

V. kn =

u _ p n

(k)

n

lln

'

where J.-k. ~ T, U~k> ~ T(l> + · · · + r, n-+ oo. It follows from the lemma established below that there are constants ll > 0 and p such that ll~k) -+ ll and f3~k> -+ p as n -+ oo. Therefore

d y
=

+ ... +

y
ll

'

which shows that T is a stable random variable. This completes the proof of the theorem. We now state and prove the lemma that we used above.

Lemma. Let ~. ~ ~ and let there be constants a. > 0 and b. such that d

~

a.~.+ b.-+~'

where the random variables ~ and ~ are not degenerate. Then there are constants a > 0 and b such that lim a. = a, lim b. = b, and ~=a~+ b.

340

III. Convergence of Probability Measures. Central Limit Theorem

PRooF. Let cpn, cp and iP be the characteristic functions of ~n• ~ and ~' respectively. Then cp""~"+b"(t), the characteristic function of an~n + bn, is equal to eirb"cpn(an t) and, by Theorem 1 and Problem 3 of §3,

eitb"cpn(an t)

-+

ip(t),

(7)

cpn(t)

-+

cp(t)

(8)

uniformly on every finite interval of length t. Let {n;} be a subsequence of {n} such that an1 -+ a. Let us first show that a < oo. Suppose that a = oo. By (7), sup llcpn(ant)l-

ltiSc

lip(t)ll-+ 0,

n-+ oo

for every c > 0. We replace t by t 0 /an;· Then, since an;-+ oo, we have

and therefore

Icpn.(to) I -+ Iip(O) I = 1. But Icpn1(t 0) I -+ Icp(t 0 ) 1. Therefore Icp(t 0 ) I = 1 for every t 0 E R, and consequently, by Theorem 5, §12, Chapter II, the random variable ~ must be degenerate, which contradicts the hypotheses of the lemma. Thus a < oo. Now suppose that there are two subsequences {n;} and {n;} such that an, -+ a, ani -+ a', where a =f a'; suppose for definiteness that 0 :::; a' < a. Then by (7) and (8),

lcpn 1(anJ)I-+ lcp(at)l,

lcpn1(an,t)l-+ liP(t)l

and

Consequently

lcp(at)l

=

lcp(a't)l,

and therefore, for all t E R,

lcp(t)l =

lcp(~t)l = ... = lcp((~)"t)l-+ 1,

n-+ oo.

Therefore Icp(t) I = 1 and, by Theorem 5 of §12, Chapter II, it follows that ~ is a degenerate random variable. This contradiction shows that a = a' and therefore that there is a finite limit lim an = a, with a ~ 0. Let us now show that there is a limit lim b" = b, and that a > 0. Since (8) is satisfied uniformly on each finite interval, we have

cpn(ant)-+ cp(at),

341

§5. Infinitely Divisible and Stable Distributions

and therefore, by (7), the limit limn-+oo eitbn exists for all t such that qJ(at) =F 0. Let [J > 0 be such that qJ(at) =F 0 for all It I < b. For such t, lim eitbn exists. Hence we can deduce (Problem 9) that lim Ibn I < oo. Let there be two sequences {n;} and {ni} such that lim bn, = b and lim bn; = b'. Then for It I < [J, and consequently b = b'. Thus there is a finite limit b = lim bn and, by (7),

ijJ(t) = eitbqJ(at), which means that ~ 4 a~ + b. Since ~ is not degenerate, we have a > 0. This completes the proof of the lemma. 4. We quote without proof a theorem on the general form of the characteristic functions of stable distributions.

Theorem 4 (Levy-Khinchin Representation). A random variable Tis stable if and only if its characteristic function qJ(t) has the form qJ(t) = exp 1/J(t), t/l(t) = itf3 - d It 1"'(1 whereO
5.

~

(11)

0.

PROBLEMS

1. Show that ~ ~ '1 if ~n .!!... ~ and ~n .!!... '1· 2. Show that if qJ 1 and qJ 2 are infinitely divisible characteristic functions, so is qJ 1

· qJ 2 •

3. Let (/Jn be infinitely divisible characteristic functions and let ({Jn(t)--+ ({J(t) for every t e R, where ({J(t) is a characteristic function. Show that ({J(t) is infinitely divisible. 4. Show that the characteristic function of an infinitely divisible distribution cannot take the value 0. 5. Give an example of a random variable that is infinitely divisible but not stable. 6. Show that a stable random variable ~always satisfies the inequality E I~ I' < oo for all r E (0, IX).

342

III. Convergence of Probability Measures. Central Limit Theorem

7. Show that if~ is a stable random variable with parameter 0 < differentiable at t = 0.

IX

:s; 1, then rp(t) is not

8. Prove that e-dlrla is a characteristic function provided that d ~ 0, 0 < 9. Let (bn)n;d be a sequence of numbers such that limn eirb. exists for all Show that lim lbnl < oo.

IX

:s; 2.

It I < c5, c5 rel="nofollow"> 0.

§6. Rapidity of Convergence in the Central Limit Theorem 1. Let ~nl• ... , ~"" be a sequence of independent random variables, S" = ~nl + · · · + ~""' Fn(x) = P(Sn ~ x). If Sn--+ %(0, 1), then FnCx)--+ (x) for every x e R. Since (x) is continuous, the convergence here is actually uniform (Problem 5 in §1): n --+ oo.

supJFn(x)- (x)l--+ 0,

(1)

X

In particular, it follows that P(Sn ~ x) - (

x-ES)

JVS,"

n--+ oo

--+ 0,

(under the assumption that ESn and VSn exist and are finite).

It is natural to ask how rapid the convergence in (1) is. We shall establish a result for the case when

n;;::: 1, where ~ 1 , ~ 2 , .•. is a sequence of independent identically distributed random variables with E~k = 0, V~k = a 2 and El~ 1 1 3 < oo. Theorem (Berry and Esseen). We have the bound sup IFn(x)-
~ CEIJt, ~

(2)

n

where Cis an absolute constant ((2n)- 112 ~ C < 0.8). PRooF. For simplicity, let a 2 = 1 and

(Subsection 10, §12, Chapter II) sup IFnCx)-
~~

n

(I

Jo

P3

= El~ 1 1 3 . By Esseen's inequality

I

fn(t)- qJ(t) dt t

+ 2T4 ~ n v 2n

(3)

343

§6. Rapidity of Convergence in the Central Limit Theorem

where qJ(t) = e- 1212 and

fit) = [f(tjjn)]", with f(t) = Ee; 1 ~'. In (3) we may take T arbitrarily. Let us choose

jn/(5/33).

T =

We are going to show that for this T,

I fit)- ({J(t)l::; ~

}n ltl

3 e- 1214 ,

ltl ::; T.

(4)

The required estimate (2), with C an absolute constant, will follow immediately from (3) by means of (4). (A more detailed analysis shows that c < 0.8.) We now turn to the proof of (4). By Taylor's formula with integral remainder, t f(t) = 1 + 1 ! f'(O)

t3 Jl (1 -

t2

+ 2 ! f"(O) + 2

0

v) 2f"'(vt) dv.

(5)

Hence

where 181 ::; 1. If It I ::; Jn/(5/3 3 ), we have, since t2 2n

/3 3 z

a 3 = 1.

lt 3 1/33

+

6n 312

:$

1 25 ·

Consequently

f(t/Jn) for

It I ::;

T

z ~~

= Jn/(5/3 3 ), and hence we may write [f(t/jn)]" =

(6)

enlnnf(t/,fti).

By (5) (with f(t) replaced by In f(tjjll)) we find that t2

In f(t/Jn) = - 2n here

8t 3

+ 6n 312

18 1 1::; 1 and IOn f)/Ill = 1[f'" · ! 2

-

3f" ·

r ·J + 2U'n · f-

::; (/33 + 3PzPt + 2PD<~~)- 3 where

Pk =

E I~ 1 1\ k = 1, 2, 3.

(7)

(In f) 111(8 1 t/jn);

::;

7/3 3 ,

3

1 (8)

344

III. Convergence of Probability Measures. Central Limit Theorem

Using the inequality le•- 11 :s; lzlel•l, we can now show, for I tl :s; T = Jn/(5P 3 ), that

l[f(t/Jn)J" _ e-t2 /21 = le"In/(t/Jiil _

<

e-t2f2l

2P31tl3 exp{- t2 + 21tl3 ~}

-6

Jn

2

6

Jn

< 2P31 t 13e-t2f4.

-6

Jn

This completes the proof of the theorem.

Remark. We observe that unless we make some supplementary hypothesis about the behavior of the random variables that are added, (2) cannot be improved. In fact, let ~ 1 , ~ 2 , ••• be independent identically distributed random variables with

It is evident by symmetry that

and hence by Stirling's formula

= l.Cn . 2-2n""' _1_ = 2

2fo

2"

1

j(2n) · (2n).

It follows, in particular, that the constant C in (2) cannot be less than (2n) - 1 12 •

2.

PROBLEMS

1. Prove (8).

2.

be independent identically distributed random variables with E~k = 0, Ele 1 l3 < oo. It is known [53] that the following nonuniform inequality holds: for all x E R,

Let~~> ~ 2 , •••

vek

= u 2and

IFn(x)- ~x)l :::;;

CEI~ 1 1 3

1

u\fo . (1 + lxl)3.

Prove this, at least for Bernoulli random variables.

345

§7. Rapidity of Convergence in Poisson's Theorem

§7. Rapidity of Convergence in Poisson's Theorem 1. Let 1] 1, 1] 2 , .•• , 1Jn be independent Bernoulli random variables, taking the values 1 and 0 with probabilities

P(1Jk = 1) = Pk>

P(1Jk

= 0) =

1~ k

1 - Pk>

~

n.

Write sn

= 111 + ... + 1Jn,

Pk = P(S = k),

= 0, 1, ... ; A. > 0.

k

In §6 of Chapter I we observe that when p 1 = · · · = P. = p and A. = np we have Prohorov's inequality,

L IPk- nkl ~ C1(A.)p, 00

k=O

where C 1 (A.) = 2 min(2, A.). When the Pk are not necessarily equal, but that

Lk=

1

Pk

= A., LeCam showed

L IPk- nkl ~ Cz(A.) max Pk> 00

k=O

1:5k:5n

where Cz(A.) = 2 min(9, A.). The object of the present section is to prove the following theorem, which provides an estimate of the approximation of the P k by the nk, specifically not including the assumption that 1 Pk = A.. Although the proof does not produce very good constants (see C(A.) below), it may be of interest because of its simplicity.

Li:=

Theorem. (1) We have the following inequality: (1)

where min is taken over all permutations i = (i 1, i2 , ••• , in) of(l, 2, ... , n), P;0 = 0, and [ns] is the integral part of ns. (2) IfLi:= 1 Pk = A., then

k~0 1Pk -nkl ~ C(A.)~in 0~~~~~k~/;k- A.sl ~

C(A.) max pk,

where

C(A.) = (2

+ 4A.)eu.

(2)

346

III. Convergence of Probability Measures. Central Limit Theorem

2. The key to the proof is the following lemma.

Lemma 1. Let S(t)

= Ll"2o 1'fk, where 1'fo = 0, 0 ~ t

Pk(t) = P(S(t) = k),

Then for every tin 0

~

t

~

(A.t)ke-.
nk(t) =

k!

~ 1,

k = 0, 1, ....

,

1,

k~o IPk(t)- nk(t)l ~ eur(2 + 4~~!k) ~~~srI ~~k- A.s

I·

(3)

PRooF. Let us introduce the variables Xk(t) = J(S(t) = k),

where I(A) is the indicator of A. For each sample point, the function S(t), 0 ~ t ~ 1, is a generalized distribution function, for which, consequently, the Lebesgue-Stieltjes integral

J~xk(s-) dS(s) is defined; it is in fact just the sum

(j _ 1)

[nt]

LXk - - 'lj·

n

i= 1

A simple argument shows that Xk(t), k ;;::: 0, satisfy X 0 (t)

=

1-

Xk(t) = -

{x

i

dS(s),

f~[Xk(s-)- Xk_ 1(s-)] dS(s),

where X 0 (0) = 1 and Xk(O) Now EXk(t) = Pk(t) and

E

0 (s-)

k

~

1,

= 0 fork~ 1. [nt]

t

Xk(s-) dS(s) = E Lxk o i= 1

= =

[nt]

1) n (j _1)E17j = ~ 17;

('

LEXk - n

i= 1

{Pk(s-) dA(s),

where [nt]

A(t) =

L Pk

k=O

( =ES(t)).

[nt]

(j _ 1)

Lpk --Pi

i= 1

n

(4)

347

§7. Rapidity of Convergence in Poisson's Theorem

Hence if we average the left-hand and right-hand sides of(4) we find that P 0 (t)

= 1 - J:P 0 (s-) dA(s), (5)

k

Pk(t) = - s:[Pk(s-)- Pk_ 1(s-)] dA(s),

In turn, it is easily verified that nk(t), k system n 0 (t)

~

1.

~

0, 0 :::;:; t :::;:; 1, satisfy the similar

= 1 - J:n 0 (s-) d(A.s),

nk(t) = -

1

k

[nis-) - nk-l (s- )] d(A.s),

~ 1.

Therefore n 0 (t)- P 0 (t)

= - {[n0 (s-)- P 0 (s-)] d(A.s)

+ {P0 (s-) d(A(s)-

(6)

A.s)

and nk(t)- Pk(t)

= - {[nk(s-)-

Pk(s- )] d(A.s)

+ {[nk- 1(s-)- Pk_ 1(s-)] d(A.s) + {[Pk(s-)-

Pk_ 1 (s- )] d[A(s)- A.s].

By the formula for integration by parts (namely "dUV = U dV see Theorem 11, §6, Chapter II) and by (5),

J~P 0 (s-) d(A(s) -

A.s) = (A(t)- A.t)P 0 (t)

+ J~(A(s)-

(7)

+ VdU";

A.s)P 0 (s-) dA(s),

(8) { [Pk(s-) - Pk-t (s-)] d(A(s) - A.s)

=

(A(t) - A.t)(Pk(t) - Pk- 1(t))

+ {[Pk(s-)-

2Pk_ 1(s-)

+ Pk_ 2 (s-)](A(s)-

where it is convenient to suppose that P _ 1 (s)

=0.

A.s)dA(s), (9)

348

III. Convergence of Probability Measures. Central Limit Theorem

From (6)-(9) we obtain

L

k~0 lnk(t)- Pk(t)l ~ 2 k~o lnk(s-)- Pk(s- )ld(A.s) + 21A(t)- A.tl + 4A(t)max IA(s)- A.sl O:;;;s:;;;t

L

~ 2 k~o lnk(s-)- Pk(s- )ld(A.s) +(2 + 4A(t)) max IA(s)- A.sl. O:;;;s:;;;t Hence, by Lemma 2, which will be proved in Subsection 3, max IA(s)- ..lsi, L IPk(t)- nk(t)l ~ e M(2 + 4A(t)) o:;;;s:;;;t k=O where we recall that A(t) = 21"2 Pk· 00

2

(10)

0

This completes the proof of Lemma 1.

3. PROOF OF THE THEOREM. Inequality (1) follows immediately from Lemma 1 if we observe that Pk = Pk(l), nk = nk(1), and that the probability Pk = P{1] 1 + · · · + 1Jn = k} is the same as P{1]; 1 + · · · + 1];" = k}, where (i 1 , i2 , ••• , in) is an arbitrary permutation of(1, 2, ... , n). Moreover, the first inequality in (2) and the estimate (3) follow from ( 1). We have only to show that

I L Pik O:;;;s:;;; I k=O

min sup i

[ns]

I

(11)

A.s ~ max Pk> 1 :;;;k:;;;n

where we may evidently suppose that A. = 1. We write F;(s) = Lk~o P;k, G(s) = s, 0 ~ s ~ 1. These are distribution functions: Fi(s) is a discrete distribution function with masses Pi 1 , Pi 2 , • • • , Pi" at the points 1/n, 2/n, ... , 1; and G(s) is a uniform distribution on [0, 1]. Our object is to find a permutation i* = (i!, ... , i:) such that sup IFi(s)- G(s)l ~ max Pk· Since sup IF;"(s)- G(s)l = max

!:;;;k:;;;n

O:;;;s:;;;l

I (~ Fi

n

-)

-~I· n

it is sufficient to study the deviations F;•(s-) - G(s) only at the points s = kjn, k = 1, ... , n. We observe that if all the Pk are equal (p 1 = · · · = Pn = 1/n), then

1 n

sup IF;(s)- G(s)l = - = max Pk· O:;;;s:;;;l

!:;;;k:;;;n

349

§7. Rapidity of Convergence in Poisson's Theorem

We shall accordingly suppose that at least one of p 1, ... , Pn is not equal to 1/n. With this assumption, we separate the set of numbers P1> . .. , Pn into the union of two (nonempty) sets A = {pi: Pi> 1/n}

B = {pi:

and

Pi~

To simplify the notation, we write F*(s) = Fi.(s), Pit It is clear that F*(1) = 1, F*(1 - (1/n)) = 1 - p:,

F*(1-

~) =

1-

(p: + P:- 1), ••• ,F*(~)

1-

=

1/n}.

= Pt.

(p~ + ··· + P2*).

Hence we see that the distribution F*(s), 0 ~ s ~ 1, can be generated by successively choosing p~, then p~_ 1, etc. The following picture indicates the inductive construction of p:, P!-1• · · ·, pf:

I

(0, 1 ) - -

I

-~--- ~--

I

I

I I i-i-- -:--- ](1, I

I

1

I

1

I

1

I

- - -: - - - J - - - L - - ~- - I

I

I

I

I

I 1 1 - - _j - - - ...!- - - t - - : I I

I

--,--I

n

I

~----~~

I

I/

1

/I

---r7{_ -~-- _

*

p,

v

/1 / I

2 n

/I

~:-~ I

I

}fL/~--~ I

I

I

/1

I

--~-~~~---~---~

I

n

P:

--~J

I

-- -~---~----~- --

I)

n

I I

I

I

I I

I

I

I

I I

_j_

I

I

I I ---1- __ ,

I

I

I

2 n

1--

I

1-n

On the right-hand side of the square [0, 1] x [0, 1] we start from (1, 1) and descend an amount p~, where p~ is the largest number (or one of the largest numbers) in A, and from the point (1, 1 - p~) we draw a (dotted) line parallel to the diagonal of the square that extends from (0, 0) to {1, 1).

350

III. Convergence of Probability Measures. Central Limit Theorem

Now we draw a horizontal line segment of length 1/n from the point (1, 1 - p:>. Its left-hand end will be at the point (1 - (1/n), 1 - p:), from which we descend an amount 1 is the largest number (or one 1 , where of the largest numbers) in B. Since 1 :::;; 1/n, it is easily seen that this line segment does not intersect the dotted line and therefore G(1 - (1/n)) F*((1 - (1/n))-) :::;; From the point (1 - (1/n), 1 1 ), we again draw a horizontal either the left-hand end possibilities: two are There line segment oflength 1/n. the diagonal or on it; below falls ) 1 of this interval (1 - (2/n), 1 diagonal. In the first the above is ) 1 or the point (1 - (2/n), 1 length of segment 2 , where case, we descend from this point by a line {p: _ 1 }. B\ set the in numbers) largest the of p:_ 2 is the largest number (or one is clear it Again, 2)/n.) 2 > (n(This set is not empty, since Pi+···+ descend we case second the In :::;; that G(1 - (2/n)) - F*((1 - (2/n)-) 2 is the largest number (or one of by a line segment oflength 2 , where the largest numbers) in the set A\{p:}. (This set is not empty since Pi+···+ P:- 2 > (n- 2)/2).) Since P:- 2 :::;; p:, it is clear that in this case

p:_

p:.

p:_

p:_

p: - p:_ p: - p:_ p: - p:_

P:-

p:_ p:.

P:-

p:_

Continuing in this way, we construct the sequence P:- 3 , .•• , Pi· It is clear from the construction that

for 1 :::;; k:::;; n. Since min sup i

05s51

II

k=O

p;k- s

I : :;

sup IF*(s)- G(s)l :::;; p:, 05s51

we have also established the second inequality in (2).

Remark. Let us observe that the lower bound min sup i

05s51

I L P;k - s I~ !P: [ns]

k=O

is evident. 4. Let A = A(t), t ~ 0, be a nondecreasing function, continuous on the right, with limits on the left, and with A(O) = 0. In accordance with Theorem 12 of §6, Chapter II, the equation

Zr = K +

J~z._ dA(s)

(12)

351

§7. Rapidity of Convergence in Poisson's Theorem

has (in the class of locally bounded functions that are continuous on the right and have limits on the left) a unique solution, given by the formula

Z, = KtS',(A), where

tS',(A) = eA
(13)

n (1 + L\A(s))e-&<•>.

(14)

oss:!>f

Let us now suppose that the function V(t), t ~ 0, which is locally bounded, continuous on the right, and with limits on the left, satisfies

Jt;

~ K + LV.- dA(s),

where K is a constant, for every t

Lemma 2. For every t

~

(15)

0.

~

0, (16)

PRooF. Let T = inf{t ~ 0: Jt; > KtS',(A)}, where inf{0} = oo. If T = oo, then (16) holds. Suppose T < oo; we show that this leads to a contradiction. By the definition of T, From this and (12),

Vr

~ KtS'r(A) =

K

~K

+K

LTs._(A) dA.

+

V.-

s:

dA(s)

~ Vr.

(17)

If Vr > KtS' r(A), inequality (17) yields Vr > Vr, which is impossible, since

IVrl <

00.

Now let Vr = Kef r(A). Then, by (17),

Vr

=K +

s:

V.-

dA(s).

From the definition ofT, and the continuity on the right of Jt;, Kcf,(A) and A(t), it follows that there is an h > 0 such that when T < t ~ T + h we have

Jt; > KtS',(A) and Ar+h - Ar ~

t.

Let us write 1/11 = Jt; - Kcf,(A). Then 0 < 1/1,

~ J:t/1.- dA.,

T < t

~ T + h,

and therefore 0:::;;; sup 1/11

:::;;;

t

sup

1/J,.

T:s;r:s;T+.!

352

III. Convergence of Probability Measures. Central Limit Theorem

Hence it follows that ljl, = 0 for T assumption that T < oo.

~

t

T

~

+ h,

which contradicts the

Corollary (Gronwall-Bellman Lemma). In (15) let A(t) = J~ a(s) ds and K;;;::: 0. Then (18)

5.

PROBLEMS

1. Prove formula (4). ~ 0, be functions of locally bounded variation, continuous on the right and with limits on the left. Let A(O) = B(O) = 0 and L\A(t) > -1, t ~ 0. Show that the equation

2. Let A = A(t), B = B(t), t

z,

=

J~z._ dA(s) + B(t)

has a unique solution $.(A, B) of locally bounded variation, given by

3. Let ~ and 11 be random variables taking the values 0, 1, .... Let p(~,

1'/) = supj P(~

E

A)- P(l'/

E

A)j,

where sup is taken over all subsets of {0, 1, ... }. Prove that (1) p(~, 1'/) = t Ik'=ol P(~ = k)- P('l = k)l; (2) p(~. 1'/) s p(~. ~) + p(~. 1'/); (3) if Cis independent of(~, 1'/), then p(~

(4) If the vectors (~ 1 , ••• ,

~.)and

+ ,, '1 + 0 s

it 1'/i)

s

J/(~;. 1'/;).

Let~= ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p, 0 < p < 1; and n = n(p) a Poisson random variable with En= p. Show that

p(~(p),

5.

1'/);

('7 1 , ••• , 11.) are independent, then

p(t1 ~i• 4.

p(~.

n(p)) = p(1 - e-P)

s

p2 •

Let~ = ~(p) be a Bernoulli random variable with P(~ = 1) = p, P(~ = 0) = 1 - p, 0 < p < 1, and let n = n(A.) be a Poisson random variable such that P(~ = 0) = P(n = 0). Show that A.= -ln(1 - p) and p(~(p),

n(A.)) = 1 - e--<- A_e-.t

s tA. 2 •

§7. Rapidity of Convergence in Poisson's Theorem

353

6. Show, using Property (4) of Problem 3, and the conclusions of Problems 4 and 5, that if ~ 1 = ~ 1(pd, ... , ~. = ~.(p.) are independent Bernoulli random variables, 0 < P; < 1, and A; = -ln(1 - p;), 1 :::;; i :::;; n, then

and

CHAPTER IV

Sequences and Sums of Independent Random Variables

§1. Zero-or-One Laws 1. The series 2::': 1 (1/n) diverges and the series 2::': 1 ( -1)"(1/n) converges. We ask the following question. What can we say about the convergence or divergence of a series L.."'= 1 (~n/n), where ~ 1 , ~ 2 , ••• isasequenceofindependent identically distributed Bernoulli random variables with P(~ 1 = + 1) = P(~ 1 = --'-1) = In other words, what can be said about the convergence of a series whose general term is ± 1/n, where the signs are chosen in a random manner, according to the sequence ~ 1 , ~ 2 , .•• ? Let

t?

A1 =

{w: I

n= 1

~"converges} n

2::':

be the set of sample points for which 1 (~n/n) converges (to a finite number) and consider the probability P(A 1 ) of this set. It is far from clear, to begin with, what values this probability might have. However, it is a remarkable fact that we are able to say that the probability can have only two values, 0 or 1. This is a corollary of Kolmogorov's "zero-one law," whose statement and proof form the main content of the present section. 2. Let (Q, §, P) be a probability space, and let ~I> ~ 2 , ••• be a sequence of random variables. Let !F': = a(~"' ~n+ 1 , •••) be the a-algebra generated by ~"' ~n+ 1 , ••• , and write !!'

=

n ~F:. 00

n=1

355

§I. Zero-or-One Laws

Since an intersection of a-algebras is again a a-algebra, X is a a-algebra. It is called a tail algebra (or terminal or asymptotic algebra), because every event A EX is independent of the values of ~ 1 , ..• , ~n for every finite number n, and is determined, so to speak, only by the behavior of the infinitely remote values of~ 1 , ~ 2 , •••• Since, for every k 2::: 1, A1 we have A1

E

={I ~nn converges} = {I ~nn converges} n=k

n=1

nk ff'k =X. In the same way, if A2 = {

E

ff'k,

~1• ~2• ... is any sequence,

~ ~n converges} EX.

The following events are also tail events:

A3 = where In

E

{~n E

In for infinitely many n},

fJ6(R), n 2::: 1; A4

=

{Ji~ ~n < 00 };

A5 =

+ " ' + ~n {I~ 1m ~ 1

A6 =

{I

A7

{~n converges};

=

As =

n

n

1m ~ 1

+ ' ' ' + ~n n

n

{rrm: n

Sn

J2n log n

=

}
}
1}.

On the other hand,

= 0 for all n

B1 =

{~n

B2 =

{li~(~ 1 + · · · + ~n) exists and is less than c}

2::: 1},

are examples of events that do not belong to X. Let us now suppose that our random variables are independent. Then by the Borel-Cantelli lemma it follows that P(A3)

=

P(A3) =

L P(~n E Jn) < 1 ~ L P(~n E Jn) =

0~

00, 00.

Therefore the probability of A 3 can take only the values 0 or 1 according to the convergence or divergence of L P(~n E Jn). This is Borel's zero-one law.

356

IV. Sequences and Sums of Independent Random Variables

Theorem 1 (Kolmogorov's Zero-One Law). Let ~ 1 , ~ 2 , ••. be a sequence of independent random variables and let A E !!f. The P(A) can only have one of the values zero or one. PROOF. The idea of the proof is to show that every tail event A is independent of itself and therefore P(A n A) = P(A) · P(A), i.e. P(A) = P 2 (A), so that P(A) = 0 or 1. If A EX then A E ffr" = crg 1 , ~ 2 , ... } = cr(Un ffD, where ff~ = a{~ 1, ... , ~n}, and we can find (Problem 8, §3, Chapter II) sets An E ff~, m :2: 1, such that P(A !::,. An) --+ 0, n --+ oo. Hence

P(An)--+ P(A),

P(An n A)--+ P(A).

(1)

But if A E X, the events An and A are independent for every n :2: 1. Hence it follows from (1) that P(A) = P 2 (A) and therefore P(A) = 0 or 1. This completes the proof of the theorem.

Corollary. Let 1J be a random variable that is measurable with respect to the tail a-algebra X, i.e. {IJ E B} E !!f, BE £?4(R). Then 1J is degenerate, i.e. there is a constant c such that P(IJ = c) = 1. 3. Theorem 2 below provides an example of a nontrivial application of Kolmogorov's zero-one law. Let ~ 1 , ~ 2 , ••. be a sequence of independent Bernoulli random variables with P(~n = 1) = p, P(~n = -1) = q, p + q = 1, n :2: 1, and let Sn = ~ 1 + · · · + ~n· It seems intuitively clear that in the symmetric case (p = !) a "typical" path of the random walk Sn, n :2: 1, will cross zero infinitely often, whereas when p # 1it will go off to infinity. Let us give a precise formulation.

Theorem 2. (a)

(b) lfp #

~fp =

1 then P(Sn =

1, then P(Sn =

0 i.o.) = 1.

0 i.o.) = 0.

PRooF. We first observe that the event B = (Sn = 0 i.o.) is not a tail event, i.e. B¢!!f=nff;:o, ff:O=a{~n,~n+t•· .. }· Consequently it is, in principle, not clear that B should have only the values 0 or 1. Statement (b) is easily proved by applying (the first part of) the BorelCantelli lemma. In fact, if B 2 n = {S 2 n = 0}, then by Stirling's formula P(B ) = 2n

en n n 2nP q

"'(4pqt !:: -vnn

and therefore L P(B 2 n) < oo. Consequently P(Sn = 0 i.o.) = 0. To prove (a), it is enough to prove that the event A =

~ Sn {hm Jn =

has probability 1 (since A

5;:;

B).

Jn

. Sn } oo, hm = - oo

357

§1. Zero-or-One Laws

Let

Then Ac! A, c--+ oo, and all the events A, A 0 A~, A~ are tail events. Let us show that P(A~) = P(A~) = 1 for each c > 0. Since A~ E ?£and A~ E ?£, it is sufficient to show only that P(A~) > 0, P(A~) > 0. But by Problem 5

P(lim }n <-c)= P(rrm }n >c)~ umP(}n >c)> O, where the last inequality follows from the De Moivre-Laplace theorem. Thus P(Ac) = 1 for all c > 0 and therefore P(A) = limc .... oo P(Ac) = 1. This completes the proof of the theorem. 4. Let us observe again that B = { Sn = 0 i.o.} is not a tail event. Nevertheless, it follows from Theorem 2 that, for a Bernoulli scheme, the probability of this event, just as for tail events, takes only the values 0 and 1. This phenomenon is not accidental: it is a corollary of the Hewitt-Savage zero-one law, which for independent identically distributed random variables extends the result of Theorem 1 to the class of" symmetric" events (which includes the class of tail events). Let us give the essential definitions. A one-to-one mapping n = (n 1, n 2 , •• •) of the set (1, 2, ... ) on itself is said to be a finite permutation if rt 11 = n for every n with a finite number of exceptions. If~= ~ 1 , ~ 2 , ••• is a sequence of random variables, n(~) denotes the sequence (~,11 , ~ 112 , ••• ). If A is the event {~ E B}, BE PJ(R 00 ), then n(A) denotes the event {n(~) E B}, BE .94(R We call an event A = {~ E B}, BE PJ(R 00 ), symmetric if n(A) coincides with A for every finite permutation n. An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn = ~ 1 + · · · + ~n· Moreover, we may suppose (Problem 4) that every event in ff;:'(S) = a{w: Sn, Sn+ 1 , ..• } generated by the tail a-algebra ?I(S) = s1 = ~1• s2 = ~1 + ~2• ... is symmetric. 00

).

n

Theorem 3 (Hewitt-Savage Zero-One Law). Let ~ 1 , ~ 2 , ••• be a sequence of independent identically distributed random variables, and A=

a symmetric event. Then P(A)

{w:(~ 1 ,~ 2 , ... )EB}

= 0

or 1.

PRooF. Let A = { ~ E B} be a symmetric event. Choose sets Bn E PJ(R") such that, for An = {w: (~ 1• ... , ~n) E Bn}, n--+ oo.

(2)

358

IV. Sequences and Sums of Independent Random Variables

Since the random variables ~ 1 , ~ 2 , •.• are independent and identically distributed, the probability distributions P~(B) = P(~ E B) and P,..w(B) = P(n.(O E B) coincide. Therefore P(A lJ. A.)

= P~(B lJ. B.) = P,..w(B lJ. B.).

Since A is symmetric, we have A

Therefore P,.<~>(B

=g

E

B} = n.(A)

=

{nnC~) E

(3)

B}.

lJ. B.)= P{n.(~) E B) lJ. (n.(~) E B.)} = P{(~ E B) lJ. (n.(~) E B.)} = P{A i:J. n.(A.)}.

(4)

Hence, by (3) and (4), P(A lJ. A.) = P(A lJ. n.(A.)).

(5)

It then follows from (2) that P(A lJ. (A. n n.(A.)))

--+

n--+ oo.

0,

(6)

Hence, by (2), (5) and (6), we obtain P(A.)

--+

P(A),

P(n.(A))

P(A. n n.(A.))

--+

--+

P(A),

P(A).

(7)

Moreover, since ~ 1 and ~ 2 are independent, P(A.

f1

n.(A.)) = P{(~ 1 , ... , ~.) E B., (~n+ 1• ... , ~2n) E B;,} = P{(~I• ... , ~.) E B.}· P{(~n+I• ... , ~2n) E B.} = P(A.)P(n:.(A.)),

whence by (7) P(A)

= P 2 (A)

and therefore P(A) = 0 or 1. This completes the proof of the theorem.

5.

PROBLEMS

1. Prove the corollary to Theorem 1.

2. Show that if (e.) is a sequence of independent random variables, the random variables Urn and lim are degenerate.

e.

e.

e

e.,

3. Let (e.) be a sequence of independent random variables, s. = 1 + · · · + and let the constants b. satisfy 0 < b. j oo. Show that the random variables Iiiii(S./b.) and lim(S.Jb.) are degenerate.

e.,

4. Lets.= el + ... + n ~ 1, and ~(S) = Show that every event in ~(S) is symmetric.

n$';:'(S), $';:'(S) = a{ro: s., s.+ I• ...}.

S. Let (e.) be a sequence of random variables. Show that {Iiiii for each c > 0.

e.> c} 2ilm{e. > c}

359

§2. Convergence of Series

§2. Convergence of Series 1. Let us suppose that ~ 1 , ~ 2 , ••. is a sequence of independent random variables, Sn = ~ 1 + · · · + ~"' and let A be the set of sample points w for which ~nCw) converges to a finite limit. It follows from Kolmogorov's zero-one law that P(A) = 0 or 1, i.e. the series ~n converges or diverges with probability 1. The object of the present section is to give criteria that will determine whether a sum of independent random variables converges or diverges.

L:

L:

Theorem 1 (Kolmogorov and Khinchin). (a) Let E~n = 0, n ~ 1. Then if (1)

L

the series ~n converges with probability 1. (b) !{the random variables~"' n ~ 1, are unfformly bounded (i.e., P(l~nl s c) = 1, c < oo ), the converse is true: the convergence of ~n with probability 1 implies (1).

L

The proof depends on

Kolmogorov's Inequality (a) Let~ 1, ~ 2 , •.. , ~n be independent random variables with E~i = 0, E~t < oo, i S n. Then for every c; > 0 (2)

(b) If also

P(\~d

S c)= 1, iS n, then (3)

PRooF. (a) Put

A= {max\Ski ~ e}, Ak = {\Sd < Then A= LAk and

t:,

i = 1, ... ,k- 1, \Ski~ e},

1 S k S n.

360

IV. Sequences and Sums of Independent Random Variables

But

ES;IAk = E(Sk + (ek+t + · · · + en)) 2 1Ak = ESf/Ak + 2ESiek+l + ... + en)/Ak ~ ESf/Ak'

+ E(ek+l + ... + en) 2 1Ak

since

+ .. · + en)/Ak = ESk/Ak · E(ek+t + .. ·+en)= 0 because of independence and the conditions Eei = 0, i :::;; n. Hence ESk(ek+t

Es; ~

L ESf/ Ak ~ 8 2 L P(Ak) =

8 2 P(A),

which proves the first inequality. (b) To prove (3), we observe that

ES;IA

= Es;

= ES;

- ES;Ix ~ Es; - 82 P(A) On the other hand, on the set Ak

1sk-tl:::;; and therefore

Es;IA

=

1sk1:::;; 1sk-tl

8,

:::;; (8

82

+ 82 P(A).

(4)

+ 1ek1:::;; 8 + c

L ESf/Ak + L E(/Ak(Sn- Sk) k

-

2)

k

n

n

k=l

j=k+l

+ c) 2 L P(Ak) + L P(Ak) L Ee] k

:::;; P(A{(8

+ c) 2 +

J 1

J

+ c) 2 + ES;),

Ee] = P(A)[(8

(5)

From (4) and (5) we obtain P(A) > -

Es; -

(8

= 1-

82

+ c) 2 + Es;

-

82

(8

(8

+

> 1-

c)2

+ c) 2 + Es;

-

(8

+

c)2

Es;

82 -

·

This completes the proof of (3). PROOF OF THEOREM 1. (a) By Theorem 4 of §10, Chapter II, the sequence (Sn), n ~ 1, converges with probability 1, if and only if it is fundamental with probability 1. By Theorem 1 of §10, Chapter II, the sequence (Sn), n ~ 1, is fundamental (P-a.s.) if and only if

n-... oo.

(6)

By (2), P{sup ISn+k- Snl k~l

~ 8} =

limP{ max ISn+k- Snl

N-+oo

1Sk:SN

Therefore (6) is satisfied if:Lk"'=. 1 Ee~ < oo, and consequently with probability 1.

~ 8} L ek converges

361

§2. Convergence of Series

(b) Let

L ~k converge. Then, by (6), for sufficiently large n, P{sup ISn+k- Snl ~ e} < 1-.

(7)

k~l

By (3), P{ supiSn+k- Snl ~ e} ~ 1k~

1

Therefore if we suppose that

Lf=

1

(c + e) E1'2" "oo 2

L...k=n

'ok

E~f = oo, we obtain

P{sup ISn+k- S.l k~1

~ e} = 1,

which contradicts (7). This completes the proof of the theorem. If ~ 1 , ~ 2 , ... is a sequence of independent Bernoulli random variables with P(~n = + 1) = P(~n = -1) = !, then the series ~nan, with Ian I ~ c, converges with probability 1, if and only if La; < oo.

ExAMPLE.

L

2. Theorem 2 (Two-Series Theorem). A sufficient condition for the convergence of the series L ~" of independent random variables, with probability 1, is that both series L E~n and L V~"converge. If P( I~n I ~ c) = 1, the condition is also necessary. PROOF. Ifl: V~" < oo, then by Theorem 1 the series L (~n- E~n) converges (P-a.s.). But by hypothesis the series E~n converges; hence ~"converges (P-a.s.) To prove the necessity we use the following symmetrization method. In addition to the sequence ~ 1 , ~ 2 , ... we consider a different sequence ~ 1 ,

L

L

~2 ,

•••

of independent random variables such that ~" has the same distribu-

tion as ~"' n ~ 1. (When the original sample space is sufficiently rich, the existence of such a sequence follows from Theorem 1 of §9, Chapter II. We can also show that this assumption involves no loss of generality.) Then if ~"converges (P-a.s.), the series ~"also converges, and hence so does L (~. - ~.). But E(~. - ~.) = 0 and P( I~" - ~"I ~ 2c) = 1. Therefore LV(~. ~.) < oo by Theorem 1. In addition,

L -

L

L: v~. = 1- L: V(~. - ~.) < oo. Consequently, by Theorem 1, L (~. - E~.) converges with probability 1, and therefore L E~" converges. Thus if L ~"converges (P-a.s.)(and P( I~" I ~ c) = 1, n ~ 1) it follows that both L E~. and LV~. converge. This completes the proof of the theorem.

3. The following theorem provides a necessary and sufficient condition for the convergence of L ~" without any boundedness condition on the random variables.

362

IV. Sequences and Sums of Independent Random Variables

Let

c be a constant and

~c = {~' ~~~ ~ C, 0,

>c.

1~1

Theorem 3 (Kolmogorov's Three-Series Theorem). Let ~ 1 , ~ 2 , ••• be a sequence of independent random variables. A necessary condition for the convergence of"[. ~ .. with probability 1 is that the series

L

"[. E~~. v~~. "[. P(l~.. l;?: c) converge for every c > 0; a sufficient condition is that these series converge for some c > 0.

PRooF. Sufficiency. By the two-series theorem,"[. ~~converges with probability 1. But if "[. P( I~.. I ;?: c) < oo, then by the Borel-Cantelli lemma we have I(l ~ .. I ;?: c) < oo with probability 1. Therefore ~ .. = ~~ for all n with at

L

L.

most finitely many exceptions. Therefore ~ also converges (P-a.s.). Necessity. If ~ converges (P-a.s.) then ~ .. -+ 0 (P-a.s.), and therefore, for every c > 0, at most a finite number of the events {I~ .. I ;?: c} can occur (P-a.s.). Therefore I( I~ .. I ;?: c) < oo (P-a.s.), and, by the second part of the Borel-Cantelli lemma, P( I~ .. I > c) < oo. Moreover, the convergence of ~ implies the convergence of ~~. Therefore, by the two-series theorem, both of the series E~~ and LV~~ converge. This completes the proof of the theorem.

L. L

L .

L

L

L

Corollary. Let ~ 1 , ~ 2 , ••• be independent variables withE~, = 0. Then if ~2

L E 1 + ~~~~~~~ <

00,

the series "[. ~ .. converges with probability 1. For the proof we observe that ~2

L E 1 + ~~~~~~~ < 00 <=> L E[~;I(I~nl ~ 1) + l~nii(I~nl > 1)] < 00. Therefore if~! = ~ .. I( I~ .. I ~ 1), we have

L E(~!)2

<

00.

Since E~, = 0, we have

LIE~! I=

L IE~,I(I~.. I ~

1)1 =

L IE~,I(I~.. I > 1)1

~ l:EI~ .. II(I~ .. I > 1) < oo. Therefore both inequality, P{l~ .. l

L

L E~!

> 1} =

and LV~! converge. Moreover, by Chebyshev's P{I~.. II(I~ .. I > 1)

Therefore P( I~ .. I > 1) the three-series theorem.

> 1}

~ E(I~ .. II(I~ .. I > 1).

< oo. Hence the convergence of L ~.. follows from

363

§3. Strong Law of Large Numbers

4.

PROBLEMS

1. Let ~ 1 , ~ 2 , ..• be a sequence of independent random variables, S" = ~ 1 ,

••. , ~n·

Show, using the three-series theorem, that (a) if I ~;; < oo (P-a.s.) then I ~" converges with probability 1, if and only if

IE

U(l~d ~!)converges;

(b) if I~" converges (P-a.s.) then I~;; < oo (P-a.s.) if and only if

I

(E l~nll(lenl ~ 1))2 < oo.

2. Let eI• ~2' ... be a sequence of independent random variables. Show that I (P-a.s.) if and only if ~;;

IE - - 2 < 1 + ~"

e;;

<

00

00.

3. Let ~ 1 , ~ 2 , ••• be a sequence of independent random variables. Show that I~" converges (P-a.s.) if and only if it converges in probability.

§3. Strong Law of Large Numbers 1. Let ~ 1 , ~ 2 , .•• be a sequence of independent random variables with finite second moments; Sn = ~ 1 + · · · + ~n· By Problem 2, §3, Chapter III, if the numbers V~i are uniformly bounded, we have the law of large numbers: Sn- ESn

P

---"'-+

0

n

n-+ oo.

'

(1)

A strong law of large numbers is a proposition in which convergence in probability is replaced by convergence with probability 1. One of the earliest results in this direction is the following theorem. Theorem 1 (Cantelli). Let ~ 1 , finite fourth moments and let

~ 2 , .••

be independent random variables with

El~n-E~ni 4 ~C,

n~l,

for some constant C. Then as n-+ oo

Sn- ES"

- - - -+

n

0

(P-a.s.).

(2)

PRooF. Without loss of generality, we may assume that E~" = 0 for n ~ 1. By the corollary to Theorem 1, §10, Chapter II, we will have Sn/n-+ 0 (P-a.s.) provided that

for every e > 0. In turn, by Chebyshev's inequality, this will follow from

364

IV. Sequences and Sums oflndependent Random Variables

Let us show that this condition is actually satisfied under our hypotheses. We have 4

s.

n

4

4

= (~1 + ... + ~.) = i~1~i-

b

4! 2 2 2!2! ~i~j

i<j

"

4!

+ loFJ .L.. 3'1' ~i~j· • • Remembering that

E~k

n

ES! =

s

= 0, k ::::;; n, we then obtain n

L E~i + 6 L: i=1

i,j=1

nC

+

3

E~?E~J s nC + 6

6n(n - 1) C = (3n 2 2

L

i,j=1 i<j

JE~i · E~J

2n)C < 3n 2 C.

-

Consequently

L E(s: )

4

::::;; 3C

L n1

2

< oo.

This completes the proof of the theorem. 2. The hypotheses of Theorem 1 can be considerably weakened by the use of more precise methods. In this way we obtain a stronger law of large numbers.

Theorem 2 (Kolmogorov). Let~ 1 , ~ 2 , ..• be a sequence of independent random variables with finite second moments, and let there be positive numbers b. such that b. i oo and

"v~. < oo.

(3)

L.f? n

Then

s. -b ES. __,. O n

In particular,

(

)

P-a.s..

(4)

if (5)

then

S" - ES. --.. 0 (P- a.s. ) ---'--n For the proof ofthis, and of Theorem 2 below, we need two lemmas.

(6)

365

§3. Strong Law of Large Numbers

Lemma 1 (Toeplitz). Let {a.} be a sequence of nonnegative numbers, b. = > 0 for n ~ 1, and b. i oo, n--+ oo. Let {x.} be a sequence of

L~= 1 a;, b.

numbers converging to x. Then

(7) In particular,

if a.

=

1 then

x1

+ · · · + x.

(8)

------+X.

n

PROOF. Let e > 0 and let n0 Choose n1 > n0 so that

n0 (e) be such that Ix. -

=

1

b

no

.L

" ' j=

Ixi

-

xI ::;

~:/2

for all n

~

n0 .

xI < e/2.

1

Then, for n > n 1,

This completes the proof of the lemma.

Lemma 2 (Kronecker). Let {b.} be a sequence of positive increasing numbers, b. j oo, n--+ oo, and let {x.} be a sequence of numbers such that L x. converges. Then 1

n

b.

j= 1

- L bixi--+ 0, In particular, if b. = n, x. = Y./n and Y1

Let b 0

n

n --+ oo.

(10)

= 0, S 0 = 0, s. = L}= 1 xi. Then (by summation by parts) n

L bjxj = L b/Sj- sj-1) =

j=1

(9)

L (y./n) converg_es, then

+ · · · + Yn --+, 0 n

PROOF.

n --+ oo.

j=1

bnSn - boSo -

n

L sj-1(bj- b;-1)

j=l

366

IV. Sequences and Sums of Independent Random Variables

and therefore

since, if sn

~X,

then by Toeplitz' lemma,

1 -b

n

L Si _ 1ai ~ x.

n j= 1

This establishes the lemma. PROOF OF THEOREM 1. Since

Sn - ES" bn

=

2_

I bk(~k -bkE~k),

bn k=1

a sufficient condition for (4) is, by Kronecker's lemma, that the series [(~k - E~k)/bk] converges (P-a.s.). But this series does converge by (3) of Theorem 1, §2. This completes the proof of the theorem.

L

EXAMPLE 1. Let ~ 1 , ~ 2 , •.• be a sequence of independent Bernoulli random variables with P(~n= l)=P(~n= -1)=1. Then, since L [l/(n log 2 n)] < oo, we have

r::. sn

v n log n

) ~ ~ 0 (P-a.s ..

(11)

3. In the case when the variables ~ 1 , ~ 2 , •.• are not only independent but also identically distributed, we can obtain a strong Jaw of large numbers without requiring (as in Theorem 2) the existence of the second moment, provided that the first absolute moment exists.

Theorem 3 (Kolmogorov). Let ~ 1 , ~ 2 , .•• be a sequence of independent identically distributed random variables withE I~ 1 1< oo. Then

sn ~ m

n where m

(P-a.s.)

(12)

= E~ 1 .

We need the following lemma.

Lemma 3. Let~ be a nonnegative random variable. Then 00

00

n=l

n=l

L P( ~ :2: n) ~ E~ ~ 1 + L P( ~ ;;:: n).

(13)

367

§3. Strong Law of Large Numbers

The proof consists of the following chain of inequalities:

L P(e ~ n) = L L P(k ~ e< 00

00

n= 1

k + 1)

n= 1 k;;,n 00

= I

kP(k ~ e < k

k=1

+ 1) =

00

I

~

k=O

E[ei
r E[kl(k ~ e < k + 1)J 00

k=O

+ 1)J

r E[(k + 1)I(k ~ e< k + 1n 00

= Ee ~ 00

I

=

k=O

(k

k=O

+ 1)P(k ~ e< k + 1)

00

00

L P(e ~ n) + L P(k ~ e<

=

n=1

k=O

00

k + 1) =

L P(e ~ n) +

n=1

1.

PRooF OF THEOREM 3. By Lemma 3 and the Borel-Cantelli lemma, E le11 <

00

¢> ¢>

L P{le11 ~ n} < 00 L P{lenl ~ n} < oo

¢>

P{lenl ~ n i.o.} = 0.

Hence Ien I < n, except for a finite number of n, with probability 1. Let us put

and suppose that Een and only if ( 1 + · · · +

e

Een

= 0, n

~

1. Then (e 1 + · · · + en)/n--. 0 (P-a.s.), if

en )/n --. 0 (P-a.s. ). Note that in general Een # 0 but

= Een !(len I<

n)

= Eeti(Ietl

< n) ___. Eet

= 0.

Hence by Toeplitz' lemma

n--. oo, and consequently (e 1 + · · · + en)!n-+ 0 (P-a.s.), if and only if

<e~ - E~~) +···+<en- Ee") --., 0 n Write ~n = ~n

-

n --.

oo

(P-a.s.),

n --.

oo. (14)

E~n· By Kronecker's lemma, (14) will be established if

I (~jn) converges (P-a.s.). In turn, by Theorem 1 of §2, this will follow if we show that, when E Ie1 I < 00, the series L (V ~n/n 2 ) converges.

368

IV. Sequences and Sums of Independent Random Variables

We have

"V~n < ~ E~; L.,

n

2-L.

n=l

n

2

00

=

L E[~iJ(k-

1 ~ l~tl < k)].

k=1

00

1

00

L

2

n

n=k

1

~ 2 k~1 kE[~if(k- 1 ~ 1~ 1 1 < k)]

L E[1~11J(k -1 ~ l~tl < k)] = 00

~ 2

2E1~11

<

00.

k= 1

This completes the proof of the theorem.

Remark 1. The theorem admits a converse in the following sense. Let be a sequence of independent identically distributed random variables such that ~ 1 , ~ 2 , •.•

~1

+ "' + ~n

-=-----"-+

n

C,

with probability 1, where Cis a (finite) constant. Then E I~ 1 I < oo and C In fact, if Sn/n -+ C (P-a.s.) then

~n = n

Sn _ (n n n

Sn1) n-1

1 -+

=

E~ 1 •

O (P-a.s.)

and therefore P( I~n I > n i.o.) = 0. By the Borel-Cantelli lemma,

L P(l~1l > n) < oo, and by Lemma 3 we have E I~ 1 I < oo. Then it follows from the theorem that C = E~ 1 .

Consequently for independent identically distributed random variables the conditionE I ~ 1 1 < oo is necessary and sufficient for the convergence (with probability 1) of the ratio Sn/n to a finite limit.

Remark 2. If the expectation m = E~ 1 exists but is not necessarily finite, the conclusion (10) of the theorem remains valid. In fact, let, for example, E~1 < oo and E~i = oo. With C > 0, put n

s; = L ~J(~; ~ C). i= 1

369

§3. Strong Law of Large Numbers

Then (P-a.s.).

But as C -+ oo, Ee 1 I(e 1 ~C)-+ Ee 1 = oo; therefore Sn/n -+

+ oo (P-a.s.)

4. Let us give some applications of the strong law oflarge numbers. 1 (Application to number theory). Let Q = [0, 1), let 11 be the algebra of Borel subsets of Q and let P be Lebesgue measure on [0, 1). Consider the binary expansions OJ=O. OJ 1 0J 2 •.• of numbers OJ e Q (with infinitely many O's) and define random variables 1(OJ), 2(OJ), ... by putting en(OJ) = OJn. Since, for all no~ 1 and all x 1 , ..• , xn taking the values 0 or 1, EXAMPLE

e e

the P-measure of this set is 1/2n. It follows that e 1 , en, ... is a sequence of independent identically distributed random variables with P<el =

o> =

P<el = 1) = !.

Hence, by the strong law of large numbers, we have the following result of Borel: almost every number in [0, 1) is normal, in the sense that with probability 1 the proportion of zeros and ones in its binary expansion tends to ! , i.e.

1

n

- L I(ek = n k=t

1)-+! (P-a.s.).

EXAMPLE 2 (The Monte Carlo method). Let f(x) be a continuous function defined on [0, 1], with values on [0, 1]. The following idea is the foundation of the statistical method of calculating JAf(x) dx (the "Monte Carlo method"). Let e 1 , '7 1 , e 2 , '7 2 , .•• be a sequence of independent random variables, uniformly distributed on [0, 1]. Put

P· =

•

It is clear that

{1

o

if f(O > '7;. if < 'li·

f(ei>

370

IV. Sequences and Sums of Independent Random Variables

By the strong law of large numbers (Theorem 3)

1 n

(1

n;~/; -+ Jo f(x) dx

(P-a.s.).

Consequently we can approximate an integral f~ f(x) dx by taking a simulation consisting of a pair of random variables(~;, '7;), i ;;:: 1, and then 1 P;· calculating P; and (1/n)

L7=

5. PROBLEMS 1. Show that

Ee

2

<

00

eI > n) <

if and only if L:'=t nP( I

00.

ee

2. Supposing that 1, 2 , ••• are independent and identically distributed, show that if E Ietl" < 00 for some IX, 0 < IX < 1, then S./n 11"--> 0 (P-a.s.), and if EI~tiP < 00 for some {J, 1 ~ fJ < 2, then (S. - nE~ 1 )/n 11 P-+ 0 (P-a.s.).

ee

3. Let 1, 2 , ••• be a sequence of independent identically distributed random variables and let E I ~ 1 1 = oo. Show that

~I~

-I a.

=

oo

(P-a.s.)

for every sequence of constants {a.}. 4. Show that a rational number on [0, 1) is never normal (in the sense of Example 1, Subsection 4).

§4. Law of the Iterated Logarithm 1. Let ~ 1 , ~ 2 , ••• be a sequence of independent Bernoulli random variables with P(~n = 1) = P(~n = -1) = t; let Sn = ~ 1 + · · · + ~n· It follows from the proof of Theorem 2, §1, that (1)

with probability 1. On the other hand, by (3.11),

JnnSnlog n -+ 0

(P-a.s.).

(2)

Let us compare these results. It follows from (1) that with probability 1 the paths of (Sn)n~ 1 intersect the "curves" infinitely often for any given but at the same time (2)

±eJn

e;

~4.

371

Law of the Iterated Logarithm

shows that they only finitely often leave the region bounded by the curves ± eJn log n. These two results yield useful information on the amplitude of the oscillations of the symmetric random walk (Sn)n ~ 1• The law of the iterated logarithm, which we present below, improves this picture of the amplitude of the oscillations of (Sn)n> 1 . Let us introduce the following definition. We call a function cp* = cp*(n), n ;;::: 1, upper (for (Sn)n~ 1) if, with probability 1, Sn ~ cp*(n) for all n from n = n 0 (w) on. Wecallafunctioncp* = cp*(n),n;;::: 1,lower(for(Sn)n~ 1 )if,withprobability 1, > cp*(n) for infinitely many n. Using these definitions, and appealing to (1) and (2), we can say that every function cp* = eJn log n, e > 0, is upper, whereas cp* = eJn is lower, e > 0. Let cp = cp(n) be a function and cp: = (1 + e)cp, cp*" = (1 - e)cp, where e > 0. Then it is easily seen that

sn

{rrm (/)~~> ~ 1} = {1i~ [~~~ (/)~:>] ~ t} ¢:> { sup

m~n,(<)

¢:.

S(m) cp m

{Sm ~ (1

~ 1 + e for every e > 0, from some n (e) on} 1

+ e)cp(m) for every e >

0, from some n 1(e) on}. (3)

In the same way,

{11m(/)~~) ; : : 1} = {li~ [~~~ (/)~:)] ;;::: 1} ¢:>{sup S(m) m2:n2(<) cp m

~ 1 + e.foreverye > O,.fromsomen (e)on} 1

¢:> {Sm ;;::: (1 - e)cp(m) for every e > 0 and for infinitely many m larger than some n 3 (e) ;;::: n 2 (e)}.

(1

(4)

It follows from (3) and (4) that in order to verify that each function cp: = + e)cp, e > 0, is upper, we have to show that (5)

But to show that cp*. = (1 - e)cp, e > 0, is lower, we have to show that (6)

372

IV. Sequences and Sums of Independent Random Variables

2. Theorem 1 (Law of the Iterated Logarithm). Let~ 1 , ~ 2 , ••. be a sequence of independent identically distributed random variables with E~; = 0 and E~t = CJ 2 > 0. Then (7)

where tf;(n) = J2CJ 2 n log log n.

(8)

For uniformly bounded random variables, the law of the iterated logarithm was established by Khinchin (1924). In 1929 Kolmogorov generalized this result to a wide class of independent variables. Under the conditions of Theorem 1, the law of the iterated logarithm was established by Hartman and Wintner (1941). Since the proof of Theorem 1 is rather complicated, we shall confine ourselves to the special case when the random variables ~n are normal, ~n " ' %(0, 1), n ~ 1. We begin by proving two auxiliary results.

Lemma 1. Let~ 1, •.• , ~n be independent random variables that are symmetrically distributed (P( ~k E B) = P(- ~k E B) for every B E 84 (R), k ::;; n). Then for every real number a

P( max

Sk >

1'5k:5,n

a) : ; 2P(S" > a).

(9)

PROOF. Let A= {max1:5k:5n sk >a}, Ak = {S;::;; a, i::;; k- 1; sk >a} and B = {Sn > a}. Since S" > a on Ak (because Sk ::;; S"), we have

P(B n Ak) ~ P(Ak n {S" ~ Sd) = P(Ak)P(S" ~ Sk) = P(Ak)P(~k+ 1 + · · · + ~n ~ 0). By the symmetry of the distributions of the random variables ~ 1, ... , ~n, we have P(~k+ 1 +

Hence P(~k+ 1 +

··· +

· · · + ~n >

~n > 0) = P(~k+ 1

+ · · · + ~n <

0).

0) ~ -!,and therefore

n

P(B) ~ k~1 P(Ak n B) ~

1

n

2k~1 P(Ak) =

1

2 P(A),

which establishes (9).

Lemma 2. Let Sn ,...., JV (0, CJ 2 (n)), CJ 2 (n) j oo, and let a(n), n ~ 1, satisfy a(n)/CJ(n)--.. oo, n--.. oo. Then P(Sn > a(n)) ,....,

CJ(n)

11::..

-y

2na(n)

exp{ -ta 2 (n)/CJ 2 (n)}.

(10)

373

§4. Law of the Iterated Logarithm

The proof follows from the asymptotic formula

-1-

faa e - y1/2 dy - -1- e -x2/2

fox

fox

.

,

X-+ 00,

since S,Ju(n) ,.... .Af(O, 1). PROOF OF THEOREM 1 (for~; "' .Af(O, 1)). Let us first establish (5). Let e > 0, A.= l + e, nk =A.\ where k ~ k0 , and k0 is chosen so that In In k0 is defined. We also define

Ak = {Sn > A.t/l(n) for some n E (nk, nk+ tJ},

(11)

and put

A = {Ak i.o.}

= {Sn > A.t/J(n) for infinitely many n}.

In accordance with (3), we can establish (5) by showing that P(A) = 0. Let us show that P(Ak) < oo. Then P(A) = 0 by the Borel-Cantelli lemma. From (11), (9) and (10) we find that

L

P(Ak)

~

P{Sn > A.t/J(nk) for some n E (nk, nk+ 1)}

~ P{S"

> A.t/J(nk) for some n

~

nk+ tl

~ 2P{Snk+1 > A.t/J(nk)} "'fo~nk) exp{ -tA.2[t/J(nk)/AJl} ~

C 1 exp( -A. In In A.k)

~

Ce-J.Ink = C2 k-;.,

where C 1 and C2 are constants. But L::;. 1 k-;. < oo, and therefore

L P(Ak) <

oo.

Consequently (5) is established. We turn now to the proof of(6). In accordance with (4) we must show that, with A. = 1 - e, e > 0, we have with probability 1 that sn ~ A.t/J(n) for infinitely many n. Let us apply (5), which we just proved, to the sequence (- Sn)n<::.t· Then we find that for all n, with finitely many exceptions, -Sn ~ 21/J(n) (P-a.s.). Consequently if nk = N\ N > 1, then for sufficiently large k, either snk-1 ~ -21/J(nk-l) or (12)

where Yk = snk - snk-1. Hence if we show that for infinitely many k

lk >

A.t/J(nk)

+ 21/J(nk-l),

(13)

374

IV. Sequences and Sums of Independent Random Variables

this and (12) show that (P-a.s.) s.k > A.t/J(nk) for infinitely many k. Take some ).' E (A., 1). Then there is anN > 1 such that for all k

A.'[2(Nk- Nk- 1) In In Nk] 112 > A.(2Nk In In Nk) 112 + 2(2Nk- 1 In In Nk- 1) 112

=A.t/J(Nk) + 21/J(Nk- 1).

It is now enough to show that ~

> A.'[2(Nk - Nk- 1) In In Nk] 112

(14)

for infinitely many k. Evidently ~ ""' %(0, Nk - Nk- 1). Therefore, by Lemma2, P{Y. > A.'[2(Nk- Nk-1) In In Nkr/2}"'

k

>

1 foX(2 In In Nk) 112

c1

k-(A')2

- (In k) 112

e-0.'>2JntnNk

c2>-

k In k ·

L

Since (1/k Ink) = oo, it follows from the second part of the Borel-Cantelli lemma that, with probability 1, inequality (14) is satisfied for infinitely many k, so that (6) is established. This completes the proof of the theorem.

Remark 1. Applying (7) to the random variables (- s.).;;,; 1, we find that

s.

r

(15)

tm cp(n) = -1.

It follows from (7) and (15) that the law of the iterated logarithm can be put in the form (16)

Remark 2. The law of the iterated logarithm says that for every 8 > 0 each function t/Ji = (1 + 8)1/1 is upper, and t/1*, = (1 - 8)1/1 is lower. The conclusion (7) is also equivalent to the statement that, for each 8 > 0,

3.

P{IS.I

~

(1 - 8)1/J(n) i.o.}

=

1,

P{IS.I

~

(1

+ 8)1/J(n) i.o.}

=

o.

PROBLEMS

1. Let ~ 1 , ~ 2 , ... be a sequence of independent random variables Show that = 1} = 1. P{IIm v~ 2ln n

with~.~

JV(O, 1).

375

§4. Law of the Iterated Logarithm

2. Let ~ 1 , ~ 2 , .•. be a sequence of independent random variables, distributed according to Poisson's law with parameter A. > 0. Show that (independently of A.)

{ Pnm

=~.In Inn =1 } =1. Inn

3. Let ~ 1 , ~ 2 , .•. be a sequence of independent identically distributed random variables with

0
{ I

} S lll(lnlnn) = ell• = 1. p Ilrri -" nil•

4. Establish the following generalization of (9). Let ~ 1 , .•• , ~.be independent random variables. Levy's inequality

PL~::. [Sk + Jl(S. -

Sk)] >

a}

:$

2P(S. a), >

S0

=

0,

holds for every real a, where /1(~) is the median of~, i.e. a constant such that

CHAPTER V

Stationary (Strict Sense) Random Sequences and Ergodic Theory

§1. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations 1. Let (Q, :F, P) be a probability space and~= (~t. ~ 2 , ..• ) a sequence of random variables or, as we say, a random sequence. Let (}k ~denote the sequence (~k+1• ~k+2• ... ).

Definition l. A random sequence ~ is stationary (in the strict sense) if the probability distributions of (}k~ and~ are the same for every k 2: 1:

BE PA(R 00 ). The simplest example is a sequence ~ = (~ 1 , ~ 2 , ..•) of independent identically distributed random variables. Starting from such a sequence, we can construct a broad class of stationary sequences 11 = (17 t. 17 2 , ••• ) by choosing any Borel function g(x 1, ... , x.) and setting '1k = g(~k• ~k+ 1, ... , ~k+ 1). If ~ = (~ 1 , ~ 2 , .•. ) is a sequence of independent identically distributed random variables with E I ~ 1 1 < oo and E~ 1 = m, the law of large numbers tells us that, with probability 1, ~1 + ... + ~. __.m, _____

n

n ~ oo.

In 1931 Birkhoff obtained a remarkable generalization of this fact for the case of stationary sequences. The present chapter consists mainly of a proof of Birkhoff's theorem. The following presentation is based on the idea of measure-preserving transformations, something that brings us in contact with an interesting

§I. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations

377

branch of analysis (ergodic theory), and at the same time shows the connection between this theory and stationary random proceesses. Let (Q, fF, P) be a probability space.

Definition 2. A transformation T ofQ into n is measurable if, for every A E fF, T- 1 A = {w: TwEA}EfF.

Definition 3. A measurable transformation T is a measure-preserving transformation (or morphism) if, for every A E F, P(T- 1A) = P(A). LetT be a measure-preserving transformation, T" its nth iterate, and ~ 1 = n ~ 2, and consider the sequence~= (~ 1 , ~ 2 , ... ). We claim that this sequence is stationary. In fact, let A= {w: ~EB} and A 1 = {w: 0 1 ~EB}, where BE~(R 00 ). Since A = {w: (~ 1 (w), ~ 1 (Tw), .. .) E B}, and A 1 = {w: (~ 1 (Tw), ~ 1 (T 2 w), ...) E B}, we have wE A 1 if and only if either Tw E A or A 1 = r- 1A. But P(T- 1A) = P(A), and therefore P(Ad = P(A). Similarly P(Ak) = P(A) for every Ak = {w: Ok~ EB}, k ~ 2. Thus we can use measure-preserving transformations to construct stationary (strict sense) random variables. In a certain sense, there is a converse result: for every stationary sequence ~considered on (Q, fF, P) we can construct a new probability space (0, #, P), a random variable ~ 1 (w) and a measure-preserving transformation f, such that the distribution of~= {~ 1 (&), ~ 1 (fw), ... } coincides with the distribution of~. In fact, take 0 to be the coordinate space Roo and put :fft = BI(R 00 ), P = P~, where P~(B) = P{w: ~ E B}, BE BH(R 00 ). The action of f on Q is given by ~ 1 (w) a random variable. Put ~k(w) = ~ 1 (Tn- 1 w),

If w = (xt. x 2 ,

••• ),

put

~1(&)

Now let A = {w: (x 1,

=

x1,

... ,

t-

1

Uw) = ~ 1 (fn- 1 w),

n ~ 2.

xk) E B}, BE ~(Rk), and

A = {w:(x 2 ,

...

,xk+ 1 )EB}.

Then the property of being stationary means that P(A) = P{w:(~ 1 ,

...

,~dEB} = P{w:(~ 2 ,

...

,~k+ 1 )EB} =

P(f- 1A),

i.e. Tis a measure-preserving transformation. Since P{w: (~ 1 , ••• , ~k) E B} = P {w: (~ 1 , ... , ~k)EB} for every k, it follows that~ and~ have the same distribution. Here are some examples of measure-preserving transformations.

378

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

EXAMPLE 1.

Let Q = {w 1 , ... , ron} consist ofn points (a finite number), n 2:: 2, letffbethecollectionofitssubsets,andlet Tw; = W;+ 1, 1::;; i::;; n- 1,and Tw" = w 1 • If P (w;) = 1/n, the transformation T is measure-preserving.

2.1f n = [0, 1), §' = ~([0, 1)), Pis Lebesgue measure, A. E [0, 1), then Tx = (x + A.) mod 1 and T = 2x mod 1 are both measure-preserving transformations. ExAMPLE

2. Let us pause to consider the physical hypotheses that led to the consideration of measure-preserving transformations. Let us suppose that Q is the phase space of a system that evolves (in discrete time) according to a given law of motion. If w is the state at instant n = 1, then Tnw, where Tis the translation operator induced by the given law of motion, is the state attained by the system after n steps. Moreover, if A is some set of states w then T- 1A = { w: Tw E A} is, by definition, the set of states w that lead to A in one step. Therefore if we interpret Q as an incompressible fluid, the condition P(T- 1 A) = P(A) can be thought of as the rather natural condition of conservation of volume. (For the classical conservative Hamiltonian systems, Liouville's theorem asserts that the corresponding transformation T preserves Lebesgue measure.)

3. One of the earliest results on measure-preserving transformations was Poincare's recurrence theorem (1912). Theorem 1. Let (Q, ff, P) be a probability space, let T be a measure-preserving transformation, and let A E F. Then, for almost every point wE A, we have Tnw E A for infinitely many n 2:: 1.

PROOF. Let C = {wEA: Tnw~A, for all n 2:: 1}. Since C n T-nc = 0 for all n 2:: 1, we have T-mc n T-<m+nlc = T-m(C n T-nC) = 0. Therefore the sequence {T-nC} consists of disjoint sets of equal measure. Therefore L:':o P(C) = L:':o P(T-nC)::;; P(O) = 1 and consequently P(C) = 0. Therefore, for almost every point wE A, for at least one n 2:: 1, we have Tnw EA. It follows that Tnw E A for infinitely many n. Let us apply the preceding result to T\ k 2:: 1. Then for every wE A\N, where N is a set of probability zero, the union of the corresponding sets corresponding to the various values of k, there is an nk such that (Tkrw EA. It is then clear that rnw E A for infinitely many n. This completes the proof of the theorem.

Corollary. Let ~(w) 2:: 0. Then

L ~(Tkw) = 00

k=O

on the set {w: ~(w)

> 0}.

oo

(P-a.s.)

379

§2. Ergodicity and Mixing

In fact, let An= {w: ~(w) ~ 1/n}. Then, according to the theorem, = oo (P-a.s.) on An, and the required result follows by letting n--+ oo.

Lk'=o ~(Tkw)

Remark. The theorem remains valid if we replace the probability measure P by any finite measure Jl with Jl(Q) < oo. 4. PROBLEMS 1. Let Tbeameasure-preservingtransformationand~ = ~(w)arandom variable whose expectation E~(w) exists. Show that E~(w) = E~(Tw). 2. Show that the transformations in Examples 1 and 2 are measure-preserving. 3. Let n = [0, 1), F = af([O, 1)) and let P be a measure whose distribution function is continuous. Show that the transformations Tx = ..l.x, 0 < A. < 1, and Tx = x 2 are not measure-preserving.

§2. Ergodicity and Mixing 1. In the present section T denotes a measure-preserving transformation on the probability space (Q, fl', P).

Definition 1. A set A E fJ' is invariant if T- 1A = A. A set A E fJ' is almost invariant if A and r- 1A differ only by a set of measure zero, i.e. P(A 6. r- 1 A) =Q . It is easily verified that the classes J and J* of invariant or almost invariant sets, respectively, are u-algebras.

Definition 2. A measure-preserving transformation Tis ergodic (or metrically transitive) if every invariant set A has measure either zero or one. Definition 3. A random variable~ = ~(w) is invariant (or almost invariant) if ~(w) = ~(Tw) for all wEn (or for almost all wE Q). The following lemma establishes a connection between invariant and almost invariant sets.

Lemma 1. If A is almost invariant, there is an invariant set B such that P(A!::. B) = 0. PRooF. Let B = Ilm r-nA. Then T- 1B = Ilm r-A = B, i.e. B EJ.It is easily seen that A!::. B s;;; Uk'=o (T-kA!::. r-A). But P(T-kA!::. r-A)

Hence P(A !'::!,. B) = 0.

= P(A 6. T- 1A) = 0.

380

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

Lemma 2. A transformation T is ergodic if and only if every almost invariant set has measure zero or one. PROOF. Let A E §*; then according to Lemma 1 there is an invariant set B such that P(A 6. B) = 0. But T is ergodic and therefore P(B) = 0 or 1. Therefore P(A) = 0 or 1. The converse is evident, since J £ J*. This completes the proof of the lemma. Theorem 1. Let T be a measure-preserving transformation. Then the following conditions are equivalent: (1) Tis ergodic; (2) every almost invariant random variable is (P-a.s.) constant; (3) every invariant random variable is (P-a.s.) constant.

PROOF. (1) (2). Let T be ergodic and ~ almost invariant, i.e. (P-a.s.) ~( w) = ~(Tw). Then for every c E R we have Ac = {w: ~(w) ~ c} E J*, and then P(AJ = 0 or 1 by Lemma 2. Let C = sup{c: P(Ac) = 0}. Since Ac j Q as c j oo and Ac! 0 as c!- w, we have ICI < oo. Then P{w:

~(w) < C} =

P{.91

{~(w) ~ C- ~}} =

and similarly P{w: ~(w) > C} = 0. Consequently P{w: (2) => (3). Evident. (3)

=>

(1 ). Let A

E

~(w)

0

= C} = 1.

J; then I A is an invariant random variable and therefore,

(P-a.s.), IA = 0 or IA = 1, whence P(A) = 0 or 1,

Remark. The conclusion of the theorem remains valid in the case when "random variable" is replaced by "bounded-random variable". We illustrate the theorem with the following example. ExAMPLE. Let Q = [0, 1), §' = .?4([0, 1)), let P be Lebesgue measure and let Tw = (w + A.) mod 1. Let us show that T is ergodic if and only if A. is irrational. Let~ = ~(w) be a random variable with Ee(w) < oo. Then we know that the Fourier series L~ _"' c.e 2 "inw of ~(w) converges in the mean square sense, L Ic. 12 < oo, and, because T is a measure-preserving transformation (Example 2, §1), we have (Problem 1, §1) that for the random variable~ c.E~(w)e2xin~<w>

= =

E~(Tw)e2xinTw

=

ez";";.E~(w)e2xinw

e2xin.i.E~(Tw)ez"inw

= c.ez";";..

So c.(1 - e 2 ";";.) = 0. By hypothesis, A. is irrational and therefore e2 "in.l. # 1 for all n # 0. Therefore c. = 0, n # 0, ~(w) = c0 (P-a.s.), and Tis ergodic by Theorem 1.

381

§3. Ergodic Theorems

On the other hand, let A. be rational, i.e. A. integers. Consider the set

A =

U k=O

2m-2 {

k

k

= k/m,

+

where k and m are

1}

w: -2 ~ w < -2- . m m

ltisclearthatthissetisinvariant;butP(A) =

t. Consequently Tis not ergodic.

2. Definition 4. A measure-preserving transformation is mixing (or has the mixing property) if, for all A and BE F, lim P(A n y-nB)

= P(A)P(B).

(1)

n-+ oo

The following theorem establishes a connection between ergodicity and mixing. Theorem 2. Every mixing transformation T is ergodic. PRooF.

Let A E ~. BE ..F. Then B = y-nB, n ;::: 1, and therefore P(A n y-nB) = P(A n B)

for all n ;::: 1. Because of (1), P(A n B) = P(A)P(B). Hence we find, when A = B, that P(B) = P 2 (B), and consequently P(B) = 0 or 1. This completes the proof.

3.

PROBLEMS

e

1. Show that a random variable is invariant if and only if it is J-measurable.

2. Show that a set A is almost invariant if and only if either P(T- 1A\A) = 0

or

3. Show that the transformation considered in the example of Subsection 1 of the present section is not mixing.

e

4. Show that a transformation is mixing if and only if, for all random variables and 11 with Ee 2 < oo and E11 2 < oo,

§3. Ergodic Theorems 1. Theorem 1 (Birkhoff and Khinchin). Let T be a measure-preserving trans~(w) a random variable withE I~ I < oo. Then (P-a.s.)

formation and~ =

(1)

382

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

lf also Tis ergodic then (P-a.s.) 1 n-1 limn

L ~(T"w) =

n k=O

(2)

E~.

The proof given below is based on the following proposition, whose simple proof was given by A. Garsia (1965).

Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving transformation, let~ be a random variable with E I~ I < oo, and let Sk(w)

=

~(w)

Mk(w)

+ ~(Tw) + · · · + ~(T"- 1 w),

= max{O, S 1(ro), ... , S"(ro)}.

Then for every n

~

PRooF. Ifn ~

~(w)

1.

k, we have Mn(Tw) ~ Sk(Tw) and therefore ~(ro) + Mn(Tw) ~ Sk+ 1(w). Since it is evident that ~(ro) ~ S 1 (ro)- Mn(Tw),

+ Sk(Tw) =

we have Therefore E[c;(ro)/1M">olro)] ~ E(max(S 1 (ro), ... , Sn(w)) - Mn(Tw)), But max(Sl> ... , Sn) = Mn on the set {Mn > 0}. Consequently, E[~(ro)/{Mn>O}(ro)] ~ E{(Mn(ro)- Mn(Tro))/{Mn(w)>OJ} ~

E{Mn(ro)- Mn(Tw)}

= 0,

since if T is a measure-preserving transformation we have EMn(ro) EM,.(Tw) (Problem 1, §1). This completes the proof of the lemma. PRooF OF THE

THEOREM. Let us suppose that E(~l.f) = O(otherwise replace

~by~- E(~l.f)).

Let tf (P-a.s.)

=

= Ilm(SJn)

and!!.

= lim(SJn). It 0

$;

!!.

$;

will be enough to establish that

'1 $; 0.

Consider the random variable i7 = q(w). Since q(ro) = q(Tw), the variable i7 is invariant and consequently, for every e > 0, the set A. = {q(ro) > e} is also invariant. Let us introduce the new random variable and put St(w) = ~*(ro)

+ .. · + ~·cr"- 1 ro),

Mt(w)

=

max(O, Sf, ... , St).

383

§3. Ergodic Theorems

Then, by the lemma,

o

E[~*J 1 M~>oa ~

for every n

~

1. But as n--+ oo,

{M~ > 0} = { maxSt > o} j {sup St > o} ={sup Skt > o} = {sup k;;>: 1

k<-: 1

k<': 1

1 ,;;k,;;n

~k > e} nA, = A.,

where the last equation follows because supk,., 1(S:/k) ~ ij, and A,= {w: ij > e}. Moreover, El~*l ~ El~l +e. Hence, by the dominated convergence theorem,

Thus

0

~ E[~*IAJ

= E[(~ - e)IAJ = E[UAJ - eP(A,) = E[E<ei.Y)IAJ - eP(A,) = -eP(A,),

so that P(A,) = 0 and therefore P(ij ~ 0) = 1. Similarly, if we consider -~(w) instead of ~(w), we find that

=(

11111

Sn) = - 1l. mSn- = -1]

--

n

n

-

and P( -17 ~ 0) = 1, i.e. P(17 ~ 0) = 1. Therefore 0 ~ 11 ~ ij ~ 0 (P-a.s.) and the first part of the theorem is established. To prove the second part, we observe that ~ince E( ~ I.Y) is an invariant random variable, we have E(~IJ) = E~ (P-a.s.) in the ergodic case. This completes the proof of the theorem. Corollary. A measure-preserving transformation T is ergodic all A and BE f7,

1 n-1 lim- L P(A n r-kB) = P(A)P(B). n

n k=O

if and only if, for (3)

To prove the ergodicity of Twe use A = BE J in (3). Then A n T- kB = B and therefore P(B) = P 2 (B), i.e. P(B) = 0 or 1. Conversely, let T be ergodic. Then if we apply (2) to the random variable~ = Ja(w), where BE f7, we find that (P-a.s.) 1 n-1 lim- L IT-kiw) = P(B). n

n k=O

384

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

If we now integrate both sides over A theorem, we obtain (3) as required.

E

fF and use the dominated convergence

2. We now show that, under the hypotheses of Theorem 1, there is not only almost sure convergence in (1) and (2), but also convergence in mean. (This result will be used below in the proof of Theorem 3.)

Theoreml. LetT be a measure-preserving transformation and let random variable with EI I < 00. Then

e

E

~~ :t>(Tkw)- E(el~ I--. 0,

e=

e(w) be a

n--. oo.

(4)

If also T is ergodic, then n--. oo.

(5)

PRooF. For every e > 0 there is a bounded random variable such that E1e- '71:::;; e. Then

'7( I'7(w) I :::;; M)

(6)

Since I'71 :::;; M, then by the dominated convergence theorem and by using (1) we find that the second term on the right of (6) tends to zero as n--. oo. The first and third terms are each at most e. Hence for sufficiently large n the lefthand side of(6) is less than 2e, so that (4) is proved. Finally, if Tis ergodic, then (5) follows from (4) and the remark that E(eii) = Ee (P-a.s.). This completes the proof of the theorem. 3. We now turn to the question of the validity of the ergodic theorem for )defined on a probstationary (strict sense) random sequences = <el> ability space (Q, !F, P). In general, (Q, !F, P) need not carry any measurepreserving transformations, so that it is not possible to apply Theorem 1 directly. However, as we observed in §1, we can construct a coordinate probability space (fi, §;, P), random variables ~ = (~ 1 , ~ 2 , .•• ), and a measurepreserving transformation T such that ~n(w) = ~ 1 (T"- 1 w) and the distributions of and ~ are the same. Since such properties as almost sure convergence and convergence in the mean are defined only for proba1w) (P-a.s. bility distributions, from the convergence of (1/n) 1 ~1 and in mean) to a random variable ij it follows that (1/n) also 1 converges (P-a.s. and in mean) to a random variable '1 such that '1 :4: ij. It

e

ez, .. .

e

Li=

(Tk-

:D= ek<w)

385

§3. Ergodic Theorems

follows from Theorem 1 that if EI~ 1 I < oo then = E( ~ 1 If), where .J is a collection of invariant sets (E is the average with respect to the measure P). We now describe the structure of '1·

Definition 1. A set A E §i' is invariant with respect to the sequence~ if there is a set BE 86(R oo) such that for n ~ 1 A = {w:

(~., ~n+ 1, ...) E

B}.

The collection of all such invariant sets is a a-algebra, denoted by

J~.

Definition 2. A stationary sequence ~ is ergodic if the measure of every invariant set is either 0 or 1. Let us now show that the random variable 1J can be taken equal toE(~ 1 IJ ~). In fact, let A E f ~. Then since

L

1 n-1

E1~k n k~ 1 we have

f ~k ~ f

!n I

k~ 1

A

{w;

dP

A

A

(7)

'1 dP.

= {w: (~k• ~k+ 1, ... ) E B}

Let B Eei(R 00 ) be such that A since ~ is stationary,

f ~k dP = Jl

I

'1 ~ 0,

-

(~k. ~k+ 1 ....)eB}

~k dP =

Hence it follows from (7) that for all A

that '1 = E(¢ 1 1...¢';). Here E(¢ 1 1...¢';) =

E

f{ro;(~t. ~2 J

~,

... ·)eB}

for all k ~ 1. Then

~1 dP =

f ~1 dP. A

which implies (see §7, Chapter II)

E~ 1 if~

is ergodic.

Therefore we have proved the following theorem.

Theorem 3 (Ergodic Theorem). Let ~ = (~ 1 , ~ 2 , •.. ) be a stationary (strict sense) random sequence withE I~ 1 1< oo. Then (P-a.s., and in the mean)

If~

is also an ergqdic sequence, then (P-a.s., and in the mean)

1 lim-

n

L ~k(w) = E~ 1 .

n k~ 1

4.

PROBLEMS

1. Let ~ = (~ 1 , ~ 2 ,

function R(n) ergodic.

•.• )

be a Gaussian stationary sequence withE~" = 0 and covariance Show that R(n)--+ 0 is a sufficient condition for ~ to be

= E~k+n~k·

386

V. Stationary (Strict Sense) Random Sequences and Ergodic Theory

2. Show that every sequence ~ = random variables is ergodic.

(~ 1 , ~ 2 , ••. )

3. Show that a stationary sequence

~

for every BE [J(R), k = 1,2, ....

of independent identically distributed

is ergodic if and only if

CHAPTER VI

Stationary (Wide Sense) Random Sequences. L 2 - Theory

§1. Spectral Representation of the Covariance Function 1. According to the definition given in the preceding chapter, a random sequence ~ = (~ 1 , ~ 2 , ..•) is stationary in the strict sense if, for every set BE&l(R 00 )andeveryn ~ 1, P{(~l' ~2• ... )EB} = P{(~n+l• ~n+2• ... )EB}.

(1)

It follows, in particular, that if E~i < oo then E~n is independent of n: E~n

and the covariance cov(~n+m~n) =

onm:

=

E~1'

E(~n+m- E~n+m)(~n - E~n)

(2)

depends only (3)

In the present chapter we study sequences that are stationary in the wide sense (and have finite second moments), namely those for which (1) is replaced by the (weaker) conditions (2) and (3). The random variables ~n are understood to be defined for n E 7L. = {0, ± 1, ... } and to be complex-valued. The latter assumption not only does not complicate the theory, but makes it more elegant. It is also clear that results for real random variables can easily be obtained as special cases of the corresponding results for complex random variables. Let H 2 = H 2 (Q, Jl', P) be the space of (complex) random variables 2 < oo, where I~ 1 2 = r~. 2 + {3 2 • If ~ and ~ = r:1. + i{3, r:J., {3 E R, with E I~ 1 17 E H 2 , we put (4)

388 where

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

r; =

(J( -

if3 is the complex conjugate of 11 =

(J(

+ if3 and

WI = (~, 0 112 •

(5)

As for real random variables, the space H 2 (more precisely, the space of equivalence classes of random variables; compare §§10 and 11 of Chapter II) is complete under the scalar product ( ~, 11) and norm II~ 11. In accordance with the terminology of functional analysis, H 2 is called the complex (or unitary) Hilbert space (of random variables considered on the probability space (Q, !F, P)). If~' 11 E H 2 their covariance is

11) =

cov(~,

E(~ - E~)(11

- E11).

(6)

It follows from (4) and (6) that if E~ = E11 = 0 then cov(~,

11) = (~, 11).

(7)

Definition. A sequence of complex random variables ~ = (~")" E 2 with E I~" 12 < oo, n E 7l., is stationary (in the wide sense) if, for all n E 7l., E~" = E~ 0 , cov(~k+n• ~k)

=

k E 7i..

cov(~"' ~ 0 ),

(8)

As a matter of convenience, we shall always suppose that E~ 0 = 0. This involves no loss of generality, but does make it possible (by (7)) to identify the covariance with the scalar product and hence to apply the methods and results of the theory of Hilbert spaces. Let us write

R(n) =

cov(~"' ~ 0 ),

n E 7l.,

(9)

and (assuming R(O) = E I ~ 0 12 # 0)

R(n) p(n) = R(O)'

n E 7l..

(10)

We call R(n) the covariance function, and p(n), the correlationfunction, of the sequence~ (assumed stationary in the wide sense). It follows immediately from (9) that R(n) is nonnegative-definite, i.e. for all complex numbers a 1, ... , am and t 1 , ..• , tm E 7l., m ;::::: 1, we have m

L aJijR(t; -

t) ;: : : 0.

(11)

,i,j= 1

It is then easy to deduce (either from (11) or directly from (9)) the following properties of the covariance function (see Problem 1):

R(O);::::: 0,

R( -n) = R(n),

IR(n) I :s; R(O),

IR(n)- R(mW :s; 2R(O)[R(O)- Re R(n- m)].

(12)

389

§1. Spectral Representation of the Covariance Function

2. Let us give some examples of stationary sequences ~ = (~n)nEZ· (From now on, the words "in the wide sense" and the statement n E Z will both be omitted.) ExAMPLE 1. Let ~n = ~ 0 · g(n), where E~ 0 = 0, E~6 = 1 and g = g(n) is a function. The sequence~ = (~n) will be stationary if and only if g(k + n)g(k) depends only on n. Hence it is easy to see that there is a Asuch that

g(n) = g(O)ei;.n. Consequently the sequence of random variables ~n = ~0

•

g(O)eiln

is stationary with In particular, the random EXAMPLE

"constant"~

= 0 is a stationary sequence. ~

2. An almost periodic sequence. Let (13)

where z 1, •.• , zN are orthogonal (Ez;zi = 0, i =F j) random variables with zero means and E lzkl 2 =a~ > 0; -n ~ Ak < n, k = 1, ... , N; A; =F Ai, i =F j. The sequence~ = (~n) is stationary with N

L a~eiJ.•n.

R(n) =

(14)

k=l

As a generalization of (13) we now suppose that ~n

L 00

=

k=-

zkei).•n,

(15)

00

where zk, k E Z, have the same properties as in (13). If we suppose that oo a~ < oo, the series on the right of (15) converges in mean-square and

Lr'= _

R(n)

L 00

=

k=-

a~e;;.•n.

(16)

00

Let us introduce the function

L

F(A) =

af.

(17)

{k: Ak:5 ).)

Then the covariance function (16) can be written as a Lebesgue-Stieltjes integral,

R(n)

=

r"

e;;.n dF(A.).

(18)

390

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

The stationary sequence (15) is represented as a sum of "harmonics" ei).kn with "frequencies" Ak and random "amplitudes" zk of "intensities" (Jl = E jzkj 2. Consequently the values of F(A) provide complete information on the "spectrum" of the sequence ~' i.e. on the intensity with which each frequency appears in (15). By (18), the values of F(A) also completely determine the structure of the covariance function R(n). Up to a constant multiple, a (nondegenerate) F(A) is evidently a distribution function, which in the examples considered so far has been piecewise constant. It is quite remarkable that the covariance function of every stationary (wide sense) random sequence can be represented (see the theorem in Subsection 3) in the form (18), where F(A) is a distribution function (up to normalization), whose support is concentrated on [- n, n), i.e. F(A) = 0 for A < -nand F(A) = F(n) for A > n. The result on the integral representation of the covariance function, if compared with (15) and (16), suggests that every stationary sequence also admits an "integral" representation. This is in fact the case, as will be shown in §3 by using what we shall learn to call stochastic integrals with respect to orthogonal stochastic measures (§2). EXAMPLE 3 (White noise). Let e = (en) be an orthonormal sequence of random variables, Een = 0, Eeiei = bii' where bii is the Kronecker delta. Such a sequence is evidently stationary, and R(n) =

{10,

0,

n= n # 0.

Observe that R(n) can be represented in the form R(n) =

f/""

(19)

dF(A),

where

F(A)

=

f/(v) dv;

1

j(A) = 2n'

-n::;; A< n.

(20)

Comparison of the spectral functions (17) and (20) shows that whereas the spectrum in Example 2 is discrete, in the present example it is absolutely continuous with constant "spectral density" j(A) tn. In this sense we can say that the sequence e = (en) "consists of harmonics of equal intensities." It is just this property that has led to calling such a sequence e = (en) "white noise" by analogy with white light, which consists of different frequencies with the same intensities.

=

EXAMPLE 4 (Moving averages) Starting from the white noise e = (en) introduced in Example 3, let us form the new sequence C()

~" =

2: aken-k, k=-oo

(21)

391

§I. Spectral Representation of the Covariance Function

where ak are complex numbers such that equation,

Lk'= _

oo I ak 12

<

rx:;.

By Parseval's

00

cov(~n+m• ~m) = cov(~"' ~ 0 ) =

L

k=-

an+kak,

00

so that~ = (~k) is a stationary sequence, which we call the sequence obtained from e = (ek) by a (two-sided) moving average. In the special case when the ak of negative index are zero, i.e. 00

~" the sequence~ = k > p, i.e. if

(~n)

=

L aken-k• k=O

is a one-sided moving average. If, in addition, ak = 0 for (22)

then ~ = (~") is a moving average of order p. We can show (Problem 5) that (22) has a covariance function of the form R(n) = e;;."f(A.) dA., where the spectral density is

J':.,

(23)

with

P(z) = a0

+ a1 z + · · · + aPzP.

EXAMPLE 5 (Autoregression). Again let e = (en) be white noise. We say that a random sequence~ = ( ~n) is described by an autoregressive model of order q if

~n

+ bl~n-1 + ''' + bq~n-q =

Sn.

(24)

Under what conditions on b 1 , ••• , bn can we say that (24) has a stationary solution? To find an answer, let us begin with the case q = 1: (25) where a.= -b 1 . If la.l < 1, it is easy to verify that the stationary sequence ~ = (~n) with

L a.ien00

~n =

j

(26)

j=O

is a solution of (25). (The series on the right of (26) converges in mean-square.) Let us now show that, in the class of stationary sequences ~ = (~") (with finite second moments) this is the only solution. In fact, we find from (25), by successive iteration, that

392

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Hence it follows that

k-+

00.

Therefore when Ia I < 1 a stationary solution of (25) exists and is representable as the one-sided moving average (26). There is a similar result for every q > 1 : if all the zeros of the polynomial (27) lie outside the unit disk, then the autoregression equation (24) has a unique stationary solution, which is representable as a one-sided moving average (Problem 2). Here the covariance function R(n) can be represented (Problem 5) in the form R(n) =

f/An dF().),

F().) =

f/(v)

dv,

(28)

where

(29) In the special case q = 1, we find easily from (25) that Eeo = 0, 2

Eeo

= t -

1

lo::l2'

and R(n)

a"

= 1-lo::l2'

n~O

(when n < 0 we have R(n) = R(- n)), Here

1 1 J().) = 2n '11 - ae-;;.1 2 ' EXAMPLE 6.

This example illustrates how autoregression arises in the construction of probabilistic models in hydrology. Consider a body of water; we try to construct a probabilistic model of the deviations of the level ofthe water from its average value because of variations in the inflow and evaporation from the surface. If we take a year as the unit of time and let H n denote the water level in yearn, we obtain the following balance equation:

Hn+ 1 = Hn - KS(H")

+ ~n+ 1•

(30)

where ~n+ 1 is the inflow in year (n + 1), S(H) is the area of the surface of the water at level H, and K is the coefficient of evaporation.

393

§I. Spectral Representation of the Covariance Function

Let ~n = Hn - H be the deviation from the mean level (which is obtained from observations over many years) and suppose that S(H) = S(H) + c(H - H). Then it follows from the balance equation that ~n satisfies (31)

with a = 1 - cK, en = :En - KS(H). It is natural to assume that the random variables ~>n have zero means and are identically distributed. Then, as we showed in Example 5, equation (31) has (for Ia I < 1) a unique stationary solution, which we think of as the steady-state solution (with respect to time in years) of the oscillations of the level in the body of water. As an example of practical conclusions that can be drawn from a (theoretical) model (31 ), we call attention to the possibility of predicting the level for the following year from the results of the observations of the present and preceding years. It turns out (see also Example 2 in §6) that (in the meansquare sense) the optimal linear estimator of ~n+ 1 in terms of the values of •.• , ~n-l• ~n is simply O!~n· 7 (Autoregression and moving average (mixed model)). If we suppose that the right-hand side of (24) contains a0 Bn + a 1Bn- 1 + · · · + apsn- P instead of en, we obtain a mixed model with autoregression and moving average of order (p, q):

ExAMPLE

~n

+ b1~n-1 + · ·· + bq~n-q = aoBn + a1Bn-1 + ··· + apBn-p·

(32)

Under the same hypotheses as in Example 5 on the zeros it will be shown later (Corollary 2 to Theorem 3 of§3) that (32) has the stationary solution~ = (~") e;;.n dF(A.) with F(A.) = for which the covariance function is R(n) = J~,f(v) dv, where

J':,

3. Theorem (Herglotz). Let R(n) be the covariance function of a stationary

(wide sense) random sequence with zero mean. Then there is, on

([ -n, n), g6([ -n, n))), a finite measure F = F(B),

BE g6([ -n,

R(n) = PRooF. For N

~

n)), such that for every n E 7L

f/;." F(dA.).

(33)

1 and A. E [ -n, n], put

1

fN(A.) = 2 N 1t

N

N

I I

k=1 1=1

R(k - T)e- ik;.ew'.

(34)

394

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Since R(n) is nonnegative definite, fN(A.) is nonnegative. Since there are N - Im I pairs (k, I) for which k - l = m, we have

L

fN(A.) = __!_ 2n lmi
(1 - ~)R(m)e-im'-. N

(35)

Let BE~([ -n,

n)).

Then

lnl < lnl

N,} (36)

~N.

The measures F N• N ~ 1, are supported on the interval [ -n, n] and F N([ -n, n]) = R(O) < oo for all N ~ 1. Consequently the family of measures {FN}, N ~ 1, is tight, and by Prohorov's theorem {Theorem 1 of §2 of Chapter III) there are a sequence {Nk}

£;

{N} and a measure F such that

F Nk ~ F. {The concepts of tightness, relative compactness, and weak

convergence, together with Prohorov's theorem, can be extended in an obvious way from probability measures to any finite measures.) It then follows from (36) that f/'"nF(dA.)

=

N~~oo f~/'-npNk(dA.) =

R(n).

The measure F so constructed is supported on [ -n, n]. Without changing the integral J~oo ei'-nF(dA.), we can redefine F by transferring the "mass" F( { n} ), which is concentrated at n, to -n. The resulting new measure (which we again denote by F) will be supported on [ -n, n). This completes the proof of the theorem.

Remark 1. The measure F = F(B) involved in (33) is known as the spectral measure, and F(A.) = F([ -n, A.]) as the spectral function, of the stationary sequence with covariance function R(n). In Example 2 above the spectral measure was discrete (concentrated at A.k, k = 0, ± 1, ...). In Examples 3-6 the spectral measures were absolutely continuous.

Remark 2. The spectral measure F is uniquely defined by the covariance function. In fact, let F 1 and F 2 be two spectral measures and let f/'-nFl(dA.)

= f/'"nFz(dA.),

n E Z.

§2. Orthogonal Stochastic Measures and Stochastic Integrals

395

Since every bounded continuous function g(A.) can be uniformly approximated on [ -n, n) by trigonometric polynomials, we have {,.g(A.)F 1(dA.) = {,.g(A.)FidA.). It follows (compare the proof in Theorem 2, §12, Chapter II) that F 1(B) =FiB) for all Be&l([ -n, n)).

Remark 3. If e=(en) is a stationary sequence of real random variables en, then R(n) = {,.cos A.n F(dA.).

4.

PROBLEMS

1. Derive (12) from (11). 2. Show that the autoregression equation (24) has a stationary solution if all the zeros ofthe polynomial Q(z) defined by (27) lie outside the unit disk.

3. Prove that the covariance function (28) admits the representation (29) with spectral density given by (30). 4. Show that the sequence

e= (e.) of random variables, where e. = I <Xl

k=l

and cxk and f3k are real random variables, can be represented in the form

e. = I

<Xl

zkeu••

k=-oo

with zk = !(f3k- icxk) fork~ 0 and zk = z_k, A.k = -A.-k fork < 0. 5. Show that the spectral functions of the sequences (22) and (24) have densities given respectively by (23) and (29). 6. Show that if I IR(n) I < oo, the spectral function F(A.) has density f(A.) given by 1

f(A.} = - I e-i.
§2. Orthogonal Stochastic Measures and Stochastic Integrals 1. As we observed in §1, the integral representation of the covariance function and the example of a stationary sequence

en=

I

00

k=-oo

zke;;. ••

(1)

396

VI. Stationary (Wide Sense) Random Sequences. £ 2 -Theory

with pairwise orthogonal random variables zk, k e 7L, suggest the possibility of representing an arbitrary stationary sequence as a corresponding integral generalization of (1). If we put (2) Z(A.) = zk,

L

{k:.l.kS.l.}

we can rewrite (1) in the form co

e. = L

k= -co

=

eilk".1Z(A.k),

(3)

where .1Z(A.k) Z(A.k) - Z(A.k -) = zk. The right-hand side of (3) reminds us of an approximating sum for an integral j'':., ei-.n dZ(A.) of Riemann-Stieltjes type. However, in the present case Z(A.) is a random function (it also depends on w). Hence it is clear that for an integral representation of a general stationary sequence we need to use functions Z(A.) that do not have bounded variation for each ro. Consequently the simple interpretation of eiAn dZ(A.) as a Riemann-Stieltjes integral for each w is inapplicable.

J':.,

2. By analogy with the general ideas of the Lebesgue, Lebesgue-Stieltjes and Riemann-Stieltjes integrals (§6, Chapter II), we begin by defining stochastic measure. Let (Q, JF, P) be a probability space, and let E be a subset, with an algebra S 0 of subsets and the u-algebra S generated by S 0 •

Definition 1. A complex-valued function Z(.1) = Z(w; .1), defined for wen and .1 e S 0 , is a finitely additive stochastic measure if (1) E I Z(.1) 12 < oo for every .1 E S 0 ; (2) for every pair A 1 and .12 of disjoint sets in tff 0 ,

Z(.11

+ .12) =

Z(.1 1)

+ Z(.12)

(P-a.s.)

(4)

Definition 2. A finitely additive stochastic measure Z(.1) is an elementary stochastic measure if, for all disjoint sets .1 1, .1 2, ... of tff 0 such that .1 = Lk'= 1 .1k E tff 0• n--+ oo.

(5)

Remark 1. In this definition of an elementary stochastic measure on subsets of S 0 , it is assumed that its values are in the Hilbert space H 2 = H 2 (il, JF, P), and that countable additivity is understood in the mean-square sense (5). There are other definitions of stochastic measures, without the requirement of the existence of second moments, where countable additivity is defined (for example) in terms of convergence in probability or with probability one.

397

§2. Orthogonal Stochastic Measures and Stochastic Integrals

Remark 2. In analogy with nonstochastic measures, one can show that for finitely additive stochastic measures the condition (5) of countable additivity (in the mean-square sense) is equivalent to continuity (in the mean-square sense) at "zero": (6)

A particularly important class of elementary stochastic measures consists of those that are orthogonal according to the following definition.

Definition 3. An elementary stochastic measure (or.a measure with orthogonal values) if

Z(~), ~ E

fff 0 , is orthogonal

(7)

for every pair of disjoint sets

~1

and

EZ(~ 1 )Z(~ 2 )

for all

~1

and

~2

~2

in fff 0 ; or, equivalently, if

= E IZ(~1

n ~2W

(8)

in fff 0 .

We write m(~)

= E IZ(~W,

(9)

For elementary orthogonal stochastic measures, the set function m = m(~), ~ E fff 0 , is, as is easily verified, a finite measure, and consequently by Caratheodory's theorem (§3, Chapter II) it can be extended to (E, f%). The resulting measure will again be denoted by m = m(~) and called the structure function (of the elementary orthogonal stochastic measure Z = Z(~). ~ E

fff 0 ).

The following question now arises naturally: since the set function m = m(~) defined on (E, fff 0 ) admits an extension to (E, Iff), where Iff = a(fff 0 ), cannot an elementary orthogonalstochastic measure Z = Z(~), ~ E fff 0 , be extended to sets ~in E in such a way that EI Z(~) 12 = m(~), ~ E fff? The answer is affirmative, as follows from the construction given below. This construction, at the same time, leads to the stochastic integral which we need for the integral representation of stationary sequences. 3. Let Z = Z(~) be an elementary orthogonal stochastic measure, with structure function m. = m(~), ~ E f%. For every function

~ E fff 0 ,

(10)

with only a finite number of different (complex) values, we define the random variable

398

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Let L 2 = L 2 (E, 8, m) be the Hilbert space of complex-valued functions with the scalar product

(f, g)

=

L

f(A.)g(A.)m(dA.)

and the norm 11!11 = (f,J) 112 , and let H 2 = H 2 (0., $', P) be the Hilbert space of complex-valued random variables with the scalar product

=

(~, t~)

E~ii

and the norm 11~11 = (~, ~) 1 ' 2 . Then it is clear that, for every pair of functions

(J(f), J(g))

f

and g of the form (10),

= (f, g)

and IIJ(f)ll 2 = 11!11 2 = 11f
n,m ..... oo.

Therefore the sequence {J(f,.)} is fundamental in the mean-square sense and by Theorem 7, §10, Chapter II, there is a random variable (denoted by J(f)) such that J(f) E H 2 and IIJ(J,.) - f(f)ll ..... 0, n ..... oo. The random variable f(f) constructed in this way is uniquely defined (up to stochastic equivalence) and is independent of the choice of the approximating sequence {J,.}. We call it the stochastic integral off E L 2 with respect to the elementary orthogonal stochastic measure Z and denote it by f( f) =

f

fCA.)Z(dA.).

We note the following basic properties of the stochastic integral J(f); these are direct consequences of its construction (Problem 1). Let g, f, and fnEL 2 . Then (J(f), J(g)) = (f, g); IIJ(f)ll

J(af + bg)

= 11!11;

= aJ(f) + bJ(g) (P-a.s.)

(11) (12) (13)

where a and bare constants; IIJ(J,.)- J(f)ll ..... o, if II fn - f II ..... o, n ..... oo.

(14)

399

§2. Orthogonal Stochastic Measures and Stochastic Integrals

4. Let us use the preceding definition of the stochastic integral to extend the elementary stochastic measure Z(A), A E C 0 , to sets inC = u(C 0 ). Since m is assumed to be finite, we have I d = I d(A.) E L 2 for all A E C. Write Z(A) = J(I d). It is clear that Z(A) = Z(A) for A E C 0 . It follows from (13) that if At n A2 = 0 for At and A2 E C, then Z(At

+ A2 )

= Z(At)

+ Z(A 2 )

(P-a.s.)

and it follows from (12) that EIZ(A)I 2 = m(A),

AEC.

Let us show that the random set function Z(A), A E C, is countably additive in the mean-square sense. In fact, let Ak E C and A = Lk"= t Ak. Then n

Z(A) -

L Z(Ak) = J(gn),

k= t

where n

9n(A) = Id(A.) -

L Idk(A) =

Il:n(A.),

k= t

But n-+ oo,

i.e. n

E I Z(A) -

L Z(Ak) 2 -+ 0, 1

n-+ oo.

k=1

It also follows from (11) that

when At n A2 = 0, At, A2 EC. Thus our function Z(A), defined on A E C, is countably additive in the mean-square sense and coincides with Z(A) on the sets A E C 0 • We shall call Z(A), A E C, an orthogonal stochastic measure (since it is an extension of the elementary orthogonal stochastic measure Z(A)) with respect to the structure function m(A), A E C; and we call the integral J(f) = .fE f(A.)Z(dA.), defined above, a stochastic integral with respect to this measure. 5. We now consider the case (E, C) = (R, ~(R)), which is the most important for our purposes. As we know (§3, Chapter II), there is a one-to-one correspondence between finite measures m = m(A) on (R, ~(R)) and certain (generalized) distribution functions G = G(x), with m(a, b] = G(b)- G(a). It turns out that there is something similar for orthogonal stochastic measures. We introduce the following definition.

400

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Definition 4. A set of (complex-valued) random variables {Z;.}, A. E R, defined on (Q, $', P), is a random process with orthogonal increments if (1) EIZ;.I 2 < oo,A.eR; (2) for every A. E R

EIZ;.- Z;.J2~0, (3) whenever ..1. 1 < ..1. 2 < ..1. 3 < A.4, E(Z;. 4

Z;.,)(Z;. 2

-

-

Z;.,)

= 0.

Condition (3) is the condition of orthogonal increments. Condition (1) means that Z;. E H 2 . Finally, condition (2) is included for technical reasons; it is a requirement of continuity on the right (in the mean-square sense) at each A.e R. Let Z = Z(~) be an orthogonal stochastic measure with respect to the structure function m = m(~), of finite measure, with the (generalized) distribution function G(A.). Let us put

Z;. = Z(- oo, A.]. Then E IZ;.I 2 = m(- oo, A.] = G(A.) < oo, and (evidently) 3) is satisfied also. Then this process {Z;.} is called a process with orthogonal increments. On the other hand, if {Z;.} is such a process withE IZ;.I 2 = G(A.), G(- oo) = 0, G( + oo) < oo, we put Z(~) =

when

~ =

Zb- Za

(a, b]. Let cC 0 be the algebra of sets

~

=

n

I

(ak, bk]

and

Z(~)

=

k=l

n

I

Z(ak, bk].

k=l

It is clear that

E IZ(~)I 2

= m(~),

where m(d) = I~=t [G(bk)- G(ak)] and EZ(~ 1 )Z(~ 2 )

=0

for disjoint intervals ~ 1 = (a 1 , b 1 ] and ~ 2 = (a 2 , b2 ]. Therefore Z = Z(~), ~ E cC 0 , is an elementary stochastic measure with orthogonal values. The set function m = m(~), ~ E cC 0 , has a unique extension to a measure on cC = 86(R), and it follows from the preceding constructions that Z = Z(~), ~ E cC 0 , can also be extended to the set~ E C, where cC = 86(R), and E IZ(~) 12 = m(~), ~ E B(&l).

401

§3. Spectral Representation of Stationary (Wide Sense) Sequences

Therefore there is a one-to-one correspondence between processes {Z_.}, A. E R, with orthogonal increments and EIZ .. I2 = G(A.), G(- oo) = 0, G( + oo) < oo, and orthogonal stochastic measures Z = Z(.::\), .::\ E &I(R), with structure functions m = m(L\). The correspondence is given by

Z;.

= Z(- oo,

A.],

G(A.) = m(- oo, A.]

and m(a, b] = G(b)- G(a).

By analogy with the usual notation of the theory of Riemann-Stieltjes integration, the stochastic integral JR f(A.) dZ_., where {Z;.} is a process with orthogonal increments, means the stochastic integral JR f(A.)Z(dA.) with respect to the corresponding process with an orthogonal stochastic measure.

6.

PROBLEMS

1. Prove the equivalence of (5) and (6). 2. Letf e L 2 • Using the results of Chapter II (Theorem 1 of§4, the Corollary to Theorem 3 of §6, and Problem 9 of §3), prove that there is a sequence of functions J,. of the form (10) such that II!- J..ll -+ 0, n-+ oo. 3. Establish the following properties of an orthogonal stochastic measure Z(L\} with structure function m(L\): E IZ(L\1) - Z(L\2W

= m(L\16 L\2),

Z(L\1 \L\2) = Z(L\1) - Z(L\1 r~ L\2) Z(L\ 16 L\ 2 ) = Z(L\1)

+ Z(L\ 2 )

-

(P-a.s.),

2Z(L\ 1 rl L\ 2 )

(P-a.s.).

§3. Spectral Representation of Stationary (Wide Sense) Sequences 1. If ~ = (~n) is a stationary sequence with E~n = 0, n E Z, then by the theorem of §1, there is a finite measure F = F(L\) on ([ -n, n), &~([ -n, n))) such that its covariance function R(n) = cov(~k+n• ~k) admits the spectral representation (1)

The following result provides the corresponding spectral representation of the sequence ~ = (~"), n E Z, itself.

402

VI. Stationary (Wide Sense) Random Sequences. L 2 - Theory

Theorem 1. There is an orthogonal stochastic measure Z Ll E .11([ -n, n)), such that for every n E Z (P-a.s.)

=

Z(Ll),

~n = ff""Z(dA).

(2)

Moreover, E !Z(LlW = F(Ll).

The simplest proof is based on properties of Hilbert spaces. Let L 2 (F) = L 2 (E, lff, F) be a Hilbert space of complex functions, E = [ -n, n), lff = .11([ -n, n)), with the scalar product (f, g)

= f/(}c)g(}c)F(d}c),

(3)

and let L6(F) be the linear manifold (L6(F) ~ L 2 (F)) spanned by en = en( A), n E Z, where en(A) = ei"". Observe that since E = [ -n, n) and F is finite, the closure of L6(F) coincides (Problem 1) with L 2 (F): L6(F)

=

L 2 (F).

Also let L6(~) be the linear manifold spanned by the random variables~"' n E Z, and let e(O be its closure in the mean-square sense (with respect toP). We establish a one-to-one correspondence between the elements of L6(F) and L6( 0, denoted by "~ ", by setting nEZ,

(4)

and defining it for elements in general (more precisely, for equivalence classes of elements) by linearity: (5)

(here we suppose that only finitely many of the complex numbers r:xn are different from zero). Observe that (5) is a consistent definition, in the sense that !: rx.nen = 0 almost everywhere with respect to F if and only if L rx.n~n = 0 (P-a.s.). The correspondence "~" is an isometry, i.e. it preserves scalar products. In fact, by (3), (en, em)

=

f"en(A)em(A)F(dA)

=

ffl(n-m rel="nofollow">F(d}c)

=

R(n - m)

= E~n~m = (~"' ~m) and similarly (6)

§3. Spectral Representation of Stationary (Wide Sense) Sequences

403

Now let '7 E L 2 (~). Since L 2 (~) = I6(~), there is a sequence {1'/n} such that 1'/n E L6( ~) and II IJn - '711 --+ 0, n --+ oo. Consequently {1'/n} is a fundamental sequence and therefore so is the sequence Un}, where fn E L6(F) and f, +-+ IJn. The space L 2 (F) is complete and consequently there is an f E L 2 (F) such that llfn- fll--+ 0. There is an evident converse: iff E e(F) and II!- !nil --+ 0, fnEL6{F), tpere is an element '7 of L 2(0 such that 11'7 - '7nll --+ 0, IJn E L6{~) and '7n +-+ f,. Up to now the isometry"+-+" has been defined only as between elements of L6( ~)and L6(F). We extend it by continuity, taking f +-+ '1 when f and 11 are the elements considered above. It is easily verified that the correspondence obtained in this way is one-to-one (between classes of equivalent random variables and of functions), is linear, and preserves scalar products. Consider the function f(A.) = I 4 (A.), where Ll E .16([ -n, n)), and let Z(Ll) be the element of L 2 (~) such that I 4 (A.) +-+ Z(A.). It is clear that III 4 (A.)II 2 = F(Ll) and therefore E IZ(LlW = F(Ll). Moreover, if .1 1 n .1 2 = 0, we have EZ(Ll 1 )Z(Ll 2 ) = 0 and E I Z(Ll) - I~= 1 Z(LlkW --+ 0, n--+ oo, where .1 = 1 .1k. Hence the family of elements Z(Ll), Ll E .16([ -n, n)), form an orthogonal stochastic measure, with respect to which (according to §2) we can define the stochastic integral

If=

Let f E L 2 (F) and '1 +-+f. Denote the element '1 by (f) (more precisely, select single representatives from the corresponding equivalence classes of random variables or functions). Let us show that (P-a.s.) J(f) = (f).

(7)

In fact, if (8)

is a finite linear combination of functions I 4 .(A.), Llk = (ak, bk], then, by the very definition of the stochastic integral, J(f) = I IXkZ(Llk), which is evidently equal to (f). Therefore (7) is valid for functions of the form (8). But if J E L 2 (F) and II fn - f II --+ 0, where fn are functions of the form (8), then ll(fn) - (f)ll --+ 0 and IIJ(fn) - J(f)IJ -> 0 (by (2.14)). Therefore (f) = J(f) (P-a.s.). Consider the function f(A.) = ei'-n. Then (ei'-n) = ~n by (4), but on the other hand J(ei'-n) = .f':._, ei'-"Z(dA.). Therefore

n E 7l. (P-a.s.) by (7). This completes the proof of the theorem.

404

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

e

en,

Corollary 1. Let = (en) be a stationary sequence qfreal random variables n E 71.. Then the stochastic measure Z = Z(t1) involved in the spectral representation (2) has the property that Z(t1) = Z(- .:1)

(9)

for every .:1 = aJ([ -n, n)), where -.:1 = {A: -A. E .:1}.

In fact, let f(A.) = therefore

L cxkeo.k and 11 = L cxkek (finite sums). Then f - 11 and (10)

Since J .1(A.)- Z(t1), it follows from (10) that either J .1(- A.)- Z(L1) or J -<1(..1.)- Z(t1). On the other hand, J -<1(..1.)- Z(- .:1). Therefore Z(t1) = Z( -.:1)(P-a.s.).

Corollary 2. Again let ables

en and Z(L1) =

e

=(en) be a stationary sequence of real random variZ 1 (.:1) + iZ 2 (L1). Then EZ 1(i1 1)Zit12) = 0

for every .:1 1 and .:1 2 ; and

if .:1 1 11

.:1 2 =

0

(11)

then (12)

In fact, since Z(t1) = Z( -.:1), we have Z 1( -.:1) = Z 1(.:1),

Z 2( -.:1) = -Zit1).

(13)

Moreover, since EZ(t1 1)Z(t12) = EIZ(t1 1 11 .:12)1 2, we have Im EZ(t1 1)Z(t12) = 0, i.e. (14)

If we take the interval -.:1 1 instead of .:1 1 we therefore obtain EZ1( -L11)Z2(l12)

+ EZ2( -L\1)Z1(l12) = 0,

which, by (13), can be transformed into EZ1(L\1)Zil\2)- EZ2(L\1)Z1(f12) = 0.

(15)

Then (11) follows from (14) and (15). On the other hand, if .:1 1 11 L\ 2 = 0 then EZ{t1 1)Z(t1 2) = 0, whence Re EZ(t1 1)Z(L\ 2) = 0 and Re EZ(- t1 1)Z(L\ 2) = 0, which, with (13), provides an evident proof of (12).

e

Corollary 3. Let =(en) be a Gaussian sequence. Then, for every family .:1 1, ... , L\k, the vector (Z 1(.:1 1), ... , Z 1(.:1k), Z 2(.:1 1), ... , Z 2(L1k)) is normally distributed.

§3. Spectral Representation of Stationary (Wide Sense) Sequences

405

In fact, the linear manifold L5{~) consists of (complex-valued) Gaussian random variables 17, i.e. the vector (Re 17, lm 17) has a Gaussian distribution. Then, according to Subsection 5, §13, Chapter II, the closure of L5(~) also consists of Gaussian variables. It follows from Corollary 2 that, when ~ = ( ~n) is a Gaussian sequence, the real and imaginary parts of Z 1 and Z 2 are independent in the sense that the families of random variables (Z 1 (L\ 1 ), ... , Z 1(L\k)) and (Z 2 (L\ 1), ... , Zil\k)) are independent. It also follows from (12) that when the sets L\ 1 , ... , L\k are disjoint, the random variables Z;(L\ 1), ..• , Z;(L\k) are collectively independent, i = 1, 2.

Corollary 4. If then (P-a.s.)

~ = C~n)

is a stationary sequence of real random variables,

Remark. If {Z;.}, A. E [ -n, n), is a process with orthogonal increments, corresponding to an orthogonal stochastic measure Z = Z(L\), then in accordance with §2 the spectral representation (2) can also be written in the following form: J! ':.n

f"

=

eiAn

-n

dZ ).•

nel..

(17)

2. Let~ = C~n) be a stationary sequence with the spectral representation (2) and let 17 E L 2 ( ~). The following theorem describes the structure of such random variables.

r,

Theorem 2. If 11 E L 2 (~). there is a function q> E L 2 (F) such that (P-a.s.) 11

=

q>(A.)Z(dA.).

(18)

PRooF. If (19)

then by (2) (20)

i.e. (18) is satisfied with

q>"(A.) =

L:

lklsn

a" em.

(21)

In the general case, when 17 E L 2 (~), there are variables 17n of type (19) such that //11- 17nl/ --t 0, n --t 00. But then [fq>n- q>m/1 = ll17n- 17m I --t 0, n, m --t 00.

406

VI. Stationary (Wide Sense) Random Sequences. L 2 -Theory

Consequently {qJ,.} is fundamental in L 2 (F) and therefore there is a function qJ e L 2 (F) such that 11(/J - qJ,.II ~ 0, n ~ oo. By property (2.14) we have ll..l(qJ,.)- ..l(qJ)II ~ 0, and since '7n =..l(qJ,.) we also have '1 = ..l(qJ) (P-a.s.). This completes the proof of the theorem.

Remark. Let

H 0 (~) and H 0 (F) be the respective closed linear manifolds spanned by the variables ~ .. and by the functions e,. when n :s;; 0. Then if '1 e H 0 (~) there is a function qJ e H 0 (F) such that (P-a.s.) '1 = J':.,. qJ(A.)Z(dA.).

3. Formula (18) describes the structure of the random variables that are obtained from ~ .. , n e 71., by linear transformations, i.e. in the form of finite sums (19) and their mean-square limits. A special but important class of such linear transformations are defined by means of what are known as (linear).filters. Let us suppose that, at instant m, a system (filter) receives as input a signal xm, and that the output of the system is, at instant n, the signal h(n - m)xm, where h = h(s), s e 71., is a complex valued function called the impulse response (of the filter). Therefore the total signal obtained from the input can be represented in the form 00

Yn =

L

m=-oo

h(n - m)xm.

(22)

For physically realizable systems, the values of the input at instant n are determined only by the "past" values of the signal, i.e. the values xm for m :s;; n. It is therefore natural to call a filter with the impulse response h(s) physically realizable if h(s) = 0 for all s < o, in other words if n

Yn =

L

m=-oo

oo

h(n- m)Xm =

L h(m)Xn-m·

m=O

(23)

An important spectral characteristic of a filter with the impulse response h is its Fourier transform

qJ(A.) =

L 00

e-iJ.mh(m),

(24)

m=-oo

known as the frequency characteristic or transfer function of the filter. Let us now take up conditions, about which nothing has been said so far, for the convergence ofthe series in (22) and (24). Let us suppose that the input is a stationary random sequence ~ = (~ ..), n e 71., with covariance function R(n) and spectral decomposition (2). Then if 00

L

k,l=-oo

h(k)R(k - l)li(l) < 00,

(25)

407

§3. Spectral Representation of Stationary (Wide Sense) Sequences

_

the series L~= 00 h(n - m)~m converges in mean-square and therefore there is a stationary sequence 1J = (IJn) with 00

'ln =

L

00

m=-oo

h(n - m)~m =

L

m=-oo

h(m)~n-m·

(26)

In terms of the spectral measure, (25) is evidently equivalent to saying that

cp(A.) E L 2 (F), i.e.

f"icp(A.WF(dA.) < oo.

(27)

Under (25) or (27), we obtain the spectral representation

IJn = f/)."cp(A.)Z(dA.).

(28)

of 1J from (26) and (2). Consequently the covariance function R,(n) of 1J is given by the formula

R/n) = f/)."lcp(A.)I 2 F(dA.).

(29)

In particular, if the input to a filter with frequency characteristic

n), the output will be a stationary sequence (moving average) 00

'ln =

L

m=-oo

h(m)~>n-m

(30)

with spectral density

JP,.)

1

2

= 2n I
.

The following theorem shows that there is a sense in which every stationary sequence with a spectral density is obtainable by means of a moving average.

Theorem 3. Let 1] = (IJn) be a stationary sequence with spectral density f,(A.). Then (possibly at the expense of enlarging the original probability space) we can find a sequence E = ( ~>n) representing white noise, and a filter, such that the representation (30) holds. PRooF. For a given (nonnegative) function UA.) we can find a function cp(A.) such that f.,( A.) = (1/2n) I cp(A.) 12 • Since f.,( A.) dA. < oo, we have cp(A.) E L 2 (~), where ~ is Lebesgue measure on [ -n, n). Hence

550

VIII. Sequences of Random Variables That Form Markov Chains

It is easy to see that the chain is periodic with period d = 2. Suppose that

p

> q (the moving particle tends to move to the right). Let i > 1; to determine

the probability fi 1we may use formula (1), from which it follows that

fi1 = ( ~)

i-1

i > 1.

< 1,

All states of this chain communicate with each other. Therefore if state i is recurrent, state 1 will also be recurrent. But (see the proof of Lemma 3 in §3) in that case fi 1 must be 1. Consequently when p > q all the states of the chain are nonrecurrent. Therefore pj'jl --+ 0, n --+ oo for i and j E E, and there is neither a limit distribution nor a stationary distribution. Now let p :::;; q. Then, by (1), fi 1 = 1 for i > 1 and / 11 = q + Pf21 = 1. Hence the chain is recurrent. Consider the system of equations determining the stationary distribution n =(no, 1t1, . . . ):

1t2

+ n1,q, = 1t1P + 1t3 q,

1t1

= n1q

1t1

=

1to

that is,

1tz =

+ 1tzq, 1tzq + 1t3q,

whence

ni =

(~)nj- 1,

j

= 2, 3, ....

If p = q we have n 1 = n 2 = ... , and consequently

In other words, if p = q, there is no stationary distribution, and therefore no limit distribution. From this and Theorem 3, §4, it follows, in particular, that in this case all states of the chain are null states. It remains to consider the case p < q. From the condition Lf=o ni = 1 we find that

that is 1t1

q-p 2q

= --

551

§5. Examples

and ni = q

~ p. (~y-1;

j ~ 2.

Therefore the distribution I1 is the unique stationary distribution. Hence when p < q the chain is recurrent and positive (Theorem 2, §4). The distribution I1 is also a limit distribution and is ergodic. EXAMPLE 5. Again consider a simple random walk with reflecting barriers at Oand N: 1

p

p

q

q

q

~

p

p

---~

0< p< 1

All the states of the chain are periodic with period d = 2, recurrent, and positive. According to Theorem 4 of §4, the chain is ergodic. Solving the system ni = ~J=o n;pii subject to ~J=o n; = 1, we obtain the ergodic distribution

2 ::s;;j:::;; N- 1,

and

2. ExAMPLE 6. It follows from Example 1 that the simple random walk considered there on the integral points of the line is recurrent if p = q, but nonrecurrent if p =f. q. Now let us consider simple random walks in the plane and in space, from the point of view of recurrence or nonrecurrence. For the plane, we suppose that a particle in any state (i, j) moves up, down, to the right or to the left with probability! (Figure 41).

*

i

...

2~+----=------=.!_

4

.!_

4

-1

*

2 0 Figure 41. A walk in the plane.

552

VIII. Sequences of Random Variables That Form Markov Chains

For definiteness, consider the state (0, 0). Then the probability Pk = p~~. o).(O, o> of going from (0, 0) to (0, 0) ink steps is given by p2n+l

p2n

= 0, =

n = 0, 1, 2, ... ,

._,

L,

{(i,j):i+j=n,OS:iS:n}

(2n)! 1 2n ~(4) ' l.l.J.J.

n = 1, 2, ....

Multiplying numerators and denominators by (n!) 2 , we obtain n

p 2n = (1.)2" C"2n ._, C;n C"i = (l)Z"(C" )z 4 f..., n 4 2n ' i=O

smce n

._, C; c•-i L.. n n

i=O

= C"2n •

Applying Stirling's formula, we find that

Pzn

~

1 nn

-,

and therefore I P 2 • = oo. Consequently the state (0, 0) (likewise any other) is recurrent. It turns out, however, that in three or more dimensions the symmetric random walk is nonrecurrent. Let us prove this for walks on the integral points (i,j, k) in space. Let us suppose that a particle moves from (i, j, k) by one unit along a coordinate direction, with probability i for each.

~5.

553

Examples

Then if Pk is the probability of going from (0, 0, 0) to (0, 0, 0) ink steps, we have

Pzn+ 1 2n -

2

-

'1)2 ( l.

L..

{(i,j):Oo>i+ j,;n, O,;i,;n, O,;j,;n)

-- -12n c 2n <

n

0,

"

_

p

=

=

0, 1, ... ,

1 2n (2n)! (6) (} '1)2(( . n- l· - }') 1)2 .

I

[

_1_ cn _1_cn 2n 3" 22n

·1 '1

!..} ·

{(i,j):Oo>i+ j,;n, O,;i,;n, O,;j,;n)

"

1·

n·

(n -

l -

. 1 '1 (

L..

{(i,j):Oo;i+ j,;n, O,;i,;n, O,;j,;n) l.J ·

n

]2 C1_)2n 3

·1 J) · n!

n-

1_ n

') 1 (3)

.

l-} .

(11)

1, 2, ... ,

=

where

en=

max

{(i,j):Oo>i+jo>n, O,;i,;n, O,;j,;n)

[

J

n! . i!j! (n- i- j)!

(12)

Let us show that when n is large, the max in (12) is attained for i ~ n/3, j ~ n/3. Let i0 and j 0 be the values at which the max is attained. Then the following inequalities are evident:

n! n! ---<----------------~ ~----~~-jo! (i 0 - 1)! (n- jo- io + 1)!- jo! io! (n- jo- io)!' n!

n!

1)! -jo- io +-~~ Uo --l)!io!(n -jo- io- 1)!+ l)!(n ---------~--------~<-----------------

j 0 !(io

n! - j 0 - i 0 -1)!' Uo + 1)!i0 !(n -<---------~ ---------whence

n - i0

-

1 ::;; 2j 0

::;;

n - i0

n - j0

-

1 ::;; 2i 0

::;;

n - j0

+ +

1, 1,

and therefore we have, for large n, i 0 ~ n/3,j 0 ~ n/3, and

cn ~ By .Stirling's formula,

n!

554

VIII. Sequences of Random Variables That Form Markov Chains

and since 00

n~l

3)3

2n312n312 < oo,

we have Ln P zn < oo. Consequently the state (0, 0, 0), and likewise any other state, is nonrecurrent. A similar result holds for dimensions greater than 3. Thus we have the following result (P6lya): Theorem. For R 1 and R 2 , the symmetric random walk is recurrent; for R", n ?: 3, it is nonrecurrent.

3. PROBLEMS 1. Derive the recursion relation ( 1). 2. Establish (4). 3. Show that in Example 5 all states are aperiodic, recurrent, and positive.

4. Classify the states of a Markov chain with transition matrix

~~

q

:)

0 0 ,

0 p q where p

+ q = 1, p ;?: 0, q ;?: 0.

Historical and Bibliographical Notes

Introduction The history of probability theory up to the time of Laplace is described by Todhunter [Tl]. The period from Laplace to the end of the nineteenth century is covered by Gnedenko and Sheinin in [KlO]. Maistrov [Ml] discusses the history of probability theory from the beginning to the thirties of the present century. There is a brief survey in Gnedenko [G4]. For the origin of much of the terminology of the subject see Aleksandrova [A3]. For the basic concepts see Kolmogorov [K8], Gnedenko [G4], Borovkov [B4], Gnedenko and Khinchin [G6], A.M. and I. M. Yaglom [Yl], Prohorov and Rozanov [P5], Feller [Fl, F2], Neyman [N3], Loeve [L7], and Doob [D3]. We also mention [M3] which contains a large number of problems on probability theory. In putting this text together, the author has consulted a wide range of sources. We mention particularly the books by Breiman [B5], Ash [A4, A5], and Ash and Gardner [A6], which (in the author's opinion) contain an excellent selection and presentation of material. For current work in the field see, for example, Annals of Probability (formerly Annals of Mathematical Statistics) and Theory of Probability and its Applications (translation of Teoriya Veroyatnostei i ee Primeneniya). Mathematical Reviews and Zentralblatt fur Mathematik contain abstracts of current papers on probability and mathematical statistics from all over the world. For tables for use in computations, see [Al].

556

Historical and Bibliographical Notes

Chapter I §1. Concerning the construction of probabilistic models see Kolmogorov [K7] and Gnedenko [G4]. For further material on problems of distributing objects among boxes see, e.g., Kolchin, Sevastyanov and Chistyakov [K3]. §2. For other probabilistic models (in particular, the one-dimensional Ising model) that are used in statistical physics, see Isihara [12]. §3. Bayes's formula and theorem form the basis for the "Bayesian approach" to mathematical statistics. See, for example, De Groot [D 1] and Zacks [Zl]. §4. A variety of problems about random variables and their probabilistic description can be found in Meshalkin [M3]. §5. A combinatorial proof of the law of large numbers (originating with James Bernoulli) is given in, for example, Feller [Fl]. For the empirical meaning of the law of large numbers see Kolmogorov [K7]. §6. For sharper forms of the local and integrated theorems, and ofPoisson's theorem, see Borovkov [B4] and Prohorov [P3]. §7. The examples of Bernoulli schemes illustrate some of the basic concepts and methods of mathematical statistics. For more detailed discussions see, for example, Cramer [C5] and van der Waerden [Wl]. §8. Conditional probability and conditional expectation with respect to a partition will help the reader understand the concepts of conditional probability and conditional expectation with respect to a-algebras, which will be introduced later. §9. The ruin problem was considered in essentially the present form by Laplace. See Gnedenko and Sheinin [KlO]. Feller [Fl] contains extensive material from the same circle of ideas. §10. Our presentation essentially follows Feller [Fl]. The method for proving (10) and (11) is taken from Doherty [D2]. §11. Martingale theory is throughly covered in Doob [D3]. A different proof of the ballot theorem is given, for instance, in Feller [Fl]. §12. There is extensive material on Markov chains in the books by Feller [Fl], Dynkin [D4], Kemeny and Snell [K2], Sarymsakov [Sl], and Sirazhdinov [S8]. The theory of branching processes is discussed by Sevastyanov [S3].

Chapter II §1. Kolmogorov's axioms are presented in his book [K8]. §2. Further material on algebras and a-algebras can be found in, for example, Kolmogorov and Fomin [K8], Neveu [Nl], Breiman [B5], and Ash [A5]. §3. For a proof of Caratheodory's theorem see Loeve [17] or Halmos [Hl].

557

Historical and Bibliographical Notes

§§4-5. More material on measurable functions is available in Halmos [H1]. §6. See also Kolmogorov and Fomin [K8], Halmos [H1], and Ash and Gardner [A6]. The Radon-Nikodym theorem is proved in these books. The inequality

is sometimes called Chebyshev's inequality, and the inequality

e

P( I I ~ e)

~ EI; I', e

r > 0,

is called Markov's inequality. For Pratt's lemma see [P2]. §7. The definitions of conditional probability and conditional expectation with respect to a a-algebra were given by Kolmogorov [K8]. For additional material see Breiman [B5] and Ash [A5]. The result quoted in the Corollary to Theorem 5 can be found in [M5]. §8. See also Borovkov [B4], Ash [A5], Cramer [C5], and Gnedenko [G4]. §9. Kolmogorov's theorem on the existence of a process with given finitedimensional distribution is in his book [K8]. For Ionescu-Tulcea's theorem see also Neveu [N1] and Ash [A5]. The proof in the text follows [A5]. §§10-11. See also Kolmogorov and Fomin [K9], Ash [A5], Doob [D3], and Loeve [L 7]. §12. The theory of characteristic functions is presented in many books. See, for example, Gnedenko [G4], Gnedenko and Kolmogorov [G5], Ramachandran [R1], Lukacs [L8], and Lukacs and Laha [L9]. Our presentation of the connection between moments and semi-invariants follows Leonov and Shiryayev [L4]. §13. See also Ibragimov and Rozanov [11], Breiman [B5], and Liptser and Shiryayev [L5].

Chapter III §1. Detailed investigations of problems on weak convergence of probability measures are given in Gnedenko and Kolmogorov [G5] and Billingsley [B3]. §2. Prohorov's theorem appears in his paper [P4]. §3. The monograph [G5] by Gnedenko and Kolmogorov studies the limit theorems of probability theory by the method of characteristic functions. See also Billingsley [B3]. Problem 2 includes both Bernoulli's law of large numbers and Poisson's law of large numbers (which assumes that 1, 2 , •.. are independent and take only two values (1 and 0), but in general are differently distributed: P(ei = 1) = Pi• P(ei = 0) = 1 - Pi• i ~ 1).

ee

558

Historical and Bibliographical Notes

§4. Concerning "nonclassical" conditions for the validity of the central limit theorem, see Zolotarev [Z2, Z3] and Rotar [R3]. The statement and proof of Theorem 1 were given by Rotar. The lemma in Subsection 6 is also due to him. §5. Here the presentation follows Gnedenko and Kolmogorov [G5] and Ash [A5]. §6. The proof follows Sazonov [S2]. §7. See also Valkeila [Vl].

Chapter IV §1. Kolmogorov's zero-or-one law appears in his book [K8]. For the Hewitt-Savage zero-or-one law see also Borovkov [B4], Breiman [B5], and Ash [A5]. §§2-4. Here the fundamental results were obtained by Kolmogorov and Khinchin (see [K8] and the references given there). See also Petrov [Pl] and Stout [S9]. For probabilistic methods in number theory see Kubilius [Kll].

Chapter V §§1-3. Our exposition of the theory of (strict sense) stationary random processes is based on Breiman [B5], Sinai [S7], and Lamperti [12]. The simple proof of the maximal ergodic theorem was given by Garsia [Gl].

Chapter VI §1. The books by Rozanov [R4], and Gihman and Skorohod [G2, G3] are devoted to the theory of (wide sense) stationary random processes. Example 6 was frequently presented in Kolmogorov's lectures. §2. For orthogonal stochastic measures and stochastic integrals see also Ooob [03], Gihman and Skorohod [G3], Rozanov [R4], and Ash and Gardner [A6]. §3. The spectral representation (2) was obtained by Cramer and Loiwe (see, for example, [17]). The same representation (in different language) is contained in Kolmogorov [K5]. Also see Ooob [03], Rozanov [R4], and Ash and Gardner [A6]. §4. There is a detailed exposition of problems of statistical estimation of the covariance function and spectral density in Hannan [H2, H3]. §§5-6. See also Rozanov [R4], Lamperti [L2], and Gihman and Skorohod [G2, G3]. §7. The presentation follows Lipster and Shiryayev [L5].

Historical and Bibliographical Notes

559

Chapter VII §1. Most of the fundamental results of the theory of martingales were obtained by Doob [D3]. Theorem 1 is taken from Meyer [M4]. Also see Meyer and Dellacherie [M5], Lipster and Shiryayev [15], and Gihman and Skorohod [G3]. §2. Theorem I is often called the theorem "on transformation under a system of optional stopping" (Doob [D3]). For the identities (14) and (15) and Wald's fundamental identity see Wald [W2]. §3. Chow and Teicher [C3] contains an illuminating study of the results presented here, including proofs of the inequalities of Khinchin, Marcinkiewicz and Zygmund, Burkholder, and Davis. Theorem 2 was given by Lenglart [L3]. §4. See Doob [D3]. §5. Here we follow Kabanov, Liptser and Shiryayev [K1], Engelbert and Shiryayev [El], and Neveu [N2]. Theorem 4 and the example were given by Liptser. §6. This approach to problems of absolute continuity and singularity, and the results given here, can be found in Kabanov, Liptser and Shiryayev [Kl]. Theorem 6 was obtained by Kabanov. §7. Theorems 1 and 2 were given by Novikov [N4]. Lemma 1 is a discrete analog of Girsonov's lemma (see [K1]). §8. The exposition follows Liptser and Shiryayev [L6].

Chapter VIII §1. For the basic definitions see Dynkin [D4], Ventzel [V2], Doob [D3],

and Gihman and Skorohod [G3]. The existence of regular transition probabilities such that the Kolmogorov-Chapman equation (9) is satisfied for all x E R is proved in [N 1] (corollary to Proposition V.2.1) and in [G 3] (Volume I, Chapter II, §4). Kuznetsov (see Abstracts of the Twelfth European Meeting of Statisticians, Varna, 1979) has established the validity (which is far from trivial) of a similar result for Markov processes with continuous times and values in universal measurable spaces. §§2-5. Here the presentation follows Kolmogorov [K4], Borovkov [B4], and Ash [A4].

Referencest

[Al] [A2] [A3] [A4] [AS] [A6] [BI]

[B2] [B3] [B4] [B5] [Cl]

[C2]

M. Abramovitz and I. A. Stegun, editors. Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables. National Bureau of Standards, Washington, D.C., 1964. P. S. Aleksandrov. Einfiihrung in die Mengenlehre und die Theorie der reel/en Funktionen. DVW, Berlin, 1956. N. V. Aleksandrova. Mathematical Terms [Matematicheskie terminy]. Vysshaia Shkol~, Moscow, 1978. R. B. Ash. Basic Probability Theory. Wiley, New York, 1970. R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972. R. B. Ash and M. F. Gardner. Topics in Stochastic Processes. Academic Press, New York, 1975. S. N. Bernshtein. Chebyshev's work on the theory of probability (in Russian), in The Scientific Legacy of P. L. Chebyshel' [Nauchnoe nasledie P. L. Chebysheva], pp. 43-68, Akademiya Nauk SSSR, Moscow-Leningrad, 1945. S. N. Bernshtein. Theory of Probability [Teoriya veroyatnostei], 4th ed.

Gostehizdat, Moscow, 1946.

P. Billingsley. Convergence of Probability Measures. Wiley, New York, 1968.

A. A. Borovkov. Wahrscheinlichkeitstheorie: eine Einfiihrung. Birkhiiuser, Basel-Stuttgart, 1976. L. Breiman. Probability. Addison-Wesley, Reading, MA, 1968. P. L. Chebyshev. Theory of Probability: Lectures Given in 1879 and 1880 [Teoriya veroyatnostei: Lektsii akad. P. L. Chebysheva chitannye v 1879, 1880 gg.]. Edited by A. N. Krylov from notes by A. N. Lyapunov. MoscowLeningrad, 1936. Y. S. Chow, H. Robbins, and D. Siegmund. Great Expectations: The Theory of Optimal Stopping. Houghton-Mifflin, Boston, 1971.

t Translator's note: References to translations into Russian have been replaced by references to the original works. Russian references have been replaced by their translations whenever I could locate translations; otherwise they are reproduced (with translated titles). Names of journals are abbreviated according to the forms used in Mathematical Reviews (1982 version).

562

References

[C3]

Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchangability, Martingales. Springer-Verlag, New York, 1978. Kai-Lai Chung. Markov Chains with Stationary Transition Probabilities. Springer-Verlag, New York, 1967. H. Cramer, Mathematical Methods of Statistics. Princeton University Press, Princeton, NJ, 1957. M. H. De Groot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970. M. Doherty. An amusing proof in fluctuation theory. Lecture Notes in Mathematics, no. 452, 101-104, Springer-Verlag, Berlin, 1975. J. L. Doob. Stochastic Processes. Wiley, New York, 1953. E. B. Dynkin. Markov Processes. Plenum, New York, 1963. H. J. Engelbert and A. N. Shiryayev. On the sets of convergence and generalized submartingales. Stochastics 2 (1979), 155-166. W. Feller. An Introduction to Probability Theory and Its Applications, vol. 1, 3rd ed. Wiley, New York, 1968. W. Feller. An Introduction to Probability Theory and Its Applications, vol. 2, 2nd ed. Wiley, New York, 1966. A. Garcia. A simple proof of E. Hopf's maximal ergodic theorem. J. Math. Mech. 14 (1965), 381-382. I. I. Gihman [Gikhman] and A. V. Skorohod [Skorokhod]. Introduction to the Theory of Random Processes. Saunders, Philadelphia, 1969. I. I. Gihman and A. V. Skorohod. Theory of Stochastic Processes, 3 vols. Springer-Verlag, New York-Berlin, 1974--1979. B. V. Gnedenko. Lehrbuch der Wahrscheinlichkeitstheorie. Akademische Verlagsgesellschaft, Berlin, 1967. B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of Independent Variables. Addison-Wesley, Cambridge, MA, 1954. B. V. Gnedenko and A. Ya. Khinchin. Elementary Introduction to the Theory ofProbability. Freeman, San Francisco, 1961. P.R. Halmos. Measure Theory. Van Nostrand, New York, 1950. E. J. Hannan. Time Series Analysis. Methuen, London, 1960. E. J. Hannan. Multiple Time Series. New York, Wiley, 1970. I. A. lbragimov and Yu. V. Linnik. Independent and Stationary Sequences of Random Variables. Walters-Noordhoff, Groningen, 1971. I. A. lbragimov and Yu. A. Rozanov. Gaussian Random Processes. SpringerVerlag, New York, 1978. A. Isihara. Statistical Physics. Academic Press, New York, 1971. Yu. M. Kabanov, R. Sh. Liptser, and A. N. Shiryayev. On the question of the absolute continuity and singularity of probability measures. Math. USSR-Sb. 33 (1977), 203-221. J. Kemeny and L. J. Snell. Finite Markov Chains. Van Nostrand, Princeton, 1960. V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations. Halsted, New York, 1978. A. N. Kolmogorov, Markov chains with countably many states. Byull. Moskov. Univ. I (1937), 1-16 (in Russian). A. N. Kolmogorov. Stationary sequences in Hilbert space. Byull. Moskov. Univ. Mat. 2 (1941), 1-40 (in Russian). A. N. Kolmogorov, The contribution of Russian science to the development of probability theory. Uchen. Zap. Moskot'. Unil'. 1947, no. 91, 56ff. (in Russian).

[C4] [C5] [D1] [D2] [D3] [D4] [E 1] [F1] [F2] [G1] [G2] [G3] [G4] [G5] [G6] [HI] [H2] [H3] [II] [12] [13] [Kl] [K2] [K3] [K4] [K5] [K6]

References [K7] [K8] [K9]

[KIO] [Kll] [Ll] [L2] [L3] [L4] [L5] [L6] [L7] [L8] [L9] [Ml] [M2]

tM3] [M4] [M5] [Nl] [N2] [N3] [N4] [PI] [P2] [P3] [P4] [PS]

563

A. N. Kolmogorov, Probability theory (in Russian), in Mathematics: Its Contents, Methods, and Value [Matematika, ee soderzhanie, metody i znachenie]. Akad. Nauk SSSR, vol. 2, 1956. A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, New York, 1956. A. N. Kolmogorov and S. V. Fomin. Elements of the Theory of Functions and Functional Analysis. Graylock, Rochester, NY, 1957. A. N. Kolmogorov and A. P. Yushkevich, editors. Mathematics of the Nineteenth Century [Matematika XIX veka]. Nauka, Moscow, 1978. J. Kubilius. Probabilistic Methods in the Theory of Numbers. American Mathematical Society, Providence, 1964. J. Lamperti. Probability. Benjamin, New York, 1966. J. Lamperti. Stochastic Processes. Springer-Verlag, New York, 1977. E. Lenglart. Relation de domination entre deux processus. Ann. Inst. H. Poincare. Sect. B (N.S.) 13 (1977), 171-179. V. P. Leonov and A. N. Shiryayev. On a method of calculation of semiinvariants. Theory Probab. Appl. 4 (1959), 319-329. R. S. Liptser and A. N. Shiryayev. Statistics of Random Processes. SpringerVerlag, New York, 1977. R. Sh. Liptser and A. N. Shiryayev. A functional central limit theorem for semimartingales. Theory Probab. Appl. 25 (1980), 667-688. M. Loeve. Probability Theory. Springer-Verlag, New York, 1977-78. E. Lukacs. Characteristic Functions. Hafner, New York, 1960. E. Lukacs and R. G. Laha. Applications of Characteristic Functions. Hafner, New York, 1964. D. E. Maistrov. Probability Theory: A Historical Sketch. Academic Press, New York, 1974. A. A. Markov. Calculus of Probabilities [Ischislenie veroyatnostei], 3rd ed. St. Petersburg, 1913. L. D. Meshalkin. Collection of Problems on Probability Theory [Sbornik zadach po teorii t•eroyatnostel]. Moscow University Press, 1963. P.-A. Meyer, Martingales and stochastic integrals, I. Lecture Notes in Mathematics, no. 284. Springer-Verlag, Berlin-New York, 1972. P.-A. Meyer and C. Dellacherie. Probabilities and potential. North-Holland Mathematical Studies, no. 29. Hermann, Paris; North-Holland, Amsterdam, 1978. J. Neveu. Mathematical Foundations of the Calculus of Probability. HoldenDay, San Francisco, 1965. J. Neveu. Discrete Parameter Martingales. North-Holland, Amsterdam, 1975. J. Neyman. First Course in Probability and Statistics. Holt, New York, 1950. A. A. Novikov. On estimates and the asymptotic behavior of the probability of nonintersection of moving boundaries by sums of independent random variables. Math. USSR-Izv. 17 (1980), 129-145. V. V. Petrov. Sums of Independent Random Variables. Springer-Verlag, Berlin-New York, 1975. J. W. Pratt. On interchanging limits and integrals. Ann. Math. Stat. 31 (1960), 74-77. Yu. V. Prohorov [Prokhorov]. Asymptotic behavior of the binomial distribution. Uspekhi Mat. Nauk 8, no. 3(55) (1953), 135-142 (in Russian). Yu. V. Prohorov. Convergence of random processes and limit theorems in probability theory. Theory Probab. Appl. 1 (1956), 157-214. Yu. V. Prohorov and Yu. A. Rozanov. Probability Theory. Springer-Verlag, Berlin-New York, 1969.

564

References

[Rl]

B. Ramachandran. Advanced Theory of Characteristic Functions. Statistical Publishing Society, Calcutta, 1967. A. Renyi. Probability Theory, North-Holland, Amsterdam, 1970. V. I. Rotar. An extension of the Lindeberg-Feller theorem. Math. Notes 18 (1975), 660-663. Yu. A. Rozanov. Stationary Random Processes [Statsionarnye sluchainye protsessy]. Fizmatgiz, Moscow, 1963. T. A. Sarymsakov. Foundations of the Theory of Markov Processes [Osnovy teorii protsessov Markova]. GITTL, Moscow, 1954. V. V. Sazonov. Normal approximation: Some recent advances. Lecture Notes in Mathematics, no. 879. Springer-Verlag, Berlin-New York, 1981. B. A. Sevyastyanov [Sewastjanow]. Verzweigungsprozesse. Oldenbourg, Munich-Vienna, 1975. A. N. Shiryayev. Random Processes [Sluchainye processy]. Moscow State University Press, 1972. A. N. Shiryayev. Probability, Statistics, Random Processes [Veroyatnost, statistika, sluchainye protsessy], vols. I and II. Moscow State University Press, 1973, 1974. A. N. Shiryayev. Statistical Sequential Analysis [Statisticheskil posledovatelnyi analiz]. Nauka, Moscow, 1976. Ya. G. Sinai. Introduction to Ergodic Theory [Vvedenie v ergodicheskuyu teoriyu]. Erevan University Press, 1973. S. H. Sirazhdinov. Limit Theorems for Stationary Markov Chains [Predelnye teoremydlyaodnorodnyh tsepei Markova]. Akad. Nauk Uzbek. SSR, Tashkent, 1955. W. F. Stout. Almost Sure Convergence. Academic Press, New York, 1974. I. Todhunter. A History of the Mathematical Theory of Probability from the Time of Pascal to that of Laplace. Macmillan, London, 1865. E. Valkeila. A general Poisson approximation theorem. Stochastics 7 (1982), 159-171. A. D. Ventzel. Course in the Theory ofRandom Processes [Kurs teorii sluchainyh protsessov]. Nauka, Moscow, 1975-. B. L. van der Waerden. Mathematical Statistics. Springer-Verlag, BerlinNew York, 1969. A. Wald. Sequential Analysis. Wiley, New York, 1947. A. M. and I. M. Yaglom. Wahrscheinlichkeit und Information. DVW, Berlin, 1960; Probabi/ite et Information. Dunod, Paris, 1969. S. Zacks. The Theory of Statistical Inference. Wiley, New York, 1971. V. M. Zolotarev. A generalization of the Lindeberg-Feller theorem. Theory Probab. Appl. 12 (1967), 606-618. V. M. Zolotarev. Theoremes limites pour les sommes de variables aleatoires independantes qui ne sont pas infinitesimales. C.R. Acad. Sci. Paris. Ser. A-B 264 (1967), A799-A800.

[R2] [R3] [R4] [Sl] [S2] [S3] [S4] [S5] [S6] [S7] [S8] [S9] [Tl]

[Vl] [V2] [W1] [W2] [Yl] [Z1] [Z2] [Z3]

Index of Symbols

u,

u. n, n

134, 135 a' boundary 309 0, empty set 11, 134 EF> 419 ® 30 =,identity, or definition 149 "'• asymptotic equality 20; or equivalence 296 ff. =>,<=>,implications 139, 140 =>,also used for "convergence in general" 307, 309 4 335 b 314 ~, finer than 13 j,! 135 j_ 262,492 ~.~ • .!4, -4,24 250; ~ 306 loc «, « 492 (X, Y) 455 {X. --+} 483 [A], closure of A 151, 309 A, complement of A 11, 134 [t 1 , . . . , t.] combination, unordered set 7 (t 1 , . . . , t.) permutation, ordered set 7 A + B union of disjoint sets 11, 134 A\B 11, 134

A 6. B 43, 134 {A. i.o.} = lim sup A. 135 a- = min(a, 0); a+ = max(a, 0) 105 a$= a- 1 ,a # O;O,a = 0 434 a A b = min(a, b); a v b = max(a, b) 455 A$ 305 d 13 ~. index set 315 PJ(R) = PJ(R 1) = ffd = P4 1 141, 142 f!4(R") 142, 157 P4 0 (R") 144 PJ(R 00 ) 144, 160; PJ 0 (R 00 ) 145 PJ(R 1) 146, 164 PJ(C), Pl0 (C) 149 ffd(D) 148 PJ[O, 1] 152 ffd(R) ® · · · ® PJ(R) = PJ(R") 142 c 148 C = C[O, oo) 149 (C, PJ(C)) 148 c~ 6 c+ 483 cov (~. '1) 41, 232 D, (D, PJ(D)) 148 ~ 169 E 37 (E, C) 175

566

Index of Symbols

E<eJD),

76 211 E(~, '/) 79, 218 E(~J'It• o o o, 'In) 79 EC~I'It·ooo,'!k) 272 ess sup 259 (f, g), <J, g) 398 F* G 239 g; 131, 134 ~p 152 $'/lr 174 $'*,$'*'$'A 137 H 2 424 hn(x), Hn(x) 266 inf 44 § 161, 394 IA,I(A) 33 i - j 529 JA~dP 181 ~ dP 179 (L-S) (R-S) (L) (R) L 2 260 258 IE 262 !l 265

IP'k

E(~j.@)

E(~J~)

Jn

J,

J,

J,

0

Rn

251

7

A

138

N(A) 15 N(d) 13 N(Q) 5 JV(m, a 2 ) 232 0

233,263 p(w) 13 !!}> 315 p 18, 131 P(A) 18, 132 (!)

P(A I D), P(A I£&)

212 P(A I~) 74, 212 Px, P" 526 Pc(F) 307 IP' 113 P(AI~)

74

142, 157 159, 233, 263 (R, 81) = (R 1 ; PJ) = (R, &I(R)) 142 Rn, PA(Rn) 142, 157 R"', R 00 , EJI(R 00 ) 144, 160 RT, (RT, &I(RT)) 145, 164; (RT, &I(RT), P) 245 tt(x) 497 v 41 V(~l~) 212 Xn* = maxisniXil 464 IIXnllp 464 z 175 7l. 387 408 Z(A.), Z(A) 396 a(£&) 12 bij 266 A 59, 157 (Jk~ 376, 527 Jl, p(A) 131 /1 1 x Jlz 196 ~ 32 c:+ 44 ~ 270 n (permutation) 568 Ill= IIPijll 113 l0I 148 Cf1r s T np I®It E T g;;) 148 p(~. '/) 41 (p, t$ 427 (p 431 (x) 61 x. x2 154, 241 XB 172 n 5 (0, d, P) 18, 29 ffo

IR

J

lim, lim, lim sup, lim inf, lim liml 135

l.iomo

0

R 141 R. 142, 172

181

z

u

(M)n

114

IIPiill 113 llp(x, y)ll 110, 111 !!J> = {Pa;etE2l} 315 I[) = (qt. q2' o) 543

i,

141,

Index

Also see the Historical and Bibliographical Notes (pp. 555-559), and the Index of Symbols. Absolute continuity with respect toP 193, 492 Absolute moment 180 Absolutely continuous distributions 154 functions 154 measures 153 probability measures 193, 492 random variables 169 Absorbing Ill, 547 Accessible state 528 a.e. 183 Algebra correspondence with decomposition 80 induced by a decomposition 12 of events 12, 126 of sets 130, 137 If. (J131, 137ff. smallest 138 tail 355 If. Almost everywhere 183 Almost invariant 379 Almost periodic 389 Almost surely I 73, 183 Amplitude 390 Aperiodic state 531 a posteriori probability 27 Appropriate sets, principle of 139 a priori probability 27 Arcsine law 100 Arithmetic properties 528

a.s. 173, 183 Asymptotic algebra 355 Asymptotic properties 532 If. Asymptotically infinitesimal 327 uniformly 510 unbiased 415 Asymptotics 505 If. Atoms 12, 173 Attraction of characteristic functions 296 Autoregression, autoregressive 391, 393 Average time of return 533 Averages, moving 391, 393, 409 Axioms 13, 136

Backward equation I 14 Balance equation 392 Ballot theorem 105 If. Banach space 259 Barriers Ill, 54 7, 549 Bartlett's estimator 416 Basis, orthonormal 265 Bayes's formula 26 Bayes's theorem 27, 228 If. Bellman, R. 352 Bernoulli, James 2 distribution 153 law of large numbers 49 random variable 34, 46 scheme 30, 45 If., 55 If., 68 If.

568 Bernstein, S. N. 3, 4, 5, 305 inequality 55 polynomials 54 proof of Weierstrass' theorem 54 Berry-Esseen theorem 63, 342 Bessel's inequality 262 Best estimator 42, 69. Also see Optimal estimator Beta distribution !54 Bilateral exponential distribution !54 Binary expansion 129, 369 Binomial distribution 17 tf., !52 negative 153 Binomial random variable 34 Birkhotf, G. D. 376 Birthday problem 15 Bochner-Khinchin theorem 285 Borel, E. 4 algebra 141 ff. function 168 rectangle 143 sets 141, 145 space 227 zero-or-one law 355 Borel-Cantelli lemma 253 Borel-Cantelli-Levy theorem 486 Bose-Einstein 10 Boundary 504 Bounded variation 205 Branching process 113 Brownian motion 304 Butfon's needle 222 Bunyakovskii, V. Ya. 38, 190 Burkholder's inequality 469

Canonical decomposition 512 probability space 245 Cantelli, F. P. 253, 363 Cantor, G. diagonal process 317 function 155 Caratheodory's theorem 150 Carleman's test 294 Cauchy distribution 154, 337 inequality 38 sequence 251 Cauchy-Bunyakovskii inequality 38, 190 Cauchy criterion for almost sure convergence 256 convergence in probability 257

Index convergence in mean-p 258 Central limit theorem 3, 319, 324, 326 ff. for dependent variables 509 ff. nonclassical condition for 327 Certain event II, 134 Cesaro limit 541 Change of variable in integral 194 Chapman, D. G. 114, 246, 525 Characteristic function of distribution 4, 272 tf. random vector 273 set 33 Charlier, C. V. L. 267 Chebyshev, P. L. 3, 319 inequality 47, 55, 190 Chi, chi-squared distributions 154, 241 Class c+ 483 convergence-determining 312 determining 312 monotonic 138 of states 529 Classical method 15 models 17 ff. Classification of states 528 ff. Closed linear manifold 265 Coin tossing I, 5, 17, 33, 81 ff., 129 ff. Coincidence problem 15 Collectively independent 36 Combinations 6 Communicating states 529 Compact relatively 315 sequentially 316 Compensator 454 Complement II, 134 Complete function space 258 ff. probability measure 152 probabiiity space !52 Completely nondeterministic 419 Conditional distribution 225 probability 23 tf., 74, 219 regular 224 with respect to a decomposition 74, 210 u-algebra 210, 212 random variable 75, 212 variance 81, 212 Wiener process 305 Conditional expectation in the wide sense 262, 272

Index

with respect to decomposition 76 event 218 set of variables 79 u-algebra 211 Conditionally Gaussian 438 Confidence interval 71 Consistency property 161, 244 Consistent estimator 68 Construction of a process 243 Continuity theorem 320 Continuous at 0 !51, 162 Continuous from above or below 132 Continuous time 175 Convergence-determining class 312 Convergence of martingales and submartingales 476 ff., 483 ff. probability measures Chap. III, 306 ff. random variables: equivalences 480 sequences. See Convergence of sequences series 359 Convergence of sequences almost everywhere 183, 250, 322 almost sure 173, 183, 250, 322 at points of continuity 251 dominated 185 in distribution 250, 322 in general 307, 313 in mean 250 in mean of order p 250 in mean square 250 in measure 250, 322 in probability 250 monotone 184 weak 306 with probability I 250 Convolution 239 Coordinate method 245 Correlation coefficient 41, 232 function 388 maximal 242 Counting measure 231 Covariance 41, 232, 291 function 304, 388, 395 matrix 233 Cramer-Wold method 517 Cumulant 288 Curve of regression 236 Curvilinear boundary 504 ff. Cyclic property 530 Cyclic subclasses 530 Cylinder set 144

569 Davis's inequality 470 Decomposition canonical 512 countable 138 Doob 454 Krickeberg 475 Lebesgue 493 of martingale 475 ofQ 12, 138 of probability measure 493 of random sequence 419 of set 12, 290 of submartingale 454 trivial 77 Degenerate distribution 510 distribution function 286 random variable 296 Delta function 296 Delta, Kronecker 266 De Moivre, A. 2, 49 De Moivre-Laplace limit theorem 62 Density Gaussian 66, 154, 160, 234 n-dimensional 159 of distribution !54 of measure with respect to a measure 194 of random variable 169 Dependent random variables 101 central limit theorem for 509 ff. Derivative, Radon-Nikodym 194 Detection of signal 434 Determining class 312 Deterministic 419 Dichotomy Hajek-Feldman 501 Kakutani 496 Difference of sets II, 134 Direct product 30, 31, 142, 147, 149 Dirichlet's function 209 Discrete measure 153 random variable 169 time 175 uniform density !53 Disjoint 134 Dispersion 41 Distribution. Also see Distribution function, Probability distribution Bernoulli 153 beta 154 binomial 17, 18, 153 Cauchy 241, 338

570 chi, chi-squared 241 conditional 225 degenerate 510 discrete (list) 153 discrete uniform !53 entropy of 51 ergodic 116 exponential 154 Fisher 241 gamma 154 Gaussian 66, 154, 159, 291 geometric 153 hypergeometric 21 infinitely divisible 335 initial II 0, 524 invariant 118 limit 545 lognormal 238 multidimensional 35, 158 multinomial 20 negative binomial !53 normal 66, 154 of process 176 ofsum 36 Poisson 64, 153 polynomial 20 probability 33 stable 338 stationary 118 Student's 154, 242 154, 242 1two-sided exponential 154 uniform 154 with density (list) 154 Distribution function 34, 35, 150, 169 absolutely continuous 154 degenerate 286 discrete 153 finite-dimensional 244 generalized 156 n-dimensional 158 of functions of random variables 36, 237 ff. of sum 36, 239 Distribution of objects in cells 8 .@-measurable 77 Dominated convergence 185 Dominated sequence 467 Doob, J. L. 454, 457, 464, 474, 476 Doubling stakes 87, 453 Doubly stochastic 546 d-system 140 Duration of random walk 88 Dvoretzky's inequality 475

Index

Efficient estimator 69 Eigenvalue 128 Electric circuit 32 Elementary events 5, 134 probability theory Chap. I stochastic measure 396 Empty set II, 134 Entropy 51 Equivalent measures 492 Ergodic distribution 46 sequence 385 theorems 116, 381 ff., 385 maximal 382 mean-square 410 theory 377 ff. transformation 379 Ergodicity 116, 379 ff., 545 Errors law of 296 mean-square 43 of observation 2, 3 Esseen's inequality 294 Essential state 528 Essential supremum 259 Estimation 68, 235, 412 ff., 426 Estimator 42, 68 ff., 235 Bartlett's 416 best 42, 69 consistent 68 efficient 69 for parameters 444, 488, 503 linear 43 of spectral quantities 414 optimal 69, 235, 301, 426, 433, 435, 441 Parzen's 417 unbiased 68, 412 Zhurbenko's 417 Events 5, 10, 134 certain II, 134 impossible II, 134 independent 28, 29 mutually exclusive 134 symmetric 357 Existence of limits and stationary distributions 541 ff. Expectation 37, 179 inequalities for 190 ff. of function 55 of maximum 45 of random variable with respect to decomposition 76 set of random variables 79

571

Index

a-algebra 211 ofsum 38 Expected value 37 Exponential distribution 154 Exponential random variable 154, 242, 243 Extended random variable 171 Extension of a measure 150,247, 399 Extrapolation 422, 425 tf.

Fair game 452 Fatou's lemma 185, 209 Favorable game 86, 452 F-distribution 154 Feldman, J., 501 Feller, W., 332 Fermat, P. de 2 Fermi-Dirac 10 Filter 406 tf., 436 tf. physically realizable 423 Filtering 432 tf., 436 tf. Finer decomposition 78 Finite second moment 260 tf. Finite-dimensional distribution function 244 Finitely additive 131 First arrival 127, 533 exit 121 return 92, 127, 533 Fisher's information 70 .'1' -measurable 168 Forward equation 114 Foundations Chap. II, 129 tf. Fourier transform 274 Frequencies 390 Frequency 46 Frequency characteristic 406 Fubini's theorem 196 tf. Fundamental inequalities (for martingales) 464 tf.

Fundamental sequence

251 tf., 256 tf.

Gamma distribution 154, 337 Garcia, A. 382 Gauss, C. F. 3 Gaussian density 66, 154, 160, 234 distribution 66, 154, !59, 291 measure 266 random variables 232, 241, 296 random vector 297 sequence 304,385,404,413,4 38 zero-one Ia ·for 50 I

systems 295 tf., 303 Gauss-Markov process 305 Generalized Bayes theorem 229 distribution function 156 martingale 448, 491 submartingale 448, 491 Geometric distribution 153 Gnedenko, B. V. vii, 510 Gram determinant 263 Gram-Schmidt process 264 Gronwall-Bellman lemma 352 Haar functions 268, 482 Hajek-Feldman dichotomy 501 Hardy class 424 Harmonics 390 Hartman, P. 372 Helly's theorem 316 Herglotz, G. 393 Hermite polynomials 266 Hewitt, E. 357 Hilbert space 261 complex 273, 388 separable 265 unitary 388 Hinchin. See Khinchin History 556-559 Holder inequality 191 Huygens, C. 2 Hydrology 392, 393 Hypergeometric distribution 21 Hypotheses 27

Impossible event I I, 134 Impulse response 406 Increasing sequence 134 Increments independent 304 uncorrelated 107, 304 Indecomposable 529 Independence 27 tf. linear 263. Also see Linear dependence of components of a random vector 284

Independent algebras 28, 29 events 28, 29 functions 177 increments 304 random variables 36, 75, 79, 177, 480 Independent of the future 448

572 Indeterminacy 50 If. Indicator 33, 43 Inequalities Bernstein 55 Bessel 262 Burkholder 469 Cauchy-Bunyakovskii 38, 70, 190 Cauchy-Schwarz 38 Chebyshev 47, 55, 190 Davis 470 Dvoretzky 475 Holder 191 Jensen 190 Khinchin 468, 469 Kolmogorov 466 Levy 375 Lyapunov 191 Marcinkiewicz-Zygmund 469 Markov 557 martingale 464 If. Minkowski 192 Ottaviani 475 Rademacher-Menshov 475 Rao-Cramer 70 Schwarz 38 two-dimensional Chebyshev 55 Inessential state 528 Infinitely divisible 335 Infinitely many outcomes 129 If. Information 70 Initial distribution 110, 246 Innovation sequence 420 Integral Lebesgue 179, 181 Lebesgue-Stieltjes 181 Riemann 181, 201 Riemann-Stieltjes 201, 202 stochastic 390, 398 Integral equation 206-208, 350 If. Integral theorem 62 Integration by parts 204 substitution 209 Intensity 390 Interpolation 430 If. Intersection II, 134 Introducing probability measures 149 If. Invariant set 379, 385 Inversion formulas 281, 295 i.o. 135 Ionescu Tulcea, C. T. vii, 247 Ising model 23 Isometry 402 Iterated logarithm 372

Index

Jensen's inequality

190, 231

Kakutani dichotomy 496 Kalman-Bucy filter 436 If. Khinchin, A. Ya. 285, 359, 372, 468 Kolmogorov, A. N. vii, 3, 4, 510 axioms 136 inequality 359 Kolmogorov-Chapman equation 114, 246, 525 Kolmogorov's theorems convergence of series 359 existence of process 244 extension of measures 161, 165 iterated logarithm 372 stationary sequences 425, 431 strong law of large numbers 364, 366 three-series theorem 362 two-series theorem 361 zero-or-one law 356 Krickeberg's decomposition 475 Kronecker, L. 365 delta 266

A (condition) 326 Laplace, P. S. 2, 35, 62 Law of large numbers 49, 65, 323 for Markov chains 120 for square-integrable martingales 487 Poisson's 557 strong 363 If. Law of the iterated logarithm 370 If. Least squares 3 Lebesgue, H. decomposition 493 dominated convergence theorem 185 integral 178 If. change of variable in 194 measure 152, 157 Lebesgue-Stieltjes integral 195 Lebesgue-Stieltjes measure 152, 156, 204 LeCam, L. 345 Levy, P. convergence theorem 478 distance 314 inequality 375 Levy-Khinchin representation 341 Levy-Khinchin theorem 337 Likelihood ratio I 08 lim inf, lim sup 135, 137 Limit theorems 49 If., 55 If.

573

Index

Limits under expectation signs 216 integral signs 183, 187 Linde berg condition 326 Lindeberg-Feller theorem 332 Linear manifold 262, 265 Linearly independent 263 Liouville's theorem 378 Lipschitz condition 481 Local limit theorem 55, 56 Local martingale, submartingale 449 Locally absolutely continuous 492 Locally bounded variation 206 Lognormal 238 Lottery 15, 22 Lower function 371 L 2 -theory Chap. VI, 387 ff. Lyapunov, A.M. 3, 319 condition 331 inequality 191

Macmillan's theorem 53 Marcinkiewicz's theorem 286 Marcinkiewicz-Zygmund inequality 469 Markov, A. A. 3, 319 dependence 523 process 246 property 110, 125, 523 time 448 Markov chains 108 ff., 249, Chap. VIII, 523 ff. classification of 528-541 discrete 524 examples 111 ff., 541 ff. finite 524 homogeneous 111, 524 Martingale 100 ff., Chap. VII (446 ff.), 453 bounded 456 convergence of 476 ff. generalized 448 inequalities for 464 ff. in gambling 453 local 449 oscillations of 473 sets of convergence for 483 ff. square-integrable 454 ff., 466, 506 uniformly integrable 479 Martingale-difference 453, 511, 521 Martingale transform 450 Mathematical expectation 37, 74. Also see Expectation

Mathematical foundations Chap. II, 129 ff. Matrix covariance 233 doubly stochastic 546 of transition probabilities 110 orthogonal 233, 263 stochastic 111 transition 111 Maximal correlation coefficient 242 ergodic theorem 382 Maxwell-Boltzmann 10 Mean duration 81, 88, 461 -square 42 ergodic theorem 410 value 37 vector 299 Measurable function 168 random variable 77 set 152 spaces 131 (C, ~(C)) 148 (D, ~(D)) 148 n" $"(t)) 148 (R, ~(R)) 141, 150 (R"', ~(R"')) 144, 160 (R", ~(R")) 142, 157 (Rr, ~(RT)) 145, 164 transformation 377 Measure 131 absolutely continuous 153 complete 152 consistent 526 countably additive 131 counting 231 discrete 153 elementary 396 extending a 150, 247, 399 finite 131 finitely additive 130, 131 Gaussian 266 Lebesgue 152 Lebesgue-Stieltjes 156 orthogonal 397 probability 131, 132 restriction of 163 a-additive 131 a-finite 131 signed 194 singular 154 ff. stochastic 390 Wiener 166, 167

574 Measure-preserving transformations 376 If. Median 44 Menshov, D. E. 475 Method of characteristic functions 318 If. of moments 4, 319 Metrically transitive 379 Minkowski inequality 192 Mises, R. von 4 Mixed model 393 Mixed moment 287 Mixing 381 Moivre. See De Moivre Moment 180 absolute 180 and semi-invariant 288 method 4, 319 mixed 287 problem 292 If. Monotone convergence theorem 184 Monotonic class 138 Monte Carlo method 223, 369 Moving averages 390, 393, 409 Multinomial distribution 20, 21 Multiplication formula 26 Mutual variation 455 Needle (Buffon) 222 Negative binomial 153 Noise 390, 407, 412 Nondeterministic 419 Nonlinear estimator 425 Nonnegative definite 233 Nonrecurrent state 533 Norm 258 Normal correlation 301, 305 density 66, 159, 264 distribution function 62, 66, 154, !59 number 369 Normally distributed 297 Null state 533 Occurrence of event 134 Optimal estimator 69, 235, 301,426,433, 435,441 Optional stopping 559 Ordered sample 6 Orthogonal increments 400 matrix 233, 263 random variables 261

Index stochastic measures 390, 397, 492 system 261 values 397 Orthogonalization 264 Orthonormal 265 Oscillations of martingales 473 water level 393 Ottaviani's inequality 475 Outcome 5, 134 Pairwise independence 29 P-almost surely, almost everywhere 183 Parallelogram property 272 Parseval's equation 265 Partially ordered sequence 432 If. Parzen's estimator 417 Pascal, B. 2 Path 48, 83, 93 Pauli exclusion principle 10 Period of Markov chain 530 Periodogram 415 Permutation 7, 357 Perpendicular 262 Phase space 110, 378, 524 Physically realizable 406, 423 7t 223 Poincare recurrence principle 378 Poisson, D. 3 distribution 64, !53 law of large numbers 557 limit theorem 64, 325 Poisson-Charlier polynomials 267 Polya's theorems characteristic functions 285 random walk 554 Polynomials Bernstein 54 Hermite 266 Poisson-Charlier 267 Positive semi-definite 285 Positive state 533 Pratt's lemma 209, 557 Predictable sequence 418, 446 Preservation of martingale property 456 If. Principle of appropriate sets 139 Probabilistic model 5, 14, 129, 136 in the extended sense 131 Probability 13, 132 a posteriori, a priori 27 classical 15 conditional 23, 74, 212 finitely additive 131

575

Index

measure 131, 149-152 multiplication 26 of first arrival or return 533 of mean duration 88 ofruin 86 of success 68 If. total 25, 75, 77 transition 525 Probability distribution 33, 168, 176 discrete 153 lognormal 238 stationary 528 table 153, 154 Probability measure 131, 132, 152, 492 absolutely continuous 492 complete 152 Probability space 14, 136 canonical 245 complete 152 universal 250 Problems on arrangements 8 coincidence 15 ruin 82, 86 Process branching 113 Brownian motion 304 construction of 243 If. Gaussian 305 Gauss-Markov 305 Markov 246 stochastic 4, 175 Wiener 304, 305 with independent increments 305 Prohorov, Yu. V. vii, 64, 315 Projection 262, 271 Pseudoinverse 305 Pseudotransform 434 Purely nondeterministic 419 Pythagorean property 272

Quadratic variation 455 Queueing theory 112

Rademacher-Menshov inequality 475 Rademacher system 268 Radon-Nikodym derivative 194 theorem 194, 557 Random elements 174 If.

function 175 process 175, 304 with orthogonal increments 400 sequences 4, Chap. V, 366 If. existence of 247, 304 orthogonal 419 Random variables 32 If., 168 If., 232 If. absolutely continuous 169 almost invariant 379 complex 175 continuous 169 degenerate 296 discrete 168, 169 exponential 242, 243 £-valued 174 extended 171 Gaussian 232, 296 invariant 379 normally distributed 232 simple 168 uncorrelated 232 Random vectors 35, 175 Gaussian 297, 299 Random walk 18, 82 If., 356 in two and three dimensions 552 simple 546 symmetric 92, 356, 552 with curvilinear boundary 504 If. Rao-Cramer inequality 70 Rapidity of convergence 342 If., 345 If. Realization of a process 176 Recurrent state 533 Reflecting barrier 549 If. Reflection principle 93 Regression 236 Regular conditional distribution 225 conditional probability 224 stationary sequence 419 Relatively compact 315 Reliability 71 Restriction of a measure 163 Reversed sequence 128 Riemann integral 201 Riemann-Stieltjes integral 201 If. Ruin 82, 86, 461

Sample points, space 5 Sampling with replacement 6 without replacement 7, 21, 23 Savage, L. J. 357

576 Scalar product 261 Schwarz inequality 38. Also see Bunyakovskii, Cauchy Semicontinuous 311 Semi-definite 285 Semi-invariant 288 Semi-norm 258 Separable 265 Sequences almost periodic 389 moving average 390 of independent random variables 354 ff. partially observed 432 predictable 418, 446 random 175 regular 419 singular 419 stationary (strict sense) 376 ff. stationary (wide sense) 387 ff. stochastic 446, 483 Sequential compactness 316 Series of random variables 359 Sets of convergence 483 ff. Shifting operators 527 u-additive 132 Sigma algebra 131, 134 asymptotic 355 generated by ~ 172 tail, terminal 355 Signal, detection of 434 Significance level 71 Simple moments 289 random variable 32 random walk 546 semi-invariants 289 Singular measure 154 ff., 492 Singular sequence 419 Singularity of distributions 493 ff. Skorohod, A. V. 148 Slowly varying 505 Spectral characteristic 406 density 390 rational 409, 428 function 394 measure 394 representation of covariance function 387 ff., 393 sequences 401, 402 window 416 Spectrum 390 Square-integrable martingale 454 ff., 466, 487, 506

Index Stable 338 Standard deviation 41, 232 State space 110 States, classification of 528 ff., 533 ff. Stationary distribution 118, 528, 542 Markov chain 118 sequence Chap. V, 376 ff.; Chap. VI, 387 ff. Statistical estimation 412 ff. Statistically independent 28 Statistics 4, 50 Stieltjes, T. J. 181, 201 Stirling's formula 20, 22 Stochastic integral 390 matrix Ill, 546 measure 390, 396 ff. extension of 399 orthogonal 397 with orthogonal values 397, 398 process 4, 175 sequence 446, 483 Stochastically independent 42 Stopped process 449 Stopping time 82, 103, 448 Strong law of large numbers 363 ff., 472, 482 Strong Markov property 125 Structure function 397 Student distribution 154, 242 Submartingales 446 ff. convergence of 476 ff. generalized 448, 491 local 449 nonnegative 477 nonpositive 476 sets of convergence of 483 ff. uniformly integrable 477 Substitution, integration by 209 Sum of dependent random variables 509 ff. events II exponential random variables 243 Gaussian random variables 241 independent random variables 326 ff., Chap. IV, 354 ff. Poisson random variables 242 sets 11, 134 Summation by parts 365 Supermartingale 447 Symmetric difference!::,. 43, 134 Symmetric events 357 Szego-Kolmogorov formula 436

577

Index

Tables continuous densities !54 discrete densities 153 terms in set theory and probability 134, 135 Tail 49, 321, 332 algebra 355 If. Taxi stand 112 !-distribution 154, 242 Terminal algebra 355 Three-series theorem 362 Tight 315 Time change (in martingale) 456 If. continuous 175 discrete 175 domain 175 Toeplitz, 0. 365 Total probability 25, 75, 77 Trajectory 176 Transfer function 406 Transform 450 Transformation, measure-preserving 376 If. Transition function 524 matrix Ill probabilities Ill, 246, 524 Trial 30 Trivial algebra 12 Tulcea. See Ionescu Tulcea Two-dimensional Gaussian density 160 Two-series theorem 361 Typical path 48, 52 realization 48

Unbiased estimator 68, 412 Uncorrelated 42, 232 increments 107 Unfavorable game 86, 87, 452 Uniform distribution 153, 154 Uniformly asymptotically infinitesimal 510 continuous 326 integrable 186 If.

Union II, 134 Uniqueness of distribution function 280 solution of moment problem 293 Universal probability space 250 Unordered samples 6 sets 166 Upper function 371

Variance 41 conditional 81 ofsum 42 Variation, mutual, quadratic 455 Vector Gaussian 236 random 35, 175, 236, 297, 299

Wald's identities I 05, 460 If., 464 Water level 392, 393 Weak convergence 306 If. Weierstrass approximation theorem for polynomials 54 for trigonometric polynomials 280 White noise 390, 407, 412 Wiener, N. measure, 166, 167 process 304, 305 Window, spectral 416 Wintner, A. 372 Wold's expansion 418 If., 422 method 517 Wolf, R. 223

Zero-or-one laws 354 If., 479 Borel 355 for Gaussian sequences 50 I Hewitt-Savage 357 Kolmogorov 354, 356 Zhurbenko 's estimator 417 Zygmund, A. 469

Graduate Texts in Mathematics 2 3 4 5 6 7 8 9 lO 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

TAKEUTI/ZARING. Introduction to Axiomatic Set Theory. 2nd ed. OxTOBY. Measure and Category. 2nd ed. ScHAEFFER. Topological Vector Spaces. HILTONISTAMMBACH. A Course in Homological Algebra. MACLANE. Categories for the Working Mathematician. HUGHEs/PiPER. Projective Planes. SERRE. A Course in Arithmetic. TAKEun/ZARING. Axiometic Set Theory. HuMPHREYS. Introduction to Lie Algebras and Representation Theory. CoHEN. A Course in Simple Homotopy Theory. CoNWAY. Functions of One Complex Variable. 2nd ed. BEALS. Advanced Mathematical Analysis. ANDERSON/FuLLER. Rings and Categories of Modules. GoLUBITSKY/GuiLLEMIN. Stable Mappings and Their Singularities. BERBERIAN. Lectures in Functional Analysis and Operator Theory. WINTER. The Structure of Fields. RoSENBLATT. Random Processes. 2nd ed. HALMOS. Measure Theory. HALMOS. A Hilbert Space Problem Book. 2nd ed., revised. HUSEMOLLER. Fibre Bundles. 2nd ed. HuMPHREYS. Linear Algebraic Groups. BARNEs/MACK. An Algebraic Introduction to Mathematical Logic. GREUB. Linear Algebra. 4th ed. HoLMES. Geometric Functional Analysis and its Applications. HEWITT/STROMBERG. Real and Abstract Analysis. MANES. Algebraic Theories. KELLEY. General Topology. ZARISKI/SAMUEL. Commutative Algebra. Vol. I. ZARISKIISAMUEL. Commutative Algebra. Vol. II. JACOBSON. Lectures in Abstract Algebra I: Basic Concepts. JACOBSON. Lectures in Abstract Algebra II: Linear Algebra. JACOBSON. Lectures in Abstract Algebra III: Theory of Fields and Galois Theory. HIRSCH. Differential Topology. SPITZER. Principles of Random Walk. 2nd ed. WERMER. Banach Algebras and Several Complex Variables. 2nd ed. KELLEY/NAMIOKA et al. Linear Topological Spaces. MoNK. Mathematical Logic. GRAUERTIFRITZSCHE. Several Complex Variables. ARVESON. An Invitation to C*-Algebras. KEMENY/SNELL/KNAPP. Denumerable Markov Chains. 2nd ed. APOSTOL. Modular Functions and Dirichlet Series in Number Theory. SERRE. Linear Representations of Finite Groups. GILLMAN/J ERISON. Rings of Continuous Functions. KENDIG. Elementary Algebraic Geometry. LOEVE. Probability Theory I. 4th ed. LOEVE. Probability Theory II. 4th ed. MoiSE. Geometric Topology in Dimensions 2 and 3. SAcHs/Wu. General Relativity for Mathematicians. GRUENBERG/WEIR. Linear Geometry. 2nd ed. EDWARDS. Fermat's Last Theorem.

Graduate Texts in Mathematics 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

KLINGENBERG. A Course in Differential Geometry. HARTSHORNE. Algebraic Geometry. MAN IN. A Course in Mathematical Logic. GRA vER!W ATKINS. Combinatorics with Emphasis on the Theory of Graphs. BROWN/PEARCY. Introduction to Operator Theory 1: Elements of Functional Analysis. MASSEY. Algebraic Topology: An Introduction. CROWELL/Fox. Introduction to Knot Theory. KoBLITZ. p-adic Numbers, p-adic Analysis, and Zeta-Functions. LANG. Cyclotomic Fields. ARNOLD. Mathematical Methods in Classical Mechanics. WHITEHEAD. Elements of Homotopy Theory. KARGAPOLov/MERZLJAKOV. Fundamentals of the Theory of Group~. BoLLOBAS. Graph Theory. EDWARDS. Fourier Series. Vol. I. 2nd ed. WELLS. Differential Analysis on Complex Manifolds. 2nd ed. WATERHOUSE. Introduction to Affine Group Schemes. SERRE. Local Fields. WEIDMANN. Linear Operators in Hilbert Spaces. LANG. Cyclotomic Fields II. MASSEY. Singular Homology Theory. FARKASIKRA. Riemann Surfaces. STILL WELL. {:Jassical Topology and Combinatorial Group Theory. HuNGERFORD. Algebra. DAVENPORT. Multiplicative Number Theory. 2nd ed. HOCHSCHILD. Basic Theory of Algebraic Groups and Lie Algebras. In AKA. Algebraic Geometry. HECKE. Lectures on the Theory of Algebraic Numbers. BURRIS!SANKAPPANAVAR. A Course in Universal Algebra. WALTERS. An Introduction to Ergodic Theory. RoBINSON. A Course in the Theory of Groups. FoRSTER. Lectures on Riemann Surfaces. Bon/Tu. Differential Forms in Algebraic Topology. WASHINGTON. Introduction to Cyclotomic Fields. IRELAND/RoSEN. A Classical Introduction Modern Number Theory. EDWARDS. Fourier Series. Vol. II. 2nd ed. vAN LINT. Introduction to Coding Theory. BROWN. Cohomology of Groups. PIERCE. Associative Algebras. LANG. Introduction to Algebraic and Abelian Functions. 2nd ed. BR9NDSTED. An Introduction to Convex Polytopes. BE ARDON. On the Gtumetry of Discrete Groups. DIESTEL. Sequences and Series in Banach Spaces. DuBROVIN/FoMENKo/NoviKOV. Modern Geometry- Methods and Applications Vol. I. WARNER. Foundations of Differentiable Manifolds and Lie Groups. SHIRYAYEV. Probability, Statistics, and Random Processes. ZEIDLER. Nonlinear Functional Analysis and Its Applications I: Fixed Points Theorem.

Related Documents

Probability
February 2021 0

Probability
January 2021 10

3 Probability
January 2021 2

Probability .pdf
January 2021 1

Chapter 1_ Introduction To Probability
February 2021 0

Competoid Math (probability)-converted
January 2021 2

More Documents from ""

Probability
January 2021 10

Al Di Meola - Mediterranean Sundance.pdf
March 2021 0

Bach_air.pdf
March 2021 0

Manual De Diagnostico Maxxforce 13
January 2021 12

Variacion Compensatoria Y Equivalente Graficamente
March 2021 0

Sting - Fragile.pdf
March 2021 0