Professional Documents
Culture Documents
Textbook Mathematical Theory of Bayesian Statistics 1St Edition Sumio Watanabe Ebook All Chapter PDF
Textbook Mathematical Theory of Bayesian Statistics 1St Edition Sumio Watanabe Ebook All Chapter PDF
https://textbookfull.com/product/mathematical-theory-of-bayesian-
statistics-first-edition-watanabe/
https://textbookfull.com/product/mathematical-statistics-1st-
edition-dieter-rasch/
https://textbookfull.com/product/probability-and-mathematical-
statistics-theory-applications-and-practice-in-r-1st-edition-
mary-c-meyer/
https://textbookfull.com/product/mathematical-statistics-
borovkov-a-a/
The Bayesian way: introductory statistics for
economists and engineers First Edition Nyberg
https://textbookfull.com/product/the-bayesian-way-introductory-
statistics-for-economists-and-engineers-first-edition-nyberg/
https://textbookfull.com/product/bayesian-statistics-for-
beginners-a-step-by-step-approach-therese-m-donovan/
https://textbookfull.com/product/practice-of-bayesian-
probability-theory-in-geotechnical-engineering-wan-huan-zhou/
https://textbookfull.com/product/mathematical-methods-of-
statistics-pms-9-volume-9-harald-cramer/
https://textbookfull.com/product/reasoning-with-data-an-
introduction-to-traditional-and-bayesian-statistics-using-r-1st-
edition-jeffrey-m-stanton/
Mathematical Theory
of Bayesian Statistics
Mathematical Theory
of Bayesian Statistics
Sumio Watanabe
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
Version Date: 20180402
International Standard Book Number-13: 978-1-482-23806-8 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ix
2 Statistical Models 35
2.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . 41
2.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.5 Finite Normal Mixture . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Nonparametric Mixture . . . . . . . . . . . . . . . . . . . . . 59
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
v
vi CONTENTS
References 309
Index 317
Preface
Sumio Watanabe
ix
Chapter 1
Definition of Bayesian
Statistics
1
2 CHAPTER 1. DEFINITION OF BAYESIAN STATISTICS
introduced.
(8) Statistical estimation in a conditional independent case is studied, in
which the cross validation loss can not be used but WAIC can be.
For readers who are new to probability theory, chapter 10 will be helpful.
If we knew both P (x1 , x2 , ..., xn |q) and P (q), where P (q) is an a priori
probability distribution of a true distribution, then by Bayes’ theorem,
which would give the statistical inference of q(x) from a sample {x1 , x2 , ...,
xn }. However, in the real world, we do not have any information about
either of them, showing that P (q|x1 , x2 , ..., xn ) cannot be obtained.
A problem whose answer cannot be uniquely determined because of the
lack of the information is called an ill-posed problem. Statistical inferences
in the real world are ill-posed. In an ill-posed problem, we cannot deter-
mine a uniquely optimal method by which a correct answer is automatically
obtained, which leads us to propose a new way:
C = (2π)N/2 det(S).
p
and Z
δ(x)dx = 1.
1 2
q(x) = δ(x − 1) + δ(x − 2).
3 3
If X is a random variable which is subject to q(x) and Q, then Y = f (X) is
also a random variable which is subject to
Z
p(y) = δ(y − f (x))q(x)dx,
Z
P (A) = q(x)dx.
f (x)∈A
if the right hand side is finite, where ( )T shows the transposed matrix.
If N = 1, then V[X] and V[X]1/2 are called the variance and the standard
deviation, respectively.
where q(x) and q(y) are called marginal probability densities of X and Y ,
respectively. The conditional probability density of Y for a given X is
defined by
q(x, y)
q(y|x) = .
q(x)
If q(x) = 0, then q(y|x) is not defined, however, we define 0 · q(y|x) = 0. The
conditional probability density function q(x|y) is also defined by q(x, y)/q(y).
Then it follows that
X n = (X1 , X2 , ..., Xn ).
8 CHAPTER 1. DEFINITION OF BAYESIAN STATISTICS
Throughout this book, the notation n is used for the number of random
variables. Sometimes X n and n are referred to as a sample and a sample
size, respectively. A realized value of X n in a trial is denoted by
xn = (x1 , x2 , ..., xn ).
If X n is subject to a probability density function,
q(x1 )q(x2 ) · · · q(xn )
then X n is called a set of independent random variables which are subject
to the same probability density q(x). Here q(x) is sometimes referred to as a
true probability density. In the practical applications, we do not know q(x),
but we assume there exists such a density q(x).
For an arbitrary function f : xn 7→ f (xn ) ∈ R, the expected value of
f (X n ) over X n is denoted by E[ ]. That is to say,
Z Z Z n
Y
n n
E[f (X )] = · · · f (x ) q(xi )dx1 dx2 · · · dxn .
i=1
E[Sn ] = S,
1h
Z i
V[Sn ] = q(x, y)(log q(y|x))2 dxdy − S 2 .
n
For the case when Y n is independent for a given X n , see Sections 1.8 and
5.5.
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
Assume that a true distribution is q(x) = p(x|0, 1) and the number of inde-
pendent random variables is n. In this case, the parameter that attains the
true density is unique,
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
1 1 1 1
0 0 0 0
-1 -1 -1 -1
0 1 2 0 1 2 0 1 2 0 1 2
1 x2
N (x) = √ exp − .
2π 2
by which one might expect that the posterior distribution will concentrate
on the neighborhood of the true parameter (0.5, 0.3). The real posterior
distributions for n = 100, n = 1000, and n = 10000 are shown in Figures
1.4, 1.5, and 1.6, respectively.
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Note that both Gn and Tn are random variables. Let S be the entropy
of a true distribution given by eq.(1.1). Then it immediately follows that
Z Z
n
Gn − S = − q(x) log p(x|X )dx + q(x) log q(x)dx
q(x)
Z
= q(x) log dx
p(x|X n )
= K(q(x)||p(x|X n )), (1.15)
which shows that the smaller generalization loss is equivalent to the smaller
Kullback-Leibler distance. Two training losses Tn (1) and Tn (2) can be de-
fined for both sets, but they do not have such properties. In other words,
the smaller training loss does not mean a smaller generalization error.
Definition 2. Assume n ≥ 2. Let X n \ Xi be a set of random variables X1 ,
X2 , ..., Xn which does not contain Xi and p(x|X n \ Xi ) be the predictive
density using X n \ Xi . The cross validation loss is defined by
n
1X
Cn = − log p(Xi |X n \ Xi ). (1.16)
n
i=1
Cn ≥ Tn .
*****
Emil Nervander kirjoitti »Morgonbladetissa» Suomalaisessa
teatterissa 25/II 1880 olleen »Nooran» ensi-illan johdosta:
*****
*****
Rakkahin Matte!
*****