ISYE 6420 Bayesian Statistics

Academic details

Georgia Tech Fall 2019

Roshan Vengazhiyil, Ph.D., Professor Brani Vidakovic, Ph.D. Professor (Video Lectures)

Course Website
https://www2.isye.gatech.edu/~brani/isye6420/

HomeWork Page
https://www2.isye.gatech.edu/~brani/isye6420/homeworks.html

Software

WinBUGS and OpenBUGS statistical software
free download:
https://www.mrc-bsu.cam.ac.uk/software/bugs/
http://www.openbugs.net/w/FrontPage

Tutorial: openbugs.net/Manuals/Tutorial.html

OCTAVE or R or Python (needed for calculations for which WinBUGS/OpenBUGS are not appropriate)
free downloads:
www.gnu.org/software/octave/ (Quick Guide: http://ubcmatlabguide.github.io/)
www.r-project.org/
www.python.org/

edX Link : https://courses.edx.org/dashboard?tpa_hint=saml-gatech
Youtube : https://www.youtube.com/playlist?list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO

Extra resources
University of Auckland (New Zealand?)
Stats 331 - Bayesian Statistics
https://www.youtube.com/watch?v=HVlXEwVD8dw&list=PLuRpZIQQRQedb2GM2WhKSEzojGN-BIIR9

Lectures - Unit 1

Introduction

Welcome to Bayesian Statistics. We will not get too theoretical, this course is mostly an applied, hands on approach. Most of what we do will be in WINBugs. You should have some basic statistics and programming under your belt.

Topics Covered

What are we going to cover? Unit 1 - Is simply an intro to this class
Unit 2 - History behind Bayes and comparison to classical methods of statistical inference
Unit 3 - Review of probability with a focus on conditional probabilities
Unit 4 - Bayes Theorem, Conjugation, Prior elicitation
Unit 5 - Bayesian Computation, Markov chain Monte Carlo, algorithms
Unit 6 - Graphical Models and advanced WinBugs
Unit 7 - Hierarchical models, Bayesian Linear Models
Unit 8 - Missing/censored data
Unit 9 - Model building and selection
Unit 10 - Applications
Unit 11 - Review Exam prep

Refresher

https://www.youtube.com/watch?v=uSAE1-wfIKU

Likelihood Principle: All of the information in a sample is contained in the likelihood function, a density or a distribution function

  • The data is modeled by a likelihood function
  • Not all statistical paradigms agree with this principle

Recall:
For a Probability Density function $ f(x, \theta) $ where $\theta$ is a fixed but arbitrary parameter and x is variable
The likelihood function is $ f(x, \theta) $ where $\theta$ is variable and x is fixed The Maximum Likelihood will occur where f is maximized for a fixed x. This is determined using the maximum likelihood estimator computed as follows

$$ \hat{\theta} = max_{\theta} f(x_1,...,x_n,\theta) $$

Classical/frequentists methods of inference consider $\theta$ to be fixed and unknown

  • performance of methods evaluated by repeated sampling
  • consider all possible data sets Bayesian consider $\theta$ to be random
  • only consider observed data set and prior information

Lectures - Unit 2

Bayesians vs Frequentist/Classical

Classical statistics assign probabilities to only those quantities that are truly random in nature, i.e., these probabilities have the connotation of the above quantities being results of random physical experiments.

In Bayesian statistics, probabilities are subjective. Probabilities can be assigned to events that are not even physically random in nature. In fact, probabilities can be assigned to any statement. Probabilities can also be assigned to deterministic quantities, and these probabilities only express the degree of belief.

The essential difference between Bayesian and Frequentist statisticians is in how probability is used. Frequentists use probability only to model certain processes broadly described as "sampling." Bayesians use probability more widely to model both sampling and other kinds of uncertainty.

Example 1

Suppose we have a 2 sided coin and we want to determine the probability of a heads.

  • A Frequentist would estimate P(H) = $\hat{p}$ based solely on the results of an experiment
  • A Bayesian would decide on a prior probability (prior to the experiment beginning) and then incorporate the results of the experiment.

After conducting an experiment of 10 coin flips, we count 0 heads and 10 tails
The Frequentist claims:
The relative frequency of an event is a good estimator of the probability of that event
ie: $ \hat{p} = \frac{X}{n} = \frac{0}{10} = 0 $
and the likelihood of an event is given as
$ L(p) = (1-p)^{10} $ with a max at p=0

Of course this is a rather odd result and few would believe it.

The Bayesian would approach this experiment in a slightly different fashion

  • First they would determine the prior probability of heads (prior to the experiment beginning)
  • Second they would determine the posterior incorporating the results
  • Third they repeat the process with an updated prior equal to the current posterior Finally they would estimate the probability from the posterior, with the posterior distibution being the end goal for a bayesian. $P(H)=\hat{p}$ could be estimated from the posterios distributions mean, median or mode.

The posterior distribution in this case would be $ (1-p)^{10} $
We then turn this into a density by multiplying by 11. (So as to make the integral from 0 to 1 equal to 1) Then take the mean to get 1/12.

To check the above statements we'll need a bit of a refresher For a Probability density function f(x) $$ \int_{} f(X) dx = 1 $$ and $$ E(X) = \int_{} x f(X) dx $$

Lectures - Unit 3

Basic Probability Review

(S) Sample Space: Is the set of all possible outcomes from an experiment
$S_i$ is often used to denote the i'th outcome
(E) Event: Is the set of elementary outcomes for an experiment
Often {$E = S_i$, for multiple i} making E a subset of the sample space
($E^c$) Event complement = Events in S NOT in E

$P(S_i)$ is the probability of the i'th outcome
$P(E)$ is the probability of an Event
$P(E^c) = 1-P(E)$ is the probability of the complement

Example: Experiment = Rolling a 6-sided die and observing the result
Sample Space = S = {1,2,3,4,5,6}
Event = This can be defined multiple ways depending on your target interest
E1 = P(Si) = #n/#S = 1/6 = where Si is any 1 of the possible outcomes
E2 = P(Si) = #n/#S = 3/6 = where Si is an even number so E={S2,S4,S6}

Basic Set Theory Review

$P(A \cup B)$ Union: For 2 or more events it is the event composed of all elements from A and B
$P(A \cap B)$ Intersection: For 2 or more events it is the event composed of all elements in both A and B
$P(A-B)$Difference: For 2 or more events it is the events in the LHS argument less those in RHS Exclusive/Non-overlapping: Events that share no common outcomes

Basic Observation wrt to probabilities $P(S)$ = 1, because S is the set of all outcomes
$P(S^c)$ = 0, The empty set
$P(A \cup B) = P(A)+P(B)$ when A,B are exclusive
$P(A^c) = 1-P(A)$ when $S = P(A \cup A^c)$ (Note that A and $A^c$ are always exclusive) $P(A \cup B) = P(A)+P(B)-P(AB)$ general/arbitrary form
$P(AB)=P(A)P(B)$ A,B are independent events

Reminder:
Exclusive events share no outcomes
Independent events may share outcomes.
Independence is defined as: Probability of an intersection equal to the product of the each probability.

Example - Circuit Problem

Serial Circuit: A series of connections lined up in a linear fashion (inline)
System $---A_1---A_2---A_3---A_n-----$
In this case any failure in the nodes will result in a system failure
Similarly Success means all of the nodes are successful $P(S works) = P(A_{1} A_{2} A_{3} ... A_{n})$

Parallel Circuit: A series of connections lined up in a horizontal Fashion
 flow can take any one of multiple paths
  $|---A_1---|$ ---
  $|---A_2---|$
  $|---A_3---|$
--- $|---A_n---|$
In this case a system failure will only occur if ALL of the nodes fail because of independence.
$P(S works) = 1-P(A_{1}^{c} A_{2}^{c} A_{3}^{c} ... A_{n}^{c})$

Independence & Conditional probabilities

The probability of an event can sometimes be affected by a preceeding event. For example the probability of say rain could depend on the number of days it has already rained. This is a conditional probability as the number of preceding rain days influences the probability. However, if we computed the probability of rain on the relative frequency of rain over a time frame then we would be dealing with an unconditional probability.

Conditional Probability of an event A given that event B has occured
$ P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(AB)}{P(B)} \text{assuming P(B) > 0 }$

Similarly, by symmetry, we also have
$ P(B|A) = \frac{P(A \cap B)}{P(A)} $

Independence
A and B are said to be independent if one of the following are true

  1. $P(A|B) = P(A)$ or $P(B|A) = P(B)$
  2. $P(A \cap B) = P(A)P(B)$
    Otherwise they are said to be dependent NB $P(A \cap B)$ is often written P(AB)

Example 2

Consider the following example/Experiment
We have a standard deck cards with 52 cards. It can be evenly divided into 4 suits each containing 13 choices. And it contains 4 queens.
We select 1 card at random
A = card is a spade
B = card is a queen
So $(A \cap B)$ = card is a queen of spade

Observe that $P(A \cap B) = 1/52$
$P(A)P(B) = \frac{13}{52} x \frac{4}{52} = \frac{1}{52} = P(A \cap B)$ So we have independence
But what if we remove the 2 of diamonds? Then
$P(A)P(B) = \frac{13}{51} x \frac{4}{51} \ne P(A \cap B)$

Laws of Probabilities

Multiplicative Law
$ P(A \cap B) = P(A)P(B|A) = P(B)P(A|B) $
Whenever A and B are independent then this reduces to
$ P(A \cap B) = P(A)P(B) $

Additive Law
$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $
if A and B are exclusive then $P(A \cap B)= 0 $

Total Law of Probability

Let
$B_1, B_2,..., B_n$ represent a total partition of the sample space S
such that
$ B_i \cap B_j = 0 \text{ for } i \ne j$

If A is any subset of S and ${B_1,B_2,...,B_n}$ is a partition of S
then A can be decomposed as A = $(A \cap B_1) \cup (A \cap B_2) \cup ... \cup (A \cap B_n)$
with
$ P(A) = \sum_{i=1}^{k} P(A|B_i)P(B_i) $

Note Often times we will partition the sample space according to a hypothesis that also partitions the event we are interested in.

Example Let $S = H_1 \cap H_2 \cap H_3 \cap ... \cap H_n $
where $H_i H_j = 0$ ie they're exclusive
Let A be an event in S
then
$A = AS$
$= A(H_1 \cap H_2 \cap H_3 \cap ... \cap H_n)$
$= AH_1 \cup AH_2 \cup AH_3 \cup ... \cup AH_n $
We now compute P(A)
$P(A) = P(AH_1)+P(AH_2)+P(AH_3)+...+P(AH_n)$ By exclusivity $= P(A|H_1)P(H_1)+P(A|H_2)P(H_2)+P(A|H_3)P(H_3)+...=P(A|H_n)P(H_n)$ as intersections

Ex - Manufacturing Bayes

There are three machines 1,2,3

machine Conforming Volume
$M_1$ 0.94 .3
$M_2$ 0.95 .5
$M_3$ 0.97 .2

One item is random selected. What is the probability of conforming (Denoted C)?

Let our Hypothesis by $H_i$ item is from machine i
Then by the law of total probability
$P(C)=P(C|M_1)P(M_1) + P(C|M_2)P(M_2) + P(C|M_3)P(M_3)$
$=(.94)(.3)+(.95)(.5)+(.97)(.2)$
$=0.957$

Ex - Bridged Circuit

Imagine a circuit(S) with 5 nodes: A1 through A5
title

Observe that $A_5$ bridges the two branches making this cicuit both serial and parallel
To reduce the problem to what we've seen before we define our hypotheses as
$H_1:A_5$ works
$H_2:A_5$ fails ($=H_{1}^{c}$)

Observe that $H_1,H_2$ is an exclusive partition of our sample space
Now using our two hypotheses we have 2 scenarios to determine S works

$H_1$ composed of $(A_1 \cup A_3)(A_2 \cup A_4)$ 2 parallel cicuits connected at A5
$H_2$ composed of $(A_1 \cap A_2)\cup(A_3 \cap A_4)$ 2 parallel inline cicuits (Top line or bottom line)

$P(S|H_1)=(1-q_1 q_3)(1-q_2 q_4)$ where $q_i=(1-p_i)$
= 0.8536
and $P(S|H_2)=1 - (1-q_1 q_2)(1-q_3 q_4)$ This is 1 - (probability both inlines fail)
= 0.7984

Now using the law of total probability
$P(S) = P(S|H_1)P(H_1) + P(S|H_2)P(H_1)$
= .8315 = 83.15%

Bayesian Learning

Suppose we are interested in $P(H_i | A)$ for some hypothesis $H_i$
This implies an update to $P(H_i)$ given that A has already occured
By total probability we can compute P(A) (which is the weighted average of conditional probabilities, where the weights are the probabilities of the hypotheses)
We also know that by definition
$$P(H_i | A) = \frac{P(A \cap H_i)}{P(A)} = \frac{P(A | H_i)P(H_i)}{P(A)}$$

The terms on the right end represent
The probability of an intersction as a product of probabilities conditioned on a prior event A
This is in fact Bayes formula

$P(H_i)$ is known before A happens then A happens
Now $P(H_i)$ must be updated in light of the event A
ie $P(H_i) \text{ becomes } P(H_i | A) $
This is a shift from a prior probability to a posterior probability that incorporates new information

Recall our earlier example using the deck of cards
$P(Q_{spades}) = 1/52$
Selection 1 $S_1=2_{diamonds}$
Now
$P(Q_{spades} | S_1=2_{diamonds} ) = 1/51$

Bayes Rule

$ P(B_j|A) = \frac{P(A|B_j)P(B_j)}{P(A)} $

$ P(B_j) $ is the prior, or marginal, probability of $B_j$

$ P(B_j|A) $ is the posterior, or conditional, probability of $B_j$, after A happens

$ P(A|B_j) $ is the likelihood

Alternate way of looking at Bayes

Posterior = (Likelihood x Prior) / Evidence

Ex - v2 Manufacturing Bayes

Let's re-examine our earlier example using Bayes formula
Previously we used total probability to find that
$P(Conforming) = 0.957 $ for a randomly selected item
Now we wish to determine the probability that it came from Machine 1

$\begin{equation} \begin{split} P(M_1 | Conforming ) & = \frac{ P(Conforming | came from M_1)P(came from M_1) }{P(Conforming)} \\ & = \frac{(0.94)(0.3)}{(0.957)} \\ & = 0.2965 \end{split} \end{equation}$
Eazy peazy

Ex - v2 Bridged Circuit

Previously we found that

  1. $P(Circuit(S) works) = 0.8315$
  2. $P(S|H_1)=0.8536$ (where $H_1 = A_5$ works hypothesis)

Now we want: Probability that $A_5$ works given that S works
$\begin{equation} \begin{split} P(A_5=works | S=works) & = \frac{P(S=works|H_1)P(H_1)}{P(S=works)} \\ & = \frac{(0.8536)(0.6)}{(0.8315)} \\ & = 0.6159 \end{split} \end{equation} $

Ex - Two headed coin

We have a box with N coins, N-1 are fair and 1 of them are two headed (head on both sides)
We select 1 coin at random (and we don't look at it)
Experiment: We flip the selected coin k times
Result: we get heads for all the flips

What is the probability that the 2-headed coin was selected?

Begin by defining the Event
A = Coin lands heads up k times in k flips Define your hypotheses
$H_1$ A fair coin was selected (NB $P(H_1)=(N-1)/N$
$H_2$ Unfair coin was selected (NB $P(H_2)=1/N$

Now we consider the implications of each Hypothesis
$P(A|H_1) = \frac{1}{2} ... \frac{1}{2} \text{k-times} = \frac{1}{2^k} $
$P(A|H_2) = 1$

Using the law of total probability
$\begin{equation} \begin{split} P(A) & = P(A|H_1)P(H_1) + P(A|H_2)P(H_2) \\ & = \frac{1}{2^k} \frac{(N-1)}{N} + 1 \frac{1}{N} \\ & = \frac{N-1+2^k}{2^kN} \\ \end{split} \end{equation} $
Putting this back into Bayes yields
$\begin{equation} \begin{split} P(H_2 | A) & = \frac{\frac{1}{N}x1}{\frac{N-1+2^k}{2^kN}} \\ & = \frac{2^k}{(N-1)+2^k} \text{ simplified } \end{split} \end{equation} $

Now you need to choose a k and an N and simulate
Let's take N = 1,000,000
for k=20, we get $P(H_2|A)=.5119$
for k=40, we get $P(H_2|A)=.9999$ truncated to 4 decimal

Basic Bayes Network

Lectures - Unit 4

Basic Distributions Review

Models and Parameters

Random Variable is a mapping, not really a variable. It maps numbers from the sample space S to the real numbers. Recall that the sample space is the set of elementary outcomes from an experiment. Then a function F that maps those outcomes to a set of real numbers can be thought of as a random variable.

Random variables are divided into two main types based on their realizations, the numbers they map to. They can be discrete, with fixed values, or continuous, in an interval. In either case the summation of the probabilities must equal 1.

Discrete Distributions are called Probability mass functions and written as a summation
$ P(X = x_i) = p_i ; \sum p_i = 1 $
NB: Each $x_i$ is atomic with probability $p_i$

Continuous Distributions are called Probability density functions and written as an integral
$ P(X \in I) = \int_I f(x) dx = 1$ when I is the entire interval
NB: The realization of a continuous distribution is always on an interval.

Commonly discrete Distributions
Bernoulli $X \sim Ber(p) \sim p,q=1-p$
1 parameter: p, and 2 possible realization 0, or 1

Binomial $X \sim Bin(n,p) \sim p_k = P(X=k)= {n \choose k} p^k q^{n-k} $
2 parameters n & p, and n possible realizations
k cannot exceed n.

Poisson $ X \sim Pois(\lambda) \sim p_k = P(X=k) = \frac{\lambda^k}{k!} e^{-\lambda}$
Often used to model counts in a space

  • 2 parameters $\lambda$ (aka rate parameter) and k possible realizations
  • Unlike the binomial, k may exceed n, it's not bounded by n
  • $\lambda$ represents the limiting distribution of the binomial
  • Often we approx $\lambda$ as np for a large n and a small p
  • as n goes to infinity np should converge to $\lambda$

Common Continuous Distributions
Uniform
$X \sim U(a,b): f(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} & a \leq x \leq b \\ 0 & \quad else \end{array} \right. $

Exponential $X \sim Exp(\lambda): f(x) = \lambda e^{-\lambda x}; \lambda > 0, x \geq 0 $
The version above is the Paramter(rate). Alternatively you can take $\lambda = \frac{1}{\mu}$ to get the Parameter(scale) version.
$f(x) = \frac{1}{\mu} e^{-\frac{x}{\mu}};$

Beta
$X \sim Be(a,b): f(x) = \frac{1}{B(a,b)} x^{a-1} (1-x)^{b-1}$
where $B(a,b) = \int_{0}^{1} x^{a-1} (1-x)^{b-1} dx$

Normal - the most widely used of all distributions
$$X \sim N(\mu,\sigma^2) : f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x-\mu)^2}{2 \sigma^2} } $$ where $\sigma >0, -\infty < x,\mu < \infty$

  • $\mu$ is the mean aka: location shift parameter
  • $\sigma$ is the variance aka: scale parameter

Cumulative Distribution Function
Denoted with a capital F rather than a little f
Is the area up to the point x
$F(x) = P(X \leq x) = \left\{ \begin{array}{ll} \sum_{x_i \leq x} P(X=x_i) \\ \int_{-\infty}^{x} f(x) dx \end{array} \right. $

Example - Exponential for $x \geq 0$
$\begin{equation} \begin{split} F(x) & = \int_{0}^{x} \lambda e^{-\lambda t} dt \\ & = -e^{-\lambda t} \Big|_0^x \\ & = (-e^{-\lambda x}) - (-1) \\ & = 1-(e^{-\lambda x}) \end{split} \end{equation} $

Numerical Characteristics

Expectations
$ \begin{equation} \begin{split} E(X) & = \sum_{i} x_i p_i \\ & = \int x f(x) dx \end{split} \end{equation} $

k'th moment
$ \begin{equation} \begin{split} E(X^k) & = \sum_{i} x_i^k p_i \\ & = \int x^k f(x) dx \end{split} \end{equation} $

General function
let g(x) be the general func
$ \begin{equation} \begin{split} E(X^k) & = \sum_{i} g(x_i) p_i \\ & = \int g(x) f(x) dx \end{split} \end{equation} $

Variance
$\sigma^2 = Var(X) = E(X - E(X))^2 = E(X^2) - E(X)^2$
$SD(X) = \sqrt{Var(X)} = \sigma$

Example - Exponential
$X \sim Exp(\lambda)$
$Var(X) = \frac{1}{\lambda^2}$ and $SD(X) = \frac{1}{\lambda^2}$

Quantiles
It is the point where the Cumulative distribution is equal to the probability
$E_p$ is the p'th quantile of $F$ if $F(E_p)=p$

  • $E_p = inf \{x|\sum_{x_i \leq x} p_i \geq p \}$
  • $\int_{0}^{E_p} f(x) dx = p$ or $F^{-1}(x) = E_p$

Example - Exponential
$X \sim Exp(\lambda); F(x) = 1-e^{-\lambda x}$
$F(E_p) = p; E_p = -\frac{1}{\lambda} log(1-p), 0 \leq p \leq 1$

Median: The point where the distribution is divided into 2. the 50% or 0.5 point

$Q_1$ and $Q_3$ are the 25% point and 75% point

Joint/Conditional Distributions

Let $X = (X_1, X_2, ... X_n)$ be a vector of Random Variables (RVs)
Then
$X = (X_1,X_2)$ is the joint distribution
and
$f(x_1|x_2) = \frac{f(x_1,x_2}{f(x_2)}$
gives the conditional distribution of $X_1$ given $X_2$
note that $f(x_2)$ is the marginal distribution for $X_2$
given by $ \int_{-\infty}^{\infty} f(x_1,x_2) d x_1 $ - the integral wrt $x_1$

Ex1 - Discrete
Consider the following table (Joint distribution)

X/Y-101Pr(X)
00.10.30.10.5
10.10.050.10.25
20.050.10.10.25
Pr(Y)0.250.450.3

We determined the last column & last row by summing the column/row. These are the marginal distributions
So we can now determine one of three conditional distributions as
$P(X|Y=0) = (.3/.45)+(.05/.45)+(.1/.45) $
Similarly
$P(Y|X=2) = (0.05/0.25)+(0.1/0.25)+(0.1/0.25)$

Ex2 - Continuous
Let
$f(x,y) = \frac{1}{\pi} e^{-\frac{1}{2}(x^2 - 2xy+5y^2)}; x,y \in \mathbb{R} $
We want to find the conditionals for both $X|Y=y$ and $Y|X=x$ for which we'll need the marginals as well
First we need to get this into an easy to integrate form
$ \begin{equation} \begin{split} f(x,y) & = \frac{1}{\pi} e^{-\frac{1}{2}(x^2 - 2xy+5y^2)} \\ & = \frac{1}{\pi} e^{-2y^2} e^{-\frac{1}{2}(x - y)^2} \text{by squaring the exponent} \\ & = \frac{1}{\pi} e^{-2y^2} \sqrt{2 \pi} \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2}(x - y)^2} \end{split} \end{equation} $

Now we can see that
$f(y) = \int_{\mathbb{R}} f(x,y) d x = \sqrt{\frac{2}{\pi}} e^{-2y^2} \int_{\mathbb{R}} \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2}(x - y)^2} $

  • The first term has nothing to do with x and thus the integral passes through
  • Recall that integral($e^{ax}$) = $\frac{1}{a}e^{ax}$

Finally we get
$ = \frac{1}{\sqrt{2 \pi \frac{1}{4}}} e^{- \frac{y^2}{2-\frac{1}{4}} } . 1 $ as the marginal of Y If you stare at this long enough you may notice that this is $N(0,(\frac{1}{2})^2)$

A similar process can be performed to get the marginal of X
Square the exponent, reduce, then integrate
You should get $N(0,(\frac{\sqrt{5}}{2})^2)$

Now we are ready to compute the conditional distributions
Recall from above $f(x_1|x_2) = \frac{f(x_1,x_2)}{f(x_2)}$
$f(x|y) = \frac{\frac{1}{\pi} e^{-\frac{1}{2}(x^2 - 2xy + 5y^2)}}{\frac{1}{\sqrt{2 \pi \frac{1}{4}}} e^{- \frac{1}{2} 4y^2 } } = \frac{1}{\sqrt{ 2 \pi}} e^{- \frac{1}{2} (x - y)^2}$
Which you may notice is simply $X|Y = y \sim N(y,1)$

For Y|X it's much more messy.
Following a similar procedure as previously should yield $ N(\frac{x}{5}; (\frac{1}{\sqrt{5}})^2) $

Bayes Theorem

Ingredients for Bayes Theorem

NB Bayes Rule should only be used for events. Here we speak about Bayes theorem which deals with the distribution functions

Let's begin by defining our terminology
Observations $X_1, X_2, ... , X_n$ are modeled as
$X_i \sim f^{iid}(x_i | \theta ) $
iid independent and identically distributed

Then the joint distribution of the sample $X_1, X_2, ... , X_n$ is given by
$f(x_1 | \theta)f(x_2 | \theta) \dotsm f(x_n | \theta) = \prod_{i=1}^{n} f(x_i | \theta)$

As a function of $\theta$, the joint distribution $\prod_{i=1}^{n} f(x_i | \theta)$ is called the likelihood
$ L(\theta | x_1, x_2, \dotsm x_n) \prod_{i=1}^{n} f(x_i | \theta)$

Likelihood Principle
All information about the experiment is contained in the likelihood

Example
Each $X_i$ in sample is an Exponentially distributed $Exp(\lambda)$.
Let $X_1=2,X_2=3,X_3=1$ be the observations
Then the likelihood is given by
$L(\lambda | x_1, x_2, x_3) = \lambda e^{-2\lambda} x \lambda e^{-3\lambda} x \lambda e^{-1\lambda} = \lambda^{3} e^{-6\lambda} $

Alternatively, if the x's were unknown a more generic form can be written as
$$L(\lambda | x_1, x_2, x_3) = \lambda^{3} e^{-\lambda \sum_{1}^{3}x_i} $$

Let $\theta$ be a parameter in $f(x|\theta)$
Define a prior distribution on $\theta$, as $\theta \sim \pi(\theta)$
where $\theta \in \Theta$, $\Theta$ is the sample space

The joint distribution of $(X,\theta)$ is $h(x,\theta)$
And the marginal of X is $m(x) = \int_{\Theta} h(x,\theta) d \theta$

Recall that a frequentist would think of $\theta$ as fixed

Now let's bridge together Bayes Rule and the theorem

$$\text{The rule says: }P(AH_i) = P(A|H_i)P(H_i) = P(H_i|A)P(A) $$$$\text{Which is also: }P(H_i|A) = \frac{P(A|H_i)P(H_i)}{P(A)} $$

By analogy we can construct Bayes Theorem
$$ h(x,\theta) = f(x |\theta)\pi(\theta) = \pi(\theta|x)m(x)$$ $$ \pi(\theta|x) = \frac{f(x |\theta)\pi(\theta)}{m(x)} $$ $$\text{Which is bayes theorem!}$$

Verbally
The posterior distribution $\pi(\theta|x)$ is the likelihood $f(x|\theta)$ multiplied by the prior $\pi(\theta)$ divided by the marginal $m(x)$

Example
Normal Likelihood + Normal prior
$x|\theta \sim N(\theta,\sigma^2),\sigma^2$ is known
$\theta \sim N(\mu,\tau^2), \mu,\tau $ elicited

Then
$ \begin{equation} \begin{split} h(x,\theta) & = N(\theta,\sigma^2) x N(\mu,\tau^2) \end{split} \end{equation} $

Which is a really long and ugly equation that boils down to the following result

$\theta|X \sim N(\frac{\tau^2}{\sigma^2 + \tau^2}x + \frac{\sigma^2}{\sigma^2 + \tau^2} \mu ; \frac{\sigma^2 \tau^2}{\sigma^2 + \tau^2} ) $

$X \sim N(\mu, \sigma^2 + \tau^2)$

We can now rewrite
$\frac{\tau^2}{\sigma^2 + \tau^2}x + \frac{\sigma^2}{\sigma^2 + \tau^2} \mu = wx + (1-w)\mu$

$\tau^2 \gg \sigma^2 \rightarrow w \simeq 1$; Posterior mean is close to x

$\sigma^2 \gg \tau^2 \rightarrow w \simeq 0$; Posterior mean is close to $\mu$

Conjugate families

DEFINE
For a likelihood f and a prior $\pi$:
If the prior and posterior belong to the same family of distributions then the pair $(f,\pi)$ is conjugate

If the pair $(f,\pi)$ is conjugate, there is no need to compute the normalizing constant m(x) in
$f(x| \theta)\pi(\theta) \propto \pi(\theta|x)$

Example
Binomial Likelihood + Beta Prior
$X|p \sim Bin(n,p)$ ; $f(x,p) = {n \choose x}p^x (1-p)^{n-x}$
$p \sim Beta(\alpha , \beta )$ ; $\pi(p) = \frac{1}{B(\alpha,\beta)} p^{\alpha-1}(1-p)^{\beta-1}$

$\pi(p|x) \propto f(x|p)\pi(p)$
$= C p^x (1-p)^{n-x} p^{\alpha-1} (1-p)^{\beta-1}$ for some constant C
$= C p^{x+\alpha-1} (1-p)^{n-x+\beta-1}$
$\implies p|X \sim Beta(x+\alpha, n - x + \beta)$
$X \sim Beta(\alpha,\beta) \implies E(X) = \frac{\alpha}{\alpha + \beta}$

$$\begin{equation} \begin{split} E(p|X) & = \frac{x+\alpha}{n+\alpha+\beta} \\ & = \frac{n}{n+\alpha+\beta} \frac{x}{n} + \frac{\alpha+\beta}{n+\alpha+\beta} \frac{\alpha}{\alpha+\beta} \\ & = w \hat{p} + (1-w) (\text{prior mean}) \end{split} \end{equation} $$

Excercise
Show that A Poisson likelihood and a Gamma prior form a conjugate pair
Hint
$x|\lambda \sim Poi(\lambda)$ ; $f(x|\lambda) = \frac{\lambda^x}{x!} e^{-\lambda} $

Important Conjugates

? denotes an unknown

Likelihood-$X|\theta$------------Prior-$\pi(\theta)$Posterior-$\theta | X$
BinomialBetaBeta
NegBinomialBetaBeta
PoissonGammaGamma
GeometricBetaBeta
ExponentialGammaGamma
Normal($\mu=?$)NormalNormal
Normal($\sigma^2=?$)InvGammaInvGamma
Normal($\mu=?;\sigma^2=?$)Normal/GammaNormal/Gamma
MultinomialDirichletDirichlet

Point Estimation

The posterior is the ultimate summary of an experiment for a bayesian. The probability measure, mean, is the most frequently used bayes estimator for a parameter. Posterior mode & median are alternatives and secondary in importance.

Interestingly the posterior mode maximizes the posterior density in the same way that the Maximum likelihood estimator maximizes the density. When used this way it is termed MAP=Maximum Posterior. It is computationally less demanding than mean or median.

To find: $argmax_\theta \pi(\theta|x) = argmax_\theta f(x|\theta)\pi(\theta) $

Suppose $X \sim U(0,\theta)$ with observations of $X_1=2; X_2=5; X_3=0.5; X_4=3$
Let the prior $\pi(\theta) \sim Pareto(\theta_0, \alpha)$ for $\theta_0 = 6 \text{ and } \alpha=2$
Then the posterior is also $Pareto(\theta^*, \alpha^*)$
with $\theta^* = max\{\theta_0, X_{(n)} \} = max\{6,5\} = 6$
and $\alpha^* = \alpha + n =$ 2 + 4 = 6

Posterior mean is $\frac{\alpha^* \theta^*}{\alpha^* - 1}$ = 36/5 = 7.2
Posterior Median is $\theta^* 2^{1/\alpha^*} = 6/2^{1/6}=6.7348$

Prior Elicitation

Prior distributions are carriers of prior information that is coherantly incorporated via Bayes theorem into an inference. At the same time parameters are unobservable and prior specification is subjective in nature.

If one has no prior information then there are many noninformative choices are possible (invariant priors, Jeffreys' priors, default priors, reference priors, etc). A noninformative prior is one which is dominated by the likelihood, or that is flat relative to the likelihood.

Flat Prior $\pi(\theta)=C$ for location(mean) and $\pi(\theta)=1/\theta$ for the scale/rate parameter Jeffreys' priors are obtained from a particular functional of density
For Binomial it is proportional to $p^{-1/2}(1-p)^{-1/2}$
For an exponential rate parameter it is proportional to $1/\lambda$
For a Normal: it is flat for the mean, but for the variance it is proportional to 1/$\sigma^2$

Example
If $X_1=1.7; X_2=0.6; X_3=5.2;$ come from an exponential with rate $\lambda$. Find the bayesian estimator if the prior on $\lambda$ is 1/ $\lambda$

We compute

  1. the likelihood as $\lambda^3 e^{-\lambda \sum_1^3 X_i}$
  2. the posterior as $\frac{1}{\lambda} \lambda^3 e^{-\lambda \sum_1^3 X_i}$

Which we recognize as Gamma(3,$\sum_1^3 X_i$)
Thus the Bayesian estimator (as the posterior mean) is given by
$\hat{\lambda} = \frac{3}{\sum_1^3 X_i} = \frac{1}{X} = 1/2.5 = 0.4$

Credible Sets: Bayesian interval estimation

In Classical stats a confidence interval is a set that in a large number of trials will contain the parameter with confidence 1-alpha. From Wikipedia: the confidence level represents the frequency (i.e. the proportion) of possible confidence intervals that contain the true value of the unknown population parameter. In other words, if confidence intervals are constructed using a given confidence level from an infinite number of independent sample statistics, the proportion of those intervals that contain the true value of the parameter will be equal to the confidence level

In Bayesian Stats: the probability of a parameter belonging to a set C, for credible set, is (1-alpha)100%.
Formally
Let C be a subset of the parameter space $\Theta$
Then C is a credible set with credibility (1-$\alpha$)100% if
$P(\theta \in C|X) = E(I(\theta \in C) | X) = \int_C \pi(\theta|x) d \theta \geq 1-\alpha $
In the discrete case replace the integral with a sum

Lemma
For a fixed posterior and a (1-$\alpha$)100% credibility, a credible set is not unique.
There are 2 main versions: Highest posterior density (HPD) and equal-tail credible sets

HPD
Suppose you have a posterior distribution $\pi(\theta|X)$ then
$C = \{\theta \in \Theta | \pi(\theta|x) \geq k(\alpha)) \}$
$P^{\theta|X}(\theta \in C) \geq 1- \alpha $
The challenge here is to set $k(\alpha)$ in such a way that the area below the posterior distribution is at least $1-\alpha$. This may involve numerical solutions of complex equations

Equitailed
Take the posterior distribution and form 2 intervals Upper(U) and Lower(L)
Then $\int_{-\infty}^{L} \pi(\theta | x) d \theta \leq \frac{\alpha}{2}$
and $\int_{U}^{+\infty} \pi(\theta | x) d \theta \leq \frac{\alpha}{2}$
so $\mathbb{P}^{\theta|X}(\theta \in [L,U]) \geq 1- \alpha $
This can be done using quantiles

Recall Jeremy's IQ
Normal density + Normal Prior => Normal Posterior
$X|\theta \sim N(\theta,\sigma^2)$
$\theta \sim N(\mu, \tau)$
With elicited/observed value X=98, we determined $\sigma^2=80; \tau^2 = 120$
Yields $\theta|X \sim N(102.8, 48)$

Because this is a Normal distribution the HPD and Equitailed coincide. This is generally true for symmetric distributions.
We can easily compute the 95% Credible set as
$102.8 \pm z_{1-\frac{\alpha}{2}} * \sqrt{48} = z_{0.975} = 1.96 $
t/f $\theta \in [89.2207,116.3793]; L=27.1586$

Compare to the Confidence Interval from classical stats
$98 \pm z_{1-\frac{\alpha}{2}} * \sqrt{80} = z_{0.975} = 1.96 $
t/f $\theta \in [80.4692,115.5308]; L=35.0615$

EXAMPLE 2
Given $X_1,X_2,...,X_n \sim Exp(\lambda); \lambda $ - rate parameter
We determine
$\sum_{1}^{n} X_i \sim Ga(n,\lambda)$ and
$\lambda \sim Ga(\alpha,\beta)$
Now we compute
$f(x|\lambda) \propto \lambda^n e^{-\lambda \sum x_i}$
$\pi(\lambda) \propto \lambda^{\alpha-1} e^{-\beta \lambda}$

Posterior:
$\pi(\lambda | \sum X) \sim Ga(n+\alpha,\sum x_i + \beta)$

Let $X_1=2; X_2=7; X_3=12$ be the lifetimes of a particular device
Assume that the $X_i$'s as exponential with unknown $\lambda$
Let prior on $\lambda$ be a Gamma(1,8)

Question Find the bayes estimators for $\lambda$ and both credible sets (HPD & equitailed)

Sol'n
$\sum X_i$ = 12+7+2=21; n=3
$\lambda | \sum X_i \sim Ga(3+1, 21+8) = Ga(4,29)$ recall E(Gamma)=a/b
Posterior mean: $\mathbb{E}(\lambda | \sum X_i) = 4/29 = 0.1379$
Posterior mode: Mode $ = (4-1)/29 = 0.1034$
Posterior median: GammaInverse(0.5,4,1/29)=0.1266 (No closed form so we used numerical methods)

NB $\lambda$ is the rate parameter so the expected lifetime is the reciprocal

Recall that $\mathbb{E}(X_i | \lambda) = \frac{1}{\lambda}$ so the lifetimes are computed as the reciprocals of each estimator
Lifetimes
= {1/0.1379, 1/0.1034, 1/0.1266}
= {7.2516, 9.6712, 7.8975}

Bayesian Testing

Recall: Classical approach to hypothesis testing, we formulate $H_0$ as $H_0:\theta = 0 \text{ vs } H_1:\theta \gt 0$

The Bayesian approach: The posterior with the highest probability is always favoured. It can be either $H_0 \text{ or } H_1$, makes no difference.

Formulate $\Theta_0 \text{ and } \Theta_1$ as two nonoverlapping sets for parameter $\theta$. Generally $\Theta_{0}^{c} = \Theta_1$ but this is not mandatory so long as they don't overlap they can be handled

Define
Null Hypothesis: $H_0: \theta \in \Theta_0$
Alternative Hypothesis: $H_1: \theta \in \Theta_1$

Bayesian testing involves a comparison of posterior probabilities of $\Theta_0 \text{ and } \Theta_1$, which is the regions corresponding to the two hypothesis. So if $\pi(\theta|X)$ is our posterior dist'n then the hypothesis corresponding to the smallest area will be rejected

$p_0 = \mathbb{P}(H_0|X) = \int_{\Theta_0} \pi(\theta|x) d \theta$
$p_1 = \mathbb{P}(H_1|X) = \int_{\Theta_1} \pi(\theta|x) d \theta$

Jeremy's IQ - Tested
Recall that we found the posterior to be $N(102.48,48)$

Jeremy says he had a bad day and that his true IQ should be at least 105
So we compute the posterior probability of his claim as $\theta \geq 105$ $p_0 = \mathbb{P}(\theta \geq 105 |X) = \mathbb{P}( Z \geq \frac{105-102.8}{\sqrt(48)} = 1 - \Phi(0.3175) = 0.3754 $

Since this is less than 1/2 we reject his claim of $\theta \geq 105$ in favour of $\theta \lt 105$ with probability $(1-0.3754)$

What is the prior & posterior odds in favour of $H_0$?
Let $\frac{\pi_0}{\pi_1} = \frac{\mathbb{P}(H_0)}{\mathbb{P}(H_1)}$
and $\frac{p_0}{p_1} = \frac{\mathbb{P}(H_0|X)}{\mathbb{P}(H_1|X)}$

Then Bayes Factor gives the odds in favour of $H_0$ as the ratio of the corresponding posterior to prior odds as
$$ B_{01}^{\pi}(x) = \frac{\mathbb{P}(H_0|X)}{\mathbb{P}(H_1|X)} / \frac{\mathbb{P}(H_0)}{\mathbb{P}(H_1)} = \frac{p_0/p_1}{\pi_0 / \pi_1} $$

NB: $B_{01}$ is the odds of Null hypothesis to hypothesis 1; Similarly $B_{10}$ is the odds of hypothesis 1 to the null hypothesis

Jeremy's IQ - Odds
Posterior Odds in favour of $H_0$ are $\frac{0.3754}{1-0.3754} = 0.4652 \lt 1$

Precise Hypothesis
If you recall from basic stats for a continuous distribution the probability of a point is always 0. So the null hypothesis given above presents a unique challenge.

Suppose we want to test $H_0:\theta = \theta_0$ versus $H_1:\theta \neq \theta_0$
then the prior on theta will contain the point mass at $\theta_0$
If you don't take the point mass then P(H0) will always be zero
Hence
$\pi(\theta) = \pi_0 \bullet \delta_{\theta_0} + (1-\pi_0) \bullet \xi(\theta)$ where $(1-\pi_0) = \pi_1$
Which is a mixture prior

Now we find the marginal as $m(x) = \pi_0 \bullet f(x|\theta_0) + \pi_1 \bullet m_1(x) $
and then integrate the likelihood wrt this prior
$m_1(x) = \int_{\theta \neq \theta_0} f(x|\theta) \xi(\theta) d \theta $
We find $pi_0$ times the likelihood evaluated at $\theta_0$
$$\begin{equation} \begin{split} \pi(\theta|x) & = \frac{f(x|\theta_0)\pi_0 }{m(x)} \\ & = \frac{f(x|\theta_0)\pi_0 }{\pi_0 \bullet f(x|\theta_0) + \pi_1 \bullet m_1(x)} \\ & = (1 + \frac{\pi_0}{\pi_1} \bullet \frac{m_1(x)}{f(x|\theta_0)})^{-1} \end{split} \end{equation} $$

Bayes Factor Calibration

Evidence against $H_0$Value (Interval)-XXXXXXXXXXXXXXX
Poor$0 \leq Log_{10} B_{10} \leq 0.5 $
Substantial$0.5 \lt Log_{10} B_{10} \leq 1 $
Strong $1 \lt Log_{10} B_{10} \leq 1.5 $
Very Strong $1.5 \lt Log_{10} B_{10} \leq 2 $
Decisive$2 \lt Log_{10} B_{10} $

Jeremy's IQ Revisited

Take $H_0: \theta \leq 100$ v.s. $H_1: \theta \gt 100$
$\theta | X \sim N(102.8,48)$

$p_0 = \mathbb{P}^{\theta | X} (H_0) $
$ = \int_{-\infty}^{100} \frac{1}{\sqrt{2 \pi 48}} e^{- \frac{(\theta-102.8)^2}{2*48}} d \theta$
= normcdf(100,102.8,sqrt(48))
= 0.3431
Hence
$p_1 = \mathbb{P}^{\theta | X} (H_1) = 1 - 0.3431 = 0.6569$

In a similar fashion we find that
$\pi_0 = \mathbb{P}^{\theta} $ $ = \int_{-\infty}^{100} \frac{1}{\sqrt{2 \pi 120}} e^{- \frac{(\theta-110)^2}{2*120}} d \theta$
= normcdf(100,110,sqrt(120))
= 0.1807
and $\pi_1 = 1-0.1807 = 0.8193$

Finally
$B_{10} = \frac{p_1/p_0}{\pi_1 / \pi_0} = \frac{0.6569/0.3431}{0.8193/0.1807} = 0.4223 $

$ \frac{p_1}{p_0} = B_{10} \bullet \frac{\pi_1}{\pi_0} $

$log_{10} B_{01} = -log_{10} B_{10} = 0.3744$ Poor evidence

Bayesian Prediction

We are often faced with a problem of predicting a new observation a new observation $X_{n+1}$ after we've observed $X_1,X_2, ..., X_n$. We can assume that the new observation will have a likelihood of $f(x_{n+1} | \theta )$, while the observed samples have a posterior of $\theta$ as $\pi(\theta|X_1,X_2, ..., X_n)$

The posterior predictive distribution for $X_{n+1}$ can be obtained from the likelihood by integrating out the paramter $\theta$ using the posterior distribution

$$ f(x_{n+1} | X_1,...,X_n) = \int_{\Theta} f(x_{n+1}|\theta) \pi(\theta |X_1,...,X_n ) d \theta $$

Where $\Theta$ is the domain for $\theta$

Note that the marginal dist'n also integrates out the parameter but uses the prior instead of the posterior $$ m(x) = \int_{\Theta} f(x|\theta) \pi(\theta) d \theta $$ For this reason the marginal is often called the prior predictive distribution

Definition:
The prediction for $X_{n+1}$ is the expectation $\mathbb{E}(X_{n+1})$ taken with rest to the predictive distribution
$\hat{X}_{n+1} = \int_{\mathbb{R}} x_{n+1} f(x_n+1 | X_1,...,X_n ) d x_{n+1}$
While the predictive variance
$Var(X_{n+1}) = \int_{\mathbb{R}} (x_{n+1} - \hat{X}_{n+1})^2 f(x_n+1 | X_1,...,X_n ) d x_{n+1}$
Can be used to assess the precision of the prediction

Example: Exponential Survival Time
Suppose $X \sim Exp(\lambda)$ represents the survival time of patients affected by a particular disease
The pdf of X is given by $f(x|\lambda) = \lambda e^{-\lambda x}; x \gt 0$

Additionally we are given a prior on $\lambda$ as
$ \pi(x) = Gamma(\alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda^{\alpha - 1} e^{-\beta \lambda}, \lambda \gt 0$
And a likelihood on the sample $X_1,...,X_n$ of exponential R.V's is given by $ \lambda^n exp\{-\lambda \sum_1^n X_i \} $
We can find the posterior as $ \lambda^{n + \alpha - 1 } exp\{- ( \sum_1^n X_i + \beta) \lambda \} $
Which is recognized as $Gamma(\alpha + n , \beta + \sum X_i )$
And can be completed as
$ \pi(\lambda | X_1,...,X_n ) = \frac{( \sum_1^n X_i + \beta)^{n+\alpha}}{\Gamma(n+\alpha)} \lambda^{n+\alpha-1} exp\{- ( \sum_1^n X_i + \beta) \lambda \} $

So the predictive distribution for a new $X_{n+1}$ is
$f(x_{n+1} | X_1,...,X_n )$
$= \int_{0}^{\infty} \lambda exp\{ -\lambda x_{n+1} \} \pi(\lambda | X_1,...,X_n ) d \lambda$
$= \frac{(n+\alpha) ( \sum_1^n X_i + \beta)^{n+\alpha} }{ ( \sum_1^n X_i + \beta + x_{n+1})^{n+\alpha+1} }, x_{n+1} \gt 0$

Prior Elicitation

Often times Bayesian's are criticized for using subjective priors which are not objective. The bayesians try to minimize this criticism by expressing personal opinion as a probability distribution of a prior. So how do we elicit this prior?

Suppose you are confident in the family of distributions that a prior belongs to but your not sure about the numerical specification. You could specify some values such as mean, and variance, moments and quantiles, modes etc etc. These will specify the prior.

Suppose you decide to put an exponential prior on the paramter theta
The exponential prior would have the hyper parameter lambda and you know that theta is elicited to be approx 2
Formally
$\theta \sim Exp(\lambda)$ and $E(\theta)=2 \implies \frac{1}{\lambda} = 2 $ that is $\lambda = \frac{1}{\lambda}$

We could also use the median
Say family exponential and median is equal to 4
$\theta \sim Exp(\lambda); F(\xi_{1/2})=\frac{1}{2} \implies F(4)=\frac{1}{2} $
$\frac{1}{2} = (1 - e^{-4 \lambda}) \implies e^{-4 \lambda} = \frac{1}{2} \implies \lambda=\frac{log 2}{4} = 0.1733 $

Elict Beta prior
We want to elicit a Beta prior on $\theta$ if $E(\theta)=1/2 \text{ and } Var(\theta)=1/8$
So $\theta \sim Beta(\alpha,\beta); \alpha, \beta$ to be specified
$\frac{1}{2} = E(\theta) = \frac{\alpha}{\alpha + \beta} \implies \alpha+\beta = 2 \alpha \implies \alpha=\beta$

Now solve for $Var(\theta)$
$\frac{1}{8} = Var(\theta) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} = \frac{\alpha^2}{4 \alpha^2 (2\alpha+1)} = \frac{1}{4(2\alpha+1)}$

Hence $4(2 \alpha + 1) = 8 \implies 8 \alpha = 4 \implies \alpha = \frac{1}{2} (= \beta) $

In general
for $E(\theta)=\mu$ and $Var(\theta)=\sigma^2$ we will find that
$\alpha = \mu(\frac{\mu (1-\mu) }{\sigma^2} - 1)$
and
$\beta = (1-\mu) (\frac{\mu (1-\mu) }{\sigma^2} - 1)$

Non-Informative Priors

Unlike the previous section where we may know something about the prior, what if you don't know anything about the prior distribution? ie you have no information?

In such a case we turn to non-informative priors (which is vaguely defined). This allows us to avoid the subjective criticism coming from classical statisticians

Invariance Principle in Selecting Priors
Let $X|\theta \sim f(x - \theta);$ $\theta$ here is the location parameter

Jeffrey's Prior
Is the prior that satisfies $p(\theta) \propto \sqrt{det I(\theta)}$
Where $I(\theta)$ is the fisher information for $\theta$ and is invariant under reparametrization of the parameter vector $\theta$. det stands for the determinant

Effective Sample Size

It is possible to calibrate the amount of information a prior is carrying by assigning a sample size value.

For a Binomial + Beta(a,b) prior

  • prior mean is taken as $\frac{a}{a+b}$
  • posterior mean is $\frac{a+X}{a+b+n}$
    so ESS = a+b

A Gamma(a,b) prior on Poisson rate $\lambda$ is conjugate so

  • Bayes rule a/b without data
  • goes to
  • $\frac{\sum X_i + a}{b+n}$ with the data
    so ESS = b

A Gamma(a,b) prior on normal precision $\tau = 1/\sigma^2$

  • Bayes rules are a/b
  • and $(a+n/2) / (b + 1/2 \sum (X_i - \mu)^2 $
    So ESS = 2a

Lectures - Unit 5

Bayesian Computation

The crux of Bayesian stats lies in how to compute the posterior, as well as the Bayesian estimators.

There are Two Approaches to these problems:

  • A: The classical approach where we use numerical methods to compute integrals
  • B: The monte carlo approach where we use Markov Chains (which is used by OpenBugs)

Recall that bayes theorem states
$\pi(\theta|x) = \frac{ f(x|\theta)\pi(\theta)}{m(x)}$
Which is conceptually easy to understand but difficult in practice. The difficulty lies in the marginal. In our previous examples we we're happy to drop the marginal and use the concept of proportionality. In the conjugate case the proportionality lead to a density. But in many cases we will need to compute the marginal in order to form a proper density.

It's easy to find $\pi(\theta|x) \propto f(x|\theta)\pi(\theta)$
but if we need to use $\pi(\theta|x) = \frac{ f(x|\theta)\pi(\theta)}{m(x)}$
then we need to find m(x) in order to normalize it in order to be a density.
Recall that the normalized constant $m(x)$ is the integral $m(x) = \int f(x|\theta)\pi(\theta) d \theta$

The above situation arises whenever we are faced with a nonconjugate case

Consider:
$x|\theta \sim N(\theta,1)$ (Notice that the mean is unknown, but the variance is known)
$\theta \sim Cauchy(0,1)$
Likelihood is $f(x|\theta) \propto e^{-\frac{1}{2} (x-\theta)^2 }$
Prior is $\pi(\theta) \propto \frac{1}{1+\theta^2}$

Now we want to compute the integral:
$m(x) = \int_{-\infty}^{\infty} e^{-\frac{1}{2} (x-\theta)^2} \frac{1}{1+\theta^2} d \theta $

Which is not solvable using elementary methods. So we will need to resort to numerical methods.

Similarly we could try to find the bayesian estimator :
$\delta_B(x) = \frac{\int \theta f(x|\theta) \pi(\theta) }{\int f(x|\theta) \pi(\theta) } $

In either case we are faced with an integral without a closed form.

We could avoid the integration all together using sampling methods. Using a sample from either the prior or the likelihood.

Here's how it works.

We write out estimator as

$\delta_B(x) = \frac{\int_{-\infty}^{\infty} \frac{\theta}{1+\theta^2} e^{-\frac{1}{2} (\theta - x)^2} d \theta }{\int_{-\infty}^{\infty} \frac{1}{1+\theta^2} e^{-\frac{1}{2} (\theta - x)^2} d \theta } $

If we choose to sample from Cauchy(0,1), our likelihood, then we can estimate our integral as a sum.
So generate a random sample $\theta_1,...,\theta_N$ from Cauchy(0,1)
Compute
$\delta_B(x) = \frac{\sum \theta_i e^{-\frac{1}{2} (\theta - x)^2} }{\sum e^{-\frac{1}{2} (\theta_i - x)^2} } $

Similarly we could choose to generate the random sample from the prior distribution as $\theta_1,...,\theta_N \sim N(x,1)$
Then take the sum
as $\delta_B(x) = \frac{\sum \frac{\theta_i}{1 + \theta_i^2} }{\sum \frac{1}{1 + \theta_i^2} } $

Let's see this inpractice using Python

Normal-Cauchy

In [2]:
from scipy import integrate, inf, exp
x = 2
# Integrating over Cauchy from Above
num = lambda th: th * exp(-0.5*(x-th)**2)/(1+th**2)
denom = lambda th: exp(-0.5*(x-th)**2)/(1+th**2) 
delta2 = integrate.quad(num,-inf,inf)[0]/integrate.quad(denom,-inf,inf)[0]
#delta(2)
print(delta2) #1.2821951026935339

# Errors

numerator =integrate.quad(num,-inf,inf)[0]    #0.9159546679977636
denominator=integrate.quad(denom,-inf,inf)[0] #0.714364503556127
errnum=integrate.quad(num,-inf,inf)[1]        #1.0415234856193602e-09
errdenom=integrate.quad(denom,-inf,inf)[1]    #1.2022419107752649e-08

err = delta2 * (errnum/numerator + errdenom/denominator)
print(err)   #2.3036713479165735e-08
1.2821951026935339
2.3036713479165735e-08

Laplace Method

Is another method of approximating integrals. Much more complex as it requires finding the determinant of a matrix of parameters.

Markov Chain Review

Markov Chains Review:
$X_0,X_1,...,X_{n-1},X_{n},X_{n+1},....$ forms a Markov chain if
$P(X_{n+1} \in A | X_0,X_1,...,X_{n}) = P(X_{n+1} \in A | X_{n})$
In other words only the most recent observation has an effect.
Future events are independent on past events

Define our transition kernel as
$Q(A|X_n) = P(X_{n+1} \in A | X_{n})$ This is a one step transition kernel
$Q(A|X_n = x) = \int_{A} q(x,y) d y = \int_{A} q(y|x) d y $

Definition: An Invariant distribution
$\prod$ is an invariant distribution if $\prod(A) = \int Q(A|x) \prod(dx)$

Stationary:
A stationary distribution of a Markov Chain is a probability distribution
that remains unchanged in the Markov chain process as Time progresses.

In Bayesian terms:
Suppose $\pi$ is a density for $\prod$,
then $\pi$ is stationary if $q(x|y)\pi(y)=q(y|x)\pi(x)$
This is called Detailed Balance Equation

Equilibrium
If $Q^{n}(A|x)=P(X_n \in A | X_0 = x)$
And $\lim_{x\to\infty} Q^{n}(A|x) = \prod(A)$

Markov Chains require that the stationary or equilibrium distribution correspond to the posterior
The initial condition $X_0=x$ is forgotten as n grows larger.
When n is large $X_n$ is a random variable sampled from the posterior distribution

Monte Carlo

Two popular methods: Gibbs sampling and Metropolis

High level procedure

To find
$\mu_{\pi}(g) = \int g(\theta) \pi(\theta) d \theta$
We sample iid R.V.s $\theta_0,...,\theta_n \sim \pi$
then $\mu_{\pi}(g) \approx \frac{1}{n} \sum_{i=1}^{n} g(\theta_i) $

MCMC Metropolis

Recall the detailed balance equation for a stationary distribution is given by $q(x|y)f(y)=q(y|x)f(x)$
where $\pi$ is the target distribution (aka the Posterior distribution)
and $q$ is the kernel density of Markov Chain (aka Transition)

NB: We get to pick q
In fact it can be any conditional distribution that you please. As long as the following condition, of admissibility holds

Call $q$ admissible if
Support $\pi_x \subset \bigcup_x$ support $q(\bullet | x)$
ie: Support of the target density is subset of the union with respect to
all possible values x of the support of the conditional density

In general the detailed balance equation may not hold for this choice of q even
if it is admissible. In general one side will need to be multiplied by a number
less than 1 to ensure equality

Suppose the left hand side is larger. Then we multiply by a factor $\rho(x,y)$
Our balance equation now looks like
$q(x|y)\rho(x,y) \pi(y)=q(y|x)\pi(x)\bullet 1$
Solving for $\rho(x,y)$ yields $\rho(x,y) = min \{1; \frac{q(x|y)\pi(y)}{q(x|y)\pi(y)} \}$

Observe that $\rho(x,y)$ depends on the ratio of target distribution. So if we target
the posterior distribution then the ratio gets the marginals
$\frac{\pi(y)}{\pi(x)}$ which cancel out

In a similar fashion: If the choice of a kernel density q is symmetric
ie $q(y|x) = q(x|y)$
then they cancel out
Leaving us with $\rho(x,y) = min \{1; \frac{\pi(y)}{\pi(y)} \}$
we will talk more about this in a moment

Enough theory ... Let's see how it works now

Metropolis Algorithm

STEP000 - Choose a symmentric proposal distribution $q(\theta|\theta')=q(\theta'|\theta)$
STEP001 - Start with an arbitrary x_0 from support of target $\pi$
STEP003 - For i from 1 to n-1

  • generate a proposal y from $q(y|x_n)$
  • generate $U \sim U(0,1)$
  • Test: $U \leq \rho(x_n,y) $
  • if TRUE then set $x_{n+1} = y$ with probability $\rho(x_n,y)$
  • if FALSE then set $x_{n+1} = x_n$ with probability $ 1 - \rho(x_n,y)$

STEP004 - increase n, go to STEP002

where
$\rho(x,y) = min \{1; \frac{q(x|y)\pi(y)}{q(x|y)\pi(y)} \}$

Choosing q any admissible and supported distribution will suffice but some are favoured
For example:
If q is symmetric ( $q(y|x) = q(x|y)$ ) then they cancel
Hence $\rho(x,y) = min \{1; \frac{\pi(y)}{\pi(y)} \}$

Lemmas:
if $q(y|x) = q(x|y) = $q(|x-y|)$ then the algorithm is called Metropolis Random Walk

if $q(y|x) \equiv q(y) $ then the algorithm is called an Independence Metropolis

Example 0.1

Binomial Distribution with Nonstandard Prior
See http://galton.uchicago.edu/~eichler/stat24600/Handouts/l12.pdf

Let
$Y=(Y_1,Y_2,...,Y_n)^T$ where $Y_i \stackrel{iid}{\sim} Bin(1,\theta) $
$S_n = \sum_{i=1}^{n} Y_i$ Just to make the notation cleaner
Prior: $\pi(\theta) = 2 cos^2(4\pi \theta) $

We compute the posterior as
$\begin{equation} \begin{split} \pi(\theta | Y) & \sim f(Y|\theta) \pi(\theta) \\ & \propto \left[ \theta^{S_n} (1-\theta)^{n-S_n} \right] cos^2(4\pi\theta) \\ \end{split} \end{equation} $

We will use the exponential as our proposal distribution
$ q(\theta' | \theta ) \sim exp\left( \frac{1}{2\sigma^2} (\theta-\theta')^2 \right) $

Our test sample is
$\begin{equation} \begin{split} \rho(\theta | Y) & = \left[ \frac{\pi(\theta'|Y) q(\theta | \theta' ) }{\pi(\theta|Y) q(\theta' | \theta )} \right] \\ & = \left[ \frac{\pi(\theta'|Y) }{\pi(\theta|Y) } \right] \text{q & q' cancel out} \\ & = \left[ \frac{\theta'^{S_n} (1-\theta')^{n-S_n} cos^2(4\pi\theta') }{\theta^{S_n} (1-\theta)^{n-S_n} cos^2(4\pi\theta)} \right] \\ \end{split} \end{equation} $

We got a bit lucky here as our proposal distribution is symmetric (ie $q(x|y)=q(y|x)$) so we can cancel them out.

In order to implement the above we'll need to choose a variance. This can be tricky as the convergence rate will depend highly upon the choice.

Example 0.2

Example 1

Recall the example of Normal + Cauchy
$X|\theta \sim N(\theta,1)$ and $\theta \sim Cauchy(0,1)$

$\pi(\theta | x) \propto \frac{e^{-\frac{(x-\theta)^2}{2}}}{1+\theta^2}$
$\theta$ is the current status
$\theta'$ is the proposal status

Let $q(\theta' | \theta) \sim N( x |\tau^2): q \propto e^{\frac{1}{2\tau^2}(\theta' - x)^2 }$

$$\gamma = \frac{\pi(\theta') q(\theta|\theta')}{\pi(\theta) q(\theta'|\theta)} = \frac{\frac{e^{-\frac{(x - \theta')^2}{2}}}{1+\theta'^2} e^{-\frac{(\theta-x)^2}{2}} } {\frac{e^{-\frac{(x - \theta)^2}{2}}}{1+\theta^2} e^{-\frac{(\theta'-x)^2}{2}} } = \frac{1+\theta^2}{1+\theta'^2}$$

Now use the final equation as our acceptance criteria
Take X=2; $\theta_0 = 1$ as the inital seed
You should get a mean of 1.2825

see norcaumet.m for an octave/matlab implementation of the above

Example 2

$T_1,...,T_n \sim Weibull(\alpha,\eta)$
Where Weibull pdf is given by : $f(t|\alpha,\eta) = \alpha \eta t^{\alpha-1} e^{-\eta t^{\alpha}} $
With prior exponentials
$\pi(\alpha,\eta) \propto e^{-\alpha} \eta^{\beta-1} e^{-\xi \eta} $

We propose the product of two exponentials (Why? When alpha=1 a weibull is just an exponential)
$q(\alpha',\eta' | \alpha,\eta) = \frac{1}{\alpha \eta} e^{- \frac{\alpha'}{\alpha} - \frac{\eta'}{\eta}}$

Now compute
$ \frac{[ \prod \alpha' \eta' t^{\alpha'-1} e^{-\eta' t^{\alpha'}}] e^{-\alpha'} \eta'^{\beta-1} e^{-\xi \eta'} \frac{1}{\alpha' \eta'} e^{- \frac{\alpha}{\alpha'} - \frac{\eta}{\eta'}} }{[ \prod \alpha \eta t^{\alpha-1} e^{-\eta t^{\alpha}}] e^{-\alpha} \eta^{\beta-1} e^{-\xi \eta} \frac{1}{\alpha \eta} e^{- \frac{\alpha'}{\alpha} - \frac{\eta'}{\eta}} } $

While this looks messy computationally it's rather straight forward to program

Gibbs Sampling Algo

Gibbs sampling is a special case of the Metropolis algorithm. The big distinction in Gibbs is that the vector of parameters is updated component wise. This leads to a scenario where all proposals will be accepted, there is no need to text here. (NB This is the algorithm used by Bugs for all simulations)

Let $f(X|\theta)\pi(\theta)$ be the numerator of the posterior
Suppose that we can find full conditionals for the components of $\theta$ (here $\theta = (\theta_1,...,\theta_p)$ is a vector parameters.
then we can compute the full conditional given all other parameters as
$\pi(\theta_1 | \theta_2,...,\theta_p , X)$
...
$\pi(\theta_{p} | \theta_1,...,\theta_{p-1} , X)$

once you've found these conditionals, which is hard, Gibbs becomes rather trivial

STEP001
Set $\theta^0 = (\theta_1^0,...,\theta_p^0)$ as our initial values

STEP002
Get sample $\theta_1^{n+1} = \pi(\theta_1 | \theta_2^n,...,\theta_p^n, X)$
Get sample $\theta_2^{n+1} = \pi(\theta_2 | \theta_1^{n+1},\theta_3^n,...,\theta_p^n, X)$
...
Get sample $\theta_p^{n+1} = \pi(\theta_p | \theta_1^{n+1},\theta_2^{n+1},...,\theta_{p-1}^{n+1}, X)$

STEP003
Increase n
Go back to STEP002

Finding the conditional

  1. Form a kernel of the joint distribution of all parameters and data (The product of the likelihood times all priors)
  2. To find the full conditional of $\theta_i$, select only the parts of the kernel containing $\theta_i$. You can treat all other terms as constants
  3. Normalize the selected part as a distribution. (Often times it should be recognizable from the kernel)

Example 1

$X_1,...,X_n \sim N(\mu,1/\tau)$; Precision=$\tau$; $\sigma^2 = (1/\tau) $
Priors are given as
$\mu \sim N(0,1) \propto e^{-\frac{1}{2} \mu^2} $
$\tau \sim Gamma(2,1) \propto \tau e^{-\tau}$

Likelihood is given by
$N(\mu,1/\tau) \propto (2\pi)^{-\frac{n+1}{2}} \tau^{\frac{n}{2}} e^{-\frac{\tau}{2} \sum (x_i - \mu)^2 }$

So the joint distribution is expressed by
$joint \propto likelihood(X)*prior(\mu)*prior(\tau) $
$joint \propto (2\pi)^{-\frac{n+1}{2}} \tau^{\frac{n}{2}} e^{-\frac{\tau}{2} \sum (x_i - \mu)^2 } e^{-\frac{1}{2} \mu^2} \tau e^{-\tau} $

Keeping only the terms depending on $\mu$ yields
$\pi(\mu | \tau,X) \propto e^{-\frac{\tau}{2} \sum (x_i - \mu)^2 } e^{-\frac{1}{2} \mu^2} $
Reduces to $\pi(\mu | \tau,X) \propto e^{-\frac{1}{2} (1+n\tau) ( \mu - \frac{\tau \sum x_i}{1+n\tau}) } $
Is recognized as $\mu | \tau,X) \sim N\left(\frac{\tau \sum x_i}{1+n\tau}, \frac{1}{1+n\tau} \right)$

Similarly we can find the full conditional for $\tau$ by keeping only the effected terms
$\pi(\tau|\mu,X) \propto \tau^{\frac{n}{2}} e^{-\frac{\tau}{2} \sum (x_i - \mu)^2 } \tau e^{-\tau} $
Reduces to
$\pi(\tau|\mu,X) \propto \tau^{\frac{n}{2} + 1} e^{-\tau ( 1 + \frac{1}{2} \sum (x_i - \mu)^2 } $
Is recognized as
$\tau|\mu,X \sim Gamma\left( \frac{n}{2} + 2 ; 1 + \frac{1}{2} \sum (x_i - \mu)^2 \right)$

Now all you need to do is take some random values, as your initials, and run your Gibbs.

Example 2 N+Cauchy

Let's revisit our Normal + Cauchy example from previous lessons
we are given
$X|\theta \sim N(\theta,1) $
$\theta \sim Cauchy(0,1) = \frac{1}{ \pi \gamma \left[ 1+(\frac{x-x_0}{\gamma})^2 \right] } $
Our goal is to find $\delta(2)$ using Gibbs sampling

Example 3 Pumps

There are 10 pumps that are either working correctly or failing. The data collected is as follows

Pump12345678910
Time94.3215.7262.88125.765.2431.441.0481.0482.09610.48
#Fails5151431911422

We are also given the following:
$x_i|\theta_i \sim Poi(\lambda=\theta_i,t_i)$; so we compute $f(x_i|\theta_i) \propto \theta_i^{x_i} e^{-\theta_i t_i} $

$\theta_i \sim Gamma(\alpha,\beta)$; so for $\alpha = 1$ we get $\pi(\theta) \propto \beta e^{-\beta \theta_i}$

Prior on the hyper parameter
$\beta \sim Gamma(c,d)$; so we get the hyperprior $\pi(\beta) \propto \beta^{c-1} e^{-d \beta}$

our joint distribution is given as the product
$joint \propto \left( \prod \theta_i^{x_i} e^{-\theta_i t_i} e^{-\beta \theta_i} \right) \beta^n \beta^{c-1} e^{-d \beta} $

Now we are ready to find the conditionals
$\pi(\theta_i|\theta_{\neq i}, \beta,x) \propto \theta_i^{x_i} e^{-\theta_i t_i} e^{-\beta \theta_i} = Gamma(x_i + 1; \beta + t_i) $

$\pi(\beta|\theta,x) \propto e^{-\beta \sum \theta_i} \beta^{n+c-1} e^{-d \beta} = Gamma(n+c,\sum \theta_i + d) $

Now take
c=0.1
Theta initial = [1, 1, 1, 1, ... ] 10 of them
and run the simulation

Graphical Models

Introduction

From Wikipedia

From Wikipedia

Each Node A, B, etc corresponds to a Random Variable.
Oriented edges correspond to Statistical dependencies between the nodes

They are an easy way illustrate the intuitive representation of relationships and to assess independence. Efficiency in inference by Markovian Independence.

Recall Markovian Property => P(C|A,B)=P(C|B) ( Where A,B,C and linear events)
Lemma: P(A|B,C)=P(A|B)
Proof:
$\begin{equation} \begin{split} P(A|B,C) & = \frac{P(B,C|A)P(A)}{P(B,C)} BayesRule\\ & = \frac{P(B|A)P(C|A,B)P(A)}{P(C|B)P(B)} ChainRule\\ & = \frac{P(B|A)P(C|B)P(A)}{P(C|B)P(B)} MarkovianRule \\ & = \frac{P(B|A)P(A)}{P(B)} \\ & = P(A|B) Bayes \end{split} \end{equation}$

Consider the following simple network: A -> B -> C
Where A and C are unobserved and B is observed
Are A and C independent given B? ie is P(A,C|B)=P(A|B)P(C|B) true?
Yes!

$\begin{equation} \begin{split} P(A,C|B) & = \frac{P(A,B,C)}{P(B)}\\ & = \frac{P(A)P(B|A)P(C|A,B)}{P(B)} \\ & = \frac{P(A)P(B|A)P(C|B)}{P(B)} \\ & = \frac{P(A)P(B|A)}{P(B)} P(C|B) \\ & = P(A|B)P(C|B) \end{split} \end{equation}$

Doodle BUGS

none

Example

GoTo OpenBugs

  1. Create a new graph
    Doodle > New > (keep defaults or change)
  2. Add a node (Every node should have a name)
    Left-Click (NB You'll want to give it a name)
    A node will appear with the default settings
  3. Add a plate
    CTRL + Left-Click
  4. Add edge
    Highlight the target node
    CTRL + Left-Click the source node A line will be drawn from your source to the target

When you're done building your graph
Edit > Select Document (to highlight your graph model)
Right click and copy

Using your model File > New Document Paste your document Add your data code Add your inits code

Running Your model simulation.
Our model is graphical but the steps are the same as before
Model > Specification

  1. Click on the graphical model that was pasted into the program Check Model
    Load data Compile
    Load inits

Modelling

Topics Hierarchical models, Bayesian Linear models, generalized linear models

Hierarchical Models

Suppose you have the following prior with hyper-parameters
$\pi(\theta) = \int \pi_1(\theta|\theta_1) \pi_2(\theta_1|\theta_2) ... \pi_{n}(\theta_{n-1}|\theta_n)\pi_{n+1}(\theta_n) d\theta_1...d\theta_n$

You could represent this as
$X|\theta \sim f(x|\theta)$
$\theta|\theta_1 \sim \pi_1(\theta|\theta_1)$
....
$\theta_n \sim \pi_{n+1}(\theta_n)$

Which is equivalent to $X|\theta \sim f(x|\theta)$ and $\theta \sim \pi(\theta)$

This represents a flow of information from the bottom of the hierarchy to the top
X <- $\theta$ <- $\theta_1$ <- ... <- $\theta_{n-1}$ <- $\theta_n$
This works IFF all the $\theta$'s are observed. If not then there's a break in the system. Furthermore we want the flow to operate both ways. If we have an observed X=x then the parameters should incorporate the information accordingly.

To ensure that the flow is unidirectional we use the joint distribution
$f(x,\theta,\theta_1,...,\theta_n) \propto f(x|\theta,\theta_1,...,\theta_n) \times \pi_1(\theta|\theta_1,...,\theta_n) \times \pi_2(\theta_1|\theta_2,...,\theta_n) \times ... \times \pi_{n}(\theta_{n-1}|\theta_n) \times \pi_{n+1}(\theta_n)$

Applying the markovian property allows us to reduce the joint distribution to
$f(x,\theta,\theta_1,...,\theta_n) \propto f(x|\theta)\pi_1(\theta|\theta_1) .... \pi_{n+1}(\theta_n)$

Why the hierarchy?

  1. Modelling requirements may ask for hierarchy (Bayesian Meta-Analysis)
  2. Prior information can be separated into structural and subjective parts
  3. Robustness + Objectivity = Let the data determine the hyperparameters
  4. Computational intensity: A single prior can be difficult to implement.

Consider three scenarios
Scenario 1:
A single theta for all R.V.s
$\prod_{i=1}^n f(y_i|\theta) \pi(\theta)$

Scenario 2:
One theta for each R.V. (thetas are independent)
$\prod_{i=1}^n f(y_i|\theta_i) \pi(\theta_i)$

Scenario 3:
One theta for each R.V. plus one hyper parameter for all $\theta$s
$\pi(\phi) \prod_{i=1}^n f(y_i|\theta_i) \pi(\theta_i|\phi)$

In scenario 3 the thetas are said to be exchangeable. Exchangeability is roughly described as a limited dependence.
More precisely the distribution of the vector of RVs $(Y_1,Y_2,...,Y_n)$ is equivalent to the distribution of any of it's permutations.

Of course if the RVs are independent then they will be exchangeable.

As an example consider the Multivariate Normal distribution with mean=0 and covariance matrix, and correlation $\rho$
$(X,Y) \sim MVN_2 \left(0, \begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix} \right)$

X & Y are correlated, they're not independent, But they're exchangeable as the distribution of $(X,Y)\sim(Y,X)$

Priors with Structural Information

Example 1
Let $X|p \sim Bin(n,p)$
With $p|k \sim Beta(k,k); k \in \mathbb{N}$ this would imply that the expectation is 1/2. This is structural
Also $k|r \sim Geom(r); p(k=i)=(1-r)^{i-1} r$ for i=1,2,... and 0<r<1
Finally $r \sim Beta(2,2)$ to close the hierarchy we place a non-informative prior.(Note that you must always close your hierarchy with a specific prior)

Analytical Approach requires the use of Symbolic Software

In openbugs

code
model{
x ~ dbin(p,n)
p ~ dbeta(k,k)
k ~ dgeom(r)
r ~ dbeta(2,2)
}

DATA - Suppose we want to model 3 successes in 5 trials
list(n=5,x=3)

INITS - Some random start points
list(r=0.5,k=2,p=0.5)

Priors as Hidden Mixtures

Meta-Analysis via Hierarchical Models

Bayesian Linear Models

One-Way ANOVA

Assume you have multiple trials of an experiment $(Y_1,...,Y_a)$. You are interested in modelling the responses from these experiments as Normal $(N(\mu_1,\sigma^2), ... ,N(\mu_a,\sigma^2) )$. You can think of these experiments as treatments and their respective responses. Furthermore, for each experiment we have the observation $(y_{11},...,y_{1,n_1})$

We've assumed normality so the distribution of these Y's is normal, and we've assumed that sigma, the variance, is the same across all the experiments. These are the traditional ANOVA assumptions.

In an ANOVA problem we are interested in comparing the means of each treatment.
ie Test: $H_0: \mu_1=\mu_2=...=\mu_a$ v.s $H_1: \text{ not } H_0$
Where our Model is: $y_{ij} \sim N(\mu_i,\sigma^2)$
with $\mu_i = \mu+\alpha_i; i=1,2,...a; j=1,...,n_i$ where $\mu$ is the grand mean which we do not know
We can rewrite our test in several was
$H_0: \alpha_1=\alpha_2=...=\alpha_a=0$
or we test $\sum_{1}^{a} \alpha_i = 0$ as a sum to zero constraint (STZ)
or we test $\alpha_1 = 0$ as a corner value constraint

In classical Stats we use an ANOVA table to test $H_0$. If it is rejected then you must move to multiple comparisons

Bayesian Approach
We treat all these parameters as RVs with non-informative priors, at the least. Informative priors may also be used if there is knowledge about them. Now any function, or linear combo, of $\alpha$s is estimable and testable. This opens up more information to us.

Example
See Codes4Unit7 folder, anovacoagulation.odc

Is an animals coagulation dependent on their diet?
24 animals were given 1 or 4 diets. The response variable y is the time until coagulation.

Factorial Designs Two-Way ANOVA

Here there are two factors contributing to the response:
Factor 1: 1,2,...,a
Factor 2: 1,2,...,b

12...b
1
2(i,j)
...
a

Where $(i,j)=y_{i,j_1},y_{i,j_2},...,y_{i,n_{ij}} $ represents an observation
$y_{ijk}=\mu+\alpha_i+\beta_j+(\alpha \beta)_{ij}+\epsilon_{ijk}$
$\mu_{ij}=\mu+\alpha_i+\beta_j+(\alpha \beta)_{ij}$
$\epsilon_{ijk} \sim N(0,\sigma^2)$

Constraints:
$\sum \alpha_i=0;\sum \beta_j=0;\sum_i (\alpha \beta)_{ij}=\sum_j (\alpha \beta)_{ij}=0 $ STZ: Sum to zero constraint $\alpha_1=0; \beta_1=0;(\alpha \beta)_{1j}=(\alpha \beta)_{i1}=0$ CR: Corner value constraint

Test: $H_{01}: (\alpha \beta)_{ij}=0$ The first test should be the interaction between terms.
Main Effect: $H_{02}:\alpha_i=0$ and $H_{03}: \beta_j=0$

Example
See Codes4Unit7 folder simvastatin.odc
Students were asked to explore the reaction, growth, of cells under different conditions (factors). In particular one condition was the drug Simvastatin, which is a cholesterol lowering medication.

Description:
In a quantitative physiology lab II at Georgia Tech, students were asked to find a therapeutic model to test on MC3T3-E1 cell line to enhance osteoblastic growth. The students found a drug called Simvastatin, a cholesterol lowering drug to test on these cells. Using a control and three different concentrations: 10^-9, 10*-8 and 10-7 M, cells were treated with the drug. These cells were plated on four, 24 well plates with each well plate having a different treatment. To test for osteoblastic differentiation an assay, pNPP, was used to test for alkaline phosphatase activity. The higher the alkaline phosphatase activity the better the cells are differentiating, and become more bone like. This assay was performed 6 times total within 11 days. Each time the assay was performed, four wells from each plate were used.

The following snippet is the model

code
for(i in 1:n){
   apa[i]  ~ dnorm( mu[i], tau )
    mu[i] <-  mu0 + alpha[ conc[i] ]  + beta[ time[i]  ] + alpha.beta[ conc[i], time[i] ]
    }

apa represents the measured time: mu[i] is the mean of the height, vol of growth, measurement, and tau is the precision
mu0 is the grand mean

Then we have our constraints (this is also part of the model in Openbugs):

code
##STZ (sum-to-zero) constraints
   alpha[1] <- - sum(alpha[2:leva])
   beta[1] <- - sum(beta[2:levb])
   for(a in 1:leva) {alpha.beta[a,1] <- - sum(alpha.beta[a, 2:levb])}
   for(b in  2:levb) {alpha.beta[1,b] <- - sum(alpha.beta[2:leva, b])}

Note that the first element is skipped!

And we configure the priors as:

code
#PRIORS
    mu0 ~ dnorm(0, 0.0001) #Non-Informative
    for(a in 2:leva) {alpha[a] ~ dnorm(0, 0.0001)}
    for(b in 2:levb) {beta[b] ~ dnorm(0, 0.0001)}
    for(a in 2:leva) {for(b in 2:levb){
          alpha.beta[a,b] ~ dnorm(0, 0.0001) }}
    tau ~ dgamma(0.001, 0.001)
    s <- 1/sqrt(tau)

Regression (single)

Recall
Basic Regression Model: $\hat{y} = \hat{\alpha} + \hat{\beta}x$

Types
Simple: One covariate x, Normal error
Multiple: 2 or more covariates
Non-Normal errors
Multivariate Response

We focus on simple and multiple.

Given Data in the form: $(x_1,y_1), ... (x_n,y_n)$
A Model given by: $y_i=\beta_0+\beta_1 x_1 + \epsilon_i$ Assuming a linear relationship
Where errors are: $\epsilon_i \sim N(0,\sigma^2)$

We compute the sum of squares for the x's and y's as:
$S_{yy} = \sum (y_i-\bar{y})^2;S_{xx} = \sum (x_i-\bar{x})^2;S_{xy} = \sum (y_i-\bar{y})(x_i-\bar{x}); $

We compute the sum of squares for the errors and residuals as:
$S_{se} = \sum (y_i-\hat{y_i})^2;$
$S_{sr} = \sum (\hat{y_i} - \bar{y})^2 $ The amount of variability predicted by the y hats

Now we can determine our parameters as
$\hat{\beta_1} = \frac{S_{XY}}{S_{XX}};$ which is the best possible slope
$\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$ which is the best possible intercept
$\epsilon_i=\hat{y_i}-y_i$
$\sigma^2 = \frac{S_{se}}{n-2}$
$R^2 = \frac{S_{sr}}{S_{st}} = 1- \frac{S_{se}}{S_{st}}; S_{st} $ is the total variability.

Finally our model is given as:
$\hat{y} = \hat{\beta_0} + \hat{\beta_1}x ; $ as computed from above

Example See Codes4Unit7 folder fat2.odc

Regression (multiple)

For i=1,2,...,n
Data is in the form: $(x_{i1},x_{i2},...,x_{ik},y_i);$
A Model given by: $y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i$ Assuming a linear relationship
With errors: $\epsilon_i \sim N(0,\sigma^2)$
The number of parameters is k+1=p, because we include the intercept.

We represent these in matric notation for compactness and convenience
$ y = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{bmatrix} $ $ X = \begin{bmatrix} 1 & x_{11} & ... x_{1k} \\ 1 & x_{21} & ... x_{2k} \\ ... \\ 1 & x_{n1} & ... x_{nk} \\ \end{bmatrix} $ $ \beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ ... \\ \beta_k \end{bmatrix} $ $ \epsilon = \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ ... \\ \epsilon_n \end{bmatrix} $

Finally our model can be written same as before: $y=X\beta+\epsilon$. However this time around these are matrices

This time around we have:
$\hat{\beta} = (X^TX)^{-1} X^Ty$ as our least square estimator
$\hat{y} = X \hat{\beta}$
so let $H=X(X^TX)^{-1} X^T$ Generally called the "hat matrix"
then $\hat{y} = Hy$ is our predictor
and $\epsilon=(I-H)y$ is fitted error residuals

Furthermore
SST = $y^T(I-\frac{1}{n}J)y;$ where J=n by n matrix of 1s
SSE = $y^T(I-H)y$

Hence $R^2 = 1- \frac{SSE}{SST}$
and adjusted $R_{adj}^2 = 1- \frac{(n-1)SSE}{(n-p)SST}$

Bayesian spin on Regression
We put priors on the all parameters, the $\beta$s and on the error $\epsilon$. We want the posteriors to determine the best fit. The model is the same as before.
$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i$
with $\epsilon_i \sim N(0,\sigma^2)$ or $N(0,\tau)$ where $\tau=\frac{1}{\sigma^2}$

How it's done?
Method 1
We could take independent priors for all $\beta$s.
ie $\beta_j \sim N(0,\xi);$ Using precision $ \xi=\frac{1}{Var(\beta_j)}$ for j=0,...,k
$\xi$ the precision parameter typically very small like 10E-5
$\tau \sim Gamma(a,b)$
a,b should be equal and very small like 10E-3

Weak prior information: $\pi(\beta,\sigma^2)=\frac{1}{\sigma^2}$
Recall that this is Jeffery's prior

Method 2: Normal+Inverse Gamma
Used when the Betas are not independent of the precision or variance
$[\beta|\sigma^2] \sim MVN(\mu,\sigma^2 V)$ v=prior covariance matrix
$\sigma^2 \sim InvGamma(a,b)$
$\tau \sim Gamma(a,b)$

with estimates:
$\hat{\beta} = E[\beta|y]=A \beta^{OLS} + (I-A)\mu$ (OLS=Ordinary Least Square) where $A = (X^TX + V^{-1})^{-1} X^TX$

NB if V is the identity matrix you get ridge regression!!

Method 3: Zellner's Priors
Let g be Zellner's prior then
$[\beta|\sigma^2] \sim MVN(\mu,g \bullet \sigma^2 V)$ V=prior covariance matrix
$\sigma^2 \sim InvGamma(a,b)$
Typical choices for g are: $g=n;g=p^2;g=max{n,p^2} $ the choice of g reflects the strength of information

It should be noted that $V=(X^T X)^{-1}$

Example See Codes4Unit7 folder fatmulti.odc

Generalized Linear Models (GLMs)

What does it mean to be generalized?

  1. Here $y_i$s remain independent but the distribution is generalized from the normal to the exponential family.
    NB: the exponential family includes Binomial, Normal, Poisson, Negative Binomial, gamma etc. It's a large family!
  2. The mean of the $y_i$s is a function of $\beta_0+\beta_1 x_{i1}+ ... +\beta_k x_{ik}=l_i$
    With $\mu_i = E[y_i] = g(l_i)$ -> g is called the link function
    For Simple Linear regression the link function is simply the identity (1)
  3. Variance of $y_i$ depends on $\mu_i$, and therefore is not constant.

Exponential GLMs
The exponential family of distributions have the form
$f(y|\theta,\phi) = e^{\frac{y\theta - b(\theta)}{\phi} + c(y,\phi) } $ in paradigmatic (General) form where $\theta$ - is a canonical parameter
and $\phi$ is often called the dispersion parameter

Example 1: Normal
Normal is from exponential family
Taking $\mu = \theta;$ and $\phi = \sigma^2$
Yields $f(x|\mu,\sigma^2) = exp \left[ \frac{y\mu - \mu^2 / 2}{\sigma^2} - \frac{1}{2}(\frac{y^2}{\sigma^2} + log(2\pi\sigma^2)) \right]$
Here $\theta=\mu$ is the link identity since $\mu=E[y_i]$

Example 2: Bernoulli
Bernoulli distribution with p = probability of success
for $y_i \sim Ber(p);$ we have $ \theta = log(\frac{p}{1-p})=logit(p); p=E[y_i]$

Example 3: Poisson
Poisson Distribution
$y_i \sim Poi(\lambda)$ so we have $f(y|\lambda)=\frac{\lambda^y}{y!}e^{-1} = exp\{y*log(\lambda)-\lambda+log(y!)\}$
Which implies that $\theta = log(\lambda)$ and $\lambda = E[y_i] = \mu$
The link is $log(\lambda)=\beta_0+\beta_1 x_{1}+ ... +\beta_k x_{k}$
Why?: $\lambda$ is positive, and log maps pos numbers to $[-\infty,\infty]$

Example 4: Binomial
This is primarily used when observations are between 0 and 1
We have several options for a link function
The Logit function:
$log(\frac{p}{1-p})=F^{-1}(p) = \beta_0+\beta_1 x_{1}+ ... +\beta_k x_{k} $ F is the logistic C.D.F.
The Probit Function:
$F^{-1}(p) = \beta_0+\beta_1 x_{1}+ ... +\beta_k x_{k} $ F is NORMAL C.D.F.
The cloglog function (cloglog is the complementary log-log):

  • F is C.D.F. of Gumbel type I distribution
  • $F(x)=1-exp(-exp(x))$
  • $F^{-1}(p) = cloglog(p) = log(-log(1-p)) = \beta_0+\beta_1 x_{1}+ ... +\beta_k x_{k} $

Performance Measurement
The standard measure of model performance are deviances D
where $D = -2*log(\frac{Fitted-Model-Likelihood}{Saturated-Model-Likelihood}) $

In the Logistic case:
$D = -2 * \sum \left[ y_i log(\frac{\hat{y_i}}{y_i}) + (n_i-y_i)*log(\frac{n_i-\hat{y_i}}{n_i-y_i}) \right]$
where $\hat{y_i} = n_i \hat{p_i}$ is the model fit for $y_i$
and $\hat{p_i}=\frac{y_i}{n_i}; \hat{y_i}=y_i$ for the saturated model and $n_i$ is the number of cases for which the covariates coincide, and k is the number of such groups

In the Poisson Case:
$D = -2 * \sum \left[ y_i log(\frac{{y_i}}{\hat{y_i}}) - (y_i-\hat{y_i}) \right]$
with $\hat{y_i} = exp\{ \beta_0+\beta_1 x_{i1}+ ... +\beta_k x_{ik} \} $

Multinomial Logit

Is a type of regression used on a categorical variable with more than 2 categories. Recall that in logistic regression the response was binary, ie 0 or 1. That can be used as a classification for a response with only 2 possible outcomes. It could be, for example, the response is agree/disagree, disease/no disease, and so forth.

What happens if you have more than two categories? Suppose you have K categories, in this case that one observation is multinomial.

So let $(y_1,...,y_n)$ be our n observations, each one having k dimensions
with $(y_1,...,y_n) \sim Mn(p,1)$
and $p=(p_1,...,p_k)$ be the corresponding vector of probabilities.
(ie $p_i$ is the probability an observation is in the i'th category. As always they must add up to one $\sum p_i = 1$
(ie $y_i = (y_{i1},...,y_{ik}); y_{i=j}=1;y_{i \neq j}=0;$ for $j \in \{1,2,...K\}$)

Example: $y_i=(0,0,0,1,0)$
Then $K=5; y_{i=4}=1; y_{i \neq 4}=0;p=(p_1,...,p_k) $

Example:
Suppose we have n subjects for a medical treatment.
The i'th subject, person, has the covariates $x_{i1},...,x_{i,p-1}$ (Covariates = characterisitics)
We want to determine the probability that this subject falls into the j'th category.
$p_{ij} = \frac{exp\{n_{ij}\}} {\sum_k exp\{n_{ik}\} }; $
$n_{ij}= \beta_{0j}+\beta_{1j} x_{i1}+ ... +\beta_{p-1} x_{i,p-1}$

You can also pick one category to be reference. Let's say that first category, j=1 is your reference category.
so $p_{i1} = P(y_{i1} = 1) = \frac{1}{1+\sum_{k=2} exp\{n_{ik}\} }$
and $p_{ij} = P(y_{ij} = 1) = \frac{exp\{n_{ij}\} }{1+\sum_{k=2} exp\{n_{ik}\} }$
NB: in the first equation the numerator is 1, since all the betas are 0 and $e^0 = 1$

Why do we use a reference category?
It allows us to compare the probabilities and the probabilities of all categories different than 1 compared to the probability of the first category for i subject.
$log(\frac{p_{ij}}{p_{i1}}) = n_{ij} = \beta_{0j}+\beta_{1j} x_{i1}+ ... +\beta_{p-1,j} x_{i,p-1} $

This log ratio is equal to exactly a linear combination of the predictors for this particular subject. So that is how the multinomial logit works, generalization of the binary regression.

Multilevel Models

Many data models involve actually modelling at different levels of the subjects. So the highest modeling resolution is the subject itself, but the subject may belong to some community. The communities can belong to bigger communities, and so on, so forth. So there are different levels and different variables connected to different levels that account for the total variability and total modeling, and enter into the comprehensive model.

Typical example is random effect models that we're going to use as an example. Imagine you have to model something about school students. At the student level, let's say growth, or some educational growth, that you model as linear regression.

We set up a Mixed Effects Model:
$y_i = \alpha_{j(i)}+\beta x_i + \epsilon_i$, for students i=1,2,...n
$\alpha_j = a + b \mu_j + n_j$, for schools j=1,2,...,J
Think of the $x_i$s as time for example.

So why would you do this multilevel modeling?
Of course, you may learn more about treatments and about responses in a more comprehensive way. You may borrow strength and make all data having influence at all different levels of the model. So it is simply information flowing nicely in the model. Ultimately, you will get better-fitting models by using a model that account for uncertainty at all different levels.

Imagine that you have Poisson regression and you have observations which are counts.
so $y_i \sim Poi(\lambda_i)$ note that the lambdas are connected to the $y_i$s
and you decide to use the log link for lambda
ie $\lambda_i = log(\beta_{i0}+\beta_{1} x_{i1}+ ... +\beta_{k} x_{ik})$
Because It's poisson we know the mean and variance as just lambda
ie $E[y_i]=Var(y_i)=\lambda_i$

However, in practice these are rarely if ever equal. Generally, $E[y_i] < Var(y_i)$ Not always, but it happens.

This tells you Poisson model as is is not good model because if it is good model, then expectation variance should be close to each other. So what you do is add a so called random effect to the list of the covariates.

This random effect would be center at zero, but the variance may not be zero. And it could act as an sponge to collect the Xs of variance into Ci. And then have this parameters Beta 1, Beta 2, Beta 0 up to Beta k, better estimated. So usually in the case of over this version, the random effect models are much better fitting models.

Missing/Censored Data

Missing Data

Handling techniques: Multiple Imputations, WinBUGS

Easy to handle types:

  • MCAR: Missing completely at random, doesn't depend on observed or unobserved data
  • MAR: Missing at Random, depends only on observed data

These both are ignorable missingness and can be handled using Multiple Imputations.

Hard to handle:

  • MNAR: Missing not at Random describes the situation where the data influences the missingness.
    The lack of randomness here makes this difficult to deal with. An example might be data magnitude Here we must be more careful. If you impute the missing data using some statistical model and you use a single imputation, then this could potentially be misleading. You've imputed the data meaning it now looks complete, but then when performing the inference the estimation of the error is not correct because it hasn't accounted for the uncertainty in the imputed data.

Multiple Imputations to the rescue!

Let's assume we have both some missing data in our total observations Y
ie $Y = (Y_{obs},Y_{mis}) $
Then $P(\theta | Y_{obs}) = \int P(\theta | Y_{obs},Y_{mis}) P(Y_{mis} | Y_{obs}) d Y_{mis} $
Interpreted as (Actual posterior of $\theta$) = AVG(Complete data posterior distribution of $\theta$)

Similarly: $\mathbb{E}[\theta|Y_{obs}] = \mathbb{E}[ \mathbb{E}[\theta | Y_{obs},Y_{mis}]|Y_{obs}] $ Interpretation: (posterior of $\theta$) = AVG(repeated complete data posterior means of $\theta$)

Finally: $VAR[\theta|Y_{obs}] = \mathbb{E}[ VAR[\theta | Y_{obs},Y_{mis}]|Y_{obs}] + VAR[ \mathbb{E}[\theta | Y_{obs},Y_{mis}]|Y_{obs}]$
Interpretation: (Posterior variance of $\theta$) = AVG(repeated complete data variances of $\theta$) + VAR(repeated complete data posterior means of $\theta$)

The process is rather straight forward. Basically you compute the estimate for each imputation, then take the average over all the estimators.

The nice thing about WinBUGS is that it will automatically handle missing observations for you.

see the 3 rats examples in
Fall2019_ISYE6420_BayesianStats\Lecture_Material\Unit8_TimeToEvent_missingData

Time-To-Event Models

Used when the observations we are interested in modeling are measurements of time needed, or spent, waiting for an event to happen.

2 Main Theories: Reliability Theory, Survival Theory

Let's formalize our notation:

  • T is the lifetime, or time-to-event. Clearly T>=0
  • $F(t)=P(T \leq t)$ is the cdf
  • S(t) = F(t) = P(T > t) is our survival function
  • f(t) - density of T, f(t) = $\frac{dF(t)}{dt} = - \frac{dS(t)}{dt}$
  • $P(a \leq T \leq b ) = F(b)-F(a) = S(a)-S(b)$

We can take the above to define a hazard function h(t) in the following way
$h(t)dt = P(t \leq T \leq t+dt | T \geq t ) = \frac{S(t)-S(t+dt)}{S(t)} $
$h(t) = \lim_{dt \to 0} \left( \frac{S(t)-S(t+dt)}{dt} \bullet \frac{1}{S(t)} \right) = - \frac{S(t)'}{S(t)} = -(log S(t))' $

In [ ]: