\[ \newcommand{\prob}[1]{\mathbb{P}\!\left(#1\right)} \] \[ \newcommand{\expt}[1]{\mathbb{E}\!\left[#1\right]} \] \[ \newcommand{\Cov}[2]{\text{Cov}\!\left(#1,#2 \right)} \] \[ \newcommand{\Var}[1]{\text{Var}\!\left(#1\right)} \] \[ \newcommand{\Natural}{\mathbb{N}} \] \[ \newcommand{\real}{\mathbb{R}} \] \[ \newcommand{\dint}{\displaystyle\int} \] \[ \newcommand{\ind}[1]{I\!\left(#1\right)} \] \[ \newcommand{\bx}{\mathbb{x}} \]

1 Data

In the rat tumor example, the historical data were in fact a set of observations of tumor incidence in 70 groups of rats (Table 5.1). In the \(j\)th historical experiment, let the number of rats with tumors be \(y_j\) and the total number of rats be \(n_j\). We model the \(y_j\)’s as independent binomial data, given sample sizes \(n_j\) and study-specific means \(\theta_j\). Assuming that the beta prior distribution with parameters \((\alpha, \beta)\) is a good description of the population distribution of the \(\theta_j\)’s in the historical experiments, we can display the hierarchical model schematically as in Figure 5.1, with \(\theta_{71}\) and \(y_{71}\) corresponding to the current experiment.

Table 5.1 in BDA Figure 5.1 in BDA

Now, load data as follows:

2 Inference

2.1 Model

2.1.1 Complete Pooling

\(\newcommand{\bino}{\textrm{Binomial}}\) \(\newcommand{\unif}{\textrm{Uniform}}\)

\[ \begin{align*} p(y_j|\theta) &= \bino(y_j|n_j, \theta), \\ p(y|\theta) &= \prod_{j=1}^N p(y_j|\theta), \\ p(\theta) &= \unif(\theta|0, 1) = 1 \\ \end{align*} \]

2.1.2 No Pooling

\[ \begin{align*} p(y_j|\theta_j) &= \bino(y_j|n_j, \theta_j), \\ p(y|\theta) &= p(y_j|\theta_j), \\ p(\theta_j) &= \unif(\theta_j|0,1), \\ p(\theta) &= \prod_{j=1}^N p(\theta_j). \end{align*} \]

2.1.3 Partial Pooling

2.1.3.1 Tumor Incidence Rates (BDA 5.3)

\[ \begin{align*} p(y_j|\theta_j) &= \bino(y_j|n_j, \theta_j), \\ p(y|\theta) &= \prod_{j=1}^N p(y_j|\theta_j), \\ p(\theta_j|\alpha, \beta) &= \textrm{Beta}(\theta_j|\alpha, \beta), \\ p(\theta|\alpha, \beta) &= \prod_{j=1}^N p(\theta_j|\alpha, \beta), \\ \alpha = \kappa& \phi, \quad \beta = \kappa (1-\phi), \\ p(\phi) &= \unif(\phi|0, 1), \\ p(\kappa) &= \textrm{Pareto}(\kappa|1, 0.5) \propto \kappa^{-1.5} \\ \end{align*} \] \[ \begin{align*} \kappa = \alpha + \beta, \\ \phi = \alpha / \beta \\ \end{align*} \]

2.1.3.2 Log Odds of Tumor Incidence Rates (BDA 5.4)

\[ \begin{align*} p(y_j|\theta_j) &= \bino(y_j|n_j, \theta_j), \\ p(y|\theta) &= \prod_{j=1}^N p(y_j|\theta_j), \\ \alpha_j &= \textrm{logit}(\theta_j) = \log\left(\frac{\theta_j}{1-\theta_j}\right), \\ p(\alpha_j|\mu, \sigma) &= N(\alpha_j|\mu, \sigma), \\ p(\mu) &= N(\mu|-1, 1), \\ p(\sigma) &= 2N(\sigma|0, 1)\textrm{I}(\sigma>0). \end{align*} \]

2.2 Results

Convergence check is done only for the hierarchical model as an example.

2.2.1 Diagnostics for Convergence

After checking for covergence with traceplot and acf plot, details in Computation (Part3), fitted models can directly used for inferences about the parameters such as \(\phi\), \(\kappa\), \(\theta\), \(\alpha\), \(\beta\), and any other quantities of interest. Judging by the following plots, it seems the model converged well, so we proceed to the inference! Note that all the following inferences are based on samples generated from parameters’ posterior distribution.

Trace Plot

Auto Corrleation Function (ACF) Plot

2.2.3 Customized Inference

In addition to the inference on parameters, we can also infer the quantities of interest. In Stan, this can be implemented by declaring quantities of interest in “generated quantities” block. Stan automatically provides full Bayesian inference by producing draws from the posterior distribution of any calculated event probabilities, predictions, or statistics. Different quantities of interest have been declared in our 4 models: e.g for pooled model, posterior p-val for max, mean test, \(\theta_{1}\), \(\theta_{33}\), \(\theta_{54}\), \(\theta_{71}\), and probability of any \(\theta\) greater than 0.35 is declared and retrieved.

Inference for Complete Pooling model:

## Inference for Stan model: pool.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## p_max               0.03    0.00 0.16 0.00 0.00 0.00 0.00  1.00  4017    1
## p_mean              0.52    0.01 0.50 0.00 0.00 1.00 1.00  1.00  2628    1
## theta[1]            0.15    0.00 0.01 0.14 0.15 0.15 0.16  0.17  1486    1
## theta[33]           0.15    0.00 0.01 0.14 0.15 0.15 0.16  0.17  1486    1
## theta[54]           0.15    0.00 0.01 0.14 0.15 0.15 0.16  0.17  1486    1
## theta[71]           0.15    0.00 0.01 0.14 0.15 0.15 0.16  0.17  1486    1
## some_ability_gt_350 0.00     NaN 0.00 0.00 0.00 0.00 0.00  0.00   NaN  NaN
## 
## Samples were drawn using NUTS(diag_e) at Sat May 18 03:48:15 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Inference for No pooling model:

## Inference for Stan model: no-pool.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## p_max               0.95       0 0.23 0.00 1.00 1.00 1.00  1.00  3892    1
## p_mean              0.99       0 0.11 1.00 1.00 1.00 1.00  1.00  4108    1
## theta[1]            0.05       0 0.04 0.00 0.01 0.03 0.06  0.16  5344    1
## theta[33]           0.17       0 0.10 0.03 0.09 0.15 0.23  0.41  7289    1
## theta[54]           0.24       0 0.09 0.09 0.17 0.23 0.30  0.43  7663    1
## theta[71]           0.31       0 0.11 0.12 0.23 0.31 0.39  0.55  6471    1
## some_ability_gt_350 1.00     NaN 0.04 1.00 1.00 1.00 1.00  1.00   NaN    1
## 
## Samples were drawn using NUTS(diag_e) at Sat May 18 03:48:48 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Inference for hierarchical models:

## Inference for Stan model: hier.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd 2.5%   25%   50%   75% 97.5% n_eff Rhat
## phi    0.14    0.00 0.01 0.12  0.14  0.14  0.15  0.17  4871    1
## kappa 14.90    0.16 5.24 7.72 11.17 14.01 17.43 27.39  1035    1
## 
## Samples were drawn using NUTS(diag_e) at Sat May 18 03:49:21 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
## Inference for Stan model: hier-logit.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## mu    -1.93       0 0.13 -2.20 -2.01 -1.93 -1.84 -1.70  2975    1
## sigma  0.69       0 0.13  0.46  0.60  0.69  0.78  0.97  1544    1
## 
## Samples were drawn using NUTS(diag_e) at Sat May 18 03:49:53 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

3 Data Analysis

3.1 Model Checking (Bayesian p-values)

Bayesian p-value(i.e.Posterior predictive p-values) has a direct definition as a probability which can be approximated using Monte Carlo methods

\[ \begin{equation} p_{B}=\operatorname{Pr}\left[T\left(y^{\mathrm{rep}}\right) \geq T(y) | y\right] \end{equation} \]

\[ \begin{equation} p_{B} \approx \frac{1}{M} \sum_{m=1}^{M} \mathrm{I}\left[T\left(y^{\mathrm{rep}(m)}\right) \geq T(y)\right] \end{equation} \]

We generate a Bayesian p-value by comparing the simulated data test statistic with its value on the actual data. Unlike Frequentist p-values, the test statistics is not conditioned on point estimated θ. Instead, variation in accordance with the posterior distribution is allowed. It is a measure of model (mis)fit, and values near 0 or 1 indicate a discrepancy between the model and the data.

##           mean     se_mean        sd 2.5% 25% 50% 75% 97.5%    n_eff
## p_max  0.02575 0.002499200 0.1584084    0   0   0   0     1 4017.483
## p_mean 0.52350 0.009743989 0.4995099    0   0   1   1     1 2627.935
## p_min  1.00000         NaN 0.0000000    1   1   1   1     1      NaN
##             Rhat
## p_max  0.9992371
## p_mean 0.9998626
## p_min        NaN
##           mean     se_mean        sd 2.5% 25% 50% 75% 97.5%    n_eff
## p_max  0.94625 0.003615632 0.2255519    0   1   1   1     1 3891.566
## p_mean 0.98800 0.001699112 0.1088989    1   1   1   1     1 4107.740
## p_min  1.00000         NaN 0.0000000    1   1   1   1     1      NaN
##             Rhat
## p_max  0.9996334
## p_mean 0.9997706
## p_min        NaN
##          mean     se_mean        sd 2.5% 25% 50% 75% 97.5%    n_eff
## p_max  0.7605 0.007206893 0.4268317    0   1   1   1     1 3507.658
## p_mean 0.5180 0.007680991 0.4997384    0   0   1   1     1 4233.025
## p_min  1.0000         NaN 0.0000000    1   1   1   1     1      NaN
##            Rhat
## p_max  1.000015
## p_mean 1.000172
## p_min       NaN
##         mean     se_mean        sd 2.5% 25% 50% 75% 97.5%    n_eff
## p_max  0.737 0.007015620 0.4403175    0   0   1   1     1 3939.125
## p_mean 0.527 0.008037236 0.4993329    0   0   1   1     1 3859.818
## p_min  1.000         NaN 0.0000000    1   1   1   1     1      NaN
##             Rhat
## p_max  0.9995801
## p_mean 0.9998822
## p_min        NaN
# statistics
y_min <- min(y);
y_max <- max(y);
y_mean <- mean(y);
y_sd <- sd(y);

# function to calculate ppp and save as data frame
pvals_frame <- function(ss, model_name, M = length(ss$min_y_rep)) {
  df_pvals_min <- data.frame(list(test_stat = rep("min", M), 
                  replication = ss$min_y_rep), model = rep(model_name, M))
  df_pvals_max <- data.frame(list(test_stat = rep("max", M),
                  replication = ss$max_y_rep), model = rep(model_name, M))
  df_pvals_mean <- data.frame(list(test_stat = rep("mean", M),
                  replication = ss$mean_y_rep), model = rep(model_name, M))
  df_pvals_sd <- data.frame(list(test_stat = rep("sd", M),
                  replication = ss$sd_y_rep), model = rep(model_name, M))
  return(rbind(df_pvals_min, df_pvals_max, df_pvals_mean, df_pvals_sd))
}

# extract parameters from fitted models
ss_hier = extract(fit_rats_hier, permuted = TRUE)
ss_hier_logit = extract(fit_rats_hier_logit, permuted = TRUE)
ss_pool = extract(fit_rats_pool, permuted = TRUE)
ss_no_pool = extract(fit_rats_no_pool, permuted = TRUE)

# calculate ppp
df_pvals <- rbind(pvals_frame(ss_hier, "partial pool"),
  pvals_frame(ss_hier_logit, "partial (logit)"),
  pvals_frame(ss_pool, "complete pool"),
  pvals_frame(ss_no_pool, "no pool"))

# plotting
M = length(ss_hier$p_min)
post_test_stat_plot <- ggplot(df_pvals, aes(replication)) +
  facet_grid(model ~ test_stat) +
  geom_histogram(binwidth = 0.5, colour="black", size = 0.25, fill="white") +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) +
  xlab("value in replicated data set") +
  geom_vline(aes(xintercept = y_val),
             data = data.frame(y_val = rep(c(rep(y_min, M), rep(y_max, M),
                    rep(y_mean, M), rep(y_sd, M)), 4),
                    test_stat = df_pvals$test_stat,
                    replication = df_pvals$replication),
             colour = "blue", size = 0.25) +
  ggtitle("Posterior p-values")

post_test_stat_plot

3.2 Model Comparison (WAIC)

Several measures include AIC, BIC, DIC, WAIC, LOOCV etc. Here we show WAIC. Dependence of the bias correction on the parameter uncertainty in the posterior is more explicit and is recommended by Gelman. “Compared to AIC and DIC, WAIC has the desirable property of averaging over the posterior distribution rather than conditioning on a point estimate. This is especially relevant in a predictive context, as WAIC is evaluating the predictions that are actually being used for new data in a Bayesian context.” (Ch.7)

\[ \begin{aligned} \operatorname{lppd} &=\log \text { pointwise predictive density } \\ &=\log \prod_{i=1}^{n} p_{\text { post }}\left(y_{i}\right)=\sum_{i=1}^{n} \log \int p\left(y_{i} | \theta\right) p_{\text { post }}(\theta) d \theta \end{aligned} \]

\[ \begin{equation} p_{\mathrm{WAIC}}=\sum_{i=1}^{n} \operatorname{var}_{\mathrm{post}}\left(\log p\left(y_{i} | \theta\right)\right) \end{equation} \]

\[ \widehat{elppd}_{\mathrm{WAIC}}=\operatorname{lp} \mathrm{pd}-p_{\mathrm{WAIC}} \]

## 
## Computed from 4000 by 1 log-likelihood matrix
## 
##           Estimate SE
## elpd_waic   -151.9 NA
## p_waic        39.5 NA
## waic         303.8 NA
## 
## Computed from 4000 by 1 log-likelihood matrix
## 
##           Estimate SE
## elpd_waic   -172.7 NA
## p_waic         0.5 NA
## waic         345.4 NA
## 
## Computed from 4000 by 1 log-likelihood matrix
## 
##           Estimate SE
## elpd_waic   -149.7 NA
## p_waic        40.5 NA
## waic         299.4 NA
## 
## Computed from 4000 by 1 log-likelihood matrix
## 
##           Estimate SE
## elpd_waic   -160.7 NA
## p_waic        41.9 NA
## waic         321.4 NA

4 Computation

Model is fitted through iterative simulation, Markov Chain Monte Carlo (MCMC), but the process impose following problems, which can result in misleading inference if not attended. For more information on MCMC, please refer to Bayes and Computation slide from StanKorea Meetup1.

4.1 Problems

Two problems need attention when using MCMC:

  1. Unrepresentative simulations

Simulations may be grossly unrepresentative of the target distribution if the iterations have not proceeded long enough.

  1. Within-sequence correlation

Simulation inference from correlated draws is generally less precise than from the same number of independent draws.

4.2 Solutions

Three solutions are suggested for the above problems:

  1. Multi chain with different starting points

Design the simulation runs to allow effective monitoring of convergence by simulating multiple sequences with starting points dispersed throughout parameter space (Figure 11.1a).

  1. Simulate until ‘within’ variation roughly equals ‘between’ variation

Monitor the convergence of all quantities of interest by comparing variation between and within simulated sequences until ‘within’ variation roughly equals ‘between’ variation (Figure 11.1b). Only when the distribution of each simulated sequence is close to the distribution of all the sequences mixed together can they all be approximating the target distribution.

  1. Alter the algorithm if the simulation efficiency is unacceptably low the algorithm can be altered (Sections 12.1,12.2)

In order to implement the solutions, ‘warm-up’ and ‘thin’ is used.

4.2.1 Warm-up

:= practice of discarding early iterations in Markov chain simulation

purpose:

  • diminish the influence of the starting values

  • to make sure distributions of the simulated values θ_t are close to the target distribution, p(θ|y) for large enough t

in Stan:

  • stan() > warmup = floor(iter/2)

  • discard the first half as a default, but different warm-up fractions can be appropriate depending on the context

4.2.2 Thin

:= keep every kth simulation draw from each sequence

purpose:

  • avoid the dependence of the iterations in each sequence

in Stan:

  • stan() > thin = 1

  • if the sequences have reached approximate convergence, they can be directly used for inferences

5 Appendix

5.1 Source Code (R + stan)

You can download source codes to click following links

File Description
pool.stan stan code for complete pooling model
no-pool.stan stan code for no pooling model
hier.stan stan code for partial pooling model (tumor incidence rates)
hier-logit-centered.stan stan code for partial pooling model (log odds; centered model)
hier-logit.stan stan code for partial pooling model (log odds; non-centered model)
rat-tumors.R R code for fittig stan models and saving output as Rdata