# brief plotting of model to get idea for priors
= seq(0.1,40,length=100)
t # all parameters are the log of their original values
= 30; # approximately the peak of virus
alph = 1.0; # approx. growth rate
bet = 2.5; # approx. peak time
gamm = 0.3; # approx. decay rate
et = 2*exp(alph)
num = exp( - exp(bet)*(t - exp(gamm)) )
d1 = exp( exp(et) * (t - exp(gamm)) )
d2 = log( num /(d1 + d2) )
mu plot(t,mu, type = "l") #looks somewhat like virus load in acute
Bayesian analysis of longitudinal multilevel data - part 6
Overview
This is an extension of a cmdstanr
/Stan model to fit longitudinal data using Bayesian multilevel/hierarchical/mixed-effects models. It is a continuation of a prior series of posts. You should start at the beginning.
Here is the Stan code for this example and this is the R script that runs everything.
Introduction
As described in the prior post I want to implement an ordinary differential equation (ODE) model with Stan. I am slowly building up to that.
While I had planned to hop from the 2-parameter cmdstanr
and Stan model straight to the ODE model, I ended up deciding on one more intermediate step. Namely, I know that the ODE model I want to use has 4 main parameters, as opposed to the 2 parameters used here. I figured I might first build a more complex, 4-parameter non-ODE model before I try to implement the ODE model.
New deterministic model
The obvious candidate for that 4-parameter model is the equation I mentioned previously and that we ended up using in one of our papers. This equation was introduced as a good model to fit virus load data for an acute infection in this paper.
I’m deviating from the original paper by giving the parameters different names (\(\alpha\), \(\beta\), \(\gamma\), \(\eta\) instead of \(p\), \(g\), \(d\), \(k\)) to be consistent with what I’ve been doing so far.
Also, since the original model was developed and set up with the assumptions that all 4 main parameters are positive, I need to ensure that by using the previously described approach of exponentiating the parameters.
This leads to this equation for the virus load trajectory.
\[ \begin{aligned} \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \end{aligned} \]
Here’s a bit of R code to explore this equation and to determine values for parameter priors that produce curves that are broadly consistent with our simulated data.
Full model structure
The rest of the model follow the previous one, just with more parameters. They all get their own equations. Here are the main components of the the new model.
\[ \begin{aligned} \textrm{Outcome} \\ Y_{i,t} \sim \mathrm{Normal}\left(\mu_{i,t}, \sigma\right) \\ \\ \textrm{main model describing the virus trajectory} \\ \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \\ \textrm{Deterministic models for main parameters} \\ \alpha_{i} = a_{0,i} + a_1 \left(\log (D_i) - \log (D_m)\right) \\ \beta_{i} = b_{0,i} + b_1 \left(\log (D_i) - \log (D_m)\right) \\ \gamma_{i} = g_{0,i} + g_1 \left(\log (D_i) - \log (D_m)\right) \\ \eta_{i} = e_{0,i} + e_1 \left(\log (D_i) - \log (D_m)\right) \\ \end{aligned} \]
Parameter distributions
To fully specify the model, we need to give all parameters distributions. Here are the distributions for the population-level parameters. These do not vary between individuals. I’m choosing values for the prior distributions assuming that dose will lead to an increase in peak, time to peak and growth rate, and a reduction in decay rate. I’m only indicating the values with an X
below since I keep changing them in the code and then they are out of sync here. Check the R script to see the chosen values.
\[ \begin{aligned} \textrm{population-level priors} \\ \sigma \sim \mathrm{Exponential}(X) \\ a_1 \sim \mathrm{Normal}(X,X) \\ b_1 \sim \mathrm{Normal}(X,X) \\ g_1 \sim \mathrm{Normal}(X,X) \\ e_1 \sim \mathrm{Normal}(X,X) \\ \end{aligned} \]
In addition, we allow some parameters to differ between individuals, and we’ll implement hyper-parameters to allow these values to inform each other across individuals. This is again the adaptive pooling concept discussed previously.
I’m setting values for the prior distributions such that the virus load curve looks somewhat reasonable, based on the quick exploration of the model above. Again, not showing exact values here to not create confusion between what I write here and potentially different values I end up using in the code (and forgetting to update here). See the code for the actual values.
\[ \begin{aligned} \textrm{individal-level priors} \\ a_{0,i} \sim \mathrm{Normal}(\mu_a, \sigma_a) \\ b_{0,i} \sim \mathrm{Normal}(\mu_b, \sigma_b) \\ g_{0,i} \sim \mathrm{Normal}(\mu_g, \sigma_g) \\ e_{0,i} \sim \mathrm{Normal}(\mu_e, \sigma_e) \\ \\ \textrm{hyper priors} \\ \mu_a \sim \mathrm{Normal}(X, X) \\ \mu_b \sim \mathrm{Normal}(X, X) \\ \mu_g \sim \mathrm{Normal}(X, X) \\ \mu_e \sim \mathrm{Normal}(X, X) \\ \sigma_a \sim \mathrm{Exponential}(X) \\ \sigma_b \sim \mathrm{Exponential}(X) \\ \sigma_g \sim \mathrm{Exponential}(X) \\ \sigma_e \sim \mathrm{Exponential}(X) \\ \end{aligned} \]
And that’s the full model. The basic structure is the same as before, but the model is bigger because I’m now modeling the virus trajectory (given by \(\mu_{i,t}\)) with 4 main parameters.
Model implementation
We’ll follow the same setup as in the previous post. Links to the Stan and R code files are given at the top of this document.
library('here') #for file loading
library('dplyr') # for data manipulation
library('ggplot2') # for plotting
library('fs') #for file path
library('cmdstanr') #for model fitting
library('bayesplot') #for plotting results
library('posterior') #for post-processing
library('loo') #for model diagnostics
Some setup bits
############################################
# some general definitions and setup stuff
############################################
#setting random number seed for reproducibility
= 12345
rngseed
# I'll be saving results so we can use them without always running the model
# Note that the files are often too large for standard Git/GitHub - where this project lives
# Git Large File Storage should be able to handle it
# I'm using a simple hack so I don't have to set up Git LFS
# I am saving these large file to a folder that is synced with Dropbox
# adjust accordingly for your setup
#filepath = fs::path("C:","Data","Dropbox","datafiles","longitudinalbayes")
= fs::path("D:","Dropbox","datafiles","longitudinalbayes")
filepath = "cmdstanr4par.Rds"
filename <- here('posts','2024-02-16-longitudinal-multilevel-bayes-6',"stancode-4par.stan") stanfile
Data
We’ll use the same data as before. I’m making one more change. Instead of hard-coding the values for the prior distributions into the Stan code, I’m passing them into the Stan code as part of the data. This makes exploring the Stan model more flexible, I don’t need to re-edit the Stan code if I want to try different values for the priors. I could do this for all parameters, but out of laziness, and because I don’t change them much, I’m hard-coding the sigma parameters inside the Stan file.
# adjust as necessary
<- here::here('posts','2022-02-22-longitudinal-multilevel-bayes-1','simdat.Rds')
simdatloc <- readRDS(simdatloc)
simdat = length(unique(simdat$m3$id))
Nind = length(simdat$m3$id)
Ntot # values for prior distributions
# allows for exploring different values without having to edit Stan model code
= list(mu_a_mu = 1, mu_a_sd = 1,
priorvals mu_b_mu = 1, mu_b_sd = 1,
mu_g_mu = 2.5, mu_g_sd = 1,
mu_e_mu = 0.3, mu_e_sd = 1,
a1_mu = 0.5, a1_sd = 1,
b1_mu = 0.1, b1_sd = 1,
g1_mu = 0.1, g1_sd = 1,
e1_mu = -0.1, e1_sd = 1
)
# all data as one list, this is how Stan needs it
=list(id=simdat[[3]]$id,
fitdatbaseoutcome = simdat[[3]]$outcome,
time = simdat[[3]]$time,
dose_adj = simdat[[3]]$dose_adj[1:Nind], #first Nind values
Ntot = Ntot,
Nind = Nind
)= c(fitdatbase, priorvals) fitdat
Stan code
I again wrote the Stan code in a separate Stan file. Here is the code. The model is basically like the previous one, updated to reflect the model equations above. As mentioned above, instead of hard-coding values for prior distributions inside the Stan code, I’m passing some of them into the code as data. The advantage of passing them in is that I can more quickly play around with different values and see how results change. It also ensures that I use the same values in all parts of the model (e.g., model
and generated quantities
blocks).
//
// Stan code for fitting a hierarchical model to time-series data
//
data{
int<lower = 1> Ntot; //number of observations
int<lower = 1> Nind; //number of individuals
vector[Ntot] outcome; //virus load
vector[Ntot] time; // times at which virus load is measured
vector[Nind] dose_adj; //dose after adjustment, 1 value per individual
array[Ntot] int id; //vector of person IDs to keep track which data points belong to whom
//everything below are variables that contain values for prior distributions
real mu_a_mu;
real mu_b_mu;
real mu_g_mu;
real mu_e_mu;
real mu_a_sd;
real mu_b_sd;
real mu_g_sd;
real mu_e_sd;
real a1_mu;
real b1_mu;
real g1_mu;
real e1_mu;
real a1_sd;
real b1_sd;
real g1_sd;
real e1_sd;
}
parameters{
// population variance
real<lower=0> sigma;
// variance of priors
real<lower=0> sigma_a;
real<lower=0> sigma_b;
real<lower=0> sigma_g;
real<lower=0> sigma_e;
// population-level dose dependence parameters
real a1;
real b1;
real g1;
real e1;
// hyper-parameters to implement adaptive pooling
real mu_a;
real mu_b;
real mu_g;
real mu_e;
// individual level variation parameters
vector[Nind] a0;
vector[Nind] b0;
vector[Nind] g0;
vector[Nind] e0;
}
// Generated/intermediate parameters
transformed parameters{
// predicted virus load from model
vector[Ntot] virus_pred;
// main model parameters
// each individual has their potentially own value
// I'm removing the last letter from each variable name
// just to avoid potential conflicts with bulit-in names of Stan
// such as beta for a beta distribution
// it might not matter, but I wanted to be safe
vector[Nind] alph;
vector[Nind] bet;
vector[Nind] gamm;
vector[Nind] et;
// intermediate variables for denominators
// just to make equation easier to understand
real d1;
real d2;
// compute main model parameters
// multiply a by 30 to get better scaling
for ( i in 1:Nind ) {
30*( a0[i] + a1 * dose_adj[i]) );
alph[i] = exp(
bet[i] = exp( b0[i] + b1 * dose_adj[i] );
gamm[i] = exp( g0[i] + g1 * dose_adj[i] );
et[i] = exp( e0[i] + e1 * dose_adj[i] );
}// loop over all observations
// since parameters are saved in vectors of length corresponding to number of individuals
// we need to index with that extra id[i] notation
for (i in 1:Ntot)
{
d1 = exp( -bet[id[i]] * (time[i] - gamm[id[i]]) );
d2 = exp( et[id[i]] * (time[i] - gamm[id[i]]) ); 2*alph[id[i]] / ( d1 + d2) );
virus_pred[i] = log(
}
// end transformed parameters block
}
model{
// hyper-priors to allow for adaptive pooling among individuals
// values for the distributions are passed into the Stan code as part of the data
mu_a ~ normal( mu_a_mu , mu_a_sd );
mu_b ~ normal( mu_b_mu , mu_b_sd );
mu_g ~ normal( mu_g_mu , mu_g_sd );
mu_e ~ normal( mu_e_mu , mu_e_sd );// variance of priors
1);
sigma_a ~ exponential(1);
sigma_b ~ exponential(1);
sigma_g ~ exponential(1);
sigma_e ~ exponential(// individual variation of each ODE model parameter
a0 ~ normal( mu_a , sigma_a );
b0 ~ normal( mu_b , sigma_b );
g0 ~ normal( mu_g , sigma_g );
e0 ~ normal( mu_e , sigma_e );// average dose-dependence of each ODE model parameter
// values for the distributions are passed into the Stan code as part of the data
a1 ~ normal( a1_mu , a1_sd);
b1 ~ normal( b1_mu , b1_sd);
g1 ~ normal( g1_mu , g1_sd);
e1 ~ normal( e1_mu , e1_sd);// residual population variation
1);
sigma ~ exponential(
// distribution of outcome (virus load)
// all computations to get the time-series trajectory for the outcome are done
// inside the transformed parameters block
outcome ~ normal( virus_pred , sigma );
}
// for model diagnostics and exploration
generated quantities {
// define quantities that are computed in this block
// for the individual level parameters, every individual
// has the same prior so we only specify one
real mu_a_prior;
real mu_b_prior;
real mu_g_prior;
real mu_e_prior;
real<lower=0> sigma_a_prior;
real<lower=0> sigma_b_prior;
real<lower=0> sigma_g_prior;
real<lower=0> sigma_e_prior;
real a0_prior;
real b0_prior;
real g0_prior;
real e0_prior;
real a1_prior;
real b1_prior;
real g1_prior;
real e1_prior;
real<lower=0> sigma_prior;
vector[Ntot] log_lik;
vector[Ntot] ypred;
// this is so one can plot priors and compare with posterior later
// simulate the priors
mu_a_prior = normal_rng( mu_a_mu , mu_a_sd);
mu_b_prior = normal_rng( mu_b_mu , mu_b_sd);
mu_g_prior = normal_rng( mu_g_mu , mu_g_sd);
mu_e_prior = normal_rng( mu_e_mu , mu_e_sd);
1 );
sigma_a_prior = exponential_rng( 1 );
sigma_b_prior = exponential_rng( 1 );
sigma_g_prior = exponential_rng( 1 );
sigma_e_prior = exponential_rng(
a0_prior = normal_rng(mu_a_prior, sigma_a_prior);
b0_prior = normal_rng(mu_b_prior, sigma_b_prior);
g0_prior = normal_rng(mu_g_prior, sigma_g_prior);
e0_prior = normal_rng(mu_e_prior, sigma_e_prior);
a1_prior = normal_rng( a1_mu , a1_sd);
b1_prior = normal_rng( b1_mu , b1_sd);
g1_prior = normal_rng( g1_mu , g1_sd);
e1_prior = normal_rng( e1_mu , e1_sd);
1 );
sigma_prior = exponential_rng(
// compute log-likelihood and predictions
for(i in 1:Ntot)
{
log_lik[i] = normal_lpdf(outcome[i] | virus_pred[i], sigma);
ypred[i] = normal_rng(virus_pred[i], sigma);
}//end generated quantities block }
This loads and compiles the Stan model.
# make Stan model.
<- cmdstanr::cmdstan_model(stanfile,
stanmod1 pedantic=TRUE,
force_recompile=TRUE)
Model fitting settings
These are the settings for the model fitting routine. Basically the same as before, only more initial conditions now because we have more parameters. And of course different values, since our model changed.
#settings for fitting
= list(warmup = 1500,
fs_m1 sampling = 2000,
max_td = 20, #tree depth
adapt_delta = 0.99999,
chains = 5,
cores = 5,
seed = rngseed,
save_warmup = TRUE)
# separate definition of initial values, added to fs_m1 structure
# a different sample will be drawn for each chain
# there's probably a better way to do that than a for loop
set.seed(rngseed) #make inits reproducible
<- function() (list(mu_a = runif(1,1,1.5),
init_vals_1chain mu_b = runif(1,0.8,1.2),
mu_g = runif(1,2,3),
mu_e = runif(1,0.2,0.6),
sigma_a = runif(1,0,1),
sigma_b = runif(1,0,1),
sigma_g = runif(1,0,1),
sigma_e = runif(1,0,1),
a0 = runif(Nind,1,1.5),
b0 = runif(Nind,0.8,1.5),
g0 = runif(Nind,1.5,2.5),
e0 = runif(Nind,0,1),
a1 = runif(1,0.5,0.6),
b1 = runif(1,0.1,0.1),
g1 = runif(1,0.1,0.1),
e1 = runif(1,-0.1,-0.1),
sigma = runif(1,0,1))
)= NULL
inits for (n in 1:fs_m1$chains)
{= init_vals_1chain()
inits[[n]]
}$init = inits fs_m1
Model fitting
This runs the model. It’s not actually run here to speed up generation of this Quarto file, but the code chunk is in the R script, so you can run it.
<- stanmod1$sample(data = fitdat,
res_m1 chains = fs_m1$chains,
init = fs_m1$init,
seed = fs_m1$seed,
parallel_chains = fs_m1$chains,
iter_warmup = fs_m1$warmup,
iter_sampling = fs_m1$sampling,
save_warmup = fs_m1$save_warmup,
max_treedepth = fs_m1$max_td,
adapt_delta = fs_m1$adapt_delta,
output_dir = filepath
)
$save_object(fs::path(filepath,filename)) res_m1
Model result loading
To save time, we don’t run the model each time, instead we save the results and load them.
# loading previously saved fit.
# useful if we don't want to re-fit
# every time we want to explore the results.
# since the file is too large for GitHub
# it is stored in a local cloud-synced folder
# adjust accordingly for your setup
<- readRDS(fs::path(filepath,filename)) res_m1
Model diagnostics
First, we look at diagnostics from the fitting routine to make sure nothing obviously wrong shows up.
$cmdstan_diagnose() res_m1
Processing csv files: D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-1-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-2-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-3-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-4-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-5-39fd59.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
210 of 17500 (1.20%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.
Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.07, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete.
Ok, so the sampler isn’t quite happy. We should sample more and more stringently, but that would take very long. So for the purpose of this investigation, and given that I’m only exploring this model as a stepping stone to the ODE model I’m really interested in, I’ll leave it the way it is. If this were an actual research project, I would obviously need to improve the model performance.
Another important check are to make a few diagnostic plots. We’ll first need to get the samples, both with and without warmups, to be able to make various figures.
#this uses the posterior package to get draws
<- res_m1$draws(inc_warmup = FALSE, format = "draws_df")
samp_m1 <- res_m1$draws(inc_warmup = TRUE, format = "draws_df") allsamp_m1
Now we can look at a few figures. Here I’m again showing a trace plot and a pairs plot.
# only main parameters, excluding parameters that we have for each individual, is too much
= c("a1","b1","g1","e1","sigma")
plotpars ::color_scheme_set("viridis")
bayesplot<- bayesplot::mcmc_trace(samp_m1, pars = plotpars)
bp1 <- bayesplot::mcmc_pairs(samp_m1, pars = plotpars)
bp2 plot(bp1)
plot(bp2)
The plots look reasonable. Well-mixing chains and no noticeable correlations among parameters.
Model results
Let’s also look at the results, namely the posterior distributions of the parameters. We’ll again do both a table and a figure. We can’t really compare with the values we used to generate the model since we are using a different model to fit the data, so we shouldn’t expect any parameters to be similar. Thus, I’m not focusing further on the values. Again, for a real research project, you would want to carefully evaluate the parameters (after addressing the problems with the algorithm not working well).
print(head(res_m1$summary(),15))
# A tibble: 15 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -323. -3.26e+2 3.78e+1 3.86e+1 -3.78e+2 -2.56e+2 1.04 99.0
2 sigma 5.87 5.86e+0 2.67e-1 2.68e-1 5.45e+0 6.33e+0 1.00 5947.
3 sigma_a 0.0240 2.13e-2 1.58e-2 1.54e-2 3.97e-3 5.39e-2 1.03 232.
4 sigma_b 0.0160 1.27e-2 1.30e-2 1.19e-2 1.49e-3 4.21e-2 1.04 145.
5 sigma_g 0.0118 1.00e-2 8.34e-3 7.92e-3 1.69e-3 2.80e-2 1.01 200.
6 sigma_e 0.275 2.68e-1 5.55e-2 5.31e-2 1.99e-1 3.77e-1 1.00 3843.
7 a1 0.112 1.12e-1 1.15e-2 1.17e-2 9.33e-2 1.32e-1 1.00 5467.
8 b1 0.105 1.05e-1 2.21e-2 2.21e-2 6.94e-2 1.41e-1 1.00 1531.
9 g1 -0.00380 -3.27e-3 1.58e-2 1.58e-2 -2.98e-2 2.16e-2 1.00 1623.
10 e1 -0.299 -2.98e-1 3.55e-2 3.45e-2 -3.58e-1 -2.42e-1 1.00 1856.
11 mu_a 0.869 8.69e-1 2.23e-2 2.21e-2 8.33e-1 9.06e-1 1.01 1017.
12 mu_b 3.86 3.86e+0 4.36e-2 4.37e-2 3.79e+0 3.93e+0 1.01 514.
13 mu_g 0.480 4.79e-1 3.13e-2 3.05e-2 4.32e-1 5.34e-1 1.02 487.
14 mu_e 0.241 2.43e-1 6.63e-2 6.40e-2 1.28e-1 3.49e-1 1.00 3324.
15 a0[1] 0.878 8.76e-1 3.42e-2 3.04e-2 8.26e-1 9.36e-1 1.00 2078.
# ℹ 1 more variable: ess_tail <dbl>
<- bayesplot::mcmc_dens_overlay(samp_m1, pars = plotpars)
bp3 plot(bp3)
Priors and Posteriors
Next, we’ll compare prior and posterior distributions. This can give an indication if the priors were selected well or are too broad or overly influential. To be able to show priors, we needed all that extra information in the generated quantities block in the Stan code.
# data manipulation to get in shape for plotting
<- samp_m1 %>%
postdf1 select(!ends_with('prior')) %>%
select(!starts_with(".")) %>%
select(-"lp__") %>%
select(!contains("["))
# awkward way of getting some further parameters
# namely values from first individual for a0,b0,g0,e0
<- samp_m1 %>%
postdf2 select(contains("0[1]")) %>%
rename_with(~ gsub("[1]", "", .x, fixed = TRUE) )
<- cbind(postdf1, postdf2)
postdf <- samp_m1 %>%
priordf select(ends_with('prior')) %>%
rename_with(~ gsub("_prior", "", .x, fixed = TRUE) )
<- tidyr::pivot_longer(data = postdf, cols = everything() , names_to = "parname", values_to = "value") %>% mutate(type = "posterior")
postlong <- tidyr::pivot_longer(data = priordf, cols = everything() , names_to = "parname", values_to = "value") %>% mutate(type = "prior")
priorlong <- dplyr::bind_rows(postlong,priorlong) ppdf
<- ppdf %>%
m1_p1 ggplot() +
geom_density(aes(x = value, color = type), linewidth = 1) +
facet_wrap("parname", scales = "free") +
theme_minimal()
plot(m1_p1)
The plots show that the priors don’t overly influence the results, the data dominates the posteriors (they move and get more peaked, an indication that the data controls the posterior shape).
Observed versus predicted
Another useful plot is to look at observed versus predicted results. This is shown in the following plot.
<- samp_m1 %>% select(starts_with("ypred"))
ypred_df <- bayesplot::ppc_dens_overlay(fitdat$outcome, as.matrix(ypred_df))
m1_p2 plot(m1_p2)
# just to get a picture that can be shown together with the post
#ggsave("featured.png",m1_p2)
The data (black line, \(y\) variable) and the model (thin green line, \(y_{rep}\)) are a bit off. That indicates that the model didn’t fully capture the patterns found in the data and might need modification.
Cross-validation tests
Here’s again some further exploration via cross-validation with the loo
package.
# uses loo package
<- res_m1$loo(cores = fs_m1$chains, save_psis = TRUE)
loo_m1 print(loo_m1)
Computed from 10000 by 264 log-likelihood matrix
Estimate SE
elpd_loo -862.2 11.2
p_loo 29.7 3.4
looic 1724.3 22.3
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 249 94.3% 812
(0.5, 0.7] (ok) 13 4.9% 303
(0.7, 1] (bad) 2 0.8% 161
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
plot(loo_m1)
Some values aren’t too great. This again suggests that we need to tweak the model or run it longer with more stringent settings.
Here’s some more LOO diagnostics.
<- samp_m1 %>% select(starts_with("ypred"))
ypred_df <- bayesplot::ppc_loo_pit_overlay(
m1_p3 y = fitdat$outcome,
yrep = as.matrix(ypred_df),
lw = weights(loo_m1$psis_object)
)plot(m1_p3)
The marginal posterior predictive plot suggests some improvement might be possible (so that the solid line is more on top of the green lines). See here for more.
Model predictions
Finally, we again want to look at the actual data and the model predictions. It’s exactly the same code as for the 2-parameter model.
# averages and CI for
# estimates of deterministic model trajectory
# for each observation
# this is computed in the transformed parameters block of the Stan code
<- samp_m1 |>
mu select(starts_with("virus_pred")) |>
apply(2, quantile, c(0.05, 0.5, 0.95)) |>
t()
rownames(mu) <- NULL
# estimate and CI for prediction intervals
# the predictions factor in additional uncertainty around the mean (mu)
# as indicated by sigma
# this is computed in the predicted-quantities block of the Stan code
# the average of mu and preds should be more or less the same
# but preds will have wider uncertainty due to the residual variation sigma
<- samp_m1 |>
preds select(starts_with("ypred")) |>
apply(2, quantile, c(0.05, 0.5, 0.95)) |>
t()
rownames(preds) <- NULL
# change dose so it looks nicer in plot
= as.factor(fitdat$dose_adj)
dose levels(dose)[1] <- "low"
levels(dose)[2] <- "medium"
levels(dose)[3] <- "high"
#place everything into a data frame
= data.frame(id = as.factor(fitdat$id),
fitpred dose = dose,
time = fitdat$time,
Outcome = fitdat$outcome,
Estimate = mu[,2],
Qmulo = mu[,1], Qmuhi = mu[,3],
Qsimlo = preds[,1], Qsimhi = preds[,3]
)
#make the plot
#not including the prediction intervals since it looks too messy
<- ggplot(data = fitpred, aes(x = time, y = Estimate, group = id, color = dose ) ) +
predplot geom_line() +
geom_ribbon(aes(x=time, ymin=Qmulo, ymax=Qmuhi, fill = dose, color = NULL), alpha=0.3, show.legend = F) +
#geom_ribbon(aes(x=time, ymin=Qsimlo, ymax=Qsimhi, fill = dose, color = NULL), alpha=0.1, show.legend = F) +
geom_point(aes(x = time, y = Outcome, group = id, color = dose), shape = 1, size = 2, stroke = 2) +
scale_y_continuous(limits = c(-30,50)) +
labs(y = "Virus load",
x = "days post infection") +
theme_minimal()
plot(predplot)
The model fits don’t look that great. I’m not sure why the model isn’t capturing the data better, if it needs more iterations and/or better choice of priors, or if there’s something inherent about the structure of this model not being able to capture the data generated by the 2-parameter model, even though it is more flexible with its 4 parameters. I suspect it’s the last one, it seems even though this new model has more parameters, and thus should be more flexible, it can’t quite capture the more gentle increase and decrease of the data generated by the simpler model.
In a reasearch/real-world setting, I would try to explore the model, priors, algorithm to make sure 1) I don’t have any remaining mistakes in my code️ 🙄, 2) The problem is not due to poor choice of priors or starting values, 3) the hierarchical structure is not constraining the model too much. If those are ruled out, change the model to something that is better able to capture the patterns seen in the data.
In fact, I did a bit of that and figured I’ll show it too.
Alternative model
Since the previous model isn’t fitting too well, I wanted to understand a bit more why that might be. I decided to implement a simpler model. It’s the same 4-parameter equation for the virus load as above, but with a simpler parameter structure. I only used individual-level parameters which are independent of each other, and there’s no dependence on dose. The complete independence of each parameter from each other is expected to make the model too flexible and overfit. The dose dependence is the scientific question, so removing them makes the model scientifically pointless. But I wanted to see if a model that basically allows each individual to have their completely independent set of main model parameters could get me something that matches the data closer. In some sense, this is the most flexible model I can make. If that doesn’t work, it suggests the underlying model for the virus load is not able to capture the pattern seen here.
This is the updated model. I would of course not need both the \(\alpha, \beta, ...\) and \(a_0, b_0,...\) parameters since they are the same here and just duplicates. I decided to keep it anyway so it’s easier to compare to the above model - and I had to make fewer changes to the code 😁.
\[ \begin{aligned} \textrm{Outcome} \\ Y_{i,t} \sim \mathrm{Normal}\left(\mu_{i,t}, \sigma\right) \\ \\ \textrm{main model describing the virus trajectory} \\ \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \\ \textrm{Deterministic models for main parameters} \\ \alpha_{i} = a_{0,i} \\ \beta_{i} = b_{0,i} \\ \gamma_{i} = g_{0,i} \\ \eta_{i} = e_{0,i} \\ \\ \textrm{Priors} \\ a_{0,i} \sim \mathrm{Normal}(X, X) \\ b_{0,i} \sim \mathrm{Normal}(X, X) \\ g_{0,i} \sim \mathrm{Normal}(X, X) \\ e_{0,i} \sim \mathrm{Normal}(X, X) \\ \sigma \sim \mathrm{HalfCauchy}(X) \\ \end{aligned} \]
Alternative model implementation
Here’s the code to set up and run this model. We of course have a new Stan model and need to pick new priors and new initial conditions.
############################
# Try an alternative model
############################
= fs::path("D:","Dropbox","datafiles","longitudinalbayes")
filepath = "cmdstanr4par-simple.Rds"
filename_simple <- here('posts','2024-02-16-longitudinal-multilevel-bayes-6',"stancode-4par-simple.stan")
stanfile = 12345 rngseed
# need to update priors
= list(a0_mu = 1, a0_sd = 1,
priorvals2 b0_mu = 1, b0_sd = 1,
g0_mu = 2.5, g0_sd = 1,
e0_mu = 0.3, e0_sd = 1
)= c(fitdatbase,priorvals2) fitdat2
# need different initial values
set.seed(rngseed) #make inits reproducible
<- function() (list(a0_mu = runif(Nind,1,1),
init_vals_1chain b0_mu = runif(Nind,0.8,1.2),
g0_mu = runif(Nind,2,3),
e0_mu = runif(Nind,0,0.5),
a0_sd = runif(Nind,1,1),
b0_sd = runif(Nind,1,1),
g0_sd = runif(Nind,1,1),
e0_sd = runif(Nind,1,1),
sigma = runif(1,0,1))
)= NULL
inits for (n in 1:fs_m1$chains)
{= init_vals_1chain()
inits[[n]]
}$init = inits fs_m1
Stan Code
This is the updated Stan code. Here is the file.
//
// Stan code for fitting a hierarchical model to time-series data
//
data{
int<lower = 1> Ntot; //number of observations
int<lower = 1> Nind; //number of individuals
vector[Ntot] outcome; //virus load
vector[Ntot] time; // times at which virus load is measured
vector[Nind] dose_adj; //dose after adjustment, 1 value per individual
array[Ntot] int id; //vector of person IDs to keep track which data points belong to whom
//everything below are variables that contain values for prior distributions
real a0_mu;
real b0_mu;
real g0_mu;
real e0_mu;
real a0_sd;
real b0_sd;
real g0_sd;
real e0_sd;
}
parameters{
// population variance
real<lower=0> sigma;
// individual-level parameters
vector[Nind] a0;
vector[Nind] b0;
vector[Nind] g0;
vector[Nind] e0;
}
// Generated/intermediate parameters
transformed parameters{
// predicted virus load from model
vector[Ntot] virus_pred;
// main model parameters
// each individual has their potentially own value
// I'm removing the last letter from each variable name
// just to avoid potential conflicts with bulit-in names of Stan
// such as beta for a beta distribution
// it might not matter, but I wanted to be safe
vector[Nind] alph;
vector[Nind] bet;
vector[Nind] gamm;
vector[Nind] et;
// intermediate variables for denominators
// just to make equation easier to understand
real d1;
real d2;
// compute main model parameters
for ( i in 1:Nind ) {
30*a0[i]);
alph[i] = exp(
bet[i] = exp(b0[i]);
gamm[i] = exp(g0[i]);
et[i] = exp(e0[i]);
}
// loop over all observations
// since parameters are saved in vectors of length corresponding to number of individuals
// we need to index with that extra id[i] notation
for (i in 1:Ntot)
{
d1 = exp( -bet[id[i]] * (time[i] - gamm[id[i]]) );
d2 = exp( et[id[i]] * (time[i] - gamm[id[i]]) ); 2*alph[id[i]] / ( d1 + d2) );
virus_pred[i] = log(
}
// end transformed parameters block
}
model{
// individual variation of each ODE model parameter
a0 ~ normal( a0_mu , a0_sd );
b0 ~ normal( b0_mu , b0_sd );
g0 ~ normal( g0_mu , g0_sd );
e0 ~ normal( e0_mu , e0_sd );
// residual population variation
0, 1);
sigma ~ cauchy(
// distribution of outcome (virus load)
// all computations to get the time-series trajectory for the outcome are done
// inside the transformed parameters block
outcome ~ normal( virus_pred , sigma );
}
// for model diagnostics and exploration
generated quantities {
// define quantities that are computed in this block
vector[Ntot] ypred;
vector[Ntot] log_lik;
//real<lower=0> sigma_prior;
real sigma_prior;
real a0_prior;
real b0_prior;
real g0_prior;
real e0_prior;
// this is so one can plot priors and compare with posterior later
// simulate the priors
0, 1));
sigma_prior = abs(cauchy_rng(
a0_prior = normal_rng( a0_mu , a0_sd);
b0_prior = normal_rng( b0_mu , b0_sd);
g0_prior = normal_rng( g0_mu , g0_sd);
e0_prior = normal_rng( e0_mu , e0_sd);
// compute log-likelihood and predictions
for(i in 1:Ntot)
{
log_lik[i] = normal_lpdf(outcome[i] | virus_pred[i], sigma);
ypred[i] = normal_rng(virus_pred[i], sigma);
}//end generated quantities block }
Compiling the Stan model.
# make Stan model.
<- cmdstanr::cmdstan_model(stanfile,
stanmod2 pedantic=TRUE,
force_recompile=TRUE)
Model run
This runs the model and saves results.
# most of it the same as previously
<- stanmod2$sample(data = fitdat2,
res_m2 chains = fs_m1$chains,
init = fs_m1$init,
seed = rngseed,
parallel_chains = fs_m1$chains,
iter_warmup = fs_m1$warmup,
iter_sampling = fs_m1$sampling,
save_warmup = fs_m1$save_warmup,
max_treedepth = fs_m1$max_td,
adapt_delta = fs_m1$adapt_delta,
output_dir = filepath
)
$save_object(fs::path(filepath,filename_simple)) res_m2
Result exploration
Load results.
<- readRDS(fs::path(filepath,filename_simple)) res_m2
Get the samples.
#this uses the posterior package to get draws
<- res_m2$draws(inc_warmup = FALSE, format = "draws_df")
samp_m2 <- res_m2$draws(inc_warmup = TRUE, format = "draws_df") allsamp_m2
Diagnostics. Still not quite right.
$cmdstan_diagnose() res_m2
Processing csv files: D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-1-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-2-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-3-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-4-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-5-5ba055.csvWarning: non-fatal error reading adaptation data
Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.
Checking sampler transitions for divergences.
No divergent transitions found.
Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.
Effective sample size satisfactory.
Split R-hat values satisfactory all parameters.
Processing complete, no problems detected.
A few plots. They look fairly reasonable.
# only a few parameters
= c("a0[1]","b0[1]","g0[1]","e0[1]","sigma")
plotpars ::color_scheme_set("viridis")
bayesplot<- bayesplot::mcmc_trace(samp_m2, pars = plotpars)
bp1 <- bayesplot::mcmc_dens_overlay(samp_m2, pars = plotpars)
bp3 plot(bp1)
plot(bp3)
Comparing priors and posteriors.
# data manipulation to get in shape for plotting
# start with manipulation of posterior parameters
<- samp_m2 %>%
postdf1 select(!ends_with('prior')) %>%
select(contains("sigma"))
# akward way of getting some further parameters
# namely values from first individual for a0,b0,g0,e0
<- samp_m2 %>%
postdf2 select(contains("0[1]")) %>%
rename_with(~ gsub("[1]", "", .x, fixed = TRUE) )
<- cbind(postdf1, postdf2)
postdf <- tidyr::pivot_longer(data = postdf,
postlong cols = everything() ,
names_to = "parname",
values_to = "value") %>%
::mutate(type = "posterior")
dplyr# manipulation of prior parameters
<- samp_m2 %>%
priordf select(ends_with('prior')) %>%
rename_with(~ gsub("_prior", "", .x, fixed = TRUE) )
<- tidyr::pivot_longer(data = priordf,
priorlong cols = everything() ,
names_to = "parname",
values_to = "value") %>%
::mutate(type = "prior")
dplyr
<- dplyr::bind_rows(postlong,priorlong) ppdf
<- ppdf %>%
m2_p1 ggplot() +
geom_density(aes(x = value, color = type), linewidth = 1) +
facet_wrap("parname", scales = "free") +
theme_minimal()
plot(m2_p1)
Skipping over all the other diagnostics plot we did above.
Outcome predictions
Going straight to plotting of data and model predictions.
<- samp_m2 |>
mu2 select(starts_with("virus_pred")) |>
apply(2, quantile, c(0.05, 0.5, 0.95)) |>
t()
rownames(mu) <- NULL
<- samp_m2 |>
preds2 select(starts_with("ypred")) |>
apply(2, quantile, c(0.05, 0.5, 0.95)) |>
t()
rownames(preds) <- NULL
# change dose so it looks nicer in plot
= as.factor(fitdat2$dose_adj)
dose levels(dose)[1] <- "low"
levels(dose)[2] <- "medium"
levels(dose)[3] <- "high"
= data.frame(id = as.factor(fitdat2$id),
fitpred2 dose = dose,
time = fitdat2$time,
Outcome = fitdat2$outcome,
Estimate = mu2[,2],
Qmulo = mu2[,1], Qmuhi = mu2[,3],
Qsimlo = preds2[,1], Qsimhi = preds2[,3]
)
#make the plot
#not including the CI or prediction intervals since it looks too messy
<- ggplot(data = fitpred2, aes(x = time, y = Estimate, group = id, color = dose ) ) +
predplot2 geom_line() +
#geom_ribbon(aes(x=time, ymin=Qmulo, ymax=Qmuhi, fill = dose, color = NULL), alpha=0.3, show.legend = F) +
#geom_ribbon(aes(x=time, ymin=Qsimlo, ymax=Qsimhi, fill = dose, color = NULL), alpha=0.1, show.legend = F) +
geom_point(aes(x = time, y = Outcome, group = id, color = dose), shape = 1, size = 2, stroke = 2) +
scale_y_continuous(limits = c(-30,50)) +
labs(y = "Virus load",
x = "days post infection") +
theme_minimal()
plot(predplot2)
This doesn’t really look any better compared to the other model. So I – preliminarily – conclude that the main 4-parameter deterministic model I use here to describe the virus load trajectories just can’t capture the shape of the data generated by the simpler 2-parameter model.
Summary and continuation
This completes the 4-parameter model. I would not use this model if I really wanted to fit the data here. But I just wanted to get a working model, I’m not really interested in the model results. I just wanted to set the stage for the next version, which is the 4-parameter ODE model. So it’s finally time to tackle that one.