The data files and R code used in this course can be downloaded from this link.
1 A simple linear regression model
For a dataset \({(x_i,y_i) : i = 1,...,n}\), we consider the linear regression model: \(y_{i}\sim N\left(\mu_{i},\sigma\right)\), being
\[\mu_i = \beta_0 + \beta_1.x_i\] We estimate this model in the bayesian framework supposing that \(\beta_{0}\sim N\left(0;10^{4}\right)\) and \(\beta_{1}\sim N\left(3,1\right)\)
First, we simulate a dataset \(\left\{ \left(x_{i},y_{i}\right)~:~i=1,...,n=50\right\}\) obeying the linear regression model: \[y_{i}\sim N\left(5+x_{i}~;~\sigma=1\right)\] Note \(\beta_0=5\) and \(\beta_1=1\).
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) 2.528839 0.2358034 2.062863 2.529498 2.991121
x 1.492742 0.0100154 1.473102 1.492742 1.512385
Next we show this line as well as the true regression line and the one fitted by frequentist estimation. We can see how our initial prior has had a great effect.
Let’s first simulate some contaminated data for a linear regression. We have the same data as before plus a new set of observations that departs markedly from the rest:
We now simulate a linear regression model with two independent variables \(x_1\) and \(x_2\). Specifically, the simulated model is \(y_{i}\sim N\left(\mu_{i},\sigma\right)\) bein \(\sigma=1\) and:
Welch Two Sample t-test
data: Age by Low.Educational.Level
t = -15.207, df = 668.03, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-12.485504 -9.629977
sample estimates:
mean in group 0 mean in group 1
44.05008 55.10782
Obesity and low level of education are associated variables:
tIO <-table(dt$Low.Educational.Level,2-dt$Obesity)chisq.test(tIO)
Pearson's Chi-squared test with Yates' continuity correction
data: tIO
X-squared = 38.909, df = 1, p-value = 4.44e-10
Mostrar/Ocultar código R
summary(epi2x2(tIO))
Epidemiological 2x2 Table Analysis
Input Matrix:
1 2
0 167 492
1 165 206
Pearson Chi-Squared Statistic (Includes Yates' Continuity Correction): NA
Associated p.value for H0: There is no association between exposure and outcome vs. HA: There is an association : NA
p.value using Fisher's Exact Test (1 DF) : 0
Estimate of Odds Ratio: 0.424
95% Confidence Limits for true Odds Ratio are: [0.324, 0.555]
Estimate of Relative Risk (Cohort, Col1): 0.57
95% Confidence Limits for true Relative Risk are: [0.479, 0.678]
Estimate of Risk Difference (p1 - p2) in Cohort Studies: -0.191
95% Confidence Limits for Risk Difference: [-0.254, -0.129]
Estimate of Risk Difference (p1 - p2) in Case Control Studies: -0.202
95% Confidence Limits for Risk Difference: [-0.264, -0.14]
Note: Above Confidence Intervals employ a continuity correction.
Now we fit (by maximum likelihood) a logistic regression model for the relationship between Obesity and low educational level adjusting by age:
Mostrar/Ocultar código R
ml <-glm(Obesity ~ Age + Low.Educational.Level,family=binomial(link=logit),data=dt)summary(ml)
Call:
glm(formula = Obesity ~ Age + Low.Educational.Level, family = binomial(link = logit),
data = dt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3261 -0.8626 -0.7157 1.2361 1.8066
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.10003 0.29693 -7.073 1.52e-12 ***
Age 0.02286 0.00627 3.646 0.000266 ***
Low.Educational.Level 0.61433 0.15300 4.015 5.94e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1294.9 on 1029 degrees of freedom
Residual deviance: 1242.5 on 1027 degrees of freedom
AIC: 1248.5
Number of Fisher Scoring iterations: 4
The bayesian estimation is:
Mostrar/Ocultar código R
formula <- Obesity ~1+ Age + Low.Educational.Level m.log <-inla(formula,family="binomial",data=dt,control.fixed=list(mean=list(Age=0, Low.Educational.Level=1),prec=list(Age=0.0001, Low.Educational.Level=100),mean.intercept=0, prec.intercept=0.000001),control.predictor=list(compute=TRUE),control.compute=list(dic=TRUE, cpo=TRUE))smry <-summary(m.log)smry$fixed <- smry$fixed[,1:5]smry
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) -1.986 0.293 -2.567 -1.984 -1.415
Age 0.018 0.006 0.007 0.018 0.030
Low.Educational.Level 0.885 0.084 0.721 0.885 1.050
Deviance Information Criterion (DIC) ...............: 1250.24
Deviance Information Criterion (DIC, saturated) ....: -7082.37
Effective number of parameters .....................: 2.31
is computed
Each row in the data file corresponds to the number of deaths due to different types of cancer in the Canary Islands according to year (from 1980 to 2002), sex and age group (4 age groups are considered). In particular, we will analyse the number of deaths from lung cancer attributable to smoking (variable D_05) among subjects in age group 2 (between 45 and 55 years). To do so, we will fit a Poisson model, considering as offset the number of deceased subjects in the population for each year, sex and age group. Frequentist estimation using maximum likelihood produces the following result:
Call:
glm(formula = D_05 ~ offset(log(N)) + Sex.Male * Year, family = poisson,
data = dtabg2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.41941 -0.78271 -0.03452 0.67019 2.03603
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -158.23354 29.53446 -5.358 8.43e-08 ***
Sex.Male 126.89682 31.19693 4.068 4.75e-05 ***
Year 0.07453 0.01481 5.034 4.80e-07 ***
Sex.Male:Year -0.06267 0.01564 -4.007 6.16e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 720.508 on 45 degrees of freedom
Residual deviance: 45.066 on 42 degrees of freedom
AIC: 254.11
Number of Fisher Scoring iterations: 4
The following figure shows the fit of the model:
Mostrar/Ocultar código R
b <- mtab$coeffpd <-as.numeric(predict(mtab))dtabg2 <- dtabg2 %>%mutate(Adj.Rate =100000*exp(pd)/N)ggplot(dtabg2,aes(Year,Rate))+geom_point(colour="blue")+geom_line(aes(Year,Adj.Rate,colour=Sex))+labs(x="Year",y="Deaths per 100,000 person-years",title="Age group 45 to 55 years old")
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) -157.984 29.501 -216.986 -157.602 -101.208
Sex.Male 126.698 31.163 66.531 126.351 188.834
Year 0.074 0.015 0.046 0.074 0.104
Sex.Male:Year -0.063 0.016 -0.094 -0.062 -0.032
Deviance Information Criterion (DIC) ...............: 254.19
Deviance Information Criterion (DIC, saturated) ....: -558.17
Effective number of parameters .....................: 4.02
is computed
Mostrar/Ocultar código R
#m.pois$all.hyper$fixed
The following figure shows the bayesian credibility intervals for the predictions of the model:
Mostrar/Ocultar código R
fitBayes <- m.pois$summary.fitted.valuesdtabg2 <- dtabg2 %>%mutate(`Adj.Rate (Bayes)`=100000*fitBayes$mean/N,adj.median=100000*fitBayes[["0.5quant"]]/N,adj.q2.5=100000*fitBayes[["0.025quant"]]/N,adj.q97.5=100000*fitBayes[["0.975quant"]]/N)ggplot(dtabg2,aes(Year,Rate))+geom_point(colour="blue")+geom_line(aes(Year,`Adj.Rate (Bayes)`,colour=Sex))+geom_line(aes(Year,adj.median,colour=Sex))+geom_line(aes(Year,adj.q2.5,colour=Sex),linetype=2)+geom_line(aes(Year,adj.q97.5,colour=Sex),linetype=2)+labs(x="Year",y="Deaths per 100,000 person-years",title="Bayesian Estimation.\nAge group 45 to 55 years old")
Alternatively we could have modelled these data using a negative binomial model, but the fit does not improve substantially (it is even slightly worse). The estimated parameter values are almost unchanged:
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) -157.760 30.233 -218.219 -157.360 -99.575
Sex.Male 125.057 32.491 62.150 124.752 189.690
Year 0.074 0.015 0.045 0.074 0.105
Sex.Male:Year -0.062 0.016 -0.094 -0.062 -0.030
Model hyperparameters:
mean sd
size for the nbinomial observations (1/overdispersion) 22506.29 334503.83
0.025quant 0.5quant
size for the nbinomial observations (1/overdispersion) 30.52 160.12
0.975quant mode
size for the nbinomial observations (1/overdispersion) 47472.18 NA
Deviance Information Criterion (DIC) ...............: 254.47
Deviance Information Criterion (DIC, saturated) ....: -557.89
Effective number of parameters .....................: 4.58
is computed
5 Mixed-models.
We consider a sample of n subjects from a given population. In the i-th subject and for a set of covariates \(x_{i,1},...,x_{i,n_i}\) we observe the corresponding random variables \(y_{i,1},...,y_{i,n_i}\) , which obey the mixed model:
\[y_{i,j}=\beta_0+\beta_1.x_{i,j}+a_i+e_{i,j}\] Here, \(a_1,...,a_n\) are iid \(N(0,\sigma_a)\) random variables which denote the random effects of individuals and for each \(i=1,...,n\), \(e_{i,1},...,e_{i,n_i}\) are iid random variables \(N(0,\sigma_e)\) that correspond to within-subject variability .
Mixed-models 1.
We begin by simulating this model:
Mostrar/Ocultar código R
set.seed(19941998)n=15### Number of subjectsm=10### Number of observations per subjectb <-c(5,3)a <-rnorm(n,0,1)z <-seq(0,0.5,length=m)dt <-data.frame(id=gl(n,m),x=rep(z,n)) %>%mutate(y=b[1]+b[2]*x+a[as.integer(id)]+rnorm(n*m,0,0.2))ggplot(dt,aes(x,y,color=id))+geom_point()+geom_line(aes(group=id))+ylim(c(2,10))
The frequentist estimation is:
Mostrar/Ocultar código R
lmm <-lmer(y ~ x + (1| id), data = dt,REML =FALSE)summary(lmm)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ x + (1 | id)
Data: dt
AIC BIC logLik deviance df.resid
39.0 51.0 -15.5 31.0 146
Scaled residuals:
Min 1Q Median 3Q Max
-2.3126 -0.5656 -0.1156 0.5606 2.5212
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.94190 0.9705
Residual 0.04187 0.2046
Number of obs: 150, groups: id, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 4.6602 0.2525 18.46
x 2.9008 0.1047 27.71
Correlation of Fixed Effects:
(Intr)
x -0.104
And the bayesian estimation:
Mostrar/Ocultar código R
formula <- y ~1+ x +f(id,model="iid")bmm <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=0, prec=0.000001,mean.intercept=0, prec.intercept=0.000001))smry <-summary(bmm)smry$fixed <- smry$fixed[,1:5]smry
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) 4.660 0.261 4.143 4.660 5.177
x 2.901 0.105 2.694 2.901 3.107
Model hyperparameters:
mean sd 0.025quant 0.5quant
Precision for the Gaussian observations 24.05 2.918 18.726 23.90
Precision for id 1.13 0.401 0.509 1.07
0.975quant mode
Precision for the Gaussian observations 30.21 NA
Precision for id 2.07 NA
is computed
Posterior density
Mostrar/Ocultar código R
### Densidad a posterioridpost.intercept <-data.frame(bmm$marginals.fixed$`(Intercept)`)dpost.x <-data.frame(bmm$marginals.fixed$x)ggplot(dpost.x,aes(x,y))+geom_line(colour="#33CC99",linewidth=1.5)+geom_vline(xintercept=3,colour="#3300FF")
Mixed-models 2.
Now we simulate another dataset in which each individual has its own slope, different from the general slope for the whole population.
Mostrar/Ocultar código R
n=15### Number of subjectsm=10### Number of observations per subjecta <-rnorm(n,2.5,0.3)b <-3-az <-rep(seq(0,0.5,length=m),n)dt <-data.frame(id=gl(n,m)) %>%mutate(x=z+a[as.integer(id)],y=b[as.integer(id)]+1.5*z+rnorm(n*m,0,0.2)) ggplot(dt,aes(x,y,color=id))+geom_point()+geom_line(aes(group=id))
Frequentist Estimation:
Simple linear model (bad idea)
Mostrar/Ocultar código R
ml <-lm(y ~ x, data=dt)summary(ml)
Call:
lm(formula = y ~ x, data = dt)
Residuals:
Min 1Q Median 3Q Max
-1.14726 -0.30964 0.01695 0.31673 0.78883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.29653 0.23184 9.905 < 2e-16 ***
x -0.49756 0.08749 -5.687 6.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3927 on 148 degrees of freedom
Multiple R-squared: 0.1793, Adjusted R-squared: 0.1738
F-statistic: 32.34 on 1 and 148 DF, p-value: 6.678e-08
Linear mixed model
Mostrar/Ocultar código R
lmm <-lmer(y ~ x + (1| id), data = dt,REML =FALSE)summary(lmm)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ x + (1 | id)
Data: dt
AIC BIC logLik deviance df.resid
5.3 17.3 1.4 -2.7 146
Scaled residuals:
Min 1Q Median 3Q Max
-2.05300 -0.60358 -0.04049 0.53140 2.92777
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.59183 0.7693
Residual 0.03434 0.1853
Number of obs: 150, groups: id, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.63440 0.31643 -8.325
x 1.38120 0.09367 14.745
Correlation of Fixed Effects:
(Intr)
x -0.777
Bayesian Estimation
Mostrar/Ocultar código R
### Bayesian estimationformula <- y ~1+ x +f(id,model="iid")bmm <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=0, prec=0.001,mean.intercept=0, prec.intercept=0.001))smry <-summary(bmm)smry$fixed <- smry$fixed[,1:5]smry
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant
(Intercept) -2.623 0.327 -3.273 -2.619 -1.989
x 1.377 0.097 1.186 1.377 1.565
Model hyperparameters:
mean sd 0.025quant 0.5quant
Precision for the Gaussian observations 29.32 3.56 22.807 29.16
Precision for id 1.80 0.67 0.801 1.70
0.975quant mode
Precision for the Gaussian observations 36.83 NA
Precision for id 3.40 NA
is computed
6 Nonlinear regression: autoregresive processes.
National Morbidity, Mortality, and Air Pollution Study. The National Morbidity, Mortality and Air Pollution Study (NMMAPS) is a large time series study designed to estimate the effect of air pollution on the health of individuals living in 108 US cities during the period 1987-2000. Data on the daily concentration of particulate matter with an aerodynamic diameter less than 10 (\(PM_{10}\)) and nitrogen dioxide (\(NO_2\)), both measured in \(\mu g/m^3\), as well as the daily temperature for Salt Lake City are contained in the NMMAPSraw.csv file. We use this data set to study the relationship between \(PM_{10}\) and temperature as an illustration of a linear regression model.
We denote by \(PM_{T,M,d}\) the concentration of particles with aerodynamic diameter less than 10 \(\mu\)m according to temperature (\(T\)) month (\(M\)) and day (\(d\)). We assume that \(PM_{T,M,d} \sim N\left(\mu_{T,M,d},\sigma\right)\), where:
\(\gamma_M\) denotes the effect of month \(M\) (January is taken as the reference month, hence \(\gamma_{Jan}\) =0). We also assume that \(z_d\), \(d=1,...,N\) is an autoregressive process of order 1 and hence, \(z_{d} \sim N\left(\phi\cdot z_{d-1},\sigma\right)\).
---title: "Bayesian inference using INLA"lang: enauthor: 'Saavedra, P.'crossref: fig-prefix: "figura" tbl-prefix: "tabla"editor: markdown: wrap: 72format: html: theme: default css: styles.css code-fold: true code-tools: true code-summary: "Mostrar/Ocultar código R" code-block-bg: true code-block-border-left: "#31BAE9" toc: true number-sections: true number-depth: 2 number-offset: 1 self-contained: true pdf: pdf-engine: xelatex documentclass: amsart fontsize: 10pt include-before-body: text: | \renewcommand{\tablename}{Tabla} toc: false shift-heading-level-by: -1 number-sections: true number-depth: 1 colorlinks: true linestretch: 1.25 geometry: - top=30mm - left=30mm - right=30mm - bottom=30mm docx: default---```{r setup, include=FALSE}knitr::opts_chunk$set(class.output="fondoverde", warning=FALSE, message=FALSE)# ver https://bookdown.org/yihui/rmarkdown-cookbook/chunk-styling.html``````{r message=FALSE, warning=FALSE}library(readxl)library(INLA)library(quantreg)library(lme4)library(tidyverse)library(flextable)library(plot3D)library(epibasix)library(janitor)library(lubridate)library(knitr)library(kableExtra)inla.setOption(short.summary=TRUE)```### Course materials {.unnumbered}The data files and R code used in this course can be downloaded from[this link](INLA_course.zip).## A simple linear regression modelFor a dataset ${(x_i,y_i) : i = 1,...,n}$, we consider the linearregression model: $y_{i}\sim N\left(\mu_{i},\sigma\right)$, being$$\mu_i = \beta_0 + \beta_1.x_i$$ We estimate this model in the bayesianframework supposing that $\beta_{0}\sim N\left(0;10^{4}\right)$ and$\beta_{1}\sim N\left(3,1\right)$First, we simulate a dataset$\left\{ \left(x_{i},y_{i}\right)~:~i=1,...,n=50\right\}$ obeying thelinear regression model: $$y_{i}\sim N\left(5+x_{i}~;~\sigma=1\right)$$Note $\beta_0=5$ and $\beta_1=1$.Data simulation:```{r}### Data simulationset.seed(19941998)b <-c(5,1)sg=1n=50x <-seq(0,10,length=50)mu <- b[1]+b[2]*xy <- mu+rnorm(n,0,sg)dt <-data.frame(x,y)### Figureggplot(dt,aes(x,y)) +geom_point()```Results of frequentist inference are:```{r}### Frequentist Inferenceml <-lm(y ~x)summary(ml)```And results of bayesian inference with vague priors:```{r}#### Bayesian inferenceformula <- y ~1+ xml <-inla(formula,family="gaussian",data=dt)ml$summary.fixed[,1:5]```#### Erratic prior.If we impose on the Bayesian estimation that we are very sure that thevalue of the slope is 1.5, we arrive at the following result:```{r class.output="bg-warning"}#### Erratic priorsmb <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=1.5, prec=10000,mean.intercept=1, prec.intercept=0.0001))mb$summary.fixed[,1:5]```Next we show this line as well as the true regression line and the onefitted by frequentist estimation. We can see how our initial prior hashad a great effect.```{r class.output="bg-warning"}### FigureBb_0 <- mb$summary.fixed[1,1]Bb_1 <- mb$summary.fixed[2,1]ggplot(dt,aes(x,y))+geom_point(colour="#99CCFF")+geom_abline(intercept =5, slope =1,colour="#3300FF",linewidth=1.5)+annotate(geom="text", x=2.5, y=16, label="True regression line",colour="#3300FF",size=5)+stat_smooth(method ="lm",formula = y ~ x,geom ="smooth",colour="#FF0000")+annotate(geom="text", x=2.5, y=15, label="Frequentist estimation",colour="#FF0000",size=5)+geom_abline(intercept = Bb_0, slope = Bb_1,colour="#33CC99")+annotate(geom="text", x=2.5, y=14, label="Bayesian estimation",colour="#33CC99",size=5)```We can also see the posterior density resulting from our insistence thatthe slope has to be very close to 1.5:```{r}#### Densidades a posterioridpost.intercept <-data.frame(mb$marginals.fixed$`(Intercept)`)dpost.x <-data.frame(mb$marginals.fixed$x)ggplot(dpost.x,aes(x,y))+geom_line(colour="#33CC99",linewidth=1.5)+geom_vline(xintercept=1,colour="#3300FF")```A closer look at the posterior density:```{r}ggplot(dpost.x,aes(x,y))+geom_line(colour="blue",size=1)+theme_bw()```#### Contaminated dataLet's first simulate some contaminated data for a linear regression. Wehave the same data as before plus a new set of observations that departsmarkedly from the rest:```{r}xz <-seq(12,13,length=5)yz <-0.5*xz+1+rnorm(5,0,1)xc <-c(x,xz)yc <-c(y,yz)dc <-data.frame(xc,yc)ggplot(dc,aes(xc,yc)) +geom_point() +theme_bw()```The frequentist estimation in this case is:```{r}ml2 <-lm(yc~xc)summary(ml2)```And the bayesian estimation when we consider that the intercept is close to 1.5:```{r}formula <- yc ~1+ xcmb <-inla(formula,family="gaussian",data=data.frame(xc,yc),control.fixed=list(mean=1.5, prec=64,mean.intercept=1, prec.intercept=0.0001))mb$summary.fixed[,1:5]```Graphically:```{r}## FigureBb_0 <- mb$summary.fixed[1,1]Bb_1 <- mb$summary.fixed[2,1]ggplot(dc,aes(xc,yc))+geom_point(colour="#99CCFF")+stat_smooth(method ="lm",formula = y ~ x,geom ="smooth",colour="#FF0000")+annotate(geom="text", x=2.5, y=15, label="Frequentist estimation",colour="#FF0000",size=5)+geom_abline(intercept = Bb_0, slope = Bb_1,colour="#33CC99")+annotate(geom="text", x=2.5, y=14, label="Bayesian estimation",colour="#33CC99",size=5)```An alternative method when data are contaminated as shown in the figurecould be to use regression in median:```{r}### Regression in medianrqfit <-rq(yc ~ xc, data =dc)(b <- rqfit$coeff)ggplot(dc,aes(xc,yc))+geom_point(colour="#99CCFF")+geom_abline(intercept = b[1], slope = b[2],colour="#FF33CC")+annotate(geom="text", x=2.5, y=15, label="Regression in median",colour="#FF33CC",size=5)```## Multivariate linear regressionWe now simulate a linear regression model with two independent variables$x_1$ and $x_2$. Specifically, the simulated model is$y_{i}\sim N\left(\mu_{i},\sigma\right)$ bein $\sigma=1$ and:$$\mu_i=5+3x_{1i}-2x_{2i}$$```{r}set.seed(19941998)b=c(5,3,-2)sg <-1n <-100x1 <-rnorm(n,9,1)x2 <-rnorm(n,5,1)mu <- b[1]+b[2]*x1+b[3]*x2y <- mu+rnorm(n,0,sg)dt <-data.frame(x1,x2,y)```#### Frequentist estimation```{r}# Compute the linear regression### Maximum Likelihoodml <-lm(y ~ x1 + x2)summary(ml)```#### Bayesian estimation with a non informative prior distribution```{r}### Bayes: Non-informative prior distributionformula <- y ~1+ x1 + x2 mb <-inla(formula,family="gaussian",data=dt)mb$summary.fixed[,1:5]# Para mostrar los valores de los hiperparámetros:# mb$all.hyper$fixed```#### Bayesian estimation when researcher has a high confidence that $\beta_0=5$```{r}formula <- y ~1+ x1 + x2 mb <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=list(x1=0,x2=0),prec=list(x1=0, x2=0),mean.intercept=5, prec.intercept=4))mb$summary.fixed[,1:5]#mb$all.hyper$fixed```\ #### Bayesian estimation when researcher has a high confidence that $\beta_1=1$ and $\beta_2=-1$```{r}formula <- y ~1+ x1 + x2 mb <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=list(x1=1,x2=-1),prec=list(x1=1000, x2=1000),mean.intercept=0, prec.intercept=0))mb$summary.fixed[,1:5]# mb$all.hyper$fixed```## Logistic regression: association between obesity and low level of education (study of Telde)Data reading:```{r}dt <-read_excel("data/Telde.INLA.xlsx")dt$Obesity <-ifelse(dt$BMI>=30,1,0)if (is_html_output()){ dt %>%kbl() %>%kable_styling(c("striped", "hover"),full_width =FALSE) %>%scroll_box(height ="250px")} else{head(dt) %>%flextable() %>%autofit()}```Subjects with a low level of education are older than the rest:```{r}dt %>%mutate(`Low Educational Level`=factor(Low.Educational.Level,levels=c(0,1),labels=c("No","Yes"))) %>%group_by(`Low Educational Level`) %>%summarize(`Age (mean±sd)`=sprintf("%.2f±%.2f",mean(Age),sd(Age))) %>%flextable() %>%autofit()t.test(Age ~ Low.Educational.Level,data=dt)```Obesity and low level of education are associated variables:```{r}dt %>%mutate(`Low Educational Level`=factor(Low.Educational.Level,levels=c(0,1),labels=c("No","Yes")),Obesity=factor(Obesity,levels=0:1,labels=c("No","Yes"))) %>%tabyl(`Low Educational Level`,Obesity) %>%adorn_percentages("row") %>%adorn_pct_formatting() %>%adorn_ns("front") %>%flextable() %>%add_header_row(values=c("","Obesity"),colwidths =c(1,2)) %>%vline(j=1) %>%vline_left(border=officer::fp_border(color="black")) %>%vline_right(border=officer::fp_border(color="black")) %>%fix_border_issues()``````{r}tIO <-table(dt$Low.Educational.Level,2-dt$Obesity)chisq.test(tIO)``````{r warning=FALSE}summary(epi2x2(tIO))```Now we fit (by maximum likelihood) a logistic regression model for therelationship between Obesity and low educational level adjusting by age:```{r}ml <-glm(Obesity ~ Age + Low.Educational.Level,family=binomial(link=logit),data=dt)summary(ml)```The bayesian estimation is:```{r}formula <- Obesity ~1+ Age + Low.Educational.Level m.log <-inla(formula,family="binomial",data=dt,control.fixed=list(mean=list(Age=0, Low.Educational.Level=1),prec=list(Age=0.0001, Low.Educational.Level=100),mean.intercept=0, prec.intercept=0.000001),control.predictor=list(compute=TRUE),control.compute=list(dic=TRUE, cpo=TRUE))smry <-summary(m.log)smry$fixed <- smry$fixed[,1:5]smry``````{r}dpost.E <- m.log$marginals.fixed$Low.Educational.Levelggplot(data.frame(x=dpost.E[,1],y=dpost.E[,2]),aes(x,y))+geom_line(col="blue",size=1.5)```## Tobacco-attributable cancer mortality in the Canary IslandsData reading:```{r}dtab <-read_excel("data/M_tabaco.xlsx")if (is_html_output()){ dtab %>%kbl() %>%kable_styling(c("striped", "hover"),full_width =FALSE) %>%scroll_box(height ="250px")} else{head(dtab) %>%flextable() %>%autofit()}```\ Each row in the data file corresponds to the number of deaths due todifferent types of cancer in the Canary Islands according to year (from1980 to 2002), sex and age group (4 age groups are considered). Inparticular, we will analyse the number of deaths from lung cancerattributable to smoking (variable `D_05`) among subjects in age group 2(between 45 and 55 years). To do so, we will fit a Poisson model,considering as offset the number of deceased subjects in the populationfor each year, sex and age group. Frequentist estimation using maximumlikelihood produces the following result:```{r}dtab$Sex <-ifelse(dtab$Sex.Male==1,"Male","Female")dtabg2 <- dtab %>%filter(G.Age==2) %>%mutate(Rate=100000*D_05/N)mtab <-glm(D_05 ~offset(log(N)) + Sex.Male*Year, family=poisson, data=dtabg2)summary(mtab)```The following figure shows the fit of the model:```{r}b <- mtab$coeffpd <-as.numeric(predict(mtab))dtabg2 <- dtabg2 %>%mutate(Adj.Rate =100000*exp(pd)/N)ggplot(dtabg2,aes(Year,Rate))+geom_point(colour="blue")+geom_line(aes(Year,Adj.Rate,colour=Sex))+labs(x="Year",y="Deaths per 100,000 person-years",title="Age group 45 to 55 years old")```Now the bayesian estimation of the same model:```{r}### inlaformula <- D_05 ~offset(log(N)) + Sex.Male*Yearm.pois <-inla(formula,family="poisson",data=dtabg2,control.fixed=list(mean=list(Sex.Male=0, Year=0, `Sex.Male:Year`=0),prec=list(Sex.Male=0.000001, Year=0.000001, `Sex.Male:Year`=0.000001),mean.intercept=0, prec.intercept=0.000001),control.predictor=list(compute=TRUE),control.compute=list(dic=TRUE, cpo=TRUE))smry <-summary(m.pois)smry$fixed <- smry$fixed[,1:5]smry#m.pois$all.hyper$fixed```The following figure shows the bayesian credibility intervals for thepredictions of the model:```{r}fitBayes <- m.pois$summary.fitted.valuesdtabg2 <- dtabg2 %>%mutate(`Adj.Rate (Bayes)`=100000*fitBayes$mean/N,adj.median=100000*fitBayes[["0.5quant"]]/N,adj.q2.5=100000*fitBayes[["0.025quant"]]/N,adj.q97.5=100000*fitBayes[["0.975quant"]]/N)ggplot(dtabg2,aes(Year,Rate))+geom_point(colour="blue")+geom_line(aes(Year,`Adj.Rate (Bayes)`,colour=Sex))+geom_line(aes(Year,adj.median,colour=Sex))+geom_line(aes(Year,adj.q2.5,colour=Sex),linetype=2)+geom_line(aes(Year,adj.q97.5,colour=Sex),linetype=2)+labs(x="Year",y="Deaths per 100,000 person-years",title="Bayesian Estimation.\nAge group 45 to 55 years old")```\ \ Alternatively we could have modelled these data using a negative binomial model, but the fit does not improve substantially (it is even slightly worse). The estimated parameter values are almost unchanged:```{r}### inlam.nbin <-inla(formula,family="nbinomial",data=dtabg2,control.fixed=list(mean=list(Sex.Male=0, Year=0, `Sex.Male:Year`=0),prec=list(Sex.Male=0.000001, Year=0.000001, `Sex.Male:Year`=0.000001),mean.intercept=0, prec.intercept=0.000001),control.predictor=list(compute=TRUE),control.compute=list(dic=TRUE, cpo=TRUE))smry <-summary(m.nbin)smry$fixed <- smry$fixed[,1:5]smry```## Mixed-models.We consider a sample of n subjects from a given population. In the i-thsubject and for a set of covariates $x_{i,1},...,x_{i,n_i}$ we observethe corresponding random variables $y_{i,1},...,y_{i,n_i}$ , which obeythe mixed model:$$y_{i,j}=\beta_0+\beta_1.x_{i,j}+a_i+e_{i,j}$$ Here, $a_1,...,a_n$ areiid $N(0,\sigma_a)$ random variables which denote the random effects ofindividuals and for each $i=1,...,n$, $e_{i,1},...,e_{i,n_i}$ are iidrandom variables $N(0,\sigma_e)$ that correspond to within-subjectvariability .### Mixed-models 1.We begin by simulating this model:```{r}set.seed(19941998)n=15### Number of subjectsm=10### Number of observations per subjectb <-c(5,3)a <-rnorm(n,0,1)z <-seq(0,0.5,length=m)dt <-data.frame(id=gl(n,m),x=rep(z,n)) %>%mutate(y=b[1]+b[2]*x+a[as.integer(id)]+rnorm(n*m,0,0.2))ggplot(dt,aes(x,y,color=id))+geom_point()+geom_line(aes(group=id))+ylim(c(2,10))```The frequentist estimation is:```{r}lmm <-lmer(y ~ x + (1| id), data = dt,REML =FALSE)summary(lmm)```And the bayesian estimation:```{r}formula <- y ~1+ x +f(id,model="iid")bmm <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=0, prec=0.000001,mean.intercept=0, prec.intercept=0.000001))smry <-summary(bmm)smry$fixed <- smry$fixed[,1:5]smry```#### Posterior density```{r}### Densidad a posterioridpost.intercept <-data.frame(bmm$marginals.fixed$`(Intercept)`)dpost.x <-data.frame(bmm$marginals.fixed$x)ggplot(dpost.x,aes(x,y))+geom_line(colour="#33CC99",linewidth=1.5)+geom_vline(xintercept=3,colour="#3300FF")```### Mixed-models 2.Now we simulate another dataset in which each individual has its ownslope, different from the general slope for the whole population.```{r}n=15### Number of subjectsm=10### Number of observations per subjecta <-rnorm(n,2.5,0.3)b <-3-az <-rep(seq(0,0.5,length=m),n)dt <-data.frame(id=gl(n,m)) %>%mutate(x=z+a[as.integer(id)],y=b[as.integer(id)]+1.5*z+rnorm(n*m,0,0.2)) ggplot(dt,aes(x,y,color=id))+geom_point()+geom_line(aes(group=id))```#### Frequentist Estimation:- Simple linear model (bad idea)```{r}ml <-lm(y ~ x, data=dt)summary(ml)```- Linear mixed model```{r}lmm <-lmer(y ~ x + (1| id), data = dt,REML =FALSE)summary(lmm)```#### Bayesian Estimation```{r}### Bayesian estimationformula <- y ~1+ x +f(id,model="iid")bmm <-inla(formula,family="gaussian",data=dt,control.fixed=list(mean=0, prec=0.001,mean.intercept=0, prec.intercept=0.001))smry <-summary(bmm)smry$fixed <- smry$fixed[,1:5]smry```## Nonlinear regression: autoregresive processes.*National Morbidity, Mortality, and Air Pollution Study.* The NationalMorbidity, Mortality and Air Pollution Study (NMMAPS) is a large timeseries study designed to estimate the effect of air pollution on the health ofindividuals living in 108 US cities during the period 1987-2000. Data onthe daily concentration of particulate matter with an aerodynamicdiameter less than 10 ($PM_{10}$) and nitrogen dioxide ($NO_2$), bothmeasured in $\mu g/m^3$, as well as the daily temperature for Salt LakeCity are contained in the `NMMAPSraw.csv` file. We use this data set tostudy the relationship between $PM_{10}$ and temperature as anillustration of a linear regression model. We denote by $PM_{T,M,d}$ the concentration of particles withaerodynamic diameter less than 10 $\mu$m according to temperature ($T$)month ($M$) and day ($d$). We assume that$PM_{T,M,d} \sim N\left(\mu_{T,M,d},\sigma\right)$, where:$$\mu_{T,M,d}=\beta_{0}+\beta_{1}\cdot T+\gamma_{M}+z_{d}$$ $\gamma_M$ denotes the effect of month $M$ (January is taken as the referencemonth, hence $\gamma_{Jan}$ =0). We also assume that $z_d$, $d=1,...,N$is an autoregressive process of order 1 and hence,$z_{d} \sim N\left(\phi\cdot z_{d-1},\sigma\right)$.#### Data reading:```{r}dataNMMAPS <-read_excel("data/NMMAPSraw.xlsx")if (is_html_output()){ dataNMMAPS %>%kbl() %>%kable_styling(c("striped", "hover"),full_width =FALSE) %>%scroll_box(height ="250px")} else{head(dataNMMAPS) %>%flextable() %>%autofit()}```\ The total number of observations in this file is:```{r}nrow(dataNMMAPS)```Next we represent the concentration of `pm10` vs. date:```{r}dataNMMAPS <- dataNMMAPS %>%mutate(date=dmy(date,locale="en_US.UTF-8"),month=factor(month(date,locale="en_US.UTF-8",label=TRUE,abbr=TRUE),ordered=FALSE))ggplot(dataNMMAPS,aes(date,pm10))+geom_line(cex=0.2,color="blue")+labs(x="Date")```#### Bayesian estimation:```{r}### Process formula <- pm10 ~1+ temperature + month +f(id, model="ar1",hyper =list(prec =list(prior="loggamma",param=c(1,0.01))))ml <-inla(formula,family="gaussian",data=dataNMMAPS,control.predictor =list(compute =TRUE))round(ml$summary.fixed[,1:5],3)```The following figure shows the predicted vs. observed values of `pm10`:```{r warning=FALSE}pFitted <- ml$summary.fitted.valuesggplot(data.frame(pm10=dataNMMAPS$pm10,fittedMean=pFitted$mean),aes(pm10,fittedMean))+geom_point(size=.1,col="blue")+geom_abline(intercept=0,slope=1)+ylim(c(0,300))+labs(y="Predictions")```#### Fitted Values:```{r}ggplot(data.frame(date=dataNMMAPS$date,pFitted$mean),aes(date,pFitted$mean))+geom_line(col="red",size=0.1)```## INLA on the webHere is a list of some free resources on the web to continue learningBayesian methods with INLA:- [Home page of R-INLA Project](https://www.r-inla.org/home)- [Bayesian inference with INLA. Virgilio Gómez Rubio](https://becarioprecario.bitbucket.io/inla-gitbook/)- [Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Paula Moraga](https://www.paulamoraga.com/book-geospatial/)- [INLA examples & tutorials](https://www.r-inla.org/examples-tutorials)- [INLA for (generalized) linear models](https://www.flutterbys.com.au/stats/tut/tut12.10.html)- [Bayesian Regression Modeling with INLA](https://julianfaraway.github.io/brinlabook/), Xiaofeng Wang, Yu Ryan Yue, Julian Faraway [github site](http://julianfaraway.github.io/brinla/). [Examples](https://julianfaraway.github.io/brinla/examples/)\ Some interesting papers on INLA:* [New Frontiers in Bayesian Modeling Using the INLA Package in R](https://www.jstatsoft.org/article/view/v100i02) Van Niekerk, J., Bakka, H., Rue, H., & Schenk, O. _Journal of Statistical Software_, 100, 1–28. (2021). * [Spatial Data Analysis with R-INLA with Some Extensions](https://www.jstatsoft.org/article/view/v063i20). Bivand RS, Gómez-Rubio V, Rue H. _Journal of Statistical Software_ 63, 20 (2015)* [Bayesian Computing with INLA: A Review](https://www.annualreviews.org/doi/pdf/10.1146/annurev-statistics-060116-054045) Rue, H. Riebler A, Sørbye SH, Illian JB, Simpson DP. _Annu. Rev. Stat. Appl._ 4:395–421 (2017)