Open access peer-reviewed chapter

Bayesian Multilevel Modeling in Dental Research

Written By

Edilberta Tino-Salgado, Flaviano Godínez-Jaimes, Cruz Vargas-De-León, Norma Samanta Romero-Castro, Salvador Reyes-Fernández and Victor Othon Serna-Radilla

Reviewed: 04 October 2022 Published: 26 November 2022

DOI: 10.5772/intechopen.108442

From the Edited Volume

Recent Advances in Medical Statistics

Edited by Cruz Vargas-De-León

Chapter metrics overview

79 Chapter Downloads

View Full Metrics

Abstract

Clinical designs in dentistry collect measurements of the teeth of each subject, forming complex data structures; however, standard statistical methods (Student’s t-test, ANOVA, and regression models) do not treat the data as a grouped data type; that is, the measurements are treated as independent despite not being the case. A disadvantage of not considering the dependence on multilevel data is that if there is a significant correlation between the observations, it is ignored by the researcher and consequently finds statistically significant results when in fact they are not. Bayesian methods have the advantage of not assuming normality, unlike maximum likelihood estimation, and Bayesian methods are appropriate when you have small samples. We showed the minimum statistical theory for the use of multilevel models in dental research when the response variable is numerical. In this regard, it was proposed to carry out a Bayesian multilevel analysis to determine the clinical factors associated with the depth of periodontal probing. We adapted the bottom-up strategy to specify a multilevel model in the frequentist approach to the Bayesian approach. We checked the adequacy of the fit of the postulated model using posterior predictive density.

Keywords

  • periodontal probing depth
  • dental research
  • nested data structures
  • Bayesian multilevel modeling
  • bottom-up methodology

1. Introduction

The most widely used statistical methods in dental research are t-test, ANOVA (one, two and three factors), non-parametric tests, and regression models [1]. These methods assume that the observations of the studied variables are independent. Nested data structures are frequently found in dental research. An example is an experimental design in which multiple measurements are performed on the same individual. If, in addition to performing multiple measurements in an individual, we perform multiple measurements in each tooth, we will obtain a nested data structure. This nesting of the data results in grouped data. Typically, for clinical and dental data, contextual variables are measured in each individual (i.e., socioeconomic level, educational level, etc.), and these characteristics can form another group of data. Considering the detection of bacterial plaque in each tooth of individuals who have a home with a high marginality index, two nested groups are distinguished, namely, teeth nested in individuals and individuals nested at group. The word “nested” can be understood as “within” or “contained in.” It is to be expected that items from the same group may be more similar to each other than items from a different group; that is, measurements from one individual are expected to be more similar to each other in comparison with measurements from other individuals. This fact indicates that the assumption of independence does not apply to nested data. Multilevel models take into account the non-independence of the observations. One consequence of ignoring the dependence of observations is that the results of some tests may be statistically significant when, actually, they are not. Under the classical approach, the estimation of the parameters of a multilevel model is performed using maximum likelihood, which has optimal properties in many scenarios; however, problems such as non-compliance with model assumptions or lack of convergence of iterative methods can occur. The Bayesian approach has some advantages over the classical approach.

The purpose of this chapter is to show the minimum statistical theory for the use of multilevel models in dental research when the response variable is numerical. For this, we will remember the definitions of multilevel models and multilevel generalized linear models (MGLM), in addition to the main Bayesian concepts and their application to MGLM. We will use an adaptation of the bottom-up strategy to specify a multilevel model. Our adaptation proposal tries to use the Bayesian leave-one-out cross-validation (LOO-CV) between the different steps for the comparison of models. We will check the adequacy of the fit of the postulated model using posterior predictive density. Finally, we will provide an example of this model applied to a numerical response variable, such as periodontal probing.

Advertisement

2. Multilevel models

Multilevel models partition the variance of the dependent variable at different levels of data grouping. At least two types of variance are distinguished: intra-group varianceσw2, or individual-level variance (level one), and between-group varianceσb2, which defines the variation at the group level (level two).

The dependency of the observations in the same group is measured with the intraclass correlation coefficient (ICC). Shrout and Fleiss in 1979 defined the ICC as the ratio of the between-group variance and the total variance (the sum of the variances between groups and in intra-groups):

ICC=σb2σb2+σw2E1

The ICC varies between 0 and 1, since the variance cannot be negative. Before using a multilevel model, it is necessary to determine whether the ICC is significant at each level of the data. To that end, using the null model (defined in the next section), we determine whether the variance of the residuals of each level is significant. If that occurs, the ICC is also significant, and this means that at the individual level, the observations are dependent, and therefore, it is necessary to use a multilevel model instead of an ordinary multiple regression model [2].

2.1 Two-level models

Let yij be the dependent variable measured in the i-th individual in the j-th level-two unit (e.g., the j-th group); i=1,,N, where N is the total sample size, and j=1,,J for J level-two units.

The simplest two-level model is the null model (intercept-only model, unconditional means model, or one-way random-effects analysis of the variance). The model is defined by two equations:

yij=β0j+eijβ0j=γ00+u0jE2

β0j is the mean of y in the group j that varies across groups; eij is the individual variation around this mean; γ00 is the overall intercept, that is, the grand mean of y; and u0j is the deviation of β0j with respect to γ00.

Substitution of β0j in yij produces the single-equation model:

yij=γ00+u0j+eijE3

Eq. (3) is composed of a fixed part, γ00, and a random part corresponding to two random effects, u0j and eij. Assuming that eijN0σ2 and u0jN0σu02 and that eij and u0j are independent, the variance of yij in Eq. (3) is

varyij=varγ00+u0j+eij=σu02+σ2E4

where σu02=σb2 is the variance between groups, and σ2=σw2 is the variance within groups in Eq. (1) to calculate the ICC.

Now, let us consider a two-level model with two level-one independent variables, x1 with a fixed effect α1, which does not vary between groups, and x2 with a random effect β2j, which does vary between groups.

This model is defined by

yij=β0j+α1x1ij+β2jx2ij+eijβ0j=γ00+γ01w1j+u0jβ2j=γ10+γ11w1j+u1jE5

In the above equation, both β0j and β2j depend on a level-two independent variable; w1 and uqj are the deviation of the effect of the variable w1 on yij in the group j with respect to the average effect γq0.

Substitution of β0j and β2j in yij produces the single-equation model:

yij=γ00+α1x1ij+γ01w1j+γ10x2ij+γ11w1jx2ij+u0j+u1jx2ij+eijE6

Generalizing the above two-level model to the case where level one includes P independent variables xp that have a fixed effect, Q, independent variables xq that have a random effect, and M level-two independent variables wm, which also have a fixed effect, we have:

yij=γ00+m=1Mγ0mwmj+p=1Pαpxpij+q=P+1P+Qγq0xqij+q=P+1P+Qm=1Mγqmwmjxqij+u0j+q=P+1P+Qxqijuqj+eijE7

Model 7 is composed of fixed effects (the coefficients γ and α) and random effects (all terms in parentheses). Two-level models can also be expressed in a matrix form by

Y=+Wu+eE8

where Y is the vector of measurements of the dependent variable, X is the design matrix of the fixed effect parameter vector β (containing the overall mean, main effects, and interactions), W is the design matrix of the random effects given by the vector U, and e is the vector of level-one residual errors.

2.2 Three-level models

Let i, j, and k indicate the observation units of levels 1, 2, and 3, respectively. In addition, level 3 has K units, each level-three unit has Jk level-two units, and the jth level-two unit in the kth level-three unit has nijk level-oneunits. The null model is

yijk=β0jk+eijkβ0jk=γ00k+u0jkγ00k=ξ000+ν00kE9

In the first equation, β0jk is the level-one random intercept that varies between the groups of level two, and eijk is the residual variance at level one with respect to β0jk. In the second equation, γ00k is the level-two random intercept that varies between the level-three units, and u0jk is the residual variation of the group j with respect to γ00k. In the third equation, ξ000 is the general intercept, that is, the grand mean of y, and ν00k is the variation between the means of the level-three groups (i.e., the deviation of the mean of group k with respect to the grand mean).

Substituting γ00k in β0jk and then β0jk in yijk yields

yijk=ξ000+ν00k+u0jk+eijkE10

Assuming that eijkN0σ2, u0jkN0σu02, and ν00kN0σν02 and that eijk, u0jk, and ν00k are independent, the variance of yij in Eq. (10) is

varyijk=σν02+σu02+σ2E11

One way to define the ICC at levels two and three, attributed to Davis and Scott [3], is

ρlevel3=σν02σν02+σu02+σ2E12
ρlevel2=σu02σν02+σu02+σ2,E13

Now, we consider a three-level multilevel model with two level-one independent variables, x1 and x2; the former has a fixed effect and the latter a random effect:

yijk=β0jk+α1x1ijk+β2jkx2ijk+eijkE14

Let suppose that the random coefficients β0jk and β2jk are explained by a second-level variable, w1, by the relationships

β0jk=γ00k+γ01kw1jk+u0jkβ2jk=γ10k+γ11kw1jk+u1jkE15

And the random coefficients in Eq. (15) are explained by a third-level variable, z1, by the equations

γ00k=ξ000+ξ001z1k+ν00kγ01k=ξ010+ξ011z1k+ν01kγ10k=ξ100+ξ101z1k+ν10kγ11k=ξ110+ξ111z1k+ν11kE16

Substituting Eqs. (16) in (15) and then in Eq. (14), we have the three-level multilevel model:

yijk=ξ000+α1x1ijk+ξ001z1k+ξ010w1jk+ξ100x2ijk+ξ011z1kw1jk+ξ101z1kx2ijk+ξ110w1jkx2ijk+ξ111z1kw1jkx2ijk+u0jk+ν00k+ν10kx2ijk+u1jkx2ijk+ν01kw1jk+ν11kw1jkx2ijk+eijkE17

where the regression coefficients ξ and α are the fixed part of the model, and the residual terms of each level contained in parentheses are the random part.

We can generalize the three-level model of Eq. (17). Suppose level one contains P independent variables xp that have a fixed effect, αp, and Q independent variables xqq=P+1,,P+Q. Level two contains M variables wmm=1,,M. Level three contains L independent variables zll=1,,L.

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+l=1Lξ00lzlk+m=1Ml=1Lξ0mlzlkwmjk+q=P+1P+Ql=1Lξq0lzlkxqijk+q=P+1P+Qm=1Mξqm0wmjkxqijk+q=P+1P+Qm=1Ml=1Lξqmlzlkwmjkxqijk+ν00k+u0jk+q=P+1P+Qν10kxqijk+q=P+1P+Quqjkxqijk+m=1Mν0mkwmjk+q=P+1P+Qm=1Mνqmkwmjkxqijk+eijk.E18

When the three-level model includes a random slope of level one and a random slope of level two, the model easily includes many parameters (interaction and a residual effect by each random slope coefficient) that easily cause convergence problems, except for sufficiently large data sets. Therefore, most three-level models have few random slope coefficients.

The equivalent matrix model for a three-level model is

Y=+Wu++eE19

Again, X is the design matrix of the fixed effect parameter vector β (containing the overall mean, main effects, and interactions), W is the design matrix of the random effects given by the vector u, Z is the design matrix of the random effects given by the vector ν, and e is the vector of level-one residual errors.

2.3 Assumptions of multilevel models

Statistical assumptions such as normal distribution, variance at each level, and independence between errors at different levels have been mentioned in the definition of the null multilevel model. They are explicitly defined in this section.

The dimension of the vector u depends on the number of random coefficients in the level-one equation; for example, in Eq. (14), the dimension is two. Similarly, the dimension of the vector ν depends on the number of random coefficients in the level-two equation; for example, in Eq. (15), the dimension is four. Let e=e111112T, u=u0jku1jkusjkT, and ν=ν00kν0tkνt0kνttkT with dimensions N, s, and t2, respectively.

Multilevel models’ assumptions are:

νueN000D000G000RE20

where 0 is the vector of zeros with the appropriate dimension and

D=σν02σν01σν0tσν01σν12σν1tσν0tσν1tσνt2,G=σu02σu01σu0sσu01σu12σu1sσu0sσu1sσus2,R=σ2IE21

Eq. (20) says:

  1. Level-one errors, e, are independent, identically normal distributed with mean zero and variance σ2.

  2. Level-two errors, u, follow a multivariate normal distribution with mean 0 and covariance G.

  3. Level-three errors, ν, follow a multivariate normal distribution with mean 0 and covariance D.

  4. Level-one and level-two errors are independent Coveu=0.

  5. Level-one and level-three errors are independent Coveν=0.

  6. Level-two and level-three errors are independent Covuν=0.

2.4 Multilevel model estimation

The estimation of a multilevel model is complex because, in addition to the residuals at the individual level in the model, there are more residual terms of random intercepts and/or slopes of higher levels. Simultaneously, three types of parameters need to be estimated: the fixed effects, the random effects, and the residual variance/covariance components in matrices D, G, and R. Statistical theory and estimation algorithms for multilevel modeling are beyond the scope of this chapter, but some ideas are given.

When matrices D, G, and R are known, they can be used to estimate the combined model using generalized least square (GLS). The variance of y, given that the matrices D and G are known, is

V̂=WDW+ZGZ+RE22

The inverse of the V̂ matrix can be used as a weight; the regression coefficients of the model can be estimated using GLS. However, the matrices D and G are unknown.

The maximum likelihood estimation method is the most used for estimating multilevel models. It consists of maximizing the likelihood function that generally involves an iterative process that takes the parameter estimates as the initial parameter values for the next iteration of parameter estimation. This process is repeated until the parameter estimates have stabilized from one iteration to the next. The default tolerance number, which is sometimes defined by the users, is usually a sufficiently small number, for example, 108. The model converges if the tolerance number is reached between two consecutive iterations. However, sometimes this does not happen. If the limit of specified iterations is reached and the tolerance number between two consecutive iterations has not been reached, the method is said to not converge, and this fact may indicate model specification problems or a small sample size.

Other estimation methods used in multilevel models are generalized estimating equations, bootstrap methods, and Bayesian methods [3]. When the assumptions of the multilevel models (Section 2.3) are not met, these methods are adequate.

2.5 Multilevel generalized linear models

Multilevel generalized linear models (MGLM) are an extension of generalized linear models. What makes both models different is that the former assumes dependence in the observations of the dependent variable and the latter assumes independence in the observations.

A three-level MGLM of the dependent variable Y conditioned in the random effects ν and u is

gEYνu=η=+Wu+E23

where g is the link function, which is a known monotonic, differentiable function, and η is the linear predictor. As in multilevel models, the random effects are assumed to have a normal distribution with zero mean vector and variance/covariance matrixes D and G, respectively. The multilevel models described in Sections 2.1 and 2.2 are a particular case of MGLM with g, the identity function.

Advertisement

3. Bayesian inference

Bayesian inference is a more attractive alternative to frequentist maximum likelihood estimation when: (1) we have information about the parameters in the model, (2) the frequentist estimation method does not converge, (3) the sample size is small at the highest level of the data, or (4) nonlinear functions of the parameters are to be estimated. With this motivation, let us define some concepts.

The heart of the Bayesian inference is the posterior distribution of θ, pθy, which is defined as the joint probability distribution of the observed data y and the parameter θ, pyθ=pyθpθ, conditioned on the known value of y, py=pθpyθ). Using Bayes’ theorem, we obtain

pθy=pθpyθpyE24

where pyθ is the likelihood of the data y, and pθ is the prior distribution of theta. py=pθpyθ with fixed y is a normalization constant not depending on θ. So, an equivalent equation to (24) is

pθypθpyθE25

Prior distributions can be informative or non-informative. When the researcher has a high degree of certainty about θ, the prior distribution will have a small variance and so will also be informative. If this fact does not happen, that is, the researcher has low degree of certainty about θ, the prior distribution will have a large variance and so will be non-informative. Since the prior distribution is a factor in the posterior distribution, when the prior is informative, it will have a great impact on the posterior, so the researcher must be careful when an informative prior distribution is used.

Bayesian estimators are only mean or median vector of the posterior distribution, that is, θ̂=θpθpyθpy. However, if θ has high dimension, this implies to obtain multiple integrals that usually do not have a closed solution. Sometimes, θ=θaθb and θb are nuisance parameters that must be ignored. The solution is to integrate the posterior distribution with respect to the nuisance parameters, but again, this multiple integral may have no closed solution.

The most widely used method is Markov chain Monte Carlo to obtain means, medians, and quantiles of the posterior distribution.

3.1 Markov chain Monte Carlo

3.1.1 Markov chain

A discrete-time Markov chain is a sequence of random variables, Xn,n1, that take values in a finite or countable Ω set that satisfies

pXn+1=jX0=i0Xn=in=pXn+1=jXn=inE26

for all n and any states i0,,in, j in Ω. Under regularity conditions, the chain will gradually forget its initial state i0, and starting from a state t, ptX0=i0 will converge to a unique stationary distribution ϕ (invariant) that does not depend on t or i0.

As the number of sampled points Xt increases, they will look more like dependent samples from ϕ. The burn-in of an MCMC is the number of iterations, m, to eliminate so that the rest show a behavior of dependent samples from the stationary distribution ϕ [4]. When the number of burn-in samples is m, an estimator of the expectation of fX is

EfX̂=1nmt=m+1nfXtE27

3.1.2 Hamiltonian Monte Carlo

The Gibbs sampling and the random walk Metropolis are methods whose distributions converge to the target distributions; however, complex models with a large number of parameters may require an unacceptably long time to converge to the target distribution. This problem is largely caused by inefficient random walks that estimate the parameters’ space.

The Hamiltonian Monte Carlo (HCM) algorithm or hybrid Monte Carlo algorithm eliminates random walks using momentum variables that transform the target distribution sampling problem into the Hamiltonian dynamics simulation problem. The Störmer–Verlet “leapfrog” (jump steps) integrator is used to simulate the time evolution of this system. Given a sample m, a step size ε, and a number of steps L, the HMC algorithm consists of resampling the momentum variables rd from a standard multivariate normal distribution (it can be considered a Gibbs sampling update) and then applying L “leapfrog” updates to the position and momentum variables (θ and r) to generate a pair of proposed position and momentum variables (θ˜, r˜), which are defined as θm=θ˜ and rm=r˜, and will be accepted or rejected according to the Metropolis algorithm. For more details, see [5]. In general, specifying the step size (ε) and number of steps (L) is quite difficult when the path is too short, too long, or too straight.

This method for generating MCMC is implemented in the brms package [6] to perform Bayesian estimation in multilevel models.

3.1.3 MCMC diagnostics

After a large enough number of iterations, the MCMC eventually converges to the posterior distribution. A diagnostic statistic is needed to determine whether the MCMC has already converged to the stationary distribution or more iterations are needed. Several diagnostic statistics have been proposed, but we will use the Gelman and Rubin and graphical diagnostics.

Gelman and Rubin diagnostic (GR) [7]. This diagnostic uses several chains, Xi0Xin1,i=1,,m, drawn from an overdispersed density with respect to the target density π. In 1992, Gelman and Rubin defined two estimators of the variance of X when Xπθ:

  1. The within-chain variance:W=i=1mj=0n1XijX¯i2/mn1 and

  2. The pooled variance:V̂=n1/nW+B/n.

where B/n=i=1mX¯iX¯2/m1 is the between-chain variance estimate, X¯i is the mean of the chain i, i=1,,m, and X¯ is the overall mean. The potential scale reduction factor (PSRF) or Rhat is defined by:

R̂=V̂WE28

The variance in the numerator of R̂ overestimates the target variance, while the variance in the denominator underestimates it. This fact produces R̂ greater than 1. One criterion for stopping the MCMC simulation is that R̂1 or R̂<1.1. The GR and ESS diagnostics are implemented in the coda package [8].

Graphical diagnostics. MCMC trace plots are the most widely used diagnostic plots to determine convergence. They are a time series that shows the behavior of the Markov chains around their state space and their achievements at each iteration. When the visible trends show changes in the dispersion of the chain trace, the MCMC has not reached a stationary state. In contrast, when good mixing is observed, the MCMC sampling is said to converge to the target distribution.

3.2 Model checking and model comparison

Any Bayesian analysis should include a check of the adequacy of the fit of the postulated model to the data. The adequacy of the fit of a model is measured by how well the distribution of the proposed model approximates the distribution of the data; the better the fit of the postulated model to the data, the better the model. But if the fit is poor, it does not mean that the model is bad, but rather that it contains deficiencies that can be improved. This section explains a model assessment method based on the posterior predictive distribution.

Let us define the replicated data yrep as one that could be observed tomorrow if the experiment that produced the current data y were replicated tomorrow with the same model and the same values of θ that produced y. The distribution of yrep given the current data y is called posterior predictive distribution and defined as [9].

pyrepy=pyrepθpθyE29

If the model is accurate, that is, it has a reasonably good fit, the replicated data should be similar to the observed data.

3.2.1 Log pointwise predictive density

The performance of the fitted model can be measured by the quality of its predictions in the new data yf. Pointwise predictions are predictions of each element yif in yf that are summarized using an appropriate statistic.

Access to yf is not always easy and sometimes impossible. Instead, performance of the fitted model can be done using the current data y. This method for calculating predictive accuracy and to compare models is known as within-sample predictive accuracy.

The log pointwise predictive density (lppd) of the fitted model to the observed data and unknown parameter θ is defined as

lppd=logi=1npθyi=i=1nlogpyiθpθyiE30

In general, the expected predictive accuracy of a model fitted to new data is poorer than the expected predictive accuracy of the same model with the observed data. With the computed lppd (clppd), we can evaluate the expression using draws from pθy obtained with MCMC, θs, s=1,,S using sufficient draws:

clppd=i=1nlog1Ss=1SpyiθsE31

The clppd of the observed data y is an overestimate of the clppd for future data.

A second method to assess posterior predictive expectation is the adjusted within-sample predictive accuracy that consists of a bias correction of the lppd estimated using information criteria such as Akaike information criterion, deviance information criterion, or Watanabe–Akaike information criterion.

A third method to assess posterior predictive expectation is the cross-validation, which captures the out-of-sample predictive error by fitting the model to the training data and assessing the predictive fit in the holdout data [9]. In model comparison, the best model is the one with the lowest predictive error. Let us explain this method in detail:

Leave-one-out cross-validation (LOO-CV) works with n partitions in which each holdout set has only one observation, which generates n different inferences, pposti, obtained through S posterior simulations, θis.

The Bayesian LOO-CV estimate of the predictive fit out of the sample is

lppdloocv=i=1nlogppostiyii=1nlog1Ss=1SpyiθisE32

Each prediction is conditioned in n1 data points, which underestimates the predictive fit. For large n, the difference is insignificant; however, for small n, a first-order bias correction b=lppdlppd¯i can be used, where

lppd¯i=1ni=1nj=1nlogppostiyj1ni=1nj=1nlog1Ss=1SpyjθisE33

The bias-corrected Bayesian LOO-CV is

lppdcloocv=lppdloocv+bE34

An estimation of the effective number of parameters is

p1oocv=ldppldpploocvE35

When comparing two fitted models, we can estimate the difference in their expected predictive accuracy by the difference in elppdloocv. The standard error of the difference can be computed using a paired estimate to take advantage of the fact that the same set of n data points is used to fit both models.

Suppose we are comparing models I and II, with corresponding fit measures elpdloocvI and elpdloocvII; then difference and its standard error are

elpd_diff=elpdloocvIelpdloocvIIse_diff=seelpdloocvIelpdloocvII=nVinelpdloo,iIelpdloo,iIIE36

When two models are compared using the LOO-CV statistic, the one with the lowest value of this statistic is declared the best model. If elpd_diff is used with the loo_compare function of the brms library [6], the value of the difference is reported in the best model accompanied by its se_diff. When comparing two models, the value of the difference is reported in the column of the best model. There is more evidence of the superiority of one model over another when the elpd_diff is larger than the se_diff.

LevelsVariables
Level 3: PatientAge, plaque, calculus, insulin resistance, smoke, root remnants, and mismatched restorations
Level 2: ToothMobility
Level 1: Probing siteBleeding

Table 1.

Independent variables on levels.

Advertisement

4. Multilevel model methodology

To propose a multilevel model, it is necessary to determine which variables will be in the fixed part, which in the random part, and the cross-level interactions. This task can be complex, so we need a strategy to build the model.

4.1 Multilevel model building strategy

In this section, we show an adaptation of the bottom-up strategy to specify a three-level multilevel model. The bottom-up methodology is used in the frequentist approach [3]. Our adaptation proposal tries to use the Bayesian LOO-CV from Step 2 to Step 7 for model comparison.

Step 1. Fitting the intercept-only model:

yijk=ξ000+ν00k+u0jk+eijkE37

This model gives a basal line to compare with the next models.

Step 2. Add all the level-one independent variables fixed:

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+ν00k+u0jk+eijkE38

It must be determined which level-one variable has a significant effect on y. We will assume that all the P level-one variables are statistically significant. Models 38 and 37 must be compared.

Step 3. Add the level-two independent variables fixed:

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+ν00k+u0jk+eijkE39

It must be determined which level-two variable has a significant effect on y. If the variables wm explain the variability of y, Model 39 should be superior to Model 38. Again, we assume that all the M level-two independent variables are statistically significant.

Step 4. Add the level-three independent variables fixed:

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+l=1Lξ00lzlk+ν00k+u0jk+eijkE40

It must be determined which level-three variable has a significant effect on y. If the variables zl explain the variability of y, Model 40 should be superior to Model 39.

Steps 1–3 consider the specification of the fixed part of the three-level multilevel model. Now we will specify the random part of the model.

Step 5. Assessing whether any of the slopes of the independent variables at level one has a significative variance component between groups at level two or level three.

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+l=1Lξ00lzlk+ν00k+u0jk+q=P+1P+Qνq0kxqijk+q=P+1P+Quqjkxqijk+eijkE41

where uqjk are the level-two residuals of the slopes of the level-one independent variable xq, and ν’s are the level-three residuals of the slopes of the level-two independent variable wm.

Level-one independent variables that do not have a significant slope may have a significant random slope. This step and the next should be carefully performed, because the model can easily become overparameterized and/or have problems such as non-convergence or extremely slow calculations. It is advisable to assess significance of the slopes variable by variable. Next, the model is formulated with all the variables with significant random slopes. If Model 41 is not better than Model 40, the procedure for specifying a three-level multilevel model stops.

Step 6. Assessing whether any of the slopes of the level-two independent variable has a significant variance component among level-three groups.

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+l=1Lξ00lzlk+ν00k+u0jk+q=P+1P+Qνq0kxqijk+q=P+1P+Quqjkxqijk+m=1Mν0mkwmjk+eijkE42

where the ν’s are the level-three residuals of the slopes of the level-two independent variable wm.

The assessment of random slopes of the level-two variables should be performed variable by variable, and then, all these variables should be included into a model to assess the improvement of the model with respect to Model 41.

Step 7. Adding interactions between level-three independent variables and the level-one and level-two independent variables that have a significant slope variance in Steps 5 and 6. This produces the full model:

yijk=ξ000+p=1Pαpxpijk+q=P+1P+Qξq00xqijk+m=1Mξ0m0wmjk+l=1Lξ00lzlk+m=1Ml=1Lξ0mlzlkwmjk+q=P+1P+Ql=1Lξq0lzlkxqijk+q=P+1P+Qm=1Mξqm0wmjkxqijk+q=P+1P+Qm=1Ml=1Lξqmlzlkwmjkxqijk+ν00k+u0jk+q=P+1P+Qνq0kxqijk+q=P+1P+Quqjkxqijk+m=1Mν0mkwmjk+q=P+1P+Qm=1Mνqmkwmjkxqijk+eijk.E43

When explaining the variances of the random slopes in terms of contextual variables, the model automatically includes interaction terms between levels that compose the fixed part of the model. It is recommended to add variables that explain the variance of the random slope coefficients one by one and not as shown in this step (this was done here to avoid specifying more equations).

When it comes to an MGLM, the methodology changes slightly; that is, instead of defining models in terms of y, models are defined in terms of gμijk=gEYijkνu, and the residual errors at the individual level are no longer specified. An example of this methodology for an MGLM is illustrated below.

Advertisement

5. Application: periodontal probing depth

In this section, an example is given in which a multilevel generalized linear model is used for data from a cross-sectional study conducted by Romero-Castro et al. [10]. This study was carried out among adults who reside in the state of Guerrero, Mexico, and who went to the external dental clinical service of the Dental School of the Autonomous University of Guerrero (UAGro) in search of treatment, during the period from August 2015 to February 2016. The protocol was approved (registration no. CB005/2015) by the ethic committee at UAGro.

The goal of this multilevel analysis was to determine the clinical factors associated with the depth of periodontal probing.

Thirty-two teeth were examined in each of the 116 patients. Probing pocket depth was recorded at six sites in each tooth, that is, mesiobuccal, mid buccal, distobuccal, mesiolingual, mid lingual, and distolingual locations of each tooth. Pocket depth was recorded by use of Florida probe in the six sites. The response variable was probing depth measured in millimeters; that is, probing depth is a continuous variable and greater than zero (0). The data set consisted of 18,358 observations.

The independent variables, except the age, were all dichotomous: bleeding, mobility, plaque, calculus, insulin resistance (fasting plasma glucose>100mg/dL), smoking, root remnants, and mismatched restorations, where 0 indicated absence and 1 presence.

Figure 1 and Table 1 show the three levels of the data and the variables at each level. The first level corresponded to the probing sites where the independent variables bleeding and furcation and the response variable probing depth were measured. Level two corresponded to the dental piece, that is, teeth that only had the independent variable mobility, and level three corresponded to the patients, measuring the independent variables age, plaque, calculus, insulin resistance, smoking, root remnants, and mismatched restorations.

Figure 1.

Multilevel structure of the probing depth of 1 tooth out of 32 teeth for each patient.

A first data analysis was done using a three-level multilevel model assuming a normal distribution for the probing depth. The frequentist fit had two problems: the residuals did not have a normal distribution and the numerical method to obtain the estimates did not converge.

The minimum of probing depth was 0.2 mm, Q1 was 0.8 mm, Q2 was 1.2 mm, Q3 was 1.8 and the maximum was 9 mm. In addition, its distribution was asymmetric to the right (skewness = 1.6 and kurtosis = 8.0). Therefore, it was assumed that probing depth had gamma distribution with mean μ and variance μ2/α:

fy=α/μαΓαyα1expαyμE44

It is well known that gamma regression belongs to the generalized linear model family. But as the data studied is of hierarchical nature, the appropriate model is the multilevel generalized linear model. Given that the response variable is non-negative, the link function used was the natural logarithm to get expected probing depth greater than zero.

5.1 Bayesian estimation

Likelihood: It was assumed that probing depth follows a gamma distribution.

Prior distribution: It was defined as a product of marginal prior distributions for each component of β in Model 23. β is composed by the overall mean, main effects, and interactions: ξ000,ξq00,ξ0m0,ξ00l,ξ0ml,ξq0l,ξqm0,ξqml, and all of them had N0102 prior. brms function uses a special parameterization for matrices D and G in Eq. (21). This parameterization is G=FσkΩkFσk, where Fσk is a diagonal matrix with diagonal elements σk [6]. Priors for D and G needed only to specify priors for σk and Ωk, which were σkHalfCauchy10 and ΩkCorrLKJ1. Finally, the shape hyper-parameter was shapeGamma0.01,0.01.

The analysis of this model was performed with the brms library [6, 11] that uses the probabilistic programming language Stan [12] in the environment of R Software 4.0.5.

Simulation: All the MCMC had four chains; the number of iterations and burn-in was not the same for the models studied, but all used a final sample of 4000.

The MCMC of the models (37, 38 and 39), that is, null, with level-one, and with level-two variables, were obtained using 4000 iterations and a burn-in of 3000. Model 40 used 5000 iterations and a burn-in of 4000, Model 41 used 7000 iterations and a burn-in of 6000, and Model 43 used 8000 iterations and a burn-in of 7000.

Bayesian estimators: The mean of the posterior distribution was used as the Bayesian estimator; this is related with minimizing the squared loss function.

Models studied: We studied a three-level multilevel generalized linear model, where i represented the level-one units, j the level-two units, and k the level-three units. Although the values of the Rhats are not shown, all the MCMC of the studied models converged since all the Rhats were at most 1.01.

Step 1. The null model is

logμijk=ξ000+ν00k+u0jkE45

Columns 2 and 3 of Table 2 show the Bayesian estimations of the null model. The credible intervals did not contain zero, so that the variances at the tooth level and at the patient level were significant. This supports the use of MGLM.

Model 45Model 46Model 47Model 48Model 49Model 50
Coef(95%CrI)Coef(95%CrI)Coef(95%CrI)Coef(95%CrI)Coef(95%CrI)Coef(95%CrI)
Group-level effects:
Patient (116 levels)
sd(Intercept)0.20(0.17, 0.23)0.20(0.17, 0.23)0.20(0.17, 0.23)0.19(0.17, 0.22)0.19(0.17, 0.22)0.19(0.17, 0.22)
sd(Bleeding)0.12(0.02, 0.21)0.13(0.02, 0.23)
Patient:Tooth (3131 levels)
sd(Intercept)0.23(0.22, 0.24)0.22(0.21, 0.23)0.22(0.21, 0.23)0.22(0.21, 0.23)0.22(0.21, 0.23)0.22(0.21, 0.23)
sd(Bleeding)0.25(0.16, 0.33)0.25(0.16, 0.34)
Population-level effects:
Intercept0.34(0.30, 0.37)0.33(0.29, 0.37)0.33(0.29, 0.36)0.29(0.24, 0.34)0.29(0.25, 0.34)0.29(0.25, 0.34)
Bleeding0.15(0.11, 0.19)0.15(0.10, 0.19)0.15(0.10, 0.19)0.13(0.07, 0.20)0.11(0.02, 0.19)
Mobility0.04(−0.00, 0.08)0.03(−0.01, 0.07)0.03(−0.01, 0.07)0.03(−0.01, 0.08)
Calculus0.10(0.03, 0.18)0.10(0.03, 0.18)0.10(0.03, 0.18)
Smoking−0.02(−0.14, 0.09)−0.02(−0.14, 0.10)−0.02(−0.13, 0.09)
Bleeding:Calculus0.06(−0.07, 0.19)
Specific parameters:
Shape4.50(4.41, 4.60)4.51(4.41, 4.61)4.51(4.41.4.61)4.51(4.42, 4.61)4.55(4.45, 4.64)4.54(4.45, 4.64)
elpd_diff (se_diff)−16.2 (8.2)−2.3 (2.0)−3.4 (1.5)−10.7 (7.9)#−1.3 (1.8)*

Table 2.

Bayesian estimates of Models 45–50.

95%CrI: 95% Credible Interval. Comparisons.


Model 45 vs. 46.


Model 46 vs. 47.


Model 47 vs. 48.


Model 48 vs. 49.


Model 49 vs. 50.


Coef: Coefficient.

Modelelpd_diffse_diff
Comparison of models with one level-three independent variable.
ξ001calculus1k0.00.0
ξ003insulinresistance3k−0.11.6
ξ004rootremnants4k−0.81.5
ξ005plaque5k−1.51.6
ξ002smoking2k−1.71.6
ξ006age6k−1.81.5
ξ007mismatchedrestorations7k−2.51.5
Comparison of models with two level-three independent variables.
ξ001calculus1k+ξ002smoking2k0.00.0
ξ001calculus1k+ξ003insulinresistance3k−2.11.5
ξ001calculus1k+ξ006age6k−2.61.5
ξ001calculus1k+ξ005plaque5k−2.91.5
ξ001calculus1k+ξ007mismatchedrestorations7k−4.41.5
ξ001calculus1k+ξ004rootremnants4k−5.21.5
Comparison of the best models with one and two level-three independent variables
ξ001calculus1k+ξ002smoking2k0.00.0
ξ001calculus1k−2.61.5
Comparison of models with three level-three independent variables
ξ001calculus1k+ξ002smoking2k+ξ004rootremnants4k0.00.0
ξ001calculus1k+ξ002smoking2k+ξ006age6k−0.21.6
ξ001calculus1k+ξ002smoking2k+ξ003insulinresistance3k−0.61.5
ξ001calculus1k+ξ002smoking2k+ξ005plaque5k−1.21.6
ξ001calculus1k+ξ002smoking2k+ξ007mismatchedrestorations7k−1.61.5
Comparison of the best models with two and three level-three independent variables
ξ001calculus1k+ξ002smoking2k0.00.0
ξ001calculus1k+ξ002smoking2k+ξ004rootremnants4k−2.91.5

Table 3.

Forward variable selection for the level-three variables*.

*All the models have the structure: logμijk=ξ000+α1bleeding1ijk+ξ010mobility1jk+var1+var2+var3+ν00k+u0jk, where var1 is the independent variable that produces the best fit among all the seven models with one independent variable. Similarly, var2 is the second independent variable that produces the best fit among all the six models, having var1 in common, with two independent variables, and so on for var3.

Step 2. The model with level-one variable, bleeding, is

logμijk=ξ000+α1bleeding1ijk+ν00k+u0jkE46

The Bayesian estimations of the model showed that the bleeding coefficient was significant (columns 4 and 5 of Table 2). The comparison of Models 45 and 46, using LOO-CV, indicates that the model including the level-one variables was better (before the last row and column 5 in Table 2).

Step 3. The model with level-two variable, mobility, is

logμijk=ξ000+α1bleeding1ijk+ξ010mobility1jk+ν00k+u0jkE47

Mobility fixed effect was not significant (columns 6 and 7 of Table 2); however, it was retained in the model because it was the only level-two variable and to get estimations of the effect of the third-level independent variables adjusted for the effect of level-two variable. The LOO-CV criterion indicated that this model was slightly better (before the last row and column 7 in Table 2).

There are seven level-three contextual variables (Table 1); before specifying the model containing only the significant level-three variables, a forward selection of variables was performed to avoid having an overparameterized model. Table 3 shows the variable selection procedure where each model contains the level-one variable, bleeding, and the level-two variable, mobility. The LOO-CV model comparison indicates that the model that includes calculus and smoking variables is the best model.

Step 4. The model with level-three variables, calculus and smoking, is

logμijk=ξ000+α1bleeding1ijk+ξ010mobility1jk+ξ001calculus1k+ξ002smoking2k+ν00k+u0jkE48

Columns 8 and 9 in Table 2 show that the variable smoking was not significant; however, the model that contains smoking is better than the others. Model 48 was better than Model 47 (before the last row and column 9 in Table 2).

Step 5. The model with a random slope for the variable bleeding.

In Eq. (49), a random slope for the variable bleeding is added that varies at patient and teeth levels; that is, the relationship between probing depth and bleeding varied between patients and between teeth.

logμijk=ξ000+α1bleeding1ijk+ξ010mobility1jk+ξ001calculus1k+ξ002smoking2k+ν10kbleeding1ijk+u1jkbleeding1ijk+ν00k+u0jkE49

Finally, in the next model interaction, terms were added based on signs that occur in periodontal disease.

Columns 10 and 11 of Table 2 show that the random slope of bleeding was significant at patient and teeth levels. Again, this model was compared with Model 48 using the LOO-CV criterion, and the best model was Model 48, which contained random slopes (before the last row and column 11 in Table 2).

Step 7. The model with cross-level interactions is

logμijk=ξ000+α1bleeding1ijk+ξ010mobility1jk+ξ001calculus1k+ξ002smoking2k+ξ101bleeding1ijkcalculus1k+ν10kbleeding1ijk+u1jkbleeding1ijk+ν00k+u0jkE50

Eq. (50) has an interaction between the level-three variable calculus with the level-one variable bleeding. Columns 12 and 13 of Table 2 show that the interaction was not significant (its credible interval contained zero). Finally, the comparison of models indicated that the best model was Model 49 corresponding to the bleeding random slope model (the last row and column 13 in Table 2). So, this model is interpreted.

Figure 2 shows the posterior predictive fit of Model 49 to the data. The replicated data are plotted in a light color, and the observed data are plotted in black. As both curves agree very well, the posterior predictive density fits very well with the distribution of the probing depth. Both distributions are clearly not symmetric, and they seem to follow a gamma distribution. Definitely, normal distribution was not an appropriate assumption for probing depth. In conclusion, the random slope model (49) had a good fit.

Figure 2.

Posterior predictive density of Model 49 with a random slope for bleeding between teeth and patient.

Figure 3 shows the histograms of the empirical posterior distributions of the parameters. Finally, the MCMC of Model 49 converged since all the Rhats were at most 1.01, and the trace plots of Figure 3 show that the chains mix well.

Figure 3.

Posterior distribution of parameters of Model 49 and time series plots showing the MCMC output.

5.2 Discussion

In this example of probing depth, the variance at the tooth level (1.59) and the variance at the patient level (1.49) were significant (Table 2); that is, the mean of the dependent variable varied between teeth nested in patients, and the ICC at the tooth level (0.45) was higher than that at the patient level (0.42); that is, there was greater dependence between the measurements of the probing sites of different teeth than between measurements of the probing sites of different patients. This finding probes that using a multilevel model for these probing depth data was better than using a single-level model, and the former produced more accurate estimates and credible intervals. In addition, the random slope of bleeding was significant between teeth; that is, there was a positive relationship between probing depth and bleeding that varied between teeth in the patients (probing depth between teeth increased by an average of 1.28 mm if the site was bleeding). On the other hand, calculus is a form of hardened dental plaque. In the random slope model (Eq. (49)), bleeding and calculus were significant parameters that estimated that, on average, the depth of bleeding probing sites was 1.14 mm greater than the sites that did not exhibit bleeding. On average, the probing depth of patients who had calculus on any of the teeth was 1.11 mm greater than of patients who did not have calculus. The plausible intervals for bleeding and calculus were 1.07,1.22 and 1.03,1.20, respectively.

In the random slope model, mobility and smoking were not significantly associated with probing depth, but if we decide to give them an interpretation, we can say that, on average, the probing depth of patients who presented dental mobility was 1.03 mm greater than the probing depth of patients who did not have dental mobility. Similarly, smoking patients had, on average, a probing depth 0.98 mm greater than that in non-smoking patients. Different results and interpretations could be obtained from measuring the independent variables at levels other than those given in this example. Specifically, the variable calculus could have been measured at the tooth level. Before fitting the Bayesian multilevel model, we tried to estimate the multilevel model using restricted maximum likelihood; however, the numerical method did not converge. More practical examples using the R Software can be found at [13].

Advertisement

6. Conclusions

In certain clinical research designs, the data have a nested structure (in other words, a hierarchical structure). The data that make up a nested structure are modeled using multilevel models because they simultaneously estimate the effects of the variables at the individual level and the effects of the contextual variables or variables at the group level. A significant ICC determines whether it is necessary to use a multilevel model. If the ICC is not significant, an ordinary regression model is sufficient to model the nested data. A disadvantage of multilevel models is that they easily contain a large number of parameters to be estimated. On the other hand, modeling the data levels separately incurs a large type 1 error even when the ICC is small. This fact causes the inferences to be incorrect. The maximum likelihood estimation of the parameters of a multilevel model requires that the assumptions of the distribution are satisfied. More general methods such as Bayesian estimation make it possible to estimate the parameters without requiring that the assumptions of the multilevel models be satisfied. In addition, the Bayesian estimation is robust to a small sample size, a situation that is more likely to occur in higher level observations, and in general, it is able to deal with technical problems such as multicollinearity of the data.

In this chapter, we adapted the bottom-up strategy to specify a multilevel model in the frequentist approach to the Bayesian approach. Our proposal was to use the Bayesian LOO-CV between the different steps for the comparison of models. Deviance information criterion (DIC) could also be used instead of Bayesian LOO-CV.

Two factors had a significant association with probing depth. Bleeding (site-level covariate) and dental calculus (patient-level covariate). At the tooth level, a factor associated with the probing depth was not found.

The methodology set out in this chapter can be applied to other areas of the health sciences with data with a hierarchical structure and numerical response variable.

Advertisement

Acknowledgments

Tino-Salgado is indebted to CONACYT for fellowship that enabled him to pursue graduate studies for the degree of Maestría en Matemáticas Aplicadas. The authors thank the sixth semester periodontics students of the 2013–2018 batch for their assistance and logistical support during the patient recruitment stage.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

Advertisement

Abbreviations

DICDeviance information criterion
HMCHamiltonian Monte Carlo
ESSEffective sample size
GEEGeneralized estimating eqs.
GRGelman and Rubin diagnostic
ICCIntraclass correlation coefficient
LOO-CVLeave-one-out cross-validation
lppdLog pointwise predictive density
MCMCMarkov chain Monte Carlo
MGLMMultilevel generalized linear models
NUTSNo-U-turn sampling
PSRFPotential scale reduction factor

References

  1. 1. Kim JS, Kim D-K, Hong SJ. Assessment of errors and misused statistics in dental research. International Dental Journal. 2011;61(3):163-167
  2. 2. Wang J, Xie H, Fisher JF. Multilevel Models, Applications Using SAS. Berlin, Germany: de Gruyter; 2011. DOI: 10.1515/9783110267709
  3. 3. Joop J Hox, Mirjam Moerbeek y Rens Van de Schoot. Multilevel Analysis: Techniques and Applications. New York, United States: Routledge; 2017
  4. 4. Gilks WR, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in Practice. Florida, United States: CRC Press; 1995
  5. 5. Matthew D Hoffman, Andrew Gelman y col. The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Research 2014;15(1):1593–1623
  6. 6. Bürkner P-C. brms: An R package for Bayesian multilevel models using stan. Journal of Statistical Software. 2017;80(1):1-28. DOI: 10.18637/jss.v080.i01
  7. 7. Roy V. Convergence diagnostics for markov chain Monte Carlo. Annual Review of Statistics and Its Application. 2020;7:387-412
  8. 8. Plummer M, Best N, Cowles K, Vines K. CODA: Convergence diagnosis and output analysis for MCMC. R News. 2006;6(1):7-11
  9. 9. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. Florida, United States: CRC Press; 2013
  10. 10. Romero-Castro NS, Castro-Alarcon N, Reyes-Fernández S, Flores-Alfaro E, Parra-Rojas I. Periodontal disease distribution, risk factors, and importance of primary healthcare in the clinical parameters improvement. International Journal of OdontoStomatology. 2020;14(2):183-190. DOI: 10.4067/S0718-381X2020000200183
  11. 11. Bürkner P-C. Advanced Bayesian multilevel modeling with the R package brms. The R Journal. 2018;10(1):395-411. DOI: 10.32614/RJ-2018-017
  12. 12. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A probabilistic programming language. Journal of Statistical Software. 2017;76(1):1-32. DOI: 10.18637/jss.v076.i01
  13. 13. Holmes Finch W, Bolin JE, Kelley K. Multilevel Modeling Using R. CRC Press; 2019

Written By

Edilberta Tino-Salgado, Flaviano Godínez-Jaimes, Cruz Vargas-De-León, Norma Samanta Romero-Castro, Salvador Reyes-Fernández and Victor Othon Serna-Radilla

Reviewed: 04 October 2022 Published: 26 November 2022