Open access peer-reviewed chapter - ONLINE FIRST

Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

By Mu Yue

Submitted: April 30th 2021Reviewed: September 17th 2021Published: October 19th 2021

DOI: 10.5772/intechopen.100506

Downloaded: 37

Abstract

In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

Keywords

  • sparse boosting
  • high-dimensional data
  • machine learning
  • variable selection
  • data analysis

1. Introduction

High-dimensional model has become very popular in statistical literature and many new machine learning techniques have been developed to deal with data with very large number of features. In the past decades, researchers have done a great deal of high-dimensional data analysis where the sample size nis relatively small but the number of features punder consideration is extremely large. It is widely known that including irrelevant predictors in the statistical model may result in unstable estimation and dreadful computing issues. Thus, variable selection is crucial to address the challenges. Among all developments, regularization procedures such as LASSO [1], smoothly clipped absolute deviation (SCAD) [2], MCP [3] and their various extensions [4, 5, 6] have been thoroughly studied and widely used to perform variable selection and estimation simultaneously in order to improve the prediction accuracy and interpretability of the statistical model. However, those penalized estimation approaches all have some tuning parameters required to be selected by computationally expensive methods like cross-validation.

In recent years, machine learning methods such as boosting have become very prominent for high-dimensional data settings since they can improve the selection accuracy substantially and reduce the chance of including irrelevant features. The original boosting algorithms were proposed by Schapire [7] which is an ensemble method that iteratively combines weaker learners to minimize the expected loss. The major difference among different boosting algorithms is the loss function. For example, AdaBoost [8] has the exponential loss function, L2 boosting [9] has the squared error loss function, sparse boosting [10] has the penalized loss function and HingeBoost [11] has the weighted hinge loss function. Recently, more various versions of boosting algorithms have been proposed. See, for example, Bühlmann and Hothorn [12] for the twin boosting; Komori and Eguchi [13] for the pAUCBoost; Wang [14] for the twin HingeBoost; Zhao [15] for the GSBoosting and Yang and Zou [16] for the ER-Boost. Besides these extensions, much effort has been made in understanding the advantages of boosting such as relatively lower over-fitting risk, smaller computational cost, and simpler adjustment to include additional constraints.

In this chapter we review some sparse boosting based methods for the following high-dimensional problems based on three research papers. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers [17]. Then, a two-step sparse boosting to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeated over time [18]. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations [19]. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

The rest of the chapter is arranged as follows. In Section 2, a sparse boosting method to fit high-dimensional survival data is studied. In Section 3, a two-step sparse boosting approach to carry out variable selection and model-based prediction by fitting high-dimensional models with longitudinal data is studied. In Section 4, a subgroup identification method incorporating multi-step sparse boosting algorithm for high-dimensional dense longitudinal data is studied. Finally, Section 5 provides concluding remarks.

Advertisement

2. Sparse boosting for survival data

Survival time data are usually referred to time-to-event data and they are usually censored. Predicting survival time and identifying the risk factors can be very helpful for patient treatment selection, disease prevention strategy or disease management in evidence-based medicine. A well-known model in survival analysis is the Cox proportional hazards (PH) model [20] which assumes multiplicative covariate effects in the hazards function. Another popular model is the accelerated failure time (AFT) model [21] which assumes that the covariate effect is to accelerate or decelerate the life time of a disease. The coefficients in the regression model have the direct interpretation of the covariate effects on the mean survival time. Recently, researchers developed boosting methods to analyze survival data. For example, Schmid and Hothorn [22] proposed a flexible boosting method for parametric AFT models, and Wang and Wang [23] proposed Buckley-James boosting for survival data with right censoring and high dimensionality.

In this section, a sparse boosting method to fit high-dimensional varying-coefficient AFT models is presented. In particular, the sparse boosting techniques for right censored survival data is studied. In Section 2.1, the varying-coefficient AFT model for survival data is formulated and a detailed sparse boosting algorithm to fit the model is proposed. In Section 2.2, the proposed sparse boosting techniques through simulation studies is evaluated. In Section 2.3, the performance of sparse boosting via a lung cancer data example is examined.

2.1 Methodology

2.1.1 Model and estimation

Let Tiand Cibe the logarithm of survival time and censoring time for the ith subject in a random sample of size nrespectively. In reality Yi=minTiCiand the censoring indicator δi=ITiCi[24] are observed. Denote Xi=Xi,1Xi,p1to be the corresponding (p-1)-dimensional predictors such as gene expressions or biomarkers for the ith subject and Uito be the univariate index variable. Our observed data set XiδiYiUi:XiIRp1δi01YiIRUiIRi=12nis an independently and identically distributed random sample from XδYU. The varying-coefficient AFT model is:

Ti=β0Ui+j=1p1Xi,jβjUi+εi,i=1,,n,E1

where β0.,β1.,,βp1.are the unknown varying-coefficient functions of confounder Uand εiis the random error with EεiXiUi=0.

A weighted least squares estimation approach is adopted. Let wi‘s be the Kaplan–Meier weights [25], which are the jumps in the Kaplan–Meier estimator computed as w1=δ1nand wi=δini+1j=1i1njnj+1δj,i=2,,n. Let Y1Ynbe the order statistics of Yis, δ1,,δnbe the corresponding censoring indicators of the ordered Yis, and X1,j,,Xn,j,j=1,,p1and U1,,Unare defined similarly. Then the weighed least squares loss function is

i=1nwiYiβ0Uij=1p1Xi,jβjUi2.E2

Let B.=B1.BL.Tbe an equal-spaced B-spline basis, where Lis the dimension of the basis. Under certain smoothness conditions, the Curry-Schonberg theorem [26] implies that for every smooth function βj., it can be approximated by

βj.BT.γj,j=0,,p1,E3

where γjis a vector of length L. Then the weighted least squares loss function Eq. (2) can be approximated by

i=1nwiYiBTUiγ0j=1p1Xi,jBTUiγj2.E4

Denote by Y˜=Y1YnT,Xi,0=1for i=1,,n, X˜j=BU1X1,jBUnXn,jT, X˜=X˜0X˜p1, W=diagw1wnand γ=γ0Tγp1TT. Then the objective function Eq. (4) may be written in the following matrix form:

Y˜X˜γTWY˜X˜γ.E5

The estimation may yield close-form solution for the coefficients when dimensionality pis small or moderate. With high dimensionality the solution cannot be easily achieved. Let γK̂=γ0K̂Tγp1K̂TTbe the estimator of γfrom sparse boosting approach with weighted square loss function Eq. (5), and K̂is the estimated number of stopping iterations. Then the estimates of coefficient function are given by

β̂ju=BTuγjK̂,j=0,,p1.E6

Instead of using the regularized estimation approaches, a sparse boosting method to estimate γK̂is presented in the following subsection.

2.1.2 Sparse boosting techniques

The key idea of sparse boosting is to replace the empirical risk function in L2 boosting with the penalized empirical risk function which is a combination of squared loss and the trace of boosting operator as a measure of boosting complexity, and then perform gradient descent in a function space iteratively. Thus sparse boosting produces sparser models compared to L2 boosting. The g-prior minimum description length (gMDL) proposed by [27] can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. The gMDL takes the form:

gMDLRSStraceB=logS+traceBnlogY˜TY˜RSStraceB×S,S=RSSntraceB.E7

Here RSSis the residual sum of squares and Bis the boosting operator. The model that achieves the shortest description of data will be selected. The advantage is that it has a data-dependent penalty for each dimension since it is explicitly given as a function of data only, thus the selection of the tuning parameter can be avoided.

The sparse boosting procedure is described in details. The initial value of γis set to be a zero vector, i.e. γk=0for k=0, while in each of the kth iteration (1kKfor Kbeing the total number of iterations) only the current residual Rk=Y˜X˜γk1is used to regress every jth working element X˜j,j=0,,p1. The fit denoted by λ̂jkcan be obtained by minimizing the weighted squared loss function RkX˜jλTWRkX˜jλwith respect to λ. Hence the weighted least squared estimate is λ̂jk=X˜jTWX˜j1X˜jTWRk, the corresponding hat matrix is Hj=X˜jX˜jTWX˜j1X˜jTWand the weighted residual sum of squares is RSSjk=RkX˜jλ̂jkTWRkX˜jλ̂jk. The selected component ŝkcan be obtained by:

ŝk=argmin0jp1gMDLRSSjktraceBjk,E8

where Bj1=Hjand Bjk=IIHjIνHŝk1..IνHŝ1for k>1is the boosting operator for selecting jth component in the kth iteration. Therefore, at each iteration there is only one working component X˜ŝkto be chosen, and only the corresponding coefficient vector γŝkkchanges, i.e. γŝkk=γŝkk1+νλ̂ŝkk, where νis the step size, while all the other γjkfor jŝkremain the same. This process is repeated for Kiterations and estimate the stopping iteration Kby.

K̂=argmin1kKgMDLRSSŝkktraceBk,E9

where Bk=IIνHŝk..IνHŝ1.

From this sparse boosting procedure, the estimator of γis obtained as γK̂=γ0K̂Tγp1K̂TT. The sparse boosting algorithm for the varying-coefficient AFT model can be summarized as follows:

Sparse Boosting Algorithm for Varying-Coefficient AFT Model.

  1. Initialization. Set k=0and γ0k=0,,γp1k=0(component-wise).

  2. Iteration. k=k+1. Compute ŝk=argmin0jp1gMDLRSSjktraceBjk, where Bj1=Hjand Bjk=IIHjIνHŝk1..IνHŝ1for k>1.

  3. Update. γŝkk=γŝkk1for jŝkand γŝkk=γŝkk1+νλ̂ŝkk, where νis the step size.

  4. Iteration. Repeat step (b)-(c) for Kiterations.

  5. Stopping. Estimate K̂=argmin1kKgMDLRSSŝkktraceBk, where Bk=IIνHŝk..IνHŝ1.Thus, γK̂=γ0K̂Tγp1K̂TTis the estimate for γand β̂ju=BTuγjK̂,j=0,,p1are the estimators for varying coefficients. The final estimator of Y˜is Y˜K̂=X˜γK̂.

According to [10] and references therein, the selection of step size νis of minor importance as long as it is small. A smaller value of νachieves higher prediction accuracy while requires a larger number of boosting iterations and more computing time. A typical value used in literature is ν=0.1.

2.2 Simulation

The performance of the above sparse boosting algorithm is evaluated by studying their performance on simulated data. L2 boosting and sparse boosting methods are compared in their performance of variable selection and function estimation. Sparse boosting method is what we present in this section while L2 boosting method is a relatively simpler version and may not achieve sparse solution in general.

The simulation results from [17] show that both boosting methods can identify important variables while sparse boosting selects much fewer irrelevant variables than L2 boosting. Although in-sample prediction errors (defined as i=1nδiYiYiK̂2/i=1nδi) using L2 boosting is a little bit smaller than using sparse boosting since the former has larger model sizes, the average of root mean integrated squared errors (defined as 1nj=05i=1nβjuiβ̂jui2) using sparse boosting is much smaller than that using L2 boosting. Furthermore, when the smoothness assumption in Curry-Schonberg theorem is violated for the coefficient functions, the performance of variable selection remains good. In summary, sparse boosting outperforms L2 boosting in terms of parameter estimation and variable selection.

2.3 Lung cancer data analysis

Lung cancer is the top cancer killer for people in the U.S. Identifying relevant gene expressions in lung cancer is important for treatment and prevention. Our data is from a large multi-site blinded validation study [28] with 442 lung adenocarcinomas. Age is treated as the potential confounder in this analysis, since it is usually strongly correlated with survival time [29]. After removing missing measurements and predictors in overall survival, a total of 439 patients are left in the analysis. For each patient, 22,283 gene expressions are available. The median follow-up time is 46 months (range: 0.03 to 204 months) with the overall censoring rate 46.47 %. The median age at diagnosis is 65 years (range: 33 to 87 years). After adopting a marginal screening procedure to screen out irrelevant genes, variable selection approaches are used to identify important genes associated with lung cancer. With the aim of comparison, except L2 boosting and the proposed sparse boosting, the following existing variable selection approaches for constant-coefficient AFT models are also considered: Buckley-James boosting with linear least squares [23], Buckley-James twin boosting with linear least squares [23], Buckley-James regression with elastic net penalty [30] and SCAD penalty respectively.

The results from [17] show that L2 boosting and sparse boosting for varying-coefficient AFT model not only produce relatively sparser model, but also have smaller in-sample and out-of-sample prediction error compared to the four methods for constant-coefficient AFT model. Again, sparse boosting produce even sparser model than L2 boosting. In conclusion, including age in the varying-coefficient AFT model could lead to more accurate estimate than constant-coefficient AFT model and the proposed sparse boosting method for varying-coefficient AFT model has good performance in terms of estimation, prediction as well as sparsity.

Advertisement

3. Two-step sparse boosting for longitudinal data

Longitudinal data contain repeated measurements collected from the same respondents over time. The assumption that all measurements are independent does not hold for such data. One important question in longitudinal analysis is how to make efficient inference by taking into account of the within subjects correlation. This question has been investigated in depth by many researchers [31, 32] for parametric models. Semiparametric and nonparametric models for longitudinal data are also presented in the literature, see [33, 34]. Recently, there are some development on longitudinal data with high-dimensionalilty using varying-coefficient models [35, 36]. All previous studies adopted the penalty methods.

In this section, a two-step sparse boosting approach is presented to preform the variable selection and the model-based prediction. Specifically, high-dimensional varying-coefficient models with longitudinal data will be considered. In the first step, the sparse boosting approach is utilized to obtain an estimate of the correlation structure. In the second step, the within-subject correlation structure is considered and variable selection and coefficients estimation are achieved by sparse boosting again. The rest of this section is arranged as follows. In Section 3.1, the varying-coefficient model for longitudinal data is formulated and a two-step sparse boosting algorithm is presented. In Section 3.2, simulation studies are conducted to illustrate the validity of the two-step sparse boosting method. In Section 3.3, the performance of two-stage method is assessed by studying yeast cell cycle gene expression data.

3.1 Methodology

3.1.1 Model and estimation

Let Yijbe the continuous outcome for the jth measurement of individual itaken at time tijT, where Tis the time interval on which the measurements are taken. Denote Xij=Xij,1Xij,p1to be the corresponding (p-1)-dimensional covariate vector. The varying-coefficient model which can capture the dynamical impacts of the covariates on the response variable is considered:

Yij=β0tij+d=1p1Xij,dβdtij+εij,i=1,,n,j=1,,ni,E10

where β0.,β1.,,βp1.are the unknown smooth coefficient functions of time and εi=εi1εiniT,i=1,,nare multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different i, but components of εiare correlated with each other. Without loss of generality, the balanced longitudinal study is considered in the following implementation, i.e., tij=tkj, and ni=mfor all i.

The estimation procedure is presented below. In the first step, the within-subject correlation is ignored first and the coefficients are estimated by minimizing the following least squares loss function:

i=1nj=1mYijβ0tijd=1p1Xij,dβdtij2.E11

The B-spline basis is used to estimate the coefficient functions β0.,β1.,,βp1.. Denote B.=B1.BL.Tto be an equal-spaced B-spline basis of dimension L. Under certain smoothness assumptions, function βd.can be approximated by

βd.BT.γd,d=0,,p1,E12

where γdis a loading vector of length L. Then the least squares loss function Eq. (11) is close to

i=1nj=1mYijBTtijγ0d=1p1Xij,dBTtijγd2.E13

Further denote Yi=Yi1YimT, Y=Y1TYnTT, Xij,0=1, X˜i,d=Bti1Xi1,dBtimXim,dT, X˜i=X˜i,0X˜i,p1, X˜=X˜1TX˜nTTand γ=γ0Tγp1TT. Then the target function Eq. (13) can be expressed in the matrix format:

i=1nYiX˜iγTYiX˜iγYX˜γTYX˜γ.E14

Denote γK1̂to be the estimator of γby sparse boosting with squared loss function Eq. (14) being loss function, where K1̂is the estimated stopping iterations in this step. There is no exact closed form for γK1̂since it is derived from an iterative algorithm. However it can be evaluated very fast in a computer implementation. The detailed algorithm will be presented in the next subsection.

The first step coefficient estimates are given by

β˜dt=BTtγdK1̂,d=0,,p1.E15

Write ε̂i=YiX˜iγK1̂,i=1,,n. The m×mcovariance matrix CovYiΣcan be estimated by the following empirical estimator

̂=1ni=1nε̂iε̂iT.E16

In the second step, the estimated correlation structure within repeated measurements is taken into account to form the weighted least squares loss function as follows:

i=1nYiX˜iγTΣ̂1YiX˜iγYX˜γTWYX˜γ,E17

where W=diagΣ̂1Σ̂1is the estimated n×m×n×mweight matrix. Denote γK2̂to be the estimator of γby sparse boosting with weighted loss function Eq. (17) being the loss function, where K2̂is the estimated stopping iterations in the second step. Then the coefficient estimates from the second step are given by

β̂dt=BTtγdK2̂,d=0,,p1.E18

The reliable estimates for the coefficient functions could then be obtained. More details about how to use sparse boosting to get γK1̂and γK2̂are provided in the following subsection.

3.1.2 Two-step sparse boosting techniques

gMDL can be adopted as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. gMDL can be expressed in the following form:

gMDLRSStraceB=logF+traceBn×mlogYTYRSStraceB×F,F=RSSn×mtraceB,E19

where Bis the boosting operator and RSSis the residual sum of squares.

The two-step sparse boosting approach is presented more specifically. In the first step, the start value of γis set to zero vector, i.e. γ0=0, and in each of the k1th iteration (0<k1K1, and K1is the maximum number of iterations considered in the first step), the residual Rk1=YX˜γk11in present iteration is used to fit each of the dth component X˜,d=X˜1,dTX˜n,dTT,d=0,,p1by treating all the within-subject observations uncorrelated. Then the fit denoted by λ̂dk1can be calculated by minimizing the squared loss function Rk1X˜,dλTRk1X˜,dλwith respect to λ. Therefore, the least squares estimate is λ̂dk1=X˜,dTX˜,d1X˜,dTRk1, the corresponding hat matrix is Hd=X˜,dX˜,dTX˜,d1X˜,dTand the residual sum of squares is RSSdk1=Rk1X˜,dλ̂dk1TRk1X˜,dλ̂dk1. The chosen element ŝk1is attained by:

ŝk1=argmin0dp1gMDLRSSdk1traceBdk1,E20

where Bd1=Hdand Bdk1=IIHdIνHŝk11..IνHŝ1for k1>1is the first step boosting operator for choosing dth element in the k1th iteration. Hence, there is an unique element X˜,ŝk1to be selected at each iteration, and only the corresponding coefficient vector γŝk1k1changes, i.e., γŝk1k1=γŝk1k11+νλ̂ŝk1k1, where νis the pre-specified step-size parameter. All the other γdk1for dŝk1keep unchanged. This procedure is repeated for K1times and the number of iterations K1can be estimated by

K1̂=argmin1k1K1gMDLRSSŝk1k1traceBk1,E21

where Bk1=IIνHŝk1..IνHŝ1.

From the first step of sparse boosting, the estimator of γis obtained by γK1̂=γ0K1̂Tγp1K1̂TT. Then the weight matrix Wcan be easily obtained too.

In the second step, sparse boosting is used again by taking into account of the correlation structure estimator for the repeated measurements estimated in the first step. The initial value of γis set to be the coefficient estimator from the first step of sparse boosting, i.e. γ0=γK1̂, and in each of the k2th iteration (0<k2K2, and K2is the maximum number of iterations under consideration in the second step), the residual Rk2=YX˜γk21in current iteration is used to fit each of the dth working element X˜,d,d=0,,p1by incorporating the within-subject correlation estimator from the first step. Then the fit denoted by λ̂dk2can be obtained by minimizing the weighted squared loss function Rk2X˜,dλTWRk2X˜,dλwith respect to λ. Thus, the weighted least squares estimate is λ̂dk2=X˜,dTWX˜,d1X˜,dTWRk2, the corresponding hat matrix is Hd=X˜,dX˜,dTWX˜,d1X˜,dTWand the weighted residual sum of squares is RSSdk2=Rk2X˜,dλ̂dk2TWRk2X˜,dλ̂dk2. The chosen element ŝk2can be obtained by:

ŝk2=argmin0dp1gMDLRSSdk2traceBdk2,E22

where Bd1=IIBK1̂IHdand Bdk2=IIBK1̂IHdIνHŝk21..IνHŝ1for k2>1is the second step boosting operator for choosing dth element in the k2th iteration. Thus, there is an unique element X˜,ŝk2to be selected at each time, and only the corresponding coefficient vector γŝk2k2change, i.e., γŝk2k2=γŝk2k21+νλ̂ŝk2k2. While all the other γdk2for dŝk2remain the same. This procedure is repeated for K2times and the estimated stopping iterations K2̂is

K2̂=argmin1k2K2gMDLRSSŝk2k2traceBk2,E23

where Bk2=IIBK1̂IνHŝk2..IνHŝ1.

From the second step of sparse boosting, the estimator of γis arrived by γK2̂=γ0K2̂Tγp1K2̂TT. The two-step sparse boosting algorithm for varying-coefficient model with longitudinal data can be summarized in the following form:

Two-step Sparse Boosting Algorithm with Longitudinal Data.

Step I: Use sparse boosting to estimate covariance matrix.

  1. Initialization. Let k1=0and γ0k1=0,,γp1k1=0.

  2. Increase k1by 1. Calculate ŝk1=argmin0dp1gMDLRSSdk1traceBdk1, where Bd1=Hdand Bdk1=IIHdIνHŝk11..IνHŝ1for k1>1.

  3. Update. γŝk1k1=γŝk1k11for dŝk1and γŝk1k1=γŝk1k11+νλ̂ŝk1k1, where νis the step-size parameter.

  4. Iteration. Repeat step (b)-(c) for some large iteration number K1.

  5. Stopping. The optimal iteration number can be taken as K1̂=argmin1k1K1gMDLRSSŝk1k1traceBk1, where Bk1=IIνHŝk1..IνHŝ1.

Thus, γK1̂=γ0K1̂Tγp1K1̂TTis the first step estimator for γfrom sparse boosting and β˜dt=BTtγdK1̂, d=0,,p1are the varying coefficient estimates ignoring the within-subject correlation. CovYican be estimated by

Σ̂=1ni=1nYiX˜iγK1̂YiX˜iγK1̂T.

Step II: Use sparse boosting again by incorporating covariance matrix estimator.

  1. Initialization. Let k2=0and γk2=γK1̂.

  2. Increase k2by 1. Calculate ŝk2=argmin0dp1g MDLRSSdk2traceBdk2, where Bd1=IIBK1̂IHdand Bdk2=IIBK1̂IHdIνHŝk21..IνHŝ1for k2>1.

  3. Update. γŝk2k2=γŝk2k21for dŝk2and γŝk2k2=γŝk2k21+νλ̂ŝk2k2.

  4. Iteration. Repeat step (b)-(c) for some large iteration number K2.

  5. Stopping. The optimal iteration number can be taken as K2̂=argmin1k2K2gMDLRSSŝk2k2traceBk2, where Bk2=IIBK1̂IνHŝk2..IνHŝ1.

Therefore, γK2̂=γ0K2̂Tγp1K2̂TTand β̂dt=BTtγdK2̂, d=0,,p1are the final estimator for γand varying coefficient estimates by the two-step sparse boosting. The final estimate for Yis Ŷ=X˜γK2̂.

3.2 Simulation

Simulation studies are conducted to evaluate the performance of the above two-step sparse boosting algorithm. The following four methods are compared in terms of variable selection and function estimation performance. M1: two-step L2 boosting (use squared loss for update criterion and gMDL for stopping criterion); M2: two-step sparse boosting; M3: two-step lasso (performs lasso regression in the first step to calculate the estimated within-subject correlation structure using Eq. (14), and use lasso regression in the second step by taking into account of the estimated correlation structure) and M4: two-step elastic net regression (similar as M3 with the elastic net mixing parameter 0.5).

The simulation results from [18] show that all methods are able to identify important variables. However, in terms of sparsity, the two-step sparse boosting method preforms best with smallest number of false positives. Both penalization methods select much more irrelevant variables than boosting methods, with elastic net selects the most. For two-step sparse boosting, results of variable selection are quite stable from step I to step II but for the other approaches, the false positives and thus the sizes of model from step I to step II are expanding. Two-step sparse boosting yields smallest bias for the coefficients estimation among the competing methods. The refined estimates after incorporating the within-subject correlation generally perform better than the initial estimates without taking into account of the within-subject correlation since the two-step methods gain reduction of bias, especially when the within-subject correlation is high. In other words, the reduction of bias from step I to step II are much larger when the within-subject correlation is higher. This is intuitive as in the second step, the within-subject correlation structure estimated from the first step have been taken into account. The similar results obtained for the bias of the estimated covariance matrix. The bias under smaller within-subject correlation is smaller than under larger within-subject correlation. The two-step sparse boosting yields smaller bias of the estimated covariance matrix than other competing methods when the within-subject correlation is high. In summary, the performance of variable selection and functional coefficients estimation for two-step sparse boosting is quite satisfactory.

3.3 Yeast cell cycle gene expression data analysis

The cell cycle is one of the most important activities in life by which cells grow, replicate their chromosomes, undergo mitosis, and split into daughter cells. Thus, identifying cell cycle-regulated genes becomes very important. Adopting a model-based approach, Luan and Li [37] identified n=297cell cycle-regulated genes based on the α-factor synchronization experiments. All gene expression levels were measured at m=18different time points covering two cell-cycle periods. Using the same subset of the original data as in [38], a total p=96transcriptional factors (TFs) are included as predictors in the downstream analysis. Wei, Huang and Li [39] proved that the effects of the TFs on gene expression levels are time-dependent. After the independence screening by l2-norm [40] to screen out the irrelevant predictors at first step, several methods can be used to identify the key TFs involved in gene regulation. Except two-step L2 boosting and two-step sparse boosting which take into account of the within-subject correlation in the second step, one-step L2 boosting and one-step sparse boosting which ignore the within-subject correlation are also considered for better comparison. Besides, some two-step penalized approaches are also considered: two-step lasso, two-step adaptive lasso and two-step elastic net (the elastic net mixing parameter 0.5).

The results from [18] show that boosting approaches yield sparser model than the penalized methods. Sparse boosting yields even sparser model and smaller errors in terms of estimation and prediction than L2 boosting. Two-step boosting achieves better performance than one-step boosting with smaller estimation and prediction errors. Two-step sparse boosting method yields the most sparse model, with the smallest in-sample and out-of-sample prediction errors compared to other methods. In terms of the selected TFs, there is a significant overlap between two-step sparse boosting and each of the other methods. In conclusion, the two-step sparse boosting approach performs quite well in terms of variable selection, coefficients estimation and prediction and can provide useful information in identifying the important TFs that take part in the network of regulations.

Advertisement

4. Multi-step sparse boosting for subgroup identification

As personalized medicine is gaining popularity, identification of subgroups of the patients that can gain a higher efficacy from the treatment becomes greatly important. Recently, significant statistical approaches have been proposed to identify subgroups of patients who may be suitable for different treatments. Traditionally, subgroup identification is achieved by parametric partitioning approaches such as Bayesian approaches [41] or classification and regression tree (CART) [42]. Recently, recursive partitioning methods gain popularity since they achieve greater generalizability and efficiency. Such methods include MOB [43], PRIM [44], sequential-BATTing [45] and other non-parametric methods. For a detailed literature review of subgroup identification refer to Lipkovich et al. [46]. In this section, a sparse boosting based subgroup identification method is presented in the context of dense longitudinal data.

In particular, a formal subgroup identification method for high-dimensional dense longitudinal data is presented. It incorporates multi-step sparse boosting into the homogeneous pursuit via change point detection. Firstly, sparse boosting algorithm for individual modeling is first performed to obtain initial estimates. Then, change point detection via binary segmentation is used to identify the subgroup structure of patients. Lastly, the model on each identified subgroups is refitted and again sparse boosting is utilized to remove irrelevant predictors and yield reliable final estimates. The rest of the section is organized as follows. In Section 4.1, the subgroup model is formulated and a detailed method for subgroup identification and estimation is presented. In Section 4.2, the subgroup identification technique is evaluated through simulation studies. In Section 4.3, the feasibility and applicability of the approach is validated by studying a wallaby growth dataset.

4.1 Methodology

4.1.1 Patients model

Denote Yitbe the continuous measurement of the tth follow-up for patient i, where i=1,,n, t=1,,Ti. Let Xit=Xit,1Xit,pbe the corresponding p-dimensional predictors. Assume npatients are independent. The following longitudinal model for the patients is considered:

Yit=β˜i,0+j=1pXit,jβ˜i,j+εit,i=1,,n,t=1,,Ti.E24

where εi=εi1εiTiT,i=1,,nare multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different i, but components of εiare correlated with each other.

Moreover, the model is further assumed to have the following subgroup structure:

β˜i,j=β1,jwheniΩ1,jβ2,jwheniΩ2,jβNj+1,jwheniΩNj+1,jE25

The partition for regression coefficient β˜i,j:1inis Ωk,j:1kNj+1, which is unknown, and thus there are Nj+1subgroups for the jth predictor. All patients are divided into at least maxjNj+1and at most j=0pNj+1subgroups by the model. The patients in the same subgroup share a similar relationship between the response and the predictors and have the same set of regression coefficients while different subgroups have different overall relationship between response and covariates. The main aim is to investigate the effects of the predictors on the response for different subgroups.

However, if the number of predictors under consideration is much larger than the number of patients and the number of follow-ups, a serious challenge may arise to estimate regression coefficients. Therefore, instead of adopting traditional methods (eg, MLE), sparse boosting method can be used to estimate the regression coefficients. With this, the dimensionality of features can be reduced and the coefficients of parameters can be obtained simultaneously.

4.1.2 Subgroup identification and estimation

Denote β˜i=β˜i,0β˜i,pTand β˜=β˜1Tβ˜nTT. Firstly, an initial estimator for β˜iis calculated for each subject ithrough sparse boosting approach using his or her own repeated measurements data; then, homogeneity pursuit via change point detection can be used to identify the change points among βk,js; lastly, the β˜is can be replaced by the identified subgroup structure, and the final estimator of regression coefficients can be obtained by the sparse boosting algorithm again. The steps for estimating β˜iis outlined as below.

In the first step, individualized modeling via sparse boosting is performed. For each of the ith individual, the initial coefficients β˜ican be estimated by minimizing the following least squares loss function:

t=1TiYitβ˜i,0j=1pXit,jβ˜i,j2.E26

Let Yi=Yi1YiTiT, Xit,0=1, Xi,j=Xi1,jXiTi,jT, Xi=Xi,0Xi,p. Then the function Eq. (26) can be written in the matrix form:

YiXiβ˜iTYiXiβ˜i.E27

Denote β˜iL̂i=β˜i,0L̂iβ˜i,pL̂iTto be the estimator of β˜iby sparse boosting with Eq. (27) being loss function, where L̂iis the estimated stopping iterations in this step. This is the initial estimator of β˜i. The detailed sparse boosting algorithm will be presented in the next subsection.

In the second step, homogeneity pursuit via change point detection is performed. Binary segmentation algorithm [47] is used to detect the change points among β˜i,j, i=1,,nand to identify the subgroup structure. Let β˜i,jL̂ibe the j+1th component of β˜iL̂i. For the jth covariate, β˜i,jL̂i, i=1,,n, are sorted in ascending order, and denoted by b1bn. Denote ri,jbe the rank of β˜i,jL̂i.

For any 1l1<l2n, denote the scaled difference between the partial means of the first τl1+1observations and the last l2τobservations to be

Hl1l2τ=l2ττl1+1l2l1+1i=τ+1l2bll2τi=l1τbiτl1+1.E28

Denote δto be the threshold, which is a tuning parameter and can be selected by AIC or BIC, then the binary segmentation algorithm is as follows:

  1. Find t̂1such that

    H1,nt̂1=max1τ<nH1,nτ.E29

If H1,nt̂1δ, there is no change points among bl, l=1,,n, and the change point detection process terminates. Otherwise, t̂1is added to the set of change points and the region τ:1τnis divided into two subregions: τ:1τt̂1and τ:t̂1+1τn.

  • Find the change points in the two subregions derived in part (1), respectively. Consider the region τ:1τt̂1first. Find t̂2such that

    H1,t̂1t̂2=max1τ<t̂1H1,t̂1τ.E30

    If H1,t̂1t̂2δ, there is no change point in the region τ:1τt̂1. Otherwise, add t̂2to the set of change points and divide the region τ:1τt̂1into two subregions: τ:1τt̂2and τ:t̂2+1τt̂1. Similarly, for the region τ:t̂1+1τn, t̂3can be found such that

    Ht̂1+1,nt̂3=maxt̂1+1τ<nHt̂1+1,nτ.E31

    If Ht̂1+1,nt̂3δ, there is no change point in the region τ:t̂1+1τn. Otherwise, add t̂3to the set of change points and divide the region τ:t̂1+1τninto two subregions: τ:t̂1+1τt̂3and τ:t̂3+1τn.

  • For each subregion derived in part (2), the above algorithm is repeated for the subregion τ:1τt̂1or τ:t̂1+1τnin part (2) until no change point is detected in any subregions.

  • The estimated locations for change points are sorted in increasing order and denoted by

    t̂1<t̂2<<t̂N̂j,E32

    where N̂jis the number of detected change points and could be used to estimate Nj. Further denote t̂0=0, and t̂N̂j+1=n. Let. R̂i,j=:t̂1<ri,jt̂,1N̂j+1,where R̂i,j:1incan be used to estimate the grouping index Ri,j:1in. The above algorithm can be used to identify the change points for all j=0,,pand correspondingly obtain R̂i,j:1in0jp. Let R̂,j:1N̂0jp=unique rows ofR̂i,j:1in0jp, then N̂is the estimated total number of subgroups for patients and the patients index in group is.

    Ω̂=i:R̂i,j=R̂,j,1N̂.E33

    All the coefficients β˜i,js in the same estimated subgroup Ω̂are treated to be equal.

    In the third step, subgroup modeling is performed by sparse boosting. Incorporating the patients structure identified in step 2, the model is refitted to each of the subgroups via sparse boosting with the following least squares loss function

    iΩ̂t=1TiYitβ˜i,0j=1pXit,jβ˜i,j2,1N̂.E34

    Further denote Y=YΩ̂1TYΩ̂Ω̂TT, X,j=XΩ̂1,jTXΩ̂Ω̂,jTT, X=X,0X,pand β˜=β˜Ω̂1Tβ˜Ω̂Ω̂TTfor =1,,N̂, where Ω̂iis the ith element of Ω̂and Ω̂is the number of elements in Ω̂. The function Eq. (34) can be written in the matrix form:

    YXβ˜TYXβ˜,1N̂.E35

    Denote β˜L̂to be the estimate for β˜by sparse boosting with Eq. (35) being the loss function, where L̂is the estimated number of stopping iterations in this step. The estimator for coefficient β˜iis

    β̂i=β˜L̂foriΩ̂,1in.E36

    More details about how to use sparse boosting to obtain β˜iL̂i1inand β˜L̂1N̂are given in the following subsection.

    4.1.3 Multi-step sparse boosting techniques

    gMDL can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion to avoid the selection of the tuning parameter. gMDL can be expressed in the following form:

    gMDLYRSStraceB=logF+traceBYlogYTYRSStraceB×F,F=RSSYtraceB,E37

    where Yis the vector of response variable, Yis the length of Y, Bis the boosting operator and RSSis the residual sum of squares.

    The sparse boosting procedure is described in details. The starting value of β˜iis set to zero vector, i.e. β˜i0=0, and in each of the lith iteration (0<liLi, and Liis the maximum number of iterations considered in this step), the residual Rli=YiXiβ˜ili1in present iteration is used to fit each of the jth element Xi,j,j=0,,p. The fit denoted by λ̂jlican be obtained by minimizing the squared loss function RliXi,jλTRliXi,jλwith respect to λ. Thus, the least squares estimate is λ̂jli=Xi,jTXi,j1Xi,jTRli, the corresponding hat matrix is Hj=Xi,jXi,jTXi,j1Xi,jTand the residual sum of squares is RSSjli=RliXi,jλ̂jliTRliXi,jλ̂jli. The selected entry ŝliis obtained by:

    ŝli=argmin0jpgMDLYiRSSjlitraceBjli,E38

    where Bj1=Hjand Bjli=IIHjIνHŝli1..IνHŝ1for li>1is the boosting operator for choosing jth entry in the lith iteration in this step. Hence, there is an unique element Xi,ŝlito be selected at each iteration, and only the corresponding coefficient vector β˜i,ŝlilichanges, i.e., β˜i,ŝlili=β˜i,ŝlili1+νλ̂ŝlili, where νis the pre-specified step-size parameter. All the other β˜i,jlifor jŝlikeep unchanged. This procedure is repeated for Litimes and the number of iterations Lican be estimated by

    L̂i=argmin1liLigMDLYiRSSŝlilitraceBli,E39

    where Bli=IIνHŝli..IνHŝ1.

    From the above sparse boosting approach, the estimator of β˜iis β˜iL̂i=β˜i,0L̂iβ˜i,pL̂iT, i=1,,n. Then the subgroup structure can be obtained by homogeneity pursuit via change point detection.

    Next, sparse boosting is used again for each estimated subgroups. The starting value of β˜is set to zero vector, i.e. β˜0=0, and in each of the lth iteration (0<lL, and Lis the maximum number of iterations considered in this stage), the residual Rl=YXβ˜l1in present iteration is used to fit each of the jth component X,j,j=0,,p. Then the fit denoted by λ̂jlican be calculated by minimizing the squared loss function RlX,jλTRlXi,jλwith respect to λ. Therefore, the least squares estimate is λ̂jl=X,jTX,j1X,jTRl, the corresponding hat matrix is Hj=X,jX,jTX,j1X,jTand the residual sum of squares is RSSjl=RlX,jλ̂jlTRlX,jλ̂jl. The chosen element ŝlis attained by:

    ŝl=argmin0jpgMDLYRSSjltraceBjl,E40

    where Bj1=Hjand Bjl=IIHjIνHŝl1..IνHŝ1for l>1is the boosting operator for choosing jth element in the lth iteration in this stage. Hence, there is an unique element X,ŝlto be selected at each iteration, and only the corresponding coefficient vector β˜,ŝllchanges, i.e., β˜,ŝll=β˜,ŝll1+νλ̂ŝll, where νis the pre-specified step-size parameter. All the other β˜,jlfor jŝlkeep unchanged. This procedure is repeated for Ltimes and the number of iterations Lcan be estimated by

    L̂i=argmin1lLigMDLYRSSŝlltraceBl,E41

    where Bl=IIνHŝl..IνHŝ1.

    From the second step of sparse boosting, the estimator of β˜is β˜L̂=β˜,0L̂β˜,pL̂T, =1,,N̂.

    4.2 Simulation

    Extensive simulations are conducted to evaluate the performance of the proposed procedure. The accuracy of subgrouping, feature selection, coefficients estimation and prediction are assessed in the setting of different number of patients and repeated measurements. To understand the advantage of the proposed method better, the following four approaches are also considered. M1: the homogeneous model fitting method which treats all patients as one group and use sparse boosting for the single model to estimate β˜; M2: the heterogeneous model fitting method which uses initial pre-grouping estimate β˜iL̂ias the final estimate of β˜i; M3: same as the proposed method but in step 2, instead of detecting the change points for coefficients of each covariate β˜i,jL̂i, i=1,,nfor j=0,,p, it detects the change points among β˜1TL̂1β˜nTL̂nTsimilarly to Ke et al. [48]; M4: the proposed method.

    The results from [19] show that the naive homogeneous model fitting method M1 can rarely identify the important covariates while the over-parameterized model fitting method M2 and other two methods (M3 & M4) which identify subgroup structures consistently yield true positives equal to the true number of important covariates. Compared these three methods which can identify the important covariates, the proposed method produces smallest false positives. In addition, the number of false positives is decreasing when there is an increase in cluster size. Neither the homogeneous model fitting method nor heterogeneous model fitting method is able to identify the true structure among patients. The method M3 produces much more subgroups than it really has, while the proposed method M4 identified the number of subgroups closest to the actual number of subgroups. Furthermore, the probability of identifying the true subgroups becomes larger when the number of repeated measurements increases. For in-sample prediction, the over-parameterized model M2 performs the best while the methods M3 & M4 performs very competitively. However, for out-of-sample prediction, method M4 is the best. M1 is inferior to M4, yielding poor results of estimation and prediction. In summary, the proposed method preforms pretty well in terms of subgroup identification, variable selection, estimation as well as prediction.

    4.3 Wallaby growth data analysis

    The proposed subgroup identification method is applied to wallaby growth data, which is from the Australian Dataset and Story Library (OzDASL) and can be found at http://www.statsci.org/data/oz/wallaby.html. The data set has 77 Tammar wallabies’ growth measurements which were taken longitudinally. The response variable is the weight of wallabies (tenths of a gram). The predictors involve length of head, ear, body, arm, leg, tail, foot and their second order interactions. Therefore, a total of 35 predictors are included in the analysis. After removing the missing data, 43 Tammar wallabies are kept in our dataset. The number of repeated measurements ranges from 9 to 34 (median: 23). To have a better understanding of the wallabies’ growth trend, the questions of which parts of body would affect the weight and whether the length of each body parts have the same effects on the weight for all wallabies are investigated, i.e. is there any subgroups among wallabies. Except the above subgroup identification method (SB-CPD1), the other 3 methods studied in simulation are also considered, i.e. homogeneous model fitting method (SB-Homogeneous), heterogeneous model fitting method (SB-Heterogeneous) and the method similar to SB-CPD1 but identifying subgroups via other method in Ke et al. [48] (SB-CPD2). In addition, the following subgroup identification methods incorporating penalized methods are also investigated: similar to our proposed method but instead of using sparse boosting, lasso (Lasso-CPD1), elastic net (ElasticNet-CPD1), SCAD (SCAD-CPD1) or MCP (MCP-CPD1) is used.

    The results from [19] show that although Lasso-CPD1 and ElasticNet-CPD1 yield smaller in-sample prediction error by keeping all 35 covariates, they have relatively large out-of-sample prediction errors due to over-fitting problem. The subgroup identification method via sparse boosting keeps smaller number of predictors, achieves sparser model than penalized methods. The proposed method SB-CPD1 identifies smaller number of subgroups and predictors than alternative competing methods while produces smallest out-of-sample prediction errors. In conclusion, the proposed subgroup identification method provides a more precise definition for various subgroups. It may also result in a more accurate medical decision making for these subjects.

    Advertisement

    5. Conclusions

    In this chapter, we discussed various sparse boosting based machine learning methods in the context of high-dimensional data problems. Specifically, we presented the sparse boosting procedure and two-step sparse boosting procedure for nonparametric varying-coefficient models with survival data and repeatedly measured longitudinal data respectively to simultaneously perform variable selection and estimation of functional coefficients. We further presented the multi-step sparse boosting based subgroup identification method with longitudinal patient data to identify subgroups that exhibit different treatment effects. The extensive numerical studies show the validity and effectiveness of our proposed methods and the real data analysis further demonstrate their usefulness and advantages.

    DOWNLOAD FOR FREE

    chapter PDF

    © 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    How to cite and reference

    Link to this chapter Copy to clipboard

    Cite this chapter Copy to clipboard

    Mu Yue (October 19th 2021). Sparse Boosting Based Machine Learning Methods for High-Dimensional Data [Online First], IntechOpen, DOI: 10.5772/intechopen.100506. Available from:

    chapter statistics

    37total chapter downloads

    More statistics for editors and authors

    Login to your personal dashboard for more detailed statistics on your publications.

    Access personal reporting

    We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

    More About Us