In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.
- sparse boosting
- high-dimensional data
- machine learning
- variable selection
- data analysis
High-dimensional model has become very popular in statistical literature and many new machine learning techniques have been developed to deal with data with very large number of features. In the past decades, researchers have done a great deal of high-dimensional data analysis where the sample size is relatively small but the number of features under consideration is extremely large. It is widely known that including irrelevant predictors in the statistical model may result in unstable estimation and dreadful computing issues. Thus, variable selection is crucial to address the challenges. Among all developments, regularization procedures such as LASSO , smoothly clipped absolute deviation (SCAD) , MCP  and their various extensions [4, 5, 6] have been thoroughly studied and widely used to perform variable selection and estimation simultaneously in order to improve the prediction accuracy and interpretability of the statistical model. However, those penalized estimation approaches all have some tuning parameters required to be selected by computationally expensive methods like cross-validation.
In recent years, machine learning methods such as boosting have become very prominent for high-dimensional data settings since they can improve the selection accuracy substantially and reduce the chance of including irrelevant features. The original boosting algorithms were proposed by Schapire  which is an ensemble method that iteratively combines weaker learners to minimize the expected loss. The major difference among different boosting algorithms is the loss function. For example, AdaBoost  has the exponential loss function, L2 boosting  has the squared error loss function, sparse boosting  has the penalized loss function and HingeBoost  has the weighted hinge loss function. Recently, more various versions of boosting algorithms have been proposed. See, for example, Bühlmann and Hothorn  for the twin boosting; Komori and Eguchi  for the pAUCBoost; Wang  for the twin HingeBoost; Zhao  for the GSBoosting and Yang and Zou  for the ER-Boost. Besides these extensions, much effort has been made in understanding the advantages of boosting such as relatively lower over-fitting risk, smaller computational cost, and simpler adjustment to include additional constraints.
In this chapter we review some sparse boosting based methods for the following high-dimensional problems based on three research papers. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers . Then, a two-step sparse boosting to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeated over time . Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations . This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.
The rest of the chapter is arranged as follows. In Section 2, a sparse boosting method to fit high-dimensional survival data is studied. In Section 3, a two-step sparse boosting approach to carry out variable selection and model-based prediction by fitting high-dimensional models with longitudinal data is studied. In Section 4, a subgroup identification method incorporating multi-step sparse boosting algorithm for high-dimensional dense longitudinal data is studied. Finally, Section 5 provides concluding remarks.
2. Sparse boosting for survival data
Survival time data are usually referred to time-to-event data and they are usually censored. Predicting survival time and identifying the risk factors can be very helpful for patient treatment selection, disease prevention strategy or disease management in evidence-based medicine. A well-known model in survival analysis is the Cox proportional hazards (PH) model  which assumes multiplicative covariate effects in the hazards function. Another popular model is the accelerated failure time (AFT) model  which assumes that the covariate effect is to accelerate or decelerate the life time of a disease. The coefficients in the regression model have the direct interpretation of the covariate effects on the mean survival time. Recently, researchers developed boosting methods to analyze survival data. For example, Schmid and Hothorn  proposed a flexible boosting method for parametric AFT models, and Wang and Wang  proposed Buckley-James boosting for survival data with right censoring and high dimensionality.
In this section, a sparse boosting method to fit high-dimensional varying-coefficient AFT models is presented. In particular, the sparse boosting techniques for right censored survival data is studied. In Section 2.1, the varying-coefficient AFT model for survival data is formulated and a detailed sparse boosting algorithm to fit the model is proposed. In Section 2.2, the proposed sparse boosting techniques through simulation studies is evaluated. In Section 2.3, the performance of sparse boosting via a lung cancer data example is examined.
2.1.1 Model and estimation
Let and be the logarithm of survival time and censoring time for the th subject in a random sample of size respectively. In reality and the censoring indicator  are observed. Denote to be the corresponding (-)-dimensional predictors such as gene expressions or biomarkers for the th subject and to be the univariate index variable. Our observed data set is an independently and identically distributed random sample from . The varying-coefficient AFT model is:
where are the unknown varying-coefficient functions of confounder and is the random error with .
A weighted least squares estimation approach is adopted. Let ‘s be the Kaplan–Meier weights , which are the jumps in the Kaplan–Meier estimator computed as and . Let be the order statistics of , be the corresponding censoring indicators of the ordered , and and are defined similarly. Then the weighed least squares loss function is
Let be an equal-spaced B-spline basis, where is the dimension of the basis. Under certain smoothness conditions, the Curry-Schonberg theorem  implies that for every smooth function , it can be approximated by
where is a vector of length . Then the weighted least squares loss function Eq. (2) can be approximated by
Denote by for , , , and . Then the objective function Eq. (4) may be written in the following matrix form:
The estimation may yield close-form solution for the coefficients when dimensionality is small or moderate. With high dimensionality the solution cannot be easily achieved. Let be the estimator of from sparse boosting approach with weighted square loss function Eq. (5), and is the estimated number of stopping iterations. Then the estimates of coefficient function are given by
Instead of using the regularized estimation approaches, a sparse boosting method to estimate is presented in the following subsection.
2.1.2 Sparse boosting techniques
The key idea of sparse boosting is to replace the empirical risk function in L2 boosting with the penalized empirical risk function which is a combination of squared loss and the trace of boosting operator as a measure of boosting complexity, and then perform gradient descent in a function space iteratively. Thus sparse boosting produces sparser models compared to L2 boosting. The g-prior minimum description length (gMDL) proposed by  can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. The gMDL takes the form:
Here is the residual sum of squares and is the boosting operator. The model that achieves the shortest description of data will be selected. The advantage is that it has a data-dependent penalty for each dimension since it is explicitly given as a function of data only, thus the selection of the tuning parameter can be avoided.
The sparse boosting procedure is described in details. The initial value of is set to be a zero vector, i.e. for , while in each of the th iteration ( for being the total number of iterations) only the current residual is used to regress every th working element . The fit denoted by can be obtained by minimizing the weighted squared loss function with respect to . Hence the weighted least squared estimate is , the corresponding hat matrix is and the weighted residual sum of squares is . The selected component can be obtained by:
where and for is the boosting operator for selecting th component in the th iteration. Therefore, at each iteration there is only one working component to be chosen, and only the corresponding coefficient vector changes, i.e. , where is the step size, while all the other for remain the same. This process is repeated for iterations and estimate the stopping iteration by.
From this sparse boosting procedure, the estimator of is obtained as . The sparse boosting algorithm for the varying-coefficient AFT model can be summarized as follows:
Sparse Boosting Algorithm for Varying-Coefficient AFT Model.
Initialization. Set and (component-wise).
Iteration. . Compute , where and for .
Update. for and , where is the step size.
Iteration. Repeat step (b)-(c) for iterations.
Stopping. Estimate , where .Thus, is the estimate for and are the estimators for varying coefficients. The final estimator of is .
According to  and references therein, the selection of step size is of minor importance as long as it is small. A smaller value of achieves higher prediction accuracy while requires a larger number of boosting iterations and more computing time. A typical value used in literature is .
The performance of the above sparse boosting algorithm is evaluated by studying their performance on simulated data. L2 boosting and sparse boosting methods are compared in their performance of variable selection and function estimation. Sparse boosting method is what we present in this section while L2 boosting method is a relatively simpler version and may not achieve sparse solution in general.
The simulation results from  show that both boosting methods can identify important variables while sparse boosting selects much fewer irrelevant variables than L2 boosting. Although in-sample prediction errors (defined as ) using L2 boosting is a little bit smaller than using sparse boosting since the former has larger model sizes, the average of root mean integrated squared errors (defined as ) using sparse boosting is much smaller than that using L2 boosting. Furthermore, when the smoothness assumption in Curry-Schonberg theorem is violated for the coefficient functions, the performance of variable selection remains good. In summary, sparse boosting outperforms L2 boosting in terms of parameter estimation and variable selection.
2.3 Lung cancer data analysis
Lung cancer is the top cancer killer for people in the U.S. Identifying relevant gene expressions in lung cancer is important for treatment and prevention. Our data is from a large multi-site blinded validation study  with 442 lung adenocarcinomas. Age is treated as the potential confounder in this analysis, since it is usually strongly correlated with survival time . After removing missing measurements and predictors in overall survival, a total of 439 patients are left in the analysis. For each patient, 22,283 gene expressions are available. The median follow-up time is 46 months (range: 0.03 to 204 months) with the overall censoring rate 46.47 . The median age at diagnosis is 65 years (range: 33 to 87 years). After adopting a marginal screening procedure to screen out irrelevant genes, variable selection approaches are used to identify important genes associated with lung cancer. With the aim of comparison, except L2 boosting and the proposed sparse boosting, the following existing variable selection approaches for constant-coefficient AFT models are also considered: Buckley-James boosting with linear least squares , Buckley-James twin boosting with linear least squares , Buckley-James regression with elastic net penalty  and SCAD penalty respectively.
The results from  show that L2 boosting and sparse boosting for varying-coefficient AFT model not only produce relatively sparser model, but also have smaller in-sample and out-of-sample prediction error compared to the four methods for constant-coefficient AFT model. Again, sparse boosting produce even sparser model than L2 boosting. In conclusion, including age in the varying-coefficient AFT model could lead to more accurate estimate than constant-coefficient AFT model and the proposed sparse boosting method for varying-coefficient AFT model has good performance in terms of estimation, prediction as well as sparsity.
3. Two-step sparse boosting for longitudinal data
Longitudinal data contain repeated measurements collected from the same respondents over time. The assumption that all measurements are independent does not hold for such data. One important question in longitudinal analysis is how to make efficient inference by taking into account of the within subjects correlation. This question has been investigated in depth by many researchers [31, 32] for parametric models. Semiparametric and nonparametric models for longitudinal data are also presented in the literature, see [33, 34]. Recently, there are some development on longitudinal data with high-dimensionalilty using varying-coefficient models [35, 36]. All previous studies adopted the penalty methods.
In this section, a two-step sparse boosting approach is presented to preform the variable selection and the model-based prediction. Specifically, high-dimensional varying-coefficient models with longitudinal data will be considered. In the first step, the sparse boosting approach is utilized to obtain an estimate of the correlation structure. In the second step, the within-subject correlation structure is considered and variable selection and coefficients estimation are achieved by sparse boosting again. The rest of this section is arranged as follows. In Section 3.1, the varying-coefficient model for longitudinal data is formulated and a two-step sparse boosting algorithm is presented. In Section 3.2, simulation studies are conducted to illustrate the validity of the two-step sparse boosting method. In Section 3.3, the performance of two-stage method is assessed by studying yeast cell cycle gene expression data.
3.1.1 Model and estimation
Let be the continuous outcome for the th measurement of individual taken at time , where is the time interval on which the measurements are taken. Denote to be the corresponding (-)-dimensional covariate vector. The varying-coefficient model which can capture the dynamical impacts of the covariates on the response variable is considered:
where are the unknown smooth coefficient functions of time and are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different , but components of are correlated with each other. Without loss of generality, the balanced longitudinal study is considered in the following implementation, i.e., , and for all .
The estimation procedure is presented below. In the first step, the within-subject correlation is ignored first and the coefficients are estimated by minimizing the following least squares loss function:
The B-spline basis is used to estimate the coefficient functions . Denote to be an equal-spaced B-spline basis of dimension . Under certain smoothness assumptions, function can be approximated by
where is a loading vector of length . Then the least squares loss function Eq. (11) is close to
Further denote , , , , , and . Then the target function Eq. (13) can be expressed in the matrix format:
Denote to be the estimator of by sparse boosting with squared loss function Eq. (14) being loss function, where is the estimated stopping iterations in this step. There is no exact closed form for since it is derived from an iterative algorithm. However it can be evaluated very fast in a computer implementation. The detailed algorithm will be presented in the next subsection.
The first step coefficient estimates are given by
Write . The covariance matrix can be estimated by the following empirical estimator
In the second step, the estimated correlation structure within repeated measurements is taken into account to form the weighted least squares loss function as follows:
where is the estimated weight matrix. Denote to be the estimator of by sparse boosting with weighted loss function Eq. (17) being the loss function, where is the estimated stopping iterations in the second step. Then the coefficient estimates from the second step are given by
The reliable estimates for the coefficient functions could then be obtained. More details about how to use sparse boosting to get and are provided in the following subsection.
3.1.2 Two-step sparse boosting techniques
gMDL can be adopted as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. gMDL can be expressed in the following form:
where is the boosting operator and is the residual sum of squares.
The two-step sparse boosting approach is presented more specifically. In the first step, the start value of is set to zero vector, i.e. , and in each of the th iteration (, and is the maximum number of iterations considered in the first step), the residual in present iteration is used to fit each of the th component by treating all the within-subject observations uncorrelated. Then the fit denoted by can be calculated by minimizing the squared loss function with respect to . Therefore, the least squares estimate is , the corresponding hat matrix is and the residual sum of squares is . The chosen element is attained by:
where and for is the first step boosting operator for choosing th element in the th iteration. Hence, there is an unique element to be selected at each iteration, and only the corresponding coefficient vector changes, i.e., , where is the pre-specified step-size parameter. All the other for keep unchanged. This procedure is repeated for times and the number of iterations can be estimated by
From the first step of sparse boosting, the estimator of is obtained by . Then the weight matrix can be easily obtained too.
In the second step, sparse boosting is used again by taking into account of the correlation structure estimator for the repeated measurements estimated in the first step. The initial value of is set to be the coefficient estimator from the first step of sparse boosting, i.e. , and in each of the th iteration (, and is the maximum number of iterations under consideration in the second step), the residual in current iteration is used to fit each of the th working element by incorporating the within-subject correlation estimator from the first step. Then the fit denoted by can be obtained by minimizing the weighted squared loss function with respect to . Thus, the weighted least squares estimate is , the corresponding hat matrix is and the weighted residual sum of squares is . The chosen element can be obtained by:
where and for is the second step boosting operator for choosing th element in the th iteration. Thus, there is an unique element to be selected at each time, and only the corresponding coefficient vector change, i.e., . While all the other for remain the same. This procedure is repeated for times and the estimated stopping iterations is
From the second step of sparse boosting, the estimator of is arrived by . The two-step sparse boosting algorithm for varying-coefficient model with longitudinal data can be summarized in the following form:
Two-step Sparse Boosting Algorithm with Longitudinal Data.
Step I: Use sparse boosting to estimate covariance matrix.
Initialization. Let and .
Increase by 1. Calculate , where and for .
Update. for and , where is the step-size parameter.
Iteration. Repeat step (b)-(c) for some large iteration number .
Stopping. The optimal iteration number can be taken as , where .
Thus, is the first step estimator for from sparse boosting and , are the varying coefficient estimates ignoring the within-subject correlation. can be estimated by
Step II: Use sparse boosting again by incorporating covariance matrix estimator.
Initialization. Let and .
Increase by 1. Calculate , where and for .
Update. for and .
Iteration. Repeat step (b)-(c) for some large iteration number .
Stopping. The optimal iteration number can be taken as , where .
Therefore, and , are the final estimator for and varying coefficient estimates by the two-step sparse boosting. The final estimate for is .
Simulation studies are conducted to evaluate the performance of the above two-step sparse boosting algorithm. The following four methods are compared in terms of variable selection and function estimation performance. M1: two-step L2 boosting (use squared loss for update criterion and gMDL for stopping criterion); M2: two-step sparse boosting; M3: two-step lasso (performs lasso regression in the first step to calculate the estimated within-subject correlation structure using Eq. (14), and use lasso regression in the second step by taking into account of the estimated correlation structure) and M4: two-step elastic net regression (similar as M3 with the elastic net mixing parameter 0.5).
The simulation results from  show that all methods are able to identify important variables. However, in terms of sparsity, the two-step sparse boosting method preforms best with smallest number of false positives. Both penalization methods select much more irrelevant variables than boosting methods, with elastic net selects the most. For two-step sparse boosting, results of variable selection are quite stable from step I to step II but for the other approaches, the false positives and thus the sizes of model from step I to step II are expanding. Two-step sparse boosting yields smallest bias for the coefficients estimation among the competing methods. The refined estimates after incorporating the within-subject correlation generally perform better than the initial estimates without taking into account of the within-subject correlation since the two-step methods gain reduction of bias, especially when the within-subject correlation is high. In other words, the reduction of bias from step I to step II are much larger when the within-subject correlation is higher. This is intuitive as in the second step, the within-subject correlation structure estimated from the first step have been taken into account. The similar results obtained for the bias of the estimated covariance matrix. The bias under smaller within-subject correlation is smaller than under larger within-subject correlation. The two-step sparse boosting yields smaller bias of the estimated covariance matrix than other competing methods when the within-subject correlation is high. In summary, the performance of variable selection and functional coefficients estimation for two-step sparse boosting is quite satisfactory.
3.3 Yeast cell cycle gene expression data analysis
The cell cycle is one of the most important activities in life by which cells grow, replicate their chromosomes, undergo mitosis, and split into daughter cells. Thus, identifying cell cycle-regulated genes becomes very important. Adopting a model-based approach, Luan and Li  identified cell cycle-regulated genes based on the -factor synchronization experiments. All gene expression levels were measured at different time points covering two cell-cycle periods. Using the same subset of the original data as in , a total transcriptional factors (TFs) are included as predictors in the downstream analysis. Wei, Huang and Li  proved that the effects of the TFs on gene expression levels are time-dependent. After the independence screening by -norm  to screen out the irrelevant predictors at first step, several methods can be used to identify the key TFs involved in gene regulation. Except two-step L2 boosting and two-step sparse boosting which take into account of the within-subject correlation in the second step, one-step L2 boosting and one-step sparse boosting which ignore the within-subject correlation are also considered for better comparison. Besides, some two-step penalized approaches are also considered: two-step lasso, two-step adaptive lasso and two-step elastic net (the elastic net mixing parameter 0.5).
The results from  show that boosting approaches yield sparser model than the penalized methods. Sparse boosting yields even sparser model and smaller errors in terms of estimation and prediction than L2 boosting. Two-step boosting achieves better performance than one-step boosting with smaller estimation and prediction errors. Two-step sparse boosting method yields the most sparse model, with the smallest in-sample and out-of-sample prediction errors compared to other methods. In terms of the selected TFs, there is a significant overlap between two-step sparse boosting and each of the other methods. In conclusion, the two-step sparse boosting approach performs quite well in terms of variable selection, coefficients estimation and prediction and can provide useful information in identifying the important TFs that take part in the network of regulations.
4. Multi-step sparse boosting for subgroup identification
As personalized medicine is gaining popularity, identification of subgroups of the patients that can gain a higher efficacy from the treatment becomes greatly important. Recently, significant statistical approaches have been proposed to identify subgroups of patients who may be suitable for different treatments. Traditionally, subgroup identification is achieved by parametric partitioning approaches such as Bayesian approaches  or classification and regression tree (CART) . Recently, recursive partitioning methods gain popularity since they achieve greater generalizability and efficiency. Such methods include MOB , PRIM , sequential-BATTing  and other non-parametric methods. For a detailed literature review of subgroup identification refer to Lipkovich et al. . In this section, a sparse boosting based subgroup identification method is presented in the context of dense longitudinal data.
In particular, a formal subgroup identification method for high-dimensional dense longitudinal data is presented. It incorporates multi-step sparse boosting into the homogeneous pursuit via change point detection. Firstly, sparse boosting algorithm for individual modeling is first performed to obtain initial estimates. Then, change point detection via binary segmentation is used to identify the subgroup structure of patients. Lastly, the model on each identified subgroups is refitted and again sparse boosting is utilized to remove irrelevant predictors and yield reliable final estimates. The rest of the section is organized as follows. In Section 4.1, the subgroup model is formulated and a detailed method for subgroup identification and estimation is presented. In Section 4.2, the subgroup identification technique is evaluated through simulation studies. In Section 4.3, the feasibility and applicability of the approach is validated by studying a wallaby growth dataset.
4.1.1 Patients model
Denote be the continuous measurement of the th follow-up for patient , where , . Let be the corresponding -dimensional predictors. Assume patients are independent. The following longitudinal model for the patients is considered:
where are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different , but components of are correlated with each other.
Moreover, the model is further assumed to have the following subgroup structure:
The partition for regression coefficient is , which is unknown, and thus there are subgroups for the th predictor. All patients are divided into at least and at most subgroups by the model. The patients in the same subgroup share a similar relationship between the response and the predictors and have the same set of regression coefficients while different subgroups have different overall relationship between response and covariates. The main aim is to investigate the effects of the predictors on the response for different subgroups.
However, if the number of predictors under consideration is much larger than the number of patients and the number of follow-ups, a serious challenge may arise to estimate regression coefficients. Therefore, instead of adopting traditional methods (eg, MLE), sparse boosting method can be used to estimate the regression coefficients. With this, the dimensionality of features can be reduced and the coefficients of parameters can be obtained simultaneously.
4.1.2 Subgroup identification and estimation
Denote and . Firstly, an initial estimator for is calculated for each subject through sparse boosting approach using his or her own repeated measurements data; then, homogeneity pursuit via change point detection can be used to identify the change points among s; lastly, the s can be replaced by the identified subgroup structure, and the final estimator of regression coefficients can be obtained by the sparse boosting algorithm again. The steps for estimating is outlined as below.
In the first step, individualized modeling via sparse boosting is performed. For each of the th individual, the initial coefficients can be estimated by minimizing the following least squares loss function:
Let , , , . Then the function Eq. (26) can be written in the matrix form:
Denote to be the estimator of by sparse boosting with Eq. (27) being loss function, where is the estimated stopping iterations in this step. This is the initial estimator of . The detailed sparse boosting algorithm will be presented in the next subsection.
In the second step, homogeneity pursuit via change point detection is performed. Binary segmentation algorithm  is used to detect the change points among , and to identify the subgroup structure. Let be the th component of . For the th covariate, , , are sorted in ascending order, and denoted by . Denote be the rank of .
For any , denote the scaled difference between the partial means of the first observations and the last observations to be
Denote to be the threshold, which is a tuning parameter and can be selected by AIC or BIC, then the binary segmentation algorithm is as follows:
Find such thatE29
If , there is no change points among , , and the change point detection process terminates. Otherwise, is added to the set of change points and the region is divided into two subregions: and .
Find the change points in the two subregions derived in part (1), respectively. Consider the region first. Find such thatE30
If , there is no change point in the region . Otherwise, add to the set of change points and divide the region into two subregions: and . Similarly, for the region , can be found such thatE31
If , there is no change point in the region . Otherwise, add to the set of change points and divide the region into two subregions: and .
For each subregion derived in part (2), the above algorithm is repeated for the subregion or in part (2) until no change point is detected in any subregions.
The estimated locations for change points are sorted in increasing order and denoted by
where is the number of detected change points and could be used to estimate . Further denote , and . Let. where can be used to estimate the grouping index . The above algorithm can be used to identify the change points for all and correspondingly obtain . Let , then is the estimated total number of subgroups for patients and the patients index in group is.
All the coefficients s in the same estimated subgroup are treated to be equal.
In the third step, subgroup modeling is performed by sparse boosting. Incorporating the patients structure identified in step 2, the model is refitted to each of the subgroups via sparse boosting with the following least squares loss function
Further denote , , and for , where is the th element of and is the number of elements in . The function Eq. (34) can be written in the matrix form:
Denote to be the estimate for by sparse boosting with Eq. (35) being the loss function, where is the estimated number of stopping iterations in this step. The estimator for coefficient is
More details about how to use sparse boosting to obtain and are given in the following subsection.
4.1.3 Multi-step sparse boosting techniques
gMDL can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion to avoid the selection of the tuning parameter. gMDL can be expressed in the following form:
where is the vector of response variable, is the length of , is the boosting operator and is the residual sum of squares.
The sparse boosting procedure is described in details. The starting value of is set to zero vector, i.e. , and in each of the th iteration (, and is the maximum number of iterations considered in this step), the residual in present iteration is used to fit each of the th element . The fit denoted by can be obtained by minimizing the squared loss function with respect to . Thus, the least squares estimate is , the corresponding hat matrix is and the residual sum of squares is . The selected entry is obtained by:
where and for is the boosting operator for choosing th entry in the th iteration in this step. Hence, there is an unique element to be selected at each iteration, and only the corresponding coefficient vector changes, i.e., , where is the pre-specified step-size parameter. All the other for keep unchanged. This procedure is repeated for times and the number of iterations can be estimated by
From the above sparse boosting approach, the estimator of is , . Then the subgroup structure can be obtained by homogeneity pursuit via change point detection.
Next, sparse boosting is used again for each estimated subgroups. The starting value of is set to zero vector, i.e. , and in each of the th iteration (, and is the maximum number of iterations considered in this stage), the residual in present iteration is used to fit each of the th component . Then the fit denoted by can be calculated by minimizing the squared loss function with respect to . Therefore, the least squares estimate is , the corresponding hat matrix is and the residual sum of squares is . The chosen element is attained by:
where and for is the boosting operator for choosing th element in the th iteration in this stage. Hence, there is an unique element to be selected at each iteration, and only the corresponding coefficient vector changes, i.e., , where is the pre-specified step-size parameter. All the other for keep unchanged. This procedure is repeated for times and the number of iterations can be estimated by
From the second step of sparse boosting, the estimator of is , .
Extensive simulations are conducted to evaluate the performance of the proposed procedure. The accuracy of subgrouping, feature selection, coefficients estimation and prediction are assessed in the setting of different number of patients and repeated measurements. To understand the advantage of the proposed method better, the following four approaches are also considered. M1: the homogeneous model fitting method which treats all patients as one group and use sparse boosting for the single model to estimate ; M2: the heterogeneous model fitting method which uses initial pre-grouping estimate as the final estimate of ; M3: same as the proposed method but in step 2, instead of detecting the change points for coefficients of each covariate , for , it detects the change points among similarly to Ke et al. ; M4: the proposed method.
The results from  show that the naive homogeneous model fitting method M1 can rarely identify the important covariates while the over-parameterized model fitting method M2 and other two methods (M3 & M4) which identify subgroup structures consistently yield true positives equal to the true number of important covariates. Compared these three methods which can identify the important covariates, the proposed method produces smallest false positives. In addition, the number of false positives is decreasing when there is an increase in cluster size. Neither the homogeneous model fitting method nor heterogeneous model fitting method is able to identify the true structure among patients. The method M3 produces much more subgroups than it really has, while the proposed method M4 identified the number of subgroups closest to the actual number of subgroups. Furthermore, the probability of identifying the true subgroups becomes larger when the number of repeated measurements increases. For in-sample prediction, the over-parameterized model M2 performs the best while the methods M3 & M4 performs very competitively. However, for out-of-sample prediction, method M4 is the best. M1 is inferior to M4, yielding poor results of estimation and prediction. In summary, the proposed method preforms pretty well in terms of subgroup identification, variable selection, estimation as well as prediction.
4.3 Wallaby growth data analysis
The proposed subgroup identification method is applied to wallaby growth data, which is from the Australian Dataset and Story Library (OzDASL) and can be found at http://www.statsci.org/data/oz/wallaby.html. The data set has 77 Tammar wallabies’ growth measurements which were taken longitudinally. The response variable is the weight of wallabies (tenths of a gram). The predictors involve length of head, ear, body, arm, leg, tail, foot and their second order interactions. Therefore, a total of 35 predictors are included in the analysis. After removing the missing data, 43 Tammar wallabies are kept in our dataset. The number of repeated measurements ranges from 9 to 34 (median: 23). To have a better understanding of the wallabies’ growth trend, the questions of which parts of body would affect the weight and whether the length of each body parts have the same effects on the weight for all wallabies are investigated, i.e. is there any subgroups among wallabies. Except the above subgroup identification method (SB-CPD1), the other 3 methods studied in simulation are also considered, i.e. homogeneous model fitting method (SB-Homogeneous), heterogeneous model fitting method (SB-Heterogeneous) and the method similar to SB-CPD1 but identifying subgroups via other method in Ke et al.  (SB-CPD2). In addition, the following subgroup identification methods incorporating penalized methods are also investigated: similar to our proposed method but instead of using sparse boosting, lasso (Lasso-CPD1), elastic net (ElasticNet-CPD1), SCAD (SCAD-CPD1) or MCP (MCP-CPD1) is used.
The results from  show that although Lasso-CPD1 and ElasticNet-CPD1 yield smaller in-sample prediction error by keeping all 35 covariates, they have relatively large out-of-sample prediction errors due to over-fitting problem. The subgroup identification method via sparse boosting keeps smaller number of predictors, achieves sparser model than penalized methods. The proposed method SB-CPD1 identifies smaller number of subgroups and predictors than alternative competing methods while produces smallest out-of-sample prediction errors. In conclusion, the proposed subgroup identification method provides a more precise definition for various subgroups. It may also result in a more accurate medical decision making for these subjects.
In this chapter, we discussed various sparse boosting based machine learning methods in the context of high-dimensional data problems. Specifically, we presented the sparse boosting procedure and two-step sparse boosting procedure for nonparametric varying-coefficient models with survival data and repeatedly measured longitudinal data respectively to simultaneously perform variable selection and estimation of functional coefficients. We further presented the multi-step sparse boosting based subgroup identification method with longitudinal patient data to identify subgroups that exhibit different treatment effects. The extensive numerical studies show the validity and effectiveness of our proposed methods and the real data analysis further demonstrate their usefulness and advantages.
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267–288
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–1360
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010;38(2):894–942
Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–1429
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005 Apr 1;67(2):301–320
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67
Schapire RE. The strength of weak learnability. Machine learning. 1990 Jun 1;5(2):197–227
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences. 1997 Aug 1;55(1):119–139
Bühlmann P, Yu B. Boosting with the L 2 loss: regression and classification. Journal of the American Statistical Association. 2003 Jun 1;98(462):324–339
Bühlmann P, Yu B, Singer Y, Wasserman L. Sparse Boosting. Journal of Machine Learning Research. 2006 Jun 1;7(6)
Wang Z. HingeBoost: ROC-based boost for classification and variable selection. The International Journal of Biostatistics. 2011 Feb 4;7(1)
Bühlmann P, Hothorn T. Twin boosting: improved feature selection and prediction. Statistics and Computing. 2010 Apr;20(2):119–138
Komori O, Eguchi S. A boosting method for maximizing the partial area under the ROC curve. BMC bioinformatics. 2010 Dec;11(1):1–7
Wang Z. Multi-class hingeboost. Methods of information in medicine. 2012;51(02):162–167
Zhao J. General sparse boosting: improving feature selection of l2 boosting by correlation-based penalty family. Communications in Statistics-Simulation and Computation. 2015 Jul 3;44(6):1612–1640
Yang Y, Zou H. Nonparametric multiple expectile regression via ER-Boost. Journal of Statistical Computation and Simulation. 2015 May 3;85(7):1442–1458
Yue M, Li J, Ma S. Sparse boosting for high-dimensional survival data with varying coefficients. Statistics in medicine. 2018 Feb 28;37(5):789–800
Yue M, Li J, Cheng MY. Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients. Computational Statistics Data Analysis. 2019 Mar 1;131:222–234
Yue M, Huang L. A new approach of subgroup identification for high-dimensional longitudinal data. Journal of Statistical Computation and Simulation. 2020 Jul 23;90(11):2098–2116
David CR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society. 1972;34(2):187–220
Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in medicine. 1992;11(14–15):1871–1879
Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC bioinformatics. 2008 Dec;9(1):1–3
Wang Z, Wang CY. Buckley-James boosting for survival analysis with high-dimensional biomarker data. Statistical Applications in Genetics and Molecular Biology. 2010 Jun 8;9(1)
Li J, Ma S. Survival analysis in medicine and genetics. CRC Press; 2013 Jun 4
Stute W. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis. 1993 Apr 1;45(1):89–103
Curry HB, Schoenberg IJ. On Pólya frequency functions IV: the fundamental spline functions and their limits. InIJ Schoenberg Selected Papers 1988 (pp. 347–383). Birkhäuser, Boston, MA
Hansen MH, Yu B. Model selection and the principle of minimum description length. Journal of the American Statistical Association. 2001 Jun 1;96(454):746–774
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC. Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine. 2008 Aug;14(8):822
Consonni D, Bertazzi PA, Zocchetti C. Why and how to control for age in occupational epidemiology. Occupational and environmental medicine. 1997 Nov 1;54(11):772–776
Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley–James method for survival data with high-dimensional covariates. Biometrics. 2008 Mar;64(1):132–140
Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data: Oxford University Press. 2002
Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998 John Wiley & Sons. Hoboken NJ. 2012
Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001 Sep 1;96(455):1045–1056
Fan J, Huang T, Li R. Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association. 2007 Jun 1;102(478):632–641
Cheng MY, Honda T, Li J, Peng H. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics. 2014;42(5):1819–1849
Cheng MY, Honda T, Li J. Efficient estimation in semivarying coefficient models for longitudinal/clustered data. The Annals of Statistics. 2016;44(5):1988–2017
Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003 Mar 1;19(4):474–482
Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007 Jun 15;23(12):1486–1494
Wei F, Huang J, Li H. Variable selection and estimation in high-dimensional varying-coefficient models. Statistica Sinica. 2011 Oct 1;21(4):1515
Yue M, Li J. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. The international journal of biostatistics. 2017 May 18;13(1)
Sivaganesan S, MÃ¼ller P, Huang B. Subgroup finding via Bayesian additive regression trees. Statistics in medicine. 2017 Jul 10;36(15):2391–2403
Zhang H, Singer BH. Recursive partitioning and applications. Springer Science & Business Media; 2010 Jul 1
Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics. 2008 Jun 1;17(2):492–514
Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification. Statistics in medicine. 2015 Jan 30;34(2):317–342
Huang X, Sun Y, Trow P, Chatterjee S, Chakravartty A, Tian L, Devanarayan V. Patient subgroup identification for clinical drug development. Statistics in medicine. 2017 Apr 30;36(9):1414–1428
Lipkovich I, Dmitrienko A, B D’Agostino Sr R. Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in medicine. 2017 Jan 15;36(1):136–196
Bai J. Estimating multiple breaks one at a time. Econometric theory. 1997 Jun 1:315–352
Ke Y, Li J, Zhang W. Structure identification in panel data analysis. Annals of Statistics. 2016;44(3):1193–1233