Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

Mu Yue

doi:10.5772/intechopen.100506

Abstract

In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

Keywords

sparse boosting
high-dimensional data
machine learning
variable selection
data analysis

Author Information

Show +

Mu Yue*
- Singapore University of Technology and Design, Singapore

*Address all correspondence to: yuemu.moon@gmail.com

1. Introduction

High-dimensional model has become very popular in statistical literature and many new machine learning techniques have been developed to deal with data with very large number of features. In the past decades, researchers have done a great deal of high-dimensional data analysis where the sample size n is relatively small but the number of features p under consideration is extremely large. It is widely known that including irrelevant predictors in the statistical model may result in unstable estimation and dreadful computing issues. Thus, variable selection is crucial to address the challenges. Among all developments, regularization procedures such as LASSO [1], smoothly clipped absolute deviation (SCAD) [2], MCP [3] and their various extensions [4, 5, 6] have been thoroughly studied and widely used to perform variable selection and estimation simultaneously in order to improve the prediction accuracy and interpretability of the statistical model. However, those penalized estimation approaches all have some tuning parameters required to be selected by computationally expensive methods like cross-validation.

In recent years, machine learning methods such as boosting have become very prominent for high-dimensional data settings since they can improve the selection accuracy substantially and reduce the chance of including irrelevant features. The original boosting algorithms were proposed by Schapire [7] which is an ensemble method that iteratively combines weaker learners to minimize the expected loss. The major difference among different boosting algorithms is the loss function. For example, AdaBoost [8] has the exponential loss function, L2 boosting [9] has the squared error loss function, sparse boosting [10] has the penalized loss function and HingeBoost [11] has the weighted hinge loss function. Recently, more various versions of boosting algorithms have been proposed. See, for example, Bühlmann and Hothorn [12] for the twin boosting; Komori and Eguchi [13] for the pAUCBoost; Wang [14] for the twin HingeBoost; Zhao [15] for the GSBoosting and Yang and Zou [16] for the ER-Boost. Besides these extensions, much effort has been made in understanding the advantages of boosting such as relatively lower over-fitting risk, smaller computational cost, and simpler adjustment to include additional constraints.

In this chapter we review some sparse boosting based methods for the following high-dimensional problems based on three research papers. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers [17]. Then, a two-step sparse boosting to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeated over time [18]. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations [19]. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.

The rest of the chapter is arranged as follows. In Section 2, a sparse boosting method to fit high-dimensional survival data is studied. In Section 3, a two-step sparse boosting approach to carry out variable selection and model-based prediction by fitting high-dimensional models with longitudinal data is studied. In Section 4, a subgroup identification method incorporating multi-step sparse boosting algorithm for high-dimensional dense longitudinal data is studied. Finally, Section 5 provides concluding remarks.

2. Sparse boosting for survival data

Survival time data are usually referred to time-to-event data and they are usually censored. Predicting survival time and identifying the risk factors can be very helpful for patient treatment selection, disease prevention strategy or disease management in evidence-based medicine. A well-known model in survival analysis is the Cox proportional hazards (PH) model [20] which assumes multiplicative covariate effects in the hazards function. Another popular model is the accelerated failure time (AFT) model [21] which assumes that the covariate effect is to accelerate or decelerate the life time of a disease. The coefficients in the regression model have the direct interpretation of the covariate effects on the mean survival time. Recently, researchers developed boosting methods to analyze survival data. For example, Schmid and Hothorn [22] proposed a flexible boosting method for parametric AFT models, and Wang and Wang [23] proposed Buckley-James boosting for survival data with right censoring and high dimensionality.

In this section, a sparse boosting method to fit high-dimensional varying-coefficient AFT models is presented. In particular, the sparse boosting techniques for right censored survival data is studied. In Section 2.1, the varying-coefficient AFT model for survival data is formulated and a detailed sparse boosting algorithm to fit the model is proposed. In Section 2.2, the proposed sparse boosting techniques through simulation studies is evaluated. In Section 2.3, the performance of sparse boosting via a lung cancer data example is examined.

2.1 Methodology

2.1.1 Model and estimation

Let Ti and Ci be the logarithm of survival time and censoring time for the ith subject in a random sample of size n respectively. In reality Yi=minTiCi and the censoring indicator δi=ITi≤Ci [24] are observed. Denote Xi=Xi,1⋯Xi,p−1 to be the corresponding (p-1)-dimensional predictors such as gene expressions or biomarkers for the ith subject and Ui to be the univariate index variable. Our observed data set XiδiYiUi:Xi∈IRp−1δi∈01Yi∈IRUi∈IRi=12⋯n is an independently and identically distributed random sample from XδYU. The varying-coefficient AFT model is:

Ti=β0Ui+∑j=1p−1Xi,jβjUi+εi,i=1,…,n,E1

where β0.,β1.,⋯,βp−1. are the unknown varying-coefficient functions of confounder U and εi is the random error with EεiXiUi=0.

A weighted least squares estimation approach is adopted. Let wi‘s be the Kaplan–Meier weights [25], which are the jumps in the Kaplan–Meier estimator computed as w1=δ1n and wi=δin−i+1∏j=1i−1n−jn−j+1δj,i=2,…,n. Let Y1≤⋯≤Yn be the order statistics of Yi′s, δ1,⋯,δn be the corresponding censoring indicators of the ordered Yi′s, and X1,j,⋯,Xn,j,j=1,⋯,p−1 and U1,⋯,Un are defined similarly. Then the weighed least squares loss function is

∑i=1nwiYi−β0Ui−∑j=1p−1Xi,jβjUi2.E2

Let B.=B1.…BL.T be an equal-spaced B-spline basis, where L is the dimension of the basis. Under certain smoothness conditions, the Curry-Schonberg theorem [26] implies that for every smooth function βj., it can be approximated by

βj.≈BT.γj,j=0,⋯,p−1,E3

where γj is a vector of length L. Then the weighted least squares loss function Eq. (2) can be approximated by

∑i=1nwiYi−BTUiγ0−∑j=1p−1Xi,jBTUiγj2.E4

Denote by Y˜=Y1⋯YnT,Xi,0=1 for i=1,⋯,n, X˜j=BU1X1,j⋯BUnXn,jT, X˜=X˜0⋯X˜p−1, W=diagw1⋯wn and γ=γ0T⋯γp−1TT. Then the objective function Eq. (4) may be written in the following matrix form:

Y˜−X˜γTWY˜−X˜γ.E5

The estimation may yield close-form solution for the coefficients when dimensionality p is small or moderate. With high dimensionality the solution cannot be easily achieved. Let γK̂=γ0K̂T⋯γp−1K̂TT be the estimator of γ from sparse boosting approach with weighted square loss function Eq. (5), and K̂ is the estimated number of stopping iterations. Then the estimates of coefficient function are given by

β̂ju=BTuγjK̂,j=0,⋯,p−1.E6

Instead of using the regularized estimation approaches, a sparse boosting method to estimate γK̂ is presented in the following subsection.

2.1.2 Sparse boosting techniques

The key idea of sparse boosting is to replace the empirical risk function in L2 boosting with the penalized empirical risk function which is a combination of squared loss and the trace of boosting operator as a measure of boosting complexity, and then perform gradient descent in a function space iteratively. Thus sparse boosting produces sparser models compared to L2 boosting. The g-prior minimum description length (gMDL) proposed by [27] can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. The gMDL takes the form:

gMDLRSStraceB=logS+traceBnlogY˜TY˜−RSStraceB×S,S=RSSn−traceB.E7

Here RSS is the residual sum of squares and B is the boosting operator. The model that achieves the shortest description of data will be selected. The advantage is that it has a data-dependent penalty for each dimension since it is explicitly given as a function of data only, thus the selection of the tuning parameter can be avoided.

The sparse boosting procedure is described in details. The initial value of γ is set to be a zero vector, i.e. γk=0 for k=0, while in each of the kth iteration (1≤k≤K for K being the total number of iterations) only the current residual Rk=Y˜−X˜γk−1 is used to regress every jth working element X˜j,j=0,⋯,p−1. The fit denoted by λ̂jk can be obtained by minimizing the weighted squared loss function Rk−X˜jλTWRk−X˜jλ with respect to λ. Hence the weighted least squared estimate is λ̂jk=X˜jTWX˜j−1X˜jTWRk, the corresponding hat matrix is Hj=X˜jX˜jTWX˜j−1X˜jTW and the weighted residual sum of squares is RSSjk=Rk−X˜jλ̂jkTWRk−X˜jλ̂jk. The selected component ŝk can be obtained by:

ŝk=argmin0≤j≤p−1gMDLRSSjktraceBjk,E8

where Bj1=Hj and Bjk=I−I−HjI−νHŝk−1.⋯.I−νHŝ1 for k>1 is the boosting operator for selecting jth component in the kth iteration. Therefore, at each iteration there is only one working component X˜ŝk to be chosen, and only the corresponding coefficient vector γŝkk changes, i.e. γŝkk=γŝkk−1+νλ̂ŝkk, where ν is the step size, while all the other γjk for j≠ŝk remain the same. This process is repeated for K iterations and estimate the stopping iteration K by.

K̂=argmin1≤k≤KgMDLRSSŝkktraceBk,E9

where Bk=I−I−νHŝk.⋯.I−νHŝ1.

From this sparse boosting procedure, the estimator of γ is obtained as γK̂=γ0K̂T⋯γp−1K̂TT. The sparse boosting algorithm for the varying-coefficient AFT model can be summarized as follows:

Sparse Boosting Algorithm for Varying-Coefficient AFT Model.

Initialization. Set k=0 and γ0k=0,⋯,γp−1k=0 (component-wise).
Iteration. k=k+1. Compute ŝk=argmin0≤j≤p−1gMDLRSSjktraceBjk, where Bj1=Hj and Bjk=I−I−HjI−νHŝk−1.⋯.I−νHŝ1 for k>1.
Update. γŝkk=γŝkk−1 for j≠ŝk and γŝkk=γŝkk−1+νλ̂ŝkk, where ν is the step size.
Iteration. Repeat step (b)-(c) for K iterations.
Stopping. Estimate K̂=argmin1≤k≤KgMDLRSSŝkktraceBk, where Bk=I−I−νHŝk.⋯.I−νHŝ1.Thus, γK̂=γ0K̂T⋯γp−1K̂TT is the estimate for γ and β̂ju=BTuγjK̂,j=0,⋯,p−1 are the estimators for varying coefficients. The final estimator of Y˜ is Y˜K̂=X˜γK̂.

According to [10] and references therein, the selection of step size ν is of minor importance as long as it is small. A smaller value of ν achieves higher prediction accuracy while requires a larger number of boosting iterations and more computing time. A typical value used in literature is ν=0.1.

2.2 Simulation

The performance of the above sparse boosting algorithm is evaluated by studying their performance on simulated data. L2 boosting and sparse boosting methods are compared in their performance of variable selection and function estimation. Sparse boosting method is what we present in this section while L2 boosting method is a relatively simpler version and may not achieve sparse solution in general.

The simulation results from [17] show that both boosting methods can identify important variables while sparse boosting selects much fewer irrelevant variables than L2 boosting. Although in-sample prediction errors (defined as ∑i=1nδiYi−YiK̂2/∑i=1nδi) using L2 boosting is a little bit smaller than using sparse boosting since the former has larger model sizes, the average of root mean integrated squared errors (defined as 1n∑j=05∑i=1nβjui−β̂jui2) using sparse boosting is much smaller than that using L2 boosting. Furthermore, when the smoothness assumption in Curry-Schonberg theorem is violated for the coefficient functions, the performance of variable selection remains good. In summary, sparse boosting outperforms L2 boosting in terms of parameter estimation and variable selection.

2.3 Lung cancer data analysis

Lung cancer is the top cancer killer for people in the U.S. Identifying relevant gene expressions in lung cancer is important for treatment and prevention. Our data is from a large multi-site blinded validation study [28] with 442 lung adenocarcinomas. Age is treated as the potential confounder in this analysis, since it is usually strongly correlated with survival time [29]. After removing missing measurements and predictors in overall survival, a total of 439 patients are left in the analysis. For each patient, 22,283 gene expressions are available. The median follow-up time is 46 months (range: 0.03 to 204 months) with the overall censoring rate 46.47 %. The median age at diagnosis is 65 years (range: 33 to 87 years). After adopting a marginal screening procedure to screen out irrelevant genes, variable selection approaches are used to identify important genes associated with lung cancer. With the aim of comparison, except L2 boosting and the proposed sparse boosting, the following existing variable selection approaches for constant-coefficient AFT models are also considered: Buckley-James boosting with linear least squares [23], Buckley-James twin boosting with linear least squares [23], Buckley-James regression with elastic net penalty [30] and SCAD penalty respectively.

The results from [17] show that L2 boosting and sparse boosting for varying-coefficient AFT model not only produce relatively sparser model, but also have smaller in-sample and out-of-sample prediction error compared to the four methods for constant-coefficient AFT model. Again, sparse boosting produce even sparser model than L2 boosting. In conclusion, including age in the varying-coefficient AFT model could lead to more accurate estimate than constant-coefficient AFT model and the proposed sparse boosting method for varying-coefficient AFT model has good performance in terms of estimation, prediction as well as sparsity.

3. Two-step sparse boosting for longitudinal data

Longitudinal data contain repeated measurements collected from the same respondents over time. The assumption that all measurements are independent does not hold for such data. One important question in longitudinal analysis is how to make efficient inference by taking into account of the within subjects correlation. This question has been investigated in depth by many researchers [31, 32] for parametric models. Semiparametric and nonparametric models for longitudinal data are also presented in the literature, see [33, 34]. Recently, there are some development on longitudinal data with high-dimensionalilty using varying-coefficient models [35, 36]. All previous studies adopted the penalty methods.

In this section, a two-step sparse boosting approach is presented to preform the variable selection and the model-based prediction. Specifically, high-dimensional varying-coefficient models with longitudinal data will be considered. In the first step, the sparse boosting approach is utilized to obtain an estimate of the correlation structure. In the second step, the within-subject correlation structure is considered and variable selection and coefficients estimation are achieved by sparse boosting again. The rest of this section is arranged as follows. In Section 3.1, the varying-coefficient model for longitudinal data is formulated and a two-step sparse boosting algorithm is presented. In Section 3.2, simulation studies are conducted to illustrate the validity of the two-step sparse boosting method. In Section 3.3, the performance of two-stage method is assessed by studying yeast cell cycle gene expression data.

3.1 Methodology

3.1.1 Model and estimation

Let Yij be the continuous outcome for the jth measurement of individual i taken at time tij∈T, where T is the time interval on which the measurements are taken. Denote Xij=Xij,1⋯Xij,p−1 to be the corresponding (p-1)-dimensional covariate vector. The varying-coefficient model which can capture the dynamical impacts of the covariates on the response variable is considered:

Yij=β0tij+∑d=1p−1Xij,dβdtij+εij,i=1,⋯,n,j=1,⋯,ni,E10

where β0.,β1.,⋯,βp−1. are the unknown smooth coefficient functions of time and εi=εi1⋯εiniT,i=1,⋯,n are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different i, but components of εi are correlated with each other. Without loss of generality, the balanced longitudinal study is considered in the following implementation, i.e., tij=tkj, and ni=m for all i.

The estimation procedure is presented below. In the first step, the within-subject correlation is ignored first and the coefficients are estimated by minimizing the following least squares loss function:

∑i=1n∑j=1mYij−β0tij−∑d=1p−1Xij,dβdtij2.E11

The B-spline basis is used to estimate the coefficient functions β0.,β1.,⋯,βp−1.. Denote B.=B1.…BL.T to be an equal-spaced B-spline basis of dimension L. Under certain smoothness assumptions, function βd. can be approximated by

βd.≈BT.γd,d=0,⋯,p−1,E12

where γd is a loading vector of length L. Then the least squares loss function Eq. (11) is close to

∑i=1n∑j=1mYij−BTtijγ0−∑d=1p−1Xij,dBTtijγd2.E13

Further denote Yi=Yi1⋯YimT, Y=Y1T⋯YnTT, Xij,0=1, X˜i,d=Bti1Xi1,d⋯BtimXim,dT, X˜i=X˜i,0⋯X˜i,p−1, X˜=X˜1T⋯X˜nTT and γ=γ0T⋯γp−1TT. Then the target function Eq. (13) can be expressed in the matrix format:

∑i=1nYi−X˜iγTYi−X˜iγ≡Y−X˜γTY−X˜γ.E14

Denote γK1̂ to be the estimator of γ by sparse boosting with squared loss function Eq. (14) being loss function, where K1̂ is the estimated stopping iterations in this step. There is no exact closed form for γK1̂ since it is derived from an iterative algorithm. However it can be evaluated very fast in a computer implementation. The detailed algorithm will be presented in the next subsection.

The first step coefficient estimates are given by

β˜dt=BTtγdK1̂,d=0,⋯,p−1.E15

Write ε̂i=Yi−X˜iγK1̂,i=1,⋯,n. The m×m covariance matrix CovYi≡Σ can be estimated by the following empirical estimator

∑̂=1n∑i=1nε̂iε̂iT.E16

In the second step, the estimated correlation structure within repeated measurements is taken into account to form the weighted least squares loss function as follows:

∑i=1nYi−X˜iγ⋆TΣ̂−1Yi−X˜iγ⋆≡Y−X˜γ⋆TWY−X˜γ⋆,E17

where W=diagΣ̂−1⋯Σ̂−1 is the estimated n×m×n×m weight matrix. Denote γ⋆K2̂ to be the estimator of γ⋆ by sparse boosting with weighted loss function Eq. (17) being the loss function, where K2̂ is the estimated stopping iterations in the second step. Then the coefficient estimates from the second step are given by

β̂dt=BTtγd⋆K2̂,d=0,⋯,p−1.E18

The reliable estimates for the coefficient functions could then be obtained. More details about how to use sparse boosting to get γK1̂ and γ⋆K2̂ are provided in the following subsection.

3.1.2 Two-step sparse boosting techniques

gMDL can be adopted as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion. gMDL can be expressed in the following form:

gMDLRSStraceB=logF+traceBn×mlogYTY−RSStraceB×F,F=RSSn×m−traceB,E19

where B is the boosting operator and RSS is the residual sum of squares.

The two-step sparse boosting approach is presented more specifically. In the first step, the start value of γ is set to zero vector, i.e. γ0=0, and in each of the k1th iteration (0<k1≤K1, and K1 is the maximum number of iterations considered in the first step), the residual Rk1=Y−X˜γk1−1 in present iteration is used to fit each of the dth component X˜,d=X˜1,dT⋯X˜n,dTT,d=0,⋯,p−1 by treating all the within-subject observations uncorrelated. Then the fit denoted by λ̂dk1 can be calculated by minimizing the squared loss function Rk1−X˜,dλTRk1−X˜,dλ with respect to λ. Therefore, the least squares estimate is λ̂dk1=X˜,dTX˜,d−1X˜,dTRk1, the corresponding hat matrix is Hd=X˜,dX˜,dTX˜,d−1X˜,dT and the residual sum of squares is RSSdk1=Rk1−X˜,dλ̂dk1TRk1−X˜,dλ̂dk1. The chosen element ŝk1 is attained by:

ŝk1=argmin0≤d≤p−1gMDLRSSdk1traceBdk1,E20

where Bd1=Hd and Bdk1=I−I−HdI−νHŝk1−1.⋯.I−νHŝ1 for k1>1 is the first step boosting operator for choosing dth element in the k1th iteration. Hence, there is an unique element X˜,ŝk1 to be selected at each iteration, and only the corresponding coefficient vector γŝk1k1 changes, i.e., γŝk1k1=γŝk1k1−1+νλ̂ŝk1k1, where ν is the pre-specified step-size parameter. All the other γdk1 for d≠ŝk1 keep unchanged. This procedure is repeated for K1 times and the number of iterations K1 can be estimated by

K1̂=argmin1≤k1≤K1gMDLRSSŝk1k1traceBk1,E21

where Bk1=I−I−νHŝk1.⋯.I−νHŝ1.

From the first step of sparse boosting, the estimator of γ is obtained by γK1̂=γ0K1̂T⋯γp−1K1̂TT. Then the weight matrix W can be easily obtained too.

In the second step, sparse boosting is used again by taking into account of the correlation structure estimator for the repeated measurements estimated in the first step. The initial value of γ⋆ is set to be the coefficient estimator from the first step of sparse boosting, i.e. γ⋆0=γK1̂, and in each of the k2th iteration (0<k2≤K2, and K2 is the maximum number of iterations under consideration in the second step), the residual R⋆k2=Y−X˜γ⋆k2−1 in current iteration is used to fit each of the dth working element X˜,d,d=0,⋯,p−1 by incorporating the within-subject correlation estimator from the first step. Then the fit denoted by λ̂d⋆k2 can be obtained by minimizing the weighted squared loss function R⋆k2−X˜,dλTWR⋆k2−X˜,dλ with respect to λ. Thus, the weighted least squares estimate is λ̂d⋆k2=X˜,dTWX˜,d−1X˜,dTWR⋆k2, the corresponding hat matrix is Hd⋆=X˜,dX˜,dTWX˜,d−1X˜,dTW and the weighted residual sum of squares is RSSd⋆k2=R⋆k2−X˜,dλ̂d⋆k2TWR⋆k2−X˜,dλ̂d⋆k2. The chosen element ŝk2 can be obtained by:

ŝk2=argmin0≤d≤p−1gMDLRSSd⋆k2traceBd⋆k2,E22

where Bd⋆1=I−I−BK1̂I−Hd⋆ and Bd⋆k2=I−I−BK1̂I−Hd⋆I−νHŝk2−1⋆.⋯.I−νHŝ1⋆ for k2>1 is the second step boosting operator for choosing dth element in the k2th iteration. Thus, there is an unique element X˜,ŝk2 to be selected at each time, and only the corresponding coefficient vector γŝk2⋆k2 change, i.e., γŝk2⋆k2=γŝk2⋆k2−1+νλ̂ŝk2⋆k2. While all the other γd⋆k2 for d≠ŝk2 remain the same. This procedure is repeated for K2 times and the estimated stopping iterations K2̂ is

K2̂=argmin1≤k2≤K2gMDLRSSŝk2⋆k2traceB⋆k2,E23

where B⋆k2=I−I−BK1̂I−νHŝk2⋆.⋯.I−νHŝ1⋆.

From the second step of sparse boosting, the estimator of γ⋆ is arrived by γ⋆K2̂=γ0⋆K2̂T⋯γp−1⋆K2̂TT. The two-step sparse boosting algorithm for varying-coefficient model with longitudinal data can be summarized in the following form:

Two-step Sparse Boosting Algorithm with Longitudinal Data.

Step I: Use sparse boosting to estimate covariance matrix.

Initialization. Let k1=0 and γ0k1=0,⋯,γp−1k1=0.
Increase k1 by 1. Calculate ŝk1=argmin0≤d≤p−1gMDLRSSdk1traceBdk1, where Bd1=Hd and Bdk1=I−I−HdI−νHŝk1−1.⋯.I−νHŝ1 for k1>1.
Update. γŝk1k1=γŝk1k1−1 for d≠ŝk1 and γŝk1k1=γŝk1k1−1+νλ̂ŝk1k1, where ν is the step-size parameter.
Iteration. Repeat step (b)-(c) for some large iteration number K1.
Stopping. The optimal iteration number can be taken as K1̂=argmin1≤k1≤K1gMDLRSSŝk1k1traceBk1, where Bk1=I−I−νHŝk1.⋯.I−νHŝ1.

Thus, γK1̂=γ0K1̂T⋯γp−1K1̂TT is the first step estimator for γ from sparse boosting and β˜dt=BTtγdK1̂, d=0,⋯,p−1 are the varying coefficient estimates ignoring the within-subject correlation. CovYi can be estimated by

Σ̂=1n∑i=1nYi−X˜iγK1̂Yi−X˜iγK1̂T.

Step II: Use sparse boosting again by incorporating covariance matrix estimator.

Initialization. Let k2=0 and γ⋆k2=γK1̂.
Increase k2 by 1. Calculate ŝk2=argmin0≤d≤p−1g MDLRSSd⋆k2traceBd⋆k2, where Bd⋆1=I−I−BK1̂I−Hd⋆ and Bd⋆k2=I−I−BK1̂I−Hd⋆I−νHŝk2−1⋆.⋯.I−νHŝ1⋆ for k2>1.
Update. γŝk2⋆k2=γŝk2⋆k2−1 for d≠ŝk2 and γŝk2⋆k2=γŝk2⋆k2−1+νλ̂ŝk2⋆k2.
Iteration. Repeat step (b)-(c) for some large iteration number K2.
Stopping. The optimal iteration number can be taken as K2̂=argmin1≤k2≤K2gMDLRSSŝk2⋆k2traceB⋆k2, where B⋆k2=I−I−BK1̂I−νHŝk2⋆.⋯.I−νHŝ1⋆.

Therefore, γ⋆K2̂=γ0⋆K2̂T⋯γp−1⋆K2̂TT and β̂dt=BTtγd⋆K2̂, d=0,⋯,p−1 are the final estimator for γ⋆ and varying coefficient estimates by the two-step sparse boosting. The final estimate for Y is Ŷ=X˜γ⋆K2̂.

3.2 Simulation

Simulation studies are conducted to evaluate the performance of the above two-step sparse boosting algorithm. The following four methods are compared in terms of variable selection and function estimation performance. M1: two-step L2 boosting (use squared loss for update criterion and gMDL for stopping criterion); M2: two-step sparse boosting; M3: two-step lasso (performs lasso regression in the first step to calculate the estimated within-subject correlation structure using Eq. (14), and use lasso regression in the second step by taking into account of the estimated correlation structure) and M4: two-step elastic net regression (similar as M3 with the elastic net mixing parameter 0.5).

The simulation results from [18] show that all methods are able to identify important variables. However, in terms of sparsity, the two-step sparse boosting method preforms best with smallest number of false positives. Both penalization methods select much more irrelevant variables than boosting methods, with elastic net selects the most. For two-step sparse boosting, results of variable selection are quite stable from step I to step II but for the other approaches, the false positives and thus the sizes of model from step I to step II are expanding. Two-step sparse boosting yields smallest bias for the coefficients estimation among the competing methods. The refined estimates after incorporating the within-subject correlation generally perform better than the initial estimates without taking into account of the within-subject correlation since the two-step methods gain reduction of bias, especially when the within-subject correlation is high. In other words, the reduction of bias from step I to step II are much larger when the within-subject correlation is higher. This is intuitive as in the second step, the within-subject correlation structure estimated from the first step have been taken into account. The similar results obtained for the bias of the estimated covariance matrix. The bias under smaller within-subject correlation is smaller than under larger within-subject correlation. The two-step sparse boosting yields smaller bias of the estimated covariance matrix than other competing methods when the within-subject correlation is high. In summary, the performance of variable selection and functional coefficients estimation for two-step sparse boosting is quite satisfactory.

3.3 Yeast cell cycle gene expression data analysis

The cell cycle is one of the most important activities in life by which cells grow, replicate their chromosomes, undergo mitosis, and split into daughter cells. Thus, identifying cell cycle-regulated genes becomes very important. Adopting a model-based approach, Luan and Li [37] identified n=297 cell cycle-regulated genes based on the α-factor synchronization experiments. All gene expression levels were measured at m=18 different time points covering two cell-cycle periods. Using the same subset of the original data as in [38], a total p=96 transcriptional factors (TFs) are included as predictors in the downstream analysis. Wei, Huang and Li [39] proved that the effects of the TFs on gene expression levels are time-dependent. After the independence screening by l2-norm [40] to screen out the irrelevant predictors at first step, several methods can be used to identify the key TFs involved in gene regulation. Except two-step L2 boosting and two-step sparse boosting which take into account of the within-subject correlation in the second step, one-step L2 boosting and one-step sparse boosting which ignore the within-subject correlation are also considered for better comparison. Besides, some two-step penalized approaches are also considered: two-step lasso, two-step adaptive lasso and two-step elastic net (the elastic net mixing parameter 0.5).

The results from [18] show that boosting approaches yield sparser model than the penalized methods. Sparse boosting yields even sparser model and smaller errors in terms of estimation and prediction than L2 boosting. Two-step boosting achieves better performance than one-step boosting with smaller estimation and prediction errors. Two-step sparse boosting method yields the most sparse model, with the smallest in-sample and out-of-sample prediction errors compared to other methods. In terms of the selected TFs, there is a significant overlap between two-step sparse boosting and each of the other methods. In conclusion, the two-step sparse boosting approach performs quite well in terms of variable selection, coefficients estimation and prediction and can provide useful information in identifying the important TFs that take part in the network of regulations.

4. Multi-step sparse boosting for subgroup identification

As personalized medicine is gaining popularity, identification of subgroups of the patients that can gain a higher efficacy from the treatment becomes greatly important. Recently, significant statistical approaches have been proposed to identify subgroups of patients who may be suitable for different treatments. Traditionally, subgroup identification is achieved by parametric partitioning approaches such as Bayesian approaches [41] or classification and regression tree (CART) [42]. Recently, recursive partitioning methods gain popularity since they achieve greater generalizability and efficiency. Such methods include MOB [43], PRIM [44], sequential-BATTing [45] and other non-parametric methods. For a detailed literature review of subgroup identification refer to Lipkovich et al. [46]. In this section, a sparse boosting based subgroup identification method is presented in the context of dense longitudinal data.

In particular, a formal subgroup identification method for high-dimensional dense longitudinal data is presented. It incorporates multi-step sparse boosting into the homogeneous pursuit via change point detection. Firstly, sparse boosting algorithm for individual modeling is first performed to obtain initial estimates. Then, change point detection via binary segmentation is used to identify the subgroup structure of patients. Lastly, the model on each identified subgroups is refitted and again sparse boosting is utilized to remove irrelevant predictors and yield reliable final estimates. The rest of the section is organized as follows. In Section 4.1, the subgroup model is formulated and a detailed method for subgroup identification and estimation is presented. In Section 4.2, the subgroup identification technique is evaluated through simulation studies. In Section 4.3, the feasibility and applicability of the approach is validated by studying a wallaby growth dataset.

4.1 Methodology

4.1.1 Patients model

Denote Yit be the continuous measurement of the tth follow-up for patient i, where i=1,⋯,n, t=1,⋯,Ti. Let Xit=Xit,1⋯Xit,p be the corresponding p-dimensional predictors. Assume n patients are independent. The following longitudinal model for the patients is considered:

Yit=β˜i,0+∑j=1pXit,jβ˜i,j+εit,i=1,⋯,n,t=1,⋯,Ti.E24

where εi=εi1⋯εiTiT,i=1,⋯,n are multivariate error terms with mean zero. Errors are assumed to be uncorrelated for different i, but components of εi are correlated with each other.

Moreover, the model is further assumed to have the following subgroup structure:

β˜i,j=β1,jwheni∈Ω1,jβ2,jwheni∈Ω2,j⋮⋮βNj+1,jwheni∈ΩNj+1,jE25

The partition for regression coefficient β˜i,j:1≤i≤n is Ωk,j:1≤k≤Nj+1, which is unknown, and thus there are Nj+1 subgroups for the jth predictor. All patients are divided into at least maxjNj+1 and at most ∏j=0pNj+1 subgroups by the model. The patients in the same subgroup share a similar relationship between the response and the predictors and have the same set of regression coefficients while different subgroups have different overall relationship between response and covariates. The main aim is to investigate the effects of the predictors on the response for different subgroups.

However, if the number of predictors under consideration is much larger than the number of patients and the number of follow-ups, a serious challenge may arise to estimate regression coefficients. Therefore, instead of adopting traditional methods (eg, MLE), sparse boosting method can be used to estimate the regression coefficients. With this, the dimensionality of features can be reduced and the coefficients of parameters can be obtained simultaneously.

4.1.2 Subgroup identification and estimation

Denote β˜i=β˜i,0⋯β˜i,pT and β˜=β˜1T⋯β˜nTT. Firstly, an initial estimator for β˜i is calculated for each subject i through sparse boosting approach using his or her own repeated measurements data; then, homogeneity pursuit via change point detection can be used to identify the change points among βk,js; lastly, the β˜is can be replaced by the identified subgroup structure, and the final estimator of regression coefficients can be obtained by the sparse boosting algorithm again. The steps for estimating β˜i is outlined as below.

In the first step, individualized modeling via sparse boosting is performed. For each of the ith individual, the initial coefficients β˜i can be estimated by minimizing the following least squares loss function:

∑t=1TiYit−β˜i,0−∑j=1pXit,jβ˜i,j2.E26

Let Yi=Yi1⋯YiTiT, Xit,0=1, Xi,j=Xi1,j⋯XiTi,jT, Xi=Xi,0⋯Xi,p. Then the function Eq. (26) can be written in the matrix form:

Yi−Xiβ˜iTYi−Xiβ˜i.E27

Denote β˜iL̂i=β˜i,0L̂i⋯β˜i,pL̂iT to be the estimator of β˜i by sparse boosting with Eq. (27) being loss function, where L̂i is the estimated stopping iterations in this step. This is the initial estimator of β˜i. The detailed sparse boosting algorithm will be presented in the next subsection.

In the second step, homogeneity pursuit via change point detection is performed. Binary segmentation algorithm [47] is used to detect the change points among β˜i,j, i=1,⋯,n and to identify the subgroup structure. Let β˜i,jL̂i be the j+1th component of β˜iL̂i. For the jth covariate, β˜i,jL̂i, i=1,⋯,n, are sorted in ascending order, and denoted by b1≤⋯≤bn. Denote ri,j be the rank of β˜i,jL̂i.

For any 1≤l1<l2≤n, denote the scaled difference between the partial means of the first τ−l1+1 observations and the last l2−τ observations to be

Hl1l2τ=l2−ττ−l1+1l2−l1+1∑i=τ+1l2bll2−τ−∑i=l1τbiτ−l1+1.E28

Denote δ to be the threshold, which is a tuning parameter and can be selected by AIC or BIC, then the binary segmentation algorithm is as follows:

Find t̂1 such that
H1,nt̂1=max1≤τ<nH1,nτ.E29
If H1,nt̂1≤δ, there is no change points among bl, l=1,⋯,n, and the change point detection process terminates. Otherwise, t̂1 is added to the set of change points and the region τ:1≤τ≤n is divided into two subregions: τ:1≤τ≤t̂1 and τ:t̂1+1≤τ≤n.
Find the change points in the two subregions derived in part (1), respectively. Consider the region τ:1≤τ≤t̂1 first. Find t̂2 such that
H1,t̂1t̂2=max1≤τ<t̂1H1,t̂1τ.E30
If H1,t̂1t̂2≤δ, there is no change point in the region τ:1≤τ≤t̂1. Otherwise, add t̂2 to the set of change points and divide the region τ:1≤τ≤t̂1 into two subregions: τ:1≤τ≤t̂2 and τ:t̂2+1≤τ≤t̂1. Similarly, for the region τ:t̂1+1≤τ≤n, t̂3 can be found such that
Ht̂1+1,nt̂3=maxt̂1+1≤τ<nHt̂1+1,nτ.E31
If Ht̂1+1,nt̂3≤δ, there is no change point in the region τ:t̂1+1≤τ≤n. Otherwise, add t̂3 to the set of change points and divide the region τ:t̂1+1≤τ≤n into two subregions: τ:t̂1+1≤τ≤t̂3 and τ:t̂3+1≤τ≤n.
For each subregion derived in part (2), the above algorithm is repeated for the subregion τ:1≤τ≤t̂1 or τ:t̂1+1≤τ≤n in part (2) until no change point is detected in any subregions.

The estimated locations for change points are sorted in increasing order and denoted by

t̂1<t̂2<⋯<t̂N̂j,E32

where N̂j is the number of detected change points and could be used to estimate Nj. Further denote t̂0=0, and t̂N̂j+1=n. Let. R̂i,j=ℓ:t̂ℓ−1<ri,j≤t̂ℓ,1≤ℓ≤N̂j+1, where R̂i,j:1≤i≤n can be used to estimate the grouping index Ri,j:1≤i≤n. The above algorithm can be used to identify the change points for all j=0,⋯,p and correspondingly obtain R̂i,j:1≤i≤n0≤j≤p. Let R̂ℓ,j⋆:1≤ℓ≤N̂0≤j≤p=unique rows ofR̂i,j:1≤i≤n0≤j≤p, then N̂ is the estimated total number of subgroups for patients and the patients index in group ℓ is.

Ω̂ℓ=i:R̂i,j=R̂ℓ,j⋆,1≤ℓ≤N̂.E33

All the coefficients β˜i,js in the same estimated subgroup Ω̂ℓ are treated to be equal.

In the third step, subgroup modeling is performed by sparse boosting. Incorporating the patients structure identified in step 2, the model is refitted to each of the subgroups via sparse boosting with the following least squares loss function

∑i∈Ω̂ℓ∑t=1TiYit−β˜i,0−∑j=1pXit,jβ˜i,j2,1≤ℓ≤N̂.E34

Further denote Yℓ⋆=YΩ̂ℓ1T⋯YΩ̂ℓΩ̂ℓTT, Xℓ,j⋆=XΩ̂ℓ1,jT⋯XΩ̂ℓΩ̂ℓ,jTT, Xℓ⋆=Xℓ,0⋆⋯Xℓ,p⋆ and β˜ℓ⋆=β˜Ω̂ℓ1T⋯β˜Ω̂ℓΩ̂ℓTT for ℓ=1,⋯,N̂, where Ω̂ℓi is the ith element of Ω̂ℓ and ∣Ω̂ℓ∣ is the number of elements in Ω̂ℓ. The function Eq. (34) can be written in the matrix form:

Yℓ⋆−Xℓ⋆β˜ℓ⋆TYℓ⋆−Xℓ⋆β˜ℓ⋆,1≤ℓ≤N̂.E35

Denote β˜ℓ⋆L̂ℓ⋆ to be the estimate for β˜ℓ⋆ by sparse boosting with Eq. (35) being the loss function, where L̂ℓ⋆ is the estimated number of stopping iterations in this step. The estimator for coefficient β˜i is

β̂i=β˜ℓ⋆L̂ℓ⋆fori∈Ω̂ℓ,1≤i≤n.E36

More details about how to use sparse boosting to obtain β˜iL̂i1≤i≤n and β˜ℓ⋆L̂ℓ⋆1≤ℓ≤N̂ are given in the following subsection.

4.1.3 Multi-step sparse boosting techniques

gMDL can be used as the penalized empirical risk function to estimate the update criterion in each iteration and the stopping criterion to avoid the selection of the tuning parameter. gMDL can be expressed in the following form:

gMDLYRSStraceB=logF+traceB∣Y∣logYTY−RSStraceB×F,F=RSS∣Y∣−traceB,E37

where Y is the vector of response variable, ∣Y∣ is the length of Y, B is the boosting operator and RSS is the residual sum of squares.

The sparse boosting procedure is described in details. The starting value of β˜i is set to zero vector, i.e. β˜i0=0, and in each of the lith iteration (0<li≤Li, and Li is the maximum number of iterations considered in this step), the residual Rli=Yi−Xiβ˜ili−1 in present iteration is used to fit each of the jth element Xi,j,j=0,⋯,p. The fit denoted by λ̂jli can be obtained by minimizing the squared loss function Rli−Xi,jλTRli−Xi,jλ with respect to λ. Thus, the least squares estimate is λ̂jli=Xi,jTXi,j−1Xi,jTRli, the corresponding hat matrix is Hj=Xi,jXi,jTXi,j−1Xi,jT and the residual sum of squares is RSSjli=Rli−Xi,jλ̂jliTRli−Xi,jλ̂jli. The selected entry ŝli is obtained by:

ŝli=argmin0≤j≤pgMDLYiRSSjlitraceBjli,E38

where Bj1=Hj and Bjli=I−I−HjI−νHŝli−1.⋯.I−νHŝ1 for li>1 is the boosting operator for choosing jth entry in the lith iteration in this step. Hence, there is an unique element Xi,ŝli to be selected at each iteration, and only the corresponding coefficient vector β˜i,ŝlili changes, i.e., β˜i,ŝlili=β˜i,ŝlili−1+νλ̂ŝlili, where ν is the pre-specified step-size parameter. All the other β˜i,jli for j≠ŝli keep unchanged. This procedure is repeated for Li times and the number of iterations Li can be estimated by

L̂i=argmin1≤li≤LigMDLYiRSSŝlilitraceBli,E39

where Bli=I−I−νHŝli.⋯.I−νHŝ1.

From the above sparse boosting approach, the estimator of β˜i is β˜iL̂i=β˜i,0L̂i⋯β˜i,pL̂iT, i=1,⋯,n. Then the subgroup structure can be obtained by homogeneity pursuit via change point detection.

Next, sparse boosting is used again for each estimated subgroups. The starting value of β˜ℓ⋆ is set to zero vector, i.e. β˜ℓ⋆0=0, and in each of the lℓ⋆th iteration (0<lℓ⋆≤Lℓ⋆, and Lℓ⋆ is the maximum number of iterations considered in this stage), the residual R⋆lℓ⋆=Yℓ⋆−Xℓ⋆β˜ℓ⋆lℓ⋆−1 in present iteration is used to fit each of the jth component Xℓ,j⋆,j=0,⋯,p. Then the fit denoted by λ̂j⋆li⋆ can be calculated by minimizing the squared loss function R⋆lℓ⋆−Xℓ,j⋆λTR⋆lℓ⋆−Xi,j⋆λ with respect to λ. Therefore, the least squares estimate is λ̂j⋆lℓ⋆=Xℓ,j⋆TXℓ,j⋆−1Xℓ,j⋆TR⋆lℓ⋆, the corresponding hat matrix is Hj⋆=Xℓ,j⋆Xℓ,j⋆TXℓ,j⋆−1Xℓ,j⋆T and the residual sum of squares is RSSj⋆lℓ⋆=R⋆lℓ⋆−Xℓ,j⋆λ̂jlℓ⋆TR⋆lℓ⋆−Xℓ,j⋆λ̂jlℓ⋆. The chosen element ŝlℓ⋆⋆ is attained by:

ŝlℓ⋆⋆=argmin0≤j≤pgMDLYℓ⋆RSSj⋆lℓ⋆traceBj⋆lℓ⋆,E40

where Bj⋆1=Hj⋆ and Bj⋆lℓ=I−I−Hj⋆I−νHŝlℓ⋆−1⋆.⋯.I−νHŝ1⋆ for lℓ⋆>1 is the boosting operator for choosing jth element in the lℓ⋆th iteration in this stage. Hence, there is an unique element Xℓ,ŝlℓ⋆⋆ to be selected at each iteration, and only the corresponding coefficient vector β˜ℓ,ŝlℓ⋆⋆lℓ⋆ changes, i.e., β˜ℓ,ŝlℓ⋆lℓ=β˜ℓ,ŝlℓ⋆⋆lℓ⋆−1+νλ̂ŝlℓ⋆lℓ⋆, where ν is the pre-specified step-size parameter. All the other β˜ℓ,j⋆lℓ⋆ for j≠ŝlℓ⋆ keep unchanged. This procedure is repeated for Lℓ⋆ times and the number of iterations Lℓ⋆ can be estimated by

L̂i⋆=argmin1≤lℓ≤Li⋆gMDLYℓ⋆RSSŝlℓ⋆⋆lℓ⋆traceB⋆lℓ⋆,E41

where B⋆lℓ⋆=I−I−νHŝlℓ⋆⋆.⋯.I−νHŝ1⋆.

From the second step of sparse boosting, the estimator of β˜ℓ is β˜ℓ⋆L̂ℓ⋆=β˜ℓ,0⋆L̂ℓ⋆⋯β˜ℓ,p⋆L̂ℓ⋆T, ℓ=1,⋯,N̂.

4.2 Simulation

Extensive simulations are conducted to evaluate the performance of the proposed procedure. The accuracy of subgrouping, feature selection, coefficients estimation and prediction are assessed in the setting of different number of patients and repeated measurements. To understand the advantage of the proposed method better, the following four approaches are also considered. M1: the homogeneous model fitting method which treats all patients as one group and use sparse boosting for the single model to estimate β˜; M2: the heterogeneous model fitting method which uses initial pre-grouping estimate β˜iL̂i as the final estimate of β˜i; M3: same as the proposed method but in step 2, instead of detecting the change points for coefficients of each covariate β˜i,jL̂i, i=1,⋯,n for j=0,⋯,p, it detects the change points among β˜1TL̂1⋯β˜nTL̂nT similarly to Ke et al. [48]; M4: the proposed method.

The results from [19] show that the naive homogeneous model fitting method M1 can rarely identify the important covariates while the over-parameterized model fitting method M2 and other two methods (M3 & M4) which identify subgroup structures consistently yield true positives equal to the true number of important covariates. Compared these three methods which can identify the important covariates, the proposed method produces smallest false positives. In addition, the number of false positives is decreasing when there is an increase in cluster size. Neither the homogeneous model fitting method nor heterogeneous model fitting method is able to identify the true structure among patients. The method M3 produces much more subgroups than it really has, while the proposed method M4 identified the number of subgroups closest to the actual number of subgroups. Furthermore, the probability of identifying the true subgroups becomes larger when the number of repeated measurements increases. For in-sample prediction, the over-parameterized model M2 performs the best while the methods M3 & M4 performs very competitively. However, for out-of-sample prediction, method M4 is the best. M1 is inferior to M4, yielding poor results of estimation and prediction. In summary, the proposed method preforms pretty well in terms of subgroup identification, variable selection, estimation as well as prediction.

4.3 Wallaby growth data analysis

The proposed subgroup identification method is applied to wallaby growth data, which is from the Australian Dataset and Story Library (OzDASL) and can be found at http://www.statsci.org/data/oz/wallaby.html. The data set has 77 Tammar wallabies’ growth measurements which were taken longitudinally. The response variable is the weight of wallabies (tenths of a gram). The predictors involve length of head, ear, body, arm, leg, tail, foot and their second order interactions. Therefore, a total of 35 predictors are included in the analysis. After removing the missing data, 43 Tammar wallabies are kept in our dataset. The number of repeated measurements ranges from 9 to 34 (median: 23). To have a better understanding of the wallabies’ growth trend, the questions of which parts of body would affect the weight and whether the length of each body parts have the same effects on the weight for all wallabies are investigated, i.e. is there any subgroups among wallabies. Except the above subgroup identification method (SB-CPD1), the other 3 methods studied in simulation are also considered, i.e. homogeneous model fitting method (SB-Homogeneous), heterogeneous model fitting method (SB-Heterogeneous) and the method similar to SB-CPD1 but identifying subgroups via other method in Ke et al. [48] (SB-CPD2). In addition, the following subgroup identification methods incorporating penalized methods are also investigated: similar to our proposed method but instead of using sparse boosting, lasso (Lasso-CPD1), elastic net (ElasticNet-CPD1), SCAD (SCAD-CPD1) or MCP (MCP-CPD1) is used.

The results from [19] show that although Lasso-CPD1 and ElasticNet-CPD1 yield smaller in-sample prediction error by keeping all 35 covariates, they have relatively large out-of-sample prediction errors due to over-fitting problem. The subgroup identification method via sparse boosting keeps smaller number of predictors, achieves sparser model than penalized methods. The proposed method SB-CPD1 identifies smaller number of subgroups and predictors than alternative competing methods while produces smallest out-of-sample prediction errors. In conclusion, the proposed subgroup identification method provides a more precise definition for various subgroups. It may also result in a more accurate medical decision making for these subjects.

5. Conclusions

In this chapter, we discussed various sparse boosting based machine learning methods in the context of high-dimensional data problems. Specifically, we presented the sparse boosting procedure and two-step sparse boosting procedure for nonparametric varying-coefficient models with survival data and repeatedly measured longitudinal data respectively to simultaneously perform variable selection and estimation of functional coefficients. We further presented the multi-step sparse boosting based subgroup identification method with longitudinal patient data to identify subgroups that exhibit different treatment effects. The extensive numerical studies show the validity and effectiveness of our proposed methods and the real data analysis further demonstrate their usefulness and advantages.

References

1. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267–288
2. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–1360
3. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010;38(2):894–942
4. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–1429
5. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005 Apr 1;67(2):301–320
6. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67
7. Schapire RE. The strength of weak learnability. Machine learning. 1990 Jun 1;5(2):197–227
8. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences. 1997 Aug 1;55(1):119–139
9. Bühlmann P, Yu B. Boosting with the L 2 loss: regression and classification. Journal of the American Statistical Association. 2003 Jun 1;98(462):324–339
10. Bühlmann P, Yu B, Singer Y, Wasserman L. Sparse Boosting. Journal of Machine Learning Research. 2006 Jun 1;7(6)
11. Wang Z. HingeBoost: ROC-based boost for classification and variable selection. The International Journal of Biostatistics. 2011 Feb 4;7(1)
12. Bühlmann P, Hothorn T. Twin boosting: improved feature selection and prediction. Statistics and Computing. 2010 Apr;20(2):119–138
13. Komori O, Eguchi S. A boosting method for maximizing the partial area under the ROC curve. BMC bioinformatics. 2010 Dec;11(1):1–7
14. Wang Z. Multi-class hingeboost. Methods of information in medicine. 2012;51(02):162–167
15. Zhao J. General sparse boosting: improving feature selection of l2 boosting by correlation-based penalty family. Communications in Statistics-Simulation and Computation. 2015 Jul 3;44(6):1612–1640
16. Yang Y, Zou H. Nonparametric multiple expectile regression via ER-Boost. Journal of Statistical Computation and Simulation. 2015 May 3;85(7):1442–1458
17. Yue M, Li J, Ma S. Sparse boosting for high-dimensional survival data with varying coefficients. Statistics in medicine. 2018 Feb 28;37(5):789–800
18. Yue M, Li J, Cheng MY. Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients. Computational Statistics Data Analysis. 2019 Mar 1;131:222–234
19. Yue M, Huang L. A new approach of subgroup identification for high-dimensional longitudinal data. Journal of Statistical Computation and Simulation. 2020 Jul 23;90(11):2098–2116
20. David CR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society. 1972;34(2):187–220
21. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in medicine. 1992;11(14–15):1871–1879
22. Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC bioinformatics. 2008 Dec;9(1):1–3
23. Wang Z, Wang CY. Buckley-James boosting for survival analysis with high-dimensional biomarker data. Statistical Applications in Genetics and Molecular Biology. 2010 Jun 8;9(1)
24. Li J, Ma S. Survival analysis in medicine and genetics. CRC Press; 2013 Jun 4
25. Stute W. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis. 1993 Apr 1;45(1):89–103
26. Curry HB, Schoenberg IJ. On Pólya frequency functions IV: the fundamental spline functions and their limits. InIJ Schoenberg Selected Papers 1988 (pp. 347–383). Birkhäuser, Boston, MA
27. Hansen MH, Yu B. Model selection and the principle of minimum description length. Journal of the American Statistical Association. 2001 Jun 1;96(454):746–774
28. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC. Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine. 2008 Aug;14(8):822
29. Consonni D, Bertazzi PA, Zocchetti C. Why and how to control for age in occupational epidemiology. Occupational and environmental medicine. 1997 Nov 1;54(11):772–776
30. Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley–James method for survival data with high-dimensional covariates. Biometrics. 2008 Mar;64(1):132–140
31. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data: Oxford University Press. 2002
32. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998 John Wiley & Sons. Hoboken NJ. 2012
33. Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001 Sep 1;96(455):1045–1056
34. Fan J, Huang T, Li R. Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association. 2007 Jun 1;102(478):632–641
35. Cheng MY, Honda T, Li J, Peng H. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics. 2014;42(5):1819–1849
36. Cheng MY, Honda T, Li J. Efficient estimation in semivarying coefficient models for longitudinal/clustered data. The Annals of Statistics. 2016;44(5):1988–2017
37. Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003 Mar 1;19(4):474–482
38. Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007 Jun 15;23(12):1486–1494
39. Wei F, Huang J, Li H. Variable selection and estimation in high-dimensional varying-coefficient models. Statistica Sinica. 2011 Oct 1;21(4):1515
40. Yue M, Li J. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. The international journal of biostatistics. 2017 May 18;13(1)
41. Sivaganesan S, MÃ¼ller P, Huang B. Subgroup finding via Bayesian additive regression trees. Statistics in medicine. 2017 Jul 10;36(15):2391–2403
42. Zhang H, Singer BH. Recursive partitioning and applications. Springer Science & Business Media; 2010 Jul 1
43. Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics. 2008 Jun 1;17(2):492–514
44. Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification. Statistics in medicine. 2015 Jan 30;34(2):317–342
45. Huang X, Sun Y, Trow P, Chatterjee S, Chakravartty A, Tian L, Devanarayan V. Patient subgroup identification for clinical drug development. Statistics in medicine. 2017 Apr 30;36(9):1414–1428
46. Lipkovich I, Dmitrienko A, B D’Agostino Sr R. Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in medicine. 2017 Jan 15;36(1):136–196
47. Bai J. Estimating multiple breaks one at a time. Econometric theory. 1997 Jun 1:315–352
48. Ke Y, Li J, Zhang W. Structure identification in panel data analysis. Annals of Statistics. 2016;44(3):1193–1233

[1] 1. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996 Jan;58(1):267–288

[2] 2. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001 Dec 1;96(456):1348–1360

[3] 3. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010;38(2):894–942

[4] 4. Zou H. The adaptive lasso and its oracle properties. Journal of the American statistical association. 2006 Dec 1;101(476):1418–1429

[5] 5. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005 Apr 1;67(2):301–320

[6] 6. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb;68(1):49–67

[7] 7. Schapire RE. The strength of weak learnability. Machine learning. 1990 Jun 1;5(2):197–227

[8] 8. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences. 1997 Aug 1;55(1):119–139

[9] 9. Bühlmann P, Yu B. Boosting with the L 2 loss: regression and classification. Journal of the American Statistical Association. 2003 Jun 1;98(462):324–339

[10] 10. Bühlmann P, Yu B, Singer Y, Wasserman L. Sparse Boosting. Journal of Machine Learning Research. 2006 Jun 1;7(6)

[11] 11. Wang Z. HingeBoost: ROC-based boost for classification and variable selection. The International Journal of Biostatistics. 2011 Feb 4;7(1)

[12] 12. Bühlmann P, Hothorn T. Twin boosting: improved feature selection and prediction. Statistics and Computing. 2010 Apr;20(2):119–138

[13] 13. Komori O, Eguchi S. A boosting method for maximizing the partial area under the ROC curve. BMC bioinformatics. 2010 Dec;11(1):1–7

[14] 14. Wang Z. Multi-class hingeboost. Methods of information in medicine. 2012;51(02):162–167

[15] 15. Zhao J. General sparse boosting: improving feature selection of l2 boosting by correlation-based penalty family. Communications in Statistics-Simulation and Computation. 2015 Jul 3;44(6):1612–1640

[16] 16. Yang Y, Zou H. Nonparametric multiple expectile regression via ER-Boost. Journal of Statistical Computation and Simulation. 2015 May 3;85(7):1442–1458

[17] 17. Yue M, Li J, Ma S. Sparse boosting for high-dimensional survival data with varying coefficients. Statistics in medicine. 2018 Feb 28;37(5):789–800

[18] 18. Yue M, Li J, Cheng MY. Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients. Computational Statistics Data Analysis. 2019 Mar 1;131:222–234

[19] 19. Yue M, Huang L. A new approach of subgroup identification for high-dimensional longitudinal data. Journal of Statistical Computation and Simulation. 2020 Jul 23;90(11):2098–2116

[20] 20. David CR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society. 1972;34(2):187–220

[21] 21. Wei LJ. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Statistics in medicine. 1992;11(14–15):1871–1879

[22] 22. Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC bioinformatics. 2008 Dec;9(1):1–3

[23] 23. Wang Z, Wang CY. Buckley-James boosting for survival analysis with high-dimensional biomarker data. Statistical Applications in Genetics and Molecular Biology. 2010 Jun 8;9(1)

[24] 24. Li J, Ma S. Survival analysis in medicine and genetics. CRC Press; 2013 Jun 4

[25] 25. Stute W. Consistent estimation under random censorship when covariables are present. Journal of Multivariate Analysis. 1993 Apr 1;45(1):89–103

[26] 26. Curry HB, Schoenberg IJ. On Pólya frequency functions IV: the fundamental spline functions and their limits. InIJ Schoenberg Selected Papers 1988 (pp. 347–383). Birkhäuser, Boston, MA

[27] 27. Hansen MH, Yu B. Model selection and the principle of minimum description length. Journal of the American Statistical Association. 2001 Jun 1;96(454):746–774

[28] 28. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC. Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature medicine. 2008 Aug;14(8):822

[29] 29. Consonni D, Bertazzi PA, Zocchetti C. Why and how to control for age in occupational epidemiology. Occupational and environmental medicine. 1997 Nov 1;54(11):772–776

[30] 30. Wang S, Nan B, Zhu J, Beer DG. Doubly penalized Buckley–James method for survival data with high-dimensional covariates. Biometrics. 2008 Mar;64(1):132–140

[31] 31. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data: Oxford University Press. 2002

[32] 32. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, vol. 998 John Wiley & Sons. Hoboken NJ. 2012

[33] 33. Lin X, Carroll RJ. Semiparametric regression for clustered data using generalized estimating equations. Journal of the American Statistical Association. 2001 Sep 1;96(455):1045–1056

[34] 34. Fan J, Huang T, Li R. Analysis of longitudinal data with semiparametric estimation of covariance function. Journal of the American Statistical Association. 2007 Jun 1;102(478):632–641

[35] 35. Cheng MY, Honda T, Li J, Peng H. Nonparametric independence screening and structure identification for ultra-high dimensional longitudinal data. Annals of Statistics. 2014;42(5):1819–1849

[36] 36. Cheng MY, Honda T, Li J. Efficient estimation in semivarying coefficient models for longitudinal/clustered data. The Annals of Statistics. 2016;44(5):1988–2017

[37] 37. Luan Y, Li H. Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics. 2003 Mar 1;19(4):474–482

[38] 38. Wang L, Chen G, Li H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics. 2007 Jun 15;23(12):1486–1494

[39] 39. Wei F, Huang J, Li H. Variable selection and estimation in high-dimensional varying-coefficient models. Statistica Sinica. 2011 Oct 1;21(4):1515

[40] 40. Yue M, Li J. Improvement screening for ultra-high dimensional data with censored survival outcomes and varying coefficients. The international journal of biostatistics. 2017 May 18;13(1)

[41] 41. Sivaganesan S, MÃ¼ller P, Huang B. Subgroup finding via Bayesian additive regression trees. Statistics in medicine. 2017 Jul 10;36(15):2391–2403

[42] 42. Zhang H, Singer BH. Recursive partitioning and applications. Springer Science & Business Media; 2010 Jul 1

[43] 43. Zeileis A, Hothorn T, Hornik K. Model-based recursive partitioning. Journal of Computational and Graphical Statistics. 2008 Jun 1;17(2):492–514

[44] 44. Chen G, Zhong H, Belousov A, Devanarayan V. A PRIM approach to predictive-signature development for patient stratification. Statistics in medicine. 2015 Jan 30;34(2):317–342

[45] 45. Huang X, Sun Y, Trow P, Chatterjee S, Chakravartty A, Tian L, Devanarayan V. Patient subgroup identification for clinical drug development. Statistics in medicine. 2017 Apr 30;36(9):1414–1428

[46] 46. Lipkovich I, Dmitrienko A, B D’Agostino Sr R. Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in medicine. 2017 Jan 15;36(1):136–196

[47] 47. Bai J. Estimating multiple breaks one at a time. Econometric theory. 1997 Jun 1:315–352

[48] 48. Ke Y, Li J, Zhang W. Structure identification in panel data analysis. Annals of Statistics. 2016;44(3):1193–1233

Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

Computational Statistics and Applications

Abstract

Keywords

Author Information

Mu Yue*

1. Introduction

2. Sparse boosting for survival data

2.1 Methodology

2.1.1 Model and estimation

2.1.2 Sparse boosting techniques

2.2 Simulation

2.3 Lung cancer data analysis

3. Two-step sparse boosting for longitudinal data

3.1 Methodology

3.1.1 Model and estimation

3.1.2 Two-step sparse boosting techniques

3.2 Simulation

3.3 Yeast cell cycle gene expression data analysis

4. Multi-step sparse boosting for subgroup identification

4.1 Methodology

4.1.1 Patients model

4.1.2 Subgroup identification and estimation

4.1.3 Multi-step sparse boosting techniques

4.2 Simulation

4.3 Wallaby growth data analysis

5. Conclusions

References

Fast Computation of the EM Algorithm for Mixture Models

Sparse Boosting Based Machine Learning Methods for High-Dimensional Data

Computational Statistics and Applications

Abstract

Keywords

Author Information

Mu Yue*

1. Introduction

2. Sparse boosting for survival data

2.1 Methodology

2.1.1 Model and estimation

2.1.2 Sparse boosting techniques

2.2 Simulation

2.3 Lung cancer data analysis

3. Two-step sparse boosting for longitudinal data

3.1 Methodology

3.1.1 Model and estimation

3.1.2 Two-step sparse boosting techniques

3.2 Simulation

3.3 Yeast cell cycle gene expression data analysis

4. Multi-step sparse boosting for subgroup identification

4.1 Methodology

4.1.1 Patients model

4.1.2 Subgroup identification and estimation

4.1.3 Multi-step sparse boosting techniques

4.2 Simulation

4.3 Wallaby growth data analysis

5. Conclusions

References

Continue reading from the same book

Computational Statistics and Applications