Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets

Jingyi Zhang; Honghe Jin; Ye Wang; Xiaoxiao Sun; Ping Ma and
Wenxuan Zhong

doi:10.5772/intechopen.75861

Abstract

Complex and massive datasets can be easily accessed using the newly developed data acquisition technology. In spite of the fact that the smoothing spline ANOVA models have proven to be useful in a variety of fields, these datasets impose the challenges on the applications of the models. In this chapter, we present a selected review of the smoothing spline ANOVA models and highlight some challenges and opportunities in massive datasets. We review two approaches to significantly reduce the computational costs of fitting the model. One real case study is used to illustrate the performance of the reviewed methods.

Keywords

smoothing spline
smoothing spline ANOVA models
reproducing kernel Hilbert space
penalized likelihood
basis sampling

Author Information

Show +

Jingyi Zhang
- Department of Statistics, The University of Georgia, Athens, GA, USA
Honghe Jin
- Department of Statistics, The University of Georgia, Athens, GA, USA
Ye Wang
- Department of Statistics, The University of Georgia, Athens, GA, USA
Xiaoxiao Sun
- Department of Statistics, The University of Georgia, Athens, GA, USA
Ping Ma*
- Department of Statistics, The University of Georgia, Athens, GA, USA
Wenxuan Zhong
- Department of Statistics, The University of Georgia, Athens, GA, USA

*Address all correspondence to: pingma@uga.edu

1. Introduction

Among the nonparametric models, smoothing splines have been widely used in many real applications. There has been a rich body of literature in smoothing splines such as the additive smoothing spline [1, 2, 3, 4, 5, 6], the interaction smoothing spline [7, 8, 9, 10], and smoothing spline ANOVA (SSANOVA) models [11, 12, 13, 14].

In this chapter, we focus on studying the SSANOVA models. Suppose that the data yixi and i=1,2,…,n are independent and identically distributed (i.i.d.) copies of YX, where Y∈Y⊂R is the response variable and X∈X⊂Rd is the covariate variable. We consider the regression model:

yi=ηxi+ϵi,i=1,2,…,n,E1

where yi is the response, η is the nonparametric function varying in an infinite-dimensional space, xi=xi1…xidT is on the domain X⊂Rd, and ϵi∼i.i.d.N0σ2. More general cases, in which the conditional distribution of Y given x, denoted as Y∣x, which follows different distributions instead of the Gaussian distribution, will be discussed later. The nonparametric function η in (1) can be decomposed into

ηx=ηc+∑j=1dηjxj+∑k<jηkjxkxj+…

through the functional ANOVA, where ηc is a constant function, ηj is the main effect function of xj, ηkj is the interaction effect of xk and xj, and so on.

In the model (1), η can be estimated by minimizing the following penalized likelihood functional:

Lη+λJη,E2

where Lη is a log likelihood measuring the goodness of fit of η, Jη is a quadratic functional on η to quantify its smoothness, and λ is the smoothing parameter balancing trade-offs between the goodness of fit and the smoothness of η [11, 12, 13]. The computational complexity of estimating η by minimizing (2) is of the order On3 for the sample of size n. Therefore, the high computational costs render SSANOVA models impractical for massive datasets. In this chapter, we review two methods to lower the computational costs. One approach is through the adaptive basis selection algorithm [14]. By carefully sampling a smaller set of basic functions conditional on the response variables, the adaptive sampling reduces the computational costs to Onn∗2, where n∗≪n is the number of the sampled basis functions. The computational costs can also be reduced by the rounding algorithm [15]. This algorithm can significantly decrease the sample size to μ by rounding the data with a given precision, where μ≪n. After rounding, the computational costs can be dramatically reduced to Oμ3.

The rest of the chapter is organized as follows. Section 2 provides a detailed introduction to SSANOVA models and the model estimation. The details of adaptive basis selection algorithm and rounding algorithm are reviewed in Section 3. In Appendix, we demonstrate the numerical implementations using the R software.

2. Smoothing spline ANOVA models

In this section, we first review smoothing spline models and the reproducing kernel Hilbert space. Second, we present how to decompose a nonparametric function on tensor product domains, which lays the theoretical foundation for SSANOVA models. In the end, we show the estimation of SSANOVA models and illustrate the model with a real data example.

2.1. Introduction of smoothing spline models

In the model (1), η is located in an infinite-dimensional space. One way to estimate it is to add some constraints and estimate η in a finite-dimensional space. With the smoothness constraint, we estimate η by minimizing the penalized likelihood functional (2), and the minimizer of (2) is called a smoothing spline.Example 1.

Cubic smoothing splines

Suppose that Y∣x follows a normal distribution, that is, Y∣xi∼Nηxiσ2. Then, the penalized likelihood functional (2) can be reduced as the penalized least squares:

1n∑i=1nyi−ηxi2+λ∫Xη¨x2dx,E3

where η¨=d2η/dx2. The minimizer of (3) is called a cubic smoothing spline [16, 17, 18]. In (3), the first term quantifies the fidelity to the data, and the second term controls the roughness of the function.

Example 2.

Exponential family smoothing splines

Suppose that Y∣x follows an exponential family distribution:

Y∣xi∼expyηxi−bηxi/aϕ+cyϕ,

where a>0, b, and c are known functions and ϕ is either known or a nuisance parameter. Then, η can be estimated by minimizing the following penalized likelihood functional [19, 20]:

−1n∑i=1nyiηxi−bηxi+λJη.E4

Note that the cubic smoothing spline in Example 1 is a special case of exponential family smoothing splines when Y∣x follows the Gaussian distribution.

The smoothing parameter λ is sensitive to the estimation of η (see Figure 1). Therefore, it is crucial to implement some proper smoothing parameter selection methods to estimate λ. One of the most popular methods is the generalized cross validation (GCV) [21, 22]. More details will be discussed in Section 2.6.2.

Figure 1.
The data are generated by the model ϵ. y=5+e3x1+106x2111−x26+104x221−x210+5cos2πx1−x2+ϵ, where ϵ∼N01. Panels (a) and (b) show the data and the true function, respectively. The estimated functions depending on different smoothing parameters are shown in panels (c), (d), and (e). We set λ→0 in panel (c) and λ→∞ in panel (e). The proper λ selected by generalized cross validation (GCV) is used in panel (d).

2.2. Reproducing kernel Hilbert space

We assume that readers are familiar with Hilbert space, which is a complete vector space with an inner product well defined that allows length and angle to be measured [23]. In a general Hilbert space, the continuity of a functional, which is required in minimizing (2) on H=Jη<∞, is not always satisfied. To circumvent the problem, we optimize (2) in a special Hilbert space named reproducing kernel Hilbert space [24].

For each g∈H, there exists a corresponding continuous linear functional Lg such that Lgf=gf, where f∈H and ⋅⋅ defines the inner product in H. Conversely, an element gL∈H can also be found such that gLf=Lf for any continuous linear functional L of H [23]. This is known as the Riesz representation theorem.Theorem 2.1.

Riesz representation

Let H be a Hilbert space. For any functional L of H, there uniquely exists an element gL∈H such that

L⋅=gL⋅,

where gL is called the representer of L. The uniqueness is in the sense that g1 and g2 are considered as the same representer for any g1 and g2 satisfying ‖g1−g2‖=0, where ∥⋅∥=⋅⋅ defines the norm in H.

For a better construction of estimator minimizing (2), one needs the continuity of evaluation functional xf=fx. Roughly speaking, this means that if two functions f and g are close in norm, that is, ‖f−g‖ is small, then f and g are also pointwise close, that is, ∣fx−gx∣ is small for all x.Definition 1.

Reproducing kernel Hilbert space

Consider a Hilbert space H consisting of functions on domain X. For every element x∈X, define an evaluation functional x such that xf=fx. If all the evaluation functional x s are continuous, ∀x∈X, then H is called a reproducing kernel Hilbert space.

By Theorem 2.1, for every evaluation functional x, there exists a corresponding function Rx∈H on X as the representer of x, such that Rxf=xf=fx and ∀f∈H. By the definition of evaluation functional, it follows

Rxy=RxRy=Ryx.E5

The bivariate function Rxy=RxRy is called the reproducing kernel of H, which is unique if it exists. The essential meaning of the name “reproducing kernel” comes from its reproducing property

Rx⋅f=Rx⋅f=fx

for any f∈H. In general, a reproducing kernel Hilbert space defines a reproducing kernel function that is both symmetric and positive definite. In addition, Moore-Aronszajn theorem states that every symmetric, positive definite kernel defines a unique reproducing kernel Hilbert space [25], and hence one can construct a reproducing kernel Hilbert space simply by specifying its reproducing kernel.

We now introduce the concept of tensor sum decomposition. Suppose that H is a Hilbert space and G is a linear subspace of H. The linear subspace Gc=f∈Hfg=0∀g∈G is called the orthogonal complement of G. It is easy to verify that for any f∈H, there exists a unique decomposition f=fG+fGc, where fG∈G and fGc∈Gc. This decomposition is called a tensor sum decomposition, denoted by H=G⊕Gc. Suppose that H1 and H2 are two Hilbert spaces with inner products ⋅⋅1 and ⋅⋅2. If the only common element of these two spaces is 0, one can also define a tensor sum Hilbert space H=H1⊕H2. For any f,g∈H, one has unique decompositions f=f1+f2 and g=g1+g2, where f1,g1∈H1 and f2,g2∈H2. Moreover, the inner product defined on H would be fg=f1g11+f2g22. The following theorem provides the rules in the tensor sum decomposition of a reproducing kernel Hilbert space.Theorem 2.2

Suppose that R1 and R2 are the reproducing kernel Hilbert spaces H1 and H2, respectively. If H1∩H2=0, then H=H1⊕H2 has a reproducing kernel R=R1+R2.

Conversely, if the reproducing kernel R of H can be decomposed into R=R1+R2, where both R1 and R2 are positive definite, and they are orthogonal to each other, that is, R1x1⋅R2x2⋅=0 for ∀x1,x2∈X, then the spaces H1 and H2 corresponding to the kernels R1 and R2 form a tensor sum decomposition H=H1⊕H2.

2.3. Representer theorem

In (2), the smoothness penalty term Jη=Jηη is nonnegative definite, that is, Jηη≥0, and hence it is a squared semi-norm on the reproducing kernel Hilbert space H=η:Jη<∞. Denote NJ=η∈H:Jη=0 as the null space of Jη and HJ as its orthogonal complement. By the tensor sum decomposition H=NJ⊕HJ, one may decompose the η into two parts: one in the null space NJ that has no contribution on the smoothness penalty and the other in HJ “reproduced” by the reproducing kernel R⋅⋅ [12].Theorem 2.3.

There exist coefficient vectors d=d1…dMT∈RM and c=c1…cnT∈Rn such that

ηx=∑k=1Mdkξkx+∑i=1nciRxix,E6

where ξkk=1..M is the basis of null space NJ and R⋅⋅ is the reproducing kernel of HJ.

This theorem indicates that although the minimization problem is in an infinite-dimensional space, the minimizer of (2) lies in a data-adaptive finite-dimensional space.

2.4. Function decomposition

The decomposition of a multivariate function is similar to the classical ANOVA. In this section, we present the functional ANOVA which lays the foundation for SSANOVA models.

2.4.1. One-way ANOVA decomposition

We consider a classical one-way ANOVA model yij=μi+ϵij, where yij is the observed data, μi is the treatment mean for i=1,…,K and j=1,…,J, and ϵij s are the random errors. The treatment mean μi can be further decomposed as μi=μ+αi, where μ is the overall mean and αi is the treatment effect with the constraint ∑i=1Kαi=0.

Similar to the classical ANOVA decomposition, a univariate function fx can be decomposed as

f=Af+I−Af=fc+fx,E7

where A is an averaging operator that averages the effect of x and I is an identity operator. The operator A averages a function f to a constant function fc satisfying AI−A=0. For example, one can take Af=∫01fxdx in L101=f:∫01fxdx<∞. In (7), fc=Af is the mean function, and fx=I−Af is the treatment effect.

2.4.2. Multiway ANOVA decomposition

On a d-dimensional product domain X=∏j=1dXj∈Rd, a multivariate function fx1…xd can be decomposed similarly to the one-way ANOVA decomposition. Let Aj, j=1,…,d, be the average operator on Xj, and then Ajf is a constant function on Xj. One can define the ANOVA decomposition on X as

f=∏j=1dI−Aj+Ajf=∑S∏j∈SI−Aj∏j∉SAjf=∑SfS,E8

where S⊆1…d. The term fc=∏j=1dAjf is the constant function, fj=I−Aj∏α≠jAαf is the main effect term of xj, the term fμν=I−AμI−Aν∏α≠μ,νAαf is the interaction of xμ and xν, and so on.

2.5. Some examples of model conduction

Smoothing splines on Cm01. If we define

Jη=∫01ηm2dx

in the space Cm01=f:fm∈L201, where fm denotes the m th differentiation of f, L2=f:∫01fx2dx<∞, then the minimizer of (2) is called a polynomial smoothing spline.

Here, we use an inner product

fg=∑ν=0m−1∫01fνxdx∫01gνxdx+∫01fmxgmxdx.E9

One can easily check that (9) is a well-defined inner product in Cm01 with H0=f:fm=0 equipped with the inner product ∑ν=0m−1∫01fνxdx∫01gνxdx [21]. To construct the reproducing kernel, define

kνx=−∑μ=−∞−1+∑μ=1∞exp2πiμx2πiμν,E10

where i=−1. One can verify that ∫01kνμxdx=δνμ and ν,μ=0,1,…,m−1, where δνμ is the Kronecker delta [26]. Indeed, k0…km−1 forms an orthonormal basis of H0. Then, the reproducing kernel in H0 is

R0xy=∑ν=0m−1kνxkνy.

For space

H1=f:∫01fνxdx=0ν=01…m−1fm∈L201,

one can check that the reproducing kernel in H1 is

R1xy=kmxkmy+−1m−1k2mx−y,

(See details in [11]; Section~2.3).

SSANOVA models on product domains: A natural way to construct reproducing kernel Hilbert space on product domain ∏j=1dXj is taking the tensor product of spaces constructed on the marginal domains Xj s. According to the Moore-Aronszajn theorem, every nonnegative definite function R corresponds to a reproducing kernel Hilbert space with R as its reproducing kernel. Therefore, the construction of the tensor product reproducing kernel Hilbert space is induced by constructing its reproducing kernel.Theorem 2.4.

Suppose that if R1x1y1 is nonnegative definite on X1 and R2x2y2 is nonnegative definite on X2, then Rxy=R1x1y1R2x2y2 is nonnegative definite on X=X1×X2.

Theorem 2.4 implies that a reproducing kernel R on tensor product reproducing kernel Hilbert space can be derived from the reproducing kernels on marginal domains. Indeed, let Hj be the space on Xj with reproducing kernel Rj, where j=1,2. Then, R=R1R2 is nonnegative definite on X1×X2. The space H corresponding to R⋅⋅ is called the tensor product space of H1 and H2, denoted by H=H1⊗H2.

One can decompose each marginal space Hj into Hj=Hj0⊕Hj1, where Hj0 denotes the averaging space and Hj1 denotes the orthogonal complement. Then, by the discussion in Section 2.4, the one-way ANOVA decomposition on each marginal space can be generalized to a multidimensional space H=⊗j=1dHj as

H=⊗j=1dHj0⊕Hj1=⊕S⊗j∈SHj1⊗⊗j∉SHj0=⊕SHS,E11

where S denotes all the subsets of 1…d. The component fS in (8) is in the space HS. Based on the decomposition, the minimizer of (2) is called a tensor product smoothing spline. One can construct a tensor product smoothing spline following Theorem 2.3, in which the reproducing kernel term may be calculated in the same way as the tensor product (11).

In the following, we will give some examples of tensor product smoothing splines on product domains.

2.5.1. Smoothing splines on 1…K×01

We construct the reproducing kernel Hilbert space by using

R10=1/KandR11=Ix1=y1

on 1…K. On 01, assume that if m=2, then we have

R20=1+k1x2k1y2

and

R21=k2x2k2y2−k4x2−y2.

In this case, the space H can be further decomposed as

H=H10⊕H11⊗H200⊕H201⊕H21.E12

The reproducing kernels of tensor product cubic spline on 1…K×01 are listed in Table 1.

Subspace	Reproducing kernel
H10⊗H200	1/K
H10⊗H201	k1x2k1y2/K
H10⊗H21	k2x2k2y2−k4x2−y2/K
H11⊗H200	Ix1=y1−1/K
H11⊗H201	Ix1=y1−1/Kk1x2k1y2
H11⊗H21	Ix1=y1−1/Kk2x2k2y2−k4x2−y2

Table 1.

Reproducing kernels of (12) on 1…K×01 when m=2.

On other product domains, for example, 012, the tensor product reproducing kernel Hilbert space can be decomposed in a similar way. More examples are available in ([11], Section~2.4).

2.5.1.1. General form

In general, a tensor product reproducing kernel Hilbert space can be specified as H=⊕jHj, where j∈B is a genetic index. Suppose that Hj is equipped with a reproducing kernel Rj and an inner product fgj. Denote fj as the projection of f onto Hj. Then, an inner product in H can be defined as

Jfg=∑jθj−1fjgjj,E13

where θj≥0 are the tuning parameters. If a penalty J in (2) has the form (13), the SSANOVA models can be defined on the space H=⊕jHj with the reproducing kernel:

R=∑jθjRj.E14

2.6. Estimation

In this section, we show the procedure of estimating the minimizer η̂ of (2) under the Gaussian assumption and selecting the smoothing parameters.

2.6.1. Penalized least squares

We consider the same model shown in (1), and then the η can be estimated by minimizing the penalized least squares:

1n∑i=1nyi−ηxi2+λJη.E15

Let S denote the n×M matrix with the ij th entry ξjxi as in (6) and R denote the n×n matrix with the ij th entry Rxixj with the form (14). Then, based on Theorem 2.3, η can be expressed as

η=Sd+Rc,

where η=ηx1…ηxnT, d=d1…dMT, and c=c1…cnT. The least squares term in (15) becomes

1n∑i=1nyi−ηxi2=1ny−Sd−RcTy−Sd−Rc,

where y=y1…ynT.

By the reproducing property (5), the roughness penalty term can be expressed as

Jη=∑i=1n∑j=1nciRxixjcj=cTRc.

Therefore, the penalized least squares criterion (15) becomes

1ny−Sd−RcTy−Sd−Rc+λcTRc.E16

The penalized least squares (16) is a quadratic form of both d and c. By differentiating (16), one can obtain the linear system:

STSSTRRTSRTR+nλRdc=STyRTy.E17

Note that (17) only works for penalized least squares (15), and hence a normal assumption is needed in this case.

2.6.2. Selection of smoothing parameters

In SSANOVA models, properly selecting smoothing parameters is important to estimate η [9, 27, 28], as shown in Figure 1. Here, we introduce the generalized cross validation (GCV) method for the smoothing parameter selection.

For the multivariate predictors, the penalty term in (15) has the form

λJf=λ∑j=1Sθj−1fjfjj,

where S is the number of smoothing parameters, which is related to the functional ANOVA decomposition, and θj's are the extra smoothing parameters. To avoid overparameterization, we treat λ=λ/θ1⋯λ/θST as the effective smoothing parameters.

A GCV score is defined as

Vλ=n−1yTI−Aλ2yn−1trI−Aλ2,

where Aλ is a symmetric matrix similar to the hat matrix in linear regression. We can select a proper λ by minimizing the GCV score [21].

2.7. Case study: Twitter data

Tweets in the contiguous United States were collected over five weekdays in January 2014. The dataset contains information of time, GPS location, and tweet counts (see Figure 2). To illustrate the application of SSANOVA models, we study the time and spatial patterns in this data.

Figure 2.
Heatmaps of tweet counts in the contiguous United States. (a) Tweet counts at 2:00 a.m. (b) Tweet counts at 6:00 p.m.

The bivariate function ηx1x2 is a function of time and location, where x1 denotes the time and x2 represents the longitude and latitude coordinates. We use the thin-plate spline for the spatial variable and cubic spline for the time variable. As a rotation-free method, the thin-plate spline is popular for modeling spatial data [29, 30, 31]. For a better interpretation, we decompose the function η as

ηx1x2=ηc+η1x1+η2x2+η12x1x2,

where ηc is a constant function; η1 and η2 are the main effects of time and location, respectively; and η12 is the spatial-time interaction effect.

The main effects of time and location are shown in Figure 3. Obviously, in panel (a), the number of tweets has the periodic effect, where it attains the maximum value at 8:00 p.m. and the minimum value at 5:00 a.m. The main effect of time shows the variations of Twitter usages in the United States. In addition, we can infer how the tweet counts vary across different locations based on panel (b) in Figure 3. There tend to be more tweets in the east than those in the west regions and more tweets in the coastal zone than those in the inland. We use the scaled dot product

π=η̂12Tŷ/∥ŷ∥2

to quantify the percentage decomposition of the sum of squares of ŷ [11], where ŷ=ŷ1…ŷnT is the predicted values of log#Tweets, and η̂12=η12x1…η12xnT is the estimated interaction effect term, where η12x=η12x1x2. In our fitted model, π=3×10−16, which is so small that the interaction term is negligible. This indicates that there is no significant difference for the Twitter usages across time in the contiguous United States.

Figure 3.
(a) The main effect function of time (hours). (b) The main effect function of location.

3. Efficient approximation algorithm in massive datasets

In this section, we consider SSANOVA models under the big data settings. The computational cost of solving (17) is of the order On3 and thus gives rise to a challenge on the application of SSANOVA models when the volume of data grows. To reduce the computational load, an obvious way is to select a subset of basis functions randomly. However, it is hard to keep the data features by uniform sampling. In the following section, we present an adaptive basis selection method and show its advantages over uniform sampling [14]. Instead of selecting basis functions, another approach to reduce the computational cost is shrinking the original sample size by rounding algorithm [15].

3.1. Adaptive basis selection

A natural way to select the basis functions is through uniform sampling. Suppose that we randomly select a subset x⌣=x⌣1…x⌣n⌣ from x1…xn, where n⌣ is the subsample size. Thus, the kernel matrix would be Rx⌣ix, i=1,…,n⌣. Then, one minimizes (17) in the effective model space:

HE=N⊕spanRx⌣ixi=12…n⌣.

The computational cost will be reduced significantly to Onn⌣2 if n⌣ is much smaller than n. Furthermore, it can be proven that the minimizer of (2), η⌣, by uniform sampling basis selection, has the same asymptotic convergence rate as the full basis minimizer η̂.

Although the uniform basis selection reduces the computational cost and the corresponding η⌣ achieves the optimal asymptotic convergence rate, it may fail to retain the data features occasionally. For example, when the data are not evenly distributed, it is hard for uniform sampling to capture the feature where there are only a few data points. In [14], an adaptive basis selection method is proposed. The main idea is to sample more basis functions where the response functions change largely and fewer basis functions on those flat regions. More details of adaptive basis selection method are shown in the following procedure:

Step 1 Divide the range of responses yii=1n into K disjoint intervals, S1,…,SK. Denote ∣Sk∣ as the number of observations in Sk.

Step 2 For each Sk, draw a random sample of size nk from this collection. Let x∗k=x1∗k…xnk∗k be the predictor values.

Step 3 Combine x∗1,…,x∗K together to form a set of sampled predictor values x1∗…xn∗∗, where n∗=∑k=1Knk.

Step 4 Define

HE=H0⊕spanRxi∗⋅i=12…n∗

as the effective model space.

By adaptive basis selection, the minimizer of (2) keeps the same form as that in Theorem 2.3:

ηAx=∑k=1Mdkξkx+∑i=1n∗ciRxi∗x.

Let R∗ be an n×n∗ matrix, and its ij th entry is Rxixj∗. Let R∗∗ be an n∗×n∗ matrix, and its ij th entry is Rxi∗xj∗. Then, the estimator ηA satisfies

ηA=SdA+R∗cA,

where ηA=ηAx1⋯ηAxn∗T, dA=d1…dMT, and cA=c1…cn∗T. Similar to (17), the linear system of equations in this case is

STSSTR∗R∗TSR∗TR∗+nλR∗∗dAcA=STyR∗Ty.E18

The computational complexity of solving (18) is of the order Onn∗2, so the method decreases the computational cost significantly. It can also be shown that the adaptive sampling basis selection smoothing spline estimator ηA has the same convergence property as the full basis method. More details about the consistency theory can be found in [14]. Moreover, adaptive sampling basis selection method for exponential family smoothing spline models was developed in [32].

3.2. Rounding algorithm

Other than sampling a smaller set of basis functions to save the computational resources, for example, the adaptive basis selection method presented previously, [15] proposed a new rounding algorithm to fit SSANOVA models in the context of big data.

Rounding algorithm: The details of rounding algorithm can be shown in the following procedure:

Step 1 Assume that all predictors are continuous.

Step 2 Convert all predictors to the interval 01.

Step 3 Round the raw data by using the transformation:

zij=RDxij/rjrj,fori∈1⋯n,j∈1⋯d,

where the rounding parameter rj∈01 and rounding function RD⋅ transform input data to the nearest integer.

Step 4 After replacing xij with zij, we redefine S and R in (16) and then estimate η by minimizing the penalized least squares (16).Remark 1

In Step 3, if rj is the rounding parameter for j th predictor and its value is 0.03, then each zij is formed by rounding the corresponding xij to its nearest 0.03.

Remark 2

It is evident that the value of rounding parameter can influence the precision of approximation. The smaller the rounding parameter, the better the model estimation and the higher the computational cost.

Computational benefits: We now briefly explain why the implementation of rounding algorithm can reduce the computational loads. For example, if the rounding parameter r=0.01, it is obvious that u≤101, where u denotes the number of uniquely observed values. In conclusion, using user-tunable rounding algorithm can dramatically reduce the computational burden of fitting SSANOVA models from the order of On3 to Ou3, where u≪n.

Case study: To illustrate the benefit of the rounding algorithm, we apply the algorithm to the electroencephalography (EEG) dataset. Note that EEG is a monitoring method to record the electrical activity of the brain. It can be used to diagnose sleep disorders, epilepsy, encephalopathies, and brain death.

The dataset [33] contains 44 controls and 76 alcoholics. Each subject was repeatedly measured 10 times by using visual stimulus at a frequency of 256 Hz. This brings about n=10 replications ×120 subjects ×256 time points =307,200 observations. There are two predictors, time and group (control vs. alcoholic). We apply the cubic spline to the time effect and the nominal spline to the group effect.

After applying the model to the unrounded data, rounded data with rounding parameter r=0.01 and r=0.05 for time covariate, we can obtain a summary table about GCV, AIC [34], BIC [35], and running time in Table 2.

	GCV	AIC	BIC	CPU time (seconds)
Unrounded data	85.9574	2,240,019	2,240,562	15.65
Rounded data with r=0.01	86.6667	2,242,544	2,242,833	1.22
Rounded data with r=0.05	86.7654	2,242,893	2,243,089	1.13

Table 2.

Fit statistics and running time for SSANOVA models.

Based on Table 2, we can easily see that there are no significant difference among the GCV scores and AIC/BIC. In addition using rounding algorithm reduces 92% CPU time compared to using unrounded dataset.

4. Conclusion

Smoothing spline ANOVA (SSANOVA) models are widely used in applications [11, 20, 36, 37]. In this chapter, we introduced the general framework of the SSANOVA models in Section 2. In Section 3, we discussed the models under the big data settings. When the volume of data grows, fitting the models is computing-intensive [11]. The adaptive basis selection algorithm [14] and rounding algorithm [15] we presented can significantly reduce the computational cost.

Acknowledgments

This work is partially supported by the NIH grants R01 GM122080 and R01 GM113242; NSF grants DMS-1222718, DMS-1438957, and DMS-1228288; and NSFC grant 71331005.

Conflict of interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interests; and expert testimony or patent-licensing arrangements) or nonfinancial interest (such as personal or professional relationships, affiliations, knowledge, or beliefs) in the subject matter or materials discussed in this manuscript.

In this appendix, we use two examples to illustrate how to implement smoothing spline ANOVA (SSANOVA) models in R. The gss package in R, which can be downloaded on the CRAN https://cran.r-project.org/, is utilized.

We now load the gss package:

library(gss)

Example I: Apply the smoothing spline to a simulated dataset.

Suppose that the predictor x follows a uniform distribution on 01, and the response y is generated based on y=5+2cos3πx+ϵ, where ϵ∼N01.

x<-runif(100);y<-5+2*cos(3*pi*x)+rnorm(x)

Then, fit cubic smoothing spline model:

cubic.fit<-ssanova(y^˜x)

To evaluate the predicted values, one uses:

new<-data.frame(x=seq(min(x),max(x),len=50))

est<-predict(cubic.fit,new,se=TRUE)

The se.fit parameter indicates if one can get the pointwise standard errors for the predicted values. The predicted values and Bayesian confidence interval, shown in Figure 4, are generated by:

Figure 4.
The solid red line represents the fitted values. The green lines represent the 95% Bayesian confidence interval. The raw data are shown as the circles.

plot(x,y,col=1)

lines(new$x,est$fit,col=2)

lines(new$x,est$fit+1.96*est$se,col=3)

lines(new$x,est$fit-1.96*est$se,col=3)

Example II: Apply the SSANOVA model to a real dataset.

In this example, we illustrate how to implement the SSANOVA model using the gss package. The data is from an experiment in which a single-cylinder engine is run with ethanol to see how the nox concentration nox in the exhaust depends on the compression ratio comp and the equivalence ratio equi. The fitted model contains two predictors (comp and equi) and one interaction term.

data(nox)

nox.fit <- ssanova(log10(nox)^˜comp*equi,data=nox)

The predicted values are shown in Figure 5.

Figure 5.
The x-axis, y-axis, and z-axis represent the compression ratio, the equivalence ratio, and the predicted values, respectively.

x=seq(min(nox$comp),max(nox$comp),len=50)

y=seq(min(nox$equi),max(nox$equi),len=50)

temp <- function(x, y){

new=data.frame(comp=x,equi=y)

return(predict(nox.fit,new,se=FALSE))

}

z=outer(x, y, temp)

persp(x,y,z,theta = 30).

References

1. Buja A, Hastie T, Tibshirani R. Linear smoothers and additive models. The Annals of Statistics. 1989;17:453-510
2. Burman P. Estimation of generalized additive models. Journal of Multivariate Analysis. 1990;32(2):230-255
3. Friedman JH, Grosse E, Stuetzle W. Multidimensional additive spline approximation. SIAM Journal on Scientific and Statistical Computing. 1983;4(2):291-301
4. Hastie TJ. Generalized additive models. In: Statistical Models in S. Routledge; 2017. pp. 249-307
5. Stone CJ. Additive regression and other nonparametric models. The Annals of Statistics. 1985;13:689-705
6. Stone CJ. The dimensionality reduction principle for generalized additive models. The Annals of Statistics. 1986;14:590-606
7. Barry D et al. Nonparametric bayesian regression. The Annals of Statistics. 1986;14(3):934-953
8. Chen Z. Interaction Spline Models. University of Wisconsin–Madison; 1989
9. Gu C, Wahba G. Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM Journal on Scientific and Statistical Computing. 1991;12(2):383-398
10. Wahba G. Partial and Interaction Splines for the Semiparametric Estimation of Functions of Several Variables. University of Wisconsin, Department of Statistics; 1986
11. Gu C, Smoothing Spline ANOVA. Models, Volume 297. In: Springer Science & Business Media; 2013
12. Wahba G. Spline Models for Observational Data. SIAM; 1990
13. Wang Y. Smoothing Splines: Methods and Applications. CRC Press; 2011
14. Ma P, Huang JZ, Zhang N. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika. 2015;102(3):631-645
15. Helwig NE, Ma P. Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, Special Issue on Statistical and Computational Theory and Methodology for Big Data. 2016;9:433-444
16. Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. CRC Press; 1993
17. Kimeldorf GS, Wahba G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics. 1970;41(2):495-502
18. Kimeldorf GS, Wahba G. Spline functions and stochastic processes. Sankhya: The Indian Journal of Statistics, Series A; 1970. pp. 173-180
19. O’sullivan F, Yandell BS, Raynor WJ Jr. Automatic smoothing of regression functions in generalized linear models. Journal of the American Statistical Association. 1986;81(393):96-103
20. Wahba G, Wang Y, Gu C, Klein R, Klein B. Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. The Annals of Statistics; 1995:1865-1895
21. Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1978;31(4):377-403
22. Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21(2):215-223
23. Kreyszig E. Introductory Functional Analysis with Applications, Volume 1. New York: Wiley; 1989
24. Berlinet A, Thomas-Agnan C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. In: Springer Science & Business Media; 2011
25. Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society. 1950;68(3):337-404
26. Abramowitz M, Stegun IA. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, Volume 55. Courier Corporation; 1964
27. Hurvich CM, Simonoff JS, Tsai C-L. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1998;60(2):271-293
28. Mallows CL. Some comments on Cp. Technometrics. 2000;42(1):87-94
29. Duchon J. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. Constructive Theory of Functions of Several Variables; 1977. pp. 85-100
30. Meinguet J. Multivariate interpolation at arbitrary points made simple. Zeitschrift für Angewandte Mathematik und Physik (ZAMP). 1979;30(2):292-304
31. Wahba G, Wendelberger J. Some new mathematical methods for variational objective anal-ysis using splines and cross validation. Monthly Weather Review. 1980;108(8):1122-1143
32. Ma P, Zhang N, Huang JZ, Zhong W. Adaptive basis selection for exponential family smoothing splines with application in joint modeling of multiple sequencing samples. Statistica Sinica, in press; 2017
33. Lichman M. UCI Machine Learning Repository. 2013. URL. http://archive.ics.uci.edu/ml
34. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G, editors. Selected Papers of Hirotugu Akaike, Springer Series in Statistics. New York: Springer; 1998. pp. 199-213
35. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461-464. DOI: 10.1214/aos/1176344136
36. Helwig NE, Shorter KA, Ma P, Hsiao-Wecksler ET. Smoothing spline analysis of variance models: A new tool for the analysis of cyclic biomechanical data. Journal of Biomechanics. 2016;49(14):3216-3222
37. Lin X, Wahba G, Xiang D, Gao F, Klein R, Klein B. Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. The Annals of Statistics. 2000;28:1570-1600

[1] 1. Buja A, Hastie T, Tibshirani R. Linear smoothers and additive models. The Annals of Statistics. 1989;17:453-510

[2] 2. Burman P. Estimation of generalized additive models. Journal of Multivariate Analysis. 1990;32(2):230-255

[3] 3. Friedman JH, Grosse E, Stuetzle W. Multidimensional additive spline approximation. SIAM Journal on Scientific and Statistical Computing. 1983;4(2):291-301

[4] 4. Hastie TJ. Generalized additive models. In: Statistical Models in S. Routledge; 2017. pp. 249-307

[5] 5. Stone CJ. Additive regression and other nonparametric models. The Annals of Statistics. 1985;13:689-705

[6] 6. Stone CJ. The dimensionality reduction principle for generalized additive models. The Annals of Statistics. 1986;14:590-606

[7] 7. Barry D et al. Nonparametric bayesian regression. The Annals of Statistics. 1986;14(3):934-953

[8] 8. Chen Z. Interaction Spline Models. University of Wisconsin–Madison; 1989

[9] 9. Gu C, Wahba G. Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM Journal on Scientific and Statistical Computing. 1991;12(2):383-398

[10] 10. Wahba G. Partial and Interaction Splines for the Semiparametric Estimation of Functions of Several Variables. University of Wisconsin, Department of Statistics; 1986

[11] 11. Gu C, Smoothing Spline ANOVA. Models, Volume 297. In: Springer Science & Business Media; 2013

[12] 12. Wahba G. Spline Models for Observational Data. SIAM; 1990

[13] 13. Wang Y. Smoothing Splines: Methods and Applications. CRC Press; 2011

[14] 14. Ma P, Huang JZ, Zhang N. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika. 2015;102(3):631-645

[15] 15. Helwig NE, Ma P. Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, Special Issue on Statistical and Computational Theory and Methodology for Big Data. 2016;9:433-444

[16] 16. Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. CRC Press; 1993

[17] 17. Kimeldorf GS, Wahba G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics. 1970;41(2):495-502

[18] 18. Kimeldorf GS, Wahba G. Spline functions and stochastic processes. Sankhya: The Indian Journal of Statistics, Series A; 1970. pp. 173-180

[19] 19. O’sullivan F, Yandell BS, Raynor WJ Jr. Automatic smoothing of regression functions in generalized linear models. Journal of the American Statistical Association. 1986;81(393):96-103

[20] 20. Wahba G, Wang Y, Gu C, Klein R, Klein B. Smoothing spline ANOVA for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. The Annals of Statistics; 1995:1865-1895

[21] 21. Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1978;31(4):377-403

[22] 22. Golub GH, Heath M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics. 1979;21(2):215-223

[23] 23. Kreyszig E. Introductory Functional Analysis with Applications, Volume 1. New York: Wiley; 1989

[24] 24. Berlinet A, Thomas-Agnan C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. In: Springer Science & Business Media; 2011

[25] 25. Aronszajn N. Theory of reproducing kernels. Transactions of the American Mathematical Society. 1950;68(3):337-404

[26] 26. Abramowitz M, Stegun IA. Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables, Volume 55. Courier Corporation; 1964

[27] 27. Hurvich CM, Simonoff JS, Tsai C-L. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1998;60(2):271-293

[28] 28. Mallows CL. Some comments on Cp. Technometrics. 2000;42(1):87-94

[29] 29. Duchon J. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. Constructive Theory of Functions of Several Variables; 1977. pp. 85-100

[30] 30. Meinguet J. Multivariate interpolation at arbitrary points made simple. Zeitschrift für Angewandte Mathematik und Physik (ZAMP). 1979;30(2):292-304

[31] 31. Wahba G, Wendelberger J. Some new mathematical methods for variational objective anal-ysis using splines and cross validation. Monthly Weather Review. 1980;108(8):1122-1143

[32] 32. Ma P, Zhang N, Huang JZ, Zhong W. Adaptive basis selection for exponential family smoothing splines with application in joint modeling of multiple sequencing samples. Statistica Sinica, in press; 2017

[33] 33. Lichman M. UCI Machine Learning Repository. 2013. URL. http://archive.ics.uci.edu/ml

[34] 34. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G, editors. Selected Papers of Hirotugu Akaike, Springer Series in Statistics. New York: Springer; 1998. pp. 199-213

[35] 35. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461-464. DOI: 10.1214/aos/1176344136

[36] 36. Helwig NE, Shorter KA, Ma P, Hsiao-Wecksler ET. Smoothing spline analysis of variance models: A new tool for the analysis of cyclic biomechanical data. Journal of Biomechanics. 2016;49(14):3216-3222

[37] 37. Lin X, Wahba G, Xiang D, Gao F, Klein R, Klein B. Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. The Annals of Statistics. 2000;28:1570-1600

Smoothing Spline ANOVA Models and their Applications in Complex and Massive Datasets

Topics in Splines and Applications

Abstract

Keywords

Author Information

Jingyi Zhang

Honghe Jin

Ye Wang

Xiaoxiao Sun

Ping Ma*

Wenxuan Zhong