Open access peer-reviewed chapter

An Optimization Approach to Supervised Principal Component Analysis

Written By

Anthony O. Smith and Anand Rangarajan

Submitted: 15 August 2023 Reviewed: 04 September 2023 Published: 05 December 2023

DOI: 10.5772/intechopen.1003668

From the Edited Volume

New Insights on Principal Component Analysis

Fausto Pedro García Márquez, Mayorkinos Papaelias and René-Vinicio Sánchez Loja

Chapter metrics overview

43 Chapter Downloads

View Full Metrics

Abstract

Supervised dimensionality reduction has become an important theme in the last two decades. Despite the plethora of models and formulations, there is a lack of a simple model that aims to project the set of patterns into a space defined by the classes (or categories). We set up a model where each class is represented as a 1D subspace of the vector space formed by the features. Assuming the set of classes does not exceed the cardinality of the features, the model results in multi-class supervised learning in which the features of each class are projected into the class subspace. Class discrimination is guaranteed via the imposition of the orthogonality of the 1D class sub-spaces. The resulting optimization problem—formulated as the minimization of a sum of quadratic functions on a Stiefel manifold—while being non-convex (due to the constraints), has a structure for which we can identify when we have reached a global minimum. After formulating a version with standard inner products, we extend the formulation to a reproducing kernel Hilbert space and similarly to the kernel version. Comparisons with the multi-class Fisher discriminants and principal component analysis showcase the relative merits toward dimensionality reduction.

Keywords

  • dimensionality reduction
  • optimization
  • classification
  • supervised learning
  • Stiefel manifold
  • category space
  • Fisher discriminants
  • principal component analysis
  • multi-class

1. Introduction

Dimensionality reduction and supervised learning have long been active tropes in machine learning. For example, principal component analysis (PCA) and the support vector machine (SVM) are standard bearers for dimensionality reduction and supervised learning. Even now, machine learning researchers are accustomed to performing PCA when seeking a simple dimensionality reduction technique, even though it is an unsupervised learning approach. In the past decade, there has been considerable interest in including supervision (expert label information) in dimensionality reduction techniques. Beginning with the well-known EigenFaces versus FisherFaces debate [1], considerable activity has centered around using Fisher linear discriminants (FLD) and other supervised learning approaches in dimensionality reduction. Since the Fisher linear discriminant has a multi-class extension, it is natural to begin there. However, asking if this is the only possible approach is also natural. In this work, with a fundamental goal, we design a category space approach using multi-class information to reduce dimensionality.

The venerable Fisher discriminant is a supervised dimensionality reduction technique wherein a maximally discriminative one-dimensional subspace is estimated from the data. The criterion used for discrimination is the ratio between the squared distance of the projected class means and a weighted sum of the projected variances. This criterion has a closed-form solution yielding the best 1D subspace. The Fisher discriminant also has an extension to the multi-class case. Here, the criterion used is more complex and highly unusual: the squared distance ratio between each class’s projected mean compared to the total projected mean and the sum of the projected variances. This, too, results in a closed-form solution but with the subspace dimension cardinality being one less than the number of classes.

The above description of the multi-class FLD sets the stage for our approach. We assume the set of categories (classes) is a subspace of the original feature space (similar to FLD). However, we add the restriction that the category bases are mutually orthogonal with the origin of the vector space belonging to no category. Given this restriction, the multi-class category space dimensionality reduction criterion is quite straightforward. We maximize the square of the inner product between each pattern and its category axis to discover the category space via this process. (Setting the origin is a highly technical issue and, therefore, not described here.) The result is a sum of quadratic objective functions on a Stiefel manifold—the category space of orthonormal basis vectors. This very interesting objective function has coincidentally received quite a bit of treatment recently [2, 3, 4, 5]. Furthermore, there is no need to restrict ourselves to sums of quadratic objective functions provided we are willing to forego useful analysis of this base case. The unusual aspect of the objective function comprising sums of quadratic objective functions is that we can formulate a criterion that guarantees that we have reached a global minimum if the achieved solution satisfies it.

While numerous alternatives exist to the FLD (such as canonical correlation analysis [6, 7]) and while there are many nonlinear unsupervised dimensionality reduction techniques (such as local linear embedding [8, 9, 10], ISOMAP [11, 12] and Laplacian Eigenmaps [13, 14, 15]), we have not encountered a simple dimensionality reduction technique which is based on projecting the data into a space spanned by the categories. Unfortunately, no algorithm at present can a priori guarantee satisfaction of this criterion; hence, we can only check on a case-by-case basis. Despite this, our experimental results show that we get efficient solutions, competitive with those obtained from other dimensionality reduction algorithms.

Advertisement

2. Related work

Traditional dimensionality reduction techniques like principal component analysis (PCA) [16, 17, 18] and supervised algorithms such as Fisher linear discriminant analysis [19] seek to retain significant features while removing insignificant, redundant, or noisy features. Many real-world problems have been solved using these algorithms as preprocessing steps before applying a classification algorithm. A limitation of most methods is that there is no specific connection between the dimensionality reduction technique and the supervised learning-driven classifier. Dimensionality reduction techniques such as canonical correlation analysis (CCA) [20], and partial least squares (PLS) [21, 22] on the one hand and classification algorithms such as support vector machines (SVM) [23] on the other seek to optimize different criteria. In contrast, in this paper, we analyze dimensionality reduction from the perspective of multi-class classification. Using a category vector space (with dimension equal to class cardinality) is an integral aspect of this approach.

In supervised learning, it is customary for classification methodologies to regard classes as nominal labels without having any internal structure. This remains true regardless of whether a discriminant or classifier is sought. Discriminants are designed by attempting to separate patterns into opposing classes [24, 25, 26]. When generalization to a multi-class classifier is required, many oppositional discriminants are combined, with the final classifier being a winner-take-all (or voting-based) decision w.r.t. the set of nominal labels. Convex objective functions based on misclassification error minimization (or approximation) are not that different either. Least-squares or logistic regression methods set up convex objective functions with nominal labels converted to binary outputs [27, 28]. When extensions to multi-class are sought, the binary labels are extended to one of K encoding, with K being the number of classes.

Support vector machines (SVMs) were inherently designed for two-class discrimination, and all formulations of multi-class SVMs extend this oppositional framework using one-versus-one or one-versus-all schemes. Below, we begin by describing the different approaches to the multi-class problem. This is not meant to be exhaustive but provides an overview of some popular methods and approaches researched in classification and dimensionality reduction. Folley and Sammon [29, 30] studied the two class problems and feature selection and focused on criteria with the greatest potential to discriminate. Feature selection aims to find a set of features with the best discrimination properties. To identify the best feature vectors, they chose the generalized Fisher optimality criterion proposed by [31]. The selected directions maximize the Fisher criterion, which has attractive discrimination properties. Principal components analysis (PCA) permits the reduction of dimensions of high dimensional data without losing significant information [16, 20, 32]. Principal components identify patterns or significant features without considering discriminative considerations [33]. Supervised PCA (SPCA), derived from PCA, is a method for obtaining useful sub-spaces when the labels are considered. This technique was first described in [34] under the title “supervised clustering.” The idea behind SPCA is to perform selective dimensionality reduction using carefully chosen subsets of labeled samples. This is used to build a prediction model [35]. Unfortunately, despite the superficial similarity to our work, SPCA does not carefully reformulate supervised dimensionality reduction as an optimization problem (as we do) and develop an algorithm that projects vectors into the space of categories- the present work’s goal. While we have addressed the most popular techniques in dimensionality reduction and multi-class classification, this is not an exhaustive study of the literature. We focus primarily on discriminative dimensionality reduction methods that improve multi-class classification performance. The closest we have seen in relation to our work on category spaces is the work in [36, 37]. They mention the importance and usefulness of modeling categories as vector spaces for document retrieval and explain how unrelated items should have an orthogonal relationship. This is to say that they should have no features in common. The structured SVM in [38] is another effort at going beyond nominal classes. Here, classes can have internal structures in the form of strings, trees, etc. However, explicitly modeling classes as vector spaces is not carried out.

From the above, the modest goal of the present work should be clear. We seek to project the input feature vectors to a category space—, a subspace formed by category basis vectors. The multi-class FLD falls short of this goal since the number of projected dimensions is one less than the number of classes. The multi-class (and more recently multi-label) SVM [39] literature is fragmented due to a lack of agreement regarding the core issue of multi-class discrimination. The varieties of supervised PCA do not begin by clearly formulating a criterion for category space projection. Variants such as CCA [40, 41], PLS [42, 43], and structured SVM’s [38] while attempting to add structure to the categories do not go as far as the present work in attempting to fit a category subspace. Kernel variants of the above also do not touch the basic issue addressed in the present work. Nonlinear (and manifold learning-based) dimensionality reduction techniques [8, 11, 13, 44, 45] are unsupervised and therefore do not qualify.

Advertisement

3. Dimensionality reduction using a category space formulation

3.1 Maximizing the square of the inner product

The principal goal is a new form of supervised dimensionality reduction. Specifically, when we seek to marry principal component analysis with supervised learning, the simplest synthesis is category space dimensionality reduction with orthogonal class vectors. Assume the existence of a feature space with each feature vector xiRD. We aim to perform supervised dimensionality reduction by reducing the number of feature dimensions from D to K where KD. Here K is the number of classes, and the first simplifying assumption made in this work is that we will represent the category space using K orthonormal basis vectors wk together with an origin x0RD. The second assumption we make is that each feature vector xi should have a large magnitude inner product with its assigned class. From the orthonormality constraint above, this automatically implies a small magnitude inner product with all other weight vectors. A candidate objective function and constraints following the above considerations are

EW=12k=1KikCkwkTxikx02E1

and

wkTwl=1,k=l0,klE2

respectively. In Eq. (1), W=w1w2wK. Note that we have referred to this as a candidate objective function for two reasons. First, the origin x0 is still unspecified, and obviously, we cannot minimize Eq. (1) w.r.t. x0 as the minimum value is not bounded from below. Second, it is unclear why we cannot use the inner product’s absolute value or other symmetric functions. Both these issues are addressed later in this work. We currently resolve the origin issue by setting x0 to the centroid of all the feature vectors (with this choice getting a principled justification below).

The objective function in Eq. (1) is the negative of a quadratic function. Since the function x2 is concave, it admits a Legendre transform-based majorization [46] using the tangent of the function. That is, we propose to replace objective functions of the form 12x2 with minyxy+12y2 which can quickly be checked to be valid for an unconstrained auxiliary variable y. Note that this transformation yields a linear objective function w.r.t. x, which is to be expected from the geometric interpretation of a tangent.

Consider the following Legendre transformation of the objective function in Eq. (1). The new objective function is

EquadWZ=k=1KikCkzkikwkTxik+wkTx0+12zkik2E3

where Z=zkikk1Kik1Ck. Consider this an objective function over x0 as well. We require additional constraints to avoid minima at negative infinity (i.e., for the objective function to be coercive). One such constraint (and perhaps not the only one) is of the form ikCkzkik=0,k. When this constraint is imposed, we obtain a new objective function,

EquadWZ=k=1KikCkzkikwkTxik+12zkik2E4

to be minimized subject to the constraints

ikCkzkik=0,kE5

which are in addition to the orthonormal constraints in Eq. (2). This objective function yields a Z, which removes the class-specific centroid of Ck for all classes.

3.2 Maximizing the absolute value of the inner product

We have justified our choice of centroid removal mentioned above indirectly obtained via constraints imposed on Legendre transform auxiliary variables. The above objective function can be suitably modified using different forms (absolute inner product, etc.). To see this, consider the following objective function, which minimizes the negative of the magnitude of the inner product:

EW=k=1KikCkwkTxikx0.E6

Since x is also a concave function, it too can be majorized. Consider first replacing the non-differentiable objective function x with x2+ϵ (also concave) where ϵ can be chosen to be a suitably small value. Now consider replacing x2+ϵ with minyxyϵ1y2 which can again quickly checked to be valid for a constrained auxiliary variable y11. The constraint is somewhat less relevant since the minimum w.r.t. y occurs at y=xx2+ϵ2 which lies within the constraint interval. Note that this transformation yields a linear objective function w.r.t. x. As before, we introduce a new objective function

EabsWZ=k=1KikCkzkikwkTxikϵ1zkik2E7

to be minimized subject to the constraints ikCkzkik=0,k and zkik11 which are the same as in Eq. (5) in addition to the orthonormal constraints in Eq. (2).

3.3 Extension to RKHS kernels

The generalization to RKHS kernels is surprisingly straightforward. First, we follow standard kernel PCA and write the weight vector in terms of the RKHS projected patterns ϕxl to get

wk=i=1Nαkiϕxi.E8

Note that the expansion of the weight vector is in overall patterns rather than just the class-specific ones. This assumes that the weight vector for each class lives in the subspace (potentially infinite dimensional) spanned by the RKHS projected patterns—the same assumption as in standard kernel PCA. The orthogonality constraint between weight vectors becomes

wkwl=i=1Nαkiϕxii=1Nαliϕxi=i=1Nj=1Nαkiαkjϕxiϕxj=i=1Nj=1NαkiαkjKxixjE9

which is equal to one if k=l and zero otherwise. In matrix form, the orthonormality constraints become

AGAT=IKE10

where Aklαki and Gij=Kxixj is the well-known Gram matrix of pairwise RKHS inner products between the patterns.

The corresponding squared inner product and the absolute value of the inner product objective functions are

EKquadAZ=k=1KikCkj=1NzkikαkjKxjxik+12zkik2E11

and

EKabsAZ=k=1KikCkj=1NzkikαkjKxjxikϵ1zkik2E12

respectively. These have to be minimized w.r.t. the orthonormal constraints in Eq. (10) and the origin constraints in Eq. (5). Note that the objective functions are identical w.r.t. the matrix A. The parameter ϵ can be set to a very small but positive value.

Advertisement

4. An algorithm for supervised dimensionality reduction

We now return to the objective functions and constraints in Eqs. (4) and (7) prior to tackling the corresponding kernel versions in Eqs. (11) and (12) respectively. It turns out that the approach for minimizing the former can be readily generalized to the latter, with the former being easier to analyze. Note that the objective functions in Eqs. (4) and (7) are identical w.r.t. W. Consequently, we dispense with the optimization problems w.r.t. Z, which are straightforward and focus on the optimization problem w.r.t. W.

4.1 Weight matrix estimation with orthogonality constraints

The objective function and constraints on W can be written as

EequivW=k=1KikCkzkikwkTxikE13

and

wkTwl=1,k=l0,kl.E14

Note that the set Z is not included in this objective function despite its presence in the larger objective functions of Eqs. (4) and (7). The orthonormal constraints can be expressed using a Lagrange parameter matrix to obtain the following Lagrangian:

LWΛ=k=1KikCkzkikwkTxik+traceΛWTWIK.E15

Setting the gradient of L w.r.t. W to zero, we obtain

WLWΛ=Y+WΛ+ΛT=0E16

where the matrix Y of size D×K is defined as

Yi1C1z1i1xi1ikCkzkikxik.E17

Using the constraint WTW=IK, we get

Λ+ΛT=WTY.E18

Since Λ+ΛT is symmetric, this immediately implies that WTY is symmetric. From Eq. (16), we also get

Λ+ΛTWTWΛ+ΛT=Λ+ΛT2=YTY.E19

Expanding Y using its singular value decomposition (SVD) as Y=UΣVT, the above relations can be simplified to

Y=UΣVT=UVTVΣVT=WΛ+ΛTE20

giving

Λ+ΛT=VΣVTE21

and

W=UVT.E22

We have shown that the optimal solution for W is the polar decomposition of Y, namely W=UVT. Since Z has been held fixed during the estimation of W, in the subsequent step, we can hold W fixed and solve for Z and repeat. We obtain an alternating algorithm that iterates between estimating W and Z until a convergence criterion is met.

4.2 Estimation of the auxiliary variable Z

The objective function and constraints on Z depend on whether we use objective functions based on the square or absolute value of the inner product. We separately consider the two cases. The inner product squared effective objective function

EquadeffZ=k=1KikCkzkikwkTxik+12zkik2E23

is minimized w.r.t. Z subject to the constraints ikCkzkik=0,k. The straightforward solution obtained via standard minimization is

zkik=wkTxik1CkikCkwkTxik=wkTxik1CkikCkxik.E24

The absolute value effective objective function

EabseffZ=k=1KikCkzkikwkTxikϵ1zkik2E25

is also minimized w.r.t. Z subject to the constraints ikCkzkik=0,k. A heuristic solution obtained (eschewing standard minimization) is

zkik=wkTxikwkTxik2+ϵ21CkikCkwkTxikwkTxik2+ϵ2E26

which has to be checked to be valid. The heuristic solution acts as an initial condition for constraint satisfaction (which can be efficiently obtained via 1D line minimization). The first order Karush-Kuhn-Tucker (KKT) conditions obtained from the Lagrangian

LabseffZM=k=1KikCkzkikwkTxikϵ1zkik2k=1KμkikCkzkikE27

are

wkTxik+ϵzkik1zkik2μk=0,kE28

from which we obtain

zkik=wkTxik+μkwkTxik+μk2+ϵ2.E29

We see that the constraint zkik11 is also satisfied. For each category Ck, there exists a solution to the Lagrange parameter μk such that ikzkik=0. This can be obtained via any efficient 1D search procedure like golden section [47].

4.3 Extension to the kernel setting

The objective function and constraints on the weight matrix A in the kernel setting are

EKequivA=k=1KikCkj=1NzkikαkjKxjxikE30

with the constraints

AGAT=IKE31

where Aki=αki and Gij=Kxixj is the N×N kernel Gram matrix. The constraints can be expressed using a Lagrange parameter matrix to obtain the following Lagrangian,

LkerAΛ=k=1KikCkj=1NzkikαkjKxjxik+traceΛkerAGATIK.E32

Setting the gradient of Lker w.r.t. A to zero, we obtain

Yker+Λker+ΛkerTAG=0E33

where the matrix Yker of size K×N is defined as

YkerkjikCkzkikKxjxik.E34

Using the constraint AGAT=IK, we obtain

Λker+ΛkerTAGATΛker+ΛkerT=Λker+ΛkerT2=YkerG1YkerT.E35

Expanding YkerG12 using its singular value decomposition as YkerG12=UkerSkerVkerT, the above relations can be simplified to

Λker+ΛkerT=UkerSkerUkerTE36

and

AG12=UkerVkerTA=UkerVkerTG12.E37

We have shown that the optimal solution for A is related to the polar decomposition of YkerG12, namely A=UkerVkerTG12. Since Z has been held fixed during the estimation of A, in the subsequent step, we can hold A fixed and solve for Z and repeat. We thereby obtain an alternating algorithm that iterates between estimating A and Z until a convergence criterion is met. This is analogous to the non-kernel version above.

The solutions for Z in this setting are very straightforward to obtain. We eschew the derivation and merely state that

zkik=j=1NαkjKxjxik1CkikCkj=1NαkjKxjxik=j=1NαkjKxjxik1CkikCkK(xjxik)E38

for the squared inner product kernel objective and

zkik=j=1NαkjKxjxikj=1NαkjKxjxik2+ϵ21CkikCkj=1NαkjKxjxikj=1NαkjKxjxik2+ϵ2E39

for the absolute valued kernel objective. This heuristic solution acts as an initial condition for constraint satisfaction (which can be efficiently obtained via 1D line minimization). Following the line of Eqs. (27)(29) above, the solution can be written as

zkik=j=1NαkjKxjxik+μkj=1NαkjKxjxik+μk2+ϵ2.E40

For each category Ck, as before, there exists a solution to the Lagrange parameter μk such that ikzkik=0. Once again, this can be obtained via any efficient 1D search procedure like golden section [47].

4.4 Analysis

4.4.1 Euclidean setting

The simplest objective function in the above sequence, analyzed in the literature, is based on the squared inner product. Below, we summarize this work by closely following the treatment in [2, 48]. First, to bring our work in sync with the literature, we eliminate the auxiliary variable Z from the squared inner product objective function (treated as a function of both W and Z here):

EquadeffWZ=k=1KikCkzkikwkTxik+12zkik2.E41

Setting zkik=wkTxik1CkikCkxik which is the optimum solution for Z, we get

EquadW=12k=1KwkTRkwk12k=1KikCkwkTxik1CkiCkxi2E42

where Rk is the class-specific covariance matrix:

RkikCkxik1CkiCkxixik1CkiCkxiT.E43

We seek to minimize Eq. (42) w.r.t. W under the orthonormality constraints WTW=IK.

A set of K orthonormal vectors wkRDk1K in a D-dimensional Euclidean space is a point on the well-known Stiefel manifold, denoted here by MD,K with KD. The problem in Eq. (42) is equivalent to the maximization of the sum of heterogeneous quadratic functions on a Stiefel manifold. The functions are heterogeneous in our case since the class-specific covariance matrices Rk are not identical in general. The Lagrangian corresponding to this problem (with Z removed via direct minimization) is

LquadWΛ=12k=1KwkTRkwk+traceΛTWTWIK.E44

Setting the gradient of the above Lagrangian w.r.t. W to zero, we obtain

R1w1R2w2RKwK=WΛ+ΛT.E45

Noting that Λ+ΛT is symmetric and using the Stiefel orthonormality constraint WTW=IK, we get

Λ+ΛT=WTR1w1R2w2RKwK.E46

The above can be considerably simplified. First we introduce a new vector wMD,K defined as ww1Tw2TwKTT and then rewrite Eq. (45) in vector form to get

Rw=SwwE47

where

RR10K0K0KR20K0K0K0KRKE48

is a KD×KD matrix and

Sww1TR1w1IK12w1TR1wK+wKTRKw1IK12w1TR1w2+w2TR2w1IK12w2TR2wK+wKTRKw2IK12w1TR1wK+wKTRKw1IKwKTRKwKIKE49

a KD×KD symmetric matrix. The reason S(w) can be made symmetric is because it is closely related to the solution to Λ+ΛT, which has to be symmetric. The first and second order necessary conditions for a vector w0MD,K to be a local minimum (feasible point) for the problem in Eq. (42) are as follows:

Rw0=Sw0w0E50

and

RSw0TMw0E51

is negative semi-definite. In Eq. (51), TMw0 is the tangent space of the Stiefel manifold at w0. In a tour de force proof, Rapcsák further shows in [2] that if the matrix RSw0 is negative semi-definite, then a feasible point w0 is a global minimum. This is an important result since it adds a sufficient condition for a global minimum for the problem of minimizing a heterogeneous sum of quadratic forms on a Stiefel manifold.1

4.4.2 The RKHS setting

We can readily extend the above analysis to the kernel version of the squared inner product. The complete objective function w.r.t. both the coefficients A and the auxiliary variable Z is

EKequivAZ=k=1KikCkj=1NzkikαkjKxjxik+12zkik2.E52

Setting zkik=j=1NαkjKxjxik which is the optimum solution for Z, we get

EKquadA=12k=1KikCkj=1NαkjKxjxik1CkikCkK(xjxik)2=12k=1KαkTGkαkE53

where αkj=αkj,A=α1α2αKT (and)

GkjmikCkKxjxik1CkiCkK(xjxi)Kxmxik1CkiCkK(xmxi)E54

The constraints on A can be written as

AGAT=IKG12ATTG12AT=IK.E55

Introducing a new variable B=G12AT, we may rewrite the kernel objective function and constraints as

EKquadnewB=12k=1KβkTHβk12k=1KβkTG12GkG12βkE56

(where Bβ1β2βK) and

BTB=IKE57

respectively. This is now in the same form as the objective function and constraints in Section 4.4.1, and therefore, the Rapcsák analysis of that section can be directly applied here. The above change of variables is predicated on the positive definiteness of G. If this is invalid, principal component analysis has to be applied to G, resulting in a positive definite matrix in a reduced space, after which the above approach can be applied.

In addition to providing necessary conditions for global minima, the authors in [4] developed an iterative procedure as a method for a solution. We have adapted this to suit our purposes. A block coordinate descent algorithm which successively updates W and Z is presented in Algorithm 1.

Algorithm 1. Iterative process for minimization of the sum of squares of inner products objective function.

  • Input: A set of labeled patterns xikik=1Ck,k1K.

  • Initialize:

    • Convergence threshold δ.

    • Arbitrary orthonormal system W0.

  • Repeat

    • Calculate the sequence W1W2Wm. Assume Wm is constructed for m=0,1,2,

    • Update the auxiliary variable Zm+1 satisfying ikCkzkik=0:

    • * zkikm+1=wmkTxik1CkikCkwmkTxik.

    • Perform the SVD decomposition on i1C1z1i1m+1xi1ikCkzkikm+1xik to get Um+1Sm+1Vm+1T where Sm+1 is K×K.

    • Wm+1=Um+1Vm+1T, the polar decomposition.

  • Loop until Wm+1WmFδ.

  • Output: W

Advertisement

5. Experimental results

5.1 Quantitative results for linear and kernel dimensionality reduction

In practice, dimensionality reduction is used in conjunction with a classification algorithm. By definition, the purpose of dimensionality reduction, as it relates to classification, is to reduce the complexity of the data while retaining discriminating information. Thus, we utilize a popular classification algorithm to analyze the performance of our proposed dimensionality reduction technique. In this section, we report the results of several experiments with dimensionality reduction combined with SVM classification. In the multi-class setting, we compare against other state-of-the-art algorithms that perform dimensionality reduction and then evaluate the performance using the multi-class one-vs-all linear SVM scheme. The classification technique uses the traditional training and testing phases, outputting the class it considers the best prediction for a given test sample. We measure the accuracy of these predictions averaged over all test sets. In Table 1, we demonstrate the effectiveness of both the sum of quadratic and absolute value functions, denoted as category quadratic space (CQS) and category absolute value space (CAS), respectively. Then, we benchmark their overall classification accuracy against several classical dimensionality reduction techniques, namely, least squares linear discriminant analysis (LS-LDA) [27], Fisher linear discriminant (MC-FLD) [19], principal component analysis (PCA) [33] and their multi-class and kernel counterparts (when applicable). In each experiment, we chose two-thirds of the data for training, and the remaining third of the samples were used for testing. The results are shown in Table 1.

Name (#Classes)CQSCASLS-LDAPCAMC-FLD
Vehicle (4)53.9153.0576.5655.3676.82
Wine (3)96.0796.8295.5177.1997.28
Iris (3)97.5596.8896.1196.7796.77
Seeds (3)90.3990.7995.1592.5395.79
thyroid (3)94.0294.0894.0292.5793.92
Satellite (6)85.3085.2086.3885.4586.52
Segmentation (7)93.1493.4494.6294.4094.43
Vertebral (3)84.1382.7981.4580.0581.18

Table 1.

Linear dimensionality reduction w/SVM classification.

Databases: To illustrate the performance of the methods proposed in Section 3, we conducted experiments using different publicly available data sets taken from the UCI machine learning data repository [49]. We have chosen a variety of data sets that vary in terms of class cardinality (K), samples (N), and number of features (D) to demonstrate the versatility of our approach. For a direct comparison of results, we chose the same data sets: Vehicle, Wine, Iris, Seeds, Thyroid, Satellite, Segmentation, and Vertebral Silhouettes recognition databases. More details about the individual sets are available at the respective repository sites.

We divide the results into the linear and kernel groups (as is normal practice). Table 1 shows the results for linear dimensionality reduction with SVM linear classification. All dimensionality reduction algorithms were implemented and configured with a linear SVM classifier for optimal classification results (via cross-validation). It can be seen that the category space projection scheme consistently provides a good projection for standard classification algorithms to be executed. Several of the data sets comprise only three classes, and it can be seen that the proposed method is competitive in performance and, in some instances, performs slightly better.

Also, for comparison, Table 2 reports the performance of the proposed kernel formulations followed by a linear SVM classifier. These proposed methods also achieve accuracy rates similar to their kernel counterparts with angle classification shown in Table 3.

Name (# Classes)K-CQSK-CASK-PCAK-MC-FLD
Vehicle (4)40.2740.9244.8174.35
Wine (3)92.9595.6395.9596.88
Iris (3)95.5593.3395.5594.44
Seeds (3)90.2190.4791.5393.65
thyroid (3)41.9740.2443.0872.34
Satellite (6)81.5486.2389.6990.61
Segmentation (7)72.9677.2483.0192.43
Vertebral (3)70.9669.5370.9682.25

Table 2.

Kernel dimensionality reduction w/SVM classification.

Name (# Classes)K-CQS-AK-CAS-A
Vehicle (4)67.9668.24
Wine (3)95.3295.32
Iris (3)95.5595.18
Seeds (3)91.7991.79
thyroid (3)67.9066.79
Satellite (6)83.3376.29
Segmentation (7)50.2148.94
Vertebral (3)77.5977.77

Table 3.

Kernel dimensionality reduction w/angle classification.

The iterative approach in Algorithm (1) was applied to obtain an optimal orthonormal basis W (which is D×K) for the category space, where D dimensional input patterns can be projected to the smaller K dimensional category space if D>K. We start with a set of N labeled, input vectors xiRD drawn randomly from multiple classes Ck,k1K. The optimization technique searches over Steifel manifold elements, as explained above. The algorithm is terminated when the Frobenius norm difference between iterations, Wm1WmFδ (with δ=108). Once we have determined the optimal W, the patterns are mapped to the category space by the transformation yi=WTxi to obtain the corresponding set of N samples yiRK, where K is the reduced dimensional space.

The results above show that our proposed methods lead to classification rates that can be compared to classical approaches. However, the main focus of this work is to provide an algorithm that retains important classification information while introducing a geometry (category vector subspace) with attractive semantic and visualization properties. The results suggest that our classification results are competitive with other techniques while learning a category space.

5.2 Visualization of kernel dimensionality reduction

Another valuable aspect of this research can be seen in the kernel formulation, which demonstrates the warping of the projected patterns toward their respective category axes. This suggests a geometric approach to classification, i.e., we could consider the angle of deviation of a test set pattern from each category axis to measure class membership. Within the category space, a base category is represented by the bases (axes) that define the category space. Class membership is, therefore, inversely proportional to the angle between the pattern and the respective category axis. Figures 1 through 3 illustrate the warped space for various three class problems, for a variation in the width parameter σ of a Gaussian radial basis function kernel in the range σ=0.1,0.8. Note the improved visualization semantics of the category space approach when compared to the other dimensionality reduction techniques.

Figure 1.

Reduced dimensionality projection for a medium σ value: From top to bottom: Row 1: Vertebral, Row 2: Thyroid, Row 3: Wine, Row 4: Iris, Row 5: Seeds.

Figure 2.

Reduced dimensionality projection for a small σ value. From top to bottom: Row 1: Vertebral, Row 2: Thyroid, Row 3: Wine, Row 4: Iris, Row 5: Seeds.

Figure 3.

Reduced dimensionality projection for a large σ value. From top to bottom: Row 1: Vertebral, Row 2: Thyroid, Row 3: Wine, Row 4: Iris, Row 5: Seeds.

Advertisement

6. Conclusions

In this work, we presented a new approach to supervised dimensionality reduction—that attempts to learn orthogonal category axes during training. The motivation for this work stems from the observation that the semantics of the multi-class Fisher linear discriminant are unclear, especially w.r.t. defining a space for the categories (classes). Beginning with this observation, we designed an objective function comprising sums of quadratic and absolute value functions (aimed at maximizing the inner product between each training set pattern and its class axes) with Stiefel manifold constraints (since the category axes are orthonormal). It turns out that recent work has characterized such problems and provided sufficient conditions for the detection of global minima (despite the presence of non-convex constraints). The availability of a straightforward Stiefel manifold optimization algorithm tailored to this problem (which has no step size parameters to estimate) is an attractive by-product of this formulation. The extension to the kernel setting is entirely straightforward. Since the kernel dimensionality reduction approach warps the patterns toward orthogonal category axes, this raises the possibility of using the angle between each pattern and the category axes as a classification measure. We conducted experiments in the kernel setting and demonstrated reasonable performance for the angle-based classifier, suggesting a new avenue for future research. Finally, visualization of dimensionality reduction for three classes showcases the category space geometry with clear semantic advantages over standard principal components and multi-class Fisher.

Several opportunities exist for future research. We notice a clustering of patterns near the origin of the category space, clearly calling for an origin margin (as in support vector machines) [43]. At the same time, we can also remove the orthogonality assumption (in the linear case) while continuing to pursue multi-class discrimination. Finally, extensions to the multi-label case [42] are warranted and suggest interesting opportunities for future work.

Advertisement

Additional information

An earlier version of this chapter was previously published as a preprint in: 1. Smith AO, Rangarajan A. A Category Space Approach to Supervised Dimensionality Reduction [Internet]. arXiv.org. 2016 [cited 2023 Sep 5]. Available from: https://arxiv.org/abs/1610.08838

Advertisement

Nomenclature

Name

Description

PCA

principal component analysis

SVM

support vector machine

FLD

Fisher linear discriminant

CCA

canonical correlation analysis

PLS

partial least squares

SPCA

supervised PCA

RKHS

reproducing Kernel Hilbert space

KKT

Karush-Kuhn-Tucker

CQS

category quadratic space

CAS

category absolute value space

LS-LDA

least squares linear discriminant analysis

MC-FLD

multi-class Fisher linear discriminant

K-CQS

kernel category quadratic space

K-CAS

kernel category absolute value space

K-PCA

kernel principal component analysis

K-MC-FLD

kernel multi-class Fisher linear discriminant

K-CQS-A

kernel CQS w/angle classification

K-CAS-A

kernel CAS w/angle classification

References

  1. 1. Belhumeur PN, Hespanha JP, Kriegman DJ. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19(7):711-720
  2. 2. Rapcsák T. On minimization on Stiefel manifolds. European Journal of Operational Research. 2002;143(2):365-376
  3. 3. Jiang B, Dai Y-H. A framework of constraint preserving update schemes for optimization on Stiefel manifold. Mathematical Programming. 2015;153(2):535-575
  4. 4. Bolla M, Michaletzky G, Tusnady G, Ziermann M. Extrema of sums of heterogeneous quadratic forms. Linear Algebra and its Applications. 1998;269(1–3):331-365
  5. 5. Liu H, Wu W, So AM-C. Quadratic optimization with orthogonality constraints: Explicit Lojasiewicz exponent and linear convergence of line-search methods. In: Proceedings of Machine Learning Research, International Conference on Machine Learning. PMLR; 2016. pp. 1158-1167. Available from: https://proceedings.mlr.press/v48/liue16.html
  6. 6. Hardoon DR, Szedmak SR, Shawe-Taylor JR. Canonical correlation analysis: An overview with application to learning methods. Neural Computation. 2004;16(12):2639-2664
  7. 7. Xu M, Zhu Z, Zhang X, Zhao Y, Li X. Canonical correlation analysis with l 2, 1-norm for multiview data representation. IEEE Transactions on Cybernetics. 2019;50(11):4772-4782
  8. 8. Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323-2326
  9. 9. Chen J, Liu Y. Locally linear embedding: A survey. Artificial Intelligence Review. 2011;36:29-48
  10. 10. Ghojogh B, Ghodsi A, Karray F, Crowley M. Locally linear embedding and its variants: Tutorial and survey. arXiv. preprint arXiv:2011.10925. 2020
  11. 11. Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319-2323
  12. 12. Jenkins OC, Matarić MJ. A spatio-temporal extension to isomap nonlinear dimension reduction. In: Proceedings of the Twenty-First International Conference on Machine Learning. 2004. p. 56
  13. 13. Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003;15(6):1373-1396
  14. 14. Li B, Li Y-R, Zhang X-L. A survey on Laplacian eigenmaps based manifold learning methods. Neurocomputing. 2019;335:336-351
  15. 15. Zhu H, Koniusz P. Generalized Laplacian eigenmaps. Advances in Neural Information Processing Systems. 2022;35:30783-30797
  16. 16. Jolliffe IT. Principal Component Analysis. Springer Series in Statistics. 2nd ed. New York: Springer; 2002
  17. 17. Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2(4):433-459
  18. 18. Bro R, Smilde AK. Principal component analysis. Analytical Methods. 2014;6(9):2812-2831
  19. 19. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7(2):179-188
  20. 20. Hotelling H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology. 1933;24(6):417-441
  21. 21. Arenas-García J, Petersen KB, Hansen LK. Sparse kernel orthonormalized PLS for feature extraction in large data sets. Advances in Neural Information Processing Systems. 2007;19:33-40
  22. 22. Vinzi VE, Chin WW, Henseler J, Wang H, et al. Handbook of Partial Least Squares. Vol. 201. Springer; 2010. DOI: 10.1007/978-3-540-32827-8
  23. 23. Vapnik VN. Statistical Learning Theory. John Wiley & Sons; 1998. Available from: https://www.wiley.com/en-us/Statistical+Learning+Theory-p-978047103003
  24. 24. Bishop CM. Neural Networks for Pattern Recognition. 1st ed. Oxford University Press; 1996. Available from: https://global.oup.com/academic/product/neural-networks-for-pattern-recognition-9780198538646?cc=us&lang=en&
  25. 25. Duda RO, Hart P, Stork DG. Pattern Classification. 2nd ed. New York, NY: Wiley Interscience; 2000
  26. 26. Hastie T, Tibshirani R. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B (Methodological). 1996;58(1):155-176
  27. 27. Ye J. Least squares linear discriminant analysis. In: Proceedings of the 24th International Conference on Machine Learning (ICML). ACM; 2007. pp. 1087-1093. DOI: 10.1145/1273496.1273633
  28. 28. Bishop CM. Pattern Recognition and Machine Learning. 1st ed. New York: Springer; 2006
  29. 29. Sammon JW. An optimal discriminant plane. IEEE Transactions on Computers. 1970;100(9):826-829
  30. 30. Foley DH, Sammon JW. An optimal set of discriminant vectors. IEEE Transactions on Computers. 1975;100(3):281-289
  31. 31. Anderson TW, Bahadur RR. Classification into two multivariate normal distributions with different covariance matrices. The Annals of Mathematical Statistics. 1962;33(2):420-431
  32. 32. Schölkopf B, Burges CJC. Advances in Kernel Methods: Support Vector Learning. MIT Press; 1999. DOI: 10.7551/mitpress/1130.001.0001
  33. 33. Rao CR. The use and interpretation of principal component analysis in applied research. Sankhyā: The Indian Journal of Statistics, Series A. 1964;26(4):329-358
  34. 34. Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology. 2004;2(4):e108
  35. 35. Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. Journal of the American Statistical Association. 2012;101(473):119-137
  36. 36. Widdows D. Geometry and Meaning. Vol. 773. Stanford: CSLI Publications; 2004
  37. 37. Widdows D. Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. In: Proceedings of 41st Annual Meeting on Association for Computational Linguistics. Vol. 1. Association for Computational Linguistics; 2003. pp. 136-143. DOI: 10.3115/1075096.1075114
  38. 38. Tsochantaridis I, Hofmann T, Joachims T, Altun Y. Support vector machine learning for interdependent and structured output spaces. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML). ACM; 2004. p. 104. DOI: 10.1145/1015330.1015341
  39. 39. Ji S, Ye J. Linear dimensionality reduction for multi-label classification. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI). Vol. 9. 2009. pp. 1077-1082. DOI: 10.5555/1661445.1661617
  40. 40. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 6th ed. Pearson; 2002. DOI: 10.1007/978-3-662-45171-7
  41. 41. Sun L, Ji S, Ye J. Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;33(1):194-200
  42. 42. Sun L, Ji S, Ye J. Multi-Label Dimensionality Reduction. CRC Press; 2013. DOI: 10.1201/b16017
  43. 43. Shajari H, Rangarajan A. A unified framework for multiclass and multilabel support vector machines. CoRR, abs/2003.11197. 2020. DOI: 10.48550/arXiv.2003.11197
  44. 44. Gao W, Ma Z, Xiong C, Gao T. Dimensionality reduction of spd data based on riemannian manifold tangent spaces and local affinity. Applied Intelligence. 2023;53(2):1887-1911
  45. 45. Ghojogh B, Crowley M, Karray F, Ghodsi A. Elements of Dimensionality Reduction and Manifold Learning. Springer Nature; 2023. DOI: 10.1007/978-3-031-10602-6
  46. 46. Yuille AL, Rangarajan A. The concave-convex procedure. Neural Computation. 2003;15(4):915-936
  47. 47. Kiefer J. Sequential minimax search for a maximum. Proceedings of the American Mathematical Society. 1953;4(3):502-506
  48. 48. Rapcsák T. On minimization of sums of heterogeneous quadratic functions on Stiefel manifolds. In: Migdalas A, Pardalos PM, Värbrand P, editors. From Local to Global Optimization. Springer; 2001. pp. 277-290. DOI: 10.1007/978-1-4757-5284-7_12
  49. 49. Kelly M, Longjohn R, Nottingham K. The UCI Machine Learning Repository. 2013. Available from: http://archive.ics.uci.edu

Notes

  • Note that this problem is fundamentally different from and cannot be reduced to the minimization of trace AWTBW subject to WTW=IK which has a closed form solution.

Written By

Anthony O. Smith and Anand Rangarajan

Submitted: 15 August 2023 Reviewed: 04 September 2023 Published: 05 December 2023