Open access peer-reviewed chapter

Computing on Vertices in Data Mining

Written By

Leon Bobrowski

Submitted: 17 June 2021 Reviewed: 08 July 2021 Published: 14 December 2021

DOI: 10.5772/intechopen.99315

From the Edited Volume

Data Mining - Concepts and Applications

Edited by Ciza Thomas

Chapter metrics overview

326 Chapter Downloads

View Full Metrics

Abstract

The main challenges in data mining are related to large, multi-dimensional data sets. There is a need to develop algorithms that are precise and efficient enough to deal with big data problems. The Simplex algorithm from linear programming can be seen as an example of a successful big data problem solving tool. According to the fundamental theorem of linear programming the solution of the optimization problem can found in one of the vertices in the parameter space. The basis exchange algorithms also search for the optimal solution among finite number of the vertices in the parameter space. Basis exchange algorithms enable the design of complex layers of classifiers or predictive models based on a small number of multivariate data vectors.

Keywords

  • data mining
  • basis exchange algorithms
  • small samples of multivariate vectors
  • gene clustering
  • prognostic models

1. Introduction

Various data mining tools are proposed to extract patterns from data sets [1]. Large, multidimensional data sets impose high requirements as to the precision and efficiency of calculations used to extract patterns (regularities) useful in practice [2]. In this context, there is still a need to develop new algorithms of data mining [3]. New types of patterns are also obtained in result of combining different types of classification or prognosis models [4].

The Simplex algorithm from linear programming is used as an effective big data mining tool [5]. According to the basic theorem of linear programming, the solution to the linear optimization problem with linear constraints can be found at one of the vertices in the parameter space. Narrowing the search area to a finite number of vertices is a source of the efficiency of the Simplex algorithm.

Basis exchange algorithms also look for an optimal solution among a finite number of vertices in the parameter space [6]. The basis exchange algorithms are based on the Gauss - Jordan transformation and, for this reason, are similar to the Simplex algorithm. Controlling the basis exchange algorithm is related to the minimization of convex and piecewise linear (CPL) criterion functions [7].

The perceptron and collinearity criterion functions belong to the family of CPL functions The minimization of the perceptron criterion function allows to check the linear separability of data sets and to design piecewise linear classifiers [8]. Minimizing the collinearity criterion function makes it possible to detect collinear (flat) patterns in data sets and to design multiple interaction models [9].

Data sets consisting of a small number of multivariate feature vectors generate specific problems in data mining [10]. This type of data includes genetic data sets. Minimizing the perceptron criterion function or the collinearity function enables solving problems related to discrimination or regression also in the case of a small set of multidimensional feature vectors by using complex layers of low dimensional linear classifiers or prognostic models [11].

Advertisement

2. Linear separability vs. linear dependence

Let us assume that each of m objects Oj from a given database were represented by the n-dimensional feature vector xj = [xj,1,...,xj,n]T belonging to the feature space F[n] (xjF[n]). The data set C consists of m such feature vectors xj:

C=xj,wherej=1,,mE1

The components xj,i of the feature vector xj are numerical values (xj,iR or xj,I ∈{0, 1}) of the individual features Xi of the j-th object Oj. In this context, each feature vector xj (xjF[n]) represents n features Xi belonging to the feature set F(n) = {X1,…, Xn}.

The pairs {Gk+, Gk=} (k = 1, …, K) of the learning sets Gk+and Gk= (Gk+Gk = ∅) are formed from some feature vectors xj selected from the data set C(1):

Gk+=xj:jJk+,andGk=xj:jJkE2

where Jk+ and Jk are non-empty sets of indices j of vectors xj (Jk+Jk = ∅).

The positive learning set Gk+ is composed of mk+ feature vectors xj (jJk+). Similarly, the negative learning set Gk is composed of mk feature vectors xj (jJk), where mk+ + mk ≤ m.

Possibility of the learning sets Gk+ and Gk(2) separation using a hyperplane H(wk, θk) in the feature space F[n] is investigated in pattern recognition [1]:

Hwkθk=x:wkTx=θkE3

where wk = [wk,1,..., wk,n]TRn is the weight vector, θkR1 is the threshold, and wkTx = Σi wk,i xi is the scalar product.

Definition 1: The learning sets Gk+ and Gk(1) are linearly separable in the feature space F[n], if and only if there exists a weight vector wk (wkRn), and a threshold θk (θkR1) that the hyperplane H(wk, θk) (3) separates these sets [7]:

wkθkxjGk+wkTxjθk+1andE4
xjGkwkTxjθk1

According to the above inequalities, all vectors xj from the learning set Gk+(2) are located on the positive side of the hyperplane H(wk, θk) (3), and all vectors xj from the set Gk lie on the negative side of this hyperplane.

The hyperplane H(wk, θk) (3) separates (4) the sets Gk+ and Gk(1) with the following margin δL2(wk) based on the Euclidean (L2) norm which is used in the Support Vector Machines (SVM) method [12]:

δL2wk=2/wkL2=2/wkTwk1/2E5

where || wk ||L2 = (wkTwk)1/2 is the Euclidean length of the weight vector wk.

The margin δL1(wk) with the L1 norm related to the hyperplane H(wk, θk) (2), which separates (10) the learning sets Gk+ and Gk(2) was determined by analogy to (5) as [11]:

δL1wk=2/wkL1=2/wk,1++wk,nE6

where || wk ||L1 = |wk,1| + ... +| wk,n| is the L1 length of the weight vector wk.

The margins δ L2(wk) (5) or δ L1(wk) (6) are maximized to improve the generalization properties of linear classifiers designed from the learning sets Gk+ and Gk(2) [7].

The following set of mk′ = mk+ + mk linear equations can be formulated on the basis of the linear separability inequalities (4):

jJk+xjTwk=θk+1andE7
jJkxjTwk=θk1

If we assume that the threshold θk can be determined latter, then we have n unknown weights wk,i (wk = [wk,1,..., wk,n]T) in an underdetermined system of mk′ = mk+ + mk (mk′ ≤ mk < n) linear Eqs. (7). In order to obtain a system of n linear Eqs. (7) with n unknown weights wk,i, additional linear equations based on selected n - mk′ unit vectors ei (iIk) were taken into account [6]:

iIkeiTwk=0E8

The parameter vertex wk = [wk,1,..., wk,n]T can be determined by the linear Eqs. (7) and (8) if the feature vectors xj forming the learning sets Gk+ and Gk(2) are linearly independent [7].

The feature vector xj′ (xj′Gk+Gk(2)) is a linear combination of some other vectors xj(i) (j(i) ≠ j′) from the learning sets (2), if there are such parameters α j′,ij′,i ≠ 0) that the following relation holds:

xj=αj,1xj1++αj,lxjlE9

Definition 2: Feature vectors xj making up the learning sets Gk+ and Gk(2) are linearly independent if neither of these vectors xj′ (xj′Gk+Gk) can be expressed as a linear combination (9) of l (l ∈{1,…, m - 1}) other vectors xj(l) from the learning sets.

If the number mk′ = mk+ + mk of elements xj of the learning sets Gk+ and Gk(2) is smaller than the dimension n of the feature space F[n] (mk+ + mk ≤ n), then the parameter vertex wkk) can be defined by the linear equations in the following matrix form [13]:

Bkwkθk=1kθkE10

where

1kθk=θk+1θk+1θk1.θk10.0TE11

and

Bk=x1xmkeimk+1.einTE12

The first mk+ components of the vector 1kk) are equal to θk + 1, the next mk components equal to θk - 1, and the last n - mk+ − mk components are equal to 0. The first mk+ rows of the square matrix Bk(12) are formed by the feature vectors xj (jJk+) from the set Gk+(2), the next mk rows are formed by vectors xj (jJk) from the set Gk(2), and the last n - mk+ − mk rows are made up of unit vectors ej (iIk):

If the matrix Bk(12) is non-singular, then there exists the inverse matrix Bk−1:

Bk1=r1rmkrimk+1.rinE13

In this case, the parameter vertex wkk) (10) can be defined by the following equation:

wkθk=Bk11kθk=θk+1rk++θk1rk=E14
=θkrk++rk+rk+rk

where the vector rk+ is the sum of the first mk+ columns ri of the inverse matrix Bk−1 (13), and the vector rk is the sum of the successive mk columns ri of this matrix.

The last n - (mk+ + mk) components wk.ik) of the vector wkk) = [wk,1k),…, wk,nk)]T(14) linked to the zero components of the vector 1kk) (11) are equal to zero:

imk++mk+1nwk.iθk=0E15

The conditions wk.ik) = 0 (15) result from the equations eiTwkk) = 0 (8) at the vertex wkk) (14).

Length ||wkk)||L1 of the weight vector wkk) (14) in the L1 norm is the sum of mk′ = mk+ + mk components |wk,ik)|:

wkθkL1=wk,1θk++wk,mkθkE16

In accordance with the Eq. (14), components |wk,ik)| can be determined as follows:

i1mk++mkwk,iθk=θkrk,i++rk,i+rk,i+rk,iE17

The length ||wkk)||L1(16) of the vector wkk) (14) with the L1 norm is minimized to increase the margin δL1(wkk)) (6). The length ||wkk)||L1(16) can be minimized by selecting the optimal threshold value θk* on the basis of the Eq. (14).

θkδL1wkθkδL1wkθkE18

where the optimal vertex wkk*) is defined by the Eq. (14).

Theorem 1: The learning sets Gk+ and Gk(2) formed by m (m ≤ n) linearly independent (9) feature vectors xj are linearly separable (4) in the feature space F[n] (xjF[n]).

Proof: If the learning sets Gk+ and Gk(2) are formed by m linearly independent feature.

vectors xj then the non-singular matrix Bk = [x1,…, xm, ei(m + 1),…., ei(n)]T(12) containing these m vectors xj and n - m unit vectors ei (iIk) can be defined [10]. In this case, the inverse matrix Bk−1 (13) exists and can determine the vertex wkk) (14). The vertex equation Bkwkk) = 1kk) (10) can be reformulated for the feature vectors xj(2) as follows:

xjGk+wkθkTxj=θk+1andxjGkwkθkTxj=θk1E19

The solution of the Eqs. (19) satisfies the linear separability inequalities (4).

It is possible to enlarge the learning sets Gk+ and Gk(2) in such a way, which maintains their linear separability (4).

Lemma 1: Increasing the positive learning set Gk+(2) by such a new vector xj′ (xj′Gk+), which is a linear combination with the parameters αj′,i(9) of some feature vectors xj(l)(2) from this set (xj(l)Gk+) preserves the linear separability (4) of the learning sets if the parameters αj′,i fulfill the following condition:

αj,1++αj,l1E20

If the assumptions of the lemma are met, then

wkTxj=wkTαj,1xj1++αj,lxjl=E21
=αj,1wkTxj1++αj,lwkTxjl=αj,1θk+1++αj,lθk+1θk+1

The above inequality means that linear separability connditions (4) still apply after the increasing of the learning set Gk+(2).

Lemma 2: Increasing the negative learning set Gk(2) by such a new vector xj′ (xj′Gk), which is a linear combination with the parameters αj′,i(9) of some feature vectors xj(l)(2) from this set (xj(l)Gk) preserves the linear separability (4) of the learning sets if the parameters αj′, i fulfill the following condition:

αj,1++αj,l1E22

Justification Lemma 2 may be based on inequality similar to (21).

Advertisement

3. Perceptron criterion function

The minimization the perceptron criterion function allows to assess the degree of linear separabilty (4) of the learning sets Gk+ and Gk(2) in different feature subspaces F[n′] (F[n′] ⊂ F[n + 1]) [6]. When defining the perceptron criterion function, it is convenient to use the following augmented feature vectors yj (yjF[n + 1]) and augmented weight vectors vk (vkRn + 1) [1]:

jJk+2yj=xjT1T,E23
jJk2yj=xjT1T

and

vk=wkTθkT=wk,1wk,nθkTE24

The augmented vectors yj are constructed (23) on the basis of the learning sets Gk+ and Gk(2). These learning sets are extracted from the data set C(1) according to some additional knowledge. The linear separability (4) of the learning sets Gk+ and Gk(2) can be reformulated using the following set of m inequalities with the augmented vectors yj(23) [7]:

vkjJk+Jk2vkTyj1E25

The dual hyperplanes hjp in the parameter space Rn + 1 (vRn + 1) are defined on the basis of the augmented vectors yj [6]:

jJk+Jk2hjp=v:yjTv=1E26

Dual hyperplanes hjp(26) divide the parameter space Rn + 1 (vRn + 1) into a finite number L of disconnected regions (convex polyhedra) Dlp (l = 1,…, L) [7]:

Dlp=v:jJl+yjTv1andjJlyjTv<1E27

where Jl+ and Jl are disjointed subsets (Jl+Jl+ = ∅) of indices j of feature vectors xj making up the learning sets Gk+ and Gk(2).

The perceptron penalty functions φjp(v) are defined as follows for each of augmented feature vectors yj(23) [6]:

jJk
φjpv=1yjTvifyjTv<10ifyjTv1E28

The j - th penalty function φjp(v) (28) is greater than zero if and only if the weight vector v is located on the wrong side (yjTv < 1) of the j-th dual hyperplane hjp(26). The function φjp(v) (28) is linear and greater than zero as long as the parameter vector v = [vk,1,..., vk,n + 1]T remains on the wrong side of the hyperplane hjp(26). Convex and piecewise-linear (CPL) penalty functions φjp(v) (28) are used to enforce the linear separation (8) of the learning sets Gk+ and Gk(2).

The perceptron criterion function Φkp(v) is defined as the weighted sum of the penalty functions φjp(v) (28) [6]:

Φkpv=ΣjαjφjpvE29

Positive parameters αj (αj > 0) can be treated as prices of individual feature vectors xj:

jJk+2αj=1/2mk+andjJk2αj=1/2mkE30

where mk+ (mk) is the number of elements xj in the learning set Gk+ (Gk) (2).

The perceptron criterion function Φkp(v) (29) was built on the basis of the error correction algorithm, the basic algorithm in the Perceptron model of learning processes in neural networks [14].

The criterion function Φkp(v) (29) is convex and piecewise-linear (CPL) [6]. It means, among others, that the function Φkp(v) (29) remains linear within each area Dl(27):

l1LvDlΦkpv=ΣjαjyjTE31

where the summation is performed on all vectors yj(23) fulfilling the condition yjTv < 1.

The optimal vector vk* determines the minimum value Φkp(vk*) of the criterion function Φkp(v) (29):

vkvRn+1ΦkpvΦkpvk0E32

Since the criterion function Φkp(v) (29) is linear in each convex polyhedron Dl(27), the optimal point vk* representing the minimum Φkp(vk*) (32) can be located in selected vertex of some polyhedron Dl′p(27). This property of the optimal vector vk* (32) follows from the fundamental theorem of linear programming [5].

It has been shown that the minimum value Φkp(vk*) (32) of the perceptron criterion function Φkp(v) (29) with the parameters αj(30) is normalized as follows [6]:

0Φkpvk1E33

The below theorem has been proved [6]:

Theorem 2: The minimum value Φkp(vk*) (32) of the perceptron criterion function Φkp(v) (29) is equal to zero (Φkp(vk*) = 0) if and only if the learning sets Gk+ and Gk(2) are linearly separable (4).

The minimum value Φkp(vk*) (32) is near to one (Φkp(vk*) ≈ 1) if the sets Gk+ and Gk(2) cover almost completely. It can also be proved that the minimum value Φkp(vk*) (32) of the perceptron criterion function Φkp(v) (29) does not depend on invertible linear transformations of the feature vectors yj(23) [6]. The perceptron criterion function Φk(v) (29) remains linear inside of each region Dl(27).

The regularized criterion function Ψkp(v) is defined as the sum of the perceptron criterion function Φkp(v) (29) and some additional penalty functions [13]. These additional CPL functions are equal to the costs γii > 0) of individual features Xi multiply by the absolute values |wi| of weighs wi, where v = [wT, −θ]T = [w1,..., wn, −θ]TRn + 1(24):

Ψkpv=Φkpv+λΣiγiwiE34

where λ (λ ≥ 0) is the cost level. The standard values of the cost parameters γi are equal to one (∀i ∈ {1, ..., n} γi = 1).

The optimal vector vk,λ* constitutes the minimum value Ψkp(vk,λ*) of the CPL criterion function Ψkp(v) (34), which is defined on elements xj of the learning sets Gk+ and Gk(2):

vk,λvRn+1ΨkpvΨkpvk,λ>0E35

Similarly as in the case of the perceptron criterion function Φkp(v) (29), the optimal vector vk,λ* (35) can be located in selected vertex of some polyhedron Dl′(27). The minimum value Ψkp(vk,λ*) (35) of the criterion function Ψkp(v) (34) is used, among others, in the relaxed linear separability (RLS) method of gene subsets selection [15].

Advertisement

4. Collinearity criterion function

Minimizing the collinearity criterion function is used to extract collinear patterns from large, multidimensional data sets C(1) [7]. Linear models of multivariate interactions can be formulated on the basis of representative collinear patterns [9].

The collinearity penalty functions φj(w) are determined by individual feature vectors xj = [xj,1,...,xj,n]T in the following manner [9]:

xjC1
φjw=1xjTw=1xjTwifxjTw1xjTw1ifxjTw>1E36

The penalty functions φj(w) (36) can be related to the following dual hyperplanes hj1 in the parameter (weight) space Rn (wRn):

j=1mhj1=w:xjTw=1E37

The CPL penalty φj(w) (36) is equal to zero (φjc(w) = 0) in the point w = [w1,..., wn]T if and only if the point w is located on the dual hyperplane hj1(37).

The collinearity criterion function Φk(w) is defined as the weighted sum of the penalty functions φj(w) (36) determined by feature vectors xj forming the data subset Ck (Ck ⊂ C(1)):

Φkw=ΣjβjφjwE38

where the sum takes into account only the indices J of the set Jk = {j: xjCk}, and the positive parameters βjj > 0) in the function Φk(w) (38) can be treated as the prices of particular feature vectors xj. The standard choice of the parameters βj values is one ((∀jJk) βj = 1.0).

The collinearity criterion function Φk(w) (38) is convex and piecewise-linear (CPL) as the sum of this type of penalty functions φj(w) (36) [9]. The vector wk* determines the minimum value Φk(wk*) of the criterion function Φk(w) (38):

wkwΦkwΦkwk0E39

Definition 3: The data subset Ck (Ck ⊂ C(1)) is collinear when all feature vectors xj from this subset are located on some hyperplane H(w, θ) = {x: wTx = θ} with θ ≠ 0.

Theorem 3: The minimum value Φkp(vk*) (39) of the collinearity criterion function Φk(w) (38) defined on the feature vectors xj constituting a data subset Ck (Ck ⊂ C(1)) is equal to zero (Φkp(vk*) = 0) when this subset Ck is collinear (Def. 3) [9].

Different collinear subsets Ck can be extracted from data set C(1) with a large number m of elements xj by minimizing the collinearity criterion function Φkp(w) (38) [9].

The minimum value Φkp(vk*) (39) of the collinearity criterion function Φk(w) (38) can be reduced to zero by omitting some feature vectors xj from the data subset Ck (Ck ⊂ C(1)). If the minimum value Φk(wk*) (39) is greater than zero (Φk(wk*) > 0) then we can select feature vectors xj (jJk(wk*)) with the penalty φj(wk*) (36) greater than zero:

jJkwkφjwk=1xjTwk>0E40

Omitting one feature vector xj′ (j′∈ Jk(wk*)) with the above property results in the following reduction of the minimum value Φkp(vk*) (39);

ΦkwkΦkwkφjwkE41

where Φk′(wk′*) is the minimum value (39) of the collinearity criterion function Φk′(w) (38) defined on feature vectors xj constituting the data subset Ck reduced by the vector xj′.

The regularized criterion function Ψk(w) is defined as the sum of the collinearity criterion function Φk(w) (38) and some additional CPL penalty functions φj0(w) [7]:

Ψkw=Φkw+λΣiχiw=Σjβjφjw+λΣiχiφi0wE42

where λ ≥ 0 is the cost level. The standard values of the cost parameters γi are equal to one ((∀i ∈ {1,…,n}) γi = 1). The additional CPL penalty functions φj0(w) are defined below [7]:

i=1nE43
χiw=eiTw=wjifwj0wjifwj>0

The functions φj0(w) (43) are related to the following dual hyperplanes hj0 in the parameter (weight) space Rn (wRn):

i=1nhj0=w:ejTw=0=w:wj=0E44

The CPL penalty function φj0(w) (43) is equal to zero (φj0(w) = 0) in the point w = [w1,..., wn]T if and only if this point is located on the dual hyperplane hj0(44).

Advertisement

5. Parameter vertices

The perceptron criterion function Φkp(v) (29) and the collinearity criterion function Φk(w) (38) are convex and piecewise-linear (CPL). The minimum values of a such CPL criterion functions can be located in parameter vertices of some convex polyhedra. We consider the parameter vertices wk (wkRn) related to the collinearity criterion function Φk(w) (38).

Definition 4: The parameter vertexwk of the rank rk (rk ≤ n) in the weight space Rn (wkRn) is the intersection point of rk hyperplanes hj1(37) defined by linearly indepenedent feature vectors xj (jJk) from the data set C(1) and n - rk hyperplanes hi0(44) defined by unit vectors ei (iIk) [7].

The j-th dual hyperplane hj1(37) defined by the feature vector xj(1) passes through the k-th vertexwk if the equation wkTxj = 1 holds.

Definition 5: The k-th weight vertex wk of the rank rk is degenerate in the parameter space Rn if the number mk of hyperplanes hj1(37) passing through this vertex (wkTxj = 1) is greater than the rank rk (mk > rk).

The vertex wk can be defined by the following set of n linear equations:

jJkwkwkTxj=1E45

and

iIkwkwkTei=0E46

Eqs. (45) and (46) can be represented in the below matrix form [7]:

Bkwk=1kE47

where 1k = [1,…,1, 0,…,0]T is the vector with the first rk components equal to one and the remaining n - rk components are equal to zero.

The square matrix Bk(47) consists of k feature vectors xj (jJk(45)) and n - k unit vectors ei (iIk(46)) []:

Bk=x1xkeik+1einTE48

where the symbol ei(l) denotes such unit vector, which is the l-th row of the matrix Bk.

Since feature vectors xj (∀jJk(wk) (45)) making up rk rows of the matrix Bk(48) are linearly independent, then the inverse matrix Bk−1 exists:

Bk1=r1rkrik+1.rinE49

The inverse matrix Bk−1(49) can be obtained starting from the unit matrix I = [e1,..., en]T and using the basis exchange algorithm [8].

The non-singular matrix Bk(48) is the basis of the feature space F[n] related to the vertex wk = [wk,1,…, wk,n]T. Since the last n - rk components of the vector 1k(47) are equal to zero, the following equation holds:

wk=Bk11k=r1++rkE50

According to Eq. (50), the weight vertex wk is the sum of the first k columns ri of the inverse matrix Bk−1(49).

Remark 1: The n - k components wk.i of the vector wk = [wk,1,…, wk,n]T(50) linked to the zero components of the vector 1k = [1,…, 1, 0,…., 0, 1]T(7) are equal to zero:

ik+1nwk.i=0E51

The conditions wk.i = 0 (51) result from the equations wkTei = 0 (46) at the vertex wk.

The fundamental theorem of linear programming shows that the minimum Φk(wk*) (39) of the CPL collinearity criterion function Φk(w) (38) can always be located in one of the vertices wk(50) [5]. The same property has also the regularized criterion function Ψk(w) (42), another function of the CPL type [7].

We can see that all such feature vectors xj(1) which define hyperplanes hj1(37) passing through the vertex wk are located on the hyperplane H(wk, 1) = {x: wkTx = 1} (3) in the feature space F[n]. A large number mk of feature vectors xj(1) located on the hyperplane H(wk, 1) (3) form the collinear clusterC(wk) based on the vertex wk [8]:

Cwk=xjC1:wkTx=1E52

If the vertex wk of the rank rk is degenerate in the parameter space Rn then the collinear cluster C(wk) (52) contains more than rk feature vectors xj(1).

The k-th vertex wk = [wk,1,…, wk,n]T in the parameter space Rn (wkRn) is linked by the Eq. (47) to the non-singular matrix Bk(48). The rows of the matrix Bk(48) can form the basis of the feature space F[n]. The conditions wk.i = 0 (51) result from the equations wkTei = 0 (46) at the vertex wk.

i=1nifeiBk48,thenwk.i=0E53

Each feature vector xj from the data set C(1) represents n features Xi belonging to the feature set R(n) = {X1,…, Xn}. The k-th vertexical feature subset Rk(rk) consists of rk features Xi that are connected to the weights wk.i different from zero (wk.i ≠ 0):

Rkrk=Xi1XirkE54

The k-th vertexical subspaceFk[rk] (Fk[rk] ⊂ F[n]) contains the reduced vectors xj[rk] with rk componets xj,i(l) (xj[rk] ∈ Fk[rk]) related to the weights wk.i different from zero:

j1mxjrk=xj,i1xj,irkTE55

The reduced vectors xj[rk] (55) are obtained from the feature vectors xj = [xj,1,...,xj,n]T belonging to the data set C(1) by omitting the n - rk components xj,i related to the weights wk.i equal to zero (wk.i = 0).

We consider the optimal vertexical subspace Fk*[rk] (Fk*[rk] ⊂ F[n]) related to the reduced optimal vertex wk*[rk] which determines the minimum Φk(wk*) (39) of the collinearity criterion function Φk(w) (38). The optimal collinear cluster C(wk*[rk]) (52) is based on the optimal vertex wk*[rk] = [wk,1*,…, wk,rk*]T with rk different from zero components wk,i* (wk.i* ≠ 0). Feature vectors xj belonging to the collinear cluster C(wk*) (52) satisfy the equations wk*[rk]Txj[rk] = 1, hence:

xjPwkwk.1xj,i1++wk.rkxj,irk=1E56

where xj,i(l) are components of the j-th feature vectors xj related to the weights wk.i different from zero (wk.i ≠ 0).

A large number mk of feature vectors xj(1) belonging to the collinear cluster C(wk*[rk]) (52) justifies the following collinear model of interaction between selected features Xi(l) which is based on the Eqs. (56) [9]:

wk.1Xi1++wk.rkXirk=1E57

The collinear interaction model (57) allows, inter alia, to design the following prognostic models for each feature Xi′ from the subset Rk(rk) (54):

(i{1,,rk)Xi=αi,0+αi,1Xi1++αi,rkXirkE58

where βi′,0 = 1 / wk.i′*, βi′, i′ = 0, and (∀ i(l) ≠ i′) βi′,i(l) = wk.i(l)* / wk.i′*.

Feature Xi′ is a dependent variable in the prognostic model (58), the remaining m - 1 features Xi(l) are independent variables (i(l) ≠ i′). The family of rk prognostic models (58) can be designed on the basis of one collinear interaction model (57). Models (58) have a better justification for a large number mk of feature vectors xj(1) in the collinear cluster C(wk*[rk]) (52).

Advertisement

6. Basis exchange algorithm

The collinearity criterion function Φ(w) (38), like other convex and piecewise linear (CPL) criterion functions, can be minimized using the basis exchange algorithm [8]. The basis exchange algorithm aimed at minimization of the collinearity criterion function Φ(w) (38) is described below.

According to the basis exchange algorithm, the optimal vertex wk*, which constitutes the minimum value Φk(wk*) (39) of the collinearity function Φk(w) (38), is achieved after a finite number L of the steps l as a result of guided movement between selected vertices wk(50) [8]:

w0w1.wLE59

The sequence of vertices wk(59) is related by (47) to the following sequence of the inverse matrices Bk−1(49):

B01B11.BL1E60

The sequence of vertices wk(l)(59) typically starts at the vertex w0 = [0,..., 0]T related to the identity matrix B0 = In = [e1,..., en]T of the dimension n x n [7]. The final vertex wL(59) should assure the minimum value of the collinearity criterion function Φ(w) (38):

wΦwΦwL0E61

If the criterion function Φ(w) (38) is defined on m (m ≤ n) linearly independent vectors xj (xjC(1)) then the value Φ(wL) of the collinearity criterion function Φ(w) (38) at the final vertex wL(59) becomes zero (Φ(wL) = 0) [8]. The rank rL (Def. 4) of the final vertex wL(59) can be equal to the number m of feature vectors xj (rL = m) or it can be less than m (rL < m). The rank rL of the final vertex wL(59) is less than m (rL < m) if the final vertex wL is degenerate [7].

Consider the reversible matrix Bk = [x1,..., xk, ei(k + 1),..., ei(n)]T(48), which determines the vertex wk(50) and the value Φk(wk) of the criterion function Φk(w) (38) in the k-th step. In the step (l + 1), one of the unit vectors ei in the matrix Bk(48) is replaced by the feature vector xk + 1 and the matrix Bk + 1 = [x1,..., xk, xk + 1, ei(k + 2),..., ei(n)]T appears. The unit vector ei(k + 1) leaving matrix Bk(48) is indicated by an exit criterion based on the gradient of the collinearity criterion function Φ(w) (38) [7]. The exit criterion allows to determine the exit edge rk + 1(49) of the greatest descent of the collinearity criterion function Φ(w) (38). As a result of replacing the unit vector ei(k + 1) with the feature vector xk + 1, the value Φ(wk) of the collinearity function Φ(w) (38) decreases (41):

Φwk+1Φwkφk+1wkE62

After a finite number L (L ≤ m) of the steps k, the collinearity function Φ(w) (38) reaches its minimum (61) at the final vertex wL(59).

The sequence (60) of the inverse matrices Bk−1 is obtained in a multi-step process of minimizing the function Φ(w) (38). During the k-th step, the matrix Bk-1 = [x1,…, xk-1, ei(k),…., ei(n)]T(12) is transformed into the matrix Bk by replacing the unit vector ei(k) with the feature vector xk:

k1LBk1BkE63

According to the vector Gauss-Jordan transformation, replacing the unit vector ei(k) with the feature vector xk during the k - th stage results in the following modifications of the co + lumns ri(k) of the inverse matrix Bl−1 = [x1,…, xl, ei(l+1),…, ei(n)]T(49) [6]:

ril+1l+1=1/ril+1lTxl+1ril+1lE64

and

iil+1ril+1=rilrilTxl+1rill+1==rilrilTxjl+1/rillTxl+1rill

where i(l+1) is the index of the unit vector ei(l+1) leaving the basis Bl = [x1,..., xl, ei(l+1),..., ei(n)]T during the l-th stage.

Remark 2: The vector Gauss-Jordan transformation (64) resulting from the replacing of the unit vector ei(k) with the feature vector xk in the basis Bk-1 = [x1,..., xk-1, ei(k),..., ei(n)]T cannot be executed when the below collinearity condition is met [7]:

rikkTxk=0E65

The collinearity condition (65) causes a division by zero in Eq. (64).

Let the symbol rl[k] denote the l-th column rl(k) = [rl,1(k),..., rl,n(k)]T of the inverse matrix Bk−1 = [r1(k),…, rk-1(k), rk(k),…., rn(k)] (49) after the reduction of the last n - k components rl,i(k):

rlk=rl,1krl,kkTE66

Similarly, the symbol xj[k] = [xj,1,...,xj,k]T means the reduced vector obtained from the feature vector xj = [xj,1,...,xj,n]T after he reducing of the last n - k components xj,i:

(j{1,,m)xjk=xj,1xj,kTE67

Lemma 3: The collinearity condition (65) appears during the k-th step when the reduced vector xk[k] (66) is a linear combination of the basis reduced vectors xj[k] (67) with j < k:

xkk=α1x1k++αk1xk1kE68

where (∀i ∈ {1,…, k - 1}) αiR1.

The proof of this lemma results directly from the collinearity condition (65) [7].

Advertisement

7. Small samples of multivariate feature vectors

A small sample of multivariate vectors appears when the number m of feature vectors xj in the data set C(1) is much smaller than the dimension n of these vectors (m << n). The basis exchange algorithms allows for efficient minimization of the CPL criterion functions also in the case of small samples of multivariate vectors [10]. However, for small samples, some new properties of the basis exchange algorithms are more important. In particular, the regularization (42) of the CPL criterion functions becomes crucial. New properties of the basis exchange algorithms in the case of a small number m of multidimensional feature vectors xj(1) is discussed on the example of the collinearity criterion function Φ(w) (38) and the regularized criterion function Ψ(w) (42).

Lemma 4: The value Φ(wK) of the collinearity criterion function Φ(w) (38) at the final vertex wL(59) is equal to zero if all m linear Eqs. (45) are fulfilled in the vertex wL which is related by the Eq. (47) to the matrix BL = [x1,..., xm, ei(1),..., ei(n-m)]T(48) containing the unit vectors ei with the indices i from the subset IL (iIL).

Theorem 4: If the feature vectors xj constituting the subset Ck (Ck ⊂ C(1)) and used in the definition of the function Φ(w) (38) are linearly independent (Def. 2), then the value Φ(wL) of the collinearity criterion function Φ(w) at the final vertex wL(59) is equal to zero (Φ(wL) = 0).

The proof of Theorem 4 can be based on the stepwise inversion of the matrices Bk(48) [16]. The final vertex wL(59) can be found by inverting the related matrix BL = [x1,..., xrk, ei(1),..., ei(n-rk)]T(48).

The final vertex wL(59) resetting (Φ(wL) = 0) the criterion function Φ(w) (38) can be related to the optimal matrix BL = [x1,..., xL, ei(L + 1),..., ei(n)]T(48) built from L (L ≤ m) feature vectors xj (jJ(wL) (45)) from the data set C(1) and from n - L selected unit vectors ei (iI(wL) (46)). Different subsets of the unit vectors ei in the final matrix BL(48) result in different positions of the final vertices wL(l)(59) in the parameter space Rn. The criterion function Φ(w) (38) is equal zero (Φ(wL(l)) = 0) at each of these vertices wL(l)(59).

The position of the final vertices wL(l)(59) in the parameter space Rn depends on which unit vectors ei (iIL(l)) are included in the basis BL(l)(48), where:

(l{1,,lmax)ΦkwLl=0E69

The maximal number lmax(69) of different vertices wL(l)(59) can be large when m << n:

lmax=n!/m!nm!E70

The choice between different final vertices wL(l)(59) can be based on the minimization of the regularized criterion function Ψ(w) (42). The regularized function Ψ(w) (42) is the sum of the collinearity function Φ(w) (38) and the weighted sum of the cost functions φi0(w) (43). If Φ(wL(l)) = 0 (38), then the value Ψ(wL(l)) of the criterion function Ψ(w) (42) at the final vertex wL(l)(59) can be given as follows:

ΨwLl=λiΣiχiφj0wLl==λΣχiwLl,iE71

where the above sums take into account only the indices i of the subset I(wL(l)) of the non-zero components wL(l),i of the final vertex wL(l) = [wL(l),1,…, wL(l),n]T(59):

IwLl=i:eiTwLl0=i:wLl,i0E72

If the final vertex wL(l)(59) is not degenerate (Def. 5), then the matrix BL(l)(48) is built from all m feature vectors xj (j ∈ {1,...., m}) making up the data set C(1) and from n - m selected unit vectors ei (iI(wL(l)) (71)).

B)m=x1xmeim+1einTE73

The problem of the constrained minimizing of the regularized function Ψ(w) (71) at the vertices wL(l)(59) satisfying the conditions Φ(wL(l)) = 0 (69) can be formulated in the following way:

minlΨwLl:ΦwLl=0==minlΣiγiwLl,i:ΦwLl=0E74

According to the above formulation, the search for the minimum of the regularized criterion function Ψ(w) (42) is takes place at all such vertices wL(l)(59), where the collinearity function Φ(w) (38) is equal to zero. The regularized criterion function Ψ(w) (42) is defined as follows at the final vertices wL(l) = [wL(l),1,…, wL(l),n]T(59), where Φ(wL(l)) = 0:

wLlΨwLl=ΣγiwLl,iE75

The optimal vertex wL(l)* is the minimum value Ψ′(wL(l)*) of the CPL criterion function Ψ′(w) (75) defined on such final vertices wL(l)(59), where Φ(wL(l)) = 0 (38):

wLlwLl:ΦwLl=0ΨwLlΨwLl>0E76

As in the case of the minimization of the perceptron criterion function Φkp(v) (29), the optimal vector wL(l)* (76) may be located at a selected vertex of some convex polyhedron (27) in the parameter space Rn (wRn) [7].

If the cost parameters γi(42) have standard values of one ((∀i ∈ {1,…,n}) γi = 1), then the constraint minimization problem (74) leads to the optimal vertex wL(l)* with the smallest L1 length || wL(l)* ||L1 = |wL(l),1*| + … + |wL(l),n*|, where Φ(wL(l)*) = 0 (38):

wLlwLl:ΦwLl=0wLlwLlE77

Optimal vertex wL(l)* with the smallest L1 length || wL(l)* ||L1(77) is related to the largest L1 margin δL1(wL(l)*) (6) [11]:

δL1wLl=2/wLlL1=2/wLl,1++wLl,nE78

The basis exchane algorithm allow to solve the constraint minimization problem (74) and to find the optimal vertex wL(l)* (77) with the largest L1 margin δL1(wL(l)*).

Support Vector Machines (SVM) is the most popular method for designing linear classifiers or prognostic models with large margins [12]. According to the SVM approach, the optimal linear classifier or the prognostic model defined by such an optimal weight vector w* that has a maximum margin δL2(w*) based on the Euclidean (L2) norm:

δL2w=2/wL2=2/wTw1/2E79

Maximization of the Euclidean margins δL2(w) (79) is performed using quadratic programming [2].

Advertisement

8. Complex layers of linear prognostic models

Complex layers of linear classifiers or prognostic models have been proposed as a scheme for obtaining a general classification or forecasting rules designed on the basis of a small number of multidimensional feature vectors xj [11]. According to this scheme, when designing linear prognostic models, averaging over a small number m of feature vectors xj of the dimension n (m << n) is replaced by averaging on collinear clusters of selected features (genes) Xi. Such an approach to averaging can be linked to the ergodic theory [17].

In the case of a small sample of multivariate vectors, the number m of feature vectors xj in the data set C(1) may be much smaller than the dimension n of these vectors (m << n). In this case, the collinear cluster C(wk*[rk]) (52) may contain all feature vectors xj from the set C(1) and the vertex wk*[rk] may have the rank rk equal to m (rk = m).

As it follows from Theorem 4, if the collinearity criterion function Φ(w) (38) is defined on linearly independent (Def. 2) feature vectors xj, then the values Φ(wm(l)) of this function at each final vertex wm(l)(59) are equal to zero (Φ(wm(l)) = 0). Each final vertex wm(l)(59) can be reached in m steps k (k = 1, …, m) starting from the vertex w0 = [0,..., 0]T related to the identity matrix B0 = In = [e1,..., en]T.

Minimization of the collinearity criterion function Φ(w) (38), and then minimization of the criterion function Ψ′(wL(l)) (75) at the final vertices wL(l)(59) allows to determine the optimal vertex wL(l)* (77) with the largest L1 margin δL1(wL(l)*) (78). If the feature vectors xj(1) are linearly independent, then the optimal vertex wL(l)* (77) is related to the optimal basis BL(l)* = [x1,..., xm, ei(m + 1),..., ei(n)]T which contains all m feature vectors xj(1) and n - m unit vectors ei with the indices i belonging to the optimal subset I(wL(l)*) (71) (iI(wL(l)*)).

The optimal basis Bm* = [x1,..., xm, ei(m + 1),..., ei(n)]T(73) is found in two stages. In the first stage, m feature vectors xj(1) are introduced into matrices Bk = [x1,..., xk, ei(k + 1),..., ei(n)]T (k = 0, 1,…, m - 1). The inverse matrices Bk−1(49) are computed in accordance with the vector Gauss-Jordan transformation (64). In the second stage, the unit vectors ei(l) in the matrices Bm(l)(73) are exchanged to minimize the CPL function Ψ′(wm(l)) (75) at the final vertices wm(l)(77). The optimal basis Bm* defines (47) the optimal vertex wm(l)* (77), which is characterized by the largest margin δL1(wm(l)*) (78).

The vertexical feature subspace F1*[m] (F1*[m] ⊂ F[n] (1)) can be obtained on the basis of the optimal vertex wm(l)* (77) with the largest margin δL1(wm(l)*) (78). The vertexical subspace F1*[m] contains the reduced vectors x1,j[m] with the dimension m [7]:

j1mx1,jmF1mE80

The reduced vectors x1,j[m] (80) are obtained from the feature vectors xj = [xj,1,...,xj,n]T (xjF[n]) ignoring such components xj,i which are related to the unit vectors ei in the optimal basis B1* = [x1,..., xm, ei(m + 1),..., ei(n)]T(73). The reduced vectors x1,j[m] are represented by such m features Xi (XiR1* (54)), which are not linked to the unit vectors ei (iIm(l)*) in the basis Bm(l)* (73) representing the optimal vertex wm(l)* (77).

R1=Xi1Xim:ilIml72E81

The m features Xi(l) belonging to the optimal subset R1* (Xi(l)R1* (81) are related to the weights wk.l* (wk*[m] = [wk,1*,…, wk,m*]T) that are not zero (wk.l * ≠ 0).

The optimal feature subset R1* (81) consists of m collinear features Xi. The optimal vertex w1*[m] (Φ(w1*[m]) = 0 (69)) in the reduced parameter space Rm (w1*[m] ∈ Rm) is based on these m features Xi. The reduced optimal vertex w1*[m] with the largest margin δL1(w1*[m]) (77) is the unique solution of the constrained optimization problem (74). Maximizing the L1 margin δL1(wl*) (78) leads to the first reduced vertex w1*[m] = [wk,1*,…, wk,m*]T with non-zero components wk.i * (wk.i * ≠ 0).

The collinear interaction model between m collinear features Xi(l) from the optimal subset R1*(m) (81) can be formulated as follows (57):

wk.1Xi1++wk.mXim=1E82

The prognostic models for each feature Xi′ from the subset R1* (81) may have the following form (58):

(i{1,,m)Xi=αi,0+αi,1Xi1++αi,mXimE83

where αi′,0 = 1 / wk.i′*, αi′, i′ = 0, and (∀ i(l) ≠ i′) αi′,i(l) = wk.i(l)* / wk.i′*.

In the case of a data set C with a small number m (m << n) of multidimensional feature vectors xj(1), the prognostic models (83) for individual features Xi′ can be weak. It is know that sets (ensembles) of weak models can have strong generalizing properties [4]. A set of weak prognostic models (83) for a selected feature (dependent variable) Xi′ can be implemented in the complex layer of L prognostic models (83) [11].

The complex layer can be built on the basis of the sequence of L optimal vertices wl* (77) related to m features Xi constituting the subsets Rl* (81), where l = 0, 1,..., L.

w1R1,.,wLRLE84

Design assumption: Each subset Rl* (81) in the sequence (84) contains a priori selected feature (dependent variable) Xi′ and m - 1 other features (independent variables) Xi(l). The other features Xi(l) (Xi(l)Rl*) should be different in successive subsets Rl* (l = 0, 1,..., L).

The first optimal; vertex w1* (77) in the sequence (84) is designed on the basis of m feature vectors xj(1), which are represented by all n features Xi constituting the feature set F(n) = {X1,…, Xn}. The vertex w1* (77) is found by solving the constraint optimization problem (74) according to the procedure with the two stages outlined earlier. The two-stage procedure allows to find the optimal vertex w1* (77) with the largest L1 margin δL1(w1*) (78).

The second optimal vertex w2* (77) in the sequence (84) is obtained on the basis of m reduced feature vectors xj[n - (m - 1)] (67), which are represented by n - (m - 1) features Xi constituting the reduced feature subset F2(n - (m + 1)):

F2nm1=Fn/R1XiE85

The l-th optimal vertex wl* (77) in the sequence (84) is designed on the basis of m reduced vectors xj[n - l(m - 1)] (67), which are represented by n - l(m - 1) features Xi constituting the feature subset Fl(n - l(m - 1)):

Flnlm1=Fl1nlm1/Rl1XiE86

The sequence (84) of L optimal vertices wl* (77) related to the subsets Fl(n - l(m - 1)) (86) of features is characterized by decreased L1 margins δL1(wl*) (78) [18].

δL1w1δL1w2δL1wLE87

The prognostic models (83) for the dependent feature (variable) Xi′ are designed for each subset Fl(n - l(m - 1)) (86) of features Xi, where l = 0, 1,..., L(84):

(l01LE88
Xil=αi,0l+αi,1lXi1l++αi,mlXim

The final forecast Xi′ for the dependent feature (variable) Xi′ based on the complex layer of L + 1 prognostic models (88) can have the following form:

Xi=Xi1++XiL/L+1E89

In accordance with the Eq. (89), the final forecast Xi(m) for the feature Xi′ results from averaging the forecasts of L + 1 individual models Xi′(l) (88).

Advertisement

9. Concluding remarks

The article considers computational schemes of designing classifiers or prognostic models based on such a data set C(1), which consists of a small number m of high-dimensional feature vectors xj (m < < n).

The concept of a complex layer composed of many linear prognostic models (88) built in low-dimensional feature subspaces is discussed in more detail. These models (88) are built by using a small number m of collinear features Xi belonging to the optimal feature clusters Rl* (81). The optimal feature clusters Rl* (81) are formed by the search for the largest margins δL1(wl*) (78) in the L1 norm.

The averaged prognostic models Xi′(89) are based on the layer of L parallel models Xi′(l) (88). In line with the ergodic theory, averaging on a small number m of feature vectors xj has been replaced with averaging on L collinear clusters Rl* (81) of features Xi. Such averaging scheme should allow for a more stable extraction of general patterns from small samples of high-dimensional feature vectors xj(1) [11].

References

  1. 1. Duda O. R., Hart P. E., and Stork D. G., Pattern classification, J. Wiley, New York, 2001
  2. 2. Hand D., Smyth P., and Mannila H., Principles of data mining, MIT Press, Cambridge (2001)
  3. 3. Bishop C. M., Pattern Recognition and Machine Learning. Springer Verlag, 2006
  4. 4. Kuncheva L.: Combining Pattern Classifiers: Methods and Algorithms, 2nd Edition, J. Wiley, New Jersey (2014)
  5. 5. Simonnard M., Linear Programming, Prentice – Hall, New York, Englewood Cliffs, 1966
  6. 6. Bobrowski L., Data mining based on convex and piecewise linear (CPL) criterion functions (in Polish), Białystok University of Technology, 2005
  7. 7. Bobrowski L., Data Exploration and Linear Separability, pp. 1 - 172, Lambert Academic Publishing, 2019
  8. 8. Bobrowski, L.: ″Design of piecewise linear classifiers from formal neurons by some basis exchange technique″, Pattern Recognition, 24(9), pp. 863-870 (1991)
  9. 9. Bobrowski L., Zabielski P., ″Models of Multiple Interactions from Collinear Patterns″, pp. 153-165 in: Bioinformatics and Biomedical Engineering (IWBBIO 2018), Eds.: I. Rojas, F. Guzman, LNCS 10208, Springer Verlag, 2018
  10. 10. Bobrowski L., Small Samples of Multidimensional Feature Vectors (ICCCI 2020), pp. 87 - 98 in: Advances in Computational Collective Intelligence, Eds.: Hernes M, et al., Springer 2020
  11. 11. Bobrowski L., ″Complexes of Low Dimensional Linear Classifiers with L1 Margins″, pp. 29 - 40 in: ACIIDS 2021, Springer Verlag, 2021
  12. 12. Boser B. E., Guyon I., Vapnik V. N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, 5, 144–152. Pittsburgh, ACM, 1992
  13. 13. Bobrowski L., Łukaszuk T.: Repeatable functionalities in complex layers of formal neurons, EANN 2021, Engineering Applications of Neural Networks, Springer 2021
  14. 14. Rosenblatt F.: Principles of neurodynamics, Spartan Books, Washington, 1962
  15. 15. Bobrowski L., Łukaszuk, T.: Relaxed Linear Separability (RLS) Approach to Feature (Gene) Subset Selection, pp. 103 - 118 in: Selected Works in Bioinformatics, Edited by: Xuhua Xia, INTECH, 2011
  16. 16. Bobrowski L.: ″Large Matrices Inversion Using the Basis Exchange Algorithm″, British Journal of Mathematics & Computer Science, 21(1): 1-11, 2017
  17. 17. Petersen K.: Ergodic Theory (Cambridge Studies in Advanced Mathematics), Cambridge University Press, 1990
  18. 18. Bobrowski L., Zabielski P.: ″Feature (gene) clustering with collinearity models″, ICCCI 2021 (to appear), Springer Verlag, 2021

Written By

Leon Bobrowski

Submitted: 17 June 2021 Reviewed: 08 July 2021 Published: 14 December 2021