Computing on Vertices in Data Mining

Leon Bobrowski

doi:10.5772/intechopen.99315

Abstract

The main challenges in data mining are related to large, multi-dimensional data sets. There is a need to develop algorithms that are precise and efficient enough to deal with big data problems. The Simplex algorithm from linear programming can be seen as an example of a successful big data problem solving tool. According to the fundamental theorem of linear programming the solution of the optimization problem can found in one of the vertices in the parameter space. The basis exchange algorithms also search for the optimal solution among finite number of the vertices in the parameter space. Basis exchange algorithms enable the design of complex layers of classifiers or predictive models based on a small number of multivariate data vectors.

Keywords

data mining
basis exchange algorithms
small samples of multivariate vectors
gene clustering
prognostic models

Author Information

Show +

Leon Bobrowski*
- Faculty of Computer Science, Białystok University of Technology, Poland
- Institute of Biocybernetics and Biomedical Engineering, PAS, Poland

*Address all correspondence to: l.bobrowski@pb.edu.pl

1. Introduction

Various data mining tools are proposed to extract patterns from data sets [1]. Large, multidimensional data sets impose high requirements as to the precision and efficiency of calculations used to extract patterns (regularities) useful in practice [2]. In this context, there is still a need to develop new algorithms of data mining [3]. New types of patterns are also obtained in result of combining different types of classification or prognosis models [4].

The Simplex algorithm from linear programming is used as an effective big data mining tool [5]. According to the basic theorem of linear programming, the solution to the linear optimization problem with linear constraints can be found at one of the vertices in the parameter space. Narrowing the search area to a finite number of vertices is a source of the efficiency of the Simplex algorithm.

Basis exchange algorithms also look for an optimal solution among a finite number of vertices in the parameter space [6]. The basis exchange algorithms are based on the Gauss - Jordan transformation and, for this reason, are similar to the Simplex algorithm. Controlling the basis exchange algorithm is related to the minimization of convex and piecewise linear (CPL) criterion functions [7].

The perceptron and collinearity criterion functions belong to the family of CPL functions The minimization of the perceptron criterion function allows to check the linear separability of data sets and to design piecewise linear classifiers [8]. Minimizing the collinearity criterion function makes it possible to detect collinear (flat) patterns in data sets and to design multiple interaction models [9].

Data sets consisting of a small number of multivariate feature vectors generate specific problems in data mining [10]. This type of data includes genetic data sets. Minimizing the perceptron criterion function or the collinearity function enables solving problems related to discrimination or regression also in the case of a small set of multidimensional feature vectors by using complex layers of low dimensional linear classifiers or prognostic models [11].

2. Linear separability vs. linear dependence

Let us assume that each of m objects O_j from a given database were represented by the n-dimensional feature vector x_j = [x_j,1,...,x_j,n]^T belonging to the feature space F[n] (x_j ∈ F[n]). The data set C consists of m such feature vectors x_j:

C=xj,wherej=1,…,mE1

The components x_j,i of the feature vector x_j are numerical values (x_j,i ∈ R or x_j,I ∈{0, 1}) of the individual features X_i of the j-th object O_j. In this context, each feature vector x_j (x_j ∈ F[n]) represents n features X_i belonging to the feature set F(n) = {X₁,…, X_n}.

The pairs {G_k⁺, G_k⁼} (k = 1, …, K) of the learning sets G_k⁺and G_k⁼ (G_k⁺ ∩ G_k⁻ = ∅) are formed from some feature vectors x_j selected from the data set C(1):

Gk+=xj:j∈Jk+,andGk−=xj:j∈Jk−E2

where J_k⁺ and J_k⁻ are non-empty sets of indices j of vectors x_j (J_k⁺ ∩ J_k⁻ = ∅).

The positive learning set G_k⁺ is composed of m_k⁺ feature vectors x_j (j ∈ J_k⁺). Similarly, the negative learning set G_k⁻ is composed of m_k⁻ feature vectors x_j (j ∈ J_k⁻), where m_k⁺ + m_k⁻ ≤ m.

Possibility of the learning sets G_k⁺ and G_k⁻(2) separation using a hyperplane H(w_k, θ_k) in the feature space F[n] is investigated in pattern recognition [1]:

Hwkθk=x:wkTx=θkE3

where w_k = [w_k,1,..., w_k,n]^T ∈ Rⁿ is the weight vector, θ_k ∈ R¹ is the threshold, and w_k^Tx = Σ_i w_k,i x_i is the scalar product.

Definition 1: The learning sets G_k⁺ and G_k⁻(1) are linearly separable in the feature space F[n], if and only if there exists a weight vector w_k (w_k ∈ Rⁿ), and a threshold θ_k (θ_k ∈ R¹) that the hyperplane H(w_k, θ_k) (3) separates these sets [7]:

∃wkθk∀xj∈Gk+wkTxj≥θk+1andE4

∀xj∈Gk−wkTxj≤θk−1

According to the above inequalities, all vectors x_j from the learning set G_k⁺(2) are located on the positive side of the hyperplane H(w_k, θ_k) (3), and all vectors x_j from the set G_k⁻ lie on the negative side of this hyperplane.

The hyperplane H(w_k, θ_k) (3) separates (4) the sets G_k⁺ and G_k⁻(1) with the following margin δ_L2(w_k) based on the Euclidean (L₂) norm which is used in the Support Vector Machines (SVM) method [12]:

δL2wk=2/wkL2=2/wkTwk1/2E5

where || w_k ||_L2 = (w_k^Tw_k)^1/2 is the Euclidean length of the weight vector w_k.

The margin δ_L1(w_k) with the L₁ norm related to the hyperplane H(w_k, θ_k) (2), which separates (10) the learning sets G_k⁺ and G_k⁻(2) was determined by analogy to (5) as [11]:

δL1wk=2/wkL1=2/wk,1+…+wk,nE6

where || w_k ||_L1 = |w_k,1| + ... +| w_k,n| is the L₁ length of the weight vector w_k.

The margins δ _L2(w_k) (5) or δ _L1(w_k) (6) are maximized to improve the generalization properties of linear classifiers designed from the learning sets G_k⁺ and G_k⁻(2) [7].

The following set of m_k′ = m_k⁺ + m_k⁻ linear equations can be formulated on the basis of the linear separability inequalities (4):

∀j∈Jk+xjTwk=θk+1andE7

∀j∈Jk−xjTwk=θk−1

If we assume that the threshold θ_k can be determined latter, then we have n unknown weights w_k,i (w_k = [w_k,1,..., w_k,n]^T) in an underdetermined system of m_k′ = m_k⁺ + m_k⁻ (m_k′ ≤ m_k < n) linear Eqs. (7). In order to obtain a system of n linear Eqs. (7) with n unknown weights w_k,i, additional linear equations based on selected n - m_k′ unit vectors e_i (i ∈ I_k) were taken into account [6]:

∀i∈IkeiTwk=0E8

The parameter vertex w_k = [w_k,1,..., w_k,n]^T can be determined by the linear Eqs. (7) and (8) if the feature vectors x_j forming the learning sets G_k⁺ and G_k⁻(2) are linearly independent [7].

The feature vector x_j′ (x_j′ ∈ G_k⁺ ∪ G_k⁻(2)) is a linear combination of some other vectors x_j(i) (j(i) ≠ j′) from the learning sets (2), if there are such parameters α _j′,i (α_j′,i ≠ 0) that the following relation holds:

xj′=αj′,1xj1+…+αj′,lxjlE9

Definition 2: Feature vectors x_j making up the learning sets G_k⁺ and G_k⁻(2) are linearly independent if neither of these vectors x_j′ (x_j′ ∈ G_k⁺ ∪ G_k⁻) can be expressed as a linear combination (9) of l (l ∈{1,…, m - 1}) other vectors x_j(l) from the learning sets.

If the number m_k′ = m_k⁺ + m_k⁻ of elements x_j of the learning sets G_k⁺ and G_k⁻(2) is smaller than the dimension n of the feature space F[n] (m_k⁺ + m_k⁻ ≤ n), then the parameter vertex w_k(θ_k) can be defined by the linear equations in the following matrix form [13]:

Bkwkθk=1kθkE10

where

1kθk=θk+1…θk+1θk−1….θk−10.…0TE11

and

Bk=x1…xmk’eimk’+1….einTE12

The first m_k⁺ components of the vector 1_k(θ_k) are equal to θ_k + 1, the next m_k⁻ components equal to θ_k - 1, and the last n - m_k⁺ − m_k⁻ components are equal to 0. The first m_k⁺ rows of the square matrix B_k(12) are formed by the feature vectors x_j (j ∈ J_k⁺) from the set G_k⁺(2), the next m_k⁻ rows are formed by vectors x_j (j ∈ J_k⁻) from the set G_k⁻(2), and the last n - m_k⁺ − m_k⁻ rows are made up of unit vectors e_j (i ∈ I_k):

If the matrix B_k(12) is non-singular, then there exists the inverse matrix B_k⁻¹:

Bk−1=r1…rmk’rimk’+1….rinE13

In this case, the parameter vertex w_k(θ_k) (10) can be defined by the following equation:

wkθk=Bk−11kθk=θk+1rk++θk−1rk−=E14

=θkrk++rk−+rk+−rk−

where the vector r_k⁺ is the sum of the first m_k⁺ columns r_i of the inverse matrix B_k⁻¹ (13), and the vector r_k⁻ is the sum of the successive m_k⁻ columns r_i of this matrix.

The last n - (m_k⁺ + m_k⁻) components w_k.i(θ_k) of the vector w_k(θ_k) = [w_k,1(θ_k),…, w_k,n(θ_k)]^T(14) linked to the zero components of the vector 1_k(θ_k) (11) are equal to zero:

∀i∈mk++mk−+1…nwk.iθk=0E15

The conditions w_k.i(θ_k) = 0 (15) result from the equations e_i^Tw_k(θ_k) = 0 (8) at the vertex w_k(θ_k) (14).

Length ||w_k(θ_k)||_L1 of the weight vector w_k(θ_k) (14) in the L₁ norm is the sum of m_k′ = m_k⁺ + m_k⁻ components |w_k,i(θ_k)|:

wkθkL1=∣wk,1θk∣+…+∣wk,mk′θk∣E16

In accordance with the Eq. (14), components |w_k,i(θ_k)| can be determined as follows:

∀i∈1…mk++mk−∣wk,iθk∣=∣θkrk,i++rk,i−+rk,i+−rk,i−∣E17

The length ||w_k(θ_k)||_L1(16) of the vector w_k(θ_k) (14) with the L₁ norm is minimized to increase the margin δ_L1(w_k(θ_k)) (6). The length ||w_k(θ_k)||_L1(16) can be minimized by selecting the optimal threshold value θ_k* on the basis of the Eq. (14).

∀θkδL1wkθk∗≥δL1wkθkE18

where the optimal vertex w_k(θ_k*) is defined by the Eq. (14).

Theorem 1: The learning sets G_k⁺ and G_k⁻(2) formed by m (m ≤ n) linearly independent (9) feature vectors x_j are linearly separable (4) in the feature space F[n] (x_j ∈ F[n]).

Proof: If the learning sets G_k⁺ and G_k⁻(2) are formed by m linearly independent feature.

vectors x_j then the non-singular matrix B_k = [x₁,…, x_m, e_i(m + 1),…., e_i(n)]^T(12) containing these m vectors x_j and n - m unit vectors e_i (i ∈ I_k) can be defined [10]. In this case, the inverse matrix B_k⁻¹ (13) exists and can determine the vertex w_k(θ_k) (14). The vertex equation B_kw_k(θ_k) = 1_k(θ_k) (10) can be reformulated for the feature vectors x_j(2) as follows:

∀xj∈Gk+wkθkTxj=θk+1and∀xj∈Gk−wkθkTxj=θk–1E19

The solution of the Eqs. (19) satisfies the linear separability inequalities (4).

It is possible to enlarge the learning sets G_k⁺ and G_k⁻(2) in such a way, which maintains their linear separability (4).

Lemma 1: Increasing the positive learning set G_k⁺(2) by such a new vector x_j′ (x_j′∉ G_k⁺), which is a linear combination with the parameters α_j′,i(9) of some feature vectors x_j(l)(2) from this set (x_j(l) ∈ G_k⁺) preserves the linear separability (4) of the learning sets if the parameters α_j′,i fulfill the following condition:

αj′,1+…+αj′,l≥1E20

If the assumptions of the lemma are met, then

wkTxj′=wkTαj′,1xj1+…+αj′,lxjl=E21

=αj′,1wkTxj1+…+αj′,lwkTxjl=αj′,1θk+1+…+αj′,lθk+1≥θk+1

The above inequality means that linear separability connditions (4) still apply after the increasing of the learning set G_k⁺(2).

Lemma 2: Increasing the negative learning set G_k⁻(2) by such a new vector x_j′ (x_j′ ∉ G_k⁻), which is a linear combination with the parameters α_j′,i(9) of some feature vectors x_j(l)(2) from this set (x_j(l) ∈ G_k⁻) preserves the linear separability (4) of the learning sets if the parameters α_{j′, i} fulfill the following condition:

αj′,1+…+αj′,l≤−1E22

Justification Lemma 2 may be based on inequality similar to (21).

3. Perceptron criterion function

The minimization the perceptron criterion function allows to assess the degree of linear separabilty (4) of the learning sets G_k⁺ and G_k⁻(2) in different feature subspaces F[n′] (F[n′] ⊂ F[n + 1]) [6]. When defining the perceptron criterion function, it is convenient to use the following augmented feature vectors y_j (y_j ∈ F[n + 1]) and augmented weight vectors v_k (v_k ∈ R^n + 1) [1]:

∀j∈Jk+2yj=xjT1T,E23

∀j∈Jk−2yj=−xjT1T

and

vk=wkT−θkT=wk,1…wk,n−θkTE24

The augmented vectors y_j are constructed (23) on the basis of the learning sets G_k⁺ and G_k⁻(2). These learning sets are extracted from the data set C(1) according to some additional knowledge. The linear separability (4) of the learning sets G_k⁺ and G_k⁻(2) can be reformulated using the following set of m inequalities with the augmented vectors y_j(23) [7]:

∃vk∀j∈Jk+∪Jk−2vkTyj≥1E25

The dual hyperplanes h_j^p in the parameter space R^n + 1 (v ∈ R^n + 1) are defined on the basis of the augmented vectors y_j [6]:

∀j∈Jk+∪Jk−2hjp=v:yjTv=1E26

Dual hyperplanes h_j^p(26) divide the parameter space R^n + 1 (v ∈ R^n + 1) into a finite number L of disconnected regions (convex polyhedra) D_l^p (l = 1,…, L) [7]:

Dlp=v:∀j∈Jl+yjTv≥1and∀j∈Jl−yjTv<1E27

where J_l⁺ and J_l⁻ are disjointed subsets (J_l⁺ ∩ J_l⁺ = ∅) of indices j of feature vectors x_j making up the learning sets G_k⁺ and G_k⁻(2).

The perceptron penalty functions φ_j^p(v) are defined as follows for each of augmented feature vectors y_j(23) [6]:

∀j∈Jk

φjpv=1−yjTvifyjTv<10ifyjTv≥1E28

The j - th penalty function φ_j^p(v) (28) is greater than zero if and only if the weight vector v is located on the wrong side (y_j^Tv < 1) of the j-th dual hyperplane h_j^p(26). The function φ_j^p(v) (28) is linear and greater than zero as long as the parameter vector v = [v_k,1,..., v_k,n + 1]^T remains on the wrong side of the hyperplane h_j^p(26). Convex and piecewise-linear (CPL) penalty functions φ_j^p(v) (28) are used to enforce the linear separation (8) of the learning sets G_k⁺ and G_k⁻(2).

The perceptron criterion function Φ_k^p(v) is defined as the weighted sum of the penalty functions φ_j^p(v) (28) [6]:

Φkpv=ΣjαjφjpvE29

Positive parameters α_j (α_j > 0) can be treated as prices of individual feature vectors x_j:

∀j∈Jk+2αj=1/2mk+and∀j∈Jk−2αj=1/2mk−E30

where m_k⁺ (m_k⁻) is the number of elements x_j in the learning set G_k⁺ (G_k⁻) (2).

The perceptron criterion function Φ_k^p(v) (29) was built on the basis of the error correction algorithm, the basic algorithm in the Perceptron model of learning processes in neural networks [14].

The criterion function Φ_k^p(v) (29) is convex and piecewise-linear (CPL) [6]. It means, among others, that the function Φ_k^p(v) (29) remains linear within each area D_l(27):

∀l∈1…L∀v∈DlΦkpv=ΣjαjyjTE31

where the summation is performed on all vectors y_j(23) fulfilling the condition y_j^Tv < 1.

The optimal vector v_k* determines the minimum value Φ_k^p(v_k*) of the criterion function Φ_k^p(v) (29):

∃vk∗∀v∈Rn+1Φkpv≥Φkpvk∗≥0E32

Since the criterion function Φ_k^p(v) (29) is linear in each convex polyhedron D_l(27), the optimal point v_k* representing the minimum Φ_k^p(v_k*) (32) can be located in selected vertex of some polyhedron D_l′^p(27). This property of the optimal vector v_k* (32) follows from the fundamental theorem of linear programming [5].

It has been shown that the minimum value Φ_k^p(v_k*) (32) of the perceptron criterion function Φ_k^p(v) (29) with the parameters α_j(30) is normalized as follows [6]:

0≤Φkpvk∗≤1E33

The below theorem has been proved [6]:

Theorem 2: The minimum value Φ_k^p(v_k*) (32) of the perceptron criterion function Φ_k^p(v) (29) is equal to zero (Φ_k^p(v_k*) = 0) if and only if the learning sets G_k⁺ and G_k⁻(2) are linearly separable (4).

The minimum value Φ_k^p(v_k*) (32) is near to one (Φ_k^p(v_k*) ≈ 1) if the sets G_k⁺ and G_k⁻(2) cover almost completely. It can also be proved that the minimum value Φ_k^p(v_k*) (32) of the perceptron criterion function Φ_k^p(v) (29) does not depend on invertible linear transformations of the feature vectors y_j(23) [6]. The perceptron criterion function Φ_k(v) (29) remains linear inside of each region D_l(27).

The regularized criterion function Ψ_k^p(v) is defined as the sum of the perceptron criterion function Φ_k^p(v) (29) and some additional penalty functions [13]. These additional CPL functions are equal to the costs γ_i (γ_i > 0) of individual features X_i multiply by the absolute values |w_i| of weighs w_i, where v = [w^T, −θ]^T = [w₁,..., w_n, −θ]^T ∈ R^n + 1(24):

Ψkpv=Φkpv+λΣiγiwiE34

where λ (λ ≥ 0) is the cost level. The standard values of the cost parameters γ_i are equal to one (∀i ∈ {1, ..., n} γ_i = 1).

The optimal vector v_k,λ* constitutes the minimum value Ψ_k^p(v_k,λ*) of the CPL criterion function Ψ_k^p(v) (34), which is defined on elements x_j of the learning sets G_k⁺ and G_k⁻(2):

∃vk,λ∗∀v∈Rn+1Ψkpv≥Ψkpvk,λ∗>0E35

Similarly as in the case of the perceptron criterion function Φ_k^p(v) (29), the optimal vector v_k,λ* (35) can be located in selected vertex of some polyhedron D_l′(27). The minimum value Ψ_k^p(v_k,λ*) (35) of the criterion function Ψ_k^p(v) (34) is used, among others, in the relaxed linear separability (RLS) method of gene subsets selection [15].

4. Collinearity criterion function

Minimizing the collinearity criterion function is used to extract collinear patterns from large, multidimensional data sets C(1) [7]. Linear models of multivariate interactions can be formulated on the basis of representative collinear patterns [9].

The collinearity penalty functions φ_j(w) are determined by individual feature vectors x_j = [x_j,1,...,x_j,n]^T in the following manner [9]:

∀xj∈C1

φjw=∣1−xjTw∣=1−xjTwifxjTw≤1xjTw−1ifxjTw>1E36

The penalty functions φ_j(w) (36) can be related to the following dual hyperplanes h_j¹ in the parameter (weight) space Rⁿ (w ∈ Rⁿ):

∀j=1…mhj1=w:xjTw=1E37

The CPL penalty φ_j(w) (36) is equal to zero (φ_j^c(w) = 0) in the point w = [w₁,..., w_n]^T if and only if the point w is located on the dual hyperplane h_j¹(37).

The collinearity criterion function Φ_k(w) is defined as the weighted sum of the penalty functions φ_j(w) (36) determined by feature vectors x_j forming the data subset C_k (C_k ⊂ C(1)):

Φkw=ΣjβjφjwE38

where the sum takes into account only the indices J of the set J_k = {j: x_j ∈ C_k}, and the positive parameters β_j (β_j > 0) in the function Φ_k(w) (38) can be treated as the prices of particular feature vectors x_j. The standard choice of the parameters β_j values is one ((∀j ∈ J_k) β_j = 1.0).

The collinearity criterion function Φ_k(w) (38) is convex and piecewise-linear (CPL) as the sum of this type of penalty functions φ_j(w) (36) [9]. The vector w_k* determines the minimum value Φ_k(w_k*) of the criterion function Φ_k(w) (38):

∃wk∗∀wΦkw≥Φkwk∗≥0E39

Definition 3: The data subset C_k (C_k ⊂ C(1)) is collinear when all feature vectors x_j from this subset are located on some hyperplane H(w, θ) = {x: w^Tx = θ} with θ ≠ 0.

Theorem 3: The minimum value Φ_k^p(v_k*) (39) of the collinearity criterion function Φ_k(w) (38) defined on the feature vectors x_j constituting a data subset C_k (C_k ⊂ C(1)) is equal to zero (Φ_k^p(v_k*) = 0) when this subset C_k is collinear (Def. 3) [9].

Different collinear subsets C_k can be extracted from data set C(1) with a large number m of elements x_j by minimizing the collinearity criterion function Φ_k^p(w) (38) [9].

The minimum value Φ_k^p(v_k*) (39) of the collinearity criterion function Φ_k(w) (38) can be reduced to zero by omitting some feature vectors x_j from the data subset C_k (C_k ⊂ C(1)). If the minimum value Φ_k(w_k*) (39) is greater than zero (Φ_k(w_k*) > 0) then we can select feature vectors x_j (j ∈ J_k(w_k*)) with the penalty φ_j(w_k*) (36) greater than zero:

∀j∈Jkwk∗φjwk∗=∣1−xjTwk∗∣>0E40

Omitting one feature vector x_j′ (j′∈ J_k(w_k*)) with the above property results in the following reduction of the minimum value Φ_k^p(v_k*) (39);

Φk′wk′∗≤Φkwk∗−φj′wk∗E41

where Φ_k′(w_k′*) is the minimum value (39) of the collinearity criterion function Φ_k′(w) (38) defined on feature vectors x_j constituting the data subset C_k reduced by the vector x_j′.

The regularized criterion function Ψ_k(w) is defined as the sum of the collinearity criterion function Φ_k(w) (38) and some additional CPL penalty functions φ_j⁰(w) [7]:

Ψkw=Φkw+λΣiχiw=Σjβjφjw+λΣiχiφi0wE42

where λ ≥ 0 is the cost level. The standard values of the cost parameters γ_i are equal to one ((∀i ∈ {1,…,n}) γ_i = 1). The additional CPL penalty functions φ_j⁰(w) are defined below [7]:

∀i=1…nE43

χiw=∣eiTw∣=−wjifwj≤0wjifwj>0

The functions φ_j⁰(w) (43) are related to the following dual hyperplanes h_j⁰ in the parameter (weight) space Rⁿ (w ∈ Rⁿ):

∀i=1…nhj0=w:ejTw=0=w:wj=0E44

The CPL penalty function φ_j⁰(w) (43) is equal to zero (φ_j⁰(w) = 0) in the point w = [w₁,..., w_n]^T if and only if this point is located on the dual hyperplane h_j⁰(44).

5. Parameter vertices

The perceptron criterion function Φ_k^p(v) (29) and the collinearity criterion function Φ_k(w) (38) are convex and piecewise-linear (CPL). The minimum values of a such CPL criterion functions can be located in parameter vertices of some convex polyhedra. We consider the parameter vertices w_k (w_k ∈ Rⁿ) related to the collinearity criterion function Φ_k(w) (38).

Definition 4: The parameter vertexw_k of the rank r_k (r_k ≤ n) in the weight space Rⁿ (w_k ∈ Rⁿ) is the intersection point of r_k hyperplanes h_j¹(37) defined by linearly indepenedent feature vectors x_j (j ∈ J_k) from the data set C(1) and n - r_k hyperplanes h_i⁰(44) defined by unit vectors e_i (i ∈ I_k) [7].

The j-th dual hyperplane h_j¹(37) defined by the feature vector x_j(1) passes through the k-th vertexw_k if the equation w_k^Tx_j = 1 holds.

Definition 5: The k-th weight vertex w_k of the rank r_k is degenerate in the parameter space Rⁿ if the number m_k of hyperplanes h_j¹(37) passing through this vertex (w_k^Tx_j = 1) is greater than the rank r_k (m_k > r_k).

The vertex w_k can be defined by the following set of n linear equations:

∀j∈JkwkwkTxj=1E45

and

∀i∈IkwkwkTei=0E46

Eqs. (45) and (46) can be represented in the below matrix form [7]:

Bkwk=1kE47

where 1_k = [1,…,1, 0,…,0]^T is the vector with the first r_k components equal to one and the remaining n - r_k components are equal to zero.

The square matrix B_k(47) consists of k feature vectors x_j (j ∈ J_k(45)) and n - k unit vectors e_i (i ∈ I_k(46)) []:

Bk=x1…xkeik+1…einTE48

where the symbol e_i(l) denotes such unit vector, which is the l-th row of the matrix B_k.

Since feature vectors x_j (∀j∈ J_k(w_k) (45)) making up r_k rows of the matrix B_k(48) are linearly independent, then the inverse matrix B_k⁻¹ exists:

Bk−1=r1…rkrik+1….rinE49

The inverse matrix B_k⁻¹(49) can be obtained starting from the unit matrix I = [e₁,..., e_n]^T and using the basis exchange algorithm [8].

The non-singular matrix B_k(48) is the basis of the feature space F[n] related to the vertex w_k = [w_k,1,…, w_k,n]^T. Since the last n - r_k components of the vector 1_k(47) are equal to zero, the following equation holds:

wk=Bk−11k=r1+…+rkE50

According to Eq. (50), the weight vertex w_k is the sum of the first k columns r_i of the inverse matrix B_k⁻¹(49).

Remark 1: The n - k components w_k.i of the vector w_k = [w_k,1,…, w_k,n]^T(50) linked to the zero components of the vector 1_k = [1,…, 1, 0,…., 0, 1]^T(7) are equal to zero:

∀i∈k+1…nwk.i=0E51

The conditions w_k.i = 0 (51) result from the equations w_k^Te_i = 0 (46) at the vertex w_k.

The fundamental theorem of linear programming shows that the minimum Φ_k(w_k*) (39) of the CPL collinearity criterion function Φ_k(w) (38) can always be located in one of the vertices w_k(50) [5]. The same property has also the regularized criterion function Ψ_k(w) (42), another function of the CPL type [7].

We can see that all such feature vectors x_j(1) which define hyperplanes h_j¹(37) passing through the vertex w_k are located on the hyperplane H(w_k, 1) = {x: w_k^Tx = 1} (3) in the feature space F[n]. A large number m_k of feature vectors x_j(1) located on the hyperplane H(w_k, 1) (3) form the collinear clusterC(w_k) based on the vertex w_k [8]:

Cwk=xj∈C1:wkTx=1E52

If the vertex w_k of the rank r_k is degenerate in the parameter space Rⁿ then the collinear cluster C(w_k) (52) contains more than r_k feature vectors x_j(1).

The k-th vertex w_k = [w_k,1,…, w_k,n]^T in the parameter space Rⁿ (w_k ∈ Rⁿ) is linked by the Eq. (47) to the non-singular matrix B_k(48). The rows of the matrix B_k(48) can form the basis of the feature space F[n]. The conditions w_k.i = 0 (51) result from the equations w_k^Te_i = 0 (46) at the vertex w_k.

∀i=1…nifei∈Bk48,thenwk.i=0E53

Each feature vector x_j from the data set C(1) represents n features X_i belonging to the feature set R(n) = {X₁,…, X_n}. The k-th vertexical feature subset R_k(r_k) consists of r_k features X_i that are connected to the weights w_k.i different from zero (w_k.i ≠ 0):

Rkrk=Xi1…XirkE54

The k-th vertexical subspaceF_k[r_k] (F_k[r_k] ⊂ F[n]) contains the reduced vectors x_j[r_k] with r_k componets x_j,i(l) (x_j[r_k] ∈ F_k[r_k]) related to the weights w_k.i different from zero:

∀j∈1…mxjrk=xj,i1…xj,irkTE55

The reduced vectors x_j[r_k] (55) are obtained from the feature vectors x_j = [x_j,1,...,x_j,n]^T belonging to the data set C(1) by omitting the n - r_k components x_j,i related to the weights w_k.i equal to zero (w_k.i = 0).

We consider the optimal vertexical subspace F_k*[r_k] (F_k*[r_k] ⊂ F[n]) related to the reduced optimal vertex w_k*[r_k] which determines the minimum Φ_k(w_k*) (39) of the collinearity criterion function Φ_k(w) (38). The optimal collinear cluster C(w_k*[r_k]) (52) is based on the optimal vertex w_k*[r_k] = [w_k,1*,…, w_k,rk*]^T with r_k different from zero components w_k,i* (w_k.i* ≠ 0). Feature vectors x_j belonging to the collinear cluster C(w_k*) (52) satisfy the equations w_k*[r_k]^Tx_j[r_k] = 1, hence:

∀xj∈Pwk∗wk.1∗xj,i1+…+wk.rk∗xj,irk=1E56

where x_j,i(l) are components of the j-th feature vectors x_j related to the weights w_k.i different from zero (w_k.i ≠ 0).

A large number m_k of feature vectors x_j(1) belonging to the collinear cluster C(w_k*[r_k]) (52) justifies the following collinear model of interaction between selected features X_i(l) which is based on the Eqs. (56) [9]:

wk.1∗Xi1+…+wk.rk∗Xirk=1E57

The collinear interaction model (57) allows, inter alia, to design the following prognostic models for each feature X_i′ from the subset R_k(r_k) (54):

(∀i′∈{1,…,rk)Xi′=αi′,0+αi′,1Xi1+…+αi′,rkXirkE58

where β_i′,0 = 1 / w_k.i′*, β_{i′, i′} = 0, and (∀ i(l) ≠ i′) β_i′,i(l) = w_k.i(l)* / w_k.i′*.

Feature X_i′ is a dependent variable in the prognostic model (58), the remaining m - 1 features X_i(l) are independent variables (i(l) ≠ i′). The family of r_k prognostic models (58) can be designed on the basis of one collinear interaction model (57). Models (58) have a better justification for a large number m_k of feature vectors x_j(1) in the collinear cluster C(w_k*[r_k]) (52).

6. Basis exchange algorithm

The collinearity criterion function Φ(w) (38), like other convex and piecewise linear (CPL) criterion functions, can be minimized using the basis exchange algorithm [8]. The basis exchange algorithm aimed at minimization of the collinearity criterion function Φ(w) (38) is described below.

According to the basis exchange algorithm, the optimal vertex w_k*, which constitutes the minimum value Φ_k(w_k*) (39) of the collinearity function Φ_k(w) (38), is achieved after a finite number L of the steps l as a result of guided movement between selected vertices w_k(50) [8]:

w0→w1→….→wLE59

The sequence of vertices w_k(59) is related by (47) to the following sequence of the inverse matrices B_k⁻¹(49):

B0−1→B1−1→….→BL−1E60

The sequence of vertices w_k(l)(59) typically starts at the vertex w₀ = [0,..., 0]^T related to the identity matrix B₀ = I_n = [e₁,..., e_n]^T of the dimension n x n [7]. The final vertex w_L(59) should assure the minimum value of the collinearity criterion function Φ(w) (38):

∀wΦw≥ΦwL≥0E61

If the criterion function Φ(w) (38) is defined on m (m ≤ n) linearly independent vectors x_j (x_j ∈ C(1)) then the value Φ(w_L) of the collinearity criterion function Φ(w) (38) at the final vertex w_L(59) becomes zero (Φ(w_L) = 0) [8]. The rank r_L (Def. 4) of the final vertex w_L(59) can be equal to the number m of feature vectors x_j (r_L = m) or it can be less than m (r_L < m). The rank r_L of the final vertex w_L(59) is less than m (r_L < m) if the final vertex w_L is degenerate [7].

Consider the reversible matrix B_k = [x₁,..., x_k, e_i(k + 1),..., e_i(n)]^T(48), which determines the vertex w_k(50) and the value Φ_k(w_k) of the criterion function Φ_k(w) (38) in the k-th step. In the step (l + 1), one of the unit vectors e_i in the matrix B_k(48) is replaced by the feature vector x_k + 1 and the matrix B_k + 1 = [x₁,..., x_k, x_k + 1, e_i(k + 2),..., e_i(n)]^T appears. The unit vector e_i(k + 1) leaving matrix B_k(48) is indicated by an exit criterion based on the gradient of the collinearity criterion function Φ(w) (38) [7]. The exit criterion allows to determine the exit edge r_k + 1(49) of the greatest descent of the collinearity criterion function Φ(w) (38). As a result of replacing the unit vector e_i(k + 1) with the feature vector x_k + 1, the value Φ(w_k) of the collinearity function Φ(w) (38) decreases (41):

Φwk+1≤Φwk−φk+1wkE62

After a finite number L (L ≤ m) of the steps k, the collinearity function Φ(w) (38) reaches its minimum (61) at the final vertex w_L(59).

The sequence (60) of the inverse matrices B_k⁻¹ is obtained in a multi-step process of minimizing the function Φ(w) (38). During the k-th step, the matrix B_k-1 = [x₁,…, x_k-1, e_i(k),…., e_i(n)]^T(12) is transformed into the matrix B_k by replacing the unit vector e_i(k) with the feature vector x_k:

∀k∈1…LBk−1→BkE63

According to the vector Gauss-Jordan transformation, replacing the unit vector e_i(k) with the feature vector x_k during the k - th stage results in the following modifications of the co + lumns r_i(k) of the inverse matrix B_l⁻¹ = [x₁,…, x_l, e_i(l+1),…, e_i(n)]^T(49) [6]:

ril+1l+1=1/ril+1lTxl+1ril+1lE64

and

∀i≠il+1ril+1=ril−rilTxl+1rill+1==ril−rilTxjl+1/rillTxl+1rill

where i(l+1) is the index of the unit vector e_i(l+1) leaving the basis B_l = [x₁,..., x_l, e_i(l+1),..., e_i(n)]^T during the l-th stage.

Remark 2: The vector Gauss-Jordan transformation (64) resulting from the replacing of the unit vector e_i(k) with the feature vector x_k in the basis B_k-1 = [x₁,..., x_k-1, e_i(k),..., e_i(n)]^T cannot be executed when the below collinearity condition is met [7]:

rikkTxk=0E65

The collinearity condition (65) causes a division by zero in Eq. (64).

Let the symbol r_l[k] denote the l-th column r_l(k) = [r_l,1(k),..., r_l,n(k)]^T of the inverse matrix B_k⁻¹ = [r₁(k),…, r_k-1(k), r_k(k),…., r_n(k)] (49) after the reduction of the last n - k components r_l,i(k):

rlk=rl,1k…rl,kkTE66

Similarly, the symbol x_j[k] = [x_j,1,...,x_j,k]^T means the reduced vector obtained from the feature vector x_j = [x_j,1,...,x_j,n]^T after he reducing of the last n - k components x_j,i:

(∀j∈{1,…,m)xjk=xj,1…xj,kTE67

Lemma 3: The collinearity condition (65) appears during the k-th step when the reduced vector x_k[k] (66) is a linear combination of the basis reduced vectors x_j[k] (67) with j < k:

xkk=α1x1k+…+αk−1xk−1kE68

where (∀i ∈ {1,…, k - 1}) α_i ∈ R¹.

The proof of this lemma results directly from the collinearity condition (65) [7].

7. Small samples of multivariate feature vectors

A small sample of multivariate vectors appears when the number m of feature vectors x_j in the data set C(1) is much smaller than the dimension n of these vectors (m << n). The basis exchange algorithms allows for efficient minimization of the CPL criterion functions also in the case of small samples of multivariate vectors [10]. However, for small samples, some new properties of the basis exchange algorithms are more important. In particular, the regularization (42) of the CPL criterion functions becomes crucial. New properties of the basis exchange algorithms in the case of a small number m of multidimensional feature vectors x_j(1) is discussed on the example of the collinearity criterion function Φ(w) (38) and the regularized criterion function Ψ(w) (42).

Lemma 4: The value Φ(w_K) of the collinearity criterion function Φ(w) (38) at the final vertex w_L(59) is equal to zero if all m linear Eqs. (45) are fulfilled in the vertex w_L which is related by the Eq. (47) to the matrix B_L = [x₁,..., x_m, e_i(1),..., e_i(n-m)]^T(48) containing the unit vectors e_i with the indices i from the subset I_L (i ∈ I_L).

Theorem 4: If the feature vectors x_j constituting the subset C_k (C_k ⊂ C(1)) and used in the definition of the function Φ(w) (38) are linearly independent (Def. 2), then the value Φ(w_L) of the collinearity criterion function Φ(w) at the final vertex w_L(59) is equal to zero (Φ(w_L) = 0).

The proof of Theorem 4 can be based on the stepwise inversion of the matrices B_k(48) [16]. The final vertex w_L(59) can be found by inverting the related matrix B_L = [x₁,..., x_rk, e_i(1),..., e_i(n-rk)]^T(48).

The final vertex w_L(59) resetting (Φ(w_L) = 0) the criterion function Φ(w) (38) can be related to the optimal matrix B_L = [x₁,..., x_L, e_i(L + 1),..., e_i(n)]^T(48) built from L (L ≤ m) feature vectors x_j (j ∈ J(w_L) (45)) from the data set C(1) and from n - L selected unit vectors e_i (i ∈ I(w_L) (46)). Different subsets of the unit vectors e_i in the final matrix B_L(48) result in different positions of the final vertices w_L(l)(59) in the parameter space Rⁿ. The criterion function Φ(w) (38) is equal zero (Φ(w_L(l)) = 0) at each of these vertices w_L(l)(59).

The position of the final vertices w_L(l)(59) in the parameter space Rⁿ depends on which unit vectors e_i (i ∈ I_L(l)) are included in the basis B_L(l)(48), where:

(∀l∈{1,…,lmax)ΦkwLl=0E69

The maximal number l_max(69) of different vertices w_L(l)(59) can be large when m << n:

lmax=n!/m!n–m!E70

The choice between different final vertices w_L(l)(59) can be based on the minimization of the regularized criterion function Ψ(w) (42). The regularized function Ψ(w) (42) is the sum of the collinearity function Φ(w) (38) and the weighted sum of the cost functions φ_i⁰(w) (43). If Φ(w_L(l)) = 0 (38), then the value Ψ(w_L(l)) of the criterion function Ψ(w) (42) at the final vertex w_L(l)(59) can be given as follows:

ΨwLl=λiΣiχiφj0wLl==λΣχi∣wLl,i∣E71

where the above sums take into account only the indices i of the subset I(w_L(l)) of the non-zero components w_L(l),i of the final vertex w_L(l) = [w_L(l),1,…, w_L(l),n]^T(59):

IwLl=i:eiTwLl≠0=i:wLl,i≠0E72

If the final vertex w_L(l)(59) is not degenerate (Def. 5), then the matrix B_L(l)(48) is built from all m feature vectors x_j (j ∈ {1,...., m}) making up the data set C(1) and from n - m selected unit vectors e_i (i ∈ I(w_L(l)) (71)).

B)m=x1…xmeim+1…einTE73

The problem of the constrained minimizing of the regularized function Ψ(w) (71) at the vertices w_L(l)(59) satisfying the conditions Φ(w_L(l)) = 0 (69) can be formulated in the following way:

minlΨwLl:ΦwLl=0==minlΣiγiwLl,i:ΦwLl=0E74

According to the above formulation, the search for the minimum of the regularized criterion function Ψ(w) (42) is takes place at all such vertices w_L(l)(59), where the collinearity function Φ(w) (38) is equal to zero. The regularized criterion function Ψ(w) (42) is defined as follows at the final vertices w_L(l) = [w_L(l),1,…, w_L(l),n]^T(59), where Φ(w_L(l)) = 0:

∀wLlΨ′wLl=Σγi∣wLl,i∣E75

The optimal vertex w_L(l)* is the minimum value Ψ′(w_L(l)*) of the CPL criterion function Ψ′(w) (75) defined on such final vertices w_L(l)(59), where Φ(w_L(l)) = 0 (38):

∃wLl∗∀wLl:ΦwLl=0Ψ′wLl≥Ψ′wLl∗>0E76

As in the case of the minimization of the perceptron criterion function Φ_k^p(v) (29), the optimal vector w_L(l)* (76) may be located at a selected vertex of some convex polyhedron (27) in the parameter space Rⁿ (w ∈ Rⁿ) [7].

If the cost parameters γ_i(42) have standard values of one ((∀i ∈ {1,…,n}) γ_i = 1), then the constraint minimization problem (74) leads to the optimal vertex w_L(l)* with the smallest L₁ length || w_L(l)* ||_L1 = |w_L(l),1*| + … + |w_L(l),n*|, where Φ(w_L(l)*) = 0 (38):

∃wLl∗∀wLl:ΦwLl=0‖wLl‖≥‖wLl∗‖E77

Optimal vertex w_L(l)* with the smallest L₁ length || w_L(l)* ||_L1(77) is related to the largest L₁ margin δ_L1(w_L(l)*) (6) [11]:

δL1wLl∗=2/‖wLl∗‖L1=2/wLl,1∗+…+wLl,n∗E78

The basis exchane algorithm allow to solve the constraint minimization problem (74) and to find the optimal vertex w_L(l)* (77) with the largest L₁ margin δ_L1(w_L(l)*).

Support Vector Machines (SVM) is the most popular method for designing linear classifiers or prognostic models with large margins [12]. According to the SVM approach, the optimal linear classifier or the prognostic model defined by such an optimal weight vector w* that has a maximum margin δ_L2(w*) based on the Euclidean (L₂) norm:

δL2w∗=2/w∗L2=2/w∗Tw∗1/2E79

Maximization of the Euclidean margins δ_L2(w) (79) is performed using quadratic programming [2].

8. Complex layers of linear prognostic models

Complex layers of linear classifiers or prognostic models have been proposed as a scheme for obtaining a general classification or forecasting rules designed on the basis of a small number of multidimensional feature vectors x_j [11]. According to this scheme, when designing linear prognostic models, averaging over a small number m of feature vectors x_j of the dimension n (m << n) is replaced by averaging on collinear clusters of selected features (genes) X_i. Such an approach to averaging can be linked to the ergodic theory [17].

In the case of a small sample of multivariate vectors, the number m of feature vectors x_j in the data set C(1) may be much smaller than the dimension n of these vectors (m << n). In this case, the collinear cluster C(w_k*[r_k]) (52) may contain all feature vectors x_j from the set C(1) and the vertex w_k*[r_k] may have the rank r_k equal to m (r_k = m).

As it follows from Theorem 4, if the collinearity criterion function Φ(w) (38) is defined on linearly independent (Def. 2) feature vectors x_j, then the values Φ(w_m(l)) of this function at each final vertex w_m(l)(59) are equal to zero (Φ(w_m(l)) = 0). Each final vertex w_m(l)(59) can be reached in m steps k (k = 1, …, m) starting from the vertex w₀ = [0,..., 0]^T related to the identity matrix B₀ = I_n = [e₁,..., e_n]^T.

Minimization of the collinearity criterion function Φ(w) (38), and then minimization of the criterion function Ψ′(w_L(l)) (75) at the final vertices w_L(l)(59) allows to determine the optimal vertex w_L(l)* (77) with the largest L₁ margin δ_L1(w_L(l)*) (78). If the feature vectors x_j(1) are linearly independent, then the optimal vertex w_L(l)* (77) is related to the optimal basis B_L(l)* = [x₁,..., x_m, e_i(m + 1),..., e_i(n)]^T which contains all m feature vectors x_j(1) and n - m unit vectors e_i with the indices i belonging to the optimal subset I(w_L(l)*) (71) (i ∈ I(w_L(l)*)).

The optimal basis B_m* = [x₁,..., x_m, e_i(m + 1),..., e_i(n)]^T(73) is found in two stages. In the first stage, m feature vectors x_j(1) are introduced into matrices B_k = [x₁,..., x_k, e_i(k + 1),..., e_i(n)]^T (k = 0, 1,…, m - 1). The inverse matrices B_k⁻¹(49) are computed in accordance with the vector Gauss-Jordan transformation (64). In the second stage, the unit vectors e_i(l) in the matrices B_m(l)(73) are exchanged to minimize the CPL function Ψ′(w_m(l)) (75) at the final vertices w_m(l)(77). The optimal basis B_m* defines (47) the optimal vertex w_m(l)* (77), which is characterized by the largest margin δ_L1(w_m(l)*) (78).

The vertexical feature subspace F₁*[m] (F₁*[m] ⊂ F[n] (1)) can be obtained on the basis of the optimal vertex w_m(l)* (77) with the largest margin δ_L1(w_m(l)*) (78). The vertexical subspace F₁*[m] contains the reduced vectors x_1,j[m] with the dimension m [7]:

∀j∈1…mx1,jm∈F1∗mE80

The reduced vectors x_1,j[m] (80) are obtained from the feature vectors x_j = [x_j,1,...,x_j,n]^T (x_j ∈ F[n]) ignoring such components x_j,i which are related to the unit vectors e_i in the optimal basis B₁* = [x₁,..., x_m, e_i(m + 1),..., e_i(n)]^T(73). The reduced vectors x_1,j[m] are represented by such m features X_i (X_i ∈ R₁* (54)), which are not linked to the unit vectors e_i (i ∉ I_m(l)*) in the basis B_m(l)* (73) representing the optimal vertex w_m(l)* (77).

R1∗=Xi1…Xim:il∉Iml∗72E81

The m features X_i(l) belonging to the optimal subset R₁* (X_i(l) ∈ R₁* (81) are related to the weights w_k.l* (w_k*[m] = [w_k,1*,…, w_k,m*]^T) that are not zero (w_k.l * ≠ 0).

The optimal feature subset R₁* (81) consists of m collinear features X_i. The optimal vertex w₁*[m] (Φ(w₁*[m]) = 0 (69)) in the reduced parameter space R^m (w₁*[m] ∈ R^m) is based on these m features X_i. The reduced optimal vertex w₁*[m] with the largest margin δ_L1(w₁*[m]) (77) is the unique solution of the constrained optimization problem (74). Maximizing the L₁ margin δ_L1(w_l*) (78) leads to the first reduced vertex w₁*[m] = [w_k,1*,…, w_k,m*]^T with non-zero components w_k.i * (w_k.i * ≠ 0).

The collinear interaction model between m collinear features X_i(l) from the optimal subset R₁*(m) (81) can be formulated as follows (57):

wk.1∗Xi1+…+wk.m∗Xim=1E82

The prognostic models for each feature X_i′ from the subset R₁* (81) may have the following form (58):

(∀i′∈{1,…,m)Xi′=αi′,0+αi′,1Xi1+…+αi′,mXimE83

where α_i′,0 = 1 / w_k.i′*, α_{i′, i′} = 0, and (∀ i(l) ≠ i′) α_i′,i(l) = w_k.i(l)* / w_k.i′*.

In the case of a data set C with a small number m (m << n) of multidimensional feature vectors x_j(1), the prognostic models (83) for individual features X_i′ can be weak. It is know that sets (ensembles) of weak models can have strong generalizing properties [4]. A set of weak prognostic models (83) for a selected feature (dependent variable) X_i′ can be implemented in the complex layer of L prognostic models (83) [11].

The complex layer can be built on the basis of the sequence of L optimal vertices w_l* (77) related to m features X_i constituting the subsets R_l* (81), where l = 0, 1,..., L.

w1∗R1∗,….,wL∗RL∗E84

Design assumption: Each subset R_l* (81) in the sequence (84) contains a priori selected feature (dependent variable) X_i′ and m - 1 other features (independent variables) X_i(l). The other features X_i(l) (X_i(l) ∈ R_l*) should be different in successive subsets R_l* (l = 0, 1,..., L).

The first optimal; vertex w₁* (77) in the sequence (84) is designed on the basis of m feature vectors x_j(1), which are represented by all n features X_i constituting the feature set F(n) = {X₁,…, X_n}. The vertex w₁* (77) is found by solving the constraint optimization problem (74) according to the procedure with the two stages outlined earlier. The two-stage procedure allows to find the optimal vertex w₁* (77) with the largest L₁ margin δ_L1(w₁*) (78).

The second optimal vertex w₂* (77) in the sequence (84) is obtained on the basis of m reduced feature vectors x_j[n - (m - 1)] (67), which are represented by n - (m - 1) features X_i constituting the reduced feature subset F₂(n - (m + 1)):

F2n−m−1=Fn/R1∗∪Xi′E85

The l-th optimal vertex w_l* (77) in the sequence (84) is designed on the basis of m reduced vectors x_j[n - l(m - 1)] (67), which are represented by n - l(m - 1) features X_i constituting the feature subset F_l(n - l(m - 1)):

Fln−lm−1=Fl−1n−lm−1/Rl−1∗∪Xi′E86

The sequence (84) of L optimal vertices w_l* (77) related to the subsets F_l(n - l(m - 1)) (86) of features is characterized by decreased L₁ margins δ_L1(w_l*) (78) [18].

δL1w1∗≥δL1w2∗≥…≥δL1wL∗E87

The prognostic models (83) for the dependent feature (variable) X_i′ are designed for each subset F_l(n - l(m - 1)) (86) of features X_i, where l = 0, 1,..., L(84):

(∀l∈01…LE88

Xi′l=αi′,0l+αi′,1lXi1l+…+αi′,mlXim

The final forecast X_i′^∧ for the dependent feature (variable) X_i′ based on the complex layer of L + 1 prognostic models (88) can have the following form:

Xi′∧=Xi′1+…+Xi′L/L+1E89

In accordance with the Eq. (89), the final forecast X_i(m)^∧ for the feature X_i′ results from averaging the forecasts of L + 1 individual models X_i′(l) (88).

9. Concluding remarks

The article considers computational schemes of designing classifiers or prognostic models based on such a data set C(1), which consists of a small number m of high-dimensional feature vectors x_j (m < < n).

The concept of a complex layer composed of many linear prognostic models (88) built in low-dimensional feature subspaces is discussed in more detail. These models (88) are built by using a small number m of collinear features X_i belonging to the optimal feature clusters R_l* (81). The optimal feature clusters R_l* (81) are formed by the search for the largest margins δ_L1(w_l*) (78) in the L₁ norm.

The averaged prognostic models X_i′^∧(89) are based on the layer of L parallel models X_i′(l) (88). In line with the ergodic theory, averaging on a small number m of feature vectors x_j has been replaced with averaging on L collinear clusters R_l* (81) of features X_i. Such averaging scheme should allow for a more stable extraction of general patterns from small samples of high-dimensional feature vectors x_j(1) [11].

References

1. Duda O. R., Hart P. E., and Stork D. G., Pattern classification, J. Wiley, New York, 2001
2. Hand D., Smyth P., and Mannila H., Principles of data mining, MIT Press, Cambridge (2001)
3. Bishop C. M., Pattern Recognition and Machine Learning. Springer Verlag, 2006
4. Kuncheva L.: Combining Pattern Classifiers: Methods and Algorithms, 2nd Edition, J. Wiley, New Jersey (2014)
5. Simonnard M., Linear Programming, Prentice – Hall, New York, Englewood Cliffs, 1966
6. Bobrowski L., Data mining based on convex and piecewise linear (CPL) criterion functions (in Polish), Białystok University of Technology, 2005
7. Bobrowski L., Data Exploration and Linear Separability, pp. 1 - 172, Lambert Academic Publishing, 2019
8. Bobrowski, L.: ″Design of piecewise linear classifiers from formal neurons by some basis exchange technique″, Pattern Recognition, 24(9), pp. 863-870 (1991)
9. Bobrowski L., Zabielski P., ″Models of Multiple Interactions from Collinear Patterns″, pp. 153-165 in: Bioinformatics and Biomedical Engineering (IWBBIO 2018), Eds.: I. Rojas, F. Guzman, LNCS 10208, Springer Verlag, 2018
10. Bobrowski L., Small Samples of Multidimensional Feature Vectors (ICCCI 2020), pp. 87 - 98 in: Advances in Computational Collective Intelligence, Eds.: Hernes M, et al., Springer 2020
11. Bobrowski L., ″Complexes of Low Dimensional Linear Classifiers with L₁ Margins″, pp. 29 - 40 in: ACIIDS 2021, Springer Verlag, 2021
12. Boser B. E., Guyon I., Vapnik V. N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, 5, 144–152. Pittsburgh, ACM, 1992
13. Bobrowski L., Łukaszuk T.: Repeatable functionalities in complex layers of formal neurons, EANN 2021, Engineering Applications of Neural Networks, Springer 2021
14. Rosenblatt F.: Principles of neurodynamics, Spartan Books, Washington, 1962
15. Bobrowski L., Łukaszuk, T.: Relaxed Linear Separability (RLS) Approach to Feature (Gene) Subset Selection, pp. 103 - 118 in: Selected Works in Bioinformatics, Edited by: Xuhua Xia, INTECH, 2011
16. Bobrowski L.: ″Large Matrices Inversion Using the Basis Exchange Algorithm″, British Journal of Mathematics & Computer Science, 21(1): 1-11, 2017
17. Petersen K.: Ergodic Theory (Cambridge Studies in Advanced Mathematics), Cambridge University Press, 1990
18. Bobrowski L., Zabielski P.: ″Feature (gene) clustering with collinearity models″, ICCCI 2021 (to appear), Springer Verlag, 2021

[1] 1. Duda O. R., Hart P. E., and Stork D. G., Pattern classification, J. Wiley, New York, 2001

[2] 2. Hand D., Smyth P., and Mannila H., Principles of data mining, MIT Press, Cambridge (2001)

[3] 3. Bishop C. M., Pattern Recognition and Machine Learning. Springer Verlag, 2006

[4] 4. Kuncheva L.: Combining Pattern Classifiers: Methods and Algorithms, 2nd Edition, J. Wiley, New Jersey (2014)

[5] 5. Simonnard M., Linear Programming, Prentice – Hall, New York, Englewood Cliffs, 1966

[6] 6. Bobrowski L., Data mining based on convex and piecewise linear (CPL) criterion functions (in Polish), Białystok University of Technology, 2005

[7] 7. Bobrowski L., Data Exploration and Linear Separability, pp. 1 - 172, Lambert Academic Publishing, 2019

[8] 8. Bobrowski, L.: ″Design of piecewise linear classifiers from formal neurons by some basis exchange technique″, Pattern Recognition, 24(9), pp. 863-870 (1991)

[9] 9. Bobrowski L., Zabielski P., ″Models of Multiple Interactions from Collinear Patterns″, pp. 153-165 in: Bioinformatics and Biomedical Engineering (IWBBIO 2018), Eds.: I. Rojas, F. Guzman, LNCS 10208, Springer Verlag, 2018

[10] 10. Bobrowski L., Small Samples of Multidimensional Feature Vectors (ICCCI 2020), pp. 87 - 98 in: Advances in Computational Collective Intelligence, Eds.: Hernes M, et al., Springer 2020

[11] 11. Bobrowski L., ″Complexes of Low Dimensional Linear Classifiers with L₁ Margins″, pp. 29 - 40 in: ACIIDS 2021, Springer Verlag, 2021

[12] 12. Boser B. E., Guyon I., Vapnik V. N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, 5, 144–152. Pittsburgh, ACM, 1992

[13] 13. Bobrowski L., Łukaszuk T.: Repeatable functionalities in complex layers of formal neurons, EANN 2021, Engineering Applications of Neural Networks, Springer 2021

[14] 14. Rosenblatt F.: Principles of neurodynamics, Spartan Books, Washington, 1962

[15] 15. Bobrowski L., Łukaszuk, T.: Relaxed Linear Separability (RLS) Approach to Feature (Gene) Subset Selection, pp. 103 - 118 in: Selected Works in Bioinformatics, Edited by: Xuhua Xia, INTECH, 2011

[16] 16. Bobrowski L.: ″Large Matrices Inversion Using the Basis Exchange Algorithm″, British Journal of Mathematics & Computer Science, 21(1): 1-11, 2017

[17] 17. Petersen K.: Ergodic Theory (Cambridge Studies in Advanced Mathematics), Cambridge University Press, 1990

[18] 18. Bobrowski L., Zabielski P.: ″Feature (gene) clustering with collinearity models″, ICCCI 2021 (to appear), Springer Verlag, 2021

Computing on Vertices in Data Mining

Data Mining - Concepts and Applications

Abstract

Keywords

Author Information

Leon Bobrowski*

1. Introduction

2. Linear separability vs. linear dependence

3. Perceptron criterion function

4. Collinearity criterion function

5. Parameter vertices

6. Basis exchange algorithm

7. Small samples of multivariate feature vectors

8. Complex layers of linear prognostic models

9. Concluding remarks

References

Artificial Intelligence and Its Application in Optimization under Uncertainty

Computing on Vertices in Data Mining

Data Mining - Concepts and Applications

Abstract

Keywords

Author Information

Leon Bobrowski*

1. Introduction

2. Linear separability vs. linear dependence

3. Perceptron criterion function

4. Collinearity criterion function

5. Parameter vertices

6. Basis exchange algorithm

7. Small samples of multivariate feature vectors

8. Complex layers of linear prognostic models

9. Concluding remarks

References

Continue reading from the same book

Data Mining