Computing on Vertices in Data Mining

The main challenges in data mining are related to large, multi-dimensional data sets. There is a need to develop algorithms that are precise and efficient enough to deal with big data problems. The Simplex algorithm from linear programming can be seen as an example of a successful big data problem solving tool. According to the fundamental theorem of linear programming the solution of the optimization problem can found in one of the vertices in the parameter space. The basis exchange algorithms also search for the optimal solution among finite number of the vertices in the parameter space. Basis exchange algorithms enable the design of complex layers of classifiers or predictive models based on a small number of multivariate data vectors.


Introduction
Various data mining tools are proposed to extract patterns from data sets [1]. Large, multidimensional data sets impose high requirements as to the precision and efficiency of calculations used to extract patterns (regularities) useful in practice [2]. In this context, there is still a need to develop new algorithms of data mining [3]. New types of patterns are also obtained in result of combining different types of classification or prognosis models [4].
The Simplex algorithm from linear programming is used as an effective big data mining tool [5]. According to the basic theorem of linear programming, the solution to the linear optimization problem with linear constraints can be found at one of the vertices in the parameter space. Narrowing the search area to a finite number of vertices is a source of the efficiency of the Simplex algorithm.
Basis exchange algorithms also look for an optimal solution among a finite number of vertices in the parameter space [6]. The basis exchange algorithms are based on the Gauss -Jordan transformation and, for this reason, are similar to the Simplex algorithm. Controlling the basis exchange algorithm is related to the minimization of convex and piecewise linear (CPL) criterion functions [7].
The perceptron and collinearity criterion functions belong to the family of CPL functions The minimization of the perceptron criterion function allows to check the linear separability of data sets and to design piecewise linear classifiers [8]. Minimizing the collinearity criterion function makes it possible to detect collinear (flat) patterns in data sets and to design multiple interaction models [9].
Data sets consisting of a small number of multivariate feature vectors generate specific problems in data mining [10]. This type of data includes genetic data sets. Minimizing the perceptron criterion function or the collinearity function enables solving problems related to discrimination or regression also in the case of a small set of multidimensional feature vectors by using complex layers of low dimensional linear classifiers or prognostic models [11].

Linear separability vs. linear dependence
Let us assume that each of m objects O j from a given database were represented by the n-dimensional feature vector x j = [x j,1 ,...,x j,n ] T belonging to the feature space F[n] (x j ∈ F[n]). The data set C consists of m such feature vectors x j : The components x j,i of the feature vector x j are numerical values (x j,i ∈ R or x j,I ∈{0, 1}) of the individual features X i of the j-th object O j . In this context, each feature vector x j (x j ∈ F[n]) represents n features X i belonging to the feature set F (n) = {X 1 , … , X n }.
The pairs {G k + , G k = } (k = 1, … , K) of the learning sets G k + and G k = (G k + ∩ G k À = ∅) are formed from some feature vectors x j selected from the data set C (1): where J k + and J k À are non-empty sets of indices j of vectors x j (J k + ∩ J k À = ∅). The positive learning set G k + is composed of m k + feature vectors x j (j ∈ J k + ). Similarly, the negative learning set G k À is composed of m k À feature vectors x j (j ∈ J k À ), where m k + + m k À ≤ m. Possibility of the learning sets G k + and G k À (2) separation using a hyperplane H(w k , θ k ) in the feature space F[n] is investigated in pattern recognition [1]: where w k = [w k,1 ,..., w k,n ] T ∈ R n is the weight vector, θ k ∈ R 1 is the threshold, and w k T x = Σ i w k,i x i is the scalar product. Definition 1: The learning sets G k + and G k À (1) are linearly separable in the feature space F[n], if and only if there exists a weight vector w k (w k ∈ R n ), and a threshold θ k (θ k ∈ R 1 ) that the hyperplane H(w k , θ k ) (3) separates these sets [7]: According to the above inequalities, all vectors x j from the learning set G k + (2) are located on the positive side of the hyperplane H(w k , θ k ) (3), and all vectors x j from the set G k À lie on the negative side of this hyperplane. The hyperplane H(w k , θ k ) (3) separates (4) the sets G k + and G k À (1) with the following margin δ L2 (w k ) based on the Euclidean (L 2 ) norm which is used in the Support Vector Machines (SVM) method [12]: where || w k || L2 = (w k T w k ) 1/2 is the Euclidean length of the weight vector w k . The margin δ L1 (w k ) with the L 1 norm related to the hyperplane H(w k , θ k ) (2), which separates (10) the learning sets G k + and G k À (2) was determined by analogy to (5) as [11]: where || w k || L1 = |w k,1 | + ... +| w k,n | is the L 1 length of the weight vector w k . The margins δ L2 (w k ) (5) or δ L1 (w k ) (6) are maximized to improve the generalization properties of linear classifiers designed from the learning sets G k + and G k À (2) [7].
The following set of m k 0 = m k + + m k À linear equations can be formulated on the basis of the linear separability inequalities (4): If we assume that the threshold θ k can be determined latter, then we have n unknown weights w k,i (w k = [w k,1 ,..., w k,n ] T ) in an underdetermined system of Eqs. (7). In order to obtain a system of n linear Eqs. (7) with n unknown weights w k,i , additional linear equations based on selected n -m k 0 unit vectors e i (i ∈ I k ) were taken into account [6]: The parameter vertex w k = [w k,1 ,..., w k,n ] T can be determined by the linear Eqs. (7) and (8) if the feature vectors x j forming the learning sets G k + and G k À (2) are linearly independent [7].
The feature vector x j 0 (x j 0 ∈ G k + ∪ G k À (2)) is a linear combination of some other vectors x j(i) ( j(i) 6 ¼ j 0 ) from the learning sets (2), if there are such parameters α j 0 ,i (α j 0 ,i 6 ¼ 0) that the following relation holds: Definition 2: Feature vectors x j making up the learning sets G k + and G k À (2) are linearly independent if neither of these vectors x j 0 (x j 0 ∈ G k + ∪ G k À ) can be expressed as a linear combination (9) of l (l ∈{1, … , m -1}) other vectors x j(l) from the learning sets.
If the number m k 0 = m k + + m k À of elements x j of the learning sets G k + and G k À (2) is smaller than the dimension n of the feature space F[n] (m k + + m k À ≤ n), then the parameter vertex w k (θ k ) can be defined by the linear equations in the following matrix form [13]: where and The first m k + components of the vector 1 k (θ k ) are equal to θ k + 1, the next m k À components equal to θ k -1, and the last n -m k + À m k À components are equal to 0. The first m k + rows of the square matrix B k (12) are formed by the feature vectors x j (j ∈ J k + ) from the set G k + (2), the next m k À rows are formed by vectors x j (j ∈ J k À ) from the set G k À (2), and the last n -m k + À m k À rows are made up of unit vectors e j (i ∈ I k ): If the matrix B k (12) is non-singular, then there exists the inverse matrix B k À1 : In this case, the parameter vertex w k (θ k ) (10) can be defined by the following equation: where the vector r k + is the sum of the first m k + columns r i of the inverse matrix B k À1 (13), and the vector r k À is the sum of the successive m k À columns r i of this matrix.
Length ||w k (θ k )|| L1 of the weight vector w k (θ k ) (14) in the L 1 norm is the sum of m k 0 = m k + + m k À components |w k,i (θ k )|: In accordance with the Eq. (14), components |w k,i (θ k )| can be determined as follows: The length ||w k (θ k )|| L1 (16) of the vector w k (θ k ) (14) with the L 1 norm is minimized to increase the margin δ L1 (w k (θ k )) (6). The length ||w k (θ k )|| L1 (16) can be minimized by selecting the optimal threshold value θ k * on the basis of the Eq. (14).
where the optimal vertex w k (θ k *) is defined by the Eq. (14). Theorem 1: The learning sets G k + and G k À (2) formed by m (m ≤ n) linearly independent (9) feature vectors x j are linearly separable (4) in the feature space

F[n] (x j ∈ F[n]).
Proof: If the learning sets G k + and G k À (2) are formed by m linearly independent feature. vectors x j then the non-singular matrix B k = [x 1 , … , x m , e i(m + 1) , … ., e i(n) ] T (12) containing these m vectors x j and n -m unit vectors e i (i ∈ I k ) can be defined [10]. In this case, the inverse matrix B k À1 (13) exists and can determine the vertex w k (θ k ) (14). The vertex equation B k w k (θ k ) = 1 k (θ k ) (10) can be reformulated for the feature vectors x j (2) as follows: The solution of the Eqs. (19) satisfies the linear separability inequalities (4). It is possible to enlarge the learning sets G k + and G k À (2) in such a way, which maintains their linear separability (4).
Lemma 1: Increasing the positive learning set G k + (2) by such a new vector x j 0 (x j 0∉ G k + ), which is a linear combination with the parameters α j 0 ,i (9) of some feature vectors x j(l) (2) from this set (x j(l) ∈ G k + ) preserves the linear separability (4) of the learning sets if the parameters α j 0 ,i fulfill the following condition: If the assumptions of the lemma are met, then The above inequality means that linear separability connditions (4) still apply after the increasing of the learning set G k + (2). Lemma 2: Increasing the negative learning set G k À (2) by such a new vector , which is a linear combination with the parameters α j 0 ,i (9) of some feature vectors x j(l) (2) from this set (x j(l) ∈ G k À ) preserves the linear separability (4) of the learning sets if the parameters α j 0 , i fulfill the following condition: Justification Lemma 2 may be based on inequality similar to (21).

Perceptron criterion function
The minimization the perceptron criterion function allows to assess the degree of linear separabilty (4) of the learning sets G k + and G k [6]. When defining the perceptron criterion function, it is convenient to use the following augmented feature vectors y j (y j ∈ F[n + 1]) and augmented weight vectors v k (v k ∈ R n + 1 ) [1]: The augmented vectors y j are constructed (23) on the basis of the learning sets G k + and G k À (2). These learning sets are extracted from the data set C (1) according to some additional knowledge. The linear separability (4)  and G k À (2) can be reformulated using the following set of m inequalities with the augmented vectors y j (23) [7]: The dual hyperplanes h j p in the parameter space R n + 1 (v ∈ R n + 1 ) are defined on the basis of the augmented vectors y j [6]: Dual hyperplanes h j p (26) divide the parameter space R n + 1 (v ∈ R n + 1 ) into a finite number L of disconnected regions (convex polyhedra) D l p (l = 1, … , L) [7]: where J l + and J l À are disjointed subsets (J l + ∩ J l + = ∅) of indices j of feature vectors x j making up the learning sets G k + and G k À (2). The perceptron penalty functions φ j p (v) are defined as follows for each of augmented feature vectors y j (23) [6]: The j -th penalty function φ j p (v) (28) is greater than zero if and only if the weight vector v is located on the wrong side (y j T v < 1) of the j-th dual hyperplane h j p (26). The function φ j p (v) (28) is linear and greater than zero as long as the parameter vector v = [v k,1 ,..., v k,n + 1 ] T remains on the wrong side of the hyperplane h j p (26). Convex and piecewise-linear (CPL) penalty functions φ j p (v) (28) are used to enforce the linear separation (8) of the learning sets G k + and G k À (2). The perceptron criterion function Φ k p (v) is defined as the weighted sum of the penalty functions φ j p (v) (28) [6]: Positive parameters α j (α j > 0) can be treated as prices of individual feature vectors x j : where m k + (m k À ) is the number of elements x j in the learning set G k + (G k À ) (2). The perceptron criterion function Φ k p (v) (29) was built on the basis of the error correction algorithm, the basic algorithm in the Perceptron model of learning processes in neural networks [14].
The criterion function Φ k p (v) (29) is convex and piecewise-linear (CPL) [6]. It means, among others, that the function Φ k p (v) (29) remains linear within each area D l (27): where the summation is performed on all vectors y j (23) fulfilling the condition Since the criterion function Φ k p (v) (29) is linear in each convex polyhedron D l (27), the optimal point v k * representing the minimum Φ k p (v k *) (32) can be located in selected vertex of some polyhedron D l 0 p (27). This property of the optimal vector v k * (32) follows from the fundamental theorem of linear programming [5].
It has been shown that the minimum value Φ k p (v k *) (32) of the perceptron criterion function Φ k p (v) (29) with the parameters α j (30) is normalized as follows [6]: The below theorem has been proved [6]: and G k À (2) cover almost completely. It can also be proved that the minimum value Φ k p (v k *) (32) of the perceptron criterion function Φ k p (v) (29) does not depend on invertible linear transformations of the feature vectors y j (23) [6]. The perceptron criterion function Φ k (v) (29) remains linear inside of each region D l (27).
The regularized criterion function Ψ k p (v) is defined as the sum of the perceptron criterion function Φ k p (v) (29) and some additional penalty functions [13]. These additional CPL functions are equal to the costs γ i (γ i > 0) of individual features X i multiply by the absolute values |w i | of weighs w i , where v = [w T , Àθ] T = [w 1 ,..., w n , Àθ] T ∈ R n + 1 (24): where λ (λ ≥ 0) is the cost level. The standard values of the cost parameters γ i are equal to one (∀i ∈ {1, ..., n} γ i = 1).
The optimal vector v k,λ * constitutes the minimum value Ψ k p (v k,λ *) of the CPL criterion function Ψ k p (v) (34), which is defined on elements x j of the learning sets G k + and G k À (2): Similarly as in the case of the perceptron criterion function Φ k p (v) (29), the optimal vector v k,λ * (35) can be located in selected vertex of some polyhedron D l 0 (27). The minimum value Ψ k p (v k,λ *) (35) of the criterion function Ψ k p (v) (34) is used, among others, in the relaxed linear separability (RLS) method of gene subsets selection [15].

Collinearity criterion function
Minimizing the collinearity criterion function is used to extract collinear patterns from large, multidimensional data sets C (1) [7]. Linear models of multivariate interactions can be formulated on the basis of representative collinear patterns [9]. The collinearity penalty functions φ j (w) are determined by individual feature vectors x j = [x j,1 ,...,x j,n ] T in the following manner [9]: The penalty functions φ j (w) (36) can be related to the following dual hyperplanes h j 1 in the parameter (weight) space R n (w ∈ R n ): The CPL penalty φ j (w) (36) is equal to zero (φ j c (w) = 0) in the point w = [w 1 ,..., w n ] T if and only if the point w is located on the dual hyperplane h j 1 (37). The collinearity criterion function Φ k (w) is defined as the weighted sum of the penalty functions φ j (w) (36) determined by feature vectors x j forming the data subset C k (C k ⊂ C (1)): where the sum takes into account only the indices J of the set J k = {j: x j ∈ C k }, and the positive parameters β j (β j > 0) in the function Φ k (w) (38) can be treated as the prices of particular feature vectors x j . The standard choice of the parameters β j values is one ((∀j ∈ J k ) β j = 1.0).
The collinearity criterion function Φ k (w) (38) is convex and piecewise-linear (CPL) as the sum of this type of penalty functions φ j (w) (36) [9]. The vector w k * determines the minimum value Φ k (w k *) of the criterion function Φ k (w) (38): Definition 3: The data subset C k (C k ⊂ C (1)) is collinear when all feature vectors x j from this subset are located on some hyperplane H(w, θ) = {x: w T x = θ} with θ 6 ¼ 0. Theorem 3: The minimum value Φ k p (v k *) (39) of the collinearity criterion function Φ k (w) (38) defined on the feature vectors x j constituting a data subset C k (C k ⊂ C (1)) is equal to zero (Φ k p (v k *) = 0) when this subset C k is collinear (Def. 3) [9].
Different collinear subsets C k can be extracted from data set C (1) with a large number m of elements x j by minimizing the collinearity criterion function Φ k p (w) (38) [9].
The minimum value Φ k p (v k *) (39) of the collinearity criterion function Φ k (w) (38) can be reduced to zero by omitting some feature vectors x j from the data subset C k (C k ⊂ C (1)). If the minimum value Φ k (w k *) (39) is greater than zero (Φ k (w k *) > 0) then we can select feature vectors x j (j ∈ J k (w k *)) with the penalty φ j (w k *) (36) greater than zero: Omitting one feature vector x j 0 (j 0 ∈ J k (w k *)) with the above property results in the following reduction of the minimum value Φ k p (v k *) (39);

Data Mining
where Φ k 0 (w k 0*) is the minimum value (39) of the collinearity criterion function Φ k 0 (w) (38) defined on feature vectors x j constituting the data subset C k reduced by the vector x j 0 .
The regularized criterion function Ψ k (w) is defined as the sum of the collinearity criterion function Φ k (w) (38) and some additional CPL penalty functions φ j 0 (w) [7]: where λ ≥ 0 is the cost level. The standard values of the cost parameters γ i are equal to one ((∀i ∈ {1, … ,n}) γ i = 1). The additional CPL penalty functions φ j 0 (w) are defined below [7]: The functions φ j 0 (w) (43) are related to the following dual hyperplanes h j 0 in the parameter (weight) space R n (w ∈ R n ): The CPL penalty function φ j

Parameter vertices
The perceptron criterion function Φ k p (v) (29) and the collinearity criterion function Φ k (w) (38) are convex and piecewise-linear (CPL). The minimum values of a such CPL criterion functions can be located in parameter vertices of some convex polyhedra. We consider the parameter vertices w k (w k ∈ R n ) related to the collinearity criterion function Φ k (w) (38).
Definition 4: The parameter vertex w k of the rank r k (r k ≤ n) in the weight space R n (w k ∈ R n ) is the intersection point of r k hyperplanes h j 1 (37) defined by linearly indepenedent feature vectors x j (j ∈ J k ) from the data set C (1) and n -r k hyperplanes h i 0 (44) defined by unit vectors e i (i ∈ I k ) [7]. The j-th dual hyperplane h j 1 (37) defined by the feature vector x j (1) passes through the k-th vertex w k if the equation w k T x j = 1 holds. Definition 5: The k-th weight vertex w k of the rank r k is degenerate in the parameter space R n if the number m k of hyperplanes h j 1 (37) passing through this vertex (w k T x j = 1) is greater than the rank r k (m k > r k ). The vertex w k can be defined by the following set of n linear equations: and Eqs. (45) and (46) can be represented in the below matrix form [7]: Computing where 1 k = [1, … ,1, 0, … ,0] T is the vector with the first r k components equal to one and the remaining n -r k components are equal to zero.
The square matrix B k (47) consists of k feature vectors x j (j ∈ J k (45)) and n -k unit vectors e i (i ∈ I k (46)) []: where the symbol e i(l) denotes such unit vector, which is the l-th row of the matrix B k .
Since feature vectors x j (∀j∈ J k (w k ) (45)) making up r k rows of the matrix B k (48) are linearly independent, then the inverse matrix B k À1 exists: The inverse matrix B k À1 (49) can be obtained starting from the unit matrix I = [e 1 ,..., e n ] T and using the basis exchange algorithm [8].
The non-singular matrix B k (48) is the basis of the feature space F[n] related to the vertex w k = [w k,1 , … , w k,n ] T . Since the last n -r k components of the vector 1 k (47) are equal to zero, the following equation holds: According to Eq. (50), the weight vertex w k is the sum of the first k columns r i of the inverse matrix B k À1 (49). Remark 1: The n -k components w k.i of the vector w k = [w k,1 , … , w k,n ] T (50) linked to the zero components of the vector 1 k = [1, … , 1, 0, … ., 0, 1] T (7) are equal to zero: The conditions w k.i = 0 (51) result from the equations w k T e i = 0 (46) at the vertex w k .
The fundamental theorem of linear programming shows that the minimum Φ k (w k *) (39) of the CPL collinearity criterion function Φ k (w) (38) can always be located in one of the vertices w k (50) [5]. The same property has also the regularized criterion function Ψ k (w) (42), another function of the CPL type [7].
We can see that all such feature vectors x j (1) which define hyperplanes h j 1 (37) passing through the vertex w k are located on the hyperplane F[n]. A large number m k of feature vectors x j (1) located on the hyperplane H(w k , 1) (3) form the collinear cluster C(w k ) based on the vertex w k [8]: If the vertex w k of the rank r k is degenerate in the parameter space R n then the collinear cluster C(w k ) (52) contains more than r k feature vectors x j (1).
The k-th vertex w k = [w k,1 , … , w k,n ] T in the parameter space R n (w k ∈ R n ) is linked by the Eq. (47) to the non-singular matrix B k (48). The rows of the matrix B k (48) can form the basis of the feature space F[n]. The conditions w k.i = 0 (51) result from the equations w k T e i = 0 (46) at the vertex w k .
Each feature vector x j from the data set C (1) represents n features X i belonging to the feature set R(n) = {X 1 , … , X n }. The k-th vertexical feature subset R k (r k ) consists of r k features X i that are connected to the weights w k.i different from zero (w k.i 6 ¼ 0): F[n]) contains the reduced vectors x j [r k ] with r k componets x j,i(l) (x j [r k ] ∈ F k [r k ]) related to the weights w k.i different from zero: The reduced vectors x j [r k ] (55) are obtained from the feature vectors x j = [x j,1 ,...,x j,n ] T belonging to the data set C (1) by omitting the n -r k components x j,i related to the weights w k.i equal to zero (w k.i = 0).
We consider the optimal vertexical subspace F k *[r k ] (F k *[r k ] ⊂ F[n]) related to the reduced optimal vertex w k *[r k ] which determines the minimum Φ k (w k *) (39) of the collinearity criterion function Φ k (w) (38). The optimal collinear cluster C(w k *[r k ]) (52) is based on the optimal vertex w k *[r k ] = [w k,1 *, … , w k,rk *] T with r k different from zero components w k,i * (w k.i * 6 ¼ 0). Feature vectors x j belonging to the collinear cluster C(w k *) (52) satisfy the equations w k *[r k ] T x j [r k ] = 1, hence: where x j,i(l) are components of the j-th feature vectors x j related to the weights w k.i different from zero (w k.i 6 ¼ 0).
A large number m k of feature vectors x j (1) belonging to the collinear cluster C(w k *[r k ]) (52) justifies the following collinear model of interaction between selected features X i(l) which is based on the Eqs. (56) [9]: The collinear interaction model (57) allows, inter alia, to design the following prognostic models for each feature X i 0 from the subset R k (r k ) (54): where β i 0 ,0 = 1 / w k.i 0 *, β i 0 , i 0 = 0, and (∀ i(l) 6 ¼ i 0 ) β i 0 ,i(l) = w k.i(l) * / w k.i 0 *. Feature X i 0 is a dependent variable in the prognostic model (58), the remaining m -1 features X i(l) are independent variables (i(l) 6 ¼ i 0 ). The family of r k prognostic models (58) can be designed on the basis of one collinear interaction model (57). Models (58) have a better justification for a large number m k of feature vectors x j (1) in the collinear cluster C(w k *[r k ]) (52).

Basis exchange algorithm
The collinearity criterion function Φ(w) (38), like other convex and piecewise linear (CPL) criterion functions, can be minimized using the basis exchange algorithm [8]. The basis exchange algorithm aimed at minimization of the collinearity criterion function Φ(w) (38) is described below.
According to the basis exchange algorithm, the optimal vertex w k *, which constitutes the minimum value Φ k (w k *) (39) of the collinearity function Φ k (w) (38), is achieved after a finite number L of the steps l as a result of guided movement between selected vertices w k (50) [8]: The sequence of vertices w k (59) is related by (47) to the following sequence of the inverse matrices B k À1 (49): The sequence of vertices w k(l) (59) typically starts at the vertex w 0 = [0,..., 0] T related to the identity matrix B 0 = I n = [e 1 ,..., e n ] T of the dimension n x n [7]. The final vertex w L (59) should assure the minimum value of the collinearity criterion function Φ(w) (38): If the criterion function Φ(w) (38) is defined on m (m ≤ n) linearly independent vectors x j (x j ∈ C (1)) then the value Φ(w L ) of the collinearity criterion function Φ(w) (38) at the final vertex w L (59) becomes zero (Φ(w L ) = 0) [8]. The rank r L (Def. 4) of the final vertex w L (59) can be equal to the number m of feature vectors x j (r L = m) or it can be less than m (r L < m). The rank r L of the final vertex w L (59) is less than m (r L < m) if the final vertex w L is degenerate [7].
Consider the reversible matrix B k = [x 1 ,..., x k , e i(k + 1) ,..., e i(n) ] T (48), which determines the vertex w k (50) and the value Φ k (w k ) of the criterion function Φ k (w) (38) in the k-th step. In the step (l + 1), one of the unit vectors e i in the matrix B k (48) is replaced by the feature vector x k + 1 and the matrix B k + 1 = [x 1 ,..., x k , x k + 1 , e i(k + 2) ,..., e i(n) ] T appears. The unit vector e i(k + 1) leaving matrix B k (48) is indicated by an exit criterion based on the gradient of the collinearity criterion function Φ(w) (38) [7]. The exit criterion allows to determine the exit edge r k + 1 (49) of the greatest descent of the collinearity criterion function Φ(w) (38). As a result of replacing the unit vector e i(k + 1) with the feature vector x k + 1, the value Φ(w k ) of the collinearity function Φ(w) (38) decreases (41): After a finite number L (L ≤ m) of the steps k, the collinearity function Φ(w) (38) reaches its minimum (61) at the final vertex w L (59).
The sequence (60) of the inverse matrices B k À1 is obtained in a multi-step process of minimizing the function Φ(w) (38). During the k-th step, the matrix B k-1 = [x 1 , … , x k-1 , e i(k) , … ., e i(n) ] T (12) is transformed into the matrix B k by replacing the unit vector e i(k) with the feature vector x k: According to the vector Gauss-Jordan transformation, replacing the unit vector e i(k) with the feature vector x k during the k -th stage results in the following modifications of the co + lumns r i (k) of the inverse matrix B l À1 = [x 1 , … , x l , e i(l+1) , … , e i(n) ] T (49) [6]: and is the index of the unit vector e i(l+1) leaving the basis B l = [x 1 ,..., x l , e i(l+1) ,..., e i(n) ] T during the l-th stage.
Remark 2: The vector Gauss-Jordan transformation (64) resulting from the replacing of the unit vector e i(k) with the feature vector x k in the basis B k-1 = [x 1 ,..., x k-1 , e i(k) ,..., e i(n) ] T cannot be executed when the below collinearity condition is met [7]: The collinearity condition (65) causes a division by zero in Eq. (64). Let the symbol r l [k] denote the l-th column r l (k) = [r l,1 (k),..., r l,n (k)] T of the inverse matrix B k À1 = [r 1 (k), … , r k-1 (k), r k (k), … ., r n (k)] (49) after the reduction of the last n -k components r l,i (k): Similarly, the symbol x j [k] = [x j,1 ,...,x j,k ] T means the reduced vector obtained from the feature vector x j = [x j,1 ,...,x j,n ] T after he reducing of the last n -k components x j,i : Lemma 3: The collinearity condition (65) appears during the k-th step when the reduced vector x k [k] (66) is a linear combination of the basis reduced vectors x j [k] (67) with j < k: The proof of this lemma results directly from the collinearity condition (65) [7].

Small samples of multivariate feature vectors
A small sample of multivariate vectors appears when the number m of feature vectors x j in the data set C (1) is much smaller than the dimension n of these vectors (m < < n). The basis exchange algorithms allows for efficient minimization of the CPL criterion functions also in the case of small samples of multivariate vectors [10]. However, for small samples, some new properties of the basis exchange algorithms are more important. In particular, the regularization (42) of the CPL criterion functions becomes crucial. New properties of the basis exchange algorithms in the case of a small number m of multidimensional feature vectors x j (1) is discussed on the example of the collinearity criterion function Φ(w) (38) and the regularized criterion function Ψ(w) (42). Theorem 4: If the feature vectors x j constituting the subset C k (C k ⊂ C (1)) and used in the definition of the function Φ(w) (38) are linearly independent (Def. 2), then the value Φ(w L ) of the collinearity criterion function Φ(w) at the final vertex w L (59) is equal to zero (Φ(w L ) = 0).
The proof of Theorem 4 can be based on the stepwise inversion of the matrices B k (48) [16].
The maximal number l max (69) of different vertices w L(l) (59) can be large when m < < n: The choice between different final vertices w L(l) (59) can be based on the minimization of the regularized criterion function Ψ(w) (42). The regularized function Ψ(w) (42) is the sum of the collinearity function Φ(w) (38) and the weighted sum of the cost functions φ i 0 (w) (43). If Φ(w L(l) ) = 0 (38), then the value Ψ(w L(l) ) of the criterion function Ψ(w) (42) at the final vertex w L(l) (59) can be given as follows: where the above sums take into account only the indices i of the subset I(w L(l) ) of the non-zero components w L(l),i of the final vertex w L(l) = [w L(l),1 , … , w L(l),n ] T (59): If the final vertex w L(l) (59) is not degenerate (Def. 5), then the matrix B L(l) (48) is built from all m feature vectors x j (j ∈ {1,...., m}) making up the data set C (1) and from n -m selected unit vectors e i (i ∈ I(w L(l) ) (71)).
The problem of the constrained minimizing of the regularized function Ψ(w) (71) at the vertices w L(l) (59) satisfying the conditions Φ(w L(l) ) = 0 (69) can be formulated in the following way: According to the above formulation, the search for the minimum of the regularized criterion function Ψ(w) (42) is takes place at all such vertices w L(l) (59), where the collinearity function Φ(w) (38) is equal to zero. The regularized criterion function Ψ(w) (42) is defined as follows at the final vertices w L(l) = [w L(l),1 , … , w L(l),n ] T (59), where Φ(w L(l) ) = 0: The optimal vertex w L(l) * is the minimum value Ψ 0 (w L(l) *) of the CPL criterion function Ψ 0 (w) (75) defined on such final vertices w L(l) (59), where Φ(w L(l) ) = 0 (38): As in the case of the minimization of the perceptron criterion function Φ k p (v) (29), the optimal vector w L(l) * (76) may be located at a selected vertex of some convex polyhedron (27) in the parameter space R n (w ∈ R n ) [7].

Complex layers of linear prognostic models
Complex layers of linear classifiers or prognostic models have been proposed as a scheme for obtaining a general classification or forecasting rules designed on the basis of a small number of multidimensional feature vectors x j [11]. According to this scheme, when designing linear prognostic models, averaging over a small number m of feature vectors x j of the dimension n (m < < n) is replaced by averaging on collinear clusters of selected features (genes) X i . Such an approach to averaging can be linked to the ergodic theory [17].
In the case of a small sample of multivariate vectors, the number m of feature vectors x j in the data set C (1) may be much smaller than the dimension n of these