## Abstract

The main challenges in data mining are related to large, multi-dimensional data sets. There is a need to develop algorithms that are precise and efficient enough to deal with big data problems. The Simplex algorithm from linear programming can be seen as an example of a successful big data problem solving tool. According to the fundamental theorem of linear programming the solution of the optimization problem can found in one of the vertices in the parameter space. The basis exchange algorithms also search for the optimal solution among finite number of the vertices in the parameter space. Basis exchange algorithms enable the design of complex layers of classifiers or predictive models based on a small number of multivariate data vectors.

### Keywords

- data mining
- basis exchange algorithms
- small samples of multivariate vectors
- gene clustering
- prognostic models

## 1. Introduction

Various data mining tools are proposed to extract patterns from data sets [1]. Large, multidimensional data sets impose high requirements as to the precision and efficiency of calculations used to extract patterns (regularities) useful in practice [2]. In this context, there is still a need to develop new algorithms of data mining [3]. New types of patterns are also obtained in result of combining different types of classification or prognosis models [4].

The Simplex algorithm from linear programming is used as an effective big data mining tool [5]. According to the basic theorem of linear programming, the solution to the linear optimization problem with linear constraints can be found at one of the vertices in the parameter space. Narrowing the search area to a finite number of vertices is a source of the efficiency of the Simplex algorithm.

Basis exchange algorithms also look for an optimal solution among a finite number of vertices in the parameter space [6]. The basis exchange algorithms are based on the Gauss - Jordan transformation and, for this reason, are similar to the Simplex algorithm. Controlling the basis exchange algorithm is related to the minimization of convex and piecewise linear (CPL) criterion functions [7].

The perceptron and collinearity criterion functions belong to the family of CPL functions The minimization of the perceptron criterion function allows to check the linear separability of data sets and to design piecewise linear classifiers [8]. Minimizing the collinearity criterion function makes it possible to detect collinear (flat) patterns in data sets and to design multiple interaction models [9].

Data sets consisting of a small number of multivariate feature vectors generate specific problems in data mining [10]. This type of data includes genetic data sets. Minimizing the perceptron criterion function or the collinearity function enables solving problems related to discrimination or regression also in the case of a small set of multidimensional feature vectors by using complex layers of low dimensional linear classifiers or prognostic models [11].

## 2. Linear separability vs. linear dependence

Let us assume that each of * m* objects

O

_{j}from a given database were represented by the

*-dimensional feature vector*n

x

_{j}= [x

_{j,1},...,x

_{j,n}]

^{T}belonging to the feature space

**[**F

*] (*n

x

_{j}∈

**[**F

*]). The data set*n

**consists of**C

*such feature vectors*m

x

_{j}:

The components x_{j,i} of the feature vector _{j} are numerical values (x_{j,i} ∈ * R* or x

_{j,I}∈{0, 1}) of the individual features

X

_{i}of the

*-th object*j

O

_{j}. In this context, each feature vector

x

_{j}(

x

_{j}∈

**[**F

*]) represents*n

*features*n

X

_{i}belonging to the feature set

*(*F

*) = {*n

X

_{1},…,

X

_{n}}.

The pairs {_{k}^{+}, _{k}^{=}} (* k* = 1, …,

*) of the learning sets*K

G

_{k}

^{+}and

G

_{k}

^{=}(

G

_{k}

^{+}∩

G

_{k}

^{−}= ∅) are formed from some feature vectors

x

_{j}selected from the data set

**(1):**C

where _{k}^{+} and _{k}^{−} are non-empty sets of indices * j* of vectors

x

_{j}(

J

_{k}

^{+}∩

J

_{k}

^{−}= ∅).

The * positive* learning set

G

_{k}

^{+}is composed of

m

_{k}

^{+}feature vectors

x

_{j}(

*∈*j

J

_{k}

^{+}). Similarly, the

*learning set*negative

G

_{k}

^{−}is composed of

m

_{k}

^{−}feature vectors

x

_{j}(

*∈*j

J

_{k}

^{−}), where

m

_{k}

^{+}+

m

_{k}

^{−}≤

*.*m

Possibility of the learning sets _{k}^{+} and _{k}^{−}(2) separation using a hyperplane (

w

_{k}, θ

_{k}) in the feature space

**[**F

*] is investigated in pattern recognition [1]:*n

where _{k} = [w_{k,1},..., w_{k,n}]^{T} ∈ ^{n} is the weight vector, θ_{k} ∈ ^{1} is the threshold, and _{k}^{T}** x** = Σ

_{i}w

_{k,i}x

_{i}is the scalar product.

* Definition* 1: The learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(1) are

*in the feature space*linearly separable

**[**F

*], if and only if there exists a weight vector*n

w

_{k}(

w

_{k}∈

R

^{n}), and a threshold

θ

_{k}(

θ

_{k}∈

R

^{1}) that the hyperplane

**(**H

w

_{k}, θ

_{k}) (3) separates these sets [7]:

According to the above inequalities, all vectors _{j} from the learning set _{k}^{+}(2) are located on the positive side of the hyperplane (

w

_{k}, θ

_{k}) (3), and all vectors

x

_{j}from the set

G

_{k}

^{−}lie on the negative side of this hyperplane.

The hyperplane (

w

_{k}, θ

_{k}) (3) separates (4) the sets

G

_{k}

^{+}and

G

_{k}

^{−}(1) with the following margin δ

_{L2}(

w

_{k}) based on the Euclidean (

L

_{2}) norm which is used in the Support Vector Machines (SVM) method [12]:

where || _{k} ||_{L2} = (_{k}^{T}_{k})^{1/2} is the Euclidean length of the weight vector _{k}.

The margin δ_{L1}(_{k}) with the _{1} norm related to the hyperplane (

w

_{k}, θ

_{k}) (2), which separates (10) the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) was determined by analogy to (5) as [11]:

where || _{k} ||_{L1} = |w_{k,1}| + ... +| w_{k,n}| is the _{1} length of the weight vector _{k}.

The margins δ _{L2}(_{k}) (5) or δ _{L1}(_{k}) (6) are maximized to improve the generalization properties of linear classifiers designed from the learning sets _{k}^{+} and _{k}^{−}(2) [7].

The following set of _{k}′ = _{k}^{+} + _{k}^{−} linear equations can be formulated on the basis of the linear separability inequalities (4):

If we assume that the threshold θ_{k} can be determined latter, then we have * n* unknown weights w

_{k,i}(

w

_{k}= [w

_{k,1},..., w

_{k,n}]

^{T}) in an underdetermined system of

m

_{k}′ =

m

_{k}

^{+}+

m

_{k}

^{−}(

m

_{k}′ ≤

m

_{k}<

*) linear Eqs. (7). In order to obtain a system of*n

*linear Eqs. (7) with*n

*unknown weights w*n

_{k,i}, additional linear equations based on selected

*-*n

m

_{k}′ unit vectors

e

_{i}(

*∈*i

I

_{k}) were taken into account [6]:

The parameter vertex _{k} = [w_{k,1},..., w_{k,n}]^{T} can be determined by the linear Eqs. (7) and (8) if the feature vectors _{j} forming the learning sets _{k}^{+} and _{k}^{−}(2) are linearly independent [7].

The feature vector _{j′} (_{j′} ∈ _{k}^{+} ∪ _{k}^{−}(2)) is a linear combination of some other vectors _{j(i)} (* j*(

*) ≠*i

*′) from the learning sets (2), if there are such parameters α*j

_{j′,i}(α

_{j′,i}≠ 0) that the following relation holds:

* Definition* 2: Feature vectors

x

_{j}making up the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) are linearly independent if neither of these vectors

x

_{j′}(

x

_{j′}∈

G

_{k}

^{+}∪

G

_{k}

^{−}) can be expressed as a linear combination (9) of

*(*l

*∈{1,…,*l

*- 1}) other vectors*m

x

_{j(l)}from the learning sets.

If the number _{k}′ = _{k}^{+} + _{k}^{−} of elements _{j} of the learning sets _{k}^{+} and _{k}^{−}(2) is smaller than the dimension * n* of the feature space

**[**F

*] (*n

m

_{k}

^{+}+

m

_{k}

^{−}≤

*), then the parameter vertex*n

w

_{k}(θ

_{k}) can be defined by the linear equations in the following matrix form [13]:

where

and

The first _{k}^{+} components of the vector _{k}(θ_{k}) are equal to θ_{k} + 1, the next _{k}^{−} components equal to θ_{k} - 1, and the last * n* -

m

_{k}

^{+}−

m

_{k}

^{−}components are equal to 0. The first

m

_{k}

^{+}rows of the square matrix

B

_{k}(12) are formed by the feature vectors

x

_{j}(

*∈*j

J

_{k}

^{+}) from the set

G

_{k}

^{+}(2), the next

m

_{k}

^{−}rows are formed by vectors

x

_{j}(

*∈*j

J

_{k}

^{−}) from the set

G

_{k}

^{−}(2), and the last

*-*n

m

_{k}

^{+}−

m

_{k}

^{−}rows are made up of unit vectors

e

_{j}(

*∈*i

I

_{k}):

If the matrix _{k}(12) is non-singular, then there exists the inverse matrix _{k}^{−1}:

In this case, the parameter vertex _{k}(θ_{k}) (10) can be defined by the following equation:

where the vector _{k}^{+} is the sum of the first _{k}^{+} columns _{i} of the inverse matrix _{k}^{−1} (13), and the vector _{k}^{−} is the sum of the successive _{k}^{−} columns _{i} of this matrix.

The last * n* - (

m

_{k}

^{+}+

m

_{k}

^{−}) components w

_{k.i}(θ

_{k}) of the vector

w

_{k}(θ

_{k}) = [w

_{k,1}(θ

_{k}),…, w

_{k,n}(θ

_{k})]

^{T}(14) linked to the zero components of the vector

1

_{k}(θ

_{k}) (11) are equal to zero:

The conditions w_{k.i}(θ_{k}) = 0 (15) result from the equations _{i}^{T}_{k}(θ_{k}) = 0 (8) at the vertex _{k}(θ_{k}) (14).

Length ||_{k}(θ_{k})||_{L1} of the weight vector _{k}(θ_{k}) (14) in the _{1} norm is the sum of _{k}′ = _{k}^{+} + _{k}^{−} components |w_{k,i}(θ_{k})|:

In accordance with the Eq. (14), components |w_{k,i}(θ_{k})| can be determined as follows:

The length ||_{k}(θ_{k})||_{L1}(16) of the vector _{k}(θ_{k}) (14) with the _{1} norm is minimized to increase the margin δ_{L1}(_{k}(θ_{k})) (6). The length ||_{k}(θ_{k})||_{L1}(16) can be minimized by selecting the optimal threshold value θ_{k}* on the basis of the Eq. (14).

where the optimal vertex _{k}(θ_{k}*) is defined by the Eq. (14).

* Theorem* 1: The learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) formed by

*(*m

*≤*m

*) linearly independent (9) feature vectors*n

x

_{j}are linearly separable (4) in the feature space

**[**F

*] (*n

x

_{j}∈

**[**F

*]).*n

* Proof*: If the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) are formed by

*linearly independent feature.*m

vectors _{j} then the non-singular matrix _{k} = [_{1},…, _{m}, _{i(m + 1)},…., _{i(n)}]^{T}(12) containing these * m* vectors

x

_{j}and

*-*n

*unit vectors*m

e

_{i}(

*∈*i

I

_{k}) can be defined [10]. In this case, the inverse matrix

B

_{k}

^{−1}(13) exists and can determine the vertex

w

_{k}(θ

_{k}) (14). The vertex equation

B

_{k}

w

_{k}(θ

_{k}) =

1

_{k}(θ

_{k}) (10) can be reformulated for the feature vectors

x

_{j}(2) as follows:

The solution of the Eqs. (19) satisfies the linear separability inequalities (4).

It is possible to enlarge the learning sets _{k}^{+} and _{k}^{−}(2) in such a way, which maintains their linear separability (4).

* Lemma* 1: Increasing the positive learning set

G

_{k}

^{+}(2) by such a new vector

x

_{j′}(

x

_{j′}∉

G

_{k}

^{+}), which is a linear combination with the parameters α

_{j′,i}(9) of some feature vectors

x

_{j(l)}(2) from this set (

x

_{j(l)}∈

G

_{k}

^{+}) preserves the linear separability (4) of the learning sets if the parameters α

_{j′,i}fulfill the following condition:

If the assumptions of the lemma are met, then

The above inequality means that linear separability connditions (4) still apply after the increasing of the learning set _{k}^{+}(2).

* Lemma* 2: Increasing the negative learning set

G

_{k}

^{−}(2) by such a new vector

x

_{j′}(

x

_{j′}∉

G

_{k}

^{−}), which is a linear combination with the parameters α

_{j′,i}(9) of some feature vectors

x

_{j(l)}(2) from this set (

x

_{j(l)}∈

G

_{k}

^{−}) preserves the linear separability (4) of the learning sets if the parameters α

_{j′, i}fulfill the following condition:

Justification Lemma 2 may be based on inequality similar to (21).

## 3. Perceptron criterion function

The minimization the perceptron criterion function allows to assess the degree of linear separabilty (4) of the learning sets _{k}^{+} and _{k}^{−}(2) in different feature subspaces [

*′] (*n

**[**F

*′] ⊂*n

**[**F

*+ 1]) [6]. When defining the perceptron criterion function, it is convenient to use the following augmented feature vectors*n

y

_{j}(

y

_{j}∈

**[**F

*+ 1]) and augmented weight vectors*n

v

_{k}(

v

_{k}∈

R

^{n + 1}) [1]:

and

The augmented vectors _{j} are constructed (23) on the basis of the learning sets _{k}^{+} and _{k}^{−}(2). These learning sets are extracted from the data set (1) according to some additional knowledge. The linear separability (4) of the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) can be reformulated using the following set of

*inequalities with the augmented vectors*m

y

_{j}(23) [7]:

The dual hyperplanes _{j}^{p} in the parameter space ^{n + 1} (** v** ∈

R

^{n + 1}) are defined on the basis of the augmented vectors

y

_{j}[6]:

Dual hyperplanes _{j}^{p}(26) divide the parameter space ^{n + 1} (** v** ∈

R

^{n + 1}) into a finite number

*of disconnected regions (*L

*)*convex polyhedra

D

_{l}

^{p}(

*= 1,…,*l

*) [7]:*L

where _{l}^{+} and _{l}^{−} are disjointed subsets (_{l}^{+} ∩ _{l}^{+} = ∅) of indices * j* of feature vectors

x

_{j}making up the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2).

The perceptron penalty functions φ_{j}^{p}(** v**) are defined as follows for each of augmented feature vectors

y

_{j}(23) [6]:

The * j* - th penalty function φ

_{j}

^{p}(

**) (28) is greater than zero if and only if the weight vector**v

**is located on the wrong side (**v

y

_{j}

^{T}

**< 1) of the**v

*-th dual hyperplane*j

h

_{j}

^{p}(26). The function φ

_{j}

^{p}(

**) (28) is linear and greater than zero as long as the parameter vector**v

**= [v**v

_{k,1},..., v

_{k,n + 1}]

^{T}remains on the wrong side of the hyperplane

h

_{j}

^{p}(26). Convex and piecewise-linear (

*) penalty functions φ*CPL

_{j}

^{p}(

**) (28) are used to enforce the linear separation (8) of the learning sets**v

G

_{k}

^{+}and

G

_{k}

^{−}(2).

The perceptron criterion function Φ_{k}^{p}(** v**) is defined as the weighted sum of the penalty functions φ

_{j}

^{p}(

**) (28) [6]:**v

Positive parameters _{j} (_{j} > 0) can be treated as prices of individual feature vectors _{j}:

where _{k}^{+} (_{k}^{−}) is the number of elements _{j} in the learning set _{k}^{+} (_{k}^{−}) (2).

The perceptron criterion function Φ_{k}^{p}(** v**) (29) was built on the basis of the

*algorithm, the basic algorithm in the*error correction

*model of learning processes in neural networks [14].*Perceptron

The criterion function Φ_{k}^{p}(** v**) (29) is convex and piecewise-linear (

*) [6]. It means, among others, that the function Φ*CPL

_{k}

^{p}(

**) (29) remains linear within each area**v

D

*(27):*

_{l}

where the summation is performed on all vectors _{j}(23) fulfilling the condition _{j}^{T}** v** < 1.

The optimal vector _{k}* determines the minimum value Φ_{k}^{p}(_{k}*) of the criterion function Φ_{k}^{p}(** v**) (29):

Since the criterion function Φ_{k}^{p}(** v**) (29) is linear in each convex polyhedron

D

*(27), the optimal point*

_{l}

v

_{k}* representing the minimum Φ

_{k}

^{p}(

v

_{k}*) (32) can be located in selected vertex of some polyhedron

D

_{l′}

^{p}(27). This property of the optimal vector

v

_{k}* (32) follows from the

*[5].*fundamental theorem of linear programming

It has been shown that the minimum value Φ_{k}^{p}(_{k}*) (32) of the perceptron criterion function Φ_{k}^{p}(** v**) (29) with the parameters α

_{j}(30) is normalized as follows [6]:

The below theorem has been proved [6]:

* Theorem* 2: The minimum value Φ

_{k}

^{p}(

v

_{k}*) (32) of the perceptron criterion function Φ

_{k}

^{p}(

**) (29) is equal to zero (Φ**v

_{k}

^{p}(

v

_{k}*) = 0) if and only if the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2) are linearly separable (4).

The minimum value Φ_{k}^{p}(_{k}*) (32) is near to one (Φ_{k}^{p}(_{k}*) ≈ 1) if the sets _{k}^{+} and _{k}^{−}(2) cover almost completely. It can also be proved that the minimum value Φ_{k}^{p}(_{k}*) (32) of the perceptron criterion function Φ_{k}^{p}(** v**) (29) does not depend on invertible linear transformations of the feature vectors

y

_{j}(23) [6]. The perceptron criterion function Φ

_{k}(

**) (29) remains linear inside of each region**v

D

*(27).*

_{l}

The regularized criterion function Ψ_{k}^{p}(** v**) is defined as the sum of the perceptron criterion function Φ

_{k}

^{p}(

**) (29) and some additional penalty functions [13]. These additional**v

*functions are equal to the costs γ*CPL

_{i}(γ

_{i}> 0) of individual features

X

_{i}multiply by the absolute values |w

_{i}| of weighs w

_{i}, where

**= [**v

w

^{T}, −θ]

^{T}= [w

_{1},..., w

_{n}, −θ]

^{T}∈

R

^{n + 1}(24):

where λ (λ ≥ 0) is the * cost level*. The standard values of the cost parameters γ

_{i}are equal to one (∀

*∈ {1, ...,*i

*} γ*n

_{i}= 1).

The optimal vector _{k,λ}* constitutes the minimum value Ψ_{k}^{p}(_{k,λ}*) of the * CPL* criterion function Ψ

_{k}

^{p}(

**) (34), which is defined on elements**v

x

_{j}of the learning sets

G

_{k}

^{+}and

G

_{k}

^{−}(2):

Similarly as in the case of the perceptron criterion function Φ_{k}^{p}(** v**) (29), the optimal vector

v

_{k,λ}* (35) can be located in selected vertex of some polyhedron

D

*(27). The minimum value Ψ*

_{l′}

_{k}

^{p}(

v

_{k,λ}*) (35) of the criterion function Ψ

_{k}

^{p}(

**) (34) is used, among others, in the**v

*(*relaxed linear separability

*) method of gene subsets selection [15].*RLS

## 4. Collinearity criterion function

Minimizing the collinearity criterion function is used to extract collinear patterns from large, multidimensional data sets (1) [7]. Linear models of multivariate interactions can be formulated on the basis of representative collinear patterns [9].

The collinearity penalty functions φ_{j}(** w**) are determined by individual feature vectors

x

_{j}= [x

_{j,1},...,x

_{j,n}]

^{T}in the following manner [9]:

The penalty functions φ_{j}(** w**) (36) can be related to the following dual hyperplanes

h

_{j}

^{1}in the parameter (weight) space

R

^{n}(

**∈**w

R

^{n}):

The * CPL* penalty φ

_{j}(

**) (36) is equal to zero (φ**w

_{j}

^{c}(

**) = 0) in the point**w

**= [w**w

_{1},..., w

_{n}]

^{T}if and only if the point

**is located on the dual hyperplane**w

h

_{j}

^{1}(37).

The collinearity criterion function Φ_{k}(** w**) is defined as the weighted sum of the penalty functions φ

_{j}(

**) (36) determined by feature vectors**w

x

_{j}forming the data subset

C

_{k}(

C

_{k}⊂

**(1)):**C

where the sum takes into account only the indices * J* of the set

J

_{k}= {

*:*j

x

_{j}∈

C

_{k}}, and the positive parameters β

_{j}(β

_{j}> 0) in the function Φ

_{k}(

**) (38) can be treated as the**w

*of particular feature vectors*prices

x

_{j}. The standard choice of the parameters β

_{j}values is one ((∀

*∈*j

J

_{k}) β

_{j}= 1.0).

The collinearity criterion function Φ_{k}(** w**) (38) is convex and piecewise-linear (

*) as the sum of this type of penalty functions φ*CPL

_{j}(

**) (36) [9]. The vector**w

w

_{k}* determines the minimum value Φ

_{k}(

w

_{k}*) of the criterion function Φ

_{k}(

**) (38):**w

* Definition* 3: The data subset

C

_{k}(

C

_{k}⊂

**(1)) is**C

*when all feature vectors*collinear

x

_{j}from this subset are located on some hyperplane

*(*H

**, θ) = {**w

**:**x

w

^{T}

**= θ} with θ ≠ 0.**x

* Theorem* 3: The minimum value Φ

_{k}

^{p}(

v

_{k}*) (39) of the collinearity criterion function Φ

_{k}(

**) (38) defined on the feature vectors**w

x

_{j}constituting a data subset

C

_{k}(

C

_{k}⊂

**(1)) is equal to zero (Φ**C

_{k}

^{p}(

v

_{k}*) = 0) when this subset

C

_{k}is collinear (

*3) [9].*Def.

Different collinear subsets _{k} can be extracted from data set (1) with a large number

*of elements*m

x

_{j}by minimizing the collinearity criterion function Φ

_{k}

^{p}(

**) (38) [9].**w

The minimum value Φ_{k}^{p}(_{k}*) (39) of the collinearity criterion function Φ_{k}(** w**) (38) can be reduced to zero by omitting some feature vectors

x

_{j}from the data subset

C

_{k}(

C

_{k}⊂

**(1)). If the minimum value Φ**C

_{k}(

w

_{k}*) (39) is greater than zero (Φ

_{k}(

w

_{k}*) > 0) then we can select feature vectors

x

_{j}(

*∈*j

J

_{k}(

w

_{k}*)) with the penalty φ

_{j}(

w

_{k}*) (36) greater than zero:

Omitting one feature vector _{j′} (* j*′∈

J

_{k}(

w

_{k}*)) with the above property results in the following reduction of the minimum value Φ

_{k}

^{p}(

v

_{k}*) (39);

where Φ_{k′}(_{k′}*) is the minimum value (39) of the collinearity criterion function Φ_{k′}(** w**) (38) defined on feature vectors

x

_{j}constituting the data subset

C

_{k}reduced by the vector

x

_{j′}.

The regularized criterion function Ψ_{k}(** w**) is defined as the sum of the collinearity criterion function Φ

_{k}(

**) (38) and some additional**w

*penalty functions φ*CPL

_{j}

^{0}(

**) [7]:**w

where λ ≥ 0 is the * cost level*. The standard values of the cost parameters γ

_{i}are equal to one ((∀

*∈ {1,…,*i

*}) γ*n

_{i}= 1). The additional

*penalty functions φ*CPL

_{j}

^{0}(

**) are defined below [7]:**w

The functions φ_{j}^{0}(** w**) (43) are related to the following dual hyperplanes

h

_{j}

^{0}in the parameter (

*) space*weight

R

^{n}(

**∈**w

R

^{n}):

The * CPL* penalty function φ

_{j}

^{0}(

**) (43) is equal to zero (φ**w

_{j}

^{0}(

**) = 0) in the point**w

**= [w**w

_{1},..., w

_{n}]

^{T}if and only if this point is located on the dual hyperplane

h

_{j}

^{0}(44).

## 5. Parameter vertices

The perceptron criterion function Φ_{k}^{p}(** v**) (29) and the collinearity criterion function Φ

_{k}(

**) (38) are convex and piecewise-linear (**w

*). The minimum values of a such*CPL

*criterion functions can be located in parameter vertices of some convex polyhedra. We consider the parameter vertices*CPL

w

_{k}(

w

_{k}∈

R

^{n}) related to the collinearity criterion function Φ

_{k}(

**) (38).**w

* Definition* 4: The

parameter vertex

w

_{k}of the

rank r

_{k}(

r

_{k}≤

*) in the weight space*n

R

^{n}(

w

_{k}∈

R

^{n}) is the intersection point of

r

_{k}hyperplanes

h

_{j}

^{1}(37) defined by linearly indepenedent feature vectors

x

_{j}(

*∈*j

J

_{k}) from the data set

**(1) and**C

*-*n

r

_{k}hyperplanes

h

_{i}

^{0}(44) defined by unit vectors

e

_{i}(

*∈*i

I

_{k}) [7].

The * j*-th dual hyperplane

h

_{j}

^{1}(37) defined by the feature vector

x

_{j}(1) passes through the

*-th*k

vertex

w

_{k}if the equation

w

_{k}

^{T}

x

_{j}= 1 holds.

* Definition* 5: The

*-th weight vertex*k

w

_{k}of the rank

r

_{k}is

*in the parameter space*degenerate

R

^{n}if the number

m

_{k}of hyperplanes

h

_{j}

^{1}(37) passing through this vertex (

w

_{k}

^{T}

x

_{j}= 1) is greater than the rank

r

_{k}(

m

_{k}>

r

_{k}).

The vertex _{k} can be defined by the following set of * n* linear equations:

and

Eqs. (45) and (46) can be represented in the below matrix form [7]:

where _{k} = [1,…,1, 0,…,0]^{T} is the vector with the first _{k} components equal to one and the remaining * n* -

r

_{k}components are equal to zero.

The square matrix _{k}(47) consists of * k* feature vectors

x

_{j}(

*∈*j

J

_{k}(45)) and

*-*n

*unit vectors*k

e

_{i}(

*∈*i

I

_{k}(46)) []:

where the symbol _{i(l)} denotes such unit vector, which is the * l*-th row of the matrix

B

_{k}.

Since feature vectors _{j} (∀* j*∈

J

_{k}(

w

_{k}) (45)) making up

r

_{k}rows of the matrix

B

_{k}(48) are linearly independent, then the inverse matrix

B

_{k}

^{−1}exists:

The inverse matrix _{k}^{−1}(49) can be obtained starting from the unit matrix = [

e

_{1},...,

e

_{n}]

^{T}and using the basis exchange algorithm [8].

The non-singular matrix _{k}(48) is the * basis* of the feature space

**[**F

*] related to the vertex*n

w

_{k}= [w

_{k,1},…, w

_{k,n}

]

^{T}. Since the last

*-*n

r

_{k}components of the vector

1

_{k}(47) are equal to zero, the following equation holds:

According to Eq. (50), the weight vertex _{k} is the sum of the first * k* columns

r

_{i}of the inverse matrix

B

_{k}

^{−1}(49).

* Remark* 1: The

*-*n

*components w*k

_{k.i}of the vector

w

_{k}= [w

_{k,1},…, w

_{k,n}

]

^{T}(50) linked to the zero components of the vector

1

_{k}= [1,…, 1, 0,…., 0, 1]

^{T}(7) are equal to zero:

The conditions w_{k.i} = 0 (51) result from the equations _{k}^{T}_{i} = 0 (46) at the vertex _{k}.

The * fundamental theorem of linear programming* shows that the minimum Φ

_{k}(

w

_{k}*) (39) of the

*collinearity criterion function Φ*CPL

_{k}(

**) (38) can always be located in one of the vertices**w

w

_{k}(50) [5]. The same property has also the regularized criterion function Ψ

_{k}(

**) (42), another function of the**w

*type [7].*CPL

We can see that all such feature vectors _{j}(1) which define hyperplanes _{j}^{1}(37) passing through the vertex _{k} are located on the hyperplane (

w

_{k}, 1) = {

**:**x

w

_{k}

^{T}

**= 1} (3) in the feature space**x

**[**F

*]. A large number*n

m

_{k}of feature vectors

x

_{j}(1) located on the hyperplane

*(*H

w

_{k}, 1) (3) form the

collinear cluster

**(**C

w

_{k}) based on the vertex

w

_{k}[8]:

If the vertex _{k} of the rank _{k} is degenerate in the parameter space ^{n} then the collinear cluster (

w

_{k}) (52) contains more than

r

_{k}feature vectors

x

_{j}(1).

The * k*-th vertex

w

_{k}= [w

_{k,1},…, w

_{k,n}

]

^{T}in the parameter space

R

^{n}(

w

_{k}∈

R

^{n}) is linked by the Eq. (47) to the non-singular matrix

B

_{k}(48). The rows of the matrix

B

_{k}(48) can form the

*of the feature space*basis

**[**F

*]. The conditions w*n

_{k.i}= 0 (51) result from the equations

w

_{k}

^{T}

e

_{i}= 0 (46) at the vertex

w

_{k}.

Each feature vector _{j} from the data set (1) represents

*features*n

X

_{i}belonging to the feature set

*(*R

*) = {*n

X

_{1},…,

X

_{n}}. The

*-th*k

vertexical feature subset R

_{k}(

r

_{k}) consists of

r

_{k}features

X

_{i}that are connected to the weights w

_{k.i}different from zero (w

_{k.i}≠ 0):

The * k*-th

vertexical subspace

F

_{k}[

r

_{k}] (

F

_{k}[

r

_{k}] ⊂

**[**F

*]) contains the reduced vectors*n

x

_{j}[

r

_{k}] with

r

_{k}componets x

_{j,i(l)}(

x

_{j}[

r

_{k}] ∈

F

_{k}[

r

_{k}]) related to the weights w

_{k.i}different from zero:

The reduced vectors _{j}[_{k}] (55) are obtained from the feature vectors _{j} = [x_{j,1},...,x_{j,n}]^{T} belonging to the data set (1) by omitting the

*-*n

r

_{k}components x

_{j,i}related to the weights w

_{k.i}equal to zero (w

_{k.i}= 0).

We consider the optimal vertexical subspace _{k}*[_{k}] (_{k}*[_{k}] ⊂ [

*]) related to the reduced optimal vertex*n

w

_{k}*[

r

_{k}] which determines the minimum Φ

_{k}(

w

_{k}*) (39) of the collinearity criterion function Φ

_{k}(

**) (38). The optimal collinear cluster**w

**(**C

w

_{k}*[

r

_{k}]) (52) is based on the optimal vertex

w

_{k}*[

r

_{k}] = [w

_{k,1}*,…, w

_{k,rk}*]

^{T}with

r

_{k}different from zero components w

_{k,i}* (w

_{k.i}* ≠ 0). Feature vectors

x

_{j}belonging to the collinear cluster

**(**C

w

_{k}*) (52) satisfy the equations

w

_{k}*[

r

_{k}]

^{T}

x

_{j}[

r

_{k}] = 1, hence:

where x_{j,i(l)} are components of the * j*-th feature vectors

x

_{j}related to the weights w

_{k.i}different from zero (w

_{k.i}≠ 0).

A large number _{k} of feature vectors _{j}(1) belonging to the collinear cluster (

w

_{k}*[

r

_{k}]) (52) justifies the following collinear model of interaction between selected features

X

_{i(l)}which is based on the Eqs. (56) [9]:

The collinear interaction model (57) allows, inter alia, to design the following prognostic models for each feature _{i′} from the subset _{k}(_{k}) (54):

where β_{i′,0} = 1 / w_{k.i′}*, β_{i′, i′} = 0, and (∀ * i*(

*) ≠*l

*′) β*i

_{i′,i(l)}= w

_{k.i(l)}* / w

_{k.i′}*.

Feature _{i′} is a dependent variable in the prognostic model (58), the remaining * m* - 1 features

X

_{i(l)}are independent variables (

*(*i

*) ≠*l

*′). The family of*i

r

_{k}prognostic models (58) can be designed on the basis of one collinear interaction model (57). Models (58) have a better justification for a large number

m

_{k}of feature vectors

x

_{j}(1) in the collinear cluster

**(**C

w

_{k}*[

r

_{k}]) (52).

## 6. Basis exchange algorithm

The collinearity criterion function Φ(** w**) (38), like other convex and piecewise linear (

*) criterion functions, can be minimized using the basis exchange algorithm [8]. The basis exchange algorithm aimed at minimization of the collinearity criterion function Φ(*CPL

**) (38) is described below.**w

According to the basis exchange algorithm, the optimal vertex _{k}*, which constitutes the minimum value Φ_{k}(_{k}*) (39) of the collinearity function Φ_{k}(** w**) (38), is achieved after a finite number

*of the steps*L

*as a result of guided movement between selected vertices*l

w

_{k}(50) [8]:

The sequence of vertices _{k}(59) is related by (47) to the following sequence of the inverse matrices _{k}^{−1}(49):

The sequence of vertices _{k(l)}(59) typically starts at the vertex _{0} = [0,..., 0]^{T} related to the identity matrix _{0} = _{n} = [_{1},..., _{n}]^{T} of the dimension * n* x

*[7]. The final vertex*n

w

*(59) should assure the minimum value of the collinearity criterion function Φ(*

_{L}

**) (38):**w

If the criterion function Φ(** w**) (38) is defined on

*(*m

*≤*m

*) linearly independent vectors*n

x

_{j}(

x

_{j}∈

**(1)) then the value Φ(**C

w

*) of the collinearity criterion function Φ(*

_{L}

**) (38) at the final vertex**w

w

*(59) becomes zero (Φ(*

_{L}

w

*) = 0) [8]. The rank*

_{L}

*(*r

_{L}

*4) of the final vertex*Def.

w

_{L}(59) can be equal to the number

*of feature vectors*m

x

_{j}(

*=*r

_{L}

*) or it can be less than*m

*(*m

*<*r

_{L}

*). The rank*m

*of the final vertex*r

_{L}

w

*(59) is less than*

_{L}

*(*m

*<*r

_{L}

*) if the final vertex*m

w

*is degenerate [7].*

_{L}

Consider the reversible matrix _{k} = [_{1},..., _{k}, _{i(k + 1)},..., _{i(n)}]^{T}(48), which determines the vertex _{k}(50) and the value Φ_{k}(_{k}) of the criterion function Φ_{k}(** w**) (38) in the

*-th step. In the step (*k

*+ 1), one of the unit vectors*l

e

_{i}in the matrix

B

_{k}(48) is replaced by the feature vector

x

_{k + 1}and the matrix

B

_{k + 1}= [

x

_{1},...,

x

_{k},

x

_{k + 1},

e

_{i(k + 2)},...,

e

_{i(n)}]

^{T}appears. The unit vector

e

_{i(k + 1)}leaving matrix

B

_{k}(48) is indicated by an

*based on the gradient of the collinearity criterion function Φ(*exit criterion

**) (38) [7]. The exit criterion allows to determine the exit edge**w

r

_{k + 1}(49) of the greatest descent of the collinearity criterion function Φ(

**) (38). As a result of replacing the unit vector**w

e

_{i(k + 1)}with the feature vector

x

_{k + 1,}the value Φ(

w

_{k}) of the collinearity function Φ(

**) (38) decreases (41):**w

After a finite number * L* (

*≤*L

*) of the steps*m

*, the collinearity function Φ(*k

**) (38) reaches its minimum (61) at the final vertex**w

w

*(59).*

_{L}

The sequence (60) of the inverse matrices _{k}^{−1} is obtained in a multi-step process of minimizing the function Φ(** w**) (38). During the

*-th step, the matrix*k

B

_{k-1}= [

x

_{1},…,

x

_{k-1},

e

_{i(k)},….,

e

_{i(n)}]

^{T}(12) is transformed into the matrix

B

_{k}by replacing the unit vector

e

_{i(k)}with the feature vector

x

_{k:}

According to the vector Gauss-Jordan transformation, replacing the unit vector _{i(k)} with the feature vector _{k} during the * k* - th stage results in the following modifications of the co + lumns

r

_{i}(

*) of the inverse matrix*k

B

_{l}

^{−1}= [

x

_{1},…,

x

_{l},

e

_{i(l+1)},…,

e

_{i(n)}]

^{T}(49) [6]:

where * i*(

*+1) is the index of the unit vector*l

e

_{i(l+1)}leaving the basis

B

_{l}= [

x

_{1},...,

x

_{l},

e

_{i(l+1)},...,

e

_{i(n)}]

^{T}during the

*-th stage.*l

* Remark* 2: The vector Gauss-Jordan transformation (64) resulting from the replacing of the unit vector

e

_{i(k)}with the feature vector

x

_{k}in the basis

B

_{k-1}= [

x

_{1},...,

x

_{k-1},

e

_{i(k)},...,

e

_{i(n)}]

^{T}cannot be executed when the below

*is met [7]:*collinearity condition

The collinearity condition (65) causes a division by zero in Eq. (64).

Let the symbol [

_{l}

*] denote the*k

*-th column*l

r

*(*

_{l}

*) = [r*k

_{l,1}(

*),..., r*k

_{l,n}(

*)]*k

^{T}of the inverse matrix

B

_{k}

^{−1}= [

r

_{1}(

*),…,*k

r

_{k-1}(

*),*k

r

_{k}(

*),….,*k

r

*(*

_{n}

*)] (49) after the reduction of the last*k

*components r*n - k

_{l,i}(

*):*k

Similarly, the symbol _{j}[* k*] = [x

_{j,1},...,x

_{j,k}]

^{T}means the reduced vector obtained from the feature vector

x

_{j}= [x

_{j,1},...,x

_{j,n}]

^{T}after he reducing of the last

*-*n

*components x*k

_{j,i}:

* Lemma* 3: The collinearity condition (65) appears during the

*-th step when the reduced vector*k

x

_{k}[

*] (66) is a linear combination of the basis reduced vectors*k

x

_{j}[

*] (67) with*k

*<*j

*:*k

where (∀* i* ∈ {1,…,

*- 1}) α*k

_{i}∈

R

^{1}.

The proof of this lemma results directly from the collinearity condition (65) [7].

## 7. Small samples of multivariate feature vectors

A small sample of multivariate vectors appears when the number * m* of feature vectors

x

_{j}in the data set

**(1) is much smaller than the dimension**C

*of these vectors (*n

*<<*m

*). The basis exchange algorithms allows for efficient minimization of the*n

*criterion functions also in the case of small samples of multivariate vectors [10]. However, for small samples, some new properties of the basis exchange algorithms are more important. In particular, the regularization (42) of the*CPL

*criterion functions becomes crucial. New properties of the basis exchange algorithms in the case of a small number*CPL

*of multidimensional feature vectors*m

x

_{j}(1) is discussed on the example of the collinearity criterion function Φ(

**) (38) and the regularized criterion function Ψ(**w

**) (42).**w

* Lemma* 4: The value Φ(

w

_{K}) of the collinearity criterion function Φ(

**) (38) at the final vertex**w

w

*(59) is equal to zero if all*

_{L}

*linear Eqs. (45) are fulfilled in the vertex*m

w

*which is related by the Eq. (47) to the matrix*

_{L}

B

*= [*

_{L}

x

_{1},...,

x

_{m},

e

_{i(1)},...,

e

_{i(n-m)}]

^{T}(48) containing the unit vectors

e

_{i}with the indices

*from the subset*i

*(*I

_{L}

*∈*i

*).*I

_{L}

* Theorem* 4: If the feature vectors

x

_{j}constituting the subset

C

_{k}(

C

_{k}⊂

**(1)) and used in the definition of the function Φ(**C

**) (38) are linearly independent (**w

*2), then the value Φ(*Def.

w

*) of the collinearity criterion function Φ(*

_{L}

**) at the final vertex**w

w

*(59) is equal to zero (Φ(*

_{L}

w

*) = 0).*

_{L}

The proof of Theorem 4 can be based on the stepwise inversion of the matrices _{k}(48) [16]. The final vertex (59) can be found by inverting the related matrix

_{L}

B

*= [*

_{L}

x

_{1},...,

x

_{rk},

e

_{i(1)},...,

e

_{i(n-rk)}]

^{T}(48).

The final vertex (59) resetting (Φ(

_{L}

w

*) = 0) the criterion function Φ(*

_{L}

**) (38) can be related to the optimal matrix**w

B

*= [*

_{L}

x

_{1},...,

x

*, e*

_{L}

_{i(L + 1)},..., e

_{i(n)}]

^{T}(48) built from

*(*L

*≤*L

*) feature vectors*m

x

_{j}(

*∈*j

*(*J

w

*) (45)) from the data set*

_{L}

**(1) and from**C

*-*n

*selected unit vectors*L

e

_{i}(

*∈*i

*(*I

w

*) (46)). Different subsets of the unit vectors*

_{L}

e

_{i}in the final matrix

B

*(48) result in different positions of the final vertices*

_{L}

w

_{L(l)}(59) in the parameter space

R

^{n}. The criterion function Φ(

**) (38) is equal zero (Φ(**w

w

_{L(l)}) = 0) at each of these vertices

w

_{L(l)}(59).

The position of the final vertices _{L(l)}(59) in the parameter space ^{n} depends on which unit vectors _{i} (* i* ∈

I

_{L(l)}) are included in the basis

B

_{L(l)}(48), where:

The maximal number _{max}(69) of different vertices _{L(l)}(59) can be large when * m* <<

*:*n

The choice between different final vertices _{L(l)}(59) can be based on the minimization of the regularized criterion function Ψ(** w**) (42). The regularized function Ψ(

**) (42) is the sum of the collinearity function Φ(**w

**) (38) and the weighted sum of the cost functions φ**w

_{i}

^{0}(

**) (43). If Φ(**w

w

_{L(l)}) = 0 (38), then the value Ψ(

w

_{L(l)}) of the criterion function Ψ(

**) (42) at the final vertex**w

w

_{L(l)}(59) can be given as follows:

where the above sums take into account only the indices * i* of the subset

*(*I

w

_{L(l)}) of the non-zero components w

_{L(l),i}of the final vertex

w

_{L(l)}= [w

_{L(l),1},…, w

_{L(l),n}]

^{T}(59):

If the final vertex _{L(l)}(59) is not degenerate (* Def.* 5), then the matrix

B

_{L(l)}(48) is built from all

*feature vectors*m

x

_{j}(

*∈ {1,....,*j

*}) making up the data set*m

**(1) and from**C

*-*n

*selected unit vectors*m

e

_{i}(

*∈*i

*(*I

w

_{L(l)}) (71)).

The problem of the constrained minimizing of the regularized function Ψ(** w**) (71) at the vertices

w

_{L(l)}(59) satisfying the conditions Φ(

w

_{L(l)}) = 0 (69) can be formulated in the following way:

According to the above formulation, the search for the minimum of the regularized criterion function Ψ(** w**) (42) is takes place at all such vertices

w

_{L(l)}(59), where the collinearity function Φ(

**) (38) is equal to zero. The regularized criterion function Ψ(**w

**) (42) is defined as follows at the final vertices**w

w

_{L(l)}= [w

_{L(l),1},…, w

_{L(l),n}

]

^{T}(59), where Φ(

w

_{L(l)}) = 0:

The optimal vertex _{L(l)}* is the minimum value Ψ′(_{L(l)}*) of the * CPL* criterion function Ψ′(

**) (75) defined on such final vertices**w

w

_{L(l)}(59), where Φ(

w

_{L(l)}) = 0 (38):

As in the case of the minimization of the perceptron criterion function Φ_{k}^{p}(** v**) (29), the optimal vector

w

_{L(l)}* (76) may be located at a selected vertex of some convex polyhedron (27) in the parameter space

R

^{n}(

**∈**w

R

^{n}) [7].

If the cost parameters γ_{i}(42) have standard values of one ((∀* i* ∈ {1,…,

*}) γ*n

_{i}= 1), then the constraint minimization problem (74) leads to the optimal vertex

w

_{L(l)}* with the smallest

L

_{1}length ||

w

_{L(l)}* ||

_{L1}= |w

_{L(l),1}*| + … + |w

_{L(l),n}*|, where Φ(

w

_{L(l)}*) = 0 (38):

Optimal vertex _{L(l)}* with the smallest _{1} length || _{L(l)}* ||_{L1}(77) is related to the largest _{1} margin δ_{L1}(_{L(l)}*) (6) [11]:

The basis exchane algorithm allow to solve the constraint minimization problem (74) and to find the optimal vertex _{L(l)}* (77) with the largest _{1} margin δ_{L1}(_{L(l)}*).

Support Vector Machines (* SVM*) is the most popular method for designing linear classifiers or prognostic models with large margins [12]. According to the

*approach, the optimal linear classifier or the prognostic model defined by such an optimal weight vector*SVM

*** that has a maximum margin δ**w

_{L2}(

***) based on the Euclidean (**w

L

_{2}) norm:

Maximization of the Euclidean margins δ_{L2}(** w**) (79) is performed using quadratic programming [2].

## 8. Complex layers of linear prognostic models

Complex layers of linear classifiers or prognostic models have been proposed as a scheme for obtaining a general classification or forecasting rules designed on the basis of a small number of multidimensional feature vectors _{j} [11]. According to this scheme, when designing linear prognostic models, averaging over a small number * m* of feature vectors

x

_{j}of the dimension

*(*n

*<<*m

*) is replaced by averaging on collinear clusters of selected features (genes)*n

X

_{i}. Such an approach to averaging can be linked to the ergodic theory [17].

In the case of a small sample of multivariate vectors, the number * m* of feature vectors

x

_{j}in the data set

**(1) may be much smaller than the dimension**C

*of these vectors (*n

*<<*m

*). In this case, the collinear cluster*n

**(**C

w

_{k}*[

r

_{k}]) (52) may contain all feature vectors

x

_{j}from the set

**(1) and the vertex**C

w

_{k}*[

r

_{k}] may have the rank

r

_{k}equal to

*(*m

r

_{k}=

*).*m

As it follows from Theorem 4, if the collinearity criterion function Φ(** w**) (38) is defined on linearly independent (

*2) feature vectors*Def.

x

_{j}, then the values Φ(

w

_{m(l)}) of this function at each final vertex

w

_{m(l)}(59) are equal to zero (Φ(

w

_{m(l)}) = 0). Each final vertex

w

_{m(l)}(59) can be reached in

*steps*m

*(*k

*= 1, …,*k

*) starting from the vertex*m

w

_{0}= [0,..., 0]

^{T}related to the identity matrix

B

_{0}=

I

_{n}= [

e

_{1},...,

e

_{n}]

^{T}.

Minimization of the collinearity criterion function Φ(** w**) (38), and then minimization of the criterion function Ψ′(

w

_{L(l)}) (75) at the final vertices

w

_{L(l)}(59) allows to determine the optimal vertex

w

_{L(l)}* (77) with the largest

L

_{1}margin δ

_{L1}(

w

_{L(l)}*) (78). If the feature vectors

x

_{j}(1) are linearly independent, then the optimal vertex

w

_{L(l)}* (77) is related to the optimal basis

B

_{L(l)}* = [

x

_{1},...,

x

_{m},

e

_{i(m + 1)},...,

e

_{i(n)}]

^{T}which contains all

*feature vectors*m

x

_{j}(1) and

*-*n

*unit vectors*m

e

_{i}with the indices

*belonging to the optimal subset*i

*(*I

w

_{L(l)}*) (71) (

*∈*i

*(*I

w

_{L(l)}*)).

The optimal basis _{m}* = [_{1},..., _{m}, _{i(m + 1)},..., _{i(n)}]^{T}(73) is found in two stages. In the first stage, * m* feature vectors

x

_{j}(1) are introduced into matrices

B

_{k}= [

x

_{1},...,

x

_{k},

e

_{i(k + 1)},...,

e

_{i(n)}]

^{T}(

*= 0, 1,…,*k

*- 1). The inverse matrices*m

B

_{k}

^{−1}(49) are computed in accordance with the vector Gauss-Jordan transformation (64). In the second stage, the unit vectors

e

_{i(l)}in the matrices

B

_{m(l)}(73) are exchanged to minimize the

*function Ψ′(*CPL

w

_{m(l)}) (75) at the final vertices

w

_{m(l)}(77). The optimal basis

B

_{m}* defines (47) the optimal vertex

w

_{m(l)}* (77), which is characterized by the largest margin δ

_{L1}(

w

_{m(l)}*) (78).

The vertexical feature subspace _{1}*[* m*] (

F

_{1}*[

*] ⊂*m

**[**F

*] (1)) can be obtained on the basis of the optimal vertex*n

w

_{m(l)}* (77) with the largest margin δ

_{L1}(

w

_{m(l)}*) (78). The vertexical subspace

F

_{1}*[

*] contains the reduced vectors*m

x

_{1,j}[

*] with the dimension*m

*[7]:*m

The reduced vectors _{1,j}[* m*] (80) are obtained from the feature vectors

x

_{j}= [x

_{j,1},...,x

_{j,n}]

^{T}(

x

_{j}∈

**[**F

*]) ignoring such components x*n

_{j,i}which are related to the unit vectors

e

_{i}in the optimal basis

B

_{1}* = [

x

_{1},...,

x

_{m},

e

_{i(m + 1)},...,

e

_{i(n)}]

^{T}(73). The reduced vectors

x

_{1,j}[

*] are represented by such*m

*features*m

X

_{i}(

X

_{i}∈

R

_{1}* (54)), which are not linked to the unit vectors

e

_{i}(

*∉*i

I

_{m(l)}*) in the basis

B

_{m(l)}* (73) representing the optimal vertex

w

_{m(l)}* (77).

The * m* features

X

_{i(l)}belonging to the optimal subset

R

_{1}* (

X

_{i(l)}∈

R

_{1}* (81) are related to the weights w

_{k.l}* (

w

_{k}*[

*] = [w*m

_{k,1}*,…, w

_{k,m}*]

^{T}) that are not zero (w

_{k.l}* ≠ 0).

The optimal feature subset _{1}* (81) consists of * m* collinear features

X

_{i}. The optimal vertex

w

_{1}*[

*] (Φ(*m

w

_{1}*[

*]) = 0 (69)) in the reduced parameter space*m

R

^{m}(

w

_{1}*[

*] ∈*m

R

^{m}) is based on these

*features*m

X

_{i}. The reduced optimal vertex

w

_{1}*[

*] with the largest margin δ*m

_{L1}(

w

_{1}*[

*]) (77) is the unique solution of the constrained optimization problem (74). Maximizing the*m

L

_{1}margin δ

_{L1}(

w

**) (78) leads to the first reduced vertex*

_{l}

w

_{1}*[

*] = [w*m

_{k,1}*,…, w

_{k,m}*]

^{T}with non-zero components w

_{k.i}* (w

_{k.i}* ≠ 0).

The collinear interaction model between * m* collinear features

X

_{i(l)}from the optimal subset

R

_{1}*(

*) (81) can be formulated as follows (57):*m

The prognostic models for each feature _{i′} from the subset _{1}* (81) may have the following form (58):

where α_{i′,0} = 1 / w_{k.i′}*, α_{i′, i′} = 0, and (∀ * i*(

*) ≠*l

*′) α*i

_{i′,i(l)}= w

_{k.i(l)}* / w

_{k.i′}*.

In the case of a data set with a small number

*(*m

*<<*m

*) of multidimensional feature vectors*n

x

_{j}(1), the prognostic models (83) for individual features

X

_{i′}can be weak. It is know that sets (ensembles) of weak models can have strong generalizing properties [4]. A set of weak prognostic models (83) for a selected feature (dependent variable)

X

_{i′}can be implemented in the complex layer of

*prognostic models (83) [11].*L

The complex layer can be built on the basis of the sequence of * L* optimal vertices

w

** (77) related to*

_{l}

*features*m

X

_{i}constituting the subsets

** (81), where*R

_{l}

*= 0, 1,...,*l

*.*L

* Design assumption*: Each subset

** (81) in the sequence (84) contains a priori selected feature (dependent variable)*R

_{l}

X

_{i′}and

*- 1 other features (independent variables)*m

X

_{i(l)}. The other features

X

_{i(l)}(

X

_{i(l)}∈

**) should be different in successive subsets*R

_{l}

** (*R

_{l}

*= 0, 1,...,*l

*).*L

The first optimal; vertex _{1}* (77) in the sequence (84) is designed on the basis of * m* feature vectors

x

_{j}(1), which are represented by all

*features*n

X

_{i}constituting the feature set

*(*F

*) = {*n

X

_{1},…,

X

_{n}}. The vertex

w

_{1}* (77) is found by solving the constraint optimization problem (74) according to the procedure with the two stages outlined earlier. The two-stage procedure allows to find the optimal vertex

w

_{1}* (77) with the largest

L

_{1}margin δ

_{L1}(

w

_{1}*) (78).

The second optimal vertex _{2}* (77) in the sequence (84) is obtained on the basis of * m* reduced feature vectors

x

_{j}[

*- (*n

*- 1)] (67), which are represented by*m

*- (*n

*- 1) features*m

X

_{i}constituting the reduced feature subset

F

_{2}(

*- (*n

*+ 1)):*m

The * l*-th optimal vertex

w

** (77) in the sequence (84) is designed on the basis of*

_{l}

*reduced vectors*m

x

_{j}[

*-*n

*(*l

*- 1)] (67), which are represented by*m

*-*n

*(*l

*- 1) features*m

X

_{i}constituting the feature subset

*(*F

_{l}

*-*n

*(*l

*- 1)):*m

The sequence (84) of * L* optimal vertices

w

** (77) related to the subsets*

_{l}

*(*F

_{l}

*-*n

*(*l

*- 1)) (86) of features is characterized by decreased*m

L

_{1}margins δ

_{L1}(

w

**) (78) [18].*

_{l}

The prognostic models (83) for the dependent feature (variable) _{i′} are designed for each subset * F*(

_{l}

*-*n

*(*l

*- 1)) (86) of features*m

X

_{i}, where

*= 0, 1,...,*l

*(84):*L

The final forecast _{i′}^{∧} for the dependent feature (variable) _{i′} based on the complex layer of * L* + 1 prognostic models (88) can have the following form:

In accordance with the Eq. (89), the final forecast _{i(m)}^{∧} for the feature _{i′} results from averaging the forecasts of * L* + 1 individual models

X

_{i′}(

*) (88).*l

## 9. Concluding remarks

The article considers computational schemes of designing classifiers or prognostic models based on such a data set (1), which consists of a small number

*of high-dimensional feature vectors*m

x

_{j}(

*< <*m

*).*n

The concept of a complex layer composed of many linear prognostic models (88) built in low-dimensional feature subspaces is discussed in more detail. These models (88) are built by using a small number * m* of collinear features

X

_{i}belonging to the optimal feature clusters

** (81). The optimal feature clusters*R

_{l}

** (81) are formed by the search for the largest margins δ*R

_{l}

_{L1}(

w

**) (78) in the*

_{l}

L

_{1}norm.

The averaged prognostic models _{i′}^{∧}(89) are based on the layer of * L* parallel models

X

_{i′}(

*) (88). In line with the ergodic theory, averaging on a small number*l

*of feature vectors*m

x

_{j}has been replaced with averaging on

*collinear clusters*L

** (81) of features*R

_{l}

X

_{i}. Such averaging scheme should allow for a more stable extraction of general patterns from small samples of high-dimensional feature vectors

x

_{j}(1) [11].

## References

- 1.
Duda O. R., Hart P. E., and Stork D. G., Pattern classification , J. Wiley, New York, 2001 - 2.
Hand D., Smyth P., and Mannila H., Principles of data mining , MIT Press, Cambridge (2001) - 3.
Bishop C. M., Pattern Recognition and Machine Learning . Springer Verlag, 2006 - 4.
Kuncheva L.: Combining Pattern Classifiers: Methods and Algorithms , 2nd Edition, J. Wiley, New Jersey (2014) - 5.
Simonnard M., Linear Programming , Prentice – Hall, New York, Englewood Cliffs, 1966 - 6.
Bobrowski L., Data mining based on convex and piecewise linear (CPL)criterion functions (in Polish ), Białystok University of Technology, 2005 - 7.
Bobrowski L., Data Exploration and Linear Separability , pp. 1 - 172, Lambert Academic Publishing, 2019 - 8.
Bobrowski, L.: ″Design of piecewise linear classifiers from formal neurons by some basis exchange technique″, Pattern Recognition , 24(9), pp. 863-870 (1991) - 9.
Bobrowski L., Zabielski P., ″Models of Multiple Interactions from Collinear Patterns″, pp. 153-165 in: Bioinformatics and Biomedical Engineering (IWBBIO 2018), Eds.: I. Rojas, F. Guzman, LNCS 10208, Springer Verlag, 2018 - 10.
Bobrowski L., Small Samples of Multidimensional Feature Vectors ( ICCCI 2020 ), pp. 87 - 98 in:Advances in Computational Collective Intelligence , Eds.: Hernes M, et al., Springer 2020 - 11.
Bobrowski L., ″Complexes of Low Dimensional Linear Classifiers with L _{1}Margins″, pp.29 - 40 in: ACIIDS 2021, Springer Verlag, 2021 - 12.
Boser B. E., Guyon I., Vapnik V. N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory , 5, 144–152. Pittsburgh, ACM, 1992 - 13.
Bobrowski L., Łukaszuk T.: Repeatable functionalities in complex layers of formal neurons, EANN 2021,Engineering Applications of Neural Networks , Springer 2021 - 14.
Rosenblatt F.: Principles of neurodynamics , Spartan Books, Washington, 1962 - 15.
Bobrowski L., Łukaszuk, T.: Relaxed Linear Separability ( RLS ) Approach to Feature (Gene) Subset Selection, pp. 103 - 118 in:Selected Works in Bioinformatics , Edited by: Xuhua Xia,INTECH , 2011 - 16.
Bobrowski L.: ″Large Matrices Inversion Using the Basis Exchange Algorithm″, British Journal of Mathematics & Computer Science , 21(1): 1-11, 2017 - 17.
Petersen K.: Ergodic Theory (Cambridge Studies in Advanced Mathematics ), Cambridge University Press, 1990 - 18.
Bobrowski L., Zabielski P.: ″Feature (gene) clustering with collinearity models″, (ICCCI 2021to appear), Springer Verlag, 2021