Open access peer-reviewed chapter

Quantitative Structure-Activity Relationship Modeling and Bayesian Networks: Optimality of Naive Bayes Model

Written By

Oleg Kupervasser

Reviewed: 20 March 2019 Published: 29 May 2019

DOI: 10.5772/intechopen.85976

From the Edited Volume

Bayesian Networks - Advances and Novel Applications

Edited by Douglas McNair

Chapter metrics overview

1,154 Chapter Downloads

View Full Metrics

Abstract

Previously, computational drag design was usually based on simplified laws of molecular physics, used for calculation of ligand’s interaction with an active site of a protein-enzyme. However, currently, this interaction is widely estimated using some statistical properties of known ligand-protein complex properties. Such statistical properties are described by quantitative structure-activity relationships (QSAR). Bayesian networks can help us to evaluate stability of a ligand-protein complex using found statistics. Moreover, we are possible to prove optimality of Naive Bayes model that makes these evaluations simple and easy for practical realization. We prove here optimality of Naive Bayes model using as an illustration ligand-protein interaction.

Keywords

  • quantitative structure-activity relationship
  • Naive Bayes model
  • optimality
  • Bayes classifier
  • Bayesian networks
  • protein-ligand complex
  • computational drag design
  • molecular recognition and binding
  • ligand-active site of protein
  • likelihood
  • probability

1. Introduction

The determination within the chapter is based on a paper [1]. Bayes classifiers are broadly utilized right now for recognition, identification, and knowledge discovery. The fields of application are, for case, image processing, personalized medicine [2], chemistry (QSAR (quantitative structure-activity relationship) [3, 4]; see Figure 1). The especial importance Bayes Classifiers have in Medical Diagnostics and Bioinformatics. Cogent illustrations of this can be found in the work of Raymer and colleagues [5].

Figure 1.

Quantitative structure-activity relationship.

Let us give some example of using QSAR from papers [3, 4]:

“Molecular recognition and binding performed by proteins are the background of all biochemical processes in a living cell. In particular, the usual mechanism of drug function is effective binding and inhibition of activity of a target protein. Direct modeling of molecular interactions in protein-inhibitor complexes is the basis of modern computational drug design but is an extremely complicated problem. In the current paradigm, site similarity is recognized by the existence of chemically and spatially analogous regions from binding sites. We present a novel notion of binding site local similarity based on the analysis of complete protein environments of ligand fragments. Comparison of a query protein binding site (target) against the 3D structure of another protein (analog) in complex with a ligand enables ligand fragments from the analog complex to be transferred to positions in the target site, so that the complete protein environments of the fragment and its image are similar. The revealed environments are similarity regions and the fragments transferred to the target site are considered as binding patterns. The set of such binding patterns derived from a database of analog complexes forms a cloudlike structure (fragment cloud), which is a powerful tool for computational drug design.”

However, these Bayes classifiers have momentous property—by strange way the Naive Bayes classifier more often than not gives a decent and great description of recognition. More complex models of Bayes classifier cannot progress it significantly [1]. In the paper [6] creators clarify this exceptional property. In any case, they utilize a few suspicions (zero–one misfortune) which diminish all-inclusiveness and simplification of this proof. We allow in this chapter a common verification of Naive Bayes classifier optimality. The induction within the current chapter is comparative to [1]. The consequent attractive consideration of Naive Bayes classifier optimality problem was made in [7, 8]. Be that as it may, shockingly these papers do not incorporate any investigation of the past one [1].

We would like to prove Naive Bayes classifier optimality using QSAR terminology. Indeed, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.

Let us define the essential issue that we attempt to unravel within the chapter. Assume that we have a set of states for a complex of ligand-active site of protein and a set of factors that characterize these states. For each state, we know the likelihood dispersion for each factor. In any case, we have no data of the approximate relationships of the factors. Presently, assume that we know factor values for some test of the state. What is the probability that this test corresponds to some state? It could be a commonplace issue of recognition over a condition of incomplete data.

In the simplest case, we can define two states for “ligand-active site of protein” complex. It is 0 (ligand is not bound to active site of protein) or 1 (ligand is not bound to active site of protein).

The next step is definition of factors (reliabilities below) that characterize strength of a bond for “ligand-active site of protein” complex. Let us grant an illustration of factors (reliabilities below) from experience of QSAR in papers [3, 4]:

“First, consider the protein 5 A°-environment A = {a1, a2,…aN} of one ligand atom X in the analog protein, that is, all atoms from the binding site that are in the 5 A°-neighborhood of X. Suppose that the complete target binding site T consists of N′ atoms: T = {t1, t2,…tN’} and there exists a subset T0 T of size n (N′ ≥ n ≥ 4) such that n atoms from T0 are similar to n atoms A0 = {ai1, ai2,…ain} A in their chemical types and spatial arrangement. The search for A0 and T0 is performed using a standard clique detection technique in the graph whose nodes represent pairs (ai, ti) of chemically equivalent atoms and edges reflect similarity of corresponding pairwise distances. If the search is successful, the optimal rigid motion superimposing matched protein atoms is applied both to the initial ligand atom X and its complete environment A (Figure 2(a) in [3]). The atoms are thus transferred to the target binding site. Then we extend the matching between A0 and T0 by such atom pairs (ai,ti) that ai and ti have the same chemical atom type in the coarser 10-type typification mentioned above, and the distance between ti and the image a′i of atom ai is below a threshold. Next, a reliability value R, with 0 ≤ R ≤ 1, is assigned to the image X′ of X in the target site and reflects the similarity between the environments of X and its image X′. If the environments are highly similar (R ≈ 1) we expect that the position of X′ is the place where an atom with chemical type identical to X can be bound by the target, since the environment of X′ contains only atoms required for binding with no “alien” atoms. However, as illustrated in Figure 2(a) in [3], the analog site may contain extra binding atoms (shown on the lower side) that decrease the reliability value. In a simple form, the reliability R can be defined as the sum of the number of matched atoms divided by the total number of analog and target atoms in the 5 A°-environments of X and X′, respectively (Figure 2(b) in [3]):

Figure 2.

Function Γ opt α β θ : 0 1 3 0 1 .

R = 2n/(N + N′), using the notation presented above. In fact, we use a somewhat more complicated definition that accounts for the quality of spatial superposition of matched atoms and their distance from X′.”

We do not want to discuss here these definitions for these factors and states. Our purpose is not the demonstration of effectiveness of these definitions or effectiveness of QSAR. The interested reader can learn it from papers [3, 4] and references inside of these papers. As we said above, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.

Let us consider the case when no relationships exist between reliabilities. In this case, the Naive Bayes model is a correct arrangement of the issue. We demonstrate in this chapter that for the case that we don’t know relationships between reliabilities even approximately—the Naive Bayes model is not correct, but ideal arrangement in a few senses. More point by point, we demonstrate that the Naive Bayes model gives minimal mean error over all conceivable models of relationship. We assume in this confirmation that all relationship models have the same likelihood. We think that this result can clarify the depicted over secretive optimality of Naive Bayes model.

The Chapter is built as described in the following statements. We grant correct numerical description of the issue for two states and two reliabilities in Section 2. We characterize our notations in Section 3. We define general form of conditional likelihood for all conceivable relationships of our reliabilities in Section 4. We characterize the limitations of the functions depicting the relationships in Section 5. We find the formula for an interval between two models of probability (correlation) in Section 6. We discover constraints for our fundamental functions in Section 7. We illuminate our primary issue; we demonstrate Naive Bayes model’s optimality for uniform distribution of all conceivable relationships in Section 8. We discover mean error between the Naive Bayes model and a genuine model for uniform distribution of all conceivable relationships in Section 9. We consider the case of more than two states and reliabilities in Section 10. We make conclusions in Section 11.

Advertisement

2. Definition of the task

Suppose that A is a state for “ligand-active site of protein” complex. It is 0 (ligand is not bound to active site of protein) or 1 (ligand is not bound to active site of protein). Accept that the apriori likelihood P A = P A = 1 is known, and indicate it by θ . Let X 1 , X 2 be two reliability values (defined above), with values in a set 0 1 . However, for generality, we will define X 1 , X 2 in a set [−∞;+∞], but probability density to find X 1 , X 2 in [−∞; 0] or [1;+∞] is equal to zero. We have the taking after data: X 1 = x 1 and X 2 = x 2 (gotten through estimation). Moreover, we have two functions, “classifiers,” which for given x 1 and x 2 give us

P A = 1 / X 1 = x 1 = P A / x 1 α ,
P A = 1 / X 2 = x 2 = P A / x 2 β .

We want to find the likelihood

P A = 1 / X 1 = x 1 X 2 = x 2 = P A / x 1 x 2

in terms of α , β and θ . More particularly we wish to discover a function Γ opt α β θ which on the average is the most excellent estimation for P A / x 1 x 2 in a sense to be characterized expressly within the following consideration (see Figure 2).

Advertisement

3. Notation and preliminaries

ρ X 1 , X 2 x 1 x 2 —joint PDF (probability density function) for X 1 and X 2 .

ρ X 1 , X 2 / A x 1 x 2 = h x 1 x 2 —joint PDF for X 1 and X 2 , known A = 1 . In terms of h x 1 x 2 and θ , we can find P A / x 1 x 2 as follows:

P A / x 1 x 2 = θh x 1 x 2 θh x 1 x 2 + 1 θ h ¯ x 1 x 2 , E1

here

h ¯ x 1 x 2 ρ X 1 , X 2 / A ¯ x 1 x 2 —joint PDF for X 1 and X 2 , known A = 0 .

We can find

ρ X 1 x 1 = + ρ X 1 , X 2 x 1 x 2 d x 2 ,
ρ X 2 x 2 = + ρ X 1 , X 2 x 1 x 2 d x 1 ,
h 1 x 1 ρ X 1 / A x 1 = + h x 1 x 2 d x 2 ,
h 2 x 2 ρ X 2 / A x 2 = + h x 1 x 2 d x 1 ,
h ¯ 1 x 1 ρ X 1 / A ¯ x 1 = + h ¯ x 1 x 2 d x 2 ,
h ¯ 2 x 2 ρ X 2 / A ¯ x 2 = + h ¯ x 1 x 2 d x 1 .
Advertisement

4. Generic form of P A / x 1 x 2

Let us define the function g x 1 x 2 and g ¯ x 1 x 2

g x 1 x 2 h x 1 x 2 h 1 x 1 h 2 x 2 ,
g ¯ x 1 x 2 h ¯ x 1 x 2 h ¯ 1 x 1 h ¯ 2 x 2 .

Let us say that if X 1 and X 2 are conditionally independent, i.e.,

h x 1 x 2 = ρ X 1 X 2 / A x 1 x 2 = ρ X 1 / A x 1 ρ X 2 / A x 2 = h 1 x 1 h 2 x 2 ,

then

g x 1 x 2 = g ¯ x 1 x 2 = 1 .

Let us define the following monotonously nondecreasing probability distribution functions:

H 1 x 1 x 1 h 1 z dz ,
H 2 x 2 x 2 h 2 z dz ,
H ¯ 1 x 1 x 1 h ¯ 1 z dz ,
H ¯ 2 x 2 x 2 h ¯ 2 z dz .

Take attention that since H 1 x 1 , H 2 x 2 , H ¯ 1 x 1 and H ¯ 2 x 2 are monotonous (At this point we can assume that h 1 x 1 , h 2 x 2 , h ¯ 1 x 1 , h ¯ 2 x 2 > 0 so that H 1 x 1 , H 2 x 2 , H ¯ 1 x 1 and H ¯ 2 x 2 are monotonously increasing. However, such limitation will be unnecessary as we will see within the following conclusion.), the inverse functions H 1 1 x 1 , H 2 1 x 2 , H ¯ 1 1 x 1 and H ¯ 2 1 x 2 must exist. As a result, we can characterize

J a b g H 1 1 a H 2 1 b ,
J ¯ a b g ¯ H ¯ 1 1 a H ¯ 2 1 b .

To be brief, let us use the following concise designation:

J J H 1 x 1 H 2 x 2 = g H 1 1 H 1 x 1 , H 2 1 H 2 x 2 = g x 1 x 2 ,
J ¯ J ¯ H ¯ 1 x 1 H ¯ 2 x 2 = g ¯ H ¯ 1 1 H ¯ 1 x 1 H ¯ 2 1 H ¯ 2 x 2 = g ¯ x 1 x 2 .

By the definition

h x 1 x 2 = J h 1 x 1 h 2 x 2 , E2
h ¯ x 1 x 2 = J ¯ h ¯ 1 x 1 h ¯ 2 x 2 . E3

We currently obtain

h 1 x 1 ρ X 1 / A x 1 = ρ X 1 x 1 P A / x 1 P A = α ρ X 1 x 1 θ , E4
h 2 x 2 ρ X 2 / A x 2 = ρ X 2 x 2 P A / x 2 P A = β ρ X 2 x 2 θ , E5
h ¯ 1 x 1 ρ X 1 / A ¯ x 1 = ρ X 1 x 1 P A ¯ / x 1 P A ¯ = 1 α ρ X 1 x 1 1 θ , E6
h ¯ 2 x 2 ρ X 2 / A ¯ x 2 = ρ X 2 x 2 P A ¯ / x 2 P A ¯ = 1 α ρ X 2 x 2 1 θ . E7

As a result from Eqs. (2) and (3)

h x 1 x 2 = J αβ ρ X 1 x 1 ρ X 2 x 2 θ 2 ,
h ¯ x 1 x 2 = J ¯ 1 α 1 β ρ X 1 x 1 ρ X 2 x 2 1 θ 2 .

Now from Eq. (1)

P A / x 1 x 2 = J θ αβ ρ X 1 x 1 ρ X 2 x 2 J θ αβ ρ X 1 x 1 ρ X 2 x 2 + J ¯ 1 θ 1 α 1 β ρ X 1 x 1 ρ X 2 x 2 = αβ αβ + J ¯ J θ 1 θ 1 α 1 β . E8

Note, that for values of J = J ¯ = 1 (conditional independence of x1 and x2) equation (8) becomes the exact solution for the optimal model:

Γ α β θ = P A / x 1 x 2 .
Advertisement

5. Limitations for the functions J a b and J ¯ a b

We can write

h 1 x 1 = + J H 1 x 1 H 2 x 2 h 1 x 1 h 2 x 2 d x 2 . E9

As a result

1 = + J H 1 x 1 H 2 x 2 h 2 x 2 d x 2 = 0 1 J H 1 x 1 H 2 x 2 d H 2 x 2 . E10

Thus, we obtain the following condition:

0 1 J a b db = 1 , E11

and similarly

0 1 J a b da = 1 . E12

Similarly, we can get

0 1 J ¯ a b da = 1 0 1 J ¯ a b db = 1 . E13

Obviously

J a b , J ¯ a b   0 , E14
0 1 0 1 J a b dadb = 0 1 0 1 J ¯ a b dadb = 1 . E15

All the solutions of Eqs. (11)(15) together with (8) can define the set of all possible realizations of P A / x 1 x 2 .

Let us give some example of a solution of (11), (12) and (14), (15):

Let ρ x be a function such that ρ x   0 and 0 1 ρ x dx = 1

J a b = ρ a b , a b ρ a b + 1 , a < b ,

satisfy Eqs. (11), (12), (14), and (15).

Advertisement

6. Definition of distance

We define the distance between the proposed approximation of P A / x 1 x 2 - Γ α β θ and the actual function P A / x 1 x 2 as follows:

Γ α β θ P A / x 1 x 2 = + ρ X 1 X 2 x 1 x 2 Γ α β θ P A / x 1 x 2 2 d x 1 d x 2 .

Now we have from Eqs. (2) and (3) and Eqs. (4)(7)

ρ X 1 X 2 x 1 x 2 = θh x 1 x 2 + 1 θ h ¯ x 1 x 2 = θJ h 1 x 1 h 2 x 2 + 1 θ J ¯ h ¯ 1 x 1 h ¯ 2 x 2 = J αβ θ + J ¯ 1 α 1 β 1 θ ρ X 1 x 1 ρ X 2 x 2 ,
Γ α β θ P A / x 1 x 2 = + ρ X 1 x 1 ρ X 2 x 2 J αβ θ + J ¯ 1 α 1 β 1 θ × Γ α β θ P A / x 1 x 2 2 d x 1 d x 2 = 0 1 0 1 Jαβ θ + J ¯ 1 α 1 β 1 θ × Γ α β θ P A / x 1 x 2 2 d F 1 x 1 d F 2 x 2 . E16

Here

F 1 x 1 = x 1 ρ X 1 z dz ,
F 2 x 2 = x 2 ρ X 2 z dz .
Advertisement

7. Constraints for basic functions

We will consider further all functions with arguments 1 F 1 , F 2 0 , but not x 1 , x 2 . We have six functions of F 1 , F 2 , which define Eq. (16): J , J ¯ , H 1 , H 2 , α , β . Let us to write the functions by help these functions (F1, F2) and find restrictions for these functions:

α = P A / x 1 = θ h 1 x 1 / ρ X 1 x 1 = θ ( d H 1 / d x 1 ) / ( d F 1 / d x 1 ) = θ d H 1 d F 1 .

By the same way

β = θ d H 2 d F 2 .

We know that functions H 1 , F 1 , H 2 , F 2 are cumulative distribution functions of x 1 , x 2 , correspondingly. These functions are monotonously nondecreasing and change from 0 to 1 from the definition of cumulative distribution functions. Therefore, we can conclude the following restraints for functions H 1 , H 2 as functions of F 1 , F 2 exist:

H 1 1 = H 2 1 = 1 , H 1 0 = H 2 0 = 0 ,
0 α = θ d H 1 d F 1 , β = θ d H 2 d F 2 1 ,
0 θ 1 ,
H ¯ 1 x 1 = x 1 h ¯ 1 x 1 d x 1 = x 1 1 α ρ X 1 x 1 1 θ d x 1 = 1 1 θ x 1 ρ X 1 x 1 d x 1 θ 1 θ x 1 α ρ X 1 x 1 θ d x 1 = F 1 1 θ θ 1 θ H 1 x 1 .

By the same way

H ¯ 2 x 2 = F 2 1 θ θ 1 θ H 2 x 2 ,
J H 1 F 1 H 2 F 2 : J H 1 F 1 H 2 F 2 0 0 1 J a b db = 1 0 1 J a b da = 1 ,
J ¯ H ¯ 1 F 1 H ¯ 2 F 2 : J ¯ H ¯ 1 F 1 H ¯ 2 F 2 0 0 1 J ¯ a b db = 1 0 1 J ¯ a b da = 1 ,
P A / x 1 x 2 = Jαβ θ Jαβ θ + J ¯ 1 α 1 β 1 θ . E17
Advertisement

8. Optimization

We shall find the best approximation of Γ α β θ as follows:

mi n Γ α β θ E Γ α β θ P A / x 1 x 2 Γ α β θ ,

where the expected value (or expectation or mathematical expectation or mean or the first moment) E is taken with respect to the joint PDF of possible realizations of J , J ¯ , α , β , H 1 , H 2 for given F 1 and F 2 .

For the sake of brevity, we denote

C Jαβ θ + J ¯ 1 α 1 β 1 θ
D Jαβ θ .

Then from Eqs. (17) and (16)

Γ α β θ P A / x 1 x 2 = 0 1 0 1 C Γ α β θ D / C 2 d F 1 d F 2 = 0 1 0 1 d F 1 d F 2 D 2 C + Γ 2 α β θ C 2 Γ α β θ D . E18

Thus

mi n Γ α β θ E Γ α β θ P A / x 1 x 2 = mi n Γ α β θ E 0 1 0 1 d F 1 d F 2 D 2 C + Γ 2 α β θ C 2 Γ α β θ D = mi n Γ α β θ E 0 1 0 1 d F 1 d F 2 D 2 C + mi n Γ α β θ E 0 1 0 1 d F 1 d F 2 Γ 2 α β θ C 2 Γ α β θ D = Const + mi n Γ α β θ E 0 1 0 1 d F 1 d F 2 Γ 2 α β θ C 2 Γ α β θ D . E19

It remains to calculate the expected value in Eq. (19).

We have by obvious assumptions

ρ J , J ¯ , α , β , H 1 H 2 / F 1 , F 2 J J ¯ α β H 1 H 2 / F 1 F 2 = ρ J / H 1 , H 2 J / H 1 H 2 ρ J ¯ / H ¯ 1 , H ¯ 2 J ¯ / H ¯ 1 H ¯ 2 ρ α / F 1 α / F 1 × ρ H 1 / α F 1 H 1 / α F 1 ρ β / F 2 β / F 2 ρ H 2 / β , F 2 H 2 / β F 2 . E20

Lemma 1

E J a b = 0 + ρ J a b / a , b J a b / a b J a b dJ = 1 ,
E J ¯ a b = 0 + ρ J ¯ a b / a , b J ¯ a b / a b J ¯ a b d J ¯ = 1 .

Proof: We can take into the consideration the function ρ J a b / a , b . The domain of the function J a b is square 0 a , b 1 . By dividing this square into small squares i j , we can get sampling of the function J . Then, on every square i , j , we can define the value of the function J ij . We can write the following restraints for function J :

J ij 0 ,
1 N i = 1 N J ij = 1 ,
1 N j = 1 N J ij = 1 .

Here i = 1 , , N , j = 1 , , N .

All matrixes J ij that satisfy the above conditions have the same probability. So we can define probability density function

ρ J 11 J ij J NN .

This density function should be symmetric according to transpositions of columns and rows of the matrix J ij , because the density function has the same probability for all matrixes J ij that satisfy the above conditions. Indeed, these conditions are also symmetric according to transpositions of columns and rows of matrix J ij . From symmetry conditions that define this function ρ according to transpositions of columns and rows of matrix J ij , it is possible to conclude that this function ρ also does not transform according to these transpositions.

We can consider function ρ u / ij u / ij that is a discretization of the function ρ J a b / a , b J a b / a b :

ρ u / ij u / ij = 0 + ρ J 11 J nk J ij = u J NN lm ij d J lm .

We can transpose columns and rows J ij in such a way that element J ij will be replaced by the other element J nk , and after it the function ρ J 11 will not be transformed. So from the above equation, we can get

ρ u / ij u / ij = 0 + ρ J 11 J nk J ij = u J NN lm nk d J lm = ρ u / nk u / nk .

From this equation we can conclude that ρ u / ij u / ij does not depend on ij so ρ J / a , b J / a b does not depend on ab and

ρ J / a , b J / a b = ρ J J ,

and

E J a b = 0 + ρ J J JdJ = Const , E21

From

0 1 0 1 J a b dadb = 1 ,

we can conclude that

0 1 0 1 E J a b dadb = 1 .

So we can obtain that Const = 1 in Eq. (21).

Lemma 2: Probability distribution functions α and β do not depend on F 1 and F 2 :

ρ α / F 1 α / F 1 = ρ α α ,
ρ β / F 2 β / F 2 = ρ β β .

Proof: Let us make sampling of the function α F 1 by dividing the domain of this function F 1 , 0 1 on intervals of 1 / N , N 1 . Then restriction conditions for α k , k = 1 , , N

0 α k 1 ,
1 N k = 1 N α k = 0 1 θd H 1 F 1 d F 1 d F 1 = θ .

All columns α k that are satisfied by these conditions have equal probability. We can consider respective function ρ α 1 α k α l α N . From symmetry conditions that define this function according to transpositions α k α l , function ρ α 1 α k α l α N also does not transform according to such transpositions. As a result, it is possible to write

ρ k u = 0 1 ρ α 1 α k = u α l α N n k d α n = 0 1 ρ α 1 α k α l = u α N n l d α n = ρ l u .

From this equation, we can conclude that function ρ α / F 1 α / F 1 does not depend on F 1 :

ρ α / F 1 α / F 1 = ρ α α .

From (20) we obtain

E Γ 2 α β θ C 2 Γ α β θ D = 0 1 0 1 ρ α α ρ β β dαdβ × 0 1 0 1 ρ H 1 / α , F 1 H 1 / α F 1 ρ H 2 / β , F 2 H 2 / β F 2 d H 1 d H 2 0 0 ρ J J ρ J ¯ J ¯ × Γ 2 α β θ Jαβ θ + J ¯ 1 α 1 β 1 θ 2 Γ α β θ Jαβ θ dJ d J ¯ = 0 1 0 1 ρ α α ρ β β dαdβ × Γ 2 α β θ E J αβ θ + E J ¯ 1 α 1 β 1 θ 2 Γ α β θ E J αβ θ .

Let us define

C ¯ = αβ θ + 1 α 1 β 1 θ ,
D ¯ = αβ θ .

By Lemma 1, E J = E J ¯ = 1 . Hence

E Γ 2 α β θ C 2 Γ α β θ D = 0 1 0 1 Γ 2 α β θ C ¯ 2 Γ α β θ D ¯ ρ α α ρ β β dαdβ .

It remains to find

mi n Γ α β θ 0 1 0 1 d F 1 d F 2 0 1 0 1 dαdβ ρ α α ρ β β Γ 2 α β θ ) C ¯ 2 Γ α β θ D ¯ . E22

Since

ρ α α ρ β β 0 ,

if the expression in square brackets is minimized at each point, then the whole integral in Eq. (22) is minimized. Thus, we may proceed as follows:

Γ Γ 2 α β θ C ¯ 2 Γ α β θ D ¯ = 2 Γ α β θ C ¯ 2 D ¯ = 0 .

Hence the optimum Γ α β θ is given by

Γ opt α β θ = D ¯ C ¯ = αβ θ αβ θ + 1 α 1 β 1 θ .
Advertisement

9. Mean distance between the proposed approximation of P A / x 1 x 2 Γ α β θ and the actual function P A / x 1 x 2

The mean distance from (18) is

DIS = E Γ α β θ P A / x 1 x 2 = 0 1 0 1 ρ α α ρ β β dαdβ Γ 2 α β θ C ¯ 2 Γ α β θ D ¯ + Const ,

where Const in this equation is defined by

Const = E + + ρ X 1 X 2 x 1 x 2 P A / x 1 x 2 2 d x 1 d x 2 .

From this equation we can find boundaries of the Const . From 0 P A / x 1 x 2 1 we can conclude

Const E + + ρ X 1 , X 2 , x 1 x 2 P A / x 1 x 2 d x 1 d x 2 = E θ = θ .

The second condition is

0 E + + ρ X 1 , X 2 x 1 x 2 P A / x 1 x 2 θ 2 d x 1 d x 2 = E + + ρ X 1 , X 2 x 1 x 2 P A / x 1 x 2 2 + θ 2 2 P A / x 1 x 2 θ d x 1 d x 2 = E + + ρ X 1 , X 2 x 1 x 2 P A / x 1 x 2 2 d x 1 d x 2 θ 2 .

So from these two equations, we can conclude

θ 2 Const θ .

In the next step, we would like find function ρ α α ( ρ β β ) in the equation for DIS .

Restrictions for function α F 1 , 0 F 1 1 are the following:

0 1 α F 1 d F 1 = θ ,
0 α F 1 1 .

In discrete form (for N ), we can rewrite α set = α 1 α 2 α N

1 N i = 1 N α i = θ ,
0 α i 1 , i = 1 , 2 , , N .

Let us define a function U α set in the following way:

U α set = i = 1 N α i for   0 α i 1 , i = 1 , 2 , , N + otherwise ,
U α set = i = 1 N U i α i ,
U i α i = α i for   0 α i 1 + otherwise .

Then the function that satisfies equal probability distribution with considering restrictions (i) and (ii) is the following:

ρ α set α set = 1 C δ U α set . E23

here δ is the Dirac delta function.

We can define the constant C by

+ + ρ α set α set d α 1 d α N = 1 .

It can be proved for N that distribution (23) is equal to the following distribution (from “statistical mechanics” [9]; transform from microcanonical to canonical distribution):

ρ α set α set = 1 Z e KU α set .

Here we can find Z and K from the following equations:

+ + ρ α set α set d α 1 d α N = 1 , E24
+ + U α set ρ α set α set d α 1 d α N = . E25

Quest function ρ α α can be found by

ρ α α = + + ρ α set α 1 α j = α α N i = 1 , i j N d α i = 1 D e K U j α J = α , E26

where

D N = Z . E27

From Eqs. (24) and (25), we can find

1 Z = K 1 e K N , E28
θ = Λ K , E29

where Λ K is the decreasing function

Λ K = 1 for K = 0 for K = + 1 / 2 for K = 0 1 K 1 e K 1 otherwise .

If K is the root of Eq. (29), we can write from Eqs. (26)(29) for function ρ α α

ρ α α = For K = 0 1 for   0 α 1 0 otherwise α For K = + 2 δ α 0 α 1 0 otherwise α For K = 2 δ α 1 0 α 1 0 otherwise α For otherwise K 1 D e 0 α 1 0 otherwise α ,

where 2 0 1 δ α 1 = 2 0 1 δ α = 1 and 1 D = K 1 e K .

Advertisement

10. The case of more than two states A and reliabilities X

Let A be a state, with values in set 0 , 1 , , L . This number can characterize strength of a bond. Assume that the apriori probability P A = i is known, and denote it by θ i ; here i = 1 , , L . Let X 1 , , X K be random variables, with values in some set, say ] ; + [ . We have the following information: X 1 = x 1 ,…, X K = x K (obtained through measurement). Furthermore, we have systems, “classifiers,” which for given x 1 ,…, x K produce

P A = i / X j = x j α ij .

We want to find the probability P A = i / X 1 = x 1 X K = x K in terms of α ij and θ i . In more detail, we want to find a function Γ opt , M α ij θ i , which is the best approximation for P A = M / x 1 x K on the average. By the same way, in case of two variables, it is possible to find that the Γ opt , M α ij θ i can be defined by the following equation:

Γ opt , M α ij θ i = j = 1 K α Mj / θ M K 1 i = 1 L j = 1 K α ij / θ i K 1 .

We have evidential restraints for α ij , θ i

0 α ij 1 i = 1 L α ij = 1 ,
0 θ i 1 i = 1 L θ i = 1 .

11. Conclusions

Using as an illustration the QSAR, we demonstrated effectively that the Naive Bayes model gives minimal mean error over uniform dispersion of all conceivable relationships between characteristic reliabilities. This result can clarify the portrayed over secretive optimality of Naive Bayes model. We too found the mean error that the Naive Bayes model gives for uniform distribution of all conceivable relationships of reliabilities.

Medicinal chemistry (quantitative structure-activity relationships, QSAR) prediction increasingly relies on Bayesian network-based methods. Its importance derives partly from the difficulty and inaccuracies of present quantum chemical models (e.g., in SYBYL and other software) and from the impracticality of sufficient characterization of structure of drug molecules and receptor active sites, including vicinal waters in and around hydrophobic pockets in active sites. This is particularly so for biologicals (protein and nucleic acid APIs (nucleic acid active pharmaceutical ingredients)) and target applications that exhibit extensive inter-receptor trafficking, genomic polymorphisms, and other system biology phenomena. The effectiveness and accuracy of Bayesian methods for drug development likewise depend on certain prerequisites, such as an adequate distance metric by which to measure similarity/difference between combinatorial library molecules and known successful ligand molecules targeting a particular receptor and addressing a particular clinical indication. In this connection, the distance metric proposed in Section 6 of the chapter manuscript and the associated Lemmas and Proofs are of substantial value in the future of high-throughput screening (HTS) and medicinal chemistry.

However, our purpose here was not demonstration of effectiveness of these definitions or effectiveness of QSAR. The interested reader can learn it from papers [3, 4] and references inside of these papers. As we said above, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.

References

  1. 1. Kupervasser O. The mysterious optimality of Naive Bayes: Estimation of the probability in the system of “classifiers”. Pattern Recognition and Image Analysis. 2014;24(1):1-10. http://arxiv.org/abs/cs/0202020v1; 2002
  2. 2. Maslennikov ED, Sulimov AV, Savkin IA, Evdokimova MA, Zateyshchikov DA, Nosikov VV, et al. An intuitive risk factors search algorithm: Usage of the Bayesian network technique in personalized medicine. Journal of Applied Statistics. 2015;42(1):71-87
  3. 3. Ramensky V, Sobol A, Zaitseva N, Rubinov A, Zosimov V. A novel approach to local similarity of protein binding sites substantially improves computational drug design results. Proteins: Structure, Function, and Bioinformatics. 2007;69(2):349-357
  4. 4. Nikitin S, Zaitseva N, Demina O, Solovieva V, Mazin E, Mikhalev S, et al. A very large diversity space of synthetically accessible compounds for use with drug design programs. Journal of Computer-Aided Molecular Design. 2005;19:47-63
  5. 5. Raymer ML, Doom TE, Kuhn LA, Punch WF. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics. 2003;33B:802
  6. 6. Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning. 1997;29:103
  7. 7. Zhang H. The optimality of Naive Bayes. In: FLAIRS Conference; 2004. Available from: http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf
  8. 8. Kuncheva LI. On the optimality of Naive Bayes with dependent binary features. Pattern Recognition Letters. 2006;27:830
  9. 9. Landau LD, Lifshitz EM. Statistical Physics. Vol. 5. United Kingdom: Elsevier Science Technology; 1996

Written By

Oleg Kupervasser

Reviewed: 20 March 2019 Published: 29 May 2019