## Abstract

Previously, computational drag design was usually based on simplified laws of molecular physics, used for calculation of ligand’s interaction with an active site of a protein-enzyme. However, currently, this interaction is widely estimated using some statistical properties of known ligand-protein complex properties. Such statistical properties are described by quantitative structure-activity relationships (QSAR). Bayesian networks can help us to evaluate stability of a ligand-protein complex using found statistics. Moreover, we are possible to prove optimality of Naive Bayes model that makes these evaluations simple and easy for practical realization. We prove here optimality of Naive Bayes model using as an illustration ligand-protein interaction.

### Keywords

- quantitative structure-activity relationship
- Naive Bayes model
- optimality
- Bayes classifier
- Bayesian networks
- protein-ligand complex
- computational drag design
- molecular recognition and binding
- ligand-active site of protein
- likelihood
- probability

## 1. Introduction

The determination within the chapter is based on a paper [1]. Bayes classifiers are broadly utilized right now for recognition, identification, and knowledge discovery. The fields of application are, for case, image processing, personalized medicine [2], chemistry (QSAR (quantitative structure-activity relationship) [3, 4]; see Figure 1). The especial importance Bayes Classifiers have in Medical Diagnostics and Bioinformatics. Cogent illustrations of this can be found in the work of Raymer and colleagues [5].

Let us give some example of using QSAR from papers [3, 4]:

“Molecular recognition and binding performed by proteins are the background of all biochemical processes in a living cell. In particular, the usual mechanism of drug function is effective binding and inhibition of activity of a target protein. Direct modeling of molecular interactions in protein-inhibitor complexes is the basis of modern computational drug design but is an extremely complicated problem. In the current paradigm, site similarity is recognized by the existence of chemically and spatially analogous regions from binding sites. We present a novel notion of binding site local similarity based on the analysis of complete protein environments of ligand fragments. Comparison of a query protein binding site (target) against the 3D structure of another protein (analog) in complex with a ligand enables ligand fragments from the analog complex to be transferred to positions in the target site, so that the complete protein environments of the fragment and its image are similar. The revealed environments are similarity regions and the fragments transferred to the target site are considered as binding patterns. The set of such binding patterns derived from a database of analog complexes forms a cloudlike structure (fragment cloud), which is a powerful tool for computational drug design.”

However, these Bayes classifiers have momentous property—by strange way the Naive Bayes classifier more often than not gives a decent and great description of recognition. More complex models of Bayes classifier cannot progress it significantly [1]. In the paper [6] creators clarify this exceptional property. In any case, they utilize a few suspicions (zero–one misfortune) which diminish all-inclusiveness and simplification of this proof. We allow in this chapter a common verification of Naive Bayes classifier optimality. The induction within the current chapter is comparative to [1]. The consequent attractive consideration of Naive Bayes classifier optimality problem was made in [7, 8]. Be that as it may, shockingly these papers do not incorporate any investigation of the past one [1].

We would like to prove Naive Bayes classifier optimality using QSAR terminology. Indeed, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.

Let us define the essential issue that we attempt to unravel within the chapter. Assume that we have a set of states for a complex of ligand-active site of protein and a set of factors that characterize these states. For each state, we know the likelihood dispersion for each factor. In any case, we have no data of the approximate relationships of the factors. Presently, assume that we know factor values for some test of the state. What is the probability that this test corresponds to some state? It could be a commonplace issue of recognition over a condition of incomplete data.

In the simplest case, we can define two states for “ligand-active site of protein” complex. It is 0 (ligand is not bound to active site of protein) or 1 (ligand is not bound to active site of protein).

The next step is definition of factors (reliabilities below) that characterize strength of a bond for “ligand-active site of protein” complex. Let us grant an illustration of factors (reliabilities below) from experience of QSAR in papers [3, 4]:

“First, consider the protein 5 A°-environment A = {a_{1}, a_{2},…a_{N}} of one ligand atom X in the analog protein, that is, all atoms from the binding site that are in the 5 A°-neighborhood of X. Suppose that the complete target binding site T consists of N′ atoms: T = {t_{1}, t_{2},…t_{N’}} and there exists a subset T_{0}_{0} are similar to n atoms A_{0} = {a_{i1}, a_{i2},…a_{in}} _{0} and T_{0} is performed using a standard clique detection technique in the graph whose nodes represent pairs (a_{i}, t_{i}) of chemically equivalent atoms and edges reflect similarity of corresponding pairwise distances. If the search is successful, the optimal rigid motion superimposing matched protein atoms is applied both to the initial ligand atom X and its complete environment A (**Figure 2(a)** in [3]). The atoms are thus transferred to the target binding site. Then we extend the matching between A_{0} and T_{0} by such atom pairs (a_{i},t_{i}) that a_{i} and t_{i} have the same chemical atom type in the coarser 10-type typification mentioned above, and the distance between t_{i} and the image a′_{i} of atom a_{i} is below a threshold. Next, a reliability value R, with 0 ≤ R ≤ 1, is assigned to the image X′ of X in the target site and reflects the similarity between the environments of X and its image X′. If the environments are highly similar (R ≈ 1) we expect that the position of X′ is the place where an atom with chemical type identical to X can be bound by the target, since the environment of X′ contains only atoms required for binding with no “alien” atoms. However, as illustrated in Figure 2(a) in [3], the analog site may contain extra binding atoms (shown on the lower side) that decrease the reliability value. In a simple form, the reliability R can be defined as the sum of the number of matched atoms divided by the total number of analog and target atoms in the 5 A°-environments of X and X′, respectively (Figure 2(b) in [3]):

R = 2n/(N + N′), using the notation presented above. In fact, we use a somewhat more complicated definition that accounts for the quality of spatial superposition of matched atoms and their distance from X′.”

We do not want to discuss here these definitions for these factors and states. Our purpose is not the demonstration of effectiveness of these definitions or effectiveness of QSAR. The interested reader can learn it from papers [3, 4] and references inside of these papers. As we said above, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.

Let us consider the case when no relationships exist between reliabilities. In this case, the Naive Bayes model is a correct arrangement of the issue. We demonstrate in this chapter that for the case that we don’t know relationships between reliabilities even approximately—the Naive Bayes model is not correct, but ideal arrangement in a few senses. More point by point, we demonstrate that the Naive Bayes model gives minimal mean error over all conceivable models of relationship. We assume in this confirmation that all relationship models have the same likelihood. We think that this result can clarify the depicted over secretive optimality of Naive Bayes model.

The Chapter is built as described in the following statements. We grant correct numerical description of the issue for two states and two reliabilities in Section 2. We characterize our notations in Section 3. We define general form of conditional likelihood for all conceivable relationships of our reliabilities in Section 4. We characterize the limitations of the functions depicting the relationships in Section 5. We find the formula for an interval between two models of probability (correlation) in Section 6. We discover constraints for our fundamental functions in Section 7. We illuminate our primary issue; we demonstrate Naive Bayes model’s optimality for uniform distribution of all conceivable relationships in Section 8. We discover mean error between the Naive Bayes model and a genuine model for uniform distribution of all conceivable relationships in Section 9. We consider the case of more than two states and reliabilities in Section 10. We make conclusions in Section 11.

## 2. Definition of the task

Suppose that A is a state for “ligand-active site of protein” complex. It is 0 (ligand is not bound to active site of protein) or 1 (ligand is not bound to active site of protein). Accept that the

We want to find the likelihood

in terms of

## 3. Notation and preliminaries

here

We can find

## 4. Generic form of P A / x 1 x 2

Let us define the function

Let us say that if

then

Let us define the following *monotonously nondecreasing* probability distribution functions:

Take attention that since

To be brief, let us use the following concise designation:

By the definition

We currently obtain

As a result from Eqs. (2) and (3)

Now from Eq. (1)

Note, that for values of *x*_{1} and *x*_{2}) equation (8) becomes the exact solution for the optimal model:

## 5. Limitations for the functions J a b and J ¯ a b

We can write

As a result

Thus, we obtain the following condition:

and similarly

Similarly, we can get

Obviously

All the solutions of Eqs. (11)–(15) together with (8) can define the set of all possible realizations of

Let us give some example of a solution of (11), (12) and (14), (15):

Let

## 6. Definition of distance

We define the distance between the proposed approximation of

Now we have from Eqs. (2) and (3) and Eqs. (4)–(7)

Here

## 7. Constraints for basic functions

We will consider further all functions with arguments *F*_{1}, *F*_{2}) and find restrictions for these functions:

By the same way

We know that functions *monotonously nondecreasing* and change from 0 to 1 from the definition of cumulative distribution functions. Therefore, we can conclude the following restraints for functions

By the same way

## 8. Optimization

We shall find the best approximation of

where the expected value (or expectation or mathematical expectation or mean or the first moment)

For the sake of brevity, we denote

Thus

It remains to calculate the expected value in Eq. (19).

We have by obvious assumptions

Lemma 1

Proof: We can take into the consideration the function

Here

All matrixes

This density function should be symmetric according to transpositions of columns and rows of the matrix

We can consider function

We can transpose columns and rows

From this equation we can conclude that

and

From

we can conclude that

So we can obtain that

Lemma 2: Probability distribution functions

Proof: Let us make sampling of the function

All columns

From this equation, we can conclude that function

From (20) we obtain

Let us define

By Lemma 1,

It remains to find

Since

if the expression in square brackets is minimized at each point, then the whole integral in Eq. (22) is minimized. Thus, we may proceed as follows:

Hence the optimum

## 9. Mean distance between the proposed approximation of P A / x 1 x 2 −Γ α β θ and the actual function P A / x 1 x 2

The mean distance from (18) is

where

From this equation we can find boundaries of the

The second condition is

So from these two equations, we can conclude

In the next step, we would like find function

Restrictions for function

In discrete form (for

Let us define a function

Then the function that satisfies equal probability distribution with considering restrictions (i) and (ii) is the following:

here

We can define the constant

It can be proved for

Here we can find

Quest function

where

From Eqs. (24) and (25), we can find

where

If

where

## 10. The case of more than two states *A* and reliabilities *X*

Let A be a state, with values in set

We want to find the probability

We have evidential restraints for

## 11. Conclusions

Using as an illustration the QSAR, we demonstrated effectively that the Naive Bayes model gives minimal mean error over uniform dispersion of all conceivable relationships between characteristic reliabilities. This result can clarify the portrayed over secretive optimality of Naive Bayes model. We too found the mean error that the Naive Bayes model gives for uniform distribution of all conceivable relationships of reliabilities.

Medicinal chemistry (quantitative structure-activity relationships, QSAR) prediction increasingly relies on Bayesian network-based methods. Its importance derives partly from the difficulty and inaccuracies of present quantum chemical models (e.g., in SYBYL and other software) and from the impracticality of sufficient characterization of structure of drug molecules and receptor active sites, including vicinal waters in and around hydrophobic pockets in active sites. This is particularly so for biologicals (protein and nucleic acid APIs (nucleic acid active pharmaceutical ingredients)) and target applications that exhibit extensive inter-receptor trafficking, genomic polymorphisms, and other system biology phenomena. The effectiveness and accuracy of Bayesian methods for drug development likewise depend on certain prerequisites, such as an adequate distance metric by which to measure similarity/difference between combinatorial library molecules and known successful ligand molecules targeting a particular receptor and addressing a particular clinical indication. In this connection, the distance metric proposed in Section 6 of the chapter manuscript and the associated Lemmas and Proofs are of substantial value in the future of high-throughput screening (HTS) and medicinal chemistry.

However, our purpose here was not demonstration of effectiveness of these definitions or effectiveness of QSAR. The interested reader can learn it from papers [3, 4] and references inside of these papers. As we said above, we use QSAR only for clearness; the proof is correct for any field of use of Naive Bayes classifier.