Lipinski’s Rule of Five
This chapter presents in silico approaches used in protein structure prediction and drug discovery research.
The structural and functional diversity of animal toxins are interesting tools for therapeutic drug design. This diversity is also of great interest in the search for natural or synthetic inhibitors against these animal toxins.
Computational techniques are highly important in drug design. They are used in the search for candidate ligands binding to a receptor.
Drug design based on structure has become a highly developed technology and is used in large pharmaceutical companies. Firstly, the structure of the protein of interest must be known. Therefore, molecular modelling plays an important role in the discovery of new drugs.
If the structure of the receptor is known, then the application is essentially a problem of structure-based drug design. These methods have specific goals, such as attempting to identify the location of the active site of the ligand and the geometry of the ligand in the active site. Another goal is to select a number of related binders in terms of affinity or evaluation of the binding free energy.
The strategy of virtual screening has been used to contribute to the increase in hit rate in the selection of new drug candidates.
Virtual screening (VS) is a modern methodology that has been used in the identification of new bioactive substances. It is an in silico method that aims to identify small molecules contained in large databases of compounds with high potential for interaction with target proteins for subsequent biochemical analyses.
The strategy of VS can be divided into ligand-based virtual screening (LBVS), where a large number of molecules can be evaluated based on the similarity of known ligands, and structure-based virtual screening (SBVS), where a number of molecules can be evaluated for specifically binding to the active sites of target proteins (Figure 1).
Molecular docking is used to determine the best orientation and conformation of a ligand in its receptor site. The aim is to generate a range of conformations of the protein-ligand complex and sort them according to their scores, which are based on their stabilities. In order to do this, the protein structure and a database of ligands (potential candidates) are used as inputs to the docking software. Thus, large collections of virtual compounds are subjected to docking into a protein-binding site and sorted according to their affinities for the macromolecular target, as suggested by the score function.
The focus of this chapter is to present the strategy of SBVS and the basic concepts of the methodologies involved. Examples of these approaches that have been applied to the identification of animal venom inhibitors have been presented at the end of the chapter.
2. Structure-Based Virtual Screening (SBVS)
SBVS involves the evaluation of databases based on the simulation of interactions between the ligands (small molecules) and receptors (target protein). The various steps in the process of SBVS are briefly shown in Figure 2. After obtaining the structure of the receptor and ligand, the next step in the process is molecular docking, which involves the coupling of the ligands with the receptor. At this stage, various conformations and orientations are generated and classified according to the score function. The target protein can be obtained from a database or by modelling.
2.1. Obtaining the Structure of the Protein Target
Knowledge of the target protein structure is essential for structure-based drug design. The determination of the 3-dimensional structure of the protein may be achieved experimentally by diffraction of X-rays or by magnetic resonance. If the structure of the target protein has already been solved, it can easily be found deposited in public databases such as PDB  which contains more than 80,000 experimentally solved structures.
However, sometimes the structure of the target is not known, and this poses a problem in the drug design process. This situation can be resolved by making use of computational methods for predicting protein structure.
Such methods are divided into 2 groups: those based on templates and those that are template-free. The first group includes comparative or homology modelling and threading. The second group includes methods that do not depend on templates to build the model, such as ab initio modelling (Figure 3).
2.1.1. Template-Based Modelling
Homology modelling is based on the use of proteins that share an ancestral relationship with the target protein, that is, that they are evolutionarily related and tend to have similar structures. Thus, this method basically involves knowledge of the primary chain of the target protein and a search among databases for homologous proteins that have solved structures. These proteins are used as templates.
Threading modelling is based on the principle that proteins may have similar structures without sharing the same ancestral relationship because the structure tends to be more conserved than the primary sequence. In this case, these methods evaluate the primary chain of the target protein in relation to proteins that have solved structures.
18.104.22.168. Comparative/Homology Modelling
Comparative or homology modelling constructs a model structure of the target protein using its primary chain and the information obtained from homologous proteins that have solved structures. Therefore, this method depends on the availability of proteins that have structures similar to those of the target and can be used as templates. The whole process requires not only the construction of the model, but also the refinement and evaluation of the obtained model. The process can be divided into stages as follows: selection of the templates, which involves the identification of homologous sequences in a database of proteins that will be used as templates in the modelling process; sequence alignment between the target and the templates; refinement of the alignment; construction of the model, adding loops and side chains; and evaluation of the model (Figure 4).
The construction of the model depends on the availability of templates. For this purpose, alignment of target and template sequences is widely used and is very efficient. Sequence alignments are typically generated by searching for the result that presents the largest region of identity and similarity. Generally, an identity percentage of at least 25% is considered significant.
There are several tools available for sequence alignment. They differ in the methods used, which can be exhaustive or heuristic, as well as the number of sequences involved in the alignment (multiple or pairwise comparisons). Among these tools, BLAST/PSIBLAST [1; 2] is a tool that performs local alignments based on the profiles between the target sequence and each sequence belonging to a known database.
The results of the alignment can be evaluated using the E-value. The E-value shows an inverse relationship with the identity/similarity between the sequences. Because it is a heuristic method, the results reported by BLAST are generally suboptimal.
If more than 1 template with similar scores is achieved, the best one can be selected as the template with the higher resolution.
When more than one template is selected, and taking into account that the results are usually suboptimal, there is a need for an alignment between the target protein and the selected templates. In this case, multiple alignments are indicated. There are several tools that perform multiple alignments, such as ClustalW 
After obtaining the alignments between the target and templates, the process of obtaining the model of the target protein begins. There are several software tools available, which differ with respect to the method applied. Prominent among these are MODELLER [9, 33] and SWISS-MODEL  The software that has shown the best performance is MODELLER. The program models the backbone using a homology-derived restraint method, which is based on the multiple alignment between the target and templates to differentiate between highly conserved and less conserved residues. The model is optimised by energy minimisation and molecular dynamics methods (Figure 5).
The regions of the target that are not aligned with the protein template generally represent loop regions. There are usually some regions caused by insertions and deletions producing gaps in the alignment. Closing these gaps requires modelling of the loops. The loops and the side chains are shaped during the refinement of the model. For this, methods that do not rely on templates can be applied. These include the use of physics parameters and knowledge-based data.
The loops are usually modelled using a database of fragments or by ab initio modelling. The use of a database involves finding parts of protein structures known to fit onto 2 regions (stems) of the target protein, which are the regions that precede and follow the loop to be modelled. The conformation of the best matching fragment is used to model the loop.
Ab initio methods generate many random loops and look for one that presents a low-energy state and includes conformational angles contained within the allowed regions of the Ramachandran plot  The software CODA  can be used for loop modelling.
The side chains can be modelled by programs that make use of libraries of rotamers, such as the software SCRWL4 . The use of rotamer libraries reduces computational time because it reduces the number of favourable torsion angles being examined.
After obtaining the model, its quality must be evaluated. This should be done to make sure that the model has structural features consistent with the physical and chemical rules. Several errors in modelling can occur due to poor choice of template, bad alignment between the target and template, and incorrect determination of loops and side chains.
In the evaluation stage of the model, the structural characteristics as well as the stereochemistry accuracy of the model must be examined.
There are tools available for analysing stereochemical properties, such as PROCHECK . PROCHECK checks the general physicochemical parameters such as phi-psi angles (Ramachandran plot) and chirality. The parameters of the model are compared with those already compiled.
To validate the model for chemical correctness, it is possible to use the software WHAT IF . WHAT IF is a server that checks planarity and bond angles, among other parameters. It also displays the Ramachandran plot.
Verify3D [4, 26] can be used for the analysis of the pseudo-energy profile of the model. It has a database containing environmental profiles based on secondary structures, and the solvent exposure of solved structures at high resolution. It should be noted that the results may be different when different programs are used for verification.
To distinguish correct from incorrect regions, the ERRAT program  can be used; this is based on analysis of the characteristics of atomic interactions compared to the highly refined structures.
PROtein Volume Evaluation (PROVE; ) calculates the volume of the atoms in the macromolecules using an algorithm that treats the atoms as spheres, analysing the model in relation to the highly resolved and refined structures stored in the PDB.
Threading modelling is generally used when the template and target sequences share less than 30% identity. Thus, structures that do not share an evolutionary relationship with the target protein can be used as templates. However, the target protein has to adopt a fold similar to that of the protein that has had its structure solved. The method can be classified as a pairwise energy-based method.
Using the sequence of the target protein as input, a search is conducted on a database of structures in order to find the best structural match using the criterion of energy calculation. The process is accomplished through a search for solved structures that are most appropriate for the target protein. The comparison highlights secondary structures because they are evolutionarily conserved.
A model is constructed by placing aligned residues between the structure of the template and the target residues. In the next step, the energy of this model is calculated. This is done on various structures in the database. In the end, the models obtained are ranked based on the energy. The model presenting the lowest energy constitutes the most compatible folding model (Figure 6).
2.1.2. Template-Free Modelling
One of the biggest problems in comparative modelling is the lack of templates. Template-free methods generate models based on the physicochemical properties and thermodynamic chain of the primary protein target. The processes are iterative. The conformation of the structure is altered until a configuration of lower potential energy is found.
Some methods use force fields based on knowledge as a scoring function. These methods are not strictly free of templates since they employ structures of small fragments of proteins such as, for example, ASTRO-FOLD [19, 35]. Others use energy functions based on first principles of energy and movement of atoms. Generally, these methods involve the calculation of energies of the structures, which has a high computational cost. They are therefore limited to small molecules (approximately 100 residues), as in the case of the software ROSETTA .
Firstly, ROSETTA breaks the sequence of the target protein into several short fragments and predicts the secondary structures of the fragments using HMMs. These fragments are then arranged (assembled) into a tertiary setting. Random combinations of these fragments generate a large number of models, which have their energies calculated. The conformation that presents the lowest global energy value is chosen as the best model (Figure 7).
3. Molecular Docking
One application of molecular docking is virtual screening, in which a library of compounds is compared to one or more targets, thereby providing an analysis of compounds ranked by potential.
Virtual screening computational techniques are applied to the selection of compounds that can be active in a target protein.
In molecular docking, a ligand is usually placed in the binding site of a predetermined structure of a receptor (Figure 8). In other words, this is a method based on structure. The receptor is typically a protein and the ligand is a small molecule or a peptide. The optimal position and orientation of the ligand are determined using a search algorithm and a scoring function that ranks the solutions.
The first step of the process of molecular docking is to determine the binding sites of the protein. This can be done by software programs such as Q-Sitefinder .
The metaPocket method  predicts binding sites using 4 methods: LIGSITEcs , PASS , Q-Sitefinder, and SURFnet  – which in combination increase the success rate of prediction. The methods LIGSITEcs, PASS, and SURFnet use only the geometrical characteristics of the protein structure, detecting regions that have the potential to be binding sites. Such methods do not require prior knowledge of the ligands.
In Q-Sitefinder, the surface of the protein is covered with a layer of methyl probes for the calculation of Van der Waals interactions between the protein and the probe. Probes with favourable interaction energies are retained, and are classified into groups based on the number of probes per group. The largest and most energetically favourable group is ranked first and considered the best potential binding site.
Another step is to define the position of the ligand in the pocket. This can be predicted by molecular docking algorithms.
Several methods have developed different scoring functions and different search methodologies.
The search algorithms have to be able to present different configurations and orientations of the ligand in a short time. Search algorithms, such as those used in molecular dynamics, Monte Carlo simulations, and genetic algorithms, among others, are all suitable for molecular docking.
Scoring functions must be able to discriminate between different ligand-receptor interactions. These can be grouped into field-force, empirical, and knowledge-based methods.
The algorithms can be classified into rigid body docking and flexible docking algorithms. In rigid-body docking, both the ligand and receptor are rigid. These methods are faster, but do not allow ligand and receptor to adapt to the binding. In flexible methods, the computational cost is higher compared to rigid methods. However, in these cases, the flexibility of the ligand and/or receptor is considered.
Another important factor to be considered in ligand-receptor interactions is the presence of water. Some methods allow water molecules to be positioned. In cases where this is not possible, the position of water molecules can be predicted using a software program such as GRID .
GRID calculates the interactions between chemical groups and small molecules with known 3-dimensional structures. The energies are calculated using Lennard-Jones interactions, electrostatic and hydrogen bonding between the compounds, and 3-dimensional structures, using a position-dependent dielectric function.
GOLD uses a genetic algorithm that seeks solutions through docking that propagates multiple copies of flexible models of the ligand in the active site of the receptor and recombining segments of copies at random until a converged set of structures is generated.
The process of searching the databases can be time consuming; a way to reduce the search space is filtering databases by performing a search with the fastest algorithms, selecting the best candidates ranked. Subsequently, within this selection, a search algorithm slowly generates a new ranking of the ligands. Another way to reduce the number of ligands being studied in the database is to perform a search for ligands that offer the greatest possibility of being used in drug design. In this case, it is possible to filter the database by using the ADMET (absorption, distribution, metabolism, excretion, and toxicity) filter.
Lipinski´s rule of 5  can be used. The rule of 5 is a set of properties that characterise compounds that exhibit good oral bioavailability. It states that, in general, an orally active drug has no more than 1 violation of the rules (Table 1):
|Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms|
|Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)|
|A molecular mass less than 500 daltons|
|An octanol-water partition coefficient log P not greater than 5|
Analysis of the metabolic fate and chemical toxicity of the compounds can be accomplished using the software programs DEREK and METEOR . DEREK predicts whether a given chemical is toxic to humans, mammals, and bacteria. METEOR uses the knowledge of metabolism rules to predict the metabolic fate of chemicals, assisting in the choice of more efficient molecules.
4. Ligand-Based Virtual Screening (LBVS)
Other methods can also be used for screening databases of compounds, such as those based on ligands (LBSV). In this case, a similarity search can be made between known bioactive compounds and molecules contained in databases. LBVS techniques include methods based on the pharmacophore and quantitative structure-activity relationship (QSAR) modelling.
In pharmacophore-based virtual screening, a hypothetical pharmacophore is taken as a template. The goal of screening is to identify molecules that show chemical similarities to the template .
QSAR is based on the similarity between structures. It is a quantitative relationship between a biological activity and the molecular descriptors that are used to predict the activity. QSAR searches for similarities between known ligands and each structure in a database, investigating how the biological activity of the ligands can be correlated to their structural features .
5. Examples of Virtual Screening / Molecular Docking in Animal Venom
 performed a virtual screening against α-Cobratoxin. The neurotoxin α-Cobratoxin (Cbtx), isolated from the venom of the Thai cobra Naja kaouthia, causes paralysis by preventing acetylcholine (ACh) binding to nicotinic acetylcholine receptors (nAChRs). A search for α- Cobratoxin structures was carried out in the PDB, and the virtual screening of 1990 compounds was performed using the program AutoDock. On [3H]epibatidine and on [125I] α-bungarotoxin, NSC121865 (compound 23) was most potent in binding with Ac (Kd = 16.26 nM; Kd = 36.63 nM). The results showed that, in clinical applications, NSC121865 would be a very useful potential lead in the development of a new treatment for snakebite victims. This inhibitor can be used for the development of a more potent and specific anti-cobratoxin.
 investigated the effects of protease inhibitors, including phenylmethylsulfonyl fluoride (PMSF), benzamidine (BMD), and their derivatives on the activity of recombinant gloshedobin, a snake venom thrombin-like enzyme (SVTLE), from the snake Gloydius shedaoensis. The structural model of gloshedobin was built by homology modelling using modelling package MODELLER. The stereochemical quality of the homology model was assessed using the PROCHECK program and the software AutoDock was used to dock inhibitors onto the structural model of gloshedobin. The docking results indicated that the strongest inhibitor, PMSF, bound covalently to the catalytic Ser195.
 evaluated the inhibitory effect of 1-(3-dimethylaminopropyl)-1-(4-fluorophenyl)-3-oxo-1,3-dihydroisobenzofuran-5-carbonitrile (DFD) on viper venom-induced haemorrhagic and PLA2 activities. Molecular docking studies of DFD and snake venom metalloproteases (SVMPs) were performed to understand the mechanism of inhibition by DFD, since SVMPs constitute one of the protein groups responsible for venom-induced haemorrhage. The docking results showed that DFD binds to a hydrophobic pocket in SVMPs with the Ki of 19.26 x 10 -9 (kcal/mol) without chelating Zn2+ in the active site.
In silico approaches used in protein structure prediction and in drug discovery research have been presented in this chapter.
Computational methods used in the search for inhibitors play an essential role in the process of discovering new drugs.
The application of protein modelling methods has contributed significantly in cases where the structure of the target protein has not been solved, allowing the SBVS process be completed.
Good results obtained by virtual screening depend on the quality of structures, databases to be scanned, the search algorithms, and scoring functions. Therefore, there must be a good interaction and exchange of information between in silico and experimental methods. Careful application of these strategies is necessary for successful drug design.
Table 2 presents a list of software tools and server web sites.
The author would like to thank CAPES-PROEX and CNPq for financial support.