Relative ranking of positive controls using the two scoring methods compared to experimental data.
All living cells have a tendency to maintain their genomic stability with as few mutations as possible. This is of crucial importance to the normal function of cells in complex environments, correctly timed cell cycle progression, and a commitment to apoptosis when appropriate (Wood, et al., 2001). In this context, the balance between constancy and mutability in the context of genomic stability must be precisely regulated and controlled. To achieve this objective, a number of multiple and overlapping DNA repair pathways have been crafted within the cell (Harper & Elledge, 2007). Nevertheless, an elevated activity of these pathways could significantly decrease cancer cells’ sensitivity to many known anticancer agents and, consequently, increase their antitumor drug resistance. This unforeseen role stems from the fact that most cancer chemotherapy in clinical use today, directly or indirectly damage DNA by causing single- or double-stranded DNA breaks or by interfering with the functions of crucial DNA interacting proteins. As a natural cellular response, following the detection of damage, DNA repair pathways attempt to restore the genome and restore the normal state of the cell. During this course, the cell’s fate is mainly determined by the effectiveness of DNA repair mechanisms which allow the cell to survive or, if the damage is too heavy, induce apoptosis, causing the cell to die (Harper & Elledge, 2007). Consequently, to improve existing cancer therapies, DNA repair pathways have been considered as novel therapeutic targets. Several DNA repair inhibitors have been reported, some of which have been recently proven to be successful (Damia & D'Incalci, 2007)
This review paper focuses on our efforts directed at in silico searches for inhibitors of proteins that control the DNA repair circuitry. The targets chosen here play critical roles in tumor cell initiation and progression, hence their regulation offers promise for the improvement of current cancer therapy. Two of these targets are DNA repair proteins that are directly linked to the hallmark “relapse” or “drug resistance” phenomena. These are Excision Repair Cross-Complementation Group 1 (ERCC1) (Kang, et al., 2006), and DNA polymerase beta (pol β) (Parsons, et al., 2004). The former is a key player in Nucleotide Excision Repair (NER), while the latter is the error-prone polymerase of Base Excision Repair (BER). The third target is p53 (Teodoro, et al., 2007), a so-called guardian protein of the genome that is inactivated in more than half of all human cancers investigated. An additional aim of this review is to share with the reader our experience as a computational drug discovery group by describing the virtual screening protocol we have developed in order to successfully address these biological problems. This chapter is divided into two main sections. The first gives a description of the computational workflow that we typically follow in our virtual screening tasks. The second is a summary of our findings for the individual targets listed above.
2. An improved virtual screening (VS) protocol
Fig. 1 illustrates the essential steps taken in order to execute the virtual screening (VS) protocol used in our lab. In a nutshell, the developed protocol employs molecular docking, molecular dynamics simulations and clustering techniques to filter a given library of compounds for inhibitors of a particular target. The concepts behind VS and other computational tools are described elsewhere (Limin, et al., 2011); (Stahura & Bajorath, 2004). However, a detailed description and rationale behind each step of this workflow are summarized below. Except for a few steps that need carful preparation, the whole process has been automated. It starts with a collection of 3D structures of ligands and a well-prepared target structure. It finally yields a set of top hit structures in their preferred binding modes with the target. Although the following steps were applied to the three specific targets described in the following section, the procedure is general and the same method is applicable to almost any bio-molecular target.
2.1. Target preparation
2.1.1. Primary assessment of target structure
In general, the downloaded “crude” crystal structure of a target contains many details that must be taken into account. This includes non-standard amino acids; co-factors; other small molecules that are present due to the crystallization process; ions and co-crystallized water molecules. For most small molecules like polyethylene glycol, it is advisable to remove them from the structure, since they are not included in the native form of the target but were required for the crystallization process. Moreover, non-standard amino acids must be carefully assessed and modeled. In many protein structures, these unusual amino acids lack several atoms because most structure handling packages do not check automatically inspect for them. Their parameters must be appended to the used Force Field (FF) before starting further simulations. Co-factors, ions and co-crystallized water molecules should be included within the simulated structure.
Water molecules that are located close to or within the binding site can mediate several interactions with the ligands. However, it is important to find out which water molecules are conserved within these regions. Any unpreserved (misplaced) water molecule can obstruct the docking simulation and lead to incorrect results. One way to identify important water molecules is to compare several crystal structures of the same target (if applicable) and choose the water molecules to be kept during the docking procedure. When a limited number of target structures is available, it is important to run different docking exercises by removing/keeping these water molecules and selecting the cases that lead to realistic and favorable binding modes.
An additional decision-making tool for the selection of water molecules is to use prediction software packages (e. g. ConSolv 1.0) that check whether a bound water molecule is likely to be conserved or displaced in other, independently solved crystallographic structures of the same target.
Finally, it is necessary to verify that no parts of the protein structure are missing. These missing residues are usually mentioned at the header of the Protein Data Bank (PDB)-file and must be added and relaxed within the target structure. Regarding the targets that were studied in this work, all missing amino acids were distant away from the binding site.
Nonetheless, we added and relaxed them using molecular dynamics (MD) simulations before running the docking experiments (see below).
2.1.2. Identifying the binding site
The starting point of any VS study is the identification of the binding site within the target protein. This portion of the protein is directly related to the biological activity that needs to be regulated. At this stage, it is important to consult previously published work and determine if there are any known active compounds that bind to the target protein (positive controls) and to ascertain their binding location. If the binding site is not exactly known, however, there is a set of active molecular structures that exist, one should run a series of blind docking experiments until a suitable and experimentally verified binding site is found (Bennett, et al., 2010); (Hazan, et al., 2008). Regarding the three targets that we focus on here, the binding sites were accurately known, mainly because they are protein-protein interaction sites (e.g. ERCC1-XPA and p53-MDM2/MDM4), or protein-DNA binding sites (e.g. DNA-pol β) where crystal structures of the interacting subunits are available.
2.1.3. Protonation states of charged residues
Proper adjustment of the protonation states, “the assignment of Hydrogen atoms” of the ionizable groups, contained by the target structure is important for any successful VS simulation. These residues play key roles in inter-protein, protein-solvent and protein-ligand interactions. The protonation states can be determined by predicting the pKa value of charged residues and comparing it to the pH value at which the simulation is performed. In this work, all protonation states of ionizable residues were calculated using the software PROBKA and adjusted at physiological pH of 7.0 (H. Li, et al., 2005). PROBKA is a very fast and accurate method that relates the structure and environment of the charged residues to the change of the pKa values from their intrinsic ones. Once the protonation states have been decided, all hydrogen atoms are then added to the system according to a given force field (FF). For the three targets, the AMBER99SB FF was used (Hornak, et al., 2006). At this stage, the protein structure is ready for the docking or MD simulations.
2.2. Ligand collection preparation
In parallel with target preparation, the organization and cleaning up of the set of compounds is undertaken for in silico screening. Currently, there are many suitable, easy to access compound databases that contain millions of molecules spanning various structural families. Prior to any screening exercises, one should decide on a set of compounds to be filtered and build up a virtual compound collection (VCC) of compounds. This collection will be repeatedly used against many targets. A typical VCC should include marketed drugs, lead-like compounds, fragment structures, commercially available chemicals and other high-activity molecules. It is also important to represent these molecules in different protonation, stereo and conformational states. An effective VCC should be constructed from molecules that are suitable for further lead optimizations, after they show biological activity.
2.2.1. Construction of the VCC
Five different databases comprise the core of our VCC. These are the National Cancer Institute diversity set (NCIDS), the DrugBank database (Wishart, et al., 2006), subsets of the ZINC database (Irwin & Shoichet, 2005) and finally, the French national chemical library “la Chimiothèque Nationale” (CN). Some of them are used in the first iteration of VS and others are retained for higher-order screening exercises.
The NCIDS is a collection of approximately 2,000 compounds that are structurally representative of a wide range of molecules, representing almost 140,000 compounds that are available for testing at the NCI. A number of its ligands contain rare earth elements and cannot be properly parameterized for docking experiments, leaving us with 1,883 compounds that can be actually used. We use a cleaned 3D version of the NCIDS formatted for use in AutoDock (the main docking program used by us) (Goodsell & Olson, 1990) and was prepared by the AutoDock Scripps team. What makes the NCIDS so valuable and extensively screened by many groups (even in HTS) is that its individual molecules have distinctive structures and are the cluster representatives of their parent families. Having screened and ranked the molecules, one can re-screen the subset of the representative structures, instead of screening the entire NCI set of compounds.
The DrugBank database is not only a set of molecules representing FDA-approved (and investigational) drugs, but also it is a unique bioinformatics and cheminformatics resource since it relates each drug to its target(s). It includes details about the different pathways, structural information and chemical characteristics of these targets and the way they are involved in a particular disease. This information is stored on a freely accessible website that is linked to other databases (KEGG, PubChem, ChEBI, PDB, Swiss-Prot and GenBank) and to a range of structure displaying applets. The DrugBank collection includes ~4,800 drug structures including >1,350 FDA-approved small molecule drugs, 123 FDA-approved biotech (protein/peptide) drugs, 71 nutraceuticals and >3,243 experimental drugs. Once a hit is identified from this library, it is simply a drug. This means it overcomes many barriers of preclinical and clinical testing and development and can be directly tested for its novel biological activity. Moreover, a hit from this collection may explain a mysterious side effect that would not be discovered before its identification as a regulator of the examined target.
ZINC is a free database dedicated to VS It includes more than 13 million purchasable compounds most of which are “drug-like” or “lead-like”. These compounds are available in several 3D formats and compatible with several docking programs. The ZINC database has many other interesting features. For example, one can easily create a subset of the whole database with any given set of properties such as specific functional groups, molecular weight, and a calculated logP. Most of the compounds also exist in multiple protonation states suitable for different pH values, several tautomeric forms, all possible stereochemistries, and different 3D conformations. The database is also organized so that the origin of each molecule is known. That is, one can determine the vendor and original catalog number for each commercial source of a compound. Similarly to the DrugBank database, a molecule can be annotated for its function or activity. It also has a powerful web server that helps in searching, browsing, creating subsets, and downloading some or all of the molecules in the database.
The CN chemical library (~100,000 compounds) is a repository of all synthetic, natural compounds and natural extracts in the existing French public laboratories. This database is divided into two main categories. The first part includes information about all synthetic products, while the second contains the natural compounds and extracts. In this work, we used the whole CN database in our screening. In contrast to the previously mentioned databases, compounds in this library are represented by 2D SDF structures with no hydrogen atoms attached. This required a number of cleaning and preparation steps before using them in our VS simulations (see below).
2.2.2. Enriching the VCC core
This is where ligand-based methods come to play a significant role in the pre-screening process. Any molecule that is known to bind to the target-binding site can serve as a positive control. Such molecules can be identified through published articles or previous patents. Besides their function in directing and verifying the simulation parameters, they can be used as seeds in the searches for similar chemical structures to enrich the VCC. This is a crucial step, which should be taken even if the identified similar structures have been previously removed from the VCC in its early construction steps.
Following this strategy, we have used known inhibitors for the p53-MDM2 interaction (see section 3.2) and DNA pol β (see section 3.3) to enrich their representative VCCs For the ERCC1-XPA interaction (see section 3.1), initially, there was no active compound confirmed to bind to the ERCC1 pocket. Hence, for the first round of VS, we started from scratch and did not apply this enrichment method. However, it was used in the second round of screening, after the first iteration identified a list of novel binders to the ERCC1 target.
2.2.3. Cleaning up the VCC
Having decided which collection of compounds to use in the screening process, one should spend time and effort to ensure the quality of the used ligand structures. As mentioned before, it is important to adopt proper protonation and conformational states for the ligands. For example, the original CN library of compounds is a collection of 2D structures with no hydrogen atoms. Ligands in this state are not suitable for docking using many of the popular docking programs. These software packages require 3D structures with proper placement of hydrogen atoms. One solution to this problem, which was followed in the ERCC1-XPA case (see section 3.1), is to use conversion software that can translate the 2D information into its 3D representative structure Many of such programs are available (e.g. Open Babel and LigPrep from Schrödinger). We prefer LigPrep for this task because it produces structures with few errors compared to Open Babel, especially in bond connection and hydrogen atoms assignment.
2.3. Generation of an ensemble of target structures
Proteins are inherently dynamical macromolecules. Their dynamical behavior is essential in order to recognize and bind to other molecules inside the cell. Although many attempts have been made to partly include the flexibility of the molecular target within docking algorithms (Schneider & Bohm, 2002), there are still many barriers and challenges that impede progress in this field. One major challenge is the enormous number of conformations that are accessible to the target under equilibrium conditions The range of these conformations is very wide and includes many local and global movements within the structure of the protein. These dynamical transitions can be as small as minor rotations of the side-chains or as large as the complete dislocation of domains within the same target. There are many crystal structures in the PDB that give evidence to this bizarre dynamical behavior. These conformational changes can be illustrated by comparing different crystal structures of the same target, especially, between its bound and unbound forms.
2.3.1. Hybrid MD-docking methods
One way to accommodate receptor flexibility and to offer more accurate scoring techniques is to implement a hybrid method between docking and MD simulations. Originally, the use of MD simulations in VS studies was intended to create a set of receptor conformations (Broughton, 2000; Carlson, et al., 2000). However, it was always debatable whether to use structures derived from MD simulations or NMR data. In our opinion, if a reasonable ensemble of NMR structures exist, one should consider using them all, instead of running long MD simulations. However, if the VS exercise departs from a single X-ray crystal structure, it is important to generate such an ensemble using MD simulations.
In this context, a successful approach, reported by McCammon and his team, is the relaxed complex scheme (RCS)(Lin, et al., 2002; Amaro, et al., 2008). This method, illustrated in Fig. 2, forms the foundation of the VS protocol presented here. In the RCS approach, all-atom MD simulations (e.g., 2-5 ns simulation) are applied to explore the conformational space of the target, while docking is subsequently used for the fast screening of drug libraries against an ensemble of receptor conformations This ensemble is extracted at predetermined time intervals (e.g., 10 ps) from the simulation, resulting in hundreds of thousands of protein conformations. Each conformation is then used as a target for an independent docking experiment.
2.3.2. Principle component analysis and sampling convergence
A typical MD trajectory displays the time dependence of atomistic Cartesian coordinates. Although the duration of the whole trajectory is typically very short (at best, on the order of hundreds of ns) compared to real life biological dynamics, it involves a huge number of snapshots that contain a mixture of fast and slow modes of motion. It is impossible to segregate or understand this mixed dynamics through simple analysis (e.g. visual inspection). However, covariance, or principle component, analysis (PCA) can break up these two types of motions and extract the essential dynamics (ED) spanned by the protein structure. This essential dynamics represents the collective movements that are directly linked to the function of the protein and are essential for its role. In fact, PCA transforms the original space of correlated variables from a large MD simulation into a reduced space of independent variables (Garcia, 1992); (Amadei, et al., 1993). For a typical protein, the system’s dimensionality is thereby reduced from tens of thousands to fewer than fifty degrees of freedom.
To perform PCA for a subset of N atoms, the entire MD trajectory is RMSD fitted to a reference structure, in order to remove all rotations and translations. The covariance matrix can then be calculated from their Cartesian atomic co-ordinates as:
Where ri represents the three Cartesian co-ordinates () and the eigenvectors of the covariance matrix constitute the essential vectors of the motion. It is generally accepted that the larger an eigenvalue, the more important its corresponding eigenvector in the collective motion. PCA can also be employed to predict the completeness of sampling during the MD simulation. This step is critical and was used in three cases to answer a very important question: when to stop the simulation and start extracting the dominant conformations of the protein? In this, we follow a method proposed by Hess (Hess, 2002) that divides an MD trajectory into separate parts, and their normalized overlap is calculated using the covariant matrices for each pair of parts:
where C1 and C2 are the covariant matrices, and the symbol tr denotes the trace operation. If the overlap is 0, then the two sets are considered to be orthogonal, whereas an overlap of 1 indicates that the matrices are identical. In this context, for the three targets studied described below, the individual whole trajectories were divided into three parts and the normalized overlap between each pair was calculated to determine the completeness of sampling.
2.3.3. Iterative clustering to extract dominant conformations
Once a sufficient sampling is confirmed through the aforementioned PC calculations, clustering analysis is used to extract a set of target structures that represent dominant conformations. Unfortunately, there is no universally accepted clustering algorithm or parameters that can be used to extract all information contained within the MD simulation. However, recent studies suggest that a number of clustering algorithms, such as average-linkage, means and self-organizing maps (SOM) can be used accurately to cluster MD data (Shao, et al., 2007). In this work, the clustering quality was anticipated by calculating a number of clustering metrics. These metrics can reveal the optimal number of clusters to be extracted and their population size. These are the Davies-Bouldin index (DBI)(Davies DL & Bouldin DW, 1979) and the "elbow criterion" (Shao, et al., 2007). A high-quality clustering scheme is correlated with high DBI values. On the other hand, using the elbow criterion, the percentage of variance explained by the data is expected to plateau for cluster counts exceeding the optimal number of clusters. Using these metrics, by varying the number of clusters, one should expect for adequate clustering, a local minimum for DBI and a horizontal line for the percentage of variance explained by the data. Fig. 3 describes an example of such calculations.
Our implementation employs an iterative clustering algorithm using the above-mentioned hypothesis. The procedure is established as an in-house code using the PTRAJ utility of AMBER10 (Case, et al., 2005). A modified version of the code is also used to cluster the docking results. MD trajectories’ clustering runs the average-linkage algorithm for a number of clusters ranging from 5 to 150 clusters. Structures are extracted at 2 ps intervals over the entire simulation time. In order to remove the overall rotation and translation, all C∝atoms are fitted to the minimized initial structure. RMSD-clustering is performed on the residues contained in the investigated binding sites. These residues are clustered into groups of similar conformations using the atom-positional RMSD of the entire amino acid, including side chains and hydrogen atoms, as the similarity criterion. The centroid of each cluster, the structure having the smallest RMSD to all members of the cluster, is chosen as the cluster representative structure and the most dominant structures are used as rigid templates for the ensemble-based docking experiments (see below).
2.4. Docking ligands to the ensemble of target structures
As stated above, the outcome of the iterative clustering step is an ensemble of protein structures that are used as targets for docking. The main docking program that was used for the three cases analyzed was AutoDock version 4 (Garrett MM, et al., 1999). AutoDock is one of the most popular docking packages that utilize different conformational search methods, including Simulated Annealing (SA), traditional Genetic Algorithm (GA), and Lamarckian Genetic Algorithm (LGA). Here, we use the LGA approach. The approach is well-described in the original paper by Morris et al. (Garrett MM, et al., 1999).
2.4.1. Automated clustering of docked poses
AutoDock can cluster output poses into subgroups depending on their RMSD values referred to a reference structure. Although this approach is widely used, the number of clusters and the population size of each cluster strongly depends on the RMSD cut-off used. Consequently, it is impossible to predict the optimal cut-off for the RMSD in order to produce a clustering pattern with the highest confidence. This motivated us to use an alternative approach when clustering the docked ligand structures In fact, we extended and automated the clustering methodology that was used in section 2.3.3 to couple the elbow criterion (Shao J, et al., 2007) with the clustering module of PTRAJ (Case, et al., 2005). This method exploits the fact that the percentage of variance exhibited by the data (λ), is expected to plateau for cluster counts exceeding the optimal number.
The percentage of variance is defined by:
where (SSR) is the sum-of-squares regression from each cluster summed over all clusters and (SST) is the total sum of squares. Here, we used the SOM algorithm to cluster the docking results. This modified clustering program increases the number of clusters required until the percentage of variance exhibited by the data (λ) plateaus. The convergence of clustering can be determined by calculating the first and second derivatives of the percentage of variance with respect to the clusters number (dλ/dN and d2 λ/dN2 ) after each attempt to increase the cluster counts. The clustering process then stops at an acceptable value for these derivatives that is close to zero. In this way, the clustering procedure depends only on the system itself and adjusts itself to arrive at the optimal clustering pattern for that specific system.
2.4.2 Preliminary ranking of docking results
The VS protocol then sorts the docking results by the lowest binding energy of the most populated cluster. The compounds can also be ranked using their weighted average binding energies according to the following formula:
where i is the index number of each ensemble cluster, whose percent distribution sums up to 100% and M is number of different structures included in the ensemble. The VS protocol only considers a compound among the top hits if the most populated cluster from any of the VS experiments includes at least 25% of all docked conformations. The top N hits of the combined docking runs construct an irredundant set of promising compounds that are used for further analysis. In this work, a typical preliminary set (N) includes from 200 to 500 compounds.
2.4.3 Visual inspection and selection of a focused set of hits
Visual inspection of the preliminary set of hits is necessary before proceeding to the later computationally rigorous steps. Although this step involves more human intervention, it assures the quality of the docking results, which are not precise in terms of ranking or the final selection of binding geometries.
2.5. Molecular dynamics simulations on selected hit-target complexes
There are many factors (e.g. water content, protein flexibility, etc.) that are not well characterized within the docking context (Warren, et al., 2006). During this step, the VS protocol aims at accounting for these factors by performing MD simulations. Each simulation starts from the final docked structure. The important aspect at this stage is the solvation of the docked models. It is generally accepted that water molecules are not only involved in solvation/desolvation of the protein-ligand complexes, but also mediate their interactions and help in generating more suitable binding modes. MD simulations relax the structures, rearrange water and ion molecules and generate trajectories that are used during the next step of binding energy calculations. The output obtained from this step is a set of snapshots representing the trajectory of the MD simulations for each complex. Although this procedure requires extensive computational resources, it tends to improve the protein–ligand interactions and enhance their molecular complementarity.
2.6. Rescoring of hits using the MM-PBSA method
Besides using MD simulations to refine the docked structures, another essential constraint for a successful VS experiment is to accurately predict the binding energies. To correctly perform this task, we need to move away from simple docking scoring methods. However, we are also restricted by the need for a fairly fast method that can be applied to many systems at a reasonable computational cost. In this context, the VS protocol utilizes a fast and efficient scoring method to suggest the final ranked set of top hits. This method is the molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) technique. The method was initially proposed by Kollman et al. (Kollman, et al., 2000) and it combines molecular mechanics with continuum solvation models. The method has been extensively tested on many systems and shown to reproduce, with an acceptable range of accuracy, experimental binding data. It was also validated as a VS refining tool and revealed excellent results in predicting the actual binding affinities and in discriminating true binders from inactive (decoy) compounds (Abagyan & Totrov, 2001; Schneider & Bohm, 2002; Shoichet, et al., 1993). Its main advantages are the lack of adjustable parameters and the option of using a single MD simulation for the complete system to determine all energy values.
In this work we used the MM-PBSA method as implemented in AMBER. The total free energy is the sum of average molecular mechanical gas-phase energies (EMM), solvation free energies (Gsolv), and entropy contributions (-TSsolute) of the binding reaction:
The total molecular mechanical energies can be further decomposed into contributions from electrostatic (Eele), van der Walls (Evdw) and internal energies (Eint):
Furthermore, the solvation free energy can be expressed as a sum of non-electrostatic and electrostatic contributions:
The non-electrostatic part was approximated by a linear function of the (SASA). That is:
3. Targeting DNA repair proteins
Below is a description of our target proteins and a summary of the results we found after applying the above-mentioned VS protocol to regulate their activity.
3.1. Case I: inhibitors of the ERCC1-XPA interaction
Platinum-based cancer therapy is one of the most efficacious treatments for many cancer types including testes, ovary, head, neck, and lung cancers (Boulikas & Vougiouka, 2003). While generally efficient at inducing apoptosis, acquired resistance to platinum compounds has limited their efficacy and therefore reduced successful clinical use of these agents (Mc Guire & Ozols, 1998). One way of acquiring this type of drug resistance is the remarkable high activity of the Nucleotide Excision Repair (NER) pathway (Rabik & Dolan, 2007).
Although many proteins are involved in the NER machinery, only the over-expression ERCC1 (a 33-kDa protein) correlates with the augmented platinum resistance. This spectacular conclusion has been reached from several independent clinical trial investigations on ovarian (Kang, et al., 2006), colorectal (Shirota, et al., 2001), and non–small cell lung cancer (Lord, et al., 2002). ERCC1 forms a tight heterodimer endonuclease complex with XPF. At the final stages of NER, the ERCC1-XPF enzyme cleaves the damaged DNA strand at the phosphodiester bonds on the 5’ side of the damage. Prior to the incision step in NER, the ERCC1-XPF endonuclease is recruited to the damaged DNA site through a secondary interaction between ERCC1 and XPA (L. Li, et al., 1994). This protein-protein interaction is necessary for a functional NER mechanism.
The NMR crystal structure was resolved by Tsodikov’s group (Tsodikov, et al., 2007). The critical residue-residue interactions as determined through our binding energy predictions are shown in Fig. 4. A 14-residue peptide from XPA that includes three essential consecutive glycines (residues 72–74) is buried within a hydrophobic cleft within the central domain of ERCC1. This peptide has two critical characteristics (Tsodikov, et al., 2007). First, it is necessary and sufficient for binding to ERCC1. Second, and more importantly, it can compete with the full-length XPA protein in binding to ERCC1 and disrupting NER in vitro. Moreover, there is no other cellular function beyond NER that has been observed for XPA (Rosenberg, et al., 2001). These observations, coupled with the available crystal structure of this interaction make ERCC1 and XPA an extremely attractive target for computationally based development of small molecule inhibitors that are targeted for use in combination therapies involving cisplatin.
3.1.1. Two-stage filtering procedure
Following the VS protocol described in section 2, we carried out a two-stage virtual screening procedure. Top hits from the first iteration were used as positive controls for the second First, we screened relatively small chemical libraries from the constructed VCC (see section 2.2.1). The objective at this stage was to discover novel lead compounds that can be used as positive controls for subsequent screening simulations and experiments. Furthermore, we wished to construct a pharmacophore model that can guide future research toward discovering potent and specific inhibitors of this interaction. A detailed description of the parameters used and the results obtained in this study can be found in the published manuscript (K. H. Barakat, et al., 2009). Here, we summarize the major steps and report important findings.
Prior to the screening process, it was important to determine precisely the key residues that mediated the ERCC1-XPA interaction (see Fig. 4). To do so, we used binding energy decomposition analysis using the MM-PBSA method in order to mark the essential residues from both sides. Residues within ERCC1 highlighted the binding site of the target. Following that, we extended the MD simulations up to 50 ns and extracted 6 dominant structures using clustering analysis. The length of the simulation was determined by convergence of the normalized overlap using PCA (see section 2.3.2 for details). Beside the 6 extracted structures we added 2 more protein conformations, one for a minimized initial structure and the other was an equilibrated conformation obtained from PCA. We also tried two docking alternatives. The intention was to compare docking against one target structure with flexible side chains to docking against eight rigid protein conformations. Flexible docking exploited the findings of the earlier analysis of binding energy decomposition, where we found Y145 to contribute more than 25% of the overall binding energy (data not shown). Hence, we decided to allow the full side-chain flexibility for this residue in docking to the minimized initial structure. In the other approach, we docked the two compound libraries to the eight rigid models of the target. We also used 3 different methods to rank the compounds; all of them were based on the AutoDock scoring function. In all ranking methods we used only the binding modes that possessed the largest docking cluster. The first ranking method was based on the minimum energy of the flexible docking run. The second used the average binding energy from the eight different rigid simulations. The final ranking was based on the weighted average binding energy using Equ. 4.
The binding mode for three selected top hits within their most favored binding site conformations is shown in Fig. 5. Electrostatic surface maps are included to provide an additional perspective of the charge distribution in the ERCC1 cavity. The binding cleft is mainly positively charged with small negatively charged spots on boundaries of the binding site. This electrostatic potential distribution indicates that the binding site may exhibit a weak positive electrostatic potential. Although, the charge distribution changed slightly between the two representative binding sites indicating the perseverance of its overall shape, the positive potential is apparent on closed conformation. It is worth mentioning that thirty compounds out of this study were experimentally tested and two of them (including the one shown in Fig. 5-A) exhibited positive activity. These two identified top hits were then used as positive controls during the second stage of VS (see below).
Two important lessons were learnt from the first stage and guided us toward more successful outcomes in the second round of screening. The first was to use NMR structures when available, instead of carrying out long MD simulations to extract dominant conformations of the target. The second lesson was to employ more accurate methods in ranking of the compounds, instead of depending solely on the docking scoring function, no matter which combination of scoring methods was used (i.e. average or weighted average binding energies). Hence, in executing the second screening simulation we used 10 NMR target structures, a larger library of compounds (CN chemical library ~100,000 compounds), and the MM-PBSA scoring method to rank the top hits. Eleven compounds from the top hits of this screening were tested experimentally. Five compounds showed positive activity leading to a significantly better-hit rate than the initial round of screening (data not shown). The top hit of the second round of ERCC1 screening is shown in Fig. 6.
3.2. Case II: dual inhibitors of p53-MDM2/4
Over the last two decades, the tumor suppressor protein p53 has been called the “guardian of the genome”. P53 earned this label due to its vital roles in cell cycle, apoptosis, DNA repair and senescence (Teodoro, et al., 2007). In these processes, p53 responds to cellular stresses, such as hypoxia and DNA damage, by accumulating in the nucleus and activating various pathways to maintain the cell’s functional normality (Vogelstein, et al., 2000). As such, tumor cells have developed numerous ways to disable its function. Certainly, the gene TP53, which encodes for p53, is mutated or deleted in ~50% of human cancers (Feki & Irminger-Finger, 2004). In the remaining human cancers, while p53 retains its wild type structure, its activity is eradicated by its main cellular inhibitors, murine double minute 2/4 (MDM2/4) proteins (Kubbutat, et al., 1997; Kussie, et al., 1996). Originally, MDM2 was discovered as the main regulator of p53 activity. It acts as an E3 ubiquitin ligase that exports p53 out of the nucleus and promotes its degradation. Moreover, by binding to the transactivation domain of p53 within the nucleus, MDM2 inhibits p53 function as a transcription factor for other proteins. Consequently, over-expression of MDM2 reduces the cellular ability to activate the p53 pathway under stress conditions (Fakharzadeh, et al., 1991).
Structurally related to MDM2, MDM4 (also known as MDMX or HDMX) is a second cellular regulator of p53 (Shvarts, et al., 1996). Although MDM4 lacks the intrinsic E3 ligase activity of MDM2, current models suggest that it acts as a major p53 transcriptional antagonist independent of MDM2 (Toledo, et al., 2006). The binding domains of p53 within MDM2 and MDM4 are very similar (V. Bottger, et al., 1999), offering promise for the discovery of new small molecule compounds that can simultaneously target the two proteins.
The high-resolution crystal structure of the p53-MDM2 complex demonstrated the essentialinteracting regions located in the MDM2-p53 interface (see
Fig. 7) (Kussie, et al., 1996). Essentially, p53 forms an amphipathic-helix peptide (residues 15-29) that is partly buried inside a small but deep, hydrophobic groove on the surface of the MDM2 N-terminal domain (residues 19-102). This interaction involves four key residues from p53, namely F19, L22, W23 and L26 and at least 13 residues from MDM2 (L54, L57, I61, M62, Y67, Q72, V75, F86, F91, V93, I99, Y100 and I103) (A. Bottger, et al., 1997). Interestingly, 10 out of the 13 most important MDM2 residues described above are conserved in MDMX, which indicates that the binding site of p53 within the surface of MDMX is similar to, but not identical with, that of MDM2.
The last decade witnessed the identification of many small-molecule p53-MDM2 inhibitors with promising binding affinities (Patel & Player, 2008). These are analogs of cis-imidazoline (Nutlins) (Vassilev, 2004), spiro-oxindole (MI-63 and MI219) (Dastidar, et al., 2008; Ding, et al., 2006), benzodiazepinedione (TDP665759) (Grasberger, et al., 2005), terphenyl,(L. Chen, et al., 2005), quilinol,(Lu, et al., 2006), chalcone (Stoll, et al., 2001) and sulfonamide (Galatin & Abraham, 2004). Only three compounds, namely, Nutlin-3, MI-219 and TDP665759 showed sufficiently high binding affinity, and desirable pharmacokinetic profiles in cells (Shangary & Wang, 2008) to be seriously considered for clinical development. However, these compounds are more highly selective for MDM2 than for its homolog MDM4. Even nutlin-3 has been shown to be inactive in cancer cells that over-express MDM4 (B. Hu, et al., 2006), opening a new avenue in p53 research and requiring a new generation of MDM2-inhibitors that can target its homolog, MDM4, as well.
3.2.1. Screening against two targets
In our efforts to discover novel compounds that can restore the p53 activity we screened for dual inhibitors of the p53 interactions with both MDM2 and MDM4 (K. Barakat, et al., 2010). Our strategy followed from the evident similarity between the p53 binding sites within the two proteins. We first filtered a subset of our VCC for MDM2 inhibitors Top hits from this search were then screened against the MDM4 target. Compounds that can simultaneously bind to the two targets were considered as potential dual inhibitors. Our VCC-subset included the NCI diversity set, DrugBank compounds. We also enriched the docked compounds with more than 3,168 derivative structures extracted from the known MDM2-inhibitors. This enrichment was obvious, as similar targets are more than likely to bind similar compounds. The initial screening against MDM2 used 28 dominant protein conformations. These conformations represented the apo- and holo-MDM2’s collective conformational dynamics and were extracted from MD simulations, PCA and clustering analyses.
|Compound||MDM2 Ranking (kcal/mol)||MDM4 Ranking (kcal/mol)|
|MI-219||-10.6 ± 1.5||-9.1 ± 2.2||-11.4||-5.3 ± 1.5||-6.8 ± 2.2||-5.9|
|Nutlin-3||-9.3 ± 1.3||-8.2 ± 2.2||-9.7||-6.1 ± 1.6||-5.8 ± 2.2||Negligible|
|TDP665759||-9.5 ± 1.5||-9.1 ± 2.2||-8.4||-5.6 ± 1.4||-8.2 ± 2.2||Negligible|
|PMI||-10.4 ± 1.6||N/A||-11.6||-12.8 ± 1.5||N/A||-11.5|
Scoring of the top MDM2 hits employed two ranking steps. First was a docking-based ranking similar to what was described in the previous ERCC1 study. The objective was to suggest a modest number (300 in this case) of promising hits for the subsequent re-ranking step. This final scoring utilized the MM-PBSA method. Prior to the application of the MM-PBSA method and as described in Fig. 1, all 300 hits were prepared through all-atoms and solvated MD simulations. We also included a recently discovered peptidic MDM2/4 dual inhibitor (PMI) in the rescoring step (Pazgier, et al., 2009). The 300 suggested MDM2 top hits were also docked to the p53-binding site within MDM4 followed by rescoring their binding using MD and MM-PBSA calculations. Table 1 describes the relative ranking of the used positive controls. The apparent IC50 values for Nutlin3, MI-219, TDP665759 and PMI in binding to MDM2 are 90 nM, 5 nM, 704 nM (Koblish, et al., 2006; Vassilev, 2007) and 3.4 nM35 at 250C, respectively. We did not find explicit values for the binding affinities of the three non-peptide molecules regarding their binding to MDM4, hoverer, it has been experimentally confirmed that these compounds are weak binders to MDMX (Vassilev, 2007) (Pazgier, et al., 2009).
Although the discrepancy in the MM-PBSA calculations for the interactions of the four inhibitors with MDM2 was about 1 kcal/mol, the predicted values were in an excellent agreement with the experimental data compared to the values obtained by the AutoDock scoring function (see Table 1). These results also illustrated the limitations of AutoDock scoring function in eliminating false positive ligands, i.e. compounds that cannot practically bind but are predicted to bind, from active compounds. For example, the TDP665759 compound was predicted to bind to MDM4 with a relatively high binding energy compared to the rest of the compounds. On the other hand, the MM-PBSA approach selected the real binders for the two protein targets. For MDM2, the four ligands can bind strongly to the protein, while, for MDM4, only the PMI peptide can bind with a very high binding energy.
Not only did this study reveal a number of promising hits that can simultaneously bind to the two targets, but it also explained why known MDM2-inhibitors such as Nutlin 3 could not bind to MDM4. Although the two binding sites are fairly similar, the MDM4 pocket seemed to be more compact than that of MDM2. This was mainly due to the three residues Pro95, Ser96 and Pro97 in MDM4 that have been replaced by His96, Arg97 and Lys98 in MDM2 (see Fig. 8).
These substitutions are located on one of the alpha helices that comprise the p53 binding site within the two proteins. Consequently, the proline residues (Pro95 and Pro97) in MDM4 shifted this helical domain in MDM4 relative to MDM2 and caused Lys98 and Tyr99 to protrude into the p53-binding cleft within MDM4, making it shallower and less accessible to many of the MDM2 top hits we found. Moreover, we noticed very minor differences in the electrostatic potential distributions around the surfaces of the two proteins (data not shown), where MDM2 was more positively charged in certain regions deeply located within the binding site. These slight variations in both shape and electrical properties of the two proteins played a considerable role in governing the final conformation adopted by the ligands.
This observation is clear when comparing the binding modes of nutlin within the two pockets (see Fig. 8-a). While Tyr100 and Leu99 of MDM2 extend the binding site allowing nutlin to intimately bind to MDM2, the same residues in MDM4 clash with the drug preventing it from taking the normal conformation that was adopted within MDM2. On the other hand, Fig. 8b-c show how two compounds from the discovered set of proposed MDM2/MDM4 inhibitors were able to tolerate the structural variations between the two binding sites.
3.3. Case III: inhibitors of DNA polymerase beta
DNA polymerase beta (polβ), the smallest naturally occurring DNA polymerase enzyme, belongs to the X-family of DNA polymerases (Uchiyama, et al., 2009). DNA polβ is a vital member of the base excision repair (BER) pathway (Beard & Wilson, 2006). The enzyme plays a significant role in chemotherapeutic agent resistance, as its over-expression reduces the efficacy of anticancer drug therapies including bleomycin (Parsons, et al., 2004), monofunctional alkylating agents (Liu, et al., 2002), cisplatin (Hoffmann, et al., 1996), and other platinum-based compounds.. Furthermore, small-scale studies on different types of cancer showed that polβ is mutated in approximately 30% of tumors, which in turn reduces polβ fidelity in DNA synthesis exposing the genome to serious and often deleterious mutations (Chan, et al., 2007; Starcevic, et al., 2004). Based on these findings, polβ, the error-prone polymerase of BER, has been seriously considered as a promising therapeutic target for cancer treatment.
Many inhibitors of DNA polβ have been identified during the last two decades. To name but a few, this list includes polypeptides (Husain, et al., 1995), fatty acids (Mizushina, et al., 1996), triterpenoids (Tanaka, et al., 1998), sulfolipids (Mizushina, et al., 1998), polar lipids (Ogawa, Murate, Izuta, et al., 1998), secondary bile acids (Ogawa, Murate, Suzuki, et al., 1998), phenalenone-derivatives (Perpelescu, et al., 2002), anacardic acid (J. Z. Chen, Y.; Wang, L.; Sucheck, S.; Snow, A.; Hecht, S., 1998), harbinatic acid (Deng, et al., 1999), flavanoid derivatives (Maloney, et al., 2005), and pamoic acid (H. Y. Hu, et al., 2004). However, most of these inhibitors are either not potent enough or lack sufficient specificity to eventually become approved drugs. Among these compounds pamoic acid (PA) was one of the few compounds that had promising activity against polβ and a well-defined binding mode. The compound was initially discovered by Hu and his co-workers (H. Y. Hu, et al., 2004) Their NMR analysis revealed that PA binds to the 8-kDa domain of polβ and suggested that the binding pocket is located between the two helices: helix-2 and helix-4 of the 8-kDa domain. Interestingly, the same region has been recognized in different studies to be essential in the DNA binding and deoxyribose phosphate lyase activities of the enzyme (Pelletier, et al., 1994). The precise interactions between PA and the lyase domain of polβ were further investigated in a different study (Hazan, et al., 2008) which used a combination protocol of blind docking and NMR analysis to confirm the earlier findings of Hu et al (H. Y. Hu, et al., 2004).
3.3.1. Screening against the lyase active site of polβ
Following the procedure described in Fig. 1, we focused our search space on the binding site of PA, using it as a positive control. Our aim was to discover more potent drug candidates through filtering a library of ~12,500 compounds against 11 protein structures. The molecules tested included the NCI diversity set, the DrugBank set of small-molecules and more than 9,000 fragment structures with drug-like properties extracted from the ZINC database (see section 2.2.1 for a detailed description of these compound libraries). The top 300 hits that showed strong affinity for polβ have been validated and rescored using a more robust scoring function, the MM-PBSA method.
The reported KD value for PA binding to polβ is 9 µM (H. Y. Hu, et al., 2004). Using the AutoDock scoring function, we obtained a value of -6.2 kcal/mol as an estimate for this binding energy. Although this value is in excellent agreement with the experimental measurement (-6.9 kcal/mol) as calculated using the KD value, based on the previously described studies, docking scoring functions were not efficient in discriminating false positives in VS experiments and are biased toward their training set of compounds (Tondi, et al., 1999). Consequently, in this work, the top 300 hits were rescored using the MM-PBSA method in order to validate their docking results and confirm their binding to the protein. Fig. 9 demonstrates the binding modes of the top three hits of the MM-PBSA ranking. Similarly to a substantial number of our suggested top hits, the shown compounds are small in size, however, they occupy a considerable portion of the DNA-binding pocket. These lead compounds can be employed as the basis for a further fragment-based drug design step, in order to construct potent and more specific pol β inhibitors.
DNA repair pathways control the balance between genomic stability’s constancy and mutability (Harper & Elledge, 2007). The mode of action of modern anticancer treatments is by inducing damage to the DNA. Over-expression of proteins involved in the DNA repair circuitry boosts the repair activity, removes most of the induced damage and hence, reduces the efficacy of DNA-damaging agents. This unexpected mechanism represents one of the major factors behind the antitumor drug resistance phenomena observed for these agents (Harper & Elledge, 2007). Therefore. DNA repair proteins are currently considered as valuable targets to improve cancer therapy (Damia & D'Incalci, 2007).
Our group has been involved in in silico searches for novel inhibitors of a number of DNA repair proteins. This chapter reviews our efforts in applying computational high throughput screening methods to filter compound libraries for such inhibitors. The chapter contains two main parts. First is a detailed description of all the computational steps that are used in the virtual screening workflow that we follow. Second is a summary of the results we found in applying this protocol to three important DNA repair targets. The three targets are ERCC1-XPA (K. H. Barakat, et al., 2009), an important element of the NER pathway; MDM2 and MDM4 (K. Barakat, et al., 2010), the two cellular inhibitors of p53; and finally DNA polβ (K. Barakat, et al., In press); (K. Barakat & Tuszynski, 2011), the error prone polymerase of BER. The aim of this review is to share with the reader our experience in this field from a computational drug discovery perspective. Furthermore, we have attempted to demonstrate that computational tools can be easily applied to DNA repair proteins and eventually arrive at compounds capable of regulating their activities.
Funding for this work was obtained from the Alberta Cancer Foundation, Canadian Breast Cancer Foundation, Alberta’s Advanced Education and Technology and the Allard Foundation and NSERC.