Chemical similarity networks are an emerging area of interest in medicinal chemistry, chemical biology, and systems chemoinformatics that are currently being applied to drug target prediction, drug repurposing, and drug discovery in the new paradigm of poly-pharmacology and systems biology. In this chapter, we discuss the network-based drug target identification and discovery framework called chemical similarity network analysis pull-down (CSNAP) and its applications. We highlight the utility of CSNAP in identifying novel antimitotic drugs and their targets through practical case studies.
- drug discovery
- chemical similarity networks
- target identification
Chemical similarity is an important concept in drug discovery used to identify compounds with similar bioactivities based on structural similarity between two ligands [1, 2]. Once a lead compound has been discovered from a chemical screen, a drug designer can design a series of structural analogues with improved pharmaceutical properties. The fundamental principle behind similarity-based drug discovery is the “chemical similarity principle,” which states that if two molecules share similar structures, then they will likely have similar bioactivities. While there are exceptions, correlation between chemical structure and compound activities has been well established in medicinal chemistry . Consequently, determining whether two molecules are structurally similar is a prerequisite for similarity-based drug discovery. At a rudimentary level, the similarity between two ligands can be easily discerned through visual inspection by identifying common functional groups, structural motifs, or substructures. However, human intervention is often subjective and not suitable for large-scale analysis. Thus, applying computational algorithms for unbiased chemical similarity comparison and database searching is essential for a successful drug discovery campaign.
Several computational chemical similarity search algorithms have been developed [1, 4, 5]. The most commonly used approaches use chemical substructure fingerprints. Non-hashed structural fingerprints such as MACCS keys or Obabel FP3 fingerprints detect predefined substructures or functional group patterns in a molecule by mapping common chemical motifs into binary arrays known as structural keys. To compare the chemical similarity between two molecules, each molecule is converted into a binary series of 0 and 1, indicating the absence and presence of a particular substructure. On the other hand, chemical hashed fingerprints such as Daylight fingerprints or Obabel FP2 fingerprints use path information derived from molecular graphs to compare chemical structures . While path-based fingerprints usually confer higher specificity, structural fingerprints can nevertheless be useful for detecting hits with distinct chemical scaffolds. Once the chemical fingerprints have been determined in a chemical search and the molecules have been converted to appropriate data representations, the next step is to evaluate the chemical similarity using a distance metric. Common distance measures include Euclidean, Manhattan, and Mahalanobis metrics, which have been widely applied in chemoinformatics and bioinformatics applications . However, in the case of binary chemical fingerprints, the simplest and most direct distance measure is the Tanimoto index. Tanimoto metrics calculate the fraction of shared bits between chemical fingerprints in the range of 0–1. Although there is no universal Tanimoto index cutoff (Tc) to determine whether two molecules are sufficiently similar, a Tc value of 0.7 is a reasonable starting point for most chemical searches. Alternatively, statistical scores such as a
In addition to 2D fingerprints, 3D chemical similarity fingerprints have also been developed. 3D chemical similarity fingerprints utilize the 3D structural information of the ligands such as molecular shape, pharmacophore points, or molecular interaction fields (MIF) for structural similarity comparison. Although 3D chemical similarity comparison can often capture structural features essential for protein-ligand binding, 3D alignment algorithms often require extensive optimization procedures to maximize the overlapped volume and are computationally intensive. Alternatively, nonalignment methods based on chemical descriptors such as GETAWAY or 3D-MoRSE descriptors can also be used [8, 9]. The 3D chemical descriptor is capable of capturing 3D ligand properties from 2D information and may improve computational time. However, substantial postvalidation may be required to confirm 3D structural similarity.
2. Network-based target prediction and drug discovery
The application of chemical similarity searches for ligand bioactivity prediction has recently gained substantial interest . Due to the high failure rate of many new chemical entities (NCE) in the late stage of clinical trials, understanding on- and off-target binding of a drug to predict mechanisms of action and adverse reactions has become crucial for drug discovery programs . If the chemical structure of a compound is known, then it is possible to predict compound bioactivities based on the chemical similarity methods described previously. Drug targets can be inferred from bioactivity databases with annotated targets sharing the highest similarity to the target molecules. Many public bioactivity databases are freely available and can be applied to this application including ChEMBL, PubChem, DrugBank, and Binding Database to name a few [12–14].
The simplest approach for drug target inference is by a simple chemical similarity search where the target of a query compound is inferred from the annotated ligand sharing the highest chemical similarity (Figure 1). However, there are several limitations to this approach. First, target information for the reference molecules may be incomplete; thus, target inference from a single molecular entity can miss potential targets from molecules sharing lower chemical similarity. Likewise, pair-wise target predictions may not provide consistent predictions for a group of structurally similar ligands. Second, chemical similarity values are not effective at ranking on and off targets and do not consider the structure-activity relationship (SAR) of congeneric series. Most importantly, simple ligand-based searches cannot be applied to analyzing large numbers of ligands such as the unannotated hits from a chemical screen. To circumvent this shortcoming, we recently proposed a new network target inference approach based on chemical similarity networks called chemical similarity network analysis pull-down (CSNAP) .
CSNAP uses a network-based algorithm to predict drug targets and does not rely on absolute chemical similarity values. It utilizes a scoring function (
3. CSNAP implementation
3.1. Chemical similarity network algorithms
Mathematically, a chemical similarity network can be considered as a graph G (V, E) where the vertex V represents compounds and the edge E represents chemical similarity and connects two compounds if they share a chemical similarity above an arbitrary threshold . The CSNAP algorithm is performed in three steps: (1) chemical similarity database search, (2) chemical similarity network construction, and (3) drug target scoring and inference.
3.1.1. Chemical similarity search
Chemical similarity searching is the first step in the CSNAP algorithm (Figure 2A). The chemical similarity comparisons are performed using various 2D Obabel fingerprints including FP2, FP3, MACCS, and others. Query compounds prepared in SMILES or SDF formats are used as inputs to the CSNAP program. The compounds are searched sequentially against the ChEMBL database. To identify the ChEMBL compounds most similar to the query, the relative chemical similarity score is quantified by a
3.1.2. Chemical similarity network construction
To generate chemical similarity networks, pair-wise chemical similarity values are evaluated between every pair of compounds. A network edge is established between two ligands whenever their similarity value is above a predefined threshold (>0.7) (Figure 2B). When large compound data sets are analyzed by the CSNAP algorithm, structurally diverse compounds are partitioned into subnetworks of distinct chemical scaffolds, known as “chemotypes.” The chemical similarity networks can be used to estimate the chemical diversity of input structures at this stage.
3.1.3. Drug target scoring and inference
CSNAP infers drug targets using consensus statistics. Specifically, drug targets in the first neighbor of the query are identified and ranked based on their target annotation frequency (Figure 2C). A consensus score called an
3.2. Case study: CSNAP web server for automated drug target prediction
To reduce concept to practice, we constructed a CSNAP web server for large-scale target prediction and drug discovery. The CSNAP web includes a front-end graphical user interface (GUI) that provides user interaction and output visualization, while target prediction is performed at the back-end by running the CSNAP algorithm.
3.2.1. CSNAP web input
The CSNAP web server accepts two ligand input formats: SDF and SMILES, which are two of the most commonly used molecular formats that handle large compound databases. In addition, a JME molecular editor is also included, which can be used to convert a chemical structure to a SMILES string on the fly (Figure 3A). Several chemical fingerprints are provided to perform chemical comparisons during the search and network clustering steps, including Obabel FP2, FP3, PF4, and MACCS fingerprints (Figure 3B). Obabel FP2 fingerprints use a path-based algorithm and are more specific than FP3, FP4, and MACCS that utilize a predefined set of substructures for chemical searches. On the other hand, when structural analogues are not available in the chemical database, FP3 can instead be used to search structurally distinct chemicals with similar bioactivities. To perform chemical searches, the chemical similarity cutoff needs to be defined. Here, CSNAP web supports a combination of absolute cutoff based on Tanimoto coefficient (Tc > 0.7) and relative chemical similarity cutoff based on a Z-score. From our experience, the default option using a
Once the query ligands and chemical search parameters have been defined, the CSNAP algorithm will search the ChEMBL compound activity database to identify structurally similar compounds for target inference (Figure 3B). The ChEMBL database assigns targets to a compound based on the level of target specificity (confidence score). Similarly, the compounds are also classified based on the assay type from which they are derived, including biochemical, functional, or ADMET assays. These database parameters will also need to be selected to perform the CSNAP analysis.
3.2.2. CSNAP web output
The CSNAP output page consists of three main panels: (1) chemical similarity networks, (2) chemical structure information, and (3) ligand-target interaction fingerprint (LTIF) (Figure 3C).
3.2.3. Chemical similarity networks
The chemical similarity networks panel displays the generated chemical networks using the CSNAP algorithm based on input ligands. The chemical similarity network connects query (red) and annotated ligands (gray) from the ChEMBL database, and the targets are inferred using consensus statistics. For large compound sets, the number of generated chemical clusters can be used to estimate the chemical diversity of the sets. To retrieve additional information regarding a specific ligand, the user can click on the node and the relevant information will be displayed in the chemical structure information panel.
3.2.4. Chemical structure information
The chemical structure information panel displays the chemical information selected from the chemical similarity network panel. The panel consists of several columns that include chemical structure information (chemical ID, chemical structure, SMILES string, InChI key) and the predicted target information (target name and UniProt ID). In the ChEMBL prediction column, the predicted targets of each compound are ranked by the
3.2.5. Ligand-target interaction fingerprint (LTIF)
To analyze the results from large-scale target prediction searches, the ligand-target interaction fingerprint (LTIF) is provided in the CSNAP web output. The LTIF panel displays the predicted
3.2.6. Target spectrum and Gene Ontology (GO) search
To further differentiate primary targets from off-targets in the LTIF, the CSNAP web also computes a target spectrum, by summing the
4. Application of CSNAP for drug target prediction and discovery
4.1. CSNAP validation
The CSNAP algorithm was validated using 206 known drugs from the directory of useful decoy (DUD) set . The benchmark set included 46 angiotensin-converting enzyme (ACE), 47 cyclin-dependent kinase 2 (CDK2), 23 heat-shock protein 90 (HSP90), 34 HIV reverse-transcriptase (HIVRT), 25 HMG-CoA reductase (HMGA), and 31 poly [ADP-ribose] polymerase (PARP) inhibitors. Using the default search criteria (fingerprint: FP2, Tc = 1,
4.2. Target prediction of hits from an antimitotic chemical screen
We applied the CSNAP algorithm to predict the drug targets of a set of 212 compounds that were inhibitors of cell division . CSNAP clustering of the mitotic compounds resulted into 85 chemical similarity subnetworks (Figure 7A). To identify the most common targets within the set, we applied the LTIF analysis. The target spectrum derived from the LTIF revealed four broad classes of mitotic targets including fatty acid desaturases (SCD, SCD1, and FADS2), ABL1 kinase, non-receptor-type tyrosine phosphatases (PTPN7, PTPN12, PTPN22, PTPRC, and ACP1), and beta tubulins. In particular, the target spectrum showed that beta tubulin had the largest peak height and was the most prominent protein target for the mitotic compounds. Further analysis showed that 51 compounds were associated with tubulin-targeting chemotypes. The predicted drug targets were validated by comparing siRNA-treated and drug-treated mitotic phenotypes in cell culture using immunofluorescence microscopy. In addition,
4.3. Discovery of novel tubulin-targeting antimitotics
Using a negative selection strategy, we identified seven novel tubulin-targeting agents that were active in our tubulin polymerization assay but had not been associated with known tubulin chemotypes (Figure 7A). The seven compounds were analogues of phenyl-sulfanyl-thiazol-acetamide scaffolds that exhibit various degrees of tubulin destabilizing effects through a mechanism similar to that of the tubulin destabilizing agent colchicine. The most potent compound, compound
4.4. Characterization of novel tubulin-targeting antimitotics
To investigate how the novel antimitotics interacted with beta tubulin, we performed structural alignments between compound
5. CSNAP3D: a 3D upgrade to the CSNAP approach
Chemical similarity searches based on 2D chemical structures have several limitations. First, compounds with distinct scaffolds can exhibit similar bioactivity due to “scaffold hopping” by interacting with a common receptor [25, 26]. Second, although 2D fingerprints based on substructure or fragment searches have the potential to detect scaffold hopping, the scaffold enrichment rate is low. Furthermore, 2D searches do not capture essential features of protein-ligand interactions in three-dimensional space. Consequently, 3D chemical searches based on the three-dimensional information of the ligands will offer additional opportunities to discover novel compounds.
5.1. 3D chemical similarity search
The most common approach to compare ligand similarity in 3D is by shape superposition, which maximizes the Gaussian volume overlap between two ligands . Alternatively, ligand alignments that use molecular interaction field (MIF) or pharmacophore have also been proposed [28, 29]. These approaches take into account the shared chemical features arranged in three-dimensional space. To identify the optimal 3D chemical descriptors, we performed an unbiased screen of diverse 3D chemical descriptors based on molecular shapes or pharmacophores. Using 206 benchmark compounds from the DUD set, we tested the ability of each 3D descriptor to enrich class-specific scaffolds ranked by respective similarity scores. The lowest energy conformer of each ligand was generated using the MOE program. The results showed that 3D chemical descriptors using a combination of shape and pharmacophore features achieved the highest enrichment rate and ligand alignment accuracy compared to those based on shape or pharmacophore alone. This observation agrees with our current understanding that shape complementary and chemical matching are essential for the protein-ligand binding process.
We subsequently developed a 3D chemical similarity search method called “ShapeAlign” that utilized two open-source softwares: “Shape-it” and “Align-it” . Similar to the combo score implemented in the ROCS program, the ShapeAlign algorithm also used a combination of shape and pharmacophore for 3D chemical searches. However, ShapeAlign incorporated a 2D fingerprint similarity score as an integral part of the searching process. Given a ligand with a pre-generated 3D conformation, the ShapeAlign algorithm first detects ligands from the chemical database with the highest shape matching evaluated by a shape Tanimoto index. The hit molecules are then aligned and rescored according to the degree of pharmacophore matching using the Align-it program.
5.2. Drug target prediction using CSNAP3D
We incorporated the “ShapeAlign” algorithm into our CSNAP program called “CSNAP3D” to cluster chemical structures and predict drug targets based on 3D ligand similarity. To evaluate CSNAP3D performance, we assessed the average true-positive rate (TPR) and false-positive rate (FPR) of predicting drug targets for the 206 benchmark compounds. The result showed that CSNAP3D achieved a TPR of >95% at 0.85 Tanimoto cutoff in comparison with other 2D target prediction approaches including CSNAP2D, SEA, and PASS approaches . A comparison of CSNAP3D and CSNAP2D generated networks showed that diverse 2D scaffold subnetworks were clustered into smaller subsets of 3D chemical networks, suggesting that CSNAP3D could be used to identify scaffold hopping ligands not identifiable by conventional 2D methods (Figure 9).
5.2.1. Target prediction of HIVRT inhibitors
As further validation, we presented a case study of predicting targets for a set of HIVRT inhibitors using the CSNAP3D algorithm. HIVRT inhibitors can be classified as nucleoside-based analogues (NRTIs) or non-nucleoside-based analogues (NNRTIs) . In particular, NNRTIs have been difficult drug classes for computational dug target prediction due to the chemical diversity of the drug classes where many compounds are scaffold hopping ligands that bind to a common nucleotidyltransferase binding site. Although 3D ligand-based target predictions that use either the alignment or nonalignment methods have been attempted, many of these approaches yielded low predictability. Here, we applied CSNAP3D to predict the drug targets of 34 structurally diverse HIVRT NNRTIs and compared the prediction results with the CSNAP2D approach (Figure 9). Initial 2D chemical similarity network analysis clustered the 34 NNRTIs into 20 structurally diverse chemical similarity scaffolds. Further LTIF analysis, by mapping target prediction
5.2.2. Discovery of novel taxol scaffold hopping ligands
Taxol (paclitaxel) is a well-known anticancer natural product derived from the Yew tree, whose antiproliferative effect was first discovered in 1960s from an NCI anticancer drug screening campaign . Taxol has since been found to be effective for treating a wide range of cancers including ovarian, breast, lung, bladder, prostate, melanoma, esophageal, and other solid tumors. However, the efficacy of taxol has been limited by severe side effects, toxicity, and synthetic feasibility. Thus, identification of low-weight taxol mimetics with more tolerable drug profiles is critical. While several taxol mimetics have been discovered including Synstab B and GS-164, both discovered by chemical screening, their binding mechanisms have remained undetermined [33, 34].
Here, we sought a rational approach to discover taxol mimetics using the CSNAP3D algorithm based on our existing structural knowledge of the original taxol molecule. CSNAP3D analysis of the 212 mitotic compounds from a chemical screen identified 42 potential taxol mimetics linked to 30 taxol structural conformers. Seven predicted taxol mimetics were found to be true positives with a >25% fold change in optical density when tested in tubulin polymerization assays
6. Conclusions and future directions
Chemical similarity is an important concept in medicinal chemistry and drug discovery to identify similar compounds with improved bioactivities. Here, we have expanded on this concept to chemical similarity network theory, where descriptive network statistics and graph topology can be applied to large-scale analysis of chemical diversity, bioactivities, and target identification. To demonstrate the utility of this approach, we have implemented the CSNAP algorithm, which can be used for large-scale compound analysis and target predictions. Analogous to protein function prediction in PPI networks, we applied consensus statistics to identify the common targets of each query ligand. We showed that this scoring function outperforms several target prediction methods based on simple chemical similarity searches. To address the challenge of scaffold hopping, where structurally diverse ligands can potentially interact with a common receptor, we developed the CSNAP3D algorithm as a CSNAP extension. CSNAP3D searches chemical structure using the “ShapeAlign” protocol, which utilizes a combination of shape and pharmacophore descriptors. We found that CSNAP3D improves target prediction, particularly for challenging drug classes such as HIVRT NNRTIs that showed high structural diversity and are scaffold hopping ligands. Finally, we successfully applied CSNAP3D to rationally discover low molecular weight taxol mimetics, which exhibit a taxol-like anticancer mechanism and potentially possess improved transport and pharmacokinetic properties than its natural counterpart.
The current CSNAP framework can be extended in several directions. For instance, consensus scoring can be expanded by considering higher-order neighbors, which has been demonstrated to improve prediction accuracy in PPI networks. Similarly, graph theoretical approaches based on maximum network flow and other global optimization approaches can be applied for target assignments . To improve posttarget validation, high throughput functional genomics data can be incorporated to aid in the identification of critical targets relevant to a disease pathway. One example is multiplayer network approaches that integrate drug, target, and annotation interaction networks to enhance target predictions and validations . While CSNAP3D has substantially improved the predictability of CSNAP2D, the algorithm is limited to receptors with bound ligands and the ligand alignment is based on the lowest energy conformer. This shortcoming can be circumvented by considering multi-conformer networks that correlate ligand conformation with target specificities. Likewise, pseudo-ligands generated as the mirror image of an orphan receptor can be considered for receptor deorphanization.
In conclusion, chemical similarity networks are an emerging field in ligand-based drug discovery where the collective properties of a ligand can be easily dissected using descriptive network statistics and graph topology. Here, we presented a new network-based approach for drug discovery and target identification called chemical similarity network analysis pull-down (CSNAP) and a new CSNAP framework called CSNAP3D. The CSNAP computational framework represents a new concept in computational drug discovery with practical application in target identification and drug discovery. We anticipate that the CSNAP approach will stimulate further work in systems and network-based drug discovery that will aid in the discovery of novel drugs for the treatment of cancer and other important diseases.