General classification of Mtb genes. Adopted from .
Tuberculosis (TB) continues to be a major health hazard worldwide due to the resurgence of drug discovery strains of Mycobacterium tuberculosis (Mtb) and co-infection. For decades drug discovery has concentrated on identifying ligands for ~10 Mtb targets, hence most of the identified essential proteins are not utilised in TB chemotherapy. Here computational techniques were used to identify ligands for the orphan Mtb proteins. These range from ligand-based and structure-based virtual screening modelling the proteome of the bacterium. Identification of ligands for most of the Mtb proteins will provide novel TB drugs and targets and hence address drug resistance, toxicity and the duration of TB treatment.
- Mycobacterium tuberculosis
- target deorphaning
- target deconvolution
- proteome modelling
- virtual screening
Tuberculosis (TB) continues to be a major public health concern with over 2 billion people currently infected, 8.6 million new cases per year, and more than 1.3 million deaths annually . The current drug-regimen combination for drug sensitive TB consists of isoniazid, rifampicin, ethambutol and pyrazinamide, administered over 6 months . If this treatment fails, second-line drugs are used, such as para-aminosalicylate (PAS) and fluoroquinolones, which are usually either less effective or more toxic with serious side effects. Although this regimen has a high success rate, it is marred by compliance issues, which have resulted in the rise of multidrug resistant (MDR), extensively drug resistant (XDR) and totally drug resistant (TDR) strains of the causative agent,
The conventional target deorphaning process involves experimental work, which characteristically includes genetic, proteomics and transcriptional profiling and then identification of the ligands for the proteins using many more chemical-proteomic approaches . This approach is usually long, expensive and time consuming. However, developments in bioinformatics and chemoinformatics, together with advances in computer tools and resources, have fortunately revolutionised target deorphaning. Bioinformatics describes the target space in
Therefore, in this chapter we present an overview of the genome of
An extensive literature search was performed to give an overview of the genome of the
3. Genome sequence of
Cole and co-workers  in 1998 reported the complete sequence of
The genome of
|Function||No. of genes||% of total genes||% of total coding capacity|
|Cell wall and cell processes||517||13.0||13.5|
|IS elements and bacteriophages||137||3.4||2.5|
|PE and PPE Proteins||167||4.2||7.1|
|Intermediary metabolism and respiration||877||22.0||24.6|
|Virulence, detoxification and adaptation||91||2.3||2.4|
|Conserved hypothetical function||911||22.9||18.4|
|Proteins of unknown function||607||15.3||9.9|
3.1 Current status of tuberculosis drugs and targets
3.1.1 Tuberculosis drugs
The success of TB chemotherapy derives from an “intensive” phase involving a cocktail of four first-line drugs, comprising, rifampicin (RIF), isoniazid (INH), pyrazinamide (PZA), and ethambutol (EMB). A threatening global issue of this epidemic is the emergence of drug-resistant bacteria, a trend that is on the rise, as such strains are easily spread with low fitness costs associated with transmission 
In addition, comorbidity with HIV causes massive diagnostic and therapeutic challenges and results in adverse drug interactions . This is because RIF is a potent inducer of drug-metabolising enzymes, including cytochrome P450 (CYP) 3A4. This induction dramatically reduces plasma levels of several highly active antiretroviral therapy drugs; thus, patients are often forced to complete their TB treatment before beginning HIV treatment . Patients who contract MDR-TB with HIV have a very poor prognosis due to the duration of treatment; these individuals frequently succumb within a few months. Therefore, there is an urgent need to develop continually new active agents to combat MDR-TB which has been compounded by the emergence of XDR-TB. Furthermore, cases of TDR-TB have been noted in China, India, Africa, and Eastern Europe. In TDR-TB, the
Mycobacterium tuberculosisdrug targets
To date the number of essential
|Targets||Function||Conventional drugs||New ligands|
|Enoyl-(acyl-carrier-protein) reductase (InhA), Fatty acid synthase||Biosynthesis of mycolic acids, that is essential for growth and virulence||Isoniazid|
|DNA gyrase||An ATP-dependent enzyme that acts by creating a transient double-stranded DNA break||Fluoroquinolones||Clinafloxacin|
|Ubiquinol-cytochrome C-reductase (QCrB)||Electron carriers of the respiratory chain||Pyrrolo[3,4-c]pyridine-1,3(2H)diones|
|Transmembrane transport protein large (MmpL3)||Responsible for heme uptake into the cell.|
Responsible for the transport of ions, drugs, fatty acids and bile salts
|Decaprenylphospo-β-D-ribofuranose-2-oxidase (DprE1)||Cell wall synthesis||Benzothiazinones (BTZ043)|
4-aminoquinolone piperidine amides
|RNA polymerase||Responsible for transcription||Rifampicin|
|Protein synthase||Protein synthesis||Linezolid (https://www.drugbank.ca/drugs/DB00601)||PNU100480|
|ATP Synthase||ATP synthesis||Bedaquiline||D-Dethiobiotin|
|Cytidine triphosphate (CTP) synthetase||Catalysis of amination of uridine triphosphate (UTP) into CTP||Thiophenecarboxamide|
4-(pyridine 2-yl) thiazole
|Transcription factor (IdeR)||Regulating the intracellular levels of iron||Benzo-thiazol benzene sulfonic acid|
|Lysine-ε-amino transferase (LAT)||Catalysing reversibly the transamination of lysine into α-ketoglutaric acid||Benzothiazole|
This is of paramount importance because
4. Computer resources and tools for tuberculosis drug targets
The development in genomics, coupled with advances in high performance computing and validation of molecular targets, has introduced new approaches to drug discovery that provide a shift from the historical pipeline that focuses on target identification and in most cases involves single targets. In this era of extensive discovery of new chemical entities for treatment of TB and other infectious diseases like HIV/AIDs, a number of research institutes as well as pharmaceutical companies are eagerly developing computational tools and protocols to facilitate drug discovery and development . Genomics provide DNA, RNA, transcriptomic and proteomic data that is housed in a variety of databases and provide resources e.g. from the European Bioinformatics Institute (EBI) https://www.ebi.ac.uk/, and the National Centre for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/, which can be easily retrieved and analysed, thereby shifting the drug discovery focus from a single to a multi-protein target approach. In this approach
The revolution in genomics led to the availability of a number of mycobacterial genomes and the development of a variety of databases consisting of
5. Computational target deorphaning techniques
A number of computational methods are being explored in order to identify ligands for both host and pathogen targets and for targets from other organisms like
5.1 Ligand-based and structure-based virtual screening methods
Structure-based virtual screening is an approach used in drug discovery to computationally screen small molecule databases for compounds that target proteins of known 3D structure that are experimentally validated. Brain Shoichet  has pointed out that this approach was first published in the 1970s, however most new ligands and their targets were not identified until the early 2000. The method offers the opportunity to access a large number of potential new chemical ligands for old and new targets. In the presence of available ligands for named biological targets, ligand-based virtual screening may be used using a variety of techniques ranging from molecular similarity, pharmacophoric search, to machine learning and most recently deep learning.
5.1.1 Structure-based techniques
Structure-based virtual screening plays a significant role in drug discovery in that it is used to identify ligands for biological targets when the 3D structures of the
188.8.131.52 Molecular docking
Molecular docking calculations are capable of predicting the binding conformation of ligands inside the binding pocket of a target, as such they are used to map small molecules onto targets and hence provide essential binding information for structure-based drug design. To achieve this, a number of docking algorithms like Autodock , perform a stochastic conformational search or e.g. in GLIDE, a  that perform a systematic search . In a stochastic search structural parameters, such as torsional, translational and rotational degrees of freedom of the ligand, are randomly modified to generate an ensemble of molecular conformations and increase the chances of finding the energy global minimum, whilst in a systematic conformational search structural features are gradually changed until a local or global minimum is reached . During the search, conformations of a number of potential binding compounds are explored and evaluated using a specific scoring function. In addition, the conformations are ranked based on their calculated binding energy. Highly ranked compounds are selected as ligands for the target. On the other hand, reverse or inverse docking is used for identifying targets of drug phenotypic hits from a sea of targets. In this way, structure-based screening helps to identify and explain polypharmacology, molecular mechanism of action of substances, facilitate drug repurposing, detect adverse drug reactions and hence toxicity.
184.108.40.206 Deorphaning the HTH transcription regulator,
In an effort to de-orphan the HTH transcription regulator, EthR, and identify the binding mode of the ligand, we docked 200 fragment-like compounds from the Maybridge database to the highest quality crystal structure of the 23 PDB entries using the GOLD algorithm (unpublished work). We used Arpeggio , an online tool that identifies non-covalent interactions in protein-structures, to assess the role of each EthR binding site residue and each small-molecule ligand moiety in contributing to protein-ligand interactions. Visual assessment of interactions involved calculating interactions using the Arpeggio web server (http://structure.bioc.cam.ac.uk/arpeggio) and downloading the results as PyMOL session files, to analyse the non-covalent interactions of each residue. We found that in addition to using polar contacts, most ligands are stabilised by a cascade of pi-interactions starting from Tyr103 close to the entrance of the allosteric pocket to Phe114 located close to the HTH-domain and beyond (Figure 2). Furthermore, potential ligands for the protein were identified. Information obtained from these results is vital identify ligands with a higher probability of binding to EthR, and so improve the potency and safety of ethionamide (ETH).
Similarly, docking calculations were used to assess binding of ligands identified from for a novel TB drug target, inosine monophosphate dehydrogenase (IMPDH) protein Guab2 that is responsible for the synthesis of xanthosine monophosphate (XMP) from IMP, identified from high throughput screening . Hit compounds were identified in a single shot high-throughput screen, validated by dose response and subjected to further biochemical analysis. The compounds were also assessed using molecular docking experiments, providing a platform for their further optimisation using medicinal chemistry. From the results, it was observed that occupation of the nicotinamide sub-site was correlated with interactions of the ligands with the purine ring of IMP.
220.127.116.11 Applying concerted computational and experimental approaches
Likewise, we used a combination of ligand-based and structure-based chemogenomic approaches, followed by biophysical and biochemical methods, to identify targets for
6. Modelling proteomes for mycobacteria, hotspots and druggability
A comprehensive understanding of the structural proteomes of mycobacteria is essential for novel drug discovery and elucidating the roles of mutations in drug resistance. Most researchers begin by defining the 3D-structure using X-ray crystallography, NMR or increasingly cryo-EM. For phenotypic screening and understanding off-target hits, where the target is not identified, prior knowledge of the structures of all gene products in the target organism is helpful. This has stimulated the establishment of several consortia in what is usually known as structural genomics, but might more appropriately termed “structural proteomics”.
6.1 Evolution of structural genomics consortia
The Structural Genomics Consortium (SGC)  which has focused on proteins of interest to medicine, has impressive achievements, in 2011 defining ~40% of the structures of proteins from human parasites deposited in the PDB . The Tuberculosis Structural Genomics Consortium (TBSGC), an international collaboration involving 53 countries, has focused on 3D structures of
6.2 Comparative 3D modelling of proteins
Comparative modelling proteins, based on the fold recognition and structural alignment with the closest homologues that have experimentally solved structures, began using interactive graphics in the 1970s [41, 42, 43]. The development of automated modelling software began in the 1980s, initially with Composer  and later developed with Comparer  and Modeller , based on satisfaction of 3D restraints derived from structurally aligned homologues. Modeller has now been cited ~10,500 times in the literature!
6.2.1 Computational modelling pipelines and structural proteome databases
Rapid progress in this and other related software coupled with increasing computing power has enabled genome scale prediction of protein structures, as a viable alternative to experimental determination. In order to construct computational models of all gene products, which we here refer to as the structural proteome, we identify templates by a sequence-structure homology search using Fugue , which uses local-structural-environment-specific substitution tables to predict the likelihood of a common 3D structure. We have incorporated Fugue into a pipeline (Vivace), in which templates are selected from TOCCATA (Ochoa Montaño and Blundell, unpublished), a database of consensus profiles built from CATH 3.5  and SCOP 1.75A  based classification of proteins structures (PDB files). PDBs within each profile are clustered based on sequence similarity using CD-HIT  and structures are aligned using BATON, a modified version of COMPARER . After further optimization of the clusters by discarding templates with more than 20% difference in sequence identity to the maximum hit, remaining templates are classified into states based on ligand binding and oligomerization. Five different states, known as “liganded-monomeric,” “liganded-complexed,” “apo-monomeric,” “apo-complexed” and “any,” are generated in each profile hit. Models are built in each of these states using Modeller 9.10  and refined. Later NDOPE, GA341  Molprobity  and SSAG  are used to determine the quality of the models.
6.2.2 Mycobacterial proteome databases
The first application of this approach was to construct the Chopin Database (http://mordred.bioc.cam.ac.uk/chopin/about), a database of protein structures for H37Rv strain of
Similar models of the structural proteome for
6.2.3 Oligomeric protein models
Current work on structural proteomes includes efforts to extend the modelling pipeline to homo-oligomeric (and eventually hetero-oligomeric) structures using comparative approaches (Malhotra et al., unpublished), extending models and improving models of small molecule complexes, and linking individual protein structures into the metabolic networks and interactions in the cell (Bannerman et al., unpublished). An example of an oligomeric structure is CTP-synthase, encoded by
The models were built by using templates PDB-IDs: 4zdI and 4zdK for PyrG of
6.3 Structural implications of mutations
We have also spent time over 2 decades analysing the impacts of mutations evident in the increasing wealth of available genome sequences for pathogenic mycobacteria and cancers. We originally developed SDM  in 1997, a method depending on statistical analysis of environment-dependent amino-acid substitution tables [57, 58]. In 2013 machine learning was introduced with the arrival of Douglas Pires in Cambridge, developing first mCSM for stability  followed by several “flavours” including mCSM-PPI for impacts on protein-protein interactions, mCSM-NA  for nucleic acid interactions and mCSM-lig for impacts on small-molecule ligand interactions useful for understanding drug resistance . A critical part of using machine learning is to have an extensive database of experimentally-defined impacts of mutations on stability and interactions, such as Platinum by David Ascher when in Cambridge , a database of experimentally measured effects of mutations on structurally defined protein-ligand complexes that was developed for mCSM-lig. These two structural approaches to predicting the impacts of mutations (SDM & mCSM) have proved complementary and more reliable than most sequence-only methods. They also allow the application of saturation mutagenesis, facilitating
6.4 Active sites, cavities and fragment hotspot maps
Although comparative modelling of homologues in complex with ligands can often give clues about active sites, cofactor binding and substrate or other ligand binding sites, this is not always possible. In order to indicate putative binding sites in the absence of appropriate experimental data, we have exploited cavity-defining software such as VolSite  for novel binding site description together with an alignment and comparison tool (Shaper) . We have used FuzCav, a novel alignment-free high-throughput algorithm to compute pairwise similarities between protein-ligand binding sites  and GHECOM , to study the small pockets that often characterise protein-protein and protein-peptide interactions.
Further to the identification of cavities and pockets, it is also useful to be able to identify hotspots, region(s) of the binding site defined as a major contributor to the binding free energy, and often characterised by their ability to bind fragment-sized organic molecules in well-defined orientations. The usual understanding is that the fragment, with a mixed polar and hydrophobic character, can displace an “unhappy water.” We have tried to mimic this
The models of individual molecules of the modelled proteome can be individually decorated with the hotspot maps. They give a good indication of the known functional sites on experimentally defined structures of proteins, often demonstrating that a functional site comprises several hotspots involved in binding substrates and cofactors. They also provide a good indication of the location of allosteric sites .
In summary we can move from the study of individual targets to an understanding of the majority of targets coded by the genome. Indeed, we can build 3D structures for a majority of the genes, so providing a model of the “structural proteome”. Hotspots and cavities provide a basis for identification of the ligandability of putative binding sites and have been used in our group to predict pharmacophores that can be used in docking and virtual screening and so deorphaning of mycobacterial proteins.
To identify druggable proteins from the structural proteome, we have adopted a hierarchal selection process wherein chokepoint analysis is initially performed to identify metabolic reactions that are critical to cell survival. Gene products identified in this screen are later subjected to essentiality analysis using either flux balance analysis (FBA) based models or by data from the transposon saturation mutagenesis experiments in the literature. Genes that are essential are chosen at this stage and understanding of the gene expression profiles in different growth conditions is analysed. Genes whose expression is condition specific are excluded. Later for the selected genes, the structural information of the corresponding proteins is analysed in the context of prior knowledge and attempts in drug discovery, druggable pockets and fragment hotspots maps, small molecule bound states, non-human homologue, non-homologous to human microbiome, cellular localization and biochemical properties of the proteins. Structure-guided virtual screening is performed on the selected drug targets with a choice of fragment and compound libraries using CCDC Gold (The Cambridge Crystallographic Data Centre) . Best poses with good scores lead the experimental process of structure-guided fragment-based drug discovery.
The challenge now is to test the computational methods outlined here for identifying ligands and understanding the druggability of the proteome—several thousand gene products from the whole genome of
LYB and GCM are grateful to Chinhoyi University of Technology for their support in introducing computational drug discovery and development research work at the University and all our collaborators. TLB and SCV thank the Gates Foundation, the Cystic Fibrosis Trust and the American Leprosy Mission for their funding of computational and experimental work on approaches to combating disease from mycobacterial infections. They also thank colleagues in Cambridge and elsewhere who have contributed over the years to our efforts to develop new approaches to structural biology, computational bioinformatics and drug discovery.