Computational Deorphaning of Mycobacterium tuberculosis Targets

Tuberculosis (TB) continues to be a major health hazard worldwide due to the resurgence of drug discovery strains of Mycobacterium tuberculosis ( Mtb ) and co-infection. For decades drug discovery has concentrated on identifying ligands for ~10 Mtb targets, hence most of the identified essential proteins are not utilised in TB chemotherapy. Here computational techniques were used to identify ligands for the orphan Mtb proteins. These range from ligand-based and structure-based virtual screening modelling the proteome of the bacterium. Identification of ligands for most of the Mtb proteins will provide novel TB drugs and targets and hence address drug resistance, toxicity and the duration of TB treatment.


Introduction
Tuberculosis (TB) continues to be a major public health concern with over 2 billion people currently infected, 8.6 million new cases per year, and more than 1.3 million deaths annually [1]. The current drug-regimen combination for drug sensitive TB consists of isoniazid, rifampicin, ethambutol and pyrazinamide, administered over 6 months [2]. If this treatment fails, second-line drugs are used, such as para-aminosalicylate (PAS) and fluoroquinolones, which are usually either less effective or more toxic with serious side effects. Although this regimen has a high success rate, it is marred by compliance issues, which have resulted in the rise of multidrug resistant (MDR), extensively drug resistant (XDR) and totally drug resistant (TDR) strains of the causative agent, Mycobacterium tuberculosis (Mtb) [3,4], in both immunocompetent and immunocompromised patients worldwide [5]. However, it took about 40 years for a new TB drug to be discovered and most of the current TB drugs target a total of only ~10 proteins, even though the complete genome of Mtb was published nearly 20 years ago [6]. Consequently, most of the essential proteins are orphans since their ligands are still to be identified. In our context, target deorphaning or deconvolution encompasses identification of ligands for Mtb proteins not currently exploited in TB chemotherapy and those of old TB targets. Targeting further essential proteins should allow the fight against drug resistance to be enhanced, and possibly lead to a reduction in the duration of TB treatment.
The conventional target deorphaning process involves experimental work, which characteristically includes genetic, proteomics and transcriptional profiling and then identification of the ligands for the proteins using many more chemicalproteomic approaches [7]. This approach is usually long, expensive and time consuming. However, developments in bioinformatics and chemoinformatics, together with advances in computer tools and resources, have fortunately revolutionised target deorphaning. Bioinformatics describes the target space in Mtb from the genome to the proteome, whilst chemoinformatics provides information about the available chemical space and tools for navigation of the space. Together these developments have led to a mushrooming of computer-based target deorphaning methods ranging from modelling proteomes, virtual screening, machine and deep learning, and chemogenomics [8][9][10]. When used effectively in conjunction with experimental work, computational methods can facilitate identification of new TB targets and drugs [11][12][13].
Therefore, in this chapter we present an overview of the genome of Mtb, giving a detailed account on how the computational techniques have been used to de-orphan Mtb targets including case studies, the current and proposed future impacts of these techniques on the number of de-orphaned Mtb targets and their impacts in boosting the biomedical efficacy of TB drugs. The collated data will provide researchers in academia and industry with knowledge of target-ligand pairs and interactions, information crucial for the design of novel drugs with known targets that are less prone to resistance, with minimal side effects and interactions with e.g. anti-HIV drugs.

Method
An extensive literature search was performed to give an overview of the genome of the Mtb and status of the currently used tuberculosis drugs and their targets. An analysis of the essential proteins in Mtb and the number of proteins targeted by the current TB drugs was performed. To boost this data Mtb target-ligand data was extracted from the ChEMBL database version 24 (https://www.ebi.ac.uk/chembl/beta/g/#browse/ targets), which was used to determine the number of the proposed new targets. An overview of computational deorphaning of Mtb targets is provided, using data extracted from literature and a description of the efforts made from our laboratory. To sum this up, a detailed account of modelling the proteome for Mycobacteria, and identification of the hotspots and druggability of the proteins is given.

Genome sequence of Mycobacterium tuberculosis
Cole and co-workers [14] in 1998 reported the complete sequence of Mtb, which comprises of 4,411,529 base pairs. The genome has an evenly distributed guaninecysteine content of 65.6% and represents the second-largest bacterial genome sequence currently available. Additionally, the genome is rich in repetitive DNA, particularly insertion sequences, and in new multi-gene families and duplicated housekeeping genes, providing evidence for horizontally-transferred pathogenicity islands of a particular base composition [14].
The genome of Mtb has some exceptional features, for example there are over 200 genes that encode enzymes for the metabolism of fatty acids, comprising 6% of the total ( Table 1). Among these, about 100 are predicted to function in the oxidation of fatty acids. This large number of Mtb enzymes that putatively have fatty acids as substrates may be linked to the ability of this pathogen to grow in the tissues of the infected host, where fatty acids maybe the major carbon source. Another unusual feature of the Mtb genome is the presence of the unrelated Pro-Glu (PE) and Pro-Pro-Glu (PPE) families of proteins that have conserved N-terminal domains of 100 and 180 amino acids respectively. The antigenicity of these proteins has led to the assumption that at least some of these proteins may be involved in antigenic variation of Mtb during infection [15].

Tuberculosis drugs
The success of TB chemotherapy derives from an "intensive" phase involving a cocktail of four first-line drugs, comprising, rifampicin (RIF), isoniazid (INH), pyrazinamide (PZA), and ethambutol (EMB). A threatening global issue of this epidemic is the emergence of drug-resistant bacteria, a trend that is on the rise, as such strains are easily spread with low fitness costs associated with transmission [16]. The World Health Organisation (WHO) reported that globally 3.5% of naive infections already expressed resistance to the two most efficacious frontline agents used to treat the disease, RIF and INH, thereby classifying the infection as multidrug resistant tuberculosis (MDR-TB) [17]. Treatment of drug-resistant Mtb is difficult already, requiring 6-9 months of combination therapy of second-line drugs, such as PAS, fluoroquinolones e.g. levofloxacin, and aminoglycosides e.g. kanamycin, capreomycin, ethionamide and cycloserine. Complicating the issue is the fact that TB is endemic to the developing world; thus, access to adequate healthcare facilities and drugs can be limited for those patients. This leads to non-compliance by most patients, relapse of the disease and severe side-effects especially of secondline drugs [18]. Treatment for MDR-TB can extend upwards of 2 years and relies on more toxic, less efficacious second-line drugs, many of which are even more scarce than frontline drugs in affected areas [16].
In addition, comorbidity with HIV causes massive diagnostic and therapeutic challenges and results in adverse drug interactions [19]. This is because RIF is a potent inducer of drug-metabolising enzymes, including cytochrome P450 (CYP) 3A4. This induction dramatically reduces plasma levels of several highly active antiretroviral therapy drugs; thus, patients are often forced to complete Tuberculosis -Beyond the Biomedical 4 their TB treatment before beginning HIV treatment [20]. Patients who contract MDR-TB with HIV have a very poor prognosis due to the duration of treatment; these individuals frequently succumb within a few months. Therefore, there is an urgent need to develop continually new active agents to combat MDR-TB which has been compounded by the emergence of XDR-TB. Furthermore, cases of TDR-TB have been noted in China, India, Africa, and Eastern Europe. In TDR-TB, the Mycobacterium are resistant to all available therapeutics [19]. To address this, in 2012 the U.S. Food and Drug Agency (FDA) approved bedaquiline for MDR-TB [21] and later delamanid was approved as a compassionate care option for XDR-TB and TDR-TB infections, nonetheless the EMA approved both agents for MDR-TB [22]. The biggest challenge is that these drugs have reported human ether-a-go-go related gene (hERG) toxicity, as well as multiple absorption, distribution, metabolism and excretion (ADME) issues due to their high lipophilicity [21]. This leads to an urgent need for development of new agents that have successful therapeutic effects.

Mycobacterium tuberculosis drug targets
To date the number of essential Mtb proteins encoded by approximately 4000 genes is just over 500 (Figure 1), and this provides a rich source for novel targets for new and current TB drugs. However, Lamichhane et al. [23] reported that TB chemotherapy exploited only 10 of these proteins; Table 2, gives a summary of the targets, and their current and/or new drug ligands. The most popular target is enoyl[acyl-carrier protein] reductase, important for the biosynthesis of mycolic acid. Efforts to identify genes that code for new potential drugs are underway, as evidenced by 76 TB data points recorded in the ChEMBL database version 24 (https://www.ebi. ac.uk/chembl/beta/g/#browse/targets), consisting of small bioactive compounds, their targets and bioassay data. There are 73 single proteins, including the 10 proteins already targeted by both first-line and second-line drugs during TB chemotherapy. Thus, 63 new drug targets are being explored in a plethora of bioassays. This is of paramount importance because Mtb secreted proteins play a vital role in host-pathogen interactions and facilitate nutrient acquisition, pilot the host immune response and interfere with therapeutic intervention. Therefore, the Mtb secretome consists of proteins essential for successful invasion and in vivo growth during host infection. The essential proteins are the most suitable drug targets for the development of diagnostic tools and new drugs, because of their key role in in vivo bacterial survival and growth. Identifying ligands for these proteins required for growth and survival in the infected host could lead to the discovery of potentially useful biomarkers to add on the above mentioned drug targets [27].

Computer resources and tools for tuberculosis drug targets
The development in genomics, coupled with advances in high performance computing and validation of molecular targets, has introduced new approaches to drug discovery that provide a shift from the historical pipeline that focuses on target identification and in most cases involves single targets. In this era of extensive discovery of new chemical entities for treatment of TB and other infectious diseases like HIV/AIDs, a number of research institutes as well as pharmaceutical companies are eagerly developing computational tools and protocols to facilitate drug discovery and development [28]. Genomics provide DNA, RNA, transcriptomic and proteomic data that is housed in a variety of databases and provide resources e.g. from the European Bioinformatics Institute (EBI) https://www.ebi.ac.uk/, and the National Centre for Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/, which can be easily retrieved and analysed, thereby shifting the drug discovery focus from a single to a multi-protein target approach. In this approach Mtb genomic data are analysed for network, structure and function of a number of essential proteins that are druggable and validated as potential targets for a number of bactericidal or bacteriostatic chemical compounds. In this section, different databases, resources and tools for target deorphaning are discussed with a particular focus on Mtb targets.
The revolution in genomics led to the availability of a number of mycobacterial genomes and the development of a variety of databases consisting of Mtb genomic and transcriptomic data. The genomic databases provide information about the structure, function and evolution of Mtb genes, whilst the transcriptomics provide information crucial for analysis of gene expression using large scale RNA sequences [29]. On the other hand proteomics provides information about the function, networks and structure of proteins. In their paper, Machado et al. [29] give a detailed summary of most computational resources for TB and we encourage readers to consult the article for more information. Similarly a number of chemogenomic resources and database containing data for Mtb ligand annotated targets have been developed. Examples of such databases include the ChEMBL database [30], a database of small bioactive molecules and their targets, TIBLE [31] a database containing MIC and target data for mycobacterial species and TDR targets containing target-ligand information for neglected tropical diseases including TB. The databases are freely available and provide easy access to target-ligand data for Mtb. In these databases each target is associated to ligand(s) obtained from bioassays and vice versa.

Computational target deorphaning techniques
A number of computational methods are being explored in order to identify ligands for both host and pathogen targets and for targets from other organisms like Plasmodium falciparum [32]. In most cases two or more complementary ligand-based and structure-based deorphaning approaches are used; statistical methods involving machine learning [8] and deep learning strategies are applied in conjunction with biological and/or biophysical methods to validate the computational results or the computational methods are used to provide the protein-ligand binding information in the absence of X-ray co-crystallised structures of the ligand [12,13]. In their work, Mendes and Blundell [13] applied cheminformatics to complement current efforts for target identification of fragment-sized molecules that target e.g. the PanC that synthesises pantothenate important for generation of the Mtb co-enzyme A. This has led to the identification of 'hotspots' in the binding pockets of a number of proteins, which highlight the most favoured binding spots for the protein. Hotspots and druggability will be discussed in detail in Section 6.

Ligand-based and structure-based virtual screening methods
Structure-based virtual screening is an approach used in drug discovery to computationally screen small molecule databases for compounds that target proteins of known 3D structure that are experimentally validated. Brain Shoichet [33] has pointed out that this approach was first published in the 1970s, however most new ligands and their targets were not identified until the early 2000. The method offers the opportunity to access a large number of potential new chemical ligands for old and new targets.
In the presence of available ligands for named biological targets, ligand-based virtual screening may be used using a variety of techniques ranging from molecular similarity, pharmacophoric search, to machine learning and most recently deep learning.

Structure-based techniques
Structure-based virtual screening plays a significant role in drug discovery in that it is used to identify ligands for biological targets when the 3D structures of the Mtb targets from X-ray crystallography, nuclear magnetic resonance (NMR) or cryoelectron microscopy are available in the Protein Data Bank, or homology models available in the CHOPIN database and/or generated in house. This method applies structural data of proteins/receptors to provide small molecules with specific structural attributes for good binding affinity [34]. Generally, the process involves three crucial steps, namely preparation of 3D crystal structures of proteins obtained from the Protein Data Bank (PDB) and the ligand structures, docking calculation and data analysis. Protein structure preparation involves adding hydrogen atoms that are normally missing in the coordinate files, adding missing residues, optimising hydrogen bonds, removing atomic clashes, as well as sampling the degrees of freedom such as flip that are not clear in standard resolution crystal structures, for example the 180 o flips of chain terminal rotatable side-chain groups e.g. in shape-symmetric amino acids Asn and Gln, tautomer and/or ionisation state and relaxation of the target and ligand structure [35]. Most docking software is associated with protein and ligand preparation tools, for example Autodock4 or VINA require structures prepared using AutoDockTools (ADT) and the protein preparation script to generate Autodock-type atoms containing Gasteiger charges, and produce the pdbqt files that are compatible with the tool [36]. Similarly, the Primex and Ligprep tools are used to prepare the protein and ligand structures respectively before docking with GLIDE [37]. The quality of input structure files contribute to the quality of the docking results, and the importance of protein and ligand preparation have been highlighted by Sastry [35].

Molecular docking
Molecular docking calculations are capable of predicting the binding conformation of ligands inside the binding pocket of a target, as such they are used to map small molecules onto targets and hence provide essential binding information for structure-based drug design. To achieve this, a number of docking algorithms like Autodock [36], perform a stochastic conformational search or e.g. in GLIDE, a [37] that perform a systematic search [34]. In a stochastic search structural parameters, such as torsional, translational and rotational degrees of freedom of the ligand, are randomly modified to generate an ensemble of molecular conformations and increase the chances of finding the energy global minimum, whilst in a systematic conformational search structural features are gradually changed until a local or global minimum is reached [34]. During the search, conformations of a number of potential binding compounds are explored and evaluated using a specific scoring function. In addition, the conformations are ranked based on their calculated binding energy. Highly ranked compounds are selected as ligands for the target. On the other hand, reverse or inverse docking is used for identifying targets of drug phenotypic hits from a sea of targets. In this way, structure-based screening helps to identify and explain polypharmacology, molecular mechanism of action of substances, facilitate drug repurposing, detect adverse drug reactions and hence toxicity.

Deorphaning the HTH transcription regulator, EthR
In an effort to de-orphan the HTH transcription regulator, EthR, and identify the binding mode of the ligand, we docked 200 fragment-like compounds from the Maybridge database to the highest quality crystal structure of the 23 PDB entries using the GOLD algorithm (unpublished work). We used Arpeggio [38], an online tool that identifies non-covalent interactions in protein-structures, to assess the role of each EthR binding site residue and each small-molecule ligand moiety in contributing to protein-ligand interactions. Visual assessment of interactions involved calculating interactions using the Arpeggio web server (http://structure.bioc.cam. ac.uk/arpeggio) and downloading the results as PyMOL session files, to analyse the non-covalent interactions of each residue. We found that in addition to using polar contacts, most ligands are stabilised by a cascade of pi-interactions starting from Tyr103 close to the entrance of the allosteric pocket to Phe114 located close to the HTH-domain and beyond (Figure 2). Furthermore, potential ligands for the protein were identified. Information obtained from these results is vital identify ligands with a higher probability of binding to EthR, and so improve the potency and safety of ethionamide (ETH).  Similarly, docking calculations were used to assess binding of ligands identified from for a novel TB drug target, inosine monophosphate dehydrogenase (IMPDH) protein Guab2 that is responsible for the synthesis of xanthosine monophosphate (XMP) from IMP, identified from high throughput screening [12]. Hit compounds were identified in a single shot high-throughput screen, validated by dose response and subjected to further biochemical analysis. The compounds were also assessed using molecular docking experiments, providing a platform for their further optimisation using medicinal chemistry. From the results, it was observed that occupation of the nicotinamide sub-site was correlated with interactions of the ligands with the purine ring of IMP.

Applying concerted computational and experimental approaches
Likewise, we used a combination of ligand-based and structure-based chemogenomic approaches, followed by biophysical and biochemical methods, to identify targets for Mtb phenotypic hits deposited in the ChEMBL database [11]. In this work, EthR and InhA emerged as potential targets for many of the hits, and some of them displayed activity through both targets. From the 35 predicted EthR inhibitors 25 displayed an inhibition of better than 50%, of which eight showed an IC 50 better than 50 μM against Mtb EthR and three were confirmed to be also active against InhA. Further the EthR-ligand complexes were validated using X-ray crystallography in the Blundell laboratory to give new crystal structures which were deposited in the Protein Data Bank. These results provide new lead compounds that could be further developed into highly active ligands of EthR and InhA and enhance treatment of drug-resistant TB.

Modelling proteomes for mycobacteria, hotspots and druggability
A comprehensive understanding of the structural proteomes of mycobacteria is essential for novel drug discovery and elucidating the roles of mutations in drug resistance. Most researchers begin by defining the 3D-structure using X-ray crystallography, NMR or increasingly cryo-EM. For phenotypic screening and understanding off-target hits, where the target is not identified, prior knowledge of the structures of all gene products in the target organism is helpful. This has stimulated the establishment of several consortia in what is usually known as structural genomics, but might more appropriately termed "structural proteomics".

Evolution of structural genomics consortia
The Structural Genomics Consortium (SGC) [39] which has focused on proteins of interest to medicine, has impressive achievements, in 2011 defining ~40% of the structures of proteins from human parasites deposited in the PDB [40]. The Tuberculosis Structural Genomics Consortium (TBSGC), an international collaboration involving 53 countries, has focused on 3D structures of Mtb [40]. This activity and others working on Mtb proteomes have deposited 2274 structures in the PDB, but still representing less than 583 gene products, only 13.97% of genome. Although this is a small percentage, it compares impressively with knowledge of protein structures of two other mycobacterial pathogens where there is great clinical interest: for M. leprae causing leprosy there are experimentally-defined 3D structures for 15 gene products and for M. abscessus, a free living Mycobacterium, which is a growing challenge for cystic fibrosis patients, there are 53 experimentally-defined 3D structures in the PDB.

Comparative 3D modelling of proteins
Comparative modelling proteins, based on the fold recognition and structural alignment with the closest homologues that have experimentally solved structures, began using interactive graphics in the 1970s [41][42][43]. The development of automated modelling software began in the 1980s, initially with Composer [44] and later developed with Comparer [45] and Modeller [46], based on satisfaction of 3D restraints derived from structurally aligned homologues. Modeller has now been cited ~10,500 times in the literature!

Computational modelling pipelines and structural proteome databases
Rapid progress in this and other related software coupled with increasing computing power has enabled genome scale prediction of protein structures, as a viable alternative to experimental determination. In order to construct computational models of all gene products, which we here refer to as the structural proteome, we identify templates by a sequence-structure homology search using Fugue [47], which uses local-structural-environment-specific substitution tables to predict the likelihood of a common 3D structure. We have incorporated Fugue into a pipeline (Vivace), in which templates are selected from TOCCATA (Ochoa Montaño and Blundell, unpublished), a database of consensus profiles built from CATH 3.5 [48] and SCOP 1.75A [49] based classification of proteins structures (PDB files). PDBs within each profile are clustered based on sequence similarity using CD-HIT [50] and structures are aligned using BATON, a modified version of COMPARER [45]. After further optimization of the clusters by discarding templates with more than 20% difference in sequence identity to the maximum hit, remaining templates are classified into states based on ligand binding and oligomerization. Five different states, known as "liganded-monomeric," "ligandedcomplexed," "apo-monomeric," "apo-complexed" and "any," are generated in each profile hit. Models are built in each of these states using Modeller 9.10 [46] and refined. Later NDOPE, GA341 [51] Molprobity [52] and SSAG [53] are used to determine the quality of the models.

Mycobacterial proteome databases
The first application of this approach was to construct the Chopin Database (http://mordred.bioc.cam.ac.uk/chopin/about), a database of protein structures for H37Rv strain of Mtb. This has provided structures that are reasonably certain for around 65% of gene products. These have proved reliable indicators of the overall structures but may have some uncertainties especially in loop regions and domain-domain relationships. A further ~19% probably have correct folds while the remaining would unlikely to be correct. Nevertheless, compared to those structures defined experimentally by X-ray analysis, this represents a 6-fold increase of structural information available that might be useful in assessing druggability and the impacts of mutations.
Similar models of the structural proteome for M. abscessus (Skwark et al., unpublished) and M. leprae (Vedithi et al., unpublished) have been developed in the group. In M. leprae, of the 1615 gene products, templates were identified for 1429 gene products and we were able to model 1161 proteins with high confidence. A total of 36,408 models were built in different ligand bound and oligomeric states for the 1161 proteins. The distribution of Fugue Z score across models indicates that only 4% of the proteome has no hits and 15% has poor scores. ~80% of the proteome has acceptable and good hits, and the corresponding Z scores. Around 47% of the protein queries identified templates with identity and coverage greater than 40 and 67% of the models in the proteome are of best quality as estimated by NDOPE, GA341, Molprobity and Secondary Structure Agreement (SSAG).

Oligomeric protein models
Current work on structural proteomes includes efforts to extend the modelling pipeline to homo-oligomeric (and eventually hetero-oligomeric) structures using comparative approaches (Malhotra et al., unpublished), extending models and improving models of small molecule complexes, and linking individual protein structures into the metabolic networks and interactions in the cell (Bannerman et al., unpublished). An example of an oligomeric structure is CTP-synthase, encoded by PyrG, which is an essential gene in Mtb identified by transposon saturation mutagenesis [54] and catalyses ATP-dependent amination of UTP to CTP with either L-glutamine or ammonia. The allosteric effector GTP functions by stabilising the protein conformation that binds to the tetrahedral intermediates formed during glutamine hydrolysis. Its closest homologue in M. leprae ML1363 is a target of choice and was modelled using Vivace during the proteome modelling exercise. We modelled the apomeric and ligand bound states of the model and oligomerized the protomer using our inhouse oligomerization pipeline. The protomeric and oligomeric states are depicted in Figure 3A and B.
The models were built by using templates PDB-IDs: 4zdI and 4zdK for PyrG of Mtb [55]. Both the templates are 89% identical and 100% coverage to the query sequence. The superposition of the models with the templates indicated a root mean square deviation (RMSD) of 0.758.

Structural implications of mutations
We have also spent time over 2 decades analysing the impacts of mutations evident in the increasing wealth of available genome sequences for pathogenic mycobacteria and cancers. We originally developed SDM [56] in 1997, a method depending on statistical analysis of environment-dependent amino-acid substitution tables [57,58]. In 2013 machine learning was introduced with the arrival of Douglas Pires in Cambridge, developing first mCSM for stability [59] followed by several "flavours" including mCSM-PPI for impacts on protein-protein interactions, mCSM-NA [60] for nucleic acid interactions and mCSM-lig for impacts on small-molecule ligand interactions useful for understanding drug resistance [61]. A critical part of using machine learning is to have an extensive database of experimentally-defined impacts of mutations on stability and interactions, such as Platinum by David Ascher when in Cambridge [62], a database of experimentally measured effects of mutations on structurally defined protein-ligand complexes that was developed for mCSM-lig. These two structural approaches to predicting the impacts of mutations (SDM & mCSM) have proved complementary and more reliable than most sequence-only  methods. They also allow the application of saturation mutagenesis, facilitating in silico systematic analysis of mutations [63], an approach now being adopted to whole proteomes where every residue in each of the proteins in the proteome is mutated to all the other 19 amino acids and the effects of the mutations are measured using various methods mentioned above. In structure-guided fragment-based drug discovery, this provides comprehensive information on the regions of the protein that are less likely to lead to drug resistance and therefore can be probed by elaboration of fragments/small molecules. We performed saturation mutagenesis on the drug targets in M. leprae for leprosy and the average or highest impact a mutation can induce in each residue position is depicted on the structure (Figure 4).

Active sites, cavities and fragment hotspot maps
Although comparative modelling of homologues in complex with ligands can often give clues about active sites, cofactor binding and substrate or other ligand binding sites, this is not always possible. In order to indicate putative binding sites in the absence of appropriate experimental data, we have exploited cavitydefining software such as VolSite [64] for novel binding site description together with an alignment and comparison tool (Shaper) [65]. We have used FuzCav, a novel alignment-free high-throughput algorithm to compute pairwise similarities between protein-ligand binding sites [66] and GHECOM [67], to study the small pockets that often characterise protein-protein and protein-peptide interactions.
Further to the identification of cavities and pockets, it is also useful to be able to identify hotspots, region(s) of the binding site defined as a major contributor to the binding free energy, and often characterised by their ability to bind fragment-sized organic molecules in well-defined orientations. The usual understanding is that the fragment, with a mixed polar and hydrophobic character, can displace an "unhappy water. " We have tried to mimic this in silico by using SuperStar [68] to generate atomic interaction propensities on a grid. We then carry out a search with three fragments, each having a six-membered carbon ring, but having a donor, acceptor or a non-polar substituent. The resulting map is convoluted with an estimate of the depth below the surface, which generally appears to correlate with favourable entropic gain on water release on binding of a ligand [69]. The hotspot maps, computed in this way and indicating donor, acceptor and lipophilic interactions correlate well with experimental binding sites of fragments that can be elaborated in fragment-based discovery. For the ligand bound structures, lower contouring can provide "warm spots" for the binding sites, indicating possibilities for elaborating the fragment in the binding pocket.
The models of individual molecules of the modelled proteome can be individually decorated with the hotspot maps. They give a good indication of the known functional sites on experimentally defined structures of proteins, often demonstrating that a functional site comprises several hotspots involved in binding substrates and cofactors. They also provide a good indication of the location of allosteric sites [70].

Conclusion
In summary we can move from the study of individual targets to an understanding of the majority of targets coded by the genome. Indeed, we can build 3D structures for a majority of the genes, so providing a model of the "structural proteome". Hotspots and cavities provide a basis for identification of the ligandability of putative binding sites and have been used in our group to predict pharmacophores that can be used in docking and virtual screening and so deorphaning of mycobacterial proteins.
To identify druggable proteins from the structural proteome, we have adopted a hierarchal selection process wherein chokepoint analysis is initially performed to identify metabolic reactions that are critical to cell survival. Gene products identified in this screen are later subjected to essentiality analysis using either flux balance analysis (FBA) based models or by data from the transposon saturation mutagenesis experiments in the literature. Genes that are essential are chosen at this stage and understanding of the gene expression profiles in different growth conditions is analysed. Genes whose expression is condition specific are excluded. Later for the selected genes, the structural information of the corresponding proteins is analysed in the context of prior knowledge and attempts in drug discovery, druggable pockets and fragment hotspots maps, small molecule bound states, non-human homologue, nonhomologous to human microbiome, cellular localization and biochemical properties of the proteins. Structure-guided virtual screening is performed on the selected drug targets with a choice of fragment and compound libraries using CCDC Gold (The Cambridge Crystallographic Data Centre) [71]. Best poses with good scores lead the experimental process of structure-guided fragment-based drug discovery.
The challenge now is to test the computational methods outlined here for identifying ligands and understanding the druggability of the proteome-several thousand gene products from the whole genome of Mtb. We can then begin to assess the degree to which we can de-orphan the many Mtb proteins that have until now not featured as targets in the worldwide efforts to combat the global challenge of TB to the health and well-being of human kind.