Computational Identification of Indispensable Virulence Proteins of Salmonella Typhi CT18 Computational Identification of Indispensable Virulence Proteins of Salmonella Typhi CT18

Typhoid infections have become an alarming concern with the increase of multidrug resistant strains of Salmonella serovars. The new pathogenic Gram-negative strains are resistant to most antibiotics such as chloramphenicol, ampicillin, trimethoprim, cipro- floxacin and even co-trimoxazole and their derivatives thereby causing numerous out breaks in the Indian subcontinent, Southeast Asian and African countries. Conventional and modern methods of typing had been adopted to differentiate outbreak strains. However, identifying the most indispensable proteins from the complete set of proteins of the whole genome of Salmonella sp., comprising the Salmonella pathogenicity islands (SPI) responsible for virulence, has remained an ever challenging task. We have adopted a network-based method to figure out, albeit theoretically, the most significant proteins which might be involved in the resistance to antibiotics of the Salmonella sp. An under- standing of the above will provide insight into conditions that are encountered by this pathogen during the course of infection, which will further contribute in identifying new targets for antimicrobial agents.


Introduction
Food-borne infections are quite common and widely distributed worldwide, though there can be several sources of such diseases. Human Salmonellosis or typhoid, causing systemic infection of the human gastrointestinal tract and diarrhoea, is one such common disease caused by Salmonella enterica serovar Typhi. With a prevalence of probably 10 millions of cases and hundreds of thousands of deaths every year [1], the disease has turned out to be a major cause for concern with the emergence of multidrug-resistant (MDR) Salmonella strains [2]. Such new strains are resistant to chloramphenicol, ampicillin, trimethoprim, ciprofloxacin and even co-trimoxazole and their derivatives, thereby causing numerous outbreaks in the Indian subcontinent, Southeast Asian and African countries [3,4]. Thus, newer drugs like cephalosporins and quinolone derivatives needed to be explored to combat the situation [5].
To deal with the threats of multidrug resistance, several health intervention strategies have been undertaken. However, the prospects for finding new antibiotics for several classes of Gram-negative pathogens are especially poor due to the blockades provided by their outer membrane to the entry of some existing antibiotics and expulsion of many of the remainder by their efflux pumps [6]. It has become imperative that the conventional strategies for dealing with such pathogens are less effective or even at times, ineffective completely, to emerge victorious against the strategies for the war waged out by them. In such cases, the complexities posed can be solved by adopting some non-conventional approaches of finding the drug targets for these pathogens. Proteins, being the functional unit of the cell of any living organism, have always been good targets for combating diseases. Diseases, on the other hand, serve as interesting examples of complex protein interactions among several other heterogeneous entities of and between organisms. However, understanding the complexity of such interacting protein partners, especially with respect to the combat against the pathogens, has always been elusive. Thus, analyses of the mosaic mesh or network of interacting proteins, commonly known as protein interaction networks (PINs) can provide sufficient insight to reveal the indispensable virulent proteins for valuable drug targets [7].
Analyses of a PIN, to highlight important and/or indispensable proteins, can be as simple as centrality measurements with respect to the biological scenario. These can start by determining the number of interacting partners of a particular protein to identify its degree centrality (DC) which correlates with its biological importance. Thus, high-degree proteins (or hubs) are known to correspond to proteins that are essential [8]. As a protein can be affected locally while interacting with its other partners in the global network, other centrality measures are also given importance based on their relevance. Thus, we have discussed the importance of the measures like closeness centrality (CC), betweenness centrality (BC) and eigenvector centrality (EC) [8] parameters for PIN comprising the Salmonella pathogenicity islands (SPI) harbouring the specialized virulent proteins characterized by the type III secretion system (T3SS) among others. Till date, 17 such discrete sets have been reported for S. Typhi [9] along with the five SPI (1 till 5) characterized experimentally [10] among which SicA has been identified as the indispensable one in the phylogenetically closest neighbour, S. enterica serovar Typhimurium strain LT2 [11].
Again, extracting knowledge of the most indispensable virulence proteins from among the stipulated sets of SPI proteins could be quite insufficient. Thus, we have carried out further analyses of the whole genome of S. Typhi CT18 encompassing the decomposition of the whole genome protein interactome to a core of highly interacting proteins through the k-core analysis approach [12]. We have performed cartographic analyses further to identify the functional modules in the network [13] and predicted the indispensability of certain sets of proteins, which have been shown to be sharing similar functional modules empirically important for drug targets.

Dataset collection
Proteins for 17 Salmonella pathogenicity islands (SPIs) were collected from an in silico study of SPI for S. enterica serovar Typhi strain CT18 [9]. The locus tag of all the proteins of SPI for S. Typhi CT18 was fed as queries to the STRING 10.0 biological meta-database [14] to get all the possible interactions of a particular protein (date and time of access: Jul 28 2016 13:07:15). Detailed protein links file under the accession number 220341 in STRING was used to collect all the interactions of the whole genome proteins of S. Typhi.

Interactome construction
All individual protein interaction data, with medium confidence values obtained by default from String 10.0, were imported into Cytoscape version 3.3.0 [15] to integrate and build the interactomes of network comprising SPI-1 till -13 and -15 till -18, individually and all these 17 SPI collectively (AS). The interaction information, weighted by their strength as per STRING, of all the proteins of S. Typhi genome was imported into Gephi 0.9.1 [16] to construct and visualize the interactome of the whole genome. An interactome of proteins can be perceived as the protein interaction network (PIN) and can be represented as an undirected graph G = (V, E) consisting of a finite set of V vertices (or nodes) and E edges. An edge e = (u, v) connects two vertices (nodes) u and v. Each protein in the above PIN is represented as a vertex/node. The number of connections/interactions/associations/links a node has with other nodes comprises its degree d (v) [17].

SPI-PIN
All the interactomes of SPI-PIN have been viewed by Cytoscape version 3.3.0 in the form of graphs of aforementioned interconnected proteins. The networks were subsequently analysed via the Cytoscape integrated java plugin CytoNCA [18] to compute values for the network centrality parameters namely EC, DC, CC and BC. Combined scores from different parameters considered in STRING were taken as edge weights for computing CytoNCA scores. Top 20 proteins for each of the centrality measures were taken for drawing Venn diagrams to find common proteins from each measure.

WhoG-PIN
As few (21) nodes out of the whole genome were isolated from the major part of network, these were considered to have less impact on the overall topology and thus ignored. Further analyses were based on the large connected component (LCC) of network comprising 4508 protein partners having 1041182 interactions. The analytical study has been done by using MATLAB version 7.11, a programming language developed by MathWorks [19].
For the primary understanding of the network, the distributions of network degree (k) were plotted by Complementary Cumulative Distribution Function (CCDF). To extract significant information from the topology of the large and complex Whole Genome Protein Interaction Network (WhoG-PIN), knowledge of the role of each protein was derived from the cartographic representation of within-module degree z-score of the protein versus its participation coefficient as per the methodology described by Guimera et al. [20]. Participation of each protein reflected its positioning within own module and with respect to other modules, where modules were calculated based on Rosvall method [21]. To have an idea of the core group of the very specific proteins which might have variety of role to play in the whole genome context, a k-core analysis was performed following the network decomposition (pruning) techniques to produce a sequence of subgraph of gradually increasing cohesion [12].

Features of the 17 SPIs
The virulence proteins of Salmonella are spread across the 17 Salmonella pathogenicity islands (SPIs) in S. Typhi as implied by Ong et al. [9]. Among these, five have been well characterized and reported to have SicA as the most indispensable one as identified computationally by Lahiri et al. [11]. A detailed insight into these SPI proteins would reveal SPI-1 and -2 to encode the proteins of the type III secretion systems (T3SSs), while SPI-4 encodes those of type I secretion system (T1SS) mediated by a giant non-fimbrial adhesin, which is co-regulated by the invasion genes encoded by the SPI-1 [22]. The sit gene cluster proteins of SPI-1 T3SS, encoding an iron uptake system, are involved in the invasion into the eukaryotic host non-phagocytic cells mediated by the delivery of effectors that directly engage host cell signalling pathways [10]. For the systemic phase of infection, proteins of the SPI-2 cluster are essential for the survival and replication in eukaryotic host cells [23], which are aided by the high-affinity magnesium uptake system encoded by mgtCB, harboured by SPI-3 [24]. The effector proteins of enteropathogenesis are harboured by SPI-5 and are induced by distinct regulatory cues and targeted to different TTSS, namely, SopB, secreted by SPI1 T3SS and PipB, translocated by SPI-2 T3SS to the Salmonella-containing vacuole and Salmonella-induced filaments.
The 59 kb SPI-6 consists of a type VI secretion system (T6SS), the safABCD fimbrial gene cluster, the invasin pagN, two pseudogenes as transposase remnants (STY0343 and STY0344), the fimbrial operon tcfABCD and the genes tinR and tioA [25][26][27][28][29]. The largest SPI identified till date is that of SPI-7 with 134 kb size [25,30,31] and 150 genes inserted between duplicated pheU tRNA sequences [30,32] containing the Vi capsule biosynthesis genes [33], a type IVB pilus operon [34] and the SopE prophage (ST44) [35]. SPI-9 is a 16 kb locus containing three genes encoding for a T1SS and one for a large protein [36]. SPI-10 is an island found next to the leuX tRNA gene at centisome 93. It is a 33 kb fragment [25] carrying a full P4-related prophage, termed ST46 [37][38][39]. ST46 harbours the prpZ cluster as cargo genes encoding eukaryotic-type Ser/Thr protein kinases and phosphatases involved in S. Typhi survival in macrophages [40]. SPI-11 is a 10 kb fragment in S. Typhi and includes phoP-activated genes pagD and pagC involved in intramacrophage survival [41,42]. The 6.3 kb SPI-12 contains the effector SspH2 [43] along with the three ORFs are pseudogenes (STY2466a, STY2468 and STY2469). SPI-13 was initially identified in serovar Gallinarum [44]. In S. Typhi, it is a 25-kb gene cluster found next to the pheV tRNA gene on centrosome 67. The 8-kb portion of this island corresponds to SPI-8 whose virulence function is unknown, and it harbours two bacteriocin immunity proteins (STY3281 and STY3283) and four pseudogenes [25]. SPI-14 is absent in S. Typhi [36,44]. SPI-15 in S. Typhi is a 6.5 kb island of five ORFs encoding hypothetical proteins [44]. SPI-16 is a 4.5 kb fragment inserted next to an argU tRNA site, and encodes five or seven Open reading frames (ORFs), four of which are pseudogenes, the three remaining ORFs show a high level of identity with P22 phage genes involved in seroconversion [45]. SPI-17 is a 5-kb island encoding six ORFs inserted next to an argW tRNA site [45]. SPI-18 was recently identified in S. Typhi as a 2.3 kb fragment harbouring only two ORFs: STY1498 (clyA) and STY1499 [46] of which the former encodes a 34 kDa pore-forming secreted cytolysin [46,47].

The individual and the combined SPI-PINs
To focus upon the most indispensable proteins of the highly complex virulent phenotype as that of Salmonella, an integrated picture comprising the involvement of all the SPI and the connected associated proteins must be taken into account. Thus, with an ultimate goal to identify the indispensable virulent proteins for potential candidates of therapeutic targets, we have constructed the PINs or interactomes of the 17 individual SPI mentioned above, along with and a combined network of all of these SPI-PINs (AS). These were then analysed to identify the most important proteins among a group of highest number of interacting partners. This was done by utilizing the four important concepts of centrality applied to biological networks, namely eigenvector centrality (EC), degree centrality (DC), closeness centrality (CC) and betweenness centrality (BC) [48][49][50].
Amongst the four centrality measures being mentioned above, DC is the most basic as it brings out the involvement of the protein in a large number of interactions in a network. However, in a biological scenario of Salmonella infection, having the primary stages as attachment and invasion, the interactions of those proteins may not be in a sequential order so as to carry out a particular function as reflected through DC parametric analyses. In such cases, analyses of CC could be a good measure, which would reveal the close proximities of the proteins expected to communicate sequentially with other network proteins essential for a particular function. Again, a one-to-many type simultaneous interaction of a protein, rendering different functions, is imperative from the complexities of biological phenotype like virulence. Thus, the protein with a high proportion of interactions lying 'in between' and thereby connecting many other proteins in the network would be revealed through BC measures. This could have reflected to be quite an important protein, though it lacks the idea of connecting other important proteins in the network. EC measures the last concept and reflects the indispensable protein connecting other important proteins. A comparative picture of the parametric values of the top 20 rank holders in their descending order have been consolidated and put in a tabular form ( Table 1). These rankers in either of the cases have the proteins reflected to be important.
There have been three clear trends observed across the topmost rankers of the SPI-PINs for the measures of DC, BC, CC and EC, respectively. In most of the cases, there is a unanimous decision for the top ranking protein showing its utmost importance nearing to indispensability. SPI-PINs of these categories are -1, -3, -4, -5, -7, -8, -9, -10 to -13 and -15 to -17. The other categories have either three or two of the centrality measures conforming to the unanimosity of the top ranking proteins. SPI-2, -18 and the all SPI (AS-PIN) have BC differing in the top ranking position whereas SPI-6 and -10 have segregation of DC and EC against CC and BC for the top ranking positions. The common top ranking proteins across these 17 SPI and the AS has been reflected in Figure 1 with Venn diagrams.
It has been observed that with SPI-1, protein HilA is ranked highest. HilA is the central regulator in SPI-1, which activates the sip operon that is responsible in encoding secreted proteins, as well as the inv/spa and prg operons encoding components of the secretion apparatus [51,52]. SPI-2 till -4 has all the secretion apparatus inner membrane proteins SsaG, FidL and STY4452 as the top rankers, respectively. Among the other top rankers, the inositol phosphate phosphatase, SopB, of SPI-5, an atypical fimbria chaperone protein SafB and ImpA-related N-family protein, STY0286, of SPI-6, the pilin protein, PilL, of SPI-7, bacteriocin immunity protein, STY3281, of SPI-8, a large repetitive protein with six Bacterial_Ig-like domains, t2643, of SPI-9, bacteriophage gene regulatory protein, STY4826, of SPI-10, cytolethal distending toxin protein, CdtB, of SPI-11, uronate isomerase, UxaC, of SPI-13 and the sensory histidine kinase protein, having role in motility and virulence, BarA, of SPI-18 are noteworthy.
With respect to the above analyses of the individual interactomes of the SPI, an idea about the importance of these proteins in their individual SPI and finally across all SPI could be obtained. However, for a drug to be effective, the indispensability issue of these proteins needs to be taken care of. Thus, a broader picture with respect to the whole genome proteins of S. Typhi is then delineated to address the concern.

Feature of the WhoG-PIN
It is imperative that the WhoG-PIN, built from the empirical and theoretical results of physical and functional interactions among proteins laid down in STRING, can be random like that   proposed by Erdos and Renyi [53] or a small-world type proposed by Watts and Strogatz [54].
The idea was to see if the connectivity distribution, P(k), of a node in a network getting connected to k other nodes, decays exponentially for large values of k. It was observed that the WhoG-PIN roughly follows the power law and is free of a characteristic scale [55] with a tailed degree distribution (Figure 2).

Decomposition of WhoG-PIN
In order to get an idea of the indispensable ones from the barrage of proteins involved in the individual SPI-PINs and AS, we have performed a k-core analysis for them. A k-core is a subgraph whose nodes have degree at least equal to k. Nodes which are part of k-core, but not in the k+1 core, is called, k-shell. This is able to classify the nodes (proteins, in our study) based on the variety of their interacting partners. Proteins, which belong to outer shell, have lower k value and thus reflect limited number of interacting partner proteins. Moreover, proteins, which belong to inner k-core/shell, are specific ones, highly interacting with each other and thus can be considered to be the most important ones. Decomposition of this core decomposes the network and thus makes this the innermost core.
After decomposition of the WhoG-PIN, we have obtained the inner core member proteins which are highly robust, central and thus highly interactive in nature [56]. We have arrived to the 154th core with a number of 2180 proteins (Figure 3; data not shown). An idea was to look in for the rank holder proteins of the AS-PIN obtained through the EC, DC, CC or BC measures. Interestingly, it was found that the top ranker PilL, across EC, DC and CC measures, belong to the 111th core and not the 154th core. On the contrary, the top ranking BC protein, BarA, was in the 154th core along with the closely ranked PilV in the 150th core. The only other protein, amongst the unanimous top rankers of AS-PIN, STY4521 had a position of 145 in k-core measures. Very strikingly, two proteins of BC top rankers were also in the 154th innermost core along with BarA. These were the RNA polymerase sigma factor, RpoS and the chaperone protein, SicA. On a note of comparison among the top ranking proteins of EC and BC analysed for AS-PIN, proteins of the latter group had higher ranks in the whole genome context, with STY4586, STY4644 and STY4664 having the same 154th innermost core measures. On the contrary, those from the former ranking group (EC) mostly moved around the core numbers 56-70. This reflected that proteins from the BC rankers were more important in their interaction with other proteins, forming a bridge amongst those and thereby rendering high betweenness.
In an earlier work by Lahiri et al., SicA was found to be in the group of innermost core of the interactome comprising the five most extensively worked out SPI of S. Typhimurium    can foresee that the needle proteins are quite important virulence factors when it comes to search targets for drug. To top them all, SicA stands out as being one of the topmost rankers in BC measure of AS-PIN and in the innermost core of the WhoG-PIN. This is quite justified as SicA is a Salmonella type III secretion-associated invasin chaperone protein required for the stabilization of SipB and SipC to prevent their premature association which may lead to their targeting for degradation. Along with InvF, SicA is required for transcriptional activation of several virulence genes like sigDE (sopB, pipC), sipBCDA and sopE. [57].

Cartographic analyses of WhoG-PIN
For the purpose of classification of the proteins of S. Typhi CT18, based on their functional role and region in the network space, we have performed a cartographic analyses for the WhoG-PIN. As described earlier here, this is delineated by within module z-score of each node (protein) and its participation coefficient within and between other modules [20]. The within-module degree z-score measures how 'well connected' a node 'i' is to other nodes in the module, while the participation coefficient measures how the node 'i' is positioned in its own module and with respect to other modules. These measures are done based on the modules of the network, which are calculated by Rosval method [21]. The proteins are mainly divided into two major categories namely the hub nodes and the non-hub nodes.
As can be understood from the name itself, a hub is a connection point of many nodes. The category of non-hub nodes can be assigned four different roles namely, R1 comprising ultraperipheral nodes, R2 of peripheral nodes, R3 of non-hub connector nodes and R4 having the non-hub kinless nodes. Likewise, the hub nodes can be assigned three different roles namely, R5 of provincial hubs, R6 of connector hubs and R7 of kinless hubs (Figure 4). The kinless hubs nodes are supposed to be important in terms of functionality, which has high connection within module as well as between modules. Accordingly, the ultra-peripheral nodes occupy the least connecting position in the network followed by the peripheral nodes. These nodes can be pruned easily without much affecting the whole network while decomposing it to reach the core (refer previous section for k-core). The non-hub connectors are expected to take part in only a small but fundamental set of interactions. This is just opposite to those of the provincial hubs class which have many within-module connections. The non-hub kinless nodes are those with links homogeneously distributed among all modules. The most conserved in terms of decomposition as well as evolution would be, however, those from the connector hubs with many links to most of the other modules. The system would try to retain these connections as essential ones for their very survival.
As can be perceived from the above classification of the connectors and the hubs, the proteins belonging to the R4, R6 and R7 role players are very crucial and can be regarded as potential drug targets. In the context of our WhoG-PIN, the only one R7 is a putative transposase, STY0115 and reminds of the Tn5 transposase, the enzyme that helps bacteria to share antibiotic resistance genes [58,59]. This is closely followed by the plasmid transfer protein, TrhC in R6 group. This could very well play a good target for drugs as plasmids are known to be powerhouse of the antibiotic resistance genes [60]. Uncoupling of phosphotransferase system could also be an effective way of getting targets for novel drugs as exemplified by PtsG, TreB, NagE and t0287 [61]. Inhibition of glutamate Synthase, GltB has already been utilized as target for Mycobacterium tuberculosis [62] as has been uroporphyrinogen decarboxylase, HemE, albeit in a different context [63]. Recently, bacterial GCN5-related N-acetyltransferases of the R4 group have been thought of as essential drug targets as well [64]. All the functions of R7, R6 and R4 are listed in Table 2.

Conclusion
This work schematically delineates a process of figuring out the most indispensable protein in a system of interacting proteins of S. Typhi. It deals with the computational framework of building of the theoretical networks comprising the 17 individual SPI-PINs along with the AS-PIN followed by the conventional parametric approach of identifying the most interacting protein connected to other important proteins in the concerned phenotype of virulence. This is reinforced by the analysis of disintegrating the WhoG-PIN to the innermost core of the proteins, essential for virulence. All these lead to the identification of SicA to be the most indispensable one amongst a group of other virulent proteins being benefitted through network centrality and decomposition analyses. A further investigation of the WhoG-PIN brought forth the proteins of important conserved class, potential enough to be the most important ones and thus indispensable among the barrage of other proteins of the whole genome of S. Typhi CT18.