Systemic Approach to the Genome Integration Process of Human Lentivirus

to define a reliable in the complex relationship for a wide range of immune functions and play multifaceted roles in


Introduction
The human genome is one of the most complex molecular structures ever seen in nature. Its extraordinary information content has revealed a surprising mosaicims between coding and non-coding sequences [1][2][3][4]. This highly regionalized structure introduces complex patterns for understanding the gene structure and repetitive DNA sequence composition providing a new scenario to study biological process such as Lentivirus cDNA integration into host genome. In the field of genome analysis, bioinformatics provides the key connection between all different forms of data gathered by new high-throughput techniques such as systematic sequencing, expression arrays, and high throughput screenings among others. Although the success of bioinformatics in the genome analysis is undeniable, in some cases has complicated the relationship of computation with experimental biology. There is a need to attend to our pressing needs of bioinformatics applications without forgetting other, perhaps less evident but equally important, aspects of computation in biology.
The study of particular systems is the source of inspiration that guides the formation of general ideas from specific cases to general principles. Therefore the systemic approach extends towards the study of fundamental biological questions, such as gene assembly, protein folding and the nature of functional specificity. Such issues extend beyond the current perception of bioinformatics as a support discipline and address aspects of biological complexity, including the simulation of molecular interaction networks.

An overview to human genome
The genome coding regions are defined, in part, by an alternative series of motifs responsible for a variety of functions that take place on the DNA and RNA sequences, such as, gene regulation, RNA transcription, RNA splicing, and DNA methylation. For example, sequencing of the human genome revealed a controversial number of interrupted genes (25,000-32,000) with their regulatory sequences [1,2] representing about 2% of the genome. These genes are immersed in a giant sea of different types of non-coding sequences which make up around 98% of the genome. The non-coding regions are characterized by many kinds of repetitive DNA sequences, where almost 10.6% of the human genome consists of Alu sequences, a type of SINE (short interspersed elements) sequence [3].
[Alu] elements are not randomly distributed throughout the genome but rather are biased toward gene-rich regions [5]. They can act as insertional mutagens and the vast majority appears to be genetically inert (6). LINES, MIR, MER, LTRs, DNA transposons, and introns are other kinds of noncoding sequences, which together conform about 86% of the genome. In addition, some of these sequences are overlapped one to another, for example, the CpG islands (CGI), which complicates analysis of the genomic landscape. In turn, each chromosome is characterized by some particular properties of structure and function.

Human lentiviral integration
The two closely related human lentiviruses HIV-1 and HIV-2 are responsible for the 21 th century AIDS pandemic [7][8][9]. Most current therapeutic approaches use combinations of antiviral drugs that inhibit activities of viral enzymes such as reverse transcriptase, protease and integrase; nevertheless none of those have succeeded in controlling infection [10][11][12]. One option to overcome the problem is to explore new therapies that include the study of the integration dynamics of human Lentiviruses because it would permit to understand the underpinnings behinds of alterations of cellular homeostasis when a cell is infected [13]. Additionally, analysis of integration process is important in HIV-induced disease and in Lentivirus-based gene therapy [14].
Integration is a crucial step in the life cycle of retrovirus permitting the incorporation of viral cDNA into the host genome [15][16][17]. cDNA integration is mediated by the virally encoded integrase enzyme and other viral and cellular proteins in a molecular complex called the preintegration complex (PIC) [18]. One cellular factor involved in HIV targeting is the lens epithelium-derived growth factor (LEDGF) [19,20], which binds to both HIV-1 integrase and chromatin, tethering the viral integration machinery to chromatin [21]. HIV-1 integration has been extensively studied using a wide array of molecular biology, biochemistry and structural biology approaches [22]. However, is critical to directly identify the viral distribution inside human genome in order to understand at genomic level the relationship between the composition and topology of chromatin and the target site selection.
As shown by previous studies, the preferences in target site selection for integration are not entirely random [23][24][25][26]; being pronounced favored and disfavored chromosomal regions which differ among retroviruses [27]. These preferential regions of host genomes are characterized by having a high frequency of integrational events, as known as "hotspot" and are distributed along the genome of host cell [28,29]. In HIV-1, most of proviruses are localized into transcriptionally active regions not only in exons and introns, but also in sequences around start transcription sites [30,31].
An additional related study performed by Felice et al, 2009 [32] compared and contrasted the chromosomal integration patterns between gamma retrovirus (Moloney Leukemia virus, MLV) and Lentivirus (human immunodeficiency virus type 1, HIV-1), finding that gammaretroviral, but not lentiviral vectors, integrate in genomic regions enriched in cell-type specific subsets of transcription factors binding sites (TFBSs), independently from their relative position with respect to genes and transcription start sites. Therefore, is proposed that TFBSs could be differential genomic determinants of retroviral target site selection in the human genome.
Several in vitro and in vivo studies have shown that HIV-1 integrate predominantly in active transcription units and in genome zones with high gene density, high frequency of Alu elements, low content of CpG islands and open chromatin regions [33]. Notwithstanding this evidence, the identification of particular characteristics of local chromatin that facilitate integration in a wider genomic manner still remains to be elucidated.
The objective of the this chapter is to show the main results that our group of investigation have obtained of statistically testing those genomic variables that define a preferred genomic environment for human lentiviral integration and localize them in specific chromosome loci; moreover in the construction of gene/protein interaction networks among those cellular genes located around several Lentivirus integration sites in naturally infected humans as a systemic approach to better understand the lentiviral integration process.
To test our hypothesis we conducted in silico studies of the integration profile in the genomic DNA of peripheral blood mononuclear cells (PBMCs) and macrophages for both human Lentiviruses (HIV-1 and HIV-2) in a window size analysis of 100K. The statistical analyses included several genomic variables such as the chromosomal loci, the numbers of CpG Island, protein coding genes, transcripts and also the distribution of SINEs, LINEs, LTRs and others; moreover the exploration genomic regions in which epigenetics mechanisms would be associated with the integration process. Together, the results allow us to propose common genomic environments that favor the target chromatin zones for both human Lentiviruses.

Data mining and statistical analyses
A total of 352 human genome sequences flanking the 5'LTR of human Lentiviruses (176 sequences of HIV-1 [27] and 176 of HIV-2 [33] were obtained from GenBank (NCBI) under accession numbers: CL529260 to CL529766 (HIV-1) and DQ632388 to DQ632563 (HIV-2). Using the BLAST algorithm (NCBI; http://blast.ncbi.nlm.nih.gov/Blast.cgi), the sequences were aligned to the draft human genome (hg18) and those that met the following criteria were considered authentic integration sites: (i) contained the terminal 3' end of the HIV-1 or HIV-2 LTR; (ii) had matching genomic DNA within five bp of the end of the viral LTR; (iii) had at least 95% homology to human genomic sequence across the entire sequenced region; (iv) matched a single human genetic locus with at least 95% homology across the entire sequenced region (v) had minimum size of 50 bp.
BLAST of NCBI and the BLAT algorithm of the Genome Browser (University of California, Santa Cruz, Human Genome Project) (http://www.genome.ucsc.edu/) were used to obtain information about coding protein genes (RefSeq), transcripts, CpG islands and repetitive elements. Additional genomic information included molecular process and molecular function, was obtained from Gene Ontology (GO) (http://www.geneontology.org/index.shtml), GenCard (http://www.genecards.org/cgi-bin/carddisp.pl) and Gene Entrez (http://www.ncbi.nlm.nih.gov/ncbi/geneentrez). The chromosomal localization of the HIV-1 and HIV-2 proviruses was identified using the G pattern banding of each chromosome, as proposed by the Paris Conference (1971) [35], with updating of 850 times resolution. As the highest number of HIV-1 and HIV-2 proviruses was recorded on chromosome 17, an extensive characterization of its chromatin structure was performed including the genomic information available in several platforms of the Genome Browser: shows the CpG islands and distribution of its methylation; of histone H3 in the Lysine 4 and 27 methylation data obtained from ENCODE Histone modification by University of Washington CHIP-seq; Nucleosoma occupancy probabilities from A375 by Washington University and DNase1 hypersensitivity (ENCODE University of Washington) in GM12878 cells. All statistical analyses were performed using STATISTICA 7 [35]. The Mann-Whitney test (Wilcoxon rank) was used to establish differences between HIV-1 and HIV-2 chromosomal integration. Differences in function, molecular process and cell localization were analyzed using the ttest for independent samples. The Kolgomorov-Smirnov test was used for determining normality of data. In order to avoid an erroneous significance level for multiple comparisons a Bonferroni correction test was applied. To calculate the significant association among CpG numbers, genes numbers and integrations multiple regression analyses were performed. CpG numbers and genes per Mpb per chromosomes were determined from the NCBI and Ensemble databases (update 2010).

Patterns of provirus distribution
No significant differences were observed in the integration lymphocytic profiles between HIV-1 and HIV-2 (p>0.05, Mann-Whitney test). The integrational events for both human Lentiviruses were recorded in all chromosomes except the Y (figure 1). However, significant differences between the number of HIV-1 and HIV-2 provirus were observed for chromosomes 4, 8, 9, 11 and 16 (p<0.05, X 2 test). Most of the total integrations (39/352) occurred in chromosome 17 (figure 1). A tendency to a differential distribution of provirus towards telomeric and subtelomeric regions of the most of human chromosomes was observed. In this sense, other authors showed that centromeric alphoid repeat regions are disfavored as integration sites [36]. Although proviruses were observed in all chromosomes, we identified some chromatin regions with only HIV-1 integrations in chromosomes 4, 6 and 9 and only HIV-2 in chromosome 21.

Functional characterization of genes flanking integration sites
The ontology of genes hosting HIV integrations events were analyzed using G.O (Gene Ontology from NCBI). 83% (146/176) of HIV-1 and 77% (135/176) of HIV-2 integrations occurred close to chromatin regions containing protein coding genes (p>0.05, t-student test). In a 100Kb extension of chromatin that harbored both HIV-1 and HIV-2 proviruses no differences were observed for the gene functional categories (p>0.05, Bonferoni´s correction). According to molecular function, 46% of HIV-1 integrations and 57% of HIV-2 were associated with molecular binding, while 19% and 18% respectively occurred in regions that code for genes associated with enzymatic function (figure 2a). Otherwise an exploring about the biological process revealed a preferential integration in a collection of genes involved in metabolism and gene expression for HIV-1 (36%) and HIV-2 (37%) (p>0.05, Bonferoni´s correction) (figure 2b).

Distribution of the repetitive elements flanking integration sites
A low number of repetitive elements including SINEs, LINEs and LTRs were identified associated with provirus in an extension of 100Kb of flanking host chromatin. In general, there were no differences in the distribution of repetitive elements categories (SINEs, LINEs and LTRs) between HIV-1 and HIV-2 integrations (p>0.05, X 2 test). Our results showed that both lentiviruses had a preferential integration close to Alu elements which correspond to SINEs. Within LINEs, differences among L1, L2 and L3 were recorded. The other class of repetitive elements like LTR, simple repeats and low complexity represented a minor proportion of the integration associated chromatin (figure 3).

Definition of the common genomic environment of integrations
As the integration do not follow a random model [23][24][25], some characteristics of the chromatin associated with regions with high level of provirus integration, support the hypothesis that a preferential integration is conditioned by structural and functional states of local chromatin; these states are defined by several genomic variables which were studied in this work, and together would define genomic environments.
The results of multiple-regression analysis conducted on the HIV-1 and HIV-2 data sets showed that there were differential distributions of CpG island, genes, and Alu elements that together conditioned a specific genomic environment per chromosome (R 2 =0.91, p<0.05). Gene density was the independent variable contributed most in the prediction of the dependent variable (integrations) due to the highest regression coefficients (B= 0.83; p<0.05). The highest relative likelihood of hosting a lentiviral integration event in the human genome was registered in chromosome 17 (figure 4a). To test that integration events are favored by gene-rich regions in all chromosomes, a comparison between those variables was done indicating that a high gene density in chromatin regions determine a favorable environment for integration, even when the chromosome 17 is excluded (Figure 4b  Experimental studies have demonstrated that regulatory regions in general and promoters in particular, tend to be DNase sensitive and are target for integration of the majority of retroviruses [37,38]. In 2006, the complete nucleotide sequence of chromosome 17 was published [39]. This chromosome is rich in protein coding genes, having the second highest gene density in the genome, (16.2 genes per Mb), with a relative excess of short interspersed elements (SINEs, 22.3%) and a deficit of long interspersed elements (LINEs, 14.4%). Likewise, this chromosome has high average CpG content (45.5%) and high euchromatin density [39] (figure 6). Our statistical analysis determined that chromosome 17 had the highest number of integrations, mainly concentrate towards the telomeres of both arms.  [40], found histone and IN acetylation may favor integration by tethering the virus to acetylated/decondensed regions of the chromatin. We concluded that the structural characteristics and the epigenetic modifications observed in those regions with high frequency of cDNA viral integrations would synergistically configure a local "genomic environment" that facilitates the target site selection during the retroviral integration.

Construction of HIV-1 gene/protein networks
Host-virus interactions is a complex level of systems information that permits a thorough understanding of how the virus exploits the host cell and uses the cellular machinery to integrate into host genome. Recently, the HIV-1 Human Protein Interaction Database (HHPID) registered 3959 interactions among 1452 human proteins and nineteen HIV proteins (fifteen of them structural and four intermediate proteins) [41]. Previous studies have identified most of human cell pathways been disturbed by at least one interaction with an HIV-1 protein during the virus life cycle [42][43][44]. Those interactions are of two types: either direct, via host cell protein-viral proteins or indirect, such as regulatory interactions that alter expression of human genes [45,46]; the signaling network cc-cytokine is both disrupted and exploited by HIV at various stages of infection. 22 candidates human class E proteins were connected into coherent network by 43 different protein-protein interactions, in which AIP1 play a key role in linking complexes that act early (TSG101/ESCRT-I) and late (CHMP4/ESCRT-III) in the HIV infection pathways [47,49]. Monocyte/macrophage infection is characterized by a viral dynamic substantially different from that of T lymphocytes. In fact, in vivo HIV infection of activated CD4-T lymphocytes accounts for the majority of the daily production of virus particles. However, a large number of lymphocytes are in a resting state, thus unable to sustain a complete and productive virus life cycle, and contribute only minimally to the daily virus production [50][51][52]. Because of the limited HIV-induced cytopathic effect and of their ability to accumulate high levels of HIV particles in intracellular compartments, HIV-infected macrophages serve as a potentially important reservoir, and as "Trojan horses" exploited by the virus to favor its dissemination in different tissues. [53,54].
Cytoscape v.2.63 [55] was also used to construct a gene expression network from two kinds of files: The first one from gene expression profiles as a text file (.pvals) that were imported of expression data microarray experiments (GEO profiles, NCBI). The second, as data annotation in text files (.sif) that corresponds to each one gene-gene interactions (online databases). In the first one, gene expression values were collected from the microarray data series GSE19236 composed by two Agilent platforms (GPL6480 and GPL6848) with 48 samples of monocytes to macrophages, macrophages and dendritic cells. These are available from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) repository (accession number GEO: GSE19236) and for our analysis, we selected all macrophages expression samples (GSM476720, GSM476721, GSM476722, GSM476723, GSM476724, GSM476725). To identify which genes were significant among samples in microarrays; considering a p-value< 0.001 as significant, an ANOVA test was calculated. Additionally, a Hierarchical clustering analysis of the samples using Euclidean Distance Method and mean linking were performed. MultiExperiment Viewer v4.1 [56] was applied to make the corresponding statistical analyses. Using data from BOND (Biomolecular Network Data Bank, http://bond.unleashedinformatics.com/Action), BioGird (Biological General Repository for Interaction Datasets, http://thebiogrid.org/), KEEG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/), available online, a new file with the interaction data of 28 genes located close to integration sites was constructed. Cytoscape v2.6 was used for visualizing and analyzing the genetic interaction networks among 28 human macrophages genes and their interactions. BiNGO v2.6 plugin (Biological Networks Gene Ontology tool) was used to determine which Gene Ontology (GO) terms are significantly overrepresented in a set of genes. A hypergeometric test was applied to determine which categories were significantly represented (p-value< 0.01); significant value was adjusted for multiple hypotheses testing using the Bonferroni Family-wise error rate correction [57]. The network topology parameters were calculated using Network Analyzer plug-in, which includes network diameter, the number of connected pairs of nodes and average number of neighbors; it also analyses node degrees, shortest paths, clustering coefficients, and topological coefficients (Max Planck Institute Informatik).
To identify active sub-networks as highly connected regions of the main network we used j ActiveModules plug-in that grouped genes according with significant p-values of gene expression over particular subsets of samples. The result shows active modules, listed according to the number of nodes, and an associated Z-score. An active module with Zscores greater than 3.0 indicated significant response upon the conditions of the experiment. We kept the standard default values, as being the most effective for initial analyses (58). Eleven thousand and seven hundred and thirteen (11,713) significant genes of 41,000 probes were clustered in two significant different groups of cells; one of them included only dendritic cells, meanwhile the second grouped monocyte to macrophages and macrophages which are sharing similar gene expression patterns. A total of 2,770 interactions among 28 genes which were located closed to HIV-1 proviruses in human macrophages were recorded. AKT3 was gene with highest number of interactions (456), followed by FLT1 (381), STAT5A (356) and AXIN1 (328) (figure 6a). In contrast ZNF36, DYRK1A and RBMS3 genes had the lowest number of gene interactions. The normal macrophage gene network showed tree components: the main cluster composed by 26 macrophages genes and its interactions and two minor clusters in which ZNF36 gene was the central node with five interactions; and STX1A as central node with twelve interactions (Figure 6a).
To further identify active sub-networks inside the main gene network, we performed an expression clustering analysis using p-values calculated by comparison of gene level expression among five macrophage samples. We found 5 subnetworks, in which the most significant active module was integrated by 222 genes with a score of 3.15 (p<0.01). Within them 12 genes related with provirus integration sites were found: AXIN1, NFAT5, STAT5A, FLT1, AKT3, HTT, RIPK2, DGCR8, WWOX, NRG1, DYRK1A and SLC2A14 (figure 6b).
The GO functional significant categories in this active module showed enrichment for positive regulation of biological process and cell proliferation. Most of the genes identified in this sub-network were associated with cellular pathways that play significant role by modulating cell signaling networks including Wnt signaling, MAPK signaling and ErbB signaling.

Effects on normal macrophage gene networks by HIV-1 integration
In order to better understand the alteration of macrophages homeostasis by the HIV-1 integration, our analyses were focused to simulate what are the effects of viral cDNA integration in the alteration of several gene expression networks in human macrophage. In general the topology of non-infected macrophage network gene was dramatically changed by the HIV-1 integration events that lead to turned off the expression of five genes by the integration of proviral cDNA (Figure 7).
The evaluation of the several topological parameters such as clustering coefficient, shortest paths, network heterogeneity, the centralization, average number of neighbors and characteristic path length, showed a changed in the values of HIV-1 macrophage infected gene network, compared with normal macrophage network. The non-altered network was more condensed, had more number of interactions, was wide open rich in shortest paths and also was composed by one major component and two minor clusters being more heterogeneous and multi-functional (table 1).
Statistical differences between the topology states of two networks were registered for topological coefficients, closeness centrality and neighborhood connectivity distribution (Kolgomorov-Smirnov test p<0.05), but not in average clustering coefficient distribution. These results indicate that normal network was significantly more central and densely connected in comparison with that of HIV-1 macrophage infected network.  Using Random network plugin by Cytoscape we found the Clustering Coefficient of the non-infected Network and simulated infected Network in comparison with those generated at random showed not statistical differences (Kruskal-Wallis test, P= 0.317). The data confirmed that the topology of both reported networks have a strong support that the simulation of our gene network is valid.
We test our hypothesis that integration HIV-1 generate disturbs in the gene expression having a global effect in cellular networks and essential biological pathways. The enriched GO terms were categorized for normal and infected macrophages networks to identify the functional cellular change by HIV-1 integration. From all the GO categories covered by the 28 macrophages genes and its interactions, we have listed the ten most significant categories of the enriched GO terms in table 2.
The Gene Ontology (GO) enrichment analysis that normal network was composed by 423 significant functional categories of a total of 1190. These individual significant categories could be further classified into two major groups; cell function regulation and signaling of biological process. In contrast HIV-1 infected macrophage gene network was enriched with 10 significant functional categories of a total of 40. The significantly overrepresented categories indicated that this emergent new gene network was composed by genes involved in metabolic process and DNA repair process.
In this study we simulated at systemic level, the alterations of cellular pathways when HIV provirus integrates into genes by turning them off and produce dysregulation of several local signaling pathways. One of the target gene associated with HIV-1 integration was AKT3, also called PKB, which is a serine/threonine protein kinase family member. It is involved in a wide range of biological processes including cell proliferation, differentiation, apoptosis, stimulating cell growth, and regulating other biological responses (59,60). Also, it have been identified playing important roles of regulation in the G2/M transition of the cell cycle.

Normal Network p-value b HIV-1 infected Network p-value b
Signal AKT3 via JNK interacts with NFTA and Jun that are targets for the HIV-1 macrophage integration network and are included in the mitogen-activated protein kinase (MAPK) cascade which perform essential functions such as proliferation, survival and inflammation, apoptosis in all cell types. This pathway is associated with others that include the phosphatidylinositol signaling system, Wnt signaling pathway, ERK5 pathway, P53 signaling pathway. (61)(62)(63). According with these previous data, we propose that, when AKT3 is turned off by HIV-1 integration, the cross talk with others is disrupted leading to a signaling dysfunction of metabolic associated pathways. When AKT3 was inactive the direct interaction with MKK7 produce a disruption of JNK and after with JUN that would result in a non activation by phosphorilation of apoptotic and cell cycle process. On the other hand inactivation of the MAPK pathway in both macrophages and dendritic cells leads to inhibition of proinflammatory cytokine secretion, downregulation of co-stimulatory molecules such as CD80 and CD86, and ineffective T cell priming. The net result is an impaired innate and adaptive immune response (64,65).
Recently it have been reported that HIV-1 infection triggers the activation of the PI3K/Akt cell survival pathway in primary human macrophages as reflected by decreased PTEN protein expression and increased Akt kinase activity and renders these cells resistant to cytotoxic insults (54,61,64,65). As result of HIV-1 integration close to AKT3, PTEN, AKT1 and 2, FOXO 1 and MDM2 that are included into the macrophage gene network, would expected a disruption of the apoptotic process.

Conclusions
We can conclude that a general effect of HIV-1 integrations in macrophages DNA is to disrupt several signaling pathways that control the normal cell homeostasis. Comparison between normal and infected macrophages of top 10 GO function categories showed the dramatic change of one non-infected macrophage whose main cellular functions are devoted to maintain a cell signaling crucial functions, to one infected in which the most important function are macromolecular biosynthetic process, maintenance of fidelity during DNAdependent DNA replication, mismatch repair, age-dependent response to reactive oxygen species during chronological cell aging and oxidation reduction. As HIV infected macrophage is an abnormal reservoir in which the metabolic cascades are altered, it is possible to propose that the metabolism of macrophage adapt to perform survival functions where the apoptotic process is interrupted and a SOS metabolism make that the macrophage change of its life style In silico studies are based upon statistical calculations which permit the drawing of generalizations about a biological process; however since some variables could affecting the in toto process, in order to get a real history of Lentivirus integration it would be important to consider that there is another factors, including physiological process and cellular compartments that would be influencing the in vivo integration site selection. Some of these are cell-cycle phase, the transcriptional state of the cell, the topology of chromosomal DNA, cell type infected, and presence of co-helper molecules during the PIC complex conformation By providing these new testable hypotheses we hope that our results will accelerate experimental efforts to define a reliable disturbing in the gene complex relationship by lentivirus integration in PBMC and macrophages which are critical immune cells responsible for a wide range of immune functions and play multifaceted roles in HIV pathogenesis

Author details
Felipe