The functional domains of the predicted genes located within the intergenic sequences of
Gene finding can be defined as a problem of identifying a stretch of the genomic DNA sequence that is biologically functional. Such a genomic DNA sequence is known as a gene. A gene performs a function like protein coding or regulation at the molecular level and plays a biological role, such as growth, metabolism, and intelligence. Traditionally, gene finding relies on numerous biological experiments and statistical analysis to pinpoint the location of a new gene in a genetic map. With the advent of bioinformatics, gene finding has largely become a computational problem. Genes are predictable based on the genomic sequence alone. However, the determination of the specific function and biological role of a gene would still demand
A newly sequenced genome is annotated thoroughly so that the information it carries can be utilized. In essence, genome annotation is to identify the locations of genes and all of the coding regions in a genome, and determine their protein products as well as functions. Hundreds of bacterial genome sequences are publicly available and the number will soon reach a new milestone. Gene annotation by hand is almost impossible to handle the deluge of new genome sequences appearing at this pace. The need for automated, large-scale, high-throughput genome annotation is imminent (Overbeek, Begley et al. 2005; Van Domselaar, Stothard et al. 2005; Stothard and Wishart 2006). The basic level of genome annotation is the use of BLAST (Altschul, Gish et al. 1990) for finding similarities between related genomic sequences. Integration with other sources of information and experimental data is a trend in genome annotation.
A recent study indicates that many genomes could be either over-annotated (too many genes) or under-annotated (too few genes), and a large percentage of genes may have been assigned a wrong start codon (Nielsen and Krogh 2005). The fact that the original genome annotation is accurate and complete upon submission does not guarantee that it will not be changed, as new experimental evidence and knowledge would continue to arrive and constant updates would be inevitable. However, re-annotation of the whole genome is not very fruitful, as most of the genes have been identified in the first annotation. For example, the re-annotation of the H37Rv genome resulted in about 2% of new protein-coding sequences (CDS) added to the genome. The result reflects the limitation with current genome annotation technology. To address the issue, we developed a new method for gene finding in an annotated genome. We select the genome of
In our previous studies, we found that some intergenic sequences in
2. Research methods and design
The developed method of gene finding in the intergenic sequences proceeds as follows:
Transcription analysis to identify intergenic regions exhibiting significant gene expression activity.
Coding potential and gene structure analysis on active intergenic elements identified based on transcription evidence.
Protein domain search to identify functional domains in each active intergenic element with significant transcription activity and coding potential.
Homology search based on BLAST to seek homologue evidence.
The flowchart of the method is displayed in Figure 1.
The method was applied to the originally annotated
2.1. RNA isolation
2.2. Microarray hybridization
In this study, we used the anti-sense Affymetrix
Microarray hybridization followed the Affymetrix protocol. In brief, the assay utilized reverse transcriptase and random hexamer primers to produce DNA complementary to the RNA. The cDNA products were then fragmented by DNAase and labeled with terminal transferase and biotinylated GeneChip DNA Labeling Reagent at the 3' terminal.
Each RNA sample underwent hybridization with one gene array to produce the expression data of all genes on the array. We performed eleven independent bacterial cultures and RNA extractions at different times, and collected eleven sets of microarray data for this study. A global normalization scheme was applied so that each array's median value was adjusted to a predefine value (500). The scale factor for achieving this transformed median value for an array was uniformly applied to all the probe set values on a specific array to result in the determined signal value for all the probe sets on the array. In this manner, corresponding probe sets can be directly compared across arrays.
2.3. Gene expression analysis
The gene expression data were analyzed by the program GCOS (GeneChip Operating Software) version 1.4. In the program, the Detection algorithm determined whether a measured transcript was detected (P Call) or not detected (A Call) on a single array according to the Detection
2.4. Gene prediction
Protein-coding region identification and gene prediction were performed by the programs, GeneMark and GeneMark.hmm (Lukashin and Borodovsky 1998; Besemer and Borodovsky 2005) ( respectively. The prokaryotic version and the
2.5. Protein domain search
The Pfam program version 20.0 (Finn, Mistry et al. 2006) ( was employed to conduct protein domain search after the input DNA sequence was translated into a protein sequence in six possible frames. The search mode was set to “global and local alignments merged”, and the cut-off E-value set to 0.001, which was more stringent than the default value of 1.0. Pfam maintained a comprehensive collection of multiple sequence alignments and hidden Markov models for 8296 common protein families based on the Swissprot 48.9 and SP-TrEMBL 31.9 protein sequence databases.
2.6. Homology search
The BLASTx program (Altschul, Gish et al. 1990) ( was used to identify high-scoring homologous sequences. The program first translated the input DNA sequence into a protein sequence in six possible frames, and then matched it against the non-redundant protein sequence database (nr) in the GenBank and calculated the statistical significance of the matches. The default cut-off E-value was 10.0 but we set it to . Orthologs refer to homologs in different strains of the same species. Orthologs provide critical evidence for gene finding and characterization in a new genome sequence. A typical prokaryotic gene has the following structure: the promoter, transcription initiation, the 5' untranslated region, translation initiation, the coding region, translation stop, the 3' untranslated region, and transcription stop.
In our previous research, we conducted a genome-wide expression analysis on intergenic regions using the Affymetrix GeneChip (Fu and Shinnick 2007). The transcriptional activity of intergenic regions was measured based on a set of eleven independent RNA samples extracted from
In this work, genes in the intergenic sequences were recognized based on transcriptional activity, structural patterns, and coding potential, and subsequently validated through sequence comparison with orthologs from other
3.1. Transcriptional analysis
An intergenic region was assumed to transcribe if there existed transcripts in the RNA sample that were bound to the probes encoding that intergenic sequence. The presence or absence of a given transcript was determined in accordance with the Detection algorithm of the Affymetrix system. In this study, a gene or intergenic region was determined to express (transcriptionally active) only if the derived mRNA was present (P-call) in more than 90% of the collected RNA samples with a Detection
3.2. Sequence-based gene prediction
In the sequence-based approach to gene prediction, gene structure and coding potential are the two mutually supportive elements. The GeneMark algorithm was applied to an intergenic sequence for checking whether it contained a probable coding region, and the GeneMark.hmm algorithm was used for predicting a gene within the sequence. The criteria based on the predefined transcriptional evidence, coding potential, and gene structure yielded 65 candidate genes in the intergenic regions of
3.3. Protein domain search
The biological function of a gene is determined using
3.4. Homologue evidence
In evolutionary biology, a reliable means for predicting the function of an unknown gene sequence is based on homologs or orthologs. BLAST is a bioinformatics program for database search, allowing functional and evolutionary inference between sequences. In this study, BLAST was employed to retrieve from sequence databases all proteins that produced statistically significant alignment with a given intergenic sequence under consideration. The sequences retrieved by BLAST were homologous to the query sequence. It turned out that the highest-scoring homologous sequences with ≥98% identity were consistently those belonging to the same strain (H37Rv) or different strains of
A homologous sequence found in different strains of the same species often represents an ortholog that shares similar function, whereas a homologous sequence found in the same organism is a paralog (which is produced via gene duplication within a genome) that tends to have a different function. No evidence suggested paralogs in our analysis, as argued based on the following observation. We noted that, given an intergenic sequence, when BLAST returned a homologous sequence pertaining to the H37Rv strain, it was apparently the same protein-coding sequence contained in the intergenic sequence. This is because the intergenic sequence used as a query and its homologous sequence returned by BLAST occupied the same physical location within the genome, as inferred from information given by the Affymetrix Genechip. This coincidence was further explained by the fact that the intergenic sequence was named according to the early version of the H37Rv genome annotation while the homologous sequence was retrieved from the GenBank which contained all up-to-date genes. The results are significant. First, we demonstrated that our method was able to identify protein-coding genes in intergenic regions previously considered as non-coding sequences. Secondly, our method based on bioinformatics and transcriptional evidence correctly predicted these changes on a high-throughput, genomic scale. The changes refer to
IG1061 → (containing) Rv1322A
IG499 → Rv0634B
IG617 → Rv0787A
IG1741 → Rv2219A
IG2500 → Rv3198A
IG2053 → Rv2631
IG1179 → Rv1489A
IG2522 → Rv3224B
IG1291 → Rv1638A
IG398 → Rv0500A
IG2870 → Rv3678A
IG188 → Rv0236A
IG2498 → Rv3196A,
IG2591 → Rv3312A
IG595 → Rv0755A
IG1814 → Rv2309A
IG1030 → Rv1290A
IG2141 → Rv2737A
In the above findings, each intergenic region contained an independent gene/CDS with the only exception that part of IG2053 was incorporated in its left-flanking CDS. The presence of a gene structure in an IG and its lack of functional correlation with its adjacent genes suggested that it was not a run-away segment from adjacent genes.
In our analysis, predicted genes located within intergenic sequences that met the criteria defined based on protein-coding potential, structural patterns, and transcription evidence, were called “candidate” genes. If a candidate gene of unknown function was homologous to another gene of know function, the candidate gene was assigned the function associated with its homologous gene. Nonetheless, the strategy of inferring the function of an uncharacterized sequence from its orthologs had limited value in analyzing intergenic data mainly because most of the orthologs found in this study were hypothetical proteins with unknown function. We did not assign a specific function to a candidate gene until it had an ortholog of known function, whether or not the candidate gene carried a known functional domain. In the absence of a specific function assigned, a CDS was termed a hypothetical protein rather than a gene in our system.
In this work, six intergenic sequences were identified that met the criteria we defined, including protein coding, structural patterns, transcription, and ortholog evidence: IG499, IG617, IG1741, IG2500, IG1567, and IG2229, among which four genes had been reported in the
|IG||Lt Flank||Rt Flank||Domain ID||Re-annotated H37Rv Gene|
Computational algorithms for gene prediction are divided in two classes: One is based on sequence similarity and the other based on gene structure and signal. The latter is known as
The methods presented here did not address the issue of genes that did not code proteins. There are a number of regulatory, non-coding RNAs assuming a distinct role from mRNA, rRNA and tRNA. Many such RNAs have been identified and characterized both in prokaryotes and eukaryotes and their main functions are posttranscriptional regulation of gene expression and RNA-directed DNA methylation (Erdmann, Barciszewska et al. 2001; Pickford and Cogoni 2003). A non-coding RNA has neither a long open reading frame nor a gene structure. The DNA sequence that encodes a non-coding RNA is called a gene if its regulatory function can be defined. Thus it is possible that an isolated expression element lacking a gene structure is a non-coding, regulatory RNA. However, it was confirmed that the potential protein-coding genes found in this study did not match any RNA family published in the RNA-families database .
High-throughput gene finding on a newly sequenced genome is enabled through advanced computational genome annotation software. However, genome annotation does not guarantee all genes to be identified since knowledge and concepts about what constitutes a gene are evolving and yet to be perfected. Genome re-annotation using the same kind of computational heuristics offers limited help, unless supported with new