In proteomics analyses,protein identificationby mass spectrometry (MS) is usually performed usingprotein sequence databasessuch as RefSeq (NCBI; http://www.ncbi.nlm.nih.gov/RefSeq/), UniProt (http://www.uniprot.org/) or IPI (http://www.ebi.ac.uk/IPI/IPIhelp.html). Because these databases usuallytarget the longest (main) open reading frame (ORF) in the corresponding mRNAsequence, whether shorter ORFs on the same mRNA are actually translated still shrouds in mystery. In the first place,it had been considered that almost all eukaryotic mRNAs contains only one ORF and functions as monocistronic mRNAs.It is now known, however, that some eukaryotic mRNAshad multiple ORFs, which are recognized as polycistronic mRNAs.One of the well-known extra ORFs is an upstream ORF (uORF) and it functions as regulators of mRNA translation (Diba et al., 2001; Geballe & Morris, 1994; Morris & Geballe, 2000; Vilela & McCarthy, 2003; Zhang & Dietrich, 2005). For getting clues to the mystery of diversified short ORFs,full-length mRNA sequence databases with complete 5‘-untranslated regions (5‘-UTRs) were essentially needed (Morris & Geballe, 2000; Suzuki et al., 2001).
The oligo-capping method was developed to construct full-length cDNA libraries (Maruyama & Sugano, 1994) and the corresponding sequence were stored into the database called DBTSS (DataBase of Transcriptional Start Site; http://dbtss.hgc.jp/) (Suzuki et al., 1997, 2002, 2004; Tsuchihara et al., 2009; Wakaguri et al., 2008; Yamashita et al., 2006). Comparing the dataset of DBTSS with the corresponding RefSeq entries, it was found that about 50 % of the RefSeq entries had at least one upstream ATG (uATG) except the functional ATG initiator codon (Yamashita et al., 2003). Although it had been suggested that upstream AUGs (uAUGs) and uORFs play important roles for translation of the main ORF, none of the proteins from these uORFs was detected in biological experiments in vivo. Our previous proteomics analysis focused on small proteins revealed the first evidence of the existence of four novel small proteins translated from uORFs in vivo using highly sensitive nanoflow liquid chromatography (LC) coupled with the electrospray ionization-tandem mass spectrometry (ESI-MS/MS) system (Oyama et al., 2004). Large-scale analysis based on in-depth separation by two-dimensional LC also led to the identification of additional eight novel small proteins not only from uORFs but also from downstream ORFs and one of them was found to be translated from a non-AUG initiator codon (Oyama et al., 2007). Finding of these novel small proteins indicate the possibility of diversecontrol mechanisms of translation initiation.
In this chapter, we ﬁrstintroducewidely-recognized mechanism of translation initiation and functional roles of uORF in translational regulation. We then review how we identified novel small proteins with MS and lastly discuss the progress of bioinformatical analyses forelucidatingthe diversification of short coding regions defined by the transcriptome.
2. Translational regulation by short ORFs
It is well known that 5‘-UTRs of some mRNAs contain functionalelements for translational regulationdefined by uAUG and uORF. In this section, we show howuAUG and uORF have biological consequences for protein synthesison eukaryotic mRNAs.
2.1. Outlineof translation initiation
A small (40S) ribosomal subunit binds near the 5‘-end of mRNA, i.e. the cap structure.
The 40S subunit migrates linearly downstream of the 5‘-UTR until it encounters the optimum AUG initiator codon.
A large (60S) ribosomal subunit joins the paused 40S subunit.
The complete ribosomal complex (40S + 60S) starts protein synthesis.
In addition to the above mechanism, initiation of translation without the step of ribosome scanning is also known. It is called “internal initiation”, which depends on some particular structure on an mRNA termedinternal ribosome entry site (IRES).
2.2. Therelationship between uORF and main ORF
In case that an mRNA contains a uORF, two models for the initiation of translation are suggested(Fig. 2) (Hatzigeorgiou, 2002). One is called ”leaky scanning” and the other is ”reinitiation”. If the first AUG codon is in an unfavorable sequence context defined by Kozak (see the section 3.2), a small ribosomal subunit(40S) ignores the ﬁrstAUG and initiates translation froma more favorable AUG codondownstream located. This phenomenon is known as ”leaky scanning”(Fig. 2-(A)). In case that a complete ribosomal complex translates a main ORF after termination of translation of the uORF on the same mRNA, itis termed”reinitiation” (Fig. 2-(B)).
Therelations between two ORFs are classified into three types as follows; (1) A distant type; in-frame/out-of-frame, (2) A contiguoustype; in-frame and (3) An overlappedtype; in-frame/out-of-frame (Fig. 3).In-frame means that a uORF and the main ORF are on the same frame of the mRNA sequence, whereas out-of-frame meansthat they are on the different frame. According to the previous analysis of the accumulated 5‘-end sequence data, the average size of uORF was estimated at 31 amino acids and 20% of ORFs were categorized into Type (3) (Yamashita et al., 2003).
These different relations might bring about different eventsin initiatingtranslation. In eukaryotes, it hasa tendency to increase an efficiency of reinitiation if the distance betweena uORF and the main ORF is long (Kozak, 1991;Meijer & Thomas, 2002; Morris & Geballe, 2000). Therefore, the ORFs classified as Types (2) and (3) would be difficult to be regulated by reinitiation. It is also said that reinitiation occurs only when the length of uORF is short (Kozak, 1991), whereas the sequence context of an inter-ORF‘s region, that of upstream of uORF, uORF itself and even the main ORF can also affect reinitiation (Morris & Geballe, 2000). On the contrary, the ORFs of Type (3) might easily cause leaky scanning(Geballe & Morris, 1994; Yamashita et al., 2003). As a special case, when a termination codon of the uORF is nearthe AUG initiator codon of the downstream ORF, withinabout 50 nucleotides, ribosomes could scan backwards and reinitiate translation from the AUG codon of the downstream ORF (Peabody et al., 1986).
2.3. The role of short ORFs in translation regulation
The 5‘-UTR elements such as uAUGs and uORFs are well known as important regulators for translation initiation. In case of some genes that have multiple uORFs, considerablydifferent effects can be generated on the translation of the main ORF depending on which combination of uORFs istranslated. Some uORFsseem to promotereinitiation of the main ORFs andthe others seem to inhibit it. It is supposed that these effects arecaused by the nucleotide sequences of the 3‘ ends of the uORFs, that of uORFs or protein products encoded by uORFs. Suchdifferential enhancement of translationare considered to be one ofthe responsesof adaptation to the environment (Altmann&Trachsel, 1993; Diba et al., 2001; Geballe & Morris, 1994; Hatzigeorgiou, 2002; Iacono et al., 2005; Meijer & Thomas, 2002;Morris & Geballe, 2000; Vilela & McCarthy, 2003; Wang & Rothnagel, 2004; Zhang & Dietrich, 2005). In addition to that,variousfactors or events are known to influence onthe translational inhibition of the main ORF; the presence of arginine, a stalling of a ribosomal complex at the termination or an interaction between a ribosomal complex and the peptide encoded by the uORF, which indicates that down-regulated controlsby uORFs are general (Diba et al., 2001; Geballe & Morris, 1994; Iacono et al., 2005;Meijer & Thomas, 2002; Morris & Geballe, 2000; Vilela & McCarthy, 2003; Zhang & Dietrich, 2005).
As for downstream ORFs, there is also a report that a peptide encoded in the 3‘-UTR may be expressed (Rastinejad & Blau, 1993). However, whether and how the peptides control the translation initiation of the main ORF is still unknown.
3. Variability of translation start sites
How a ribosomal complex (40S + 60S) recognizes an initiator codon on the mRNA is a matter of vital importance fordefining the proteome. Here we presenta part of already proposed elements for regulation of translation initiation.
3.1. The first-AUG rule
Traditionally, the first-AUG rule iswidely recognized for initiation of translation (Kozak, 1987, 1989, 1991). It states that ribosomes start translation from the first-AUG on the corresponding mRNA. Although this rule is not absolute, 90-95% of vertebrate ORFs was established by the first AUG codon on the mRNA (Kozak, 1987, 1989, 1991). Our previous proteomics analysis of small proteins also indicated that about 84% of proteinsinRefSeq were translated from the first AUG of the corresponding mRNAs (Oyama et al., 2004). On the other hand, there are also many negative reports concerningthe rule;29% of cDNA contained at least one ATG codon in their 5‘-UTR (Suzuki et al., 2000); 41% of transcriptshad more than one uAUG and24% of genes had more than two uAUGs(Peri & Pandey, 2001); about 50% of the RefSeq entries had at least one uAUG (Yamashita et al., 2003); about 44% of 5‘-UTRs had uAUGs and uORFs(Iacono et al., 2005). There are also some reports that the first AUG is skipped if it is too close to the cap structure, within 12 (Kozak, 1991) to 14 (Sedman et al., 1990) nucleotides(see the section 3.3). In this chapter, we cited a variety of statistical data on the UTRs. Because they are based on different versions or generations of sequence databases, the data vary widely (Meijer & Thomas, 2002), which is the point to be properly considered.
3.2. Kozak’s consensus sequence
The strongest bias for initiation of translation in vertebrates is the sequence context called“Kozak’s sequence”, known as GCCA/GCCATGG(Kozak, 1987). The nucleotides in positions -3 (A or G) and +4 (G) are highly conserved andgreatly effective for a ribosomal complex to start translation (Kozak, 1987, 2002; Matsui et al., 2007; Suzuki et al., 2001; Wang & Rothnagel, 2004).The context of an AUG codon in position -3 is the most highly conserved and functionally the most important; it is regarded as strong or optimal only when this position matches A or G, and that in position +4 is also highly conserved (Kozak, 2002). Some reports mentioned that only 0.86% (Kozak, 1987) to 6% (Iacono et al., 2005) of functional initiator codons lacked Kozak’ssequence in positions -3 and +4,whereas 37% (Suzuki et al., 2000) to 46% (Kozak, 1987) of uATGswould be skipped because of unfavorable Kozak’ssequencein both of the positions. On the contrary,another report mentioned that most initiator codons were not in close agreement with Kozak’sconsensus sequence (Peri & Pandey, 2001).
3.3. The length of the 5'-UTR
The length of 5'-UTR is also effective when translation occursfrom an AUG codon near the 5’ end of the mRNA (Kozak, 1991; Sedman et al., 1990).About half of ribosomes skip an AUG codon even in an optimal context if the length of 5‘-UTR is less than 12 nucleotides (mentioned in the section 3.1) and this type of leaky scanning can be reduced if the length of 5‘-UTR is more than or equal to 20 nucleotides (Kozak, 1991).In the traditional analysis based on incomplete 5‘-UTR sequences,the distance from the 5' end to the AUG initiator codon in vertebrate mRNAs was generally from 20 and 100nucleotides (Kozak, 1987). The previous analysis using RefSeq human mRNA sequences indicated that 85% of 5‘-UTR sequences less than 100 nucleotides contain no uAUGs(Peri & Pandey, 2001). The evidence convincedusthat the first-AUG rule was widely supported in eukaryotes. In the recent analysis based on full-length 5‘-UTR sequences, it is 125 nucleotideslongon average (Suzuki et al., 2000)andtranscriptional start sites (TSSs) vary widely (Carninci et al., 2006; Kimura et al., 2006; Suzuki et al., 2001). The average scattered length of5'-UTR was more than 61.7 nucleotides, with a standard deviation of 19.5nucleotides(Suzuki et al., 2001) and 52 % of the human RefSeq genes contained 3.1 TSS clusters on average (Kimura et al., 2006), which has an over 500 nucleotides interval (Fig. 4).In protein-coding genes, differentially regulated alternative TSSs are common (Carninci et al., 2006). Because the diversity of transcriptioninitiation greatly affects the length of the 5'-UTR, there remainsome doubtswhether thelength of the 5'-UTRcontributesto the efficiency of translation initiation.There is also a report that the degree of leaky scanning is not affected by the length of 5‘-UTR (Wang & Rothnagel, 2004).
3.4. non-AUG initiator codon
In the general translation model, a non-AUG codon is considered to be ignored by ribosomes unless a downstream AUG codon is in a relatively weak context (Geballe & Morris, 1994; Kozak, 1999). In case that an upstream non-AUG codon, such as ACG, CUG or GUG, satisfies Kozak’s consensus sequence, it possibly functions as an initiator of translation in addition to the first AUG initiator codon(Kozak, 1999, 2002). Besides Kozak’s consensus sequence, downstreamstem-and-loop and highly structured GC-rich context in the 5‘-UTRcould enhance translation initiation from a non-AUG codon(Kozak, 1991, 2002).
4. Protein identification by MS
The recent progress of proteomic methodologies based on highly sensitive liquid chromatography-tandem mass spectrometry (LC-MS/MS) technology have enabled us to identify hundreds or thousands of proteins in a single analysis.
Wesucceededinthe discovery of novel small proteins translated from short ORFs using direct nanoflow LC-MS/MS system (Oyama et al., 2004, 2007). Among54 proteins less than 100 amino acids that were identified by retrieving several sequence databases with a representative search engine, Mascot (Matrix Science; http://www.matrixscience.com/), four ones wereturned out to be encoded in 5‘-UTRs (Oyama et al., 2004). This showed the first direct evidence of peptide products from the uORFs actually translated in human cells. In the subsequent analysis using more sophisticated two-dimensional LC system, we also discovered eight novel small proteins (Oyama et al., 2007), five of which were encoded in the 5‘-UTR and three were encoded in the 3‘-UTR of the corresponding mRNA. Even based on the accumulated DBTSS data, two ORFs had no putative AUG codon, which indicated the possibility that they were translated fromnon-AUG initiator codon. In the article above, 197 proteins less than 20 kDa were identified by Mascot. Theprocedurefor identifying novel proteins by MS is describedas follows.
4.1. Materials and methods
The proteins included in cultured cell lysates were first separated according to their size. Small protein-enriched fraction through acid extraction and SDS-PAGE were treated with enzymes. In case of SDS-PAGE, the digested peptides were extracted from the gel. The samples were desalted and concentrated to introduce into the MS system. The schematic procedure is shown in Fig. 5.
4.2. Protein identification
The samples were analyzed using nanoflow LC-MS/MS system.The purified peptides were eluted with a linear gradient of acetonitrile and sprayed into the high-resolution tandem mass spectrometer. Acquired tandem mass (MS/MS) spectra were then converted to text files and processed against sequence databases using Mascot. Based on theprinciple that each peptide has a MS/MS spectrum with unique characteristics,the search enginecomparesmeasured data on precursor/product ionswith those theoretically calculated from protein sequence data(Fig. 6). The MS/MSspectrumfile contains mass to charge ratio (m/z) values of precursor and product ions along withtheir intensity. The measuredspectrum lists are searched against sequence databases to identify the corresponding peptide in a statistical manner. The theoretical spectrumlists are totally dependent on the contents of sequence databases themselves.
4.3. Finding of novel small proteins
Forexploringnovel small proteins,two types of sequence databases were used;one was an artificial database computationally translated from the cDNAsequences in all the reading frames and the other was an already established protein database. In order to processthe comparison ofthe large-scale protein identification data from the two kinds of databases,severalPerl scripts have beendeveloped based on thedefinition that candidatesof novel small proteins were identified only in the cDNA database(s) (Fig. 7). In a result datasheet using RefSeq sequences, each protein was annotated with NM numbers for the cDNA database and with NP numbers for the protein database. The Perl scripts then exchanged NM to NP numbers and evaluatedthem.
5. Bioinformatics approach
In order to forward MS-based identification of novel coding regions of mRNAs, MS systems, sequence databases and bioinformatics methodologies are required toimprove together. Regardingbioinformatics, twoaspects seem to be demanded; one is for retrieving target proteins from an enormoussize ofdatabase searching results, the other is for constructingplatforms to predict novelcoding sequences (CDSs).
5.1. Contribution of sequence databases & bioinformatics to MS-based proteomics
The recent advances in MS-based proteomics technology have enabled us to perform large-scale protein identification with high sensitivity. The accumulation of well-established sequence databasesalso made a great contribution to efficient identification in proteomics analyses. One of the representative databases is a specialized 5‘-end cDNA database like DBTSS and the other is a series of whole genomesequence databases for variousspecies. To investigatethe mechanismsintranscriptional control, DBTSS has lately attracted considerable attention because it contains accumulated information on the transcriptional regulation of each gene (Suzuki et al., 2002, 2004; Tsuchihara et al., 2009; Wakaguri et al., 2008; Yamashita et al., 2006). Based on the accumulated data,the diverse distribution of TSSs wasclearly indicated (Kimura et al., 2006; Suzuki et al., 2000, 2001). On the other hand,manywhole genome sequencing projectsare progressing all over the world (GOLD: Genomes Online Database; http://www.genomesonline.org/).Complement and maintenanceof sequence databases for variousspeciesmust help to find more novel proteins across the species. For example,there are several reports that conducted bioinformatical approaches to explore novel functional uORFs by comparing the 5'-UTRregions of orthologs based on multiple sequence alignments (Zhang & Dietrich, 2005), using ORF Finder (http://bioinformatics.org/sms/orf find.html) and a machine learning technique, inductive logic programming (ILP) with biological background knowledge (Selpi et al., 2006), or applying comparative genomics and a heuristicrule-based expert system (Cvijovic et al., 2007). Using advanced sequence databases, new proteinCDSs were added as a result of the predictionby variousalgorithms(e.g. Hatzigeorgiou, 2002; Ota et al., 2004). Based on the well-established cDNA databases, MS couldevaluatewhether these CDSs are actually translated in a high-throughput manner. Construction of more detailed sequence databases will lead to detection of more novel small proteins in the presumed 5'-UTRs (Oyama et al., 2004). Tomake good use of those exhaustive sequence databases, bioinformatical techniques, especiallydata mining tools such as search engines to retrieve target proteins from an enormoussize ofdatabase search results, areobviouslyindispensable.
5.2. Contribution of MS-based proteomics tosequence databases & bioinformatics
In addition to the technological progress of MS, sequence databases and data mining tools, development of other bioinformatical techniques calledprediction tools, are also important. Ad-hoc algorithms for predicting new CDSs, as mentioned above, could be improved by usingMS-based novel protein data. Those novel onescan be applied to play a role ina collection ofsupervised training data for machine learning, pattern recognition or rule-based manual approach. There is an interesting bioinformatical reportwhich hypothesizedthat a uORF in the transcript down-regulates transcription of the corresponding RNA via RNA decay mechanisms (Matsui et al., 2007). They obtained human and mouse transcripts from RefSeq and UniGene (http://www.ncbi.nlm.nih.gov/unigene) and classified the transcripts into Level 0 (not containing uORF) and Level 1-3 (containing uORF). Then, they prepared the data of expression intensities and half-lives of mRNA transcripts mainly from SymAtlas (now linked to BioGPS; http://biogps.gnf.org/#goto=welcome) and Genome Research website (http://genome.cshlp.org/). Although they suggested that not only the expression level but also the half-life of transcriptswas obviouslydeclined in the latter group, they did not demonstrate any interaction between uORFs and transcripts.
Advanced MS instruments can not only evaluatewhether uORFs are actually translated but also quantifytime-course changes of their expression levels. Stable isotope labeling with amino acids in cell culture (SILAC) technology enables us to quantify the changes regarding all the proteins in vivo (Oyama et al., 2009).Based on time-course changes of specific peptides, we could also hypothesize some regulatory interactions.In combination with the measurement of the dynamics of the corresponding mRNAs using microarray or reverse transcription-polymerase chain reaction (RT-PCR),transcriptional regulation by short ORFs will be analyzed at the system level.
Although the roles of5‘-UTR elements, especially uORFs, had been well discussedas translational regulators for the main ORFs in the biological context, whether the proteins encoded by the uORFs were translated had not been approached for a long time. We first unraveledthe mystery by demonstrating the existence of novel protein products defined by these ORFs using advanced proteomics technology. Thanks to the progress of nanoLC-MS/MS-based shotgun proteomics strategies, thousands of proteins can now be identified fromprotein mixtures such as cell lysates. Some of the presumed UTRs areno longer“untranslated“,and other noncoding transcriptsareno longer“noncoding“. One of the novel small proteins revealed in our analysis was indeed defined by a short transcript variant generated by utilization of the downstream alternative promoters(Oyama et al., 2007). Alternative uses of diverse transcription initiation, splicing and translation start sites could increase the complexity of short protein-coding regions and MS-based annotation of these novel small proteins will enable us to perform a more detailed analysis of the real outline of the proteome, along with the translational regulationby the diversified short ORFeome systematically.