Regulation of gene expression is achieved by the presence of cis regulatory elements; these signatures are interspersed in the noncoding region and also situated in the coding region of the genome. These elements orchestrate the gene expression process by regulating the different steps involved in the flow of genetic information. Transcription (DNA to RNA) and translation (RNA to Protein) are controlled at different levels by different regulatory elements present in the genome. Current chapter describes the structural and functional elements present in the coding and noncoding region of the genome. Further we discuss role of regulatory elements in regulation of gene expression in prokaryotes and eukaryotes. Finally, we also discuss DNA structural properties of regulatory regions and their role in gene expression. Identification and characterization of cis regulatory elements would be useful to engineer the regulation of gene expression.
- DNA structural properties
- gene architecture
- gene expression
- promoter structure
Genome, the blue print of life, is essentially comprised of coding (genes) and noncoding (regulatory regions and other repetitive sequences) DNA. Genetic information is embedded in the form of coding regions (genes) that encode proteins. This flow of information from gene to proteins is a multistep pathway viz. transcription that is synthesis of RNA from the DNA and continues with the translation which is protein synthesis from RNA. Control of this flow of information is crucial for fate of the cell and this phenomenon is known as the regulation of the gene expression. The function of the cell is determined by the amount and type of the RNA and protein molecules that is achieved by the regulation of the gene expression. There are various steps involved in this flow of information process such as chromatin domain organization, transcription (initiation, elongation and termination), post-transcriptional modification, RNA export (exclusive for eukaryotes), translation and mRNA degradation. Among all these different regulated stages of gene expression transcription initiation is the most utilized point of regulation. Transcription event is coupled with the translation process in the case of prokaryotes due to availability of ribosomes in the same compartment (due to lack of nucleus). However, transcription process is far more complicated in case of eukaryotes due to involvement of additional steps that are RNA splicing and RNA export. These additional steps provide extra stages for the regulation of gene expression process in eukaryotes.
Regulation of gene expression is achieved by harnessing the regulatory elements, located in the noncoding as well as coding regions of the genome. Current chapter focuses on the different structural and functional elements present in the coding regions (genes) and noncoding regions (regulatory regions), which are utilized by the cell to regulate the gene expression process.
2. Gene architecture
2.1. Noncoding elements of the genes
Genes are the repositories for primary information content of inheritance in genome and their expression determines the phenotypes, which in turn decides future of the cell in multicellular organisms. Functioning of gene products viz. mRNA (messenger RNA) and ncRNA (noncoding RNA) is modulated by complex gene regulatory networks. Eukaryotic genomes are mostly comprised of compositional properties (repetitive sequences, codon usage bias, mutational information, etc.) and functional signals (TATA box, Inr-element, cap signal, Kozak sequence, GT-AG splicing sites, etc.) . Processing of the transcript is an important phase in the gene expression process, which also provides additional level of regulation in eukaryotes. Transcription and translation events are coupled in prokaryotes due to the availability of ribosomes to the mRNA while transcript undergoes several levels of processing in nucleus and finally processed transcripts are exported to the cytoplasm for translation in eukaryotes. Complexity in the gene structure results into the phenotypic diversity and this complexity arises from the occurrence and arrangement of the noncoding elements interspersing the coding region. Gene expression diversification is achieved by the presence of trailer sequences known as untranslated region (UTR) and intervening noncoding sequences known as introns . These elements exert several direct and indirect functions.
2.2. Untranslated regions
UTRs are the trailer sequences located at the 5′ and 3′ end of the coding region which are the part of the transcribed mRNA but remain untranslated. Presence of alternative promoters or more than one transcription start site result into multiple 5’ UTRs which in turn controls the gene expression in several ways [3, 4, 5, 6]. G quadruplex or G4 structure is a predominant secondary structure situated in the guanine rich 5’ UTRs which in turn hinders the translation process [7, 8, 9]. Highly and constitutively expressed genes are associated with short and poor in guanine base 5’ UTR in order to facilitate the translation process . Sequential unwinding of natural stem loop structures located in the 5’ UTR in some mRNA is found to be associated with efficient translation [11, 12, 13].
IRES (internal ribosome entry sites), located usually upstream of the initiation codon (in the 5’ UTR) are responsible for the translation initiation in a cap independent mechanism by recruiting ribosome near the initiation site [5, 14, 15, 16]. The IRES mediated translational regulation occurs under certain stress conditions such as cellular stress, nutritional stress, mitotic stress etc. [17, 18, 19]. Conserved upstream open reading frames (uORFs) located in the 5’ UTR are also found to regulate protein translation, which are followed by main start codon (AUG) in the downstream [20, 21, 22]. Antibiotic resistance in the pathogenic bacteria is also found to be associated with uORFs mediated regulation . In a recent study, fusion of uORF in the upstream of the auto-activated immune receptor gene developed the resistance to the plant diseases in Arabidopsis and rice .
Apart from these regulatory regions located in the 5’ UTR, the 3’ UTR is also rich in regulatory sequences located at the end of the coding sequence or gene. The conserved motif/s associated with 3’ UTR play crucial roles in gene expression at the posttranscriptional level. The 3’ UTR perform various regulatory functions, which are providing stability to the mRNA by polyadenylation, transcript cleavage, serve the binding site for microRNAs etc. Different isoforms of mRNA are derived from the alternative splicing and polyadenylation with alternative 3’ UTR. The varying expression levels and spatiotemporal localization for the same protein is achieved by differing 3’ UTR sequence in human [25, 26, 27]. AU rich elements (AREs) which are 50 to 150 nucleotide long and associated with multiple copies of pentanucleotide AUUUA regulate gene expression by stabilizing the mRNA [28, 29]. The abundance of AREs in the 3’ UTR of wide range of gene families indicates significant role in the gene regulation process . MicroRNA response elements (MREs) are mostly located in the 3’ UTR region where single stranded 22 nucleotide long microRNA binds to regulate the expression of mRNA . Poly(A) tail is stretch of adenosine (around 250 nucleotide) attached at the 3′ end of the RNA by adenylation process. The poly(A) binding proteins (PABP), specific class of regulatory proteins (nuclear and cytoplasmic) binds to the poly(A) tail and perform different regulatory functions like stability of mRNA, export and decay of the mRNA. These proteins play vital role in gene regulation [32, 33, 34, 35].
2.3. Intronic regions
An intron is a noncoding DNA sequence that is transcribed but not translated; it is removed during the processing of pre-mRNA (precursor mRNA) to final mature RNA also known as RNA splicing. There are four different types of introns based on different splicing mechanisms.Spliceosomal introns are the foremost discovered and well characterized introns, which are excised by spliceosome, a ribonucleoprotein complex [36, 37]. Group I introns, widely present in mRNA, rRNA and tRNA of variety of organisms including algae, fungi, lower eukaryotes and few bacteria [38, 39, 40, 41, 42]. Similarly, group II introns are large autocatalytic ribozymes widely present in the mitochondria, chloroplast, plants, fungi, yeast and many bacteria, play major role in genome evolution [43, 44, 45, 46]. The tRNA introns widely present in all domains of life are exceptionally different as enzymes are involved in the removal of intron and in the joining of the two halves [47, 48, 49]. Gene regulation is modulated to a great extent by count or number, length and position of the introns and they have several direct and indirect biological functions . Multiple protein isoforms of the same gene are derived from the regulated alternative splicing process in eukaryotes [51, 52, 53, 54]. Introns modulate gene expression either by the presence of transcriptional regulatory elements or by intron mediated enhancements [55, 56, 57]. Introns also regulate the gene expression by mediating the chromatin assembly (chromatin structure modulation) and controlling the mRNA export [58, 59, 60, 61]. Apart from these direct biological functions, introns also exert indirect influence, for example position and length of the intron in the gene have potential role in the regulation of the expression level of the transcript [62, 63, 64, 65].
3. Promoter structure
3.1. Different promoter elements
Promoters are stretch of genomic sequences where assembly of transcription machinery (RNAP and other accessory proteins) takes place prior to initiation of transcription. Although prokaryotic and eukaryotic polymerase shares functional similarity, promoter architecture differs in complexity . Single type of RNA polymerase along with the specific σ factors recognizes promoter elements in prokaryotes . Where −10 and −35 elements located in the upstream of the transcription start sites (TSSs) are recognized by different domains specific σ factors while UP element, an AT rich sequence situated from −40 to −60 is recognized by CTDs of α subunit of RNAP (Figure 1). An extension of extended −10 element, −15 element (TGnT) has been also proposed as new element situated from −15 to −12. It has been found that −15 element determines the overall promoter strength by complementing the weak −10 element.
On the other hand, complexity of promoter architecture in eukaryotes increases from yeast to mammals. Different types of RNA polymerases (normally three) are responsible for the generation of variety of RNA such as ribosomal RNA, messenger RNA (mRNA) and tRNA. As in case of bacterial RNAP, archaeal RNA polymerase and eukaryotic RNA Pol II (responsible for transcribing mRNA) also require specific factors and promoter elements to initiate transcription at specific sites in the genome. Eukaryotic promoters can be broadly classified in to three categories such as core, distal and proximal. The core promoter (approximately 50 nucleotide sequence) is a platform where assembly of RNA polymerase and associated general transcription factors (GTFs), collectively referred as pre- initiation complex (PIC) takes place [68, 69]. Various promoter elements (Table 1) in the vicinity of the transcription start site; upstream and downstream regions are recognized by Pol II and other factors, such as TATA box, are recognized by TATA-binding protein (TBP), the B recognition element (BRE) by TFIIB and other elements by TBP-associated factors (TAFs)  (Figure 2). Apart from these, core promoter regions also consist of Inr element and may also contain downstream elements like downstream promoter element (DPE), motif ten element (MTE) (in humans) .
|Name||Location (relative to TSS at +1)||Associated factor/s|
|BREu||Upstream of the TSS||TFIIB|
|TATA box||−30/−31 to −23/−24||TBP|
|BREd||Downstream of the TATA box||TFIIB|
|Inr||−2 to +4/+5||TAF1 & TAF2|
|DCE||+6 to +11, +16 to +21, +30 to +34||TAF1|
|MTE||+18 to +29||TAF6 &TAF9|
|DPE||+28 to +33||TAF6 &TAF9|
Proximal promoters are located in the immediate upstream (up to a few hundred base pairs) of core promoter, are comprised of GC box, CAAT box, cis-regulatory modules (CRM) etc. CpG islands are stretch of short DNA sequences, which are rich in GC content located in the upstream of house keeping and other regulated gene promoters [72, 73]. Proximal promoters mostly work as tethering element for distal promoters instead of acting as direct activators. On the other hand, distal promoters work from long distance. Enhancers, silencers and insulators are present in the distal promoter regions. Enhancers, also known as “promoters of promoter” mainly control specificity of gene expression by deploying unique enhancers in deferent cell types . Multiple enhancers associated with single gene and single enhancer activating multiple genes provides additional level of diversity in phenotypes. In contrast to other regulators, enhancers exert their effects over tens of kilobases of DNA [75, 76]. Silencers are sequence specific elements where negative transcription factors bind to down regulate the gene expression . Insulators are also referred to as boundary elements which block the effect of transcriptional activity of neighboring genes [77, 78].
3.2. Promoter structure and nucleosome dynamics
The locations and strengths of transcription factor and RNAP binding sites, also known as cis-regulatory elements and list of all nucleosome-binding sites are collectively defining the promoter structure. Nucleosomes are not only involved in the packaging of DNA but also bring order to the eukaryotic genome by regulating replication and transcription [79, 80]. Nucleosomes provide the first line of defense to avoid the unwanted transcription initiation. Gene promoters involved in active transcription require accessibility to the DNA by RNAP machinery and associated factors, which is facilitated by nucleosome free region (NFR) or nucleosome-depleted region (NDR) [81, 82]. Nucleosome positioning is the probability of finding nucleosome at given genomic location relative to the surrounding locations while nucleosome occupancy refers to the average number nucleosomes present at the given genomic location in a given population of cells [83, 84]. Cellular gene expression is the final outcome of nucleosome dynamics, which itself depends on a complex interplay between nucleosome positioning and occupancy [85, 86, 87].
3.3. DNA structural properties of promoter regions
DNA sequence not only determines the distinct or base specific interactions but also determines the overall conformational shape, which is recognized by different proteins in case of non-base specific interactions . The higher DNA binding specificity is achieved by combing different readout mechanisms by DNA binding proteins, with DNA shape playing an important role in gene regulation and genome organization . The DNA sequence dependent structural properties can be roughly divided in to two categories, conformational and physiochemical . Conformational properties represent the static DNA structure, which are influenced by geometry of base pair steps described by translational (shift, slide and rise) and rotational (tilt, roll and twist) parameters . These also determine variation in the major and minor groove dimensions, which are crucial for DNA protein interactions. The physiochemical properties refer to the dynamic DNA structural properties such as persistence length, stress induced duplex destabilization, DNA duplex stability, protein induced bendability and intrinsic curvature etc.
Structural properties of given DNA sequence can be calculated using different di, tri tetra nucleotide models reported in experimental as well as theoretical studies. These models provide property values (lookup tables) for different oligonucleotides and using these values and appropriate length (sliding window), a given DNA sequence can be converted in to a series of numerical values referred to as a structural property profile. These profiles of given DNA sequence show variation in the structural property over the different regions of the sequence (Figure 3). An average structural property profile is calculated by taking mean of the feature value over all positions by aligning the different sequences . DNA structural features such as low stability, protein induced bendability and intrinsic curvature are consistently observed in the prokaryotic and eukaryotic promoters (Figure 4) [93, 94, 95, 96]. Promoter regions of different categories of transcripts (primary, internal, antisense and noncoding RNA) present in prokaryotic transcriptome show distinctly different DNA structural features . Moreover, promoter regions of orthologous genes show conserved DNA structural properties in prokaryotes and plants [98, 99, 100]. These findings suggest that the DNA structural properties of promoter regions are conserved across the various classes of organisms.
4. Modulation of gene expression
The activity of RNAP and RNAPII in prokaryotes and eukaryotes respectively is tightly regulated to ensure proper level of gene expression. Transcription factors (TFs), proteins that bind to specific regulatory sequences (cis-regulatory elements or CRE) are the key regulators of transcription . The complex gene regulation in eukaryotes is a consequence of the large number of transcription factors available and localization of cis-regulatory elements.
4.1. Gene expression noise and its regulation
A variation in the copy number of mRNA or protein molecules for a given gene in cell is referred as gene expression noise. It is largely under the control of regulatory DNA since it is linked with the promoter structure. TATA box with variable strength, transcription factor binding sites count, strength and their position in the promoter and nucleosome binding sites in the regulatory region have enormous effect on gene expression noise in eukaryotes . Though transcriptional regulation is quite well understood at molecular level, very little is known about gene expression noise in the case of prokaryotes. Transcription factors and inducer molecules play a major role in gene regulation process. Additionally, genome condensation assisted by nucleoid associated proteins and DNA supercoiling also play a vital role in gene regulation in bacteria. Gene expression noise is essential for achieving phenotypic heterogeneity and it has been found to be universal in nature.
4.2. DNA structural properties and their role in gene expression
Nucleosome organization in the genome has been found to be closely associated with the gene expression and its variability [82, 84, 85]. Genes with dissimilar expression levels tend to have sequences with different structural features in order to attain the required nucleosome organization [103, 104, 105]. Plasticity of gene expression, also known as gene expression variability is crucial for cell survival, is closely linked with the DNA structural properties of promoter region in S. cerevisiae. Promoters of genes with low plasticity (less responsive) are less stable, less bendable and lower nucleosome occupancy compared to the promoters of genes with high plasticity (high responsive) [106, 107]. A recent study in six different prokaryotes with variable genomic GC content (ranging from 39–58%) shows good correlation between DNA structural properties of promoter regions and gene expression. It has been found that promoter regions associated with high gene expression are less stable, less bendable and more curved as compare to the genes associated with low gene expression as seen from Figure 5. Intrinsic curvature was found to be most significant property which is distinctly present in the promoter regions associated with high gene expression as compared to those with low gene expression across all organisms . Hence estimation and characterization of DNA structural features of promoter regions could be very informative in analyzing the expression of associated gene.
The growing plethora of genomic information in the form of whole genome sequences requires its annotation to make sense of it. Mere delineation of coding sequences (gene identification) is not enough to get complete understanding of functional genomics since regulation of gene expression orchestrates the fate of cells. Gene expression regulation depends on different regulatory elements localized in the noncoding and coding region of the genome. Identification and characterization of these regulatory elements is the next level of challenge in the genome annotation process. Studies on DNA structural features of the regulatory regions show quite promising results toward achieving this goal. Moreover, DNA structural properties based characterization of regulatory regions is more sensitive and precise as compared to sequence-based approaches and most importantly it is universal in nature, applicable to all domains of life. Accumulating evidence shows a close relationship between gene expression and structural properties of promoter DNA; furthermore, this information can be used to engineer the regulatory sequences to modulate gene expression.
MB is Senior Scientist, Indian National Science Academy (INSA) and the recipient of J. C. Bose National Fellowship of Department of Science and Technology (DST), India.
Conflict of interest