Open access peer-reviewed chapter

Machine Learning and Rule Mining Techniques in the Study of Gene Inactivation and RNA Interference

By Saurav Mallik, Ujjwal Maulik, Namrata Tomar, Tapas Bhadra, Anirban Mukhopadhyay and Ayan Mukherji

Submitted: February 22nd 2018Reviewed: December 11th 2018Published: April 23rd 2019

DOI: 10.5772/intechopen.83470

Downloaded: 224

Abstract

RNA interference (RNAi) and gene inactivation are extensively used biological terms in biomedical research. Two categories of small ribonucleic acid (RNA) molecules, viz., microRNA (miRNA) and small interfering RNA (siRNA) are central to the RNAi. There are various kinds of algorithms developed related to RNAi and gene silencing. In this book chapter, we provided a comprehensive review of various machine learning and association rule mining algorithms developed to handle different biological problems such as detection of gene signature, biomarker, gene module, potentially disordered protein, differentially methylated region and many more. We also provided a comparative study of different well-known classifiers along with other used methods. In addition, we demonstrated the brief biological information regarding the immense biological challenges for gene activation as well as their advantages, disadvantages and possible therapeutic strategies. Finally, our study helps the bioinformaticians to understand the overall immense idea in different research dimensions including several learning algorithms for the benevolent of the disease discovery.

Keywords

  • machine learning
  • association rule mining
  • RNAi
  • gene silencing
  • multi-omics data

1. Introduction

RNAi [1] is an innate biological process in which RNA molecules inhibit gene expression or translation [2] by suppressing targeted mRNA molecules. Since the discovery of RNAi by Andrew Fire and Craig Mello, it has become evident that RNAi has immense potential in suppression of desired genes [3]. The first evidence that double-stranded RNA (dsRNA) could achieve efficient gene silencing through RNAi came from studies on the nematode Caenorhabditis elegans [4] and Drosophila melanogaster [5], which lead toward understanding the biochemical nature of the RNAi pathway. Two types of small ribonucleic acid (RNA) molecules—microRNA (miRNA) and small interfering RNA (siRNA)—are central to RNAi [6]. To compare two types of elicit RNAi, the siRNA must be fully complementary to its target mRNA, whereas, miRNA only needs to be partially complementary to its target mRNA. In organisms like C. elegans and D. melanogaster, RNAi can be induced by introducing long dsRNA complementary to the target mRNA to be degraded, however, in mammalian cells and organisms, introducing dsRNA longer than 30 bp activates a potent antiviral response. To solve this limitation, siRNAs are used to induce RNAi in mammalian cells and organisms [7, 8, 9].

The discovery of both siRNA and miRNA provides a new therapeutic approach [10, 11] for the treatment of diseases by targeting genes that have undesired mutated or overexpression of normal genes. The RNAi Process is as following. SiRNAs that induce the degradation of specific endogenous is very common phenomenon in eukaryotic cells to inhibit protein production at post transcriptional level [12]. The RNAi process is initiated by short dsRNAs, 21–25 nucleotides that lead to the sequence specific inhibition of their homologous mRNAs. These siRNAs are normally produced in cells from cleavage of longer dsRNA precursors by Dicer that is a ribonuclease III family member. The cleaved parts are incorporated into a multi-component nuclease complex known as the RNA-induced silencing complexes (RISC), which contain the splicing protein Argonaute-2 (Ago-2) [13]. The ssRNA derived from the short dsRNA acts as a antisense strand directing the complex to the specific target mRNA; in where a RISC-associated endoribonuclease cleavages the target mRNA [14]. Therapeutic approaches based on siRNA involve the introduction of a synthetic siRNA into the target cells to elicit RNAi, thereby inhibiting the expression of a specific messenger RNA (mRNA) to produce a gene silencing effect [15]. RNAi is beneficial in accelerating cures in medicine, especially when a disease is thought to due to a defective gene [16]. For historical perspective, the first application of RNAi therapy was in age-related macular degeneration (AMD) by using siRNAs to suppress the vascular endothelial growth factor (VEGF) pathway that causes abnormal growth of blood vessels behind the retina, carried out directly to the patient’s eye [17]. RNAi techniques have been used against the spread of tumor growth and increasing its sensitization toward drug treatment, RNAi technology will be beneficial to selectively affect cancer cells without damaging normal cells as the RNAi therapy against cancer cells is used for directly targeting the oncogenes; and therefore, found to stop progression and invasion of the tumor cells [18, 19] and also increase the sensitization of tumor against drug, as mentioned earlier [20]. As RNAi can silence disease-associated genes in tissue culture and animal models, the development of RNAi-based reagents for therapeutic applications involves technological enhancements that improve siRNA stability and delivery in vivo [21], while minimizing off-target and nonspecific effects.

A number of different approaches have been developed for the in vivo delivery of siRNA, among which, rapid infusion by hydrodynamic injection of siRNA achieves the best delivery in rodents [22]. However, this way, the delivery is restricted to highly vascularized tissues, such as the liver [23] and also, it is currently not a viable method for delivery in human clinical studies. Lipid-based in vivo applications have been devised [24], which have been used extensively for cell culture experiments, with some issues, like the cationic nature of the lipids used in cell culture leads to aggregation when used in animals and results in rapid serum clearance and lung accumulation. Even then, there are an increasing number of reports citing success with lipid-mediated delivery of siRNAs in vivo. To improve the delivery of siRNA into human liver cells [25] without transfection agents, lipophilic siRNAs were conjugated with derivatives of cholesterol, lithocholic acid, or lauric acid, where the lipid moieties were covalently linked to the 5′-ends of the RNAs using phosphoramidite chemistry [26]. These could down-regulate the expression of a LacZ expression construct. By conjugating cholesterol to the 3′-end of the sense strand of siRNA by means of a pyrrolidine linker, the pharmacological properties of siRNA molecules was improved by Soutschek et al. [27]. Advantages of cholesterol attachments are evident as being more resistant to nuclease degradation, more stable in the blood by increasing binding to human serum albumin and increased uptake of siRNA molecules by the liver. Intravascular delivery of siRNA molecules is a very simple technique, which was used to protect mice from fulminant hepatitis using siRNAs against Fas receptors by Song et al. [28], who administer Fas siRNA by intravenous injection into mice over a 24-hour period. The authors could show the persist effects for 10 days and protected mice against experimentally induced liver fibrosis. Local delivery of siRNAs have also been tried into the eye to target the VEGF pathway and shown that it could be therapeutically beneficial in neovascularization-related eye diseases. SiRNA topical gels have also been used to deliver them to cells in dermatological applications and cervical cancer treatment [29]. Gene gun method was used for an intradermal administration of nucleic acids to enhance cancer vaccine potency [30]. The other technique is an electroporation, which has been used to deliver siRNAs into the brain [31] and muscles of rodents. Injecting viral vectors for the in vivo delivery of siRNA directly have been tried, where an adeno-associated virus (AAV) associated shRNA vector injected directly into the midbrain neurons of adult mice to silence of the tyrosine hydroxylase gene near the site of injection for several weeks. However, there exist an alternative to injection, called as an ex vivo approach to generate human immunodeficiency virus (HIV)-1-resistant lymphocytes and macrophages [32]. It was accomplished through using a lentiviral vector, an anti-rev siRNA construct into CD34(+) hematopoietic progenitor cells. The siRNA-transduced progenitor cells were allowed to mature into macrophages in vitro and T-cells in vivo, [33].

Many machine learning, bio-statistical [34] and association rule mining methods [35] are available that have been developed to solve different problems related to gene silencing and disease discovery. In this book chapter, we provided a comprehensive survey of different machine learning and association rule mining algorithms developed for tackling various biological challenges such as detection of gene signature, biomarker, gene module, potentially disordered protein detection, differentially methylated region, multi-omics data integration, etc. We also described a comparative study of different well-known classifiers along with other used methods for the study. Meanwhile, many gene module discovery based approaches are also developed that employs several machine learning, deep learning and soft computing approaches. In addition, many multi-objective algorithms are also developed to find optimal multi-omics genetic signatures for the respective disease. Furthermore, we demonstrated the brief biological information regarding the immense biological challenges for gene activation and their advantages, disadvantages and possible therapeutic strategies. There are certain challenges exist, such as off-target effects, cytotoxicity, need for efficient delivery methods, their clinical implementation need efficient delivery vehicles and siRNA activity, itself, non-specific gene silencing, activation of innate immune system, the lack of efficient in vivo delivery systems still remain to be handled. Apart from these challenges, the development of efficient tissue-specific and differentiation dependent expression of siRNA is essential for transgenic and therapeutic approaches. However, there are successful in vitro and in vivo experiments for raising hopes in treating human diseases with RNAi [36]. Moreover, our study is useful for the researchers to understand the central idea about RNAi and gene silencing, along with the current machine/deep learning and association rule mining algorithms related to these (Figure 1).

Figure 1.

Flowchart of the RNAi mechanism [37, 38].

2. Fundamental concepts

In this section, some basic symbols of the graph mining, pattern recognition, [39] and information theory are described. A graph is an ordered pair G = (V, E) comprising of a set of vertices denoted as V and a set of edges denoted as E. To avoid ambiguity, the graph is described here precisely as undirected and simple. Let, Q=NEbe an unweighted as well as undirected graph, and Hbe a (hypograph) of it, (HN). Further, suppose, the density of H, denoted by DsH, be defined as DsH=IEHH, where IEHdepicts the induced edge-set of H, and Hrefers to the cardinality of H. Suppose, the highest density of the graph H, referred to as DsH, is illustrated as follows: DsH=maxHVDSH. Now, if Q=NEis a weighted graph, DsHwill be eIEHwteH, where IEHsymbolizes the induced edge-set of H, and wtedenotes the weight of the edge eIEH. Entropy of a random variable evaluates the amount of uncertainty corresponding to the variable [40]. The entropy of a discrete variable A, referred to as EPA, is defined in the following: EPA=aApalogbpa, where parefers to the probability mass function of A, and the value of b, in general, is considered as 2. Mutual information [41] between two random variables estimates the quantity of information that they combinedly share, i.e., the mutual dependency between them. When mutual information is zero, this signifies that these two variables are entirely independent to each other; whereas when mutual information is higher, it signifies that these two variables are extremely dependent on each other.

Topological Overlap Measure (TOM) and other related measures: Ravasz et al. [42] proposed a new measure Topological Overlap Measure (TOM) that provided the similarity between two nodes belonging to a network depending upon nearest neighbor concept. Furthermore, various modified versions of TOM such as weighted TOM (wTOM) [43], generalized TOM (GTOM) [44] are present in the literature. In the course of computing the wTOM, Pearson correlation coefficient scores are first evaluated for all pairs of vertices, and then a soft thresholding power (say, β>=1) is utilized from the correlation coefficient matrix through scale free topology. After that, weighted adjacency matrix is calculated using the coefficient matrix using the calculated power β. Then wTOM is computed from the weighted adjacency matrix. In the same way, the GTOM can also be defined just like TOM except it counts the number of m-step neighbors while calculating TOM measure between two vertices. Now, for calculating GTOM of order 0 (i.e., GTOM0), the adjacency score becomes the score of GTOM0. But, for determining the GTOM with higher order than zero (i.e., GTOM1, GTOM2, GTOM3,…), it follows the same procedure of TOM calculation, but counts up to d¨-th neighbors for each vertex (d¨=1,2,3,). Notably, GTOM1, GTOM2 and other higher order GTOM work only on binary matrix. So, before using those measures, the weighted adjacency matrix is translated into binary matrix in which the greater adjacency value than a specified cutoff (e.g., 70% score of the distance between the minimum and maximum adjacency values is converted into 1, and the lower value than the cutoff is transferred into 0.

In data mining, hierarchical clustering is one of the most popular cluster analyses in forming a hierarchy of clusters. There exist two types of strategies: agglomerative and divisive [45]. As is already known, agglomerative hierarchical clustering does not need any input parameters except the similarity matrix. Thus, there is no extra burden of utilizing cluster initialization as it simply merges two closest clusters at each iteration and continues till a singleton cluster is found. Divisive hierarchical clustering also follows the same style but in a reverse order. This is the major benefit of performing hierarchical clustering over the traditional K-means clustering algorithm, which is sensitive to initialization.

Association rule mining (ARM) [46] is a popular method for generating interesting relationships among different items (viz., genes). Suppose, GST=g1g2gnbe a item set (gene set) and SST=s1s2smbe sample set (viz., transaction set). Therefore, an association rule can be stated as AC, where A,CGSTand AC=ϕ. Notably, Aand Csymbolize as antecedent and consequent, respectively. An association rule can be described as the cause-effect relationships of the corresponding item sets in the transactions of a transactional data-profile in a big shopping market. A set of bought items may fall into a transaction. In a similar fashion, many genes may occur together in a sample (transaction) of a gene expression profile or similar profile. Many of these genes may be up-regulated or down-regulated, whereas the remaining genes will be non-differentially expressed.

3. Machine learning and rule mining approaches for gene inactivation

Currently, omics data analysis is one of the widely popular research domains. It can be categorized into two major types, single-omics data analysis, and multi-omics data analysis. In earlier, single-omics data processing such as gene expression data processing was highly popular. In those days, basically microarray gene expression data was popular. Now, the microarray data becomes obsolete while RNAseq, next-generation sequencing (NGS) and whole exome sequencing (WES) data become popular. However, the major aim of the single omics data analysis was to identify genetic marker as well as gene module identification. In current era, multi-omics data integration is now a big challenge to any researcher since it consists of various kind of profiles that are either proportional or inversely proportional to each other. Different kinds of regression analysis (logistic regression, sglasso [47, 48], flasso [47], etc.) are popular to integrate the multi-omics data. In case of the multi-omics data, the aim is to determine either single (or, combinatorial) gene marker, or gene signature, or multi-biomolecular closed bio-circuit. There are many machine learning and association rule mining methods available that have been developed to solve different problems related to gene silencing and disease discovery (Table 1 for tools and Table 2 for their application). For this regard, Bandyopadhyay et al. provided a comprehensive survey of various statistical tests for determining differentially expressed transcripts from microarray or other related datasets [69]. Then a rank based weighted association rule mining, RANWAR is developed to identify weighted interesting genomic rules applicable to any kind of genomic or epigenomic data [9]. A new technique of gene-based association rule mining approach was developed in [62]. Next, another statistics-based association rule mining technique “StatBicRM” had been proposed that utilized statistical test and Binary Inclusion maximal algorithm (BiMax) to find classification-based genetic rules [46]. Reverently, further enhancement of “StatBicRM” algorithm was performed and a new method of combinatorial marker discovery had been developed whose central concept was based upon the inverse relationship between the gene expression and methylation pattern [50]. In addition, mutual information based feature selection strategy had been incorporated into the statistical methodology, and a new method of identifying epigenetic biomarkers through maximal relevance and minimal redundancy based feature (gene) selection method from bi-omics dataset was proposed [63]. A new method of identifying multi-view gene-module identification was also proposed that applied the integrated methodology of statistical method and dense subgraph mining [49]. Detection of strongly connected genetic modules in multi-omics regulatory networks is an important study for the integrated study analysis of the network-based architecture. Many profiles belonging to the multi-omics datasets basically consist of a massive amount of genes, many of them are noisy and redundant. Such kind of noisy and redundant genes (or, features) are irrelevant while obtaining knowledge from the data. Furthermore, it is computationally absurd to utilize any clustering technique on such type of huge sized data profiles to get the dense genetic clusters. In many times, researchers face problems while calculating and subsequently accumulating the similarity matrix of such massive dimensions consisting of all the mutual dependency information between all the possible gene-pairs equivalent to every such profile. So, managing the high dimensionality of the underlying profile is a critical challenge to the researchers. To overcome the “curse of dimensionality” problem, the job of feature selection is basically treated as one of the most important preprocessing works to remove such noisy and redundant genes, which in turn decreases the total elapsed time. The main purpose of the feature selection is to find an optimal subset of features depending on some optimization conditions by which efficient knowledge discovery can be performed [70]. Depending on the availability of the class labels, the feature selection process can be organized into two types: supervised and unsupervised [71]. Unsupervised feature selection does not need the class label information while choosing the minimized feature subset [72], whereas supervised feature selection selects a subset of favorable features by utilizing the knowledge of class labels into the feature selection procedure. In the case of supervised feature selection, significant test [73], mutual information [74], are some broadly used measures to evaluate the excellence of the candidate features. In the territory of biological rematches, a statistical test is generally treated as one of the important tools for obtaining the significant genes for the big sized datasets, and therefore aids in decreasing the size of the dataset. There are different types of statistical tests such as t-test, significant analysis of microarrays, empirical Bayes test, etc. in the literature.

Method nameReferenceTypeBrief description
Multi-view gene modules using hypograph miningBhadra et al. [49]Gene-module detectionModule detection from multi-view data using the statistical test and mutual information based dense subgraph.
RANWARMallik et al. [9]Rank based genomic rule miningRank based weighted association rule mining to identify interesting genomic rules applicable to any genomic/epigenomic data.
Combinatorial marker discovery by integrating multiple profilesBandyopadhyay et al. [50]Combinatorial marker discoveryIntegrating gene expression and methylation profiles, and identifying combinatorial gene markers.
DTFP-growthMallik et al. [51]Gene based ARMMultiple-threshold based ARM integrating gene expression, methylation and protein-protein interaction profiles.
StatBicRMMaulik et al. [46]Statistical biclustering-based rule miningStatistical biclustering-based rule mining and analyzing the gene expression and methylation data profiles using it.
sglassoAugugliaro [47, 48]Regression methodSglasso tool develops the structured graphical lasso estimator for the weighted l1-penalized RCON(V, E) model.
flassoAugugliaro [47, 52]Regression methodImplements the weight l1-penlized factorial dynamic Gaussian graphical model.
MVDASerra et al. [53]Multi-view genomic profile integrationWorks to conjoin the those kinds of data at the levels of the outcomes of every single view clustering iteration.
Machine learning for epigenetics and future medical applicationsHolder et al. [54]Machine learning and deep learning approachesActive learning and imbalanced class learning are utilized to solve the shortcoming with machine learning for building better feature selection and solving the imbalance data problem.
A machine learning approach to integrate big data for precision medicineLee et al. [55]Molecular marker discoveryThe robust molecular markers that might be useful for targeted treatment of the acute myeloid leukemia are identified.
Deep learning based multi-omics integration robustly predicts survivalChaudhary et al. [56]Deep learning based multi-omics integration methodA deep learning method is used to integrate multi-omics data and to perform survival study on hepatocellular carcinoma.
Deep learning for genomics: a concise overviewYue et al. [57]Deep learning applications on genomic dataThe strengths of various deep learning methodologies are demonstrated that are applicable on any kind of genomic profile.
intNMFChalise and Fridley [58]Integrative clustering methodIntegrative clustering of several high dimensional profiles and subtype classification by non-negative matrix factorization (NMF).
Multi-modal data analysis for heterogeneous dataYang and Michailidis [59]Module detection for heterogeneous dataThe multi-modal profile analysis is conducted for heterogeneous data depending upon NMF.
Comparative study and evaluation of the integrative techniques for the multilevel omics dataPucher et al. [60]Integrative method for multilevel omics profilesThe comparative study of three integrative methods (viz., NMF, sparse canonical correlation analysis (sCCA) and logic data mining MicroArray Logic Analyzer (MALA)) is conducted on simulated data and real omics profile.
WeCoMXPMallik and Bandyopadhyay [61]Weighted connectivity (similarity) measureWeCoMXPis developed integrating co-expression, co-methylation and protein-protein interactions, and useful for determining the similarity between any two molecules.
Tumor prediction using integrated analysis of expression and methylationMallik et al. [62]Rule-based classifierIntegrated analysis of gene expression and DNA methylation and classification rule mining for tumor/cancer prediction.
Epigenetic gene marker discovery through feature selectionMallik et al. [63]Gene based ARMEpigenetic gene marker discovery using maximal relevance and minimal redundancy based feature selection.

Table 1.

The machine learning and rule mining methods related to gene inactivation and RNAi.

Method nameReferenceTypeBrief description
TF-MiRNA-gene network based modules for cytosine variantsSen et al. [64]Module detectionTF-MiRNA-gene network based module detection for 5hmC and 5mC brain samples between human and rhesus.
IDPTMallik et al. [65]Intrinsically disordered protein findingPotential intrinsically disordered protein identification through transcriptomic analysis of genes for epigenetic data.
Integration of DNA methylation data and gene expression dataSingh et al. [66]Finding differentially methylated regionsDifferentially methylated regions are determined and further statistical analysis is performed.
Application of machine-learning algorithms for gene expression regulationCheng and Worzel [67]Applications of machine learning methods on gene regulationThe machine learning strategies on gene regulation are reviewed, and their functional links mediated by histone modifications and transcription factors are demonstrated.
Application of machine-learning techniques on histone methylationXu et al. [68]Predictive model of gene expression by epigenetic factors by regressionA new model is developed to predict the gene expression using the function of histone modification levels through multi-linear regression multivariate adaptive regression splines.

Table 2.

Applications of machine learning and rule mining methods related to gene inactivation.

The significant genes therefore provide a weighted graph in which the nodes refer to the significant genes and the weighted edges signify the association between the related two nodes. Recently, graph data can be obtained in different rising fields of studies for forming the complicated structures viz., biological networks, chemical compounds, social networks, protein structures, etc. With the increasing stipulate on the analysis of large sized structured data, graph mining has become one of the most demanding topics of research for identifying the critical relationships among various entities included in the large graphs [75]. In the recent era, analyzing multi-omics dataset is one of the emerging topics of research where different profiles denoting several directions are applied to carry out different important tasks viz., marker determination, classification, and clustering. For this regard, many research works have been performed in the following directions viz., marker identification [76], classification [77], clustering [78], etc. Recently, Bhadra et al. [49] have developed a new algorithm handling an integrated study comprising of statistical method and normalized mutual information oriented hypo-graph mining to find the multi-omics co-similar genetic modules present in multi-omics datasets. Formerly, various statistical (viz., correlation, regression oriented) and/or weight-based techniques (viz., [79]) are matured for multi-omics data integration, but not for multi-omics genetic-module detecting. Furthermore, some multi-view data integration mechanism employs various soft-computing methods such as clustering, non-matrix factorization, etc. Recently, Serra et al. [53] proposed a framework for combining different data profiles of multi-view datasets by integrating several clustering results done on each profile through non-matrix factorization. Pucher et al. [60] provided a comprehensive review and comparative study of the three integrative methods (viz., non-negative matrix factorization (NMF), sparse canonical correlation analysis (sCCA) and logic data mining MicroArray Logic Analyzer (MALA)) on simulated data as well as real omics profile. In addition, there are many deep learning techniques that were also developed to handle biological data. Chaudhary et al. [56] proposed a deep learning based methodology to integrate multi-omics data and robustly perform survival study on hepatocellular carcinoma. Furthermore, there are many interesting applications of the above machine learning and deep learning techniques. For example, Xu et al. [68] developed a new model using the regression to predict the gene expression using the function of histone modifications/variants levels through the consecutive regression methods (viz., multi-linear regression as well as multivariate adaptive regression splines). Mallik et al. [65] performed a comprehensive analysis to identify potential intrinsically disordered proteins through the transcriptomic analysis of genes for the expression and methylation data. To find differentially methylated regions is also an area of interest. Comparison of different classifiers used in many tools related to RNAi and gene inactivation is described in Table 3.

C4.5 classifierK-nearest neighbors (KNN) classifierNaive Bayes classifierSupport vector machines (SVM) classifierArtificial neural networks (ANN) classifier
  • Can use both discrete and continuous values.

  • Can use only continuous values.

  • Can use both discrete and continuous values.

  • Can use only continuous values.

  • Can use both discrete and continuous values.

  • Handles noise.

  • Sensitive to noisy features.

  • Sensitive to noisy features.

  • Is less effective when data contains noisy features.

  • Handles noisy features.

  • Classes need not be linearly separable.

  • Classes need not be linearly separable.

  • Classes need not be linearly separable.

  • Works well even if data is not linearly separable in the input feature space.

  • Works fine even if data is not linearly separable in the input feature space.

  • Faces the problem of overfitting.

  • Overcomes the problem of overfitting.

  • Faces the problem of overfitting.

  • Overcomes the problem of overfitting.

  • Overcomes the problem of overfitting.

  • Needs large searching time.

  • Requires higher searching time for a larger data.

  • Enormous Computational efficiency.

  • Needs higher searching time for a larger data.

  • Needs high processing time if neural network is huge.

Table 3.

Comparison of different classifiers.

4. Biological challenges for gene inactivation

There are certain challenges exist, such as off-target effects, cytotoxicity, need for efficient delivery methods, their clinical implementation need efficient delivery vehicles and siRNA activity, itself, non-specific gene silencing, activation of innate immune system, the lack of efficient in vivo delivery systems still remain to be handled [80]. The effective delivery of RNAi therapeutics in vivo is one of the important challenge and have to consider several parameters for an efficient silencing, particle sizing, duration of the RNAi effect, its stability and modification, the delivery system and clearing off-target effects [81]. Apart from these challenges, the development of efficient tissue-specific and differentiation dependent expression of siRNA is essential for transgenic and therapeutic approaches. Bioactive drugs have been shown to perturb the naturally running system as these can clog/saturate the biochemical pathways. Since siRNA/shRNA relies on the endogenous microRNA machinery, thereby high doses of ectopic RNA have the risk of saturating all component of the miRNA pathway components. This was observed in the work by Grimm et al. [82] observed fatality association with high doses of liver-directed AAV-encoded shRNAs in mice, where high doses killed the recipient mice within 2 months. The length threshold of siRNAs seems to vary among cell types and it is an important consideration as dsRNA would induce innate immune responses that would eventually lead to cell death in mammalian. However, dsRNA less than 30 nucleotides have been shown good enough for no induction of cellular toxicity in mammalian and longer dsRNA is known to rapidly induce interferon responses. This suggests the careful risk assessment strategies when using longer and more potent Dicer substrates siRNAs. Moreover, correct RNAi targets are must, though ideal specificity of RNAi targets has not been shown. However if RNAi is going to silence off-targets, it can alter the gene function, which is clearly undesirable, therefore, care should be taken before-hand not to suppress the off-targets. If one third of siRNA are chosen randomly that it results in a toxic phenotype [83]. Comparison of siRNA and miRNA is described Table 4. However, there are successful in vitro and in vivo experiments for raising hopes in treating human disease with RNAi. The epigenetic network is one of the complex regulatory networks where epigenetic mechanisms such as DNA methylation and modifications to histone proteins regulate gene expression and high-order DNA structure [84]. Epigenetics is basically a study of heritable changes in phenotypes where the DNA sequences are not changed anymore. DNA methylation [85] is an epigenetic factor that represents the inclusion of a methyl group (–CH3) to the fifth position of a cytosine pyrimidine ring or to the sixth nitrogen position of an adenine purine ring in genomic DNA. DNA methylation generally decreases belong to the gene expression level. In this connection, copy number variation (CNV) [86] is another latest domain of research in genomics. It is basically an event where the repetition of different portions of the genome continuously happens, and an alteration on the number of repeats in the genome is recognized between individual to individual in the human population. Copy number variation is a category of structural changes, especially, it is a type of either duplication or deletion event which generally influences a reasonable number of base pairs. It has been realized from recent researches that around two-thirds of the total human genome is made up of repeats. In the case of mammals, copy number alteration provides a significant contribution on producing the necessary deviation in both the population and disease phenotype. Cancer forms by various types of somatic genetic changes including copy-number alternations which affect the activity of the critical genes regulating the growth of the cell. Disadvantages and advantages of RNAi, and possible overcome strategies are demonstrated briefly in Table 5.

siRNAmiRNA
  • Must be fully complementary to its target mRNA.

  • Can be partially complementary to its target mRNA.

  • 21–23 nucleotide RNA duplex, notably endogenous siRNAs’ origin is more polemic.

  • 19–25 nucleotide RNA duplex, derived from gene units.

  • dsRNA (30–100 nucleotides), before Dicer processing.

  • Precursor miRNA (70–100 nucleotides) with interspersed mismatches and hairpin structure, prior to Dicer processing.

  • One mRNA target.

  • Can have multiple targets (>100 at the same time).

  • For gene regulation, endonucleolytic cleavage of mRNA occurs.

  • For gene regulation, translational repression degradation of mRNA occurs.

  • Used as a therapeutic agent.

  • Utility as a drug target therapeutic agent Diagnostic and biomarker tool.

  • siRNAs shut down gene expression at a post-transcriptional level through mRNA degradation.

  • MiRNAs silence their target genes mainly and most of the times through translational repression.

  • Occurs in plants and lower animals, occurrence in mammals is questionable.

  • Occurrence in plants and animals.

  • Rarely found as an evolutionary conserved.

  • Evolutionary conserved most of the time in the related organism.

Table 4.

Comparison of siRNA and miRNA.

DisadvantagesAdvantages and possible therapeutic strategies
  • RNAi-based therapeutics has led to trigger several off-target (unintended) effects and hence shown host innate immune responses.

  • Strategies for selective internalization and with endogenous mechanism without disrupting the natural pathway should be used to achieve maximal benefit from RNAi-based therapeutics.

  • Pol III expressed shRNAs delivered in an AAV delivered in mice tail vein through injection was lethal due to acute liver failure.

  • Levels of ectopic expression of therapeutic shRNAs should be carefully controlled (low yet effective) to avoid off target effects.

  • Using naked siRNA has poor cellular uptake, it activates toll-like receptors and does not target to specific cell types.

  • Naked siRNA are comparatively stable and non-immunogenic.

  • Viral vectors for shRNA, expensive to create and cause immune reactions.

  • High affinity toward infecting target cells, expression can long last.

  • Lack of efficient delivery systems is the most critical challenge for the therapeutic applications of small RNAs.

  • Identify the critical problem from the literature and allow researchers to publish failed ideas.

Table 5.

Disadvantages, advantages of RNAi and possible therapeutic strategies.

5. Conclusion

RNAi and gene inactivation are well-known research topics in the research of biomedical field. MiRNA and siRNA are closely associated with RNAi. Various categories of algorithms associated with RNAi and gene silencing have been developed in last 2 decades. In this book chapter, we provided a comprehensive review of various machine/deep learning as well as association rule mining algorithms that have been developed for handling different biological problems such as gene signature detection, multi-omics data integration, single/combinatorial biomarker identification, gene module detection, potentially disordered protein detection, differentially methylated region finding, and many more. Thereafter, a comparative study of several well-known classifiers along with other used approaches for the study has been included. In addition, we provided a brief biological description of the immense biological challenges for the gene activation along with their advantages, disadvantages and possible therapeutic strategies. Finally, this chapter helps the bioinformaticians to understand the central idea of RNAi and gene silencing along with their peripheral machine/deep learning and association rule mining algorithms for the benevolent of the disease discovery as well as possible therapeutic values.

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Saurav Mallik, Ujjwal Maulik, Namrata Tomar, Tapas Bhadra, Anirban Mukhopadhyay and Ayan Mukherji (April 23rd 2019). Machine Learning and Rule Mining Techniques in the Study of Gene Inactivation and RNA Interference, Modulating Gene Expression - Abridging the RNAi and CRISPR-Cas9 Technologies, Aditi Singh and Mohammad W. Khan, IntechOpen, DOI: 10.5772/intechopen.83470. Available from:

chapter statistics

224total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Introductory Chapter: Modulating Gene Expression - Abridging the RNAi and CRISPR-Cas9 Technologies

By Aditi Singh

Related Book

First chapter

Modelling DNA Methylation Dynamics

By Karthika Raghavan and Heather J. Ruskin

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us