Entropy Based Biological Sequence Study

SARS-CoV-2 virus strains are taken into consideration for the analysis of digi-tized sequences of information by means of the notions of entropy. The occurrence of a particular pattern in the corona viral sequence is paid a special attention. The incidence of genetic word is represented in a density means. The incidence frequency of the q-gram genetic word is determined with the help of finite impulse response (FIR) filter along the sequence. It is in turn, used for the determination of the probability distribution of the genetic word incidence as the input for the calculation of entropy in the sequence. The sequence entropy is further used for principal component analysis (PCA) to determine the similarity/dissimilarity between the viral sequences. We have considered seven human corona virus sequences. Entropy based similarity study for SARS-CoV-2 strains is presented in this work.


Introduction
The entropy of amino acid sequences in DNA of an organism can be considered as the measure of diversity of proteins. The higher the value of entropy, the greater the possibility of variation in the information content coded by the nucleic acid [1]. This theory is utilized in the present study to understand the variation in the genetic sequences of different novel corona viruses that have infected people across the world leading to one of the world's biggest pandemics. The pandemic itself highlights the importance of tracking the dynamics of viral transmission in real-time. Moreover, as the virus mutates frequently, each sequence is studied and compared with others to understand the variation of information that is transmitted from one species to the other. Hyper-variable genomic hotspot for the novel coronavirus SARS-CoV-2 has already been identified by Wen et al. [2]. Likewise, the similarities in the genetic code would also provide important information in understanding the virus and its prevention.
This study presents identification and analysis of regions of similarity in SARS-CoV genetic sequence [11][12][13]. According to information theory, individuality of a species can be aggregates that propagate information from past to future. The Shannon Entropy is considered as a measure for the order/disorder state of nucleotide sequences of the DNA [14]. The information in a genetic code is comprised of an alphabetic sequence of the four letters A, C, G, and T, which symbolizes the four nucleotides, namely, adenine (A), cytosine (C), guanine (G) and thymine (T). The sequences have been recognized for most of the SARS-CoV-2 genes and are accessible in computer readable form. The probability of occurrence of a combination of a group of symbols in a sequence is the measure of order in a sequence. An alignment free approach of DNA sequence analysis, n-mer/word frequency estimation, is attempted in this work.

Methodology
Our method is based on the observation through a sliding "counter "of width W over DNA sequence [15]. A certain number of q-grams called as bins are set in the counter. As there are only four letters in the DNA alphabet, viz., {A, C, G, T} the number of all combinations of q-grams in a DNA sequence is 4 q .

Definition 1. q-gram of Sequence.
Given a sequence 'seq', when a window of length q slides over the characters of 'seq', its q-grams are formed. For a sequence 'seq, there are seq j j À q À 1 ð Þq-grams. The number of all possible q-grams or called as "bin" is 4 q . Bins can be arranged in lexicographic order, and b i is used to denote the i th bin in this order. All the possible bins are denoted as: For a sequence, the q-gram bin signature, S j is a mapping with the bin b j b j ∈ B q À Á where i th bit in S j , is corresponding to the presence or absence of b j . For a sequence 'seq, there are seq j j À b j À 1 À Á bits in S j . Example 2. Consider a sequence, S = "AACTCG". Its two-grams (q = 2) signature in the sequence is S 2 = [0 1 0 0 0].

Definition 3. Filter.
A sequence x[n] is filtered through mapping of the sequence into output sequence y[n] via a weighted window b by means of the convolution summation as b is independent of x[n] and y[n], where n is the time index. y[n] is the response of the filter to input signal x[n]. The filter is finite impulse response (FIR) digital filter. The term digital filter arises because it operates on discrete-time signals. Finite impulse response arises because the filter output is computed as a weighted, finite term sum, of past and present (Figure 1).
Example 3: Weighted filter output of S A with the weighted window β = [0.2 0.1 0.3 0.4] is as follows: For nucleotide density calculation, evenly distributed window of unit value is considered. As explained, the output of the convolution summation represents the nucleotide density along the sequence. The detail algorithms for bin construction, bin signature, filter operation is displayed in Tables 1-3 respectively. Input: q -length of bin Output: set of bins

Sequence analysis
The filter output is taken as a density distribution for DNA sequences. The density distribution is based on q-gram word density, which in turn is considered for the determination of Shannon Entropy as where p ij is the probability of appearance of the jth genetic letter at ith position in the genetic sequence. Further we want to find a similarity/dissimilarity measure between two entropy distributions ρ i = (y i1 , y i2 , … , y in ) and ρ j = (y j1 , y j2 , … , y jn ). We construct the data matrix D comprising elements [ρ 1 , ρ 2 , … .,ρ m ] 0 , where m is the number of sequences. Principal Component Analyses (PCA) is used to estimate scores between density distributions such that it reduces multidimensional data sets to lower dimensions with the consistent original data matrix [16].
We determine the dissimilarity between two sequences from the scores in the first three principal components by computing the Euclidean distance between pairs of density distributions in the m-by-n data matrix D. Rows of D correspond to sequence (observations) and columns correspond to position index in the sequence (variables). Thus, Euclidean distance X is a row vector of length m (m-1)/2, corresponding to pairs of observations in D. The distances are arranged in the order (2, 1), (3,1), … , (m, 1), (3,2), … , (m, 2), … , (m, m-1). X is used as a dissimilarity matrix in clustering or multidimensional scaling. An unweighted pair group method with arithmetic mean (UPGMA) is employed on PC scores for the construction of a phylogenetic tree [17]. UPGMA uses a local objective function to construct a rooted bifurcating tree.

Results and discussions
The nucleotide density distribution was obtained through FIR filter. We have calculated the density distribution for one-, two-, three-, gram nucleotide for different species. Secondly we have calculated entropy distributions ρi = (yi1, yi2, … , yin) and ρj = (yj1, yj2, … , yjn). The variation of entropy with position for all other sequences are calculated for the above three combinations. The entropy values were found to be minimum for mono-mer density distributions in individual sequences while increasing linearly for di-mers and codons respectively. Observations based on the position of the n-mers in sequences of SARS-CoV-2 DNA reveals significant minimum entropy regions for codons. Figure 2 shows the entropy profile calculated over 29000 bases for 7 DNA sequences. Similar analysis profile for mono-mers and di-mers does not show overlapping regions for different sequences. This suggests that codons are more effective in transferring information through different species. Codon bias has been reported for HIV 1 virus [18]. Therefore, it can be inferred that in various novel coronavirus strains, the codons at specific positions are the highest bias representing minimum entropy and hence carry the maximum information. Further studies with the sequences of these loci can be useful genetic engineering for developing vaccines or taking control over the spread of the second wave of the pandemic.
We have chosen seven SARS Corona virus sequences (SARS-COV) from various countries. The details of the organism are presented in Table 4.
Based on FIR filtering, firstly the nucleotide density distribution is generated. We have calculated the density distribution for one-, two-, three-, gram nucleotide for different species. Secondly we have calculated entropy distributions ρi = (yi1, yi2, … , yin) and ρj = (yj1, yj2, … , yjn). Figure 3 displays the spatial variation of the entropy along the SARS-COV sequence for seven species.
In fact it is inconvenient to realize all the entropy variation in 2D graphical representation. For example, the organism HKU1 shows the positions where it possesses the minima in entropy values. Some are demonstrated at the positions, around 7400, 10000, 23000 etc. the Amsterdam strain, NL63 has shown minima at around 7300, 8000 etc. But other strains exhibit their entropy representation in a crowded manner. It is difficult to understand the variation for them differentially. Rather it is more comprehensive to show the entropy variation for all sequences (total 7) in a single panel. It has been shown in Figure 3.
The present work intends to assess the variability and complexity at each nucleotide site with the calculation of entropy for each position using the Shannon entropy formula, Eq. (2). The low entropy regions around 7400 and 9000 position are common to all 7 sequences (Figure 3). Entropy (Y i ) is an important parameter for the understanding of sequential stability. Yi becomes maximal when all symbols occur at equal probability. On the other hand, Yi becomes the least if one symbol occurs at probability 1 and in that case the other symbols will be forbidden. It means that lower the value of entropy the site is more stable without much complexity. Under this assumption, the zone around the site 7400 and 9000 position are most stable for all strain/species. It may find a good structural relationship between the regions of low entropy and the secondary structure of proteins which include α-helix, β-sheets and loops regions. Strain no. 4-7 (HCoV-OC43; F: NL63; G: HCoV_229E; H: HKU1) show the stability with lower entropy around 8 K, 9 K, 11 K, 12 K site position. But this behavoiu is not exhibited in case of the strains numbers 1-3 (Wuhan-Hu-1; C CV7; D: MERS-CoV/C1272). If one can go through these strains, as a whole, it is noticed that the entropy is increasing or in turn the complexity is more. It is an indication of  Table 4).  Table 4). evolutionary development among the SARS-COV strains. Based on site entropy we prepared the dissimilarity matrix for the sequences (Figure 4).
The dissimilarity matrix demonstrates the existence of 4 different clusters. One can see that the SARS-COV sequences in a cluster shows less dissimilarity among themselves. In other way to mean that the sequences have much similarity residing in a cluster [19]. The COVID sequence appearing in cluster I is typically from Wuhan, China. The Wuhan virus genome sequence examination found β-CoV strain [20]. The Wuhan novel β-CoV revealed 88% similarity with the sequence of two bat-derived SARS-COV, bat-SL-CoVZC45 and it was named "SARS-CoV-2" by the International Virus Classification Commission. The genome of SARS-CoV-2 sequence has the similarity with the typical CoVs. It encompasses more than ten open reading frames (ORFs). The first ORFs covers about two-thirds of viral RNA, which get translated into two large polyproteins, pp1a and pp1ab. These proteins assist to form the viral replicase transcriptase complex [21]. The remaining onethird of viral RNA take part in translation of four structural proteins: spike (S), envelope (E), nucleocapsid (N) and membrane (M) proteins [22].
Cluster-II comprising of two strains CV7, MERS-CoV, belong to β-CoV genera, which also includes SARS-CoV-2 strain as placed singly in cluster-I. Two HCoVs of strains HCoV-229E and HCoV-OC43 being placed in the mixed Cluster of III and IV, are the members of α-CoV genera. From the cluster presentation (Figure 5), it will be understood that they belong to cluster-III. Remaining two strains, NL63 and HKU1 are placed in cluster IV.
Phylogenic relation among the strains is represented in Figure 6. We obtain the phylogenetic tree of the data set based on unweighted pair group method with arithmetic mean (UPGMA) on PC scores. Phylogenetic tree analysis clearly shows the relationship among all COVID strains under each cluster. We further subcluster in each cluster based on their genetic distance (GD). We have considered PC score to determine the dissimilarity or genetic distance between two organisms.
Explicitly the COVID strains are placed in a cluster description (Figure 5). The scores are determined in the principal component analysis. Three principal components are taken into consideration. Each strain is represented as state point by scatter plot in the three PC space. Cluster presentation is well agreement with phylogenic relations. Wuhan-Hu-1 strain is well isolated from all other strains. It belongs to cluster-I. Each of other three clusters possess two-member strain. Cluster-II comprises of two strains CV7 and MERS-CoV belonging to β-CoV genera (encircled with blue color ellipse in Figure 5). Already it is mentioned in the previous section that the strains HCoV-229E and HCoV-OC43 exist in Cluster of III.  It is displayed by two state points encircled in green colored ellipse. Remaining pair of strains, NL63 and HKU1 are placed in cluster IV which is marked by yellow colored ellipse.

Conclusions
The entropy has been used to select SARS-COV genome regions for stability zone detection. Even though a great deal of genetic variation is generally found, the present entropy calculation is sufficient to observe low informational complexity regions, which are representation of the conserved sites of the sequence. The low entropy regions are related to important functional domains in the proteins of these viruses. Based on entropy calculations seven SARS-COV genomes have been phylogenically described. The clusters of the genome formation is well understood.