Analysis of Next-generation Sequencing Data in Virology - Opportunities and Challenges

Viruses are the most abundant and the smallest organisms, which are relatively simple to sequence. Genome sequence data of viruses for individual species to populations out‐ number that of other species. Although this offers an opportunity to study viral diversity at varying levels of taxonomic hierarchy, it also poses challenges for systematic and struc‐ tured organization of data and its downstream processing. Extensive computational anal‐ yses using a number of algorithms and programs have opened exciting opportunities for virus discovery and diagnostics, apart from augmenting our understanding of the intri‐ guing world of viruses. Unravelling evolutionary dynamics of viruses permits improved understanding of phenomena such as quasispecies diversity, role of mutations in host switching and drug resistance, which enables the tangible measurements of genotype and phenotype of viruses. Improved understanding of geno-/serotype diversity in corre‐ lation with antigenic diversity will facilitate rational design and development of effica‐ cious vaccines against emerging and re-emerging viruses. Mathematical models developed using the genomic data could be used to predict the spread of viruses due to vector switching and the (re)emergence due to host switching and, thereby, contribute to‐ wards designing public health policies for disease management and control.


Viruses: Special class of organisms
Viruses form a major class of biological entities encompassing diverse environments ranging from algae in marine ecosystems to soil, plant, human and animal systems. Several metage-nomic studies have revealed the possibility of viruses being the dominant species of our biosphere [1]. Deep sequencing efforts have shown that viruses form 10 6 -10 9 particles per millilitre of seawater [2]. It is also interesting to note that ~90% of the reads obtained from such experiments did not encode proteins, which are reported in other organisms, including viruses, that have been characterised so far. This clearly demonstrates that the actual viral diversity has not been sampled in an adequate manner so far. A crucial aspect of viral studies is the disease burden associated with them, which is known to be enormous with serious economic implications. World Health Organization documents that the global burden of communicable diseases (of which viral diseases form a major chunk) is ~15 million annually [3].
Beyond abundance aspects, study of viral evolution and genetic variations enabled the proposal of the virocentric standpoint of the evolution. Viruses gained centre stage for reasons such as being smallest replicating entities, having short generation time, large population sizes and high replication and mutation rates. Attributes such as variation in genome sizes, gene pool, shape and assembly of particles are responsible for viruses to attain pivotal role in the study of evolution [4]. It has been observed that all plausible replication and expression strategies have been employed by viruses to dynamically adapt to the ever-changing environments. Processes like complementation, recombination, reassortment, high mutation rate and existence as quasispecies enable the viruses to outgrow and outcompete the host immune system. The molecular forces driving these processes can be delineated by sequencing and the subsequent analyses.

Viral sequencing methods
The distinction of complete genome ever to be sequenced belongs to bacteriophage ΦX174 with a genome size of 5,386 bases and was achieved through the Sanger's shotgunsequencing approach [5]. The major aim of early sequencing projects was to characterize the genomic content of an organism in terms of its coding potential. Over the last few years, the unprecedented growth in the area of sequencing technologies has had a huge impact on the way viral genomes are being addressed. The scale of generating and handling data, which was unimaginable previously, has become a reality today due to the advent of Next-Generation Sequencing (NGS) technologies. Advantages of NGS over the conventional Sanger sequencing approach are the rapid generation of sequencing data on a very massive scale and at affordable cost. NGS also provides scope for wide range of studies that include transcriptomics, gene expression and regulation (DNA-protein interaction), single-nucleotide polymorphism (SNP) and RNA profiling. Sequencing of viruses, in particular, has been important to understand the spread of epidemics, the circulating viral particles and the improvement of strains for vaccine design. Different technologies such as Roche 454 [6], Illumina [7], Ion Torrent [8] and more recently the fourth-generation sequencing methodologies popularly called single-cell sequencing, viz. Oxford Nanopore [9] and Pacific Biosciences [10], are available for sequencing.
Sample preparation and enrichment are the prerequisites for sequencing the viromes. Filtration and centrifugation on caesium chloride density gradient have proved to enrich the virus-like particles. A strategy like depletion of host rRNAs is also known to increase the virus fraction and has been attributed to the discovery of several novel RNA viruses [11]. In plant virology, use of CF11 cellulose spin column is routinely used for deep sequencing of dsRNA.
There exist several scenarios for sequencing viral genomes such as sequencing of individual strains or population [12]. Sequencing of individual genomes helps to catalogue the genes encoded in a particular strain and is a vital step for in-depth characterization studies. Sequencing of multiple isolates/strains/species enables understanding of the factors responsible for varying virulence using comparative genomic approaches [13]. For understanding the coevolution of viral and host genomes, in particular, archaea and bacteria, Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) spacer sequencing is used [14]. CRISPR are found in archaea and bacteria that serve as an antiviral mechanism in which viral genomic sequences are integrated as CRISPR spacers into the host, thereby making it immune to viral infection [15]. Understanding complex dynamics of virus-host interactions in higher organisms using sequencing provides valuable insights into transmission between animal reservoirs [16]. Sequencing of 'Auxiliary metabolic genes', which are involved in processes like motility and transcriptional repression, enables to unravel the viral genes that influence host machinery in diverse ways [17].

Data assembly and annotations
Output from NGS technologies results in gigabases of raw sequence data per experiment. Extensive computational analysis using a number of algorithms and applications is required to infer biological significance. Generic steps include mapping of reads using either de novo approach or re-sequencing approach, identification of SNPs and detection of insertions/ deletions (indels) and further downstream processing.
The various steps involved in data preprocessing are:

i.
Removal of adaptors and low-quality sequences: This is an important step in data pre-processing, and tools such as FASTX [18] and FASTQC [19] are used for this purpose. Care should be taken in case of paired-end sequences to ensure that the reads trimmed based on the quality is reflected in both the forward and the reverse FASTQ files. In case of multiplex sequencing data, an additional step of 'de-multiplexing' based on barcodes is mandatory.

ii.
Screening host sequences: Despite the methods being available for viral enrichment, it has been observed that contamination of host/vector sequences is a routine scenario. Filtering of such data ensures that no error is propagated.
Following preprocessing, reference-based mapping or de novo assembly of the processed reads can be carried out.

Reference-mapping
Alignment with a reference genome is a method of choice for most NGS experiments. Preprocessed reads when mapped to a well-annotated reference genome ensure transfer of annotations to the query genome in a hassle-free manner with statistical confidence, especially in indel-free regions. Polymorphic regions can also be identified, which account for the isolatespecific variants that may be responsible for the observed phenotype. The algorithms generally rely on indexing of either the query reads or the reference genome using suffix tree or hashing strategy [20][21][22]. Indexing the reference genome has been proved to be computationally advantageous and is widely preferred. Indexing is followed by gapped or ungapped alignment based on either Smith-Waterman [23] or Needleman-Wunsch dynamic programming approaches [24]. Gaps indicate indels and are important to gain strain-/species-specific properties. The quality of the reference alignment can be improved by using large inserts available in paired-end reads as compared to single-end reads wherein forward and reverse orientation of reads cannot be calculated. Downstream processing of aligned and assembled reads involves delineating the variant regions followed by annotation. It is also important to remove polymerase chain reaction (PCR) artefacts before variant calling as the duplicated reads hamper its sensitivity. Discovery of Schmallemberg virus, a new member of genus Orthobunyavirus that causes foetal abnormalities in ruminants [25], is attributed to a referencebased assembly approach.
Delineation of variant regions: All deviations from reference genome can be delineated as variants, which include SNPs and indels. Variant regions contribute to the nucleotide diversity in virus populations and hence play a vital role in their evolution and dynamics. One of the main parameters indicative of nucleotide diversity is the comparison of synonymous to nonsynonymous codon substitution. Synonymous mutations result in neutral substitution, which enable in maintaining the phenotype, as compared to non-synonymous substitutions, which lead to amino acid alteration and hence may affect phenotype. It is interesting to note that the existence of overlapping reading frames in viruses often constrains synonymous substitutions. Hence, computation of the magnitude of synonymous and non-synonymous polymorphism within viral populations will provide a handle to assess the role of neutral evolution and genetic drift in viral evolution. A more detailed discussion of the role of these substitution ratios in adaptive evolution of viruses is given in Section 4.5.
Tools like SNPgenie [26] and VirVarSeq [27] have been developed with a focus on calling SNPs from pooled viral samples by including codon information in an explicit manner and hence are more sensitive than traditional SNP callers [28,29].

De novo assembly
Preprocessed reads are assembled using de novo approaches, when a closely related homologue is unavailable to serve as a reference. It should be mentioned that genome assembly is computationally challenging and also requires trained manpower. Sequencing depth plays a major role in determining the quality of the assembly as does the length of the reads. Popularly used assemblers are based on de Bruijin graph approach in which reads are divided into subsequences called k-mers of length k [30]. The k-mers form the nodes of a graph, which are linked when a k-1mer is shared among them. The overall process requires large amounts of computer memory (RAM) and specialized compute clusters.
The steps involved in assembly process are: i.
Based on Overlap-Layout-Consensus principle, information stored in scattered reads are used to make contiguous regions termed 'contigs', which are generally devoid of polymorphisms. ii.
Using insert information, 'contigs' are combined to form 'scaffolds'. Gaps between contigs are usually filled with nucleotides (Ns).

iii.
Scaffolds in conjunction with synteny and geneorder information are used to build larger scaffolds.
Building a draft genome is an iterative process and involves parameter optimization, and it is advised that more than one type of assembler be used as each of them has been built for a definite purpose and has unique features. The final assembled genome is evaluated on the basis of N50 parameter. N50 is the median of assembled sequence lengths, in which longer sequences are given more weightage. Mis-assemblies due to wrong orientation of reads and low-complexity regions are, however, not accounted for in N50 parameter and tools like amosvalidate, which combines multiple validation procedures, are recommended [31].
One of the major limitations of de novo assembly using NGS data is its reporting of large proportion of incorrect recombinants. This arises mainly due to overlapping of short reads of varying quality and coverage, which in turn pave way for the introduction of spurious SNPs, ultimately resulting in artefacts in assembly. The in silico chimeras thus produced amplify diversity estimation and complicate true recombination detection. Efforts are being made to overcome this issue using probabilistic method, which assumes that true SNPs are under selection pressure and hence co-occur within a haplotype as compared to random SNPs [32]. Methods such as Iterative Virus Assembler (IVA) [33] and Paired-Read Iterative Contig Extension (PRICE) [34] have also been developed to overcome caveats associated with varying read depths and enable detection of regions with extensive genomic diversity. Assembly pipelines like VirAmp [35], VICUNA [36], SPAdes [37] offer many choices of tools and parameters for carrying out hassle-free assembly of viral genomes.
Novel approaches are also being introduced with special emphasis on viral metagenomic projects, viz. Progressive Filtering of Overlapping small RNAs (PFOR) [38]. PFOR is capable of identifying replicating circular RNAs by separating terminal small RNAs from internal small RNAs based on k-mer overlap. PFOR2, a multi-threaded version of PFOR, has recently been developed, which reduced the running time of filtering step by 90%. Novel viroids like Hop stunt viroid (HpSVd), Grapevine yellow speckle viroid (GYSVd) and Grapevine hammerhead viroid-like RNA (GHVd RNA) have been identified using this tool. Hence, de novo assembly has tremendous scope in unravelling the vast virome that has been unaddressed previously and there exists need for development of more efficient assembly algorithms, which will make it more tractable for use by larger scientific community.

Genome databases
Initial effort towards sequencing of viral genomes resulted in accumulation of genomic data in primary repositories such as GenBank [39], European Molecular Biology Laboratory (EMBL) [40] and DNA Data Bank of Japan (DDBJ) [41] and now continues to rise in International Nucleotide Sequence Database Collaboration (INSDC) [42]. Genome databases and resources dedicated to viruses were developed subsequently [43][44][45][46][47]. Lists of useful databases, resources and analysis tools have also been compiled previously [13,48]. Most of these resources archive complete genome sequences, their annotations and derived data such as viral variations, multiple sequence alignments (MSAs) and phylogenetic trees, to name a few. Some of the viral genome resources are briefly described below.

National Center for Biotechnology Information (NCBI) viral genome resource
This reference resource is designed to catalogue publicly available genomic sequences of viruses deposited in INSDC [49]. It attempts to curate reference genome sequences and leverages on the knowledge of experts to annotate as well as to identify important viral sequences.

ViralZone
This resource is developed and maintained at the Swiss Institute of Bioinformatics. The objective of the resource is to link textbook knowledge, fact sheets and images to the genomic and proteomic data with an objective to facilitate the study of viral diversity [50].

Virus Pathogen Database and Analysis Resource (ViPR)
The ViPR [51] is supported under the Bioinformatics Resource Centers (BRC) programme of National Institute of Allergy and Infectious Diseases (NIAID). The database currently provides access to molecular data of viruses including complete genomes of 14 viral families. Analytical and visualization tools for metadata-driven statistical sequence analysis, data filtering, analytical workflows and utility of personal workbench are provided to the users.
In addition to these, several organism-specific resources have been developed such as HCV Database [52] for Hepatitis C virus and IVDB [53] for Influenza virus and HIV [54].
Annotation of the sequence (gene/genome/protein) records is an integral step in downstream processing of database entries. A well-curated reference record serves as template for transfer of annotation in terms of features such as gene boundaries, associated functions (molecular/ cellular/pathway) and non-coding regions [49]. Such annotations will be highly useful in subsequent analysis and model building. The challenges of managing dedicated resources for viral genomes are relatively different as compared to the genomic databases of model and other organisms. The pace of sequencing and the quantum of genomic data being generated are affecting identification of reference genomes and annotations of genomes of strains and isolates. Additionally, to study the spatio-temporal evolution and to model the viral popula-tions, it is desirable to tag metadata such as the place and date of isolation of viruses with the corresponding genomic entries.

Impact of NGS technologies on virology
Molecular analysis of viruses using data generated by NGS has revolutionized virology. While understanding the sequence-structure-function relationships, it has also resulted in the development of new areas of research such as phyloinformatics and immunoinformatics, which translates raw data into information. The information generated from these independent yet interlinked areas, when put together fits as pieces of jigsaw puzzle (Figure 1), leading to an improved understanding of the viral diseases and, thereby, the development of antiviral therapies.

Unravelling mutational landscapes in viral quasispecies
Viral quasispecies are mutant swarms generated mainly by RNA viruses during replication, which is known to be error-prone due to the lack of proofreading activity of RNA-dependent RNA polymerase. The resulting mosaic is a dynamic distribution of non-identical but related replicons that cannot be detected using conventional sequencing approaches. Hence, quasispecies remained unexplored for a considerable time, even though the theoretical concept for quasispecies was put forth by Eigen in 1970 [55]. With the advent of NGS technologies, the generation of large genomic datasets became a reality. Due to the sequencing error issues, it was still tough to demarcate true genetic variations. Circular Sequencing (CirSeq), a novel experimental approach that creates template of tandem repeats of circularized genomic RNA fragments has been developed by Andino's group [56]. CirSeq reduces the sequencing error drastically as the repeats get sequenced in a redundant manner for every genomic fragment. A consensus reduces the theoretical error close to 10 −11 , which enables capture of the entire mutational spectrum of RNA virus populations. CirSeq was employed to study seven serial passages of Poliovirus replicated in HeLa cells. Mutation frequency was computed for every passage and their fitness was determined by mapping onto the 3D structure of proteins. As expected, majority of the mutations detected were neutral substitutions, thus highlighting robustness as driving force for adaptation and evolution [56]. This study clearly delineates the viral mutations responsible for quasispecies structure and highlights the extent of genetic variation that can be maintained in a population.
Microevolution in an evolving quasispecies population is responsible for the sequence diversity in Porcine reproductive and respiratory syndrome virus (PRRSV). PRRSV is the causative agent of late-term reproductive failure in sows and respiratory distress in pigs and hence has large economic impact. Genomic complexity of PRRSV due to multiple circulating genotypes results in antigenic diversity, which, in turn, is responsible for lack of effective vaccine development [57]. Sanger sequencing has identified open reading frames ORF5 and ORF7 as the polymorphic regions of the virus genome, encoding major immunogenic epitopes. In order to study the genome-wide polymorphisms, deep sequencing of PRRSV was carried out and amino acid substitutions in ORFs 2-7 in PRRSV strains obtained from pigs that lack B and T cells were studied [58]. By analysing nucleotide substitutions over time followed by comparative genomics with non-pathogenic variants, the role of mutation and selection in preserving the pathogenesis or fitness of PRRSV was well documented in this study.

Detection of low-frequency variants
Low-frequency variants or minority quasispecies are the variants that occur with a frequency of <20-25% in a viral population [59]. Minority quasispecies refers to the memory genomes that were dominant at an earlier phase of quasispecies evolution and can play an important role in conferring drug resistance in viruses such as Human Immunodeficiency Virus type-1 (HIV-1) and Influenza virus. Minority quasispecies of drug-resistant viruses can rapidly reemerge as major populations after the reintroduction of drug pressure. In case of HIV-1, presence of such low-frequency variants has been linked with early failure to the antiretroviral therapy [59,60]. Emergence of highly pathogenic subtype of Avian Influenza viruses (HPAI) has also been explained on the basis of low-frequency variants. Ultra-deep sequencing was used to study the emergence of HPAI from that of less pathogenic (Low Pathogenic Avian Influenza (LPAI)) progenitor viruses [61].

Inter-and intra-host genetic diversity
The rate of viral evolution and the effectiveness of its transmission are determined by interand intra-host genetic diversity. Mutation rate and selection pressure ascertain viral diversity. Factors like mixed infections and random processes such as genetic drift and population bottlenecks also contribute to the genetic diversity of viruses both within and among hosts. Transmission fitness influences the effective spread of viruses and is responsible for its stable maintenance in the environment [62].
Intra-host genetic diversity in Zucchini yellow mosaic virus (ZYMV), a plant RNA virus known to infect Cucurbitaceae plants, has been studied using NGS [63]. Population bottlenecks were investigated for this aphid-borne virus and are thought to occur during both inter-host vector transmission and systemic movement within an individual plant. ZYMV populations infecting cucumbers with and without vector were sequenced followed by de novo assembly and variant calling. Analysis revealed that the low-frequency mutants present in the initial population got fixed rapidly in vector-transmitted viruses, whereas the same continued to remain as minor variants in mechanically inoculated viruses. In addition, regions known to be responsible for vector transmission were conserved in all samples. It is interesting to know that previous studies using Sanger sequencing of the coat protein of ZYMV, which is involved in interaction with aphids, could not detect mutations when transmitted between or within plants. However, this study reported six mutations in coat protein with frequency of occurrence as low as ~3%.
Such studies provide an insight into the complex dynamics of genetic diversity of an emerging viral infection with implications in disease management.

Viral metagenomics
NGS has revolutionized metagenomics in a major way by ensuring high data throughput and by removing the hassles of cultivation/isolation by providing cost-effective options. Metagenomics involves sequencing of samples from diverse environments spanning across the biosphere [64]. The initial attempts at characterizing the viral metagenomes were more of an enumeration nature [65] and provided a glimpse of the enormous diversity underlying the previously unculturable communities. NGS has paved way for extensive characterization of the functional role of virome in hosts harbouring them [66,67]. Analysis of metagenomics data is challenging as it includes simultaneous assembly of multiple genomes/transcriptomes and the complex interplay between them. Two major methods based on 'sequence-similarity' and 'sequence composition' are usually used for categorization of samples in metagenomics. It has been observed that the alignment-free 'sequence composition'-based methods provide better means of classifying viral samples as 'sequence similarity'-based methods could only classify up to 30% of the reads [68].
In a major study involving analysis of dsDNA viruses from 43 ocean samples obtained from across the globe revealed several intriguing observations [69]. Genes shared across different samples were used as 'core genes' for comparison. 'Niche-differentiation' of different viral populations based on the layer of the ocean they occupy was observed. As viruses rely on the host machinery to replicate, a direct relationship was observed between the community structures of both viruses and hosts. Environmental factors like salinity also influenced the viral persistence and hence their diversity. Technological advances in viral metagenomics would help to unravel the underlying rules of viral evolution and ecology, the so-called 'Genomic rulebook of viruses' [70].

Receptor switching
A key event during any viral infection is the interaction of viruses with the host receptors on the plasma membrane. This serves as an entry point for viruses to access resources of the host cell and is very crucial for tropism. This interaction is known to be very specific and is responsible for activation of the signalling processes that recruit cellular machinery of the host for viral replication. The specificity of receptor binding defines host range that a virus can infect and the extent of tissue tropism that a virus can display. Switching of receptors thus enables the virus to increase its host range and/or gain access to the previously unaffected cell types.
HIV-1 enters the target host cell by binding to CD4 receptor along with a co-receptor (in majority of cases, chemokine C-C motif receptor 5 (CCR5)) using its spike protein.
Monitoring of the co-receptor usage using phenotype-based assays provided clues for the likely shift from CCR5 to chemokine C-X-C motif receptor 4 (CXCR4). However, due to the low resolution of these procedures, this transition could not be captured effectively. NGS of the variable loop region (V3) of the envelope gene containing determinants of coreceptor usage revealed the stepwise mutational pathway involved in the transition from CCR5 to CXCR4 [71]. The observation of the low-frequency intermediate variants provided an insight into the fitness landscape of HIV-1 and provided clues to tackle the disease progression in a rational manner.

Immune escape
The de novo sequencing approach has helped to analyse the heterogeneity of Influenza A virus (strain A/Nagano/RC1-L/200 or H1N1) isolated from 2009 pandemic. The amino acid changes in haemagglutinin protein (G172E and G239N) were observed to be associated with the immune escape [72].

Bioinformatics methods for viral genomics
Bioinformatics approaches help to estimate and analyse population diversity by studying genetic recombination, mutation, selection and, thereby, assist in correlation of genotype to phenotype. The methods relevant to these aspects are discussed below with emphasis on the analysis of viral populations.

Methods for quasispecies reconstruction
Quasispecies reconstruction refers to the estimation of number of viral variants and their frequency. Each viral variant in a quasispecies is considered as a haplotype. Tools available for this purpose include Short Read Assembly into Haplotypes (ShoRAH) [73], Quasispecies Reconstruction algorithm (QuRe) [74] and QuasiRecomb [75].

Short Read Assembly into Haplotypes (ShoRAH)
Principle: This method uses Bayesian principle to estimate the genetic diversity of mixed samples obtained through NGS by incorporating subroutines for correction of sequencing errors [73]. It can detect viral haplotypes with frequencies as low as 0.1%.
Algorithm steps: i. Alignment: The program requires a FASTA input file of NGS reads along with a reference sequence. It performs pairwise alignment of all reads to the reference sequence and generates a multiple sequence alignment (MSA).

ii.
Error correction (local haplotype reconstruction): Using MSA as a starting point, a set of overlapping windows is analysed by employing a model-based probabilistic clustering algorithm to obtain (i) haplotype sequences, (ii) their frequencies, (iii) corrected reads and (iv) posterior probability of the reconstruction.

iii.
Global haplotype reconstruction: The set of corrected reads is analysed under parsimony principle, which results in identification of set of unique reads of maximum length.

iv.
Frequency estimation: Using maximum likelihood (ML) and expectation maximization algorithm, the frequencies of the reconstructed haplotypes are estimated.

Quasispecies Reconstruction algorithm (QuRe)
Principle: QuRe [74] is based on a heuristic algorithm and automatically reconstructs a set of error-free, full-gene/genome variants from a collection of long NGS reads (>100 bp).
Algorithm steps:

i.
Overlaps between the reference genome and reads are generated in terms of k-mers.

ii.
Mapping of k-mers is then carried out to obtain genomic co-ordinates.

iii.
Generates a multinomial distribution based on the alignment scores of true matches along with the matches with randomly shuffled reads.

iv.
Coverage, nucleotide content and entropy of each mapped genomic position are then calculated.

v.
Errors are corrected based on Poisson distribution model, parameterized differently for homopolymeric and non-homopolymeric regions.

vi.
Reconstruction of quasispecies is carried out using the sliding window approach by calculating maximal coverage and read diversity, which reduces the false positives, i.e., in-silico recombinants.

QuasiRecomb
Principle: It employs the jumping Hidden Markov Model (HMM)-based probabilistic statistics for inference of viral quasispecies, especially for estimating the intra-patient viral haplotype distribution [75]. This method assumes that the true genetic diversity is generated by a few sequences (called generators) through mutation and recombination, and that the observed diversity results from additional sequencing errors.
Algorithm steps: i. Distribution of haplotypes in a given population is modelled to account for either point mutation or recombination in the form of probability tables and jumping HMM states respectively.

ii.
Expectation maximization algorithm is used to estimate posterior probabilities associated with rare events of mutation and recombination.

Methods to study viral population genetics
Genetic structure of a population refers to the number of distinct subpopulations, identified using a characteristic set of allele frequencies [76]. A model-based population analysis can be performed using the STRUCTURE program [77] based on genomic data. The program can infer the genetic structure in haploid, diploid and polyploid species [78].

STRUCTURE program
Principle: This method is based on Bayesian clustering approach and employs Markov Chain Monte Carlo (MCMC) algorithm to identify genetically distinct subpopulations based on allele frequencies. It assigns individuals to subpopulations based on likelihood estimates. In case of haploids, the program assumes that the loci are in linkage equilibrium or only weakly linked [78]. The program accounts for recombination by incorporating ancestry models such as admixture and linkage models. An admixed strain is assigned with a membership score to belong to two or more subpopulations, to indicate its mixed ancestry. Linkage model is an extension of admixture model to account for weak linkage that arises as a result of admixture linkage disequilibrium (LD). Therefore, the extent of linkage equilibrium within the markers needs to be tested prior to usage of the STRUCTURE program. The relevant linkage analysis (LIAN) programs and measures are discussed in Section 4.3.
Input genotype data: A wide range of markers such as multi-locus genotype data, microsatellites, SNPs can be used as an input. In case of viruses, the polymorphic sites or more specifically the parsimony-informative (PIs) sites obtained from genome-based alignment are suitable markers for population genetic analyses. A PI site contains at least two types of nucleotide bases and at least two of which occur with a minimum frequency of two. The position of each PI corresponds to a locus. At every locus, any of the four bases (A, T, G and C) and the gap is considered as an allele.
Algorithm steps: i. Carry out MSA of complete genomes and extract PI sites.

ii.
Estimate the degree of linkage equilibrium and test the null hypothesis about the same.

iii.
Simulate data using burn-in and burn-length with values in the range of 10,000-1,00,000. Check the convergence of parameters and consistency of clustering results.

iv.
Estimate the appropriate number of clusters (K) using independent runs with varying values of K.

v.
Determine the best K either by comparing mean of log likelihoods [77] or based on an ad hoc statistic, ∆K [79].

vi.
Validate the genetic structure hypothesis using Analysis of MOlecularVAriance (AMOVA) based on Fixation index (F ST ) as implemented in ARLEQUIN software [80]. F ST represents the extent of genetic differentiation among subpopulations and ranges between 0 (no differentiation) and 1 (complete differentiation).

Salient features of the STRUCTURE program:
i. This method is advantageous over traditional molecular phylogenetic methods in terms of classification of recombinant strains.

ii.
User can incorporate prior information such as geographic location of samples.

Limitations:
i. Variation in sample size may affect the clustering.
ii. This method is not suitable for datasets having high linkage disequilibrium.
Case studies: The ability of the admixture model to account for recombination has been used to analyse the extent of recombination and its role in determining the population structure of viruses such as Hepatitis B virus [81] and Rhinoviruses [82].
Population genomic study of Hepatitis B virus (HBV) was carried out using both admixture and linkage models (with burn-in of 20,000 and burn-length of 40,000). HBV is an enveloped DNA virus and belongs to the genus Orthohepadnavirus and family Hepadnaviridae. It is known to consist of eight genotypes designated as A-H, each of which has characteristic geographic distribution. This method helped to resolve the hierarchical nature of population subdivision with the presence of four major clusters (F ST = 0.497, p < 0.0001) and eight sub-clusters. The extent of recombination was observed to be low [81].
Rhinoviruses represent the highly diverse members of genus Enterovirus and family Picornaviridae. They are ss (+) RNA viruses with genome of ~7,200 bases. There are three species, viz.
Rhinovirus A, -B and -C, each of which is further subdivided into distinct serotypes. The STRUCTURE-based analysis revealed a strong evidence for existence of seven genetically distinct subpopulations (with F ST = 0.45, p = 0). Rhinovirus A and Rhinovirus C were subdivided into four and two subpopulations respectively, whereas Rhinovirus B species remain undivided. Furthermore, usage of both the admixture and the linkage models (with burn-in of 20,000 and burn-length of 40,000) helped to resolve the role of recombination in diversification of subpopulations. In case of Rhinovirus A, intra-species recombination was common, whereas in case of Rhinovirus C, intra-and inter-species recombination were observed to cause diversity [82].

Methods to compute linkage disequilibrium
Linkage equilibrium refers to the statistical independence of alleles at all loci and indicates evidence of free recombination [83]. Thus, linkage disequilibrium is a measure of the correlation between the occurrences of nucleotides at different loci of the genome. The extent to which recombination occurs can be estimated in terms of the degree of linkage disequilibrium [84] using measures made available by specialized programs such as Linkage Analysis (LIAN) [83] and DNA Sequence Polymorphism (DnaSP) [85]. The extent of linkage can be inferred based on the following parameters.

i. Standardized index of association, I S A:
It is a measure of the degree of haplotypewide linkage derived from a given dataset. I S A is computed using a formula, ii. |D'| and r 2 : The |D'| measure is the absolute value of the difference between the observed and the expected haplotype frequency in the absence of linkage disequilibrium, which is normalized by the maximum (or minimum) possible value of this difference. The squared value of the difference between the observed and the expected haplotype frequency normalized by the variance of the allele frequency is denoted by r 2 . These measures can be computed using DnaSP program [85]. The values for these measures can range between 0 (no linkage disequilibrium) and 1 (complete linkage disequilibrium) [84,86].

Case studies:
LD provides a good measure for analysing the extent of recombination in viruses [82,87]. For example, in case of Rhinoviruses, low values for LD measures (I S A = 0.0666, p < 10 −4 ; |D'| = 0.5409 and that of r 2 = 0.0613) were observed and correlated well with the evidence of recombination obtained using independent methods [82]. Similarly, LD analyses in serotypes of Foot and mouth disease virus [87] helped to reveal low values of |D'| and r 2 , supporting high recombination.

Methods for detection of recombination
In addition to undergoing mutations, viruses are known to generate new variants through genetic recombination. Genetic recombination refers to the exchange of genetic material between strains of the same or different species of viruses [88]. Within a host, co-infected with viruses, the recombination occurs either by homologous recombination or by reassortment [89]. Homologous recombination can occur between highly similar RNA genomes usually through the process called 'copy-choice' or 'template-switching' mechanism, whereas reassortment involves exchange of genomic regions between viruses that have segmented genomes. Presence of recombinants can hamper analyses pertaining to molecular clock [90], selection pressure, phylogenetic classification [91,92] and thus need to be detected prior to such analyses.

Virus Recombination Mapper (ViReMa)
ViReMa is developed to analyse the recombinants within the viral genome data derived through NGS [93]. It can detect inter-virus or virus-host recombination. This method can also detect insertion and substitution events and multiple recombination junctions within a single read.
Algorithm steps: i. Alignment of 5' end of each read to the reference genome(s) using seed-based approach.
ii. Dynamic generation of a new read segment: 3' end of the read that fail to align is extracted or the first nucleotide from the read is trimmed. This step is iterated until all the reads are either mapped or trimmed or a combination of both.

iii.
For each read, all possible recombinations are reported.

Recombination Detection Program version 4 (RDP4) package
In order to detect recombination, various methods have been developed and are provided in RDP4 package [94]. It identifies the significant evidence of recombination events based on the p-value and identifies the potential recombinant sequences and its both parents (major and minor). The main strength of the package is that it does not need any prior knowledge pertaining to non-recombinant set of reference sequences. The starting point of analysis is MSA of genomic sequences.

ii.
Following the detection of a 'recombination signal', RDP4 determines approximate breakpoint positions using HMM and then identifies the recombinant sequence using various methods such as phylogenetic profiling (PHYLPRO) [105] and Visual Recombination Detection (VisRD) [106]. iii.
The minimum number of recombination events that are needed to account for these signals are then inferred. It involves sequential disassembly of the identified recombinant sequences into respective components and iteratively rescanning the resulting expanded dataset until no further recombination signals are evident.

Salient feature:
RDP4 package provides a unified interface for multiple methods and facilitates visualization of recombination events using genomic data (up to 2,500 sequences).
The genomic dataset up to 200 million nucleotides can be analysed and is reported to have operational limits for large genomic datasets.

ii.
Recombination analysis is likely to fail in case of poor alignments, if recombinant sequences are used as reference and sequences having ambiguous characters are included.

Methods for selection pressure analysis
Natural selection is one of the fundamental evolutionary processes that shape the genetic structure of viral populations. The ratio of non-synonymous substitution rate (dN) to synonymous substitution rate (dS) is a useful means to infer selection pressure based on a codon alignment for a particular gene. Positive selection (dN/dS > 1) increases the frequency of advantageous alleles, whereas the negative selection (dN/dS < 1) is responsible for purging (removal) of deleterious alleles.
Broadly, the selection pressure can be classified as pervasive and episodic. Pervasive selection acts across all the lineages in a phylogenetic tree, whereas the episodic selection operates on a few lineages of a tree. Various statistical methods for analysis of pervasive and episodic selection are available at the Datamonkey web-server of Hypothesis testing using Phylogenies (HyPhy) software package [107][108][109].

Single Likelihood Ancestor Counting (SLAC)
Principle: This method belongs to a class called counting methods [110]. It is suitable for pervasive selection analysis and involves estimating the number of non-synonymous and synonymous changes that have occurred at each codon throughout the evolutionary history of the sample. It involves reconstructing the ancestral sequences using likelihood-based method [111].
Algorithm steps: i. Nucleotide model fit: Using maximum likelihood (ML), a nucleotide model of timereversible class is fitted to the data and tree, to obtain branch lengths and substitution rates. If multiple segments are present in the input codon alignment, base frequencies and substitution rates are inferred jointly from the whole alignment, while branch lengths are estimated for each segment separately.

ii.
Codon model fit: To obtain a global ω = dN/dS ratio, the branch lengths and substitution rate parameters are considered constant at the values estimated in 'step i'. A codon model is obtained using a combination of MG94 model and the nucleotide model of 'step i' and then fitted to the data.

iii.
Ancestral sequence reconstruction: Based on the parameter estimates obtained using steps i and ii, codons of ancestral sequences are reconstructed site by site using maximization of the likelihood of the data at the site over all possible ancestral character states. Inferred ancestral sequences are treated as known for the next step.

Fixed-Effect Likelihood (FEL) and Internal Fixed-Effect Likelihood (IFEL)
Principle: These belong to a class of methods called 'fixed effects'. It analyses pervasive selection and involves fitting substitution rates on a site-by-site basis by assuming that the synonymous substitution rate is the same for all sites. Thus, FEL and IFEL assume the same dN/dS (ω) ratio, which is applicable to all branches and to interior branches, respectively [111].
Algorithm steps: i. Nucleotide and codon model fitting procedure in these methods is similar to those of SLAC method as detailed in Section 4.5.1.

ii.
Site-by-site likelihood ratio test (LRT): FEL method: For every site, based on the parameter estimates obtained using nucleotide-and codon-fit procedure, two rate parameters namely α and β are first fitted independently and then under the constraint of α = β. Here, the parameter α represents the instantaneous synonymous site rate, while β represents the instantaneous non-synonymous site rate. Furthermore, LRT is performed to infer whether α is different from β and a p-value is computed. If the p-value is significant, the site is classified based on whether α > β (indicates negative selection) or α < β (indicates positive selection).
IFEL method: It differs from FEL in following aspects: Analysis of Next-generation Sequencing Data in Virology -Opportunities and Challenges http://dx.doi.org/10.5772/61610 • The selection is only tested for internal branches of the phylogenetic tree.
• Each site has three rate parameters, α, β_I (instantaneous non-synonymous site rate for internal branches) and β_L (instantaneous non-synonymous site rate for terminal branches).
Here, the null model assumes that α = β_I.

Mixed Effects Model of Evolution (MEME)
Principle: MEME is categorized under the 'branch-site random effects' phylogenetic methods [112]. Though this method is a generalization of FEL method, it differs from FEL and IFEL, by accounting for episodic positive selection that particularly affects a subset of lineages. MEME uniquely allows the distribution of dN/dS (ω) to vary from site to site (the fixed effect) and also from branch to branch at a site (the random effect).
Algorithm steps: i. The steps 'i' and 'ii' are same as that of the SLAC method (Section 4.5.1), whereas there is variation in step 'iii' as follows: ii.
The ω ratio is modelled across lineages at an individual site, i.e., each site is treated as a fixed-effect component of the model using a two-bin random distribution with ω− ≤ 1 (proportion p) and ω+ (unrestricted, proportion 1−p). Thus, a proportion (p) of branches at a site evolve neutrally (or under negative selection), while the remaining (1-p) may evolve under diversifying selection. To test for evidence of episodic selection, a likelihood ratio test is applied.

Methods for reconstruction of molecular phylogeny
Molecular phylogenetic analyses are the most commonly performed studies in virology with major applications in viral taxonomy, systematics and genotyping. Methods for reconstruction of phylogenetic tree are broadly classified into three main categories, viz. distance-based, character-based and Bayesian-based and are reviewed earlier [113,114]. Distance-based methods use pairwise distance matrix as an input for tree building. Neighbour-joining [115], minimum evolution [116] and least square [117,118] methods are widely used methods under this category. These methods are computationally efficient and suitable for the analysis of large datasets with low levels of sequence divergence. However, these methods do not perform equally well in case of highly divergent sequences with low levels of sequence similarity. Moreover, uncertainties can be introduced due to positioning of gaps in the MSA. Characterbased methods assume each site in MSA to evolve independently. The two classical methods under this category are maximum parsimony and maximum likelihood [119], which estimate the tree score based on the minimum number of changes and the log-likelihood value respectively. However, it needs to be mentioned that alignment-based phylogenetic methods are observed to misclassify taxa with mixed ancestry and/or recombination [91,92].
The alignment-free methods have been developed as an alternative and can be classified into four categories based on the underlying principles employed. They are k-mer/word composition, substring theory, information theory and graphical representation [120].
Whole genome-based phylogenetic trees are widely used for various viruses owing to their small genome sizes and conservation of genomic structure. Phylogenomics field has gained importance as whole genome data became available enabling the study of evolution in general and epidemiology and disease surveillance, in particular. This field when analysed in the context of spatio-temporal data helps to understand the disease spread and progression during outbreaks. The program such as Bayesian Evolutionary Analysis by Sampling Trees (BEAST) has been exclusively designed for phylogeography studies [121] and is used widely to study spatio-temporal dynamics of viruses at population scale.
BEAST software provides a Bayesian Markov chain Monte Carlo (MCMC) framework for parameter estimation and hypothesis testing of evolutionary models from molecular sequence data. It brings together a large number of evolutionary models into a single coherent framework for evolutionary inference. Available evolutionary models include substitution, insertion-deletion, demographic, tree shape priors, node calibration and relaxed clock models. This combinatorial principle is advantageous as it provides a flexible system to specify models to understand various aspects of virus evolution. BEAST uniquely incorporates the time-scale data to explicitly model the rate of molecular evolution on each branch in the tree. Under the uniform rate assumption over the entire tree, the molecular clock model becomes applicable. It is the first software to incorporate the relaxed molecular clock model that does not assume constant rate across lineages.

Methods for typing of viruses
Phylogenetic analysis, whether alignment-based or alignment-free, is routinely used for genotyping/serotyping of viruses. Such analysis is carried out using the regions that are identified as markers for the purpose of classification by the expert evolutionary virologists and the International Committee of viruses (ICTV) [122]. It has been observed that genotype information for less than 10% of the viral genomes is available as part of their sequence records. As NGS technologies are producing a large number of genomic sequences for various strains, isolates and viral species, the genotype assignment gap is ever-increasing. Several tools for genotyping have been developed using both alignment-based and alignment-free methods and are most often organism-specific. NCBI Genotyping Tool is based on the sequence similarity for identifying the genotype of recombinant and non-recombinant viral sequences [123]. Similar tools exist for Influenza virus, viz. FluGenom [124]. Alignment-free method for phylogeny and genotyping of viruses based on the concept of Return Time Distribution has been developed in-house and its applicability for genotyping of viruses such as Mumps virus, Dengue virus and West Nile virus has been demonstrated [125][126][127].

The way forward
NGS has proved to be extremely useful and has become an integral part of virus research and opened up new vistas in studying viral evolution. Ample proof of the same is the characteri-zation of the Ebola virus infection in West Africa (2014 outbreak), wherein the patient samples were sequenced using NGS to trace the origin and transmission of the infection as part of the global epidemic surveillance strategy [128]. The discovery followed by the development of vaccine [129] has been made in a short time span owing to the genomics-enabled translational research. In order to harness the use of NGS in virology, care needs to be exerted to avoid misinterpretation and over-interpretation of the data. It must be noted that starting from sample collection, DNA/RNA extraction, PCR amplification, library preparation up to sequencing are prone to errors, which have been explained [130] very comprehensively. Circumventing these issues, application of NGS in virology has enabled basic and applied research to take a quantum leap. The thorough understanding of the intricacies of a quasispecies structure aids in tracing the mutational network operational due to selection pressures. Furthermore, characterization of intra-and inter-host viral evolution helps in understanding the role of host immune system on the genetic variability of viruses. Such data when analysed in the context of population genetics provide constructs to understand emergence of new strains/lineages. Reverse vaccinology [131] enabled via genomics is expected to accelerate the rate of vaccine discovery, thereby, reducing the virus-associated disease burden.