For almost half a century, Sanger sequencing has been the conventional method for sequencing DNA. However, its utility for sequencing heterogeneous viral populations is limited because it can only detect mutations that are present in a significant portion of the DNA molecules. Several molecular methods that quantify mutations present at low levels in viral populations were proposed for evaluation of genetic consistency of viral vaccines; however, these methods are only suitable for single site polymorphisms, and cannot be used to screen for unknown mutations.
- deep sequencing
- DNA and RNA libraries
- influenza viruses
- mutational profiles
- sequence heterogeneity
1. Introductory comments on influenza viruses and vaccines
1.1. Influenza viruses
Influenza A viruses are the causative agents of seasonal epidemics and periodic pandemics. There are many serotypes that infect birds, especially waterfowl, and a few serotypes that infect mammals, including humans. Although some influenza A strains from birds and pigs have jumped the species barrier to infect humans, the majority of human infections are caused by the spread of endemic strains. The endemic strains are continually evolving being one of the reasons that influenza A infections remain a persistent problem. Strains of influenza virus that are used in vaccine production are prone to mutations during the manufacturing process, which can cause breaches in the consistency of vaccine quality. Such mutations can lead to changes in the antigenic structure of the virus and thus affect vaccine effectiveness.
Influenza A viruses belong to the Orthomyxoviridae family of viruses . There are several common features shared among the viruses in this family: they are all negative sense, single‐stranded RNA viruses that replicate in the nucleus of the host cell. The influenza A virus genome is comprised of eight RNA segments that encode more than 11 proteins. Two of the segments each encode a major antigenic protein. The fourth largest segment encodes a hemagglutinin (HA) protein and the sixth largest segment encodes a neuraminidase (NA) protein. The different subtypes and strains of influenza A viruses are distinguished by the HA and NA proteins that coat the surface of the virus. There are at least 18 different HA types and 11 NA types . The segmented nature of the viral genome enables two viruses co‐infecting the same cell to exchange their segments to produce reassortant progeny. Replication of influenza A viruses also results in mutated viral genomes because of the high error frequency of the RNA polymerase and actions by host defensive elements [3, 4]. Mutations contribute to the emergence of new endemic strains and reassortment may lead to the emergence of epidemic or pandemic strains.
The presence of mutations in influenza A populations has been examined in a variety of contexts. Several groups have isolated clones and used Sanger sequencing to identify mutations. Isolation of a sufficient number of clones has resulted in estimates of the mutation frequencies ranging from 6 × 10 −4 to 2 × 10−6 [3, 5, 6]. Although the information about heterogeneity is of great interest, caution must be exercised to ensure that it is accurately reflected in sequence databases. The presence of errors in influenza databases has been noted [7, 8]. To limit discrepancies, some groups have used next‐generation sequencing (NGS) to identify sequence heterogeneities [9–12]. Additional technologies such as multisegment reverse‐transcription PCR have also been employed . These studies revealed several interesting things. For example, it has been found that differences in viral sequences may occur after a single passage , that the same antigenic variants can be detected in different individuals , and that oseltamivir resistant and sensitive viruses can be found together as part of heterogeneous viral populations .
Most human infections are caused by influenza B viruses and the influenza A serotypes H1N1 and H3N2. In addition to the endemic human influenza transmission, there are cases reported each year of influenza A infections originating from an animal. Influenza A (H3N2) variant viruses from swine are sometimes transmitted to humans, especially those in close contact with pigs in agricultural settings . Poultry workers are also frequently seropositive for a variety of different avian influenza strains [18–20] suggesting infrequent but detectable transmission. In most cases, there is no person‐to‐person transmission of the animal viruses. Nevertheless candidate vaccine viruses (CVVs) are prepared each year against some of these viruses to provide a prophylactic option in the event of an outbreak (http://www.who.int/influenza/en/). The CVVs for vaccines against endemic strains and potentially pandemic strains are provided to vaccine manufacturers for use as seed viruses in the manufacturing process.
1.2. Influenza vaccine production
There are several different methods that are used to produce influenza vaccines. Errors may be introduced into the antigenic protein during the replication of the seed virus, no matter which production method is used. The different licensed vaccines are produced from influenza virus grown in eggs or cell culture, or from recombinant viruses expressing the influenza HA from an alternative viral backbone grown in cell culture. Contemporary strains isolated from patients during the current epidemic season normally do not always grow well in cell substrates used for vaccine production. To increase virus yields they are recombined with reference high‐growth strains, such that the CVVs have HA and NA‐coding RNA segments from the contemporary strain, and RNA segments coding for replicative proteins from the high‐growth reference virus. HA is the primary protective antigen and is responsible for binding the cellular receptor. Receptor properties in human and chicken cells differ, forcing the virus HA to adapt to the new receptor, leading to changes in the antigenic specificity potentially affecting vaccine potency.
The most common manufacturing process used in FDA‐licensed vaccines is to grow influenza virus in eggs and then inactivate the virus. The inactivated virus is purified and then diluted to the desired potency for filling vials or syringes. Live attenuated influenza vaccines (LAIV) are also grown in eggs but are administered as a nasal spray. Codon deoptimization has also been proposed as a method for creating attenuated viruses [21, 22]. Unlike the current licensed products these may be grown in cell culture.
Some inactivated influenza vaccines are grown in cells. Production of cell‐grown vaccines currently uses the same egg isolated seed virus that is used for egg inactivated vaccine production. The cell‐grown viruses are harvested, inactivated, and filled into vials or syringes for distribution in a similar manner to the egg grown viruses. The production of recombinant viruses to prepare HA does not require a live seed virus. The HA sequence is cloned into the virus used for production. Although the frequency of errors during replication may differ from that of an influenza virus, the concern still remains. Even if vaccine strains are produced using cloned DNA sequences or synthetic sequences, there is still the possibility of errors arising during amplification of the seed virus. Errors may emerge because of inaccuracies inherent in the replication system or as a response to the host cell defenses.
1.3. Influenza vaccine seed viruses
The seed viruses used to produce influenza vaccines are derived from different sources. Because the influenza virus spreads throughout the world and strains are continually evolving, a network of academic, governmental, and commercial organizations work together to produce new seed viruses. New virus isolates are collected by National Influenza Centers and sent to the World Health Organization Collaborating Centers (WHO CCs). The viruses are typed according to strain and subtype using antigenic and genetic analyses. Viruses are usually isolated in Madin‐Darby Canine Kidney Epithelial Cells (MDCK cells) and then amplified in eggs. It is important that the seed virus is as close in sequence and antigenicity to the original isolate as possible. To this end, the egg‐amplified virus is used to immunize ferrets for the production of antiserum. The antiserum is then used for antigenic typing of viruses. Such analyses are used to determine how well the strains used in the influenza vaccine are matched to the currently circulating strains.
To produce sufficient quantities of vaccine, the CVVs must have good growth characteristics. Ideally, the viruses should replicate efficiently, have a high antigen (primarily hemagglutinin) to total protein ratio and not have increased pathogenicity. Many viruses do not produce high yields in eggs without adaptation. Some viruses have been propagated for many years and have well‐known growth characteristics. They include the influenza A strains Puerto Rico/8/1934, cold‐adapted A/Ann Arbor/6/1960, A/Leningrad/134/17/1957 and variants of these viruses. Where appropriate these strains have been used as a backbone for influenza A CVVs. Combining the high yield characteristics with the antigenic characteristics of contemporary strains facilitates vaccine production. Reassortant viruses with the desired antigenicity and growth characteristics are produced by two different methods; classical reassortment and genetic reassortment. Dr. Kilbourne of the New York Medical College (NYMC) developed the classical method to create reassortant viruses that expressed the HA and NA from a seasonal strain in the background of a high‐growth virus . This method involves co‐culture of a contemporary virus and a high yield strain with antibodies to select against the HA and NA of the high yielded strain. Although the resulting viruses have the desired HA and NA genomic segments, the remaining segments may come from either the high‐growth strain or the contemporary strain. Another approach based on genetic engineering allows the production of viruses from plasmids expressing the eight influenza virus segments. Genetic engineering also allows the expression of HA and NA proteins in a vector‐based system such as a baculovirus. The desired genetic sequence of HA engineered in this system may be incompatible with baculovirus components, which can result in changes as more efficient mutants displace the parental virus (the so‐called gene constellation effect) .
All CVVs go through several rounds of replication at manufacturers’ facilities as they produce their own virus stocks, working seed, and the final product. Some changes to the virus may occur during manufacture so tests to verify the identity, purity, potency and stability of vaccine lots are required. The potency of inactivated influenza vaccines and influenza vaccine produced from recombinant viruses is determined using standardized reagents supplied by national regulatory authorities. The potency of a live‐attenuated virus is calculated from the amount of viable attenuated virus. Genetic characterization of the vaccine viruses is currently achieved by partial genome sequencing or restriction analysis.
It has been suggested that most of the differences between natural isolates and vaccine seed viruses occur during the selection and clonal isolation of the candidate virus prior to manufacture . The fidelity of replication will vary among viruses and will depend on other factors such as the host cell line and multiplicity of infection used. There are some limits on the number of times that seed viruses may be passaged so that mutations are less likely to occur. The European Pharmacopoeia monograph 0158 for inactivated influenza vaccines states that the seed virus should not be passaged more than 15 times. However, because regulations tend to lag behind scientific development, there is no universally accepted guideline for influenza vaccine manufacture that covers egg‐derived, cell‐derived, and synthetic reassortant viruses.
2. Importance of the evaluation of genetic consistency of influenza A vaccine viruses
Virus populations are comprised of genetically variable viruses and this can affect their replication, evolution, attenuation, and pathogenesis [26, 27]. Having an understanding of the mutations present, even at low levels, in a virus population is important for our understanding of how the viruses grow and cause infections. It has recently been shown that an influenza population containing two variants involved in cell exit grows better than populations containing either variant alone . Although good growth properties are a desirable feature in vaccine seed viruses, it is critical that other parts of the genome, such as those causing attenuation or encoding the major antigenic regions, remain stable. Consistency of manufacture is important and having suitable means to assess genetic consistency is valuable. New assays capable of assessing entire viral genomes, and detecting mutations present at a low level, are needed.
The emergence of mutations in the course of vaccine manufacture was shown to contribute to partial reversion to virulence in the oral polio vaccine (OPV). Mutant analysis by PCR and restriction enzyme cleavage (MAPREC) is used to control batches of oral polio vaccine for the presence of neurovirulent mutations [29, 30] and has been expanded to be used for other viruses [31, 32]. Mutations emerging during virus growth may also change antigenic properties and therefore affect protective potency of live and inactivated vaccines. New approaches that can be used not only for monitoring genetic stability of live vaccines, but also for controlling consistency of inactivated vaccines are needed. Influenza vaccines are manufactured in embryonated chicken eggs or cell culture. There is evidence that vaccine seed viruses adapt to grow efficiently in the different substrates and this can lead to changes in the receptor‐recognition site of viral hemagglutinin, which is the major protective antigen [33–36]. For this reason, it is important to monitor the changes that may take place in major protective epitopes of the virus. It is also important to know that mutations responsible for attenuated phenotypes are maintained. Knowing which mutations are emerging during virus growth in production substrates could also be used to optimize genetic structure of vaccine strains. Consistently accumulating mutations have higher fitness, and if they have no deleterious properties, their incorporation into the genome of vaccine virus could increase its yield and improve vaccine potency. Given these concerns it is imperative to screen genomes of viral vaccines for emerging mutations.
3. Methods used for evaluation of genetic consistency of vaccine viruses
As mentioned above viral populations are highly heterogeneous, and even small quantities of mutants in virus stocks may affect their biological properties. Although PCR and restriction analysis of reassortant influenza viruses can demonstrate which parental strain each genomic segment was derived from, it cannot detect new mutations. Even traditional sequencing methods are not sensitive enough to detect small amounts of mutants, and highly sensitive PCR‐based methods can only analyze one or few known mutations at a time.
Conventional sequencing approaches are suitable for discovery of mutations that are present in substantial amounts, usually around 20–25% . Determining the actual frequency using conventional sequencing requires the labor‐ and time‐intensive analysis of a large number of virus clones (plaques).
There are indirect approaches based on analysis of electrophoretic mobility in gels , which are insufficiently sensitive and do not allow mutations to be located accurately. Mass spectrometry (MALDI‐TOF)  and hybridization with microarrays of short oligonucleotides [40–42] are sensitive, but are laborious and may require follow‐up by direct sequencing. A highly sensitive mutant analysis by PCR and MAPREC [29–31] can detect and quantify mutants at levels as low as 0.1% of the viral population. Recently we developed a quantitative allele‐specific PCR (asqPCR)  for detection of a low level of mutants in viral vaccines. RT‐PCR has been proposed as a method for checking the homogeneity of influenza vaccine seed candidates . However, these methods are only suitable for analysis of one known mutation at a time.
Several versions of high‐throughput sequencing technology, known also as deep or massively parallel sequencing (MPS) have been used to assess influenza vaccine viruses. These technologies enable rapid generation of large amounts of sequence information . Three different platforms for deep sequencing are used widely at present time: the Roche/454 FLX  (http://www.454.com/) (http://454.com/products/technology.asp), the Illumina/Solexa Genome Analyzer ) (http://www.illumina.com/technology/next‐generation‐sequencing/sequencing‐technology.html), and the Applied Biosystems SOLiD TM System (http://www.thermofisher.com/us/en/home/brands/applied‐biosystems.html). Two new sequencing platforms that are improved to sequence long reads have been developed recently: the Pacific Biosciences SMRT Sequencing (http://www.pacb.com/smrt‐science/smrt‐sequencing/)  and Oxford Nanopore Technologies MinION (https://www.nanoporetech.com/technology/the‐minion‐device‐a‐miniaturised‐sensing‐system/the‐minion‐device‐a‐miniaturised‐sensing‐system). These systems are also called “single molecule” sequencers and do not require any amplification of DNA fragments prior to sequencing.
The deep sequencing technologies were shown to be suitable for analysis of heterogeneities in viral populations . It can produce huge sequencing information in one run. They are used for de novo sequencing of large genomes, metagenomics studies (virome, microbiome, etc.), screening for genomic markers, and many other applications [50–58]. Previously, it was demonstrated that deep sequencing can be used to monitor the genetic stability of oral polio vaccines, and could replace the WHO‐recommended MAPREC assay for lot release of OPV . Recently we showed that deep sequencing is suitable for evaluation of genetic consistency of influenza vaccine viruses [36, 60].
4. Description of the most used deep sequencing platforms
Deep or massively parallel sequencing refers to several high‐throughput methods for DNA sequencing that are often referred to as NGS. They have dramatically improved the ability of biotechnology, scientific, and healthcare researchers to analyze viruses by allowing users to have massive sequencing information for the entire genomes. The high‐throughput sequencing field has witnessed the rise of many technologies capable of massive genomic analysis. In the virology field, deep sequencing has made it simple to sequence full viral genomes. Likewise, identification and classification of novel and known viruses, unbiased characterization of viral populations without the need for virus culturing (viromes), molecular epidemiology, viral diversity and evolution, transmission and pathogenesis, and medical virology have greatly benefited from the use of deep sequencing. The cost of deep sequencing has decreased to an affordable price due to the competition between vendors and the ability to analyze multiple samples run in one lane of the sequencing flow cell. This has allowed virologists to study a huge number of viral samples, including mixture of viral populations, and study low‐level mutants in a wide range of viruses [36, 59–61].
There are several platforms for deep sequencing. The most widely used sequencing platforms are the Roche/454 FLX  (http://www.454.com/), the Illumina/Solexa Genome Analyzer  (http://www.illumina.com/technology/next‐generation‐sequencing/solexa‐technology.html), and the Applied Biosystems SOLiD System (http://www.appliedbiosystems.com/absite/us/en/home/applications‐technologies/solid‐next‐generation‐sequencing/next‐generation‐systems/solid‐4‐system.html?CID=FL‐091411_solid4).
The differences between these platforms include DNA library preparation procedures and chemistry, the sequencing reactions on the amplified strands, the length of reads, the amount of data generated per run, the hardware, the software engineering and the technology used to amplify single strands of a fragment from the library.
In general the DNA libraries of fragment targets are generated, and adaptors containing universal priming sites are ligated to the fragmented target ends, allowing complex genomes to be amplified with PCR primers. After ligation, the DNA is separated into single strands and attached or immobilized to a solid surface or support. The immobilization of spatially separated template sites allows thousands to billions of sequencing reactions to be performed simultaneously.
Immobilization and separation of the millions of molecules to different surfaces can be achieved by a variety of methods including the Polonator and PicoTiter Plate [47, 62–64]. Attachment of forward and reverse primers to a slide and use of solid‐phase amplification also result in the enrichment and amplification of separate template strands  (Illumina/Solexa).
Two newer sequencing platforms with longer reads differ from those described above. They are sometimes referred to as “single molecule” sequencers because they sequence molecule by molecule and do not require any amplification of DNA fragments prior to sequencing. The Pacific Biosciences system involves the attachment of a DNA polymerase to the DNA molecule. During the sequencing phase the polymerase adds bases labeled with a fluorophore. The fluorescence unique to each base is recorded and, as each new base is added, the fluorescent label is removed . It generates long sequencing reads (10–15 kb long) from single molecules of DNA, very quickly.
The Oxford Nanopore system runs the sample through very small (1 nm wide) pores. As the DNA passes through these nanopores, the Oxford machine records the electrical charge that is associated with each individual base pair of DNA, like a signature. It produces longer reads (>100 kb long).
A detailed description of the most used two deep sequencing platforms for analysis of influenza viruses and their vaccines is given below.
4.1. Roche/454 FLX pyrosequencer
The Roche/454 FLX sequencing  is based on the use of the pyrosequencing technology (http://my454.com/products/technology.asp), in which the incorporation of each nucleotide by DNA polymerase results in the release of pyrophosphate that initiates a cascade of enzymatic reactions that converts the pyrophosphate to a light signal. This light is recorded by CCD camera. This approach, as with most NGS procedures, starts with DNA library preparation; the library DNAs with 454‐specific adaptors are denatured into single strands and mixed with agarose beads whose surfaces carry oligonucleotides complementary to the 454‐specific adapter sequences on the fragment library, so each bead is associated with a single fragment. The DNA fragments captured by beads are amplified by emulsion PCR (ePCR)  to produce approximately one million copies of each DNA fragment on the surface of each bead. These amplified single molecules are then sequenced on a picotiter plate (a fused silica capillary structure) that holds a single bead in each of several hundred thousand single wells, which provides a fixed location at which each sequencing reaction can be monitored.
Individual dNTPs are added to the template in the presence of a DNA polymerase. The sequencing reaction releases pyrophosphate (PPi) after the incorporation of a complementary nucleotide. The released PPi is used by an ATP sulfurylase to release ATP from adenosine 5'‐phosphosulfate. The ATP is then used to generate light by converting luciferin into oxyluciferin . Unincorporated dNTPs are degraded by an apyrase, and dATPαS (which is not a substrate for luciferase) is used instead of dATP. This pyrosequencing reaction is repeated during the sequence of the entire target DNA. This sequencing technology can now produce sequencing reads with up to 1000 bp in length (http://454.com/products/gs‐flx‐system/).
These raw reads are processed by the 454 analysis software and then filtered to remove poor‐quality sequences, mixed sequences, and sequences without the initiating TCGA sequence. Recently, the 454‐FLX system was upgraded to reach 99.9% of accuracy after filter and an output of 14 Gb of data per run within 24 h.
4.2. Illumina genome sequencer
The Illumina sequencing  method begins with Illumina library preparation flanked with Illumina‐specific adapters. Sequencing templates are immobilized on a proprietary flow cell surface that contains immobilized oligos with sequence complementary to those of the adapters, and designed to present the DNA in a manner that facilitates access to enzymes while ensuring high stability of surface‐bound template and low non‐specific binding of fluorescently labeled nucleotides. Solid‐phase amplification of each single strand DNA from library is performed by bridge amplification, which results in the generation of several million dense clusters of single‐stranded DNA in each channel of the flow cell.
The Illumina system sequences DNA in the presence of four reversible terminator‐bound dNTPs . At each sequencing step, a fluorescently labeled dNTP is added to the molecule. The fluorescent signal is recorded and then the fluorophore is removed to allow sequencing to continue. The base calls correlate with the signal intensity. Illumina sequencing technology can now produce sequencing reads with up to 600 bp in length (http://www.illumina.com/systems/sequencing.html). The sequencing results are generated in files in which each raw read base has an assigned quality score so that the software can apply a weighting factor in calling differences and generating confidence scores. Illumina data collection software enables users to align sequences to a reference in resequencing applications. This software suite includes the full range of data collection, processing, and analysis modules to streamline collection and analysis of data.
4.3. Deep sequencing data analysis
The massively parallel scale of sequencing implies a similarly massive scale of computational analysis. The conventional pipeline for analysis of next‐generation sequencing data includes the following stages: quality control and source data filtering; alignment (mapping); reference profiling (variant‐calling, pileup); followed by single‐nucleotide polymorphism (SNP) calling (genotyping); and some form of clusterization or classification analysis of samples to discover up or down expression of genes, detect overabundance of SNP positions and correlate those with function and phenotype. Because of the sheer size of the data and amount of calculations needed, such analyses place significant demands on the information technology (IT) infrastructure. Lack of computational power, insufficiency of actively accessible storage facilities in laboratory information management systems (LIMS) and deficiency in network capacity to move data add significantly to the overhead required for high‐throughput data production. The hardware aspect of next‐generation sequencing is complicated by the imperfections of current sequence analysis tools, which are suited to shorter sequence read data. There are multiple implementations for all of the stages of the analysis, and some of those are considered to be industry standard tools, running formidable amount of bio‐medical analytics. Large‐scale analysis of thousands of samples using variety of available tools highlighted important issues with data quality, pre‐analytic quality controls, software reproducibility and post‐analytic quality controls. Existing data analysis pipelines and algorithms must be modified to accommodate extra‐large amounts of short read sequences and combination of shorter and longer read technologies.
To analyze the deep sequencing data for genetic consistency evaluation of influenza vaccine viruses, we have used the corresponding viral reference sequences from NCBI GenBank as a template for alignment of individual sequencing reads. First, sequencing reads with low quality (Phred) score are removed from the data set, and the remaining sequences aligned with reference influenza virus sequence using custom software: The High‐performance Integrated Virtual Environment (HIVE, https://hive.biochemistry.gwu.edu/dna.cgi?cmd=main) computer cluster [67, 68].
To create a quantifiable measure comparing the quality of the sequencing and mapping at different positions on a genome, we developed a metrics for assessing positional variant‐call quality. To do that, first a histogram is built at every position of a genome where the number of times a base has occurred at a given read‐position is accumulated. Additionally, positions of insertion and deletions are also collected in similar histogram. The underlying assumption of the next‐generation sequencing method is that the DNA amplification and digestion procedure is random and the short sequences produced by DNA digestion are not strongly biased and not sequence dependent. The default assumption is that a particular variant call should be confirmed by different positions on many reads thus rendering the histogram distribution to be uniform along the entire length of sequence read.
Post‐alignment quality control includes identification of mutations distributed non‐randomly along individual sequencing reads, which may indicate artifacts in PCR amplification or DNA sequencing procedures (Figure 1A and B). Biased distribution of mutations along sequencing reads was revealed by calculating Shannon entropy values . Low entropy value suggests that a mutation could be an artifact produced in sequencing procedures. This means that there is an abnormal bias in distribution of this mutation. This entropy‐based post‐alignment quality control value is calculated on the basis of the equation below. It is based on the normalized first order momentum of logarithmic probability distribution for a particular base (b) at a particular position (r) of reference genome:
where L is the length of the longest read, pi(b) is the frequency distribution of a base b in the reference frame of the reads mapped at the location r. The index i runs over all of the available positions from 1 to L. The denominator makes sure the Shannon's entropy is normalized to a unit value of 1 as the maximum value for entirely uniform distribution. In contrast singular value distribution would have a value for entropy equal to zero. This value is computed for all of the reference positions for every base.
Finally, aligned sequencing reads were used to compute SNP profiles for the entire viral genome.
5. The use of deep sequencing for evaluation of genetic consistency of influenza A vaccine viruses
Influenza A viruses are enveloped, single‐stranded RNA viruses belonging to the Orthomyxoviridae family , which also contains four other viral species: influenza B virus, influenza C virus, thogotovirus, and isavirus. The segmented genome of influenza A virus is about 13.6 kb in size and encodes for at least 11 proteins. Its genome is highly variable due to the low fidelity of RNA polymerase and reassortment between co‐infecting strains . New virus mutants emerge continuously allowing viruses to survive in presence of the host immunity and cause repeated annual epidemics and occasionally pandemics. Because of this frequent change of the antigens, influenza vaccines must be frequently reformulated to include antigens of the currently circulating strains. Both live and inactivated influenza vaccines are produced mostly by reassortment with high‐growth strains for vaccine production [72, 73]. As stated above, adaptation to growth in different cells can lead to changes in viral receptor‐binding region, and also in protective epitopes. Therefore, it is desirable to monitor genetic stability of viruses used in vaccine manufacture to ensure that their antigenic structure remains unchanged.
Deep sequencing technology has opened up the possibility for the characterization of viral genomes directly from samples [74, 75]. The viral metagenome or “virome” refers to the collection of viruses found in a particular sample from humans, animals, plants or from a specific environmental sample. Virome studies can lead to the discovery of new viruses and/or to their association with known or novel diseases. Numerous viruses have been identified as part of the virome study, including influenza A viruses .
The deep sequencing technologies are a great tool to investigate genetically complex populations of influenza viruses and to detect minority mutant variants with clinical or epidemiological relevance. Deep sequencing‐based methods have recently been applied for the assessment of influenza A viruses diversity and their dynamics of evolution [60, 76, 77]. Others have focused on the evolution of avian influenza strains with potential to become pandemic in humans [78–82], as well as the detection of virulence signatures  and reassortment patterns . Other studies have investigated the transmission and adaptation of avian influenza viruses to humans, as part of preparedness for a potential influenza pandemic [12, 53, 85–98].
As it is crucial to study transmission and adaptation of avian influenza viruses, and swine strains for epidemics and pandemics in humans, many studies based on the use of deep sequencing techniques have described avian and swine influenza virus evolution [99–103]. Other studies have investigated the predominance and spread of different human influenza viruses in specific geographic areas [16, 104–106].
Study of drug escape variants is an important aspect of epidemiological and clinical virology. Sanger sequencing can only detect mutations present in around 20% of the viral population [37, 107–109], which excludes it for quantitation of low‐level viral mutant variants. Using deep sequencing to detect low portion of mutant drug resistant variants at levels as low as 0.1% of the virus population has been demonstrated [110–116]. Other studies have focused on the use of deep sequencing for surveillance of drug resistance‐associated mutations for both NA inhibitors and adamantanes [117–127]. Deep sequencing has also been used for the detection and subtyping of human influenza A viruses and reassortants [61, 84].
Recently, deep sequencing‐based methods have been proposed for the assessment of influenza A viruses antigenic stability  using complete influenza A genomes and exploiting the ability to detect and quantify mutations in heterogeneous viral populations. Deep sequencing was used to study the evolution of influenza A viruses in the vaccinated pigs. The genetic diversity and evolution of the virus at an intra‐host level was analyzed directly from nasal swabs collected during infection . The obtained results demonstrated remarkable diversity of influenza A viruses, and rapid change of these viruses during infection of vaccinated pigs. These types of complex studies can be done only by high throughput sequence analysis.
To evaluate the genetic stability in influenza vaccine viruses, we have used a deep sequencing approach that was recently qualified for quantitation of all mutants in the entire genome including those that are present at low level in viral populations . Recently, we explored the utility of deep sequencing methods for monitoring the consistency of influenza A vaccines [36, 60]. Also in the same study, we proposed new protocols for simultaneous amplification of all segments of influenza A genomes and new bioinformatic tools to analyze the data and to identify artifacts generated during PCR amplification and deep sequencing procedures. Amplification of the entire genome of influenza viruses presents a challenge because of the difference in size and sequence composition of the eight genomic segments.
We described PCR conditions that allow to amplify all genomic segments of influenza A virus in one reaction  that was optimized subsequently during an analysis of the A/California/07/2009 (H1N1) vaccine viruses (derived from X‐179A, X‐181, 121XP viruses) . We have used both the total RNA (without specific amplification of viral cDNA) and DNA amplicon, for RNA and DNA libraries preparation, respectively, and both protocols were compared for consistency in mutant variants quantitation.
The protocols for deep sequencing of viral DNA libraries and whole‐RNA libraries used to determine quantitative profiles of mutations along the entire genome of viruses of influenza A/California/07/2009 (H1N1) vaccine viruses were described . The steps followed to perform the deep sequencing are presented in Figure 2 and can be summarized as follows: The PCR product was purified by QIAquick PCR Purification Kit (Qiagen) and fragmented by an ultrasonicator (Covaris) to generate the optimal fragment sizes needed for Illumina sequencing, then the fragmented DNAs were used for library preparation with NEBNext® DNA Sample Prep Reagent Set 1 (New England BioLabs).
For preparation of Illumina sequencing libraries from total RNA, the NEBNext mRNA Sample Prep Master Mix Set 1 (New England BioLabs) was used. Briefly, total RNA (extracted as mentioned above) was fragmented as described above to generate the optimal fragment sizes. Double‐stranded cDNA was prepared and ligated to the Illumina paired end adaptors. Finally, the libraries were amplified using 15 cycles of PCR with multiplex indexed primers and purified with magnetic beads using Agencourt Ampure Beads (Beckman Coulter). Deep sequencing was performed at Macrogen (Seoul, Korea) using HiSeq2000 (Illumina) or at our laboratory using MiSeq (Illumina).
The sequencing data analysis was done using custom software, a highly integrated virtual environment (HIVE) computer cluster (https://hive.biochemistry.gwu.edu/dna.cgi?cmd=main) as described above. The RNA sequences of X‐179A, X‐181, 121XP, A/California/07/2009 (H1N1), and A/PR/08/34 viruses deposited in NCBI GenBank were used as references for alignment of the viral sequence reads. We analyzed the depth of sequencing, the single‐nucleotide polymorphism profile, and entropy (that allow us to distinct between bias and true mutation) for each segment of influenza virus (see Figure 3, for example), the data analysis resulted also on generation of consensus sequences for each segment.
The deep sequencing results revealed several heterogeneities in most genomic segments, and several mutations led to amino acid changes. Deep sequencing of whole‐RNA libraries was found to be more reproducible than sequencing of DNA libraries. This may be due to errors introduced during PCR amplification by DNA polymerase and non‐specific alignment of primers .
The deep sequencing of the X‐179A passaged viruses  identified several mutations in HA and NA genes. In HA two non‐synonymous mutations; Pro314Gln (in 17% of the virus population) and Asn146Asp (in 78% of the virus population) were identified. The Asn146Asp mutation is found in the antigenic site Sa; it was detected at 11% in the A/California/07/2009 (H1N1) strain and as the dominant residue in the X‐181 virus . One X‐179A stock contained the Lys328Thr mutation at a low level (9%). Viruses derived from X‐179A were heterogeneous and contained some complete nucleotide substitutions in comparison to their published sequences in PB2, PB1, NP, and in NS segments . The X‐181 virus was developed from the X‐179A seed lot by another round of reassortment, and also is subjected to several passages in eggs. Deep sequencing results  showed that the G756T (Glu252Asp, present at 47%) mutation emerged in HA of the passaged X181 virus, and it is located in the conserved region of the antigenic site Ca .
Unlike the X‐179A and X‐181 viruses, 121XP was developed by reverse genetics . The deep sequencing of 121XP virus passaged 10 times in eggs (121XP‐M4 virus) showed that this virus is more heterogenic than X‐179A and X‐181 viruses passaged 10 times in eggs (X‐179A‐M1 and X‐181‐M4 viruses respectively; Figure 4) . In the passaged 121XP virus, the mutation Lys226Glu was emerged at low level (18%) in Ca antigenic site of HA, which is very close to the region that participates in the modulation of HA receptor specificity and that enables H3 influenza viruses to switch specificity from avian to human [131–133]; another mutation Lys136Asn was emerged at a high level (78%) close to the HA antigenic site Sa within the sialic acid‐binding pocket . Recently a similar deep sequencing approach was used to study the genetic and potential antigenic diversity of influenza viruses infecting humans, some of whom became infected despite recent vaccination .
We found that the deep sequencing approach based on RNA library preparation was effective and reproducible for detection of low quantities of mutants in the entire genome of influenza A vaccine viruses . The deep sequencing approach revealed that the viruses derived from three pandemic A/Ca/07/2009 (H1N1) vaccine viruses have varying levels of sequence heterogeneities some of them in antigenic sites, which may affect their efficacy.
In the last few years, the use of deep sequencing has expanded largely to tackle problems in many fields of virology. The greatest benefit of deep sequencing is its ability to detect minor mutant variants, as low as 0.1% of virus population [36, 59, 60, 110, 111, 113, 114]. The deep sequencing approach based on RNA library preparation is effective and reproducible for detection of low quantities of mutants in the entire genome of influenza A vaccine viruses , and eliminates the need for full‐length amplification. The deep sequencing platforms are improving continuously to combine low error rates with long reads and relatively low cost. It played a key role in the discovery of many new viruses, the characterization of virus populations in humans and the potential of their association with the pathogenesis of several diseases. As described here, there is no doubt that the deep sequencing is facilitating and accelerating the evaluation of the genetic consistency of vaccine viruses. It is an important tool for monitoring vaccine consistency during manufacture and after vaccination. Deep sequencing‐based assays are already being implemented for the genetic consistency evaluation of oral polio vaccine and influenza A vaccine viruses [36, 59, 60]. The ability to quantify potentially undesirable mutations in vaccine batches makes this method suitable for quality control to ensure manufacture of safe and effective vaccines.
We thank Dr. Konstantin Chumakov and Dr. Christian Sauder for their critical review of this chapter. The contents of this chapter represent solely the opinion of authors and do not represent the official view of FDA.