Open access peer-reviewed chapter

Clinical Validation of a Whole Exome Sequencing Pipeline

Written By

Debra O. Prosser, Indu Raja, Kelly Kolkiewicz, Antonio Milano and Donald Roy Love

Submitted: March 3rd, 2020 Reviewed: June 23rd, 2020 Published: July 15th, 2020

DOI: 10.5772/intechopen.93251

Chapter metrics overview

770 Chapter Downloads

View Full Metrics


Establishing whole exome sequencing (WES) in an accredited clinical diagnostic space is challenging. The validation (as opposed to verification) of an approach that will lead to clinical reports requires adhering to international guidelines and recommendations and developing a robust analytical pipeline that can scale due to the increasing clinical demand for comprehensive gene screening. This chapter will present a step-wise approach to WES validation that any laboratory can follow. The focus will be on highlighting the pivotal technical issues that must be addressed in validating WES and the analytical tools and QC metrics that must be considered before implementing WES in a clinical environment.


  • whole exome sequencing
  • next-generation sequencing
  • validation
  • bioinformatics
  • diagnostics

1. Introduction

The decision as to which type of genetic test should be implemented by a clinical laboratory is largely driven by the type of referrals received by the laboratory and the complexity of patients’ clinical phenotypes. In the main, testing has advanced from single-gene to multi-gene panels in which next-generation sequencing (NGS) has offered the technical means of undertaking this approach at low cost and high throughput. However, with the increasing awareness of genetic heterogeneity combined with gene discovery, whole exome sequencing (WES) offers laboratories a more streamlined approach. By implementing a single wet-work pipeline of exome capture coupled with the ability to analyze a virtual gene panel or report on the whole exome, laboratories can perform NGS in a more efficient manner.

Since the inception of NGS over a decade ago, multiple recommendations and guidelines have been published for NGS [1, 2, 3]. Using these guidelines, the College of American Pathologists (CAP) and Association for Molecular Pathology (AMP) published their Practical Framework for Designing and Implementing NGS Tests for Inherited Disorders in 2019 [4], and this is available through the CAP website (

We adopted this framework to establish a diagnostic NGS service using whole exome sequencing as our capture procedure and analyzing virtual gene panels or WES for reporting purposes.

The framework provides guidance and editable worksheets for the five steps involved in test establishment and validation.

  1. Test design: setup

  2. Assay design and optimization

  3. Test validation

  4. Quality management

  5. Bioinformatics and IT

Throughout the validation process, it is essential that the NGS workflow is informed by the real-world local environment in which clinical testing will be performed.


2. Test design: setup

In view of the diverse range of referrals made to the authors’ genetics laboratory (serving the needs of a 400-bed women and children’s hospital in the Middle East), a whole exome capture solution was chosen for library preparation. The principal motivation behind this determination was to achieve an efficient workflow that would allow appropriate batching coupled with a time-limited turnaround time (TAT) for all referrals.

The limited number of staff in the authors’ laboratory demanded a WES workflow that could be easily automated, twinned with a data analysis package that would allow secure remote access with a strong databasing function. The whole exome solution capture by SOPHiA™ Genetics was chosen for library preparation. This platform allows for the analysis of WES, clinical exome sequencing (CES) and clinical gene panels, together with the identification of single-nucleotide variants (SNVs) and copy number variants (CNVs) using SOPHiA™ DDM software.


3. Assay design and optimization

The validation pipeline needs to be grounded from the beginning in terms of the requirements of the test, which must take into account the sample types the laboratory will receive and the parameters that need to be satisfied (see Table 1).

Test requirementsMust haveNice to have
Clinical panelsY
CNV detectionY
Necessary sample throughput per month1632
How deeply does each position need to be covered for accurate variant calling (if known—otherwise address during test optimization)>20x>50x
DNA from whole blood collected in EDTAY
DNA from external/commercial sources (limitations)Y
Required/expected TAT3 months2 months
Combine different tests (existing or planned) within a sequencing runY

Table 1.

Test requirements and limitations.

WES, whole exome sequencing; CES, clinical exome sequencing; CNV, copy number variant; TAT, turnaround time.

Routinely, whole blood samples collected in EDTA are received by the authors’ laboratory for testing. Therefore, our validation focused only on genomic DNA extracted from whole blood using our standard methods. The baseline validation of the WES data required the inclusion of two HapMap gDNA samples: the NIST control (NA12878) and the commercial control (SG063) supplied by SOPHiA™ Genetics.

The WES capture by SOPHiA™ Genetics was used for library preparation following all the steps as set out by the automated WES 32 reaction protocol. For instrumentation, our validation was restricted to automated library preparation using the PE Sciclone® G3 NGS workstation and sequencing using the Illumina® HiSeq4000 platform.

A critical additional consideration was the need for copy number variant calls to be made. This required a minimum batch number of eight patients and high coverage requirements, which involved restricting the number of samples per Illumina® HiSeq4000 lane to one pool of eight patients.

Importantly, the naming of the sequence files (.bam,. FASTQ, etc.) should be considered during the early phase of test design and validation. File conventions that are used for the bioinformatic process may be limited in terms of the type of special characters and/or character length. Following recommendations in the CAP/AMP-Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines [5], the identity of the sample must be preserved throughout all steps of the bioinformatic pipeline. These authors recommend the following four unique identifiers that should be applied to the sample file name:

  1. Unique sample identifier

  2. Unique patient identifier

  3. Unique run identifier

  4. Laboratory location identifier

It is essential that the file naming convention that is decided upon for validation adheres to the above recommendations and can be universally implemented for all subsequent testing.


4. Test validation

Test validation mandates a need for accuracy, precision and stability. These assessments must be made in the context of expected clinical workloads and performance. For the authors’ laboratory, the sample batch size was set at 16 samples per validation batch and a total of three validation runs performed over differing days with differing technologists.

Analytical performance was characterized by the assessment of precision, sensitivity and concordance of variant calls against previously validated data.

Inter-run and intra-run data were achieved by replicate analysis of two HapMap gDNAs, the NIST sample, NA12878, and the commercial control supplied by SOPHiA™ Genetics, SG063, as well as four well-characterized clinical samples previously reported by accredited laboratories. The remaining samples included a representative group of the clinical samples received by the authors’ laboratory (see Table 2).

Sample IDDescriptionPurposePurpose (detail)Specific variant/s of interestVariant typeMeasured metric
VAL-1NA12878Baseline validationN/AN/AN/AIntra-run variability Inter-run variability
VAL-2SG063Baseline validationN/AN/AN/AIntra-run variability Inter-run variability
VAL-3Anonymized patient specimenBaseline validationVariant typeCiliopathy gene panel CCDC39:c.2017G > T p.(Glu673*) CCDC39: Deletion of exons 14 to 20SNV CNVInter-run variability Sensitivity
VAL-4Anonymized patient specimenBaseline validationVariant type prevalent in geneSingle-gene analysis CFTR:c.1521_1523delCTT p.(Phe508del)DELInter-run variability Sensitivity
VAL-5Anonymized patient specimenBaseline validationVariant typeCraniosynostosis gene panel CACNA1H:c.4318_4319delinsGC p.(Phe1440Ala)DELINSInter-run variability Sensitivity
VAL-6Anonymized patient specimenBaseline validationVariant type prevalent in geneTuberous sclerosis gene panel TSC2: Deletion of exons 2 to 16CNVInter-run variability Sensitivity
VAL-7Anonymized patient specimenGene-specific validationVariant typeArrhythmia cardiomyopathy gene panel SCN5A:c.4867C > T p.(Arg1623*)SNV (stop)Sensitivity
VAL-8Anonymized patient specimenGene-specific validationVariant typeCustom panel of 196 genes 200 genomic co-ordinatesSNV DEL/DUPSensitivity
VAL-9Anonymized patient specimenGene-specific validationVariant typeParoxysmal Dystonia gene panel Del 16p11.2 chr16:29,656,684-30,190,568CNVSensitivity
VAL-10Anonymized patient specimenGene-specific validationVariant typeLeukodystrophy gene panel MLC1:c.908_918delinsGCA p.(Val303Glyfs*96)DELINSSensitivity
VAL-11Anonymized patient specimenGene-specific validationVariant typeEpilepsy gene panel WWOX: Deletion of exons 1–5CNVSensitivity
VAL-12Anonymized patient specimenGene-specific validationVariant rangeEpilepsy gene panelSNV DEL/DUPSensitivity
VAL-13Anonymized patient specimenGene-specific validationVariant typeSingle-gene analysis CFTR: deletion of exons 4–8CNVSensitivity
VAL-14Anonymized patient specimenGene-specific validationVariant rangeNeuropathy gene panelSNV DEL/DUPSensitivity
VAL-15Anonymized patient specimenGene-specific validationVariant rangeCholestasis gene panelSNV DEL/DUPSensitivity
VAL-16Anonymized patient specimenGene-specific validationVariant typeTuberous sclerosis gene panel (2 genes) TSC2:c.5238_5255del p.(His1746_Arg1751del)DELSensitivity
VAL-17Anonymized patient specimenChromosomal CNV validationVariant typeMolecular karyotype referral Dup 22q11.21 chr22:18,661,724-21,809,099CNVSensitivity
VAL-18Anonymized patient specimenGene-specific validationVariant rangePrimary ciliary dyskinesia gene panel DNAH5: Gain of exons 1 to 50 DNAH5:c.5503C > T p.(Gln1835*)SNV CNVSensitivity
VAL-19Anonymized patient specimenGene-specific validation (pseudogene)Variant rangeInherited cancer gene panel CDKN2A:c.9_32dup p.(Ala4_Pro11dup)SNV DELSensitivity
VAL-20Anonymized patient specimenGene-specific validationVariant rangeCustom panel of 196 genes 200 genomic coordinatesSNV DEL/DUPBlind analysis
VAL-21Anonymized patient specimenChromosomal CNV validationVariant typeMolecular karyotype referral Duplication at 16p13.11, deletion at 12p31 and duplication at Xp21.1CNVSensitivity
VAL-22Anonymized patient specimenGene-specific validationVariant type prevalent in geneSingle-gene analysis DMD: duplication exons 45–62CNVSensitivity
VAL-23Anonymized patient specimenGene-specific validationVariant type prevalent in geneDystrophinopathy gene panel DMD: deletion of exons 8–34CNVSensitivity
VAL-24Anonymized patient specimenGene-specific validationVariant rangeCustom panel of 196 genes 200 genomic co-ordinatesSNV DEL/DUPSensitivity
VAL-25Anonymized patient specimenGene-specific validation (pseudogene)PseudogeneCustom panel of nine genesSNV DEL/DUPSensitivity
VAL-26Anonymized patient specimenGene-specific validationVariant typePrimary Immunodeficiency gene panel TBX1:c.1383_1421del p.(Ala464_Ala476del)DELSensitivity
VAL-27Anonymized patient specimenGene-specific validationVariant typeDilated cardiomyopathy gene panel TTN:c.75984_75985insTACCA p.(Ala25329Tyrfs*32)INSSensitivity
VAL-28Anonymized patient specimenGene-specific validationVariant typePediatric cancer gene panel SMARCB1:c.159_160delinsTATCTGGAGGCG (p.Leu54Ilefs*20)DELINSSensitivity

Table 2.

Sample list.

DEL, deletion; INS, insertion; DUP, duplication; SNV, single-nucleotide variant; CNV, copy number variant.

The complete NGS workflow should be included in the validation, from library preparation to bioinformatic analysis to report generation, which is highlighted below.

  • Sample collection and DNA extraction. Genomic DNA is extracted and purified from blood samples using either the Gentra® PureGene® DNA Blood Mini Kit or the QIAsymphony® DSP DNA Midi kit (QIAGEN, Hilden, Germany). DNA quality is initially assessed by NanoDrop™ spectrophotometry.

  • Genomic DNA preparation. The initial preparation of gDNA used in NGS library preparation is the most critical step in the NGS workflow, and the care and time taken here are key to successful library amplification and sequencing.

High-quality gDNA can be by quantified using a Qubit™ fluorometer followed by sequential dilution with further quantification to the desired input concentration. It is essential to minimize pipetting gDNA volumes of less than 5 μl for dilution. In our study, gDNA is prepared to a working concentration of 40 ng/μl. After Qubit™ quantification, the integrity of the gDNA can be analyzed using an Agilent TapeStation 4200. Samples with a DNA integrity number (DIN) of greater than 7.5 can proceed to WES capture.

  • Library preparation, targeted capture and sequencing. Whole exome sequencing was performed according to the SOPHiA™ Whole Exome Solution 32 Samples User Guide, in combination with the SOPHiA™ Library Preparation and Capture User Guide—automation with PerkinElmer Sciclone® G3 NGS workstation. Each validation run consists of 16 samples that are divided into 2 pools of 8 samples each, as shown in the validation grid in Table 3.

Run 001Run 002Run 003
Pool AAVAL-1
Pool BEVAL-1

Table 3.

Validation grid.

Copy number variant (CNV) samples are indicated in bold.

The SOPHiA™ WES protocol for library construction subjects genomic DNA (200 ng) to enzymatic fragmentation, end repair and A-tailing. All these steps occur using a Sciclone® G3 NGS workstation. The adapter-ligated DNA is then amplified in a limited way via an eight-cycle PCR protocol.

Post-amplification cleanup of the libraries is carried out using the Sciclone® G3 NGS workstation, and libraries are prepared for quantitation with a dilution factor of 4.

Amplified libraries are analyzed using Qubit™ fluorometer and Agilent TapeStation 4200 to assess the quantity and quality of each individual library. Library DNA fragments should have a size distribution between 300 and 700 bp. Genomic DNA that has been fragmented, end repaired, A-tailed and adapter-ligated can then be considered library DNA, which is ready for pooling and then hybridization and capture. In the case of the SOPHiA™ WES protocol, eight samples are pooled (200 ng of each library) per capture.

Prepared pools are hybridized for 4 h followed by post-capture amplification and cleanup on the Sciclone® G3 NGS workstation.

Final library quantification is performed for each captured library pool using a Qubit™ fluorometer and Agilent TapeStation 4200. Subsequent pools are diluted to 20 nM (in a total volume of 20 μl) and subjected to sequencing using an Illumina® HiSeq4000 Sequencing platform.

  • Sequence analysis: performance metrics. Baseline performance metrics for the WES validation study must involve the analysis of well-characterized reference samples: the NIST sample (NA12878) and the SOPHiA™ Genetics control SG063. The sequence metrics for each sample in the run must be recorded and averages established using the reference samples. Samples must meet the sequencing metrics shown in Table 4 in order to reach the threshold for clinical reporting.

Selected sequencing metricsMust haveNice to have
Q30 score>80>85
Total number of reads per sample>70 M80–100 M
Percentage of mapped reads>80%>85%
Total percentage on-target reads>90%>95%
Coverage 10% quantile (at this depth, 90% target covered)20x50x

Table 4.

Sequencing metrics.

Analytical sensitivity and specificity must be calculated separately for each variant type (SNV, indel, CNV, etc.). Additional runs may be required to meet acceptable confidence intervals for less frequent variant types of insertions and deletions. For 95% confidence and 95% reliability, 59 variants of each type (and insertion/deletion range) should be analyzed [5]. The variant types that do not have strong confidence intervals must be listed in the test limitations of the clinical report until such time that the desired confidence levels have been achieved.


5. Quality management

The worksheets described by Santani et al. [4] set out very clear guidance for all quality aspects that need to be taken into consideration for the test to meet CAP requirements [4]. Through a validation study, the majority of a test’s limitations will be discovered and can be recorded against the QC parameters. Table 5 summarizes quality metrics that need to be addressed.

SectionCategoryCriteriaSpecific requirement
Note that these may vary between tests and laboratories
Pre-analytical QC (per sample)Specimen qualityWrong specimen typeWhole blood
Wrong type of tubePurple top EDTA tube
Insufficient quantity≥0.5 ml
Clotting (blood only)No visible clots
Insufficient labellingLabelling contains name, DOB, barcode, date of collection
Expired specimen≤7 days since collection
Expired collection tubeCollection tube not expired
DNA quality and quantityOD 260/280 ratio>1.7
Electrophoretic analysisShows intact high molecular weight DNA band
Quantification≥500 ng
DNA integrity number (DIN)>7.5
Analytical QC (per instrument run)Instrument run QCCluster densityNot taken into account
Base qualityQ30 ≥ 80
Pipeline QCTotal reads passing filter>280 M per lane
% reads not assigned to any sample<5%
Control samplesPositive controlExpected variants found
Analytical QC (per sample)Library preparationFragment size and distribution>80% of fragments between 300 and 700 bp
Pooled library concentration>20 nM
Sample de-multiplexing% reads assigned to sample8–12%
Read alignment% Reads aligned to target>90%
Distribution of coverage>95% within 25–200×
Coverage 10% quantile (at this depth 90% target covered at x)>40×
PCR duplicates<20%
Specimen identityAccurate specimen identity, file names with 4 points of identificationAll worksheets and transfers during bench work are witness checked for accurate specimen identification
Data transfer IntegrityData transfer to secure analysis platform

Table 5.

Quality management.


6. Bioinformatics and IT

To assess accuracy, genetic variants must be compared against publicly available reference data obtained from 1000 Genomes Project.

Clinical association, gene validity and mutation spectrum are applied to the creation of virtual gene panels in order to aid variant interpretation and reporting. The considerations associated with constructing virtual gene panels and the analysis of variants are shown in Table 6.

Gene selectionClinical associationClinGen
Gene analysisAppropriate transcriptsLRG
PanelApp – Genes and Entities
Evaluated homopolymeric regionsIvády et al. [6]DOI: 10.1186/s12864-018-4544-x
Mutation spectrum—reported deep intronic and/or promoter region variantsPanelApp—Genes and Entities
CNV analysisClinVar
Establish if critical variants are not covered by assay
Virtual panel creationExpert reviewed panelsPanelApp

Table 6.

Considerations for gene selection, analysis and virtual panel creation.


7. Conclusions

The decision to implement WES in a clinical diagnostic environment is one that must take into account local context, which encompasses clinical complexity, staff resources, equipment resources and bioinformatic expertise. The decisions described here were made based on the above considerations with a view to establishing opportunity, the most important of which was to have a WES pipeline that could scale over time in terms of patients tested and with the potential to be a regional resource.

It should be stressed, however, that a WES pipeline is sandwiched by two critical elements: first, the need to focus on the quality and accurate quantitation of genomic DNA; which dictates the quality of everything that happens downstream, and second, to understand that the identification of DNA variants is technically demanding but the classification of those variants is not currently a fully automated process. The former can sometimes be overlooked, while the latter can be a daunting exercise. It is perhaps the subject of another book chapter to discuss the approaches to variant classification.


Conflicts of interest

The authors declare no conflicts of interest.



The authors wish to thank Mr. Duncan Kay of Custom Science (NZ) for his generous suggestions regarding commercial providers for WES data analysis and Javier Botet of Sophia Genetics for his advice regarding quality management considerations.


  1. 1. Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, et al. ACMG clinical laboratory standards for next-generation sequencing. Genetics in Medicine. 2013;15(9):733-747. DOI: 10.1038/gim.2013.92
  2. 2. Hegde M, Santani A, Mao R, Ferreira-Gonzalez A, Weck KE, Voelkerding KV. Development and validation of clinical whole-exome and whole-genome sequencing for detection of germline variants in inherited disease. Archives of Pathology & Laboratory Medicine. 2017;141:798-805. DOI: 10.5858/arpa.2016-0622-RA
  3. 3. Matthijs G, Souche E, Alders M, Corveleyn A, Eck S, Feenstra I, et al. Guidelines for diagnostic next-generation sequencing. European Journal of Human Genetics. 2016;24:2-5. DOI: 10.1038/ejhg.2015.226
  4. 4. Santani A, Simen BB, Briggs M, Lebo M, Merker JD, Nikiforova M, et al. Designing and implementing NGS tests for inherited disorders a practical framework with step-by-step guidance for clinical laboratories. The Journal of Molecular Diagnostics. 2019;21:369-374. DOI: 10.1016/j.jmoldx.2018.11.004
  5. 5. Roy S, Coldren C, Karunamurthy A, Kip NS, Klee EW, Lincoln SE, et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines a joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. The Journal of Molecular Diagnostics. 2018;20(1):4-27. DOI: 10.1016/j.jmoldx.2017.11.003
  6. 6. Ivády G, Madar L, Dzsudzsák E, Koczok K, Kappelmayer J, Krulisova V, et al. Analytical parameters and validation of homopolymer detection in a pyrosequencing-based next generation sequencing system. BioMed Central Genomics. 2018;19:158. DOI: 10.1186/s12864-018-4544-x

Written By

Debra O. Prosser, Indu Raja, Kelly Kolkiewicz, Antonio Milano and Donald Roy Love

Submitted: March 3rd, 2020 Reviewed: June 23rd, 2020 Published: July 15th, 2020