Results of HLA DNA typing for the HLA-C locus by the Luminex, SBT or SS-SBT methods. The / is possible ambiguity, + is more than the possible ambiguities indicated by /, Parenthesis and bold letters indicate tentative allele names, not yet officially approved by the WHO Nomenclature Committee.
Human Leukocyte Antigen (HLA) is the major histocompatibility complex in humans and it is critically involved in the rejection of hematopoietic stem cell or organ transplants [1-3] and in the pathogenesis of numerous autoimmune diseases . Rejection and autoimmunity is believed to occur because the HLA presents aberrant histocompatibility antigen or public epitopes that play a key role in self-nonself recognition via various mechanisms including molecular mimicry and antibody mediated rejection [5-9]. The HLA genomic region on chromosome 6p21 encodes more than 200 genes including nine classical HLA genes, HLA-A, -B, -C in the class I region and HLA–DPA1, -DPB1, -DQA1, -DQB1, -DRA, -DRB1 in the class II region, that are the most polymorphic in the human genome contributing to over 7000 alleles and numerous HLA haplotypes implicated in disease resistance or susceptibility . There are also a variety of non-classical HLA genes, such as HLA-DO and HLA-DM , HLA-E, -F, -G, MICA and MICB, which have received less attention in clinical medical research than the classical HLA genes [12,13].
The determination of the classical HLA alleles by DNA typing techniques over the past thirty years has assisted with the efficient and rapid HLA matching in transplantation therapy [2,14,15], research into autoimmunity and HLA related diseases [4,10], population diversity studies [16-18] and in forensic and paternity testing . HLA gene alleles are also a target for pharmacogenomics research into drug hypersensitivity , such as the association of HLA- B*57:01 with hypersensitivity to the antiviral agent abacavir  and the association of HLA- B*15:02 with anticonvulsants in Asian populations .
Many variations of the conventional PCR typing method have been used for HLA DNA typing, such as incorporating restriction fragment polymorphisms , single strand conformation polymorphism , sequence specific oligonucleotides (SSOs) , sequence specific primers (SSPs)  and sequence based typing (SBT), like the Sanger method . The HLA DNA typing methods mainly applied today are PCR-SSO, such as the Luminex commercial methodology [27,28], and SBT by the Sanger method employing capillary sequencing of cloned chain-termination reactions [26,29]. However, the current, conventional DNA typing methods cannot readily distinguish between polymorphisms on the same chromosome (
The next generation sequencing (NGS) technologies, which are also referred to as second-generation sequence technologies, are new sequencing developments [31-33] that have followed on from the first generation sequencing technologies of the Sanger-Coulson sequencing method using chain-termination dideoxynucleotide sequencing of single-stranded DNA  and chemical cleavage of double-stranded DNA using the Maxam-Gilbert method . Recent advances in technology and cost effectiveness of NGS point the way for its implementation and wider use in HLA research and diagnostic settings with the generation of haplotype (in-phase) sequencing and a massive level of parallelism producing millions of sequencing reads for analysis [36-43]. There are a variety of commercial NGS platforms currently available for HLA gene amplicon resequencing, such as those from Roche-454 (454 GS-FLX-Titanium, GS Junior), Applied Biosystems-Agencourt (3730XL, SOLiD 3), Illumina-Solexa (Genome Analyzer GAII, MiSeq, HiSeq models), Life Technologies (Ion Torrent Proton with different high density sequencing Chips), HeliScope (Helicos) and PacBio RS II (PacBio); including three small benchtop sequencers, the Roche Genome Sequencer (GS) Junior, Ion Personal Genome Machine (PGM) sequencer and the Illumina MiSeq, that are much cheaper than their larger counterparts and that provide faster turnover rates, but a more limited data throughput [31,44-46].
Recently, the performances of the three commercially available benchtop sequencers were compared against each other directly by different investigators comparing the sequencing of bacterial genomes [45-47]. There were large differences obtained from the three platforms in cost, sequence capacity and in performance outcomes of genome depth, stability of coverage and read lengths and quality, due in part to their different sequencing methods. The Roche GS Junior was of intermediate price with the lowest sequencing capacity of >6 Mb per run and a run time of 8 h, but the highest cost per run because of expensive reagents. Its sequencing is dependent on using the pyrosequencing technique, measuring the fluorescent light emitted when fluorophore-labeled dNTPs are added to sample DNA templates during the polymerase reaction [31,32]. The MiSeq was the most expensive instrument with the highest throughput at 1.5-2 Gb and a run time of 27 h. The running cost was 60 times cheaper than the Roche GS Junior. The MiSeq uses reversible terminator chemistry for sequencing in a cyclic method that involves fluorophore-labeled nucleotide incorporation, fluorescence imaging and cleavage [31,48]. The Ion PGM was the cheapest instrument with a throughput of 100 Mb using the 316 Chip, and 1 Gb using the 318 Chip, and a run time of 2-3 h. The running cost was 8 to 50 times cheaper than the Roche GS Junior depending on which Chip was purchased. The Ion PGM uses semi-conductor technology and ion-sensitive transistors to sequence DNA using only DNA polymerase and natural nucleotides in a sequencing-by-synthesis approach, with each polymerisation event resulting in pH and voltage changes identified by electronic sensors . Therefore, the Ion PGM is the only “non-light” sequencer currently available in the market place.
Most pre-sequencing workflows for the benchtop sequencers and the other machines require DNA template fragmentation and library preparation where the fragments are labeled in vitro with oligonucleotide tags and adapters using commercial kits such as NimbleGen, Sure Select and other systems in order to be captured by beads or probes in preparation for clonal amplification of single stranded DNA fragments . The labeling of DNA libraries with barcode sample tags, such as the multiplex identifier (MID) for Roche/454 sequencing, allows the libraries to be pooled to maximize the sequence output as a multiplex amplicon sequencing step for each sequencing run [49,50]. After construction of the template libraries, the DNA fragments are clonally amplified by emulsion PCR [51,52] or by solid phase PCR using primers attached to a solid surface  in order to generate sufficient single stranded DNA molecules and detectable signal for generating sequencing data . Apart from selecting a suitable sequencing machine, NGS also provides challenges in analyzing and interpreting complex HLA genomic data from the millions of sequencing reads generated from the next-generation sequencers, which are different to those generated from traditional sequencers, and other HLA DNA typing methods and platforms.
We have described a new HLA DNA typing method, called Super high-resolution Single molecule - Sequence Based Typing (SS-SBT), that employs NGS and the Roche GS Junior  and the Ion PGM  massively parallel sequencing bench-top platforms (Figure 1). The SS-SBT method allows sequencing of the entire HLA gene region (promoter/enhancer, 5’UT, exons, introns, 3’UT) to detect new alleles and null alleles and solves the problem of phase ambiguity by using bioinformatics and computing for accurate phase alignments, after long-range PCR and sequencing clonally amplified single DNA molecules [42, 56].
This chapter updates our progress with the SS-SBT method  and describes some of the tasks required to identify the polymorphisms and other variants generated by NGS for SNP haplotyping from different classical HLA loci of individuals such as tissue donors and recipients in a relatively simple and economical way.
1.1. Aims of chapter
Here we describe and examine our latest modifications to the SS-SBT method for six-classical class I and class II HLA loci (HLA-A, -B, -C, -DPB1, -DQB1 and -DRB1) at a super-high resolution (formerly known as the 8-digit level of resolution) and three non classical class II HLA loci HLA-DRB3, -DRB4 and –DRB5 at high resolution using single DNA molecules to solve the ambiguity problem by undertaking:
Specific amplification of the entire gene sequence.
Comparisons between two different NGS platforms, the Roche GS Junior and Ion PGM.
Comparisons between different DNA typing software for in phase sequence analysis and validation of the huge amounts of NGS data.
Comparison of workflow differences in the construction of single gene and multiplex gene sequencing template libraries for 11 HLA class I and class II loci.
Application of the SS-SBT method including PCR multiplexing for the construction of template libraries for more efficient sequencing runs.
2. Materials and methods
2.1. HLA allele nomenclature and definition of super-high resolution
HLA genotyped alleles can be assigned at different levels of detail according to a recommended, standardized nomenclature . Designations begin with HLA- as the prefix for an HLA gene and the locus name, followed by a separator * and then one or more two-digit numbers separated by colons (field separators) that specify the allele. The first two digits specify a group of alleles (first order, field one or a low resolution level). The third and fourth digits specify a non- synonymous allele (second order, field two or an intermediate resolution level). Digits five and six denote any synonymous mutations within the coding frame of the gene (third order, field three or a high resolution level). The seventh and eighth digits distinguish mutations outside the coding region (fourth order, field four or a super high resolution level). A ninth digit usually specifies an expression level or other non-genomic data and it is designated by a letter such as A (‘Aberrant’ expression), C (present in the 'Cytoplasm' but not on the cell surface), L (‘Low’ cell surface expression), N (‘Null’ alleles), Q (‘Questionable’ expression) or S (expressed as a soluble 'Secreted' molecule but is not present on the cell surface). Thus, a completely described allele may be up to 9 digits long. An example of an eight-digit or a super high resolution of an HLA-A allele that includes sequence variation within the introns or 5’/3’ extremities of the gene is HLA-A*01:01:01:01.
2.2. DNA samples
One hundred genomic DNA samples were obtained from Japanese subjects for this study on PCR amplification and NGS of HLA genes. We obtained written consent from each individual and ethical approval from Tokai University School of Medicine where the research was performed. The DNA samples were isolated from the peripheral white blood cells using the QIAamp DNA Blood Mini Kit (QIAGEN, Germany). The HLA class I and class II gene alleles for nine samples (TU1 to TU8 and TU10) had been previously assigned to the field 2 allelic level of resolution (formerly known as 4-digit typing)  by the Luminex method [27,58] and LABType SSO kits (One Lambda, CA).
2.3. DNA extraction
Genomic DNA samples were isolated from the peripheral white blood cells using the QIAamp DNA Blood Mini Kit (Qiagen). The purity of the genomic DNA for each sample was determined by measuring the absorbance at 260 and 280 nm, with the A260/A280 values being in the range of 1.5-1.9, and the concentration of the DNA was adjusted to 10-20 ng/μl using PicoGreen assay (Life Technologies, CA).
2.4. HLA DNA typing of genomic DNA samples by the Luminex method
The samples were typed at HLA-A, -B, -C, -DRB1, -DQB1 and -DPB1 using the Luminex method [27,28,58] and reagents supplied by the LABType SSO kits (One Lambda). An outline of the workflow for the Luminex method is shown in Figure 2.
2.5. HLA DNA typing of genomic DNA samples by SBT and the Sanger sequencing method
The DNA samples were typed at HLA-A, -B and -C (exons 2, 3 and 4) and -DRB1 (exon 2) using AlleleSEQR HLA-SBT Reagents (Abbott Laboratories, Abbott Park, IL). To confirm that the HLA alleles from both chromosomes were amplified with a 1:1 ratio and to validate newly discovered SNPs and indels, the nucleotide sequences of the PCR products were also directly sequenced by using the Sanger method and the ABI3130 genetic analyzer (Life Technologies, CA) in accordance with the protocol of the Big Dye terminator method (Life Technologies). The sequence-generated chromatogram data was analyzed by the Sequencher ver.4.10 DNA sequence assembly software (Gene Code Co., MI). Sequence data were also analyzed using the assign-atf software (Conexio Genomics, Australia, ). An outline of the workflow for the Sanger sequencing method is shown in Figure 2.
2.6. SS-SBT by NGS
We used two different benchtop NGS platforms (Figure 1); massively parallel pyrosequencing with the Roche GS Junior Bench Top system  and massively parallel semiconductor sequencing with the Ion Personal Genome Machine (PGM) system . A general review of the NGS workflows used for the Roche 450 system is presented by Margulies et al. , Metzger  and Erlich  and for the Ions PGM/Torrents system by Rothberg et al. .
For SS-SBT, we used a five-step workflow for HLA DNA genotyping and haplotyping for both the Roche GS Junior Bench Top system and the Ions PGM (Figure 2). These steps were LR-PCR, amplicon library construction, template preparation by emulsion (em) PCR, NGS run and HLA DNA data analysis. The first and last steps were essentially the same for both platforms, whereas steps 2 to 4 were platform specific. Two different sequencing runs, a single gene-sequencing run (SGSR) or a multiplex gene sequencing run (MGSR) were performed for the different HLA gene loci. SGSR was performed for only a single gene locus per run, whereas MGSR was performed at the same time for all of the HLA gene loci, whereby all of the LR-PCR amplicons were pooled together to construct the template libraries required for the sequencing platforms.
2.6.1. Step 1: LR-PCR
We developed and used LR-PCR primers for eleven HLA loci, A, B, C, DRB1, DRB3, DRB4, DRB5 (DRB3/4/5), DQA1, DQB1, DPA1 and DPB1 [42, 56]. For amplification of the HLA-DRB1, group-specific primers were employed to prevent concomitant amplification of DRB1-like genes, such as DRB3, DRB4, DRB5 and DRB pseudogenes including DRB2, DRB6, DRB8 and DRB9. Others have reported different LR-PCR primer sets for some of the same HLA genes [40,43].
Pooling the LR-PCR amplicons
Amplicons from all class I and class II loci for each individual were pooled in equal amounts to yield 2-3 μg per pool of a single sample.
2.6.2. Steps 2 to 4: Amplicon library construction, emPCR and NGS
Both NGS platforms require the preparation of a fragmented template library of the PCR products and further clonal amplification of the templates by emPCR for sequencing of singled stranded DNA and detection of base-called signals (Figure 2).
220.127.116.11. Workflow for Roche GS Junior Bench Top system
Titanium libraries of single-stranded template DNA fragments were prepared for emPCR and sequencing using a GS FLX Titanium Rapid Library Preparation kit (cat no. 05608228001, 454 Life Sciences, Schweiz, CA) together with the GS Titanium Rapid Library MID Adaptors Kit (cat no. 05619211001, 454 Life Sciences) that permits the preparation of up to 12 individually MID labeled libraries per kit . The workflow with the preparation kit involved fragmenting the LR-PCR amplicons by nebulization using disposable nebulisers, repairing fragment ends in a thermocycler reaction with T4 polymerase and Taq polymerase, adding adaptors such as the fusion A and B capture/sequence adapters and MID tags to the fragmented DNA by ligation with ligase and then linking the fragments of 100 bp or larger to the AMPure paramagnetic beads (Beckman Coulter Genomics, MA) with a selective binding buffer. Small fragments, primers and nucleotides were removed from the magnetic beads by washing them with 70% ethanol. Each single-strand DNA library was eluted from the beads with water or a Tris-acetate buffer, pH 8.
A MID adaptor was substituted for the adaptor provided in the GS FLX Titanium Rapid Library Preparation kit during the preparation of a library from a PCR DNA sample. The MID adaptor was ligated to each fragmented sequence of a library to provide a recognizable sequence tag at the beginning of each read. In this way, multiple libraries prepared with different MIDs can be sequenced together allowing the data analysis software to assign each read to the correct library and therefore a particular sample.
The libraries were quantified in 96-well format using the Fluoroskan Ascent CF Ver. 2.6 software (Thermo Fisher Scientific, DE) to detect the fluorescence in a PicoGreen assay (Life Technologies). The library size for each sample was measured by an Agilent 2100 Bioanalyzer using an Agilent High Sensitivity DNA Kit (Agilent Technologies, CA) and pooled with the other libraries at equimolar concentrations to create a multiplexed library.
The multiplexed libraries were clonally amplified by emPCR in order to produce hundreds of thousands or millions of copies of the same template sequence so that sufficient signal would be generated to be easily detected and recorded by the sequencing system. For preparation of emulsion beads, the single-stranded DNA fragmented libraries were added to the capture beads A and B, emulsified with oil and then amplified for at least 6 hours with an enzyme and reaction mix provided in the GS Junior Titanium emPCR Kit (Lib-L). An emulsion of PCR reagents in microreactors was prepared by adding beads A and B to the PCR reaction mix (1X amplification mix, Amplification Primers, 0.15U/ml Platinum Taq (Life Technologies), and emulsion oil and mixing vigorously using a Tissue Lyser (Qiagen, Germany).
The emulsion PCR of the library templates was carried out in an ABI 9700 (Life Technologies) thermocycler using the following cycling conditions: hotstart activation for 4 minutes at 94°C, 50 cycles of 94°C for 30 seconds, 58°C for 4.5 minutes, 68°C for 30 seconds. Sequencing primers were added to the mixture of beads and annealing buffer and annealed to the template using the following thermocycler conditions: 65°C for 5 minutes, and kept on ice for 2 minutes..
After emPCR, the emulsion was broken by isopropanol, the beads carrying the single-stranded DNA templates were enriched with primers A and B, counted, and 0.5 million beads deposited along with enzymes and buffer into a single loading port of a GS Junior Titanium PicoTiterPlate (PTP) (Cat. No 05 996 619 001, 454 Life Sciences, Carlsbad, CA) to obtain sequence reads by pyrosequencing using the Roche GS Junior system. Packing beads, sample beads and enzyme beads were applied to the PTP as per manufacturer instructions. The GS Junior Titanium PicoTiterPlate permits only a single sequencing run per PTP. During sequencing, nucleotides flowed across the PicoTiterPlate in a fixed order and a cooled CCD camera recorded the amount of light generated by the pyrosequencing reactions. The images from the CCD camera were then converted to sequence data by the instrument’s software. Bi-directional sequencing for the fragments was achieved because the single strands captured by the A and B beads allow both forward and reverse reads to be identified.
After the DNA sequencing run by the Roche GS Junior Bench Top system, data analysis such as image processing, signal correction and base-calling, binning, trimming and mapping of the sequence reads and assignment of HLA alleles were carried out in accordance with our previous publication . Image processing, signal correction and base-calling were performed by the GS Run Processor Ver. 2.5 (454 Life Sciences) with full processing for shotgun or filter analysis. Quality-filter sequence reads that were passed by the assembler software (single sff file) were binned on the basis of the MID labels into 10 separate sequence sff files using the sfffile software (454 Life Sciences). These files were further quality trimmed to remove poor sequence at the end of the reads with quality values (QVs) of less than 20.
18.104.22.168. Workflow for Ion PGM system
Barcoded-library DNAs were prepared from PCR amplicons with an Ion Xpress Plus Fragment Library Kit and Ion Xpress Barcode Adaptors 1-16 Kit according to the manufacture’s protocol for 200 base-read sequencing (Life Technologies). One hundred nanograms of the HLA pooled amplicon products from four individuals were used for the preparation of each DNA library. Each DNA library was clonally amplified by eight cycles of PCR. The DNA size for each library was measured by an Agilent 2100 Expert Bioanalyzer using the Agilent High Sensitivity DNA Kit (Agilent Technologies) and the concentration of each library was measured with the Ion Library Quantitation Kit using the 7500 Real-Time PCR System (Life Technologies). Each barcoded-library was mixed at equimolar concentrations then diluted according to the manufacture’s recommendation. emPCR was performed using the resulting multiplexed library with the Ion Xpress Template 200 Kit on an ABI 9700 (Life Technologies) with the following cycling parameters: primary denaturation 94◦C/6 min, followed by 40 cycles for 94◦C/30 s, 58◦C/30 s and 72◦C/90 s, and 10 cycles for 94◦C/30 s and 68◦C/6 min. After the emulsion was broken with 1-butanol, a magnetic-bead-based process according to the manufacture’s recommendation enriched the beads carrying the single-stranded DNA templates. Sequencing was performed using the Ion Sequencing 200 Kit and Ion 316 chip with a flow number of 440 for 200 base-read .
Ion Torrent uses a pH change (release and detection of a proton) to detect the incorporation of a base into the growing DNA strand rather than a flash of light, as used in the 454 Sequencing System .
In the case of the Ion PGM DNA sequencing system, the raw data processing and base-calling, trimming and output of quality-filter reads that were binned on the basis of the Ion Xpress Barcodes into 10 separate sequence sff files were all performed by the Torrent Suite 1.5.1 (Life Technologies) with full processing for shotgun analysis. These files were further quality trimmed to remove poor sequence at the end of the reads with QVs of less than 10.
22.214.171.124. Sequencing parameters, variables and definitions
The NGS methodology generates a number of sequencing parameters, variables and errors that need to be defined and noted.
A sequencing read is a contiguous length of nucleotide bases that is generated using a sequencing machine. The full set of aligned reads (coverage) reveals the entire genomic sequence in the DNA sample selected for sequencing.
Both the NGS methods described here are designed to generate hundreds of thousands to millions of short (100 to 800 bp) contiguous reads of low to high quality. Intrinsic quality measures or filters are automatically built into the sequencing analysis software to recognize and measure or statistically predict quality values (QV or Q) and remove incorrect repetitive sequence, indels, low-quality sequence at the beginning and end of runs (end clipping) and recognize accurate consensus sequences [62,63].
Generally, the QVs or Q’s are based on estimated Phred quality scores  so that a quality score of 10 represents a 1 in 10 probability of an incorrect base call or 90% base call accuracy, 20 represents a 1 in 100 probability of an incorrect base call or 99% base call accuracy, 30 represents a 1 in 1000 probability of an incorrect base call or 99.9% base call accuracy, 40 represents a 1 in 10000 probability of an incorrect base call or 99.99% base call accuracy, and so on.
QV scores are clearly valuable because they can reveal how much of the data from a given run is usable in a resequencing or assembly experiment. Sequencing data with lower quality scores can result in a significant portion of the reads being unusable, resulting in wasted time and expense. A QV of ≥ 20 is usually considered a satisfactory cut-off score for base calling accuracy. The QV and Q reads from the GS Junior and the Ion PGM are slightly different from each other and not directly comparable, with some studies showing the GS Junior overestimates the base-calling accuracy while the Ion PGM underestimates the base-call accuracy [45,65-67].
Read depth is the number of individual sequence reads that align to a particular nucleotide position . The average read depth (coverage, redundancy or depth) is the number of average reads representing or covering a given nucleotide in the reconstructed sequence, often calculated as N x L/G where G is the original length of the genome sequence, N is the number of reads and L is the average read length. Thus, for a 1000 bp sequence G reconstructed from 10 reads N with and average length of 200 nucleotides per read L, there is a 2x coverage or redundancy. However, reads are not distributed evenly over an entire genome and many bases will be covered by fewer reads than the average coverage, while other bases will be covered by more reads than average .
Another important variable to consider, particularly for HLA typing with NGS is the allelic balance  or the ratio of the average depth , which is the relative number of reads originating from each of the two alleles of each locus. This variable is important to monitor in order to prevent allele drop-out. Allelic imbalance is usually not a sequencing problem, but most often is due to poor genomic PCR primer design where a set of primers favours the amplification of one haplotype over the other.
Sequence accuracy to estimate the quality of a sequencing run was estimated by including internal control DNA beads of known sequence with each run.
2.6.3. Step 5: DNA sequence data analysis and HLA assignments
The trimmed and MID or barcoded binned sequences were mapped as perfectly matched parameters using the GS Reference Mapper Ver. 2.5 (Life Technologies). Reference sequences used for mapping of the sequence reads were selected by nucleotide similarity searches with HLA allele sequences in the IMGT-HLA database using the BLAT program . If a reference sequence covering the entire gene region was not available, we converted the sff files to ace files and constructed a new virtual sequence by
Three different automated NGS data processing systems, the Sequence Alignment Based Assigning Software (SEABASS) (an in house development of Tokai University, Isehara), Omixon Target (Omixon, ) and Assign MPS v1 (Conexio, ) were also assessed and compared for in phase sequence alignment of HLA genes and for allele assignment at the 8-digit level. The three programs were designed to provide software or a suite of software tools for analyzing targeted sequencing data from the next generation sequencing platforms, Roche 454, Ion Torrent and Illumina. The HLA edition of the Omixon Target can be obtained for use with MacOSX or Windows as a credit-based or annual based license fee. Conexio software for Windows can be purchased outright, whereas SEABASS is an in house development for Linux and is not available commercially at this time.
3.1. HLA typing by Sanger SBT, the Luminex method or SS-SBT by NGS
Ten genomic samples (TU1-TU10) were genotyped at six HLA loci (HLA-A, -B, -C, -DRB1, -DQB1 and –DPB1) by three different typing methods, the Luminex, the Sanger SBT and the NGS SS-SBT method. At least one or more pairs of unresolved alleles were detected in the samples by the SBT and/or Luminex methods, whereas all the samples were easily resolved at least to the 4-digit level by the SS-SBT typing method (see the Tables and Supplementary Tables in the paper by Shiina et al ). Table 1 shows an example of the DNA typing results obtained for the HLA-C locus by the Luminex, SBT or SS-SBT methods as previously reported.
Figure 3 shows how the SS-SBT NGS typing method resolves the phase ambiguity at the HLA-DRB1 locus of a sample that the Luminex beads and other conventional typing methods were not able to resolve. Essentially, the Luminex bead method analyzed the mixture of the PCR products amplified from the paternal and maternal chromosomes and could not distinguish between HLA-DRB1*15:01/DRB1*04:05 on Sample “A” and HLA-DRB1*15:02/DRB1*04:10 on Sample “B”, and therefore, resulting in ambiguity. On the other hand, in the SS-SBT method the sequenced PCR products were each amplified from a single DNA molecule (clonal amplification method by emulsion PCR), which helps to determine whether each polymorphism is paternal or maternal, thus resolving phase ambiguity. In the case shown in Figure 3, the sample was typed as DRB1*15:02/DRB1*04:10 on Sample “B”.
3.2. HLA typing by SS-SBT using two different NGS benchtop platforms
126.96.36.199. Single LR-PCR
Long-range PCR (LR-PCR) was developed to amplify eleven HLA genes (HLA-A, -C, -B, -DRB1, -DQA1, -DQB1, -DPA1, -DPB1, -DRB3, -DRB4 and –DRB5) that are known to be highly polymorphic. PCR primers were designed to avoid annealing to any known polymorphic sites and to amplify regions spanning from the 5’ promoter to the 3’UTR region of the HLA genes in either a single amplification reaction or as two separately divided amplifications that could be easily merged when sequenced.
The PCR primer sets for the HLA-A, -B and -C genes were designed to amplify their sequences from the 5’-enhancer-promoter region to the 3’UTR with an amplicon size of about 5 kb. PCR primer sets were designed for HLA-DQA1, -DQB1, and -DPA1 to amplify sequences from the 5’- enhancer-promoter region to the 3’UTR with amplicon sizes of about 7 to 10 kb bases. Because the gene sizes for HLA-DRB1 and -DPB1 were too long to successfully amplify the whole gene in a single reaction, we divided the amplification of these genes into two parts. One PCR primer pair was designed to amplify the 5’-enhancer-promoter region to exon 2 and the other primer pair was designed to amplify exon 2 to the 3’UTR, resulting in amplicon sizes of about 6 to 11 kb and 5 to 7 kb, respectively (Figure 4). The DRB1 locus produced different amplicon sizes because of allelic differences in the length of its introns.
LR-PCR methods were also designed and tested to amplify HLA-DRB3, -DRB4 and -DRB5 (
Although some LR-PCR methods are known to produce low frequency extra amplification of pseudogenes, as previously reported , our LR-PCR methods generally produced specific amplicons of targeted genes with little or no evidence of extra sequences (Figure 4). In addition, allelic imbalance for all the LR-PCR methods was minimal as judged by the sequencing results with allelic ratios ranging on average between 0.6 and 1.6 in heterozygous samples.
188.8.131.52. Pooling LR-PCR products for library construction
In order to expedite the number of samples and HLA loci required for NGS, the single LR-PCR products were pooled in equimolar amounts to produce a single multiplex (up to 13 PCR products) for the construction of a multiplexed, barcoded sample library. Figure 5 shows an example of an electrophoretogram of the pooled PCR amplicons from the HLA-A, -B, -C and -DRB1 genes.
3.2.2. Roche GS Junior NGS system
184.108.40.206. Library preparations using different gene loci and sample numbers
Initially, we applied an individual tagging per gene amplicon method for SGSR by preparing a tagged library for each of the ten individual samples with the aim of sequencing one HLA locus at a time . Thus, ten libraries were constructed from each PCR amplicon of a particular HLA locus, with each library representing an HLA gene with ten tagged individual samples. The ten individual MID-labelled ssDNA libraries were then combined into a single tube for clonal emPCR and sequencing. In this way, we performed a single sequencing run for each HLA locus represented by ten MID labelled individuals, one locus at a time. For example, Run 1 was HLA-A with samples one to ten; Run 2 was HLA-B with samples one to ten, Run 3 was HLA-B with samples one to ten, and so on until we finished Run 13, our last run, Run 13, representing the 13th LR-PCR amplicon we had amplified from the 11 HLA gene loci that we had chosen to analyze.
Later, we changed from SGSR to an individual tagging and gene-pooling method for MGSR in order to markedly reduce the number of required sequencing runs. In this approach, we first pooled the PCR amplicons obtained from the eleven separate HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DPA1, -DRB1 (part a and b), -DPB1 (part a and b), -DRB3, -DRB4 and -DRB5 of each individual at equimolar concentrations and then prepared the MID-labeled NGS libraries using the pooled amplicons from each individual. In this way, a single sequencing run was performed for all eleven HLA loci and for a multiple number of MID-labelled individuals or DNA samples. This was possible because the sequencing adapters specifically identified and grouped the gene loci, whereas the MID tags identified the individual samples. Figure 5 shows an electrophoretogram of pooled LR-PCR amplicons amplified from samples with known DR sub-types in parenthesis such that lane 1 is a heterozygous sample of DR6 and DR9, lane 2 is heterozygote of DR2 and DR6, lane 3 is a heterozygote of DR2 and DR4, lane 4 is a heterozygote of DR3 and DR8 and lane 5 is a heterozygote of DR1 and DR7. Consequently, the PCR product from the DRB1 gene varies in size depending on the DR sub-type and associated HLA-DRB1 allele so that it is 5.1 kb (orange) and 5.2 kb (blue) in lane 1, 5.2 kb (blue) and 5.6 kb (green) in lane 2, 5.6 kb (green) and 6.2 kb (purple) in lane 3, 5.2 kb (blue) in lane 4 and 5.1 kb (orange) and 5.2 kb (blue) in lane 5.
220.127.116.11. Sequencing multiplex libraries prepared for SGSR
Table 2 shows the number of draft sequences or reads (sequences passing a quality control criteria after base calling), generated by the Roche GS Junior Bench Top system for HLA-C in ten samples using multiplex libraries prepared for SGSR. The results for the sequence information about HLA-C were reported previously by Shiina et al  and Ozaki et al  as part of the data obtained in their NGS analysis of the eleven HLA loci.
Draft read numbers for eight genes, HLA-A, -C, -B, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1, were 1,015,526 sequence reads in total for 10 samples per locus with a range of reads between 49,048 in HLA-DQA1 and 153,610 sequence reads in exon 2 to 3′UTR of HLA-DRB1 . These were high quality reads with QV ≥ 20 and an average QV of 30.4 ranging between 28.6 and 32.4. The draft read bases for eight genes, HLA-A, -C, -B, -DRB1, -DQA1, -DQB1, -DPA1, -DPB1, were 412,956,301 bp in total with a range between 18,077,833 bp in HLA-DQA1 and 65,384,530 bp in exon 2 to 3’UTR of HLA-DRB1 . There was an overall average sequence length of 414.2 bp, ranging from 321.8 bp in intron 1 to 3’UTR of HLA-DPB1 to 441bp in HLA-DPA1. Of the draft reads, excluding error reads such as artificial insertion or deletion (indel) and nucleotide substitutions, 49.7%–75.4% matched perfectly with HLA allele sequences released from the IMGT-HLA database or our newly constructed virtual reference sequences. Therefore, the sequence reads had high quality and sufficient sequence volume for further HLA DNA genotyping and haplotyping analysis.
The HLA-DRB3, -DRB4 and -DRB5 (DRB3/4/5) gene loci were each sequenced separately by the Roche GS Junior Bench Top system using 19 samples and two separate multiplex libraries. After the sequencing run for each gene, a total of 147,269 sequence reads were provided for mapping and allele assignment of DRB3/4/5. The sequence reads were high quality reads (QV≥20) with an average QV of 30.9 ranging between 28.6 and 32.4. The draft read bases in total were 62.3 Mb in DRB3/4/5, with an overall average sequence length of 423.2 bp, ranging from 361 bp to 445.6 bp. This data suggested that the sequence reads had sufficient high quality and sequence volume for further HLA DNA genotyping and haplotyping analysis.
18.104.22.168. HLA SS-SBT DNA typing
HLA typing was applied to 10 genomic DNA samples (TU1-TU10) all of which gave more than one pair of unresolved alleles when defined by the SBT and/or Luminex methods (Table 1 and Table 3). In comparison, the pyrosequencing reads using Roche GS Junior system matched perfectly with the HLA allele sequences previously deposited in the IMGT-HLA database or against the virtual allele sequences constructed by
An average sequencing depth was 387.9 in total, ranging between a depth of 53.5 in TU7 of HLA-DQB1 and 924.0 in TU4 of HLA-C. An example of the minimum and maximum depth and ratio of depth between two allele sequence reads for HLA-C in five DNA samples using the Roche GS Junior is shown in Table 4. The average depth ratio between haplotypes or alleles of HLA-A, -B, -C, -DRB1 (mix 2), -DQB1 ranged from 0.6 to 1.6, suggesting a satisfactory allelic balance was achieved with the PCR reactions. In comparison, the ratio of depth between two alleles in the HLA-DRB1 (mix 1) sequences was more variable exceeding a ratio delta of 1.0, which may be due to the marked difference in the sizes of PCR products among DRB1 groups (6-11 kb). However, this ratio of depth between two alleles did not affect sequence determination from the enhancer–promoter region to the exon 2 region of the DRB1 gene when the number of draft reads was increased.
Overall, in this analysis, 21 HLA allele sequences were newly determined, 17 of them were newly identified alleles and four were alleles extended from 2-digit to 8-digit level sequences. Most of the newly identified SNPs and/or indels for 16 alleles were observed in the introns, whereas a synonymous SNP was identified in one allele of DRB1*04:10:(03):(01)) (parenthesis indicates tentative allele name, not yet officially approved by the WHO Nomenclature Committee). A comparison of the alleles detected for HLA-DRB1 by Sanger-SBT and SS-SBT in ten samples is shown in Table 3.
The HLA-DQA1, -DPA1 and -DPB1 alleles were assigned only at the 6-digit level of exon 2 with no novel alleles discovered in the 10 genomic DNA samples. This lower level assignment was likely due in part to low SNP densities preventing the haplotype genomic segments to be properly aligned and the phases separated from each other.
DRB3/4/5 sequences were identified to the field 4 level in both phases and validated by Sanger sequencing . An average depth of DRB3/4/5 was 277 in total (107 in TU1 to 561.7 in TU11. The number of newly determined allele sequences at the field 4 level of resolution was two for DRB3 (DRB3*01:01:02:(02) and DRB3*02:02:01:(03)), three for DRB4 (DRB4*01:03:01:(04), DRB4*01:03:01:(05) and DRB4*01:03:01:(06)) and one for DRB5 (DRB5*01:01:01:(02)) (the allele names at field 4 are tentative as they have not yet been officially approved by the WHO Nomenclature Committee). Six allele sequences were extended from a field 2 or 3 level to a field 4 level with one for DRB3 (DRB3*03:01:01:(01)), one for DRB4 (DRB4*01:03:02:(01)) and two for DRB5 (DRB5*01:02:(01):(01) and DRB5*02:02:(01):(01)). Field 4 level haplotype structure of DRB1-DRB3/4/5 were also determined by SS-SBT and reported by .
22.214.171.124. Sequencing multiplex LR-PCR generated libraries prepared for MGSR
Using the libraries prepared from pooled LR-PCR for MGSRs (Figure 5), we sequenced 81 Japanese and Caucasian genomic samples by the Roche GS Junior instrument and assigned the aligned alleles across the entire gene regions of the eleven HLA gene loci without any ambiguities at least to the field 3 assignment level. Of the 164 alleles detected at the field 4 assignment level, 78 (47.6%) were newly detected alleles. We were unable to determine the allele sequences of HLA-DQA1, -DPA1 and -DPB1 at field 4 due to a lack of tag SNPs available to identify and separate the two phases within some of the noncoding regions. However, all the HLA exonic alleles were assigned without ambiguity at the field 3 level.
3.2.3. Sequence read information from the Ion PGM system
The amplicons amplified from eight HLA loci (A, B, C, DRB1, DQA1, DQB1, DPA1 and DPB1), representing a total of 60 kb of the entire gene regions (from the enhancer-promoter region to the 3’UT region), were prepared and pooled for the construction of barcoded DNA libraries from four DNA samples (TU1, TU3, TU5 and TU8) that had been previously sequenced by the Roche Junior system (Table 1). The four-barcoded samples were sequenced as a multiplex in a single run on an Ion 316™ chip. Figure 6 shows the workflow of the SS-SBT method performed by the Ion PGM™ sequencer that completed the processes from sample preparation to sequence analysis within three days. Previously, four days were needed for the manual sample preparation and library construction from the time of PCR amplification to HLA allele definition . Now, with the use of automated procedures, such as the AB Library BuilderTM system and Ion OneTouchTM system from Life Technologies, we have shortened the workflow significantly to just two to three days.
The number of reads, number of bases, and median read-length of each sample are shown in Table 5. A total of 522 Mb of sequence data was produced (about 600K to 720K reads per sample) with a range between 123.3 Mb for TU1 and 141.6 Mb for TU5, with an overall average read length of 197.3 bp. The median read length was about 214 bp, which was sufficient to resolve the phase ambiguity in all of the samples that were tested. Draft read numbers in total were 2,646,446 reads with a range of reads from 605,501 for TU1 to 721,499 reads for TU5 that were high quality reads with QV ≥ 10 at an average QV of 19.2 in the high quality reads.
|Total||2,646,446||Total 522.2||Ave. 214|
Alleles of HLA class I genes obtained from the typing method using Ion PGM™ are shown in Table 6. The 1,292,006 draft sequence reads (48.8% of the passed assembly reads) when compared with the reference sequences using the GS Reference Mapper (Ver. 2.5) matched consistently with the HLA alleles that were assigned by Roche GS Junior sequencing. The average depth between two-allele sequence reads spanned from 581 for
A comparison between the Ion PGM and the Roche GS Junior for the depth distribution obtained for HLA-B is shown in Figure 7.
The depth distribution of nucleotide calls across the HLA-B DNA sequence in different samples occurred with numerous peaks and valleys. For HLA-B the depth of nucleotide calls was lowest at a depth of 38 for allele 2 in sample TU01 where a string of polyG was sequenced in intron 7 by the Ion PGM sequencer. However, a depth of > 30 appears to be sufficient for accurate calls and a sequence run of more than 15 identical nucleotides seemed to be accurate enough both by the Roche Junior and Ion PGM sequencers in this study. Overall, the trend in the variation of depth distributions tended to be similar between the samples for both the Roche GS Junior and Ion PGM suggesting that read depth was probably more dependent on the gene sequence and grouping of nucleotides than other factors such as read length, fragmentation of PCR products for the library construction, fragment size selection, emPCR variability and efficiency of sequencing primer locations.
3.3. HLA assignment software comparisons
We developed an in-house software system (SEABASS) for HLA sequence analysis, allele and in phase haplotyping and compared it to two commercial software systems, the Conexio Genomics genotyping software, Assign-MPS v1 , and the Omixon Target HLA Typing software program . The user interface for the Assign-MPS v1 is shown in Figure 8 for (A) accurately genotyping the HLA-B*56:01:01(02) sequence and (B) not detecting DQB1*03:03:02 because of too many low quality sequence reads.
All three software systems imported and converted the.fna files (FASTA format sequence files for each read) generated by the GS Junior or the Ion PGM software into the appropriate files for HLA genotype assignments. All the software programs analysed the various sequence reads, sorted them according to genomic primer and MID tag or barcodes, compared the sequences to the IMGT/HLA sequence database and generated a consensus sequence with allele assignment taking into account the exonic and intronic sequence and phase relationship. The numbers of different reverse and forward reads for each amplicon were indicated, phased and automatically assigned a genotype only if the aligned sequences were perfectly matched with the alleles listed in the HLA sequence database (Figure 9). In cases where the aligned sequence reads were mismatched at one or more bases with the database, manual editing allowed for further investigation and assignment of a genotype.
A comparison of the three software systems for allele assignment in terms of platform, convenience, analysis speed, detection of new alleles, field 2 and field 4 level of typing assignment are shown in Table 7. All three software systems performed excellently for HLA typing at the field 2 level, but the SEABASS method was better than Assign and Omixon for HLA typing at the field 4 level. In addition, the SEABASS and Assign software could detect new alleles whereas the Omixon did not provide this possibility. However, Omixon was best for analysis speed and convenience of use. Overall, there was marginal difference in the efficiency, performance, analysis and outputs between the three output systems, although we favored our in house system over the commercially available Conexio Genomics software on a cost benefits basis.
|Detection of NEW allele||YES||NO||YES|
|Field 2 level typing||***||**||**|
|Field 4 level typing||***||*||*|
Our NGS study focused on developing a suitable SS-BT method for in phase HLA genotyping for research and diagnostic laboratories using two of the commercially available low to medium throughput capacity NGS systems, the Roche GS Junior and the Ion PGM [42,56]. So far, in our best-case scenario, we were able to sequence 11 HLA loci for 5 individuals in a single sequencing run by Roche GS Junior or the Ion PGM. We have not used the full capacity of the Ion Torrents sequencing chip and, therefore, there is a potential to use a greater number of loci or individual samples (at least 57 samples) than we have already used for a single sequencing run. Moreover, both platforms provided high-resolution or super high resolution HLA typing without ambiguities, depending on the LR-PCR design to amplify the HLA gene loci.
High-throughput HLA genotyping methodologies were previously developed using massively parallel sequencing strategies, such as Roche/454 [36,37,40-42,56,71] and Illumina MiSeq sequencing [43,72]. Most of these high-throughput HLA-genotyping studies amplified a few individual exons (usually exons two, three and four) in an exon based strategy and sequenced in a multiplexed manner. LR-PCR was used by a few of the investigators to amplify large genomic regions of each gene including introns and all or most of the exons in a single PCR [40,42,43,56,72]. We also chose to combine LR-PCR with NGS because LR-PCR requires only one or two primer sets and eliminates the need to validate multiple sets of primers to amplify all alleles in the exon-based strategy. In addition, the error rates of the polymerase enzymes used in LR-PCR, because of error repair, are typically two- to six-fold lower than that of Taq polymerase that is used in conventional PCR . We amplified and sequenced from the 5’ promoter to the 3’ UTR including exons 1-7 for HLA class I and class II genes in order to substantially improve the allele resolution for genotyping in comparison with the previous conventional genotyping methods, such as SSOP and SBT, with which allele calling of sequences was largely limited to exons 2 and 3 for HLA class I genes and exon 2 for HLA-DRB1.
We developed and tested LR-PCR for the following 11 class I and class II HLA genes, HLA-A, -B. -C, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1 and -DPB1 using 13 separate LR-PCRs. The entire gene sequence from the enhancer/promoter region to the 3’UTR-region was amplified for all of the HLA gene loci, except for HLA-DRB1, -DPB1 and -DRB3/4/5. PCR of HLA-DRB1 and -DPB1 was divided into two parts with enhancer/promoter to exon 2 and exon 2 to 3’UTR for HLA-DRB1, and enhancer/promoter to intron 2 and intron 1 to 3’UTR for DRB1*15:01:01:03 because of the large size of intron 1 (~ 8kb) and/or the complexity of nucleotide repeat (microsatellite) polymorphisms in intron 2 (Figure 7). The three DRB3/4/5 specific primer sets amplified the gene regions from intron 1 across to exon 6 and into the 3’UTR. The LR-PCR of these HLA genes revealed intronic as well as the exonic polymorphic sites, which extended the HLA allelic resolution phase, but also provided important phasing information to assist with the resolution of combination ambiguities and identifying previously unknown alleles outside the regions of exon 2 and 3. Although the IMGT/HLA database has sequences mostly of exons 2 and 3 [74,75], the promoter/enhancer, intron, and 3’UTR variants should not be ignored for more comprehensive HLA typing now and into the future. The genetic variants in a regulatory element such as promoter or even introns need to be extensively analyzed because autoimmune  and infectious diseases  have been associated with the differential expression levels of the HLA genes in different haplotypes . In addition, null alleles resulting from intronic polymorphisms warrant investigation and resolution to better understand their functional effects [79-82].
After PCR amplicon production, there are four main steps leading to sequencing and HLA data analysis; amplicon library preparation, emulsion PCR, NGS and HLA data analysis (Figure 2). The preparation of the template library is an important step because the PCR amplicons are used either singly or pooled together into a multiplex and then labeled with sequence tags (indexes, barcodes or MIDs) during library construction to facilitate sample multiplexing prior to emulsion PCR and sequencing. The addition of different barcodes to different sample libraries enables the independent detection of sequences in a mixture of different samples. Computing software is used to accurately parse the tagged sample files during the analysis of sequencing data. However, a possible disadvantage of using too many multiple barcoding tags is to lose sequencing depth because when samples are pooled for multiplexing the amount of input DNA for each sample, due to the increased sample number, is reduced.
In previous studies, most investigators using either the Roche or Illumina systems pooled their PCR amplicons at equimolar concentrations for even gene distribution before constructing barcoded libraries [36,37,40,41,43,56,71,72]. The ligation of barcodes to the fragmented DNA templates during library construction allowed a varying number of different samples to be sequenced during a single sequencing run, depending on read lengths, the capacity of the sequencing platform and the number of pooled amplicons that were used. For example, Erlich et al.  reported sequencing a maximum of 760 samples per run after loading a multiplex of 95-96 barcoded samples into a single lane of an 8-region PicoTiter Plate using the 454 Genome Sequencer FLX instrument. Initially, we used only a single gene-sequencing run (SGSR) by sequencing a single PCR amplicon and up to ten barcoded samples per sequencing run with the Roche GS Junior and chose to use only up to four barcoded samples per sequencing run with the Ion PGM . As described in this chapter, we now have changed from SGSR to a multiplex gene-sequencing run (MGSR), for which we pooled all 13 LR-PCR amplicons together into a single sample (Figure 5) prior to constructing libraries for five barcoded samples per sequencing run. This change, from SGSR to MGSR, reduced the number of sequencing runs from 13 to 1, thereby greatly reducing the workload, the cost per sample and the overall cost per individual. Currently, five barcoded samples using 13 LR-PCR amplicons in a single sample is close to Roche GS Junior’s maximum sequencing capacity of 80 Mb of sequence reads per run (assuming an average 100-bp read length) without compromising depth of coverage and increasing sequencing or genotyping errors. A statistically adequate depth of sequence coverage is essential to prevent alignment errors and minimize genotyping errors. In comparison, the Ion PGM 318 sequencing chip has far greater sequencing capacity of 1000 Mb ((assuming an average 200 bp read length) and potentially we could use the Ion PGM for genotyping up to 57 barcoded samples using the 13 LR-PCR amplicons in a single sample, if necessary. Although the sample number per sequencing run is low for the Roche GS Junior compared to high capacity NGS platforms, often a small HLA typing laboratory for transplantation matching would require the typing results from only a small number of samples on a weekly basis. If sequencing is required for a very large number of samples, such as for association studies or population diversity studies, our workflow for the Roche Junior can be easily adapted to the larger capacity platforms such as the Roche 454 Genome Sequencer FLX instrument.
Apart from different sequencing capacities, the Roche Junior and Ion PGM use different sequencing principles and procedures. Roche uses a pyrosequencing fluorescence technology with a light output detected by camera scanning [31, 32], whereas the Ion PGM uses Ion Torrent semiconductor sequencing technology with simple sequencing flow chemistry and no light . Ion PGM is the first commercial sequencing machine that does not require fluorescence and camera scanning, resulting in higher speed, lower cost, and smaller instrument size. Currently, it enables 200 bp reads in 3 hours and the sample preparation time is less than 13 hours for 8 samples in parallel. Because every new sequencing technology introduces unique errors and biases into the resulting DNA sequences, a proper understanding of the NGS specific characteristics that are used to identify and interpret reads is crucial in assessing the accuracy and applications for this new technology. Both manufacturers provide their own unique software to processes the raw acquisition data and produce read files that contain high quality consensus reads and draft assemblies. Roche has the GS analysis software and Ion Torrent has a web browser driving the Torrent Suite Software on computers attached to their respective sequencing instruments. For both platforms, we used the manufacturers software to preprocess the DNA sequencing raw data before transferring the edited data onto our HLA genotyping software. Therefore, we have not directly addressed what errors are introduced into the raw reads by the NGS apparatus. However, others have shown that the major sequencing errors with both sequencing instruments prior to pre-processing are largely related to high frequency indel polymorphisms, homopolymeric regions, replicate bias, and substitution errors that mostly increase in rate with distance from the read start [45,65-67]. In addition, the quality scores that are Phred-based can be only used to detect inserted and substituted bases for both the Roche and Ion Torrent platforms. While the Ion PGM quality scores underestimate the base accuracy, the Roche 454 quality scores tend to over estimate the base accuracy [45,65-67]. Although there are no key studies of raw error reads directly comparing the Roche GS Junior with Ion PGM outputs, a recent statistical study has concluded that the accuracy of the Ion PGM is poorer than that of light-based technologies . On the other hand, direct comparisons of the Roche light-based technologies and Ion PGM of pre-edited sequencing data of bacteria has generally concluded that there is little difference in the accuracy of the sequencing data and that most errors arise with indel polymorphisms and homopolymeric regions [45-47]. Currently, we are conducting our NGS sequencing experiments with 100 DNA samples using the Roche Junior and 200 DNA samples with the Ion PGM, so, at this stage, we have not performed any accurate or meaningful statistical comparisons for the various sequencing variables produced by the two different platforms. In general, however, the indels and homopolymers are only a minor problem for SS-SBT HLA genotyping by NGS using either of the two platforms.
After pre-editing the sequences generated by the two NGS platforms using the manufacturers software, we transferred the edited data to HLA typing software. We compared a software program, called the Suzuki method, which was developed in-house, against two commercial programs, Omixon Target (Omixon) and Assign MPS (Conexio). These stand-alone software programs were assessed and compared for in phase sequence alignment of HLA genes and for allele assignment at different levels including the 8-digit level. Most genotype-calling algorithms select the HLA type candidates based on optimized alignment to cDNA references from the IMGT-HLA database due to the lack of genomic reference sequences. All three software programs consolidated and assessed the various sequence reads, sorted them according to genomic primer and barcodes, compared the sequences to the IMGT/HLA sequence database references and generated a consensus sequence with allele assignment taking into account the exonic and intronic sequence and phase relationship. The barcodes or MID tags identified by the software were used to reveal the reads of the various samples. Scores for sequence measures and quality assurances were provided including sequence depth, allele depth and allelic balance. Reads were aligned to the various loci and regions based on 100% matching between the read sequence and the reference library. Finally, a consensus sequence was generated and allele assignment made taking into account exonic and intronic sequence as well as phase relationship. The numbers of different reverse and forward reads for each amplicon were indicated, phased and automatically assigned a genotype only if the aligned sequences were perfectly matched with the alleles of genomic references or constructed references. A mapping procedure was applied for each candidate allele to verify the accuracy of the HLA typing and to detect novel alleles. In cases of less than 100% match, even at 99%, mis-mapping can occur among the HLA loci. Also, at least an average 30-fold depth was necessary to identify genetic variants with high sensitivity and resolution. When sequencing read numbers were too few, then we could not make exact assignments. If aligned sequence reads were mismatched at one or more bases with the database, manual editing allowed for further investigation and assignment of a genotype. Overall, there was little difference in the efficiency, performance, analysis and outputs between the three software systems, although we favored our in house program over the commercially available software on a cost benefits basis. Because the IMGT/HLA sequence database has relatively few genomic sequences, at less than 6% of the database entries, a major task remains to continue building a suitable reference library for all the known polymorphic HLA genes.
Our main aim in using the NGS technology was to eliminate the ambiguities currently associated with the conventional HLA genotyping methods. So far, we found that the SS-SBT method is superior to other HLA DNA typing methods, especially to efficiently detect new HLA alleles and null alleles at the 8-digit level of DNA typing without ambiguity. Although, at most, only 100 Japanese and Caucasian genomic DNA samples were used in this study, we unequivocally defined the HLA-A, -B, -C, -DRB1 and -DQB1 loci to single HLA alleles at the 8-digit level without any ambiguity. In addition, 17 DRB allele sequences, seven in DRB1, three in DRB3, four in DRB4 and three in DRB5, were newly determined to the field 4 level of allele resolution without phase ambiguity by SS-SBT. However, achieving a complete depth of correct sequence information in most samples of DRB1 and in some samples of DRB3/4/5, such as TU20 in DRB3, was compromised by the presence of microsatellites in a limited number of intronic sites. Major sequence instabilities were encountered in our study with T5~17 and T2~27 mono-stretch sequences and GT7~28 and GA3~23 microsatellite repeats in intron 1 of DRB1, intron 5 of DRB1 and intron 2 of DRB1 and DRB3/4/5 alleles that were obtained in our study and from the IMGT-HLA database. These instabilities within the microsatellite repeats are probably compounded by PCR and sequencing errors , but could be solved by Sanger sequencing of PCR products using high fidelity DNA polymerase.
All the allele sequences, excluding DRB1*09:21, were perfectly matched to at least the previously reported field 1 or field 2 level of allele information . Therefore, the newly designed primers and PCR conditions for HLA-A, -B, -C, -DRB1, -DQA1, -DQB1, -DPA1 and -DPB1  and -DRB3/4/5  are efficient for DNA typing by the SS-SBT method. Of the 164 alleles detected at the field 4 assignment level in 81 Japanese and Caucasian samples, 78 (47.6%) were newly detected alleles. Therefore, simply increasing the sample size in future analyses of HLA polymorphisms by the SS-SBT method could identify new non-synonymous substitutions along with indels that generate a null allele. A new allele DRB1*09:21 that we identified in only one sample so far, in comparison to the 129 DRB1*09:01 positive genomic DNA samples typed by SS-SBT in the Japanese population, suggests that the new allele has extremely low frequency in the Japanese.
We encountered problems with some microsatellite sequences, especially for DRB1*15:01:01:03 with the complexity of microsatellite polymorphisms in intron 2. In addition, the 8-digit level HLA alleles could not be assigned in HLA-DQA1, -DPA1 and -DPB1 because the SNP and indel densities to separate both of the phases were much lower than in the other HLA loci. Resolution was difficult for phase ambiguities in the conserved sequences at HLA loci, such as exon 3 in the HLA class II genes. Therefore, most HLA allele sequences are still unknown at the 8-digit level. In this respect, collection of the 8-digit level reference sequences using HLA homozygous DNA samples, haplotype extraction methods  or third generation sequencers such as one molecule real time DNA sequencer PacBio RS [85,86] (Pacific Biosciences, Menlo Park, CA) that provide a 3 kb read length on average are necessary for solving the current genotyping problems to improve the SS-SBT method.
The collection of HLA allele sequences at the 8-digit level and the development of HLA allele assignment programs are necessary to improve the SS-SBT method. The average depth of HLA-A, -B and -C was stable with a ratio of ~1:1. However, in comparison, the average depth of HLA-DRB1 for LR-PCR mix 2 and -DQB1 was less stable with ratios varying between 1:1 and 1: <2 . Monitoring the depth ratio and potential allele dropout is important to detect PCR bias due to unexpected variations in the primer sites. In cases where the depth ratio is drastically changed such as 1:5 or more, the primer sets and/or PCR condition may have to be modified. Nevertheless, the newly developed HLA DNA typing method SS-SBT  is potentially applicable to the diagnostic laboratory once some of the minor problems described above are solved in future.
Our study is the first direct comparison between the GS Junior and Ion PGM sequencing platforms for HLA genotyping. Although we have as yet to maximize the potential capacity of the Ion PGM to meet our theoretical expectation of sequencing the pooled PCR amplicons for 11 gene loci using 57 barcoded samples, the Ion PGM seems to perform as well as or better than Roche Junior on a number of fronts. However, it is difficult to choose between the two platforms at this stage and further work and comparative analysis will need to be performed before drawing a definite conclusion. Whereas there is little difference in overall performance between the two platforms at this time, the new Ion Torrent 318 chip offers greater capacity for more samples and more accurate reads with read lengths of 400 bp, for which we are presently testing 300 samples. In addition, automation of library preparations and emPCR amplification steps prior to sequencing has improved the overall turn around time from library preparation to allele readouts from four days to two/three days with the sequencing step performed overnight (Figure 2).
Development of better technologies to reduce the complication of the process and running costs is also necessary before the SS-SBT method is introduced into the diagnostic laboratory. In a previous simulated test of the Ion Torrent PGM system in a clinical laboratory setting, we needed four days for the manual sample preparation and library construction from the time of PCR amplification to HLA allele definition . However, with the use of automated procedures such as the AB Library BuilderTM system and Ion OneTouchTM system from Life Technologies we have shortened the workflow significantly to just two days with a running cost of US $17 per locus per sample. In comparison, the workflow for the Roche Junior remains at four days from the time of PCR amplification to HLA allele definition at a running cost of US $40 per locus per sample. Also, a new protocol using the new Ion 318 chip with 32 barcodes for 400 bp-read high sequencing quality has increased sequencing capacity and enabled sequencing reads up to and beyond 400 bp . Although we are at this moment testing the new Ion Torrent protocols with the Ion 318 chip it looks likely to lead to further improvements and further cost reductions. Therefore, a decrease in the running costs of NGS is expected to soon be substantially better than those of the conventional HLA DNA typing methods.
Whereas we have tested the Roche Junior and Ion PGM NGS compact systems, the Illumina MiSeq is another commercially available compact NGS . The Illumina MiSeq sequencer appears to compare well with Roche Junior and Ion PGM for HLA typing [43,72] and sequencing other regions and targets of interest . Other laboratories have favored the Illumina MiSeq system and have published the workflows and results for HLA typing and sequencing haplotypes to high resolution levels [43,72]. The Illumina MiSeq offers some advantages over the Roche and Ion PGM such as low DNA input amount, the pair-end reads (ie., fragments sequenced in both directions) to determine in phase alleles and possibly, a better and more reliable resolution of substitutions and indels . However, the Illumina MiSeq may not perform as well for the middle reads of a 150 bp/200 bp sequence and the generated reads may exhibit rapidly increasing error rates as the read length increases, resulting in lower quality contig assemblies . Also, the HLA DNA data analysis step appears to be a more difficult process [43,72] than that for the Roche Junior and Ion PGM (unpublished data). In cases where the phase of an allele is unresolved by the NGS system and HLA software other approaches may help to resolve the allele ambiguity. For example, one promising approach is to use DNA haplotype-specific extraction of the gene from the sample using a commercially available extraction kit, such as the solid-phase, capture-based EZ1 HaploPrep kit from Qiagen , and then resequencing the unresolved haplotype. The DNA haplotype-specific extraction procedure can extract genomic regions of up to 50 kb of contiguous sequence without amplication or concentration of the extracted DNA, and it has been used successfully to genotype and haplotype HLA class I and class II genes [84, 89-93] including adjoining genes such as HLA-B, MICA and HLA-C .
HLA in-phase genotyping is important for a variety of applications including for infectious disease studies, transplantation, pharmacogenomics, autoimmune diseases, population diversity and human evolution, and treatment of cancer pathology by vaccination. While we used the genome from whole blood or from peripheral blood mononuclear cells for HLA in phase-genotyping the analysis of HLA cDNA from cells and tissue (fresh or formalin fixed) would also be of value [38,39]. In humans, the HLA-A, -B, and -C locus-specific gene expression patterns were reported in the peripheral blood leukocytes, colon mucosa and larynx mucosa by real-time PCR . Recently, we investigated the relationship between haplotypes and gene expression levels of the class I genes by sequencing the genomic DNA of pigs using the Roche Genome Sequencer 454 FLX and found that the sequence read numbers closely reflected the gene expression levels in white blood cells . The use of the NGS sequencing method in human studies, similar to our MHC locus-specific expression analysis in pigs, could also provide informative data in various biomedical studies on HLA gene expression, such as the detection of expression levels among inter- and intra-populations, and among different tissues, both before and after vaccination against pathogens.
High-resolution donor-recipient HLA matching contributes to the success of unrelated donor organ and marrow transplantation [2,96]. HLA typing also plays critical roles in donor and recipient matching for embryonic (ES) or induced pluripotent stem (iPS) cell transplantation therapy. Currently, an iPS cell bank project is under way, led by Kyoto University, with the participation of one of the co-authors (HI) of this chapter. It was reported that when thirty iPS cell lines that have different combinations of homozygous HLA-A, -B, and -DR are generated, these iPS cells will have matched the three loci in 82.2% of the Japanese population. For fifty iPS cell lines, the chance of a match increases to 90.7% in the Japanese population .
In addition to finding new polymorphisms within exons, the SS-SBT method can analyze polymorphisms in introns, promoter, enhancer, and 3’-UTR regions that have largely remained unexplored until now. By analyzing polymorphisms in the entire region of HLA genes, the functional influence of those polymorphisms can be revealed in transplantation, diseases, and adverse effects of medications. For example, it is expected that about one in a thousand people have a null allele (deficient mutant that influences function or expression) in the HLA region. Since null alleles have a profound influence on GVHD or transplant establishment, especially in hematopoietic stem cell transplantation, the detection of null alleles is considered to be of great clinical importance.
The SS-SBT method has been developed to obtain massive and accurate sequencing data easily and cost effectively. Recently, the authors of a critical review of HLA typing by NGS  pointed out that the Roche company will discontinue product support for the 454 sequencing systems in 2016, implying that the Roche Junior sequencing platform will be phased out commercially in the near future. However, almost all the materials and reagents required for the Ion PGM™ system are available as kit products, and one Gb of sequencing data can be obtained within 5 hours of running the protocols and data analysis. Up to fifty-seven samples can be multiplexed and eight HLA gene loci analysed at an eight-digit level in a single run. Compared to the Luminex beads method or the SBT method, the SS-SBT is more cost effective. For these reasons, the Ion PGM™ sequencer is potentially the perfect sequencer for HLA genotyping using the SS-SBT method. With an expected increase in throughput and the development of an automated system in the future, we hope that Ion PGM™ will be even easier to use for routine HLA genotyping.
In conclusion, we have developed procedures for massively parallel sequencing of multiplex products that can be used for several benchtop sequencing platforms. We have obtained sequences of sufficient high quality to permit accurate HLA in phase genotypes across the full-length gene from the 5’ promoter/enhancer region to the 3’ UTR for most of the classical class I and class II genes. The use of sample tags or barcodes allows for optimization of second generation sequencing technologies by pooling samples and sequencing multiple samples in parallel for time- and cost-efficient workflows. We are currently working towards optimizing the Ion PGM/SS-SBT method for HLA in phase genotyping for both the clinical diagnostic and research laboratories.
The software Assign MPS 1.0 used in this study was a beta version and still under development at the time of publishing the chapter and not available commercially from Conexio.