Hop-variety Identification Using First- and Second-generation Sequencing

Twenty-one hop varieties from Europe and the United States were successfully identified by DNA analysis, based on single nucleotide polymorphisms (SNPs; including insertion/ deletion sequences) as identification markers. Several dozen megabases of transcriptome sequencing data were obtained by next-generation sequencing of samples from three hop varieties and compared to search for the regions containing SNPs. Consequently, four SNP-rich regions were selected as candidates for exploring identification markers in the hop varieties. Sequence data from these regions in all the tested varieties were obtained by the normal Sanger method and compared for the SNPs present. Combination of these SNPs could work well for identification of the 21 hop varieties. Moreover, the mixture of two varieties could be correctly evaluated by using this method. Hop pellet samples of two different varieties were mixed in various ratios and DNA sequencing was carried out. As a result, 5% contamination of a different variety could be detected by examining the electropherogram of the SNP positions. More quantitative methods for mixture evalu‐ ation could be expected using DNA techniques, such as quantitative real-time PCR. Be‐ cause this SNP-based identification method utilizes the DNA sequence itself, it could be a reproducible tool for accurate identification of the hop varieties.


Introduction
Accurate identification and use of hop (Humulus lupulus) varieties is very important for the production of quality beer. Hop varieties are usually identified by sensory analysis (taste, appearance, and smell), detection of the differences in the cone structure, and content of different biochemical substances, including alpha acids and essential oils. However, these methods present certain limitations, as accurate identification of the pelletized hop is not possible by observing its external appearance. Moreover, the content of biochemical substances in hop can vary depending on the cultivation conditions.
Analysis of single nucleotide polymorphisms (SNPs; differences of single nucleotide in homologous DNA among different varieties) in genomic DNA might be a better and more reproducible tool for the identification of varieties. SNPs are widely distributed in the genome and could be used as markers for the assessment of genotypes. For example, variation in DNA sequence and expression of valerophenone synthase (VPS) gene, a key gene of the bitter acid biosynthesis pathway, has been investigated in hop, using SNPs [18]. However, a large amount of DNA sequence is needed to obtain sufficient SNPs in order to identify the different varieties. In this context, high throughput next-generation sequencing (NGS) generally provides several hundred thousand-times more sequence data in a single analysis compared to the conventional Sanger method, but the whole genome sequencing by either method is still very expensive and time-consuming. In fact, genome sizes of two representative hop varieties, lupulus and neomexicanus, are 2.74 and 2.97 Gb, respectively [19], which are comparable to that of the human genome [20].
To overcome these problems, transcriptome analysis has been employed for the identification of hop varieties. Transcriptome is the entire mRNA content, transcribed from the genome, and its size ranges from one hundredth-to two hundredth-parts of the genome. Nevertheless, even by a conservative estimate of an average of one SNP per 1000 bp, based on the frequency of SNPs observed in the human genome [20], 13.5K to 30K SNPs could be expected in a relatively short period. Such a high frequency of SNPs would be enough for the identification of hop varieties. The discovery of a large number of SNPs and their specific combinations in each variety could lead to the identification of many hop varieties and detection of contaminants in mixed varieties using these specific SNP-combination-based markers. In the present study, we developed an SNP-based identification method for the hop varieties [21].

SNPs
In order to obtain intravariety DNA polymorphic regions required for developing a hop variety identification technique, we attempted searching for SNPs in a large amount of sequence data obtained using NGS. We focused on the transcriptome because transcriptome analysis requires about 100-times less data processing than whole-genome analysis, thereby reducing the lead time and cost. Besides, according to our calculation, there are already available data for as many as 15000 SNPs in the transcriptome analysis; the data size, however, is smaller than that for the whole-genome analysis. Thus, we performed an intravariety SNP analysis, using this technique, based on the assumption that SNPs required for the identification of many varieties could be obtained, thereby.

Sample collection and storage
The hop varieties to be identified were Saaz, Sládek, and Premiant, which originated in Czech Republic; Tradition, Spalter, Spalter Select, Perle, Tettnang, Brewer's Gold, Northern Brewer, Magnum, Herkules, German Nugget, and Taurus, which originated in Germany; and Cascade, Zeus, Summit, Galena, Super Galena, Nugget, and Columbus/Tomahawk, which originated in the United States (here, as is widely assumed, we considered that Columbus and Tomahawk were genetically identical). Pellets or dried samples of these varieties were obtained from appropriate suppliers. Three varieties (referred to as A, B, and C for convenience) were selected, and fresh leaves were collected from these varieties. The leaves were sampled and stored according to the procedure described below: Tissue: Leaves as young (small, yellow-green, and soft) as possible were collected. Those with white foreign matter on the surface were excluded.
Methodology: To prevent RNase contamination, leaves were collected with gloved hands and were soaked in a reagent (RNA Save; Biological Industries Israel Beit Haemek Ltd., Israel) for preventing RNA degradation.
Storage: Although RNA was stable for at least 1 week even at room temperature, the leaves were stored under refrigeration for as much time as possible until being used for RNA isolation and transcriptome and SNP sequence analysis.

Transcriptome analysis
The collected samples were used for transcriptome analysis. The procedure was subcontracted to Eurofins Genomics (Ebersberg, Germany). Briefly, the total RNA was extracted from the samples using RNeasy Plant Mini Kit (Qiagen Inc., Valencia, CA, USA) according to manufacturer's procedure. The quality of total RNA was evaluated in terms of the degree of degradation of rRNAs with Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Normalized cDNA library for use in Roche GS FLX Titanium sequencing was prepared as follows: poly (A) + RNA was isolated from the total RNA, and the first-strand cDNA synthesis was primed with an N6 randomized primer. Normalization was carried out by one cycle of denaturation and reassociation of the cDNA. Reassociated double-stranded (ds) cDNA was separated from the remaining single-stranded (ss) cDNA by passing the mixture through a hydroxylapatite column. The ss-cDNA was amplified by 9 PCR cycles. The cDNA library in the size range 500-700 bp was eluted from a preparative agarose gel. Emulsion PCR and sequencing were conducted according to standard protocols of Roche and the normalized cDNA library was sequenced in 1/2-plate run of GS FLX Titanium. Library preparations and their sequencing were carried out by Eurofins Genomics.

Preparation and assembly of contigs for SNP searches
Using the transcriptome sequence data obtained as described above, contigs were prepared under a subcontract to Eurofins Genomics. Briefly, the procedure comprised of sequence clustering and assembly for each of the varieties, based on the nucleotide sequences of the DNA fragments. De novo assembling from the unique single-read data was performed by MIRA Assembler Version 2.9.45 x 1 (for sequence assembly; Rheinfelden, Germany). To search for SNPs, contig and singlet data obtained in one of the hop varieties served as a reference for the single-read data obtained from the other two varieties. Specifically, with the nucleotide sequences of the contigs and singlets of variety C being used as reference sequences, single reads of varieties A and B were each applied and mapped to the reference sequences according to whether they shared a common portion. Further, contigs constituted by the mapped single reads were identified from the assembling information deployed on the analysis software. The reference sequences as well as the contigs and/or singlets of the other varieties were aligned to search for SNPs using bioinformatics analysis. Average reads per contig were 7 and 6 in variety A and B, respectively. The detected SNPs were reproducibly present in each variety and were therefore not the artifacts of error. Further, the nucleotide sequences of the contigs and singlets of variety B were used as reference sequences, and single reads of varieties A and C were compared and mapped to the reference sequences to search for SNPs in the same way, as mentioned above. A similar exercise was performed with the nucleotide sequences of the contigs and singlets of variety A being used as reference sequences, and the single reads of varieties B and C were applied and mapped to the reference sequences to search for SNPs.
The NGS performed in the 3 varieties generated a total of 589K to 638K reads with the total number of bases without keys, tags, and bad-quality bases being 191 to 227 Mb and the average read length without keys, tags, and bad-quality bases being 299 to 367. These values were comparable to the equipment spec ( Table 1). Numbers of contigs (part of cDNAs) assembled in each variety were 42K to 45K. Among these contigs, there were about 4500-6700 contigs with a length of 1000 bp or more (Table 2).

Evaluation of SNP detection by NGS RNA method
SNPs were searched in the contigs by comparing among the 3 varieties. As a result, 10.4K to 19.3K SNPs were obtained, as shown in Table 2. The numbers of SNPs were almost compatible with the expected numbers, 13.5K to 30K.

Results for SNP analytical regions
To call variants, mapping analysis was performed among the three hop varieties, which were combined with the contigs (as reference sequences) and single reads of each other. They were mapped and called by GS Mapper Software (Roche Applied Science, Penzberg, Germany).
Numbers and types of SNPs per contig or singlet were obtained. For example, if 16 single reads were selected as candidate DNA fragments containing SNPs, in some cases, all of them had the same nucleotide and were different from the reference, whereas in other cases, some of them had same nucleotide and the others had a nucleotide same as that of the reference. We called the former case "HOMO," in which identification was thought to be done more easily than in the latter case, designated as "HETERO." We selected the contigs that had more "HOMO" SNPs per single contig. For example, a contig (A1 c1675) containing 6 different SNPs was selected when the mapping was performed in the A1 region contig as a reference with single reads of variety B. This contig (A1 c1675) had 4 SNPs when single reads of variety C were used. In the same way, the other SNP-rich regions were selected. Thus, 4 SNP-rich regions, A1, B1, C1, and A1-2, were obtained.
The results are shown in the Table 3.  Primers were designed for each region using DNASIS Pro software (Hitachi Software Engineering Co., Ltd.; Tokyo, Japan), and PCR amplifications of the four regions were performed for further confirmation of the analysis regions.

DNA extraction from hop varieties
DNA was extracted from the pellets or dried cones of the three hop varieties (A, B, and C), used in the transcriptome analysis described above. DNA extraction by the cetyltrimethylammonium bromide (CTAB; Sigma-Aldrich Co. LLC., St. Louis, USA) method was carried out as follows. The CTAB solution comprised of 2% (w/v) CTAB, 100 mM Tris-HCl (pH 8.0), 20 mM EDTA (pH 8.0), 1.4 M NaCl, and 1% (w/v) polyvinylpyrrolidone (PVP). About 10-50 g of hop pellets or 1 g of dried cones were ground in a mortar with or without liquid nitrogen. Next, 650 μL of the CTAB solution and 2 μL 1 mg/mL RNaseA solution were added to the ground material in a 1.5 mL microtube and stirred well. This microtube containing the sample and the CTAB solution was submerged in a constant temperature water bath held at 65°C such that the content of the tube was completely underwater; the mixture was incubated for 1 h to disrupt hop cells. An equal volume (650 μL) of chloroform/isoamyl alcohol (24:1; Sigma-Aldrich Co. LLC.) was subsequently added, and the mixture was manually mixed by inversion for 3 min. The mixture was centrifuged in a Beckman's Allegra 21R centrifuge (with F2402H rotor) at 15,000 rpm (ca. 15,000×g) for 1-5 min and thereby fractionated into an organic solvent layer (lower) and an aqueous layer (upper). The aqueous layer (ca. 400 μL) was removed to a new tube. After the addition of an equal volume of isopropyl alcohol (Wako Pure Chemical Industries, Ltd.), the mixture was mixed by inversion and then centrifuged in the same way as described above. The supernatant was discarded and the remaining sediment was rinsed with about 500 μL of 70% ethanol (Wako Pure Chemical Industries, Ltd.) and dried in an MV-100 MicroVac Mini Vacuum-Centrifugal Evaporator (TOMY Seiko Co., Ltd.) for about 5 min. The residue was dissolved in 20-50 μL of TE buffer (pH 8.0, Nippon Gene Co., Ltd.) to obtain the DNA sample.
The extracted DNAs were subjected to amplification of the respective regions A1, B1, C1, and A1-2 using the primer sets shown in

Confirmation of PCR amplification
Five microliters each of the PCR product obtained, as described above, was subjected to agarose gel electrophoresis, performed using a mini gel electrophoresis system, Mupid (gel: 3% NuSieve 3:1 Agarose; FMC BioProducts or Cambrex Bio Science Rockland). The electrophoresis was carried out at 100 V for 30 min, and the amplification products were visualized under ultraviolet light illumination (Printgraph, Atto Corp.) after staining with ethidium bromide (2 μg/mL) for about 40 min. The presence of the amplified DNA fragment of the intended size was ascertained for each DNA sample. The sizes of DNA fragments amplified using the respective primer sets are described in Table 4.
The purification of the resulting PCR products was performed using the QIAquick PCR Purification Kit (Qiagen Inc., Valencia, CA, USA) according to the protocol recommended by the manufacturer, as described below. Five volumes of PBI buffer (225 μL) was added to the PCR product (45 μL) and mixed. The resulting solution was placed in the QIA quickspin column and centrifuged at 13,000 rpm (ca. 11,000xg) for 1 min. The supernatant was discarded, 750 μL of PE buffer was added, and the mixture was centrifuged at 13,000 rpm for 1 min. The supernatant was again discarded and the remaining solution was centrifuged at 14,000 rpm (ca. 13,000×g) for 1 min. The supernatant was discarded, the column was transferred to a new 1.5 mL Eppendorf tube, and 30 μL of EB buffer was added. After being left to stand for 1 min, the solution was centrifuged at 13,000 rpm for 2 min to elute the purified PCR product.

Confirmation of SNP markers by Sanger DNA sequencing
Cycle sequencing was performed using the BigDye terminator v1.1 cycle sequencing kit (Life Technologies Japan Ltd.), according to the protocol prescribed by the vendor. The purification of the product as the template DNA for sequencing was performed by the Centri-Sep Spin Column (Life Technologies Japan Ltd.) according to the manufacturer's protocol. DNA sequencing was performed on an ABI PRISM 310 genetic analyzer (Life Technologies Japan Ltd.). Sequence data obtained from the 5' and 3' ends were checked, and the correct base sequence was determined. The nucleotide sequences determined from the tested varieties were aligned in each analysis region by using ClustalW (DDBJ; DNA Data Bank of Japan), which is a popular multiple sequence alignment program for DNA.
Thus, the amplified regions were confirmed. Comparison was also made with the data obtained using the next-generation sequencer, to confirm that the identification of the three varieties A, B, and C was possible. The size of the amplified products from the regions, A1, B1, C1, and A1-2, were 651, 729, 640/646, and ca. 1500 bp, respectively.
The PCR product from the B1 region was bigger in size than expected, and on close scrutiny, it was found that this DNA region included a 111-bp insertion sequence, which might be that of an intron.
For reference, the two different sizes of the amplicons obtained from the C1 region represented the size with or without the 6-nucleotide insertion in this region.
Since the fragment of region A1-2 had a length of about 1,500 bp, the sequencing data obtained for this region consisted of sequences from two regions: an analytical region, A1-2-L of 538 bp from the 5' end (primer L), and an analytical region A1-2-R of 516 bp from the 3' end (primer R).
It was determined that the analysis of regions A1, B1, C1, and A1-2 (and the SNPs contained, therein) was also applicable to the identification of other varieties in addition to the three mentioned above. Thus, analysis of a number of varieties was carried out.

Identification markers for 21 hop varieties
Twenty-one varieties of hops were selected for identification based on the nucleotide sequence in region A1.
Using nucleotides at the SNP positions in the fragments of A1-2-L (sequenced from 5' primer) and A1-2-R/R2 (sequenced from 3' primer), Perle and Northern Brewer as well as Premiant, Zeus, and Summit were successfully identified, when 21 and 67 SNPs were found between Perle and Northern Brewer, and 59 SNPs were found among Premiant, Zeus, and Summit.

Characterization of SNP-rich regions and identification of hop varieties
In every SNP-rich region, consensus sequence and SNP positions were detected by the alignment analysis. Nucleotide polymorphisms in each variety were evaluated at the SNPs positions in all the regions. Within the 21 studied hop varieties, 24 SNPs were identified in the A1 region and 14 SNPs (including indel) were observed in the B1 region. H. lupulus, with normally a diploid chromosome, had heterozygous or homozygous SNPs in each DNA region. For example, in the A1 region, European varieties had homozygous SNPs at 77th (C/T) and 103th (A/G) positions. Such homozygous SNPs can be analyzed more easily than the heterozygous SNPs when mixed with another variety, having different nucleotides at the SNP positions. In a case, where the variety Saaz was mixed with the variety Sládek, position 77 SNPs were T and C, respectively, in the A1 region (data not shown), and these nucleotides could be easily distinguished on their electropherograms. On the other hand, position 74 SNPs were W (A and T) and A, respectively, in the same region. In this case, it was difficult to recognize whether the variety Saaz was mixed with the variety Sládek or not. Each variety had a specific combination of SNPs as markers, and 12 and 9 DNA types in the A1 and B1 regions, respectively, were identified among the 21 varieties. Indel was also found in the C1 region. Nucleotides at the SNP positions including indel were observed. Forty-four SNPs were found  in this region, and 5 DNA types were identified among the 21 varieties. It was revealed that Galena had 2 DNA types in the B1 region and Cascade and Super Galena had 2 DNA types in the C1 region.
Additionally, the A1-2 region was searched because there was no difference in the combination of the SNPs in A1, B1, and C1 regions between Perle and Northern Brewer, among Premiant, Zeus, and Summit. Nucleotides at the SNP positions in the fragments of A1-2-L (sequenced from 5' primer) and A1-2-R/R2 (sequenced from 3' primer) were shown, when 21 and 67 SNPs were found between Perle and Northern Brewer, and 59 SNPs were found among Premiant, Zeus, and Summit. Perle and Northern Brewer as well as Premiant, Zeus, and Summit were successfully identified.
For confirmation, we tried to distinguish between Columbus and Tomahawk, considered genetically identical [22][23][24], by analyzing the 4 regions and, additionally, the middle area of the A1-2 region, with newly designed primers, A1-2-M_F and A1-2-M_R (Table 4). No differences in the nucleotide sequences were detected between them. Therefore, we confirmed that these 2 varieties are indeed identical and could be considered a single variety.
Consequently, 21 hop varieties were successfully identified by a combination of these genotypes of SNPs, with differences in the 4 SNP-rich regions. The summary results are shown in Table 6.

Comparison of DNA samples prepared from hop pellets and cones
Comparison was made between the results obtained in the two cases where DNA samples were extracted from either the hop pellets or the dried hop cones, as described in Section 2.2. It was confirmed that DNA extraction and sequencing were possible with both types of DNA samples. Also, analyses using DNA samples of both types yielded the same results. This result further demonstrates that inspection at a processing step (e.g., inspection for contamination at a pelletization step) is technically possible.

Comparison between three Saaz clones
About 1 g each of dried cones of three Saaz clones (Osvald's clones 31, 72, and 114) was ground in a mortar in the presence of liquid nitrogen, and the DNA was extracted from about 50 mg each of the ground materials by the CTAB method described in Section 2.2. Each of the extracted DNA sample was subjected to amplification of the DNA fragments from regions A1, B1, and C1 and followed by sequencing of the amplified fragments, according to the procedures as described above. For each region, the nucleotide sequences of the three clones were aligned for comparison with each other.
The results obtained confirmed that there was no difference among the three Saaz clones in terms of the nucleotide sequences in the analyzed regions. In other words, it was found that any of Saaz clones could be identified through the above-described analysis using the inventive variety identification regions. These results demonstrated that the analysis using regions A1, B1, and C1 as the variety identification regions required no determination by clone.

Identification of variety mixtures using the newly identified SNP markers
Because it is important to detect contamination of other varieties, we developed a method for detecting such contamination by using the SNPs markers. The mixed samples of Saaz and Premiant hops were analyzed (for region A1). To prepare different samples, pellets of these varieties were ground as described above followed by mixing of the ground materials in the relative proportions by weight as mentioned in Table 7.
A representative electropherogram, containing the SNP position 77 in the A1 region (A1_#77), is shown in Figure 7. Peaks in the electropherogram represent fluorescence intensity of 4 different nucleotides at each DNA positions, which are depicted in different colors; A, C, G, and T, are shown in green, blue, black, and red, respectively. Variety I is homozygous (TT) at position 77, and variety II is homozygous (CC) at the same position. In the case of mixing of varieties in 50% proportion, overlap of 2 peaks of T and C was observed at this position, reflecting contamination. Figure 7 also shows peaks in the same region when variety II was mixed at 5 and 10%. Peaks in blue at this position, derived from variety II, were detected even at 5% contamination. It was the same for A1_#199 and A1_#204, in which variety I was heterozygous (GC) and variety II was homozygous (CC), so it was not easy to recognize the contamination by peak height of electropherogram. We estimated more SNPs, with a focus on homozygous SNPs, and calculated an average of the mixing rate in variety II. As a result, we obtained a closer estimate of the mixing rate. We postulate that contamination at levels lower than 5% could be estimated; however, the acceptable contamination level for us might be 5% with reference to the genetically modified organism contamination in Japanese food. Thus, the values below 5% have a slight importance. Each sample was prepared by mixing pellets of two varieties in specified proportions by weight.

Conclusions and future directions
The results obtained in the present study suggest that the SNP-combination method has high reliability as it utilizes the sequence data. Computational handling of SNP data can be easily carried out so that analysis among many varieties could be easily performed with digital information. Other SNPs identified in the future could be used as additional identification markers. Therefore, using this method, enhancement of accuracy and repeatability in the variety identification could be accomplished in a relatively simple manner.
As SNP-rich DNA regions are only 0.6-1.5 kb in size, they are much less likely to be damaged during the processing of hop products. However, upon degradation and fragmentation of the SNP-rich regions, newly designed primers for the shorter fragments generated might be useful for the amplification of the fragments and may contain several useful SNPs which can be used as identification markers.
The SNP-based information could be also used for quantitative determination of the ratio of variety mixture. In the future, more accurate results will be obtained by using quantitative real-time PCR and/or NGS, which may provide a huge amount of sequence data and increase reliability.