DNA Polymorphisms: DNA-Based Molecular Markers and Their Application in Medicine DNA Polymorphisms: DNA-Based Molecular Markers and Their Application in Medicine

DNA polymorphisms are the different DNA sequences among individuals, groups, or pop - ulations. Polymorphism at the DNA level includes a wide range of variations from single base pair change, many base pairs, and repeated sequences. Genomic variability can be pres - ent in many forms, including single nucleotide polymorphisms (SNPs), variable number of tandem repeats (VNTRs, e.g., mini- and microsatellites), transposable elements (e.g., Alu repeats), structural alterations, and copy number variations. Different forms of DNA poly morphisms can be tracked using a variety of techniques; some of these techniques include restriction fragment length polymorphisms (RFLPs) with Southern blots, polymerase chain reactions (PCRs), hybridization techniques using DNA microarray chips, and genome sequencing. During the last years, the recent advance of molecular technologies revealed new discoveries of DNA polymorphisms. DNA polymorphisms are endless, and more dis - coveries continue at a rapid rate. Mapping the human genome requires a set of genetic mark ers. DNA polymorphism serves as a genetic marker for its own location in the chromosome; thus, they are convenient for analysis and are often used as in molecular genetic studies.


Introduction
Genetic polymorphism is the existence of at least two variants with respect to gene sequences, chromosome structure, or a phenotype (gene sequences and chromosomal variants are seen at the frequency of 1% or higher), typical of a polymorphism, rather than the focus being on rare variants [1].
The human genome comprises 6 billion nucleotides of DNA packaged into two sets of 23 chromosomes, one set inherited from each parent. The probability of polymorphic DNA in humans is great due to the relatively large size of human genome. Genomic variability includes a wide range of variations from single base pair change, many base pairs, and repeated sequences [2].
Single nucleotide polymorphisms are the most common type of genetic variations in humans [3], due to their abundance across the human genome; single nucleotide polymorphisms (SNPs) have become important genetic markers for mapping human diseases, population genetics, and evolutionary studies. SNPs have become very important since technologies for DNA sequencing have become feasible and widely available. Advance continues at a rapid rate [4].
A major step forward in genome identification is the discovery of about 30-90% of the genome which is constituted by regions of repetitive DNA which are highly polymorphic in nature [5]. Polymorphic tandem repeated sequences have emerged as important genetic markers and initially, variable number tandem repeats (VNTRs) were used in DNA fingerprinting. In recent years, evidence has been accumulated for the involvement of VNTR repeats in a wide spectrum of pathological states [6].
Throughout the past years, scientists have believed that genes strictly came in two copies in a genome. However, with the recent advancement in molecular technology, discoveries have revealed substantial segments of DNA, ranging in size from thousands to millions of DNA bases that could vary in copy number. Such copy number variations (or CNVs) encompass gene copies, newly discovered CNVs are important sources of genomic diversity [7,8].
The development and use of DNA-based molecular markers is one of the most significant developments in the field of molecular genetics that facilitate the study of genetic variations in health and diseases [5].
This chapter reviews the DNA-based genetic markers and their application in medicine, with a particular emphasis on common DNA-based genetic markers, including single nucleotide polymorphisms and short tandem repeats (STRs).

Polymorphisms at DNA level
Genomic variability at DNA level can be present in many forms including: single nucleotide polymorphisms, variable number of tandem repeats (e.g., mini-and microsatellites), transposable elements (e.g., Alu repeats), structural alterations, and copy number variations. It can occur in the nucleus or mitochondria. Two major sources: (1) mutations that may result as chance processes or have been induced by external agents such as radiation and (2) recombination. Once formed, it can be inherited, allowing its inheritance to be tracked from parent to child [3].
The genomes of humans may be divided into different parts based on known functional properties; the coding and noncoding regions mostly do not code for protein [2,9]. The coding regions contain DNA sequences which determine primarily the amino acid sequences of the proteins for which they code. Noncoding DNA generally containing DNA sequences with no function has not yet been discovered or possibly no function exists [10]; such sequences may be either single copy or exist as multiple copies called repetitive DNA [10]. Indeed, regions of DNA that do not code for proteins tend to have more polymorphisms. Recently, there has been substantial progress in understanding genome content which centered on discovered protein-coding genes which considered a functional DNA sequence moving away for discoveries of many repeat families, and various copy number variations encompass gene copies leading to dosage imbalance that plays an important role in genome structure, evolution, and diversity [11,12]. "The Human Genome Project has revealed that humans have only 20,000-30,000 structural genes (protein-coding genes) (International Human Genome Sequencing Consortium, 2004)" [13].

Single nucleotide polymorphisms
Single base change is "high-density natural sequence variations in human genome" [14]. SNPs are mostly formed when errors occur (substitution, insertion and deletion). SNPs are prominent sources of variation in human genome and serve as excellent genetic markers. Some regions of the genome are richer in SNPs than others. SNPs may occur within gene sequences or in intergenic sequences. SNPs mostly are located in noncoding regions of the genome and have mostly no direct known impact on the phenotype of an individual but their role till now remains elusive, and depending on where SNPs occurs, it might have different consequences at the phenotypic level [3].

Insertion/deletion polymorphisms
It is a type of DNA variation in which a specific nucleotide sequence of various lengths ranging from one to several 100 base pairs is inserted or deleted. Indels are widely spread across the genome. Some authors consider one base pair as SNPs or repeat insertion/deletion as indels.

Polymorphic repetitive sequences
DNA repeats can be classified as interspersed repeats or tandem repeats. This can comprise over two-thirds of the human genome [15]. Interspersed repeats are dispersed across the genome within gene sequences or intergenic and include retro (pseudo) genes and transposons. Tandem repeats or variable number tandem repeats (≥2 bp in length) that are adjacent to each [16] can involve as few as two copies or many thousands of copies. Centromeres and telomeres largely comprise tandem repeats. Despite increasing evidence on the functionality of DNA repeats, their biologic role is still elusive and under frequent debate [11]. Tandem repeats are organized in a head-to-tail orientation; based on the size of each repeat unit, satellite repeats can be further divided into macrosatellites, minisatellites, and microsatellites [17]. Some of these repeats are described as follows: macrosatellites, with sequence repeats longer than 100 bp, are the largest of the tandem DNA repeats, located on one or multiple chromosomes [11], minisatellites, stretches of DNA, are characterized by moderate length patterns, 10-100 bp usually less than 50 bp [9,18], and microsatellites also known as short tandem repeats (STRs) repeat units of less than 10 bp, [3].

Structural and copy number variations
Structural and copy number variations (CNVs) are another frequent source of genome variability [6,19,20]. The term CNVs therefore encompasses previously introduced terms such as large-scale copy number variants (LCVs) [19], copy number polymorphisms (CNPs) [20], and intermediate-sized variants (ISVs) [21]. Some currently used terms are structural variations; a genomic alteration (e.g., an inversion) that involves segments of DNA > 1 kb, copy number polymorphisms; a duplication or deletion event involving >1 kb of DNA [22], intermediatesized structural variant; and a structural variant that is ∼8-40 kb in size, this can refer to a CNVs or a balanced structural rearrangement (e.g., an inversion) [21].

Common DNA-based molecular markers
The development and use of molecular methods for the detection of DNA molecular markers is one of the most significant progresses in the field of molecular genetics. Mapping the human genome requires a set of genetic markers to which we can relate the position of genes. Some of these markers are genes, others SNPs and VNTRs. Molecular markers can be used to mark in genomes for various purposes such as mapping human diseases, pharmacogenetics, and human identification.

Single nucleotide polymorphisms
Single base pair change leads to single nucleotide variant, probably accounting for many genetic conditions caused by single gene or multiple genes. SNPs represent the major source of human genomic variability. Due to the lack of knowledge on exact SNP number, it is difficult to give a direct estimate of the number of the SNPs in the human genome but in different public and private data bases, more than 5 million have been recorded and about 4 million validated [23]. "The data from the Human Genome project revealed that that human nucleotide sequence differs every 1000-1500 bases from one individual to another" [24]. "The SNP Map working group observed that two haploid genomes differ at 1 nucleotide per 1331 bp". Over 60,000 however are within genes and some of them associated with diseases [2].
Single nucleotide polymorphisms within protein-coding regions either synonymous polymorphisms; those that do not have any effect on the organism and are said to be selectively silent as the substitution causes no amino acid change in the protein produced (silent mutation) or nonsynonymous substitution results in change in encoded amino acids either missense mutation; change the protein through codon alteration or nonsense mutation results in a chain termination codon [3].
Single nucleotide polymorphisms within a coding sequence cause genetic diseases including sickle cell anemia. SNPs responsible for a disease can also occur in any genetic region that can eventually affect the expression activity of genes, for example, in promoter regions. SNPs in the noncoding region of the gene, though their effect is still debatable, most of the genome mostly consists of regulatory elements that control gene expression, but these regions have remained largely unexplored in clinical diagnostics due to the high cost of whole genome sequencing and interpretive challenges. Clinical diagnostic sequencing currently focuses on identifying causal mutations in the exome, where most disease-causing mutations are known to occur.
Another important group of SNPs is the one that alters the primary structure of a protein involved in drug metabolism; these SNPs are targets for pharmacogenetics studies.
However, some SNPs are not causative, some SNPs are in close association with, and therefore segregate with, a disease-causing sequence so, the presence of SNP correlates with the presence or an increased risk of developing the disease; these SNPs are useful in diagnostics, disease prediction, and other applications [3].
Single nucleotide polymorphisms can be used as genetic markers for constructing high genetic maps and to carry out association studies related to diseases because of their abundance and the availability of high throughput analysis technologies. SNPs have become an important application in the development and research of genetic markers [14].
There are numerous strategies that can be implemented to new single nucleotide variant (SNVs) discoveries; the most common and well-known method is by direct sequencing and in comparison to a puplic or other sequence date base [25,26] or locus specific amplification of target genomic region followed by sequence comparison [27,28]; prescreening prior to sequence determination is needed. SNV detection encompasses two broad areas: (1) scanning DNA sequences for previously unknown polymorphisms and (2) screening (genotyping) individuals for known polymorphisms. Scanning for new SNVs can be further classified to two different types of approaches, the first one being the global (or random approach) and the other one being the regional (targeted approach) [14]. There are certain methods which have been developed for using SNVs randomly in the genome; "such as representation shotgun sequencing [14,29], primer-ligation-mediated PCR [14,30] and degenerate oligonucleotideprimed PCR" [14,31].
Haplotypes are groups of SNPs that are generally inherited together. Haplotypes can have stronger correlations with diseases or other phenotypic effects compared with individual SNPs and may therefore provide increased diagnostic accuracy in some cases [32].

Microsatellites (short tandem repeats)
Microsatellites are short tandem repeats (STRs), repeat units, or motifs of less than 10 bp; because of high variability, microsatellite loci are often used in forensics, population genetics, and genetic genealogy. Significant associations were demonstrated between microsatellite variants and many diseases [15].
Depending on the search algorithm, there are approximately 700,000-1,000,000 microsatellite loci which are 2-6 bp long in the human reference genome [33,34]. Di-and tetra-nucleotides constitute about 75% of microsatellites, with the remaining loci containing tri-, penta, and hexanucleotide. Within genes, STRs are nonrandomly distributed across protein-coding sequences, untranslated regions (UTRs), and introns. STRs containing dinucleotide repeat units that are much more abundant in the regulatory or UTR regions than in other genomic regions. In the coding regions of the genes, repeats mostly have either trimeric or hexameric repeat unit, likely as a result of selection against frameshift mutations [34,35]. "The mutation rates of STRs often lie between 10 3 and 10 6 per cell generation which is 10-to 10 5 -fold higher than the average mutation rates observed in nonrepeated regions of the genome" [36,37].
"Polymorphism of tandem repeats within protein-coding regions reveals that tandem repeat variation is an important source of variation in many proteins, many of this variation is of significant impact on protein function. Tandem repeats has been associated with a number of diseases and phenotypic conditions, changes in the protein products of genes, leading to diseases, other tandem repeat polymorphisms in noncoding regions are known to modify function through their impact on gene regulation". "These polymorphisms can arise from events such as unequal crossover, replication slippage or double-strand break repair" [38].
Variations in the STR length play a significant role in modulating gene expression and STRs are likely to be general regulatory elements; regulatory STRs manifest significant polymorphism because of their high intrinsic mutation rate [15].
There are examples for distinctive phenotypic changes and diseases that are directly associated with the increases or decreases of microsatellite repeat arrays; for example, considering Huntington disease gene, triplet nucleotide mutations, the mutation that causes the disease, is an expansion of CAG repeats from the normal range of 11-14 copies to abnormal range of at least 38 copies. The extra CAG repeats that causes extra glutamine is produced [9] and there are more than 40 neurological diseases in humans, such as spinocerebellar ataxia with polyglutamine tracts, which are caused by microsatellite motif length changes in trinucleotide arrays [39].
Testing candidate genes for polymorphisms in exons, promoters, splice sites, or other regulatory regions will have to be done using SNP testing, because it is the most common polymorphisms and more likely responsible for phenotypic variations. For complex phenotypic traits and candidate loci, single-loci SNP analyses present less information due to the bi-allelic nature of the markers, as compared to the multi-allelic microsatellites. However, performing haplotype frequency may improve the accuracy [40]. Recently, polymorphic tandem repeated sequences and coy number variations have emerged as important sources of genomic diversity that facilitate the study of genetic variations in health and diseases.

The major technique for DNA-based molecular marker detection
Different forms of DNA-based molecular markers can be tracked using a variety of techniques. Some of these techniques include RFLPs with Southern blots and polymerase chain reactions (PCRs). Recently great advances in methodology for DNA polymorphisms detection using real time PCR, hybridization techniques using DNA microarray chips, genome sequencing each technique has its own advantage and disadvantage.

Restriction fragment length polymorphism with southern blot
DNA digestion with restriction enzyme endonuclease cuts DNA at a specific sequence pattern known as a restriction endonuclease recognition site. Thus, the alleles differ in length and can be distinguished by gel electrophoresis, which can arise from a number of genetic events including point mutation in restriction sites, mutation that creates a new restriction site, insertion, deletion, and repeated sequences. The first polymorphic RFLP was described in 1980. RFLPs were the original DNA targets used for human identification, parentage testing, and gene mapping.
The method of hybridization of DNA with probes is called Southern blotting, after the name of the inventor, Southern [41]. RFLP requires relatively large amounts of DNA. Hence, it cannot be performed with the samples degraded by environmental factors and also takes longer time to get the results [42,43]. PCR-RFLP is now replaced to avoid using Southern blot.

Polymerase chain reaction
In-vitro amplification of particular DNA sequences with the help of specifically chosen primers and DNA polymerase enzyme is done. The amplified fragments are separated electrophonically and detected by different staining methods. Real-time PCR useful modification of PCR can detect polymorphisms by various methodologies using real-time PCR chemistries, for example, TaqMan assay or molecular beacons.

Genomic array technology
Genomic array technology is a type of hybridization analysis allowing simultaneous study of large numbers of targets or samples. In 1987, macroarray evolved into the microarray. Tens of thousands of targets can be screened simultaneously in a very small area. Automated depositing systems (arrayers) can place thousands of spots on glass substrate of the size of a microscope slide (chip) with spotting representative sequences of each gene in triplicate, simultaneous screening of the entire human genome on a single chip. This technique facilitates the process of identifying specific homozygous and heterozygous alleles, by comparing the disparity of hybridization of the target DNA with each redundant probe. Microarray is also used to characterize genetic diversity and drug responses, to identify new drug targets, and to assess the toxicological properties of chemicals and pharmaceuticals [44].

Sequencing
Since technologies for rapid DNA sequencing have become available they are now widely used. There is a great progression for the detection of single nucleotide variants (SNVs) by direct sequencing, but intermediate-sized (from 50 bp to 50 kb) structural variants (SVs) remain a challenge. Such variants are too small to detect with cytogenetic methods but too large to reliably discover with short-read DNA sequencing. Recent high-quality genome assemblies using long-read sequencing have revealed that each human genome has approximately 20,000 structural variants, spanning 10 million base pairs, more than twice the number of bases affected by SNVs. New long-read sequencing approaches are needed to meet this challenge, as short-read sequencing technologies only detect 20% of the SVs present in the human genome [45][46][47][48].

The major application for DNA-based genetic markers
DNA-based molecular markers are such powerful tools for mapping human diseases and discover many multifactorial diseases and disorders.

Mapping human diseases and risk prediction
Genetic mapping and linkage: The mapping of the human genome has made possible to develop a haplotype map in order to better define human SNV variability. The haplotype map or HapMap will be a tool for the detection of human genetic variation that can affect health and diseases [23]. The HapMap project is far more useful because it will reduce the number of SNVs required to examine the entire genome for association with a phenotype or diseases from the 10 million SNPs that are expected to exist to approximately tag 500,000 SNPs [38]. The first large-scale effort to produce a human genetic map was performed mainly using RFLP; other several projects are underway to identify more markers in humans and to make this data publicly available to scientists worldwide. Many groups that are involved in these massive efforts through DNA polymorphisms discovery resource include the SNP consortium (TSC) http://snp.cshl.org [49,50]. The reason for the current enormous interest in SNPs is the hope that they could be used as markers to identify genes that predispose individuals to common, multifactorial disorders by using linkage disequilibrium (LD) mapping.
"The HapMap Project (http://hapmap.ncbi.nlm.nih.gov/), and other approaches, such as genome wide association studies, have been widely reported for complex polygenic diseases, with some interesting novel genes affecting disease susceptibility now identified. Genome Wide Association; the GWAS has now been used for a large range of traits and diseases e.g. baldness and eye color" [51,52].

Quantitative trait loci mapping, candidate genes, and complex traits
The identification of genes affecting complex trait is a very difficult task. For many complex traits, the observable variation is quantitative, and loci affecting such traits are generally termed quantitative trait loci (QTL). (SNVs) can be used as genetic markers for constructing high-density genetic maps and to carry out association studies related to complex traits and diseases [14].

Pharmacogenetics
Individual response to a drug is governed by many factors such as genetics, age, sex, environment, and disease. The influence of genetic factors on the response of a drug is a known fact.
Polymorphic STRs, together with SNPs and CNVs, can explain variability in response to pharmacotherapy because of their prevalence in the human genome and their functional role as regulators of gene expression and its applications. Pharmacogenetics is the study of the influence of genetics factors on drug response and metabolism. The science of pharmacogenetics when applied can be used to evade adverse drug reactions, predict toxicity and therapeutic failure, and refine therapeutic efficiency and improve clinical outcomes [53].

DNA fingerprinting and human identification
Establishing an individual's identity is one of the uses of DNA sequence information that highlights uniqueness of a particular sample [5], also known as genetic fingerprinting; DNA typing and DNA profiling are molecular genetic methods that enable the identification of individuals using hair, blood, semen, or other biological samples, based on unique patterns in their DNA. This uniqueness in each individual is the basis of human identification at the DNA level, forensic identification, determination of genetic variation, determination of family relationship, and one important instance is identifying good genetic matches for organ or marrow donation. When first described in 1984 by British scientist Alec Jeffreys, the technique used was minisatellites; these sequences are unique to each individual, with the exception of identical twins. Different DNA fingerprinting methods exist, using either restriction fragment length polymorphism (RFLP) or PCR or both. More than 200 RFLP loci have been described in human DNA. Initially, forensic medicine used minisatellite testing; however, this method requires a large amount of material and yield low-quality results especially when only little amount of materials are available. Nowadays, in most forensic samples, the study of DNA is usually performed by microsatellite analysis. The most useful microsatellite for human identification is those with a greater number of alleles, smaller size, higher frequency of heterozygotes (higher than 90%), and low frequency of mutations [43]. Among others, the microsatellite DNA marker has been the most widely used, due to its easy use by simple PCR, followed by a denaturing gel electrophoresis [40]. Each person has some STRs that were inherited from the father and some from mother, useful in paternity testing but however no person has STRs that are identical to those of either parent. The uniqueness of an individual's STR provides the scientific marker of identity and hence is helpful in forensic identification [54]. Genomic and mitochondrial are two types of DNA which are used in forensic sciences. The genomic DNA is found in the nucleus of each cell in the human body and represents a DNA source for most forensic applications. Mitochondrial DNA (mt DNA) is another source of material that can be used; various biological samples such as hair, bones, and teeth that lack nucleate cellular materials can be analyzed with mt DNA [43,55].

Sex-chromosome STR testing
"Majority of the length of the human Y chromosome is inherited as a single block in linkage from father to male offspring as a haploid entity. DNA genetic markers on the human Y chromosome are valuable tools for understanding human evolution, migration and for tracing relationships among males" [43,56]. "Chromosome X specific STRs is used in the identification and the genomic studies of different ethnic groups worldwide, because the small size of X-chromosome STR alleles; about 100-350 nucleotides, it is relatively easy to be amplified and detected with high sensitivity" [43].

DNA typing and engraftment monitoring
DNA typing becomes the method of choice for engraftment monitoring, donor cells are examined by following donor polymorphisms in the recipient blood and bone marrow. Although RFLP can efficiently differentiate donor and recipient cells, the detection of RFLP requires the use of southern blot methods, which is too labor intensive and has limited sensitivity for this application, in comparison with small minisatellites or microsatellites that are easily detected by PCR amplification, because of increased rapidity and the 0.5-1% sensitivity achievable with PCR. Sensitivity can be raised to 0.01% using Y-STR, but this approach is limited to that transplant from sex mismatched donor recipient pairs preferably from a female donor to a male recipient [2].
Nowadays, DNA fingerprinting is used as a tool for designing "personalized" medical treatments for cancer patients.

Conclusion and future perspectives
Single nucleotide polymorphisms (SNPs) have become an important application in the development and research of genetic diseases or other phenotypic traits. Haplotypes are groups of SNPs that are generally inherited together. Haplotypes can have stronger correlations with diseases or other phenotypic effects compared with individual SNPs and may therefore provide increased diagnostic accuracy in some cases.
Polymorphic tandem repeated sequences have emerged as important genetic markers and initially, variable number tandem repeats (VNTRs) were used in DNA fingerprinting; in recent years, evidence has been accumulated for the involvement of VNTR repeats in a wide spectrum of pathological states.
The new global CNV map will transform medical research in four main areas: detection for genes underlying common diseases, study of familial genetic conditions, exclude variation found in unaffected individuals, helping researchers to target the region that might be involved and the data generated will also contribute to a more accurate and complete human genome reference sequence used by all biomedical scientists. Currently, approximately 2000 CNVs have been described; there could be thousands more CNVs in the human population. About 100 CNVs were detected in each genome tested with the average size being 250,000 bases (an average gene is 60,000 bases). With advanced molecular technologies more CNVs will be discovered and more DNA samples from worldwide populations are examined.
Recently, there has been substantial progress in understanding genome content which centered on protein-coding genes which considered a functional DNA sequence moving away for many discoveries, many repeat families, and various copy number variations that play an important role in genome structure, evolution, and diversity. Additional efforts are being placed to develop strategies that would overcome the obstacles in alignment next-generation sequencing data. "Future precision medicine efforts will direct to connect genotypes to phenotypes and distinguish common, from rare or potentially disease linked variants. New longread sequencing approaches are needed to meet this challenge." Other important applications of genetic polymorphism knowledge are improving health care through gene therapy, discovery of new drugs and drug targets, and upgradation of the discovery processes with advanced technologies.
Advances in molecular technologies, DNA sequencing technology, and microarray, coupled with novel, efficient computational analysis tools, have made it possible to analyze sequencebased experimental data, more discoveries, and development at a rapid rate.