Classification of mitochondrial genome types based on RFLPs using coxI, coxII and atp6 as probes. Sizes of hybridization signals (kb) are shown.
Soybean is the most important crop provider of proteins and oil used in animal nutrition and for human consumption. Plant breeders continue to release improved cultivars with enhanced yield, disease resistance, and quality traits. It is also the most planted genetically modified crop. The narrow genetic base of current soybean cultivars may lack sufficient allelic diversity to counteract vulnerability to shifts in environmental variables. An investigation of genetic relatedness at a broad level may provide important information about the historical relationship among different genotypes. Such types of study are possible thanks to different markers application, based on variation of organelle DNA (mtDNA or cpDNA).
2. Mitochondrial genome
2.1. Genomes as markers
Typically, all sufficiently variable DNA regions can be used in genetic studies of populations and in interspecific studies. Because of in seed plants chloroplasts and mitochondria are mainly inherited uniparentally, organelle genomes are often used because they carry more information than nuclear markers, which are inherited biparentally. The main benefit is that there is only one allele per cell and per organism, and, consequently, no recombination between two alleles can occur. With different dispersal distances, genomes inherited biparentally, maternally and paternally, also reveal significant differences in their genetic variability among populations. In particular, maternally inherited markers show diversity within a population much better .
In gymnosperms the situation is somewhat different. Here, chloroplasts are inherited mainly paternally and are therefore transmitted through pollen and seeds, whereas mitochondria are largely inherited maternally and are therefore transmitted only by seeds . Since pollen is distributed at far greater distances than seeds , mitochondrial markers show a greater population diversity than chloroplast markers and therefore serve as important tools in conducting genetic studies of gymnosperms . Mitochondrial markers are also sometimes used in conjunction with cpDNA markers .
Mitochondrial regions used in interspecific studies of plants, mainly gymnosperms, include, for example, introns of the NADH dehydrogenase gene nad1 [4, 5, 6], the nad7 intron 1 , the nad5 intron 4  and an internally transcribed spacer (ITS) of mitochondrial ribosomal DNA [8, 9].
In addition to the aforementioned organelle markers, microsatellite markers [10, 11] and simple sequence repeats (SSR) are often used in population biology, and sometimes also in phylogeographic studies. Microsatellites are much less common in plants than in animals . However, they are present in both the nuclear genome and the organelle genome. Microsatellites may reveal a high variability, which may be useful in genetic studies of populations, whereas other sequences or methods such as fingerprinting do not detect mutations sufficiently [9,10,13]. Inherited only uniparentally, organelle markers have a certain quality in phylogeographic analyses. Since they are haploid, the effective population size should be reduced after the analysis using these markers as compared to those in which nuclear markers are used [1, 14]. Smaller effective populations sizes should bring about faster turnover rates for newly evolving genotypes, resulting in a clearer picture of past migration history than those obtained using nuclear markers [15-17].
Initially, it was mainly in phylogeographic studies of animal species that mitochondrial markers were used . These studies have provided some interesting data on the beginnings and the evolutionary history of human population . In contrast to studies of animals, using mitochondrial markers in studies of plants, especially angiosperms, is limited . Presently, cpDNA markers are most commonly used in phylogeographic studies of angiosperms, whereas mitochondrial markers are prevalent in studies of gymnosperms.
2.2. Plant mitochondrial DNA
Mitochondrial genomes of higher plants (208-2000 kbp) are much larger than those of vertebrates (16-17 kbp) or fungi (25-80 kbp) [21, 22]. In addition, there are clear differences in size and organization of mitochondrial genomes between different species of plants. Intramolecular recombination in mitochondria leads to complex reorganizations of genomes, and, in consequence, to alternating arrangement of genes, even in individual plants, and the occurrence of duplications and deletions are common . In addition, the nucleotide substitution rate in plant mitochondria is rather low , causing only minor differences within certain loci between individuals or even species. Extensively characterized circular animal mitochondrial genomes are highly conservative within a given species; they do not contain introns and have a very limited number of intergenic sequences . Plant mitochondrial DNA (mtDNA) contains introns in multiple genes and several additional genes undergoing expression when compared to animal mitochondria, but most of the additional sequences in plants are not expressed and they do not seem to be esssentials . The completely sequenced mitochondrial genomes are available for several higher plants, including Arabidopsis thaliana  or Marchantia polymorpha .
Restriction maps of nearly all plant mitochondrial genomes provide for the occurrence of the master circle with circular subgenomic molecules that arise after recombination among large direct repeats (> 1 kbp) [21, 29-36], which are present in most mitochondrial genomes of higher plants. However, such molecules, whose sizes can be predicted, are very rare or very difficult to observe. It can be explained by the fact that plant mitochondrial genomes are circularly permuted as in the phage T4 [37, 38]. Oldenburg and Bendich reported that mostly linear molecules in Marchantia mtDNA are circularly permuted with random ends . It shows that plant mtDNA replication occurs similarly to the mechanism of recombination in the T4 .
Many reports that have appeared in recent years indicate that mitochondrial genome of yeasts and of higher plants exist mainly as linear and branched DNA molecules with variable size which is much smaller than the predicted size of the genomes [39-44]. Using pulsed field gel electrophoresis (PFGE) of in-gel lysed mitochondria from different species revealed that only about 6-12% of the molecules are circular [41, 44]. The observed branched molecules are very similar to the molecules seen in yeast in the intermediate stages of recombination of mtDNA  or the phage T4 DNA replication [37, 38].
In all but one known case (Brassica hirta) , plant mitochondrial genomes contain repeat recombinations. These sequences, ranging in length from several hundred to several thousand nucleotides (nt) exist at two different loci in the master circle, yet in four mtDNA sequence configurations . These four configurations correspond to the reciprocal exchange of sequences 5' and 3' surrounding the repeat in the master circle, which suggests that the repeat mediates homologous recombination. Depending on the number and orientation of repeats, the master circle is a more or less complex set of subgenomic molecules .
Maternally inherited mutations, which are associated with mitochondria in higher plants, most often occur as a result of intra- and intergenic recombination. This happens in most cases of cytoplasmic male sterility (cms) [41, 49-51], in chm-induced mutation in Arabidopsis  and in non-chromosomal stripe mutations in maize . In this way, it is assumed that the recombination activity explains the complexity of the variations detected in the mitochondrial genomes of higher plants.
2.3. Mitochondrial genome of soybean
Repeated sequences 9, 23 and 299 bp have been characterized in soybean mitochondria [58, 59]. Also, numerous reorganizations of genome sequences have been characterized among different cultivars of soybean. It has been demonstrated that they occur through homologous recombination produced by these repeat sequences [58, 60, 61], or through short elements that are part of 4.9kb PstI fragment of soybean mtDNA . The 299 bp repeat sequence has been found in several copies of mtDNA of soybean and in several other higher plants, suggesting that this repeated sequence may represent a hot spot for recombination of mtDNA in many plant species [59, 62]. Previous results suggested that active homologous recombinations of mtDNA are present in at least some species of plants. Recently (2007) amitochondrial-targeted homolog of the Escherichia coli recA gene in A. thaliana has been identified . However, the data on recombnation activity in plant mitochondria is still missing. The first data on such an activity in soybean was obtained in 2006 . This discovery is supported by an analysis of mtDNA of soybean using electron microscopy and 2D-electrophoresis. The results suggest that only a small portion of mtDNA molecules undergoes recombination at any given time. Therefore the question is whether this recombination is essential to the functioning of mitochondria and to plant growth.
The repeated sequences of the atp6, atp9 and coxII genes have been also characterized, but their recombination activity has not been analysed .
The first data for the restriction map of soybean mtDNA were obtained from the analysis of loci of the atp4 gene . In the vicinity of this gene two repeated sequences that show characteristics of recombination repeats have been found [47, 48]. Active recombination repeats were also identified in circular molecules smaller than 400 kb [55, 66]. These observations suggest that soybean mtDNA has multipartite structure that is similar to other plant mitochondrial genomes containing recombination repeats.
In the mitochondrial genome of cultivar Williams 82, recombinantly active repeats 1 kb and 2 kb have been described . In a different repeat of 10 kb, surrounding both 1 kb and 2 kb repeats, two breakpoints have been identified. This recombination of smaller and larger repeats probably leads to the complex structure of genomes.
The analysis of restriction fragment length polymorphism (RFLP) of mtDNA seems to be a useful method in studying phylogenetic relationships within species.
Grabau et al. (1992) analyzed the genomes of 138 soybean cultivars . Using 2.3 kb HindIII mtDNA probe from Williams 82 soybean cultivar revealed restriction fragment length polymorphisms (RFLPs), which allowed for the division of many soybean cultivars into four cytoplasmic groups: Bedford, Arksoy, Lincoln and soja-forage.
Subsequent analyses showed variations within, and adjacent to, the 4.8 kb repeats. Bedford cytoplasm turned out to be the only one that contains copies of the repeat in four different genomic environments, which indicates its recombination activity . Lincoln and Arksoy cytoplasms contain two copies of the repeat and a unique fragment that appear to result from rare recombination events outside, but near, the repeat. In contrast, forage-soja cytoplasm contains no complete repeat, but it contains a unique truncated version of the repeat . Sequence analysis revealed that truncating is caused by the recombination with a repeat of 9 bp CCCCTCCCC. The structural reorganization that occurred in the region around 4.8 kb repeat may provide a way to analyze the relationships between species and evolution within the soybean subgenus.
In order to determine the sources of cytoplasmic variability, Hanlon and Grabau (1995) studied the old cultivars of soybeans with the same 2.3-kb HindIII fragment and with a mtDNA fragment containing the atp6 gene . They showed that mtDNA RFLP analysis with these probes is useful for the classification of mitochondrial genomes of soybean. Grabau and Davies (1992) made a general classification of wild soybean using the 2.3-kb HindIII as a probe .
|Ik||1,6||5,8||5,0; 5,4; 5,8|||
|IVf||3,5||8,1||2,4; 3,5; 5,0|||
|mt-d||5,0; 6,0; 12,0|||
|mt-f||2,4; 3,5; 5,0|||
|Ic||5,6||0,8; 2,5; 5,0||10,5||1,6||5,8||1,9||5,0||8,2; 12,0|||
|Id||5,6||0,8; 2,5; 5,0||10,5||1,6||5,8||1,9||5,0; 6,0; 12,0||2,8; 6,0; 12,0|||
|Ie||5,6||0,8; 2,5; 5,0||10,5||1,6||5,8||1,9||5,0; 12,0||2,8; 6,0; 12,0|||
|Ik||5,6||0,8; 2,5; 5,0||10,5||1,6||5,8||1,9||5,0; 5,4; 5,8||2,8; 6,0; 12,0|||
|IIg||8,5||0,8; 2,5; 5,0||9,0||1,3||7,0||4,8||1,0; 2,6||2,8; 3,0; 9,5|||
|IIIb||5,6||0,8; 2,5; 5,0||10,5||1,2||8,5||6,2; 6,5||2,9; 5,0||6,0; 8,2; 12,0|||
|IIId||5,6||0,8; 2,5; 5,0||10,5||1,2||8,5||6,2; 6,5||5,0; 6,0; 12,0||3,2; 6,2; 12,0|||
|Iva||5,6||0,8; 2,5; 5,0||10,5||3,5||8,1||5,0||2,4; 5,0||3,0; 6,0; 12,0|||
|IVb||5,6||0,8; 2,5; 5,0||10,5||3,5||5,8||5,0||2,9; 5,0||6,0; 8,2; 12,0|||
|IVc||5,6||0,8; 2,5; 5,0||10,5||3,5||5,8||5,0||5,0||8,2; 12,0|||
|IVf||5,6||0,8; 2,5; 5,0||10,5||3,5||5,8||5,0||2,4; 3,5; 5,0||3,2; 6,2; 12,0|||
|IVh||5,6||0,8; 2,5; 5,0||10,5||3,5||5,8||5,0||2,6; 2,9||3,2; 6,2; 12,0|||
|IVi||5,6||0,8; 2,5; 5,0||10,5||3,5||5,8||5,0||5,2; 12,0||3,2; 6,2; 12,0|||
|Va||5,6||0,8; 2,5; 5,0||10,5||5,8||5,8||12,0||2,4; 5,0||3,0; 6,0; 12,0|||
|Vb||5,6||0,8; 2,5; 5,0||10,5||5,8||5,8||12,0||2,9; 5,0||6,0; 8,2; 12,0|||
|Vc||5,6||0,8; 2,5; 5,0||10,5||5,8||5,8||12,0||5,0||8,2; 12,0|||
|V’j||5,6||0,8; 2,5; 5,0||10,5||5,8||15,0||1,6||5,0; 6,0||2,8; 6,0; 12,0|||
|VIg||5,6; 8,5||0,8; 2,5; 5,0; 5,2||9,0; 10,5||1,7||5,8||4,5||1,0; 2,6||2,8; 3,0; 4,3; 9,5; 12,0|||
|VIIg||8,5||0,8; 5,0; 5,2||9,0||8,5||15,0||1,6||1,0; 2,6||2,8; 3,0; 9,5|||
|VIIIc||5,6||0,8; 2,5; 5,0||10,5||8,5; 10,0||11,0; 15,0||1,6||5,0||8,2; 12,0|||
|Combined chloroplast and mitochondrial genome type|
|cpIII+mtVIIIc||8,5; 10,0||11,0; 15,0||5,0|||
In their research Tozuka et al. (1998) used two fragments of mtDNA as probes: the 0.7-kb Hindlll-NcoI fragment containing the coxII (the gene encoding the mitochondrial cytochrome oxidase subunit II) of wild soybean and the 0.66-kb StyI fragment containing the atp6 (the gene encoding the mitochondrial ATPase subunit 6) from Oenothera [69, 70] (Table 1).
Based on the RFLPs detected in gel-blot analysis with the coxII and atp6 probes, the harvested plants were divided into 18 groups. Five mtDNA types were described in 94% of the surveyed plants. The geographical distribution of mtDNA types revealed that in many regions soybean growing wild in Japan consisted of a mixture of plants with different types of mtDNA, sometimes even within a single location. Some of these mtDNA types have shown marked geographic clines among the regions. In addition, some wild soybeans had mtDNA types that were identical to those described in cultivated soybeans. These results suggest that mtDNA analysis could resolve maternal origin among of the genus Glycine subgenus Soja .
Kanazawa et al. (1998) gathered 1097 G. soja plants from all over Japan and analyzed their RFLP of mitochondrial DNA (mtDNA) using five probes (coxI, coxII, atp6, atp9, atp1=atpA)  (Table 1). 20 different types of mitochondrial genomes labeled as combinations of types I to VII and types from a to k were identified and characterized in this study. Nearly all the mtDNA types described for soybean cultivars also occurred in wild soybean.
The mitochondrial atpA gene was also analysed . It was shown that in soybean this gene has a sequence in 90-97% identical with mitochondrial genes of other plants [71-81]. Sequence similarity is limited to the atpA coding region. An intriguing feature of the atpA open reading frame of soybean is an 642 nt overlap in the putative translation termination site onto an unidentified open reading frame of the orf214. The ends of the open reading frame contain four tandems of UGA codon that covers four tandems of AUG codon that initiates an unidentified orf214 frame. The atpA-orf 214 region was found in soybean mtDNA in multiple sequence contexts. This can be attributed to the presence of two recombination repeats.
The open reading frame shares 79% of nucleotide identity with the orf214 and is located in the same atpA locus position as in common bean orf209 . Since such organization is a repeat of overlapping the atpB and atpE reading frames in several chloroplast genes [83, 84], the probability that the orf214 codes a different ATPase subunit cannot be evaluated because small ATPase subunits are poorly conserved .
So for a total of 26 mtDNA haplotypes of wild soybeans have been identified based on RFLP with probes from two mitochondrial genes: cox2 and atp6 [69, 86] (Table 1). The three most common haplotypes (Id, IVa and Va) are present in 43 populations. The distribution of mtDNA haplotypes varies among opulations . Recently Shimamoto (2001) analyzed the genetic polymorphisms of mitochondrial genes subgenus Soja originating from China and Japan  (Table 1). As a result of these studies, 6 types of mitochondrial genomes were distinguished.
3. Chloroplast genome
As the result of the extensive research conducted in the past two decades, cpDNA analysis brought about fundamental changes to the systematics of plants. The chloroplast genome is ideal for phylogenetic analyses of plants for several reasons. First, it occurs abundantly in plant cells and is taxonomically ubiquitous. And since it is well researched, it can be easily tested in the laboratory conditions and analyzed in comparative programs. Moreover, it often contains marker structural features cladistically useful, and, above all, it exhibits moderate or low rate of nucleotide substitution . In regard to the mitochondrial genome, and also to cpDNA, researchers use in their studies two distinct phylogenetic approaches , namely taxonomic checking of specific traits features of molecular cpDNA and sequencing of specific genes or regions.
3.1. Chloroplast genome of soybean
In estimating the phylogeny of plants belonging to Glycine, particular attention was paid to unusual and specific features of cpDNA. In the course of many studies on the variability of chloroplast genome, a breakthrough came in 1993, with a study on assessing phylogeny of seed plants. The study used a huge database of the nucleotide sequences of the rbcL gene , encoding the ribulose-1,5-bisphosphate carboxylase, large subunit. The accumulation of a number of comparative data on this chloroplast gene made it a frequent object of research. This is due to the fact that this gene's locus is large (> 1400 bp), and provides many phylogenetically informative traits. The rate of the rbcL evolution proved to be appropriate for assessing issues related to phylogeny of plants, especially on the medium and high taxonomic levels. Over the years other sequences from other species as well as many other genes with another chloroplast atpB gene coding H+ -ATPase subunits [92-95]. The atpB-rbcL sequence reaches different lengths in Glycine as well as in other seed plants. The study by Chiang (1998) shows that the size of the atpB-rbcL space in the studied species ranges from 524 bp to 1000 bp , where in the non-coding region the occurrence of deletions and insertions, as well as a number of nucleotide substitutions is a common phenomenon, which can also be observed in Glycine. In Glycine max, its chloroplast genome differs from the core set chloroplast DNA genes because of the presence of a single, large inversion of approximately 51 kb, in the area between the rbcL gene and the rps16 intron . This inversion is also present in other legumes: the mutation was reported in Lotus and Medicago . In addition, the non-coding atpB-rbcL region is rich in AT, due to which most non-coding regions rich in these base pairs show a small number of functions [97, 98]. Therefore, this predisposes them for faster evolution, and hence for use in molecular systematics.
The summary phylogeny was based on sequence of several cpDNA genes from hundreds of spermatophytes including Glycine (Table 2). These genes can be divided into three classes. The genes encoding the photosynthetic apparatus structure form the first class. The second class includes the rRNA genes and genes encoding the chloroplast genetic apparatus. The last class consists of an average of about 30 tRNA encoding genes , although their number can vary from 20 to 40 [100, 101].
|Genes for the photosynthesis system|
|rbcL||Ribulose -1,5- bisphosphate carboxylase, large subunit|
|psaA, B||Photosystem I, P700 apoproteins A1, A2|
|psaA||Photosystem II, D1 protein|
|psaB||47kDa chlorophyll a-binding protein|
|psaC||43 kDa chlorophyll a-binding protein|
|psaE||Cytochrome b559 (8kDa protein)|
|psaF||Cytochrome b559 (4kDa protein)|
|psaH||10 kDa phosphoprotein|
|psaI, J, K, L, M, N||–J, -K, -L, -M, -N-proteins|
|atpA, B, E||H+ -ATPase, CF1 subunits α, β, ε|
|atp F, H, I||CF0 subunits I, III, IV|
|petB, D||Cytochrome b6 /f complex, subunit b6, IV|
|nadA- K||NADH Dehydrogenase, subunits ND 1, NDI 1|
|Genes for the genetic system|
|16S rRNA||16S rRNA|
|23S rRNA||23S rRNA|
|trnA -UGC||Alanine tRNA (UGC)|
|trnG- UCC||Gliycine tRNA (UCC)|
|rnH- GUG||Histidine tRNA (GUG)|
|trnI- GAU||Isoleucine tRNA (GAU)|
|trnK- UUU||Lysine tRNA (UUU)|
|trnL- UAA||Leucine tRNA ( UAA)|
|rps2, 7, 12, 16||30S: ribosomal proteins CS2, CS7, CS12, CS16|
|rp12, 20, 32||50S: ribosomal proteins CL2, CL 20, CL32|
|rpoA, B, C1, C2||RNA polymerase, subunits α, β, β’, β’’|
|matK||Maturase –like protein|
|sprA||Small plastid RNA|
|clpP||ATP-dependent protease, proteolytic subunit|
|irf168 (ycf3)||Intron- containing Reading frame ( 168 codons)|
The complete size of the Glycine max chloroplast genome is 152,218 bp. It contains 25,574 bp of inverted repeats (IRa and IRb), which are separated by a unique small single copy (SSC) region (17,895 bp) . In addition, this genome consists of a large single region (LSC) of unique sequences with 83,175 bp. The IR extends from the rps19 gene up to the ycf1. The Glycine chloroplast genome contains 111 unique genes and 19 duplicate copies in the IR, amounting to a total of 130 genes. The cpDNA analysis has showed the presence of 30 different tRNAs in it and 7 of them are repeated within the IR regions. The genes are composed in 60% of encoding regions (52% are protein coding genes and 8% are RNA genes), and in 40% of non-coding regions, including both intergenic spacers and introns. The total content of GC and AT pairs in the Glycine chloroplast genes is 34% and 66% respectively. Distinctly higher percentage of AT pairs (70%) was observed in non-coding regions than in coding regions (62% AT) .
In comparison with other eukaryotic genomes, cpDNA is highly concentrated, for example, only 32% of the rice genome is non-coding. In Glycine max it is slightly more – 40%. Most of the non-coding DNA is found in very short fragments that separate functional genes. Some studies have shown complex patterns of mutational changes in the non-coding regions. Some of the best known regions in the chloroplast genome is the farther region of the rbcL gene in many legumes. This non-coding sequence is flanked by the rbcL and psaI (the gene encoding the polypeptide I of photosystem I).
3.2. Extent of IR in Glycine
Analysis of the IR (inverted repeats) regions in Glycine max has shown that they are separated by a large region and a small region of a unique sequence. In cpDNA repeated sequences are usually located asymmetrically, which results in the formation of long and short regions of a unique sequence . The IR in Glycine is a region with 25,574 bp containing 19 genes. At the IR/LSC junction, at the ends of the 5' IR, there is the repeated rps19 gene (68 bp), and at the junction of the IR/SSC and 5' ends the duplicated ycf1 gene (478 bp) is located. In the course of study it was shown that comparing cpDNA IR region in Medicago, Lotus, Glycine and Arabidopsis indicates that there are changes within the IR in the two legumes. Glycine and Lotus have 478 bp and 514 bp of the ycf1 duplicated, whereas Arabidopsis has 1,027 bp duplicated in the IR. This contraction of the IR in these legumes accounts for the smaller size of their IR and larger size of the SSC. In addition, contraction of the IR boundary in legumes, IRa has been lost in Medicago. This loss has resulted in ndhF (usually located in the SSC) being adjacent to trnH (usually the first gene in the LSC at the LSC/IRa junction). Loss of one copy of the IR in some legumes provides support for monophyly of six tribes [103-106]. Wolfe (1988) identified duplicated sequences of portions of two genes, 40 bp of psbA and 64 bp of rbcL, in the region of the IR deletion between trnH and ndhF in Pisum sativum and these duplications were later identified in broad bean (Vicia faba) [104,107]. According to many researchers, the IR region is considered the most conserved part of the chloroplast genome, and thus, it is responsible for stabilizing the plastid DNA molecules [108, 109]. Thus the loss of IR can be phylogenetically informative at the local level, as well as misleading at the global phylogeny level, because the IR loss likely occurred independently in more than one group of plants. Coniferous and some legumes (Pisum sativum, Vicia faba, Medicago sativa), for example, contain only one IR. Perhaps the lack of repeat sequences in these plants is associated with an increased incidence of rearrangement of chloroplast genomes .
Introns or intergenic sequences in legume chloroplast DNA have become extremely important tools in phylogenetic analyses aimed at systematizing of this species [110, 111]. Moreover, their microstructural changes occur with great frequency in the regions of cpDNA. The body of existing research suggests that mutations in the non-coding regions and relatively fast evolution of the organelle genome encoding regions can serve as valuable markers for the separation species in their evolutionary origin [110, 111]. The systematics of plants generally considers chloroplast indeles to be phylogenetic markers, because of their low prevalence in comparison with nucleotide substitutions .
3.3. CpDNA markers
There are many methods of generating molecular markers that rely on site-specific amplification of a selected DNA fragment using polymerase chain reaction (PCR) and its further processing (restriction analysis, sequencing). Initially the research on the plant genome (mostly phylogenetic studies) used non-coding and coding sequences of chloroplast DNA. With time, the genes or DNA segments located in the nuclear DNA, mitochondrial (mtDNA) and chloroplast (cpDNA) found a prominent place among plant DNA markers. Fully automated DNA sequencing made it possible to subject ever-newer regions of plant DNA to comparative sequencing.
One of the most frequently sequenced cpDNA fragments in plant phylogeny of spermatophytes is the rbcL gene encoding a large ribulose bisphosphate carboxylase subunit (RUBISCO), whose length in most plants is 1,428, 1,431 and 1,434 bp, and insertions and deletions within it are extremely rare . For many years this gene has been the subject of many comprehensive phylogenetic analyses of subgenus Glycine [112-114]. The rbcL is most commonly used in the analyses at the family and genus levels, but there also exists research at the lower levels, cultivars and wild soybean [98, 115, 116]. A marker with very similar characteristics to those of the rbcL (the rate of evolution, the length of 1497 bp) is a gene encoding the ATP synthase β subunit – the atpB .The matK gene sequence, encoding maturase involved in splicing of the type II introns, and whose length is 1,550 bp is characterized by a rapid rate of evolution that allows to use it in research at the species and genus levels [117, 118]. Frequent mutations in this gene make it unsuitable for studies at higher taxonomic levels. Other popular cpDNA sequences used in phylogenetic studies of legumes include the ndhF (the gene encoding the NADH protein, which is a dehyd98rogenase subunit), 16S rDNA, the non-coding atpB-rbcL region , or the trnL (UAA) intron and mediator between the trnL (UAA) exon and the trnF (GAA) gene [96, 117- 119].
It should be noted that the rate of evolution for a specific DNA region to be used as a marker can vary significantly not only among systematic groups, but also within these groups . Moreover, each DNA fragment within the same group has a different rate of evolution, such as the ndhF cpDNA sequence in the Solanaceae family, which provides about 1.5 times more information in terms of parsimony than the rbcL . Therefore each gene or any other DNA fragment used as a genetic marker has a typical range of "taxonomic" or phylogenetic applications, which can vary significantly within a taxon. For this reason, the rbcL sequence has been widely used in Gycine for many years at the species and genus levels [104, 117, 118].
3.4. The genetic diversity of soybeans
The importance of genetic variations in facilitating plant breeding and/or conservation strategies has long been recognized . Molecular markers are useful tools for assaying genetic variation and provide an efficient means to link phenotypic and genotypic variation . In recent years, the progress made in the development of DNA based marker systems has advanced our understanding of genetic resources. These molecular markers are classified as: (i) hybridization based markers i.e. restriction fragment length polymorphisms (RFLPs), (ii) PCR-based markers i.e. random amplification of polymorphic DNAs (RAPDs), amplified fragmentlength polymorphisms (AFLPs), inter simple sequence repeats (ISSRs) and microsatellites or simple sequence repeats (SSRs), and (iii) sequence based markers i.e. single nucleotide polymorphisms (SNPs) [121, 123]. Majority of these molecular markers have been developed either from genomic DNA library (e.g. RFLPs or SSRs) or from random PCR amplification of genomic DNA (e.g. RAPDs) or both (e.g. AFLPs) . Availability of an array of molecular marker techniques and their modifications led to comparative studies among them in many crops including soybean, wheat and barley [124-126]. Among all these, SSR markers have gained considerable importance in plant genetics and breeding owing to many desirable attributes including hypervariability, multiallelic nature, codominant inheritance, reproducibility, relative abundance, extensive genome coverage (including organellar genomes), chromosome specific location, amenability to automation and high throughput genotyping . In contrast, RAPD assays are not sufficiently reproducible whereas RFLPs are not readily adaptable to high throughput sampling. AFLP is complicated as individual bands are often composed of multiple fragments mainly in large genome templates . The general features of DNA markers are presented in Table 3.
|Need for sequence data||Essential||Essential||Not required||Not required|
|Level of polymorphism||Low||High||Low||Low-moderate|
|Utility in Marker assisted selection||High||High||Moderate||Low-moderate|
|Cost and labour involved in generation||Low||High||High||Low-moderate|
The genetic diversity of wild and cultivated soybeans has been studied by various techniques including isozymes , RFLP , SSR markers , and cytoplasmic DNA markers [87, 128, 129]. Based on haplotype analysis of chloroplast DNA, cultivated soybean appears to have multiple origins from different wild soybean populations [129, 130].
Using PCR-RFLP method soybean chloroplast DNAs were classified into three main haplotype groups (I, II and III) [113, 130, 131]. Type I is mainly found in the species of cultivated soybean (Glycine max), while types II and III are often found in both the cultivated and wild forms of soybean (Glycine soja). Type III is by far the most dominant in the wild soybean species . In Glycine, these types are widely used in evaluating cpDNA variability and in determining phylogenetic relationships between different types of cpDNA using different marker systems. According to Chen and Hebert (1999)  analysis of cpDNA sequence is not sufficient for when the analysis of population genetics, and so cpDNA polymorphism assessment methods must be constantly complemented with methods such as single-strand conformation polymorphism (SSCP) , or dideoxy fingerprinting (ddF) , and directed termination and polymerase chain reaction (DT-PCR). However, some researchers point out that there are many disadvantages of these methods, mainly because of their high cost and large amount of work necessary for obtaining the results. In their view, a single change in the regions of Glycine chloroplast DNA at the species and genus levels should be located on a local-specific markers, for example, non-coding regions, using PCR and sequencing.
Analyses of non-coding regions of cpDNA have been employed to elucidate phylogenetic relationship of different taxa . Compared with coding regions, non-coding regions may provide more informative characters in phylogenetic studies at the species level because of their high variability due to the lack of functional constraints. Non-coding regions of cpDNA have been assayed either by direct sequencing [136-141], or by restriction-site analysis of PCR products (PCR-RFLP) [142-146]. In Small's opinion (1998) non-coding regions, which include introns and intergenic sequences, often show greater variability at nucleotides than at the encoding regions, which makes the non-coding regions good phylogenetic markers . Mutations in the form of insertions and deletions are accumulated in noncoding regions at the same rate as nucleotide substitutions, and such kinds of mutations significantly accelerate changes in these regions. In many cases, insertions or deletions are related to short repeat sequences. Therefore, many researchers continually focus on the analysis of non-coding regions. Using RFLP method, Close et al. (1989) found six cpDNA haplotypes and described them in types, ranging from group I to VI, including cultivated and wild soybeans . In the course of their research they found that groups I and II diverge from groups III to VI, thus dividing subgenus Soja into two main groups. They presented a hypothesis that group II can be distinguished form group III by two independent mutations. Similar groups of haplotypes in legumes were also obtained by Shimamoto et al, (1992)  and Kanazawa et al, (1998) , using a combination of EcoRI and ClaI RFLPs. In their classification, Kazanawa et al. (1998) relied on sequential analysis and found that differences in the three types described by Shimamoto et al. (1992) resulted from two single-base substitutions: one in the non-coding region, between the rps11 and rpl36, and the other in the 3' part of the coding region of the rps3. Based on the existing reports, Xu et al. (2000) sequenced nine non-coding regions of cpDNA for seven cultivars and 12 wild forms of soybean (Glycine max, Glycine soja, Glycine tabacina, Glycine tomentella, Glycine microphylla, Glycine clandestina) in order to verify earlier classification of Glycine . In the course of their studies, they located eleven single-base changes (substitutions and deletions) in the collected 3849 database. They located five mutations in the distinguished haplotypes I and II, and seven mutations in type III. In addition, haplotypes I and II were identical and clearly different from the taxons in type III. This research has not yielded significant results, because different types of cpDNA could not originate monophyletically, but it contributed to finding a common ancestor in the course of evolution of Glycine. A neighbor joining tree resulting from the sequence data revealed that the subgenus Soja connected with Glycine microphylla, which formed a distinct clad from Clycine clandestine and the tetraploid cytotypes of Glycine tabacina and Glycine tomentella. Several informative length mutations of 54 to 202 bases, due to insertions or deletions, were also detected among the species of the genus Glycine.
3.5. Non-coding regions of the chloroplast genome as site-specific markers in Glycine
In the chloroplast genomes of legumes, including soybean, there are many non-coding regions, which are characterized by a faster rate of evolution when compared to the coding regions. As mentioned earlier some of the chloroplast genes have introns, yet their structure differs from those occuring in the nuclear genes, since in the case of cpDNA introns have a tendency to adopt secondary structure, which affects the model in which cpDNA introns evolve and it is enforeced by the secondary structure. This restriction in changes caused by mutations affects the functional requirements related to the formation of introns [98, 108]. As there are no adequate studies on the evolution of introns, it can be assumed that their evolution is similar to that of the protein-encoding genes. The loss of introns in the course of the evolution of chloroplast DNA is an interesting process. It has been discovered that O. sativa has 3 introns less in cpDNA than M. polymorpha and N. tabacum. The loss of an intron in the rpl2 gene was researched in 340 species representing 109 families of angiosperms including Glycine . When trying to determine the taxonomic position, the absence of this intron in a given gene shows that it was lost at least six times in the evolution of angiosperms. In Glycine 23 introns have been identified while in Arabidopsis thaliana there are 26 introns, mostly located in the same genes and in the same locations within those genes [98, 102].
Non-coding regions in chloroplast DNA have become a major source for phylogenetic studies within the species Glycine and in many other seed plants. Earlier, the most popular phylogeny sequences included encoding regions, such as the rbcL gene sequences that were designed to determine the phylogenetic relationships between species in major taxonomic groups [113, 136-141]. According to Taberlet et al. (1991)  the potential ability of non-coding regions of cpDNA was reserved for species located in the lower taxonomic levels while the non-coding regions, which include introns and intergenic sequences, often show greater variability at nucleotides than is evident in the coding regions, which predisposes them to be used in population studies involving Glycine, and others [139,142].
In cpDNA analysis of many plants, very conservative regions flanking areas with high variability are used. The more conservative regions, the higher the chance for the primers designed in the PCR reaction, which will be able to join the broader taxonomic group [96, 113]. The region occurring between the trnT (UGU) and the trnF (GAA) genes is a large single copy wich is suitable because of the conservativeness of the trn genes and several hundred base pairs of noncoding regions. The intergenic space between the trnT (UGU) and the trnL (UAA) 5' exon ranged from 298 bp to about 700 bp in the species studied by . In the plant genomes completely sequenced by Sugiute, the length of this region is different and amounts to 770 bp in rice and 710 bp in tobacco. In Marchantia polymorpha it is 188 bp . This region is located between the tRNA genes, just as the non-coding sequence located between the trnL (UAA) 3' exon and the trnF (GAA). Due to its catalytic properties and its secondary structure, the trnL (UAA) intron, which belongs to type I introns, is less variable and therefore of better utility for evolutionary studies at higher taxonomic levels . Moreover, depending on the species, they show high frequency of insertions or deletions, which makes them potentially useful as genetic markers.
|Region||Primer sequence (5 - 3)*||Annealing temperature||Reference|
In most studied species, the trnL (UAA) intron ranges in size from 254 - 767 bp. Its smaller fragment – the P6 loop – reaches a length of 10 - 143 bp. It is commonly applied in DNA barcoding. Its main limitation lies in its low homologousness with the species from the Gene Bank, which amounts to 67.3%, while the homologousness of the P6 loop is 19.5%. However, it also has some advantages: conservative primers projected form and trouble-free amplification process. Amplification of the P6 loop can be performed even in a very degraded DNA. The intron is well known and its sequences are used to determine phylogenetic relationships between closely related species or to identify a plant species . The first universal primers for this region were designed more than 20 years ago . However, it does not belong to the most variable non-coding regions in chloroplast DNA . The trnL (UAA) intron is the only one belonging to group I introns in chloroplast DNA, which means that its secondary structure is highly conservative, with a possibility of changes in its conservative  and variability in regions [99, 153]. Consequently, comparing the diversity of the trnL intron sequences allows to obtain new primers that contain conservative regions and amplify short sections contained between them .
Thus, in angiosperms, using non-coding regions in research at lower levels of the genome is a routine practice . A large number of non-coding regions of cpDNA has been located in angiosperms, some of which are highly variable, whereas others show relatively small variability . In studying the chloroplast genome, many researchers looked for universal primers that would allow amplification of many non-coding regions of cpDNA (Table 4) [111, 113, 148, 150].
In phylogenetic and population studies of Glycine, genetic information contained not only in cpDNA but also in mtDNA are often analysed. Organelle DNA can be used to find species-specific molecular markers. Molecular markers are an important tool to systematize the species because their use allows for detecting the differences in the genes directly. The selection of appropriate sequences, which depends on the taxonomic level at which reconstruction of the origin is carried out is very important. The initial selection concerns non-conservative sequences, which are subject to fast evolution, because the more related the specimen are, the more changeable the region should be. The relatively slow rate of evolution of certain sequences may exclude statistically significant analyses within families or species, while the study of relationship between species, which phylogenetically are very distant, using more slowly evolving sequences can be very useful. Non-coding sequences show a faster rate of evolution than the coding sequences. These regions accumulate a greater number of insertion/deletion or substitution than the non-coding regions, and therefore may be more suitable for research at inter-or intra-genus levels.