Algorithms for CpG Islands Search: New Advantages and Old Problems

CpG islands (CGIs) are regions having high GC and CpG content while generally mammalian genomes are CpG-depleted. CGIs are often located in the promoter region of the genes, mostly housekeeping but also tissue-specific. It is widely believed that CpG dinucleotides within promoters CGIs are unmethylated and are targets for specific regulatory protein binding. As a result, CGIs contain special sequence motifs for highly affinitive protein binding (transcription factor binding sites, TFBS). Methylation of cytosine in CpG context within such motifs could decrease the affinity of TF binding, increase the attraction of methyl-binding proteins, affect the histones modification and, therefore, leads to repression of genes transcription. The mechanism of local and global transcription repression via CpG methylation is used in many different normal (development, differentiation, aging, X-chromosome inactivation, imprinting) and pathological processes (cancer and other diseases). However recently it has been reported that a class of normally methylated but active promoters do exist. Lately evidences of biological relevance of methylated CGIs or CGIs located far from gene promoters appear. Such CGIs could act as regulator for pervasive transcription, which seems to be actual genome feature rather than a side-effect of high-throughput techniques errors. Replication origins are also reported to be associated with CGIs of any location. As a consequence of specific nucleotide content, CGIs could affect DNA or RNA secondary structures. For example, G2-3C2-3 motif common within CGIs induces significant local curiosity of DNA. Another motif, G-rich sequence (GRS) in 3’ and 5’ region of RNA, is known to form specific structures, G-quadruplexes, on both end of RNA playing important role in its stability. This motif corresponds to C-rich sequence in DNA, is likely to appear in CGIs. Classical algorithms for CpG islands search use sliding window (SWM) or running sum (RSM) and several distinct but not independent criteria (GC content, Obs/ExpCpG and length). The thresholds for the criteria are rather arbitrary, unconcerned between species, and demonstrate lack of biological interpretation. SWM algorithms are rather slow, RSM algorithms are faster but tend to split large CGIs into several smaller ones and to omit CGIs with nonuniform distribution of CpG dinucleotides along the sequence. Recently, several different algorithms based on CpG dinucleotides clustering were implemented. Those algorithms have smaller number of parameters and reasonable mathematical basics. The comparison of the algorithms is tricky. Hypermutability of CpG dinucleotides lead to loss of

CGI conservation between species so comparative genomics cannot be applied for estimation of the algorithms effectiveness.To validate the results of CGI prediction authors use different biological and mathematical properties.One of the most popular quality measures is the fraction of CGIs located near promoters of protein coding genes and avoided overlap with Alu-repeats.This measure couldn't be appropriate at least for two reasons.First, promoters of protein-coding genes are likely to be a small fraction of all promoters as it became clear recently.Second, two classes of promoters (CGI-dependent and CGI-independent) exist and their ratio is unclear.Avoiding of repetitive sequences is more or less reachable for many algorithms, but now authors prefer to remove Alu-repeats and other repetitive DNA sequences in advance.Prediction of the methylation profile in different tissues in norm and in cancer is another idea for validation.Algorithms of CGI search per se fail to predict correctly the distribution of methylated cytosine in the genome.To distinguish between methylated and nonmethylated CGI machine-leaning techniques (MLT) are used.Those studies include additional sequence features (di-and trinucleotide distribution, CpG and TpG frequencies, TFBS, repetitive elements and others).Machine-leaning techniques are also applicable for collecting promoter CGIs.The point that GC content and CpG frequency or density of CpG clusters is not enough to describe special types of CGIs, is highly relevant.The main problem of MLT approaches is that resulting model usually has a lot of parameters, sometimes without clear biological meaning.Consistency of the models, build up by different authors in the similar conditions is rather low, so those feastures could hardly be used for CGI validation quality in general case.A verification problem caused by lack of universal biological properties of CGIs results in an absence of widely accepted definition.It should be mentioned that all algorithms trying to predict CGIs with one particular function (promoter or unethylated CGIs) demonstrate a high false-positive rate, probably due to the complex network of CGIs functions.It's becoming clear that many different functional elements exist within one CGI.Moreover, both methylated and unmethylated, both promoter and non-promoter CGIs seem to be functional.So, one can conclude that contemporary algorithms for CGIs search based only on GC and CpG content or on CpG clustering determine a chimeric class of objects.

Algorithms for CpG islands search
Nowadays, most popular algorithms for CpG islands search are still based on criteria established more than twenty years ago (Gardiner-Garden & Frommer, 1987).The DNA segment is considered to be a CpG island if it is not shorter than 200 bp, has GC content no less than 0.5 and the ratio Obs/Exp CpG (1) no less than 0.6.
where N C , N G and N CpG are numbers of C, G and CpG in the region of length N respectively.Implementations of the basic idea vary in details, mostly in methods for search of the segments having properties mentioned above.
UCSC CGI (Algorithm of Mikhlem and Hillier) is based on the RSM but include additional check for CGI to fit the traditional criteria (Gardiner-Garden & Frommer, 1987).Total number of CGIs obtained by UCSC is less than those obtained by CpGplot, as not every frame is tested for fitting the criteria, but only those having score higher than a threshold on the first step.CGIs predicted by the algorithm of Mikhlem and Hillier are often shorter from both ends comparing to those predicted by CpGplot and also starts and ends with CpG dinucleotides.

CpG clustering methods
Next logical step of CpG searchers development is to implement actual CGI clustering methods (CGCM).There are several such algorithms available: CpGcluster (Hackenberg et al., 2006), CpG clusters (Glass et al., 2007), and CGI HW, an algorithm, developed by H. Wu (Irizarry et al., 2009;H. Wu et al., 2010).These algorithms are based on segmentation of the genome into regions with different frequency of CpG dinucleotides (CGI HW also uses segmentation based on GC content).Unlike methods described above this approach to CGI prediction is data-driven and allows finding CGIs in spices with different average GC-content and CpG frequency.
CpGcluster has two separate steps: a CpG cluster search and an estimation of the probability to find such a cluster by chance.Distance between neighboring CpG dinucleotides in random sequence is simulated by geometric law with CpG frequency as a parameter.Hackenberg and colleagues (Hackenberg et al., 2006) assume that within functional CpG cluster the distance between neighboring CpGs is smaller than expected in random sequence.Authors show that distances smaller than a median of the theoretical distribution is overrepresented in human genome.The median distance between neighboring CpG (23-53 bp depending of the chromosome) is used as a threshold, so each cluster consists of CpGs located no farther than the threshold.All resulting CGIs start and end with a CpG dinucleotide.Each cluster has a pvalue calculated based on negative binomial distribution.Only clusters with p-value less than 1.0e-5 (1.0e-20 in (Hackenberg et al., 2010a)) are considered as CpG islands.Authors find about 200000 CpG islands in human genome (25000 CpG islands using the p-value threshold equal to 1.0e-20).A lot of such CpG islands are shorter than 200 bp.Yet, authors show functionality of some short CGIs and call them CpG islets (Hackenberg et al., 2010a).CG clusters annotation also has two steps.The location of every CpG dinucleotide is extracted from genomic DNA sequences.Using these positions, every overlapping sequence fragment containing a fixed number of CpGs and having variable length is identified.For each number of CpGs, the frequency of each fragment length is recorded.The threshold for each maximum fragment length is defined as a local minimum in the fragment length histogram, estimated by identifying zero values of the first derivative of a cubic spline fit.
Mapping the CpG-dense fragments back to the genomic sequence produces an annotation track there each annotated locus is a conglomeration of one or more overlapping fragments of variable length.As the basis for choosing the optimal track the number of overlapping fragments at a locus normalized by the maximum fragment length is used.A track with maximal fragments overlap per locus is selected based on genomic averages of this metric for different numbers of CpGs per fragment.This approach allows authors to choose the species-specific optimal number of CpGs per fragment for the final annotation.One is for GC content to be high or low with assumption of the binomial distribution approximated with the normal density for baseline.The second one is for CpG number with assumption of Poisson distribution for baseline.The length L=16 for the segment was chosen based on the association of CGI with epigenetic marks.The approach summarizes the evidence for CGI status as probability scores.This provides flexibility in the definition of a CGI and facilitates CGI search in different species.

Validation problem
Having several methods for CGI prediction one is still unable to select the best one.The main reason is the lack of validation criteria.Su and colleagues (Su et al., 2009) propose cumulative mutual information of CpG dinucleotides as a measure of CGI's quality and show that it's a powerful criterion to avoid CGIs associated with Alu-repeats.Despite the power of this mathematical criterion, most of the authors try using biological features for CGIs validation.

Sources for biologically relevant validation: DNA methylation and protein binding
Very first work mentioned CG-rich islands (Bird, 1986) considers them as DNA regions where cytosine is unmethylated.Cytosine methylation usually appear in CpG context and increase the probability of its deamination about 10-times (Ehrlich & Wang, 1981), leading to enrichment of TpG and depletion of CpG dinucleotides in DNA.Absence (or decreased level) of cytosine methylation within CGI is usually considered as an origin of CGIs in mammalian genomes (Cross et al., 1994;Eckhardt et al., 2006).Modern research shows that methylated cytosines within CpNpG are also targets for spontaneous deamination (Cooper et al., 2010).
No doubts, that cytosine methylation plays important role in CGI functioning.During early development waves of methylation-demethylation generate tissue-specific genomic methylation profiles.These profiles are stable in somatic cells generations due to replication dependent maintenance methylation system (Brero et al., 2006).About 70-80% of cytosines in CpG context are methylated in differentiated cells (Baylin et al., 1998), recent study shows that cytosine is also methylated within CpHpN context (where H = С, А or Т) especially in embryonic stem cells (Baylin et al., 1998).Cytosine methylation influence DNA structure by facilitating Z-from conformation (Behe & Felsenfeld, 1981), it also affect protein binding to DNA, so most transcription factors (TF) usually bind unmethylated DNA.
Another well-known group of methyl-binding proteins consists of Kaiso and ZBTB4/33.They obtain zinc-finger domain and bind DNA in sequence-specific manner.Data on Kaiso binding site are controversial.Van Roy and McCrea (van Roy & McCrea, 2005) believe that Kaiso binds 5m CG 5m CG.Sasai and colleagues (Sasai et al., 2010) assume that 5m CG 5m CG motif is a place where two Kaiso molecules bind, one on every strand.The motif also has to be in specific sequence environment.It's also known that Kaiso binds TNGCAGGA motif having non-methylated cytosine, but with 1000-times lower affinity (Daniel et al., 2002).
There are some evidences that Kaiso is a global repressor of methylated genes and is essential for early embryonic development.ZBTB4 protein binds CYGCCATC motif as well as M 5m CGCYAT (Sasai et al., 2010).It also has been shown that proteins of this group demonstrate affinity to half-methylated DNA (Sasai et al., 2010).Some other proteins also bind methylated DNA.CpG methylation of the CRE-motif (TGACGTCA) enhances the DNA binding of the C/EBPα (Rishi et al., 2010).UHRF1 and UHRF2 (SET-and Ring finger-associated proteins, SRA) bind hemimethylated CpG and the tail of histone H3 in a highly methylation sensitive manner and help assemble histones and DNA into a nucleosome after replication (Hashimoto et al., 2009).

Sources for biologically relevant validation: DNA methylation and gene expression
Nowadays there are two main hypotheses explaining DNA methylation origin during evolution.Some authors believe that methylation system arose to inactivate viruses and transposons (Walsh et al., 1998).Despite some evidences in favor of this hypothesis, most of the authors nowadays suppose that main function of DNA methylation is a control of gene expression during development and cell differentiation, most likely by influence on affinity of different protein binding.
Promoter regions of many genes are unmethylated and demonstrate resistance to increasing concentration of methylating agents (Bestor et al., 1992).Yet if promoter region become methylated this usually leads to stable in cell generations and irreversible gene suppression (Razin & Riggs, 1980;Schubeler et al., 2001).However some genes demonstrate rather high expression independently to methylation level of their promoters (Shen et al., 2007) and some promoters need to have methylated cytosine to be activated (Rishi et al., 2010).Cytosine methylation affects transcription both directly by changing the affinity of TF binding to DNA and indirectly by forming inactive chromatin domains.Both 5m C and T change DNA conformation in core positions of TFBS.For transcription repression in some cases it's enough to have one cytosine methylated, in other cases the level of expression is correlated negatively with methylation level, but is independent on the exact position of cytosine to be methylated.Inhibition of transcription caused by partial DNA methylation can be overpassed by enhancers (Hug et al., 1996), however fully methylated promoters can't be reactivated that way (Schubeler et al., 2001).The possibility of active demethylation is still under discussion (S.C. Wu & Zhang, 2010).Cytidine deaminase AID could play a role in this process in mammals (Fritz & Papavasiliou, 2010).Recently it has been shown that elongation complex also can participate in demethylation (Okada et al., 2010).Even DNA methyl-transferases DNMT3a/b could force cytosine deamination leading to reparation of T-G mismatch pair into correct C-G pair with GC-biased reparation system (S.C. Wu & Zhang, 2010).Overexpression of MBD3 could also play a role in demethylation (S.E. Brown et al., 2008).Yet active demethylation after implantation of the embryo is very rare occasion (S. C. Wu & Zhang, 2010).
Different tissues and cell types demonstrate specific cytosine methylation patterns (Ushijima et al., 2003), those patterns in the same tissue of different individuals are similar (Lister et al., 2009), but not identical (Bock et al., 2008).Now a lot of regions with tissues-specific methylation profiles (tDMRs) are known (Rakyan et al., 2008;Brunner et al., 2009;Straussman et al., 2009;Xin et al., 2010).DMRs are likely to be involved in gene imprinting (Lopes et al., 2003).Differential activity of imprinted alleles of the gene is dependent on methylation of promoters, enhanserses or silencers of those genes (Li et al., 1993).
Females have one of the Х chromosomes inactivated in somatic cells (Gartler & Riggs, 1983).The process of inactivation starts at early embryo stage with Xist activation (S.D. Brown, 1991), which leads to chromatin modification and methylation of promoters of most (Deobagkar & Chandra, 2003) but not all (Zeschnigk et al., 2009) genes.Methylation and gene repression profile of inactivated X chromosome is stable in cell generations.Defect of normal methylation profile is a distinctive feature for different pathology conditions (Ratt syndrome, psychopathologies (Egger et al., 2004), autoimmune diseases (Richardson, 2007), hypertension (Frey, 2005)).Despite many evidences on epigenetic changes in pathologies, cancer is the most known disease having abnormalities in epigenetics, especially in DNA methylation (Jones & Baylin, 2002;Laird, 2003;Herrera et al., 2008).Tumor cells demonstrate a lot of modifications in epigenetics status: general demethylation of the genome, influencing chromatin structure, increased DNA methyltransferase activity, and hypermethylation of promoter regions of many genes resulting in their repression.High probability of 5m C to mutate into T brings about a lot of cancerspecific mutations.It's importation to notice, that pathological profiles of methylation often depend on environmental conditions and are inherited (Liu et al., 2008).

Sources for biologically relevant validation: CpG islands as promoter regions
The RNA polymerase II core promoter contains DNA motifs directing transcriptional machinery to the transcription start site (TSS).Nowadays four DNA motifs are known to be a part of core promoter: the TATA box, the TFIIB recognition element (BRE), the initiator (Inr), and the downstream promoter element (DPE) (Kutach & Kadonaga, 2000).The TATA box is an A/T-rich sequence, located about 20-30 nucleotides upstream of the TSS, that binds TFIID complex (Burley & Roeder, 1996).The BRE having the consensus SSRCGCC, is located immediately upstream of the TATA element in some promoters and increases the affinity of TFIIB binding (Lagrange et al., 1998).The Inr was originally a motif encompassing the TSS that is sufficient to direct accurate initiation in the absence of a TATA element (Smale, 1997).Inr elements are, however, present in both TATA-containing and TATA-less promoters and play a role in TFIID binding (Chalkley & Verrijzer, 1999).In mammalian promoters, the Inr consensus sequence is RRA +1 NWRR, where A +1 is the TSS (Bucher, 1990).
The DPE acts cooperatively with the Inr helping TFIID binding and accuracy of transcription initiation in TATA-less promoters (Burley & Roeder, 1996).The DPE is located about 30 nucleotides downstream of the TSS and contains a common GWCG sequence motif.Saxonov and colleagues (Saxonov et al., 2006) demonstrate that human genes have two different promoter types: AT-rich and GC-rich (associated with CGIs).They are easily distinguishable not only in AT-or GC content, but also in different motifs overrepresented in each promoter type.One can see that most of core promoter elements are GC-rich and could be a part of a CGI-associated promoter.CGIs are often located in 5' regions of genes, mostly overlapping with TSS (Gardiner-Garden & Frommer, 1987;Davuluri et al., 2001;Ponger et al., 2001), and participate in regulation of transcription initiation (Rozenberg et al., 2008).Housekeeping genes tend to have CGI promoter more frequently comparing to tissue-specific genes (Zhu et al., 2008).However promoters of tissue-specific genes related to development and embryogenesis are usually located in proximity to CGIs (Robinson et al., 2004).Many authors believe that CGIs exist since CpG dinucleotides inside them are protected from methylation.The mechanism of such protection is assumed to be protein binding at CGIs boundaries as it has been shown for Sp1 in the promoter of mouse aprt gene (Macleod et al., 1994).Later role of Sp1 in CGI boundaries formation has been shown for other genes (Tomatsu et al., 2002).Sp1 is often associated with CGIs as one of the key features (Macleod et al., 1994;Rozenberg et al., 2008).In one of the first works on CGI (Gardiner-Garden & Frommer, 1987) it has been shown that CGIs obtain many G/C-boxes (GGGCGG), which act as a core for Sp1 TFBS (Briggs et al., 1986).Sp1 binds both methylated and unmethylated DNA (Holler et al., 1988).Fan and colleagues (Fan et al., 2007) assume that all proteins with zinc-finger domain can play a role in CpG boundaries formation.Some other proteins, like VEZF1 (Dickson et al., 2010) and CTCF (Filippova et al., 2005;Recillas-Targa et al., 2006), also participate in this process.Naumann (Naumann et al., 2009) shows that loss of such a boundary (in fragile X-chromosome syndrome) leads to spread of methylation and gene inactivation.Moreover CGIs obtaining CTCF binding sites can themselves play a role of insulators forming boundaries of chromatin domains (Filippova et al., 2005).
Besides TFBS other DNA motifs are associated with CGI promoters.GC-skew, a feature of all unidirectional promoters, is stronger for genes starting within CGIs than for genes lacking this property (Polak et al., 2010).Tandem or simple repeats are also found within CGIs (Hutter et al., 2006).Sequence motifs G 2-3 C 2-3 , typical for CGI, induce local DNA curiosity and form G-qudruplexes at 5' and 3' ends of RNA molecule.G-quadruplexes in DNA restrict methylation of CpG dinucleotides genome-wide (Halder et al., 2010).

Sources for biologically relevant validation: CpG islands located far from promoter regions
At least 25% of CpG islands are located far from gene promoters (Ponger et al., 2001).
Although a lot of such CGIs overlap with repeats, (Graff et al., 1997;Ponger et al., 2001), other CGIs don't (Ponger et al., 2001;Hackenberg et al., 2006).They are often located near 3' gene region (Gardiner-Garden & Frommer, 1987) or within the gene (Hackenberg et al., 2006).Such 3'and intragenic CGIs are subject for natural selection not only on the protein level, but also on the level of nucleic acids, which confirms their functional significance (Medvedeva et al., 2010).Many of CGIs located far from promoters of protein-coding genes perform important biological functions.For instance, a CGI within intron 10 of KCNQ1 acting as a promoter of antisense RNA transcript is involved into imprinting regulation of the locus (Smilinich et al., 1999).Imprinting of MAP3K12 gene is caused by differential methylation of a CGI located in its last exon (Takada et al., 2000).Many CGI around the 3' ends of genes affect its expression in normal tissues (Appanah, Dickerson et al. 2007) and in cancer (Shiraishi et al., 2002).
Resenly several works show that CGIs located far from known genes in intragenic regions correspond to previously undetected promoters (Carninci et al., 2005;Medvedeva et al., 2010) playing a role during development (Illingworth et al., 2011).CTCF insulator protein forming a boundary of chromatin active regions (Bell & Felsenfeld, 2000) often binds CCCTC core motif common within CGIs.

CpG islands and mobile elements.
There are a lot of repetitive sequences in human genomes having high GC content, so many algorithms find CGI overlapping with repeats (Alu-repeat in human (Graff et al., 1997) and B1-repeat in mouse (Yates et al., 1999)).
Cytosines within CGIs associated with Alu-repeats in normal cells are methylated, which in turn represses the expansion of the repeat (Xing et al., 2004).Loss of methylation in Alurepeats is typical for tumor cells (Xie et al., 2010).Recently absence of methylation in Alurepeats was shown for germ line (Brohede & Rand, 2006).Ullu and Tschudi (Ullu & Tschudi, 1984) believe that Alu-repeats are possessed pseudogenes of 7SL-RNA, and several Alu families still contain inner promoter of RNA polymerase III (Britten et al., 1988).One can expect that CGIs in Alu-repeats should have different DNA motifs comparing to CGIs in promoters of protein-coding genes transcribed by PolII.Nevertheless, recent studies show that pervasive PolII transcription is also a common feature for pseudogenes and transposons (Frith et al., 2006).Alu-repeats are source of spreading DNA methylation, so unmethylated CGIs contain TFBS for Sp1 and other proteins to protect themselves from methylation (Caiafa & Zampieri, 2005).Recent studies show that Alu-repeats proximal to CpG islands could themselves form a boundary protecting CpG islands from methylation (Feltus et al., 2003).
Taking into consideration all facts mentioned above, it's obviously too early to exclude Aluand similar repeats out of attention speaking on CGIs functionality.Most of the authors (Takai & Jones, 2002;H. Wu et al., 2010) try to build an algorithm for CGI search that avoid CGIs around Alu-repeats.There are some differences in GC content, Obs/Exp CpG (Takai & Jones, 2002) or in cumulative mutual information of CpG dinucleotides (Su et al., 2009) between CGIs found near Alu-repeats and around promoters of protein-coding genes.Yet most algorithms excluded ab initio all repetitive sequences and therefore all of the CGIs located within them, removing more than a half of CGIs in doing so.The question remains why the same sequences in repetitive elements are of no use while in unique segments are essential.CpG islands and replication origins.Sequence properties of replication origins in mammals are not studied very well.There are some evidences that CpG islands near 3' region of the gene (Phi- van & Stratling, 1999) or in other genome regions can play a role of replication origins (Rein et al., 1997;Rein et al., 1999), it's important to know that some CpG should be methylated in those regions for success of replication (Rein et al., 1999).

Approches for validation
Taking into consideration biological properties mentioned above, DNA methylation is a logically relevant feature for CGI prediction validation.Complicated system of interactions involving CGIs makes it obvious that considering CGI as merely unmethylated region is an oversimplification.As far as DNA methylation plays important role in cell differentiation, the same DNA region can be unmethylated in early stage of development and methylated in later stages (reprogrammed DMR, rDMR), or unmethylated in one tissue and methylated in another one (tissue-specific DMR, tDMR), or unmethylated in one allele and methylated in another (allele-specific DMR, aDMR) as in case of imprinting or dosage compensation, or demonstrate cross-individual differences in methylation (individual DMR, iDMR).More appropriate way is to associate CGI with DMRs demonstrating absense (or decreased level) of cytosine methylation only in one or few conditions.Nevertheless even methylated CGIs play a role in transcription regulation, some of them contains TSS of protein-coding (Shen et al., 2007) or non-coding genes (Medvedeva et al., 2010).Recently a mechanism of transcription activation by binding of the C/EBPα transcription factor to the methylated CRE motif (TGACGTCA) was demonstated (Rishi et al., 2010).Thus, the absence of methylation shouldn't be the only criterion for CGIs verification.
Resently a lot of works dedicated to prediction of DNA methylation status in different normal tissues ((Bock et al., 2008;Zhao & Han, 2009) and refs in them) and cancer (Feltus et al., 2006) appeared.Various machine leaning techniques (support-vector machine (Bhasin et al., 2005;Das et al., 2006), alternative decision trees (Carson et al., 2008), discriminant analysis (Feltus et al., 2003)) were used to distinguish between methylated and unmethylated regions.Authors use GC content, different di-and tri nucleotides (Das et al., 2006;Fang et al., 2006), Alu-repeat location (Das et al., 2006;Fang et al., 2006), TpG fraction, TFBS, repeats, predicted DNA structures (Bock et al., 2006) and other DNA patterns and properties (Bhasin et al., 2005;Bock et al., 2007;Oakes et al., 2007;Carson et al., 2008;Ehrich et al., 2008) as parameters for those studies.Results obtained by different authors are incomparable, as in every case the model is built on distinct set of tissues and usually not in a genome-wide manner.Features demonstrating high selectivity in one work don't do the same in other works.The consistency of features is low, so one can conclude that those models are overlearned.
Promoter proximity is another traditional key feature for CGI validation.The most popular criterion is a fraction of predicted CGIs located near promoter regions of protein coding genes.As a negative set Alu-repeats are usually used.SWM with higher thresholds for length, GC content and Obs/Exp CpG (Takai & Jones, 2003;Han & Zhao, 2009) and clustering algorithms (Glass et al., 2007;Hackenberg et al., 2010a;H. Wu et al., 2010) show best results.Takai-Jones algorithm predicts 40% of CGIs to be located near promoters of RefSeq genes, CpGcluster can reach the amount of 50% of all CGIs to be near promoter regions (with pvalue = 1.0e-20).Wu and colleagues (H. Wu et al., 2010) believe that CGHW predicts more CGI to be located near promoters of RefSeq genes comparing to UCSC CGI and CG clusters.Despite the fact that about half of CGIs are located near TSS of protein-coding genes the rest are not.Lately various evidences of pervasive transcription appear (Carninci et al., 2005).New high-throughput techniques (CAGE, SAGE, ets) identify at least ten times more transcriptionally active regions comparing to number of protein-coding genes.Most of those regions contain TSS for ncRNA of different types.CGIs located far from TSS of proteincoding genes can act as their promoters.Nowadays discovery of new protein-coding genes is rare occasion.Nevertheless our knowledge about ncRNA genes is extremely uncomplete.On the other side, one shouldn't forget that mammalian genomes have not only CGIdependent promoters, but also TATA-dependent ones (Saxonov et al., 2006).The proportion of both types is still unclear.Therefore fraction of CGIs associated with protein-coding genes promoters is not an appropriate measure.
Other genomic features, like insulators, replication origins, recombination hot-spots, are also co-located with CGIs and make the whole picture more complicated.It's also becoming clear that CGI is not functionally equipotential throughout the length.CGI is not only a region with high GC content and CpG frequency.Even in very early works on CGIs (G/C)-box was mentioned as its structure element.Currently, it's obvious that not only Sp1 but also a lot of different TFs bind DNA within CGIs, so a huge fraction of them contains TFBS and their clusters.Also, at least some CGIs have boundary regions containing binding sites for Sp1, CTCF, VEZF1 or other TFs.Recently it was shown that G-quadruplex could also form a boundary of CGIs.It should be emphasized that quality of biologically relevant feature prediction is higher, if the method uses not only CGI prediction but includes other sequence properties.Therefore the concept of complex CGI definition based not only on GC or CpG content but also on other features like TFBS, repeats or DNA structure elements looks promising.

Unsolved problems and perspectives
Despite the huge amount of works in the area commonly accepted definition of CpG islands still doesn't exist.Most likely such situation is a result of difficulties with biological verification of predictions (Segal, 2006).Authors of SWMs and to lower extend of clustering algorithms choose the parameters arbitrarily complicating biological interpretations.Authors of machine-learning techniques usually find too many distinguishing parameters important in their models, which are not important in modeling of similar processes in other cases.Specifically it should be emphasized that all attempts to construct CGI prediction algorithm based on simple DNA sequence properties (GC content, Obs/Exp CpG , distance between neibouring CpG dinucleotides) having in mind prediction of complex biological feature (promoter regions, unmethylated regions and so on) bring about a high level of false positive predictions.For example, in case of promoter CGI prediction at least one third of CGIs are located far from promoters.It admits of no doubt that existing CGI searchers find a chimeric class of DNA segments, which don't have single common function.A collection of DNA motifs relevant to different biological functions could result into more adequate CGI definition.For instance, GC-skew and known core promoter elements could help to find CGI or regions within them related to TSS.Speaking on another feature of CGIs, namely lack of DNA methylation, it should be mentioned that new high-throughput techniques show that not all CpG within CGIs are unmethylated in normal cells, as previously believed.Nowadays it became clear that not only CpGs but also CpNpGs are subject to methylation (Lister et al., 2009).Such a motif also should be included in CGI prediction model (Hackenberg et al., 2010b).The ability of a CGI searcher to predict DMRs but not unmethylated regions seems more appropriate for quality evaluation.(Dai et al., 2008;Rakyan et al., 2008;Previti et al., 2009).
Unfortunately now we are still lack of high-quality and high-resolution data on genomewide DNA methylation in different tissues, states of developmet and conditions.Highthroughput techniques, like MeDIP, MeDIP-seq (Down et al., 2008), MethylCap-Seq (Brinkman et al., 2010), bisulphyte conversion based methods (RRBS (Eckhardt et al., 2006) and Methyl-seq (Lister et al., 2009)), let us hope for a complete map of DMRs in the nearest future, which will help with CGI validation.
There is a lot of evidences that methylated cytosine also could play important functional role as sites for methyl-binding proteins.We still haven't enougth relaibale data on motif preferences for all such proteins but we expect ChIP-seq (Mardis, 2007) technique to help with the issue.There are proofs showing that it's premature to exclude Alu-and other repetitive mostly methylated sequences out of considereation speaking on CGI functions.
To resolve mentioned problems it is necessary to figure out as many biological functions associating with CGIs as possible and to find out structure elements within CGI relating to those functions or to separate CGI on several different functional groups.Such approach should result in more precise and biologically adequate CGIs definition and, therefore construction of relevant algorithm with low false positive and negative rates which in turn will improve our knowledge in genetic and epigenetic regulation of genome functioning.

Comparison of different algorithms
A lot of comparisons between algorithms for CGI search have been performed.This work is focused on study of various genome features potentially relates to CGIs.Three algorithms for CpG islands search participate in the comparison: UCSC CGI, CpGcluster (with p-value threshold of clusters equal to 1.0e-10, 1.0e-15, and 1.0e-20) and CGHW (the algorithm implemented by Wu and colleagues).I prefer to focus on the algorithms of a "new wave" and UCSC CGI as a reference because the last one is the most widespread now.ENCODE regions of human genome (version hg18) were used for the study.All annotations were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/.Standard sensitivity (3) and specificity (4) measures for prediction quality were used.
where L TP -total length (bp) of overlap of CGIs with tested annotation, L FP -total length (bp) of CGIs not overlapping with tested annotation, L FN -total length (bp) of tested annotation not overlapping with CGIs, L TN -total length (bp) of ENCODE regions not overlapping neither with tested annotation no with CGIs.

Basic statistics
As a first step I collected the summary of statictical properties of CGIs predicted by different algorithms.CGI HW covers more then 2.2 % of total length of all ENCODE regions.CpGcluster (p-value 1.0 e-20 as recommended in (Hackenberg et al., 2010a)

Conclusions
In summary, no one algorithm for CGI search predicts all biologically relevant features with appropriate accuracy.In all cases a lot of both false positives and false negatives appear.
All algorithms participating in competition have its strong sides.CpGcluster (p-value = 1.0e-15 and p-value = 1.0e-20) demonstrate the highest specificity in TSS prediction.Although such CGIs obtain the smallest fraction of CAGE-tags, this may be not a disadvantage as we don't know for sure the proportion of GC-and AT-rich promoters.The largest fraction of CGIs length is covered by TFBS in case of CGIs predicted by CpGcluster, on the other hand the largest part of their adjacent regions is also covered by TFBS.This brought me to conclusion that CpGcluster finds "cropped" promoter CGIs, espessially in case of p-value = 1.0e-20.
On the contrary CGI HW demonstrates the best sensitivity in CTCF binding sites and rDMR prediction.CGI from CGI HW are associated with at least some of origins of repliacation, thereas other algoritms (with recommended parameters) don't.They are also more prone to find diversities between humans.Also those CGIs find the highest fraction of TSS.So, CGI HW finds regions with broad regulatory potential.However all those features are related to DNA methylation, which allow me to assume that CGI HW finds DMR-associated CGIs.UCSC CGI demonstrates moderate behavior.This algorithm has intermediate sensitivity both in TSS and rDMR prediction.Those CGIs have the highest decrease of TFBS in CGI adjasent regions and the highest sensitivity to DNase.It looks like UCSC finds CGI around promoter and also includes regulation regions, so those are promoter region CGIs.It's quite clear that CGI is a complex object, which doesn't correspond to any single biological feature.It seems more appropriate to segregate a class of interconnected biological features: differential DNA methylation, active transcription at least in one cell type or development stage and replication.CGI HW algorithm made the first step in this direction, whereas CpGcluster (with high threshold for p-value) moves to the opposite direction and finds specific narrow class of promoters.Traditional UCSC approach still stands ground demonstrating comparable or in some points even higher quality.Hence the CpG island problem is still far from final solution.
Authors use the number of C, G, and CpG in segment of length L as parameters for the model.Hidden state Y(s) for segment is 1 for CGI and 0 for baseline.Authors assume that Y(s) is a stationary first-order Markov chain.The choice of the state is based on two HMM.
CGI_HW (Algorithm of H. Wu) assumes that each chromosome is divided into 3 states: Alu repetitive elements, baseline, and CGI.Alu-repetitive elements are removed in advance.Hence, authors characterize the problem as that of a semi-HMM, with a known state for Alu repetitive elements, so they consider the 2-state chain conditional on being in a non-Alu state.

Table 1 .
) demonstrate the smallest genome coverage of 0.6%.CpGcluster predicts shorter CGIs with higher average GC-content and Obs/Exp CpG value comparing to other algorithms.UCSC CGI obtains the largest average number of CpGs per one CGI.Basic statistics for different CGIs.In general one could see that CGI HW finds more "relaxed" CGIs comparing to UCSC CGI (with lower GC-content, Obs/Exp CpG value and CpG frequency), whereas CpGcluster finds more "strict" CGIs comparing to UCSC CGI.It's widely accepted that a large fraction of CGIs is found around TSS of protein-coding genes.Recent studies show that total amount of TSS is about 10-times higher than the amount of protein-coding genes, so it seems more appropriate to test the CGI searchers for their ability to find TSS of any type.Several experimental techniques are able to detect any type of TSSs.Cap analysis gene expression (CAGE) is one of the most known techniques to produce a snapshot of the 5' ends of the total cellular RNA transcribed by PolII.A collection of CAGE-tags (encodeRikenCagePlus and encodeRikenCageMinus tables from UCSC) was used as a representative set of PolII TSS. www.intechopen.com

Table 2 .
CAGE-tags clusters within different CGIs.Table2shows that CGI HW has the lowest sensitivity, although they obtain the highest fraction of CAGE-tags clusters.CpGcluster20 demonstrates the highest selectivity and specificity but obtain only 39% of CAGE-tags clusters.UCSC CGI has the intermediate values of Sn and Sp.TFBS prediction.Although TFBS prediction is a classical problem for computational molecular biology, prediction of one single but highly reliable TFBS still remains tricky.I used TFBS conserved in the human/mouse/rat alignment based on Transfac Matrix Database (tfbsConsSites and tfbsConsFactors tables from UCSC).Keeping in mind that using of conserved TFBS leads to omission of all types of species-specific regulation regions, conserved TFBS are more likely to be functional comparing to other predicted TFBS.Table3demonstrates that CpGcluster predicts CGI with fewer different TFs and lower sensitivity comparing to USCS CGI and CGI HW.The highest fraction of total TFBS length is covered by CGI HW, the very same algorithm shows the highest sensitivity and the lowest specificity.It's not obvious what fraction of the CGIs one should expect to be covered by TFBS but CpGcluster20 demonstrates the largest coverage (about 19 %).

Table 3 .
Conserved TFBS within different CGIs.As it's difficult to estimate the expected coverage of TFBS, I compared the coverage of CGIs with the coverage of their adjacent regions of 100 bp.Results in Table4show that all adjacent to CGI regions contain conserved TFBS. www.intechopen.com

Table 4 .
Conserved TFBS within +/-100 bp around different CGIs.Last row of the Table4demonstrates the reduction of coverage in CGI adjacent regions comparing to CGI bodies.The adjacent regions of UCSC CGI and CGI HW contain more then 12 and 6 times less TFBS comparing to CGI body respectively.One should expect some TFBS around CGI which can function as CGI's boundaries.One the other hand, if we believe that CGI itself is the regulatory region, expected amount of TFBS in the adjacent regions should be dramatically lower comparing to CGI body, which is not the case for CpGcluster.Insulators.CTCF is well known as a DNA binding protein acting both as transcriptional factor and insulator protein.To test which CGI prediction algorithm finds more CTCF binding sites I used data on CTCF binding (oregano and oreganoAttr tables from UCSC).One can see that CGI HW shows the highest sensitivity in CTCF binding prediction.It's also important to mention that CGIs from CGI HW contain more than 25% of all CTCF sites.CpGcluster10 shows the second best result, and the quality of prediction decreases in case of CpGcluster15 and CpGcluster20.

Table 5 .
CTCF binding sites within different CGIs.DNase sensitivity regions are often considered as regions of open chromatin which correspond to regulatory regions of all types.To test what algorithm predicts CGI more often associated with DNase sensitivity regions I use joined data for several tissues available in UCSC (table wgEncodeRegDnaseClustered).All CGIs demonstrate rather good association with DNase sensitivity regions, at least one third of their length is located in sensitive area.UCSC CGI shows highest sensitivity and rather good spesifisity.Vast fraction of CpGcluster CGIs are also associated with DNase sensitivity regions; althougth sensivity of the algorithm is not very good.

Table 6 .
DNase sensitivity regions within CGIs predicted by different algorithms and quality of prediction.Data on regions differently methylated during development was downloaded from the UCSC (table rdmr).Table7shows that CGI HW predicts CGI located near over 43% of all rDMRs.This algorithm demonstrates also the best sensitivity in this case.It shoud be menthioned that CpGcluster20 has the lowest sensitivity and those CGIs are located near only 7% of rDMRs.

Table 7 .
rDMRs within different CGIs.To figure out if there is any preference for replication origins to be found by one of CGI searchers data from encodeUvaDnaRepOriginsNSGM table were used.Only CGI HW and CpGcluster10 find 5 and 2 replication origins within or around (+/-100 bp) CGI respectively.Other algorithms (and CpGcluster with more strict parameters) are unable to find any replication origins.Polymorphic loci.Data from SNP130 were used for study of polymorphic loci within different CGIs.CGI from CGI HW contains the highest fraction of SNPs and demonstrates highest sensitivity, so one should expect more interindividual variants within those CGIs.