Single Nucleotide Polymorphisms and Colorectal Cancer Risk: The First Replication Study in a South American Population

Colorectal cancer (CRC) heritability is determined by the complex interaction between inherited variants and environmental factors. CRC incidence rates have been increasing specially in developing countries, such as Brazil, where CRC is the third most frequent cancer in both genders. Genome‐wide association studies (GWAS), based on thousands of cases and controls typed at thousands of single nucleotide polymorphisms (SNPs), have identified several variants that associate with gastrointestinal cancer risk. Less of half of the familial risk has been elucidated through GWAS that identified common SNPs in almost exclusively European populations. Replication studies in admixed heterogeneous populations are scarce and most failed to replicate all the imputed SNPs. Population stratification by ethnic subgroups with different allele frequencies and so with differ ‐ ent patterns of linkage disequilibrium may cause expurious associations . Here, we show the first replication study of CRC inherited susceptibility in South America and aimed to identify known SNPs, which are associated with CRC risk in European populations.


Epidemiology
Colorectal cancer (CRC) is one of the most prevalent cancers in both genders worldwide, responsible for about 10% of all neoplasms, mainly in developed industrialized countries, such as Australia and New Zeland, North America, and Europe [1].
The detection and early removal of premalignant lesions reduces CRC mortality as confirmed by several studies that have screened populations at general risk using the fecal occult blood test. In addition, the use of flexible sigmoidoscopy for screening has shown promising results in randomized trials in the United Kingdom [5] and the United States [6], where significant reductions in both incidence and mortality were observed. Improved survival was also observed with this approach in genetically determined high-risk groups [7]. Therefore, for those determined high-risk individuals could be offered a more intensive surveillance, with colonoscopy or flexible sigmoidoscopy periodically at shorter intervals. Colonoscopy is already offered to individuals at high risk due to personal or familial history of CRC, as well as for families with Lynch syndrome and intestinal polyposis syndromes, for which more assiduous surveillance is recommended [8]. Therefore, stratifying the general population into risk categories would allow the individualization of screening and prevention strategies.

Molecular pathogenesis
The classic adenoma-carcinoma sequence has revealed an intricate molecular pathogenesis of CRC, where tumor suppressor genes are inactivated, and proto-oncogenes are activated through several signaling pathways, such as APC-B-cathenin, RAS-RAF, PIK3CA-PTEN, and TGF-B [9].
Three main molecular mechanisms are involved in CRC pathogenesis: chromosomal instability, microsatellite instability, and serrated polyp pathway. The first one occurs in most sporadic cancers where the accumulation of mutations, rearrangements, and aneuploidy drives malignant transformation within decades [10]. The second one occurs in about 15% of sporadic CRC and in most hereditary CRC. In sporadic CRC, an epigenetic event-hypermethylation-occurs in CpG islands of MMR gene promoters, which silences them leading to a genetic instability in microsatellite regions of genome [11]. In Lynch syndrome, mutations in MMR genes lead to microsatellite instability and accelerate adenoma-carcinoma sequence more rapidly, the reason why Lynch syndrome families develop cancer in their 40's or even earlier. The most recently serrated polyp pathway involves molecular mechanisms other than the classic adenoma-carcinoma sequence but has not yet been fully elucidated [12].

Risk factors
Colorectal carcinoma is a multifactorial disease, where complex interactions between genetic and environmental factors determine individual risk. Among the latter are diets rich in red meat and animal fat and lower in fiber, smoking, alcohol consumption, obesity, sedentary lifestyle, and chronic inflammatory bowel disease [13]. In addition to age, gender, and previous history of polyps, familial history is considered the main risk factor, being the relative risk between siblings two to three times higher than in the general population [14].
Traditionally, CRC has been classified into sporadic and hereditary. The concept of familial CRC reflects one end of a risk spectrum determined by the contribution of genetic variants of susceptibility. Most are sporadic with no family history and known genetic susceptibility. Most of the CRC susceptibility genes were identified in families affected by inherited syndromes, which are caused by mutations with high penetrance. These syndromes account for about 6% of CRC cases and can be classified as syndromes with or without gastrointestinal polyposis [15]. Among the main syndromes with polyposis are familial adenomatous polyposis (FAP) caused by mutations in the APC gene; Peutz-Jeghers syndrome, attributable to mutations in the STK11 gene; Juvenile polyposis, associated with the BMPR1A gene, and Cowden's syndrome, related to the PTEN gene. Among non-polyposis syndromes, the most prevalent is Lynch's syndrome, accounting for about 3% of all the cases with CRC, caused by mutations in the mismatch repair genes during DNA replication (MLH1, MSH2, MSH6, PMS2, and EPCAM) [16].
Most of the mutations identified in familial CRC are highly penetrant, that is, with a high chance of manifesting cancer throughout the life. However, there are families with CRC clusters that do not have mutations in genes associated with hereditary syndromes. This raises the hypothesis that there are other variants or mutations with low penetrance that make certain individuals more susceptible to the CRC development. Studies with brothers with and without CRC, as well as several association studies, have identified regions in the human genome in which single nucleotide polymorphisms (SNPs) variants are associated with CRC susceptibility [17].
Up to 25% of cases are familial CRC aggregations whose heritability has been partially uncovered by GWAS SNPs [18]. However, the large proportion of familial risk remains unexplained-so-called missing heritability.

Single nucleotide polymorphisms
Single nucleotide polymorphisms (SNPs) are variations of the human genome, where two or occasionally three alternative nucleotides are common in the population. In most cases, an SNP has two alternative forms, termed alleles, for example, A or G at a certain position in the genome.
There are 10 million SNPs estimated in the human genome, representing, along with other types of polymorphisms (such as copy number variations), about 90% of human genetic variation, including susceptibility to disease. Two individuals are 99.5% identical in their DNA sequences, and, every 1000 base pairs, there is one SNP [19].
Variants that have been deleterious during evolution are particularly rare due to natural selection. In turn, pathogenic variants that are deleterious in homozygosis may have become neutral or undergo a selection balance by conferring an advantage on asymptomatic heterozygotes. Therefore, alleles of frequent SNPs are not expected to have any significant phenotypic effect, either because natural selection would be in charge of eliminating it, -if it were detrimental by negative selection -, or fixing it, if beneficial by positive selection. Moreover, most SNPs are not located in either coding or regulatory sequences but in intergenic sequences [20].

Genome-wide association studies
Searching for population associations is an attractive option to identify disease susceptibility genes. Association studies are easier to conduct than linkage studies because they do not require multiple family cases segregating the phenotype. However, they depend on the linkage disequilibrium (LD)-the nonrandom association between alleles at different loci-with a susceptibility factor, which can only be identified by markers located in the same haplotype block (the set of alleles at a linked locus in a single chromosome) close to the factor [21].
SNPs are the markers of choice for studying LD for three reasons: (1) they are sufficiently abundant that they allow verifying very short chromosome segments; (2) compared to microsatellites, they have a lower mutation rate; (3) SNPs are easily large-scale genotyped on genome [20].
The structure of the human LD was investigated by the HapMap project, and the first result was a list of more than 3 million SNPs that captured most of the common genomic variation in some populations [22].
Genome-wide association studies (GWAS) are based on the LD principle at the population level, which is usually the result of a particular ancestral haplotype common in a population. Usually, loci that are physically close exhibit a stronger DL than those that are distant in a chromosome. The genomic distance at which LD decays determines how many genetic markers are required to "tag" a haplotype block, being the number of such markers much smaller than the total of segregating variants in the population. For example, the selection of about 500,000 common SNPs in the human genome is sufficient to "tag" the common variants in non-African population, even though the total SNPs are greater than 10 million [22]. These SNPs are called tagSNPs.
Although GWAS are not influenced by prior biological knowledge or genomic location of SNPs, they are influenced by LD between genotyped SNPs and non-genotyped causative variants. The strength of the statistical association between the alleles at the two loci in the genome depends, mainly, on their allelic frequencies. Thus, a rare variant-minor allele frequency (MAF) less than 0.01-will be low LD with a neighboring common variant, even though they are in the same recombination range. However, most of the SNPs selected in the SNP arrays are common (MAF higher than 0.05), and therefore, GWASs have the power to detect association of variants that are relatively common in the population [21]. On the other hand, it is suggested that the observed association between a common SNP and a complex trait may result from LD of the SNP with rare variants at the same locus. Since common alleles and causal rare variants are correlated in a low LD, the hypothesis of a "synthetic association" implies that the magnitude of the effect of the causative variants is much greater than that of the common genotyped SNPs by the GWAS. For example, if an SNP explains 0.1% of the phenotypic variance in the population, the causal variant would account for 5-10% [23].

GWAS and CRC
Most of the studies to identify low penetrance alleles for CRC susceptibility were based on a candidate gene approach, whose role in CRC pathogenesis was supposedly known. However, without the real understanding of the biology of predisposition, the choice of genes was problematic. Thus, until the advent of GWAS, few or no association studies based on this approach were able to identify alleles of susceptibility unequivocally associated with the CRC risk [17].
The number of common variants contributing with more than 1% of the inherited risk is very low, and it is very unlikely that there will be other SNPs with similar effects (greater than 1.2) for alleles with frequencies greater than 20% in European populations. In fact, the GWAS identified on average 80% of the common SNPs in this population but only 12% of SNPs with a minor allele frequency (MAF) between 5 and 10% [17].
However, variants with this profile, if taken collectively, can confer substantial risks due to their multiplicity, and in the case of CRC, to date, explain about 10% of heritability [33]. In a model built on data from the Scottish GWAS, about 170 common independent variants would explain all the genetic variance of the CRC [35]. Therefore, most of the genetic susceptibility to CRC still needs to be defined the so-called "missing heritability". There are other possible causes of this unidentified portion: (1) the effect of rare variants; (2) failure to identify causal variants; and (3) allelic heterogeneity [36].
GWAS strategies to identify modest common risk alleles are not ideal for identifying rare variants (MAF below 1%) with potentially greater effects, as well as for capturing copy number variants and other structural variants, such as insertions, complex rearrangements, or expansions of microsatellite repeats, which may alter the risk of CRC. As efforts are made to scale up the GWAS meta-analyses, both in terms of sample size and coverage of SNPs, as well as to increase the number of SNPs considered for large-scale replication, it will be feasible to discover new variants. It is possible that a multiple loci approach based on haplotype markers identifies rare alleles. In addition, the use of exome sequencing may provide a more effective strategy for finding such variants [37].

Objectives
The overall objective of the present study was to replicate in individuals of the Brazilian population the 10 SNPs associated with CRC risk that are previously described in European populations. The specific objectives were to (1) calculate the allelic and genotype frequencies of the 10 SNPs in cases and controls; (2) analyze the association between the genotypes and alleles of the 10 SNPs and CRC risk; (3) calculate the magnitude of the effect on CRC risk; and (4) correlate the genotypes of the 10 SNPs with clinical-pathological characteristics and with familial history.

Patient selection criteria
This is a retrospective study of case-control genetic association, whose sample comprised 727 cases and 740 controls, recruited from the Departments of Pelvic Surgery, Clinical Oncology, and Community Medicine at AC Camargo Cancer Center, in São Paulo, Brazil. All patients and controls authorized the present study by signing the informed consent form previously approved by the Research Ethics Committee of the institution under number 1231/09.
The inclusion criteria for cases were CRC diagnosis before age 75 years or with advanced colorectal adenoma (villous histology and/or greater than 1 cm and/or severe dysplasia) diagnosis before age 60 years, and controls were individuals without CRC who did not have firstor second-degree relatives with CRC. Controls were not matched with the cases in relation to the socioeconomic condition, ancestry, or self-referred ethnicity. The exclusion criteria were the presence of hereditary syndromes of predisposition to CRC, immunohistochemistry tests showing absence of proteins from DNA mismatch repair genes, the presence of high-penetrance germline mutations in susceptible genes to CRC, and appendix tumors and/or previous chronic inflammatory bowel disease.

Statistics analysis
All tests were corrected for multiple analyses to avoid type I error. The allelic and genotypic frequencies of each SNP were calculated using the DeFinetti program [38], and the deviations of the genotype frequencies in cases and controls predicted by the Hardy-Weinberg equilibrium were calculated by the chi-square test with one degree of freedom or by Fisher's exact test, if the expected cell count was less than five.
Association analyses between the genotypes found in cases and controls for each SNP were performed with several types of genetic models, using the SNP and Variation Suite Version 7.6.10 program [39]. Multiple analyses were corrected by false discovery rate and Bonferroni.

Clinical characteristics of cases and controls
Of the 727 cases included in this study, 51% were male, with a median age of diagnosis of 56.9 ± 10.1 SD years old; 30% fulfilled Bethesda criteria; 3% of tumors were high-risk adenomas; the most common site of CRC was the rectum and in about 10%, there was an extra-colonic second primary tumor; tubular adenocarcinomas moderately differentiated at clinical stage III was mostly diagnosed. The majority of patients was alive disease-free until data collection and about 30% of cases had no familial history of CRC, despite almost 20% did not know about affected relatives. Of the 740 controls included in this study, 52% were female, with a median age of 51.9 ± 12.3 SD years old. Cases and controls were age and sex matched (p = 0.126 and 0.193, respectively).

SNP genotyping and association tests
The genotypic frequencies of each SNP in cases and controls and their p-values are depicted in the following graphics: The allelic frequencies, the number of alleles, the genotyping rate, and the Hardy-Weinberg equilibrium (EHW) test are represented in Table 1.
The genetic association tests and their genetic models are shown in Table 2.
Of the 10 SNPs, 5 (06, 09 16, 82, and 83) were statistically significant (p ≤ 0.05) associated with the risk of CRC and 2 (26 and 71) showed a trend to association (p < 0.1). SNP 06 showed the more significant association among all SNPs in all genetic models. SNP 09 was the only predictor of low risk, mainly in the dominant model (25% lower), whereas SNPs 16 and 82 were associated with high risk in the recessive model (45 and 85% higher, respectively). SNP 83 had higher risk principally in the dominant model (almost 50%). SNPs 26 and 71, on the other hand, obtained a marginally significant association only in the allelic and additive models, with a trend toward a higher risk with SNP 71 and lower risk with SNP 26. To sum up, of five SNPs associated with CRC risk, two (SNP 16 and 82) conferred higher risk among rare homozygotes than among heterozygotes and common homozygotes together through the recessive model and three (06, 09, and 83) showed higher risk among heterozygotes and rare homozygotes together than among common homozygotes through the dominant model. Table 3 shows the five SNPs associated with CCR risk with their respective wild and variant alleles, their risk allele frequencies in comparison with the European population, the effect size of the variant allele and the their populational attributable risks, which is the incidence decrease of disease if the population was not exposed to the risk allele.

Discussion
In common diseases, such as CRC, it estimated that the most part of its genetic risk is due to the inherited multiple loci following polygenic model, each one with a common allelic frequency (MAF greater than 5%), whose effects show small sizes, between odds ratios 1.0 and 1.5. [17] Thus, to detect those small effects, it is necessary a big sample size. This strategy was validated by metanalyses of GWAS data from European populations with tens of thousands genotyped individuals through high throughput platforms, followed by validation by multiple phases with independent series of cases and controls. Even though only about 20 common SNPs with modest effects were identified so far, each one with a p value corrected by multiple tests (<5.0 × 10 −8 ). In Table 4, GWAS data from European populations are compared to this study.  ** (RAF(OR -1))/(1 + RAF(OR -1)). RAF, risk allele frequency; PAR, populational attributable risk. In this study, there was an association with CRC risk in half of SNPs (06, 09 16, 82 and 83), whose risk alleles revealed similar frequencies as to European GWAS, except SNP 06. Effect sizes were modest as well as European GWAS. SNP 06 was the variant that resulted the greatest effect with the most statistically significant association (p trend = 3.49 × 10 −5 ), which conferred the highest populational risk, whereas in European GWAS, the risk augmented up to 23%, representing 8.6% of the populational risk [35]. In the original study, the same SNP also showed the greatest association (p trend = 1.0 × 10 −12 ) [27]. SNP 09 (rs10411210) was associated with a low risk in a dose-dependent way [31]. In this study, however, this effect was detected only in the dominant model. It is noteworthy that in European studies the major allele (C) confers 15% higher risk, responsible for 12% of the populational risk [35]. In the present study, there was a trend to a higher risk but not statistically significant (p = 0.08). Moreover, SNP 16 was associated with a higher risk in this study than from European GWAS, as well as the populational attributable risk [35]. Likewise, SNPs 82 and 83 also increased more the CRC risk and populational risk in the present study than from European GWAS [35].
In the present study, the populational stratification by ancestry was not investigated. However, the Brazilian population, although greatly admixed, has a high prevalence of individuals from European ancestry, whose the great majority is located in the South (79.5%) and Southeast (74.2%) [40].

Conclusion
This study partially replicated European GWAS in Brazilian Southeastern population with a predominantly European genetic background. Small sample size and lack of stratification by ancestry are prone to type I and II errors, respectively. Further studies in admixed populations would certainly aid to uncover the missing heritability of CRC and help to build the genetic architecture of CRC susceptibility.