Discovering the Genetics of Autism

Autism is a complex neurodevelopmental disorder. It is characterized by social isolation, language deficits and repetitive or stereotyped behaviors. Autism spectrum disorder (ASD) has received a great deal of attention in the recent years not only due to the increasing rate of affected children but also because of the social and economical impact of the disorder on their families. Various studies and researches have been proposed to deal with and tackle the ASD. They can be divided into three categories as follows.


Introduction
Autism is a complex neurodevelopmental disorder. It is characterized by social isolation, language deficits and repetitive or stereotyped behaviors. Autism spectrum disorder (ASD) has received a great deal of attention in the recent years not only due to the increasing rate of affected children but also because of the social and economical impact of the disorder on their families. Various studies and researches have been proposed to deal with and tackle the ASD. They can be divided into three categories as follows.
teaching, speech and language therapy and social skills therapy. When behavioral treatment fails, many medications are used to treat ASD symptoms. Figure 1 demonstrates the interaction of the autism spectrum disorders researches and studies. The advancements of the technologies in the field of genetics provide the opportunities for researchers and scientists to explore in depth the biological information and to convert it into meaningful biological knowledge through computational-based models.
In this chapter, we will investigate the genetics origins of autism and demonstrate the latest techniques and technologies available for diagnosing the complex disorder. We will also propose a robust approach for detecting and identifying the targeted disorder based upon the advantages and strengths of the publically available and commercial approaches while avoiding their weaknesses. The proposed approach is divided into two steps. The preprocessing step is a feature-extraction method used to clearly map and detect the genetic variations and structural rearrangements followed by a statistical-based model as feature-selection to evaluate and measure the statistical and biological significance of the predicted variations. The classification step is to discover the relationship among the tested samples into groups and/or subgroups, and to provide insight into the complex pattern of the genome.
The results suggest that autism is associated with an increased amount of alterations in unstable segments of the genome. The experimental results also show that using high-resolu-tion custom-tiled samples improve the accuracy of our proposed approach in determining previously reported and new genetic contributors that warrant investigation.
This chapter aims at utilizing research to bring benefits to individuals and families affected by autism spectrum disorders and to improve the quality of their life. And this can be done by clear mapping and identifying the biomarkers associated with ASD at the early childhood stages which are essential to provide better treatments and therapies. Finally, the proposed approach presented in this chapter is broadly applicable to case-control studies of genetic diseases beyond the ASD.
The chapter is organized as follows. In section 2, we demonstrate the genetic data generating techniques, data modeling and chromosomal variations that are associated with the targeted disorder, ASD. Section 3 is devoted for the methods used to analyze the genetic data trying to discover the variant regions along the genome and to identify the tested samples. In section 4, we apply molecular test to evaluate the predictive power of the proposed approach. Finally, discussion and conclusion based on the results are presented in section 5.

Genomic structural variations and ASD susceptibility
Genetic alterations in the form of chromosomal rearrangements are genomic structural variations that lead to changes in the DNA copy number such as duplications and deletions of the DNA copies. However, copy number changes do not include other genomic structural variations such as inversions, insertions and reciprocal translocations. Figure 2 demonstrates different types of chromosomal rearrangements.  The protein most important role is in the brain. It is involved in processes crucial for learning and memory. It also has an important role in brain development. It is also known as 22q13.3 deletion syndrome and is highly associated with autism.Human (Homo sapiens) Genome Browser Gateway, http:// genome.ucsc.edu/cgi-bin/hgGateway.
A set of chromosomal regions and genes that are implicated with ASD are listed in Table   1. Some of the regions are associated with known Mendelian syndromes. In some individuals affected with these syndromes, ASD occurs as a secondary diagnosis. In other regions and genes, genetic variations causing ASD include a wide range of possibilities each with very low frequency among the cases (rare variants). In some cases the rare variants are found only once in the population. In contrast to rare variants we see that in other chromosomal regions and genes only few common genetic variations (common alleles) account for ASD susceptibility. Figure 3 illustrates the process of generating DNA copy number data using Microarraybased comparative genomic hybridization (array CGH) technology.

Data Modeling
As illustrated in Figure 3, aCGH technology is an experimental approach for genome-wide scanning of differences in DCN samples. It provides a high-resolution method to map and measure relative changes in DCN simultaneously at thousands of genomic loci. In a biological experiment, unknown (test) and reference (normal) DNA samples are labeled with fluorescent dyes Cy3 and Cy5, respectively. Then, they are combined and competitively cohybridized to an array containing genomic DNA targets that have been spotted on a glass slide. The resulting ratio of the fluorescence intensities is proportional to the ratio of the copy numbers of DNA sequences in the test and reference genomes measured in a logarithmic scale for a certain genomic location. These intensity ratios are informative about DNA copy number changes. We expect to see duplication (gain) for positive ratio, deletion (loss) for negative ratio and normal state for neutral ratio. Due to the logarithmic scale and the probes performance, the data can be approximated as a piecewise function of short and long intervals with different intensity levels that are not equally-spaced along the genome. Moreover, microarray experiments suffer from many sources of error due to human factors, array printer performance, labeling, and hybridization efficiency.
where y[n] is the contaminated genetic signal and x[n] is the true value of the genetic variation to be estimated at genomic location n of the length N. ε n is assumed to be modeled as additive wihte Gaussian noise with zero mean and some variance σ 2 .
As described in (1), Figure 4 illustrates the genetic data in the form of DNA copy number generated by aCGH technology where 4 variant segements are presented with different intensity levels.

Data Filtering
Although the recent advantecment in microarray technologies and sequencing now make it easy to measure the genetic variations with high-resolution through scanning large number of samples, small changes, particularly at the low copy repeat (LCRs) regions, remain difficult to detect due to different noise conditions. Thus, the challenging problem is to differentiate between the true biological signaling and the noise measurements.
Various methods have been proposed as preprocessing techniques to tackle this problem. These methods have been motivated by either well-known signal processing techniques or statistical-based models.  In Table 2, we present a comparison study based on the computational cost of the most recent and successful approaches. As can be noticed that the smoothing techniques are well suited to process very large amount of data such as the genetic signals compared to the statistical-based models. However, these techniques include important features such as the variant regions boundaries in the smoothing process.
Here we present our previously proposed method (Alqallaf et al., 2007), Sigma filter (SF). It is a nonlinear method used as a feature extraction to detect the variant segments edges and to smooth the rest of the genetic data. The filter is conceptually simple but effective noise smoothing algorithm. Based on the assumption of the aCGH data modeling, the SF algorithm is well suited to denoise the tested samples before further analysis. SF algorithm is motivated by the sigma probability of the Gaussian distribution, and it smooths the noise by averaging only those neighborhood variant segments which have the intensities within a fixed sigma range of the center data point. Consequently, variant segmets edges are preserved, and subtle details are retained.

Statistical significance
Few studies in the literature have addressed the power of class discovery of the recurrent copy number variations (CNVs) across multiple samples of the genetic data [52& 53]. However, they did not consider denoising the data prior to applying the statistical analysis.
To reduce the dimensionality of the detected variant regions, we apply a simple statisticalbased approach to measure the significance of the candidate gemonic regions. The approach is based on the frequency difference between the case and control samples at each gemonic location. It is used as a feature selection algorithm to select a small subset of variant segments as features for classification. Figure 5 is an illustration of three RCVNs with different sizes of filtered DCN data for multiple samples of normal control (C i ) and autistic (A i ) individuals, respectively. After selecting the informative segments of the genome, we then applied comparative classification algorithms on the reduced data.

Data classification
Based on the collected and processed genetic data, we apply a system of classifiers that are used to identify autistic individuals based on their genetic information. This system will help improve detection, identification and diagnosis of autism, which will benefit both the patients and the society in general and will lead to early diagnosis and treatment.
Generally classifiers are used by researchers faced with the task of classification based on a given data. Classifiers are mathematical models that are able to perform the task of classifi-cation or decision making, based on a previously provided data. Classifier's ability to spot trends and relationships in large data sets makes it well suited for many applications. In the field of medicine classifiers can be used to classify accurately diseases, genes, tumors, and other medical phenomena [54; 55; 56; 57; 58; 59& 60]. Although some attempts were made to use classifiers in genetics [61]. Our attempt is to use three comparative classifiers, namely, k-Nearest Neighbor, Neural Network, and Support Vector Machine, to help in diagnosing patients with ASD.
The leave-one-out cross-validation (LOOCV) is applied to evaluate the proposed classifiers by measuring the classification performance to accurately identify the association between the tested samples and the targeted disorder, ASD. The LOOCV involves using a single variant segment from the original sample as the validation data, and the remaining segments as the training data. This is repeated such that each variant segment in the sample is used once as the validation data.

k-Nearest Neighbor Classifier
The k-Nearest Neighbor (k-NN) classifier [64] is a well known nonparametric classifier. To classify a new input x, the k nearest neighbors are retrieved from the training data. The input x is then labeled with the majority class label corresponding to the k nearest neighbors.
For the k-NN classifier, we used the Euclidean distance as the distance metric, and the best k between 1 and 10 was found by performing LOOCV on the training data.

Neural Network
Neural networks are another type of classifier or mathematical models used for classification, regression or decision making. Their structure is inspired by the human neural system and brain. It consists of many neurons, interconnected at different stages. The direction of flow of information is usually from the input stage to the output stage. Each neuron has an input and an output, where an activation function converts a neurons input to its output. The output of each neuron is connected to the next stage through a weighted connection. A learning function determines the value of the weights of all the connections. The weights are updated based on a mathematical function that relates the network together. Therefore, a neural network is considered as an adaptive network that changes its structure during the learning or training phase, based on mathematical functions that relate input data to the corresponding class labels. The sum of all neurons at the different layers and the weighted interconnections make up a complex network that is commonly referred to as a black box.
Before its use to classify a test sample, the neural network is trained on a given data set with known classes or labels. During the training phase the weights are updated to minimize the output error. The selected value of the minimum acceptable error determines when the training stops. For a difficult data where it is impossible to reach the set minimum error, the maximum number of epochs is used as criteria for stopping the training process.

Support Vector Machine
The Support Vector Machine (SVM) belongs to a new generation of learning system based on recent advances in statistical learning theory [65]. A linear SVM, which is used in our system, aims to find the separating hyper-plane with the largest margin, defined as the sum of the distances from a hyper-plane (implied by a linear classifier) to the closest positive and negative exemplars. The expectation is that the larger the margin, the better the generalization of the classifier. In a non-separable case, a linear SVM seeks a trade-off between maximizing the margin and minimizing the number of errors.

Validation of the predicted variant segments
To evaluate our predictive power of our method in detecting and identifying patients with ASD, we use molecular test, quantitative Polymerase Chain Reaction (qPCR). It is a very sensitive and precise tool used for the quantification of nucleic acids. It can detect and quantify very small amounts of specific nucleic acid sequence. It is based on the method of PCR, developed by Kary Mullis in the 1980s. It allows the amplification of specific nucleic acid sequence (DNA) more than a billion-fold. Using qPCR allows scientists to quantify the starting amount of a specific DNA sequence in the sample before the amplification by PCR method [62].
Quantitative PCR is an indispensable tool for researchers in various fields including fundamental biology, molecular diagnostics, biotechnology, and forensic sciences. Critical points and limitations of qPCR-based assays must be considered to increase the reliability of the obtained data. For the detection of qPCR four technologies are commonly used all of which are based on the measurement of fluorescence during the PCR. One principle is based on intercalation of double-stranded DNA-binding dyes (simplest and cheapest). The other three principles are based on the introduction of an additional fluorescence-labeled oligonucleotide (probe). Detectable fluorescence are only released either after cleavage of the probe (hydrolysis probes) or during hybridization of one (molecular beacon) or two (hybridization probes) oligonucleotides to the amplicon. The introduction of an additional probe increases the specificity of the quantified PCR product and allows the development of multiplex reactions. Other technologies have been described for the detection of qPCR [63].
The qPCR method quickly became the first choice when it comes to quantitative analysis of nucleic acid because of many reasons. It is highly sensitive and it allows the detection of less than five copies (one copy in some cases) of a target sequence. It has good reproducibility. In addition, it has broad dynamic quantification range, at least 5 log units. It is also easy to use and has reasonable good value for money (low consumable and instrumentation costs).
For the purpose of this chapter, we are focusing on one of the many applications of qPCR, which is indispensable for research and diagnostics, the genetic variations.  Table 3. Representation of the number of events (CNVs) detected by the circular binary segmentation (CBS) and sigma filtering methods, respectively, for 22 qPCR confirmed CNVs. Table 3 shows that the number of qPCR-confirmed CNVs detected by the sigma filtering (SF) method is considerably higher than those detected using the circular binary segmentation (CBS), ranging from 4.5% to 36% more for 4 different array experiments. The results show that applying the averaging window of 2Kb allow the algorithms to be well suited for detecting variations in high-density microarray data, especially at the LCR-rich regions.

Conclusion
The etiology of Autism spectrum disorders involves genetic and environmental risk factors.
In this chapter, we have discussed the genetics basis of the complex disorder, autism. With the recent advances in the new screening technologies to investigate the entire genome such as array comparative genomic hybridization (aCGH) and whole genome sequencing, provide the opportunities insight into the pattern of the genetic variations and reveal their roles in the genetic diseases. In this study, we have demonstrated an overview for the analysis of genetic variations in the form of DNA copy number changes and their association with autism susceptibility.
Through mathematical-based models and computational-based approaches, we analyze the genetic data trying to discover and identify the relationship between the structural chromosomal rearrangements along the genome and the targeted disorder, ASD. In conclusion, the results show strong evidence that the genetic variations contribute in the complex disorder, autism.