Analysis of MLPA Data Using Novel Software Coffalyser.NET by MRC-Holland

Genetic knowledge has increased tremendously in the last years, filling gaps and giving answers that were inaccessible before. Medical genetics seeks to understand how genetic variation relates to human health and disease (National Center for Biotechnology Information, 2008). Although genetics plays a larger role in general, the knowledge of the genetic origins of disease has increased our understanding of illnesses caused by abnormalities in the genes or chromosomes, offering the potential to improve the diagnosis and treatment of patients. Normally, every person carries two copies of every gene (with the exception of genes related to sex-linked traits), which cells can translate into a functional protein. The presence of mutant forms of genes (mutations, copy number changes, insertion/deletions and chromosomal alterations) may affect several processes concerning the production of these proteins often resulting in the development of genetic disorders. Genetic disease is either caused by changes in the DNA of somatic cells in the body or it is inherited, e.g. by mutations in the germ cells of the parents. Genetic testing is "the analysis of, chromosomes (DNA), proteins, and certain metabolites in order to detect heritable disease-related genotypes, mutations, phenotypes, or karyotypes for clinical purposes (Holtzman et al, 2002). In order to make this suitable for routine diagnostics dedicated, affordable, fast, easy-to-interpret and simple-to-use genetic tests are necessary. This allows scientists to easily access information that for instance can be used to: confirm or rule out a suspected genetic condition or help determine a person's chance of developing or passing on a genetic disorder. Several hundred genetic tests are currently in use, and more are being developed (Sequeiros et al, 2008). The Multiplex Ligationdependent Probe Amplification (MLPA) is a PCR-based technique, which allows the detecting of copy number changes in DNA or RNA. MLPA can quantify up to 50 nucleic acid sequences or genes in one simple reaction, with a resolution down to the single nucleotide level (Schouten et al., 2002) needing only 20 ng of DNA. The MLPA procedure itself needs little hands on work allowing up to 96 samples to be handled simultaneously while results can be obtained within 24 hours. These properties make it a very efficient technique for medium-throughput screening of many different diseases in both a research and diagnostic settings (Ahn et al, 2007).

Over a million of MLPA reactions were performed last year worldwide but researchers are still concerned with the application of tools to facilitate and improve MLPA data analysis on large, complex data sets. MLPA kits contain oligo-nucleotide probes that through a biochemical reaction can produce signals that are proportional to the amount of the target sequences present in a sample. These signals are detected and quantified on a capillary electrophoresis device producing a fragment profile. The signals of an unknown sample need to be compared to a reference in order to assess the copy number. Profile comparison is a matter of professional judgment and expertise. Diverse effects may furthermore systematically bias the probe measurements such as: quality of DNA extraction, PCR efficiency, label incorporation, exposure, scanning, spot detection, etc., making data analysis even more challenging. To make data more intelligible, the detected probe measurements of different samples need to be normalized thereby removing the systematic effects and bringing data of different samples onto a common scale. Although several normalization methods have been proposed, they frequently fail to take into account the variability of systematic error within and between MLPA experiments. Each MLPA study is different in design, scope, number of replicates and technical considerations. Data normalization is therefore often context dependent and a general method that provides reliable results in all situations is hard to define. The most used normalization strategy therefore remains the use of in-house brew analysis spreadsheets that often cannot provide the reliability required for results with clinical purposes. These sheets furthermore do not provide easy handling of large amounts of data and file retrieval, storage and archival needs to be handled by simple file management systems. We therefore set out to develop software that could tackle all of these problems, and provide users with reliable results that are easy to interpreter. In this chapter we show the features and integrated analysis methods of our novel MLPA analysis software called Coffalyser.NET. Our software uses an analysis strategy that can adapt to fit the researcher objectives while considering both the biological context and the technical limitations of the overall study. We use statistical parameters appropriate to the situation, and apply the most robust normalization method based on the biology and quality of the data. Most information required for the analysis is extracted directly from the MRC-Holland database, producer of the MLPA technology, needing only little user input about the experimental design to define an optimal analysis strategy. In the next section we review the MLPA technology in more detail and explain the principles of MLPA data normalization. Then in section 3, we describe the main features of our software and their significance. The database behind our software is reviewed in section 4 and section 5 explains the exact workflow of our program reviewing the importance and methodology of each analysis step in detail. In the final section, we summarize our paper and present the future directions of our research.

Background
MLPA data is commonly used for sophisticated genomic studies and research to develop clinically validated molecular diagnostic tests, which e.g. can provide individualized information on response to certain types of therapy and the likelihood of disease recurrence. The most common application for MLPA is the detection of small genomic aberrations, often accounting for 10 to 30% of all disease-causing mutations (Redeker et al., 2008). In case of the very long DMD gene -involved in Duchenne Muscular Dystrophy-exon deletions and www.intechopen.com duplications even account for 65-70% of all mutations (Janssen et al., 2005). Since MLPA can detect sequences that differ only a single nucleotide, the technique is also widely used for the analysis of complicated diseases such as congenital adrenal hyperplasia and spinal muscular atrophy, where pseudo-genes and gene conversion complicate the analysis (Huang et al., 2007). Methylation-specific MLPA has also proven to be a very useful method for the detection of aberrant methylation patterns in imprinted regions such as can be found with the Prader-Willi/Angelman syndrome and Beckwith-Wiedemann syndrome (Scott et al., 2008). The MS-MLPA method can also be used for the analysis of aberrant methylation of CpG islands in tumour samples using e.g. DNA derived from formalin-fixed, paraffinembedded tissues. MLPA kits generally contain about 40-50 oligo-nucleotide probes targeted to mainly the exonic regions of a single or multiple genes. The number of genes that each kit contains is dependent on the purpose of the designed kit. Each oligo-probe consists of two hemiprobes, which after denaturation of the sample DNA hybridize to adjacent sites of the target sequence during an overnight incubation. For each probe oligo-nucleotide in a MLPA kit there are about 600.000.000 copies present during the overnight incubation. An average MLPA reaction contains 60 ng of human DNA sample, which correlates to about 20.000 haploid genomes. This abundance of probes as compared to the sample DNA allows all target sequences in the sample to be covered. After the overnight hybridization adjacent hybridized hemi-probe oligo-nucleotides are then ligated using a ligase enzyme and the ligase cofactor NAD at a slightly lower temperature than the hybridization reaction (54 °C instead of 60 °C ). The ligase enzyme used, Ligase-65, is heat-inactivated after the ligation reaction. Afterwards the non-ligated probe oligonucleotides do not have to be removed since the ionic conditions during the ligation reaction resemble those of an ordinary 1x PCR buffer. The PCR reaction can therefore be started directly after the ligation reaction by adding the PCR primers, polymerase and dNTPs. All ligated probes have identical end sequences, permitting simultaneous PCR amplification using only one primer pair. In the PCR reaction, one of the two primers is fluorescently labeled, enabling the detection and quantification of the probe products. The different length of every probe in the MLPA kit then allows these products to be separated and measured using standard capillary fragment electrophoresis. The unique length of every probe in the probe mix is used to associate the detected signals back to the original probe sequences. These probe product measurements are proportional to the amount of the target sequences present in a sample but cannot simply be translated to copy numbers or methylation percentages. To make the data intelligible, data of a probe originating from an unknown sample needs to be compared with a reference sample. This reference sample is usually performed on a sample that has a normal (diploid) DNA copy number for all target sequences. In case the signal strengths of the probes are compared with those obtained from a reference DNA sample known to have two copies of the chromosome, the signals are expected to be 1.5 times the intensities of the respective probes from the reference if an extra copy is present. If only one copy is present the proportion is expected to be 0.5. If the sample has two copies, the relative probe strengths are expected to be equal. In some circumstances reliable results can be obtained by comparing unknown samples can to reference samples by visual assessment, simply by overlaying two fragment profiles and comparing relative intensities of fragments (figure 1). Fig. 1. MLPA fragment profile of a patient sample with Duchenne disease (bottom) and that of a reference sample (top). Duchenne muscular dystrophy is the result of a defect in the DMD gene on chromosome Xp21. The fragment profile shows that the probe signals targeted to exon 45-50 of the DMD gene have a 100% decrease as compared to the reference, which may be the result of a homozygous deletion.
It may however not be feasible to obtain reliable results out of such a visual comparison if: 1. The DNA quality of the samples and references is incomparable. 2. The MLPA kit contains probes targeted to a number of different genes or different chromosomal regions, resulting in complex fragment profiles 3. The data set is very large, making visual assessment very laborious. 4. The DNA was isolated tumor tissue, which often shows DNA profiles with altered reference probes To make (complex) MLPA data easier understandable unknown and reference samples have to be brought on a common scale. This can be done by normalization, the division of multiple sets of data by a common variable in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which may be used in multiple ways in order to comprise a common variable. Reference probes are usually are targeted to chromosomal regions that are assumed to remain normal (diploid) in DNA of applicable samples. The results of data normalization are probe ratios, which display the balance of the measured signal intensities between sample and reference. In most MLPA studies, comparing the calculated MLPA probe ratios to a set of arbitrary borders is used to recognize gains and losses (González, 2008). Probe ratios of below 0.7 or above 1.3 are for instance regarded as indicative of a heterozygous deletion (copy number change from two to one) or duplication (copy number change from two to three), respectively. A delta value of 0.3 is a commonly accepted empirically derived threshold value for genetic dosage quotient analysis (Bunyan et al. 2004). To get more conclusive results probes may be arranged according to chromosomal location as this may reveal more subtle changes such as those observed in mosaic cases.

Support wide range of file format
Our software is compatible with binary data files produced by all major capillary electrophoresis systems including: ABIF files (*.FSA, *.AB1, *.ABI) produced by Applied Biosystems devices, SCF and RSD files produced by MegaBACE™ systems (Amersham) and SCF and ESD files produced by CEQ systems (Beckman). We can also import fragment lists in text or comma separate format, produced by different fragment analysis software programs such as Genescan (Applied Biosystems), Genemapper (Applied Biosystems), CEQ Fragment analysis software (Beckman) and Genetools. Raw data files are however preferred since they allow more troubleshooting and quality check options as compared to size called fragment lists. Next to this, raw and analyzed data are then stored in a single database and more advanced reports can be created.

Optimized peak detection / quantification method for MLPA
All applied algorithms in our software are specifically designed to suit MLPA or MLPA-like applications. We designed an algorithm for peak detection and quantification specifically for MLPA peak patterns. Most peak detection algorithms simply identify peaks based on amplitude, ignoring the additional information in the shape of the peaks. In our experience, 'true' peaks have characteristic shapes, and including fluorescence of artifacts may introduce ambiguity into the analysis and interpretation process. Our algorithm has the ability to differentiate most spurious peaks and artifacts from peaks that originate from a probe product. We differentiate a number of different peak artifacts, such as: shoulder peaks, printout spikes, dye artifacts, split peaks, pull-up peaks, stutter peaks and nontemplate additions. It is often difficult to identify the correct peaks due to appearance of nonspecific peaks in the vicinity of the main allele peak. Our algorithm is therefore optimized to discriminate the different artifacts from the probe signals by usage of minimum and maximum threshold values on the peak -amplitude, -area, -width andlength. Next to this, it may also recognize split and shoulder -peaks by means of shape recognition, making correct identification of probe signals even more reliable. Following peak detection, quantification and size calling, our software allows one or more peaks to be linked to the original MLPA probe target sequence. This pattern matching is greatly simplified as compared to other genotyping programs and additionally provides a powerful technique for identifying and separating signal due to capillary electrophoresis artifacts. Our software may employ three different metrics to reflect the amount of probe fluorescence: peak height, peak area and peak area including its siblings. Peak siblings are the peak artifacts that are created during the amplification of the true MLPA products but have received an alternative length. To determine which metric should be used for data normalization, our program uses an algorithm that compares the signal level of each metric www.intechopen.com over the reference probes in all samples, and compares this to the amount of noise over the same signals. The metric that has the largest level signal to noise is then used in the following normalization steps.

Performances and throughput
After a user logs in, analysis of a complete experiment can be performed in two simple steps: the processing of raw data and the comparison of different samples. Depending on the analysis setup and type of computer, the complete analysis may be completed in less than a minute for 24 samples. Our software can also make use of extra cores running in a computer, multiplying the speed of the analysis almost by two for each core. Because of problems arising from poor sample preparations, presence of PCR artifacts, irregular stutter bands, and incomplete fragment separations, a typical MLPA project requires manual examination of almost all sample data. Our software was designed to eliminate this bottleneck by substantially minimizing the need to review data. By creating a series of quality scores to the different processes users can easily pinpoint the basis for the failed analysis. These scores include quality assessment related to: the sample DNA, MLPA reaction, capillary separation and normalization steps ( figure 6). The quality of each step can fall roughly into three categories. 1. High-quality or green. The results of these analysis steps can be accepted without reviewing. 2. Low-quality or red. These steps represent samples with contamination and other failures, which render the resulted data unsuitable to continue with. This data can quickly be rejected without reviewing; recommendations can be reviewed in Coffalyser.NET and used for troubleshooting. 3. Intermediate-quality or yellow. The results of these steps fall between high-and lowquality. The related data and additional recommendations can be reviewed in Coffalyser.NET and used to optimize the obtained results. When the analysis is finished the results can be visualized in a range of different display and reporting options designed to meet the requirement of modern research and diagnostic facilities. Results effortlessly can be exported to all commonly used medical report formats such as: pdf, xls, txt, csv, jpg, gif, png etc.

Reliable recognition of aberrant probes
Results interpretation of clinically relevant tests can be one of the most difficult aspects of MLPA analysis and is a matter of professional judgment and expertise. In practice, most users only consider the magnitude of a sample test probe ratio, comparing the ratio against a threshold value. This criterion alone may often not provide the conclusive results required for diagnosing disease. MLPA probes all have their own characteristics and the level of increase or decrease that a probe ratio displays that was targeted to a region that contains a heterozygous gain or loss, may differ for each probe. Interpretation of normalized data may even be more complicated due to shifts in ratios caused by sample-to-sample variation such as: dissimilarities in PCR efficiency and size to signal sloping. Other reasons for fluctuations in probe ratios may be: poor amplification, misinterpretation of an artifact peak/band as a true probe signal, incorrect interpretation of stutter patterns or artifact peaks, contamination, mislabeling or data entry errors (Bonin et al., 2004). To make result interpretation more reliable our software combines effect-size statistics and statistical interference allowing users to evaluate the magnitude of each probe ratio in combination with it's significance in the population. The significance of each ratio can be estimated by the quality of the performed normalization, which can be assessed two factors: the robustness of the normalization factor and the reproducibility of the sample reactions. During the analysis our software estimates the reproducibility of each sample type in a performed experiment by calculating the standard deviation of each probe ratio in that sample type population. Since reference samples are assumed to be genetically equal, the effect of sample-to-sample variation on probe ratios of test probes is estimated by the reproducibility of these probes in the reference sample population. These calculations may be more accurate under circumstances where reference samples are randomly distributed across the performed experiment. Our program therefore provides an option to create a specific experimental setup following these criteria, thereby producing a worksheet for the wet analysis and a setup file for capillary electrophoresis devices. DNA sample names can be selected from the database and may be typed as a reference or test sample, positive control or negative control. This setup file replaces the need for filling in the sample names in the capillary electrophoresis run software thereby minimizing data entry errors. To evaluate the robustness of the normalization factor our algorithm calculates the discrepancies computed between the probe ratios of the reference probes within each sample. Our normalization makes use of each reference probe for normalization of each test probe; thereby producing as many dosage quotients (DQ) as there are references probes. The median of these DQ's will then be used as the definite ratio. The median of absolute deviations between the computed dosage quotients may reflects the introduced mathematical imprecision of the used normalization factor. Next, our software calculates the effect of both types of variation on each test sample probe ratio and determines a 95% confidence range. By comparing each sample's test probe ratio and its 95% confidence range to the available data of each sample type population in the experiment, we can conclude if the found results are significantly different from e.g. the reference sample population or equal to a positive sample population. The algorithm then completes the analysis by evaluating these results in combination with the familiar set of arbitrary borders used to recognize gains and losses. A probe signal in concluded to be aberrant to the reference samples; if a probe signal is significantly different as from that reference sample populations and if the extent of this change meets certain criteria. The results are finally translated into easy to understand bar charts (figure 2) and sample reports allowing users to make a reliable and astute interpretation of the results.

Advanced data mining options
The database behind our software is designed in SQL and is based on a relational database management system (RDBMS). In short this means that data is stored in the form of tables and the relationship among the data is also stored in the form of tables. Our database setup contains a large number of subtraction levels, not only allowing users to efficiently store and review experimental sample data, but also allowing users to get integrative view on comprehensive data collections as well as supplying an integrated platform for comparative genomics and systems biology. While all data normalization occurs per experiment, experiments can be organized in projects, allowing advanced data-mining options enabling users to retrieve and review data in many different ways. Users can for instance review multiple MLPA sample runs from a single patient in a single report view. Results of multiple MLPA mixese may be clustered together, allowing users gain more confidence on any found results. The database can further handle an almost unlimited number of specimens for each patient, and each specimen can additionally handle an almost unlimited number of MLPA sample runs. To each specimen additional information can be related such as sample type, tissue type, DNA extraction method, and other clinical relevant data, which can be used for a wide range of data mining operations for discovery purposes. Some of these operations include: 1. Segmenting patients accurately into groups with similar health patterns. 2. Evidence based medicine, where the information extracted from the medical literature and the corresponding medical decisions are key information to leverage the decision made by the professional. 3. Non-parametric tests (distribution-free) used to compare two or ore independent groups of data. 4. Classification methods that can be used for predicting medical diagnosis.

About the database 4.1 Client server database model
Our software uses a SQL client-server database model to store all project/experimentrelated data. The client-server model has one main application (server) that deals with one or several slave applications (clients). Clients may communicate to a server over the network, allowing data sharing within and even beyond their institutions. Even though this system may provide great convenience e.g. for people who are working on a single project but are working on different locations, both client and server may also reside in the same system. Having both client and server on the same system has some advances over running both separately: the database is better protected and both client and server will always have the same version number. In case an older client will try to connect to a server that has a newer version number, the client needs to be updated first. A client does not share any of its resources, but requests a server's content or service function. Clients therefore initiate communication sessions with servers that await incoming requests. When a new client is installed on a computer it will implement a discovery protocol in order to search for a server by means of broadcasting. The server application will then answer with its dynamic address that resolves any issues with dynamic IP addresses.

User access
In addition to serving as a common data archive, the database provides user authentication, robust and scalable data management, and flexible archive capabilities via the utilities provided within Software. Our database model acts in accordance with a simple legal system, linking users to one or multiple organizations. Each user receives a certain role within each organization to which certain right are linked. These rights may for instance include denial of access to certain data but may also be used to deny access to certain parts of the program. These same levels may also be applied on project level. Projects will have project administrators and project members. The initial project creators will also be the project administrators who are responsible for user management of that project.

Sessions
As soon as a user makes a connection with the server a session will be started with a unique identifier. Subsequent made changes by any user will be held to this identifier, in order to keep track of the made changes. This number is also used to secure experiment data when in use; this ensures no two users try to edit essential data simultaneously (data concurrency). When a user logs in on a certain system, all previously open session of that user will be closed. Every user can thus only be active on a single system. On closing a session, either by logout or by double login all old user locks will disappear.

Data retrieval and updates
In our software is equipped with MLPA sheet manager software, allowing users to obtain information about commercial MLPA kits and size markers directly from the MRC-Holland database. Next to this, the sheet manager also allows users to create custom MLPA mixes.

www.intechopen.com
The sheet manager software can be used to check if updates to any of the MLPA mixes are available. The sheet manager can further carry out automatic checks for updates at the frequency you choose, or it can be used to make manual checks whenever you wish. It can display scheduled update checks and can work completely in the background if you choose. With just one click, you can check to see if there are new versions of the program, or updated MLPA mix sheets. If updates are available, you can download them quickly and easily. In case some MLPA mixes are already in use, users may choose to hold on to both the older version and updated versions of the mix or replace the older version. Figure 3 shows the graphical representation of the workflow of our software. After creating an empty solution, users can add new or existing items to the empty solution by using the "add new project" or "add new experiment" command from the client software context menu. By creating projects, users can collect data of different experiments in one collection. Next, data files can then be imported to the database and linked to an experiment. Users then need to define for each used channel or dye stream of each capillary (sample run) what the contents are. Each detectable dye channel can be set as a sample (MLPA kit) or a size marker. Samples may further be typed as: MLPA test sample, MLPA reference sample, MLPA positive control, or MLPA digested sample. The complete analysis of each MLPA experiment can be divided in 2 steps: raw data analysis and comparative analysis. Raw data analysis includes all independent sample processes such as: the recognition and signal determination of peaks in the raw data streams of imported data files, the determination of the sizes of these peaks in nucleotides and the process of linking these peaks to their original probe target sequences. After raw data analysis is finished, users can evaluate a number of quality scores (figure 6), allowing users to easily assess the quality of the produced fragment data for each sample. Users may now reject, accept and adjust sample types before starting the comparative analysis. During the comparative part of the analysis several normalization and regression analysis methods are applied in order to isolate and correct the amount of variation that was introduced over the repeated measured data. Found variation that could not be normalized out of the e q u a t i o n i s m e a s u r e d a n d u s e d t o d e f i n e c o n f i d e n c e r a n g e s . T h e s o f t w a r e f i n a l l y calculates the variation of the probes over samples of the same types, allowing subsequent by classification of unknown samples. After the comparative analysis is finished, users may again evaluate a number of quality scores this time concerning the quality of different properties related to the normalization. The users can finally evaluate the results by means of reporting and visualization methods.

Import / export of capillary data
Importing data is the process of retrieving data from files to the SQL Server™ (for example, an ABIF file) and inserting it into SQL Server tables. Importing data from an external data source is likely to be the first step you perform after setting up your database. Our software contains several algorithms to decode binary files from the most commonly used capillary electrophoresis devices (see paragraph 2.1). Capillary devices usually store measurements of relative fluorescent units (RFU) and other related data that is collected during fragment separation in computer files encoded in binary form. Binary files are made up as a sequence of bytes, which our program decodes back into lists of the different measurements. The most important measurement being the laser induced fluorescence of the covalently bound fluorescent tags on the probe products and the size marker. The frequency at which these measurements occur depends on the type of system. A complete scan will always check all filters (or channels) and result in one data point. Almost all capillary systems are able to detect multicolor dyes permitting the usage of an internal size marker providing a more accurate size call than the usage of external size marker. Multicolor dyes may also permit the analysis of loci with overlapping size ranges, thus allowing multiple MLPA mixes to be run simultaneously in different dye colors. After data has been imported into your SQL Server database, users can start the analysis. Users can choose to analyze the currently imported data or data that was imported in the past or a combination of both. Due to the relative nature of all MLPA data, it is recommended to analyze data within the confinements of each experiment. There do exist circumstances in which better results may be obtained by applying older collected reference data but one should use these options with caution. Exporting data is usually a less frequent occurrence. Coffalyser.NET therefore does not have standard tools to export raw capillary data but rather depends on the provided tools and features of the SQL server. The data may be exported to a text file and then be read by third party applications such as Access or Microsoft Excel, which can then be used to view or manipulate the data. www.intechopen.com

Raw data analysis 5.2.1 Baseline correction
When performing detection of fluorescence in capillary electrophoresis devices it is some times the case that spectra can be contaminated by fluorescence. Baseline curvature and offset are generally caused by the sample itself and little can be designed in an instrument to avoid these interferences (Nancy T. Kawai, 2000). Non-specific fluorescence or background auto fluorescence should be subtracted from the fluorescence obtained from the probe products to obtain the relative fluorescence as a result of the incorporation of the fluorophore. The baseline wander of the fluorescence signals may cause problems in the detection of peaks and should be removed before starting peak detection. Our software corrects for this baseline by applying two times a median signal filter on the raw signals. First, the signals of the first 200 data points of each dye channel are extracted and its median was calculated. Then for every 200 subsequent data points till the end of the data stream, the same procedure is carried out. These median values are then subtracted from the signal of the original data stream to remove the baseline wander, resulting in baseline 1. This corrected baseline 1 is then fed as input for a filter that calculates the median signal over every 50 subsequent data points. These median values are then subtracted from all the signals that are below 300 RFU (for ABI-devices) on baseline 1, resulting in baseline 2. This second baseline is often necessary due to the relatively short distance between the peaks that derive from probe products with only a few nucleotides difference. By applying this second baseline correction solely on the signals that are in the lower range of detection, even peaks that reside close to each other may reside back to zero-signal, without subtracting too much fluorescence that originates from the probe products. Program administrators can modulate the default baseline correction settings, and also may store different defaults for each used capillary system.

Peak detection
In capillary-based MLPA data analysis, peak detection is an essential step for subsequent analysis. Even though various peak detection algorithms for capillary electrophoresis data exist, most of them are designed for detection of peaks in sequencing profiles. While peak detection and peak size calling are very important processes for sequencing applications, peak quantification is not so important. Due to the relatively nature of the MLPA data, peak quantification is particularly important and has a large influence on the final results. Our peak detection algorithm exists of two separate steps; the first step exists of peak detection by comparison of the intensities of fluorescent units to set arbitrary thresholds and shape recognition, the second step exist of filtering of the generated peak list by relative comparison. Program administrators can modulate the peak detection algorithm thresholds, which make use of the following criteria: 1. Detection/Intensity threshold: This threshold is used to filter out small peaks in flat regions. The minimal and maximal peak amplitudes are arbitrary units and default values are provided for each different capillary system.

Peak area ratio percentage:
Peak area is computed as the area under the curve within the distance of a peak candidate. Peak area ratio percentage is computed as the peak area divided by the total amount of fluorescence times one hundred. The peak area ratio percentage of a peak must be larger than the minimum threshold and lower than the maximum set threshold.

Model-based criterion:
The application of this criterion can consists of 3-4 steps:  Locate the start point for each peak: a candidate peak is recognized as soon as the signal increases above zero fluorescence.  Check if the candidate peak meets minimal requirements: the peak signal intensity is first expected to increase, if the top of the peak is reached and the candidate peak meets the set thresholds for peak intensity and peak area ratio percentage, then the peak is recognized as a true peak.  Discarding peak candidates: if the median signal of the previous 20 data points is smaller then the current peak intensity or if the current peak intensity returns to zero.  Detect the peak end: the signal is usually expected to drop back to zero designating the peak end. In some cases the signal does not return to zero, a peak end will therefore also be designated if the signal drops at least below half the intensity of the peak top and if the median signal of the 14 last data points is lower than the current signal. 4. Median signal peak filter: The median peak signal is calculated by the percentage of intensity of each peak as opposed to the median peak signal intensity of all detected peaks. Since the minimum and maximum thresholds are dependent on detected peaks, this filter will be applied after an initial peak detection procedure based on the criteria point 1-3. 5. Peak width filter: After peak end points have been identified, the peak width is computed as the difference of right end point and left end point. The peak width should be within a given range. This filter is also applied after an initial peak detection procedure.

Peak pattern recognition:
This method is only applied for the size marker channel, and involves the calculation of the correlation between the data point of the peak top of the detected peak list (based on the criteria point 1-5) and the expected lengths of the set size marker. In case the correlation is less than 0.999, the previous thresholds will be automatically adapted and peak detected will be restarted. These adaptations mainly include adjustment of minimal and maximal threshold values.

Peak size calling
Size calling is a method that compares the detected peaks of a MLPA sample channels against a selected size standard. Lengths of unknown (probe) peaks can then be predicted using a regression curve between the data points and the expected fragment lengths of the used size standard, resulting in a fragment profile (figure 4). Coffalyser.NET allows the use of 2 different size-calling algorithms: 1. Local least squares method 2. 1 st , 2 nd or 3 rd order least squares The local least squares method is the default size calling method for our software. It determines the sizes of fragments (nucleotides) by using the local linear relationship www.intechopen.com between fragment length and mobility (data points). Local linearity is a property of functions that have graph that appear smooth, but they need not to be smooth in a mathematical sense. The local linear least squares method makes use of a function that is only once differentiable at a point where it is locally linear. Different from the other methods, this function is not differentiable, because the slope of the tangent line is undefined. To solve the local linear function our algorithm first calculates the intercept and coefficient for each size marker point of the curve by use of a moving predictor. A local linear size of 3 points provides three predictions for each point along its curve that is surrounded by at least 2 points. The average intercept and coefficient are then stored for that point. Points at the beginning and the end of the curve will receive a single prediction, since they do not have any surrounding known values. The coefficient () and intercept () are calculated by solving the following equations 1 and 2.
To calculate the length of an unknown fragment our algorithm uses the calculated coefficient and intercepts calculated over the surrounded size marker peaks above and one below its peak. Each unknown point will be predicted twice where after the average value will be stored for that peak. If we wish to predict the length (Y) of an unknown fragment (X) of which the data point of the peak top is in between the data points of known fragments 5 and 6, we need to solve equation 5.

Peak identification
Once all peaks have been size called, the profiles must be aligned to compare the fluorescence of the different targets across samples, an operation that is perhaps the single most difficult task in raw data analysis. Peaks corresponding to similar lengths of nucleotides may still be reported with slight differences or drifts due to secondary structures or bound dye compounds. These shifts in length make a direct numerical alignment based on the original probe lengths all but impossible. Our software uses an algorithm that automatically considers what the same peaks are between different samples, allowing easy peak to probe linkage. This procedure follows a window-based peak binning approach, whereby all peaks within a given window across different samples are considered to be the same peak (figure 5). Our software algorithm follows four steps: reference profile analysis, applying and prediction of new probe lengths, reiteration of profile analysis and data filtering of all samples. www.intechopen.com The crucial task in data binning is to create a common probe length reference vector (or bin). In the first step our algorithm applies a bin set that searches for all peaks with a length closely resembling that of the design length of that probe. Next, the largest peak in each temporary bin is assumed to be the real peak descending from the related probe product. To create a stable bin, we calculate the average length over all real peaks of all used reference samples. If no reference samples exist, the median length over all collected real peak from all samples will be used. Since some probes may have a large difference between their original and detected length the previously created results may often not suffice. We therefore check if the length that we have related to each probe is applicable in our sample set. We do this by calculating how much variation exists over collected peaks length in each of the previous bins. If the variation was too large (standard deviation > 0.2) or no peak at all was found in any of the bins, the expected peak length for that probe will be estimated by prediction. The expected probe peak lengths may be predicted by using a second-order polynomial regression on using the available data of the probes for which reproducible data was found. Even though a full collection of bins is now available, the lengths of the probe products that were predicted may not be very accurate. The set of bins for each probe in the selected MLPA mix will therefore be improved by iteration of the previous steps. The lengths provided for the bins are now based on the previously detected or predicted probe product lengths allowing a more accurate detection of the real probe peaks. Probes that were not found are again predicted and a final length reference vector or bin is constructed for each probe. This final bin set can be used directly for data filtering but may also be edited manually in case the automatically created bin set may not suffice. Data filtering is the actual process where the detected fragments of each sample are linked with gene information to a probe target or control fragment. Our algorithm assumes that peaks within each sample that fall within the same provided window or bin and have sufficient fluorescence intensity are the same probe (figure 4). Our algorithm is also able to link more than one peak to a probe within one sample. The amount of fluorescence of each probe product may then be expresses the peak height, peak area of the main peak and the summarized peak area of all peaks in a bin. An algorithm can then be used to compare these metrics and decide which should optimally be used as described at 3.2, alternatively users may set a default metric. The summarized peak area may reflect the amount of fluorescence best if peaks are observed that show multiple tops which all originate from the amplification of the same ligation product. Such peaks may be observed if: 1. Too much input DNA is added the amplification reaction and the polymerase was unable to complete the extension for all amplicons (Clark, J. M. 1988). 2. Peaks were discovered which are one base pair longer than the actual target due to nontemplate addition. 3. The polymerase was unable to complete the adenine addition on all products that resulted in the presence of shoulder peaks or +A/-A peaks (Applied Biosystems, 1988).

Raw data quality control
In the final step of the raw data analysis the software performs several quality checks and translates this into simple scores ( figure 6). These quality checks are the result of a comparison of sample specific properties such as: baseline height, peak signal intensity, signal to size drop, incorporated percentage of primer etc., to expected standards specific for each capillary system. Several quality checks are furthermore performed using the control fragments providing information about the used Fig. 6. Coffalyser.NET screenshot. FRSS means fragment run separation score. FMRS means fragment MLPA reaction score. Probes, displays the number of found signals to the number of expected signals. The last columns display the quality of the DNA concentration and denaturation and the presence of the X and Y-fragments.
DNA itself as described before (Coffa, 2008). The quality scores allow users to easily find problems due to: the fragment separation process, MLPA reaction, DNA concentration or DNA denaturation. Users may then reject, accept and adjust sample types before starting the comparative analysis.

Comparative analysis
During the comparative part of the analysis we aim to isolate the amount of variation that was introduced over the repeated measured data and provide the user with meaningful data by means of reporting and visualization methods. The program is equipped with several normalization strategies in order to allow underlying characteristics of the different types of data sets to be compared. During normalization we bring MLPA data (probe peak signals) of unknown and reference samples to a common scale allowing easier understandable data to be generated. In MLPA, normalization refers to the division of multiple sets of data by a common variable or normalization constant in order to cancel out that variable's effect on the data. In MLPA kits, so called reference probes are usually added, which are targeted to chromosomal regions that are assumed to remain normal (diploid) in the DNA of all used samples. Our algorithm is able to make use of the reference probes in multiple ways in order to comprise a common variable. In case a MLPA kit does not contain any reference probes, the common variable can be made out of probes selected by the user or the program will make an auto-selection. After normalization the relative amount fluorescence related to each probe can be expressed in dosage quotients, which is the usual method of interpreting MLPA data (Yau SC, 1996). This dosage quotient or ratio is a measure for the ratio in which the target sequence is present in the sample DNA as compared to the reference DNA, or relative ploidy. To make the normalization more robust our algorithm makes use of every MLPA probe signal, set as a reference probe for normalization to produce an independent ratio (DQ i, h, j, z The data for each test probe of each sample (DQ i, h, j ) will be compared to each available reference sample (S h =n), producing as many dosage quotients as there are reference samples. The final ratio (DQ i, j ) will then estimated by calculating the average over these dosage quotients. In case no reference samples are set, each sample will be used as reference and the median over the ratios be calculated.

Dealing with sample to sample variation
Each MLPA probe is multiplied during the amplification reaction with a probe specific efficiency that is mainly determined by the sequence of the probe, resulting in a probe specific bias. Even though the relative difference of these probes in signal intensity between different samples can be determined by normalization or visual assessment (figure 1), the calculated ratio results may not always be easy to understand by employing arbitrary thresholds only. This is mainly due to sample-to-sample variation or more specific, a difference in the amplification efficiency of probe targets between reference and sample targets. Chemical remnants from the DNA extraction procedure and other treatments sample tissue was subjected to, may allot to impurities that influence the Taq DNA polymerase fidelity. Alternatively target DNA sequences may have been modified by external factors, e.g. by aggressive chemical reactants and/or UV irradiation which may result in differences in amplification rate or extensive secondary structures of the template DNA that may prevent access to region of the target DNA by the polymerase enzyme (Elizatbeth van Pelt-Verkuil, 2008). An effect that is commonly seen with MLPA data is a drop of signal intensity that is proportional with the length of the MLPA product fragments (figure 7). This signal to size drop is caused by a decreasing efficiency of amplification of the larger MLPA probes and may be intensified by sample contaminants or evaporation during the hybridization reaction. Signal to size drop may further be influenced by injection bias of the capillary system and diffusion of the MLPA products within the capillaries. In order to minimize the amount of variation in and between reference and sample data and create a robust normalization strategy our algorithm follows 7 steps. By automatic interpretation of results after each step our algorithm can adjust the parameters used for the next step thereby minimizing the amount of error that may be introduced by the use of aberrant reference signals. The following 7 steps are performed in a single comparative analysis round: 1. Normalization of all data in population mode. Each sample will be applied as a reference sample and each probe will be applied as a reference probe. 2. Determination of significance of the found results by automatic evaluation using effectsize statistics and comparison of samples to the available sample type populations. 3. Measure of the relative amount of signal to size drop. If the relative drop is less than 12% a direct normalization will suffice, any larger drop will automatically be corrected by means of regression analysis (step 4-5). 4. Before correction of the actual amount of signal to size drop, samples are corrected for the MLPA mix specific probe signal bias. This can be done by calculating the extent of this bias in each reference run by regressing the probe signals and probe lengths using a local median least squares method. Correction factors for these probe specific biases are then computed by dividing the actual probe signal through its predicted signal. The final probe-wise correction factor is then determined by taking a median of the calculated values over all reference runs. This correction factor is then applied to all runs to reduce the effect of probe bias due to particular probe properties on the forthcoming regression normalization.
5. Next we calculate the amount of signal to size drop for every sample by using a function where the log-transformed probe bias corrected signals are regressed with the probe lengths using a 2 nd order least squares method. Signals from aberrant targets are left out of this function, by applying an outlier detection method that makes use of the results found at step 2 as well as correlation measurements of the predicted line. The signal to size corrected values can then be obtained by calculating the distance of each log transformed pre-normalized signal to its predicted signal. 6. Normalization of signal to size corrected data in the user selected mode and determination of significance of the found results. Our algorithm then measures the amount variation that could not be resolved in the final normalization to aid in results interpretation and automatic sample classification. To measure the imprecision of the normalization constant, each time a sample is normalized against a reference, the median of absolute deviations (MAD i, h, j ) is calculated between the final probe ratio (DQ i, h, j ) and the independent dosage quotients using each reference probe (DQ i, h, j, z ). The average of all collected MAD i, j values over the samples are then average to estimate the final amount of variation introduced by the imprecision of reference probes. Our algorithm estimates the final MAD i, j for each probe J in sample I and by equation 7.
Discrepancies on estimated dosage quotient by the used reference probes and/or reference samples may lead to an increase of the width of this confidence range, indicating a poor normalization. Since 95% is commonly taken as a threshold indicating virtual certainty (ZAR, J.H., 1984)

Interpretation of the calculated dosage quotients
The previous sections explained how probe ratio are calculated and how our algorithm estimates the amount of introduced variation. In this section, we reflect on what those results mean for empirical comparison of users. To make data interpretation easier our program allows the use advanced visualization methods but also contains an algorithm allowing automatic data interpretation. Our algorithm compares the ratio and standard deviation of a test probe from a single sample to the behavior of that probe within a subcollection of samples. This allows the program for instance to recognize if a result from an unknown sample is significantly different from the results found in the reference sample population. Alternatively, it may find if a sample is equal to a sample population, for instance a group of positive control samples. To make an estimation of the behavior of a probe ratio within a sample population, we calculate the average value and standard deviation for each probe over samples with the same sample type. In order to calculate the confidence range of probe J in for instance the reference sample population, we need to solve equation 11. N in this case refers to all probe ratio results (DQ i, j ) from samples that were defined in the normalization setup with the sample type: reference sample (h).
Probe result of each sample are then classified in three categories, by comparison to the confidence ranges of available sample types. A probe result of a sample is either significantly different to a sample population, equal to a sample population or the result is ambiguous. To define if a probe result of an unknown sample is significantly different (>>*) to sample population, our algorithm employs 2 criteria: 1. The difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, needs to exceed a delta value of 0.3. In case an unknown sample is compared to the reference sample population, the average ratio for each probe is always approaches 1. 2. The confidence range of the probe of the unknown sample (equation 10) cannot overlap with the confidence range of that probe in a sample population (equation 11). An unknown sample in classified to be equal (=) to the population of a certain sample type if: 1. The difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, is less than 0.3. 2. The probe ratio of the unknown sample falls within the confidence range of that probe in a sample population (equation 11). www.intechopen.com Probe results that are ambiguous, consequently only meet one of the two criteria in order to characterize the result to be different or equal. Ambiguous probe results that do show a difference in the magnitude of the probe ratio, as compared to the average of that probe calculated over samples with the same sample type, but have overlapping 95% confidence ranges will be marked with an asterisk (*). In case the overlap of the confidence ranges is less than 50% the probe results will be marked with a smaller or greater than symbol plus asterisk (<* or >*). Ambiguous probe results that do not show a difference in the magnitude of the probe ratio, but do show a difference in confidence ranges may be displayed with a single and double smaller or greater than symbols, depending on the size of the difference. Fig. 8. Part of a pdf report from a tumor sample analyzed with the P335 MLPA kit. The report shows clear aberrations at 9p21.3, 9p13.2 and 12q21.33. Less clear is the ratio of RB1, which displays a slight decrease in signal as opposed to the reference population, but doesn't surpass the threshold value, due to sample mosaicism.

Reporting and visualization
Automatic data interpretation cannot replace the specialist judgment of a researcher. Knowledge about the expected genetic defect of the target DNA and other sample information may be crucial. To assist the user with data interpretation, our software automatically sorts all probe results based on the last updated map view locations of the probes. Chromosomal aberrations often-span larger regions (M. Hermsen, 2002), which allow probes targeted to that region to cluster together by sorting. Our software can then make a single page PDF reports, containing a summary of all relevant data, probe ratios ( figure 8), statistics, quality controls and charts ( figure 2 & 4) of a single sample. Fig. 9. Screen shot of from a tumor sample analyzed with the P335 MLPA kit. Probe ratio results of targets estimated as significantly increased as opposed to the reference population are marker green; those estimated as significantly decreased are marked red. www.intechopen.com