Chromosomal structural changes in human body known as copy number alteration (CNA) are often associated with diseases, such as various forms of cancer. Therefore, accurate estimation of breakpoints of the CNAs is important to understand the genetic basis of many diseases. The high‐resolution comparative genomic hybridization (HR‐CGH) and single‐nucleotide polymorphism (SNP) technologies enable cost‐efficient and high‐throughput CNA detection. However, probing provided using these profiles gives data highly contaminated by intensive Gaussian noise having white properties. We observe the probabilistic properties of CNA in HR‐CGH and SNP measurements and show that jitter in the breakpoints can statistically be described with either the discrete skew Laplace distribution when the segmental signal‐to‐noise ratio (SNR) exceeds unity or modified Bessel function‐based approximation when SNR is <1. Based upon these approaches, the confidence masks can be developed and used to enhance the estimates of the CNAs for the given confidence probability by removing some unlikely existing breakpoints.
- copy number alterations
- confidence masks
It is well known that the deoxyribonucleic acid (DNA) of a genome essential for human life often demonstrates structural changes [1–3] called genome copy number alterations (CNAs) [4–6], which are associated with disease such as cancer . Analysis of the breakpoint locations in the CAN structure is still an important issue because it helps detecting structural alterations, load of alterations in the tumor genome, and absolute segment copy numbers. Thus, efficient estimators are required to extract information about the breakpoints with accuracy acceptable for medical needs. To produce CNA profile, several technologies have been developed such as comparative genomic hybridization (CGH) , high‐resolution CGH (HR‐CGH) , whole genome sequencing , and most recently single‐nucleotide polymorphism (SNP) . The HR‐CGH technology is still used widely in spite of its low resolution . It has been reported in  that the HR‐CGH arrays are accurate to detect structural variations (SVs) at the resolution of 200 bp (base pairs). Most recently, the single‐nucleotide polymorphism technology was developed in the study of Wang et al.  to provide high‐resolution measurements of the CNAs. In spite of their high resolution, the modern methods still demonstrate the inability in obtaining good estimates of the breakpoint locations because of the following factors: (1) the nature of biological material (tumor is contaminated by normal tissue, relative values, and unknown baseline for copy number estimation), (2) technological biases (quality of material and hybridization/sequencing), and (3) intensive random noise. The HR‐CGH and SNP profiles have demonstrated deficiency in detecting the CNAs, but noise in the detected changes still remains at a high level  and accurate estimators are required to extract information about structural changes.
In the HR‐CGH microarray technique, the CNAs are often normalized and plotted as , where R and G are the fluorescent Red and Green intensities, respectively . The CNA measurements using SNP technologies are represented by the Log‐R ratios (LRRs), which are the log‐transformed ratios of experimental and normal reference SNP intensities centered at zero for each sample . From the standpoint of signal processing, the following properties of the CNA function are of importance :
It is piecewise constant (PWC) and sparse with a small number of alterations on a long base‐pair length.
Constant values are integer, although this property is not survived in the log‐R ratio.
The measurement noise in the log‐R ratio is highly intensive and can be modeled as additive white Gaussian.
The CNA estimation problem is thus to predict the breakpoint locations and the segmental levels with a maximum possible accuracy and precision acceptable for medical applications. In this work, we developed our methods to two types of cancer: B‐cell chronic lymphocytic leukemia (B‐CLL) and BLC primary breast carcinoma. Nevertheless, the methods were designed to any samples of cancer with the characteristics described above.
2.1. CNA model and problem statement
Consider a chromosome section observed with some resolution , bp at discrete breakpoints, . An example of the CNA probes with a single breakpoint and two segments is shown in Figure 1. Suppose that the copy numbers change at breakpoints, , united in a vector . The measurement can thus be represented with a vector as
Introduce a vector of segmental levels,, where corresponds to the interval to , and , , to . In such a formulation, obeys the linear regression model
where the regression matrix is sparse,
having a component
in which the th column is filled with unity and all others are zeros. The number of the columns in is exactly . However, the number or the row depends on the interval . Thus, the row‐variant matrix (4) is . Additive noise in Eq. (2) is zero mean, , and white Gaussian with the covariance , where is an identity matrix and is a know variance.
The CNA estimation problem is thus to predict the breakpoint locations and evaluate the segmental changes with a maximum possible accuracy and precision acceptable for medical applications. The problem is complicated by short number of the probes in each neighboring segment and indistinct edges. Therefore, an analysis of the estimation errors caused by the segmental noise and jitter in the breakpoints is required.
2.2. Jitter probability in the breakpoints
Consider a typical genomic measurement of two neighboring CNA segments in white Gaussian noise with different segmental variances as shown in Figure 1. A constant signal changes from level to level around the breakpoint. In the presence of noise, the location of is not clear owing to commonly large segmental variances and . As an example, the Gaussian noise probability density functions (pdfs) and are shown in Figure 1 for . Let us notice that and cross each other in two points, and , provided that .
Now considerer probes in each segment neighboring to with an average resolution. We thus may assign an event meaning that measurement at point belongs to the th segment. Another event means that measurement at belongs to the th segment. In our approach, we think that a measured value belongs to one segment if the probability is larger than if it belongs to another segment. For example, any measurement point in the interval between and (Figure 1) is supposed to belong to the th segment. Following Figure 1 and assuming different noise variances and , the events and can be specified as follows :
Because each point can belong only to one segment, the inverse events are and .
Events and can be united into two blocks: .
If and occur simultaneously with unit probability each, then jitter at will never occur. However, some other events may be found, which do not obligatorily lead to jitter. We ignore such events and define approximately the probability of the jitter‐free breakpoint as
The inverse event can be called the approximate jitter probability .
2.3. Jitter distribution in the breakpoints
To determine the confidence limits for CNAs using high‐resolution genomic arrays, jitter in the breakpoints must be specified statistically for the segmental Gaussian distribution. This can be done approximately if to employ either the discrete skew Laplace distribution or, more accurately, the modified Bessel function of the second kind and zeroth order.
2.3.1. Approximation with discrete skew Laplace distribution
Following the definition of the jitter probability given in Section 2.1 and taking into consideration that all the events are independent in white Gaussian noise, Eq. (7) can be rewritten as: where, following Eqs. (5) and (6), the probabilities and can be specified as, respectively,
where is the Gaussian density.
Suppose that jitter occurs at some point as shown, for example, in Figure 1, and assign two additional blocks of events and . The probability that jitter occurs at the th point to the left from (left jitter) and the probability that jitter occurs at the th point to the right from (right jitter) can thus be written as, respectively,
Further normalization of to have a unit area leads to the pdf , where is the sum of for all ,
where and .
It follows from the approximation admitted that converges with only if . Otherwise, if , the sum is infinite and cannot be transformed to . It has been shown in  that such a situation is practically rare. It can be observed with extremely small and different segmental SNRs when the probabilities are comparable that the measurement point belongs to one of another segment.
Accepting , one concludes that , , and . Next, using a standard relation , where , and after little transformations, Eq. (14) can be brought to
The jitter pdf associated with the lth breakpoint can finally be found to be
where is specified by Eq. (15) and .
where . For we have and transform it to the equation , where a proper solution is
and which gives us
where , , , , is the error function, is the complementary error function, and
2.3.2. Approximation of jitter distribution using the modified Bessel functions
An analysis shows that the discrete skew Laplace pdf (17) gives good results only if SNR is >1. Otherwise, real measurements do not fit well, and a more accurate function is required. Below, we show that better approach to real jitter distribution can be provided using the modified Bessel functions.
184.108.40.206. Modified Bessel function
Figure 2 demonstrates the jitter pdf measured experimentally (dotted) for different SNRs. The breakpoint corresponds here to the peak density and the probability of the breakpoint location diminishes to the left and to the right of this point. Note that the discrete skew Laplace pdf (17) behaves linearly in such scales. Therefore, Eq. (17) cannot be applied when SNR is <1 and a more accurate function is required.
Among available functions demonstrating the pdf properties, the modified Bessel function of the second kind and zeroth order is a most good candidate to fit the experimentally measured densities (Figure 2). The following form of can be used:
in which a variable depends on index which represents a discrete departure from the assumed breakpoint location. Because is a positive‐valued for smooth function decreasing with to zero, it can be used to approximate the probability density.
In order to use Eq. (24) as an approximating function
conditioned on for the one‐sided jitter probability densities shown in Figure 2, we represent a variable via as in a way such that small correspond to large values of and vice versa. Among several candidates, it has been found empirically that the following function fits the histograms with highest accuracy:
if to set for , for , and for , and represent the coefficients and as , and as
where , , , , , , and were found in the mean square error (MSE) sense. These values were found in several iterations until the MSE reached a minimum.
In summary, Figure 3 gives a typical example of a simulated CNA, where the modified Bessel function‐based approximation (depicted as MBA) demonstrates better accuracy than the approximation obtained using the skew Laplace distribution (depicted as SkL).
2.4. Probabilistic masks
It follows from Figure 3 that, in view of large noise, estimates of the CNAs may have low confidence, especially with small SNR . Thus, each estimate requires confidence boundaries within which it may exist with a given probability [20, 21].
Given an estimate of the th segmental level in white Gaussian noise, the probabilistic upper boundary (UB) and lower boundary (LB) can be specified for the given confidence probability in the ‐sigma sense as 
where indicates the boundary wideness in terms of the segmental noise variance on an interval points, from to .
Likewise, detected the th breakpoint location , the jitter probabilistic left boundary and right boundary can be defined, following , as
where and are specified by the jitter distribution in the ‐sigma sense.
By combining Eqs. (30) and (31) with Eqs. (32) and (33), the probabilistic masks can be formed as shown in  to bound the CNA estimates in the ‐sigma sense for the given confidence probability . An important property of these masks is that they can be used not only to bound the estimates and show their possible locations on a probabilistic field [20, 21] but also to remove supposedly wrong breakpoints. Such situations occur each time when the masks reveal double UB and LB uniformities in a gap of three neighboring detected breakpoints. If so, then the unlikely existing intermediate breakpoint ought to be removed.
Noticing that the segmental boundaries (30) and (31) remain the same irrespective of the jitter in the breakpoints, below we specify the masks for the jitter represented with the Laplace distribution (17) and Bessel‐based approximation (25).
2.4.1. Masks for Laplace distribution
The probabilistic UB mask
2.4.2. Masks for Bessel‐based approximation
The UB mask
that yields . Then, define the probabilities at and at as
Finally, define the jitter left boundary and right boundary as, respectively,
and use the algorithm previously designed in the study of Munoz‐Minjares and Shmaliy  for the confidence masks based on the Laplace distribution.
In this section, we test some CNA measurements and estimates by the algorithm developed in  based on the Laplace and Bessel approximations. In order to demonstrate the efficiency of the probabilistic masks and getting practically useful results, we exploit probes obtained by different technologies. First, we employ the results obtained with the HR‐CGH profile and test them by the probabilistic masks using Laplace distribution. We next demonstrate the efficiency of the Bessel‐based probabilistic masks versus the Laplace‐based masks for the probes obtained with the SNP profile.
3.1. HR‐CGH‐based probing
The first test is conducted in the three‐sigma sense suggesting that the CNAs exist between the UB and LB masks with high probability of The tested HR‐CGH array data are available from the representational oligonucleotide microarray analysis (ROMA) . The breakpoint locations are also given in . Voluntarily, we select data associated with potentially large jitter and large segmental errors. For clarity, we first compute some characteristics of the detected CNAs and notice that the segmental estimates found by averaging  are in a good correspondence with . The database processed is a part of the seventh chromosome in archive “159A–vs–159D–cut” of ROMA a sample of B‐cell chronic lymphocytic leukemia (B‐CLL). It is shown to have 14 segments and 13 breakpoints (Figure 4a and b). Below, we shall show that, owing to large detection noise, there is a high probability that some breakpoints do not exist.
It follows from Figure 4a that the only breakpoint which location can be estimated with high accuracy is . Jitter in and is moderate. All other breakpoints have large jitter. It is seen that the UB mask covering second to sixth segments is almost uniform. Thus, there is a probability that the second to fifth breakpoints do not exist. If to follow the LB mask, the locations of the second to fourth breakpoints can be predicted even with large errors. At least they can be supposed to exist. However, nothing definitive can be said about the fifth breakpoint and one may suppose that it does not exist. It is also hard to distinguish a true location of the eighth breakpoint. In Figure 4b, , , and are well detectable owing to large segmental SNRs. The breakpoint has a moderate jitter. In turn, the location of is unclear. Moreover, there is a probability that does not exist.
3.2. SNP‐based probing
Our purpose now is to apply the probabilistic mask with SNP profile that represents the CNA with low levels of SNR. Specifically, we employ the probes of the first chromosome available from “BLC_B1_T45.txt” a sample of primary breast carcinoma.
Inherently, the more accurate Bessel‐based approximation extends the jitter probabilistic boundaries with respect to the Laplace‐based ones, especially for low SNRs. We illustrate it in Figure 5, where the estimates of the first chromosome were tested by
In Figure 6, the masks
A special case can also be noticed when the masks
A conclusion that can be made based on the results illustrated in Figures 5–8 is that the Bessel‐based probabilistic masks can be used to improve estimates of the chromosomal changes for the required probability.
We finally notice that the computation time required by the masks to process the first chromosome from sample “BLC B1 T45.txt” with a length of n = 905215 was 2.634599 s using MATLAB software on a personal computer with a processor Intel Core i5, 2.5 GHz.
We evaluate the breakpoints obtained by the projects representational oligonucleotide microarray analysis  and GAP  with the confidence masks. As has been shown before, not all of the detected chromosomal changes have the same confidence to mean that there is a probability that some breakpoints do not exist. In order to improve the CNA estimates for the required confidence, the following process can be used:
and for the given confidence probability and bound the estimates.
If the masks reveal double uniformities, in and , in a GAP of any three neighboring breakpoints, then remove the intermediate breakpoint and estimate the segmental level between the survived breakpoints by simple averaging. The CNAs estimated in such a way will be valid for the given confidence
Application of this methodology to the CNA structure detected in frames of the Project GAP is shown in Figure 9. Its special feature is a number of hardly recognized small chromosomal changes (Figure 9a). We test them by the proposed masks and . To this end, we first start with equal confidence probabilities of for each estimate to exist or not exist and find out that three breakpoints demonstrate no detectability. We remove these breakpoints and depict their locations with “×”. Reasoning similarly, we remove four breakpoints to retain only probable changes, by nine breakpoints to show a picture combined with almost certain changes, by , and 10 breakpoints in the three‐sigma sense, . Observing the results, we infer that the masks are able to correct only the estimates obtained under the low SNRs. The relevant chromosomal sections S1–S7 are circled in Figure 9. It is not surprising because changes existing with high SNRs are seen visually. An estimator thus can easily detect them with high confidence.
Modern technologies developed to produce the CNA profiles with high resolution still admit intensive white Gaussian noise. Accordingly, not one estimator even ideal is able to provide jitter‐free estimation of segmental changes. Thus, in order to avoid wrong decisions, the estimates must be bounded for the confidence probability. Jitter exists in the CNA's breakpoints fundamentally. When SNR is >1, it can statistically be described using the discrete skew Laplace distribution. Otherwise, if SNR is <1, the Bessel‐based approximation produces more accuracy. By the jitter distribution, it is easy to find a region within which the breakpoint exists for the required probability. Of practical importance are the confidence UB and LB masks, which can be created based on the segmental and jitter distributions for the given confidence probability. The masks can serve as an auxiliary tool for medical experts to make decisions about the CNA structures. Applications to probes obtained using the HR‐CGH and SNP technologies confirm efficiency of the confidence masks.