Open access peer-reviewed chapter

Enhancing Estimates of Breakpoints in Genome Copy Number Alteration using Confidence Masks

By Jorge Muñoz‐Minjares, Yuriy Shmaliy and Oscar Ibarra‐Manzano

Submitted: November 18th 2015Reviewed: April 25th 2016Published: July 21st 2016

DOI: 10.5772/63913

Downloaded: 604

Abstract

Chromosomal structural changes in human body known as copy number alteration (CNA) are often associated with diseases, such as various forms of cancer. Therefore, accurate estimation of breakpoints of the CNAs is important to understand the genetic basis of many diseases. The high‐resolution comparative genomic hybridization (HR‐CGH) and single‐nucleotide polymorphism (SNP) technologies enable cost‐efficient and high‐throughput CNA detection. However, probing provided using these profiles gives data highly contaminated by intensive Gaussian noise having white properties. We observe the probabilistic properties of CNA in HR‐CGH and SNP measurements and show that jitter in the breakpoints can statistically be described with either the discrete skew Laplace distribution when the segmental signal‐to‐noise ratio (SNR) exceeds unity or modified Bessel function‐based approximation when SNR is <1. Based upon these approaches, the confidence masks can be developed and used to enhance the estimates of the CNAs for the given confidence probability by removing some unlikely existing breakpoints.

Keywords

  • copy number alterations
  • HR‐CGH
  • SNP
  • breakpoints
  • confidence masks

1. Introduction

It is well known that the deoxyribonucleic acid (DNA) of a genome essential for human life often demonstrates structural changes [13] called genome copy number alterations (CNAs) [46], which are associated with disease such as cancer [7]. Analysis of the breakpoint locations in the CAN structure is still an important issue because it helps detecting structural alterations, load of alterations in the tumor genome, and absolute segment copy numbers. Thus, efficient estimators are required to extract information about the breakpoints with accuracy acceptable for medical needs. To produce CNA profile, several technologies have been developed such as comparative genomic hybridization (CGH) [8], high‐resolution CGH (HR‐CGH) [9], whole genome sequencing [10], and most recently single‐nucleotide polymorphism (SNP) [11]. The HR‐CGH technology is still used widely in spite of its low resolution [12]. It has been reported in [13] that the HR‐CGH arrays are accurate to detect structural variations (SVs) at the resolution of 200 bp (base pairs). Most recently, the single‐nucleotide polymorphism technology was developed in the study of Wang et al. [11] to provide high‐resolution measurements of the CNAs. In spite of their high resolution, the modern methods still demonstrate the inability in obtaining good estimates of the breakpoint locations because of the following factors: (1) the nature of biological material (tumor is contaminated by normal tissue, relative values, and unknown baseline for copy number estimation), (2) technological biases (quality of material and hybridization/sequencing), and (3) intensive random noise. The HR‐CGH and SNP profiles have demonstrated deficiency in detecting the CNAs, but noise in the detected changes still remains at a high level [14] and accurate estimators are required to extract information about structural changes.

In the HR‐CGH microarray technique, the CNAs are often normalized and plotted as log2R/G=log2 ratio, where R and G are the fluorescent Red and Green intensities, respectively [12]. The CNA measurements using SNP technologies are represented by the Log‐R ratios (LRRs), which are the log‐transformed ratios of experimental and normal reference SNP intensities centered at zero for each sample [14]. From the standpoint of signal processing, the following properties of the CNA function are of importance [15]:

  • It is piecewise constant (PWC) and sparse with a small number of alterations on a long base‐pair length.

  • Constant values are integer, although this property is not survived in the log‐R ratio.

  • The measurement noise in the log‐R ratio is highly intensive and can be modeled as additive white Gaussian.

The CNA estimation problem is thus to predict the breakpoint locations and the segmental levels with a maximum possible accuracy and precision acceptable for medical applications. In this work, we developed our methods to two types of cancer: B‐cell chronic lymphocytic leukemia (B‐CLL) and BLC primary breast carcinoma. Nevertheless, the methods were designed to any samples of cancer with the characteristics described above.

2. Methods

2.1. CNA model and problem statement

Consider a chromosome section observed with some resolution Δ, bp at Mdiscrete breakpoints, n[1,M]. An example of the CNA probes with a single breakpoint and two segments is shown in Figure 1. Suppose that the copy numbers change at Kbreakpoints, 1<i1<<iK<M, united in a vector I= [i1i2iK]T. The measurement can thus be represented with a vector yMas

y= [y1y2yi1yi1+1 yi2yiKyM]T.E1

Figure 1.

Typical CNA measurements with white Gaussian noise with a single breakpoint, between two segments l and l+1 having different segmental variances. The pdf for neighboring segments are depicted as pl(x) and pl+1(x).

Introduce a vector aK+1of segmental levels,a= [a1a2aK+1]T, where a1corresponds to the interval [1,i1],aK+1to [iK,M], and aK, k2, to [ik1,ik]. In such a formulation, yobeys the linear regression model

y= A(I)a+ νE2

where the regression matrix AM×(K+1)is sparse,

A=[A1TA2TAK+1T]TE3

having a component

Ak=[010010010]E4

in which the kth column is filled with unity and all others are zeros. The number of the columns in Akis exactly K+1. However, the number or the row depends on the interval ikik1. Thus, the row‐variant matrix (4) is Ak(ikik1)×(K+1). Additive noise vin Eq. (2) is zero mean, E{v}=0, and white Gaussian with the covariance R=σv2I, where IM×Mis an identity matrix and σv2is a know variance.

The CNA estimation problem is thus to predict the breakpoint locations and evaluate the segmental changes x=A(I)awith a maximum possible accuracy and precision acceptable for medical applications. The problem is complicated by short number of the probes in each neighboring segment and indistinct edges. Therefore, an analysis of the estimation errors caused by the segmental noise and jitter in the breakpoints is required.

2.2. Jitter probability in the breakpoints

Consider a typical genomic measurement of two neighboring CNA segments in white Gaussian noise with different segmental variances as shown in Figure 1. A constant signal changes from level alto level al+1around the breakpoint il. In the presence of noise, the location of ilis not clear owing to commonly large segmental variances σl2and σl+12. As an example, the Gaussian noise probability density functions (pdfs) pl(x)and pl+1(x)are shown in Figure 1 for σl2>σl+12. Let us notice that pl(x)and pl+1(x)cross each other in two points, αland βl, provided that σl2σl+12.

Now considerer Nprobes in each segment neighboring to ilwith an average resolution. We thus may assign an event Aljmeaning that measurement at point ilNj<ilbelongs to the lth segment. Another event Bljmeans that measurement at ilj<il+N1belongs to the (l+1)th segment. In our approach, we think that a measured value belongs to one segment if the probability is larger than if it belongs to another segment. For example, any measurement point in the interval between αland βl(Figure 1) is supposed to belong to the (l+1)th segment. Following Figure 1 and assuming different noise variances σl2and σl+12, the events Aljand Bljcan be specified as follows [16]:

AljilNlj<ilis {(αl<xj)(xj<βl),σl2>σl+12,xj>αl,σl2=σl+12,αl<xj<βl,σl2<σl+12,E5
Bljilj<il+Nl1is {βl<xj<αlσl2<σl+12,xj<αl,σl2=σl+12,(xj<αl)(xj>βl), σl2>σl+12.E6

Because each point can belong only to one segment, the inverse events are A¯=1Aljand B¯=1Blj.

Events Aljand Bljcan be united into two blocks: Al={Al(ilN)Al(ilN+1)Al(il1)} and Bl={Bl(il)Bl(il+1)Bl(il+N1)}.

If Aland Bloccur simultaneously with unit probability each, then jitter at ilwill never occur. However, some other events may be found, which do not obligatorily lead to jitter. We ignore such events and define approximately the probability P(Al,Bl)of the jitter‐free breakpoint as

P(Al,Bl)=P(AilNAil1BilBil+N1)E7

The inverse event P¯(Al,Bl)=1P(Al,Bl)can be called the approximate jitter probability [17].

2.3. Jitter distribution in the breakpoints

To determine the confidence limits for CNAs using high‐resolution genomic arrays, jitter in the breakpoints must be specified statistically for the segmental Gaussian distribution. This can be done approximately if to employ either the discrete skew Laplace distribution or, more accurately, the modified Bessel function of the second kind and zeroth order.

2.3.1. Approximation with discrete skew Laplace distribution

Following the definition of the jitter probability given in Section 2.1 and taking into consideration that all the events are independent in white Gaussian noise, Eq. (7) can be rewritten as: P(Al,Bl)=PN(Al)PN(Bl),where, following Eqs. (5) and (6), the probabilities P(Al)and P(Bl)can be specified as, respectively,

P(Al)={1βlαlpl(x)dx,σl2>σl+12,αlpl(x)dx,σl2=σl+12,αlβlpl(x)dx,σl2<σl+12,E8
P(Bl)={βlαlpl+1(x)dx,σl2>σl+12,αlpl+1(x)dx,σl2=σl+12,1αlβlpl+1(x)dx,σl2<σl+12,E9

where pl(x)=1/2πσl2e((xal)2)/σl2is the Gaussian density.

Suppose that jitter occurs at some point il±k,0kN,as shown, for example, in Figure 1, and assign two additional blocks of events Alk={AilNAil1k}and Blk={Bil+kBil+N1}. The probability PkPk(AlkA¯l(ilk)A¯il1Bl)that jitter occurs at the kth point to the left from il(left jitter) and the probability Pk+Pk+(AlB¯l(il+1)B¯l(il+k1)Bk)that jitter occurs at the kth point to the right from il(right jitter) can thus be written as, respectively,

Pk=PNk(Al)[1P(Al)]kPN(Bl) ,E10
Pk+=PN(Al)[1P(Bl)]kPNk(Bl) ,E11

By normalizing Eqs. (11) and (12) with Eq. (8), one can arrive at a function that turns out to be independent on N:

fl(k)={[P1(Al)1]|k|, k<0,(left)1k=0[P1(Bl1)]k,k<0.(right) . E12

Further normalization of fl(k)to have a unit area leads to the pdf pl(k)=1φlfl(k), where φlis the sum of fl(k)for all k,

ϕl=1+k=1[φlA(k)+φlB(k)] ,E13

where φlA(k)=[P1(Al)1]k and φlB(k)=[P1(Bl)1]k.

It follows from the approximation admitted that fl(k)converges with konly if 0.5<P˜={P(A),P(B)}<1. Otherwise, if P˜<0.5, the sum φlis infinite and fl(k)cannot be transformed to pl(k). It has been shown in [18] that such a situation is practically rare. It can be observed with extremely small and different segmental SNRs when the probabilities are comparable that the measurement point belongs to one of another segment.

Accepting 0.5<P˜={P(A),P(B)}<1, one concludes that P˜<0, ln(1P˜)<0, and ln(1P˜)<ln(P˜). Next, using a standard relation k=1xk=1/(x11), where x<1, and after little transformations, Eq. (14) can be brought to

φl=P(Al)+P(Bl)1[12P(Al)[12P(Bl)]E14

The jitter pdf pl(k)associated with the lth breakpoint can finally be found to be

pl(k)=1φl{[P1(Al)1]|k|,k<0,1,k=0 , [P1(Bl)1]kk>0, E15

where ϕlis specified by Eq. (15) and 0.5<P(Al),P(Bl)<1.

If now to substitute ql=P1(Al)1and dl=P1(Bl)1, find P(Al)=1/(1+ql)and P(Bl)=1/(1+dl), and provide the transformations, then one may arrive at a conclusion that Eq. (16) is the discrete skew Laplace pdf [19].

p(k|dlql)=(1dl)(1ql)1dlql{plk,k0ql|k|k0E16

where dl=e(κl/νl)[0,1]and ql=e(1/κlνl)[0,1]and in which κland νl>0still need to be connected to Eq. (16). With this aim, consider Eqs. (16) and (17) at k=1,k=0, and k=1. By equating Eqs. (16) and (17), first obtain ((1dl)(1ql)dl)/(1dlql)=1/ϕl(1P(Bl))/P(Bl)for k=1and ((1dl)(1ql)ql)/(1dlql)=1/ϕl(1P(Al))/P(Al)for k=1that yields

νl=1κl2κllnμl ,E17

where μl=(P(Al)[1P(Bl)])/(P(Bl)[1P(Al)]). For k=0we have ((1dl)(1ql))/(1dlql)=1/ϕland transform it to the equation xl2(ϕl(1+μl))/(1+ϕl)x(1ϕl)/(1+ϕl)μl=0, where a proper solution is

x=φl(1+μl)2(1+φl)(114μl(1φl2)φl2(1+μl)2)E18

and which xl=μl(κl2)/(1κl2)gives us

κl=ln(xl)ln(xl/μl).E19

By combining Eq. (18) with Eq. (20), one may also get a simpler form for νl, namely νl=κl/lnxl.

Now, introduce the segmental signal‐to‐noise ratios (SNRs): γl=Δl2σl2, and γl+=Δl2σl+12, where Δl=al+1al, substitute the Gaussian pdf to Eqs. (9) and (10), provide the transformations, and rewrite Eqs. (9) and (10) as

P(Al)={1+12[erf(glβ)erf(glα)],γl<γl+12erfc(glα),γl=γl+12[erf(glβ)erf(glα)],γl>γl+,E20
P(Bl)={12[erf(hlα)erf(hlβ)],γl<γl+112erfc(hlα),γl=γl+1+12[erf(hlα)erf(hlβ)],γl>γl+,E21

where glβ=(βlΔl)/|Δl|γl/2, glα=(αlΔl)/|Δl|γl/2, hlβ=βl/|Δl|γl+/2, hlα=αl/|Δl|γl+/2, erf(x)is the error function, erfc(x)is the complementary error function, and

αl,βl=alγlalγl+γlγl+1γlγl+×(alal+1)γlγl++2Δl2(γlγl+)lnγlγl+ .E22

2.3.2. Approximation of jitter distribution using the modified Bessel functions

An analysis shows that the discrete skew Laplace pdf (17) gives good results only if SNR is >1. Otherwise, real measurements do not fit well, and a more accurate function is required. Below, we show that better approach to real jitter distribution can be provided using the modified Bessel functions.

2.3.2.1. Modified Bessel function

Figure 2 demonstrates the jitter pdf measured experimentally (dotted) for different SNRs. The breakpoint corresponds here to the peak density and the probability of the breakpoint location diminishes to the left and to the right of this point. Note that the discrete skew Laplace pdf (17) behaves linearly in such scales. Therefore, Eq. (17) cannot be applied when SNR is <1 and a more accurate function is required.

Figure 2.

Experimentally defined one‐sided jitter probability densities (dotted) of the breakpoint location for equal segmental SNR γ in the range of M=400 points with a true breakpoint at n=200. The experimental density functions were found using the Maximum Likelihood (ML) estimator. The histogram was plotted over 50×103 runs repeated nine times and average. Approximations (continuous) are provided using the proposed Bessel‐based approximation depicted as MBA.

Among available functions demonstrating the pdf properties, the modified Bessel function of the second kind K0(x)and zeroth order is a most good candidate to fit the experimentally measured densities (Figure 2). The following form of K0(x)can be used:

K0[x(k)]=0cos[x(k)sinht]dt=0cos[x(k)t]t2+1dt>0,x(k)>0,E23

in which a variable x(k)depends on index k,which represents a discrete departure from the assumed breakpoint location. Because K0[x(k)]is a positive‐valued for x(k)>0smooth function decreasing with xto zero, it can be used to approximate the probability density.

2.3.2.2. Approximation

In order to use Eq. (24) as an approximating function


B(k|γ)=K0[x(k)]E24

Figure 3.

Simulated CNA with a single breakpoint at n = 200 and segmental standard deviations σl and σl+1 corresponding to SNRs γl−=γl+=1.37: (a) measurement and (b) jitter distribution. Here, ML (circled) is the jitter pdf obtained experimentally using an ML estimator through a histogram over 50×103 runs, SkL (solid) is the Laplace distribution, and MBA (dashed) is the Bessel‐based approximation.

conditioned on γfor the one‐sided jitter probability densities shown in Figure 2, we represent a variable xvia kas x(k,γ)=ln[Φ(k,γ)]in a way such that small k0correspond to large values xof and vice versa. Among several candidates, it has been found empirically that the following function Φ(k,γ)fits the histograms with highest accuracy:

Φ(k,γ)=(|k|+1)ρ+τ|k|[1+γγϵ]E25

if to set γ=γlfor k<0, γ=γl+γl+2for k=0, and γ=γl+for k>0, and represent the coefficients and as τ(γ),ρ(γ), and ϵ(γ)as

τ(γ)=a0γ+a1E26
ρ(γ)= γ(b0γb1+a0)+b2E27
ϵ(γ)=c0γc1+c2E28

where a0=0.02737, a1=4.5×103, b0=0.3674, b1=0.3137, b2=0.8066, c0=0.8865, c1=1.033,and c2=1.233were found in the mean square error (MSE) sense. These values were found in several iterations until the MSE reached a minimum.

In summary, Figure 3 gives a typical example of a simulated CNA, where the modified Bessel function‐based approximation (depicted as MBA) demonstrates better accuracy than the approximation obtained using the skew Laplace distribution (depicted as SkL).

2.4. Probabilistic masks

It follows from Figure 3 that, in view of large noise, estimates of the CNAs may have low confidence, especially with small SNR γ1. Thus, each estimate requires confidence boundaries within which it may exist with a given probability [20, 21].

Given an estimate a^lof the lth segmental level in white Gaussian noise, the probabilistic upper boundary (UB) and lower boundary (LB) can be specified for the given confidence probability P(ϑ)in the ϑ‐sigma sense as [20]

a^lUBa^l+ε=a^l+ϑσj2Nl=a^l+ϑσ^lE29
a^lLBa^lε=a^l+ϑσj2Nl=a^l+ϑσ^lE30

where ϑindicates the boundary wideness in terms of the segmental noise variance σ^lon an interval Nlpoints, from n^l1to n^l1.

Likewise, detected the lth breakpoint location n^l, the jitter probabilistic left boundary JlLand right boundary JlRcan be defined, following [20], as

JlLn^lklR,E31
JlRn^l+klL,E32

where klR(ϑ)and klL(ϑ)are specified by the jitter distribution in the ϑ‐sigma sense.

By combining Eqs. (30) and (31) with Eqs. (32) and (33), the probabilistic masks can be formed as shown in [20] to bound the CNA estimates in the ϑ‐sigma sense for the given confidence probability P(ϑ). An important property of these masks is that they can be used not only to bound the estimates and show their possible locations on a probabilistic field [20, 21] but also to remove supposedly wrong breakpoints. Such situations occur each time when the masks reveal double UB and LB uniformities in a gap of three neighboring detected breakpoints. If so, then the unlikely existing intermediate breakpoint ought to be removed.

Noticing that the segmental boundaries (30) and (31) remain the same irrespective of the jitter in the breakpoints, below we specify the masks for the jitter represented with the Laplace distribution (17) and Bessel‐based approximation (25).

2.4.1. Masks for Laplace distribution

For the Laplace distribution (17), the jitter left boundary JlL(32) and right boundary JlR(33) can be defined in the ϑ–‐sigma sense if to specify klR(ϑ)and klL(ϑ)as shown in [18],

klR=νκln(1dl)(1ql)ξ(1dlql),E33
klL=νκln(1dl)(1ql)ξ(1dlql),E34

where [x] means a maximum integer lower than or equal to x. Note that functions (34) and (35) were obtained in [18] by equating (17) to ξ(Nl)=erfc(ϑ/2)and solving for kl.

The probabilistic UB maskand LB maskfor the Laplace distribution were formed in [17,20,21] by the segmental upper boundary a^lUBand lower boundary a^lLBand by the jitter left boundary JlLand jitter right boundary JlR. The algorithm for computingandmasks has been developed and applied to the CNA probes in [22].

2.4.2. Masks for Bessel‐based approximation

The UB maskand LB maskfor the Bessel‐based approximation can be formed using the same equations as for the Laplace distribution. Suppose that the Laplace pdf (17) is equal to the approximating function Βl(k)at k=0,

p(k=0|dl,ql)=Βl(k=0),E35

that yields Βl(k=0)=1ϕl. Then, define the probabilities PB(Al)at k=1and PΒ(Bl)at k=1as

PΒ(Al)=Bl(k=0)Βl(k=1)+Βl(k=0),E36
PΒ(Bl)=Bl(k=0)Βl(k=1)+Βl(k=0).E37

Next, substitute Eqs. (37) and (38) into Eqs. (19) and (20) to obtain κlBand νlB. The right‐hand jitter klBRand left‐hand jitter klBLcan now be specified by, respectively,

klBR=νlBκlBln1ξBl(k=0)E38
klBL=νlBκlBln1ξBl(k=0)E39

Finally, define the jitter left boundary JlBLand right boundary JlBRas, respectively,

JlBLn^lklBRE40
JlBRn^lklBL E41

and use the algorithm previously designed in the study of Munoz‐Minjares and Shmaliy [22] for the confidence masks based on the Laplace distribution.

3. Results

Figure 4.

and masks for the seventh chromosome taken from “159A–vs–159D–cut” of ROMA: (a) genomic location from 130 to 146 Mb and (b) genomic location from 146 to 156 Mb. Breakpoints î1, î6, î7, î9, î10, î12, and î13 are well detectable because jitter is moderate. Owing to large jitter the breakpoints î2, î3, î4, î5, î8, î9, and î11 cannot be estimated correctly. There is a probability that the breakpoints î2, î3, î4, î5, and î11 do not exist. There is a high probability that breakpoint î5 does not exist.

In this section, we test some CNA measurements and estimates by the algorithm developed in [22] based on the Laplace and Bessel approximations. In order to demonstrate the efficiency of the probabilistic masks and getting practically useful results, we exploit probes obtained by different technologies. First, we employ the results obtained with the HR‐CGH profile and test them by the probabilistic masks using Laplace distribution. We next demonstrate the efficiency of the Bessel‐based probabilistic masks versus the Laplace‐based masks for the probes obtained with the SNP profile.

3.1. HR‐CGH‐based probing

The first test is conducted in the three‐sigma sense suggesting that the CNAs exist between the UB and LB masks with high probability of P=99.73%.The tested HR‐CGH array data are available from the representational oligonucleotide microarray analysis (ROMA) [23]. The breakpoint locations are also given in [23]. Voluntarily, we select data associated with potentially large jitter and large segmental errors. For clarity, we first compute some characteristics of the detected CNAs and notice that the segmental estimates found by averaging [18] are in a good correspondence with [23]. The database processed is a part of the seventh chromosome in archive “159A–vs–159D–cut” of ROMA a sample of B‐cell chronic lymphocytic leukemia (B‐CLL). It is shown to have 14 segments and 13 breakpoints (Figure 4a and b). Below, we shall show that, owing to large detection noise, there is a high probability that some breakpoints do not exist.

It follows from Figure 4a that the only breakpoint which location can be estimated with high accuracy is i1. Jitter in i^6and i^7is moderate. All other breakpoints have large jitter. It is seen that the UB mask covering second to sixth segments is almost uniform. Thus, there is a probability that the second to fifth breakpoints do not exist. If to follow the LB mask, the locations of the second to fourth breakpoints can be predicted even with large errors. At least they can be supposed to exist. However, nothing definitive can be said about the fifth breakpoint and one may suppose that it does not exist. It is also hard to distinguish a true location of the eighth breakpoint. In Figure 4b, i10, i12, and i13are well detectable owing to large segmental SNRs. The breakpoint i9has a moderate jitter. In turn, the location of i11is unclear. Moreover, there is a probability that i11does not exist.

3.2. SNP‐based probing

Our purpose now is to apply the probabilistic mask with SNP profile that represents the CNA with low levels of SNR. Specifically, we employ the probes of the first chromosome available from “BLC_B1_T45.txt” a sample of primary breast carcinoma.

Inherently, the more accurate Bessel‐based approximation extends the jitter probabilistic boundaries with respect to the Laplace‐based ones, especially for low SNRs. We illustrate it in Figure 5, where the estimates of the first chromosome were tested by,,, andfor ϑ=3(confidence probability P=99.73%).

In Figure 6, the masksandare placed in the vicinity of segment a^18for several confidence probabilities: ϑ=0.6745(P=50%), ϑ=1(P=68.27%), ϑ=2(P=95.45%), and ϑ=3(P=99.73%). What the masks suggest here is that the CNA evidently exists with high probability, but the segmental levels and the breakpoint locations cannot be estimated with high accuracy, owing to low SNRs.

Figure 5.

Jitter left boundaries ℬlℬL, JlL and right boundaries JlℬR, JlR for the breakpoint i^2 of first chromosome from sample BLC_B1_T45.txt (primary breast carcinoma). The probabilistic masks detect a breakpoint with a confidence probability ϑ = 3 (P = 99.73%).

Figure 6.

The and masks placed around the segmental level a18 for several confidence probabilities [20]. Here, the CNA exists with high probability, but the segmental levels and the breakpoint locations cannot estimate with high accuracy.

Figure 7.

The confidence masks placed around a10 for ϑ=0.6745 (P=50 %) and ϑ=3 (P=99.73 %). Masks and do not confirm an existence of segmental changes while and indicate a small change.

Figure 8.

The confidence masks , ℒlLB,ℬlUB and . placed around the breakpoint i^20 for confidence probabilities ϑ=0.6745 and ϑ=3 of first chromosome from sample BLC_B1_T45.txt. The confidence masks based on Laplace distribution cannot detect the breakpoint i^20.

A special case can also be noticed when the masksandare not able to confirm or deny an existence of segmental changes with high probability, owing to the inability of computing the Laplace‐based masks for extremely low SNRs. Figures 7 and 8 illustrate such situations. Just on the contrary, masksandcan be computed for any reasonable SNR.

A conclusion that can be made based on the results illustrated in Figures 58 is that the Bessel‐based probabilistic masks can be used to improve estimates of the chromosomal changes for the required probability.

We finally notice that the computation time required by the masks to process the first chromosome from sample “BLC B1 T45.txt” with a length of n = 905215 was 2.634599 s using MATLAB software on a personal computer with a processor Intel Core i5, 2.5 GHz.

4. Discussion

We evaluate the breakpoints obtained by the projects representational oligonucleotide microarray analysis [23] and GAP [14] with the confidence masks. As has been shown before, not all of the detected chromosomal changes have the same confidence to mean that there is a probability that some breakpoints do not exist. In order to improve the CNA estimates for the required confidence, the following process can be used:

  1. Obtain estimates of the CNA using the standard CBS algorithm [24, 25] or any other algorithm.

  2. Compute masksandfor the given confidence probability P,%and bound the estimates.

  3. If the masks reveal double uniformities, in UBand LB, in a GAP of any three neighboring breakpoints, then remove the intermediate breakpoint and estimate the segmental level between the survived breakpoints by simple averaging. The CNAs estimated in such a way will be valid for the given confidence P,%.

Application of this methodology to the CNA structure detected in frames of the Project GAP is shown in Figure 9. Its special feature is a number of hardly recognized small chromosomal changes (Figure 9a). We test them by the proposed masks lUBand lLB. To this end, we first start with equal confidence probabilities of P=50%for each estimate to exist or not exist and find out that three breakpoints demonstrate no detectability. We remove these breakpoints and depict their locations with “×”. Reasoning similarly, we remove four breakpoints to retain only probable changes, by P=75%,nine breakpoints to show a picture combined with almost certain changes, by P=93%, and 10 breakpoints in the three‐sigma sense, P=99.73%. Observing the results, we infer that the masks are able to correct only the estimates obtained under the low SNRs. The relevant chromosomal sections S1–S7 are circled in Figure 9. It is not surprising because changes existing with high SNRs are seen visually. An estimator thus can easily detect them with high confidence.

Figure 9.

Improving estimates of the CNAs obtained in Project GAP [25] by removing some unlikely existing breakpoints: (a) original estimates, (b) even changes, P = 50%, (c) probable changes, P = 75%, (d) almost certain changes, P = 93%, and (e) three‐sigma sense, P = 99.73%.

5. Conclusions

Modern technologies developed to produce the CNA profiles with high resolution still admit intensive white Gaussian noise. Accordingly, not one estimator even ideal is able to provide jitter‐free estimation of segmental changes. Thus, in order to avoid wrong decisions, the estimates must be bounded for the confidence probability. Jitter exists in the CNA's breakpoints fundamentally. When SNR is >1, it can statistically be described using the discrete skew Laplace distribution. Otherwise, if SNR is <1, the Bessel‐based approximation produces more accuracy. By the jitter distribution, it is easy to find a region within which the breakpoint exists for the required probability. Of practical importance are the confidence UB and LB masks, which can be created based on the segmental and jitter distributions for the given confidence probability. The masks can serve as an auxiliary tool for medical experts to make decisions about the CNA structures. Applications to probes obtained using the HR‐CGH and SNP technologies confirm efficiency of the confidence masks.

© 2016 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Jorge Muñoz‐Minjares, Yuriy Shmaliy and Oscar Ibarra‐Manzano (July 21st 2016). Enhancing Estimates of Breakpoints in Genome Copy Number Alteration using Confidence Masks, Advanced Biosignal Processing and Diagnostic Methods, Christoph Hintermüller, IntechOpen, DOI: 10.5772/63913. Available from:

chapter statistics

604total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Classifying and Predicting Respiratory Function Based on Gait Analysis

By Yu Sheng Chan, Wen Te Liu and Ching Te Chiu

Related Book

Frontiers in Guided Wave Optics and Optoelectronics

Edited by Bishnu Pal

First chapter

Frontiers in Guided Wave Optics and Optoelectronics

By Bishnu Pal

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us