Open access peer-reviewed chapter

Enhancing Estimates of Breakpoints in Genome Copy Number Alteration using Confidence Masks

Written By

Jorge Muñoz‐Minjares, Yuriy Shmaliy and Oscar Ibarra‐Manzano

Submitted: 18 November 2015 Reviewed: 25 April 2016 Published: 21 July 2016

DOI: 10.5772/63913

From the Edited Volume

Advanced Biosignal Processing and Diagnostic Methods

Edited by Christoph Hintermüller

Chapter metrics overview

1,330 Chapter Downloads

View Full Metrics

Abstract

Chromosomal structural changes in human body known as copy number alteration (CNA) are often associated with diseases, such as various forms of cancer. Therefore, accurate estimation of breakpoints of the CNAs is important to understand the genetic basis of many diseases. The high‐resolution comparative genomic hybridization (HR‐CGH) and single‐nucleotide polymorphism (SNP) technologies enable cost‐efficient and high‐throughput CNA detection. However, probing provided using these profiles gives data highly contaminated by intensive Gaussian noise having white properties. We observe the probabilistic properties of CNA in HR‐CGH and SNP measurements and show that jitter in the breakpoints can statistically be described with either the discrete skew Laplace distribution when the segmental signal‐to‐noise ratio (SNR) exceeds unity or modified Bessel function‐based approximation when SNR is <1. Based upon these approaches, the confidence masks can be developed and used to enhance the estimates of the CNAs for the given confidence probability by removing some unlikely existing breakpoints.

Keywords

  • copy number alterations
  • HR‐CGH
  • SNP
  • breakpoints
  • confidence masks

1. Introduction

It is well known that the deoxyribonucleic acid (DNA) of a genome essential for human life often demonstrates structural changes [13] called genome copy number alterations (CNAs) [46], which are associated with disease such as cancer [7]. Analysis of the breakpoint locations in the CAN structure is still an important issue because it helps detecting structural alterations, load of alterations in the tumor genome, and absolute segment copy numbers. Thus, efficient estimators are required to extract information about the breakpoints with accuracy acceptable for medical needs. To produce CNA profile, several technologies have been developed such as comparative genomic hybridization (CGH) [8], high‐resolution CGH (HR‐CGH) [9], whole genome sequencing [10], and most recently single‐nucleotide polymorphism (SNP) [11]. The HR‐CGH technology is still used widely in spite of its low resolution [12]. It has been reported in [13] that the HR‐CGH arrays are accurate to detect structural variations (SVs) at the resolution of 200 bp (base pairs). Most recently, the single‐nucleotide polymorphism technology was developed in the study of Wang et al. [11] to provide high‐resolution measurements of the CNAs. In spite of their high resolution, the modern methods still demonstrate the inability in obtaining good estimates of the breakpoint locations because of the following factors: (1) the nature of biological material (tumor is contaminated by normal tissue, relative values, and unknown baseline for copy number estimation), (2) technological biases (quality of material and hybridization/sequencing), and (3) intensive random noise. The HR‐CGH and SNP profiles have demonstrated deficiency in detecting the CNAs, but noise in the detected changes still remains at a high level [14] and accurate estimators are required to extract information about structural changes.

In the HR‐CGH microarray technique, the CNAs are often normalized and plotted as log2R/G=log2 ratio, where R and G are the fluorescent Red and Green intensities, respectively [12]. The CNA measurements using SNP technologies are represented by the Log‐R ratios (LRRs), which are the log‐transformed ratios of experimental and normal reference SNP intensities centered at zero for each sample [14]. From the standpoint of signal processing, the following properties of the CNA function are of importance [15]:

  • It is piecewise constant (PWC) and sparse with a small number of alterations on a long base‐pair length.

  • Constant values are integer, although this property is not survived in the log‐R ratio.

  • The measurement noise in the log‐R ratio is highly intensive and can be modeled as additive white Gaussian.

The CNA estimation problem is thus to predict the breakpoint locations and the segmental levels with a maximum possible accuracy and precision acceptable for medical applications. In this work, we developed our methods to two types of cancer: B‐cell chronic lymphocytic leukemia (B‐CLL) and BLC primary breast carcinoma. Nevertheless, the methods were designed to any samples of cancer with the characteristics described above.

Advertisement

2. Methods

2.1. CNA model and problem statement

Consider a chromosome section observed with some resolution Δ, bp at M discrete breakpoints, n [1,M]. An example of the CNA probes with a single breakpoint and two segments is shown in Figure 1. Suppose that the copy numbers change at K breakpoints, 1<i1 < <iK<M , united in a vector I= [i1i2 iK]T. The measurement can thus be represented with a vector y M as

y= [y1y2yi1yi1+1 yi2yiKyM]T.E1

Figure 1.

Typical CNA measurements with white Gaussian noise with a single breakpoint, between two segments l and l+1 having different segmental variances. The pdf for neighboring segments are depicted as pl(x) and pl+1(x).

Introduce a vector a K+1 of segmental levels,a= [a1a2 aK+1]T , where a1 corresponds to the interval [1, i1],aK+1 to [iK,M], and aK, k2, to [ik1,ik]. In such a formulation, y obeys the linear regression model

y= A(I)a+ νE2

where the regression matrix AM×(K+1) is sparse,

A=[A1TA2T AK+1T]TE3

having a component

Ak=[010010010]E4

in which the kth column is filled with unity and all others are zeros. The number of the columns in Ak is exactly K+1. However, the number or the row depends on the interval ikik1. Thus, the row‐variant matrix (4) is Ak(ikik1)×(K+1). Additive noise v in Eq. (2) is zero mean, E{v}=0, and white Gaussian with the covariance R=σv2I, where IM×M is an identity matrix and σv2 is a know variance.

The CNA estimation problem is thus to predict the breakpoint locations and evaluate the segmental changes x=A(I)a with a maximum possible accuracy and precision acceptable for medical applications. The problem is complicated by short number of the probes in each neighboring segment and indistinct edges. Therefore, an analysis of the estimation errors caused by the segmental noise and jitter in the breakpoints is required.

2.2. Jitter probability in the breakpoints

Consider a typical genomic measurement of two neighboring CNA segments in white Gaussian noise with different segmental variances as shown in Figure 1. A constant signal changes from level al to level al+1 around the breakpoint il. In the presence of noise, the location of il is not clear owing to commonly large segmental variances σl2 and σl+12. As an example, the Gaussian noise probability density functions (pdfs) pl(x) and pl+1(x) are shown in Figure 1 for σl2>σl+12. Let us notice that pl(x) and pl+1(x) cross each other in two points, αl and βl, provided that σl2σl+12.

Now considerer N probes in each segment neighboring to il with an average resolution. We thus may assign an event Alj meaning that measurement at point ilNj<il belongs to the lth segment. Another event Blj means that measurement at ilj<il+N1 belongs to the (l+1)th segment. In our approach, we think that a measured value belongs to one segment if the probability is larger than if it belongs to another segment. For example, any measurement point in the interval between αl and βl (Figure 1) is supposed to belong to the (l+1)th segment. Following Figure 1 and assuming different noise variances σl2 and σl+12, the events Alj and Blj can be specified as follows [16]:

AljilNlj<ilis {(αl<xj)(xj<βl),σl2>σl+12,xj>αl,σl2=σl+12,αl<xj<βl,σl2<σl+12,E5
Bljilj<il+Nl1is {βl<xj<αlσl2<σl+12,xj<αl,σl2=σl+12,(xj<αl)(xj>βl), σl2>σl+12.E6

Because each point can belong only to one segment, the inverse events are A¯=1Alj and B¯=1Blj.

Events Alj and Blj can be united into two blocks: Al={Al(ilN)Al(ilN+1)Al(il1)} and Bl={Bl(il)Bl(il+1)Bl(il+N1)}.

If Al and Bl occur simultaneously with unit probability each, then jitter at il will never occur. However, some other events may be found, which do not obligatorily lead to jitter. We ignore such events and define approximately the probability P(Al,Bl) of the jitter‐free breakpoint as

P(Al,Bl)=P(AilNAil1BilBil+N1)E7

The inverse event P¯(Al,Bl)=1P(Al, Bl) can be called the approximate jitter probability [17].

2.3. Jitter distribution in the breakpoints

To determine the confidence limits for CNAs using high‐resolution genomic arrays, jitter in the breakpoints must be specified statistically for the segmental Gaussian distribution. This can be done approximately if to employ either the discrete skew Laplace distribution or, more accurately, the modified Bessel function of the second kind and zeroth order.

2.3.1. Approximation with discrete skew Laplace distribution

Following the definition of the jitter probability given in Section 2.1 and taking into consideration that all the events are independent in white Gaussian noise, Eq. (7) can be rewritten as: P(Al,Bl)=PN(Al)PN(Bl), where, following Eqs. (5) and (6), the probabilities P (Al) and P (Bl) can be specified as, respectively,

P(Al)={1βlαlpl(x)dx,σl2>σl+12,αlpl(x)dx,σl2=σl+12,αlβlpl(x)dx,σl2<σl+12, E8
P(Bl)={βlαlpl+1(x)dx,σl2>σl+12,αlpl+1(x)dx,σl2=σl+12,1αlβlpl+1(x)dx,σl2<σl+12, E9

where pl(x)=1/2πσl2e((xal)2)/σl2 is the Gaussian density.

Suppose that jitter occurs at some point il±k,0kN, as shown, for example, in Figure 1, and assign two additional blocks of events Alk={AilNAil1k} and Blk={Bil+kBil+N1}. The probability PkPk(AlkA¯l(ilk)A¯il1Bl) that jitter occurs at the kth point to the left from il (left jitter) and the probability Pk+Pk+(AlB¯l(il+1)B¯l(il+k1)Bk) that jitter occurs at the kth point to the right from il (right jitter) can thus be written as, respectively,

Pk=PNk(Al)[1P(Al)]kPN(Bl) ,E10
Pk+=PN(Al)[1P(Bl)]kPNk(Bl) ,E11

By normalizing Eqs. (11) and (12) with Eq. (8), one can arrive at a function that turns out to be independent on N:

fl(k)={[P1(Al)1]|k|, k<0, (left)1k=0[P1(Bl1)]k,k<0. (right) . E12

Further normalization of fl(k) to have a unit area leads to the pdf pl(k)=1φlfl(k), where φl is the sum of fl(k) for all k,

ϕl=1+k=1[φlA(k)+φlB(k)] ,E13

where φlA(k)=[P1(Al)1]k  and φlB(k)=[P1(Bl)1]k.

It follows from the approximation admitted that fl(k) converges with k only if 0.5<P˜={P(A),P(B)}<1. Otherwise, if P˜<0.5, the sum φl is infinite and fl(k) cannot be transformed to pl(k). It has been shown in [18] that such a situation is practically rare. It can be observed with extremely small and different segmental SNRs when the probabilities are comparable that the measurement point belongs to one of another segment.

Accepting 0.5<P˜={P(A),P(B)}<1, one concludes that P˜<0, ln(1P˜)<0, and ln(1P˜)<ln(P˜). Next, using a standard relation k=1xk=1/(x11), where x<1, and after little transformations, Eq. (14) can be brought to

φl=P(Al)+P(Bl)1[12P(Al)[12P(Bl)]E14

The jitter pdf pl(k) associated with the lth breakpoint can finally be found to be

pl(k)=1φl{[P1(Al)1]|k|,k<0 ,1,k=0 , [P1(Bl)1]kk>0 , E15

where ϕl is specified by Eq. (15) and 0.5<P(Al),P(Bl)<1.

If now to substitute ql=P1(Al)1 and dl=P1(Bl )1 , find P(Al)=1/(1+ql) and P(Bl)=1/(1+dl), and provide the transformations, then one may arrive at a conclusion that Eq. (16) is the discrete skew Laplace pdf [19].

p(k|dlql)=(1dl)(1ql)1dlql{plk,k0ql|k|k0E16

where dl=e(κl/νl)[0,1] and ql=e(1/κlνl)[0,1] and in which κl and νl>0 still need to be connected to Eq. (16). With this aim, consider Eqs. (16) and (17) at k=1, k=0, and k=1. By equating Eqs. (16) and (17), first obtain ((1dl)(1ql)dl)/(1dlql)=1/ϕl(1P(Bl))/P(Bl) for k=1 and ((1dl)(1ql)ql)/(1dlql)=1/ϕl(1P(Al))/P(Al) for k=1 that yields

νl=1κl2κllnμl ,E17

where μl=(P(Al)[1P(Bl)])/(P(Bl)[1P(Al)]). For k=0 we have ((1dl)(1ql))/(1dlql)=1/ϕl and transform it to the equation xl2(ϕl(1+μl))/(1+ϕl)x(1ϕl)/(1+ϕl)μl=0, where a proper solution is

x=φl(1+μl)2(1+φl)(114μl(1φl2)φl2(1+μl)2)E18

and which xl=μl(κl2)/(1κl2) gives us

κl=ln(xl)ln(xl/μl) . E19

By combining Eq. (18) with Eq. (20), one may also get a simpler form for νl, namely νl=κl/lnxl.

Now, introduce the segmental signal‐to‐noise ratios (SNRs): γl=Δl2σl2, and γl+=Δl2σl+12 , where Δl=al+1al, substitute the Gaussian pdf to Eqs. (9) and (10), provide the transformations, and rewrite Eqs. (9) and (10) as

P(Al)={1+12[erf(glβ)erf(glα)],γl<γl+12 erfc(glα),γl=γl+12[erf(glβ)erf(glα)],γl>γl+ , E20
P(Bl)={12[erf(hlα)erf(hlβ)],γl<γl+112erfc(hlα),γl=γl+1+12[erf(hlα)erf(hlβ)],γl>γl+ , E21

where glβ=(βlΔl)/|Δl|γl/2, glα=(αlΔl)/|Δl|γl/2, hlβ=βl/|Δl|γl+/2, hlα=αl/|Δl|γl+/2, erf(x) is the error function, erfc(x) is the complementary error function, and

αl,βl=alγlalγl+γlγl+1γlγl+×(alal+1)γlγl++2Δl2(γlγl+)lnγlγl+ .E22

2.3.2. Approximation of jitter distribution using the modified Bessel functions

An analysis shows that the discrete skew Laplace pdf (17) gives good results only if SNR is >1. Otherwise, real measurements do not fit well, and a more accurate function is required. Below, we show that better approach to real jitter distribution can be provided using the modified Bessel functions.

2.3.2.1. Modified Bessel function

Figure 2 demonstrates the jitter pdf measured experimentally (dotted) for different SNRs. The breakpoint corresponds here to the peak density and the probability of the breakpoint location diminishes to the left and to the right of this point. Note that the discrete skew Laplace pdf (17) behaves linearly in such scales. Therefore, Eq. (17) cannot be applied when SNR is <1 and a more accurate function is required.

Figure 2.

Experimentally defined one‐sided jitter probability densities (dotted) of the breakpoint location for equal segmental SNR γ in the range of M=400 points with a true breakpoint at n=200. The experimental density functions were found using the Maximum Likelihood (ML) estimator. The histogram was plotted over 50×103 runs repeated nine times and average. Approximations (continuous) are provided using the proposed Bessel‐based approximation depicted as MBA.

Among available functions demonstrating the pdf properties, the modified Bessel function of the second kind K0(x) and zeroth order is a most good candidate to fit the experimentally measured densities (Figure 2). The following form of K0(x) can be used:

K0[x(k)]=0cos[x(k)sinht]dt=0cos[x(k)t]t2+1dt>0, x(k)>0 ,E23

in which a variable x(k) depends on index k, which represents a discrete departure from the assumed breakpoint location. Because K0[x(k)] is a positive‐valued for x(k)>0 smooth function decreasing with x to zero, it can be used to approximate the probability density.

2.3.2.2. Approximation

In order to use Eq. (24) as an approximating function


B(k|γ)=K0[x(k)]E24

Figure 3.

Simulated CNA with a single breakpoint at n = 200 and segmental standard deviations σl and σl+1 corresponding to SNRs γl=γl+=1.37: (a) measurement and (b) jitter distribution. Here, ML (circled) is the jitter pdf obtained experimentally using an ML estimator through a histogram over 50×103 runs, SkL (solid) is the Laplace distribution, and MBA (dashed) is the Bessel‐based approximation.

conditioned on γ for the one‐sided jitter probability densities shown in Figure 2, we represent a variable x via k as x(k,γ)=ln[Φ(k,γ)] in a way such that small k0 correspond to large values x of and vice versa. Among several candidates, it has been found empirically that the following function Φ(k,γ) fits the histograms with highest accuracy:

Φ(k,γ)=(|k|+1)ρ+τ|k|[1+γγϵ]E25

if to set γ=γl for k<0, γ=γl+γl+2 for k=0, and γ=γl+ for k>0, and represent the coefficients and as τ(γ),ρ(γ), and ϵ(γ) as

τ(γ)=a0γ+a1E26
ρ(γ)= γ(b0γb1+a0)+b2E27
ϵ(γ)=c0γc1+c2E28

where a0=0.02737, a1=4.5×103, b0=0.3674, b1=0.3137, b2=0.8066, c0 = 0.8865, c1=1.033, and c2=1.233 were found in the mean square error (MSE) sense. These values were found in several iterations until the MSE reached a minimum.

In summary, Figure 3 gives a typical example of a simulated CNA, where the modified Bessel function‐based approximation (depicted as MBA) demonstrates better accuracy than the approximation obtained using the skew Laplace distribution (depicted as SkL).

2.4. Probabilistic masks

It follows from Figure 3 that, in view of large noise, estimates of the CNAs may have low confidence, especially with small SNR γ1. Thus, each estimate requires confidence boundaries within which it may exist with a given probability [20, 21].

Given an estimate a^l of the lth segmental level in white Gaussian noise, the probabilistic upper boundary (UB) and lower boundary (LB) can be specified for the given confidence probability P(ϑ) in the ϑ‐sigma sense as [20]

a^lUBa^l+ε=a^l+ϑσj2Nl=a^l+ϑσ^lE29
a^lLBa^lε=a^l+ϑσj2Nl=a^l+ϑσ^lE30

where ϑ indicates the boundary wideness in terms of the segmental noise variance σ^l on an interval Nl points, from n^l1 to n^l1.

Likewise, detected the lth breakpoint location n^l, the jitter probabilistic left boundary JlL and right boundary JlR can be defined, following [20], as

JlLn^lklR,E31
JlRn^l+klL,E32

where klR(ϑ) and klL(ϑ) are specified by the jitter distribution in the ϑ‐sigma sense.

By combining Eqs. (30) and (31) with Eqs. (32) and (33), the probabilistic masks can be formed as shown in [20] to bound the CNA estimates in the ϑ‐sigma sense for the given confidence probability P(ϑ). An important property of these masks is that they can be used not only to bound the estimates and show their possible locations on a probabilistic field [20, 21] but also to remove supposedly wrong breakpoints. Such situations occur each time when the masks reveal double UB and LB uniformities in a gap of three neighboring detected breakpoints. If so, then the unlikely existing intermediate breakpoint ought to be removed.

Noticing that the segmental boundaries (30) and (31) remain the same irrespective of the jitter in the breakpoints, below we specify the masks for the jitter represented with the Laplace distribution (17) and Bessel‐based approximation (25).

2.4.1. Masks for Laplace distribution

For the Laplace distribution (17), the jitter left boundary JlL (32) and right boundary JlR(33) can be defined in the ϑ–‐sigma sense if to specify klR(ϑ) and klL(ϑ) as shown in [18],

klR=νκln(1dl)(1ql)ξ(1dlql),E33
klL=νκln(1dl)(1ql)ξ(1dlql),E34

where [x] means a maximum integer lower than or equal to x. Note that functions (34) and (35) were obtained in [18] by equating (17) to ξ(Nl)=erfc(ϑ/2) and solving for kl.

The probabilistic UB mask

and LB mask
for the Laplace distribution were formed in [17,20,21] by the segmental upper boundary a^lUB and lower boundary a^lLB and by the jitter left boundary JlL and jitter right boundary JlR. The algorithm for computing
and
masks has been developed and applied to the CNA probes in [22].

2.4.2. Masks for Bessel‐based approximation

The UB mask

and LB mask
for the Bessel‐based approximation can be formed using the same equations as for the Laplace distribution. Suppose that the Laplace pdf (17) is equal to the approximating function Βl(k) at k=0,
p(k=0|dl,ql)=Βl(k=0),E35

that yields Βl(k=0)=1ϕl. Then, define the probabilities PB(Al) at k=1 and PΒ(Bl) at k=1 as

PΒ(Al)=Bl(k=0)Βl(k=1)+Βl(k=0),E36
PΒ(Bl)=Bl(k=0)Βl(k=1)+Βl(k=0).E37

Next, substitute Eqs. (37) and (38) into Eqs. (19) and (20) to obtain κlB and νlB. The right‐hand jitter klBR and left‐hand jitter klBL can now be specified by, respectively,

klBR=νlBκlBln1ξBl(k=0)E38
klBL=νlBκlBln1ξBl(k=0)E39

Finally, define the jitter left boundary JlBL and right boundary JlBR as, respectively,

JlBLn^lklBRE40
JlBRn^lklBL E41

and use the algorithm previously designed in the study of Munoz‐Minjares and Shmaliy [22] for the confidence masks based on the Laplace distribution.

Advertisement

3. Results

Figure 4.

and masks for the seventh chromosome taken from “159A–vs–159D–cut” of ROMA: (a) genomic location from 130 to 146 Mb and (b) genomic location from 146 to 156 Mb. Breakpoints î1, î6, î7, î9, î10, î12, and î13 are well detectable because jitter is moderate. Owing to large jitter the breakpoints î2, î3, î4, î5, î8, î9, and î11 cannot be estimated correctly. There is a probability that the breakpoints î2, î3, î4, î5, and î11 do not exist. There is a high probability that breakpoint î5 does not exist.

In this section, we test some CNA measurements and estimates by the algorithm developed in [22] based on the Laplace and Bessel approximations. In order to demonstrate the efficiency of the probabilistic masks and getting practically useful results, we exploit probes obtained by different technologies. First, we employ the results obtained with the HR‐CGH profile and test them by the probabilistic masks using Laplace distribution. We next demonstrate the efficiency of the Bessel‐based probabilistic masks versus the Laplace‐based masks for the probes obtained with the SNP profile.

3.1. HR‐CGH‐based probing

The first test is conducted in the three‐sigma sense suggesting that the CNAs exist between the UB and LB masks with high probability of P = 99.73%. The tested HR‐CGH array data are available from the representational oligonucleotide microarray analysis (ROMA) [23]. The breakpoint locations are also given in [23]. Voluntarily, we select data associated with potentially large jitter and large segmental errors. For clarity, we first compute some characteristics of the detected CNAs and notice that the segmental estimates found by averaging [18] are in a good correspondence with [23]. The database processed is a part of the seventh chromosome in archive “159A–vs–159D–cut” of ROMA a sample of B‐cell chronic lymphocytic leukemia (B‐CLL). It is shown to have 14 segments and 13 breakpoints (Figure 4a and b). Below, we shall show that, owing to large detection noise, there is a high probability that some breakpoints do not exist.

It follows from Figure 4a that the only breakpoint which location can be estimated with high accuracy is i1. Jitter in i^6 and i^7 is moderate. All other breakpoints have large jitter. It is seen that the UB mask covering second to sixth segments is almost uniform. Thus, there is a probability that the second to fifth breakpoints do not exist. If to follow the LB mask, the locations of the second to fourth breakpoints can be predicted even with large errors. At least they can be supposed to exist. However, nothing definitive can be said about the fifth breakpoint and one may suppose that it does not exist. It is also hard to distinguish a true location of the eighth breakpoint. In Figure 4b, i10, i12, and i13 are well detectable owing to large segmental SNRs. The breakpoint i9 has a moderate jitter. In turn, the location of i11 is unclear. Moreover, there is a probability that i11 does not exist.

3.2. SNP‐based probing

Our purpose now is to apply the probabilistic mask with SNP profile that represents the CNA with low levels of SNR. Specifically, we employ the probes of the first chromosome available from “BLC_B1_T45.txt” a sample of primary breast carcinoma.

Inherently, the more accurate Bessel‐based approximation extends the jitter probabilistic boundaries with respect to the Laplace‐based ones, especially for low SNRs. We illustrate it in Figure 5, where the estimates of the first chromosome were tested by

,
,
, and
for ϑ=3 (confidence probability P = 99.73%).

In Figure 6, the masks

and
are placed in the vicinity of segment a^18 for several confidence probabilities: ϑ = 0.6745 (P = 50%), ϑ = 1 (P = 68.27%), ϑ = 2 (P =95.45%), and ϑ = 3 (P = 99.73%). What the masks suggest here is that the CNA evidently exists with high probability, but the segmental levels and the breakpoint locations cannot be estimated with high accuracy, owing to low SNRs.

Figure 5.

Jitter left boundaries lL, JlL and right boundaries JlR, JlR for the breakpoint i^2 of first chromosome from sample BLC_B1_T45.txt (primary breast carcinoma). The probabilistic masks detect a breakpoint with a confidence probability ϑ = 3 (P = 99.73%).

Figure 6.

The and masks placed around the segmental level a18 for several confidence probabilities [20]. Here, the CNA exists with high probability, but the segmental levels and the breakpoint locations cannot estimate with high accuracy.

Figure 7.

The confidence masks placed around a10 for ϑ=0.6745 (P=50 %) and ϑ=3 (P=99.73 %). Masks and do not confirm an existence of segmental changes while and indicate a small change.

Figure 8.

The confidence masks , lLB,lUB and . placed around the breakpoint i^20 for confidence probabilities ϑ=0.6745 and ϑ=3 of first chromosome from sample BLC_B1_T45.txt. The confidence masks based on Laplace distribution cannot detect the breakpoint i^20.

A special case can also be noticed when the masks

and
are not able to confirm or deny an existence of segmental changes with high probability, owing to the inability of computing the Laplace‐based masks for extremely low SNRs. Figures 7 and 8 illustrate such situations. Just on the contrary, masks
and
can be computed for any reasonable SNR.

A conclusion that can be made based on the results illustrated in Figures 58 is that the Bessel‐based probabilistic masks can be used to improve estimates of the chromosomal changes for the required probability.

We finally notice that the computation time required by the masks to process the first chromosome from sample “BLC B1 T45.txt” with a length of n = 905215 was 2.634599 s using MATLAB software on a personal computer with a processor Intel Core i5, 2.5 GHz.

Advertisement

4. Discussion

We evaluate the breakpoints obtained by the projects representational oligonucleotide microarray analysis [23] and GAP [14] with the confidence masks. As has been shown before, not all of the detected chromosomal changes have the same confidence to mean that there is a probability that some breakpoints do not exist. In order to improve the CNA estimates for the required confidence, the following process can be used:

  1. Obtain estimates of the CNA using the standard CBS algorithm [24, 25] or any other algorithm.

  2. Compute masks

    and
    for the given confidence probability P, % and bound the estimates.

  3. If the masks reveal double uniformities, in UB and LB, in a GAP of any three neighboring breakpoints, then remove the intermediate breakpoint and estimate the segmental level between the survived breakpoints by simple averaging. The CNAs estimated in such a way will be valid for the given confidence P,%.

Application of this methodology to the CNA structure detected in frames of the Project GAP is shown in Figure 9. Its special feature is a number of hardly recognized small chromosomal changes (Figure 9a). We test them by the proposed masks lUB and lLB. To this end, we first start with equal confidence probabilities of P = 50% for each estimate to exist or not exist and find out that three breakpoints demonstrate no detectability. We remove these breakpoints and depict their locations with “×”. Reasoning similarly, we remove four breakpoints to retain only probable changes, by P = 75%, nine breakpoints to show a picture combined with almost certain changes, by P = 93%, and 10 breakpoints in the three‐sigma sense, P = 99.73%. Observing the results, we infer that the masks are able to correct only the estimates obtained under the low SNRs. The relevant chromosomal sections S1–S7 are circled in Figure 9. It is not surprising because changes existing with high SNRs are seen visually. An estimator thus can easily detect them with high confidence.

Figure 9.

Improving estimates of the CNAs obtained in Project GAP [25] by removing some unlikely existing breakpoints: (a) original estimates, (b) even changes, P = 50%, (c) probable changes, P = 75%, (d) almost certain changes, P = 93%, and (e) three‐sigma sense, P = 99.73%.

Advertisement

5. Conclusions

Modern technologies developed to produce the CNA profiles with high resolution still admit intensive white Gaussian noise. Accordingly, not one estimator even ideal is able to provide jitter‐free estimation of segmental changes. Thus, in order to avoid wrong decisions, the estimates must be bounded for the confidence probability. Jitter exists in the CNA's breakpoints fundamentally. When SNR is >1, it can statistically be described using the discrete skew Laplace distribution. Otherwise, if SNR is <1, the Bessel‐based approximation produces more accuracy. By the jitter distribution, it is easy to find a region within which the breakpoint exists for the required probability. Of practical importance are the confidence UB and LB masks, which can be created based on the segmental and jitter distributions for the given confidence probability. The masks can serve as an auxiliary tool for medical experts to make decisions about the CNA structures. Applications to probes obtained using the HR‐CGH and SNP technologies confirm efficiency of the confidence masks.

References

  1. 1. Reymond, A., Henrichsen, C. N., Harewood, L., and Merla, G. Side effects of genome structural changes. Curr Opin Genet Dev. 2007;17(5):381–386.
  2. 2. Alkan, C., Coe, B. P., and Eichler, E. E. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12(5):363–376.
  3. 3. Feuk, L., Carson, A. R., and Scherer, S. W. Structural variation in the human genome. Nat Rev Genet. 2006;7(2):85–97.
  4. 4. Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., et al. Global variation in copy number in the human genome. Nature. 2006;444(7118):444–454.
  5. 5. Hastings, P. J., Lupski, J. R., Rosenberg, S. M. and Ira, G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009; 10 (8): 551–564.
  6. 6. Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O. et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464(7289): 704–712.
  7. 7. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011): 931–945.
  8. 8. Forozan, F., Karhu, R., Kononen, J., Kallioniemi, A., and Kallioniemi, O. P. Genome screening by comparative genomic hybridization. Trends in Genetics. 1997; 13 (10): 405– 409.
  9. 9. Speicher, M. R. and Carter, N. P. The new cytogenetics: blurring the boundaries with molecular biology. Nat Rev Genet.2015;6(10): 782–792.
  10. 10. Ng, P. C. and Kirkness, E. F. Whole genome sequencing. Methods MolBiol. 2010;628: 215–226.
  11. 11. Wang, D. G., Fan, J. B., Siao, C. J., Bermo, A., Young, P., et al. Large‐scale identification, mapping, and genotyping of single‐nucleotide polymorphisms in the human genome. Science. 1998;280 (5366): 1077–1082.
  12. 12. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30(4): 1–10.
  13. 13. Urban, A. E., Korbel, J. O., Selzer, R., Richmond, T., Hacker, A., Popescu, G. V., Cubells, J. F., Green, R., Emanuel, B. S., Gerstein, M.B., Weissman, S. M., Snyder, M. High‐ resolution mapping of DNA copy alterations inhuman chromosome 22 using high‐density tiling oligonucleotide arrays. Proc Natl Acad Sci U S A. 2006;103(12): 4534–4539.
  14. 14. Popova, T., Boeva, V., Manie, E., Rozenholc, Y., Barillot, E., and Stern, M.H. Analysis of Somatic Alterations in Cancer Genome: From SNP Arrays to Next Generation Sequencing. Genomics I Humans, Animals and Plants, 2013. <hal-01108425> Observation to publisher: Reference from https://hal.archives-ouvertes.fr/hal-01108425
  15. 15. Pique‐Regi, R., Ortega, A., Tewfik, A., and Asgharzadeh, S. Detection changes in the DNA copy number. IEEE Signal Process Mag. 2012;29(1): 98–107.
  16. 16. Munoz‐Minjares, J., Cabal‐Aragón, J., and Shmaliy, Y.S. Jitter probability in the break‐points of discrete sparse piecewise‐constant signals. Proc. 21st Eur. Signal Process. Conf. (EUSIPCO‐2013), Marrakech, Marocco. 2013; pp. 1–5.
  17. 17. Munoz‐Minjares, J., Shmaliy, Y. S., and Cabal, A. J. Noise studies in measurements and estimates of stepwise changes in genome DNA chromosomal structures. Adv. App Pure Math. 2014.
  18. 18. Munoz‐Minjares, J., Cabal‐Aragón, J., and Shmaliy, Y. S. Probabilistic bounds for estimates of genome DNA copy number variations using HR‐CGH microarray. Proc. 21st European Signal Process. Conf. (EUSIPCO‐2013). 2013; pp. 1–5.
  19. 19. Kozubowski, T. J. and Inusah, S. A skew Laplace distribution on integers. Ann Inst Stat Math 2006;58(3):555–571.
  20. 20. Munoz‐Minjares, J., Cabal‐Aragon, J., and Shmaliy Y. S. Confidence masks for genome DNA copy number variations in applications to HR–CGH array measurements. Biomed Signal Process Contr. 2014;13:337–344.
  21. 21. Munoz‐Minjares, J., Shmaliy, Y. S., and Cabal‐Aragon, J. Confidence limits for genome DNA copy number variations in HR–CGH array measurements. Biomed Signal Process Contr. 2014;10:166–173.
  22. 22. Munoz‐Minjares, J. and Shmaliy, Y. S.: Bounding error in estimates of genome copy number variations using SNP array. International Journal of Biomedical Engineering. 2015;9:127–132.
  23. 23. Lucito, R., Healy, J., Alexander, J., Reiner, A., Esposito, D., Chi, M., Rodgers, L., Brady, A., Sebat, J., Troge, J., West, J. A., Rostan, S., Nguyen, K. C. Q., Powers, S., Ye, K. Q., Olshen, A., Venkatraman, E., Norton, L., Wigler, M. Representational oligonucleotide microarray analysis: a high‐resolution method to detect genome copy number variation. Genome Res. 2003;13(10): 2291–2305.
  24. 24. Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics. 2004;5(4):557–572.
  25. 25. Venkatraman E. S. and Olshen, A. B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657–663.

Written By

Jorge Muñoz‐Minjares, Yuriy Shmaliy and Oscar Ibarra‐Manzano

Submitted: 18 November 2015 Reviewed: 25 April 2016 Published: 21 July 2016