Open access peer-reviewed chapter

Monte Carlo Statistical Tests for Identity of Theoretical and Empirical Distributions of Experimental Data

By Natalia D. Nikolova, Daniela Toneva-Zheynova, Krasimir Kolev and Kiril Tenekedjiev

Submitted: April 24th 2012Reviewed: September 4th 2012Published: March 6th 2013

DOI: 10.5772/53049

Downloaded: 2489

1. Introduction

Often experimental work requires analysis of many datasets derived in a similar way. For each dataset it is possible to find a specific theoretical distribution that describes best the sample. A basic assumption in this type of work is that if the mechanism (experiment) to generate the samples is the same, then the distribution type that describes the datasets will also be the same [1]. In that case, the difference between the sets will be captured not through changing the type of the distribution, but through changes in its parameters. There are some advantages in finding whether a type of theoretical distribution that fits several datasets exists. At first, it improves the fit because the assumptions concerning the mechanism underlying the experiment can be verified against several datasets. Secondly, it is possible to investigate how the variation of the input parameters influences the parameters of the theoretical distribution. In some experiments it might be proven that the differences in the input conditions lead to qualitative change of the fitted distributions (i.e. change of the type of the distribution). In other cases the variation of the input conditions may lead only to quantitative changes in the output (i.e. changes in the parameters of the distribution). Then it is of importance to investigate the statistical significance of the quantitative differences, i.e. to compare the statistical difference of the distribution parameters. In some cases it may not be possible to find a single type of distribution that fits all datasets. A possible option in these cases is to construct empirical distributions according to known techniques [2], and investigate whether the differences are statistically significant. In any case, proving that the observed difference between theoretical, or between empirical distributions, are not statistically significant allows merging datasets and operating on larger amount of data, which is a prerequisite for higher precision of the statistical results. This task is similar to testing for stability in regression analysis [3].

Formulating three separate tasks, this chapter solves the problem of identifying an appropriate distribution type that fits several one-dimensional (1-D) datasets and testing the statistical significance of the observed differences in the empirical and in the fitted distributions for each pair of samples. The first task (Task 1) aims at identifying a type of 1-D theoretical distribution that fits best the samples in several datasets by altering its parameters. The second task (Task 2) is to test the statistical significance of the difference between two empirical distributions of a pair of 1-D datasets. The third task (Task 3) is to test the statistical significance of the difference between two fitted distributions of the same type over two arbitrary datasets.

Task 2 can be performed independently of the existence of a theoretical distribution fit valid for all samples. Therefore, comparing and eventually merging pairs of samples will always be possible. This task requires comparing two independent discontinuous (stair-case) empirical cumulative distribution functions (CDF). It is a standard problem and the approach here is based on a symmetric variant of the Kolmogorov-Smirnov test [4] called the Kuiper two-sample test, which essentially performs an estimate of the closeness of a pair of independent stair-case CDFs by finding the maximum positive and the maximum negative deviation between the two [5]. The distribution of the test statistics is known and the p value of the test can be readily estimated.

Tasks 1 and 3 introduce the novel elements of this chapter. Task 1 searches for a type of theoretical distribution (out of an enumerated list of distributions) which fits best multiple datasets by varying its specific parameter values. The performance of a distribution fit is assessed through four criteria, namely the Akaike Information Criterion (AIC) [6], the Bayesian Information Criterion (BIC) [7], the average and the minimal p value of a distribution fit to all datasets. Since the datasets contain random measurements, the values of the parameters for each acquired fit in Task 1 are random, too. That is why it is necessary to check whether the differences are statistically significant, for each pair of datasets. If not, then both theoretical fits are identical and the samples may be merged. In Task 1 the distribution of the Kuiper statistic cannot be calculated in a closed form, because the problem is to compare an empirical distribution with its own fit and the independence is violated. A distribution of the Kuiper statistic in Task 3 cannot be estimated in close form either, because here one has to compare two analytical distributions, but not two stair-case CDFs. For that reason the distributions of the Kuiper statistic in Tasks 1 and 3 are constructed via a Monte Carlo simulation procedures, which in Tasks 1 is based on Bootstrap [8].

The described approach is illustrated with practical applications for the characterization of the fibrin structure in natural and experimental thrombi evaluated with scanning electron microscopy (SEM).

2. Theoretical setup

The approach considers N 1-D datasets χi=(x1i,x2i,...,xnii), for i=1,2,…,N. The data set χicontains ni>64 sorted positive samples (0<x1ix2i...xnii) of a given random quantity under equal conditions. The datasets contain samples of the same random quantity, but under slightly different conditions.

The procedure assumes that M types of 1-D theoretical distributions are analyzed. Each of them has a probability density function PDFj(x,pj), a cumulative distribution function CDFj(x,pj), and an inverse cumulative distribution function invCDFj(P,pj), for j=1, 2, …, M. Each of these functions depends on njp-dimensional parameter vectors pj(for j=1, 2, …, M), dependent on the type of theoretical distribution.

2.1. Task 1 – Theoretical solution

The empirical cumulative distribution function CDFei(.)is initially linearly approximated over (ni+1) nodes as (ni –1) internal nodes CDFei(xki/2+xk+1i/2)=k/nifor k=1,2,…,ni–1 and two external nodes CDFei(x1iΔdi)=0and CDFei(xnii+Δui)=1, where Δdi=min(x1i,(x16ix1i)/30)and Δui=(xniixni15i)/30are the halves of mean inter-sample intervals in the lower and upper ends of the dataset χi. This is the most frequent case when the sample values are positive and the lower external node will never be with a negative abscissa because (x1iΔdi)0. If both negative and positive sample values are acceptable then Δdi=(x16ix1i)/30and Δui=(xniixni15i)/30. Of course if all the sample values have to be negative then Δdi=(x16ix1i)/30and Δui=min(xnii,(xniixni15i)/30). In that rare case the upper external node will never be with positive abscissa because (xnii+Δui)0.

It is convenient to introduce “before-first” x0i=x1i2Δdiand “after-last” xni+1i=xnii+2Δuisamples. When for some k=1,2,…,ni and for p>1 it is true that xk1i<xki=xk+1i=xk+2i=...=xk+pi<xk+p+1i, then the initial approximation of CDFei(.)contains a vertical segment of p nodes. In that case the p nodes on that segment are replaced by a single node in the middle of the vertical segment CDFei(xki)=(k+p/21/2)/ni. The described two-step procedure [2] results in a strictly increasing function CDFei(.)in the closed interval [x1iΔdi;xnii+Δui]. That is why it is possible to introduce invCDFei(.)with the domain [0; 1] as the inverse function of CDFei(.)in [x1iΔdi;xnii+Δui]. The median and the interquartile range of the empirical distribution can be estimated from invCDFei(.), whereas the mean and the standard deviation are easily estimated directly from the dataset χi:

  • mean: meanei=1nik=1nixki

  • median: medei=invCDFei(0.5)

  • standard deviation: stdei=1ni1k=1ni(xkimeanei)2;

  • inter-quartile range: iqrei=invCDFei(0.75)invCDFei(0.25).

The non-zero part of the empirical density PDFei(.)is determined in the closed interval [x1iΔdi;xnii+Δui]as a histogram with bins of equal area (each bin has equal product of density and span of data). The number of bins bi is selected as the minimal from the Scott [9], Sturges [10] and Freedman-Diaconis [11] suggestions: bi=min{biSc,biSt,biFD}, where biSc=fl(0.2865(xniix1i)ni3/stdei), biSt=fl(1+log2(ni)), and biFD=fl(0.5(xniix1i)ni3/iqrei). In the last three formulae, fl(y) stands for the greatest whole number less or equal to y. The lower and upper margins of the k-th bin md,kiand mu,kiare determined as quantiles (k–1)/bi and k/bi respectively: md,ki=invCDFei(k/bi1/bi)and mu,ki=invCDFei(k/bi). The density of the kth bin is determined as PDFei(x)=bi1/(mu,kimd,ki). The described procedure [2] results in a histogram, where the relative error of the worst PDFei(.)estimate is minimal from all possible splitting of the samples into bi bins. This is so because the PDF estimate of a bin is found as the probability that the random variable would have a value in that bin divided to the bin’s width. This probability is estimated as the relative frequency to have a data point in that bin at the given data set. The closer to zero that frequency is the worse it has been estimated. That is why the worst PDF estimate is at the bin that contains the least number of data points. Since for the proposed distribution each bin contains equal number of data points, any other division to the same number of bins would result in having a bin with less data points. Hence, the relative error of its PDF estimate would be worse.

The improper integral xPDFei(x)dxof the density is a smoothened version of CDFei(.)linearly approximated over (bi+1) nodes: (invCDFei(k/bi);k/bi)for k=0, 1, 2, …, bi.

If the samples are distributed with density PDFj(x,pj), then the likelihood of the dataset χiis Lji(pj)=k=1niPDFj(xki,pj). The maximum likelihood estimates (MLEs) of pjare determined as those pji, which maximize Lji(pj), that is pji=arg{maxpj[Lji(pj)]}. The numerical characteristics of the jth theoretical distribution fitted to the dataset χiare calculated as:

  • mean: meanji=+x.PDFj(x,pji)dx

  • median: medji=invCDFj(0.5,pji)

  • mode: modeji=arg{maxx[PDFj(x,pj)]}

  • standard deviation: stdji=+(xmeanji)2PDFj(x,pji)dx2;

  • inter-quartile range: iqrji=invCDFj(0.75,pji)invCDFj(0.25,pji).

The quality of the fit can be assessed using a statistical hypothesis test. The null hypothesis H0 is that CDFei(x)is equal to CDFj(x,pji), which means that the sample χiis drawn from CDFj(x,pji). The alternative hypothesis H1 is that CDFei(x)is different from CDFj(x,pji), which means that the fit is not good. The Kuiper statistic Vji[12] is a suitable measure for the goodness-of-fit of the theoretical cumulative distribution functions CDFj(x,pji)to the dataset χi:

Vji=maxx{CDFei(x)CDFj(x,pji)}+maxx{CDFj(x,pji)CDFei(x)}.E1

The theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for continuous distribution fitted to the data of another [5]. That is why the distribution of V from (1), if H0 is true, should be estimated by a Monte Carlo procedure. The main idea is that if the dataset χi=(x1i,x2i,...,xnii)is distributed in compliance with the 1-D theoretical distributions of type j, then its PDF would be very close to its estimate PDFj(x,pji), and so each synthetic dataset generated from PDFj(x,pji)would produce Kuiper statistics according to (1), which would be close to zero [1].

The algorithm of the proposed procedure is the following:

  1. Construct the empirical cumulative distribution function CDFei(x)describing the data in χi.

  2. Find the MLE of the parameters for the distributions of type j fitting χias pji=arg{maxpj[k=1niPDFj(xki,pj)]}.

  3. Build the fitted cumulative distribution function CDFj(x,pji)describing χi.

  4. Calculate the actual Kuiper statistic Vjiaccording to (1).

  5. Repeat for r=1,2,…, nMC (in fact use nMC simulation cycles):

    1. generate a synthetic dataset χri,syn={x1,ri,syn,x2,ri,syn,...,xni,ri,syn}from the fitted cumulative distribution function CDFj(x,pji). The dataset χri,syncontains ni sorted samples (x1,ri,synx2,ri,syn...xni,ri,syn);

    2. construct the synthetic empirical distribution function CDFe,ri,syn(x)describing the data in χri,syn;

    3. find the MLE of the parameters for the distributions of type j fitting χri,synas

      pj,ri,syn=arg{maxpj[k=1niPDFj(xk,ri,syn,pj)]};

    4. build the theoretical distribution function CDFj,rsyn(x,pj,ri,syn)describing χri,syn;

    5. estimate the rth instance of the synthetic Kuiper statistic as Vj,ri,syn=maxx{CDFe,ri,syn(x)CDFj,rsyn(x,pj,ri,syn)}+maxx{CDFj,rsyn(x,pj,ri,syn)CDFe,ri,syn(x)}.

  1. The p-value Pvalue,jfit,iof the statistical test (the probability to reject a true hypothesis H0 that the jth type theoretical distribution fits well to the samples in dataset χi) is estimated as the frequency of generating synthetic Kuiper statistic greater than the actual Kuiper statistic Vjifrom step 4:

Pvalue,jfit,i=1nmcr=1Vji<Vj,ri,synnmc1E2

In fact, (2) is the sum of the indicator function of the crisp set, defined as all synthetic datasets with a Kuiper statistic greater than Vji.

The performance of each theoretical distribution should be assessed according to its goodness-of-fit measures to the N datasets simultaneously. If a given theoretical distribution cannot be fitted even to one of the datasets, then that theoretical distribution has to be discarded from further consideration. The other theoretical distributions have to be ranked according to their ability to describe all datasets. One basic and three auxiliary criteria are useful in the required ranking.

The basic criterion is the minimal p-value of the theoretical distribution fits to the N datasets:

minPvalue,jfit=min{Pvalue,jfit,1,Pvalue,jfit,2,...,Pvalue,jfit,N}forj=1, 2, ...,M.E3

The first auxiliary criterion is the average of the p-values of the theoretical distribution fits to the N datasets:

meanPvalue,jfit=1Nj=1NPvalue,jfit,i, forj=1, 2, ..,M.E4

The second and the third auxiliary criteria are the AIC-Akaike Information Criterion [6] and the BIC-Bayesian Information Criterion [7], which corrects the negative log-likelihoods with the number of the assessed parameters:

AICj=2i=1Nlog(Lji(pji))+2log(N.njp)==2i=1Nj=1MlogPDFj(xki,pji)+2log(N.njp)E5
BICj=2i=1Nlog(Lji(pji))+2log(N.njp).log(i=1Mni)==2i=1Nj=1MlogPDFj(xki,pji)+2log(N.njp).log(i=1Mni)E6

for j=1,2,..,M. The best theoretical distribution type should have maximal values for minPvalue,jfitand meanPvalue,jfit, whereas its values for AICj and BICj should be minimal. On top, the best theoretical distribution type should have minPvalue,jfit>0.05, otherwise no theoretical distribution from the initial M types fits properly to the N datasets.

That solves the problem for selecting the best theoretical distribution type for fitting the samples in the N datasets.

2.2. Task 2 – Theoretical solution

The second problem is the estimation of the statistical significance of the difference between two datasets. It is equivalent to calculating the p-value of a statistical hypothesis test, where the null hypothesis H0 is that the samples of χi1and χi2are drawn from the same underlying continuous population, and the alternative hypothesis H1 is that the samples of χi1and χi2are drawn from different underlying continuous populations. The two-sample asymptotic Kuiper test is designed exactly for that problem, because χi1and χi2are independently drawn datasets. That is why “staircase” empirical cumulative distribution functions [13] are built from the two datasets χi1and χi2:

CDFscei(x)=k=1xkixni1/ni, fori{i1,i2}.E7

The ”staircase” empirical CDFscei(.)is a discontinuous version of the already defined empirical CDFei(.). The Kuiper statistic Vi1,i2[12] is a measure for the closeness of the two ‘staircase’ empirical cumulative distribution functions CDFscei1(.)and CDFscei2(.):

Vi1,i2=maxx{CDFscei1(x)CDFscei2(x)}+maxx{CDFscei2(x)CDFscei1(x)}E8

The distribution of the test statics Vi1,i2is known and the p-value of the two tail statistical test with null hypothesis H0, that the samples in χi1and in χi2result in the same ‘staircase’ empirical cumulative distribution functions is estimated as a series [5] according to formulae (9) and (10).

The algorithm for the theoretical solution of Task 2 is straightforward:

  1. 1. Construct the ”staircase” empirical cumulative distribution function describing the data in χi1as CDFscei1(x)=k=1xki1xni11/ni1.

  2. 2. Construct the ”staircase” empirical cumulative distribution function describing the data in χi2as CDFscei2(x)=k=1xki2xni21/ni2.

  3. 3. Calculate the actual Kuiper statistic Vi1,i2according to (8).

  4. 4. The p-value of the statistical test (the probability to reject a true null hypothesis H0) is estimated as:

Pvalue,ei1,i2=2 j=1+(4j2λ21)e-2j2λ2 E9

where

λ=1Vi1,i2(ni1ni2ni1+ni2 + 0.155 + 0.24ni1+ni2ni1ni2)E10
If Pvalue,ei1,i2<0.05 the hypothesis H0 is rejected.

2.3. Task 3 – Theoretical solution

The last problem is to test the statistical significance of the difference between two fitted distributions of the same type. This type most often would be the best type of theoretical distribution, which was identified in the first problem, but the test is valid for any type. The problem is equivalent to calculating the p-value of statistical hypothesis test, where the null hypothesis H0 is that the theoretical distribution CDFj(x,pji1)and CDFj(x,pji2)fitted to the datasets χi1and χi2are identical, and the alternative hypothesis H1 is that CDFj(x,pji1)and CDFj(x,pji2)are not identical.

The test statistic again is the Kuiper one Vji1,i2:

Vji1,i2=maxx{CDFj(x,pji1)CDFj(x,pji2)}+maxx{CDFj(x,pji2)CDFj(x,pji1)}.E11

As it has already been mentioned the theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for the case of two independent continuous cumulative distribution functions. That is why the distribution of V from (11), if H0 is true, should be estimated by a Monte Carlo procedure. The main idea is that if H0 is true, then CDFj(x,pji1)and CDFj(x,pji2)should be identical to the merged distribution CDFj(x,pji1+i2), fitted to the merged dataset χi1+i2formed by merging the samples of χi1and χi2[1].

The algorithm of the proposed procedure is the following:

  1. Find the MLE of the parameters for the distributions of type j fitting χi1as pji1=arg{maxpj[k=1ni1PDFj(xki1,pj)]}.

  2. Build the fitted cumulative distribution function CDFj(x,pji1)describing χi1.

  3. Find the MLE of the parameters for the distributions of type j fitting χi2as pji2=arg{maxpj[k=1ni2PDFj(xki2,pj)]}.

  4. Build the fitted cumulative distribution function CDFj(x,pji2)describing χi2.

  5. Calculate the actual Kuiper statistic Vji1,i2according to (11).

  6. Merge the samples χi1and χi2, and form the merged data set χi1+i2.

  7. Find the MLE of the parameters for the distributions of type j fitting χi1+i2aspji1+i2=arg{maxpj[k=1ni1PDFj(xki1,pj)k=1ni2PDFj(xki2,pj)]}.

  8. Fit the merged fitted cumulative distribution function CDFj(x,pji1+i2)to χi1+i2.

  9. Repeat for r=1,2,…, nMC (in fact use nMC simulation cycles):

    1. a. generate a synthetic dataset χri1,syn={x1,ri1,syn,x2,ri1,syn,...,xni1,ri1,syn}from the fitted cumulative distribution function CDFj(x,pji1+i2);

    2. b. find the MLE of the parameters for the distributions of type j fitting χri1,synas

      pj,ri1,syn=arg{maxpj[k=1ni1PDFj(xk,ri1,syn,pj)]};

    3. c. build the theoretical distribution function CDFj,rsyn(x,pj,ri1,syn)describing χri1,syn;

    4. d. generate a synthetic dataset χri2,syn={x1,ri2,syn,x2,ri2,syn,...,xni2,ri2,syn}from the fitted cumulative distribution function CDFj(x,pji1+i2);

    5. e. find the MLE of the parameters for the distributions of type j fitting χri2,synas

      pj,ri2,syn=arg{maxpj[k=1ni2PDFj(xk,ri2,syn,pj)]};

    6. f. build the theoretical distribution function CDFj,rsyn(x,pj,ri2,syn)describing χri2,syn;

    7. g. estimate the rth instance of the synthetic Kuiper statistic as:

      Vj,ri1,i2,syn=maxx{CDFj,rsyn(x,pj,ri1,syn)CDFj,rsyn(x,pj,ri2,syn)}++maxx{CDFj,rsyn(x,pj,ri2,syn)CDFj,rsyn(x,pj,ri1,syn)}.

  1. The p-value Pvalue,ji1,i2of the statistical test (the probability to reject a true hypothesis H0 that the jth type theoretical distribution function CDFj(x,pji1)and CDFj(x,pji2)are identical) is estimated as the frequency of generating synthetic Kuiper statistic greater than the actual Kuiper statistic Vji1,i2from step 5:

    Pvalue,ji1,i2=1nmcr=1Vji1,i2<Vj,ri1,i2,synnmc1E12

    Formula (12), similar to (2), is the sum of the indicator function of the crisp set, defined as all synthetic dataset pairs with a Kuiper statistic greater than Vji1,i2.

    If Pvalue,ji1,i2<0.05 the hypothesis H0 is rejected.

3. Software

A platform of program functions, written in MATLAB environment, is created to execute the statistical procedures from the previous section. At present the platform allows users to test the fit of 11 types of distributions on the datasets. A description of the parameters and PDF of the embodied distribution types is given in Table 1 [14, 15]. The platform also permits the user to add optional types of distribution.

The platform contains several main program functions. The function set_distribution contains the information about the 11 distributions, particularly their names, and the links to the functions that operate with the selected distribution type. Also, the function permits the inclusion of new distribution type. In that case, the necessary information the user must provide as input is the procedures to find the CDF, PDF, the maximum likelihood measure, the negative log-likelihood, the mean and variance and the methods of generating random arrays from the given distribution type. The function also determines the screen output for each type of distribution.

Beta distributionLognormal distribution
Parametersα>0,β>0Parametersμ(+),σ>0,
Supportx[0; 1]Supportx[0; +)
PDFf(xαβ)=xα1(1x)β1B(αβ),
whereB(αβ)is a beta function
PDFf(x;μ,σ)=1xσ2πe(ln(x)μ)22σ2
Exponential distributionNormal distribution
Parametersλ>0Parametersμ,σ>0
Supportx[0; +)Supportx(; +)
PDFf(xλ)={λeλxfor x00for x<0PDFf(x;μ,σ)=1σ2πe(xμ)22σ2
Extreme value distributionRayleigh distribution
Parametersα,β0Parametersσ>0
Supportx(; +)Supportx[0; +)
PDFf(xαβ)=e[(αx)/β]e(αx)/ββPDFf(x;σ)=1σ2×[xexp(x22σ2)]
Gamma distributionUniform distribution
Parametersk>0,θ>0Parametersa,b(; +)
Supportx[0; +)Supportaxb
PDFf(xkθ)=xk1ex/θθkΓ(k),
whereΓ(k)is a gamma function
PDFf(x;a,b)={1bafor axb0for x<aorx>b
Generalized extreme value distributionWeibull distribution
Parametersμ(; +),σ(0; +),ξ(; +)Parametersλ>0,k>0
Supportx>μσ/ξ(ξ>0),x<μσ/ξ(ξ<0),x(; +)(ξ=0)Supportx[0; +)
PDF1σ(1+ξz)1/ξ1e(1+ξz)1/ξwherez=x-μσPDFf(xλk)=={kλ(xλ)k1e(x/λ)kforx00forx<0
Generalized Pareto distribution
Parametersxm>0,k>0
Supportx[xm; +)
PDFf(xxmk)=kxmkxk+1

Table 1.

Parameters, support and formula for the PDF of the eleven types of theoretical distributions embodied into the MATLAB platform

The program function kutest2 performs a two-sample Kuiper test to determine if the independent random datasets are drawn from the same underlying continuous population, i.e. it solves Task 2 (see section 2.2) (to check whether two different datasets are drawn from the same general population).

Another key function is fitdata. It constructs the fit of each theoretical distribution over each dataset, evaluates the quality of the fits, and gives their parameters. It also checks whether two distributions of one type fitted to two different arbitrary datasets are identical. In other words, this function is associated with Task 1 and 3 (see sections 2.1 and 2.2). To execute the Kuiper test the function calls kutest. Finally, the program function plot_print_data provides the on-screen results from the statistical analysis and plots figures containing the pair of distributions that are analyzed. The developed software is available free of charge upon request from the authors provided proper citation is done in subsequent publications.

4. Source of experimental data for analysis

The statistical procedures and the program platform introduced in this chapter are implemented in an example focusing on the morphometric evaluation of the effects of thrombin concentration on fibrin structure. Fibrin is a biopolymer formed from the blood-borne fibrinogen by an enzyme (thrombin) activated in the damaged tissue at sites of blood vessel wall injury to prevent bleeding. Following regeneration of the integrity of the blood vessel wall, the fibrin gel is dissolved to restore normal blood flow, but the efficiency of the dissolution strongly depends on the structure of the fibrin clots. The purpose of the evaluation is to establish any differences in the density of the branching points of the fibrin network related to the activity of the clotting enzyme (thrombin), the concentration of which is expected to vary in a broad range under physiological conditions.

For the purpose of the experiment, fibrin is prepared on glass slides in total volume of 100 μlby clotting 2 mg/ml fibrinogen (dissolved in different buffers) by varying concentrations of thrombin for 1 h at 37 °C in moisture chamber. The thrombin concentrations in the experiments vary in the range 0.3 – 10 U/ml, whereas the two buffers used are: 1) buffer1 – 25 mM Na-phosphate pH 7.4 buffer containing 75 mM NaCl; 2) buffer2 - 10 mM N-(2-Hydroxyethyl) piperazine-N’-(2-ethanesulfonic acid) (abbreviated as HEPES) pH 7.4 buffer containing 150 mM NaCl. At the end of the clotting time the fibrins are washed in 3 ml 100 mM Na-cacodilate pH 7.2 buffer and fixated with 1% glutaraldehyde in the same buffer for 10 min. Thereafter the fibrins are dried in a series of ethanol dilutions (20 – 96 %), 1:1 mixture of 96 %(v/v) ethanol/acetone and pure acetone followed by critical point drying with CO2 in E3000 Critical Point Drying Apparatus (Quorum Technologies, Newhaven, UK). The dry samples are examined in Zeiss Evo40 scanning electron microscope (Carl Zeiss, Jena, Germany) and images are taken at an indicated magnification. A total of 12 dry samples of fibrins are elaborated in this fashion, each having a given combination of thrombin concentration and buffer. Electron microscope images are taken for each dry sample (one of the analyzed dry samples of fibrins is presented in Fig. 1). Some main parameters of the 12 collected datasets are given in Table 2.

An automated procedure is elaborated in MATLAB environment (embodied into the program function find_distance.m) to measure lengths of fibrin strands (i.e. sections between two branching points in the fibrin network) from the SEM images. The procedure takes the file name of the fibrin image (see Fig. 1) and the planned number of measurements as input. Each file contains the fibrin image with legend at the bottom part, which gives the scale, the time the image was taken, etc.

The first step requires setting of the scale. A prompt appears, asking the user to type the numerical value of the length of the scale in μm. Then the image appears on screen and a red line has to be moved and resized to fit the scale (Fig. 2a and 2b). The third step requires a red rectangle to be placed over the actual image of the fibrin for selection of the region of interest (Fig. 2c). With this, the preparations of the image are done, and the user can start taking the desired number of measurements for the distances between adjacent nodes (Fig. 2d).

Using this approach 12 datasets containing measurements of lengths between branching points of fibrin have been collected (Table 2) and the three statistical tasks described above are executed over these datasets.

DatasetsNmeanemedestdeiqreThrombinconcentrationBuffer
DS12740.97360.81210.51790.61601.0buffer1
DS2681.0230.93740.57080.761510.0buffer1
DS32001.0480.87480.65900.64694.0buffer1
DS42761.0020.90030.47850.59700.5buffer1
DS52120.68480.63680.31550.40301.0buffer2
DS63000.12200.12650.043990.055601.2buffer2
DS72850.78020.73790.32530.43012.5buffer2
DS82770.98700.93260.43990.57020.6buffer2
DS92000.55750.52840.23280.28300.3buffer1
DS103010.75680.65550.38050.44910.6buffer1
DS113010.78750.75600.34250.47761.2buffer1
DS123070.650000.59620.25900.32502.5buffer1

Table 2.

Distance between branching points of fibrin fibers. Sample size (N), mean (meane in μm), median (mede in μm), standard deviation (stde), inter-quartile range (iqre, in μm) of the empirical distributions over the 12 datasets for different thrombin concentrations (in U/ml) and buffers are presented

Figure 1.

SEM image of fibrin used for morphometric analysis

Figure 2.

Steps of the automated procedure for measuring distances between branching points in fibrin. Panels a and b: scaling. Panel c: selection of region of interest. Panel d: taking a measurement

4.1. Task 1 – Finding a common distribution fit

A total of 11 types of distributions (Table 1) are tested over the datasets, and the criteria (3)-(6) are evaluated. The Kuiper statistic’s distribution is constructed with 1000 Monte Carlo simulation cycles. Table 3 presents the results regarding the distribution fits, where only the maximal values for minPvalue,jfitand meanPvalue,jfit, along with the minimal values for AICj and BICj across the datasets are given. The results allow ruling out the beta and the uniform distributions. The output of the former is NaN (not-a-number) since it does not apply to values of x[0; 1]. The latter has the lowest values of (3) and (4), and the highest of (5) and (6), i.e. it is the worst fit. The types of distributions worth using are mostly the lognormal distribution (having the lowest AIC and BIC), and the generalized extreme value (having the highest possible meanPvalue,jfit). Figure 3 presents 4 of the 11 distribution fits to DS4. Similar graphical output is generated for all other datasets and for all distribution types.

Distribution type123456
AICNaN3.705e+33.035e+38.078e+27.887e+21.633e+3
BICNaN3.873e+33.371e+31.144e+31.293e+32.137e+3
minPvaluefit5.490e–1005.000e–31.020e–10
meanPvaluefitNaN005.914e–16.978e–17.500e–4
Distribution type7891011
AIC7.847e+21.444e+31.288e+33.755e+31.080e+3
BIC1.121e+31.781e+31.457e+34.092e+31.416e+3
minPvaluefit8.200e–20000
meanPvaluefit5.756e–12.592e–28.083e–201.118e–1

Table 3.

Values of the criteria used to evaluate the goodness-of-fit of 11 types of distributions over the datasets with 1000 Monte Carlo simulation cycles. The table contains the maximal values for minPvalue,jfitand meanPvalue,jfit, and the minimal values for AICj and BICj across the datasets for each distribution type. The bold and the italic values are the best one and the worst one achieved for a given criterion, respectively.

Legend: The numbers of the distribution types stand for the following: 1- beta, 2 – exponential, 3 – extreme value, 4- gamma, 5 - generalized extreme value, 6 – generalized Pareto; 7 – lognormal, 8 – normal, 9 – Rayleigh, 10 – uniform, 11 – Weibull


Figure 3.

Graphical results from the fit of the lognormal (a), generalized extreme value (b), exponential (c), and uniform (d) distributions over DS4 (where μ,σ,Xmin,Xmax,k are the parameters of the theoretical distributions from Table 1)

4.2. Task 2 – Identity of empirical distributions

Table 4 contains the p-value calculated according to (9) for all pairs of distributions. The bolded values indicate the pairs, where the null hypothesis fails to be rejected and it is possible to assume that those datasets are drawn from the same general population. The results show that it is possible to merge the following datasets: 1) DS1, DS2, DS3, D4 and DS8; 2) DS7, DS10, and DS11; 3) DS5 and DS12. All other combinations (except DS5 and DS10) are not allowed and may give misleading results in a further statistical analysis, since the samples are not drawn from the same general population. Figure 4a presents the stair-case distributions over DS4 (with meane4=1.002, mede4=0.9003, stde4=0.4785, iqre4=0.5970) and DS9 (with meane9=0.5575, mede9=0.5284, stde9=0.2328, iqre9=0.2830). The Kuiper statistic for identity of the empirical distributions, calculated according to (8), is V4,9=0.5005, whereas according to (9) Pvalue,e4,9=2.024e–24<0.05. Therefore the null hypothesis is rejected, which is also evident from the graphical output. In the same fashion, Figure 4b presents the stair-case distributions over DS1 (with meane1=0.9736, mede1=0.8121, stde1=0.5179, iqre1=0.6160) and DS4. The Kuiper statistic for identity of the empirical distributions, calculated according to (8), is V1,4=0.1242, whereas according to (9) Pvalue,e1,4=0.1957>0.05. Therefore the null hypothesis fails to be rejected, which is also confirmed by the graphical output.

Figure 4.

Comparison of the stair-case empirical distributions over DS4 and DS9 (a) and over DS1 and DS4 (b)

DatasetsDS1DS2DS3DS4DS5DS6DS7DS8DS9DS10DS11DS12
DS11.00e+003.81e-016.18e-011.96e-015.80e-068.88e-1253.46e-035.21e-024.57e-191.73e-041.89e-022.59e-10
DS23.81e-011.00e+006.77e-016.11e-011.94e-055.13e-442.13e-032.92e-011.71e-097.17e-045.34e-033.96e-08
DS36.18e-016.77e-011.00e+002.01e-011.46e-071.84e-1016.94e-051.47e-011.79e-205.05e-061.55e-031.53e-12
DS41.96e-016.11e-012.01e-011.00e+005.47e-111.73e-1235.14e-058.57e-012.02e-249.34e-083.50e-052.02e-17
DS55.80e-061.94e-051.46e-075.47e-111.00e+002.61e-1009.67e-031.59e-116.68e-042.32e-011.65e-021.52e-01
DS68.88e-1255.13e-441.84e-1011.73e-1232.61e-1001.00e+007.45e-1241.69e-1253.14e-947.35e-1259.98e-1261.75e-124
DS73.46e-032.13e-036.94e-055.14e-059.67e-037.45e-1241.00e+009.53e-057.13e-111.64e-014.59e-012.49e-05
DS85.21e-022.92e-011.47e-018.57e-011.59e-111.69e-1259.53e-051.00e+001.04e-251.19e-086.36e-068.47e-19
DS94.57e-191.71e-091.79e-202.02e-246.68e-043.14e-947.13e-111.04e-251.00e+003.48e-066.05e-124.64e-03
DS101.73e-047.17e-045.05e-069.34e-082.32e-017.35e-1251.64e-011.19e-083.48e-061.00e+001.55e-019.18e-03
DS111.89e-035.34e-031.55e-033.50e-051.65e-029.98e-1264.59e-016.36e-066.05e-121.55e-011.00e+002.06e-04
DS122.59e-103.96e-081.53e-122.02e-171.52e-011.75e-1242.49e-058.47e-194.64e-039.18e-032.06e-041.00e+00

Table 4.

P-values of the statistical test for identity of stair-case distributions on pairs of datasets. The values on the main diagonal are shaded. The bold values are those that exceed 0.05, i.e. indicate the pairs of datasets whose stair-case distributions are identical.

4.3. Task 3 – Identity of fitted distributions

As concluded in task 1, the lognormal distribution provides possibly the best fit to the 12 datasets. Table 5 contains the p-values calculated according to (12) for the lognormal distribution fitted to the datasets with 1000 Monte Carlo simulation cycles. The bold values indicate the pairs, where the null hypothesis fails to be rejected and it is possible to assume that the distribution fits are identical. The results show that the lognormal fits to the following datasets are identical: 1) DS1, DS2, DS3, and DS4; 2) DS1, DS4, and DS8; 3) DS7, DS10, and DS11; 4) DS5 and DS10; 5) DS5 and DS12. These results correlate with the identity of the empirical distribution. Figure 5a presents the fitted lognormal distribution over DS4 (with μ= –0.1081, σ=0.4766, mean74=1.005, med74=0.8975, mode74=0.7169, std74=0.5077, iqr74dy=0.5870) and DS9 (with μ= –0.6694, σ=0.4181, mean79=0.5587, med79=0.5120, mode79=0.4322, std79=0.2442, iqr79=0.2926). The Kuiper statistic for identity of the fits, calculated according to (11), is V74,9=0.4671, whereas according to (12), Pvalue,74,9=0<0.05. Therefore the null hypothesis is rejected, which is also evident from the graphical output. In the same fashion, Fig. 5b presents the lognormal distribution fit over DS1 (with μ= –1477, σ=0.4843, mean71=0.9701, med71=0.8627, mode71=0.6758, std71=0.4988, iqr71=0.5737) and DS4. The Kuiper statistic for identity of the fits, calculated according to (11), is V71,4=0.03288, whereas according to (12), Pvalue,71,4=0.5580>0.05. Therefore the null hypothesis fails to be rejected, which is also evident from the graphical output.

Figure 5.

Comparison of the lognormal distribution fits over DS4 and DS9 (a) and over DS1 and DS4 (b)

DatasetsDS1DS2DS3DS4DS5DS6DS7DS8DS9DS10DS11DS12
DS11.001.39e–11.90e–15.58e–10.000.000.003.49e–10.000.000.000.00
DS21.39e–11.006.37e–11.05e–10.000.000.003.40e–20.000.001.00e–30.00
DS31.90e–16.37e–11.002.01e–10.000.000.003.20e–20.000.000.000.00
DS45.58e–11.05e–12.01e–11.000.000.000.006.65e–10.000.000.000.00
DS50.000.000.000.001.000.001.00e–30.000.005.70e–21.00e–35.10e–2
DS60.000.000.000.000.001.000.000.000.000.000.000.00
DS70.000.000.000.001.00e–30.001.000.000.008.70e–27.90e–10.00
DS83.49e–13.40e–23.20e–26.65e–10.000.000.001.000.000.000.000.00
DS90.000.000.000.000.000.000.000.001.000.000.000.00
DS100.000.000.000.005.70e–20.008.70e–20.000.001.001.86e–10.00
DS110.001.00e–30.000.001.00e–30.007.90e–10.000.001.86e–11.000.00
DS120.000.000.000.005.10e–20.000.000.000.000.000.001.00

Table 5.

P-values of the statistical test that the lognormal fitted distributions over two datasets are identical. The values on the main diagonal are shaded. The bold values indicate the distribution fit pairs that may be assumed as identical.

The statistical procedures described above have been successfully applied for the solution of important medical problems [16; 17]. At first we could prove the role of mechanical forces in the organization of the final architecture of the fibrin network. Our ex vivo exploration of the ultrastructure of fibrin at different locations of surgically removed thrombi evidenced gross differences in the fiber diameter and pore area of the fibrin network resulting from shear forces acting in circulation (Fig. 6). In vitro fibrin structures were also generated and their equivalence with the in vivo fibrin architecture was proven using the distribution analysis described in this chapter (Fig. 7). Stretching changed the arrangement of the fibers (Fig. 7A) to a pattern similar to the one observed on the surface of thrombi (Fig. 6A); both the median fiber diameter and the pore area of the fibrins decreased 2-3-fold and the distribution of these morphometric parameters became more homogeneous (Fig. 7B). Thus, following this verification of the experimental model ultrastructure, the in vitro fibrin clots could be used for the convenient evaluation of these structures with respect to their chemical stability and resistance to enzymatic degradation [16].

Figure 6.

Fibrin structure on the surface and in the core of thrombi. A. Following thrombectomy thrombi were washed, fixed and dehydrated. SEM images were taken from the surface and transverse section of the same thrombus sample, scale bar = 2 μm. DG: a thrombus from popliteal artery, SJ: a thrombus from aorto-bifemoral by-pass Dacron graft. B. Fiber diameter (upper graphs) and fibrin pore area (lower graphs) were measured from the SEM images of the DG thrombus shown in A using the algorithms described in this chapter. The graphs present the probability density function (PDF) of the empirical distribution (black histogram) and the fitted theoretical distribution (grey curves). The numbers under the location of the observed fibrin structure show the median, as well as the bottom and the top quartile values (in brackets) of the fitted theoretical distributions (lognormal for fiber diameter and generalized extreme value for area). The figure is reproduced from Ref. [16].

Figure 7.

Changes in fibrin network structure caused by mechanical stretching. A. SEM images of fibrin clots fixed with glutaraldehyde before stretching or following 2-and 3-fold stretching as indicated, scale bar = 2 μm. B. Fiber diameter (upper graphs) and fibrin pore area (lower graphs) were measured from the SEM images illustrated in A using the algorithms described in this chapter. The graphs present the probability density function (PDF) of the empiric distribution (black histogram) and the fitted theoretical distribution (grey curves). The numbers under the fibrin type show the median, as well as the bottom and the top quartile values (in brackets) of the fitted theoretical distributions (lognormal for fiber diameter and generalized extreme value for area). The figure is reproduced from Ref. [16].

Application of the described distribution analysis allowed identification of the effect of red blood cells (RBCs) on the structure of fibrin [17]. The presence of RBCs at the time of fibrin formation causes a decrease in the fiber diameter (Fig. 8) based on a specific interaction between fibrinogen and a cell surface receptor. The specificity of this effect could be proven partially by the sensitivity of the changes in the distribution parameters to the presence of a drug (eptifibatide) that blocks the RBC receptor for fibrinogen (compare the median and interquartile range values for the experimental fibrins in the presence and absence of the drug illustrated in Fig. 8). It is noteworthy that the type of distribution was not changed by the drug, only its parameters were modified. This example underscores the applicability of the designed procedure for testing of statistical hypotheses in situations when subtle quantitative biological and pharmacological effects are at issue.

Figure 8.

Changes. in the fibrin network structure caused by red blood cells and eptifibatide. The SEM images in Panel A illustrate the fibrin structure in clots of identical volume and fibrinogen content in the absence or presence of 20 % RBC. Panel B shows fiber diameter measured from the SEM images for a range of RBC-occupancy in the same clot model. Probability density functions (PDF) of the empirical distribution (black histogram) and the fitted lognormal theoretical distribution (grey curves) are presented with indication of the median and the interquartile range (in brackets) of the fitted theoretical distributions. In the presence of RBC the parameters of the fitted distributions of the eptifibatide-free and eptifibatide-treated fibers differ at p<0.001 level (for the RBC-free fibrins the eptifibatide-related difference is not significant, p>0.05). The figure is reproduced from Ref. [17].

5. Discussion and conclusions

This chapter addressed the problem of identifying a single type of theoretical distribution that fits to different datasets by altering its parameters. The identification of such type of distribution is a prerequisite for comparing the results, performing interpolation and extrapolation over the data, and studying the dependence between the input parameters (e.g. initial conditions of an experiment) and the distribution parameters. Additionally, the procedures included hypothesis tests addressing the identity of empirical (stair-case) and of fitted distributions. In the case of empirical distributions, the failure to reject the null hypothesis proves that samples come from one and the same general population. In the case of fitted distributions, the failure to reject the null hypothesis proves that although parameters are random (as the fits are also based on random data), the differences are not statistically significant. The implementation of the procedures is facilitated by the creation of a platform in MATLAB that executes the necessary calculation and evaluation procedures.

Some parts of the three problems analyzed in this chapter may be solved using similar methods or software tools different from the MATLAB procedures described in section 3. Some software packages solve the task of choosing the best distribution type to fit the data [18, 19]. The appropriateness of the fit is defined by the goodness-of-fit metrics, which may be selected by the user. The Kolmogorov-Smirnov statistics is recommended for the case of samples with continuous variables, but strictly speaking the analytical Kolmogorov-Smirnov distribution should not be used to calculate the p-value in case any of the parameters has been calculated on the basis of the sample as explicitly stated in [19]. Its widespread application, however, is based on the fact that it is the most conservative, i.e. the probability to reject the null hypothesis is lower compared to the other goodness-of-fit criteria. Some available tools [20] also use analytical expressions to calculate the p-value of the Kolmogorov-Smirnov test in the case of a sample that is normally distributed, exponentially distributed or extreme-value distributed [21, 22]. Those formulae are applied in the lillietest MATLAB function from the Statistical toolbox, where Monte-Carlo simulation is conducted for the other distributions. It is recommended to use Monte-Carlo simulation even for the three aforementioned distributions in case any of the parameters has been derived from the sample. Some applications calculate a goodness-of-fit metrics of a single sample as a Kuiper statistics (e.g. in the awkwardly spelled kupiertest MATLAB function of [23]) and the p-value is calculated analytically. The main drawback of that program is that the user must guarantee that the parameters of the theoretical distribution have not been calculated from the sample. Other available applications offer single-sample Kuiper test (e.g. v.test function in [24]) or single- and two-sample Kuiper tests (e.g. KuiperTest function in [25]), which use Monte-Carlo simulation. The results of the functions v.test and KuiperTest are quite similar to those presented in this chapter, the main difference being our better approximation of the empirical distribution with a linear function, rather than with a histogram. Our approach to calculate p-values with Monte-Carlo simulation stems from the previously recognized fact that “…if one or more parameters have to be estimated, the standard tables for the Kuiper test are no longer valid …” [26]. Similar concepts have been proposed by others too [27].

An advantage of the method applied by us is that the Kuiper statistics is very sensitive to discrepancies at the tails of the distribution, unlike the Kolmogorov-Smirnov statistics, whereas at the same time it does not need to distribute the data into bins, as it is for the chi-square statistics. Another advantage is that the method is very suitable for circular probability distributions [23, 24], because it is invariant to the starting point where cyclic variations are observed in the sample. In addition it is easily generalized for multi-dimensional cases [25].

A limitation of our method is that it cannot be used for discrete variables [25], whereas the Kolmogorov-Smirnov test could be easily modified for the discrete case. The second drawback is that if the data are not i.i.d. (independent and identically distributed), then all Bootstrap and Monte-Carlo simulations give wrong results. In that case, the null hypothesis is rejected even if true, but this is an issue with all Monte-Carlo approaches. Some graphical and analytical possibilities to test the i.i.d. assumption are described in [19].

Further extension of the statistical procedures proposed in this chapter may focus on the inclusion of additional statistical tests evaluating the quality of the fits and the identity of the distributions. The simulation procedures in Task 3 may be modified to use Bootstrap, because this method relies on fewer assumptions about the underlying process and the associated measurement error [28]. Other theoretical distribution types could also be included in the program platform, especially those that can interpret different behaviour of the data around the mean and at the tails. Finally, further research could focus on new areas (e.g. economics, finance, management, other natural sciences) to implement the described procedures.

Acknowledgements

This. research is funded by the Hungarian Scientific Research Fund OTKA 83023. The authors wish to thank Imre Varju from the Department of Medical Biochemistry, Semmelweis University, Budapest, Hungary for collecting the datasets with length measurements, and Laszlo Szabo from the Chemical Research Center, Hungarian Academy of Sciences, Budapest, Hungary for taking the SEM images.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Natalia D. Nikolova, Daniela Toneva-Zheynova, Krasimir Kolev and Kiril Tenekedjiev (March 6th 2013). Monte Carlo Statistical Tests for Identity of Theoretical and Empirical Distributions of Experimental Data, Theory and Applications of Monte Carlo Simulations, Victor (Wai Kin) Chan, IntechOpen, DOI: 10.5772/53049. Available from:

Embed this chapter on your site Copy to clipboard

<iframe src="http://www.intechopen.com/embed/theory-and-applications-of-monte-carlo-simulations/monte-carlo-statistical-tests-for-identity-of-theoretical-and-empirical-distributions-of-experimenta" />

Embed this code snippet in the HTML of your website to show this chapter

chapter statistics

2489total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Monte Carlo Simulations Applied to Uncertainty in Measurement

By Paulo Roberto Guimarães Couto, Jailton Carreteiro Damasceno and Sérgio Pinheiro de Oliveira

Related Book

First chapter

Introduction

By Jose Ignacio Huertas

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us