Parameters, support and formula for the PDF of the eleven types of theoretical distributions embodied into the MATLAB platform
\r\n\t
\r\n\tIn this book Advanced application of radionuclides are introduced. New global trends on safe application of radionuclides in human life is elucidated.
Often experimental work requires analysis of many datasets derived in a similar way. For each dataset it is possible to find a specific theoretical distribution that describes best the sample. A basic assumption in this type of work is that if the mechanism (experiment) to generate the samples is the same, then the distribution type that describes the datasets will also be the same [1]. In that case, the difference between the sets will be captured not through changing the type of the distribution, but through changes in its parameters. There are some advantages in finding whether a type of theoretical distribution that fits several datasets exists. At first, it improves the fit because the assumptions concerning the mechanism underlying the experiment can be verified against several datasets. Secondly, it is possible to investigate how the variation of the input parameters influences the parameters of the theoretical distribution. In some experiments it might be proven that the differences in the input conditions lead to qualitative change of the fitted distributions (i.e. change of the type of the distribution). In other cases the variation of the input conditions may lead only to quantitative changes in the output (i.e. changes in the parameters of the distribution). Then it is of importance to investigate the statistical significance of the quantitative differences, i.e. to compare the statistical difference of the distribution parameters. In some cases it may not be possible to find a single type of distribution that fits all datasets. A possible option in these cases is to construct empirical distributions according to known techniques [2], and investigate whether the differences are statistically significant. In any case, proving that the observed difference between theoretical, or between empirical distributions, are not statistically significant allows merging datasets and operating on larger amount of data, which is a prerequisite for higher precision of the statistical results. This task is similar to testing for stability in regression analysis [3].
Formulating three separate tasks, this chapter solves the problem of identifying an appropriate distribution type that fits several one-dimensional (1-D) datasets and testing the statistical significance of the observed differences in the empirical and in the fitted distributions for each pair of samples. The first task (Task 1) aims at identifying a type of 1-D theoretical distribution that fits best the samples in several datasets by altering its parameters. The second task (Task 2) is to test the statistical significance of the difference between two empirical distributions of a pair of 1-D datasets. The third task (Task 3) is to test the statistical significance of the difference between two fitted distributions of the same type over two arbitrary datasets.
Task 2 can be performed independently of the existence of a theoretical distribution fit valid for all samples. Therefore, comparing and eventually merging pairs of samples will always be possible. This task requires comparing two independent discontinuous (stair-case) empirical cumulative distribution functions (CDF). It is a standard problem and the approach here is based on a symmetric variant of the Kolmogorov-Smirnov test [4] called the Kuiper two-sample test, which essentially performs an estimate of the closeness of a pair of independent stair-case CDFs by finding the maximum positive and the maximum negative deviation between the two [5]. The distribution of the test statistics is known and the p value of the test can be readily estimated.
Tasks 1 and 3 introduce the novel elements of this chapter. Task 1 searches for a type of theoretical distribution (out of an enumerated list of distributions) which fits best multiple datasets by varying its specific parameter values. The performance of a distribution fit is assessed through four criteria, namely the Akaike Information Criterion (AIC) [6], the Bayesian Information Criterion (BIC) [7], the average and the minimal p value of a distribution fit to all datasets. Since the datasets contain random measurements, the values of the parameters for each acquired fit in Task 1 are random, too. That is why it is necessary to check whether the differences are statistically significant, for each pair of datasets. If not, then both theoretical fits are identical and the samples may be merged. In Task 1 the distribution of the Kuiper statistic cannot be calculated in a closed form, because the problem is to compare an empirical distribution with its own fit and the independence is violated. A distribution of the Kuiper statistic in Task 3 cannot be estimated in close form either, because here one has to compare two analytical distributions, but not two stair-case CDFs. For that reason the distributions of the Kuiper statistic in Tasks 1 and 3 are constructed via a Monte Carlo simulation procedures, which in Tasks 1 is based on Bootstrap [8].
The described approach is illustrated with practical applications for the characterization of the fibrin structure in natural and experimental thrombi evaluated with scanning electron microscopy (SEM).
The approach considers
The procedure assumes that
The empirical cumulative distribution function
It is convenient to introduce “before-first”
mean:
median:
standard deviation:
inter-quartile range:
The non-zero part of the empirical density
The improper integral
If the samples are distributed with density
mean:
median:
mode:
standard deviation:
inter-quartile range:
The quality of the fit can be assessed using a statistical hypothesis test. The null hypothesis
The theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for continuous distribution fitted to the data of another [5]. That is why the distribution of
The algorithm of the proposed procedure is the following:
Construct the empirical cumulative distribution function
Find the MLE of the parameters for the distributions of type
Build the fitted cumulative distribution function
Calculate the actual Kuiper statistic
Repeat for
generate a synthetic dataset
construct the synthetic empirical distribution function
find the MLE of the parameters for the distributions of type
build the theoretical distribution function
estimate the
The p-value
In fact, (2) is the sum of the indicator function of the crisp set, defined as all synthetic datasets with a Kuiper statistic greater than
The performance of each theoretical distribution should be assessed according to its goodness-of-fit measures to the
The basic criterion is the minimal p-value of the theoretical distribution fits to the
The first auxiliary criterion is the average of the p-values of the theoretical distribution fits to the
The second and the third auxiliary criteria are the AIC-Akaike Information Criterion [6] and the BIC-Bayesian Information Criterion [7], which corrects the negative log-likelihoods with the number of the assessed parameters:
for
That solves the problem for selecting the best theoretical distribution type for fitting the samples in the
The second problem is the estimation of the statistical significance of the difference between two datasets. It is equivalent to calculating the
The ”staircase” empirical
The distribution of the test statics
The algorithm for the theoretical solution of Task 2 is straightforward:
1. Construct the ”staircase” empirical cumulative distribution function describing the data in
2. Construct the ”staircase” empirical cumulative distribution function describing the data in
3. Calculate the actual Kuiper statistic
4. The
where
The last problem is to test the statistical significance of the difference between two fitted distributions of the same type. This type most often would be the best type of theoretical distribution, which was identified in the first problem, but the test is valid for any type. The problem is equivalent to calculating the
The test statistic again is the Kuiper one
As it has already been mentioned the theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for the case of two independent continuous cumulative distribution functions. That is why the distribution of
The algorithm of the proposed procedure is the following:
Find the MLE of the parameters for the distributions of type
Build the fitted cumulative distribution function
Find the MLE of the parameters for the distributions of type
Build the fitted cumulative distribution function
Calculate the actual Kuiper statistic
Merge the samples
Find the MLE of the parameters for the distributions of type
Fit the merged fitted cumulative distribution function
Repeat for
a. generate a synthetic dataset
b. find the MLE of the parameters for the distributions of type
c. build the theoretical distribution function
d. generate a synthetic dataset
e. find the MLE of the parameters for the distributions of type
f. build the theoretical distribution function
g. estimate the
The p-value
Formula (12), similar to (2), is the sum of the indicator function of the crisp set, defined as all synthetic dataset pairs with a Kuiper statistic greater than
If
A platform of program functions, written in MATLAB environment, is created to execute the statistical procedures from the previous section. At present the platform allows users to test the fit of 11 types of distributions on the datasets. A description of the parameters and PDF of the embodied distribution types is given in Table 1 [14, 15]. The platform also permits the user to add optional types of distribution.
The platform contains several main program functions. The function
\n\t\t\t | \n\t\t\t\t | \n\t\t\t\n\t\t\t | \n\t\t\t\t | \n\t\t
Parameters | \n\t\t\tParameters | \n||
Support | \n\tSupport | \n||
where | \n|||
\n\t | \n\t\t | \n\t\n\t | \n\t\t | \n
Parameters | \n\tParameters | \n\n | \n|
Support | \n\tSupport | \n||
\n\t | \n\t\t | \n\t\n\t | \n\t\t | \n
Parameters | \n\tParameters | \n||
Support | \n\tSupport | \n||
\n\t | \n\t\t | \n\t\n\t | \n\t\t | \n
Parameters | \n\t\n\t\t | \nParameters | \n|
Support | \n\tSupport | \n||
where | \n|||
\n\t | \n\t\t | \n\t\n\t | \n\t\t | \n
Parameters | \n\tParameters | \n\n | \n|
Support | \n\tSupport | \n||
\n\t | \n\t\t | \n\t\n | |
Parameters | \n\t|||
Support | \n\t|||
Parameters, support and formula for the PDF of the eleven types of theoretical distributions embodied into the MATLAB platform
The program function
Another key function is
The statistical procedures and the program platform introduced in this chapter are implemented in an example focusing on the morphometric evaluation of the effects of thrombin concentration on fibrin structure. Fibrin is a biopolymer formed from the blood-borne fibrinogen by an enzyme (thrombin) activated in the damaged tissue at sites of blood vessel wall injury to prevent bleeding. Following regeneration of the integrity of the blood vessel wall, the fibrin gel is dissolved to restore normal blood flow, but the efficiency of the dissolution strongly depends on the structure of the fibrin clots. The purpose of the evaluation is to establish any differences in the density of the branching points of the fibrin network related to the activity of the clotting enzyme (thrombin), the concentration of which is expected to vary in a broad range under physiological conditions.
For the purpose of the experiment, fibrin is prepared on glass slides in total volume of 100
An automated procedure is elaborated in MATLAB environment (embodied into the program function
The first step requires setting of the scale. A prompt appears, asking the user to type the numerical value of the length of the scale in
Using this approach 12 datasets containing measurements of lengths between branching points of fibrin have been collected (Table 2) and the three statistical tasks described above are executed over these datasets.
\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t\t\n\t\t\t\t | \n\t\t
DS1 | \n\t\t274 | \n\t\t0.9736 | \n\t\t0.8121 | \n\t\t0.5179 | \n\t\t0.6160 | \n\t\t1.0 | \n\t\tbuffer1 | \n\t
DS2 | \n\t\t68 | \n\t\t1.023 | \n\t\t0.9374 | \n\t\t0.5708 | \n\t\t0.7615 | \n\t\t10.0 | \n\t\tbuffer1 | \n\t
DS3 | \n\t\t200 | \n\t\t1.048 | \n\t\t0.8748 | \n\t\t0.6590 | \n\t\t0.6469 | \n\t\t4.0 | \n\t\tbuffer1 | \n\t
DS4 | \n\t\t276 | \n\t\t1.002 | \n\t\t0.9003 | \n\t\t0.4785 | \n\t\t0.5970 | \n\t\t0.5 | \n\t\tbuffer1 | \n\t
DS5 | \n\t\t212 | \n\t\t0.6848 | \n\t\t0.6368 | \n\t\t0.3155 | \n\t\t0.4030 | \n\t\t1.0 | \n\t\tbuffer2 | \n\t
DS6 | \n\t\t300 | \n\t\t0.1220 | \n\t\t0.1265 | \n\t\t0.04399 | \n\t\t0.05560 | \n\t\t1.2 | \n\t\tbuffer2 | \n\t
DS7 | \n\t\t285 | \n\t\t0.7802 | \n\t\t0.7379 | \n\t\t0.3253 | \n\t\t0.4301 | \n\t\t2.5 | \n\t\tbuffer2 | \n\t
DS8 | \n\t\t277 | \n\t\t0.9870 | \n\t\t0.9326 | \n\t\t0.4399 | \n\t\t0.5702 | \n\t\t0.6 | \n\t\tbuffer2 | \n\t
DS9 | \n\t\t200 | \n\t\t0.5575 | \n\t\t0.5284 | \n\t\t0.2328 | \n\t\t0.2830 | \n\t\t0.3 | \n\t\tbuffer1 | \n\t
DS10 | \n\t\t301 | \n\t\t0.7568 | \n\t\t0.6555 | \n\t\t0.3805 | \n\t\t0.4491 | \n\t\t0.6 | \n\t\tbuffer1 | \n\t
DS11 | \n\t\t301 | \n\t\t0.7875 | \n\t\t0.7560 | \n\t\t0.3425 | \n\t\t0.4776 | \n\t\t1.2 | \n\t\tbuffer1 | \n\t
DS12 | \n\t\t307 | \n\t\t0.65000 | \n\t\t0.5962 | \n\t\t0.2590 | \n\t\t0.3250 | \n\t\t2.5 | \n\t\tbuffer1 | \n\t
Distance between branching points of fibrin fibers. Sample size (
SEM image of fibrin used for morphometric analysis
Steps of the automated procedure for measuring distances between branching points in fibrin. Panels
A total of 11 types of distributions (Table 1) are tested over the datasets, and the criteria (3)-(6) are evaluated. The Kuiper statistic’s distribution is constructed with 1000 Monte Carlo simulation cycles. Table 3 presents the results regarding the distribution fits, where only the maximal values for