### 2.1. Task 1 – Theoretical solution

The empirical cumulative distribution function CDFei(.)is initially linearly approximated over (*n**i*+1) nodes as (*n**i* –1) internal nodes CDFei(xki/2+xk+1i/2)=k/nifor *k*=1,2,…,*n**i*–1 and two external nodes CDFei(x1i−Δdi)=0and CDFei(xnii+Δui)=1, where Δdi=min(x1i,(x16i−x1i)/30)and Δui=(xnii−xni−15i)/30are the halves of mean inter-sample intervals in the lower and upper ends of the dataset χi. This is the most frequent case when the sample values are positive and the lower external node will never be with a negative abscissa because (x1i−Δdi)≥0. If both negative and positive sample values are acceptable then Δdi=(x16i−x1i)/30and Δui=(xnii−xni−15i)/30. Of course if all the sample values have to be negative then Δdi=(x16i−x1i)/30and Δui=min(−xnii,(xnii−xni−15i)/30). In that rare case the upper external node will never be with positive abscissa because (xnii+Δui)≤0.

It is convenient to introduce “before-first” x0i=x1i−2Δdiand “after-last” xni+1i=xnii+2Δuisamples. When for some *k*=1,2,…*,n**i* and for *p*>1 it is true that xk−1i<xki=xk+1i=xk+2i=...=xk+pi<xk+p+1i, then the initial approximation of CDFei(.)contains a vertical segment of *p* nodes. In that case the *p* nodes on that segment are replaced by a single node in the middle of the vertical segment CDFei(xki)=(k+p/2−1/2)/ni. The described two-step procedure [2] results in a strictly increasing function CDFei(.)in the closed interval [x1i−Δdi; xnii+Δui]. That is why it is possible to introduce invCDFei(.)with the domain [0; 1] as the inverse function of CDFei(.)in [x1i−Δdi; xnii+Δui]. The median and the interquartile range of the empirical distribution can be estimated from invCDFei(.), whereas the mean and the standard deviation are easily estimated directly from the dataset χi:

mean: meanei=1ni∑k=1nixki

median: medei=invCDFei(0.5)

standard deviation: stdei=1ni−1∑k=1ni(xki−meanei)2;

inter-quartile range: iqrei=invCDFei(0.75)−invCDFei(0.25).

The non-zero part of the empirical density PDFei(.)is determined in the closed interval [x1i−Δdi; xnii+Δui]as a histogram with bins of equal area (each bin has equal product of density and span of data). The number of bins *b**i* is selected as the minimal from the Scott [9], Sturges [10] and Freedman-Diaconis [11] suggestions: bi=min{biSc,biSt,biFD}, where biSc=fl(0.2865(xnii−x1i) ni3/stdei), biSt=fl(1+log2(ni)), and biFD=fl(0.5(xnii−x1i)ni3/iqrei). In the last three formulae, *fl*(*y*) stands for the greatest whole number less or equal to *y*. The lower and upper margins of the *k*-th bin md,kiand mu,kiare determined as quantiles (*k*–1)/*b**i* and *k*/*b**i* respectively: md,ki=invCDFei(k/bi−1/bi)and mu,ki=invCDFei(k/bi). The density of the *k*^{th} bin is determined as PDFei(x)=bi−1/(mu,ki−md,ki). The described procedure [2] results in a histogram, where the relative error of the worst PDFei(.)estimate is minimal from all possible splitting of the samples into *b**i* bins. This is so because the PDF estimate of a bin is found as the probability that the random variable would have a value in that bin divided to the bin’s width. This probability is estimated as the relative frequency to have a data point in that bin at the given data set. The closer to zero that frequency is the worse it has been estimated. That is why the worst PDF estimate is at the bin that contains the least number of data points. Since for the proposed distribution each bin contains equal number of data points, any other division to the same number of bins would result in having a bin with less data points. Hence, the relative error of its PDF estimate would be worse.

The improper integral ∫−∞xPDFei(x)dxof the density is a smoothened version of CDFei(.)linearly approximated over (*bi*+1) nodes: (invCDFei(k/bi); k/bi)for *k*=0, 1, 2, …, *b**i*.

If the samples are distributed with density PDFj(x, p→j), then the likelihood of the dataset χiis Lji(p→j)=∏k=1niPDFj(xki, p→j). The maximum likelihood estimates (MLEs) of p→jare determined as those p→ji, which maximize Lji(p→j), that is p→ji=arg{maxp→j[Lji(p→j)]}. The numerical characteristics of the *j*^{th} theoretical distribution fitted to the dataset χiare calculated as:

mean: meanji=∫−∞+∞x.PDFj(x, p→ji)dx

median: medji=invCDFj(0.5, p→ji)

mode: modeji=arg{maxx[PDFj(x, p→j)]}

standard deviation: stdji=∫−∞+∞(x−meanji)2PDFj(x, p→ji)dx2;

inter-quartile range: iqrji=invCDFj(0.75, p→ji)−invCDFj(0.25, p→ji).

The quality of the fit can be assessed using a statistical hypothesis test. The null hypothesis *H*_{0} is that CDFei(x)is equal to CDFj(x, p→ji), which means that the sample χiis drawn from CDFj(x, p→ji). The alternative hypothesis *H*_{1} is that CDFei(x)is different from CDFj(x, p→ji), which means that the fit is not good. The Kuiper statistic Vji[12] is a suitable measure for the goodness-of-fit of the theoretical cumulative distribution functions CDFj(x, p→ji)to the dataset χi:

Vji=maxx{CDFei(x)−CDFj(x, p→ji)}+maxx{CDFj(x, p→ji)−CDFei(x)}.E1

The theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for continuous distribution fitted to the data of another [5]. That is why the distribution of *V* from (1), if *H*_{0} is true, should be estimated by a Monte Carlo procedure. The main idea is that if the dataset χi=(x1i,x2i,...,xnii)is distributed in compliance with the 1-D theoretical distributions of type *j*, then its PDF would be very close to its estimate PDFj(x, p→ji), and so each synthetic dataset generated from PDFj(x, p→ji)would produce Kuiper statistics according to (1), which would be close to zero [1].

The algorithm of the proposed procedure is the following:

Construct the empirical cumulative distribution function CDFei(x)describing the data in χi.

Find the MLE of the parameters for the distributions of type *j* fitting χias p→ji=arg{maxp→j[∏k=1niPDFj(xki, p→j)]}.

Build the fitted cumulative distribution function CDFj(x, p→ji)describing χi.

Calculate the actual Kuiper statistic Vjiaccording to (1).

Repeat for *r*=1,2,…, *n*^{MC} (in fact use *n*^{MC} simulation cycles):

generate a synthetic dataset χri,syn={x1,ri,syn, x2,ri,syn, ..., xni,ri,syn}from the fitted cumulative distribution function CDFj(x, p→ji). The dataset χri,syncontains *n**i* sorted samples (x1,ri,syn≤x2,ri,syn≤...≤xni,ri,syn);

construct the synthetic empirical distribution function CDFe,ri,syn(x)describing the data in χri,syn;

find the MLE of the parameters for the distributions of type *j* fitting χri,synas

p→j,ri,syn=arg{maxp→j[∏k=1niPDFj(xk,ri,syn, p→j)]};

build the theoretical distribution function CDFj,rsyn(x,p→j,ri,syn)describing χri,syn;

estimate the *r*^{th} instance of the synthetic Kuiper statistic as Vj,ri,syn=maxx{CDFe,ri,syn(x)−CDFj,rsyn(x,p→j,ri,syn)}+maxx{CDFj,rsyn(x,p→j,ri,syn)−CDFe,ri,syn(x)}.

The p-value Pvalue,jfit,iof the statistical test (the probability to reject a true hypothesis *H*_{0} that the *j*^{th} type theoretical distribution fits well to the samples in dataset χi) is estimated as the frequency of generating synthetic Kuiper statistic greater than the actual Kuiper statistic Vjifrom step 4:

Pvalue,jfit,i=1nmc∑ r=1Vji<Vj,ri,synnmc1E2

In fact, (2) is the sum of the indicator function of the crisp set, defined as all synthetic datasets with a Kuiper statistic greater than Vji.

The performance of each theoretical distribution should be assessed according to its goodness-of-fit measures to the *N* datasets simultaneously. If a given theoretical distribution cannot be fitted even to one of the datasets, then that theoretical distribution has to be discarded from further consideration. The other theoretical distributions have to be ranked according to their ability to describe all datasets. One basic and three auxiliary criteria are useful in the required ranking.

The basic criterion is the minimal p-value of the theoretical distribution fits to the *N* datasets:

minPvalue,jfit=min{Pvalue,jfit,1, Pvalue,jfit,2, ..., Pvalue,jfit,N}, for j=1, 2, ...,M.E3

The first auxiliary criterion is the average of the p-values of the theoretical distribution fits to the *N* datasets:

meanPvalue,jfit=1N∑j=1NPvalue,jfit,i , for j=1, 2, ..,M.E4

The second and the third auxiliary criteria are the AIC-Akaike Information Criterion [6] and the BIC-Bayesian Information Criterion [7], which corrects the negative log-likelihoods with the number of the assessed parameters:

AICj=−2∑i=1Nlog(Lji(p→ji))+2log(N.njp)==−2∑i=1N∑j=1MlogPDFj(xki, p→ji)+2log(N.njp)E5

BICj=−2∑i=1Nlog(Lji(p→ji))+2log(N.njp).log(∑i=1Mni)==−2∑i=1N∑j=1MlogPDFj(xki, p→ji)+2log(N.njp).log(∑i=1Mni)E6

for *j*=1,2,..,*M*. The best theoretical distribution type should have maximal values for minPvalue,jfitand meanPvalue,jfit, whereas its values for *AIC**j* and *BIC**j* should be minimal. On top, the best theoretical distribution type should have minPvalue,jfit>0.05, otherwise no theoretical distribution from the initial *M* types fits properly to the *N* datasets.

That solves the problem for selecting the best theoretical distribution type for fitting the samples in the *N* datasets.

### 2.2. Task 2 – Theoretical solution

The second problem is the estimation of the statistical significance of the difference between two datasets. It is equivalent to calculating the *p*-value of a statistical hypothesis test, where the null hypothesis *H*_{0} is that the samples of χi1and χi2are drawn from the same underlying continuous population, and the alternative hypothesis *H**1* is that the samples of χi1and χi2are drawn from different underlying continuous populations. The two-sample asymptotic Kuiper test is designed exactly for that problem, because χi1and χi2are independently drawn datasets. That is why “staircase” empirical cumulative distribution functions [13] are built from the two datasets χi1and χi2:

CDFscei(x)=∑ k=1xki≤xni1/ni, for i ∈{i1,i2}.E7

The ”staircase” empirical CDFscei(.)is a discontinuous version of the already defined empirical CDFei(.). The Kuiper statistic Vi1,i2[12] is a measure for the closeness of the two ‘staircase’ empirical cumulative distribution functions CDFscei1(.)and CDFscei2(.):

Vi1,i2=maxx{CDFscei1(x)−CDFscei2(x)}+maxx{CDFscei2(x)−CDFscei1(x)}E8

The distribution of the test statics Vi1,i2is known and the *p*-value of the two tail statistical test with null hypothesis *H*_{0}, that the samples in χi1and in χi2result in the same ‘staircase’ empirical cumulative distribution functions is estimated as a series [5] according to formulae (9) and (10).

The algorithm for the theoretical solution of Task 2 is straightforward:

1. Construct the ”staircase” empirical cumulative distribution function describing the data in χi1as CDFscei1(x)=∑ k=1xki1≤xni11/ni1.

2. Construct the ”staircase” empirical cumulative distribution function describing the data in χi2as CDFscei2(x)=∑ k=1xki2≤xni21/ni2.

3. Calculate the actual Kuiper statistic Vi1,i2according to (8).

4. The *p*-value of the statistical test (the probability to reject a true null hypothesis *H*_{0}) is estimated as:

Pvalue,ei1,i2=2 ∑j=1+∞(4j2λ2−1)e-2j2λ2 E9

where

λ=1Vi1,i2(ni1ni2ni1+ni2 + 0.155 + 0.24ni1+ni2ni1ni2)E10

If

Pvalue,ei1,i2<0.05 the hypothesis

*H*_{0} is rejected.

### 2.3. Task 3 – Theoretical solution

The last problem is to test the statistical significance of the difference between two fitted distributions of the same type. This type most often would be the best type of theoretical distribution, which was identified in the first problem, but the test is valid for any type. The problem is equivalent to calculating the *p*-value of statistical hypothesis test, where the null hypothesis *H*_{0} is that the theoretical distribution CDFj(x, p→ji1)and CDFj(x, p→ji2)fitted to the datasets χi1and χi2are identical, and the alternative hypothesis *H*_{1} is that CDFj(x, p→ji1)and CDFj(x, p→ji2)are not identical.

The test statistic again is the Kuiper one Vji1,i2:

Vji1,i2=maxx{CDFj(x, p→ji1)−CDFj(x, p→ji2)}+maxx{CDFj(x, p→ji2)−CDFj(x, p→ji1)}.E11

As it has already been mentioned the theoretical Kuiper’s distribution is derived just for the case of two independent staircase distributions, but not for the case of two independent continuous cumulative distribution functions. That is why the distribution of *V* from (11), if *H*_{0} is true, should be estimated by a Monte Carlo procedure. The main idea is that if *H*_{0} is true, then CDFj(x, p→ji1)and CDFj(x, p→ji2)should be identical to the merged distribution CDFj(x, p→ji1+i2), fitted to the merged dataset χi1+i2formed by merging the samples of χi1and χi2[1].

The algorithm of the proposed procedure is the following:

Find the MLE of the parameters for the distributions of type *j* fitting χi1as p→ji1=arg{maxp→j[∏k=1ni1PDFj(xki1, p→j)]}.

Build the fitted cumulative distribution function CDFj(x, p→ji1)describing χi1.

Find the MLE of the parameters for the distributions of type *j* fitting χi2as p→ji2=arg{maxp→j[∏k=1ni2PDFj(xki2, p→j)]}.

Build the fitted cumulative distribution function CDFj(x, p→ji2)describing χi2.

Calculate the actual Kuiper statistic Vji1,i2according to (11).

Merge the samples χi1and χi2, and form the merged data set χi1+i2.

Find the MLE of the parameters for the distributions of type *j* fitting χi1+i2asp→ji1+i2=arg{maxp→j[∏k=1ni1PDFj(xki1, p→j)∏k=1ni2PDFj(xki2, p→j)]}.

Fit the merged fitted cumulative distribution function CDFj(x, p→ji1+i2)to χi1+i2.

Repeat for *r*=1,2,…, *n*^{MC} (in fact use *n*^{MC} simulation cycles):

a. generate a synthetic dataset χri1,syn={x1,ri1,syn, x2,ri1,syn, ..., xni1,ri1,syn}from the fitted cumulative distribution function CDFj(x, p→ji1+i2);

b. find the MLE of the parameters for the distributions of type *j* fitting χri1,synas

p→j,ri1,syn=arg{maxp→j[∏k=1ni1PDFj(xk,ri1,syn, p→j)]};

c. build the theoretical distribution function CDFj,rsyn(x,p→j,ri1,syn)describing χri1,syn;

d. generate a synthetic dataset χri2,syn={x1,ri2,syn, x2,ri2,syn, ..., xni2,ri2,syn}from the fitted cumulative distribution function CDFj(x, p→ji1+i2);

e. find the MLE of the parameters for the distributions of type *j* fitting χri2,synas

p→j,ri2,syn=arg{maxp→j[∏k=1ni2PDFj(xk,ri2,syn, p→j)]};

f. build the theoretical distribution function CDFj,rsyn(x,p→j,ri2,syn)describing χri2,syn;

g. estimate the *r*^{th} instance of the synthetic Kuiper statistic as:

Vj,ri1,i2,syn=maxx{CDFj,rsyn(x,p→j,ri1,syn)−CDFj,rsyn(x,p→j,ri2,syn)}++maxx{CDFj,rsyn(x,p→j,ri2,syn)−CDFj,rsyn(x,p→j,ri1,syn)}.

The p-value Pvalue,ji1,i2of the statistical test (the probability to reject a true hypothesis *H*_{0} that the *j*^{th} type theoretical distribution function CDFj(x, p→ji1)and CDFj(x, p→ji2)are identical) is estimated as the frequency of generating synthetic Kuiper statistic greater than the actual Kuiper statistic Vji1,i2from step 5:

Pvalue,ji1,i2=1nmc∑ r=1Vji1,i2<Vj,ri1,i2,synnmc1E12

Formula (12), similar to (2), is the sum of the indicator function of the crisp set, defined as all synthetic dataset pairs with a Kuiper statistic greater than Vji1,i2.

If Pvalue,ji1,i2<0.05 the hypothesis *H*_{0} is rejected.