MLL, AIC, and BIC values for the Danish fire data set.
In the first part of this chapter, we present a sample of the best known and most used classical size distributions with their main statistical properties. In the second part, we introduce the concept of composite models and based on the size distributions of the first part, we describe those which already exist in the literature. In the last part of this chapter, we apply the described statistical size distributions and some of the composite models to two real data examples and compare their goodness-of-fit.
- size distributions
- composite models
In statistical modeling, the continuous aim is to look for the probability law, which best describes the observations arising from a given field and which should represent the underlying data-generating process. The obtained probability distributions should possess desirable properties such as the flexibility of modeling different shapes and remain of a tractable form. This research avenue was initiated in the nineteenth century by famous mathematicians as Adolphe Quetelet, Sir Francis Galton, or Vilfredo Pareto, and since then it has never ceased. Nowadays, it still remains among the highly treated topics in statistics as shown by the large quantity of scientific papers recently published on the subject (see for instance the review papers  and ). The actual appeal for this topic is easily explained by the availability of large data sets in various scientific domains, making it essential and necessary to do further research on this subject.
In this chapter, we concentrate on probability distributions that analyze size-type data. By size distributions, we mean probability laws designed to model data that only take positive values. Positive observations appear naturally in different fields: survival analysis [3, 4], environmental science , network traffic modeling , economics [7, 8], hydrology , and actuarial science . Given the range of various domains of application, there exists a plethora of different size distributions and it is still a very active research area [11, 12].
The structure of this chapter is as follows. In Section 2, we review the most used and well-known size distributions, and state their main statistical properties. Section 3 introduces the notion of composite models and gives a small review of the composite models in the literature, based on the size distributions depicted in Section 2. In Section 4, we apply the described size distributions of Section 2 and some of the composite models of Section 3 to two real data sets, namely, an insurance data set and an Internet traffic data set. Finally, Section 5 concludes.
2. Review of size distributions
We describe here a sample of the best known and most used size distributions. We will state their probability density function (p.d.f) and their cumulative density function (c.d.f), show their moments and their quantile function, and give the estimators obtained via maximum likelihood estimation. More specifically, we take a closer look at the lognormal, Pareto, generalized Lomax, and generalized extreme value distributions.
2.1. The lognormal distribution
The English statistician Sir Francis Galton stated that in some situations it was preferable to measure the location of a distribution with the geometric mean instead of the arithmetic mean . Indeed, laws of nature often behave in multiplicative ways; thus, the geometric mean becomes more appropriate as a measure of central tendency than the arithmetic mean. As a reply to Galton’s request, the Scottish physician Donald McAlister established in 1879 a theory of the exponentiated (or multiplicative) normal distribution , this became to be known as the lognormal distribution.
Let X be a positive random variable (r.v.) such that log X = D Y is normally distributed with parameters μ ∈ and σ > 0. The r.v. X then has a lognormal distribution, X ~LN(μ, σ2), with probability density function (p.d.f.)
The location parameter μ ∈ and the scale parameter σ > 0 are characteristic for the r.v. logX. However, by the exponential transformation, the geometric mean eμ becomes a scale parameter, as depicted in Figure 1, and the multiplicative standard deviation eσ appears as shape parameter impacting the skewness (see Figure 2).
If random variability enjoys multiplicative effects, as stated by Galton, then a lognormal distribution must be the result. This establishes the basis of the multiplicative central limit theorem, which asserts that the geometric means of nonlognormal random variables are approximated by a lognormal distribution.
The cumulative distribution function (c.d.f.) of the lognormal law is related to the c.d.f. of the normal distribution:
where Φ(.) represents the c.d.f. of a standard normal distribution.
The moments of order r are conveniently expressed as Hence, the mean is given by , and the variance by The lognormal is a unimodal distribution and the unique mode is reached at By comparing the mean and the mode, we note that for a fixed μ, an increasing σ shifts the mode toward zero while the mean increases. The quantile function is defined as for 0 < y < 1 and where Φ-1(.) denotes the quantile function of a standard normal distribution.
Thanks to its relationship to the normal distribution, the likelihood function is given by
and hence the log-likelihood function can be expressed as
The maximum likelihood estimators for the mean and the scale are given by and respectively.
The lognormal distribution is widely used to describe natural phenomena. In finance, the Black-Scholes model, which is a mathematical model containing derivative instruments, assumes the underlying derivative price to have a lognormal distribution . In economics, income data are often modeled by a lognormal distribution , which can be easily explained as follows: a very low percentage of earners have very low income. To gain averaged revenue is frequent, whereas an elevated income is rare. In actuarial sciences, the law is assumed to fit well some types of insurance losses [17, 18]. In 1931, the French economist and engineer Robert Pierre Louis Gibrat stated that the firm size follows a lognormal distribution as its proportional growth rate is independent of its absolute size. Other applications can be found in biology [19, 20] or in linguistics to model the number of words in a sentence .
2.2. The Pareto distribution
The Italian economist and engineer Vilfredo Pareto observed in 1896 that in many populations the power law cx-α, for some constant c > 0 and some exponent α > 0, was an appropriate approximation of the number of individuals with income exceeding a given threshold x0 (see for instance [22, 23]). These power laws assume that small values of x are very frequent, while large occurrences are extremely rare. Their form implies that all power laws with a particular scaling exponent are equivalent up to constant factors since each is simply a scaled version of the others. This produces the linear relationship when logarithms are taken of both f(x) and x, which denotes the signature of power laws. Such distributions are known as Pareto-type distributions.
The p.d.f. of a r.v. X having a Pareto (type I) distribution with parameters α > 0 and x0 > 0 is given by
The location parameter x0 represents the lower bound of the data set and the shape parameter α is called the tail index or as well the Pareto index, and hence regulates the tail as can be seen in Figure 3. Note that a decreasing value of α implies a heavier tail.
The c.d.f. of the Pareto law is given by
For α > r, the r-th moment of the Pareto distribution is given by The mean and the variance are then, respectively, for α > 1 and for α > 2. The quantile function is expressed as for 0 < y < 1. Being an unimodal law, the Pareto distribution reaches its peak at xmode = x0. As x0 represents the minimum value of x, its estimation is straightforward: . The likelihood function is given by
and to estimate the parameter α, we maximize the following log-likelihood function
which yields the maximum likelihood estimator . Let us note that the maximum likelihood estimator of the tail index α corresponds to the popular Hill estimator , which is an estimator for the extreme value index in the extreme value theory. For a review on the Hill estimator, we refer the interested reader to reference . Let us note that often the focus lies more on the power law probability distribution, which is a distribution whose density has approximately the form L(x)x-α, where α > 1 and L(x) is a slowly varying function. In many situations, it is convenient to assume a lower bound x0 from which the law holds. Combining those two cases yields the Pareto-type distributions, or as well known in extreme value theory as distributions with regularly varying tails.
A generalization of the Pareto law is the so-called generalized Pareto distribution, and it regroups the Pareto type I, II, III, and IV distributions. The Pareto type IV contains the other types as special cases and hence as well other size distributions belonging to the different types as for instance the Lomax distributions . This latter distribution belongs to the Pareto type II, and its p.d.f. is given by
with shape parameter α > 0 and scale parameter k > 0. It can be interpreted as a shifted Pareto type I distribution.
A generalization of the Pareto type I distribution is the Stoppa distribution , which comes from a power transformation of the Pareto c.d.f. and yields the following p.d.f.
with shape parameters α > 0, δ > 0 and location parameter x0 > 0. If δ = 1, we get the Pareto type I distribution. However, if the shape parameter δ > 1, the Stoppa distribution presents a heavier tail than the classical Pareto law.
The Pareto distribution is often used to model fire losses in actuarial sciences [28, 29] as well as in reinsurance to approximate large losses. Originally, it was used to describe the income distribution and the allocation of wealth , but nowadays it is also used to model, for instance, areas burnt in forest fires or the file sizes of Internet traffic data . Note that, in general, in empirical applications, the Pareto distribution does not fit for all the values but rather is used to fit their upper tail, i.e., large values. Hence, in order to fit a distribution to all the values, one often uses a composite model (see Section 3) which combines two distributions where one of both is the Pareto law.
2.3. The generalized Lomax distribution
The generalized Lomax (GL) distribution, also known as the exponentiated Lomax distribution, was introduced by Abdul-Moniem and Abdel-Hameed in 2012  by powering the c.d.f. of the Lomax distribution to a positive real number.
The p.d.f. of a r.v. X following a generalized Lomax distribution with parameters a > 0, b > 0, and k > 0 corresponds to
The shape parameter a regulates the heaviness of the tail, as can be seen in Figure 4 and the shape parameter b controls the skewness (see Figure 5). The parameter k is a scale parameter as depicted in Figure 6.
The c.d.f. is expressed as
The moments of order r are given by yielding for the mean andfor the variance. The inverse c.d.f. is given by , for 0 < y < 1, and the unique mode is reached at
The likelihood function is given by
and hence the following log-likelihood function is obtained
The calculated score functions are expressed by
which have to be solved numerically by equating them to 0 in order to find the estimated parameters.
The Lomax distribution is used to model income data, wealth allocation, and actuarial claim sizes . The GL distribution is used to measure the breaking stress of carbon fibers , the survival times of patients getting chemotherapy treatment , and the number of successive failure of the air-conditioning system in airplanes .
2.4. The generalized extreme value distribution
The generalized extreme value (GEV) distribution is well known in extreme value theory as it combines the Gumbel, Fréchet, and Weibull distributions, which are also known as type I, II, and III extreme value distributions. The GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables and this result arises from the central limit theorem of Fisher and Tippett . Therefore, the GEV distribution is also known as Fisher-Tippett distribution in extreme value theory.
The GEV distribution has p.d.f.
if, with location parameter μ ∈ (see Figure 7), scale parameter σ > 0 (see Figure 8), and shape parameter k ∈ , which governs the shape and the heaviness of the tail of the distribution, as can be seen in Figure 9.
Its c.d.f. is given by
if . The mean is defined as and the variance as both expressed in terms of the Gamma function. The quantile function is given by for 0 < y < 1 and the unique mode is reached at Depending on the value of the parameter k, the GEV reduces to one of the following special cases: If k = 0, we obtain the Gumbel distribution, if k > 0, we get the Fréchet distribution, and if k < 0, the Weibull distribution is obtained. The parameters of the GEV distributions are estimated using the maximum likelihood approach.
We will now focus more closely on one of the three GEV distributions, namely, the Weibull distribution, which got its name from the Swedish engineer and scientist Waloddi Weibull, who analyzed it in detail in 1951. We take a look at this law as it is often used for size-type data and it is considered as an alternative to the lognormal distribution for the construction of composite models (see Section 3). The Weibull distribution belongs to power laws with an exponential cut-off; this means it is a power law multiplied by an exponential function. In these distributions, the exponential decay term overpowers the power law behavior for very large values.
The p.d.f. of the Weibull distribution is given by
with shape parameter τ > 0 governing the heaviness of the tail and scale parameter σ > 0.
The distribution has c.d.f.
The quantile function is given by , for 0 < y < 1. The r-th moment is given by Hence, the expectation and the variance are expressed as and , respectively. The Weibull distribution is unimodal and for τ > 1 it reaches the mode at and for τ = 1 the mode is reached at 0.
The parameters of the Weibull distribution are estimated via the maximum likelihood method. The corresponding likelihood and log-likelihood functions are given respectively by
The maximum likelihood estimator for the scale parameter σ, given τ, is and the maximum likelihood estimator for the shape parameter τ is given by an implicit function which has to be solved numerically:
In risk management, finance, and insurance, the risk measure “Value at Risk” is assessed by considering the GEV distribution [35, 36]. They are as well used in hydrology [37, 38], telecommunications , and meteorology . In material sciences, the Weibull is widely used thanks to its flexibility . Other examples include wind speed distributions , forecasting technological change , the size of reinsurance claims , hydrology , and areas burnt in forest fires .
3. Composite models
Given the wealth of distinct size distributions, as can be seen from the small sample of size distributions described in the previous section, the practitioner is often confronted to the following question: Which size distribution shall he/she use in which situation? The variation in the shapes, size distributions can take, for instance between the Pareto distribution and the lognormal distribution, renders the choice very complicated in practice.
For example, insurance companies face sometimes losses, which emerge from a combination of moderate and large claims. In order to model those large losses, the Pareto distribution seems to be the size distribution favored by practitioners. However, when losses consist of smaller values with high frequencies and larger losses with low frequencies, the lognormal or the Weibull distributions are preferred . Nevertheless, no classical size distribution provides an acceptable fit for both small and large losses. On one hand, the Pareto fits well the tail, but on the other hand, lognormal and Weibull distributions produce an overall good fit, but fit badly the tail.
A solution to this dilemma comes from the composite parametric models introduced in 2005 by Cooray and Ananda . The idea of the composite models is to join together two weighted distributions at a given threshold value. In statistical terms, let X be a r.v. and denote by f1(.) the p.d.f. of the first distribution and by f2(.) the p.d.f. of the second distribution. Let F1(.) and F2(.) be the corresponding c.d.f., respectively. Scollnik  noticed that the p.d.f. of a composite model can then be expressed as
where c is a normalization constant in [0,1], θ represents the threshold value, for -∞ < x ≤ θ, and for θ < x < ∞. In our setting, the considered composite models piece together two different size distributions with different shapes and tail-weights at a specific threshold. As size distributions are only for positive values, the p.d.f. of a composite model is rewritten as
where 0 ≤ c ≤ 1. The composite model can as well be interpreted as a two-component mixture model with mixing weights c and (1 − c). Hence, it can be seen as a convex sum of two density functions f(x) = c f1*(x) + (1 − c) f2*(x), as noted in .
As we have a threshold that cuts the composite model distribution into two, from a mathematical point of view, we need continuity and differentiability conditions at the threshold to yield a smooth density function. In order to make f (x) continuous, the following condition f (θ-) = f (θ+) is imposed and yields
The differential condition at the threshold value is given by f '(θ-) = f '(θ+)and yields
If we combine the two results for the normalization constant c, we obtain the additional restriction for θ, i.e., . Let us remark that in reference  they use a mode-matching procedure instead and state that it gives much simpler derivation of the model and allows for an easier implementation with any distribution which has a mode that has a closed form expression. Instead of having as threshold value θ, they use the modal value xm. Denote by xm1 and xm2 the modes of the distributions used by the first and second components of the composite model, then the mode-matching conditions are xm1 = xm2 and f*(xm1) = f*(xm2). The latter implies the continuity condition, and the former equality allows dropping the labels 1 and 2, which yields the following condition
Remark that the derivative at the mode is 0, hence the differentiability condition is satisfied.
The c.d.f. of a composite model of size distributions is given by
The moments of the r-th order can be expressed using this formula
Statistical inference for composite models is done using the classical maximum likelihood (ML) estimation approach. The ML estimation for composite models was first presented in  and as well in . In order to apply the ML approach, we have to know the integer value m such that the unknown threshold parameter θ is in between the m-th and m + 1-th observation. If we assume somehow that we know the value of the integer m, we would be able to write out explicitly the likelihood function. However, unfortunately, we do not know the exact value of m and as m changes, the ML estimation changes. Therefore, the following ML estimation algorithm was proposed where we have s parameters for i = 1, … , s. In a first step, for each integer m = 1, … , n - 1 we estimate the parameters as solution of the following ML system
If the inequality xm ≤ ≤ xm+1 holds, then the ML estimators can be denoted as and for i = 1, … , s. However, a second step is needed in case the first step does not provide any satisfying result meaning that we are either in one of the following two settings m = n or m = 0. This implies that the use of f1 and f2 are recommended for the likelihood function, respectively. For the ML procedure, one needs to check n − 1 intervals. Thus, the computing time strongly depends on the magnitude of n. For large n this leads to a complex system of equations that must be solved numerically.
In reference , the authors propose an alternative algorithm based on quantiles and a moment matching approach. In a first step, let us denote by q1 and q3 the first and third empirical quartiles of the data sample. We assume that q1 ≤ θ ≤ q3. Then we use the method of moments to match the first s − 1 empirical moments with their theoretical counterparts, and we add two more equations from matching two quartiles
If no result is obtained, we move to a second step where we assume that the first and third quartiles are smaller than the threshold θ, and proceed like in the first step except using now the following two quartiles’ equations
If we still have no solution, we finally assume that the first and third quartiles are greater than θ and proceed again in a similar fashion as in the first step with the two equations
Let us remark that those equations have to be solved numerically. Note that once we have a solution from this quantile and moment matching procedure, we can use the ML approach explained above to improve the result as now we have some a priori information on the parameter θ and hence on the integer m.
In general, in the area of size distributions, composite models comprise a lognormal or Weibull distribution up to a given threshold value and some form of the Pareto distribution thereafter. The obtained models are close in shape to the lognormal or Weibull law but with a thicker tail due to the Pareto distribution, see Figure 10 and Figure 11.
This research area for size distributions was initiated by Cooray and Ananda in 2005 , who proposed the composite lognormal-Pareto model. They suggested that this composite model may be better suited for insurers when confronted to smaller losses with high frequencies as well as for larger values with lower frequencies. The lognormal-Pareto composite model introduced in reference  has been further enhanced by Scollnik . In that paper, the author noticed that the two-component composite model is very restrictive since it has fixed and a priori known mixing weights. Hence, he improved the model by using unrestricted mixing weights as coefficients in each component. In a similar way, the article  improves the composite Weibull-Pareto model proposed by reference . Those are the composite models that will be described in more detail in the sequel. The papers  and  consider beside the classical Pareto distribution as well the Pareto type II distribution, known also as the Lomax distribution, as an alternative above the threshold value. In 2013, Teodorescu and Vernic  replace the lognormal distribution by any arbitrary continuous distribution, and they analyze in detail the composite Weibull-Pareto and the composite Gamma-Pareto models, and use as well the Lomax distribution as an alternative to the Pareto distribution above the threshold point. The same authors suggested already the composite exponential-Pareto model . More recently, reference  proposes a composite model based on the Stoppa distribution , which is a generalization of the Pareto law. More precisely, they propose the lognormal-Stoppa and Weibull-Stoppa composite models.
Let us now take a closer look at the composite lognormal-Pareto and Weibull-Pareto models. Given the general formulas above we can write the density for the composite lognormal-Pareto as
with 0 ≤ c ≤ 1 and Φ(.) denoting the c.d.f. of a standard normal distribution. In a similar way, the p.d.f. for the composite Weibull-Pareto can be written as
with 0 ≤ c ≤ 1.
By verifying the continuity and differentiability conditions at the threshold point θ, we obtain for the composite lognormal-Pareto model:
These conditions guarantee that the p.d.f. of the composite lognormal-Pareto is continuous and smooth at the threshold value θ. The continuity and differentiability conditions at θ, for the composite Weibull-Pareto, yield:
These conditions guarantee the continuity and smoothness of the p.d.f. of the composite Weibull-Pareto at the threshold point θ.
The c.d.f. of the composite lognormal-Pareto and Weibull-Pareto is given, respectively, by
Finally, the moments of order r of the composite lognormal-Pareto and Weibull-Pareto are given by
for τ > r, respectively.
To estimate the composite lognormal-Pareto and the composite Weibull-Pareto models, the algorithms described above are used.
In this section, we focus on two applications to real data sets, one from actuarial sciences, dealing with fire losses and one on Internet traffic data. We will analyze these two data sets with the size distributions seen in Section 2 and the two composite models, namely, the lognormal-Pareto and the Weibull-Pareto, seen in Section 3. In order to compare the distributions, we used the following three criteria:
The maximum log-likelihood (MLL) value: the larger the value, the better the fit of the distribution to the data set.
The Akaike information criterion (AIC):where p represents the number of parameters to estimate. This criterion represents a measure of the relative quality of a distribution given a set of laws. The distribution with the lowest AIC value is preferred.
The Bayesian information criterion (BIC):where n represents the length of the data set and p the number of parameters to estimate. This criterion is used to choose a distribution among a finite set of laws. The distribution with the lowest BIC is preferred.
The AIC and BIC give a trade-off between a reward for a good goodness-of-fit performance and a penalty for an increasing number of parameters to estimate. The BIC tends to favor more parsimonious models than does the AIC.
We carried out the calculations with Wolfram Mathematica 10. To calculate the MLL, AIC, and BIC values for the size distributions of Section 2, we used the function NMaximize with numerical maximization algorithm Random Search method enhanced with the option InteriorPoint. For the composite lognormal-Pareto and the composite Weibull-Pareto models, we used the estimation algorithms described in Section 3.
4.1. Danish fire losses
In this example, we analyze a classical insurance data set. This is the set of Danish data on 2492 fire insurance losses in Danish Krone (DK) from the years 1980 to 1990 inclusive. The data set can be found in the “SMPracticals” add-on package for R, available from the CRAN website cran.r-project.org.
|Lognormal||= 0.671854||= 0.732317|
|Pareto||= 0.545817||= 0.313404|
|GL||= 2.01251||= 435198||= 0.00227572|
|GEV||= 1.42575||= 0.712043||= 0.545094|
|Lognormal-Pareto||= 1.43633||= 1.38513|
|Weibull-Pareto||= 4.43613||= 1.46597|
With a MLL value of -3877.84 and only two parameters, yielding the values AIC = 7759.68 and BIC = 7771.32, the lognormal-Pareto model provides a better fit than the other models for the given data set. A visual conclusion of the fit can be seen in Figure 12.
As the data present a humped shape behavior for the lower values and tail behavior for the upper values; this example justifies the use and the necessity of the composite lognormal-Pareto model.
This data set has also been analyzed in reference  where the composite lognormal-Pareto model was introduced and reference  applied as well the Weibull-Pareto model to this data set. The results we obtain above coincide with their results.
4.2. Internet traffic data
In the second empirical illustration, we analyze Internet traffic data, which have already been analyzed from a Bayesian point of view in references  and . This data set consists of 3143 transferred bytes/second within consecutive seconds.
Based on the MLL, the AIC, and BIC values represented in Table 3, we conclude that among the considered laws, the lognormal distribution performs the best fit, closely followed by the GL and the GEV distributions. The two considered composite models do not provide good fits for this example. The estimated values for the fitted densities are given in Table 4.
|Lognormal||= 11.6518||= 0.62067|
|Pareto||= 0.353628||= 6795|
|GL||= 13.6735||= 4.08831||= 808429|
|GEV||= 94465||= 54467.5||= 0.204602|
|Lognormal-Pareto||= 1.1.05077||= 85064.5|
|Weibull-Pareto||= 1.12043||= 79366.3|
Figure 13 provides a visual proof of the goodness-of-fit of the lognormal distribution.
To sum up, we review in this chapter the notion of size distributions by presenting the best known and most used ones. We further describe the general concept of composite models based on size distributions and present in more details the composite lognormal-Pareto and the composite Weibull-Pareto models. Besides providing their main statistical properties, we illustrate the size distributions and composite models by applying them to two real application examples to emphasize their use in practice. We compare the goodness-of-fit of the considered distributions using as criteria the MLL, AIC, and BIC. For the first data set dealing with fire losses we find that the composite lognormal-Pareto model performs the best, hinting at the usefulness of composite models in this research area. However, for the second data set on Internet traffic, the simple lognormal distribution outperforms the other size distributions and composite models. This shows how delicate the choice is for a practitioner when confronted with the question which distribution or model he/she should use on a given data set. The composite models are already quite flexible, but given the different shapes a data set can take, there is a quest for even more flexible distributions. In the literature, some families of distributions are proposed, which contain many of the classical size distributions and hence can model very diverse behaviors. The most popular one is the generalized beta distribution presented in reference , and very recently reference  introduces a new flexible distribution called the interpolating family of size distributions. Those distributions are quite flexible as they enable to model very distinct shapes and probably constitute the future avenue of research in the area of size distributions.