Open access peer-reviewed chapter

Distributions and Composite Models for Size-Type Data

By Yves Dominicy and Corinne Sinner

Submitted: April 28th 2016Reviewed: October 20th 2016Published: April 26th 2017

DOI: 10.5772/66443

Downloaded: 917

Abstract

In the first part of this chapter, we present a sample of the best known and most used classical size distributions with their main statistical properties. In the second part, we introduce the concept of composite models and based on the size distributions of the first part, we describe those which already exist in the literature. In the last part of this chapter, we apply the described statistical size distributions and some of the composite models to two real data examples and compare their goodness-of-fit.

Keywords

  • size distributions
  • composite models
  • lognormal
  • Pareto
  • Weibull

1. Introduction

In statistical modeling, the continuous aim is to look for the probability law, which best describes the observations arising from a given field and which should represent the underlying data-generating process. The obtained probability distributions should possess desirable properties such as the flexibility of modeling different shapes and remain of a tractable form. This research avenue was initiated in the nineteenth century by famous mathematicians as Adolphe Quetelet, Sir Francis Galton, or Vilfredo Pareto, and since then it has never ceased. Nowadays, it still remains among the highly treated topics in statistics as shown by the large quantity of scientific papers recently published on the subject (see for instance the review papers [1] and [2]). The actual appeal for this topic is easily explained by the availability of large data sets in various scientific domains, making it essential and necessary to do further research on this subject.

In this chapter, we concentrate on probability distributions that analyze size-type data. By size distributions, we mean probability laws designed to model data that only take positive values. Positive observations appear naturally in different fields: survival analysis [3, 4], environmental science [5], network traffic modeling [6], economics [7, 8], hydrology [9], and actuarial science [10]. Given the range of various domains of application, there exists a plethora of different size distributions and it is still a very active research area [11, 12].

The structure of this chapter is as follows. In Section 2, we review the most used and well-known size distributions, and state their main statistical properties. Section 3 introduces the notion of composite models and gives a small review of the composite models in the literature, based on the size distributions depicted in Section 2. In Section 4, we apply the described size distributions of Section 2 and some of the composite models of Section 3 to two real data sets, namely, an insurance data set and an Internet traffic data set. Finally, Section 5 concludes.

2. Review of size distributions

We describe here a sample of the best known and most used size distributions. We will state their probability density function (p.d.f) and their cumulative density function (c.d.f), show their moments and their quantile function, and give the estimators obtained via maximum likelihood estimation. More specifically, we take a closer look at the lognormal, Pareto, generalized Lomax, and generalized extreme value distributions.

2.1. The lognormal distribution

The English statistician Sir Francis Galton stated that in some situations it was preferable to measure the location of a distribution with the geometric mean instead of the arithmetic mean [13]. Indeed, laws of nature often behave in multiplicative ways; thus, the geometric mean becomes more appropriate as a measure of central tendency than the arithmetic mean. As a reply to Galton’s request, the Scottish physician Donald McAlister established in 1879 a theory of the exponentiated (or multiplicative) normal distribution [14], this became to be known as the lognormal distribution.

Let X be a positive random variable (r.v.) such that log X = D Y is normally distributed with parameters μRand σ > 0. The r.v. X then has a lognormal distribution, X ~LN(μ, σ2), with probability density function (p.d.f.)

f(x;μ,σ2)=1x2πσe(logxμ)22σ2,x>0.E1

The location parameter μRand the scale parameter σ > 0 are characteristic for the r.v. logX. However, by the exponential transformation, the geometric mean becomes a scale parameter, as depicted in Figure 1, and the multiplicative standard deviation appears as shape parameter impacting the skewness (see Figure 2).

Figure 1.

Density plots of the lognormal distribution with varying location parameter μ and fixed scale parameter σ.

Figure 2.

Density plots of the lognormal distribution with varying scale parameter σ and fixed location parameter μ.

If random variability enjoys multiplicative effects, as stated by Galton, then a lognormal distribution must be the result. This establishes the basis of the multiplicative central limit theorem, which asserts that the geometric means of nonlognormal random variables are approximated by a lognormal distribution.

The cumulative distribution function (c.d.f.) of the lognormal law is related to the c.d.f. of the normal distribution:

F(x;μ,σ2)=Φ(logxμσ),x>0,E2

where Φ(.) represents the c.d.f. of a standard normal distribution.

The moments of order r are conveniently expressed as E(Xr)=erμ+r2σ22.Hence, the mean is given by E(X)=eμ+σ22, and the variance by V(X)=e2μ+σ2(eσ21).The lognormal is a unimodal distribution and the unique mode is reached at xmode=eμσ2.By comparing the mean and the mode, we note that for a fixed μ, an increasing σ shifts the mode toward zero while the mean increases. The quantile function is defined as F1(y)=eμ+σΦ1(y),for 0 < y < 1 and where Φ-1(.) denotes the quantile function of a standard normal distribution.

Thanks to its relationship to the normal distribution, the likelihood function is given by

L(x1,,xn|μ,σ2)=(i=1n1xi)(12πσ)nei=1n(logxiμ)22σ2,E3

and hence the log-likelihood function can be expressed as

l(x1,,xn|μ,σ2)=i=1nlogxin2log2πnlogσi=1n(logxiμ)22σ2.E4

The maximum likelihood estimators for the mean and the scale are given by μ̂=1ni=1nlogxiand σ̂2=1ni=1n(logxiμ̂)2,respectively.

The lognormal distribution is widely used to describe natural phenomena. In finance, the Black-Scholes model, which is a mathematical model containing derivative instruments, assumes the underlying derivative price to have a lognormal distribution [15]. In economics, income data are often modeled by a lognormal distribution [16], which can be easily explained as follows: a very low percentage of earners have very low income. To gain averaged revenue is frequent, whereas an elevated income is rare. In actuarial sciences, the law is assumed to fit well some types of insurance losses [17, 18]. In 1931, the French economist and engineer Robert Pierre Louis Gibrat stated that the firm size follows a lognormal distribution as its proportional growth rate is independent of its absolute size. Other applications can be found in biology [19, 20] or in linguistics to model the number of words in a sentence [21].

2.2. The Pareto distribution

The Italian economist and engineer Vilfredo Pareto observed in 1896 that in many populations the power law cx-α, for some constant c > 0 and some exponent α > 0, was an appropriate approximation of the number of individuals with income exceeding a given threshold x0 (see for instance [22, 23]). These power laws assume that small values of x are very frequent, while large occurrences are extremely rare. Their form implies that all power laws with a particular scaling exponent are equivalent up to constant factors since each is simply a scaled version of the others. This produces the linear relationship when logarithms are taken of both f(x) and x, which denotes the signature of power laws. Such distributions are known as Pareto-type distributions.

The p.d.f. of a r.v. X having a Pareto (type I) distribution with parameters α > 0 and x0 > 0 is given by

f(x;α,x0)=αx0(x0x)α+1,xx0.E5

The location parameter x0 represents the lower bound of the data set and the shape parameter α is called the tail index or as well the Pareto index, and hence regulates the tail as can be seen in Figure 3. Note that a decreasing value of α implies a heavier tail.

Figure 3.

Density plots of the Pareto distribution with varying shape parameter α and fixed location parameter x0 = 1.

The c.d.f. of the Pareto law is given by

F(x;α,x0)=1(x0x)α,xx0.E6

For α > r, the r-th moment of the Pareto distribution is given by E(Xr)=αx0rαr.The mean and the variance are then, respectively, E(X)=αx0α1for α > 1 and V(X)=αx02(α1)2(α2)for α > 2. The quantile function is expressed as F1(y)=x0(1y)1α,for 0 < y < 1. Being an unimodal law, the Pareto distribution reaches its peak at xmode = x0. As x0 represents the minimum value of x, its estimation is straightforward: x0̂=mini=1,,nxi. The likelihood function is given by

L(x1,,xn|α,x0)=αnx0nαi=1n(1xi)α+1E9200

and to estimate the parameter α, we maximize the following log-likelihood function

l(x1,,xn|α,x0)=nlogα+nαlogx0(α+1)i=1nlogxiE7

which yields the maximum likelihood estimator α̂=ni=1nlogxix0̂. Let us note that the maximum likelihood estimator of the tail index α corresponds to the popular Hill estimator [24], which is an estimator for the extreme value index in the extreme value theory. For a review on the Hill estimator, we refer the interested reader to reference [25]. Let us note that often the focus lies more on the power law probability distribution, which is a distribution whose density has approximately the form L(x)x-α, where α > 1 and L(x) is a slowly varying function. In many situations, it is convenient to assume a lower bound x0 from which the law holds. Combining those two cases yields the Pareto-type distributions, or as well known in extreme value theory as distributions with regularly varying tails.

A generalization of the Pareto law is the so-called generalized Pareto distribution, and it regroups the Pareto type I, II, III, and IV distributions. The Pareto type IV contains the other types as special cases and hence as well other size distributions belonging to the different types as for instance the Lomax distributions [26]. This latter distribution belongs to the Pareto type II, and its p.d.f. is given by

f(x;α,k)=αk(1+xk)(α+1),x>0,E8

with shape parameter α > 0 and scale parameter k > 0. It can be interpreted as a shifted Pareto type I distribution.

A generalization of the Pareto type I distribution is the Stoppa distribution [27], which comes from a power transformation of the Pareto c.d.f. and yields the following p.d.f.

f(x;α,δ,x0)=δαx0αx(α+1)(1(xx0)α)δ1,x>x0,E9

with shape parameters α > 0, δ > 0 and location parameter x0 > 0. If δ = 1, we get the Pareto type I distribution. However, if the shape parameter δ > 1, the Stoppa distribution presents a heavier tail than the classical Pareto law.

The Pareto distribution is often used to model fire losses in actuarial sciences [28, 29] as well as in reinsurance to approximate large losses. Originally, it was used to describe the income distribution and the allocation of wealth [22], but nowadays it is also used to model, for instance, areas burnt in forest fires or the file sizes of Internet traffic data [30]. Note that, in general, in empirical applications, the Pareto distribution does not fit for all the values but rather is used to fit their upper tail, i.e., large values. Hence, in order to fit a distribution to all the values, one often uses a composite model (see Section 3) which combines two distributions where one of both is the Pareto law.

2.3. The generalized Lomax distribution

The generalized Lomax (GL) distribution, also known as the exponentiated Lomax distribution, was introduced by Abdul-Moniem and Abdel-Hameed in 2012 [31] by powering the c.d.f. of the Lomax distribution to a positive real number.

The p.d.f. of a r.v. X following a generalized Lomax distribution with parameters a > 0, b > 0, and k > 0 corresponds to

f(x;a,b,k)=abk(1+xk)(a+1)(1(1+xk)a)b1,x>0.E10

The shape parameter a regulates the heaviness of the tail, as can be seen in Figure 4 and the shape parameter b controls the skewness (see Figure 5). The parameter k is a scale parameter as depicted in Figure 6.

Figure 4.

Density plots of the GL distribution with varying shape parameter a and fixed shape parameter b and scale parameter k.

Figure 5.

Density plots of the GL distribution with varying shape parameter b and fixed shape parameter a and scale parameter k.

Figure 6.

Density plots of the GL distribution with varying scale parameter k and fixed shape parameters a and b.

The c.d.f. is expressed as

F(x;a,b,k)=(1(1+xk)a)b,x>0.E11

The moments of order r are given by E(Xr)=bkri=1r(ri)(1)iB(11a(ri),b),yielding E(X)=bkB(11a,b)kfor the mean andV(X)=bk2(B(12a,b)b(B(11a,b))2)for the variance. The inverse c.d.f. is given by F1(y)=k((1y1b)1a1), for 0 < y < 1, and the unique mode is reached at xmode=k((ab+1a+1)1a1).

The likelihood function is given by

L(x1,,xn|a,b,k)=(abk)ni=1n(1+xik)(a+1)i=1n(1(1+xik)a)b1E12

and hence the following log-likelihood function is obtained

l(x1,,xn|a,b,k)=nlogbk(a+1)i=1nlog(1+xik)+(b1)i=1nlog(1(1+xik)a).E13

The calculated score functions are expressed by

l(x1,,xn|a,b,k)a=nai=1nlog(1+xik)a(b1)i=1n(1+xik)alog(1+xik)1(1+xik)a,
l(x1,,xn|a,b,k)b=nb+i=1nlog(1(1+xik)a),E14

and

l(x1,,xn|a,b,k)k=nk+a+1k2i=1nxi1+xika(b1)k2i=1n(1+xik)(a+1)xi1(1+xik)a,E15

which have to be solved numerically by equating them to 0 in order to find the estimated parameters.

The Lomax distribution is used to model income data, wealth allocation, and actuarial claim sizes [10]. The GL distribution is used to measure the breaking stress of carbon fibers [32], the survival times of patients getting chemotherapy treatment [33], and the number of successive failure of the air-conditioning system in airplanes [33].

2.4. The generalized extreme value distribution

The generalized extreme value (GEV) distribution is well known in extreme value theory as it combines the Gumbel, Fréchet, and Weibull distributions, which are also known as type I, II, and III extreme value distributions. The GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables and this result arises from the central limit theorem of Fisher and Tippett [34]. Therefore, the GEV distribution is also known as Fisher-Tippett distribution in extreme value theory.

The GEV distribution has p.d.f.

f(x;μ,σ,k)=1σ(1+k(xμσ))11ke(1+k(xμσ))1k,E16

if1+kxμσ>0, with location parameter μ ∈ R(see Figure 7), scale parameter σ > 0 (see Figure 8), and shape parameter kR, which governs the shape and the heaviness of the tail of the distribution, as can be seen in Figure 9.

Figure 7.

Density plots of the GEV distribution with varying location parameter μ.

Figure 8.

Density plots of the GEV distribution with varying scale parameter σ.

Figure 9.

Density plots of the GEV distribution with varying shape parameter k.

Its c.d.f. is given by

F(x;μ,σ,k)=e(1+k(xμσ))1k,E17

if 1+kxμσ>0. The mean is defined as E(X)=μσk+σkΓ(1k)and the variance as V(X)=σ2k2(Γ(12k)Γ(1k)2),both expressed in terms of the Gamma function. The quantile function is given by F1(y)=μ+kσ((logy)k1),for 0 < y < 1 and the unique mode is reached at xmode=μ+σk((1+k)k1).Depending on the value of the parameter k, the GEV reduces to one of the following special cases: If k = 0, we obtain the Gumbel distribution, if k > 0, we get the Fréchet distribution, and if k < 0, the Weibull distribution is obtained. The parameters of the GEV distributions are estimated using the maximum likelihood approach.

We will now focus more closely on one of the three GEV distributions, namely, the Weibull distribution, which got its name from the Swedish engineer and scientist Waloddi Weibull, who analyzed it in detail in 1951. We take a look at this law as it is often used for size-type data and it is considered as an alternative to the lognormal distribution for the construction of composite models (see Section 3). The Weibull distribution belongs to power laws with an exponential cut-off; this means it is a power law multiplied by an exponential function. In these distributions, the exponential decay term overpowers the power law behavior for very large values.

The p.d.f. of the Weibull distribution is given by

f(x;σ,τ)=τσ(xσ)τ1e(xσ)τ,x0,E18

with shape parameter τ > 0 governing the heaviness of the tail and scale parameter σ > 0.

The distribution has c.d.f.

F(x;σ,τ)=1e(xσ)τ,x0.E19

The quantile function is given by F1(y)=σ(log(1y))1τ, for 0 < y < 1. The r-th moment is given by E(Xr)=σrΓ(1+rτ).Hence, the expectation and the variance are expressed as E(X)=σΓ(1+1τ)and V(X)=σ2(Γ(1+2τ)Γ(1+1τ)2), respectively. The Weibull distribution is unimodal and for τ > 1 it reaches the mode at xmode=σ(τ1τ)1αand for τ = 1 the mode is reached at 0.

The parameters of the Weibull distribution are estimated via the maximum likelihood method. The corresponding likelihood and log-likelihood functions are given respectively by

L(x1,,xn|σ,τ)=τnσni=1n(xiσ)τ1ei=1n(xiσ)τ,E20

and

l(x1,,xn|σ,τ)=nlogτnτlogσ+(τ1)i=1nlog(xi)i=1n(xiσ)τ.E21

The maximum likelihood estimator for the scale parameter σ, given τ, is σ̂τ=1ni=1nxiτand the maximum likelihood estimator for the shape parameter τ is given by an implicit function which has to be solved numerically: τ̂1=i=1nxiτlog(xi)i=1nxiτ1ni=1nlog(xi).

In risk management, finance, and insurance, the risk measure “Value at Risk” is assessed by considering the GEV distribution [35, 36]. They are as well used in hydrology [37, 38], telecommunications [39], and meteorology [40]. In material sciences, the Weibull is widely used thanks to its flexibility [41]. Other examples include wind speed distributions [42], forecasting technological change [43], the size of reinsurance claims [10], hydrology [9], and areas burnt in forest fires [44].

3. Composite models

Given the wealth of distinct size distributions, as can be seen from the small sample of size distributions described in the previous section, the practitioner is often confronted to the following question: Which size distribution shall he/she use in which situation? The variation in the shapes, size distributions can take, for instance between the Pareto distribution and the lognormal distribution, renders the choice very complicated in practice.

For example, insurance companies face sometimes losses, which emerge from a combination of moderate and large claims. In order to model those large losses, the Pareto distribution seems to be the size distribution favored by practitioners. However, when losses consist of smaller values with high frequencies and larger losses with low frequencies, the lognormal or the Weibull distributions are preferred [45]. Nevertheless, no classical size distribution provides an acceptable fit for both small and large losses. On one hand, the Pareto fits well the tail, but on the other hand, lognormal and Weibull distributions produce an overall good fit, but fit badly the tail.

A solution to this dilemma comes from the composite parametric models introduced in 2005 by Cooray and Ananda [46]. The idea of the composite models is to join together two weighted distributions at a given threshold value. In statistical terms, let X be a r.v. and denote by f1(.) the p.d.f. of the first distribution and by f2(.) the p.d.f. of the second distribution. Let F1(.) and F2(.) be the corresponding c.d.f., respectively. Scollnik [47] noticed that the p.d.f. of a composite model can then be expressed as

f(x)={cf1*(x),<xθ(1c)f2*(x),θ<x<,E22

where c is a normalization constant in [0,1], θ represents the threshold value, f1*(x)=f1(x)F1(θ)for -∞ < xθ, and f2*(x)=f2(x)1F2(θ)for θ < x < ∞. In our setting, the considered composite models piece together two different size distributions with different shapes and tail-weights at a specific threshold. As size distributions are only for positive values, the p.d.f. of a composite model is rewritten as

f(x)={cf1*(x),0<xθ(1c)f2*(x),θ<x<,E23

where 0 ≤ c ≤ 1. The composite model can as well be interpreted as a two-component mixture model with mixing weights c and (1 − c). Hence, it can be seen as a convex sum of two density functions f(x) = c f1*(x) + (1 − c) f2*(x), as noted in [47].

As we have a threshold that cuts the composite model distribution into two, from a mathematical point of view, we need continuity and differentiability conditions at the threshold to yield a smooth density function. In order to make f (x) continuous, the following condition f (θ-) = f (θ+) is imposed and yields

c=f2(θ)F1(θ)f2(θ)F1(θ)+f1(θ)(1F2(θ)).E24

The differential condition at the threshold value is given by f '(θ-) = f '(θ+)and yields

c=f2'(θ)F1(θ)f2'(θ)F1(θ)+f1'(θ)(1F2(θ)).E25

If we combine the two results for the normalization constant c, we obtain the additional restriction for θ, i.e., f1(θ)f2(θ)=f1'(θ)f2'(θ). Let us remark that in reference [48] they use a mode-matching procedure instead and state that it gives much simpler derivation of the model and allows for an easier implementation with any distribution which has a mode that has a closed form expression. Instead of having as threshold value θ, they use the modal value xm. Denote by xm1 and xm2 the modes of the distributions used by the first and second components of the composite model, then the mode-matching conditions are xm1 = xm2 and f*(xm1) = f*(xm2). The latter implies the continuity condition, and the former equality allows dropping the labels 1 and 2, which yields the following condition

c=f2(xm)F1(xm)f2(xm)F1(xm)+f1(xm)(1F2(xm)).E26

Remark that the derivative at the mode is 0, hence the differentiability condition is satisfied.

The c.d.f. of a composite model of size distributions is given by

F(x)={cF1(x)F1(θ),0<xθc+(1c)F2(x)F2(θ)1F2(θ),θ<x<.E27

The moments of the r-th order can be expressed using this formula Er(f)=cEr(f1*)+(1c)Er(f2*).

Statistical inference for composite models is done using the classical maximum likelihood (ML) estimation approach. The ML estimation for composite models was first presented in [46] and as well in [49]. In order to apply the ML approach, we have to know the integer value m such that the unknown threshold parameter θ is in between the m-th and m + 1-th observation. If we assume somehow that we know the value of the integer m, we would be able to write out explicitly the likelihood function. However, unfortunately, we do not know the exact value of m and as m changes, the ML estimation changes. Therefore, the following ML estimation algorithm was proposed where we have s parameters ρifor i = 1, … , s. In a first step, for each integer m = 1, … , n - 1 we estimate the parameters as solution of the following ML system

{logLρi=0,i=1,,s,logLθ=0.E28

If the inequality xmθ^xm+1 holds, then the ML estimators can be denoted as θ^and ρ^ifor i = 1, … , s. However, a second step is needed in case the first step does not provide any satisfying result meaning that we are either in one of the following two settings m = n or m = 0. This implies that the use of f1 and f2 are recommended for the likelihood function, respectively. For the ML procedure, one needs to check n − 1 intervals. Thus, the computing time strongly depends on the magnitude of n. For large n this leads to a complex system of equations that must be solved numerically.

In reference [50], the authors propose an alternative algorithm based on quantiles and a moment matching approach. In a first step, let us denote by q1 and q3 the first and third empirical quartiles of the data sample. We assume that q1θq3. Then we use the method of moments to match the first s − 1 empirical moments with their theoretical counterparts, and we add two more equations from matching two quartiles

{cF1(q1)F1(θ)=0.25,c+(1c)F2(q3)F2(θ)1F2(θ)=0.75.E29

If no result is obtained, we move to a second step where we assume that the first and third quartiles are smaller than the threshold θ, and proceed like in the first step except using now the following two quartiles’ equations

{cF1(q1)F1(θ)=0.25,cF1(q3)F1(θ)=0.75.E30

If we still have no solution, we finally assume that the first and third quartiles are greater than θ and proceed again in a similar fashion as in the first step with the two equations

{c+(1c)F2(q1)F2(θ)1F2(θ)=0.25,c+(1c)F2(q3)F2(θ)1F2(θ)=0.75.E31

Let us remark that those equations have to be solved numerically. Note that once we have a solution from this quantile and moment matching procedure, we can use the ML approach explained above to improve the result as now we have some a priori information on the parameter θ and hence on the integer m.

In general, in the area of size distributions, composite models comprise a lognormal or Weibull distribution up to a given threshold value and some form of the Pareto distribution thereafter. The obtained models are close in shape to the lognormal or Weibull law but with a thicker tail due to the Pareto distribution, see Figure 10 and Figure 11.

Figure 10.

Density plot of the composite lognormal-Pareto model with θ = 0.55 and α = 0.5.

Figure 11.

Density plot of the composite Weibull-Pareto model with θ = 0.55 and τ = 1.42867.

This research area for size distributions was initiated by Cooray and Ananda in 2005 [46], who proposed the composite lognormal-Pareto model. They suggested that this composite model may be better suited for insurers when confronted to smaller losses with high frequencies as well as for larger values with lower frequencies. The lognormal-Pareto composite model introduced in reference [46] has been further enhanced by Scollnik [47]. In that paper, the author noticed that the two-component composite model is very restrictive since it has fixed and a priori known mixing weights. Hence, he improved the model by using unrestricted mixing weights as coefficients in each component. In a similar way, the article [51] improves the composite Weibull-Pareto model proposed by reference [52]. Those are the composite models that will be described in more detail in the sequel. The papers [47] and [51] consider beside the classical Pareto distribution as well the Pareto type II distribution, known also as the Lomax distribution, as an alternative above the threshold value. In 2013, Teodorescu and Vernic [50] replace the lognormal distribution by any arbitrary continuous distribution, and they analyze in detail the composite Weibull-Pareto and the composite Gamma-Pareto models, and use as well the Lomax distribution as an alternative to the Pareto distribution above the threshold point. The same authors suggested already the composite exponential-Pareto model [50]. More recently, reference [48] proposes a composite model based on the Stoppa distribution [27], which is a generalization of the Pareto law. More precisely, they propose the lognormal-Stoppa and Weibull-Stoppa composite models.

Let us now take a closer look at the composite lognormal-Pareto and Weibull-Pareto models. Given the general formulas above we can write the density for the composite lognormal-Pareto as

f(x)={c1x2πσe(logxμ)22σ2Φ(logθμσ),0<xθ(1c)αθ(θx)α+1,θ<x<,E32

with 0 ≤ c ≤ 1 and Φ(.) denoting the c.d.f. of a standard normal distribution. In a similar way, the p.d.f. for the composite Weibull-Pareto can be written as

f(x)={cτσ(xσ)τ1e(xσ)τ1e(θσ)α,0<xθ(1c)αθ(θx)α+1,θ<x<.E33

with 0 ≤ c ≤ 1.

By verifying the continuity and differentiability conditions at the threshold point θ, we obtain for the composite lognormal-Pareto model:

c=αθ(θx)α+1Φ(logθμσ)αθ(θx)α+1Φ(logθμσ)+1θ2πσe(logθμ)22σ2E34

and

ασ=logθμσ.E35

These conditions guarantee that the p.d.f. of the composite lognormal-Pareto is continuous and smooth at the threshold value θ. The continuity and differentiability conditions at θ, for the composite Weibull-Pareto, yield:

c=αθ(θx)α+1(1e(θσ)τ)αθ(θx)α+1(1e(θσ)τ)+τσ(θσ)τ1e(θσ)τE36

and

(θσ)τ=ατ+1.E37

These conditions guarantee the continuity and smoothness of the p.d.f. of the composite Weibull-Pareto at the threshold point θ.

The c.d.f. of the composite lognormal-Pareto and Weibull-Pareto is given, respectively, by

F(x)={cΦ(logxμσ)Φ(logθμσ),0<xθc+(1c)(1(θx)α),θ<x<.E38

and

F(x)={c1e(xσ)τ1e(θσ)τ,0<xθc+(1c)(1(θx)α),θ<x<.E39

Finally, the moments of order r of the composite lognormal-Pareto and Weibull-Pareto are given by

E(Xr)=cerμ+r2σ22Φ(logθμσ)+(1c)αθrαrE40

and

E(Xr)=cσrΓ(rτ+1;(θσ)τ)1e(θσ)τ+(1c)αθrαrE41

for τ > r, respectively.

To estimate the composite lognormal-Pareto and the composite Weibull-Pareto models, the algorithms described above are used.

4. Applications

In this section, we focus on two applications to real data sets, one from actuarial sciences, dealing with fire losses and one on Internet traffic data. We will analyze these two data sets with the size distributions seen in Section 2 and the two composite models, namely, the lognormal-Pareto and the Weibull-Pareto, seen in Section 3. In order to compare the distributions, we used the following three criteria:

  1. The maximum log-likelihood (MLL) value: the larger the value, the better the fit of the distribution to the data set.

  2. The Akaike information criterion (AIC):

    AIC=2p2MLL,
    where p represents the number of parameters to estimate. This criterion represents a measure of the relative quality of a distribution given a set of laws. The distribution with the lowest AIC value is preferred.

  3. The Bayesian information criterion (BIC):

    BIC=plogn2MLL,
    where n represents the length of the data set and p the number of parameters to estimate. This criterion is used to choose a distribution among a finite set of laws. The distribution with the lowest BIC is preferred.

The AIC and BIC give a trade-off between a reward for a good goodness-of-fit performance and a penalty for an increasing number of parameters to estimate. The BIC tends to favor more parsimonious models than does the AIC.

We carried out the calculations with Wolfram Mathematica 10. To calculate the MLL, AIC, and BIC values for the size distributions of Section 2, we used the function NMaximize with numerical maximization algorithm Random Search method enhanced with the option InteriorPoint. For the composite lognormal-Pareto and the composite Weibull-Pareto models, we used the estimation algorithms described in Section 3.

4.1. Danish fire losses

In this example, we analyze a classical insurance data set. This is the set of Danish data on 2492 fire insurance losses in Danish Krone (DK) from the years 1980 to 1990 inclusive. The data set can be found in the “SMPracticals” add-on package for R, available from the CRAN website cran.r-project.org.

The comparison of the considered distributions using the three criteria explained above is presented in Table 1. The estimated values for the fitted distributions are given in Table 2.

DistributionpMLLAICBIC
Lognormal2-4433.898871.788883.42
Pareto2–5675.0911354.2011365.80
GL3–3967.187940.367957.82
GEV3–3955.437916.867934.32
Lognormal-Pareto2–3877.847759.687771.32
Weibull-Pareto2–3959.787923.567935.20

Table 1.

MLL, AIC, and BIC values for the Danish fire data set.

Distribution
Lognormalμ= 0.671854σ^= 0.732317
Paretoα^= 0.545817x^0= 0.313404
GLα^= 2.01251b^= 435198k^= 0.00227572
GEVμ= 1.42575σ^= 0.712043k^= 0.545094
Lognormal-Paretoα^= 1.43633θ^= 1.38513
Weibull-Paretoτ^= 4.43613θ^= 1.46597

Table 2.

Estimated values for the fitted distributions for the Danish fire data set.

With a MLL value of -3877.84 and only two parameters, yielding the values AIC = 7759.68 and BIC = 7771.32, the lognormal-Pareto model provides a better fit than the other models for the given data set. A visual conclusion of the fit can be seen in Figure 12.

Figure 12.

Histogram of the Danish fire data with the fitted density of the composite lognormal-Pareto model.

As the data present a humped shape behavior for the lower values and tail behavior for the upper values; this example justifies the use and the necessity of the composite lognormal-Pareto model.

This data set has also been analyzed in reference [46] where the composite lognormal-Pareto model was introduced and reference [51] applied as well the Weibull-Pareto model to this data set. The results we obtain above coincide with their results.

4.2. Internet traffic data

In the second empirical illustration, we analyze Internet traffic data, which have already been analyzed from a Bayesian point of view in references [53] and [54]. This data set consists of 3143 transferred bytes/second within consecutive seconds.

Based on the MLL, the AIC, and BIC values represented in Table 3, we conclude that among the considered laws, the lognormal distribution performs the best fit, closely followed by the GL and the GEV distributions. The two considered composite models do not provide good fits for this example. The estimated values for the fitted densities are given in Table 4.

DistributionpMLLAICBIC
Lognormal2–39582.279168.479180.5
Pareto2–43031.786067.486079.5
GL3–39581.779169.479187.6
GEV3–39608.479222.879241.0
Lognormal-Pareto2–40098.480200.880212.9
Weibull-Pareto2–42823.985651.885663.9

Table 3.

MLL, AIC, and BIC values for the Internet traffic data.

Distribution
Lognormalμ= 11.6518σ^= 0.62067
Paretoα^= 0.353628x^0= 6795
GLα^= 13.6735b^= 4.08831k^= 808429
GEVμ= 94465σ^= 54467.5k^= 0.204602
Lognormal-Paretoα^= 1.1.05077θ^= 85064.5
Weibull-Paretoτ^= 1.12043θ^= 79366.3

Table 4.

Estimated values for the fitted distributions for the Internet traffic data.

Figure 13 provides a visual proof of the goodness-of-fit of the lognormal distribution.

Figure 13.

Histogram of the Internet traffic data with the fitted lognormal density.

5. Conclusion

To sum up, we review in this chapter the notion of size distributions by presenting the best known and most used ones. We further describe the general concept of composite models based on size distributions and present in more details the composite lognormal-Pareto and the composite Weibull-Pareto models. Besides providing their main statistical properties, we illustrate the size distributions and composite models by applying them to two real application examples to emphasize their use in practice. We compare the goodness-of-fit of the considered distributions using as criteria the MLL, AIC, and BIC. For the first data set dealing with fire losses we find that the composite lognormal-Pareto model performs the best, hinting at the usefulness of composite models in this research area. However, for the second data set on Internet traffic, the simple lognormal distribution outperforms the other size distributions and composite models. This shows how delicate the choice is for a practitioner when confronted with the question which distribution or model he/she should use on a given data set. The composite models are already quite flexible, but given the different shapes a data set can take, there is a quest for even more flexible distributions. In the literature, some families of distributions are proposed, which contain many of the classical size distributions and hence can model very diverse behaviors. The most popular one is the generalized beta distribution presented in reference [55], and very recently reference [56] introduces a new flexible distribution called the interpolating family of size distributions. Those distributions are quite flexible as they enable to model very distinct shapes and probably constitute the future avenue of research in the area of size distributions.

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Yves Dominicy and Corinne Sinner (April 26th 2017). Distributions and Composite Models for Size-Type Data, Advances in Statistical Methodologies and Their Application to Real Problems, Tsukasa Hokimoto, IntechOpen, DOI: 10.5772/66443. Available from:

chapter statistics

917total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Advances in Statistical Methodologies and Their Application to Real Problems

Edited by Tsukasa Hokimoto

Next chapter

Modelling Limit Order Book Volume Covariance Structures

By Andrija Mihoci

Related Book

Statistical Approaches With Emphasis on Design of Experiments Applied to Chemical Processes

Edited by Valter Silva

First chapter

Introductory Chapter: How to Use Design of Experiments Methodology to Get Most from Chemical Processes

By Valter Bruno Reis e Silva, Daniela Eusébio and João Cardoso

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us