Information Entropy

“The chapter begins with the short description about the concept of entropy, formula, and matlab code. Within the main chapter body, three different approaches how to use the information entropy in dataset analysis: (i) for data segmentation into two groups; (ii) for filtration of the noise in the dataset; (iii) for enhancement of the entropy contribu‐ tion via point information gain. Finally, the conclusion is briefly about extended analysis using more generalized entropy, and the usability of described algorithms: advantages and disadvantages.”


Introduction
MATLAB environment enables advanced data processing and analysis, especially using its toolboxes like signal processing, image processing, and statistics. 1 The real signals have to be evaluated with numerous methods for filtration, transformation, alignment, comparison, and so on to extract the hidden knowledge. These methods are belonging to the large group of data processing and analysis. Their origin is different from statistics, physics, artificial intelligence, or systems theory. Recently, Katajama [1] pronounced a clear distinction between the processing and analysis (Figures 1 and 2). • is the necessary step before the analysis.
• transforms the raw data into more transparent format for the analysis.
• includes tasks as calibration, filtering, feature detection, alignment, normalization, modeling, and so on.

Analysis
• is the interpretation of the processed data.
• consists of comparison, classification, clustering, decomposition, pattern recognition, identification, and so on.
One of the most useful plots in the signal or image analysis is the signal histogram, an expression of signal abundance, first introduced by Pearson [2]. The estimation of proper histogram, as a representation of the probability distribution function, suffers with the question of the proper binning. However, in the digital era, we are live with the datasets, which are discrete representation of discrete events of the real signal. Thus, the amount of bins is usually given by the amount of quantization levels during the sampling process (Figures 3  and 4).
In this chapter, the question of image processing is discussed. The lecture opens the intensity histogram function, and the induction continues through the statistical parameters, like central moments, to the information entropy. Three different methods for using the entropy in image processing are introduced, entropy filtration, entropy segmentation, and point information gain. The description is completed by mathematical equations as well as by commented MATLAB commands. The results of the commands are the plots and figures presented within the text. This chapter aims to serve as guiding overview for the entropy consideration as a processing method. The simple examples show the methods steps and additional features ( Figure 5).

Histogram function
In digital image representation, intensity histogram H (p) of a grayscale image is an intensity function shows count of pixel Φ(i, j) with the intensity equals d independently on the position In MATLAB, the grayscale image circuit 2 could be loaded by the following commands:  This could be used for many modification (contrast enhancement, frequency evaluation, segmentation, …, etc.).

Statistical parameters
The distribution function allows us to compute some statistical parameters relevant for the further processing. The distribution is well characterized by two parameters, the location parameter and scaling parameter. The location parameter describes the value around which are all other values. There is another way how to obtain the mean value directly from the intensity levels of all pixels in the image Im.
%compute mean value from the image; mu = mean ( reshape ( Im,size (Im,1)*size (Im,2),1 ) ) The image is reshaped into vector of the size MxN , 1 . No information is lost, only some computations will be simpler to proceed (Figures 6 and 7).
There are two other parameters for the distribution location, median, and mode. The median of the distribution is a value separating the higher half of plot from the lower half, it is a d value. The mode is the value that appears most often in a set of data, the one with highest probability (the d , where is highest H ). MATLAB has implemented functions: %compute median from the image; Me = median ( reshape ( Im,size (Im,1)*size(Im,2),1 ) ) %compute mode from the image; Mo = mode ( reshape ( Im,size (Im,1)*size (Im,2),1 ) )  When plotted, median is dash-dotted red and mode-dotted green. The median often serves instead of mean for the distributions that are not Gaussian. The mode expresses the most frequent value in the distributions that has only one such peak, and thus they are unimodal (Figures 8 and 9).
The second parameter of the distribution is the scaling parameter. It describes how far the other d values are from the location parameter μ. The second central moment estimates the variance σ 2 , measures how far the d values are spread out (dispersed). An equivalent measure is the square root of the variance, called the standard deviation σ. Standard deviation thus measure dispersion of the d values. sigma2 = var ( reshape ( Im,size (Im,1)*size (Im,2),1 ) ) %compute standard deviation; sigma = std ( reshape ( Im,size (Im,1)*size (Im,2),1 ) ) In case, that the dispersion is skewed, thus has different dispersions on left and right side of the plot (from the point of view of the location parameter), it is recommended to use the Inter Quartile Range IQR as robust measure of scale. Usage of the IQR also removes the affects of the outliers to the distribution dispersion.
%compute inter quartile range; Q = iqr (reshape (Im,size (Im,1) * size (Im,2),1 ) ) The value of the Inter Quartile Range IQR is usually bigger than the standard deviation σ. 5 However, the basic statistical parameters do not cover the distributions that have more than one mode (multimodal), and also cannot describe the negative exponential distributions without location parameter. In that case, we are using different measure of the distribution, the entropy S. Entropy is a measure of unpredictability of information content:

Entropy
Let start with a reminder of the form of Shannon entropy from information theory with respect to the image analysis terminology. Any given normalized discrete probability distribution H * h 1 ,h 2 ,...,h D * fulfills the condition: Usually, in an intensity image, there exists an approximation of probability distribution given by the normalized histogram function H (d). H is an intensity function that shows the count of the pixels Φ(i, j) with intensity equals to d independent on the image position (i, j) [3,4]. The histogram is normalized by the number of pixels to fulfill the conditions. 6 More conditions are assumed when measuring the information. Information must be additive for two independent events a, b: The information itself should be dependent only on the probability distribution or normalized histogram function in our cases. Equation 7 describes the conditions referred to. This equation is the well-known modified Cauchy's functional equation with unique solution I (h ) = − κ × log 2 (h ). In statistical thermodynamic theory, the constant κ refers to the Boltzman constant [5]. In the Hartley measure of information, κ equals one [6,7]. Let us focus on Hartley measure. If different amounts of information occur with different probabilities, the total 6 The 0 th central moment again.
Information Entropy http://dx.doi.org/10.5772/63401 amount of information is the average of the individual information, weighted by the probabilities of their individual occurrences [7,8]. Therefore, the total amount of information is: which leads us to the definition of Shannon entropy as a measure of information: Thus, entropy is the sum of the individual information weighted by the probabilities of their occurrences.

In image analysis, the unknown probability distribution function of intensity values is approximated via histogram function H (d):
The histogram H (d ) has to be normalized to the total amount of pixels [9,10]. Shannon entropy allows information content of the whole image or just from the selected part of the image to be measured (Figures 10 and 11).
The entropy implemented in MATLAB function S = entropy(Im) is Shannon entropy.

Entropy filtration
Entropy allows all the information content of the entire image to be measured. However, when we change the number of pixels in the histogram computation, we obtain partial information content that is strictly dependent on the area entering the computation (Figures 12 and 13).

Entropy filtering is based on the replacement of pixel values in the image by values of entropy.
Entropy is computed in a specified area, usually from the pixel's n-by-n symmetric neighborhood in the input image [4,11]. The shape of the neighborhood should be also defined by the users. The computed entropy is where se(i, j) is the pixel's Φ(i, j) neighborhood.  It is clear that the output image (as computed by entropy filtration) is strongly dependent on the area selected. For small se, the local disturbances will be given sufficient weight, and the output image will be too noisy. On the other hand, too large an se value will not preserve details and the output image will be blurred. Therefore, the key question in the filtration method is how to select a suitable neighborhood. se selection is always a compromise between a noisy or blurry image. Of course, filtration can be very useful for decreasing the area and thus allowing further analysis (Figures 14 and 15).

Entropy thresholding and segmentation
Thresholding is a time cheap method searching for point in the intensity histogram H for separating image into the objects related to the real objects. It takes from the image parts that corresponding to the threshold parameter(s). Automatic threshold selection using the entropy is based on the maximization of entropy segmentation. The histogram function H (d ) is separated into two parts, A and B, iteratively in d. For both parts, the Shannon's entropies are computed Then, the entropy of part A and B (taken together) is computed as A threshold value is set for d , where SV k is maximized [12,13]. This method uses the global histogram function; therefore, it is not sensitive to the random noise contribution and successfully removes the noise. However, the use of thresholds also ignores local changes in the background, illumination, and non-uniformity. For images with different conditions within the scene, thresholds generally produce loses and artifacts. The use of thresholds without any previous preprocessing, for example,, light normalization, is applicable only with objects that are well separable from the background. Automatic segmentation techniques [3,4,12,14,15] are very powerful tools under easily-separable conditions (Figures 16 and 17). The value d where the entropy SV is maximized represents the threshold for segmentation of the image (Figure 18).

Grayscale thresholding
The entropy segmentation gives similar results with the Otsu thresholding. Otsu gray level thresholding is a nonparametric method of automatic threshold selection for image segmentation also from the normalized intensity histogram H (d). For separating histogram into two classes, the between class variance is maximized:

Point Information Gain
The most interesting is the point information gain (PIG) which asks the question: How important is one pixel for the whole image or for the selected part? In other words, is the occurrence of the value of one single pixel a surprise? It is predictable that for value of background pixels it will not carry a lot of information, if we discard one of them. On the other hand, the objects, especially if they are complicated in structure, will increase the entropy on their position. Shannon equation evaluate total amount of the information entropy from the whole histogram. Let evaluate the normalized image histogram H (d) and compute the Shannon information entropy S: To investigate the contribution of one single pixel with intensity value v to the total entropy, we need to evaluate the second histogram G(d ) which is created without this investigated pixel: ( , , ) = ( , , ), ; ( , , ) = ( , , ) 1, = .
This time we discard the value v of the center investigated Φ(i, j) from the computation, but only once.
One single pixel of intensity value d will only decrease the histogram value g(d ) on its intensity position d. Then, the histogram is again normalized. The probability of intensity value d is slightly lower than the probability H (d) of the primary normalized histogram (with all pixels). The other probabilities g(d), where d is not the value of investigated pixel, are slightly higher than the probability H (d) of the primary normalized histogram (with all pixels). Then, in the second computation of entropy E, computed from the modified normalized histogram G(r): g log g -å (16) the individual information log2(g(d)) as well as their weights g(d) differs according to the computation of whole entropy S. Therefore, we obtained two different entropy values S and E. Entropy S represents the whole measure of information in original image. Entropy E represents the measure of information in the image without the investigated pixel. The difference PIG [16]: refers to the difference between the entropy of the two histograms, and therefore also difference between the entropy of the two images (the first one with contains our investigated pixel Φ(i, j) and the second one without this investigated pixel). Recall that the both histograms H and G were normalized, and therefore, any difference in the number of pixels in the images is immaterial. Difference PIG represents either the entropy contribution of pixel Φ(i, j) or the contribution of the value of the pixel Φ(i, j) to the information content of the whole image. The transformation of each image pixel Φ(i, j) value to its contribution to the whole image via equation 17 represents the measure of the information carried by that pixel, the Point Information Gain (PIG). Repeated computation 17 for every single pixel of the image transforms the original image into the entropy map: the image that shows contribution of every pixel to the whole information content of the image (Figures 19 and 20). It is predictable that the values of background pixels will not carry a lot of information, even if we discard one of them. On the other, the objects, especially if they are complicated in structure, will increase the entropy in their immediate area. According to the information theory, the object occurrence produces a bigger surprise than does background occurrence, and the PIG quantifies this effect. For this reason, the details in the image are preserved: they are the surprise. For the same reason, random noise is removed: We always know it is presented, and no surprise occurs (Figures 21 and 22).

Conclusion and discussion
For those, who are interested in the entropy processing, the things are little bit more complicated.
PIG approach is dependent only on pixel Φ(i, j), there is no information about pixel's position. Therefore, the area se of the histogram function computation could include not the whole image, but only some selected area around the investigated pixel. The se could be the whole row and whole column in which the pixel is located. Difference PIG = S − E in this case refers to the difference between the information content of the two crosses. Difference PIG represents either the entropy contribution of pixel Φ(i, j) or the contribution of the value of pixel Φ(i, j) to the cross. Even more derivation from the original PIG algorithm were developed recently [17][18][19].
There also exist different entropies, not only Shannon, namely Tsallis-Havrda-Charvát and Rényi definitions at least. The Rényi entropy: The evaluation of entropy has heavy computational burden; therefore, it is recommended to use parallelization on GPU. For the processing of the color images, it is usual to tread each color channel independently like a grayscale image.
Overall, the entropy is a representative parameter of the image and there is still a lot of potential in its usage for processing and analysis.