Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index Performance Assessment of Unsupervised Clustering Algorithms Combined MDL Index

Best clustering analysis should be resisting the presence of outliers and be less sensi- tive to initialization as well as the input sequence ordering. This chapter compares the performance among three of the unsupervised clustering algorithms: neural gas (NG), growing neural gas (GNG), and robust growing neural gas (RGNG). A complete expla-nation of NG and GNG algorithms is presented in the next comparison with RGNG. Another comparison due to the minimum description length (MDL) criterion between RGNG used MDL value as the clustering validity index versus GNG and NG combined with MDL. Statistical estimations are applied to explain the meaning of the output results when these algorithms are fed to the synthetic 2D dataset. The techniques introduced in this chapter are designed and implemented in a simple software package using a MATLAB-based graphical user interface (GUI) tool, which allows users to interact with the clustering techniques and output data easily.


Introduction
Cluster analysis [1] is a robust tool for exploring the underlining structures in data and grouping them with similar objects called clusters. Cluster analysis found applications in different fields ranging from the main task of data mining applications [2] such as scientific data exploration, spatial database applications, web analysis, marketing, medical diagnostics, computational biology, etc., to statistical data analysis that is used in many fields including machine learning, pattern recognition [3], image analysis [4], information retrieval [5], and bioinformatics [6]. There are different algorithms related to neural networks; the most popular are K-means, the self-organizing map (SOM), neural gas (NG), and growing neural gas (GNG) [7,8].
The goal of this work is to present a comparison among neural gas (NG), growing neural gas (GNG), and robust growing neural gas (RGNG) approaches that are related to neural networks, as well as design a new simulation tool for the purpose of education and scientific research using unsupervised learning methods. Due to the difficulty in introducing these algorithms in literature, the three techniques have been presented using a simple graphical user interface (GUI) model. Alziarjawey et al. [9] introduced an application of Matlab GUI in the medical field using the ECG signal for heart rate monitoring and PQRST detection. They introduced another application by developing a software package based on GUI, which consists of two modules using many important methods derived from linear algebra [10]. Aljobouri et al. [11] designed an educational tool for biosignal processing and medical imaging using a GUI package. The user friendly package explained in this work can be used easily by: choosing any method, changing the predefined parameters for each algorithm and comparing the results. Hence, it can be used without any programming knowledge. The interested reader may find more technical details in our previous reports and publications [12,13].
The current study is organized as follows: Section 2 provides the unsupervised clustering algorithms. Case studies are described in Section 3. Sections 4 and 5 present the experimental implementation on the synthetic dataset and clustering package design, respectively. Finally, Section 6 concludes the paper and introduces future work.

Unsupervised clustering algorithms
In this section, a review of the NG, GNG, and RGNG algorithms are presented. Because of the length and complexity of these algorithms, along with the mathematical model, flowcharts are designed for the three algorithms in this work in order to make it more understandable and easier to write the related codes.

NG algorithm
The NG network algorithm is a simple artificial neural network algorithm for finding optimal data representations based on reference vectors (prototype vectors). It was first introduced in 1991 [14] and is based on Kohonen's SOM [15]. Because of the dynamics of the reference vectors during the adaptation process, this algorithm was called "neural gas" that spread itself as a gas through the data space. NG is unlike other methods that consider distance as a rank like Euclidean distance, but it proposes a new way of calculating the influence of distance. Nearer prototypes in NG algorithm are more affected, but it does not depend directly on the influence of distance.
NG has been successfully applied to clustering [16], speech recognition [17], image processing [18], vector quantization, pattern recognition, topology representation, etc., [19,20] especially where there is a problem arriving at vector quantization or data compression. It adapts the reference vectors (prototype vectors) " w i " without any fixed topological arrangement within the network. NG not just adapts the winner vector for a specific input vector as a single-layered soft competitive learning neural network, but also updates the residual reference vectors according to the input vector nearness using a soft-max updating rule [21]. The main advantages of NG network [22] are: (1) lower distortion error than other clustering algorithms (k-means, maximum-entropy and SOM), (2) faster assemblage due to low distortion errors, (3) submission a stochastic gradient descent on a specific energy surface.
The NG algorithm is represented by the dependence of updating strengths for c reference vectors w ci ( i 0 , i 1 , … , i N−1 ) on their position ranking. If the input vector is presented by x , the definition of the position ranking ( w i0 , w i1 , … , w ik ) of the reference vectors w ci will be: The updating step of adjusting w i according to a Hebb-like learning rule is given by: where: h ( ., . ) : deterministic function with some regularity condition imposed on it. 1] : the learning rate (step size) that characterize the total range of the variation. This extent is represented by }, so Max_iter , so and t denote the maximum number of repetitions and the repetition step respectively. 1] : considers the w i within the input extent. for h λ ( k ) ∈ [0, 1] , the exponential form exp (− k / λ) was proposed [22] to obtain the best extensive result compared to other options like the Gaussian function. λ : finds the number of reference vectors that significantly change their positions in the updating steps and usually individually decrease with the iteration step t as: The NG algorithm is widely related to the structure of fuzzy clustering methods [23]. So, NG used the uncertainty of the relationship value ( h λ ( k i (x, w) ) ) / ( C ( λ ) ) to set each input vector " x " to all the reference vectors w i (i = 1, 2, … , c) instead of using u ij (2 ≤ i ≤ c, 1 ≤ j ≤ N) in FCM algorithm. This algorithm is based on solving a cost function using iterative methods plus the familiarity with linear optimization methods, essentially the gradient descent method and Newton's method. Therefore, the NG cost function to optimize [22] is: with Martinetz et al. [22,26] introduced this cost function and proved that the updating in the Hebb-like learning rule can be derived by a stochastic gradient descent on this function. By starting with a large value of λ and reducing it slowly, a good reference vector can be obtained.
Due to the sequential learning scheme in NG algorithms and the use of the neighborhood dealing rule, NG became less sensitive to various initializations due to the sequential learning scheme and use of neighborhood cooperation rule with comparison to other clustering algorithms like k-means and FCM.
Before feeding the NG algorithm, there are some parameters that have to be defined: Figure 1 shows the flowchart of the NG algorithm. Although the NG model has many advantages as mentioned earlier, it also has some limitations. It depends on decaying parameters that change over time; it is incapable of finding a network size and structure automatically and continue learning. Hence, based on the NG algorithm, the GNG algorithm was introduced by Fritzke [24,25], which has an advantage over NG algorithms through its ability to modify the network topology by removing edges with its age variable. Moreover, during the growth process associated with the neighborhood updating rule, there is no need for the neighborhood sorting step [24,25]. It has the ability to find a network size and structure automatically, and continue learning, adding units and connections, until a performance criterion is fulfilled.

GNG algorithm
In the GNG algorithm, Fritzke [24,27] proposed changing the unit numbers (mostly increased) during SOM network with a variable topological structure [24,25]. This growth mechanism is combined with topology formation rules using the competitive Hebbian learning (CHL) [26] and the earlier proposed growing mechanism inherited from the growing cell structures [27] to form a new model.
The GNG algorithm needs only constant parameters; it is not required to set the amount of prototypes. The main idea behind the GNG is to start with a minimal network size and insert a few new neurons and connections respectively in a growing structure by using a vector quantization until the desired characteristics of the model is fulfilled (e.g., net size, time limit, predefined number of neurons inserted, quality or some performance measure). To determine where to insert new units, local error measures are gathered during the adaptation process.
Each new unit is inserted near the unit that has accumulated the highest error, and a connection between the winner and the second nearest neuron is formed using the competitive Hebbian learning algorithm.
Before feeding the GNG algorithm, there are some parameters that have to be defined: , ε n : constant learning rate for the winner and its topological neighbors, respectively  prototypes that do not win over long time intervals may be detected by tracing the changes of an age variable associated with each edge. Hence, the GNG algorithm has an advantage against the NG algorithm through its ability to modify the network topology by removing edges with their age variable (not being refreshed for a time interval α_max) and the resultant nonfunctional prototypes. In the GNG algorithm, the growth process associated with the neighborhood updating rule used is somewhat similar to the neighborhood, decreasing procedure in NG. However, unlike the NG algorithm, there is no need for the neighborhood sorting step.

RGNG algorithm
Any robust algorithm should have the following features [28]: 1. It should achieve a good precision for the given model.

2.
The performance of the given model may have few deviations from the assumptions made, but these deviations should not weaken the performance, except by a small degree.

3.
The presence of large deviations from the model assumptions should not cause disaster.
If classical clustering methods are to be used as prototype based clustering algorithms, the major robustness problems are the sensitivity to initialization, the order of input vectors, and existence of many outliers, but each well executed regarding condition 1. Due to the growth scheme associated with the GNG algorithm, the algorithm faces the "dead nodes" problem. This occurs due to inappropriate initializations that led to some prototypes that may never win through the training process.  Even with initialization-insensitive clustering methods, good clustering results may not be obtained if the order of the input sequence is not chosen properly.
Even with the initialization insensitive clustering methods, good clustering results may not be obtained if the order of the input sequence is not chosen properly. As well as the introduced problem gets along with the sensitivity for initialization and the order of input vectors, there also another problem attributable to the existence of many outliers. This implies the GNG network may fail to differentiate the outliers from the inliers through the original prototype updating rule when many of outliers exist in a data set. These outliers can be regarded as input vectors that different from data points belonging to the ordinary clusters (inliers).
For these limitations of the GNG algorithm, a novel robust clustering algorithm was proposed [29] within the GNG structure, namely the robust growing neural gas (RGNG) network. RGNG possesses better robustness than the original GNG algorithm because of its succession properties. It also incorporates with it several robust strategies, such as outlier resistant scheme, adaptive modulation of learning rates, and cluster repulsion method.
Therefore, compared to the GNG network, the RGNG network is insensitive to initialization, input sequence ordering, the presence of outliers, and determination of the optimal number of clusters. The minimum description length (MDL) value was used with RGNG as the clustering validity index [30,31]. The MDL value is used to find the optimal number of clusters and their center positions corresponding to the smallest MDL. This determined automatically the optimal number of clusters by searching the extreme value of the MDL measure through the network growing process.
Before feeding the RGNG algorithm, there are some parameters that have to be defined:

Case studies
In the presented work, the performance of the NG, GNG, and RGNG algorithms on synthetic data are described. The cases studies are carried out to compare the performance of the three approaches. The experimental results on a public synthetic dataset are presented in the next section. Comparison of different neural networks and the need for such performance parameters using statistical evaluations has been recently highlighted by a number of researchers.
There are four parameters that are used in this work to evaluate the performance of the pro- Each index of the performance measures is explained in the following sections.

Classification rate
This index refers to the classification rate (CR) for the whole dataset so that each data point is classified according to its nearest prototype. CR is based on using a majority voting classifier [32] by labeling all prototypes using a simple voting mechanism. According to the proposed technique, the numbers of prototypes are small, so the resulting CR will not be high.

Partition quality
This index refers to the average partition quality (PQ) measurement, which is averaged over all the independent runs in the experiments. PQ was defined by Hamerly and Elkan [33], as: where: The p (i, j) term represents the frequency based on the probability that a data point is labeled by clusters i and j . The p (i, j) quality is normalized by the sum of true probabilities; then, squared. This statistic is related to the rand statistic for comparing partitions [34]. The PQ index is maximized when the number of clusters m ct is correctly detected and induces the same partition of n cs , i.e., m ct = n cs , so that all points in each cluster are the same as those in one of the natural clusters.

Minimum cluster number
The minimum cluster number (MCN) is the average number of detected clusters by the techniques. The MCN indexes the ability of the techniques to find the underlying natural clusters. During the training of the techniques and according to the MCN value, only the proposed RGNG approach can find the actual number of clusters successfully.
During the growing process, this value is defined as the number of natural clusters in which the algorithm places at least one prototype when the number of prototypes in the network reaches the actual number of clusters. Cluster numbers detected by NG and GNG during the growing process deviate from the actual value of clusters when the number of prototypes is the same as the actual number of clusters.

Mean square error
Mean square error (MSE) is another criterion used for evaluating the performance of the proposed clustering technique. The MSE value represents the mean distance between the current nearest prototypes' positions resulting from the application of the techniques and the actual cluster centers.
The average MSE value in this experiment is higher for NG and GNG techniques than the RGNG technique. This indicates that the RGNG approach achieves the best accuracy with the strongest stability among the three approaches.

Experimental results with synthetic data
There are six different types of 2D synthetic datasets [29,35] which are used in this work. They are snail, screw, ring, set3, set5, and set25 dataset. Figures 4-6 show the plots of NG, GNG, and RGNG clustering with three types of 2D synthetic datasets (screw, set5, and snail) as an example. The number of neurons are selected randomly, N = 7, 10, and 12.
These figures cannot clearly differentiate between each method. Hence, four parameters are used in this work to evaluate the performance of the proposed clustering techniques: CR, PQ, MCN, and MSE introduced in the previous section. For the best comparison with RGNG, MDL criterion is added to NG and GNG techniques. The training results of these techniques with synthetic data are shown in Table 1, where the number of neurons is chosen randomly as N = 7, 10, and 12.
According to literature [29,36], the clustering output results introduced in Table 1 clarified that RGNG approach is insensitive to different initializations and the presence of outliers. In   these techniques, the number of neurons used is small, so the CR values registered in the table are not high. In all the three clustering techniques, the number of neurons was equal to the actual cluster number. RGNG can effectively locate the actual number of clusters compared to the other two methods; NG and GNG fail with higher cluster numbers in the synthetic case.
The registered values of the MCN show that the number of detected prototypes or clusters in the RGNG technique is less than the others; which means that its ability to group data in actual number of clusters is better than the other two techniques. For example, when N is set to 10, the MCN value for RGNG is 10, which is less than that for NG and GNG values. The MCN value for running RGNG is equal to the number of neurons, 10, and has the same rate when compared with other N values; while the MCN value of running NG and GNG deviated from the actual cluster number.
Regarding the PQ value, it is noticed that the RGNG approach possesses higher PQ values than the NG and GNG techniques. For example, when N is set to 12, the PQ value for RGNG is 0.9807, which is higher than that of NG and GNG values. These high values of PQ indicate that the RGNG technique has a better partitioning quality with respect to the others, and finds more representative clusters.
Moreover, the RGNG method can find all the natural clusters during the growing stage with the correct number of prototypes. Hence, the MSE values are lower, which indicates that the RGNG technique has better robustness. For example, when N is set to 7, the MSE value for RGNG is 2.6493e + 004, which is lower than that for NG and GNG values. NG and GNG techniques may not detect all the actual clusters; hence, they yield higher MSE values.
The MDL value is one of the popular information theory evaluation measures that are used as clustering validity indexes [37]. The MDL criterion gives the ability of finding the optimal number of clusters and their center positions, corresponding to the smallest MDL value. The average MDL values during the growth stages are plotted versus the number of clusters or prototypes. Figure 7 shows the curves for the NG and GNG techniques combined with the MDL criterion, as well as the RGNG approach on a synthetic dataset for different number of neurons, which are selected randomly as N = 7, 10, and 12. Each detected the cluster number corresponding to the MDL value.
In RGNG, the smallest MDL value was recorded on average with respect to NG and GNG combined with the MDL principle. For example, in Figure 7 (b), the smallest MDL value is 2.65 that is obtained from running RGNG when N is equal to 4. While in the same N = 4, higher MDL value of 2.77 is recorded from running NG and GNG. From the presented figures, it is concluded that the proposed RGNG approach is insensitive to different initializations and the presence of outliers and can successfully find the actual number of clusters.

Prototype-based clustering package
The techniques introduced in this work are designed and implemented in a simple software package tool that allows users to interact with the clustering techniques and output data easily [13]. Figure 8 shows the main window with the most important features of the designed prototype-based clustering software package.

Selection data:
The user can select any one type of data from the different synthetic 2D datasets in the pop-up menu. Ring data is a 2D synthetic data selected as an example in Figure 8.

Load data:
The selected data are loaded and all information related to the selected data ("Dimension," "Name," and "Type of Data") appear in the "info" window. The dimension of the selected "Ring" data is 400x2 double. The selected data is plotted on sketch1 inside the main clustering window of Figure 8. Figure 9 shows some of selected 2D synthetic datasets from the different datasets that were used in this work. Beside each plot, the information related to it is shown in the "info" window, in the left side of each plot.

Selection technique:
The user can select one of the clustering techniques NG, GNG, or RGNG. The RGNG technique is selected as an example for the training in Figure 8 with Ring data and N = 18, which is selected randomly.
Before clicking on "Apply NG," "Apply GNG," or "Apply RGNG" button, the training parameters related to each technique must be defined. As explained in Section 3, the training parameters must be set carefully within the limited range. The number of neurons (N) as well as the other parameters related to the selected technique must be defined. Another example of using the RGNG technique with Set3 dataset is shown in Figure 10.   started, the program sketches the output running of the implemented technique as Sketch1.
In Sketch1, a Set3 data is shown with firm red circles, which represent the actual cluster centers.

MDL plot:
This panel is related to plotting MDL values versus the number of neurons (N) running the RGNG, GNG, and NG combined with MDL criterion. This panel includes three main buttons: "No. of neurons (N)," "Technique selection for MDL value," and "Apply MDL versus N" buttons, as shown in Figure 11.
After defining the number of neurons (N); one, two, or three of the training techniques have to be selected for comparing the MDL results. In the "Technique selection for MDL value" pop-up menu, there are seven selections-either show the result of each technique alone, two of them, or three of them for easy comparison. After clicking on the "Apply MDL versus N" button, the output results of MDL values are plotted with respect to the number of neurons (N) in Sketch2.

Conclusions
A simple user friendly software package is designed and implemented as an automatic clustering model for any dataset to use as part of the neural network course. NG, GNG, and RGNG algorithms are performed in the same package using a MATLAB-based graphical user   interface (GUI) tool. This visual tool lets the students/ researchers visualize the desired results using plots obtained with the click of a few buttons. The performance of these algorithms on 2D synthetic datasets is reported with respect to statistical estimations to explain the meaning of the output results. These results clarified that RGNG is better than NG and GNG when considering insensitivity to initialization as well as the presence of outliers. RGNG enhances GNG to be more robust toward noisy input dataset by using MDL criteria. Hence, RGNG solves the problem of finding the optimal number of clusters with respect to NG and GNG.
For future research directions, other unsupervised or supervised clustering algorithms may be used in the laboratory experiments. Another research direction is to apply the comparison among the three clustering algorithms to real multimodal datasets in medical applications.
The package results could also be shared to websites using ASP .NET, which can give facility for users by sharing applications which requires no installation of MATLAB or any special program just a Web browser.