Clustering Algorithms for Incomplete Datasets Clustering Algorithms for Incomplete Datasets

Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new ver- sions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was O 1 ð Þ . We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and com- pared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the devel- oped algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.


Introduction
Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on.
In this research, we developed two popular clustering algorithms to run over incomplete datasets: (1) k-means clustering algorithm [1] and (2) mean shift clustering algorithms [2].
Based on [3][4][5][6], there are three main types of missing data: 1. Missing completely at random (MCAR): when the missing value is not related to any other sample; 2. Missing at random (MAR): when the probability that a value is missing may depend on some known values but it does not depend on the other missing values; 3. Not missing at random (NMAR): when the probability that a known value is missing depends on the value that would have been observed.
There are two basic types of methods to deal with the problem of incomplete datasets. (1) Deletion: methods from this category ignore all the incomplete instances. These methods may change the distribution of the data by decreasing the volume of the dataset [7]. (2) Imputation: in these methods, the missing values were replaced with known value according to statistical computation. Based on these methods, we convert then incomplete data to complete data, and as a result, the existing machine learning algorithms can be run as they deal with complete data.
One of the most common approaches in this domain is the mean imputation (MI) method that replaces each incomplete data point with the mean of the data. There are several obvious disadvantages to this method: (a) using a fixed instance to replace all the incomplete instances will change the distribution of the original dataset and (b) ignoring the relationship among attributes will bias the performance of subsequent data mining algorithms. These problems were caused since we replace all the incomplete instances with a fixed one. On the other hand, a variant of this method is to replace the missing values only based on the distribution of the attributes. It means that the algorithm will replace each missing value with the mean of the of its attribute (MA) and the whole instance [8]. And in a case that the values were discrete, the missing value will be replaced by the most common (MCA) value in the attribute [9] (i.e., filling the unknown values of the attribute with the value that occurs most often for the same attribute). All those methods ignore the other possible values of the attribute and their distribution and represent the missing value with one value, that is, wrong in real-world datasets.
Finally, the k-Nearest Neighbor Imputation method [10,11] estimates the values that should be replaced based on the k nearest neighbors based only on the known values. The main obstacle of this method is the runtime complexity.
We can summarize the main drawbacks of each suggested method as: (1) inability to approximate the missing value and (2) inefficiency to compute the suggested value. Based on our suggested method [12], the distance between two points, that they may include missing value, is not only efficient but also takes into account the distribution of each attribute.
To do that in the computation procedure, we take into account all the possible values of the missing value with their probabilities, which are derived from the attribute's distribution. This is in contrast to the MCA and the MA methods, which replace each missing value only with the mode or the mean of each attribute.
There are three possible cases between the values: (a) both of them are known: in this case, the distance will be computed as the Euclidean distance; (b) both of them are missing; and (c) one value is missing. In the last two cases, the distance will be computed based only on the mean and the variance of the attribute. As a result, the runtime of the developed distance is O 1 ð Þ as the Euclidean distance.
In this research, we integrated this distance function in order to develop the k-means and the mean shift clustering algorithms. To this end, we derived more two formulas to compute the mean (for k-means algorithm) and for computing the gradient function of the local estimated density (for mean shift clustering algorithm).
The developed algorithms yield better results than the other methods and preserve the runtime of the algorithms which deals with complete data as can be seen in the experiments. We experimented on six standard numerical datasets from different fields from the Speech and Image Processing Unit [13]. Our experiments show that the performance of the developed algorithms using our distance function was superior to using other methods.
This chapter is organized as follows. A review of our distance function (MD E ) is described in Section 2. The mean computation is presented in Section 3. Section 3 describes several directions for integrating the (MD E ) distance and the computed mean within the k-means clustering algorithm. The mean shift clustering algorithm is presented in Section 4. Section 4.1 describes how to integrate the (MD E ) distance and the derived mean shift vector within the mean shift clustering algorithm. Experimental results of running the developed clustering algorithms are presented in Section 5. Finally, our conclusions and future work are presented in Section 6.

Our distance measure
Firstly, we will give a short preview to basic distance function that is able to compute distances between points with missing values developed by [2].
Let A ⊆ R K be a set of points. For the ith attribute A i , the conditional probability for A i will be computed according to the known values for this attribute from A (i.e., P A i À Á $ χ i ), where χ i is the distribution of the ith coordinate.
Given two sample points X, Y ⊆ R K , the goal is to compute the distance between them. Let x i and y i be the ith coordinate values from points X, Y, respectively. There are three possible cases for the values of x i and y i : Two values are known: the distance between them will be defined as the Euclidean distance.

2.
One value is missing: Suppose that x i is missing and the value y i is given. Since the value of x i is unknown, we cannot compute the distance using the Euclidean distance equation. Instead, we compute the expectation of all the distances between the given value y i and all the possible values from attribute i according to its distribution χ i .
Therefore, we approximate the mean Euclidean distance (MD E ) between y i and the missing value m i as: That means, to measure the distance between known value y i and unknown value, the algorithm will compute the expectation distance for all the distances between y i and all the possible values of the missing value. These computations did not take into account the possible correlations between the missing values and the other known values (missing completely at random -MCAR) and the probability was computed according to the whole dataset. The resulting mean Euclidean distance will be: where μ i and σ i À Á 2 are the mean and the variance for all the known values of the attribute.
3. Both values are missing: In this case, in order to measure the distance, we should compute all the distances between each possible pair of values one for each missing value x i and y i . Both these values are selected from distribution χ i .
Then, we compute the expectation of the Euclidean distance between each selected value as we did for the one missing value problem. As a result, the distance is: As x and y belong to the same attribute, E x ½ ¼ E y ½ ≔ μ i and σ x ¼ σ y ≔ σ i . Thus: As we mentioned, all these computations assume that the missing data is MCAR. However, in real-world datasets, the missing data are MAR. In this case, the probability p x ð Þ depends on the other observed values, and then, the distance will be computed as: where x obs denotes the observed attributes of point X, and μ i x|x obs and σ i x|x obs 2 are the conditional mean and variance, respectively.
On the other hand, in the case that the missing values are NMAR, the probability p x ð Þ that was used in Eq. (1) will be computed based on this information, and then, the distance will be: where p xjm i À Á is the distribution of x when x is missing.

Mean computation
Since one of our goals is developing a k-means clustering algorithm over incomplete datasets, we need to derive a formula to compute the mean of a given set that may contain incomplete points. We decide to derive this equation based on our distance function MD E .
Let A ⊆ R K be a set of n points that may contain points with missing values. Then, the mean of this dataset is defined as: for any x ∈ R K , where p i ∈ A denotes each point from the set A, and distanceðÞ is a distance function.
Let f x ð Þ be a multidimensional function: f : R K :! R which is defined as: where x j is the coordinate j and p j i is the coordinate j in point p i . Since each point p i may contain missing attributes, and according to the definition of the MD E distance in the previous section, f x ð Þ will be: x is the solution of f 0 x ð Þ ¼ 0, and in a multidimensional case: is the gradient of function f . Firstly, we will deal with one coordinate, and then, we will generalize it for the other coordinates.
Thus, we simply get: Repeating this for all the coordinates yields In other words, each coordinate of the mean is the mean of the known values of that coordinate.
In the same way, we derive a formula for computing the weighted mean for each coordinate l, yielding: where w i is the weight of point x i . It means, in order to compute the weighted mean of a set of numbers that some of them are unknown, we must distinguish between known and unknown values. If the value is known, we multiply it with its weight. On the other hand, if the value is missing, we replace it with the mean of the known values and then multiply it by the matching weight.

k-Means clustering using the MD E distance
Based on the derived formulas, the MD E distance and the mean, our aim in this research is to develop k-means clustering algorithms for incomplete datasets [1].
The MD E distance and the mean are general and can be integrated within any algorithm that computes distances or mean computation. In this section, we describe our proposed method to integrate those formulas within the framework of the k-means clustering algorithm.
We developed three different versions for k-means. For simplicity, we assume that all the points are from R 2 . We have two way to look about incomplete points. The first one considers each point as a single point, this version is similar to the GMM algorithm described in [14,15]. On the other hand, the second way is to replace each incomplete point with a set of points according to the data distribution (these are the other two methods). As will be shown in our experiments, they outperform the first algorithm.
The k-means clustering algorithm is constructed from two basic steps: (1) associate each point with its closest centroid, and then, (2) update the centroid based on the new association from Eq. (1). Given dataset D that may contain points with missing values. In the first step, the MD E distance is used to compute the distances between each data point and the centroids in order to associate each point with the closest centroid. This association is general for all the three versions. However, there are several possible ways to then compute the new centroids of the clusters. We use Figure 1(a) in order to illustrate those possibilities. In this example, we see two clusters (i.e., C1 was assigned to be the yellow cluster and C2 was assigned to be the brown cluster). Our goal is to calculate the centers of each cluster. As an example, we will deal only with C1. If all the instances do not contain missing values, the centroid will be computed based on the Euclidean mean formula, resulting in the magenta star.
However, when the associated points for a given cluster contain incomplete points, it is not clear how to compute the mean. In the given example, let x 0 ; ? ð Þ (i.e., the red star) be a point with a missing y value and x ¼ x 0 . This point was associated with C1's cluster using the MD E distance. It is important to note that we are able to associate incomplete points with closest centroid even though their geometric locations are unknown since we use the MD E distance.
On the other hand, using the MD E distance is similar to use the MA-method based on the Euclidean distance, the point x 0 ; ? ð Þwill be replaced with x 0 ; μ y . It is clear that the difference between the two methods is only the variance of known values in coordinate y, a fixed value that does not influence the association result.
The naïve method to compute the new centroid is by replacing the point with the missing value with all the possible points the set of all the possible points that satisfy x ¼ x 0 . And denote all the possible values for attribute Y. And then computing the mean according to these points (C1 real and x 0 ð Þ possible ), where each point from C1 real has weight one and each point from be the set of all the data points without missing values that are associated with the C1 cluster. As a result, the weighted mean of C1 is: This is identical to the Euclidean mean when the missing point is replaced with x 0 ; μ y and is equivalent to the MA method when x 0 ; μ y is associated with C1. As a result, the real centroid of the cluster (the magenta star) moves to the green star as described in Figure 1(b), where not all the blue "+" marks are belonging to C1.
As a result, the mean computation must distinguish between two possible methods. The first method (which we call k-mean-MD E ) takes into account all the possible points that their y coordinates are the y coordinates of the real data points from the yellow cluster in addition to the real points within the yellow circle. As a result, the mean of this set will be computed based on all the real points C1 real and C1 x0 ð Þ possible where, Computing the new centroid using Eq. (3) yields not only the same centroid as using the Euclidean distance, but also preserves the runtime of the standard k-means using the Euclidean distance.
The second method (which we called k-mean-HistMD E ): In this case, we first associate each of the points from x 0 ð Þ Ypossible with its nearest center, and after that compute a weighted mean. It means that to compute the mean, we will take into account all the real points C1 real , in addition to PC1 possible where According to this method, use all the points from x 0 ð Þ possible that are associated with the C1 cluster and not only the points from x 0 ð Þ possible whose y coordinates are from the real points associated with that cluster. Since the weights are computed using the entire dataset, we cannot use Eq. (3).
To this end, our suggested method for implementing the mean computation is simply to replace each point with a missing value with the |Y possible | points, each with a weight 1 |Y possible | , and run weighted k-means on the new dataset. This method, in one hand, is simple to implement, but in the other hand, its runtime is high, since each point with, for example, a missing y value will be replaced with all |Y possible | points. As a result, the size of the dataset will be: where D real is the set of each data points that do not contain missing values. In order to reduce the runtime complexity, we turn to use Voronoi diagram. Based on Voronoi diagram, the data space is partitioned to k subspaces (as can be seen in Figure 1(b)). Each point is associated with the subspace of the cluster in which it lies.
The third possibility is to divide the y value space to several disjoint intervals. Where, each interval will be represented by its mean, and the weight of each interval will be the ratio between the number of points in the interval to the number of all possible points. This method we called k-mean-HistMD E . k-mean-HistMD E method approximates the two methods mentioned before that compute the weighted mean.
In conclusion, we have three methods: • The naïve method which is equivalent to the MA method.
• k-means-MD E • k-mean-HistMD E These methods differ in their performance, efficiency, and the way they work.

Mean shift algorithm
In this section, we will describe another use case that integrates the derived distance function MD E within the framework of mean shift clustering algorithm. Firstly, we will give a short overview of the mean shift algorithm, and then, we will describe how we use MD E distance in this algorithm. Here, we only review some of the results described in [16,17] which should be consulted for the details. Let x i ∈ R d , i ¼ 1, …, n is associated with a bandwidth value h > 0. The sample point density estimator at point x is Based on a symmetric kernel K with bounded support satisfying is a nonparametric estimator of the density at x in the feature space. Where k x ð Þ, 0 ≤ x ≤ 1 is the profile of the kernel and the normalization constant c k, d assures that K x ð Þ integrates to one. As a result, the density estimator Eq. (5) can be rewritten as As a first step in the analysis is to find the modes of the density which are located among the zeros of the gradient ∇f x ð Þ ¼ 0, of a feature space with the underlying density f x ð Þ, and the mean shift procedure is a way to find these zeros without the need to estimate the density. Therefore, the density gradient estimator is obtained as the gradient of the density estimator by capitalizing on the linearity of Eq. (7).
Define g x ð Þ ¼ Àk 0 x ð Þ, then the kernel G x ð Þ is defined as: Introducing g x ð Þ into Eq. (8) yields where P n i¼1 g xÀxi h 2 is assumed to be a positive number. Both terms of the product in Eq. (9) have special significance. The first term is proportional to the density estimate at x computed with the kernel G. The second term is called the mean shift vector. The mean shift vector thus points toward the direction of maximum increase in the density. The implication of the mean shift property is that the iterative procedure In real world, most often the convergence points of this iterative procedure are the local maxima (modes) of the density. All the points that share the same mode are clustered within the same cluster. Therefore, we get clusters as the number of modes.

Mean shift computing using the MD E distance
This section describes the way to integrate the MD E distance within the framework of the mean shift clustering algorithm. To achieve this mission, we will first compute the mean shift vector using the MD E distance. And then, we will integrate the MD E and the derived mean shift vector within the mean shift algorithm.
Using the derived MD E distance the density estimator in Eq. (7) will be written as: Since each point x i may contain missing attributes, b f h, k x ð Þ will be: According to the definition of the MD E distance, we obtain: Now, we will compute the gradient of the density estimator in Eq. (13).
In our computation, we will first deal with one coordinate l, and then, we will generate the computation for all the other coordinates.
where there are n l points for which the x l coordinate is known, and there are m l points where it is missing. As a result, the mean shift vector using the MD E distance is defined as: Now, we can use this equation to run the mean shift procedure over datasets with missing values.

Experiments on numerical datasets
In order to measure performance of the developed clustering algorithm (i.e., k-means and mean shift), we compare their performance on complete datasets to its performance on incomplete data using the suggested distance function and then again using the existing methods (MCA, MA, and MI) within the standard algorithms.
To measure the similarity between two data clusterings, we decide to use the Rand index [18]. We use it in order to compare the results of the original clustering algorithms to the results of the other derived algorithms for incomplete datasets.
Our experiments use six standard numerical datasets from the Speech and Image Processing Unit [13]; dataset characteristics are shown in Table 1.
We produced the missing data by drawing randomly a set consisting of 10-40% of the data from each dataset. These sets are used as samples of incomplete data, where one attribute from each point was randomly selected to be assigned as missing value. For each dataset, we average the results over 10 different runs.

k-Means experiments
In the k-means algorithm, we developed two versions, k-means-MD E and k-means-HistMD E ; to cluster the incomplete datasets, we compare the performance of the k-means (k is fixed for each dataset) clustering algorithm on complete data (i.e., without missing values) to its performance on data with missing values, using the MD E distance measure (k-means-MD E and kmeans-HistMD E ) and then again using k-means-(MCA, MA, and MI).
As can be seen in Figure 2, the new algorithms that is based on the MD E distance outperformed the other existing algorithms on all the datasets. It occurred because in the MA MCA methods, the whole distribution of values is replaced by the mean or the mode of the distribution of known values, that is a fixed value. In our two developed algorithms, we use the distribution of the observed values in all the computation stages. This additional information, taking into account not only the mean of the attribute but also the variance, is probably the reason for the improved performance of our methods compared to the known heuristics.

Mean shift experiments
Mean shift clustering algorithm was tested using bandwidth h ¼ 4 (because we saw that the standard mean shift worked well for this value). As can be seen in Figure 3, for all the datasets except the Jain dataset, the curves show that the new mean shift algorithm was superior and outperformed the other compared methods for all missing value percentages, while for the Jain dataset, its superiority became apparent only when the percent of the missing values was larger than 25%, as can be seen in Figure 3(b). In addition, we can see that the MS À MC method outperforms the MS À MA method for the flame and path-based datasets, and the MS À MC outperforms MS À MA for the other datasets. As a result, we cannot decide unequivocally which algorithm is better. On the other hand, we obviously can state that the MS À MD E outperforms the other methods especially when the percentage of the missing values increases.

Conclusions
Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on. Several methods were developed  to deal with this problem such as: filling the missing values with fixed values, ignoring sample with missing values, or dealing with the missing values by defining a distance function.
In this work, we have proposed a new mean shift clustering algorithm and two versions of the k-means clustering algorithm over incomplete datasets based on the developed MD E distance that was presented in [1,2,12].
The computational complexities of all the developed algorithms were preserved and they are the same as that of the standard algorithms using the Euclidean distance. The distance was computed based only on the mean and variance of the data for each attribute.
We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods.
From our experiments, we conclude that the developed methods are more appropriate for measuring the mean, mean shift vector, and weighted mean for objects with missing values, especially when the percent of missing values is large.