Speech and Image Processing Unit Dataset properties.
Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new versions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was O1. We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the developed algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.
- missing values
- distance metric
- weighted Euclidean distance
- mean shift
Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on.
Missing completely at random (MCAR): when the missing value is not related to any other sample;
Missing at random (MAR): when the probability that a value is missing may depend on some known values but it does not depend on the other missing values;
Not missing at random (NMAR): when the probability that a known value is missing depends on the value that would have been observed.
There are two basic types of methods to deal with the problem of incomplete datasets. (1) Deletion: methods from this category ignore all the incomplete instances. These methods may change the distribution of the data by decreasing the volume of the dataset . (2) Imputation: in these methods, the missing values were replaced with known value according to statistical computation. Based on these methods, we convert then incomplete data to complete data, and as a result, the existing machine learning algorithms can be run as they deal with complete data.
One of the most common approaches in this domain is the mean imputation (MI) method that replaces each incomplete data point with the mean of the data. There are several obvious disadvantages to this method: (a) using a fixed instance to replace all the incomplete instances will change the distribution of the original dataset and (b) ignoring the relationship among attributes will bias the performance of subsequent data mining algorithms. These problems were caused since we replace all the incomplete instances with a fixed one. On the other hand, a variant of this method is to replace the missing values only based on the distribution of the attributes. It means that the algorithm will replace each missing value with the mean of the of its attribute (MA) and the whole instance . And in a case that the values were discrete, the missing value will be replaced by the most common (MCA) value in the attribute  (i.e., filling the unknown values of the attribute with the value that occurs most often for the same attribute). All those methods ignore the other possible values of the attribute and their distribution and represent the missing value with one value, that is, wrong in real-world datasets.
We can summarize the main drawbacks of each suggested method as: (1) inability to approximate the missing value and (2) inefficiency to compute the suggested value. Based on our suggested method , the distance between two points, that they may include missing value, is not only efficient but also takes into account the distribution of each attribute.
To do that in the computation procedure, we take into account all the possible values of the missing value with their probabilities, which are derived from the attribute’s distribution. This is in contrast to the MCA and the MA methods, which replace each missing value only with the mode or the mean of each attribute.
There are three possible cases between the values: (a) both of them are known: in this case, the distance will be computed as the Euclidean distance; (b) both of them are missing; and (c) one value is missing. In the last two cases, the distance will be computed based only on the and the of the attribute. As a result, the runtime of the developed distance is as the Euclidean distance.
In this research, we integrated this distance function in order to develop the -means and the mean shift clustering algorithms. To this end, we derived more two formulas to compute the mean (for -means algorithm) and for computing the gradient function of the local estimated density (for mean shift clustering algorithm).
The developed algorithms yield better results than the other methods and preserve the runtime of the algorithms which deals with complete data as can be seen in the experiments. We experimented on six standard numerical datasets from different fields from the Speech and Image Processing Unit . Our experiments show that the performance of the developed algorithms using our distance function was superior to using other methods.
This chapter is organized as follows. A review of our distance function () is described in Section 2. The mean computation is presented in Section 3. Section 3 describes several directions for integrating the () distance and the computed within the k-means clustering algorithm. The mean shift clustering algorithm is presented in Section 4. Section 4.1 describes how to integrate the () distance and the derived mean shift vector within the mean shift clustering algorithm. Experimental results of running the developed clustering algorithms are presented in Section 5. Finally, our conclusions and future work are presented in Section 6.
2. Our distance measure
Firstly, we will give a short preview to basic distance function that is able to compute distances between points with missing values developed by .
Let be a set of points. For the th attribute , the conditional probability for will be computed according to the known values for this attribute from (i.e., ), where is the distribution of the th coordinate.
Given two sample points , the goal is to compute the distance between them. Let and be the th coordinate values from points , respectively. There are three possible cases for the values of and :
Two values are known: the distance between them will be defined as the Euclidean distance.
One value is missing: Suppose that is missing and the value is given. Since the value of is unknown, we cannot compute the distance using the Euclidean distance equation. Instead, we compute the expectation of all the distances between the given value and all the possible values from attribute according to its distribution .
Therefore, we approximate the mean Euclidean distance () between and the missing value as:
That means, to measure the distance between known value and unknown value, the algorithm will compute the expectation distance for all the distances between and all the possible values of the missing value. These computations did not take into account the possible correlations between the missing values and the other known values (missing completely at random—MCAR) and the probability was computed according to the whole dataset. The resulting mean Euclidean distance will be:
where and are the and the for all the known values of the attribute.
4. Both values are missing: In this case, in order to measure the distance, we should compute all the distances between each possible pair of values one for each missing value and . Both these values are selected from distribution .
Then, we compute the expectation of the Euclidean distance between each selected value as we did for the one missing value problem. As a result, the distance is:
As and belong to the same attribute, and . Thus:
As we mentioned, all these computations assume that the missing data is MCAR. However, in real-world datasets, the missing data are MAR. In this case, the probability depends on the other observed values, and then, the distance will be computed as:
where denotes the observed attributes of point , and and are the conditional and , respectively.
On the other hand, in the case that the missing values are NMAR, the probability that was used in Eq. (1) will be computed based on this information, and then, the distance will be:
where is the distribution of when is missing.
3. Mean computation
Since one of our goals is developing a k-means clustering algorithm over incomplete datasets, we need to derive a formula to compute the mean of a given set that may contain incomplete points. We decide to derive this equation based on our distance function .
Let be a set of points that may contain points with missing values. Then, the of this dataset is defined as:
for any , where denotes each point from the set , and is a distance function.
Let be a multidimensional function: which is defined as:
In our case, the . Thus,
where is the coordinate and is the coordinate in point . Since each point may contain missing attributes, and according to the definition of the distance in the previous section, will be:
is the solution of , and in a multidimensional case: is the solution of , where
is the gradient of function . Firstly, we will deal with one coordinate, and then, we will generalize it for the other coordinates.
Thus, we simply get:
Repeating this for all the coordinates yields . In other words, each coordinate of the mean is the mean of the known values of that coordinate.
In the same way, we derive a formula for computing the weighted mean for each coordinate , yielding:
where is the weight of point . It means, in order to compute the weighted mean of a set of numbers that some of them are unknown, we must distinguish between known and unknown values. If the value is known, we multiply it with its weight. On the other hand, if the value is missing, we replace it with the mean of the known values and then multiply it by the matching weight.
4. k-Means clustering using the distance
Based on the derived formulas, the distance and the , our aim in this research is to develop -means clustering algorithms for incomplete datasets .
The distance and the are general and can be integrated within any algorithm that computes distances or mean computation. In this section, we describe our proposed method to integrate those formulas within the framework of the -means clustering algorithm.
We developed three different versions for k-means. For simplicity, we assume that all the points are from . We have two way to look about incomplete points. The first one considers each point as a single point, this version is similar to the GMM algorithm described in [14, 15]. On the other hand, the second way is to replace each incomplete point with a set of points according to the data distribution (these are the other two methods). As will be shown in our experiments, they outperform the first algorithm.
The k-means clustering algorithm is constructed from two basic steps: (1) associate each point with its closest centroid, and then, (2) update the centroid based on the new association from Eq. (1). Given dataset that may contain points with missing values. In the first step, the distance is used to compute the distances between each data point and the centroids in order to associate each point with the closest centroid. This association is general for all the three versions. However, there are several possible ways to then compute the new centroids of the clusters. We use Figure 1(a) in order to illustrate those possibilities. In this example, we see two clusters (i.e., was assigned to be the yellow cluster and was assigned to be the brown cluster). Our goal is to calculate the centers of each cluster. As an example, we will deal only with . If all the instances do not contain missing values, the centroid will be computed based on the Euclidean formula, resulting in the magenta star.
However, when the associated points for a given cluster contain incomplete points, it is not clear how to compute the mean. In the given example, let (i.e., the red star) be a point with a missing value and . This point was associated with ’s cluster using the distance. It is important to note that we are able to associate incomplete points with closest centroid even though their geometric locations are unknown since we use the distance.
On the other hand, using the distance is similar to use the MA-method based on the Euclidean distance, the point will be replaced with . It is clear that the difference between the two methods is only the variance of known values in coordinate , a fixed value that does not influence the association result.
the set of all the possible points that satisfy . And
denote all the possible values for attribute . And then computing the according to these points (and ), where each point from has weight one and each point from has weight . Where
be the set of all the data points without missing values that are associated with the cluster. As a result, the weighted of is:
This is identical to the Euclidean when the missing point is replaced with and is equivalent to the MA method when is associated with . As a result, the real centroid of the cluster (the magenta star) moves to the green star as described in Figure 1(b), where not all the blue “+” marks are belonging to .
As a result, the mean computation must distinguish between two possible methods. The first method (which we call
Computing the new centroid using Eq. (3) yields not only the same centroid as using the Euclidean distance, but also preserves the runtime of the standard k-means using the Euclidean distance.
The second method (which we called
According to this method, use all the points from that are associated with the cluster and not only the points from whose coordinates are from the real points associated with that cluster. Since the weights are computed using the entire dataset, we cannot use Eq. (3). To this end, our suggested method for implementing the computation is simply to replace each point with a missing value with the points, each with a weight , and run weighted k-means on the new dataset. This method, in one hand, is simple to implement, but in the other hand, its runtime is high, since each point with, for example, a missing value will be replaced with all points. As a result, the size of the dataset will be:
where is the set of each data points that do not contain missing values. In order to reduce the runtime complexity, we turn to use Voronoi diagram. Based on Voronoi diagram, the data space is partitioned to subspaces (as can be seen in Figure 1(b)). Each point is associated with the subspace of the cluster in which it lies.
The third possibility is to divide the value space to several disjoint intervals. Where, each interval will be represented by its mean, and the weight of each interval will be the ratio between the number of points in the interval to the number of all possible points. This method we called
In conclusion, we have three methods:
The naïve method which is equivalent to the MA method.
These methods differ in their performance, efficiency, and the way they work.
5. Mean shift algorithm
In this section, we will describe another use case that integrates the derived distance function within the framework of mean shift clustering algorithm. Firstly, we will give a short overview of the mean shift algorithm, and then, we will describe how we use distance in this algorithm. Here, we only review some of the results described in [16, 17] which should be consulted for the details. Let is associated with a bandwidth value . The
Based on a symmetric kernel with bounded support satisfying
is a nonparametric estimator of the density at in the feature space. Where is the
As a first step in the analysis is to find the modes of the density which are located among the zeros of the gradient , of a feature space with the underlying density , and the mean shift procedure is a way to find these zeros without the need to estimate the density.
Therefore, the density gradient estimator is obtained as the gradient of the density estimator by capitalizing on the linearity of Eq. (7).
Define , then the kernel is defined as:
Introducing into Eq. (8) yields
where is assumed to be a positive number. Both terms of the product in Eq. (9) have special significance. The first term is proportional to the density estimate at computed with the kernel . The second term
is called the
In real world, most often the convergence points of this iterative procedure are the local maxima (modes) of the density. All the points that share the same mode are clustered within the same cluster. Therefore, we get clusters as the number of modes.
5.1. Mean shift computing using the distance
This section describes the way to integrate the distance within the framework of the mean shift clustering algorithm. To achieve this mission, we will first compute the mean shift vector using the distance. And then, we will integrate the and the derived mean shift vector within the mean shift algorithm.
Using the derived distance the density estimator in Eq. (7) will be written as:
Since each point may contain missing attributes, will be:
According to the definition of the distance, we obtain:
Now, we will compute the gradient of the density estimator in Eq. (13).
In our computation, we will first deal with one coordinate , and then, we will generate the computation for all the other coordinates.
where there are points for which the coordinate is known, and there are points where it is missing.
As a result, the mean shift vector using the distance is defined as:
Now, we can use this equation to run the mean shift procedure over datasets with missing values.
6. Experiments on numerical datasets
In order to measure performance of the developed clustering algorithm (i.e., k-means and mean shift), we compare their performance on complete datasets to its performance on incomplete data using the suggested distance function and then again using the existing methods (MCA, MA, and MI) within the standard algorithms.
To measure the similarity between two data clusterings, we decide to use the Rand index . We use it in order to compare the results of the original clustering algorithms to the results of the other derived algorithms for incomplete datasets.
We produced the missing data by drawing randomly a set consisting of 10–40% of the data from each dataset. These sets are used as samples of incomplete data, where one attribute from each point was randomly selected to be assigned as missing value. For each dataset, we average the results over 10 different runs.
6.1. k-Means experiments
In the k-means algorithm, we developed two versions, k-means-and k-means-; to cluster the incomplete datasets, we compare the performance of the k-means (is fixed for each dataset) clustering algorithm on complete data (i.e., without missing values) to its performance on data with missing values, using the distance measure (k-means-and k-means-) and then again using k-means-(MCA, MA, and MI).
As can be seen in Figure 2, the new algorithms that is based on the distance outperformed the other existing algorithms on all the datasets. It occurred because in the MA MCA methods, the whole distribution of values is replaced by the mean or the mode of the distribution of known values, that is a fixed value. In our two developed algorithms, we use the distribution of the observed values in all the computation stages. This additional information, taking into account not only the mean of the attribute but also the variance, is probably the reason for the improved performance of our methods compared to the known heuristics.
6.2. Mean shift experiments
Mean shift clustering algorithm was tested using bandwidth (because we saw that the standard mean shift worked well for this value).
A resulting curve for the Rand index values was constructed for each dataset to evaluate how well the algorithm performed.
As can be seen in Figure 3, for all the datasets except the Jain dataset, the curves show that the new mean shift algorithm was superior and outperformed the other compared methods for all missing value percentages, while for the Jain dataset, its superiority became apparent only when the percent of the missing values was larger than 25%, as can be seen in Figure 3(b). In addition, we can see that the method outperforms the method for the flame and path-based datasets, and the outperforms for the other datasets. As a result, we cannot decide unequivocally which algorithm is better. On the other hand, we obviously can state that the outperforms the other methods especially when the percentage of the missing values increases.
Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on. Several methods were developed to deal with this problem such as: filling the missing values with fixed values, ignoring sample with missing values, or dealing with the missing values by defining a distance function.
In this work, we have proposed a new mean shift clustering algorithm and two versions of the k-means clustering algorithm over incomplete datasets based on the developed distance that was presented in [1, 2, 12].
The computational complexities of all the developed algorithms were preserved and they are the same as that of the standard algorithms using the Euclidean distance. The distance was computed based only on the and of the data for each attribute.
We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods.
From our experiments, we conclude that the developed methods are more appropriate for measuring the mean, mean shift vector, and weighted mean for objects with missing values, especially when the percent of missing values is large.