Open access peer-reviewed chapter

Clustering Algorithms for Incomplete Datasets

By Loai AbdAllah and Ilan Shimshoni

Submitted: October 24th 2017Reviewed: May 2nd 2018Published: August 1st 2018

DOI: 10.5772/intechopen.78272

Downloaded: 300

Abstract

Many real-world dataset suffers from the problem of missing values. Several methods were developed to deal with this problem. Many of them filled the missing values within fixed value based on statistical computation. In this research, we developed a new versions of the k-means and the mean shift clustering algorithms that deal with datasets with missing values without filling their values. We developed a new distance function that is able to compute distances over incomplete datasets. The distance was computed based only on the mean and variance of the data for each attribute. As a result, the runtime complexity of our computation was O1. We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods. Our experiments show that the developed algorithms using our distance function outperform the existing k-means and mean shift using other methods for dealing with missing values.

Keywords

  • missing values
  • distance metric
  • weighted Euclidean distance
  • clustering
  • mean shift
  • k-means

1. Introduction

Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on.

In this research, we developed two popular clustering algorithms to run over incomplete datasets: (1) k-means clustering algorithm [1] and (2) mean shift clustering algorithms [2].

Based on [3, 4, 5, 6], there are three main types of missing data:

  1. Missing completely at random (MCAR): when the missing value is not related to any other sample;

  2. Missing at random (MAR): when the probability that a value is missing may depend on some known values but it does not depend on the other missing values;

  3. Not missing at random (NMAR): when the probability that a known value is missing depends on the value that would have been observed.

There are two basic types of methods to deal with the problem of incomplete datasets. (1) Deletion: methods from this category ignore all the incomplete instances. These methods may change the distribution of the data by decreasing the volume of the dataset [7]. (2) Imputation: in these methods, the missing values were replaced with known value according to statistical computation. Based on these methods, we convert then incomplete data to complete data, and as a result, the existing machine learning algorithms can be run as they deal with complete data.

One of the most common approaches in this domain is the mean imputation (MI) method that replaces each incomplete data point with the mean of the data. There are several obvious disadvantages to this method: (a) using a fixed instance to replace all the incomplete instances will change the distribution of the original dataset and (b) ignoring the relationship among attributes will bias the performance of subsequent data mining algorithms. These problems were caused since we replace all the incomplete instances with a fixed one. On the other hand, a variant of this method is to replace the missing values only based on the distribution of the attributes. It means that the algorithm will replace each missing value with the mean of the of its attribute (MA) and the whole instance [8]. And in a case that the values were discrete, the missing value will be replaced by the most common (MCA) value in the attribute [9] (i.e., filling the unknown values of the attribute with the value that occurs most often for the same attribute). All those methods ignore the other possible values of the attribute and their distribution and represent the missing value with one value, that is, wrong in real-world datasets.

Finally, the k-Nearest Neighbor Imputation method [10, 11] estimates the values that should be replaced based on the knearest neighbors based only on the known values. The main obstacle of this method is the runtime complexity.

We can summarize the main drawbacks of each suggested method as: (1) inability to approximate the missing value and (2) inefficiency to compute the suggested value. Based on our suggested method [12], the distance between two points, that they may include missing value, is not only efficient but also takes into account the distribution of each attribute.

To do that in the computation procedure, we take into account all the possible values of the missing value with their probabilities, which are derived from the attribute’s distribution. This is in contrast to the MCA and the MA methods, which replace each missing value only with the mode or the mean of each attribute.

There are three possible cases between the values: (a) both of them are known: in this case, the distance will be computed as the Euclidean distance; (b) both of them are missing; and (c) one value is missing. In the last two cases, the distance will be computed based only on the meanand the varianceof the attribute. As a result, the runtime of the developed distance is O1as the Euclidean distance.

In this research, we integrated this distance function in order to develop the k-means and the mean shift clustering algorithms. To this end, we derived more two formulas to compute the mean (for k-means algorithm) and for computing the gradient function of the local estimated density (for mean shift clustering algorithm).

The developed algorithms yield better results than the other methods and preserve the runtime of the algorithms which deals with complete data as can be seen in the experiments. We experimented on six standard numerical datasets from different fields from the Speech and Image Processing Unit [13]. Our experiments show that the performance of the developed algorithms using our distance function was superior to using other methods.

This chapter is organized as follows. A review of our distance function (MDE) is described in Section 2. The mean computation is presented in Section 3. Section 3 describes several directions for integrating the (MDE) distance and the computed meanwithin the k-means clustering algorithm. The mean shift clustering algorithm is presented in Section 4. Section 4.1 describes how to integrate the (MDE) distance and the derived mean shift vector within the mean shift clustering algorithm. Experimental results of running the developed clustering algorithms are presented in Section 5. Finally, our conclusions and future work are presented in Section 6.

2. Our distance measure

Firstly, we will give a short preview to basic distance function that is able to compute distances between points with missing values developed by [2].

Let ARKbe a set of points. For the ith attribute Ai, the conditional probability for Aiwill be computed according to the known values for this attribute from A(i.e., PAiχi), where χiis the distribution of the ith coordinate.

Given two sample points X,YRK, the goal is to compute the distance between them. Let xiand yibe the ith coordinate values from points X,Y, respectively. There are three possible cases for the values of xiand yi:

  1. Two values are known: the distance between them will be defined as the Euclidean distance.

  2. One value is missing: Suppose that xiis missing and the value yiis given. Since the value of xiis unknown, we cannot compute the distance using the Euclidean distance equation. Instead, we compute the expectation of all the distances between the given value yiand all the possible values from attribute iaccording to its distribution χi.

    Therefore, we approximate the mean Euclidean distance (MDE) between yiand the missing value mias:

MDEmiyi=Exyi2=pxxyi2dx=yiμi2+σi2.

That means, to measure the distance between known value yiand unknown value, the algorithm will compute the expectation distance for all the distances between yiand all the possible values of the missing value. These computations did not take into account the possible correlations between the missing values and the other known values (missing completely at random—MCAR) and the probability was computed according to the whole dataset. The resulting mean Euclidean distance will be:

MDEmiyi=yiμi2+σi2,E1

where μiand σi2are the meanand the variancefor all the known values of the attribute.

  1. 4. Both values are missing: In this case, in order to measure the distance, we should compute all the distances between each possible pair of values one for each missing value xiand yi. Both these values are selected from distribution χi.

    Then, we compute the expectation of the Euclidean distance between each selected value as we did for the one missing value problem. As a result, the distance is:

MDExiyi=pxpyxy2dxdy=ExEy2+σx2+σy2.

As xand ybelong to the same attribute, Ex=Eyμiand σx=σyσi. Thus:

MDExiyi=2σi2.E2

As we mentioned, all these computations assume that the missing data is MCAR. However, in real-world datasets, the missing data are MAR. In this case, the probability pxdepends on the other observed values, and then, the distance will be computed as:

MDEmiyi=pxxobsxyi2dx=yiμxxobsi2+σxxobsi2,

where xobsdenotes the observed attributes of point X, and μxxobsiand σxxobsi2are the conditional meanand variance, respectively.

On the other hand, in the case that the missing values are NMAR, the probability pxthat was used in Eq. (1) will be computed based on this information, and then, the distance will be:

MDEmiyi=pxmixyi2dx=yiμxmii2+σxmii2,

where pxmiis the distribution of xwhen xis missing.

3. Mean computation

Since one of our goals is developing a k-means clustering algorithm over incomplete datasets, we need to derive a formula to compute the mean of a given set that may contain incomplete points. We decide to derive this equation based on our distance function MDE.

Let ARKbe a set of npoints that may contain points with missing values. Then, the meanof this dataset is defined as:

x¯=argminxRi=1ndistancexpi2,

for any xRK, where piAdenotes each point from the set A, and distanceis a distance function.

Let fxbe a multidimensional function: f:RK:Rwhich is defined as:

fx=i=1ndistancexpi2,

In our case, the distance=MDE. Thus,

fx=i=1ndistancexpi2=i=1nj=1KMDExjpijTheMDEdistance2=i=1nj=1KMDExjpij,

where xjis the coordinate jand pijis the coordinate jin point pi. Since each point pimay contain missing attributes, and according to the definition of the MDEdistance in the previous section, fxwill be:

fx=j=1Ki=1njxjpij2therearenjknown coordinates+i=1mjxjμj2+σj2therearemjmissing coordinates.

x¯is the solution of fx=0, and in a multidimensional case: x¯is the solution of f=0, where

f=fx1fx2fxk=0,

is the gradient of function f. Firstly, we will deal with one coordinate, and then, we will generalize it for the other coordinates.

fxl=2i=1nlxlpil+2i=1mlxlμl=0nxl=i=1nlpil+mlμlxl=i=1nlpiln+mlμlnxl=nlni=1nlpilnl+nnlnμl=μl.

Thus, we simply get:

xl=μl.E3

Repeating this for all the coordinates yields x¯=μ1μ2μk. In other words, each coordinate of the mean is the mean of the known values of that coordinate.

In the same way, we derive a formula for computing the weighted mean for each coordinate l, yielding:

x¯wl=i=1nlwixil+i=1nlwiμli=1nwi,

where wiis the weight of point xi. It means, in order to compute the weighted mean of a set of numbers that some of them are unknown, we must distinguish between known and unknown values. If the value is known, we multiply it with its weight. On the other hand, if the value is missing, we replace it with the mean of the known values and then multiply it by the matching weight.

4. k-Means clustering using the MDEdistance

Based on the derived formulas, the MDEdistance and the mean, our aim in this research is to develop k-means clustering algorithms for incomplete datasets [1].

The MDEdistance and the meanare general and can be integrated within any algorithm that computes distances or mean computation. In this section, we describe our proposed method to integrate those formulas within the framework of the k-means clustering algorithm.

We developed three different versions for k-means. For simplicity, we assume that all the points are from R2. We have two way to look about incomplete points. The first one considers each point as a single point, this version is similar to the GMM algorithm described in [14, 15]. On the other hand, the second way is to replace each incomplete point with a set of points according to the data distribution (these are the other two methods). As will be shown in our experiments, they outperform the first algorithm.

The k-means clustering algorithm is constructed from two basic steps: (1) associate each point with its closest centroid, and then, (2) update the centroid based on the new association from Eq. (1). Given dataset Dthat may contain points with missing values. In the first step, the MDEdistance is used to compute the distances between each data point and the centroids in order to associate each point with the closest centroid. This association is general for all the three versions. However, there are several possible ways to then compute the new centroids of the clusters. We use Figure 1(a) in order to illustrate those possibilities. In this example, we see two clusters (i.e., C1was assigned to be the yellow cluster and C2was assigned to be the brown cluster). Our goal is to calculate the centers of each cluster. As an example, we will deal only with C1. If all the instances do not contain missing values, the centroid will be computed based on the Euclidean meanformula, resulting in the magenta star.

Figure 1.

An example for computing the centroids for two clusters in a dataset with missing values. (a) shows the results of the different methods of computing the mean. (b) shows the Voronoi diagram.

However, when the associated points for a given cluster contain incomplete points, it is not clear how to compute the mean. In the given example, let x0?(i.e., the red star) be a point with a missing yvalue and x=x0. This point was associated with C1’s cluster using the MDEdistance. It is important to note that we are able to associate incomplete points with closest centroid even though their geometric locations are unknown since we use the MDEdistance.

On the other hand, using the MDEdistance is similar to use the MA-method based on the Euclidean distance, the point x0?will be replaced with x0μy. It is clear that the difference between the two methods is only the variance of known values in coordinate y, a fixed value that does not influence the association result.

The naïve method to compute the new centroid is by replacing the point with the missing value with all the possible points

x0possible=x0ypypYpossible,

the set of all the possible points that satisfy x=x0. And

Ypossible=yRxyD,

denote all the possible values for attribute Y. And then computing the meanaccording to these points (C1realand x0possible), where each point from C1realhas weight one and each point from x0possiblehas weight 1Ypossible. Where

C1real=xyDxyC1

be the set of all the data points without missing values that are associated with the C1cluster. As a result, the weighted meanof C1is:

meanC1=xyC1realxy+x0μyC1real+1Ypossible.E4

This is identical to the Euclidean meanwhen the missing point is replaced with x0μyand is equivalent to the MA method when x0μyis associated with C1. As a result, the real centroid of the cluster (the magenta star) moves to the green star as described in Figure 1(b), where not all the blue “+” marks are belonging to C1.

As a result, the mean computation must distinguish between two possible methods. The first method (which we call k-mean-MDE) takes into account all the possible points that their ycoordinates are the ycoordinates of the real data points from the yellow cluster in addition to the real points within the yellow circle. As a result, the meanof this set will be computed based on all the real points C1realand C1x0possiblewhere,

C1x0possible=x0ypx0possiblexyC1realy=yp.

Computing the new centroid using Eq. (3) yields not only the same centroid as using the Euclidean distance, but also preserves the runtime of the standard k-means using the Euclidean distance.

The second method (which we called k-mean-HistMDE): In this case, we first associate each of the points from x0Ypossiblewith its nearest center, and after that compute a weighted mean. It means that to compute the mean, we will take into account all the real points C1real, in addition to PC1possiblewhere

PC1possible=x0ypx0possiblex0ypC1.

According to this method, use all the points from x0possiblethat are associated with the C1cluster and not only the points from x0possiblewhose ycoordinates are from the real points associated with that cluster. Since the weights are computed using the entire dataset, we cannot use Eq. (3). To this end, our suggested method for implementing the meancomputation is simply to replace each point with a missing value with the Ypossiblepoints, each with a weight 1Ypossible, and run weighted k-means on the new dataset. This method, in one hand, is simple to implement, but in the other hand, its runtime is high, since each point with, for example, a missing yvalue will be replaced with all Ypossiblepoints. As a result, the size of the dataset will be:

Dreal+DDrealAttpossible,

where Drealis the set of each data points that do not contain missing values. In order to reduce the runtime complexity, we turn to use Voronoi diagram. Based on Voronoi diagram, the data space is partitioned to ksubspaces (as can be seen in Figure 1(b)). Each point is associated with the subspace of the cluster in which it lies.

The third possibility is to divide the yvalue space to several disjoint intervals. Where, each interval will be represented by its mean, and the weight of each interval will be the ratio between the number of points in the interval to the number of all possible points. This method we called k-mean-HistMDE. k-mean-HistMDEmethod approximates the two methods mentioned before that compute the weighted mean.

In conclusion, we have three methods:

  • The naïve method which is equivalent to the MA method.

  • k-means-MDE

  • k-mean-HistMD E

These methods differ in their performance, efficiency, and the way they work.

5. Mean shift algorithm

In this section, we will describe another use case that integrates the derived distance function MDEwithin the framework of mean shift clustering algorithm. Firstly, we will give a short overview of the mean shift algorithm, and then, we will describe how we use MDEdistance in this algorithm. Here, we only review some of the results described in [16, 17] which should be consulted for the details. Let xiRd,i=1,,nis associated with a bandwidth value h>0. The sample point density estimator at point xis

f̂x=1nhdi=1nKxxih.E5

Based on a symmetric kernel Kwith bounded support satisfying

Kx=ck,dkx2x1E6

is a nonparametric estimator of the density at xin the feature space. Where kx,0x1is the profile of the kernel and the normalization constant ck,dassures that Kxintegrates to one. As a result, the density estimator Eq. (5) can be rewritten as

f̂h,kx=ck,dnhdi=1nkxxih2.E7

As a first step in the analysis is to find the modes of the density which are located among the zeros of the gradient fx=0, of a feature space with the underlying density fx, and the mean shift procedure is a way to find these zeros without the need to estimate the density.

Therefore, the density gradient estimator is obtained as the gradient of the density estimator by capitalizing on the linearity of Eq. (7).

f̂h,Kx=2ck,dnhd+2i=1nxxikxxih2.E8

Define gx=kx, then the kernel Gxis defined as:

Gx=cg,dgx2.

Introducing gxinto Eq. (8) yields

f̂h,Kx=2ck,dnhd+2i=1nxixgxxih2=2ck,dnhd+2i=1ngxxih2i=1nxigxxih2i=1ngxxih2x,E9

where i=1ngxxih2is assumed to be a positive number. Both terms of the product in Eq. (9) have special significance. The first term is proportional to the density estimate at xcomputed with the kernel G. The second term

mGx=i=1nxigxxih2i=1ngxxih2xE10

is called the mean shift vector. The mean shift vector thus points toward the direction of maximum increase in the density. The implication of the mean shift property is that the iterative procedure

yj+1=i=1nxigyjxihi=1ngyjxihj=1,2,E11

In real world, most often the convergence points of this iterative procedure are the local maxima (modes) of the density. All the points that share the same mode are clustered within the same cluster. Therefore, we get clusters as the number of modes.

5.1. Mean shift computing using the MDEdistance

This section describes the way to integrate the MDEdistance within the framework of the mean shift clustering algorithm. To achieve this mission, we will first compute the mean shift vector using the MDEdistance. And then, we will integrate the MDEand the derived mean shift vector within the mean shift algorithm.

Using the derived MDEdistance the density estimator in Eq. (7) will be written as:

f̂h,kx=ck,dnhdi=1nkxxih2=ck,dnhdi=1nkj=1dMDExjxij2h2.E12

Since each point ximay contain missing attributes, f̂h,kxwill be:

f̂h,kx=ck,dnhdi=1nkj=1kniMDExjxij2h2eachxihaskniknown attributes+j=1unkniMDExjxij2h2eachxihasunknimissing attributes.

According to the definition of the MDEdistance, we obtain:

f̂h,kx=ck,dnhdi=1nkj=1knixjxij2h2+j=1unknixjμj2+σj2h2.E13

Now, we will compute the gradient of the density estimator in Eq. (13).

f̂h,kx=ck,dnhd+2i=1nj=1knixjxij2+j=1unknixjμj2+σj2kj=1knixjxij2h2+j=1unknixjμj2+σj2h2=ck,dnhd+2i=1nj=1knixjxij2kj=1knixjxij2h2+j=1unknixjμj2+σj2h2+j=1unknixjμj2+σj2kj=1knixjxij2h2+j=1unknixjμj2+σj2h2.

In our computation, we will first deal with one coordinate l, and then, we will generate the computation for all the other coordinates.

fxl=2ck,dnhd+2i=1nlxlxilkj=1knixjxij2h2+j=1unknixjμj2+σj2h2+2ck,dnhd+2i=1mlxlμlkj=1knixjxij2h2+j=1unknixjμj2+σj2h2=2ck,dnhd+2[xli=1nkj=1knixjxij2h2+j=1unknixjμj2+σj2h2i=1nlxilkj=1knixjxij2h2+j=1unknixjμj2+σj2h2i=1mlμlkj=1knixjxij2h2+j=1unknixjμj2+σj2h2,

where there are nlpoints for which the xlcoordinate is known, and there are mlpoints where it is missing.

fxl=2ck,dnhd+2i=1ngj=1dMDExjxij2i=1nlxilgj=1dMDExjxij2+i=1mlμlgj=1dMDExjxij2i=1ngj=1dMDExjxij2xl.

As a result, the mean shift vector using the MDEdistance is defined as:

mMDE,Gx=i=1nlxilgj=1dMDExjxij2+j=1mlμlgj=1dMDExjxij2i=1ngj=1dMDExjxij2xl.E14

Now, we can use this equation to run the mean shift procedure over datasets with missing values.

6. Experiments on numerical datasets

In order to measure performance of the developed clustering algorithm (i.e., k-means and mean shift), we compare their performance on complete datasets to its performance on incomplete data using the suggested distance function and then again using the existing methods (MCA, MA, and MI) within the standard algorithms.

To measure the similarity between two data clusterings, we decide to use the Rand index [18]. We use it in order to compare the results of the original clustering algorithms to the results of the other derived algorithms for incomplete datasets.

Our experiments use six standard numerical datasets from the Speech and Image Processing Unit [13]; dataset characteristics are shown in Table 1.

DatasetDataset sizeClusters
Flame240×22
Jain373×22
Path-based300×23
Spiral312×23
Compound399×26
Aggregation788×27

Table 1.

Speech and Image Processing Unit Dataset properties.

We produced the missing data by drawing randomly a set consisting of 10–40% of the data from each dataset. These sets are used as samples of incomplete data, where one attribute from each point was randomly selected to be assigned as missing value. For each dataset, we average the results over 10 different runs.

6.1. k-Means experiments

In the k-means algorithm, we developed two versions, k-means-MDEand k-means-HistMDE; to cluster the incomplete datasets, we compare the performance of the k-means (kis fixed for each dataset) clustering algorithm on complete data (i.e., without missing values) to its performance on data with missing values, using the MDEdistance measure (k-means-MDEand k-means-HistMDE) and then again using k-means-(MCA, MA, and MI).

As can be seen in Figure 2, the new algorithms that is based on the MDEdistance outperformed the other existing algorithms on all the datasets. It occurred because in the MA MCA methods, the whole distribution of values is replaced by the mean or the mode of the distribution of known values, that is a fixed value. In our two developed algorithms, we use the distribution of the observed values in all the computation stages. This additional information, taking into account not only the mean of the attribute but also the variance, is probably the reason for the improved performance of our methods compared to the known heuristics.

Figure 2.

Results of k-means clustering algorithm using the different distance functions on the six datasets from the Speech and Image Processing Unit.

6.2. Mean shift experiments

Mean shift clustering algorithm was tested using bandwidth h=4(because we saw that the standard mean shift worked well for this value).

A resulting curve for the Rand index values was constructed for each dataset to evaluate how well the algorithm performed.

As can be seen in Figure 3, for all the datasets except the Jain dataset, the curves show that the new mean shift algorithm was superior and outperformed the other compared methods for all missing value percentages, while for the Jain dataset, its superiority became apparent only when the percent of the missing values was larger than 25%, as can be seen in Figure 3(b). In addition, we can see that the MSMCmethod outperforms the MSMAmethod for the flame and path-based datasets, and the MSMCoutperforms MSMAfor the other datasets. As a result, we cannot decide unequivocally which algorithm is better. On the other hand, we obviously can state that the MSMDEoutperforms the other methods especially when the percentage of the missing values increases.

Figure 3.

Results of mean shift clustering algorithm using the different distance functions on the six datasets from the Speech and Image Processing Unit.

7. Conclusions

Missing values in data are common in real-world applications. They can be caused by human error, equipment failure, system-generated errors, and so on. Several methods were developed to deal with this problem such as: filling the missing values with fixed values, ignoring sample with missing values, or dealing with the missing values by defining a distance function.

In this work, we have proposed a new mean shift clustering algorithm and two versions of the k-means clustering algorithm over incomplete datasets based on the developed MDEdistance that was presented in [1, 2, 12].

The computational complexities of all the developed algorithms were preserved and they are the same as that of the standard algorithms using the Euclidean distance. The distance was computed based only on the meanand varianceof the data for each attribute.

We experimented on six standard numerical datasets from different fields. On these datasets, we simulated missing values and compared the performance of the developed algorithms using our distance and the suggested mean computations to other three basic methods.

From our experiments, we conclude that the developed methods are more appropriate for measuring the mean, mean shift vector, and weighted mean for objects with missing values, especially when the percent of missing values is large.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Loai AbdAllah and Ilan Shimshoni (August 1st 2018). Clustering Algorithms for Incomplete Datasets, Recent Applications in Data Clustering, Harun Pirim, IntechOpen, DOI: 10.5772/intechopen.78272. Available from:

chapter statistics

300total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Partitional Clustering

By Uğurhan Kutbay

Related Book

First chapter

Concepts, Historical Milestones and the Central Place of Bioinformatics in Modern Biology: A European Perspective

By T.K. Attwood, A. Gisel, N-E. Eriksson and E. Bongcam-Rudloff

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us