Density-based Clustering and Anomaly Detection

As of 1996, when a special issue on density-based clustering was published (DBSCAN) (Ester et al., 1996), existing clustering techniques focused on two categories: partitioning methods, and hierarchical methods. Partitioning clustering attempts to break a data set into K clusters such that the partition optimizes a given criterion. Besides difficulty in choosing the proper parameter K, and incapacity of discovering clusters with arbitrary shape, partitioning clustering techniques are very sensitive to outliers. Although the k-medoids method (Kaufman & Rousseeuw, 1990) is more robust than k-means (MacQueen, 1967) in the presence of outliers, they cannot discover outliers. Hierarchical clustering algorithms produce a nested sequence of clusters, with a single all-inclusive cluster at the top and single point clusters at the bottom. CURE (Guha et al., 1998) is capable of finding clusters of arbitrary shapes and reduces the effect of outliers; however, it only considers cluster proximity yet ignores cluster interconnectivity, and an outlier is still assigned to the cluster which has the closest representative point to it.


Introduction
As of 1996, when a special issue on density-based clustering was published (DBSCAN) (Ester et al., 1996), existing clustering techniques focused on two categories: partitioning methods, and hierarchical methods.Partitioning clustering attempts to break a data set into K clusters such that the partition optimizes a given criterion.Besides difficulty in choosing the proper parameter K, and incapacity of discovering clusters with arbitrary shape, partitioning clustering techniques are very sensitive to outliers.Although the k-medoids method (Kaufman & Rousseeuw, 1990) is more robust than k-means (MacQueen, 1967) in the presence of outliers, they cannot discover outliers.Hierarchical clustering algorithms produce a nested sequence of clusters, with a single all-inclusive cluster at the top and single point clusters at the bottom.CURE (Guha et al., 1998) is capable of finding clusters of arbitrary shapes and reduces the effect of outliers; however, it only considers cluster proximity yet ignores cluster interconnectivity, and an outlier is still assigned to the cluster which has the closest representative point to it.
To discover clusters with arbitrary shape and outliers, density-based clustering methods have been developed.These typically regard clusters as dense regions of objects in the data space that are separated by regions of low density (representing outliers or noises).DBSCAN grows clusters according to a density-based connectivity analysis.OPTICS (Ankerst et al., 1999) extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter settings.DENCLUE (Hinneburg & Keim, 1998) clusters objects based on a set of density distribution functions.LOF (Breunig et al., 2000) uses a more meaningful way to assign to each object a degree of being an outlier than to consider being an outlier as a binary property.LDBSCAN (Duan et al., 2007) combines the concepts of DBSCAN and LOF to discover clusters and outliers.There are two potential benefits of combining clustering and outlier detection: increasing precision and facilitating data understanding.The goal of this chapter is to survey the core concepts and techniques in the density-based clustering and outlier detection (Duan et al., 2009) with its roots in data mining, statistics, machine learning and other communities.Optics can solve this problem; however, it only creates an augmented ordering of the database representing its density-based clustering structure instead of producing clusters of a data set explicitly.In addition, it might not be able to generate the clusters resided in other clusters appropriately and this part will be discussed in the experimental part.Therefore, an algorithm which can detect A, B, C 1 , C 2 , and C 3 explicitly is needed.

Definition of LRD and LOF
The LOF of each object represents the degree the object is being outlying and the LRD of each object represents the local-density of the object.The formal definitions for these notions of LOF and LRD are shortly introduced in the following.More details are provided in (Breunig et al., 2000).
Definition 1 (k-distance of an object p) For any positive integer k, the k-distance of object p, denoted as k-distance(p), is defined as the distance d(p,o) between p and an object o∈ D such that: 1. for at least k objects o'∈ D \{p} it holds that d (p,o')≤d(p,o) 2. for at most k-1 objects o'∈ D \{p} it holds that d (p,o')<d(p,o).
Definition 2 (k-distance neighborhood of an object p): Given the k-distance of p, the kdistance neighborhood of p contains every object whose distance from p is not greater than The LOF of object p is the average of the ratio of the LRD of p and those of p's MinPts-nearest neighbors.It captures the degree to which p is called an outlier.It is easy to see that the higher the ratio of the LRD of p to those of p's MinPts-nearest neighors is, the farther away the point p is from its nearest cluster, and the higher the LOF value of p is.Since the LOF represents the degree the object is being outlying and the LOF of most objects in a cluster is approximately equal to 1, we regard object p belong to a certain cluster if LOF(p) is lower than a threshold we set.

A local-density based notion of clusters
When looking at the sample set of points depicted in Figure 2, we can easily and unambiguously detect clusters of points and noise points not belonging to any of those clusters.The main reason is that within each cluster the local density of points are different from that of the outside part.
In the following, these intuitive notions of "clusters" and "noise" are formalized.Note that both notion of clusters and the algorithm LDBSCAN apply to 2D Euclidean space as to higher dimensional feature space.The key idea is that for any point p satisfying LOF(p) ≤LOFUB, i.e. point p is not an outlier and belongs to a certain cluster C, if point q is the MinPts-nearest neighbour of p and has the similar LRD with p, q belongs to the same cluster C of p.This approach works with any distance function so that an appropriate function can be chosen for a given application.In this chapter, for the purpose of proper visualization, related examples will be in 2D space using the Euclidean distance.If LOF(p) is small enough, it means that point p is not an outlier and must belong to some clusters.Therefore it can be regarded as a core point.
Definition 7 (directly local-density-reachable): A point p is directly local-density-reachable from a point q w.r.t.pct and MinPts if 1. p ∈ N MinPts (q) and 2. LRD(q)/(1+pct)<LRD(p)<LRD(q)*(1+pct) Here, the parameter pct is used to control the fluctuation of local-density.However, in general, it is not symmetric if q is not the MinPts-nearest neighbour of p. Figure 3 shows the asymmetric case.Let MinPts=3 and pct=0.3,we calculate that LRD(p)/LRD(q)=1.27.It shows that p is directly local-density-reachable from q, but q is not directly local-density-reachable from p.

Fig. 3. Directly local-density-reachability
Definition 8 (local-density-reachable): A point p is local-density-reachable from the point q w.r.t.pct and MinPts if there is a chain of points p 1 , p 2 , ..., p n , p 1 =q, p n =p such that p i+1 is directly local-density-reachable from p i .
Local-density-reachability is a canonical extension of direct local-density-reachability.This relation is transitive, but it is not symmetric.Figure 4 depicts the relations of some sample points and an asymmetric case.Let MinPts=3, pct=0.3.According to the above definitions, LRD(p)/LRD(o)=1.27,LRD(o)/LRD(q)=0.95.Here, q is local-density-reachable from p, but p is not local-density-reachable from q. From the definition, local-density-connectivity is a symmetric relation show in Figure 5. Now we can make use of the above definitions to define the local-density-based cluster.Intuitively, a cluster is defined as a set of local-density-connected points which is maximal w.r.t local-density-reachability.Noised are defined relatively to a given set of clusters.Noises are simply the set of points in the dataset not belonging to any of its clusters.

The algorithm
In this section, we present the algorithm LDBSCAN which is designed to discover the clusters and the noise in a spatial database according to Definition 10 and 11.First, the appropriate parameters LOFUB, pct, and MinPts of clusters and one core point of the respective cluster are selected.Then all points that are local-density-reachable from the given core point using the correct parameters are retrieved.Since all the parameters are relative, and not absolute as those in DBSCAN, they are easy to choose and fall in a certain range as presented in the experimental part.
To find a cluster, LDBSCAN starts with an arbitrary point p and retrieves all points localdensity-reachable from p w.r.The LDBSCAN algorithm randomly selects one core point which has not been clustered, and then retrieves all points that are local-density-reachable from the chosen core point to form a cluster.It won't stop until there is no unclustered core point.

Cluster-Based Outliers
In this section, we give the definition of cluster-based outliers and conduct a detailed analysis on the properties of cluster-based outliers.The goal is to show how to discover cluster-based outliers and how the definition of the cluster-based outlier factor (CBOF) captures the spirit of cluster-based outliers.The higher the CBOF is, the more abnormal the cluster-based outliers are.

Definition of Cluster-Based Outliers
Intuitively, most data points in the data set should not be outliers; therefore, only the clusters that hold a small portion of data points are candidates for cluster-based outliers.Considering the different and complicated situations, it is impossible to provide a definite number as the upper bound of the number of the objects contained in a cluster-based outlier (UBCBO).Here, only a guideline is provided to find the reasonable upper bound.
Definition 12 (Upper Bound of the Cluster-Based Outlier): Let C 1 , ..., C k be the clusters of the database D discovered by LDBSCAN in the sequence that |C 1 |≥|C 2 |≥…≥|C k |.Given parameters α, the number of the objects in the cluster Definition 12 gives quantitative measure to UBCBO.Consider that most data points in the dataset are not outliers; therefore, clusters that hold a large portion of data points should not be considered as outliers.For example, if α is set to 90%, we intend to regard clusters which contain 90% of data points as normal clusters.
Definition 13 (Cluster-based outlier): Let C 1 , ..., C k be the clusters of the database D discovered by LDBSCAN.Cluster-based outliers are the clusters in which the number of the objects is no more than UBCBO.
Note that this guideline is not always appropriate.For example, in some cases the abnormal cluster deviated from a large cluster might contain more points than a certain small normal cluster.In fact, due to spatial and temporal locality, it would be more proper to choose the clusters which have small spatial or temporal span as cluster-based outliers than the clusters which contain few objects.The notion of cluster-based outliers depends on situations.

The lower bound of the number of objects contained in a cluster
Comparing with single point outliers, cluster-based outliers are more interesting.Many single point outliers are related to occasional trivial events, while cluster-based outliers concern some important lasting abnormal events.Generally speaking, it is reckless to form a cluster with only 2 or 3 objects, so the lower bound of the number of the objects contained in a cluster generated by LDBSCAN will be discussed in the following.
Definition 14 (distance between two clusters): Let C 1 , C 2 be the clusters of the database D. The distance between C 1 and C 2 is defined as Theorem 1: Let C 1 be the smallest cluster discovered by LDBSCAN w.r.t.appropriate parameters LOFUB, pct and MinPts, and C 2 is large enough be the closest normal cluster to C 1 .Let LRD(C 1 ) denote the minimum LRD of all the objects in C 1 , i.e., LRD(C 1 )=min{LRD(p) | p∈ C 1 }.Similarly, let LRD(C 2 ) denote the minimum LRD of all the objects in C 2 .Then for LBC, the lower bound of the number of the objects contained in a cluster, such that: Proof (Sketch): Let p i denote the i-th object in C 1 and q i,j be the j-th close object to p i in C 2 .And let k be the number of the objects in C 1 .To simplify our proof, we only consider the situation that each point only has k k-nearest neighbors and the density within a cluster fluctuates slightly.
If k≥MinPts+1, according to the definition of LOF, the LOF of any object in C 1 is approximately equal to 1.That is, LOF(p i )<LOFUB and each object in C 1 is a core point.In addition, each object in C 1 has the similar LRD to its neighbors which belong to the same cluster with it.According to the definition of the cluster, the cluster C 1 would be discovered by LDBSCAN.Thus, LBC is no more than MinPts+1.
If k≤MinPts, the MinPts-distance neighbors of p i contain the k-1 rest objects in C 1 and the other MinPts-k+1 neighbors in C 2 shown in Figure 6.Obviously, the MinPts-distance of each fixed object p j in C 1 is greater than the distance between any object p i in C 1 and p j , so reachdist(p i ,p j )= MinPts-distance(p j ).Furthermore, the MinPts-distance(q i,j )<<dist(C 1 ,C 2 )≤d(p i ,q i,j ).
MinPts-dist(p)+ε i .Similarly, let d(p,q)=min{d(p i ,q i,j ) | p i ∈C 1 , q i,j ∈C 2 and q i,j is the MinPtsneighbor of p i } and d(p i ,q i,j )=d(p,q)+ ε i,j .Because we assume that the density within a cluster fluctuates slightly, MinPts-dist(p)>> ε i and d(p,q)>> ε i,j .
Compare the LRD of object p i with that of its neighbor Thus, the objects in C 1 have the similar LRD.Now consider the ratio of the LRD of the object p i to that of its neighbor q j in C 2 .Let reachdist-max be the maximum reachability distance of the object q j which is the object in Since the LOF of objects deep in a cluster is approximately equal to 1, the LOFUB must be greater than 1.Then In other words, LBC satisfies the inequality, LBC≤MinPts+1, discussed in part (a).Let's consider another extreme situation.The LOFUB is so big that (LOFUB*MinPts+1)*LRD(p) is bigger than (MinPts+1)*LRD(q), and in this case LBC is less than 1.As a matter of fact, it is impossible for LBC to be less than 1.When LOFUB is big enough, the object p which is a single point outlier still satisfies the core point condition, LOF(p)≤LOFUB; therefore, the object p is deemed as a core point that should belong to a certain cluster.In this case, it forms a cluster which contains only one object by itself.

The Cluster-Based Outlier Factor
Since outliers are far more than a binary property (Breunig et al., 2000), a cluster-based outlier also needs a value to demonstrate its degree of being an outlier.In the following we give the definition of the cluster-based outlier factor.

Density-Based Clustering and Anomaly Detection 89
Definition 15 (Cluster-based outlier factor): Let C 1 be a cluster-based outlier and C 2 be the nearest non-outlier cluster of C 1 .The cluster-based outlier factor of C 1 is defined as The cluster-based outlier factor of the cluster C 1 is the result of multiplying the number of the objects in C 1 by the product of the distance between C 1 and its nearest normal cluster C 2 and the average local reachability density of C 2 .The outlier factor of cluster C 1 captures the degree to which we call C 1 an outlier.Assume that C 1 as a cluster-based outlier is deviated from its nearest normal cluster C 2 .It is easy to see that the more objects C 1 contains, and the farther away C 1 is from C 2 , and the more dense C 2 is, the higher the CBOF of C 1 is and the more abnormal C 1 is.

Experiments
A comprehensive performance study has been conducted to evaluate our algorithm.In this section, we describe those experiments and their results.The algorithm was run on both real-life datasets obtained from the UCI Machine Learning Repository and synthetic datasets.

LDBSCAN
In this section, we will demonstrate how the proposed LDBSCAN can successfully generate clusters which appear to be meaningful that is unable to be generated by other methods.

A synthetic dataset with clusters resided in other clusters
In order to test the effectiveness of the algorithm, both LDBSCAN and OPTICS are applied to a data set with 473 points as shown in Figure 7.Both LDBSCAN and OPTICS can generate the magenta cluster D, the cyan cluster E, and the green cluster F. But OPTICS can only generate the cluster G which contains all the magenta, cyan, green, and pink points.
And it is more reasonable to generate a cluster which only contains the pink points because of their similarity in local-density.Therefore LDBSCAN produces the similar local-density clusters instead of the clusters produced by OPTICS with local-density exceeds certain thresholds.
The result of LDBSCAN can be influenced by the choice of the parameters.There are two totally different parameters of MinPts.One is for the calculation of LOF and the other is for the clustering algorithm.For most of the datasets, it seems work well when MinPts for LOF is between 10 and 20, and more details can be found in (Breunig et al., 2000).For convenience of presentation, MinPts LOF is used as a shorthand of MinPts for LOF and MinPts LDBSCAN as a shorthand of MinPts for the clustering algorithm.
For objects deep inside a cluster, the LOFs are approximately equal to 1.The greater the LOF is, the higher possibility for the object to be an outlier.If the value that is selected for LOFUB is too small, some core points may be mistakenly considered as outliers; and if the value is too large, some outliers may be mistakenly considered as core points.For most of the datasets that have been experimented with, picking 1.5 to 2.5 appears to work well.However, it also depends.For example, we identified multiple clusters, e.g., a cluster of pictures from a tennis match and the reasonable LOFUB is up to 7. In Figure 8, the red points are those whose LOF exceeds the LOFUB when MinPts LOF =15.Parameter MinPts LDBSCAN determines the stand-by objects belonging to the same cluster of the core point.Clearly MinPts LDBSCAN can be as small as 1.However, if MinPts LDBSCAN is too small, some reasonable objects may be missed.Thus we suggest that MinPts LDBSCAN is at least 5 in order to take enough reasonable objects into account.The upper bound of MinPts LDBSCAN is based on a more subtle observation.Let p ∈ C 1 , q ∈ C 2 , C 1 has the similar density with C 2 .p and q are the nearest objects between C 1 and C 2 .Consider the simple situation that distance(C 1 , C 2 ) is small enough shown in Figure 10, obviously that as MinPts LDBSCAN values increase, there will be a corresponding monotonic sequence of changes to MinPts-distance(p).As the MinPts LDBSCAN values increase, once MinPts-distance(p) is greater than distance(C 1 , C 2 ), C 1 and C 2 will be generated into one cluster.In Figure 10, clustering with any core point in C 1 is started.When MinPts LDBSCAN reaches 10, C 1 and C 2 will be generated into one cluster C. Therefore, the value for MinPts LDBSCAN should not be too large.
When MinPts LDBSCAN reaches 15, enough candidates will be considered.The value ranges from 5 to 15 can be chosen for MinPts LDBSCAN .

Comet-like clusters
In order to demonstrate the accuracy of the clustering results of LDBSCAN, both LDBSCAN and OPTICS are applied to a 2-dimension dataset shown in the following Figure 11.LDBSCAN discovers the cluster C 1 consisting of small rectangle points, the cluster C 2 consisting of small circle points, and the outlier P 1 , P 2 , P 3 denoted by hollow rectangle points.OPTICS discovers the clusters whose reachability-distance falls into the dents and assigns the point to a cluster according to its reachability-distance, regardless its neighborhood density.Because the reachability-distance of the point P 3 is similar to that of the points in the right side of the cluster C 2 , the side whose density is relatively low, OPTICS would assign the point P 3 to the cluster C 2 , while LDBSCAN discovers the point P 3 as an outlier due to its different local density from its neighbors.Although both OPTICS and LDBSCAN can discover the points P 1 , P 2 as outliers, the clustering result of OPTICS is not accurate especially when the border density of a cluster varies, such as the comet-like cluster.

Cluster-based outliers
The performance of cluster-based outliers is tested in this section.

Wisconsin breast cancer data
The second used dataset is the Wisconsin breast cancer data set, which has 699 instances with nine attributes, and each record is labeled as benign (458 or 65.5%) or malignant (241 or 34.5%).In order to avoid the situation in which the local density can be ∞ if there are more than MinPts objects, different from each other, but sharing the same spatial coordinates, only 3 duplicates of certain spatial coordinates are reserved and the rest are removed.In addition, the 16 records with missing values are also removed.Therefore, the resultant dataset has 327 (57.8%) benign records and 239 (42.2%) malignant records.
The algorithm processed the dataset when pct=0.5, LOFUB=3, MinPts=10, and α=0.95.Both LOF and our algorithm find the 4 following noise records which are sing point outliers shown in Table 1.Understandably, our algorithm processes based on the result of LOF, and thus both can find the same single point outliers.

Boston housing data
The Boston housing dataset, which is taken from the StatLib library, concerns housing values in suburbs of Boston.It contains 506 instances with 14 attributes.Before clustering, data need to be standardized in order to assign each variable an equal weight.Here the zscore process is used because using mean absolute deviation is more robust than using standard deviation (Han & Kamber, 2006).The algorithm processed the dataset when pct=0.5, LOFUB=2, MinPts=10, and α=0.9.One single point outlier, 3 normal clusters and 6 cluster-based outliers are discovered.There are few single point outliers in this dataset.The maximum LOF, the value of the 381st record, is 2.624 which indicates that there is not a significant deviation.In addition, the 381st record is assigned to the 9th cluster which is a cluster-based outlier.Its LOF exceeds LOFUB due to the small number of the objects contained in the 9th cluster to which it belongs.The small number, which is less than MinPts, would affect the accuracy of LOF.Eight of all the nine records whose LOF exceeds LOFUB are assigned to a certain cluster and the LOF of the only single point outlier, the 215th record, is 2.116.The 215th record has a smaller proportion of owner-occupied units built prior to 1940, the 7th attribute, than its neighbors.
However, the 6 cluster-based outliers are more interesting than the only single point outlier.Table 3 demonstrates the information of all the 9 clusters, and the additional information of the cluster-based outliers is shown in Table 4.The 3rd cluster, which is a cluster-based outlier and has the maximum CBOF, deviates from the 1st cluster.Its 12th attribute, 1000(Bk -0.63) 2 where Bk is the proportion of blacks by town, is much lower than that of the 1st cluster.Both the 9th cluster and the 6th cluster deviate from the 1st cluster.Although the 6th cluster contains more object than the 9th cluster, the CBOF of the 6th cluster is less than that of the 9th cluster because the 9th cluster is farther away from the 1st cluster than the 6th cluster.The records in the 9th cluster have significantly big per capita crime rate by town, comparing with those of the 1st cluster.However, it is not easy to do not differentiate the records in the 6th cluster from those of the 1st cluster.Moreover, the relationship between the 4th cluster and the 8th cluster is also impressive.412,416,417,420,424,425,426,427,429,430,431,432,433,434,435,436,437,438,439,446,451,455,456,457,458 and 34 of them is discovered in the 4th cluster.The only exceptional record, the 284th record, has a slightly high proportion of residential land zoned for lots over 25,000 square feet, the 2nd attribute, and a relatively low proportion of non-retail business acres per town, the 3rd attribute.The area denoted by the 284th record is more like a residential area than the other areas along the Charles River.

www.intechopen.com
Density-Based Clustering and Anomaly Detection 95

Conclusion
In this chapter, we have examined various density-based techniques, DBSCAN, OPTICS, LOF, LDBSCAN and cluster-based outlier detection, and have described several applications of these techniques.Clustering is a process of grouping data based on a measure of similarity, and outlier detection is a process of discovering the data objects which are grossly different from or inconsistent with the remaining set of data.Both clustering and outlier detection is a subjective process; the same set of data often needs to be processed differently for different applications.This subjectivity makes the process of clustering and outlier detection hard.That is why a single algorithm or approach is not adequate to solve all the problems.
The most challenging step is feature extraction and pattern representation.In this chapter, the step of pattern representation is conveniently avoided by assuming the pattern representations are available as input to the clustering and outlier detection algorithm.Especially in the case of large data sets, it is difficult for the user to keep track of the importance of each feature.Comparing with partitioning and hierarchical methods, densitybased methods stand out both in discovering clusters with arbitrary shape and in outlier detection.Among them, the OPTICS and LDBSCAN are most successful used due to their accuracy.They can effectively discover clusters with different local density.In summary, clustering and outlier detection is an interesting, useful and challenging problem.Densitybased techniques are good at accuracy; however, the potential can only be exploited after making several designed choices carefully.

Fig. 1 .
Fig. 1.Clusters with respect to different global density parameters www.intechopen.comDensity-BasedClustering and Anomaly Detection 81 the k-distance, i.e.N k-distance(p)(p)={ q ∈ D \{p} | d(p,q)≤k-distance(p) }.These objects q are called the k-nearest neighbors of p.As no confusion arises, the notation can be simplified to use N k (p) as a shorthand for N k- distance(p) (p).Definition 3 (reachability distance of an object p w.r.t.object o): Let k be a natural number.The reachability distance of object p with respect to object o is defined as reach-dist k (p,o)=max { k-distance(o), d(p,o) } Definition 4 (local reachability density of an object p): The local reachability density of p is defined local reachability density of an object p is the inverse of the average reachability distance based on the MinPts-nearest neighbors of p. Definition 5 (local outlier factor of an object p): The local outlier factor of p is defined as

Fig. 4 .
Fig. 4. Local-density-reachability Definition 9 (local-density-connected): A point p is local-density-connected to a point q from o w.r.t.pct and MinPts if there is a point o such that both p and q are local-density-reachable from o w.r.t.pct and MinPts.

Fig. 5 .
Fig. 5. Local-density-connectivity Definition 10 (cluster): Let D be a database of points, and point o is a selected core point of C, i.e. o ∈ C and LOF(o)≤LOFUB.A cluster C w.r.t.LOFUB, pct and MinPts is a non-empty subset of D satisfying the following conditions: 1. ∨p : p is local-density-reachable from o w.r.t.pct and MinPts, then p ∈ C. (Maximality) 2. ∨p,q ∈ C: p is local-density-connected q by o w.r.t.LOFUB, pct and MinPts.(Connectivity) Definition 11 (noise): Let C 1 , ..., C k be the clusters of the database D w.r.t.parameters LOFUB, pct and MinPts.Then we define the noise as the set of points in the database D not belonging to any cluster C i , i.e. noise= { p ∈ D | ∨i: p not in C i }.

Fig. 7 .
Fig. 7. Reachability-plot for a data set with hierarchical clusters of different sizes, densities and shapes

Fig. 8 .
Fig. 8. Core points and outliers Parameter pct controls the local-density fluctuation as it is accepted.The value of pct depends on the fluctuation of the cluster.Generally speaking, it is between 0.2 and 0.5.Of course in some particular situations, other values out of this range can be chosen.Let MinPts LOF =15, MinPts LDBSCAN =10, and LOFUB=2.0. Figure 9 shows the different clustering results with different of pct.
t. LOFUB, pct, and MinPts.If p is a core point, this procedure yields a cluster w.r.t LOFUB, pct, and MinPts.If p is not a core point, LDBSCAN will check the next point of the dataset.In the following, we present a basic version of LDBSCAN without details of data types and generation of additional information about clusters: SetOfPoints is the set of the whole database.LOFUB, pct and MinPts are the carefully chosen parameters.The function SetOfPoints.get(i)returns the i-th element of SetOfPoints.Points which have been marked to be NOISE may be changed later if they are local-densityreachable from some core points of the database.The most important function used by LDBSCAN is ExpandCluster which is presented in the following: ExpandCluster(SetOfPoints, Point, ClusterID, pct, MinPts) SetOfPoint.changeClId(Point,ClusterID);FOR i FROM 1 TO MinPts DO currentP := Point.Neighbor(i); IF currentP.ClId IN {UNCLASSIFIED,NOISE} and DirectReachability(currentP,Point) THEN TempVector.add(currentP);SetOfPoint.changeClId(currentP,ClusterID);

Table 1 .
.655 Single point outliers in Wisconsin breast cancer dataset Besides the single point outliers, our algorithm discovers 3 clusters shown in Table2, among which there are 2 big clusters and 1 small cluster.One big cluster A contains 296 benign records and 6 malignant records, and the other one B contains 26 benign records and 233 malignant records.The small cluster contains only 1 record p.Among all the MinPtsnearest neighbors of the only one record in C, six neighbors belong to the cluster A and the other four belong to the cluster B. The record p is in the middle of cluster A and B, and LOF(p)= 1.795.It is closer to A than B, but has the similar local reachability density to B rather than A. Thus, it forms a cluster by itself.This kind of special record cannot be easily discovered by LOF when its MinPts-nearest neighborhood overlaps with more than one cluster.

Table 2 .
Clusters in Wisconsin breast cancer dataset There are 35 records which show that its tract bounds the Charles River, demonstrated by the 4th attribute, in the whole dataset,

Table 4 .
Cluster-based outliers in Boston housing dataset