Open access peer-reviewed chapter

Perspective Chapter: Matching-Based Clustering Algorithm for Categorical Data

Written By

Ruben Gevorgyan and Yenok Hakobyan

Submitted: 20 February 2022 Reviewed: 15 December 2022 Published: 26 July 2023

DOI: 10.5772/intechopen.109548

From the Edited Volume

Blockchain Applications - Transforming Industries, Enhancing Security, and Addressing Ethical Considerations

Edited by Vsevolod Chernyshenko and Vardan Mkrttchian

Chapter metrics overview

48 Chapter Downloads

View Full Metrics

Abstract

Blockchain technology allows confidential data to remain strictly confidential and, at the same time, can be used for machine learning with external researchers. Blockchain enables valuable datasets to be reliably processed and speeds up the process of developing valid data mining applications. Blockchain can make it much easier to share datasets, machine learning models, decentralized intelligence, and trustworthy decision-making, which is very important in anomaly detection and fraud detection. This chapter presents a new framework for partitioning categorical data, which does not use the distance measure as a key concept. The matching-based clustering algorithm is designed based on the similarity matrix and a framework for updating the latter using the feature importance criteria. The experimental results show this algorithm can serve as an alternative to existing ones and can be an efficient knowledge discovery tool, especially in anomaly detection using blockchain technologies. While the algorithms for continuous data are relatively well studied in the literature, there are still challenges to address in case of categorical data. Based on the similarity matrix and a novel method for updating it using the feature importance, a matching-based clustering algorithm is designed.

Keywords

  • categorical data
  • clustering
  • similarity matrix
  • feature importance
  • anomaly detection

1. Introduction

In the academic literature, one can find many publications in which data mining methods have been applied to solve problems in relation to blockchain systems. In their detailed review article, Liu et al. [1] divide these tasks into the following three categories: cryptocurrency price prediction, blockchain address deanonymization, and anomaly detection. This chapter presents a new clustering method for anomaly detection.

Blockchain platforms are often subject to a variety of malicious attacks [2]. Such actions can potentially be detected by analyzing patterns in transactions. Since the number of anomalous transactions is small, this problem was solved using either empirically derived rules or cluster analysis [3, 4]. As noted by Liu et al. [1], more research is required in this area, since most of the models obtained were characterized by a low proportion of identified anomalies.

Clustering is one of the “super problems” in data mining. Generally speaking, clustering is partitioning data points into intuitively similar groups [5]. This definition is simple and does not consider the challenges that occur while applying cluster analysis to real-world datasets. Nevertheless, this type of analysis is common in different areas such as text mining, marketing research, customer behavior analysis, financial market exploration, and so on. Nowadays various clustering algorithms are introduced in the literature, each of them with its advantages and disadvantages. Moreover, as the data come in different forms such as text, numeric, categorical, image, and so on, they perform differently in different scenarios. In other words, the performance of a particular clustering algorithm depends on the structure of the data under consideration.

Cluster analysis of numeric data is relatively well studied in the literature. Various approaches are implemented such as representative base, hierarchical, density base, graph base, probabilistic, and so on [6]. Recently, increasing attention has been paid to clustering none numeric types of data. An important topic is the clustering of categorical data. The problem is that the algorithms for categorical data clustering are mainly modifications of the ones introduced for numeric data. For instance, the most common algorithm is K-modes [7] which is a prototype of the K-means [8] algorithm. However, several researchers have developed algorithms specifically for categorical data, but there is still much room for new approaches.

The main problem with partitioning categorical data is that the standard operations used in clustering algorithms are not applicable. For instance, the definition of distance between two objects with categorical attributes is not as straightforward as with numeric attributes. The main problem is that categorical data takes only discrete values, which do not have any order, unlike continuous data. Thus, the definition of the distance in case of categorical data is ambiguous. Therefore, researchers have developed and used similarity measures [9, 10, 11] or have applied different types of transformation [12]. Another problem is the assessment of cluster representatives because many mathematical operations are not applicable to categorical data. For instance, it is impossible to assess the mean of the categorical feature. Taking into account the limitation of existing algorithms one may consider developing an algorithm, which is not using predefined distance/similarity measures as a key concept and is not based on representatives for assigning data points to clusters.

This idea motivated us to develop the matching-based clustering algorithm. In this paper, we are not interested in improving the similarity measure or modifying existing algorithms. The key concept of the algorithm introduced is that two objects with categorical features are similar only if all the features match. Thus, the algorithm is based on the similarity matrix. Besides, we employ a feature importance framework to choose which features to drop on each iteration until all the objects are clustered. The tests on the soybean disease dataset show that the algorithm is highly accurate and possesses much better results.

The rest of the paper is organized as follows. We briefly review the common categorical data clustering algorithms in Section 2. In Section 3, we discuss the categorical data and its limitations. In Section 4 we introduce the general framework of the matching-based clustering algorithm. Section 5 presents the experimental results on the soybean disease dataset. Finally, we summarize our work and describe our future plans in Section 5.

Advertisement

2. Categorical data clustering literature review

Researchers have proposed various methods and algorithms for clustering categorical data. The most common approach is the transformation of the data into binary dataset and then implementation of the standard algorithms with some modification if required. Nevertheless, scholars have developed a wide variety of algorithms for clustering categorical data in recent years. These algorithms can be grouped into five main classes: model-based, partition-based, density base, hierarchical, and projection-base [12]. The main difference between these algorithms is how the similarity or distance is defined for the data points, and according to what criteria the clusters are formed.

Model-based clustering is based on the notion that data come from a mixture model. The most common models used are statistical distributions. Based on the user-specified parameters the prior models are assessed. Then the algorithm aims at recovering the latent model by changing it on each iteration. The main disadvantage of this type of clustering is that it requires user-specified parameters. Hence, if the assumptions are false the results will be inaccurate. At the same time, models may oversimplify the actual structure of the data. Another disadvantage of model-based clustering is that it can be slow on large datasets. Some model-based clustering algorithms are AutoCLass [13], SVM clustering [14], BILCOM Empirical Bayesian [15], etc.

Partition-based clustering algorithms are the most common ones. The main advantage of them is the fast processing time on large datasets. The main concept is defining representatives of each cluster, allocating objects to the cluster, redefining representatives, and reassigning objects based on the dissimilarity measurements. This is repeated until the algorithm converges. The main drawback of this type of algorithm is that they require the number of clusters to be predefined by the user. Another disadvantage is that several algorithms of this type produce locally optimal solutions and are dependent on the structure of the dataset. Several partition-based algorithms are K-modes, Fuzzy K-modes [16], Squeezer [17], COOLCAT [18], etc.

Density-based algorithms define clusters as subspaces where the objects are dense and are separated by subspaces of low density [19]. The implementation of density-based algorithms for categorical data is challenging as the attributes values are unordered. Even though they can be fast in clustering, they sometimes may fail to cluster data with varying densities [20].

Hierarchical algorithms represent the data as a tree of nodes, where each node is a possible grouping of data. There are two possible ways of clustering categorical data using hierarchical algorithms: in an agglomerative (bottom-up) and divisive (top-down) fashion. However, the latter is less common. The main concept of the agglomerative algorithm is using a similarity measure to gradually allocate the objects to the nodes of the tree. The main disadvantage of hierarchical clustering is its slow speed. Another problem is that the clusters may merge thus these algorithms might lead to information distortion. Several categorical data hierarchical clustering algorithms are ROCK [21], LIMBO [22], COBWEB [23], etc.

Projected clustering algorithms are based on the fact that in high-dimensional datasets clusters are formed based on specific attribute subsets. In other words, each cluster is a subspace of high-dimensional datasets defined by a subset of attributes only relevant to that cluster. The main issue with projected clustering algorithms is that it requires user-specified parameters. If the defined parameters are inaccurate, the clustering will be poor. Projected cluster algorithms include CACTUS [24], CLICKS [25], STIRR [26], CLOPE [27], HIERDENC [28], MULIC [29], etc.

More detailed presentations and comparisons of the existing algorithm can be found in [30, 31, 32]. Summarizing the existing algorithms, we can conclude that most of them find some tradeoff between accuracy and speed. However, considering the growing interest in analyzing categorical data in social, behavioral, and biomedical science we are more interested in highly accurate algorithms. Furthermore, as one can see the majority of the algorithms uses some distance/similarity metrics and defines representatives of clusters as subroutine of the algorithms. At the same time, they also require user-specified parameters. These factors can be seen as limitations in case of clustering categorical data. Therefore, we purpose another approach to partitioning the categorical data, which tries to avoid these features. Therefore, in the next section, we discuss the main characteristics of categorical data.

Advertisement

3. Categorical data overview

Data comes in various forms such as numeric, categorical, mixture, spatial, and so on. The analysis of each type of data possesses unique challenges. The categorical data is not an exception. This type of data is widely used in political, social, and biomedical science. For instance, the measures of attitudes and opinions can be assessed with categorical data. The performance of medical treatments can also be categorical. Even though the mentioned areas of science have the largest influence on the development of the methods for categorical data, this type of data commonly occurs in other areas of science such as marketing, behavior science, education, psychology, public health, engineering, and so on.

Generally speaking, categorical data is the data where objects are defined by categorical features. Categorical features can have two types of scales: nominal and ordinal. In the case of nominal scale, they have unordered categories. In contrast, ordinal scale possesses ordered categories, but the interval between categories is unknown. In this paper, we focus only on categorical features with nominal scales.

For sake of notation, consider a multidimensional dataset D containing n objects, such that each object is described by m categorical features each with k = 1, 2, 3, … unique categories. Thus, the dataset D can be viewed as a matrix below:

Dn,m=a1,1a1,2a1,ma2,1a2,2a2,man,1an,2an,mE1

where each object is described by a a set of categories Oi=ai,1ai,2ai,3ai,m. As the categorical attributes have discrete values with no order values, the application of distance measures such as lpnorm will produce inaccurate results. However, the most common approach to overcome this limitation is the implementation of data transformation techniques. For instance, one can use binarization to transform the data into binary data and then apply the distance measure. On the other hand, the traditional way of comparing two objects with categorical features is to simply check if the categories coincide. If the categories of all the features under consideration match, the objects can be viewed as similar. This does not mean they are the same, because they can be distinguished by other features. Thus, researchers have proposed various similarity measures instead of requiring all the features to match. The most popular approach is the overlap. According to it, the similarity between two objects Ox=ax,1ax,2ax,3ax,m and Oy=ay,1ay,2ay,3ay,m is assessed by:

OvOxOy=1mi=1mγiwhereγi=1ifax,i=ay,i0otherwise,E2

It can take values from [0, 1]. The closer value gets to one, the higher the similarity between the objects.

While implementing overlap, one can notice that the probability of finding another object with the same categories rapidly decreases as the number of features and the number of unique categories of each feature increases. To illustrate this, one can calculate the probability of finding another object with the same categories as object Ox using the formula below:

P=i=1mfax,i1kmn1E3

where fax,i is the frequency of category ax,i in the dataset.

If we consider the number of objects constant, the probability of finding another similar object depends on the number of features and the number of unique categories of each. It can be seen from the formula above that as the number of attributes or the number of categories increases the probability of finding another object rapidly decreases. The problem is that the overlap measure gives equal weights to the features and does not take into account the importance of each feature in partitioning the data. However, the researchers have proposed more efficient ways of assessing similarity, which takes into account the frequency of each category in the dataset. There are various types of similarity measures that are based on this concept. For instance, Goodall [33], Lin [34], and so on:

GoodallOxOy=1mi=1mSiwhereSi=1fax,ifax,i1nn1ifax,i=ay,i0otherwise,E4
LinOxOy=1mi=1mSiwhereSi=2logfax,iifax,i=ay,i2logfax,in+fay,inotherwise,E5

Nevertheless, there are still cases when the use of similarity measures can be misleading. For instance, consider the dataset below with four objects and two categorical attributes withc1c2,b1b2 categories, respectively:

D4,2=c1b1c2b2c1b2c2b1E6

According to the measures presented above, the similarity between each unique pair of these objects will be:

ObjectOverlapLinGoodall
(O1, O2)0.000.000.00
(O1, O3)0.500.690.33
(O1, O4)0.500.690.33
(O2, O3)0.500.690.33
(O2, O4)0.500.690.33
(O3, O4)0.000.000.00

As one can see, these measures can be misleading. For instance, one can group O3, O4 to either O1 or O2 as the similarity measures are the same. Therefore, similarity measures are powerful tools, but they should be used with caution. In this regard, one may consider using a quantitative measure to compare the features and choose relatively important ones. Then the objects will be similar if the categories of the selected features match. This is the main motivation for our approach.

Therefore, we employ several feature importance measures. We define the partial grouping power of a feature in dataset D as the number of unique matching pairs on the feature divided by the total number of unique matching pairs in the dataset. This is based on the notion that if the feature has a relatively higher number of matching pairs than others, it is more likely to group objects. The PGPI l can be assessed by:

PGPIl=s=1klfcsfcs12i=1mj=1kmfcjfcj12E7

where cs is the unique category of the feature, and fcs is the frequency of the category in the dataset. This measure can take values from [0, 1]. The closer the value to one the higher the importance of the feature in aggregating the objects.

We also define a measure for the partitioning power of a feature. We define the partial partitioning power of a feature in dataset D as the number of unique mismatching pairs on the feature divided by the total number of unique mismatching pairs in the dataset. The PPPIl can be assessed by:

PPPIl=s=1klnn12fcsfcs12i=1mj=1kmnn12fcjfcj12E8

This measure can take values from [0, 1]. The closer the value to one the higher the importance of the feature in partitioning the objects. Both methods can be used in any analysis. However, as the objects can be relatively grouped or separated depending on the features under consideration, one of the measure may perform better.

We also present another measure for assessing the feature importance. This one is based on the similarity matrix. The similarity matrix is defined as the matrix below:

SMn,n=m1,1m1,2m1,mm2,1m2,2m2,mmn,1mn,2mn,mE9

where mi,j is a similarity measure between object i and j such as Overlap, Lin, and Goodall. Throughout this paper, we will use the count of matches between two objects as a similarity measure:

mi,j=i=1mγi,whereγi=1ifax,i=ay,i0otherwiseE10

This measure is also known as the hamming distance. The similarity matrix is symmetrical, thus only the upper triangular matrix is used in the calculations. Furthermore, the diagonal will also be ignored. For another measure of the feature importance, based on the similarity matrix we define the general influence matrix as:

IMn,n=I1,1I1,2I1,mI2,1I2,2I2,mIn,1In,2In,m,whereIi,j=1ifmi,j>α0otherwiseE11

where α is a threshold, which is bounded by the values similarity measure can take. In this chapter, we set α to 0. After the construction, the features or the subset of features under consideration are dropped, and the influence matrix is updated. The matrix after the drop is defined as the partial influence matrix of corresponding feature or subset of features l. In this case, the partial grouping power of a feature or subset of features is assessed by dividing the count of the ones in the partial influence matrix by the count of ones in general influence matrix:

PGPI2j=ηPIMlηGIME12

where ηPIMl is the count of ones in the PIMl and ηGIM is the count of ones in the GIM. One can notice that these measures of feature importance depend only on the number of unique matches in the dataset, and the number of categories of each feature does not influence them. In the next section, we present the matching-based clustering algorithm, which combines the importance measures of the features and the similarity matrix to partition categorical data into homogeneous groups.

Advertisement

4. Matching-based clustering algorithm

Similar to any clustering algorithm the main objective of matching-based clustering is partitioning the data into relatively similar groups. The algorithm is defined for categorical data only. However, one can modify it to work with other types also, but it is out of the scope of this paper. The main idea is, while there are still objects without clusters, the algorithm will choose features to drop based on their importance. Then it will update the similarity matrix and try to cluster the objects based on the new SM. It uses the similarity matrix where the similarity measure between two objects is defined by formula (10). We also use either PGPI or PPPI measure to choose the features to drop on each iteration. For the sake of notations, we define θp as the count of the remaining features on iteration p. The initial value of θ0 = m. We consider two objects to belong to the same cluster if mi,j=ϴp. In other words, they are grouped if their categories coincide for all the remaining features on iteration p.

The algorithm consists of the following steps:

  1. Construct the similarity matrix SMn,n.

  2. Calculate the PGPI or PPPI of each feature.

  3. Allocate the objects to clusters based on the similarity matrix. In other words, group two objects (and j), if mi,j=θp. If one of them is already allocated to a cluster, assign the second one to the same cluster.

  4. Check if there are still objects not assigned to any cluster, if yes continue to next step, otherwise terminate.

  5. Remove the features with the lowest PGPI or the highest PPPI. If there are more than one feature, one may consider either dropping all of them or use the PGPI2 to choose which one to drop.

  6. Update the similarity matrix. Furthermore, to avoid the merging of existing clusters, additionally update the SM using: existing cluster i and j, if mi,j=θp, then the values, equal to θp, of rows and columns i and j are set to zero.

  7. Return to step 3.

The algorithm stops if all the objects are clustered or the importance of remaining features is the same. To illustrate how the algorithm works, we will apply it to the dataset below:

ObjectsABCDEO1a2b1c2d3e2O2a2b1c2d3e2O3a2b1c2d3e1O4a2b1c2d3e4O5a1b2c4d2e3O6a1b2c3d4e4O7a1b2c4d2e2O8a1b2c3d4e1O9a1b2c1d1e3O10a1b2c4d2e2

In this dataset 10 objects are defined by 4 categorical features A,B,C,D,andE with a1a2,b1b2,c1c2c3c4,d1d2d3d4, and e1e2e3e4 unique categories, respectively. Thus, we initialize the algorithm by constructing the similarity matrix:

S10,10=54400100144001001400010001000024234242222522E13

Then, the importance of each feature is assessed. In this example, we will use the PGPI measure. For instance, the PGPIA will be:

PGPIA=2121+21+10+10+9=0.296E14

respectively PGPIB=0.296,PGPIC=0.14,PGPID=0.14, and PGPIE=0.127. Then as the θ0=m=5, all the objects i,j with mi,j=5 are grouped. As we can see we have two clusters O1O2 and O7O10, respectively. As we still have some objects left without cluster allocation, we continue to next step. In particular, as the feature E has the lowest PGPI, we drop it and the similarity matrix is updated. Also to avoid the merging of already existing clusters, we additionally update SM according to step 6. As similarity between clusters O1O2 and O7O10 is not equal to θ1=4, we will not make additional changes, and the new data view and corresponding similarity matrix will be:

ObjectsABCDO1O2a1b0c1d2O3a1b0c1d2O4a1b0c1d2O5a0b1c3d1O6a0b1c2d3O7O10a0b1c3d1O8a0b1c2d3O9a0b1c0d0E15
S8,8=4400000400000000002422242222.E16

As θ1=4, we will have O1O2O3O4,O5O7O10, and O6O8 clusters. However, we still have one more object to assign to a cluster, thus we drop C and D. We update the similarity matrix:

ObjectsABO1O2O3O4a1b0O5O7O10a0b1O6O8a0b1O9a0b1
S4,4=000222

But we also check the statement in step 6 between any pair of the existing clusters. As θ2=2, the statement is true in the case of O5O7O10 and O6O8. Thus, the values corresponding rows and columns, which are equal θ2=2, are set to zero. The purpose of this modification is that as we are dropping features with the low grouping power, the cluster is more likely to merge. Therefore, we may lose important local partitioning of data points. Thus the finally updated similarity matrix will be:

000000E17

However, the second iteration does not group the O9. At the same time as the importance of the remaining features A and B is the same the algorithm terminates and the object O9 forms the fourth cluster. Thus, the final form of the clustering is presented in Table 1.

ObjectsABCDECluster
O1a1b0c1d2e21
O2a1b0c1d2e21
O3a1b0c1d2e11
O4a1b0c1d2e41
O5a0b1c3d1e32
O6a0b1c2d3e43
O7a0b1c3d1e22
O8a0b1c2d3e13
O9a0b1c0d0e34
O10a0b1c3d1e22

Table 1.

Final form of the clustering.

The algorithm has some unique characteristics worth mentioning. First, to achieve better performance one can notice that all the changes required in each step should be done only on the similarity matrix and there is no need to update the dataset. Second, there is no need for user-defined parameters. However, one may consider one, for instance, the required number of clusters to be created. Third, even though we introduced step 6 to avoid the merging of clusters to achieve higher accuracy, one can avoid this step. In this case, the algorithm will create a tree were each leaf is a possible cluster, like in the hierarchical cluster. Furthermore, based on the user-defined parameter the algorithm can create the required number of clusters if required. For instance, in case of our example above the dendrogram will be (Figure 1).

Figure 1.

Dendrogram, based on the user-defined number of clusters.

Forth, as the algorithm is based on either feature grouping or partitioning power, this information can be used to understand the data better. For instance, this algorithm can serve as a subroutine for selecting features for other clustering algorithms. The main disadvantage is that may create too many clusters.

Advertisement

5. Experimental results and discussion

To test how the matching-based algorithm performs on real-world dataset, we have employed it to the soybean disease dataset [23]. It is one of the standard test data sets used in the machine learning community. It has often been used to test conceptual clustering algorithms. We chose this data set to test our algorithm because of its publicity and because all its attributes can be treated as categorical without categorization. The soybean data set has 47 observations, each being described by 35 attributes. Each observation is identified by one of the four diseases – Diaporthe Stem Canker, Charcoal Rot, Rhizoctonia Root Rot, and Phytophthora Rot. Which are used as indicators of the efficiency of the algorithm.

After applying the MBC algorithm to the soybean disease dataset, we got 18 different clusters.

DT123456789101112131415161718D132221D23223D32521D4532322

However, as we can see from the table above all the clusters except for one entirely belong to one of the groups mentioned above. In other words, we have only one possible misclassification. However, as already mentioned one may require specific number of clusters. In this case, one can use the dendrogram (Figure 2).

Figure 2.

Dendogram for the soybean disease dataset after applying the MBC algorithm.

Furthermore, we can compare the performance of the algorithm with K-modes [35]. In that paper, the algorithm was also applied to the soybean disease dataset. The author emphasized the fact that K-modes depend on the data order and the user should also give the number of clusters. In case of MBC, we do not have these limitations. Thus the application of the MBC algorithm may result in more accurate outcome. However, we still should consider how to lower the number of clusters in case of it.

Advertisement

6. Conclusion

The vital issue in clustering categorical data is the notion of distance/similarity between the observations. The best practice approaches are limited to numeric values. Hence, specific models are being developed for categorical data. The algorithm introduced in this paper can serve as an alternative to existing ones. It is based on the main characteristics of categorical data. It presents a new framework for clustering categorical data, which is not based either on distance or on similarity measure. The main concept of the algorithm is the assessment of the similarity matrix, updating latter based on the important criteria of each feature or subset of feature and grouping only if the categories of objects entirely match. These allow to cluster of categorical data without conversion. Another advantage is the description of the features, as the algorithm allows to identify the feature which causes the partitioning of the data. These can be very important in interpreting clustering results. The main advantage of the algorithm is high accuracy and few initial parameters.

Our future work plan is to develop and implement a modification of the algorithm to cluster mixture data. Furthermore, overcome its limitation and adopt it to clustering big datasets. Such an algorithm is required in a number of data mining applications, such as partitioning very large sets of objects into a number of smaller and more manageable subsets that can be more easily modeled and analyzed.

Advertisement

Mathematics subject classification (2010):

62H30, 62H17, 62H20

References

  1. 1. Liu XF et al. Knowledge discovery in cryptocurrency transactions: A survey. IEEE Access. 2021;9:37229-37254
  2. 2. Rahouti M, Xiong K, Ghani N. Bitcoin concepts, threats, and machine-learning security solutions. IEEE Access. 2018;6:67189-67205
  3. 3. Camino RD, State R, Montero L, Valtchev P. Finding suspicious activities in financial transactions and distributed ledgers. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). New Orleans: IEEE; 2017. pp. 787-796
  4. 4. Pham T, Lee S. Anomaly detection in the bitcoin system-a network perspective. 2016. arXiv preprint arXiv:1611.03942
  5. 5. Jain A, Murty M, Flynn P. Data clustering: A review. ACM Computing Surveys (CSUR). 1999;31(3):264-323
  6. 6. Aggarwal CC. Data Mining: The Textbook, 154–259. Switzerland: Springer International Publishing; 2015
  7. 7. Chaturvedi A, Green P, Carroll JD. K-modes clustering. Journal of Classification. 2001;18(1):35-55
  8. 8. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkley and Los Angeles: University of California Press; 1967. pp. 281-297
  9. 9. Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation. In: SIAM Conference on Data Mining. Atlanta: SIAM; 2008
  10. 10. Gouda K, Zaki MJ. Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Mining and Knowledge Discovery. 2005;11(3):223-242
  11. 11. Burnaby T. On a method for character weighting a similarity coefficient employing the concept of information. Mathematical Geology. 1970;2(1):25-38
  12. 12. Hubert MR, Segaert P, Pieter. Multivariate and functional classification using depth and distance. Adv. Data Anal. Classif. 2017;11:445-466
  13. 13. Stutz J, Cheeseman P. Bayesian classification (AutoClass): Theory and results. In: Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press; 1995. pp. 153-180
  14. 14. Winters-Hilt S, Merat S. SVM clustering. BMC Bioinformatics. 2007;8(7):S18
  15. 15. Andreopoulos B, An A, Wang X. Bi-level clustering of mixed categorical and numerical biomedical data. International Journal of Data Mining and Bioinformatics (IJDMB). 2006;1(1):19-56
  16. 16. Huang Z. Clustering large data sets with mixed numeric and categorical values. In: Knowledge Discovery and Data Mining. Techniques and Applications. Singapore: World Scientific; 1997
  17. 17. He Z, Xu X, Deng S, et al. Squeezer: An efficient algorithm for clustering categorical data. Journal of Computer Science and Technology. 2002;17(5):611-624
  18. 18. Barbara D, Li Y, Couto J. COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of CIKM 2002. Vol. 2002. McLean, VA, USA: ACM Press; pp. 582-589
  19. 19. Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P. Dimension induced clustering. In: Proceedings of KDD’05. Chicago: Association for Computing Machinery; 2005. pp. 51-60
  20. 20. Grambeier J, Rudolph A. Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery. 2002;6:303-360
  21. 21. Guha S, Rastogi R, Shim K. ROCK: A Robust clustering algorithm for categorical attributes. Information Systems. 2000;25(5):345-366
  22. 22. Andritsos P, Tsaparas P, Miller R, et al. LIMBO: Scalable clustering of categorical data. In: Proceedings of the 9th International Conference on Extending Database Technology EDBT’04. Heraklion, Greece: Springer; 14–18 March 2004. pp. 123-146
  23. 23. Fisher D. Knowledge acquisition via incremental conceptual clustering. Machine Learning. 1987;2:139-172
  24. 24. Ganti V, Gehrke J, Ramakrishnan R. CACTUS-clustering categorical data using summaries. In: Proceedings of KDD’99. San Diego, CA, USA: SIGMOD; 1999. pp. 73-83
  25. 25. Zaki MJ, Peters M. CLICKS: Mining subspace clusters in categorical data via K-partite maximal cliques. In: Proceedings of the 21st International Conference on Data Engineering (ICDE). Tokyo, Japan: IEEE Computer Society; 2005. pp. 355-356
  26. 26. Gibson D, Kleiberg J, Raghavan P. Clustering categorical data: An approach based on dynamical systems. In: Proceedings of 24 the International Conference on Very Large Databases (VLDB’98). New York City, USA: IEEE Computer Society; 24–27 August 1998. pp. 311-323
  27. 27. Yang Y, Guan S, You J. CLOPE: A fast and effective clustering algorithm for transactional data. In: Proceedings of KDD 2002. Edmonton, Alberta, Canada: Association for Computing Machinery; 2002. pp. 682-687
  28. 28. Andreopoulos B, An A, Wang X. Hierarchical density-based clustering of categorical data and a simplification. In: In: Proceedings of the 11th Pacic-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2007), Springer LNCS 4426/2007. Nanjing, China: Springer; 22–25 May 2007. pp. 11-22
  29. 29. Andreopoulos B. Clustering algorithms for categorical data. [PhD thesis], Dept of Computer Science Engineering, York University, Toronto, Canada. 2006
  30. 30. Anderberg MR. Cluster Analysis for Applications. New York: Academic Press; 1973
  31. 31. Jain AK, Dubes RC. Algorithms for Clustering Data. New Jersey: Prentice-Hall, Inc.; 1998
  32. 32. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. Brussels: Wiley; 2009
  33. 33. Goodall DW. A new similarity index based on probability. Biometrics. 1966;22(4):882-907
  34. 34. Lin D. An information-theoretic definition of similarity. In: Proc ICML 1998 15th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 1998. pp. 296-304
  35. 35. Zhexue H. A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Research Issues on Data Mining and Knowledge Discovery. Arizona, USA: SIGMOD; 1997. pp. 1-8

Written By

Ruben Gevorgyan and Yenok Hakobyan

Submitted: 20 February 2022 Reviewed: 15 December 2022 Published: 26 July 2023