Open access peer-reviewed chapter

Enhanced Hybrid Privacy Preserving Data Mining Technique

Written By

Naga Prasanthi Kundeti, Chandra Sekhara Rao MVP, Sudha Sree Chekuri and Seshu Babu Pallapothu

Submitted: 15 September 2022 Reviewed: 21 October 2022 Published: 27 November 2022

DOI: 10.5772/intechopen.108707

From the Edited Volume

Information Security and Privacy in the Digital World - Some Selected Topics

Edited by Jaydip Sen and Joceli Mayer

Chapter metrics overview

123 Chapter Downloads

View Full Metrics

Abstract

At present, almost every domain is handling large volumes of data even as storage device capacities increase. Amidst humongous data volumes, Data mining applications help find useful patterns that can be used to drive business growth, improved services, better health care facilities etc. The accumulated data can be exploted for identity theft, fake credit/debit card transactions, etc. In such scenarios, data mining techniques that provide privacy are helpful. Though privacy-preserving data mining techniques like randomization, perturbation, anonymization etc., provide privacy, but when applied separately, they fail to be effective. Hence, this chapter suggests an Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM) technique by combining them. The proposed technique provides more privacy of data than existing techniques while providing better classification accuracy as well as evidenced by our experimental results.

Keywords

  • privacy
  • privacy preserving data mining
  • k-anonymization
  • geometric data perturbation
  • l-diversity

1. Introduction

Modern machine learning models are applied on large volumes of data accumulated over time. The data used for training or building models may contain personal data. Data owners may not want to share their personal data. To safeguard privacy of personal data, this paper seeks to perform data analysis without revealing the sensitive personal information of users.

Privacy has often been defined in many different ways. Westin (1968) defined privacy as “the assertion of individuals, groups or institutions to specify when, how and to what extent their information can be shared to others”. Bertino et al. [1]defined privacy as “the security of data about an individual contained in an electronic repository from unauthorized disclosure”.

Privacy threats can be categorized into three types, namely (a) Membership Disclosure, (b) Attribute Disclosure and (c) Identity Disclosure.

Membership Disclosure: Such threats occur when an attacker manages to check the presence of specific user data in a data set and infers certain meta-information thereof.

Attribute Disclosure: In this type of attack, some sensitive user data can be anecdoted by the attacker by connecting data entries with some data from other sources.

Identity Disclosure: Here, an attacker can identify all sensitive data about a person by making a particular data admission in a data set, thereby revealing his identity and threatening his safety.

Privacy preservation methods protect data from data leakage by altering the original data, minimizing exposure as specified in literature [2, 3]. Prominent techniques include randomization, perturbation, suppression, generalization etc. In order to preserve useful data after altering the data, various data utility metrics such as discernability metric, KL-divergence, entropy-based information loss etc. are applied as mentioned in literature.

The data shown in tabular form is processed with each row representing an entity in the real world. The attributes of a data table can be categorized into four types viz., Identifier Attributes (Ids), Quasi-identifier Attributes (QIDs), non-Sensitive Attributes (NSAs) and Sensitive Attributes (SAs). The attributes that help identify a person from a given data are called identifier attributes. For ex: SSN, Aadhar id etc. Generally, such attributes are removed from data before sharing the data for data analysis to protect personal identity. Sensitive attributes contain delicate personal information health condition, financial status etc. Such attributes do not share or remove sensitive personal data to avoid bad results. So, the sensitive data is maintained but personal identity also needs to be protected. Quasi identifiers are the attributes purportedly used by attackers to disclose identity of the individual when combined with some background knowledge. Hence, such quasi identifiers need to be modified to prevent identity disclosures by attackers. The last attributes i.e., non-sensitive attributes do not disclose any information about individuals by retained them intact while sharing data for analysis.

So, while sharing data for analysis, several privacy preservation methods are proposed like randomization, perturbation etc. to protect privacy [4, 5]. Though data transformations are applied to provide privacy of data, yet it may lead to inaccurate data mining results, thereby reducing its utility. Hence, to balance both privacy preservation and accuracy in data mining results, Privacy Preserving Data Mining (PPDM) techniques are applied. In the process, divergence of data is minimized and actual data is validated from the analysts’ perspective through some metrics that evaluate the privacy level and data utility of different PPDM techniques [1, 6, 7].

Advertisement

2. Review of PPDM techniques

Data present in various data sources can be privacy enabled with application of different privacy preserving techniques. Some of them are Generalization, Suppression, Anatomization and Perturbation.

  • Generalization: In this method, a data field is swapped with a more generalized data filed value. In case of numerical attributes, the data field value is exchanged with a range of values. In case of categorical attributes, generalization is performed based on value generalization hierarchy.

  • Suppression: This method, averts information disclosure by abolishing some values of attribute. Original data field values are replaced with(“*”).

  • Anatomization [7]: This method works by separating quasi-identifiers and sensitive attributes. They are located in two different tables so that connecting QIDs to sensitive attributes become very tedious.

  • Perturbation: Here. original data field values are substituted by artificial values with the similar statistical information.

Samarati and Sweeny [8, 9] proposed a popular privacy model i.e., k-anonymization. Further, k-anonymity for a table is defined as follows [10]:

“Let T(A1,…,An) be a table.

Let QI be the set of quasi-identifiers corresponding to table T.

T fulfills k-anonymity property with respect to QI if and only if each sequence of values in T[QI] appears at least with k occurrences in T[QI]”.

Generalization and suppression techniques are applied on Quasi Identifiers (QIDs) as part of k-anonymization. All the QIDs in a group of size ‘k’ have similar values on ensures that the confidential data about individual users is not revealed when data is shared for analysis purpose. So, K-anonymized data provides privacy of data. An attacker can still infer sensitive information about individuals using a K-anonymized table and some background knowledge, if the value of sensitive attribute is same for all individuals in a given k-group. Let us consider the k-anonymized table shown below in Table 1.

QI: AgeQI: citySensitive attribute: Disease
20–30mumbaiFlu
20–30mumbaiFlu
20–30mumbaiFlu
30–40DelhiCancer
30–40DelhiCancer
30–40DelhiCancer

Table 1.

3-anonymized table.

While k-anonymity is a promising approach in group based anonymization due to its ease of use and the varied array of algorithms that perform it, yet it is vulnerable to many attacks. When attackers access background information, they can cause massive damage to sensitive data, including the following:

  • Homogeneity Attack: These attacks leverage cases in which the sensitive attribute values for a sensitive attribute within a set of k records are alike. In these cases, in spite of k-anonymization, sensitive attribute value for the group of k records may be precisely foreseen.

  • Background Knowledge Attack: Such attacks leverage an association among one or more sensitive attribute with QID (quasi-identifier) attributes to decrease the set of possible values for accessing the delicate attribute. Machanavajjhala et al. [11] showed that knowledge of heart attacks occuring at a condensed rate among Japanese patients could be utilized to slim the range of values for a delicate attribute of a patient’s illness.

An attacker who has access to this 3-anonymous table can use background knowledge from other data sources and identify all patients in Mumbai having disease ‘Flu’. So, sensitive information about an individual residing in Mumbai is revealed. To overcome this security breach, l-diversity principle is applied on sensitive attribute.

Agarwal et al. [4] defines l-diversity as, “Let a q*-block be a set of tuples such that its non-sensitive values generalize to q*. A q*-block is l-diverse if it contains l ‘well represented’ values for the sensitive attribute S. A table is l-diverse, if every q*-block in it is l-diverse.”

Li et al. [12] defined l-diversity as “an equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if every equivalence class of the table has l-diversity”.

Aggarwal and Yu [4] showed the likelihood of more than one sensitive field when the l-diversity problem becomes more difficult due to added dimensionalities.

Advertisement

3. Methodology

Kundeti et al. [13] had introduced a hybrid privacy preserving data mining (HPPDM) technique that provided privacy and lesser attacks, which, however can be extended to create more privacy by applying the l-diversity principle. In fact, L-diversity provides more privacy against different background attacks.

Algorithm: - Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM).

Input: - Adult Dataset D.

Output: - Privacy enabled Adult Data setD’.

Step1: Categorize attributes of Adult Data set into Identifiers, Quasi Identifiers, Sensitive and Non-Sensitive Attributes.

Step2: Consider the Quasi Identifiers and create value generalization hierarchies for quasi identifiers.

Step3: Apply geometric perturbation technique in numerical quasi identifiers to obtain perturbed numerical quasi identifier.

Step4: Create generalization hierarchies in categorical quasi identifiers and choose different levels in generalization hierarchy based on k-value chosen for anonymization.

Step5: apply l-diversity for sensitive attributes based on number of different values for classpresent.

Step 6: Obtain the privacy preserved Adult data set D′.

Advertisement

4. Implementation

An Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM) technique is implemented by using R language. The ARX anonymization tool is used for performing K-Anonymization.

An adult Dataset from the UCI machine learning repository is used for evaluating the EHPPDM technique. The dataset consists of 15 attributes, including the Class attribute. The attributes are age (numerical), work-class (categorical), fnlwgt (numerical), education (categorical), education-num (numerical), marital-status (categorical), occupation (categorical), relationship (categorical), race (categorical), sex (categorical), capital-gain (numerical), capital-loss (numerical), hours-per-week (numerical), native-country (categorical) and class variable. These attributes can be divided into quasi-identifiers, sensitive attributes and Insensitive attributes. The quasi identifiers in this data set are age, work class, education and nativity. Class attribute is sensitive attribute, while the remaining attributes are classified as Insensitive attributes.

Among the quasi identifiers, age is the numerical attribute. The Geometric data perturbation technique [14] is applied on numerical quasi identifier i.e. age. Value generalization hierarchies are created for categorical quasi identifier attributes. K-anonymization algorithm is applied to these categorical quasi identifiers. For different values of K, different anonymization levels are obtained, which provide privacy at different levels. The k-values considered are 50, 100, 150, 200, 250, 300, 350, 400, 450, 500. After anonymization, the anonymized data sets are applied with classification algorithms like naive bayes, J48 and decision tree. The accuracies of classification are noted down.

To further enhance the privacy of data, l-diversity is applied on sensitive attribute i.e. Class attribute. L-diversity is applied to reduce background attacks and linkage attacks. As l-diversity ensures that the class attribute value in a given anonymized group does not have single value, the attacker cannot identify an individual’s sensitive attribute value. The anonymized and l-diversity-applied dataset is obtained. After applying classification algorithms on the anonymized data, their accuracies are tabulated. Later, the risk analysis for various types of attacks is represented in the various figures.

Figure 1 depicts the classification accuracies for Adult data set when applied with k-anonymization. K-anonymization for different values of k is applied. Figure 2 displays the classification accuracies for the adult data set wherein l-diversity is applied to decrease background attacks. After detecting the increase in privacy, l-diversity principle is applied.

Figure 1.

Classification accuracies for adult K-anonymized data for different k-values.

Figure 2.

Classification accuracies for adult K-anonymized and l-diversity (l-value = 2) applied.

The classification accuracies of Hybrid Privacy Preserving Data Mining (HPPDM) [13] technique for adult data set are shown in Figure 3. The experimental results depict better classification accuracies with HPPDM technique as compared to k-anonymization.

Figure 3.

Classification accuracies for adult after applying hybrid privacy preserving data mining technique.

Figure 4 illustrates the classification accuracies for adult data set when Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM) is applied.

Figure 4.

Classification accuracies for enhanced hybrid privacy preserving data mining technique.

Figures 58 illustrate the risk analysis for adult data set.

Figure 5.

Risk analysis for various types of attacks after applying k-anonymization (k-value = 100).

Figure 6.

Risk analysis for various types of attacks after applying k-anonymization (k-value = 100) and l-diversity (l-value = 2).

Figure 7.

Risk analysis for various types of attacks after applying hybrid privacy preserving data mining (HPPDM) technique for kvalue = 100.

Figure 8.

Risk analysis for various types of attacks after applying enhanced hybrid privacy preserving data mining (EHPPDM) technique for kvalue = 100,l-2 diversity.

Figure 5 demonstrates the risk analysis against various types of attacks when k-anonymization is applied on adult data set. Figure 6 displays the risk analysis against various types of attacks when k-anonymization and l-diversity are applied on adult data set. Figure 7 illustrates the risk analysis against various types of attacks when Hybrid Privacy Preserving Data Mining (HPPDM) technique [13] is applied on adult data set. Figure 8 depicts the risk analysis against various types of attacks when Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM) technique is applied on adult data set. The Experimental results confirm reduction of risks to negligible levels when HPPDM and EHPPDM techniques are applied.

Advertisement

5. Conclusion

The proposed Enhanced Hybrid Privacy Preserving Data Mining (EHPPDM) technique is applied on adult datasets from UCI machine learning repository. In fact, EHPPDM technique combines two privacy preservation techniques viz., perturbation and k-anonymization. The numerical quasi identifiers are applied with geometric data perturbation, whereas categorical quasi identifiers are applied with k-anonymization technique. To enhance privacy and reduce attacks, l-diversity (lvalue = 2) is applied on sensitive attributes. The experimental results showed that classification accuracy considerably increased by applying the proposed EHPPDM technique. Moreover, the EHPPDM technique can be extended by including t-closeness property in future works.

References

  1. 1. Bertino E, Lin D, Jiang W. A survey of quantification of privacy preserving data mining algorithms. In: Privacy-Preserving Data Mining. New York, NY, USA: Springer; 2008. pp. 183-205. DOI: 10.1007/978-0-387-70992-5_8
  2. 2. Langheinrich M. Privacy in Ubiquitous Computing, Ubiquitous Computing Fundamentals. Imprint Chapman and Hall/CRC; 2009. p. 66. ISBN: 9781315145792
  3. 3. Prasanthi KN, Chandra Sekhara Rao MVP. A comprehensive assessment of privacy preserving data mining techniques. Lecture Notes in Networks and Systems. 2022;351:833-842. DOI: 10.1007/978-981-16-7657-4_67
  4. 4. Aggarwal CC, Yu PS. A general survey of privacy-preservingdata mining models and algorithms. In: Privacy-Preserving Data Mining. New York, NY, USA: Springer; 2008. pp. 11-52. DOI: 10.1007/978-0-387-70992-5_2
  5. 5. Aggarwal CC. Data Mining: The Textbook. New York, NY, USA: Springer; 2015
  6. 6. Bertino E, Fovino IN. Information driven evaluation of data hiding algorithms, Proc. Int. Conf. In: Data Warehousing Knowledge Discovery. Berlin: Springer; 2005. pp. 418-427. DOI: 10.1007/11546849_41
  7. 7. Fletcher S, Islam MZ. Measuring information quality for privacypreserving data mining. International Journal of Computer Theory and Engineering. 2015;7(1):21-28. DOI: 10.7763/IJCTE.2015.V7.924
  8. 8. Samarati P, Sweeney L. Protecting privacy when disclosing information: k- anonymity and its enforcement through generalization and suppression. In: Proc. of the IEEE Symposium on Research in Security and Privacy. 1998. pp. 384-393. DOI: 10.1184/R1/6625469.v1
  9. 9. Samarati P, Sweeney L. Generalizing data to provide anonymity when disclosing information. PODS. 1998;98:188. DOI: 10.1145/275487.275508
  10. 10. Samarati P. Protecting respondents identities in microdata release. IEEE Transactions on Knowledge and Data Engineering. 2001;13(6):1010-1027. DOI: 10.1109/69.971193
  11. 11. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. L-diversity: Privacy beyond K-anonymity. ACM Transactionson Knowledge Discovery from Data. 2007;1(1):3–es. DOI: 10.1145/1217299.1217302
  12. 12. Ninghui L, Tiancheng L, Venkatasubramanian S. T-closeness: Privacy beyond k-anonymity and l-diversity. IEEE Explore. In: IEEE 23rd International Conference on Data Engineering. ICDE; 2007. pp. 106-115. DOI: 10.1109/ICDE.2007.367856
  13. 13. Prasanthi KN, Chandra Sekhara Rao MVP. Accuracy and utility balanced privacy preserving classification mining by improving K-anonymization. International Journal of Simulation: Systems, Science & Technology. 2019;19:6. DOI: 10.5013/IJSSST.a.19.06.51
  14. 14. Chen K, Liu L. Geometric data perturbation for privacy preserving outsourced data mining. Knowledge Information and Systems. 2011;29:657-695. DOI: 10.1007/s10115-010-0362-4

Written By

Naga Prasanthi Kundeti, Chandra Sekhara Rao MVP, Sudha Sree Chekuri and Seshu Babu Pallapothu

Submitted: 15 September 2022 Reviewed: 21 October 2022 Published: 27 November 2022