Privacy Preserving Data Mining

Data mining techniques provide benefits in many areas such as medicine, sports, marketing, signal processing as well as data and network security. However, although data mining techniques used in security subjects such as intrusion detection, biometric authentication, fraud and malware classification, “pri-vacy” has become a serious problem, especially in data mining applications that involve the collection and sharing of personal data. For these reasons, the problem of protecting privacy in the context of data mining differs from traditional data privacy protection, as data mining can act as both a friend and foe. Chapter covers the previously developed privacy preserving data mining techniques in two parts: (i) techniques proposed for input data that will be subject to data mining and (ii) techniques suggested for processed data (output of the data mining algorithms). Also presents attacks against the privacy of data mining applications. The chapter conclude with a discussion of next-generation privacy-preserving data mining applications at both the individual and organizational levels.


Introduction
Especially with the 2019 pandemic, in today's world where business and education life is done electronically over the internet, fast and voluminous data sharing is made with the undeniable effect of social media and unfortunately technology works against privacy. The rapid widespread use of data mining techniques in areas such as medicine, sports, marketing, signal processing has also increased the interest in privacy. The important point here is to define the boundaries of the concept of privacy and to provide a clear definition. Individuals define privacy with the phrase "keep information about me from being available to others". However, when it comes to using these personal data in a study that is considered to be well intentioned, individuals are not disturbed by this situation and do not think that their privacy is violated [1]. What is missed here is the difficulty of preventing abuse once the information is released.
Personal data is information that relates to an identified or identifiable individual. This concept consists of the components that the data pertain to a person and that this person can also be identified. Personal data is a concept that belongs to the "ego" and is handled in a wide range from names to preferences, feelings and thoughts. An identifiable person is someone who can be identified directly or indirectly, in particular by reference to an identification number or one or more factors specific to their physical, physiological, mental, economic, cultural or social identity. For this reason, the loss of the individual's control authority over these data brings about the loss of the individual's freedom, autonomy, privacy, in short, the property of being me. The main way to ensure the use of these data without harming the privacy of individuals is to remove the identifiability of the person.
Data analysis methods, including data mining, commodify data and turn it into economic value. Apart from the ethical debates about this, it is an undeniable fact that the digital environment increases the risk of losing control of all information about one's own intellectual, emotional and situational, in short, losing its autonomy and violating the informational privacy area. The main dilemma here is; the freedom in the flow of information provided by technology, the interest relationships it provides and the benefit provided by the information source is the control power required by the concept of being an individual [2].
In addition, legal regulations aiming to protect personal data are made by governments, including for what purpose (historical, statistical, commercial, scientific) data is used, how it is collected and how it should be stored. For example, the US HIPAA rules aim to protect individually identifiable health information. These are information that is a subset of health information, including demographic information collected from an individual [3]. In the EC95/46 [4] directive, the European parliament and of the council allow the use of personal data in the case of (i) if the data subject has explicitly given his permission, or (ii) the need for a result requested by the individual. This also applies to corporate privacy issues. Privacy concerns bring corporate privacy concerns with them. However, corporate privacy and individual privacy issues are not much different from each other. The disclosure of information about an organization can be considered a potential privacy breach. In this case, it involves both views to generalize to disclosure of information about a subset of data.
The point to note here is that while focusing on the disclosure of data subjects, the secrets of the data providers' organization should also be taken into account. For example, considering that data mining studies were carried out on student data of more than one university in an academic study. Although the methods used protect the privacy of the student, certain information that is specific to the university and they want to keep may be revealed. Although the personal data owned by the organizations are secured by contracts and legal regulations, information about a subset of the combined data set may reveal the identity of the data subject. The organization that owns the data set must be involved in a distributed data mining process as long as it can prevent the disclosure of the data subjects it provides and its own trade secrets.
In the literature, solutions that take data privacy into account have been proposed in data mining. A solution that ensures that no individual data is exposed can still publish information that describes the collection as a whole. This type of corporate information is often the purpose of data mining, but some results can be identified, various data hiding and suppression techniques have been developed to ensure that the data are not individually identified.
The concept of privacy can be examined under three headings as "physicalphysical, mental-communicative and data privacy [5]. The main subject in this study is data privacy.

Data privacy
Data privacy can be defined as the protection of real persons, institutions and organizations (Data Subject) that need to be protected in accordance with the law and ethical rules during the life cycle of data (collecting data, processing and analyzing data, publishing and sharing data, preserving data, re-use data) [6]. In this process, for what purpose the data will be processed, with whom it will be shared, where it will be transferred, and being able to be controlled by the data subject at a transparent and controllable level are important requirements of data privacy. On the other hand, there is no exact definition of privacy, the definition can be made specific to the application.
Data controllers who need to take privacy precautions in order to prevent data breaches are assumed to be reliable and have legal obligations; stores and uses the data collected with digital applications using appropriate methods, and shares them by anonymizing when necessary. Collected data are classified into four groups [7]; • Identifiers (ID): It contains information that uniquely and directly identifies individuals such as full name and social security number.
• Quasi-identifiers (QID): Identifiers that, combined with external data, lead to the indirect identification of an individual. These attributes are non-unique data such as gender, age, and postal code.
• Sensitive attributes (SA): It contains data that is private and sensitive to individuals, such as sickness and salary.
• Insensitive attributes: It contains general and non-risky data that are not covered by other attributes.

Privacy metrics
It is not sufficient to measure privacy with a single metric because different definitions can be made for different applications and multiple parameters must be evaluated for this purpose. It is possible to examine the proposed metrics for PPDMs [8,9] as privacy level metric and data quality metric, depending on which aspect of privacy is measured. While evaluating these metrics, they can be measured in two subgroups to evaluate the level of privacy/data quality on the input data (data criteria) and data mining results (result criteria). How secure the data is in terms of disclosure is measured by the level of privacy metrics [10]: Bounded knowledge: The purpose here is to restrict the data with certain rules and prevent the disclosure of the information that should remain confidential. It can be transformed into limited data by adding noise to the data or by generalizing the data.
Need to know: With this metric, keeping unnecessary data away from the system prevents privacy data that will arise. It also ensures that access control (access reason and access authorization) to data.
Protected from disclosure: In order to keep the confidential data that may come out as a result of data mining, some operations (such as checking the queries) can be done on the results to provide privacy. Using the classification method to prevent the disclosure of data, which is one of the criteria for ensuring privacy, is one of the effective methods [11].
Data quality metrics: It quantifies the loss of information/benefit, and the complexity criteria that measure the efficiency and scalability of different techniques are evaluated within this scope.

Data mining with privacy
Privacy Protected Data Mining (PPDM) techniques have been developed to allow the extraction of information from data sets while preventing the disclosure Data Mining of data subjects' identities or sensitive information. In addition, PPDM allows more than one researcher to collaborate on a dataset [11,12]. Also PPDM can be defined as performing data mining on data sets to be obtained from databases containing sensitive and confidential information in a multilateral environment without disclosing the data of each party to other parties [13].
In order to protect privacy in data mining, statistical and cryptographic based approaches have been proposed. The vast majority of these approaches operate on original data to protect privacy. This is referred to as the natural trade-off between data quality and privacy level.
PPDM methods are being studied on to perform effective data mining by guaranteeing a certain level of privacy. Several different taxonomies have been proposed for these methods. In the literature, based on data life cycle stages (data collection, data publishing, data distribution and output of data mining) [10] or they are classified based on the method used (Anonymization based, Perturbation based, Randomization based, Condensation based and Cryptography based) [14].
In this study, PPDM approaches are examined with a simple taxonomy as methods applied to input data and processed data (output information) that is subject to data mining.

Methods applied to input Data
This section includes the methods suggested for collecting, cleaning, integration, selection and transformation phases of input data that will be subject to data mining.
Although it varies according to the application used or the state of trust to the institution collecting the data, it is recommended that the original values not be stored and used only in the conversion process in order to prevent disclosure of privacy. For example, the data collected with sensors, which are now widely used with internet of things, can be transformed at the stage it collects, randomizing the obtained values and transforming the raw data before being used in data mining.
In this section, data perturbation, randomization, suppression, data swapping, anonymity, cryptography and differential privacy methods are discussed.

Data perturbation
The creation of data resistant to privacy attacks can be done by perturbation significantly preserving the statistical integrity of the data [15,16]. Randomization of the original data is widely used in data perturbation [17][18][19]. Another approach is the Microaggregation method [20].
In the randomization method, noise signals are added to the data with a known statistical distribution, so when data mining methods are applied, the original data distribution can be reconstructed without accessing the original data. For this, data providers first randomize their data and then transmit them to the data recipient. Then, receiving this random data, the data receiver calculates the distribution using distribution reconstruction methods.
During the data collection phase, it can be calculated independently for each data, and after the original distribution is reconstructed, the statistical properties of the data are preserved. For example; the result of the randomization of A with B is C (C = A + B) if A be the original data distribution, and B, a publicly known noise distribution independent of A. Then, A may be reconstructed with "A= C− B". However, this reconstruction process may not be successful if B has a large variance and C's sample size is not large enough. As a solution, approaches that implement the Bayes [21], or EM [22] formula can be used. While the randomization method limits data usage to the distribution of C, it requires a lot of noise to hide outliers. Because in this approach, outliers are more vulnerable to attacks when compared to values in denser regions in the data. Although this reduces the use of the data for mining purposes, it may be necessary to add too much noise to all records in the data that would result in loss of information, in order to prevent it [7].
Randomly generated values can be added to the original data with an additive or multiplicative method [23]. The aim is to ensure that noise added to individual records for privacy is non-extractable. Multiplicative Noise is more efficient than the Additive Noise method because it is more difficult to predict the original values.
With Microaggregation method, all records in the data set are first arranged in a meaningful order and then the whole set is divided into a certain number of subsets. Then, by taking the average of the value of each subset of the specified attribute, the value of that attribute of the subset is replaced with the average value. Thus, the average value of that attribute for the entire data set will not change.
Since data perturbation approaches have a negative impact on data utility and are not resistant to attacks, they are often not preferred in utility-based data models.

Suppression
Data Suppression technique is a technique that tries to prevent the disclosure of confidential information by replacing some values with a special value. In some cases, it is the process of deleting cell values or the entire record [24]. In this way, confidential data can be changed, rounded, generalized or mixed and made available in data mining applications [25].
An example of Suppression may be changing the age attribute in records from 28 to 35, city attribute from Glasgow to Edinburgh, or generalizing the age attribute from 28 to 25-30, and Glasgow data as Scotland. Using these methods in big data can reduce data quality and change general statistics, this may result in data becoming unusable [26]. Another problem is that information is deliberately distorted to suppression. Data providers can obtain artificial inferences that are inaccurate and serve a purpose with the reported values [27].
On the other hand, suppression should not be used when data mining requires full access to sensitive values. For sensitive information in a record, the method of limiting the identity link of a record may be preferred instead.

Data swapping
A technique tries to prevent the disclosure of private information by swapping values between different records.
Data swapping can be explained as each data provider scrambling data by exchanging their data with other data providers, especially in cases where there are more than one data provider. The advantage of the technique is that the data does not affect the sub-order sums, thus allowing accurate and complete collective calculations.
With this technique, as the result of data exchanges, private data can be easily exposed in the system, for this reason it is recommended to use only in safe environments. It can be used in conjunction with other methods such as k-anonymity without violating privacy definitions.

Cryptography
Cryptography is a technique that converts plain text to cipher text using various encryption algorithms to encode messages in a way that cannot be read. It is Data Mining a method of storing and transmitting data in specific form using cryptography techniques so that only intended persons can read and process it.
In data mining applications, cryptography-based techniques are used to protect privacy during data collection and data storage [25,28], and guarantee a very high level of data privacy [23]. Encryption is generally costly due to time and computational complexity. Hence, as the volume of data increases, the time to process on encrypted data increases and creates a potential barrier to real-time analysis [29].
Secure multiparty computing (SMC) is a special encryption protocol where, when there is more than one participating party, the interested parties learn nothing but results [30,31]. The SMC calculation must be done carefully so that it does not reveal sensitive data, but the calculated result can enable the parties to estimate the value of sensitive data.

Group-based anonymization
Many privacy conversions are for creating groups between anonymous records that are converted in a group-specific manner. A number of techniques have been proposed for group anonymity in different studies, such as k-anonymity, l-diversity, and t-proximity methods. The comparison of group anonymity methods is given in Table 1.

k-anonymity
The k-anonymity method proposed by Samarati and Sweeney in the anonymization of data is a method of providing privacy that protects the identity of the data subject most commonly used in the publication of data [32].
The method ensures that after removing the ID attributes from the table, the QID values of at least k records in the table to be published are the same.
Since the QID attributes of each record in the table published by this method are the same as the other k-1 records, it is aimed to prevent identity disclosure.
To reduce the level of detail of the data representation, some attributes can be replaced with more general values (data swapping), some data points can be eliminated, or descriptive data can be deleted (suppression). However, while k-Anonymity provides protection against attacks on the disclosure of identities, it does not protect against attacks on disclosure of attributes. It is also more convenient to use for individual data rather than directly applying it to restrict data mining results that protect privacy. Besides, k-anonymity fully protects the privacy of users when it comes to the homogeneity of sensitive values in the data. Providing optimum k-anonymity is a problem in the NP-Hard class and approximate solutions have been proposed to avoid calculation difficulties [33].
In the literature, different studies such as k-neighbor anonymity, k-degree anonymity, cotomorphism anonymity, k-candidate anonymity and l-grouping derived from the k-anonymity approach have been proposed according to the structural features of the data.

l-diversity
The l-diversity approach was proposed by Ashwin Machanavijjhala in 2007 to address the weaknesses (homogeneity attack) of the k-anonymity model [34].
This method aims to prevent the disclosure of confidential information indirectly by ensuring that each QID group has at least l well-represented sensitive value.
L-diversity only guarantees the diversity of sensitive features within each QID group, but the problem that different values may belong to the same category is not solved.
In other words, it is not resistant to attacks based on semantic similarity between values.

t-closeness
In order to balance the semantic similarities of SA attributes within each QID group, it has been proposed to solve the limitations of the l-diversity approach by guaranteeing t-closeness to each other [35].
Accordingly, in t-closeness method, the distance of the distribution of sensitive attributes in any equivalence class to the distribution of the attributes in the whole table will not exceed a threshold value (t). While the t-closeness approach provides protection against disclosure of attributes, it cannot protect against disclosure of identities. In addition, it limits the usefulness of the information disclosed however, by setting the t-threshold in applications, it can exchange benefit and privacy.
In the protection of privacy, t-proximity and k-anonymity methods are used together to protect against attacks on identity disclosure and quality [36].

Methods applied to processed Data
The outputs of data mining algorithms can disclose information without open access to the original data set. Sensitive information can be accessed through studies on the results. For this reason, data mining output must also protect privacy.

Query auditing and inference control
This method is examined as query inference control and query auditing. In the query inference control, the input data or the output of the query is controlled. In t Query auditing, the queries made on the outputs obtained by data mining are audited. If the audited query enables the disclosure of confidential data, the query request is denied. Although it limits data mining, it plays an active role in ensuring privacy. Query auditing can be done online or offline. Since queries and query results are already known in offline control, it is evaluated whether the results violate privacy. In online auditing, since the queries are not known, privacy metrics are carried out simultaneously during the execution of the query. This method is examined within the scope of statistical database security.

Differential privacy
k-anonymity, l-diversity and t-closeness approaches are holistic approaches that try to protect the whole data privacy. In some cases, there is a need to protect the privacy of data at the record level. For this reason, differential privacy approach has been proposed by Dwork to protect the privacy of database query results [37].
With this model, the attacks that may occur between sending database queries and responding to the query are targeted. Failure to distinguish from which database the answer of the same query, made in more than one database, is returned will prevent the disclosure of the existence of a single record between databases.
In addition, when querying output data, it can be ensured that the query results obtain approximate values with the database approach technique. Also, it is recommended to keep the data in the system mixed during the execution of queries, just like the data collection phases to protect data privacy.

Association rule hiding
In data mining, it is one of the most frequently used methods of Association Rules to reveal the nature of interesting associations between binary variables. During data mining, some rules may explicitly disclose private information about the data subject (individual or group).
Unnecessary and information-leaking rules may occur in some relationships. The aim of the Association rule hiding technique first proposed by Atallah [38] is to protect privacy by hiding all sensitive rules. The weakness with this technique is that a significant number of insensitive rules can be hidden incorrectly [39].

Attacks against privacy
In this section, the common types of attacks that lead to the development of the methods given above and lead to privacy violations are summarized [6].

Semantic similarity attacks
Attacks that are made by making use of the intuitive similarity of sensitive attribute values within anonymous groups.
In this case, it is not sufficient for the sensitive attribute values to be different from each other in terms of protecting privacy [40]. This attack can be prevented by calculating the similarities of sensitive attributes in the same anonymous group and by providing solutions to include similar sensitive attribute values in different groups.

Background knowledge attacks
Background knowledge is non-sensitive information that can be obtained from data published by different organizations, social networks and media even by using social engineering methods. Background knowledge obtained by attacker's causes privacy attacks and breaches.
Data subject's privacy violation occurs as a result of associating background knowledge with other records using data binding methods [41].
In addition, when information obtained from data owners through requests such as promotion, campaign, research, etc. is associated with background information, it is not even possible that it will not cause a violation of privacy.

Homogeneity attacks
In cases where all or most of the sensitive attributes in the groups included in the anonymous tables are similar, the privacy of data owners is at risk of violation.
In order to prevent homogeneity attacks, it is necessary to prevent similar sensitive attributes within the groups in the anonymous table from being in the same group or to reproduce heterogeneous records by diluting the homogeneous attributes with the record duplication approach [34].

Skewness attacks
The statistical distribution of sensitive attribute values in published or shared anonymous data sets can lead to the success of skewness attacks against privacy. The distortion in the general distribution of sensitive attributes occurs when these values are too dominant and anonymous data sets become vulnerable [35].

De-Finetti attacks
It has been shown that with theoretical and experimental methods, interchangeability concepts and inferences about privacy can be made with Definetti's theorem [42]. The fact that the people who want to carry out this attack do not need extensive background knowledge makes this attack attractive. An attacker can perform an attack using machine-learning techniques on non-sensitive attributes in the dataset.

Minimality attacks
The fact that the information about which data anonymization algorithm is used in the data mining application is public is also considered as a privacy vulnerability [43]. It is based on the principle that changes on data should remain at minimum level in anonymization processes and should not be overly anonymized.

Temporal attack
Publicly declaring previously published generalized data over time causes this attack. For this reason, previously published tables should be used and new records that may cause data disclosure should not be shared [44].

Discussion
The fact that the digitalization process has become mandatory all over the world with Covid-19 pandemic has accelerated the data flow. It has become even more important to collect the necessary data, analyze it correctly and reveal reliable information. This situation has triggered the use of data mining methods to increase productivity and provide high quality products/services in almost all sectors. While applying data mining methods, it is obvious that if privacy is not taken into consideration during the data life cycle, irreversible damages will occur for individuals/institutions and organizations.
In order to increase the access and benefits of data mining technology, before applying PPDM techniques, "privacy" should be defined precisely, measurement metrics should be determined and the results obtained should be evaluated with these metrics. For this reason, this study primarily focused on the definition of privacy. The term privacy is quite extensive and does not have a standard definition. It is quite challenging in measuring privacy, as there is no standard privacy definition. Some measurement metrics are mentioned in this chapter, but metrics are usually determined by application. The lack of a standard privacy measurement metric also make challenging the comparison and evaluation of the developed PPDM techniques.
In the age of digital and online business, privacy protection needs to be done at the individual and organizational levels. Privacy protection at the individual level depends on person who is influenced by religious beliefs, community norms and culture. For this reason, the concept of personalized privacy, which allows individuals to have a certain level of control over their data, has been proposed. However, it has been observed that there are difficulties in implementing personalized privacy, as people think that compromising their privacy for applications they think is wellintentioned will not damage. Therefore, in the context of personalized privacy, new solutions are required for the trade-off between privacy and utility.
To effectively protect organizational level data privacy [7]; Policy makers in organizations should support privacy-enhancing technical architectures/models to securely collect, analyze and share data. Laws, regulations and fundamental principles regarding privacy should be analyzed by organizations. It is necessary for organizations to include the data owners in their assessment of privacy and security practices. Data owners should involve the whole process about what data is collected, how it is analyzed and for what purpose it is used. In addition, they should have the right to correct personal data in order to avoid negative consequences of incorrect data. Organizations should employ data privacy analysts, data security scientists, and data privacy architects who can develop data mining applications securely.
From a technical point of view, methods that protect confidentiality in data analytics are still in their infancy. Although studies continue by different scientific communities such as cryptography, database management and data mining, an interdisciplinary study should be conducted on PPDM. For example, the difficulties encountered in this process should also be addressed from a legal perspective. Thus, a better roadmap for next-generation privacy-preserving data mining design can be developed by academic researchers and industrial practitioners.

Conclusion
Businesses and even governments collect data through many digital platforms (social media, e-health, e-commerce, entertainment, e-government etc.) they use to serve their customers/citizens. The data collected can be sensitive data and this data can be stored, analyzed and, in good probability, anonymized and shared with others. In studies where data is used at any stage of the life cycle, regardless of the purpose, it is necessary to explain a privacy permission and the reason why the data should be accessed. Privacy Preserving Data Mining (PPDM) techniques are being developed to allow information to be extracted from data without disclosing sensitive information.
There is no single optimal PPDM technique for any stage of the data lifecycle. The PPDM technique to be applied varies according to the application requirements, such as the desired privacy level, data size and volume, tolerable information loss level, transaction complexity, etc. Because different application areas have different rules, assumptions and requirements regarding privacy.
In this chapter, the previously proposed PPDM techniques are examined in two sections. First section includes the methods suggested for collecting, cleaning, integration, selection and transformation phases of input data that will be subject to data mining and second section covers methods applied to processed data. Finally, attacks against the privacy of data mining applications are given in this chapter.