Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism Data Privacy for Big Data Publishing Using Newly Enhanced PASS Data Mining Mechanism

Anonymization is one of the main techniques that is being used in recent times to prevent privacy breaches on the published data; one such anonymization technique is k-anonymiz-ation technique. The anonymization is a parametric anonymization technique used for data anonymization. The aim of the k-anonymization is to generalize the tuples in a way that it cannot be identified using quasi-identifiers. In the past few years, we saw a tremendous growth in data that ultimately led to the concept of the big data. The growth in data made anonymization using conventional processing methods inefficient. To make the anonymi- zation more efficient, we used the proposed PASS mechanism in Hadoop framework to reduce the processing time of anonymization. In this work, we have divided the whole program into the map and reduce part. Moreover, the data types used in Hadoop provide better serialization and transport of data. We performed our experiments on the large dataset. The results proved the best efficiency of our implementation.


What is big data?
"Big data" is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.

Big data significance in industry and challenges
While understanding the estimation of big information keeps on residual a test, other down to practical challenges including financing and rate of return and aptitudes keep on remaining at the front line for various distinctive ventures that are embracing huge information. All things considered, a Gartner survey for 2015 demonstrates that over 75% of organizations are putting or are intending to put resources into enormous information in the following 2 years. These discoveries speak to a critical increment from a comparable study done in 2012, which showed that 58% of organizations put or were wanting to put resources into enormous information inside the following 2 years [1].
By and large, most associations have a few objectives for receiving enormous information ventures. While the essential objective of most associations is to upgrade client encounter, different objectives incorporate cost diminishment, better focused on promoting and making existing procedures more effective. Lately, information ruptures have additionally made upgraded security a critical objective that has huge information.

Data stream
Big data associated with the time stamp is called big data stream [2].

Constraints associated with data streams
Privacy protection: i.e., the data streams are extracted from various sources which consist of many individuals' information; hence, the sensitive data of any individuals must not be leaked.
Computer security: Access control and verification guarantee that opportune individual has a right expert to the correct protest at the perfect time and the ideal place. That is not what we need here. A general precept of information security is to discharge all the data as much as the personalities of the subjects (individuals) are ensured.
Real-time processing: Since the data is not static in nature, real-time processing is required, and at present, not many algorithms are there to process the dynamic data.

What is MapReduce?
MapReduce, as shown in Figure 1, is a preparing method and a program that demonstrates for circulated figuring in light of java. The structure deals with every one of the points of interest of information passing, for example, issuing errands, confirming assignment culmination, and duplicating information around the group between the hubs [1]: • Most of the registering happens on hubs with information on neighborhood circles that lessens the system activity.
• After consummation of the given errands, the bunch gathers and diminishes the information to shape a suitable outcome and sends it back to the Hadoop server.

Anonymization
Generally, the fundamental theme data anonymization [2][3][4][5] is the use of one or more techniques designed to make it impossible or at least more difficult to identify a particular individual from stored data related to them. Purposes of data anonymization are: 1. To prevent the privacy of individuals who shared data for various surveys

To implement effective techniques to prevent a security breach
It is privacy preservation techniques used for static data. Techniques implemented in anonymization are: 1. Encryption

K-anonymity
A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from a least k-1 individuals whose information also appear in the release. For example, if you try to identify a person from a release dataset but you only have information of his/her birth date and gender. There are k people that meet the requirement. This is k-anonymity [6,7].

Classification of attributes
Key attribute is name, address, and cell phone, which can uniquely identify an individual directly. It is always removed before release.
Quasi-identifier is a zip code, birth date, and gender, a set of attributes that can be potentially linked with external information to re-identify entities. Eighty-seven percent of the population in the USA can be uniquely identified based on these attributes, according to the census summary data in 1991. There are two tables shown below: Table 1 is hospital dataset and From above tables, we can conclude that Andre has heart disease; here the heart disease is the sensitive attribute. It is known as linking attack by combining two different tables. The solution is to consider all of the released tables before releasing the new one and trying to avoid linking. And k-anonymity does not provide privacy if sensitive values in an equivalence class lack diversity [8,9].

Algorithm
When the numbers of tuples in the processing window reach μ, one round of the clustering algorithm is started to slide again in order to accumulate more tuples in each round [10].
Parameters used in the algorithm are k, u, d: k defines the parameter for cluster anonymization.
d defines the number of clusters which can be used later.
u defines the processing window size.

Drawback
The main drawback of FANNST is that some tuples may remain in the system for more than allowable time constraint. In addition, the time and space complexity of the algorithm is O(S*S) and not efficient for a data streaming algorithm. Another weakness of FANNST is that it does not support categorical data.

FADS algorithm
The algorithm considers a set as a buffer and saves at most δ tuples in it [11,12]. Also, another set (setkc) is considered to hold the newly created cluster for later reuse. Each k-anonymized cluster will be remained in setkc up to the reuse constraint Tkc, and after that, the cluster is removed.

Drawbacks
The main drawback of the FADS is that the algorithm does not check the remaining time of tuples that hold in the buffer in each round and give their result when they might be considered to have expired. The other important weakness of FADS is that it is not parallel and cannot handle a large number of data streams in tolerable time. A sequence of tuples is defined as <sn>n∈N where N is the natural number set. The kth term of <sn> is order pair (t, tk) where k is a number and tk is a tuple.
A data stream S is a potentially infinite sequence of tuples, depicted by <t i >, where all tuples t i follow the schema t i = <ID,a 1 ,a m ,q 1 ,q n ,TS>. ID is an identifier attribute; q 1 to q n are quasiidentifiers, and TS is the time stamp.

Cluster
The cluster is a set of tuples in a stream [12]. Suppose that PS is a set of tuples in stream cluster C which can be defined as follow:

K-anonymized cluster
If a cluster C is built from the data stream and the number of the unique tuple in the cluster is greater than k, the cluster is called a k-anonymized cluster.

Generalization
Generalization is a function that maps a cluster into a tuple. More formally, generalization function G is defined as G: PowerSet(TUPLE) ! TUPLE where TUPLE is the set of all possible tuples.

Numerical value generalization
Numerical values are generalized in between maximum and minimum value, i.e., they are generalized in their domain.

Categorical value generalization
Categorical values are generalized to their lowest common ancestors as shown in Figure 2.
The above tuple can be generalized as follows: gc = < * ,staff,[39-46]>. Since we do not want to disclose the name, we kept * in the first column; here profession is categorical value, and age is numerical value; age is generalized as [max, min], and profession is generalized to lowest common ancestor of academic and nonacademic.

Distance
Distance is used to calculate the similarity or dissimilarity between two tuples. This function is the heart of the clustering. Generally, clustering is done based on distance calculation; the tuples with the closest distance are placed the same cluster.

The distance between two categorical values
If all the categorical values are arranged in the form of a tree where the root is the most generalized value of all the values and lowest most level containing more specialized values of the categorical values, e.g., of a categorical tree as shown in Figure 3 Country taxonomy tree and Figure 4 Occupation taxonomy tree.
Distance between two categorical values v 1 ,v 2 = d(v 1 ,v 2 ) = (height of the subtree roots at lowest common ancestor of (v 1 ,v 2 ))/(height of tree): For example, distance between India and Egypt (considering the tree from the above picture).
=Height of subrooted tree of a lowest common ancestor of India and Egypt/height of the tree.
=Height of the tree with east as root/height of tree = 1/3 = 0.33. The distance d(r1,r2) (i.e., the distance between two tuples r1, r2) is defined as: d(r1,r2) = sum of distances between numerical attributes of two tuples + sum of distances between categorical attributes of two tuples.
Information loss: generalization leads to information loss, but we have to group clusters in such a way that the information loss is minimum.
Information loss of a single cluster is calculated as: Total information loss = sum of information loss of all the clusters.
Information loss of the cluster = info loss of all the tuples in the cluster.
Information loss of the tuple = information loss of all the attributes (categorical attributes and numerical attributes).
Information loss of numerical attribute = (value of attribute)/(domain of the attribute).  The algorithm reads $ tuples continuously and inserts them into the SetTp. At First, for each tuple in SetTp procedure finds t's K-1 nearest tuples in SetTp, with the help of tuple t and its K-1 nearest tuples, generate a new set called as Snew and generalize it into Gs. Then a set with minimum information loss (Sk-best) that covers tuple t is chosen from SetKc if Sk-best exists and has smaller information loss than Gs; then tuple t is published Sk-best generalization.
If tuple t does not match with any set of SetKc which has less information loss compared to Gs, then tuple t is published with Snew generalization, i.e., Gs. Then Gs is inserted in SetKc.
In the following, a simple example is illustrated for better understanding. Table 3 is a portion of a university person data stream, in which quasi-identifiers are age and job. Also $ and K are assumed as $ = 3 and K = 2. Suppose that in thread n, the value of variables is as follows: In this stage, information loss of Sk-best is compared with Gs information loss. As the information loss of Sk-best is less than Gs, a tuple with idn is published with Sk-best generalization.

Dataset description
In this experiment, we evaluated the performance of the proposed algorithm on the adult dataset from UCI [13]. The dataset was widely used for the privacy-preserving purpose. The taxonomy tree is defined as per Figure 5. The sensitive attribute in the dataset is age (numerical) and profession (categorical).

Results and discussions
The total number of records in the dataset used for the experiment purpose is 32,599 tuples. The efficiency of proposed algorithm is verified by parameter information loss. The average information loss of the proposed PASS algorithm, FADS and FAST, is presented in Figure 6.
The proposed PASS algorithm publishes data with less information loss, because the SetKc in the proposed approach as shown in Figure 7 has more entities so that the data tuple has more  options to select, and this decreases the information loss as shown in Figure 8, and hence the results of an algorithm show improvement. The average execution time drastically decreases as MapReduce-based newly enhanced PASS mechanism is used.

Conclusion
All the algorithms which are present for data stream processing are not capable of processing big data, i.e., data with high capacity and volume. The data which is processed using data anonymization (nonparallel) algorithms use old languages (JAVA, SQL) and old techniques, which are not very effective means because they take a lot of time for computation and sometimes provide tuples, which are expired; this lead to loss of accuracy as well as loss of privacy which is very dangerous. Static algorithms need all the computations to be performed on a single node due to which the data and the processing requirements are very high and the computers used are prone to failure which is very expensive to recover.
In this paper, we have proposed PASS algorithm, which uses Hadoop framework to process the data. Using Hadoop, the computer's resources are used to the maximum extent by which time required for computation is reduced which in turn prevents the publishing of expired tuples. Other advantages of this algorithm are that computations can be performed on nodes which have less computation and less storage capacity than that of computers which perform nonparallel data processing. The proposed PASS algorithm publishes data with less information loss. Using Hadoop, the failures in both data and processors can be recovered. These features drastically reduce the maintenance cost and the initial setup cost.