Open access peer-reviewed chapter

Multiplicative Data Perturbation Using Random Rotation Method

Written By

Thanveer Jahan

Submitted: 08 April 2022 Reviewed: 16 May 2022 Published: 03 September 2022

DOI: 10.5772/intechopen.105415

From the Edited Volume

Data Integrity and Data Governance

Edited by B. Santhosh Kumar

Chapter metrics overview

119 Chapter Downloads

View Full Metrics

Abstract

Today’s applications rely on large volumes of personal data being collected and processed regularly. Many unauthorized users try to access this private data. Data perturbation methods are one among many Privacy Preserving Data Mining (PPDM) techniques. They play a key role in perturbing confidential data. The research work focuses on developing an efficient data perturbation method using multivariate dataset which can preserve privacy in a centralized environment and allow publishing data. To carry out the data perturbation on a multivariate dataset, a Multiplicative Data Perturbation (MDP) using Random Rotation method is proposed. The results revealed an efficient multiplicative data perturbation using multivariate datasets which is resilient to attacks or threats and preserves the privacy in centralized environment.

Keywords

  • privacy
  • multiplicative data perturbation
  • random rotation method

1. Introduction

This chapter proposes a Multiplicative Data Perturbation method. It considers multivariate datasets to perturb using a geometric data perturbation method. Then, the perturbed data will use Discrete Cosine Transformation between a pair of data values to determine Euclidean distance. This proposal is clearly elaborated in the following section.

1.1 Background

Hybrid transformations are used to maintain statistical properties of data as well as mining utilities [1, 2, 3]. The statistical properties of data are mean and variance or standard deviation without any loss of data. A feasible solution [4] is provided to optimize the data transformations by maximizing privacy of sensitive attributes. A combined technique using randomization and geometric transformation is used to protect sensitive data. A randomized technique is represented as D=X+R, where R is additive noise, X is original data and D is perturbed data. A geometric transformation is used as a 2D rotation data matrix represented as D=Rθ×D, where D is the column vector containing original co-ordinates and D′ is a column vector whose co-ordinates are rotated clockwise. The above method considered only single attributes as sensitive and rest of them as non-sensitive attributes. Data perturbation method using fuzzy logic and random rotation is proposed [5, 6].

The original data is perturbed using fuzzy based approach (M) and then random rotation perturbation is used by selecting confidential numerical attributes to get the transformed data P = M*R, where M is the dataset transformed using fuzzy based approach and R is the random dataset generated. The distorted data P is released for clustering analysis and obtained accuracy. The approach compromises in balancing privacy and accuracy. A hybrid method using SVD and Shearing based data perturbation [7] is proposed to obtain perturbed data. The approach removes the identified attributes from the dataset. These attributes are normalized using Z-score normalization to standardize to the same. Then, the dataset is perturbed using SVD transformation. Each record of the perturbed dataset is further distorted using a Shear based data Perturbation method represented as D=D+ShDD, where ShD is the random noise and D is the perturbed dataset obtained after SVD transformation.

The results show higher privacy is attained on hybrid methods when compared to single data perturbation methods. A hybrid technique [7, 8] based on Walsh-Hadamard Transformation (WHT) and Rotation is proposed. The Euclidean distance preserving transformation using Walsh-Hadamard (Hn) given below to generate orthogonal matrix to preserve statistical properties of the original dataset.

Hn=i=nDH2=H2H2H2nE1

where H2 is 1111 is a matrix and denotes the tensor or Kronecker product. Then, Rotation transformation is applied to preserve the distance between the data points. The perturbed data preserves distance between data records and maintain accuracy using classifiers. The method is limited to numerical attributes and can be extended to categorical attributes. A hybrid approach for data transformation is proposed by Manikandan et al. [9] to sanitize data and normalize the data using min-max normalization [10]. The approach transforms original data maintaining inter-relative distance among the data. Clustering analysis shows that the numbers of clusters in original data are similar to modified data. Another approach is used to modify the original data to preserve privacy with the help of inter-relative distance on categorical data is proposed [2].

The categorical data is converted into binary data and is transformed using geometric transformation. Then the clustering algorithm is used for analysis and the results for better data utilization as well as privacy preservation. The multiplicative noise is generated using random numbers with mean as 1 and is multiplied by the original data value. A random number with a short Gaussian distribution is calculated with mean as 0 and a small variance. Geetha Mary A et al. [11] proposed a non-additive method of perturbation by randomization and data is generated based on intervals on the level of privacy specified by a user. A random number is generated that is either added or multiplied with the data to generate a random modified data. The perturbed data is classified and measures using metrics.

The condensation approach is presented by Agrawal and Yui [12] for a multidimensional perturbation technique to provide privacy for multiple columns using covariance matrix. The approach was weak in protecting data privacy. Rotation perturbation was used for privacy preserving data classification [13]. Rotation perturbations are task specific and aim to have better balance between loss of information and loss of privacy. Multiplicative data perturbations include three types of perturbation techniques such as: Rotation Perturbation, Projection Perturbation and Geometric Data Perturbation.

A Rotation perturbation framework was adopted in privacy preserving data classification [14]. It is defined as GX=RX where R is randomly generated rotation matrix and X is the original data. The benefit and weakness of this method is distance preservation and is prone to distance inference attacks. These attacks are addressed [15, 16, 17]. Chen et al. [14] proposed an improved version on resilience towards attacks. Oliveria et al. [17] proposed a scaling transformation along with random rotation in privacy preserving clustering.

A Random Projection perturbation is proposed [13, 18] to project a set of points from the original multidimensional space to another randomly chosen space. This resulted with an approximate model quality. A random projection matrix is used in privacy preserving data mining to enable an individual to choose their privacy levels.

An ideal data perturbation [19] aim with a balance tradeoff of minimizing information loss and privacy loss. However these are not balanced in the existing algorithms. Compared with the existing approaches in privacy preserving data mining, Geometric data Perturbation have significantly reduced these overcome [20].

A Geometric Data Perturbation is a sequence of random geometric transformation including multiplicative transformation (R), Translation Transformation (T) and Distance Perturbation (DP) [21, 22].

X=RX+T+DTE2

The approach has two unique characteristics. The first characteristic is to perturb the original data with geometric rotation, translation and identify rotation invariant classifiers as given in above. The second characteristic is to build privacy model by evaluating the privacy quality of perturbation method. The privacy model generated is used to analyze the attacks, such as, Naives and ICA-based reconstruction. The quality of data perturbation approach is determined by the quality of privacy preserved. It is the difficulty level in estimating the original data from perturbed ones such estimations are named as inference attacks. The attacks are categorized into three categories such as: Naives Inference, Reconstruction based inference and distance based inference. A statistical method based inference to estimate original data from perturbed named as Naives inference attack was proposed [23]. It is represented as O=P, where O is the observed data and P is the perturbed data. Reconstructing the data with perturbed and released information from data is presented. Reconstruction based attacks also called as Independent Component Analysis (ICA) [24, 25]. It is represented as, O = E−1 P, where E−1 is the estimation of released information of data and P is the perturbed data. Identifying the images and some relevant information of data using outliers to discover the perturbation is distance based attacks. It is represented as O = E−1P, where E−1 is the mapping to estimate and P is the perturbed data. The higher the inference the more the original data is protected and preserved such that attacker cannot break the perturbation. The above attacks are analyzed with a privacy model with privacy guarantee [26]. It had failed to avoid outlier attack. The existing data perturbation techniques have contradiction between data privacy metric and mining utility [27, 28]. The multiplicative data perturbations will maximize the two levels i.e. data privacy and mining utility. The multiplicative data perturbation shows challenging features to improve data privacy during mining process as well as to preserve the model specific information.

In this chapter a survey is presented on privacy preserving data mining to protect confidential data. The drawbacks of the above existing data perturbation methods have made us to resolve the issues with balanced factors, such as, data privacy and data utility. The challenges in preserving privacy using multiplicative data perturbation have been given a new direction in this research study.

Advertisement

2. Proposed method

The proposed Multiplicative Data Perturbation (MDP) is shown at Figure 1 as a block diagram.

Figure 1.

Block diagram for multiplicative data perturbation using random rotation method.

The above block diagram considers the original dataset and deals with it in two stages. In the first stage, the original dataset is perturbed using geometric data perturbation. The geometric data perturbation generates a distorted dataset. This distorted dataset is further perturbed using Discrete Cosine Transformation in the second stage to finally generate a distorted dataset. The process of generating a distorted dataset using a geometric data perturbation comprises three steps. At the first step a random dataset is created using random values as in the original dataset. This random dataset is rotated counter clockwise and then multiplied with the original dataset. The resultant dataset obtained the above step is transposed in the second step, that is, Translation Transformation. This Transposed dataset is added with an additive noise in the third step to obtain a distorted dataset. This proposal is an algorithm for multiplicative data perturbation in the next section.

Advertisement

3. Proposed multiplicative data perturbation using random rotation algorithm

A proposal for multiplicative data perturbation is given in this section. The pseudo code of the proposed algorithm is listed below.

Algorithm:

Input: A Data Matrix Dp × q.

Output: A Distorted Data matrices D4, D5.

Begin.

Step 1: Create a Random data matrix R with p rows and q column and Rotate the random data matrix as Rq × p//counter clock wise Rotation by 90°.

Step 2: Construct the data matrix Xp × q using Rq × p and Dp × q data matrices as: Xq×q=Rq×pDp×q//Multiplicative Transformation.

Step 3: Create another random data matrix X1p × q with p rows, q columns with mean as 0 and standard deviation as 1.

Step 4: Construct the distorted data matrix D4p × q using Xp × q, Transpose of R and X1p × q data matrices as:

D4=X+RT+X1//Geometric data Perturbation.

Step 5: Call function DCT (D4p × q:D5p × q)//Discrete cosine transformation.

Step 6: The resultant distorted data matrix D5p × q is output,

End.

Function DCT (D4p × q:D5p × q)//Function for Discrete Cosine Transformation.

Input: A data matrix D4p × q Output: A data matrix D5p × q

Begin.

Step 1: Copy the data matrix D4 to a data matrix D5//alias

Step 2: For i = 1 to q.

For k = 1 to q.

If k = 1 then

D4i=1/iX2icos3.142+1/2i

Else

D5i=2/iX2icos3.142+1/2i

End if

End For

Construct D5 data matrix and return as parameter.

End

The algorithm accepts the data matrix Dp × q with p rows and q columns as input. It creates a random data matrix R with p rows and q columns having random values as elements. This random data matrix R is rotated counter clockwise by 90° and then multiplied with data matrix Dp × q. The data matrix that results is named as data matrix Xp × q. Create another random data matrix X1 with p rows, q columns such that its mean is 0 and standard deviation is 1. Now, construct the distorted data matrix D4 adding the data matrices X, RT and X1. This data matrix D4 is passed as a parameter to the called function DCT(). The predefined conditions are checked and data matrix D5 is updated. This data matrix D5 after completely updated is an output of the algorithm. The time complexity of the proposed MDP algorithm is found to be O(n), where n is the dimension of the dataset.

The process of updating D5 is explained with the help of an example stated below:

Example 1.1: Consider a data matrix Dp × q = 422111 where p = 2 and q = 3.

At Step 1, create a random data matrix R2 × 3 as given below:

R=0.30340.78731.14710.29390.88841.0689 and rotate R counter clockwise by 90° as given below:

R3×2=1.14711.06890.78730.88840.30340.2939

At step 2, construct the data matrix X = D2 × 3 * R3 × 2 is given as below:

X=6.46642.20492.23780.1134

At step 3, create another random data matrix X1 with 2 rows and 3 columns such that the mean is 0 and the standard deviation is 1.

X1=6.46641.43840.75492.94430.32521.3703

At step 4, construct the distorted data matrix D4 = X + RT+ X1 as given as: D4=3.10360.13621.36185.08202.10201.980.

At step 5, the function call DCT (D4:D5) where

DCTk=fkk=1qD4qcos2k+1/2qk=1,2q;i=1pE3

where

fk=1qk=12q2kq

Let k = 1, q = 1, fk=1q., then f(1) = 1, substituting the values in the Eq. (3)Dct1=13.1036cos33.14/2=5.7881

Let k = 2, q = 1, f2=2q, then f(2) = 1, substituting the above values in Eq. (3)DCT2=10.1362cos223.14/22=1.3900

Similarly, the remaining data values of D4 are calculated to form a D5 data matrix as given below:

D5=5.78811.39000.43711.39891.58262.3630

The constructed data matrix D5 is the output.

Advertisement

4. Implementation

The proposed algorithm that was discussed in the previous section is implemented in MatLab. Its source code is included. The details of implementation are furnished in this section.

The implementation utilizes the built in functions available in MatLab such as load(), size(), randn(), rot90(), dct() and normrnd(). First, a load() built-in function is used to read a data into a data matrix D. The size() function is employed to retrieve the number of rows and columns. The function randn() is used to generate a random matrix R where the size is similar to data matrix D. The data matrix R is rotated using built in function available, namely rot90(). Then, to form a data matrix X, the data matrice R is multiplied by D data matrix. Next, normrnd() is called to generate a data matrix X1 having the mean as 0, the standard deviation as 1 and the size as similar to data matrix D. The distorted data matrix D4 is constructed by adding three data matrices, X1, RT and X2. Finally, the function DCT() is employed on distorted data matrix D4 to obtain the resultant distorted data matrix D5.

Advertisement

5. Experimentation

The Experimentation was conducted using desktop computer system loaded with windows XP Operating system, MatLab and Tanagra data mining tool. The experimental details are elaborated in this section. The experimentation begins with the original dataset D is given as input to the proposed MDP algorithm to obtain the distorted dataset D4 and D5. Then, the original dataset D and distorted datasets D4 and D5 are uploaded into Tanagra data mining tool after appending a class attribute. These uploaded datasets are classified using classification utility available within Tanagra data mining tool. The results of classification are analyzed thereafter.

Similarly the datasets are clustered using clustering utilities available in them. The results of clustering are also analyzed and furnished at Section 6.6 under Results and Analysis. Unified column privacy metric to analyze possibility of attacks is also discussed in this section. But, their calculation is shown in section Results and Analysis. The datasets of Credit Approval, Haber-Man, Tic-Tac-toe and Diabetes are used in this experimentation. The details of Credit Approval dataset used in this experiment is furnished here and the rest of the datasets are furnished.

A Real Time Multivariate dataset, namely, Credit Approval, is downloaded from website UCI Machine Learning Repository. The details are shown at Table 1. Therefore the original dataset used in the experimentation is a Credit Approval dataset. It comprises 690 rows/tuples and 15 columns/attributes including one target/class attribute.

DatasetSizeDescription
Credit Approval690 rows & 15 columnsIt consists of information of customers details concerned with credit card applications

Table 1.

Details of credit approval dataset.

A sample list of the original dataset D with 5 rows and 14 attributes is shown at Table 2.

A1A2A3A4A5A6A7A8A9A10A11A12A13A14
122.0811.462441.585000121001213
022.6772840.165000021601
029.581.751441.25000122801
021.6711.5153011111201
120.178.172641.9611140260159

Table 2.

A credit approval original dataset D.

The process in the experiment is explained as below:

First, a dataset named creditapproval.txt is loaded into X data matrix with the help of load() method. Next, the size() method on X data matrix determines the number of rows p as 690 and the number of columns q as 14. The data matrix is now named Dp × q. Then, a built- in function randn(p, q) is used to create a random data matrix R. The random data matrix R is rotated with the help of built-in function rot90(). The data matrix X is constructed using data matrix R multiplied by data matrix D. The built in function normrnd(0,1, p, q) is used to create another random data matrix X1 with p rows, q columns, such that its mean is 0 and standard deviation is 1. Construct the distorted data matrix D4 by adding three data matrices X, RT(transpose of R), X1. The distorted data matrix D4 is given as parameter to function DCT(D4) and it returns the final distorted data matrix D5 as output. When the above process is executed in experimentation it outputs a distorted datasets D4 and D5.

Advertisement

6. Results and analysis

The distorted datasets D4 and D5 together with the original dataset D, respectively are appended with a class attribute, YES or NO. The original dataset D after appending with a class attribute is shown at Table 3.

A1A2A3A4A5A6A7A8A9A10A11A12A13A14Class
122.0811.462441.585000121001213NO
022.6772840.165000021601NO
029.581.751441.25000122801NO
021.6711.5153011111201YES
120.178.172641.9611140260159YES

Table 3.

A credit approval original dataset with class attribute.

Similarly the distorted datasets D4 and D5 are also appended with a class attribute and furnished at section 6.6 as part of Results and Analysis. The above mentioned datasets D, D4 and D5 are uploaded into Tanagra data mining tool. First, classification utility is used on the dataset D and distorted datasets D4, D5. It divides the attributes into two categories, non-class attributes and class attribute. These two categories can be two inputs to the classifier chosen from the available ones.

Suppose we select SVM (Support Vector Machine) as classifier, then, it classifies the datasets D, D4 and D5 based on class attribute into either credit card either approved or rejected. Such results are furnished at Section 6.6 under Results and Analysis. Similarly, the experimentation is repeated with Iterative Dichotomizer 3 (ID3), (Successor of ID3) C4.5, KNN (k-Nearest Neighbor) and MLP (Multi Layer Perceptron) classifiers.

The results of those experiments are furnished at Section 6.6. A Clustering utility available in Tanagra data mining tool is used to cluster the original dataset D and distorted datasets D4 and D5. Non- class attributes are considered and given as input to k-mean clustering method. As a result, categories of clusters are formed.

A unified column metric, Root Mean Square Error (RSME) is used to evaluate inference attacks. It is calculated using Eq. (3) as given below:

RSMEr=1qi=1qDP2E4

where D=d1,d2dq are the original dataset values, P=p1,p2pq are the perturbed dataset values and q is number of columns.

Then, privacy DP=4σ2r=r2 (if standard deviation σ = 1). The attacks

used are:

Naives inference is calculated as given in Eq. (4), where D is the original data and P = E (E is estimated or Random dataset).

Reconstruction inference is calculated as given in Eq. (4), where D is the original dataset and the Perturbed dataset

P=E1P.E5

Distance based inference is calculated as given in Eq. (5), where D is the original dataset and P = P′ (P′ is mapped set of points of Perturbed dataset P).

The calculations of these metrics are furnished at Section 7 under Results and Analysis.

Advertisement

7. Results and analysis

The results obtained in the above experiment are presented in this section. The original dataset D is given as input to the proposed MDP and output distorted dataset D4 and D5 are presented below at Table 4 and Table 5, respectively.

A1A2A3A4A5A6A7A8A9A10A11A12A13A14
7.215.6914.655.318.79.82.57−11.51.9721.410.188.4494.111.22
2.731.434.3130.714.24.04.991.4−6.694.77−8.259.66157.63.98
1.330.06−13.517.83.1322.0−6.14−4.235.28−6.392.146.61259.3−2.37
9.726.019.224−4.47.82−10.3−7.8216.13−10.21.7813.12−2.167.827.38
2.12.033−9.22−0.522.26.95−0.102.54−6.2812.60.055.96457.78159.9

Table 4.

A credit approval distorted dataset D4.

A1A2A3A4A5A6A7A8A9A10A11A12A13A14
26.9822.111.643.719.6108.46.030.187.9973.3124.0757.164.852.67
8.8−1.610.90−6.311.2−4.446.466.85−0.08−0.08−7.90−5.33−67.91594.2
−4.8−12.19.15−12.12.4−13.16.023.03−10.18−10.1−2.99−12.4−229.7−1.56
−2.32.865.8014.623.555.805.89−1.11−11.90−11.92.7515.73−12.281.87
22.9−8.8−11.61−1.21.794.63−8.728.00−3.50−3.502.75−8.2449.84−1.27

Table 5.

A credit approval distorted dataset D5.

When SVM classifier is used on D, D4 and D5 datasets, the following observations are made and the same are presented at Table 6.

DatasetTotal Number of TuplesNumber of Training Tuples Classified as Approved(YES)Number of Support VectorsError RateComputation Time (ms)
Original (D)6905893920.141562 ms
Distorted (D4)6905826210.4462172 ms
Distorted (D5)6905876320.3151969 ms

Table 6.

A credit approval dataset classified using SVM.

In the above Table 4, the first column presents the original dataset D and the distorted datasets D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples classified for credit card approved as YES. The number of support vectors available is furnished in the fourth column. Fifth column reveals the error rate of SVM classifier. The computation time is tabulated at last column.

Similarly, when ID3 and C4.5 classifiers are used on D, D4 and D5 datasets the results are tabulated at Tables 7 and 8.

DatasetTotal Number of TuplesNumber of Training Tuples Classified as Approved(YES)Tree having number of nodes and leavesError RateComputation Time(ms)
Original (D)6905847 node,4 leaves0.146416 ms
Distorted(D4)6904763 node, 2 leaves0.310131 ms
Distorted(D5)6905801 node, 1 leaf0.446416 ms

Table 7.

A credit approval dataset classified using ID3.

DatasetTotal Number of TuplesNumber of Training Tuples classified as Approved(YES)Tree having number of nodes and leavesError RateComputation Time(ms)
Original (D)69064467 node, 34 leaves0.06647 ms
Distorted(D4)690621137 nodes, 69 leaves0.101172 ms
Distorted(D5)690634157 nodes, 79 leaves0.1246234 ms

Table 8.

A credit approval dataset classified using C4.5.

In the above Tables 7 and 8, the first column presents the dataset D and distorted dataset D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples belonging to credit card approved as YES. A tree having number of nodes and leaves is furnished in the fourth column. Fifth column reveals the error rate of the ID3 and C4.5 classifiers, respectively. The computation time is tabulated at last column. When KNN classifier is used on D, D4 and D5 datasets the following observations are made and presented at Table 9.

DatasetTotal number of tuplesNumber of Training Tuples Classified as Approved (YES)NeighborsError RateComputation Time(ms)
Original (D)69053750.2217313 ms
Distorted(D4)69048550.297422 ms
Distorted(D5)69067350.3145391 ms

Table 9.

A credit approval dataset classified using KNN.

In the above Table 9, the first column presents the original dataset D and distorted datasets D4 and D5. The number of tuples in the

datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples classified as credit card approved as YES for KNN classifier. The fourth column displays the number of neighbors. The fifth column reveals the error rate of KNN classifier. The computation time is tabulated in the last column.

Similarly, the results are tabulated at Table 10 when MLP classifier is used on D, D4 and D5 datasets.

DatasetTotal Number of TuplesNumber of tuples Classified as Approved(YES)Max IterationTrain Error RateComputation Time(ms)
Original (D)6906201000.0924578 ms
Distorted(D1)6905521000.168562 ms
Distorted(D2)6905891000.347625 ms

Table 10.

A credit approval dataset classified using MLP.

In the above Table 10, the first column presents the original dataset D and the distorted datasets D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of tuples classified for credit card approved as YES. The maximum number of iteration for MLP classifier is furnished in the fourth column. The fifth column reveals the training error rate of KNN classifier. The computation time is tabulated in the last column. Based on the results presented above the accuracy of classification of datasets is presented at Table 11. The accuracy is the percentage of tuples that were correctly classified by a classifier.

DatasetCredit ApprovalHaber ManTic-Tac-ToeDiabetes
SVMID3C4.5KNNMLPSVMID3C4.5KNNMLPSVMID3C4.5KNNMLPSVMID3C4.5KNNMLP
Original (D)8584939889758688899097898798978986828990
Distorted (D4)8670908880687879757698737796898083797985
Distorted (D5)8884919787728585899099888798978385818990

Table 11.

Accuracy of classifiers (%).

The above Table 11 presents the accuracy of the classifiers for Credit Approval, Haber Man, Tic-Tac-Toe and Diabetes datasets. The first column presents the dataset D, the distorted datasets D4 and D5. The second column presents the accuracy of classification obtained on Credit Approval dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The third column presents the accuracy of classification obtained on Haber Man dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The fourth column presents the accuracy of classification obtained on Tic-Tac-Toe dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The fifth column presents the accuracy of classification obtained on Diabetes dataset using SVM, ID3, C4.5, KNN and MLP classifiers.

It is observed that accuracy of C4.5, KNN and MLP classifiers are better than the accuracy of the other classifiers for distorted dataset D5 compared to distorted dataset D4.

The above Table 12 presents the comparison of accuracy. The first column presents the distorted dataset D4 and D5. The second column presents the accuracy obtained on the proposed MDP using Credit approval, Tic-Tac-Toe and diabetes datasets for SVM and KNN classifiers. The third column presents the accuracy for the existing geometric data perturbation methods using Credit approval, Tic-Tac- Toe and Diabetes datasets for SVM and KNN classifiers. It is observed that the accuracy on the datasets using our proposed MDP was found better than the accuracy of the Existing Geometric data perturbation. Moreover, their accuracy was found only on SVM and KNN classifiers for Credit Approval, Tic-Tac-Toe, and Diabetes datasets only.

DatasetProposed Multiplicative Data Perturbation (MDP)Existing Geometric Data Perturbation Method
Credit ApprovalTic-Tac- ToeDiabetesCredit ApprovalTic-Tac-ToeDiabetes
SVMKNNSVMKNNSVMKNNSVMKNNSVMKNNSVMKNN
Distorted (D4)868898.798807986.582.99899.57773.5
Distorted (D5)88979998.583.489

Table 12.

Comparison of accuracy.

The proposed MDP has given good accuracy for distorted dataset D5 compared to distorted dataset D4, whereas the literature does not show any accuracy for distorted data D5.

The results of k-means clustering are shown below at Table 13, when k = 2 (form two clusters).

DatasetNumber of ObjectsNumber of Objects in Cluster 1Number of Objects in Cluster 2Computation time (ms)
Original (D)69025943194 ms
Distorted (D4)690391299109 ms
Distorted (D5)690336354125 ms

Table 13.

Clustering on credit approval dataset for k = 2.

In the above Table 13, the first column presents the dataset D, D4, and D5. The number of objects in the dataset considered for the experiment can be seen in the second column. The third column displays the number of objects belonging to cluster1. The fourth column reveals the number of objects belonging to cluster 2. The computational time is presented in the last column. Based on the results presented above the misclassification error rate of datasets is presented at Table 14.

DatasetPROPOSED MULTIPLICATIVE DATA PERTURBATION (MDP)
Credit ApprovalHaber ManTic-Tac- ToeDiabetes
Distorted (D4)0.3890.1890.0350.03
Distorted (D5)0.220.1000.0310.02

Table 14.

Comparison of misclassification error-rate.

The above Table 14 presents the misclassification error rate. The first column presents the distorted dataset D4 and D5. The second column presents the error rate obtained on the proposed MDP using Credit Approval, Haber Man, Tic-Tac-Toe and Diabetes datasets.

In the privacy metric mentioned in Section 1.5 in Eq. 1.2, the detailed calculation of privacy quality to analyze attacks is shown below:

Consider the data matrix D=422111 the corresponding distorted data matrix using the proposed MDP is given below:

P=5.78811.39000.43711.39891.58262.3630, E is the estimated values (Random) as given below:

E=3.54411.39000.32110.93212.45676.7860 and calculating D″ = R−1*P is given below

D"=2.14610.28000.32111.84214.67674.6130 and calculating P′ is given below

P=1.92610.68001.32113.68211.68214.5920

Then, substitute the above data matrices in eq. 1.2 to analyze the following attacks:

Naives-based Inference Attack: The RMSE is calculated by substituting the data matrices D and E. The result for RMSE r, obtained is as given below:

r=13i=12DE2=1.9221,PrivacyDP=r/2=0.6796

Reconstruction -based Inference Attack: The RSME r is calculated by substituting the data matrices D and D″. The result r obtained is as given below:

r=13i=12DD"2=1.6794,PrivacyDD"=r/2=0.839

Distance -based Attack: The RSME r is calculated by substituting the data matrices D and P′. The result r obtained is as given below:

r=13i=12DP2=1.70261,PrivacyDP=0.851

Similarly the RMSE r is calculated for the original D and distorted datasets D4 and D5 and the results are furnished at Table 15 as shown below.

AttacksProposed MDP MethodExisting Geometric Data Perturbation Method
Credit ApprovalHaber ManTic-Tac- ToeDiabetesCredit ApprovalTic-Tac- ToeDiabetes
Naives1.7431.1291.5641.5121.3451.2341.456
Reconstruction1.4671.8411.4891.8931.2871.4501.921
Distance1.5271.9801.9011.4521.5561.7841.356

Table 15.

Analysis on attacks.

In the above Table 15, the first column presents the Naives based, Reconstruction based and Distance -based attacks. The second column displays RMSE (Root Mean Square Error) r is calculated for the proposed MDP method on Credit Approval, Haber Man, Tic-Tac- Toe and Diabetes datasets. The third column reveals the RMSE calculated for existing hybrid methods on Credit Approval and Diabetes datasets. It is observed that the RMSE r for proposed MDP method on distance -based attack is high compared to RMSE for the existing geometric data perturbation methods. The metric for the proposed MDP shows better quality in preserving the confidential data and provides high uncertainity to reconstruct the original data.

Advertisement

8. Conclusion

A Multiplicative Data Perturbation algorithm by combining a Geometric Data Perturbation method and Discrete Cosine Transformation is proposed in this chapter. The proposed MDP is successfully implemented using different multivariate datasets mentioned above.

The experiments on those datasets resulted to classify accurately and create accurate number of clusters. Based on the result analysis, it is resolved that our proposed MDP algorithm is efficient to preserve confidential data during perturbation and ensures privacy while being resilient against possible of attacks the proposed methods considered a univariate datasets ex: Terrorist. A multivariate dataset is considered and a multiplicative data perturbation (MDP) was explored to effectively perturb the data in a centralized environment. This method has resulted in perturbing the data effectively and be resilient towards attacks or threats while preserving the privacy.

The research studies can explore the privacy issues on a Big Data as a future scope of research work in the following directions:

Improving Data Analytic techniques –Gather all data, filter them out with certain constraints and use to take confident decision.

Algorithms for Data Visualization- In order to visualize the required information from a pool of random data, powerful algorithms are crucial for accurate results.

In future scope includes, research can include many various methods explore many methods. These latest methods can show various results.

References

  1. 1. Li L, Zhang Q. A privacy preserving clustering technique using hybrid data transformation method. In: 2009 IEEE International Conference on Grey Systems and Intelligent Services (GSIS 2009). Vol. 2010. Nanjing: IEEE; 2009. pp. 1502-1506. DOI: 10.1109/GSIS.2009.5408151
  2. 2. Natarajan AM, Rajalaxmi RR, Uma N, Kirubhkar G. A hybrid transformation approach for privacy preserving clustering of categorical data. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. Dordrecht: Springer. 2007. pp. 403-408. DOI: 10.1007/978-1-4020-6268-1_72
  3. 3. Selva Rathnam S, Karthikeyan T. A survey on recent algorithms for privacy preserving data mining. International Journal of Computer Science and Information Technologies. 2015;6(2):1835-1840
  4. 4. Patel A, Patel K. A hybrid approach in privacy preserving data mining. In: 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA). Vol. 2. Ahmedabad, Gujarat, India: IEEE; 2016. p. 3
  5. 5. M. Naga Lakshmi and K. Sandhya Rani, “A privacy preserving clustering method based on fuzzy approach and random rotation perturbation”, Publications of Problems & Application in Engineering Research-Paper, Vol. 04, Issue No. 1, pp. 174-177, 2013.
  6. 6. Mary AG. Fuzzy–based random perturbation for real world medical datasets. International Journal of Telemedicine and clinical Practices. 2015;1(2):111-124. DOI: 10.1504/IJTMCP.2015.069749
  7. 7. M. Naga Lakshmi, K Sandhya Rani,” Privacy preserving hybrid data transformation based on SVD”,” International Journal of Advanced Research in Computer and Communication Engineering, Vol. 2, Issue 8, 2013, 2278-1021
  8. 8. Jalla HR, Girija PN. An efficient algorithm for privacy preserving data mining using hybrid transformation. International Journal of Data Mining & Knowledge Management Process. 2014;4(4):45-53. DOI: 10.5121/ijdkp.2014.4404
  9. 9. Manikandan G, Sairam N, Saranya C, Jayashree S. A hybrid privacy preserving approach in data mining. Middle- East Journal of Scientific Research. 2013;15(4):581-585. DOI: 10.5829/idosi.mejsr.2013.15.4.1.991
  10. 10. Saranya C, Manikandan G. Study on normalization techniques for privacy preserving data mining. International Journal of Engineering and Technology (IJET). 2013;5(3):2701-2704
  11. 11. Geetha Mary AN, Iyenger NSC. Non-additive random data perturbation for real world data. Procedia Technology. 2012;4:350-354. DOI: 10.1016/j.protcy.2012.05.053
  12. 12. Aggarwal CC, Yu PS. A condensation approach to privacy preserving data mining. In: Proceedings of International Conference on Extending Database Technology (EDBT). Vol. 2992. Heraklion, Crete, Greece: Springer; 2004. pp. 183-199
  13. 13. Liu K, Kargupta H, Ryan J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE). 2006;18(1):92-106
  14. 14. Chen K, Liu L. “A Random Rotation Perturbation Based Approach to Privacy Preserving Data Classification”, CC-Technical Report GIT-CC-05-12. USA: Georgia Institute of Technology; 2005
  15. 15. Lui K, Giannella C, Kargupta H. An Attacker’s view of distance preserving maps for privacy preserving data mining. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases(Pkdd’06). Berlin, Heidelberg: Springer-Verlag; 2006
  16. 16. Xu H, Guo S, Chen K. Building confidential and efficient query services in the clod with RASP data perturbation. IEEE Transactions on Knowledge and Data Engineering. 2014;26(2):322-335
  17. 17. Oliveira SR, Zaiane OR. Privacy preserving clustering by data transformation. Journal of Information and Data Management (JIDM). 2010;1(1):37–51
  18. 18. Guo S, Wu X. Deriving private information from arbitrarily projected data. In: Proceedings of the 11th European conference on principles and practice of knowledge Discovery in databases (PKDD07). Warsaw, Poland. 2007
  19. 19. Balasubramaniam S, Kavitha V. A survey on data retrieval techniques in cloud computing. Journal of Convergence Information Technology. 2013;8(16):15-24
  20. 20. Liu J, Yifeng XU. Privacy preserving clustering by random response method of geometric transformation. Harbin, Heilong Jiang, China: IEEE. 2010:181-188. DOI: 10.1109/ICICSE.2009.31
  21. 21. Balasubramaniam S, Kavitha V. Geometric data perturbation-based personal health record transactions in cloud computing. The Scientific World Journal. 2015;2015:927867, 1-927869. DOI: 10.1155/2015/927867
  22. 22. Chen K, Lui L. Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining. London: Springer-Verlag Limited; 2010
  23. 23. Hyvarinen AK, Oja E. Independent Component Analysis. New York/Chichester/Weinheim/Brisbane/Singapore/Toronto: Wiley-Interscience; 2001
  24. 24. Brankovic L, Estivill-Castro V. Privacy issues in knowledge discovery and data mining. In: Proceedings of Australian Institute of Computer Ehic Conference (AICEC99). Melbourne, Victoria, Australian: Lecture Notes in Computer Science. 1999;4213:297-308. DOI:10.1007/11871637_30
  25. 25. Liu K, Giannella C, Kargupta H. An Attacker’s view of distance preserving maps for privacy preserving data mining. In: European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany; 2006
  26. 26. Li L, Zhang Q. A privacy preserving clustering technique using hybrid data transformation method. In: Grey Systems and Intelligent Services, 2009 GSIS 2009, IEEE International Conference. Nanjing, China: IEEE; 2010. DOI: 10.1109/GSIS.2009.5408151, 08
  27. 27. Rajesh N, Sujatha K, Kumar AALS. Survey on privacy preserving data mining techniques using recent algorithms. International Journal of Computer Applications Foundation of Computer Science (FCS). 2016;133(7):30-33
  28. 28. Patel L, Gupta R. A survey of perturbation technique for privacy-preserving of data. International Journal of Emerging Technology and Advanced Engineering Website. 2013;3(6):162-166

Written By

Thanveer Jahan

Submitted: 08 April 2022 Reviewed: 16 May 2022 Published: 03 September 2022