Patterns and Privacy Preservation with Prior Knowledge for Classification by Shaofeng Bu B.Eng., Xi’an Jiaotong University China, 1998 M.Eng., Xi’an Jiaotong University China, 2001 M.Sc., The University of British Columbia, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) September 2010 c Shaofeng Bu 2010 Abstract Privacy preservation is a key issue in outsourcing of data mining. When we seek approaches to protect the sensitive information contained in the original data, it is also important to preserve the mining outcome. We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially. We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcomechange for both linear and non-linear SVM. The TNP approach gives the same guarantee when the data set is linearly separable. For linearly nonseparable data sets, we show that no-outcome-change is not always possible and propose a variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results show that the two approaches are effective to counter powerful attack models. In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise. We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Statement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Data-Mining-As-a-Service Model . . . . . . . . . 1.1.1 Three Pillars to Privacy Preservation . . . . . 1.1.2 Data Custodian Scenario . . . . . . . . . . . . 1.2 Attack Models . . . . . . . . . . . . . . . . . . . . . . 1.2.1 General Attack Models . . . . . . . . . . . . . 1.2.2 Attack Models with Prior Knowledge . . . . . 1.3 Research Problems and Contributions . . . . . . . . . 1.3.1 Decision Tree . . . . . . . . . . . . . . . . . . 1.3.2 Support Vector Machine . . . . . . . . . . . . 1.3.3 Combining POT and Random Perturbation . 1.4 Literature Review . . . . . . . . . . . . . . . . . . . . 1.4.1 Privacy Preserving Data Mining . . . . . . . . 1.4.2 Secure Data Mining On Distributed Data Sets 1.4.3 K-anonymity . . . . . . . . . . . . . . . . . . . 1.5 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Bibliography . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 5 5 6 6 9 11 13 13 16 17 17 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents 2 Preservation Of Patterns and Input-Output Privacy . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Monotone and Anti-monotone Functions . . . . . . . 2.3.2 Domain, Subspace Association and Pattern Disclosure 2.3.3 Attack Models Possibly with Prior Knowledge . . . . 2.4 The No-Outcome-Change Guarantee . . . . . . . . . . . . . . 2.5 A Piecewise Framework . . . . . . . . . . . . . . . . . . . . . 2.5.1 ChooseBP: Choosing Breakpoint Locations Randomly 2.5.2 ChooseMaxMP: Exploiting Monochromatic Pieces . . 2.5.3 Selecting Transformations Randomly . . . . . . . . . 2.5.4 Defense Against Sorting Attacks . . . . . . . . . . . . 2.6 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . 2.6.2 Domain Disclosure Risk . . . . . . . . . . . . . . . . . 2.6.3 Subspace Association Risk . . . . . . . . . . . . . . . 2.6.4 Output Privacy: Pattern Disclosure Risk . . . . . . . 2.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Patterns and Privacy Preservation with Prior Knowledge for SVM Classification . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Knowledge Points and Attack Models . . . . . . . . . . . . . 3.3.1 Knowledge Points . . . . . . . . . . . . . . . . . . . . 3.3.2 Attack Models with Prior Knowledge . . . . . . . . . 3.4 POT: Principled Orthogonal Transformation . . . . . . . . . 3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Limitations of Random Transformations . . . . . . . 3.4.3 Reducing Correlations by POT . . . . . . . . . . . . . 3.4.4 Technical Details and Equations . . . . . . . . . . . . 3.4.5 Properties of the Algorithm . . . . . . . . . . . . . . 3.5 Providing Extra Protection for Linear SVM: True Negative Point (TNP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Linearly Separable Data Sets . . . . . . . . . . . . . . 3.5.2 Linearly Non-Separable Data Sets . . . . . . . . . . . 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 ICA Attack . . . . . . . . . . . . . . . . . . . . . . . 25 25 28 28 28 29 30 31 33 33 35 38 40 41 41 42 45 47 47 48 49 49 52 53 53 54 57 57 58 59 61 63 64 66 70 71 72 iv Table of Contents 3.6.2 3.6.3 3.6.4 3.6.5 3.7 3.8 Knowledge Points . . . . . . . . . . . . . . . . . . . . Effectiveness of POT Against Curve Fitting Attacks . Effectiveness of TNP Against Global Attacks . . . . . Minimizing the Outcome Change Against Global Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Intended Usage of POT and TNP . . . . . . . . . . . Conclusion and Future Work . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74 77 79 81 82 84 4 Extension to POT . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Attack Models with Prior Knowledge . . . . . . 4.3.2 POT : Principled Orthogonal Transformation . 4.4 Optimizing POT : POT for Linear Sub-groups . . . . . 4.4.1 A Motivating Example . . . . . . . . . . . . . . 4.4.2 A Heuristic : POT for Linear Sub-Groups . . . 4.4.3 Properties of Algorithm POT-LS . . . . . . . . 4.5 POT for Other Classification Methods . . . . . . . . . . 4.6 Combination of POT and Random Perturbation . . . . 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Optimizing POT : POT-LS . . . . . . . . . . . . 4.7.2 Combination Of POT and Random Perturbation 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 89 89 90 91 92 92 100 101 105 106 107 112 120 123 5 Conclusions and Future Works . . 5.1 Conclusions . . . . . . . . . . . . . 5.2 Future Works . . . . . . . . . . . 5.3 Bibliography . . . . . . . . . . . . . . . . . . . . 125 125 127 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables 1.1 Three Pillars of Privacy Preservation . . . . . . . . . . . . . . 4 2.1 2.2 2.3 Statistics of Attributes . . . . . . . . . . . . . . . . . . . . . . Sorting Attack: Worst Case . . . . . . . . . . . . . . . . . . . Output Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 41 45 47 3.1 3.2 3.3 3.4 3.5 3.6 Three Pillars of Privacy Preservation Data Sets . . . . . . . . . . . . . . . Correlations . . . . . . . . . . . . . . Cost of POT-GenMatrix . . . . . . . Global Attacks on All Data Sets . . Measurement of Classification . . . . 50 71 75 77 78 80 4.1 4.2 4.3 The NOC Guarantee for Linear Classification . . . . . . . . . 105 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Evaluation of Lga-Posterior . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures 1.1 1.2 1.3 The Data Custodian Model . . . . . . . . . . . . . . . . . . . An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combination of POT and Random Perturbation for SVM . . A Training Data Set before and after transformation and their corresponding Decision Trees . . . . . . . . . . . . . . . . . . 2.2 A Piecewise Framework . . . . . . . . . . . . . . . . . . . . . 2.3 Failing the Global-Monotone Invariant . . . . . . . . . . . . . 2.4 Satisfying the Global-Monotone Invariant . . . . . . . . . . . 2.5 A Skeleton for ChooseBP . . . . . . . . . . . . . . . . . . . . 2.6 A Skeleton for ChooseMaxMP . . . . . . . . . . . . . . . . . . 2.7 Maximizing Monochromatic Pieces . . . . . . . . . . . . . . . 2.8 A Skeleton for Randomly Choosing Transformations Piecewise 2.9 Impact of Breakpoints and Monochromatic Pieces on Domain Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 The Venn Diagram of Cracks: the Combination Attack . . . . 2.11 Subspace Association Disclosure Risk . . . . . . . . . . . . . . 3 10 12 2.1 . . . . . . . . . . . . 27 33 34 35 35 37 37 39 43 44 46 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 Generating Orthogonal Matrix Q . . . . . . . . . . . . . . Finding True Negative Points . . . . . . . . . . . . . . . . A Skeleton for the TNP Perturbation Approach . . . . . . Linearly Non-Separable Data Sets . . . . . . . . . . . . . ICA Attack . . . . . . . . . . . . . . . . . . . . . . . . . . Sources of Knowledge Points (Kps) . . . . . . . . . . . . . Privacy Analysis of POT vs. Rotation . . . . . . . . . . . Impact of the Number of Knowledge Points . . . . . . . . WDBC : Radius δ . . . . . . . . . . . . . . . . . . . . . . Global Attack on WDBC . . . . . . . . . . . . . . . . . . Impact of Poor Knowledge Points (WDBC) . . . . . . . . Classification Accuracy : TNP v.s. Random Perturbation . . . . . . . . . . . . 60 67 68 70 72 73 74 75 76 77 79 81 4.1 Examples of Transformations . . . . . . . . . . . . . . . . . . 92 vii List of Figures 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 Algorithm POT-LS : POT for Linear Sub-Groups . . . . . . Examples of Transformations : Algorithm POT-LS . . . . . Lga-Posterior : Identifying More Linear Sub-Groups . . . . Finding Linear Sub-Groups : Lga-Posterior . . . . . . . . . The Combination of POT and Random Perturbation . . . . POT-LS (in the Worst Cases) : Data Set WDBC . . . . . . POT-LS (in the Worst Cases) : Data Set Housing . . . . . POT-LS (in the Worst Cases) : Data Set Credit . . . . . . POT-LS (in the Worst Cases) : IBM Synthetic Data Set . . The Combination Approach for SVM : WDBC . . . . . . . The Combination Approach for SVM : Housing . . . . . . . The Combination Approach for SVM : IBM Synthetic Data The Combination Approach for SVM : Credit . . . . . . . . The Combination Approach for Linear Models : WDBC . . The Combination Approach for Linear Models : Housing . . The Combination Approach for Lasso : WDBC . . . . . . . The Combination Approach for Elastic Net : WDBC . . . . The Combination Approach for Lasso : Housing . . . . . . The Combination Approach for Linear Models : Housing . . . . . . . . . . . . . . . . . . . . . 96 98 99 99 106 108 110 110 111 113 114 115 115 116 117 118 119 120 121 viii Acknowledgements I would like to express my deep thanks to my supervisors Dr. Raymond T. Ng and Dr. Laks V.S. Lakshmanan for their valuable advices and constant encouragement through the long journey of the doctoral program. I sincerely appreciate their continuous guidance, inspiration, and friendship for years. I would like to thank Dr. Chen Greif, who is in my supervisory committee. Without his excellent advice, this research work would not be done in time. I would also like to thank the university examiners, Dr. Alan Wagner and Dr. Ruben H. Zamar, and the external examiner, Dr. Flip Korn, to review my thesis. I appreciate the helpful advice from Dr. Rachel Pottinger. I would also like to thank all the members of the Data Management and Mining Lab, including Ed Knorr, Ganesh Ramesh, Hongrae Lee, Michael Lawrence, Jian Xu, Min Xie, and all others who made my research in DMM lab an enjoyable experience. I would also like to thank all my friends who made my studies at UBC a wonderful experience. I owe my special thanks to Xiaodong Zhou, Fahong Li, Ying Su, Juan Li, Hui Wang, and Kan Cai, for their generous help and support during my hard time. Finally, I would like to thank my parents, parents-in-law, and other family members for their endless support. I owe my deepest gratitude to my wife, Ming Huo, for her selfless love. They encouraged me to overcome the difficulties and helped me to walk through the darkest time in my life. ix Dedication To my families. x Statement of Co-Authorship This thesis contains materials that are work and results of joint research under the supervision of Dr. Raymond T. Ng, Dr. Laks V.S. Lakshmanan (chapter two, chapter three, and chapter four), under the instruction of Dr. Chen Greif (chapter three), and in collaboration with Dr. Ganesh Ramesh (chapter two). As the first author of the three manuscript chapters (chapter two, chapter three and chapter four), I took the main responsibility to design the approaches, finished the proofs, implemented the experiments, and conducted the data analysis under the instruction of my supervisors (Dr. Raymond T. Ng and Dr. Laks V.S. Lakshmanan) and Dr. Chen Greif. I also prepared the manuscripts, which are always reviewed by the three supervisory committee members. xi Chapter 1 Introduction Data mining aims to extract predictive information and patterns from the huge amount of data stored in databases or data warehouses to predict future trends and support decision making [22]. In recent years, more and more companies, institutes, organizations would like to conduct data mining tasks on their data to gain useful knowledge. Instead of conducting data mining tasks by data owners themselves, data-mining-as-a-service model has appeared and is becoming more and more popular [29]. There are two main reasons to motivate data owners to outsource data mining tasks to data mining service providers in the data-mining-as-a-service model. First, data owners might not have the required expertise, including parameter tuning and model selection, to conduct data mining tasks [22]. Outsourcing the data mining tasks can save data owners the need to acquire the required expertise or to hire extra data mining experts for mining tasks [45]. Second, data mining tasks might incur heavy computation cost. Instead of purchasing expensive computational resources, data owners might want to outsource data mining tasks to outside mining service providers [55, 64]. Especially with the emergence of cloud computing [59], data mining tasks could be submitted to high performance cloud computing environments [21]. It will become more and more popular to outsource data mining tasks since datamining-as-a-service could be also provided by cloud computing services [61], e.g., Oracle Data Mining on cloud [68] and IBM Cloud Computing [69]. 1.1 The Data-Mining-As-a-Service Model There are at least two scenarios in the data-mining-as-a-service model. They are the data collector scenario and the data custodian scenario. In the data collector scenario, a data collector collects data from individual data owners and conducts data mining tasks. For example, all individual users submit their data to a collector for data analysis in an online survey project. The collector might not be trusted by individual data owners. The privacy of the input data should be protected before it is submitted to the data collector. 1 1.1. The Data-Mining-As-a-Service Model In the data custodian scenario, the data custodian is trusted. A data custodian either owns the data or is responsible to protect the data. For example, a hospital or medical research institute keeps the patient data and has the responsibility to protect the patients’ private information. 1.1.1 Three Pillars to Privacy Preservation Privacy preserving data mining is the problem of protecting sensitive information inherent in data while conducting data mining tasks [13, 54]. In order to protect the private information, a perturbation or transformation mechanism is used to disguise the sensitive information. As discussed in [9], there are three pillars to the privacy preservation and they are: • input privacy [46], which aims to protect the privacy of the input data to data mining tasks. • output privacy [46], which means to protect the data mining outcome. There are cases that the mined patterns should be protected. For example, a certain disease prediction pattern mined from a medical data set might need to be protected since malicious users could use the pattern to make a prediction on other individuals. • minimization of the outcome change. In order to protect the private information, the original data is always perturbed or transformed [6]. The data mining tasks will not be applied on the original data anymore. Instead, the data mining tasks are applied on the transformed or perturbed data. The discrepancy between the data mining patterns from the original data and the patterns from the perturbed data is called outcome change. We say a transformation or perturbation approach provides no outcome change (NOC) guarantee if the patterns mined from the original data and from the perturbed data are identical. There are circumstances under which it may be impossible to guarantee NOC. In this case, a desirable property is minimization of outcome change. In the data collector scenario, a data collector gathers individual data from different owners and conducts the data mining tasks. The assumption is that the data collector might not be trusted. The individual data, i.e., the input data to the data mining tasks, should be protected. The dominant approach is random perturbation [4, 16], which transforms the original values by adding random noise in a principled way. There is a trade off between 2 1.1. The Data-Mining-As-a-Service Model privacy protection and outcome change [4]. In order to provide more protection we need more perturbation on the original data. Meanwhile, more protection means more change to the data mining outcome. The change of data mining outcome is inevitable. In the data custodian scenario, the data custodian owns the data or is authorized to keep the data. The data custodian is trusted. However, when the data custodian wants to outsource data mining tasks to outside data mining service providers, she needs to transform the original data before she sends the data to the data mining service providers. 1.1.2 Data Custodian Scenario Figure 1.1 shows the framework of the data custodian model. When a data custodian wants to outsource data mining tasks, she transforms the original data set D to the transformed data set D ′ by a transformation f . The transformed data set D ′ is submitted to the data mining service provider, who will conduct the data mining tasks on the transformed data set D ′ . After the service provider gets the data mining pattern P ′ , she will send ′ P back to the data custodian. So far the data mining pattern P ′ is from the transformed data set D ′ , which means P ′ is in the encrypted form. The data custodian needs to decode the encrypted pattern P ′ to get pattern P , which is the final mining outcome the data custodian could obtain. Let us suppose the pattern P0 be the pattern we could discover directly from the original data set D. The question is whether the decoded pattern P equals to the original pattern P0 . If they are different, then the change between P and P0 should be minimized. This is the third pillar of privacy preserving data mining, i.e., minimizing the outcome change. Data Mining Data Custodian Pattern P0 ?≡ Data Set D Pattern P Decoding − → Transformation f Data Mining ′ Service Provider Transformed Data Set D Pattern P ′ Sending Back Data Mining Pattern P ′ Figure 1.1: The Data Custodian Model 3 1.2. Attack Models The data mining service provider only has the transformed data set D ′ and cannot access the original data set D. Therefore, the original data D is protected by the transformation f . This is how we achieve the input-privacy. Meanwhile, the data mining service provider only has the pattern P ′ , which is in encrypted form and is different from the true pattern P0 . Thus the pattern is also protected. This is output-privacy. We could also use random perturbation approach to perturb the original data in the data custodian scenario. Since the outcome change is inevitable for the random perturbation approach, the question is whether we could design any other transformation that could provide the NOC guarantee while not sacrificing the input and output privacy. Several recent studies show that it is possible to provide the no-outcome-change guarantee while also providing input and output privacy in the data custodian scenario. For example, Wong et al. studied the problem of secure outsourcing of association rule mining in [55] and proposed an algorithm to give the NOC guarantee while also providing input and output privacy1 . Table 1.1 highlights the differences between the data collector scenario and the data custodian scenario regarding the three pillars of privacy preservation. Scenarios Data Collector Data Custodian KAnonymity Studies [2, 4, 5, 16] [23, 30], etc. [10, 11, 35, 55] [9],[42], etc. [25, 39, 49] [19, 33, 58], etc. Input Privacy √ Output Privacy NOC √ √ √ √ Table 1.1: Three Pillars of Privacy Preservation 1.2 Attack Models Privacy protection is always discussed together with attack models, which define how much information a hacker could have to crack the encrypted data [18]. 1 Molly et al. show that the proposed approach in [55] could not counter attacks with known frequency knowledge in a recent paper [42]. 4 1.2. Attack Models 1.2.1 General Attack Models Generally, there are four types of attack models [18], and they are: • Ciphertext-only attack : which means a hacker could only access the ciphertext; • Known-plaintext attack : which means a hacker could have samples of the plaintext and the corresponding encrypted version of those samples, i.e., the ciphertext. She could use those samples to further learn other secret information; • Chosen-plaintext attack : which means a hacker could freely select plaintexts to encrypt and get the encrypted version of the plaintexts; • Chosen-ciphertext attack : which means a hacker could choose a ciphertext and get the corresponding decryption version (i.e., the plaintext) of this ciphertext. In the data custodian model, as shown in Figure 1.1, a data custodian only sends the transformed data set D ′ to data mining service providers. Both the mining service providers or other outside hackers could only access the transformed data set D ′ . Thus the third attack model and the forth attack model, i.e., chosen-plaintext attack and chosen-ciphertext attack, are not suitable for the data custodian model. Moreover, the ciphertext-only attack is the simplest attack model. It is difficult to attack a transformation with only transformed data set [6, 48]. In this thesis, we mainly consider the second attack model, i.e., the knownplaintext attack, since it is practical for the data custodian model. 1.2.2 Attack Models with Prior Knowledge In the known-plaintext attack model, a hacker knows samples of the plaintext and the corresponding ciphertext. In the data custodian scenario, it means that a hacker knows a group of pairs of the original data point and the corresponding transformed data point. Those pairs of data points are regarded as prior knowledge owned by the hacker. A hacker could gain this kind of prior knowledge from various sources. For example, a hacker could know statistic values (e.g., the minimum, the maximum, or the mean, etc) from published statistics or she could get some samples from similar data sets (e.g., a rival company targeting the same market). In order to capture all these forms of prior knowledge, in this thesis 5 1.3. Research Problems and Contributions we define an abstract notion knowledge point that is a pair of an original data point and the corresponding transformed data point. This attack model is also called input-output point based attack in [6, 11]. However, in practice, it might be difficult for a hacker to know the exact values in the original data set D. More practically, a hacker’s knowledge about the original data might fall in a small range around the true values. For example, given the minimum of the age attribute is 19, a hacker might think the minimum is 18 from published information, which is close enough to the true minimum value to be regarded as a threat. This definition does not weaken the abilities of a hacker to crack the transformed data. Indeed, it makes the attack more practical and more meaningful. In this thesis, we evaluate our proposed transformations in the context of attack models with prior knowledge (i.e., knowledge points). A common attack method is to use the knowledge points to fit a curve to crack the transformed data, which is called curve fitting attack. Such methods include linear fitting, polyline fitting, spline fitting, etc. Another possible attack method with knowledge points is to crack the transformation matrix [10, 35] if the original data is transformed by a matrix. 1.3 Research Problems and Contributions Classification is a data mining task that builds classifiers from a training data set for future prediction [22]. In this thesis we focus our study on privacy protection under attack models with prior knowledge, for outsourcing of classification, in the data custodian scenario. Our goal is to design transformation methods to transform the original data to provide the input and output privacy while minimizing the outcome change. 1.3.1 Decision Tree We start our study on investigating in privacy preserving decision tree classification. We propose monotone and anti-monotone functions to transform the original data set D. Given a monotone or anti-monotone function f and an attribute A in D, the transformed data is A′ = f (A). Because (anti-)monotone functions are invertible, decoding the mining outcome (i.e., the pattern P ′ in Figure 1.1) is straightforward. The decoded pattern is P = f −1 (P ′ ). Moreover, (anti-)monotone functions preserve the decision tree in an exact sense (shown in Chapter 2), which means the decoded pattern P is identical to the pattern P0 discovered from the original data set D directly. 6 1.3. Research Problems and Contributions For example, let us suppose D contains an attribute ‘age’, whose domain is from 18 to 80. We transform the attribute ‘age’ by a monotone function f (x) = 0.5 × x + 30. Suppose there is a pattern P0 from D as following: P0 = {(age ≤ 30) ∨ (age > 50), Class = H; (30 < age ≤ 50), Class = L}. After transformation, the mined pattern will be P ′ = {(age ≤ 45) ∨ (age > 55), Class = H; (45 < age ≤ 55), Class = L}. The values have been transformed by the function f . Once the data custodian gets this pattern, she decodes it by the invert function f −1 (x) = 2 × (x − 30) and obtains the decoded pattern P = {(age ≤ 30) ∨ (age > 50), Class = H; (30 < age ≤ 50), Class = L}, which is identical to the true original pattern P0 . Meanwhile, the data service provider only have the transformed data set ′ D and the encrypted pattern P ′ . The input and output privacy are also achieved. There are several key advantages that these (anti-)monotone functions enjoy over random perturbations. • First, both the original data and the decision tree (mining outcome) are expressed in the transformed domain. In other words, both the data and the mining outcome are protected, which means the two pillars (i.e., input and output privacy) are provided. • Second, with monotone and anti-monotone functions, the decision tree is exactly preserved. Also, decoding of the mining outcome using the reverse transformations by the data owner is easy. Undoubtedly, the data owner would like to minimize the effort required for the decoding process. • Third, with the proposed transformations, every data value is transformed. In contrast, with random perturbations, there is a chance that a data value is not changed and the true value is revealed. For example, in [9, 17], many situations examined leave a significant percentage (e.g., 25%) of items unchanged. 7 1.3. Research Problems and Contributions A key advantage of (anti-)monotone functions is that they provide nooutcome-change while also providing input and output privacy. However, if we only use one (anti-)monotone function to transform an attribute of a data set, the input or output privacy might not be effectively protected, especially when a hacker has prior knowledge. For example, for the above transformation function f = 0.5 × x + 30, if a hacker happens to know that the transformed value 40 corresponds to 22 and 50 corresponds to 42, then she could make a guess on the transformation function and guesses g = 0.5×x+29, which is too close to the true transformation f = 0.5×x+30. In order to enhance the protection on privacy, we propose a transformation schema that generalizes from (anti-)monotone functions to piecewise (anti-)monotone functions. The new transformation schema provides two extra levels of protection against attacks with prior knowledge. First, we use breakpoints to break the domain of an attribute into several pieces. Each piece of the domain has its own (anti-)monotone function. Second, we identify the monochromatic pieces in a domain. A monochromatic piece is a piece whose elements are associated with the same class label. The monochromatic piece gives us more freedom to choose transformation functions. We show that we could even use any arbitrary function to transform a monochromatic piece while still guaranteeing no-outcome-change in chapter two. The new transformation framework with piecewise (anti-)monotone functions and monochromatic pieces injects three extra levels of uncertainty to enhance the protection on privacy: • The number of breakpoints : The data custodian can arbitrary select the number of breakpoints to break an attribute into pieces. • The location of breakpoints : How to break an attribute into pieces is not determined. That is the location of breakpoints could also be randomly selected. • The function used in each piece : After breaking an attribute into pieces, the data custodian could select different (anti-)monotone function for each piece. Finally, even if a hacker can crack one piece, he may not be able to crack the remaining pieces. However, while breakpoints enhance input and output privacy, the challenge is to add breakpoints in a way that the no-outcomechange guarantee is provided, which is presented in chapter two. 8 1.3. Research Problems and Contributions 1.3.2 Support Vector Machine In chapter three, we study the problem of outsourcing support vector machine(SVM) classification. SVM aims to find the optimal classifier that minimizes the classification errors and separates the rest of the elements with maximal margin on the train data set in the feature space [8]. For decision tree classification, we start investigating (anti-)monotone functions because those functions provide the NOC guarantee. Same to the SVM classification, we begin with transformations that could preserve support vectors, which determines the optimal classifier. Figure 1.2 is a simple example of the transformation that can preserve the support vectors. The table shown in Figure 1.2(a) is the original dataset D. There are two attributes #A and #B. pi is tuple ID. Each tuple is associated with a class label. We represent D as a 2 × 6 matrix, i.e., each tuple is a column vector that can be regarded as a point in two-dimensional space. Figure 1.2(c) shows the data points. h is the optimal classifier and points p3 , p4 and p5 are support vectors. We transform the original data D with an orthogonal matrix Q = 0.8660 0.5000 , which means D ′ = Q × D. Q is also a rotation ma−0.5000 0.8660 trix and the geometric meaning is to rotate the whole dataset π/6 clockwise. The transformed dataset is shown in Figure 1.2(b). Figure 1.2(d) shows the transformed data points and the mining result found by the mining service provider. h′ is the optimal plane and p′3 , p′4 and p′5 are support vectors in D ′ . To decode each support vectors in D ′ , the data custodian uses the inverse transformation pi = Q−1 × p′i . By comparing Figure 1.2(c) and Figure 1.2(d), we can see that a transformed data point is a support vector in the transformed data set D ′ if and only if its corresponding original data point is a support vector in the original data set D. In other words, the support vectors are preserved by the orthogonal transformation Q. Orthogonal transformation includes rotation and reflection. Both rotation and reflection preserve the geometric shape of the data set and preserve the distance between any two points. Intuitively, orthogonal transformation will preserve the support vectors. In previous work, Chen et al. propose to use random rotation to transform the original data and show that rotation provides the NOC guarantee for outsourcing SVM [10, 11]. However, random rotation could not provide enough protection on privacy especially when a hacker has prior knowledge. The reason is that the original values and the transformed values may 9 1.3. Research Problems and Contributions id p1 p2 p3 p4 p5 p6 #A 1 5 2 3 4 4 #B 1 4 3 2 4 5 class 0 1 0 0 1 1 id p′1 p′2 p′3 p′4 p′5 p′6 (a) Original Data D 6 h p Attribute #B Attribute #B class 0 1 0 0 1 1 7 6 6 p 4 p2 5 3 p 3 2 p 4 1 0 0 #B 0.366 0.964 1.598 0.232 1.464 2.330 (b) Transformed Data D ′ 7 5 #A 1.366 6.330 3.232 3.598 5.464 5.964 1 2 3 4 5 Attribute #A (c) Original SV 6 7 h’ 4 3 p’ 2 1 p 1 5 0 0 6 p’3 p’ 1 p’ p’5 p’ 2 4 1 2 3 4 5 Attribute #A 6 7 (d) Transformed SV Figure 1.2: An Example have high correlation after rotation transformation. Intuitively, the higher the correlation between the original values and the transformed values, the smaller the fitting error for curve fitting methods (e.g., linear regression, polyline, spline fitting, etc.), which means the transformation is more vulnerable to curve fitting attacks. In order to enhance the privacy protection level, we propose an approach, i.e., principled orthogonal transformation (POT), to generate orthogonal transformations in a principled way. POT is based on the Gram-Schmidt procedure [50] and could iteratively and monotonically reduce the linear correlation between the original values and the transformed values. We will provide empirical results to show that POT can effectively counter curve fitting attacks. Besides curve fitting attacks, a hacker could conduct a more powerful 10 1.3. Research Problems and Contributions attack on the transformation matrix if she has enough knowledge points. Given a d-dimensional data set, the orthogonal transformation is a d × d matrix. If a hacker has d knowledge points, then she could make a guess on the transformation matrix by matrix computation. We call this type of attack global attack. In order to further enhance the protection on privacy to counter global attacks, we propose an algorithm that identifies the data points that are guaranteed not to be support vectors. We call those data points True Negative Points (TNP). In support vector machine classification, support vectors determine the optimal classifier. Perturbation on the true negative points will not affect the support vectors and thus will not change the optimal classifier. Therefore, we can could enhance the privacy level while not sacrificing the classification accuracy by perturbing the TNP points. In chapter three, we show that the NOC guarantee is provided for a linearly separable data set. However, for a linearly non-separable data set, because it is not clearly about the geometric meaning of the support vectors and the optimal classifier [8], the points identified by the algorithm might be false negative points. Thus, the NOC guarantee may be violated but the change is minimized. 1.3.3 Combining POT and Random Perturbation Random perturbation is a well studied method for privacy preserving data mining. There is always trade off between the privacy protection level and data mining outcome change for random perturbation, i.e., the more perturbation on the data, the more protection on the privacy, the more change on the data mining outcome [4]. On the other hand, as discussed above, the POT approach has two advantages in privacy preservation of SVM classification. First, POT could enhance protection on privacy by reducing the correlation between the original values and the transformed values. Second, POT provides no-outcome-change guarantee. Therefore, the combination of the POT approach and random perturbation should be able to increase the protection level while minimizing the outcome change. The last part of this thesis proposes a transformation schema that transforms the original data sets in two steps, which is shown in Figure 1.3. First, the random perturbation approach is applied. Random noise N is added to the original data set D. The perturbed data set is D ′ = D + N . Second, we use POT to generate an orthogonal transformation Q to transform the perturbed data D ′ and get the transformed data set D” = Q × D ′ . Then the data mining service provider will conduct the data mining 11 1.3. Research Problems and Contributions Classification Data Set D Patterns P0 Random Perturbation: Adding Noise ? D′ = D + N Patterns P1 ≡ Applying POT D” = Q × D′ Patterns P ′ Patterns P2 Decoding by P2 = Q−1 × P ′ Figure 1.3: Combination of POT and Random Perturbation for SVM tasks on the transformed data set D” and discover pattern P ′ . Finally, after decoding, we could get pattern P2 , which is identical to the pattern P1 mined from the perturbed data set D ′ . Let us still use P0 to represent the pattern mined directly from the original data set D. The goal of the proposed transformation schema is to enhance the protection on privacy while minimize the difference between the pattern P2 (equals to P1 ) and the pattern P0 . In order to further enhance the protection on privacy, we execute a worst case analysis on the POT approach. We notice that even though the POT approach provide high level protection in average cases, there are cases in which a hacker could still guess out a substantial number of the original values based on her prior knowledge. We found that although POT has reduced the correlation for an attribute to be below a given threshold, there might exist a subset of values that still have high correlation with the corresponding transformed values. When a hacker’s knowledge points happen to be located in this vulnerable subset, this hacker might be able to have a good guess on the original values in this subset. In order to enhance the protection level in the worst cases, we propose a heuristic approach to break down the correlation between the original values and the transformed values among those vulnerable subsets. We will proivide empirical results to show that the proposed transformation scheme, i.e., the combination of the POT approach and the random perturbation, could effectively improve the protection level on privacy and minimize the outcome change. 12 1.4. Literature Review 1.4 1.4.1 Literature Review Privacy Preserving Data Mining In order to protect private information about individuals while answering queries in statistical databases, several approaches have proposed, including access restriction, query set restriction, micro-aggregation, data perturbation, output perturbation, auditing, random sampling, etc [1]. In the data-mining-as-a-service model, privacy also needs to be protected since the service provider, who conducts the data mining tasks, might not be trusted by the data owner [13, 54]. Random Perturbation Random perturbation is a widely studied approach to transform the original values by adding random noise in a principle way [2, 4, 15, 16, 17, 23, 30, 67]. As discussed in Section 1.1.2, the random perturbation approach is designed for the data collector scenario and mainly focus on input privacy (Table 1.1). Agrawal and Srikant propose a random perturbation approach for privacy preserving decision tree classification [4]. Random noise drawn from a given distribution is added to the original data. The data miner could build the classifier by reconstructing the distribution of the original data from the perturbed data and the distribution of the noise. Privacy level corresponds to the width (x2 − x1 ) if the value x can be estimated within the interval [x1 , x2 ]. In [2], Agrawal and Aggarwal consider a different privacy measure based on differential entropy. They propose an EM algorithm for computing the reconstructed distribution which converges to the maximum-likelihood estimate of the original distribution. Random perturbation has also been applied to association rule mining by Evfimievski et al. in [16, 17]. In [46], Rizvi and Haritsa propose a variant perturbation approach based on probabilistic distortion. Instead of adding random noise on the original values, they apply the Bernoulli principle of keeping an item in a transaction with probability p, but deleting the item with probability (1 − p). It is shown that random perturbation could be vulnerable to matrixbased attacks. In [30], Kargupta et al. show that the added random noise could be separated from the real values by using spectral analysis based matrix decomposition techniques. In [23], Huang et al. present a method based on principal component analysis that could reconstruct the original data by exploiting correlations among attributes. 13 1.4. Literature Review Aggarwal and Yu [5] propose a condensation approach, which uses a anonymized data set to replace the original data set, to protect the private information. The anonymized data set has similar statistic characteristics to the original data set. The data mining task is conducted on the anonymized data set. Same as random perturbation, the mining outcome is changed. In [66], Zhang et al. present a new scheme based on algebraic-technique to build a more accurate classifier with less disclosure. This study is also based on the data collector scenario. Like the studies based on random perturbation, the mining results are changed by the perturbation. In summary, the random perturbation approach is designed for the data collector scenario. Its goal is to protect the private information inherent in the original data, i.e., the input privacy. It is shown that it is inevitable that there is a tradeoff of privacy and classification accuracy. Generally, more noise added, more protection on the private information, but meanwhile less accuracy of the classifier. Meanwhile, the data miner and the data owner have the same forms of the data mining outcome. The output privacy is not a stated design objective. In the data custodian scenario, we aim to propose transformation approaches that could deliver the three pillars of privacy preservation, whereas the perturbation approach cannot. Geometric Data Transformations Chen et al. propose a rotation-based transformation approach for SVM [10]. In order to enhance the protection level, the proposed approach swaps the rows of a rotation matrix. The approach selects the permutation that gives the highest protection level, which is defined by a unified column privacy metric. For a d dimensional data set, the rotation matrix is a d × d matrix. Therefore, the complexity to select the best permutation of the rows is d!. Meanwhile, the selection of the best permutation could only find the best solution for the current rotation matrix, which is called local optimization in the paper. There is no guarantee on the privacy protection level. The independent component analysis (ICA) is used as attack model to evaluate the protection level. However, ICA assumes that the hacker has no prior knowledge. Therefore, ICA belongs to the ciphertext-only attack [18]. In their following work [11], except the ICA attack model, they introduce more powerful attack models, such as distance-inference attacks, in which a hacker has prior knowledge to crack the rotation matrix. Their proposed approach amounts to adding noise to the transformed data; but this sacrifices classification accuracy. Liu et al. study random orthogonal transformations from an attacker’s 14 1.4. Literature Review view for k-means and k-NN clustering [35]. However, like random rotation, random orthogonal transformation is vulnerable to attacks with prior knowledge. In [44], Oliveira et al. use affine transformations, such as scaling, rotation and translation, to transform the original values for privacy preserving clustering. However, different attributes are subjected to different transformations. The geometric properties of the data set, e.g., the distance between two points, are not preserved. Therefore, the data mining outcome, i.e., the clusters, are changed. Meanwhile, like random rotation, random affine transformation is vulnerable to attacks with prior knowledge as well. Comparing to the random rotation transformation, the POT approach proposed in this thesis could iteratively and monotonically reduce the correlation between the original values and the transformed values in a principled way. Empirical study shows that POT could effectively counter attacks with prior knowledge. Other Methods Wong et al. study the problem of secure outsourcing of association rule mining [55]. The propose an encryption approach to protect the original values. The approach is based on a one-to-n mapping and ensures correct decryption. They share with us the goal of providing the NOC guarantee. However, Molly et al. show that the encryption approach in [55] is not secure and can be broken if a hacker knows the frequency knowledge [42]. In [56], Wong et al. propose an audit approach for secure outsourcing frequent item mining. The approach protects the true item values by inserting false items. Wong et al. also study the problem of privacy protection about k-nearest neighbor (kNN) computation on encrypted databases in [57]. Their approach preserves the orders of the distance between data points instead of the exact the distance. Therefore, the proposed approach is not suitable for classification methods, such as the support vector machine classification. Mozafari and Zaniolo propose an algorithm for secure view publishing that could preserve naive Bayesian classifiers while protecting privacy[43]. All the above work have little overlap with this thesis. We mainly focus our study on privacy preservation in outsourcing of classifications, including decision tree classification and the SVM classification. Lin and Chen study how to secure publish the SVM classifier while not breaching sensitive information [34]. Their approach modifies the SVM classification algorithm so that the mining outcome is different from but 15 1.4. Literature Review close to the true support vectors. This work assumes the data owner is able to conduct the SVM computation and wants to publish the SVM classifier, which is different from the goal of this thesis. 1.4.2 Secure Data Mining On Distributed Data Sets When multi-parties have their own copies of data and would like to conduct data mining tasks on the integrated data set, secure mechanism should be designed to make sure that each site cannot gain additional information about the data via the collaboration [12]. The dominant approaches is to design security protocols based on secure multi-party computation (SMC) [20, 52, 62, 63] Kantarcioglu and Clifton propose security protocol for association rule mining on horizontally partitioned data [28]. Secure association rule mining on vertically partitioned data is studied and security protocol is proposed in [51] by Vaidya and Clifton. Security protocols for privacy preserving decision tree classification have been proposed for vertically partitioned data by Vaidya and Clifton in [53] and for horizontally partitioned data by Samet and Miri in [47]. Vaidya and Clifton propose security protocol for k-means clusters mining over vertically partitioned data [52]. Inan et al. propose an approach to compute dissimilarity matrix of objects for privacy preserving cluster over horizontally partitioned data [24]. Jagannathan and Wright propose a new security protocol for arbitrarily partitioned data, which is a generalization of vertically and horizontally partitioned data [26]. Yu et al. study the problem of secure SVM computation on distributed sites. They propose secure and scalable protocols to train SVM classifier on horizontally partitioned data in [62] and on vertically partitioned data in [63]. Laur et al. [31], Mangasarian et al. [37, 38], and Zhan et al. [65] also propose different security protocols to build support vector machine on distributed sites. Yang et al. use a SMC-based cryptographic approach to protect the customer’s privacy [60] for decision tree classification or frequent item mining. The proposed approach preserves the data mining outcome (i.e., the NOC guarantee) by preserving the frequency of items. The security protocols designed for data mining on distributed data sets are not suitable for outsourcing data mining in the data custodian model. The relationship between the data custodian and the data mining service provider is not a collaboration. Meanwhile, the data custodian is the only site which has the data. 16 1.5. Roadmap 1.4.3 K-anonymity Privacy is also needed to be protected while publishing data for public benefit or research. The notion of k-anonymity is designed to protect the sensitive information inherent in the original data for data publishing [49]. The idea is to use suppression or generalization to anonymize items so that a tuple could not be distinguished from a group of size k. There are variants of k-anonymity, e.g., l-diversity [39], t-closes [36]. Meyerson and Wiliams prove that k-anonymity is NP-hard in general and propose a couple of approximations [41]. Bayardo and Agrawal study the complexity of finding optimal k-anonymity in [7] and propose an algorithm to reach the optimal anonymization in certain circumstance. LeFevre et al. propose a set of more efficient algorithms that reach k-anonymity by full-domain generalization [32]. Full domain generalization maps the entire domain of the original attributes to a more general domain. Jiang and Clifiton study the k-anonymity problem for vertically partitioned data and propose a security protocol to ensure each site could not learn anything that violates k-anonymity about other data sites [27]. Iyengar uses generalization and suppression to anonymize original values to protect the sensitive information in the context of data mining [25]. While protecting the privacy, the proposed approach aims to preserve the usefulness of the anonymized data in the context of classification and regression. Wang et al. propose a bottom-up method for generalization [58] and Fung et al. propose a top-down specialization approach [19] to anonymize the original data set while trying to remain the classification models. Lefevre et al. [33] also propose anonymity approaches while trying to provide good data mining results. The proposed methods provide different anonymization for different data mining tasks. Similar to the perturbation approach in the data collector scenario, those notions and approaches in k-anonymity are mainly designed to protect input privacy (Table 1.1). The mining outcome changes when data mining methods are applied on the anonymized data. 1.5 Roadmap The rest of this thesis is organized as follows. Chapter two presents the transformation schema of piece-wise (anti-)monotone functions for secure outsourcing of decision tree classification. Chapter three shows our work on how to provide the three pillars of privacy preservation for the SVM classi17 1.5. Roadmap fication. The transformation framework that combines the POT approach and random perturbation is shown in chapter four. Finally, this thesis is concluded in chapter five. 18 1.6. Bibliography 1.6 Bibliography [1] Adam, N.R. and Wortman, J. C.: Security-control methods for statistical databases. ACM Computing Surveys, 21, 4, pp515–556, 1989. [2] Agrawal D. and Aggarwal C.: On the Design and Quantification of Privacy Preserving Data Mining Algorithms. Proc. of PODS 2001. [3] Agrawal, R., Kiernan, J., Srikant, R., Xu, Y.: Order Preserving encryption for numeric data. Proc. 2004 SIGMOD, pp. 563–574. [4] Agrawal R. and Srikant,R.: Privacy preserving data mining. Proc. of SIGMOD 2000, pp. 439–450. [5] Aggarwal, C.C. and Yu, P.S.: A Condensation Approach to Privacy Preserving Data Mining. Proc. of EDBT 2004. [6] Aggarwal, C.C. and Yu, P.S.: Privacy-Preserving Data Mining: Models and Algorithms. Published by Springer, 2008 [7] Bayardo, R.J. and Agrawal, R.: Data Privacy through Optimal kAnonymization. Proc. of ICDE 2005, pp. 217–228. [8] Bennett, K.P. and Bredensteiner, E.J.: Duality and Geometry in SVM Classifiers. Proc. of 17th ICML, 2000,pp. 57–64. [9] Bu, S., Lakshmanan, L.V.S., Ng, R. T. and Ramesh,G.: Preservation of Patterns and Input-Output Privacy. Proc. of ICDE 2007, pp. 696– 705. [10] Chen, K. and Liu, L.: Privacy Preserving Data Classification with Rotation Perturbation. Proc. of ICDM 2005,pp. 589–592. [11] Chen, K., Sun, G. and Liu, L.: Toward Attack-Resilient Geometric Data Perturbation. SIAM Data Mining(SDM) 2007. [12] Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X. and Zhu, M.: Tools for Privacy Preserving Distributed Data Mining. ACM SIGKDD Explorations, vol. 4-2, 2003. [13] Clifton, C. and Marks, D.: Security and Privacy Implications of Data Mining. Workshop on Data Mining and Knowledge Discovery, 1996. 19 1.6. Bibliography [14] Du, W., Teng, Z. and Zhu, Z.: Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification. Proc. of. SIGMOD, 2008, pp. 459–472. [15] Evfimievski A.: Randomization in Privacy-Preserving Data Mining. ACM SIGKDD Explorations, vol. 4, 2003. [16] Evfimievski, A., Gehrke, J. and Srikant, R.: Limiting privacy breaches in privacy preserving data mining. Proc. of PODS 2003, pp. 211–222. [17] Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J.: Privacy preserving mining of association rules. Proc. of SIGKDD 2002, pp. 217–228. [18] Ferguson, N. and Schneier, B.: Practical Cryptography. Wiley Publishing Inc. [19] Fung, B.C.M., Wang, K. and Yu, P.S.: Top-Down Specialization for Information and Privacy Preservation. Proc. of ICDE 2005, pp. 205– 216. [20] Goldreich, O.: Foundations of Cryptography, Vol II. Cambridge University Press, 2004. [21] Grossman, R. and Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. Proc. of the 14th ACM SIGKDD. 2008, pp. 920-927. [22] Han, J. and Kamber, M.: Data Mining : Concepts and Techniques, second ed. Morgan Kaufmann Publishers, 2006. [23] Z. Huang, W. Du and B. Chen.: Deriving private information from randomized data. Proc. of SIGMOD 2005, pp. 37–48. [24] Inan A., Kaya S., Saygin Y., Savas E., Hintoglu A. and Levi A.: Privacy-Preserving Clustering on Horizontally Partitioned Data. Data Engineering Workshops, 2006. [25] Iyengar, V.S.: Transforming Data to Satisfy Privacy Constraints. Proc. of SIGKDD 2002. [26] Jagannathan G. and Wright R.: Privacy-Preserving Distributed kmeans clustering over arbitrarily partitioned data. Proc. of SIGKDD 2005. 20 1.6. Bibliography [27] Jiang W. and Clifton C.: Privacy-preserving distributed k-Anonymity. Proc. of the IFIP 11.3 Working Conference on Data and Applications Security, 2005. [28] Kantarcioglu M. and Clifton C.: Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data. IEEE TKDE Journal, 16(9), 2004. [29] Kaplan, J.: Data mining as a service: the prediction is not in the box. Data Mining Review, July 1,2007. [30] Kargupta, H., Datta, S., Wang, Q. and Sivakumar, K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. Proc. of ICDM 2003, pp. 99-106. [31] Laur, S., Lipmaa, H., and Mielikainen, T.: Cryptographically private support vector machines. 12th Proc of SIGKDD 2006, pp. 618-624. [32] LeFevre, K., DeWitt, D. J., and Ramakrishnan, R.: Incognito: efficient full-domain K-anonymity. Proc. of SIGMOD 2005, pp. 49–60. [33] Lefevre, K., DeWitt, D.J. and Ramakrishnan, R.: Workload-Aware Anonymization. Proc. of SIGKDD 2006. [34] Lin, K. and Chen, M.: Releasing the SVM Classifier with PrivacyPreservation. Proc. of ICDM 2008. [35] Liu, K., Giannella, C. and Kargupta, H.: An Attacker’s View of Distance Preserving Maps for Privacy Preserving Data Mining. Proc. of PKDD 2006. [36] Li, N., Li, T. and Venkatasubramanian, S.: t-Closeness:Privacy Beyond k-Anonymity and l-Diversity. Proc of ICDE 2007, pp. [37] Mangasarian, O.L., and Wild, E.W.: Privacy-Preserving Classification of Horizontally Partitioned Data via Random Kernels. Proc. of ICDM 2008, pp. 473-479. [38] Mangasarian, O.L., Wild, E.W., and Fung, G.M. : Privacy-Preserving Classification of Vertically Partitioned Data via Random Kernels. ACM Transactions on Knowledge Discovery from Data (TKDD) 2008, Vol. 3, Num. 2. [39] Machanavajjhala, A., Gehrke, J., Kifer, D. and Venkitasubramaniam, M.: L-Diversity: Privacy Beyond K-Anonymity. Proc. of ICDE 2006. 21 1.6. Bibliography [40] Martin, D.J., Kifer, D. Machanavajjhala, A., Gehrke, J. and Halpern, J.Y.: Worst-Case Background Knowledge in Privacy. Proc. of ICDE 2007, pp. 126–135. [41] Meyerson, A. and Williams, R.: On the complexity of optimal Kanonymity. Proc. of PODS, 2004. pp. 223–228. [42] Molloy, I., Li, N., and Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. Proc. of ICDM 2009, pp. 872–877. [43] Mozafari, B. and Zaniolo, C.: Publishing naive Bayesian classifiers: privacy without accuracy loss. Proc. VLDB Endow. 2, 1 (Aug. 2009), 1174-1185. [44] Oliveira, S. and Zaı̈ane, O.R.: Privacy Preserving Clustering By Data Transformation. Proc. 2003 Brazilian Symposium on Databases, pp.304-318. [45] Qiu, L., Li, Y., and Wu, X.: Protecting business intelligence and customer privacy while outsourcing data mining tasks. Knowl. Inf. Syst. 17, 1 (Oct. 2008), 99-120. [46] Rizvi, S. and Haritsa, J.: Maintaining data privacy in association rule mining. Proc. 2002 VLDB, pp. 682–693 [47] Samet, S. and Miri, A.: Privacy preserving ID3 using Gini Index over horizontally partitioned data. Proc. of the 2008 IEEE/ACS international Conference on Computer Systems and Applications, pp. 645651. [48] Schneier, B.: Secrets & Lies: Digital Security in a Networked World. Wiley Computer Publishing Inc. 2000. [49] Sweeney, L.: K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 5, p.557-570, 2002. [50] Trefethen, L.N. and Bau, D.: Numerical Linear Algebra. SIAM, Philadelphia, 1997 [51] Vaidya, J. and Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. Proc. of SIGKDD, 2002, pp. 639–644. 22 1.6. Bibliography [52] Vaidya, J. and Clifton, C.: Privacy-Preserving K-Means Clustering over Vertically Partitioned Data. Proc. 2003 SIGKDD, pp. 206–215. [53] Vaidya, J. and Clifton, C.: Privacy-Preserving Decision Trees over vertically partitioned data. Lecture Notes in Computer Science, Vol 3654, 2005. [54] Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y. and Theodoridis Y.: State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, v.33 n.1, 2004. [55] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: Security in outsourcing of association rule mining. Proc. of VLDB. 2007, pp. 111-122. [56] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: An audit environment for outsourcing of frequent itemset mining. Proc. VLDB Endow. vol. 2-1 (Aug. 2009), 1162-1173. [57] Wong, W., Cheung, D., Kao, B., and Mamoulis, N.: Secure kNN computation on encrypted databases. Proc. of SIGMOD 2009 [58] Wang, K., Yu, P.S. and Chakraborty, S.: Bottom-Up Generalization: A data Mining Solution to Privacy Protection. Proc. of ICDM 2004, pp. 249–256. [59] Weiss, A.: Computing in The Clouds. ACM NetWorker, December 2007, pp. 16-25 [60] Yang, Z., Zhong, S. and Wright, R.N.: Privacy-Preserving Classification of Customer Data without Loss of Accuracy. Proc. of the 2005 SIAM International Conference on Data Mining (SDM)2005 [61] Yara, P., Ramachandran, R., Balasubramanian, G., Muthuswamy, K., and Chandrasekar, D.: Global Software Development with Cloud Platforms Proc. of 2009 Software Engineering Approaches for Offshore and Outsourced Development. [62] Yu, H., Jiang, X. and Vaidya, J.: Privacy-Preserving SVM using Nonlinear Kernels on Horizontally Partitioned Data. Proc. of ACM SAC Data Mining Track, 2006, pp. 603 –610. [63] Yu, H., Vaidya, J. and Jiang, X.: Privacy-Preserving SVM Classification on Vertically Partitioned Data. Proc. of PAKDD 2006, pp. 647–656. 23 1.6. Bibliography [64] Yu, H., Yang, J. and Han, J.: Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. of SIGKDD, 2003, pp. 305–315. [65] Zhan, J. and Matwin, S.: Privacy-preserving support vector machine classification. International Journal of Intelligent Information and Database Systems (IJIIDS). Vol. 1, Issue 3/4, 2007, pp. 356–385. [66] Zhang, N., Wang, S. and Zhao, W.: A New Scheme on PrivacyPreserving Data Classification. Proc. of SIGKDD, 2005, pp. 374 – 383. [67] Zhu Y. and Liu L. Optimal Randomization for Privacy- Preserving Data Mining. Proc. of SIGKDD, 2004. [68] http://www.oracle.com/technology/products/bi/odm/ odm on the cloud detail.html [69] http://www.ibm.com/ibm/cloud/cloudburst/ 24 Chapter 2 Preservation Of Patterns and Input-Output Privacy2 2.1 Introduction One of the key motivations for privacy preserving data mining is the miningas-a-service model, in which there are at least two different scenarios. In [2], Agarwal and Srikant consider the data collector scenario where individual data owners selectively submit their data to a data collector, who may not be trusted, in exchange for the value coming from mining the data collection. Many existing studies center on this. In this chapter, we consider a different, yet equally prevalent, scenario of a data custodian which either owns the data (e.g., a company owns its data) or is explicitly given the trust and responsibility of protecting the privacy of the individuals. As an example of the latter, a medical research group (i.e., the custodian) obtains consent forms from patients to participate in a biomarker study. The data custodian consider hiring a company to help mine the data. It does not implicitly trust the company and wants to guard against possible privacy breaches. To do so, the data custodian needs to transform its data. To determine the appropriate transformation, there are two critical considerations: minimizing disclosure versus minimizing outcome change (i.e., the deviation between the outcome obtained by mining the original data and the outcome of mining the transformed data). Among the transformations studied in the literature, random perturbation is a dominant approach, i.e., transforming data values by adding random noises in a principled way [2]. The more noises are added, the more disclosure is minimized – but the more the mining outcome is changed. It is implicitly assumed that to minimize disclosure, it is inevitable that there be outcome change. This may be inevitable for the data collector model. 2 A version of this chapter has been published. Bu, S., Lakshmanan, L.V.S., Ng, R. T. and Ramesh,G.: Preservation of Patterns and Input-Output Privacy. ICDE 2007, pp. 696–705. 25 2.1. Introduction In this chapter, we show that it is possible to minimize disclosure while guaranteeing no outcome change for the data custodian model. We conduct our investigation in the context of decision tree classifiers. We propose classes of monotone and anti-monotone functions, that preserve the decision tree in an exact sense. For privacy preserving data mining, we claim there are three equally important pillars. The no-outcome-change guarantee is the first pillar. The second pillar is the protection of the input data – input privacy in [9]. The third pillar is the protection of the mining outcome – output privacy. Existing studies focus on input privacy, but ignore the other two pillars. We show that using (anti-)monotone transformations can satisfy all three pillars at the same time. Moreover, when (anti-) monotone transformations are used, decoding the mining outcome is straightforward. Finally, with the proposed transformations, every data value is transformed. In contrast, with random perturbations, there is a chance that a data value is not changed and the true value is revealed. Figure 2.1 shows a simple example of a monotone transformation. The original data D is shown in (a). The attributes age and salary are transformed with the linear monotone functions: age′ = 0.9 ∗ age + 10, and salary ′ = 0.5 ∗ salary. The transformed data D ′ is now given to the mining service provider (Figure 2.1(b)). The classifier T ′ based on D ′ is shown in (c). Note that D ′ provides input privacy, whereas T ′ provides output privacy. To decode T ′ to get the real decision tree, the data custodian uses the inverse function per attribute to obtain T , shown in (d). The inverse transformations are age = (age′ − 10)/0.9, and salary = salary ′ /0.5. Notice that this is exactly the same outcome if the decision tree algorithm were to be applied to D directly without the transformation. Also note that T ′ looks realistic enough that a hacker may not even know that it is encoded. A key analytical result of this chapter is the theorem guaranteeing that there is no outcome change when using the proposed functions. But what about the other two pillars? It should be obvious that for many situations, simply using (anti-)monotone functions may not be too effective for privacy protection. One of the key technical contributions here is the generalization from (anti-)monotone functions to piecewise (anti-)monotone functions. The two central ideas are to introduce breakpoints and to exploit monochromatic pieces. By going piecewise, three levels of uncertainty are added: (i) the uncertainty of the number of breakpoints, (ii) the uncertainty of the locations of the breakpoints; and (iii) the uncertainty of the function used in each piece. We analyze disclosure risk performance (second and third pillars) with a comprehensive framework. First, we explore different notions of input 26 2.1. Introduction Age 23 17 43 68 32 20 Salary 50K 30K 40K 50K 70K 20K Risk High High High Low Low High Age 31 25 49 71 39 28 (a) Original Data D Salary 25K 15K 20K 25K 35K 10K Risk High High High Low Low High (b) Transformed Data D′ Age < 27.5 Age < 35 Salary < 45K Salary < 22.5K High High (1’) (1) (2’) High (2) (3’) Low (c) Transformed Decision Tree T ′ High (3) Low (d) Original Decision Tree T Figure 2.1: A Training Data Set before and after transformation and their corresponding Decision Trees and output disclosure. For input, we consider domain disclosure and subspace association disclosure. For output, we consider outcome disclosure, which protects paths of a decision tree. Second, we consider different attack models for the hacker, such as sorting, linear regression, and curve fitting techniques. Third, we consider a hacker’s prior knowledge, modeled as knowledge points in the attribute domain. Based on benchmark data sets, we provide empirical results showing the effectiveness of the proposed framework. 27 2.2. Related Studies 2.2 Related Studies Extensive research has been done in statistical databases to provide statistical answers without breaching sensitive information about individuals. A commonly used technique is random perturbation [1]. The study by Agrawal and Srikant shows how a decision tree can be built on data perturbed with random noise added in a principled way [2]. However, preserving the exact decision trees is not their goal, and they show that there is a tradeoff between classification accuracy and privacy level. Recall the distinction between the data collector and data custodian scenarios. The random perturbation approach is designed to deal with the former, but can be applied to the latter as well. The piecewise monotone framework proposed here is designed for the data custodian scenario. Under this scenario, the proposed model delivers the three pillars of privacy preservation, whereas the perturbation approach cannot. Evfimievski et al. [5]apply the random perturbation approach to association rule mining. Rizvi and Haritsa propose a variant based on probabilistic distortion [9]. The mining outcome is changed. Output privacy is not a stated design objective. Recently, Kargupta et al. showed that a hacker can use spectral analysis based matrix decomposition techniques to separate the random noises from the real values [7]. Huang et al. [6] use a principal component analysis based method to exploit possible correlations among attributes to reconstruct original data, demonstrating that more accurate individual data can be revealed than originally thought! Clifton and others consider privacy preservation with vertically partitioned data [11]. The focus is on developing protocols to make sure that each site cannot gain additional information about the data via the collaboration. There are many studies on various other mining tasks with privacy preservation as the main goal, including clustering [11], order-preserving comparisons [3], and data exchange [10]. For data exchange, the notion of k-anonymity is designed for input privacy. If the transformed data were mined directly, the mining outcome could be significantly affected. 2.3 2.3.1 Preliminaries Monotone and Anti-monotone Functions The training data set is a relation instance D with m attributes A1 , . . . , Am , and a categorical class label attribute C. Throughout this chapter, we are 28 2.3. Preliminaries interested in active domains of attributes, which contain values of attributes appearing in a given data set. We denote the active domain of an attribute A by δ(A). Often we will transform it into another active domain δ′ (A). Let A be a numeric attribute with a linear ordering <. Let f : δ(A) → ′ δ (A) be a function. Then f is monotone (resp., anti-monotone) if for every x, y ∈ δ(A), x < y implies f (x) < f (y) (resp., f (x) > f (y)). For a (anti)monotone function f , the inverse f −1 is always well-defined. For the m attributes A1 , . . . , Am , let there be m (anti-)monotone functions f1 , . . . , fm . Given a tuple ht, ci (c being the class label) in D, the tuple is transformed from ht.A1 , . . . , t.Am , ci to hf1 (t.A1 ), . . . , fm (t.Am ), ci. We use f~ to denote the vector of the m transformations. Let D ′ = {hf~(t), ci|ht, ci ∈ D} be the data set consisting of these transformed tuples. We refer to a tuple ht.A, ci as an A-projected tuple, i.e., it retains the A-value and the class label. Whenever there is no confusion, we refer to an A-projected tuple as simply a tuple. 2.3.2 Domain, Subspace Association and Pattern Disclosure Input privacy [9] refers to the protection of D and is measured using various metrics [2, 5], domain disclosure being the popular one. Definition 2.1. Let f : δ(A) → δ′ (A) be a transformation. A domain crack function g : δ′ (A) → δ(A) represents the guess the hacker makes on each transformed value. For a value v ′ ∈ δ′ (A) that appears in D ′ , a guess is a crack if the guess falls within a radius ρ from the actual value, i.e., |g(v ′ ) − f −1 (v ′ )| ≤ ρ. The domain disclosure risk is the fraction of the number of cracks to the number of distinct values of A′ appearing in D ′ . However, for many applications, the data custodian might care more about the association between domain values rather than domain disclosure. For example, for an insurance application, the company cares more about protecting Bob of age 45 earning 50K, rather than the individual values of age or salary. We refer to this as subspace association disclosure. Definition 2.2. Let S ⊆ {A1 , . . . , Am } be a subset of attributes of the input training data. For simplicity, let S = {A1 , . . . , As }. A subspace crack function ~g : (δ′ (A1 ) × . . . × δ′ (As )) → (δ(A1 ) × . . . × δ(As )) represents the guess the hacker makes on each transformed S-tuple. A guess is a (S-tuple) crack if ∧si=1 |~gi (vi′ ) − f~i−1 (vi′ )| ≤ ρi (radius for Ai ). The subspace association disclosure risk is the fraction of the number of cracks to the number of S-tuples in D ′ . 29 2.3. Preliminaries In output privacy, the focus is on protecting the data mining outcome from being disclosed. For decision trees, the paths of the tree are to be protected. Most studies adopting the random perturbation paradigm only focus on input privacy. The mining outcome (e.g., association rules, decision trees) are not encoded. Since the hacker only has access to the perturbed data, one may argue that in a twisted sense, the exact identity of the pattern is protected inasmuch as perturbation changes mining outcome. But then the custodian suffers the same fate: she cannot fully recover the exact pattern! In contrast, for (anti-)monotone transformations, the true classifier is revealed to the data custodian (given the no-outcome-change guarantee to be shown in Section 2.4), whereas the hacker has to guess the true identities of the paths of the decision tree, out of many possibilities. Definition 2.3. Let a path in the decision tree T ′ be of the form: ∧hi=1 Ai θi vi′ . A path crack function ~g : (δ′ (A1 ) × . . . × δ′ (Ah )) → (δ(A1 ) × . . . × δ(Ah )) represents the guess the hacker makes on a transformed path. A guess is a (path) crack if ∧hi=1 |~gi (vi′ ) − f~i−1 (vi′ )| ≤ ρi . The pattern disclosure risk is the fraction of the number of cracks to the number of paths in T ′ . 2.3.3 Attack Models Possibly with Prior Knowledge Sources of prior knowledge may be from published statistics (e.g., the minimum age being 17, the median salary being 35K), from samples of similar data (e.g., a rival company having data similar to D), or the top k modal classes (e.g., the mode of employee age being 34). Alternatively, the hacker may know that a given attribute follows a certain distribution (e.g., Zipf, Gaussian), even though the parameters may not be known exactly. While it is impossible to enumerate the many forms of prior knowledge, we abstract all these situations in the form of a single notion of knowledge points. Definition 2.4. Let v ′ be a value of attribute A in D ′ . Let the hacker guess that v ′ ∈ δ′ (A) corresponds to v ∈ δ(A). We say that (v, v ′ ) is a knowledge point if |v − fA−1 (v ′ )| ≤ ρ. As encapsulations of various forms of prior knowledge, knowledge points are parameterized by how close they are to their true values. Following Definition 2.1, the radius ρ defined above is the same as the radius used for a crack. Moreover, the hacker can use these knowledge points to form the basis of a curve fitting attack. Definition 2.5 (Curve Fitting Attack). Let (x1 , y1 ), . . . , (xm , ym ) be m knowledge points. Apply a curve fitting method to fit the m points into a crack function g. 30 2.4. The No-Outcome-Change Guarantee We consider three curve fitting methods: (i) a linear regression line that minimizes the residuals of the m knowledge points; (ii) a polyline that connects the m knowledge points; and (iii) a spline that fits the m points. Clearly, curve fitting can also be adversely affected by wrong knowledge points, i.e., points that the hacker thought were accurate but that turn out to be way off. In our empirical evaluation, we call a knowledge point (v, v ′ ) to be good if it satisfies Definition 2.4; but call it bad if |v−fA−1 (v ′ )| > 5ρ. In Section 2.6, we will examine the impact of the knowledge points and various types of curve fitting attacks. Another attack model is sorting. Consider the age attribute again. Even though the defined range of age is within [0, 100], for a practical training data set with say 10, 000 tuples on employees, it is almost guaranteed that every age in the interval [20, 65] occurs in D at least once. Then the hacker takes the transformed values of age and sorts them in increasing order. Knowing the nature of a domain such as age, the hacker then maps the transformed values in increasing order to consecutive values starting with the (guessed) minimum value all the way to the (guessed) maximum. In the rest of this chapter, we show how to safeguard against these attack models, and simultaneously provide the no-outcome-change guarantee. 2.4 The No-Outcome-Change Guarantee In this section, we show that the decision tree constructed using D is identical to the tree constructed using D ′ when either the gini index or entropy is used to select the split-points. These two selection criteria are the most widely used [8]. Definition 2.6. Let ht1 , . . . , tn i be the ordered sequence of A-projected tuples in D such that ti .A ≤ tj .A whenever i < j. Equal values are in some canonical order. The class string σA,D for attribute A in data set D is defined as the string obtained by concatenating the class labels in the ordered sequence of tuples. Whenever the data set D is obvious, we denote the class string simply as σA . For the data set D in Figure 2.1(a), consider A being the age attribute. Sorting the tuples on age gives the class string σage = HHHLHL, where H and L denote the class label High and Low respectively. For the attribute salary, the class string σsalary = HHHHLL. Let us compare the corresponding class strings in D ′ shown in Figure 2.1(b). Observe that the class string for each attribute is unchanged. 31 2.4. The No-Outcome-Change Guarantee Proposition 2.1. Let a monotone function be applied to transform attribute A from D to D ′ . Then the class string is preserved, i.e., σA,D′ = σA,D . Similarly, for an anti-monotone transformation, σA,D′ = (σA,D )R , where σ R denotes the reverse of string σ. Definition 2.7. Given a class string σ for attribute A of a data set, we decompose σ into multiple non-overlapping substrings r1 , . . . , rm such that: (i) σ = r1 ◦ . . . ◦ rm ; (ii) for all 1 ≤ i ≤ m, the substring ri consists of a single class label; and (iii) for all 1 ≤ i ≤ (m − 1), the substrings ri and ri+1 are of different class labels. Each piece ri is a label run. For the class string σage = HHHLHL in Figure 2.1, there are four label runs HHH, L, H and L. From Proposition 2.1, monotone functions preserve all the label runs of each attribute. Based on the definitions of the gini index and entropy, we have the following proposition. Proposition 2.2. For each attribute, the split-point that optimizes either the gini index or entropy must not occur within a label run; it must be at the end points between two successive label runs. Consider the first label run HHH in age in Figure 2.1, corresponding to the age values 17, 20 and 23 in D. The above proposition says that the best split point for age cannot occur at age = 20. The candidate locations of the best split point for age must correspond to the end points of successive label runs – in this case 23, 32 and 43. Note that the same proposition applies to the transformed data D ′ . From Figure 2.1(b), the candidate locations of the best split point for age’ are again the end points of successive label runs – in this case 31, 39, and 49. As formalized below, the candidate locations of the best split point are identical in D and D ′ . Theorem 2.1. The winning attribute and the split point location in D ′ – relative to the sequential ordering of the label runs – are identical to that in D. Proof Sketch: The theorem follows from the fact that in D ′ , the label runs are preserved as in D. Consequently, all the candidate split point locations are preserved as well. Finally, even though the data values change from D to D ′ , the gini index and entropy calculations remain unchanged as they are based on relative frequencies. Hence, the winning attribute and the split point locations are preserved. Note the difference between split points and split point locations. E.g., consider σage = HHHLHL in Figure 2.1. The split point in Figure 2.1(a) is 32 2.5. A Piecewise Framework age = (23+32)/2 and is located between the first and second label runs. In D ′ , as shown in Figure 2.1(b), the actual value of the split point is changed to age’ = (31+39)/2 = 35, providing output privacy. However, the split point is still at the same location with respect to the sequential ordering of the label runs. Let T ′ be the decision tree obtained by mining the transformed database D ′ . Construct another decision tree S as follows. Start from the root of T ′ going top-down: for every node x′ in T ′ of the form Aθv ′ (where θ is a comparison operator and v ′ is a value in δ′ (A)), create a corresponding node x in S of the form AθfA−1 (v ′ ), where fA : δ(A) → δ′ (A) is the data transformation used for A. We have the following result from Theorem 2.1 Corollary 2.1. Let D, D′ , T ′ , S be as above. Let T be the decision tree obtained by mining D directly. Then: S = T . 2.5 A Piecewise Framework To defend against the various attack models, the central idea of the proposed solution framework is to introduce breakpoints to break up the domain into multiple pieces, each of which is encoded by a different transformation function. Figure 2.2 gives a skeleton algorithm for implementing the piecewise framework. It consists of three main phases: (i) choosing breakpoints, (ii) choosing a transformation for each piece, and (iii) setting them up to satisfy a global constraint. We elaborate on all these aspects below. Algorithm PieceTransform Input: a sorted sequence of A− projected tuples 1. Call Procedure ChooseBP or ChooseMaxMP to decompose A into w pieces r1 , . . . , rw 2. For each piece ri , call ChooseFunction to choose fi 3. For 1 ≤ i ≤ w, initialize fi such that they satisfy the global-monotone invariant, and apply fi to δi (A) Figure 2.2: A Piecewise Framework 2.5.1 ChooseBP: Choosing Breakpoint Locations Randomly It is the primary objective that when breakpoints are added, the no-outcomechange guarantee is preserved. To do so, let us return to the concept of label 33 2.5. A Piecewise Framework runs in D : H H H H L L L L H H H H H values in D : 1 2 15 15 27 28 29 29 29 29 42 43 44 values in D’: 20 17 16 16 2 sorted values in D’: runs in D’: 2 L 4 4 5 5 5 5 L L L H H 5 5 5 5 12 13 14 12 13 14 16 16 17 20 H H H H H H H Figure 2.3: Failing the Global-Monotone Invariant runs introduced in Definition 2.7. Figure 2.3 shows a simple example when there are three label runs: r1 ≡ HHHH, r2 ≡ LLLL, r3 ≡ HHHHH. Suppose a breakpoint is introduced exactly between two successive label runs to create three pieces, each of which uses its own (anti-)monotone transformation (i.e., third row of Figure 2.3). In the transformed data, however, the label runs are different (i.e., fifth row), and create a new class string, thus violating Theorem 2.1. The key problem for the situation shown in Figure 2.3 is that the three pieces, after their individual transformations, no longer follow the original ordering among themselves. Specifically, as shown in the first row of Figure 2.3, the three label runs are in the ordering of r1 ≡ HHHH, followed by r2 ≡ LLLL, then by r3 ≡ HHHHH. However, after the transformations as shown in row 4, the runs r2 and r3 have moved ahead of r1 , thereby causing the class string to change. The solution to this problem is to ensure that the pieces satisfy a global constraint to preserve their relative ordering after transformation. Definition 2.8. Let the original domain be broken up into w pieces δ1 (A), . . . , δw (A) with w transformation functions f1 , f2 , ..., fw . This set of transformations is said to satisfy the global-monotone invariant iff for all 1 ≤ i < j ≤ w, ∀v ∈ δi (A), ∀u ∈ δj (A), it is necessary that fi (v) < fj (u). Similarly, the set is said to satisfy the global-anti-monotone invariant if the latter inequality is changed to fi (v) > fj (u). Figure 2.4 shows how the global-monotone invariant is satisfied. Because the largest transformed value of r1 is 20, the invariant requires that all the transformed values in r2 be strictly greater than 20, which is the case in Figure 2.4. Notice that this invariant is satisfied even though an antimonotone function has been applied to r1 , whereas a monotone function has been applied to r2 . Similarly, all the transformed values in r3 are greater 34 2.5. A Piecewise Framework runs in D : H H H H L L L L H H H H H values in D : 1 2 15 15 27 28 29 29 29 29 42 43 44 values in D’: 20 17 16 16 23 24 25 25 25 25 102 101 100 sorted values in D’: 16 16 17 20 23 24 25 25 25 25 100 101 102 runs in D’: H H H H L L L L H H H H H Figure 2.4: Satisfying the Global-Monotone Invariant than 25, irrespective of whether a monotone or an anti-monotone function is applied to r3 . In this way, as shown in row 5 of Figure 2.4, the label runs in D ′ are identical to those in D. While both Figure 2.3 and 2.4 consider putting breakpoints right at the boundaries between label runs, this is only a simplification for illustration purposes. In fact, any value can be a breakpoint location, as long as the global-monotone invariant is obeyed. This is essentially the nature of Procedure ChooseBP shown in Figure 2.5. For an attribute A, ChooseBP decomposes the domain of A into w pieces for some w > 0, i.e., δ(A) = δ1 (A) ∪ . . . ∪ δw (A) and δi (A) ∩ δj (A) = ∅, i 6= j. The w breakpoints are randomly selected from the set of distinct values of A. Even though ChooseBP is simple, its privacy protection power arises from the fact that the number w and the exact w locations are not known to the hacker. Specifically, if the cardinality of CBP is N , then there are O(2N ) combinations for the hacker to ponder over. Section 2.6 will assess the effectiveness of ChooseBP empirically. Procedure ChooseBP Input: DA,C , the number w of breakpoints to be returned 1. Set CBP to be {t.A|t.A ∈ DA,C }, the set of A-values 2. Randomly pick w values from CBP as the breakpoints 3. Return BP /* the set of selected breakpoints */ Figure 2.5: A Skeleton for ChooseBP 2.5.2 ChooseMaxMP: Exploiting Monochromatic Pieces Definition 2.9. A value v ∈ δ(A) in D is monochromatic if all the tuples with v as the A attribute value agree on the label, i.e., 6 ∃t1 , t2 ∈ D such that 35 2.5. A Piecewise Framework t1 .A = t2 .A = v and t1 .C 6= t2 .C, where C is the class label. Furthermore, if all the tuples in a piece r contain monochromatic values and the same label, then r is called a monochromatic piece. In Figure 2.4, 29 is a non-monochromatic value; every other value, including 15, is monochromatic. The piece r1 is a monochromatic piece. The key benefit offered by a monochromatic piece is that there is no requirement to apply either a monotone or an anti-monotone function. In Figure 2.4, the values in r1 : 1, 2, 15 are transformed anti-monotonically to 20, 17 and 16 respectively. Note that the label runs do not change if the transformation f is any bijective function. For example, 1, 2 and 15 can be transformed so that f (1) < f (15) < f (2) (e.g., see row 3 of Figure 2.7). When such a bijective function can be used, a sorting attack is blocked. Furthermore, the space of bijective functions strictly contains the space of monotone or anti-monotone functions. Thus, while ChooseBP uses randomized breakpoints to confuse the hacker, the use of a bijective function here creates an even more serious combinatorial problem for the hacker. Specifically, if N is the total number of values in all the monochromatic pieces, there are O(N !) combinations for the hacker to ponder over. To maximize the benefit of monochromatic pieces, Procedure ChooseMaxMP, shown in Figure 2.6, finds all the monochromatic values and grows them into monochromatic pieces of the largest size. Given a sorted sequence of A-values, this task can be achieved by a simple scan from the smallest to the largest value. To illustrate the procedure, consider Figure 2.7, which has the same original label runs as in Figures 2.3 and 2.4. Let us begin from the smallest value 1. Because the value 1 is monochromatic, 1 is added to BP in line (11). The subsequent monochromatic values 2 and 15 are skipped according to line (16) effectively growing r1 . The next value is 27. While it is monochromatic, the label is different. Thus, a new monochromatic piece begins in line (13) and 27 is added to BP . The next value 28 remains in the same monochromatic piece containing 27. The next value 29 is non-monochromatic. This marks the end of the previous monochromatic piece and starts a non-monochromatic piece in line (4); 29 is added to BP . To complete the example, the next and last breakpoint is 42. In sum, ChooseMaxMP creates 4 pieces: r1 ≡ (1, H)(2, H)(15, H)(15, H), r2 ≡ (27, L)(28, L), r3 ≡ (29, L)(29, L)(29, H)(29, H) and r4 ≡ (42, H)(43, H)(44, H). Because r1 , r2 and r4 are monochromatic, any bijective function can be used. Notice that the way that breakpoints are selected by ChooseMaxMP maximizes the number of values for which a bijective function can be applied. 36 2.5. A Piecewise Framework Procedure ChooseMaxMP Input: DA,C sorted on attribute A, the desired number w of breakpoints to be returned 1. Set BP to be empty, flag M P to f alse, curLabel to null 2. For each value v starting from the smallest { 3. If v is not monochromatic { 4. If M P is true { /* end of a monochromatic piece */ 5. Add v to BP /* a new non-monochromatic piece */ 6. Set M P to f alse, and curLabel to null 7. } /* else simply skip to the next value */ 8. } 9. else { /* v is monochromatic */ 10. If M P is f alse { /* end of a non-mono. piece */ 11. Add v to BP /* a new monochromatic piece */ 12. Set M P to true and curLabel to v.C. 13. } else if curLabel 6= v.C { /* different label */ 14. Add v to BP /* a different monochromatic piece */ 15. Set curLabel to v.C 16. } /* else continues the current monochromatic piece */ 17. } } /* end for-loop */ 18. If the size of BP is h and is less than w, 19. Randomly pick (w − h) breakpoints, if any, from the set of 20. non-monochromatic values as in ChooseBP 21. Return BP Figure 2.6: A Skeleton for ChooseMaxMP runs in D : H H H H L L L L H H H H H values in D : 1 2 15 15 27 28 29 29 29 29 42 43 44 values in D’: 6 10 8 pieces obtained: r1 sorted values in D’: runs in D’: 8 19 18 27 27 27 27 35 31 33 r2 r3 r4 6 8 8 10 18 19 27 27 27 27 31 33 35 H H H H L L L L H H H H H Figure 2.7: Maximizing Monochromatic Pieces 37 2.5. A Piecewise Framework Row 5 in Figure 2.7 shows the transformed values. Notice that the globalmonotone invariant is maintained between successive pieces. Notice that there is no uncertainty in the positions of the monochromatic pieces as their sizes are maximized by ChooseMaxMP. In other words, the hacker knows exactly what the monochromatic and non-monochromatic pieces are in D ′ . However, knowing the pieces in D ′ does not help the hacker at all in cracking the corresponding values in D. For example, in Figure 2.7, knowing rows 5 and 6 do not help the hacker in cracking row 2. In the simple example in Figure 2.7, three important cases have not been illustrated. First, if the number of monochromatic pieces is less than w, ChooseMaxMP still operates by essentially reverting back to ChooseBP in selecting the remaining breakpoints randomly from among the non-monochromatic values. Experimental results later will show that even in this rare case, the proposed framework offers adequate domain disclosure protection. And whenever there are monochromatic pieces, the framework exploits them to give enhanced protection. Second, if there are not enough non-monochromatic values to choose from, ChooseMaxMP returns all the non-monochromatic values, together with the starting values of the monochromatic pieces. For real data, this is a very good situation for the data custodian as it implies that the total number of values in the monochromatic pieces dominates. Finally, in our simple example, the shortest piece is of length 2. We include short pieces in our example for simplicity. In practice, ChooseMaxMP may impose a minimum width threshold (e.g., width ≥ 5). 2.5.3 Selecting Transformations Randomly After breakpoints are selected by using ChooseBP or ChooseMaxMP, the next step is to choose a transformation for each piece from a family of functions. Figure 2.8 shows the procedure of how to randomly select a function for a piece r. Let Fbi denote a set of bijective functions including non-monotone ones, e.g., any permutation functions. This set Fbi is applicable to monochromatic pieces (step 3 and step 4). For non-monochromatic pieces, we are restricted to the family of (anti-)monotone functions, denoted by Fmono . This set can contain polynomials of degree ≥ k, logarithmic functions, parabolic functions and so on. Note that Fmono is closed under composition. That is to say, for any monotone functions f, g ∈ Fmono , their composition f ◦ g is also monotone and is in Fmono . For example, given f (x) = x1/2 and g(x) = log(x), f ◦ g(x) = (logx)1/2 is monotone. Thus, again, the size of Fmono can be arbitrarily large. As in the case for 38 2.5. A Piecewise Framework a monochromatic piece, a randomization step is used to select the transformation for the non-monochromatic piece. We choose a function for r in the following way. If the transformations for the whole domain is globalmonotone, then we randomly choose a monotone function for r; otherwise, we randomly choose an anti-monotone function for r (step 5 and step 6). ChooseFunction guarantees the globle monotone/anti-monotone constraint is satisfied. Procedure ChooseFunction Input: r : a sorted sequence of values from A with class labels 1. If r is monochromatic, choose f randomly from Fbi . 2. Else /* a piece with multiple labels */ { 3. If the set of transformations is global-monotone 4. Select a monotone function f randomly from Fmono 5. Else if the set of transformations is global-anti-monotone 6. Select an anti-monotone function f randomly from Fmono } 7. Return f /* a function to transform the values in r */ Figure 2.8: A Skeleton for Randomly Choosing Transformations Piecewise Agrawal et al. propose an order preserving encryption approach in [3]. They map the original values into another domain while preserving the order of the values. Since the orders of the mapped values are the same to the orders of the original values, the label runs are preserved. Therefore, the order preserving encryption approach also preserves the decision tree. However, our piecewise transformation framework has two advantages over the order preserving encryption regarding the privacy protection. First, monochromatic pieces could proivde extra protection on privacy becasue we could use any function in Fbi to transform these pieces. Second, even when a domain does not contain any monochromatic piece, the piecewise framework can still provide more protection. We could regard the order preserving encryption as a kind of monotone function and include it in Fmono . Unlike the order preserving encryption approach, by selecting different transformations, which could include a mix of the order preserving encryption and other functions in Fmono , for different pieces, we could provide enhanced e protection on privacy. 39 2.5. A Piecewise Framework 2.5.4 Defense Against Sorting Attacks Next we turn to sorting attacks. As specified in Procedure ChooseFunction, if a piece r is monochromatic, fr is chosen from Fbi , and a sorting attack is blocked. What about a non-monochromatic piece? Note that a nonmonochromatic piece r with a monotone transformation can already have its own inherent protection against a sorting attack – provided that there are enough “discontinuities” within r. A value v ∈ δr (A) is a discontinuity if there is no tuple t in r with t.A = v, where δr (A) denotes the dynamic range [minA , maxA ] of A in r, and minA , maxA denote the least and greatest Avalue occurring in r. To be more precise, if δr (A) contains some discontinuity values, then the attacker can only crack a value v ′ ∈ δr′ (A) to within a range, say Rg = [v1 , v2 ] ⊆ δr (A). Recall from Definition 2.1 that a guess is a crack if |g(v ′ ) − f −1 (v ′ )| ≤ ρ. Let Rρ = [v − ρ, v + ρ], where v = f −1 (v ′ ). Then the probability that a value v ′ is cracked under sorting attack can be redefined |Rg ∩Rρ | . as: P (|(g(v ′ ) − f −1 (v ′ )| ≤ ρ) = |R g| Let us consider the entire sorted sequence in D ′ in Figure 2.7 (i.e., row 5). Consider v ′ = 27. There are 5 values ranked ahead of 27 and 3 values ranked after 27. Thus, given that the original domain is [1,44], g(v ′ ) can range from Rg = [6,41]. As shown, the original value is f −1 (v ′ ) = 29. Let us say that the width of a crack is 2, giving Rρ = [27, 31]. Thus, the probability that v ′ is cracked is 5/36. From the above formula, it is clear that the larger the number of discontinuities, the wider the range Rg and the smaller the crack probability. In sum, the data custodian can follow the “recipe” below to determine if an attribute A is safe for disclosure. If A has many monochromatic pieces, or if the non-monochromatic pieces contain many discontinuities, then A is safe with low crack percentage against a sorting attack. The only situation that is unsafe is when A has few monochromatic values and simultaneously few discontinuities. However, the data custodian must decide whether domain disclosure risk for A is a primary concern or not. As discussed before, perhaps the more important issue for the custodian is the association of the A-values with the values of other attributes, not just the A-values themselves. Empirical results in Section 2.6 will examine this perspective. A final point concerns the amount of information the data custodian needs to keep in order to decode the decision tree T ′ to get T . While this is discussed in [4], it suffices to say that the information required is rather minimal (i.e., the locations of breakpoints and the transformations used). 40 2.6. Empirical Evaluation 2.6 2.6.1 Empirical Evaluation Experimental Setup Data Sets: We conducted our experimental evaluation on many benchmark data sets, including the forest covertype, census income and WDBC data sets from the UC Irvine collection. The results reported below are based on the forest covertype data set. These results are representative of those that we do not have space to show. attr. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 dynamic range width 2000 361 67 1398 775 7118 255 255 255 7174 # of distinct values 1978 361 67 551 700 5785 207 185 255 5827 # of mono. piece 9 0 1 22 14 202 2 8 3 229 avg Length of mono. piece 163 0 15 10 24 18 41 6 8 17 total % of mono. values 74.2% 0.0% 22.4% 40.0% 48.0% 62.9% 39.6% 25.9% 9.4% 66.8% Table 2.1: Statistics of Attributes The forest covertype data set consists of 581,012 tuples and 10 numeric attributes (and other non-numeric attributes). The table in Table 2.1 gives the key statistics of the 10 attributes. There are attributes with many discontinuities; the number of discontinuities is the difference between the second column and the third column of the table. But attributes 2, 3 and 9 have no discontinuity. There are also attributes with several monochromatic pieces. The last column gives the percentage of values that are contained in the monochromatic pieces. For attribute 1, the percentage is the highest at 74%. But for attribute 2, the percentage is 0% as there is no monochromatic piece. This attribute represents the worst case, as it also does not contain any discontinuity. Transformations: Fbi consists of transformations for monochromatic pieces. We used a random permutation function. Fmono is for non-monochromatic 1 pieces. We used linear and higher order polynomials, log, and log 2 (denoted sqrt(log)). 41 2.6. Empirical Evaluation Prior Knowledge: Curve-fitting attacks – linear regression, spline, polyline – require some number of knowledge points (denoted as KP below) to work with. We simulated the ignorant hacker who has no prior knowledge. The width ρ for a KP varied from 1%, 2% and 5% of the width of the dynamic range. Given the definition of ρ’s, we feel that each good KP represents a considerable amount of prior knowledge. Thus, a hacker with 2 or 4 good KPs is referred to as a knowledgeable and an expert hacker respectively. The locations of KPs are selected randomly. Furthermore, given the randomization involved, each disclosure risk figure reported below is based on the median of 500 random trials. The experiments were implemented in a MATLAB environment. We ran the experiments on an Intel Pentium PC with 3GHz CPU and 2GB RAM. 2.6.2 Domain Disclosure Risk Impact of Breakpoints, Monochromatic Pieces and KPs Figure 2.9 shows the domain disclosure risks (i.e., the y-axis) for all 10 attributes of the forest covertype data set using polyline as the curve fitting model. For each attribute, the first bar gives the baseline when no breakpoint is used but with an expert hacker (i.e., 4 KPs). The second and third bar correspond to the cases when ChooseBP and ChooseMaxMP are used, with an expert hacker. Note that to make the comparisons between ChooseBP and ChooseMaxMP fair, ChooseBP uses the same number of breakpoints as ChooseMaxMP, which is determined by the number of monochromatic pieces for the attribute. (The minimum number of breakpoints in each case is set to w = 20.) The last bar corresponds to the case when ChooseMaxMP is used with a knowledgeable hacker. The difference between the first two bars indicates the reduction of crack percentage with breakpoints. For instance, for attribute 1, the crack percentage drops from over 65% to 30%. Reduction is achieved for every attribute. Even for the “worst case” attribute 2 as shown in Table 2.1, breakpoints manage to keep the crack percentage below 25%. This shows that the use of breakpoints alone is effective, even when there is no monochromatic piece. The difference between the second and third bar in Figure 2.9 is due entirely to the monochromatic pieces. Depending on the percentage of monochromatic values, the reduction can be very significant. For the first attribute, the crack percentage drops from 30% to below 10%. This shows the effectiveness of ChooseMaxMP in exploiting monochromatic pieces, if present. 42 2.6. Empirical Evaluation As a reference point, when random perturbations are used, there is a chance that a data value is not changed and the true value is revealed. For example, in [9], many situations examined leave a significant percentage (e.g., 30%) of values unchanged, even with no prior knowledge used. Figure 2.9: Impact of Breakpoints and Monochromatic Pieces on Domain Disclosure The third bar of each attribute in Figure 2.9 corresponds to an expert hacker (i.e., 4 KPs). For all 10 attributes, the crack percentage is significantly lower when the hacker has less domain knowledge. The fourth bar of each attribute shows the crack percentage when the hacker is only a knowledgeable hacker (i.e., 2 KPs). In all cases, the crack percentage falls below 15%. And for an ignorant hacker, the crack percentage falls below 5%. Finally, the crack percentage is sensitive to the presence of even a single bad KP; for lack of space, details are omitted. A combination of Attacks So far all the results presented here are based on the polyline curve fitting model and the sqrt(log) transformation function. The table below shows the corresponding values for alternative situations. All the figures are based on attribute 10 of the forest covertype data set with ChooseMaxMP and an expert hacker. It is clear that results shown above represents a “worst case” analysis. A natural extension to the various attacks analyzed so far is to consider when the hacker applies all of these attack models. The central question here is whether the different attack models crack similar or different items. Figure 2.10 shows a Venn diagram of the cracks for linear regression, spline 43 2.6. Empirical Evaluation Linear regression Spline attack Polyline attack polynomial 10.39% 14.51 % 15.55 % log 11.53% 14.8 % 18.05 % sqrt(log) 10.85% 15.28% 18.03% fitting and polyline fitting on attribute 10 using sqrt(log) as the transformation. For instance, 3% of the domains are cracked solely by polyline, whereas an additional 9% are cracked by polyline and spline but not by linear regression. Spline Fitting Linear Regression 2% 1% 4% 3% 9% 3% 3% PolyLine Fitting Figure 2.10: The Venn Diagram of Cracks: the Combination Attack The question to ask here is: what the disclosure risk for this combination attack is. One simple answer is to add up all the percentages, which gives rise to 25%. 3 However, for most situations, this is an over-estimation of risk. To illustrate, suppose that a specific item a is identified to be a by polyline, b by spline and c by linear regression. While it is true that one of the three attacks correctly reveals the identity of item a, the hacker does not know which one. Thus, he cannot distinguish whether a is really a, b or c. One way to calculate the percentage of cracks is to use expected values. Assuming the hacker trusts the three crack models equally, this gives an expected crack percentage of 12.5%. Another way is to count only those items as cracks if two or more methods agree on them. From the Venn diagram shown in Figure 2.10, this gives a total combined crack percentage of 16%. 3 This is similar in effectiveness as the 30% of unchanged values left unprotected by random perturbation [9]. In our case, however, the crack percentage includes the hacker’s prior knowledge. If the hacker has no prior knowledge, the crack percentage is significantly lower. 44 2.6. Empirical Evaluation attr. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 # of discontinuities 22 0 0 847 75 1333 48 70 0 1347 total % of mono. values 74.2% 0.0% 22.4% 40.0% 48.0% 62.9% 39.6% 25.9% 9.4% 66.8% worst case crack % by sorting 26% 100% 78% 4% 22% 8% 13% 11% 90% 7% Table 2.2: Sorting Attack: Worst Case Sorting Attack: the Worst Case Analysis Finally, we measure the risk of a sorting attack on all 10 attributes. The table in Table 2.2 gives the worst case results because we assume that the hacker has the knowledge of the true minimum and maximum values of the dynamic range. If these two pieces of prior knowledge is not known, the crack percentage is significantly lower. The second column of the table in Table 2.2 gives the number of discontinuities within the dynamic range (cf: Table 2.1). The third column is directly copied from the last column of Table 2.1. As expected, the three attributes 2, 3 and 9 are the most vulnerable with no discontinuity, and a small percentage of monochromatic values (hardly surprising given that these attribute domains are small and the data set contains over 500,000 rows). However, as long as an attribute has a fair number of discontinuities or monochromatic values, the attribute is safe against a sorting attack even in the worst case. 2.6.3 Subspace Association Risk Next we turn to subspace association disclosure risk. For the 10 attributes of the forest covertype data set, there are over 1,000 subspaces containing two or more attributes. Figure 2.11 shows the results of a few selected subspaces. All subspaces shown contain either 2 or 3 attributes. For the ones selected, they are divided into two categories. The first category are attributes for which the curve fitting attack is 45 2.6. Empirical Evaluation more serious than the sorting attack. Specifically, we consider the set of attributes {4, 7, 10} and their subspaces. The first three bars in Figure 2.11 show their individual domain disclosure risks from Figure 2.9 with an expert hacker. The next four bars show the subspace association risks for all the combinations. For example, the individual risks of attribute 4 and 7 are 16% and 25% respectively. The association disclosure risk of attributes 4 and 7 together, however, drops significantly to 4%. And the association disclosure risk among attributes 4, 7 and 10 drops to 0.2%. In general, as the subspace becomes larger, the disclosure risk drops significantly. Subspace {#4,7,10} Domain 25% Percentage of Crack 25% 20% Subspace {#2,7,10} 24% 18% 16% 15% 15% 10% 5% 3.9% 3.7% 2.9% 1.7% 0 0.2% #4 #7 #10 #4,7 #4,10 #7,10 #4,7,10 #2,7 #2,10 #2,7,10 Attributes and Subspace Figure 2.11: Subspace Association Disclosure Risk The second category are attributes for which the sorting attack is more serious. For a worst case discussion, we return to attribute 2. Even though attribute 2 is 100% cracked by sorting in the worst case, note that the association disclosure risks of attribute 2 with other attributes can be acceptable (i.e., last three bars of Figure 2.11). Thus, for the data custodian to decide whether attribute 2 can be safely released, the data custodian needs to first determine which is the primary concern: attribute 2 alone versus the association of attribute 2 with other attributes. For many situations, it is the association disclosure that is more critical. It is also interesting to compare the association disclosure risk with and without attribute 2 in Figure 2.11. While the domain disclosure risk of attribute 10 alone is 18%, the association disclosure risk of attributes 2 and 10 together is 15%. This is due to how values in attribute 2 are associated with values in attribute 10. This skew leads to the following observation: risk(A, B) < risk(A) ∗ risk(B). This is good news for the data custodian if the primary concern is subspace association risk, rather than domain risk. 46 2.7. Future Work 2.6.4 Output Privacy: Pattern Disclosure Risk Finally, to measure pattern disclosure risk, we applied the C4.5 decision tree algorithm on the 10 attributes of the forest covertype data set. The decision tree constructed contains 1707 paths from the root. The maximum length of these paths is 40. The following table shows the frequency of path lengths and the corresponding cracks. Length of Decision Paths # of Paths # of Cracks 1 2 3 4 5 6 >6 2 1 4 0 5 0 9 0 15 0 24 0 1648 0 Table 2.3: Output Privacy Among all the 1707 paths from the root, there is only 1 path of length 2 that is cracked. In fact, this result is based on an insider hacker (i.e., 8 KPs) and a 5% width. For a less powerful hacker, or a smaller radius, all the paths are protected showing impeccable output privacy protection. 2.7 Future Work In ongoing work, we study how to generalize the piecewise framework from decision trees to SVM and other kernel methods. The difference is that the dividing planes can have arbitrary orientations. In next chapter, we show how the no-outcome-change guarantee and the other two pillars of privacy can be supported for SVM using a set of transformations different from the ones analyzed here. 47 2.8. Bibliography 2.8 Bibliography [1] N.R. Adam and J. C. Wortman. Security-control methods for statistical databases. ACM Computing Surveys, 21, 4, pp515–556, 1989. [2] R. Agrawal and R. Srikant. Privacy preserving data mining. Proc. 2000 SIGMOD, pp. 439–450. [3] R. Agrawal et al. Order Preserving encryption for numeric data. Proc. 2004 SIGMOD, pp. 563–574. [4] S. Bu et al. Preservation Of Patterns and Input-Output Privacy. Full version of this submission: ftp://ftp.cs.ubc.ca/local/laks/papers/ decisionTreeIcde2007ExtendedVersion.pdf. [5] A. Evfimievski, J. Gehrke and R. Srikant. Limiting privacy breaches in privacy preserving data mining. Proc. 2003 PODS, pp. 211–222. [6] Z. Huang, W. Du and B. Chen. Deriving private information from randomized data. Proc. 2005 SIGMOD, pp. 37–48. [7] H. Kargupta et al. On the privacy preserving properties of random data perturbation techniques. Proc. 2003 ICDM, pp. 99–106. [8] P. Langley. Elements of Machine Learning. Morgan Kaufmann, 1996. [9] S. Rizvi and J. Haritsa. Maintaining Data Privacy in Association Rule Mining. Proc. 2002 VLDB, pp. 682–693. [10] L. Sweeney. K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 5, p.557-570, 2002. [11] J. Vaidya and C. Clifton. Privacy-Preserving K-Means Clustering over Vertically Partitioned Data. Proc. 2003 SIGKDD, pp. 206–215. 48 Chapter 3 Patterns and Privacy Preservation with Prior Knowledge for SVM Classification4 3.1 Introduction In outsourcing data mining, data owners provide their data to the service provider. To preserve privacy, there are at least two prevalent scenarios considered in the literature. In the data collector scenario, individual data owners send the data to a collector who will do the data mining task. The assumption is that the collector might not be trusted. Thus, the primary concern is to protect the privacy of the input data. The dominant approach is random perturbation, i.e., transforming data values by adding random noise in a principled way [3, 12, 19]. The second scenario is the data custodian model. A data custodian is a person or an organization who either owns the data or has the responsibility to protect the privacy information inherent in the data. For example, a medical research group or hospital (i.e., the custodian) may consider outsourcing data mining using patient data. They need to find a way to protect the patient’s private information. As discussed in [7], there are three ”pillars” to privacy preservation. The first pillar, which has received the most attention, is input privacy, which aims to protect the privacy of the input data to the data mining algorithm [25]. The second pillar is output privacy, which aims to protect the true identity of the mining outcome. For example, a disease prediction model learned from patient data should be protected and not be released to the 4 A version of this chapter will be submitted for publication. Bu, S., Greif, C., Lakshmanan, L. V.S., and Ng, R. T.: Patterns and Privacy Preservation with Prior Knowledge for SVM Classification 49 3.1. Introduction public to prevent malicious utilization. The third pillar is the minimization of outcome change. Existing studies (e.g., [3, 12, 19]) from the data collector scenario provide input privacy by random perturbation. However, the more the data is perturbed, the more the outcome has been changed (even though input privacy is enhanced). For the data custodian scenario, several recent studies show situations when it is possible to provide the No-Outcome-Change (NOC) guarantee without compromising input and output privacy. In [31], Wong et al. propose an algorithm to give the NOC guarantee for outsourcing of association rule mining. In [7], we propose an algorithm to give the NOC guarantee for building a decision tree, while simultaneously providing input and output privacy. Table 3.1 highlights the differences between the data collector scenario and the data custodian scenario. Scenarios Data Collector Data Custodian KAnonymity Studies [1, 2, 3, 12] [18, 19], etc. [8, 9, 21, 31] [7], this thesis, etc. [16, 22, 27] [13, 20, 30], etc. Input Privacy √ Output Privacy NOC √ √ √ √ Table 3.1: Three Pillars of Privacy Preservation In this chapter, we address the problem of privacy protection in outsourcing of support vector machines (SVM). Classification is one of the most popular data mining operations, and SVM is one of the most widely used classification methods. Instead of executing the SVM classification by herself, a data custodian might want to outsource the SVM to outside service providers in the following two cases. Firstly, the data custodian might not have data mining expertise. Even though there is an increasing number of off-the-shelf mining software packages, they still call for substantial expertise on the part of the custodian to train the SVM classifier, including selecting a better kernel function, tuning parameters, interpreting the intermediate results during the training process, etc. Thus, a data custodian might want to outsource the SVM classification to a data mining service provider who has the required expertise. Secondly, instead of buying expensive computing facilities, a data custodian might want to submit the heavy data mining computation to the data mining service provider. This may become even more popular with the recent advent of cloud computing. The data custodian needs to protect the private information in both cases. 50 3.1. Introduction Chen et al. propose using random rotation to provide the NOC guarantee for outsourcing SVM [8]. In a follow-up study, they consider how to deal with hackers/attacks with background or prior knowledge [9]. Sources of prior knowledge may include published statistics (e.g., the minimum age being 17, the median salary being $35K), from samples of similar data (e.g., a rival company targeting the same market), or from knowledge that a certain attribute follows a specific distribution (e.g., Zipf). Since it is impossible to consider all possible attack models with prior knowledge in one thesis, we consider two expressive and “realistic” attack models: one based on curve-fitting with correlation, and the other one based on linear algebraic manipulation which we call a global attack. Both attack models are realistic in the sense that they only require a modest amount of prior knowledge; yet they can be very damaging. While these two models are not exhaustive, they are two broad types of attacks. The objective of this chapter, then, is to explore how to provide the NOC guarantee for the SVM classification and be robust against these attack models. Specifically, we make the following contributions: • We first observe that the random rotation approach proposed by Chen et al. ([8, 9]) is vulnerable to the two attack models. To address this problem, we develop an approach, Principled Orthogonal Transformation (POT), which iteratively and monotonically reduces the correlation by generating orthogonal transformations in a principled way based on the Gram-Schmidt procedure [28]. The generated transformation provides the NOC guarantee for both linear and non-linear SVMs, while providing much stronger input and output privacy protection when compared with the random rotation approach. • To guard against a ”global” attack, which will be formally introduced in Section 3, we propose an algorithm that identifies data points called True Negative Points (TNP), which are guaranteed not to be support vectors. The algorithm then perturbs these true negative points to enhance privacy. It is shown that for a linearly separable data set, the NOC guarantee is preserved. However, for a linearly non-separable data set, as we shall show, the NOC guarantee may be violated but the change is minimized. For the latter case, we present empirical results, using several benchmark data sets, to show that the quality of the SVM classifier is not significantly affected. Roadmap: The rest of this chapter is structured as follows. Section 3.2 describes the related work in privacy preservation data mining. Section 3.3 51 3.2. Related Work introduces the notion of knowledge points and the associated attack models. The POT approach to counter curve fitting attacks is proposed in Section 3.4 and the TNP approach to counter global attacks is studied in Section 3.5. Section 3.6 shows the experiments that validate the effectiveness of the proposed approaches. The way in which POT and the TNP perturbation are meant to be used by a data custodian to outsource her data is discussed at the end of this section. Finally, conclusion and future work are discussed in Section 3.7. 3.2 Related Work Chen and Liu propose a rotation-based transformation approach for SVM [8]; the transformation gives the NOC guarantee. They propose a unified column privacy metric to select a rotation matrix, which could provide a higher privacy level, by swapping the rows of the rotation matrix. They use Independent Component Analysis (ICA) as the attack model to evaluate the protection level. However, ICA assumes that the hacker has no prior knowledge. In their extended work [9], they continue to study random rotation to transform the original data. They investigated attack models, such as distance-inference attacks, in which a hacker might have enough data points to crack the transformation matrix. Their approach amounts to adding noise to the transformed data; but this sacrifices classification accuracy. Liu et al. study random orthogonal transformations from an attacker’s view for k-means and k-NN clustering [21]. However, like random rotation, random orthogonal transformation is vulnerable to attacks with prior knowledge. In [7], we study how to provide the NOC guarantee for building decision tree classifiers. Decision trees are based on recursive partitioning, while SVMs are matrix-based. Thus, the study here has no overlap with the approach proposed in [7]. In [31], Wong et al. propose an algorithm to give the NOC guarantee for outsourcing of association rule mining. They share with us the goal of providing the NOC guarantee. Random perturbation is a popular approach to transform the original data by adding noise [1, 2, 3, 12, 18, 19]. As discussed earlier, this approach is designed for the data collector scenario and mainly focus on input privacy (Table 3.1). In [24], Oliveira and Zaiane study privacy preserving clustering problem and apply affine transformations on the data set. However, different attributes are subjected to different transformations. The geometric prop52 3.3. Knowledge Points and Attack Models erties (e.g., the distance between two points) of the data points are not preserved. Therefore, the NOC guarantee is violated. Meanwhile, like random rotation, random affine transformation is vulnerable to attack with prior knowledge as well. In [32], Yang et al. propose a cryptographic privacy-preserving method to protect the customer’s private information while guaranteeing the accuracy of the classification. The proposed approach is designed for decision tree classification and frequent item mining, by preserving the frequency of items. Their approach cannot be applied to SVM classification, which builds classifiers based on the distance between points in a multi-dimensional space. In [33] and [34], Yu et al. study the problem of privacy-preserving SVM on horizontally partitioned data and vertically partitioned data respectively. Secure and scalable protocols are developed for multiple parties. Besides, they do not consider prior knowledge of the hacker. Finally, there are the variants of k-anonymity for data publishing [22, 27]. Iyengar [16], Wang et al. [30], Fung et al. [13], and Lefevre et al. [20] propose anonymity approaches while trying to provide good data mining results. Similar to the perturbation approach in the data collector scenario, those notions and approaches in k-anonymity are mainly designed to protect input privacy (Table 3.1) and prior knowledge or background knowledge is not considered. The mining outcome still changes. Martin et al. in [23] and Du et al in [11] study hacker’s background knowledge for k-anonymity. Their work also focus on input privacy and the mining outcome could be significantly affected when data mining methods are applied on the transformed data. 3.3 Knowledge Points and Attack Models In this chapter, given a data set D of d attributes containing n tuples, each tuple xi is represented as a point in d-dimensional space, i.e., xi = {x1 , ..., xd }T , which is a column vector in d-dimensional space. Therefore D can be represented as a d × n matrix. We can apply a transformation τ on D. The transformed data set is D ′ = τ (D). For a point x in D, the corresponding transformed point is x′ = τ (x). 3.3.1 Knowledge Points There could be many sources of prior knowledge (e.g., published statistics, samples of similar data, etc.). We abstract all of them into a uniform notion of knowledge points. 53 3.3. Knowledge Points and Attack Models Both Chen et al. [9] and Liu et al. [21] define prior knowledge as pairs of original points and the corresponding transformed points. However, in real life, a hacker might only know approximate values but not the exact original values. In this chapter, we propose a more realistic and practical notion of knowledge points. The key idea is that if the hacker’s guessed value for a given transformed value falls within a certain radius of the exact original value, we regard that pair as a knowledge point. Definition 3.1 formalizes this notion. Definition 3.1. (Domain Knowledge Point) Given a transformed value u′ in attribute i of D ′ , suppose a hacker believes that the corresponding original value of u′ is w from her prior knowledge. We say that (w, u′ ) is a domain knowledge point if |w − u| ≤ δi , where u is the true original value of u′ and δi is a given parameter for attribute i. Here, δi measures how close the guessed original value w is to the real value u and is defined as a fraction of the domain range of attribute i. Thus, for a knowledge point as defined above, the guessed original value is within a radius of δi from the true original value. A hacker could gain prior knowledge from various sources of published statistics, samples of similar data, or knowledge that a certain attribute follows a specific distribution. For example, if a hacker has a similar data set from a rival company and knows that for the Age attribute, the maximum is 60 and the minimum is 18. When she gets the transformed data set, and finds the maximum of the transformed value to be 125 and the minimum to be 6, she can simply make a guess that 125 is transformed from 60 and 6 is from 18. Even if the true original value for 125 is 58 and the true original value for 6 is 19, the guessed values are very close. Meanwhile, if a hacker knows the mean, median, or other statistics of the original domain, she could simply map those values to the corresponding transformed values to form more knowledge points. Section 3.6.2 shows that even this kind of simple statistic value mapping between original domain and transformed domain could breach considerable privacy if the original values and transformed values are highly correlated. 3.3.2 Attack Models with Prior Knowledge In Definition 3.1, δi measures how close the guessed original value w is to the real value u. We next define a crack, using the same radius δi . Definition 3.2. (Crack) Given a data set D and the corresponding transformed data D ′ = τ (D), for each original value u in D, we have a trans54 3.3. Knowledge Points and Attack Models formed value u′ in D ′ . A domain crack function g : D ′ → D represents a guess made by a hacker. For a value u′ in D ′ , a guess is a crack if the guess falls within a radius δi from the actual value, i.e., |g(u′ ) − u| ≤ δi . The domain disclosure risk is the fraction of the number of cracks over the number of values in D ′ . We consider two possible attack models in this chapter. The first one is called curve fitting attack. Definition 3.3. (Curve Fitting Attack) Suppose a hacker has k domain knowledge points, i.e., (w1 , u′1 ), ..., (wk , u′k ), for an attribute. The hacker can apply a curve fitting approach to fit these k knowledge points into a crack function g, which is used by the hacker to crack the other transformed values in the same attribute. The second attack model is global attack and is defined for matrix transformations. Recall that a d dimensional data set D containing n tuples is represented as a d × n matrix. Each column is a tuple (data point) and each row is an attribute of the data set. Suppose D is transformed by a d × d matrix Q, the transformed data set D ′ = Q × D. For the custodian to decode the support vectors, she needs to use an invertible matrix. We have D = Q−1 × D ′ . Global attacks refer to the situation when a hacker could make a guess about the transformation matrix Q with her prior knowledge and further to crack the original data set D. Before we go to details of the definition of global attack, let us study an example first. Example 3.1. Given a three dimensional data set containing six data points 13 5 6 40 22 33 D = 18 28 27 14 15 30 36 3 31 27 34 2 and a transformation matrix 0.78 0.62 0.08 Q = 0.29 −0.47 0.84 , −0.55 0.63 0.54 the transformed data set D ′ = Q × D. Suppose a hacker has some prior knowledge about the first three values in the first attribute of D and guesses w ~ = (w1 , w2 , w3 ) = (13.5, 4.8, 6.5), which are very close to the true values D(1,1:3) = (13, 5, 6). Let us use Dk to represent the first three data points 55 3.3. Knowledge Points and Attack Models containing D(1,1:3) in D and use Dk′ to represent the corresponding three transformed data points in D ′ . Since the hacker has the transformed data set D ′ = Q × D, she can make a guess on the first row of the inverse matrix of Q, i.e., (Q−1 )(1,:) , by using D ′ and w ~ in the following way. Let us use Q̂ to represent the matrix guessed by the hacker. She believes that D ′ = Q̂ × D and Q̂−1 × D ′ = D. Therefore, she reasons that (Q̂−1 ) × Dk′ = Dk . This means (Q̂−1 )(1,:) × Dk′ = D(1,1:3) . She does not know the original values of D(1,1:3) . Instead, she has guessed values w ~ and believes −1 ~ which that w ~ is close to D(1,1:3) . Thus, she guesses (Q̂ )(1,:) × Dk′ = w, −1 contains three linear equations of the three elements in (Q̂ )(1,:) . By solving these three linear equations, the hacker can get the first row of matrix (Q̂−1 ), i.e., (Q̂−1 )(1,:) = w ~ × (Dk′ )−1 = (0.76, 0.31, −0.53). Since Q is an orthogonal matrix, Q−1 equals to the transpose of Q, i.e., QT . Therefore, the first row of the inverse of Q is (Q−1 )(1,:) = (0.78, 0.29, −0.55). We can see that the values of the guessed row (Q̂−1 )(1,:) are close to the actual values in the first row of the inverse of Q, i.e., (Q−1 )(1,:) . Once the hacker has (Q̂−1 )(1,:) , she can use it to crack the first attribute of D by the equation D̂(1,:) = (Q̂−1 )(1,:) × D ′ and guesses that D̂(1,:) = (13.5, 4.8, 6.5, 39.6, 22.2, 32.1). Here, the first three values are the hacker’s knowledge, and the remaining three values are close to the actual values D(1,4:6) = (40, 22, 33) in the original data set. From the above example, we have the definition for global attack as follows: Definition 3.4. (Global Attack). Suppose a hacker has d domain knowledge points in the ith attribute of D, i.e., (w1 , u′1 ), ... , (wd , u′d ). Let w ~ = (w1 , w2 , · · · , wd ), where wi is a guessed value. Dk′ contains the tuples containing the d transformed values (u′1 , · · · , u′d ). She can guess out the ith row (Q̂−1 )(i,:) of matrix (Q̂−1 ) by solving (Q̂−1 )(i,:) × Dk′ = w, ~ i.e., (Q̂−1 )(i,:) = w ~ × (Dk′ )−1 . (Q̂−1 )(i,:) is used to crack other values in the ith attribute of D by the hacker. Remark: Our definition of global attack is less strict, and thus riskier, than the matrix crack (e.g., distance inference attack) discussed in [9, 21]. In their work, a matrix attack requires d d-dimensional knowledge points. A d-dimensional knowledge point is an exact mapping between a transformed data point and the corresponding original data point across all attributes. They do not consider the situation when a hacker could use d 1-dimensional 56 3.4. POT: Principled Orthogonal Transformation knowledge points (i.e., domain knowledge points) in an attribute to make a guess on the transformation matrix and further to crack the original data set. 3.4 3.4.1 POT: Principled Orthogonal Transformation Preliminaries SVM Overview Given a data set D containing n data points, each point xi is associated with a class label yi ∈ {−1, 1}. SVM uses the function f (x) = n X αi yi K(x, xi ) + b i to form the classifiers from the training data set, where αi and b are parameters tuned by the training process and K is a kernel function [10]. Geometrically, SVM classifiers minimize errors on the training set and separates the rest of the elements with maximal margin in the feature space [5]. Specifically, for linear SVM, if the two classes of the data set D are linearly separable, SVM classification finds the optimal hyperplane that separates those two classes and maximizes the margin between the two classes. If the two classes are linearly non-separable, the optimal hyperplane should minimize the classification errors while maximizing the margin between the two classes. SVM-Preserving Transformations A data custodian sends the transformed data to the data mining service provider, who conducts data mining tasks on the transformed data. After the data custodian gets back the mining outcome (support vectors), the transformation used allows the data custodian to easily decode the support vectors. We have the following definition for the NOC guarantee for SVM. Definition 3.5. (SVM-preserving Transformation) Given a data set D, a transformation τ , and the transformed data set D ′ = τ (D), let V = {v1 , ..., vm } be the support vectors in D. The transformation τ is called an SVM-preserving transformation iff V ′ = τ (V ) = {τ (v1 ), ..., τ (vm )} are also the support vectors in D ′ . 57 3.4. POT: Principled Orthogonal Transformation Rotation is a transformation that transforms D by applying a rotation matrix M , i.e., D ′ = M × D. Rotation has been shown to preserve SVM in [8]. As our primary goal is to transform the data while having the NOC guarantee, we are interested in a class of transformations broader than random rotation. The POT method introduced in Section 3.4.3 transforms the original data set D by an orthogonal matrix Q: D ′ = Q × D. A transformation matrix Q is orthogonal if QT Q = QQT = I, where I is the identity matrix. The proof in [8] mainly relies on the fact that a rotation matrix M satisfies the property M T M = M M T = I. An orthogonal matrix Q, which is more general, also has the property QT Q = QQT = I. Therefore, orthogonal transformations are also SVM-preserving and give the NOC guarantee. SVM classification has broad applications and can be applied on both categorical and numerical data. However, in this thesis, we mainly focus on numerical attributes. 3.4.2 Limitations of Random Transformations In order to evaluate the privacy protection level of random transformations, i.e., random rotation [8] and random orthogonal transformation [21], we have conducted a set of experiments on benchmark data sets from the UC Irvine collection [29]. The details are shown in Section 3.6. Empirical studies show that random transformations are vulnerable to curve fitting attack even when a hacker has a small number of knowledge points. In the worst case, a hacker can crack more than half of the values in an attribute if the hacker only has 4 domain knowledge points in that attribute. For those “easy-tocrack” attributes, we found that the correlations between the original values and the corresponding transformed values are very high. The correlation coefficient ρ between two random variables X and Y is defined as ρ= cov(X, Y ) std(X)std(Y ) (3.1) where cov(X, Y ) denotes the covariance between X and Y [17]. Intuitively, if the correlation is high, the fitting error is small when we fit a linear line between the original values and the transformed values. Therefore, when a hacker applies the curve fitting attack with her knowledge points to crack the transformed values, the percentage of the values to be cracked will be high too. In order to break down the strong linear correlations between original values and transformed values, we propose principled 58 3.4. POT: Principled Orthogonal Transformation orthogonal transformation (POT), which can enhance the protection of both input and output privacy. 3.4.3 Reducing Correlations by POT Our technical objective here is to construct an orthogonal transformation while reducing correlation simultaneously. Figure 3.1 gives the outline of Algorithm 3.1 (POT-GenMatrix) which does exactly that. In this subsection, we give a general overview of the algorithm, using an example for illustration. Technical details for each step are given in the next subsection. Example 3.2. Let us again use the data set 13 5 6 40 D = 18 28 27 14 36 3 31 27 D discussed before, i.e., 22 33 15 30 34 2 . Step 1 of the algorithm picks a random d × d matrix A, say, A = 20 6 1 16 1 2 , which is orthogonalized to Q by the Gram-Schmidt pro2 5 12 cess. The details of orthogonalization are explained in Section 3.4.4. Data set D is then transformed to D ′ = QT × D. We use QT to transform D because we want to use the j th column of Q to transform the j th attribute of D. Step 3 computes the correlations between respective rows of D and D ′ . These correlations are sorted in descending order. We orthogonalize A to get Q, and we compute the correlation of each attribute between the original values in D and the transformed values in D ′ and get ρ1 = 0.95, ρ2 = −0.89 and ρ3 = 0.48. Suppose the correlation threshold is ρ0 = 0.6. For attribute #1, Step 5.2 checks whether ρ21 > ρ20 . Now ρ1 = 0.95, then Step 5.2.1 generates ~v = (0, 0, −14)T by solving an appropriate inequality. A new vector A(:,1) = A(:,1) + ~v is then computed in Step 5.2.3 and A(:,1) = (20, 16, −12)T . Based on this new A(:,1) , the new correlation ρ1 becomes 0.68 (Step 5.3), a drop from the original 0.95. However, ρ21 is still greater than ρ20 = 0.36. Therefore, the loop in Step 5.2 continues. It generates another vector ~v = (−10, 0, 0)T and A(:,1) = A(:,1) +~v = (10, 16, −12)T . The new correlation ρ1 is 0.37 (Equation 3.5). At this point, the algorithm has found a vector A(:,1) that can produce a smaller correlation than the threshold. New orthogonal vector Q(:,1) is computed in Step 5.4, using Equation 3.2. 59 3.4. POT: Principled Orthogonal Transformation Algorithm 3.1: POT-GenMatrix Input: D: d-dimensional training data set ρ0 : correlation coefficient threshold 1. Randomly generate A and compute Q by Equation 3.2 (the Gram-Schmidt process); 2. Get D ′ = QT × D; ′ 3. For each i : 1, ..., d, compute ρi between D(i,:) and D(i,:) ; 4. Sort |ρi | in descending order and store this order in idx; Reorder attributes of D by this order, i.e., D = D(idx,:) ; 5. For j = 1 to d-1 { 5.1 Compute ρj by Equation 3.5; 5.2 While ρ2j > ρ20 { 5.2.1 Get a ~v by solving Inequality 3.6; 5.2.2 Set A(:,j) = A(:,j) + ~v ; 5.3 Compute ρj by Equation 3.5; } 5.4 Compute Q(:,j) by Equation 3.2; } 6. Compute Q(:,d) by Equation 3.2; 7. Q(idx,:) = Q; 8. Return Q; Figure 3.1: Generating Orthogonal Matrix Q For attribute #2, given the orthogonal vector Q(:,1) for the first attribute and A(:,2) , Step 5.1 recomputes ρ2 = −0.82, and Step 5.2 finds that ρ22 is bigger than ρ20 . Similar to attribute #1, Step 5.2.1 generates a vector ~v = (0, 0, −16)T , and Step 5.2.3 updates A(:,2) = A(:,2) + ~v = (6, 1, −11)T . New correlation ρ2 is computed in Step 5.3 and ρ2 = 0.22. We can see the adjustment of A(:,2) does not change ρ1 and Q(:,1) , as will be shown in Lemma 3.1. Finally, for attribute #3, the orthogonal vector Q(:,3) is computed in Step 6 and ρ3 = 0.48. In Step 7, the newly generated orthogonal vectors Q(:, j) are reordered into Q based on idx, which is generated in Step 4. 60 3.4. POT: Principled Orthogonal Transformation 3.4.4 Technical Details and Equations Orthogonalization Orthogonalization is done by the usual Gram-Schmidt process on a linearly independent set of vectors A, i.e., A = {A(:,1) , ..., A(:,d) } [28]. Specifically, let us use Qj−1 to represent the first j − 1 generated vectors in Q when we are going to compute the j th vector Q(:,j) , i.e., Qj−1 = {Q(:,1) , · · · , Q(:,j−1) }. The Gram-Schmidt process to compute Q(:,j) can be expressed as [28, pp.57]: T(:,j) = (I − Qj−1 QTj−1 ) × A(:,j), Q(:,j) = T(:,j) ||T(:,j) || (3.2) To simplify the above equation, let Pj denote (I − Qj−1 QTj−1 ). Thus, T(:,j) = Pj × A(:,j) . The Gram-Schmidt process requires that vectors in A be linearly independent, which is not an issue for Algorithm 3.1 (POT-GenMatrix): if the vectors in A are generated randomly (Step 1), the probability to have linearly dependent vectors in A is negligible. For an example, if the elements of A are uniformly selected from the integers between 1 and 100, this probability is 1001d−1 , which is very small. Let us now consider the correlations between the original data and the transformed data. The transpose of Q (i.e., QT ) is also an orthogonal matrix. We use QT to transform D to D ′ , that is D ′ = QT × D. In this way, the j th attribute of D is transformed by the j th column of Q. Therefore, values ′ in the j th attribute of D ′ can be computed as D(j,:) = (Q(:,j) )T × D = Q(1,j) D(1,:) + ... + Q(d,j) D(d,:) . The covariance between the j th attribute of ′ D and D ′ and the standard deviation of D(j,:) can now be expressed as: ′ cov(D(j,:) , D(j,:) ) = cov(D(j,:) , (Q(:,j) )T × D) = (Q(:,j) )T cov(D(j,:) , D) 2 std ′ (D(j,:) ) = (Q(:,j) )T × C(:,j) = ′ var(D(j,:) ) (3.3) T = (Q(:,j) ) × C × Q(:,j) (3.4) where the matrix C is the covariance matrix of the original D, i.e., C(i,j) = cov(D(i,:) , D(j,:) ), 1 ≤ i, j ≤ d. Based on Equations 3.2, 3.3 and 3.4, we can now rewrite the square of correlation coefficient ρj as: 61 3.4. POT: Principled Orthogonal Transformation ρ2j = ′ cov 2 (D(j,:) , D(j,:) ) ′ std2 (D(j,:) )std2 (D(j,:) ) = (A(:,j) )T × S1 × A(:,j) (A(:,j) )T × S2 × A(:,j) (3.5) where S1 = PjT ×C(:,j) ×(C(:,j) )T ×Pj and S2 = std2 (D(j,:) )×PjT ×C ×Pj . Lemma 3.1. For 1 ≤ i < j, Q(:,i) and ρi are independent of A(:,j). Proof Sketch : For 1 ≤ i < j, Q(:,i) is computed from A(:,i) and Q(:,1) , ..., Q(:,i−1) (Equation 3.2). ρi is computed from A(:,i) , Q(:,1) , ..., Q(:,i−1) and covariance matrix C (Equation 3.5). Therefore, both Q(:,i) and ρi are independent of A(:,j) . A(:,j) does not change any Q(:,i) and ρi , for 1 ≤ i < j. The above lemma states that the computation of the j th column of A in Step 5.2 of Algorithm 3.1 does not change the adjusted correlations of the previous columns. Correlation Reduction So far, we have explained how the orthogonalization can be done columnby-column incrementally. Below we explain how the correlations can be reduced. Suppose we have a vector A(:,j) and the corresponding coefficient based on A(:,j) is ρj . We are looking for a new vector Ã(:,j) = A(:,j) + ~v that can produce a smaller correlation, which means ρ̃2j = (Ã(:,j) )T ×S1 ×Ã(:,j) (Ã(:,j) )T ×S2 ×Ã(:,j) (A(:,j) )T ×S1 ×A(:,j) (Equation 3.5). (Ã(:,j) )T × S2 × Ã(:,j) has the (A(:,j) )T ×S2 ×A(:,j) ′ std2 (D(j,:) )std2 (D(j,:) ) and is a non-negative value. We have: AT(:,j) S1 A(:,j) + 2S1 A(:,j)~v + ~v T S1~v AT(:,j) S2 A(:,j) + 2S2 A(:,j)~v + ~v T S2~v < ρ2j = same sign as < ρ2j 2S1 A(:,j)~v + ~v T S1~v < 2ρ2j S2 A(:,j)~v + ρ2j ~v T S2~v (3.6) Inequality 3.6 is a quadratic function containing d unknown elements of ~v . There are standard procedures for solving a quadratic inequality [14]. By solving Inequality 3.6, we can get a vector ~v and a new vector Ã(:,j) = A(:,j) + ~v that can produce a smaller value of ρ̃2j . Step 5.2 of the algorithm is designed to monotonically reduce |ρj | to be below a given threshold ρ0 . 62 3.4. POT: Principled Orthogonal Transformation 3.4.5 Properties of the Algorithm In the example discussed in Section 3.4.3, the correlation ρ1 is monotonically reduced by adjusting A(:,1) = A(:,1) +~v . The reason this works is that ~v comes from a solution of Inequality 3.6, which ensures that the new vector A(:,1) can produce smaller correlation |ρ1 |. From this property of monotonicity, we have the following theorem: Theorem 3.1. Given a threshold ρ0 , for attributes j : 1, ..., d − 1, algorithm POT-GenMatrix always finds a vector A(:,j) that produces correlation ρj s.t. ρ2j ≤ ρ20 . Proof : Firstly, algorithm POT-GenMatrix monotonically reduces ρ2j because ~v comes from a solution of Inequality 3.6. Secondly, we need to show that there is always a solution ~v for Inequality 3.6 if ρ2j > ρ20 (Step 5.2.2). Inequality 3.6 is not solvable only when ρ2j has reached the minimum value, which is actually 0. The reason is as follows: ′ cov(D(j,:) ,D(j,:) ) ′ std(D(j,:) )std(D(j,:) ) (Equation 3.5) and = (Q(:,j) )T × C(:,j) (Equation 3.3), where We have ρj = ′ cov(D(j,:) , D(j,:) ) C(:,j) is the j th column of covariance matrix C and is a non-zero vector. In d dimensional space there must be a non-zero vector q that is orthogonal to C(:,j) and Q(:,i) , 1 ≤ i < j, which means q T × C(:,j) = 0. Let Q(:,j) = q, ′ ′ then cov(D(j,:) , D(j,:) ) = 0. The denominator of ρj is std(D(j,:) )std(D(j,:) ), 2 which is not 0 because Q(:,j) is a non-zero vector. Therefore, ρj equals 0. In conclusion, POT-GenMatrix can always iteratively find a new vector A(:,j) to monotonically reduce ρ2j until ρ2j ≤ ρ20 . Note that Step 5.2 of Algorithm 3.1 (Figure 3.1) loops until ρ2j ≤ ρ20 . Theorem 3.1 guarantees that the correlation reduces monotonically until ρ2j ≤ ρ20 . In Section 3.6.3, experimental results show that the number of loops is small (around 10), but it might converge slowly for some data sets5 . If the computational cost of the procedure is high, a data custodian could set a maximum number of loops for this step. In this case, however, the correlation may be higher than the threshold. In algorithm POT-GenMatrix, Step 6 computes the dth vector Q(:,d) directly from the preceding d − 1 vectors Q(:,1) , . . . , Q(:,d−1) . Because Q is an orthogonal matrix, the last vector Q(:,d) is orthogonal to all the preceding d − 1 vectors. Therefore, in a d dimensional space, if the preceding d − 1 5 A rigorous convergence analysis of this algorithm may be rather challenging, and requires further assumptions on various characteristics of the data, such as its degree of variation. 63 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) vectors of Q have been fixed, the last vector Q(:,d) will be determined by those vectors. It is not able to change the last vector Q(:,d) by adjusting the vector A(:,d) . This is why we reorder D in descending order of |ρj | in Step 3 of algorithm POT-GenMatrix. The attributes that have high correlation coefficients with the transformed attributes are processed early. The complexity of algorithm GenOrthMatrix contains two parts. First, the cost to transform D and to compute the correlations in Steps 2-4 of Algorithm 3.1 is O(nd2 ), where n is the number of points and d is the number of attributes. Second, the cost to generate A and Q in Steps 1, 5, and 6 is O(c×d×dm ), where c is the iteration times for Step 5.2. O(dm ), 2 < m ≤ 3 is the cost of matrix multiplication. Fast matrix multiplication is a well studied problem. There are several practical optimizations that have been developed for large d [26]. In summary, the complexity of algorithm GenOrthMatrix is O(nd2 + cdm+1 ). Experimental results in Section 3.6 show that POT can significantly reduce the crack percentage by reducing the correlation between the original values and the transformed values with low computational overhead. 3.5 Providing Extra Protection for Linear SVM: True Negative Point (TNP) Section 3.4 shows that the POT approach can break the correlations between the original data and the transformed data. Experimental results shown later will demonstrate that the POT approach is effective in reducing the crack percentage against curve fitting attacks for both linear and non-linear SVM. However, if a hacker has enough knowledge points to conduct a global attack (Definition 3.4), POT might not be sufficient to protect the data. The following lemma shows the surprising result that the crack percentage is independent of transformations if a hacker has d knowledge points in an attribute for a d dimensional data set and she conducts a global attack. Lemma 3.2. The crack percentage of global attack is independent of Q. Proof Sketch : Given a data set D and the transformed data D ′ = Q × D, suppose a hacker has d domain knowledge points in the ith attribute, i.e., (w1 , u′1 ), ... , (wd , u′d ). From Definition 3.4, the hacker can guess out a row of matrix Q̂−1 , i.e., (Q̂−1 )(i,:) = w ~ × Dk′−1 , where w ~ = (w1 w2 ... wd ) and ′ , 1 ≤ k ≤ d that are associated the matrix Dk′ contains the d tuples D(:,j k) ′ with the transformed values uk (jk is the index of the transformed value ′ |1 ≤ k ≤ d). Because u′k ). Therefore, Dk′ is a d × d matrix and Dk′ = (D(:,j k) 64 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) D ′ = Q × D, we have Dk′ = Q × Dk . Let us use D ′′ to represent the data guessed by the hacker, and we have: ′′ D(i,:) = (Q̂−1 )(i,:) × D ′ = w ~ × Dk′−1 × D ′ = w ~ × (Q × Dk )−1 × Q × D = w ~ × Dk−1 × D (3.7) The transformation matrix Q does not appear in the above expression of D ′′ . Cracks indicate how close D ′′ is to D (Definition 3.2). Therefore, the crack percentage of global attack is independent of Q. Lemma 3.2 shows that we are not able to reduce the crack percentage by adjusting the transformation matrix Q whenever the hacker has d domain knowledge points in an attribute and conducts a global attack. Notice that Lemma 3.2 also covers the matrix attack [9, 21], which is a special case of the global attack. In order to counter global attacks, we propose the true negative point (TNP) approach to enhance protection. Recall that a hacker needs d domain knowledge points to execute a global attack (Definition 3.4). When d becomes larger (i.e., as dimensionality increases), it becomes harder for a hacker to have enough knowledge points [9] and the likelihood for a hacker to have ’poor’ knowledge points, which are defined below, becomes higher too. Definition 3.6. (Poor Knowledge Point) A knowledge point is poor means that the distance between the guessed value w and the actual value u is greater than the defined radius δi (Definition 3.1), i.e., |w − u| > δi . Experimental results (Section 3.6.2) show that even a single ’poor’ knowledge point could significantly bring down a hacker’s crack ability. Thus, the TNP approach is primarily needed for low dimensional data sets when global attacks are possible. The POT approach proposed in Section 3.4 is designed for both linear and non-linear SVM. However, for the TNP approach, we only focus on linear SVM in this chapter. We will extend the TNP approach against global attacks for non-linear SVM in our future work. The general idea of the TNP perturbation is as follows. Support vectors are the points that determine the optimal separating hyperplane and form the classifier [10]. We call all the other points true negative points (i.e., TNPs). The NOC guarantee requires the support vectors be preserved. For all the TNPs, we can arbitrarily perturb them as long as they do not change the optimal hyperplane and hence the support vectors. To exploit this, we need a method for identifying TNPs. Below we propose a method to find a 65 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) subset of TNPs and design a way to perturb them while providing the NOC guarantee. In Section 3.6, experimental results will show that even a small subset of TNPs can help significantly. 3.5.1 Linearly Separable Data Sets The optimal hyperplane separates the two classes and maximizes the margin between the two classes for a linearly separable data set [10]. Geometrically, suppose we move the optimal hyperplane parallel to itself in the space until it hits any data point. Then the hitting points are the support vectors. The hyperplane passing through a support vector and parallel to the optimal hyperplane also separates the two classes. We have the following necessary condition for support vectors: Proposition 3.1. A necessary condition for a point p to be a support vector is that there is a hyperplane h such that h passes through point p and separates the two classes. This proposition helps us to identify a specific subset of TNPs. Let us illustrate the idea for 2-dimensional space. For any given two points p, q in 2-dimensional space, we denote the line connecting p, q as fpq , which partitions the space into two half-spaces. Lemma 3.3. (TNPs in 2-dimensional space) For any two points p1 , p2 from class I and a point w from II, let m be the center of line segment p1 p2 . A class I point q is a true negative point if the following conditions are satisfied: 1. q and w are from different half-spaces generated by fp1 p2 ; 2. q and m are from the same half-space generated by fp1,w ; and 3. q and m are from the same half-space generated by fp2,w . Proof Sketch : Figure 3.2 shows a set of points from Class I and Class II. Points p1 , p2 are from class I (marked as crosses) and w is from class II (marked as circles). m is the center of line segment p1 p2 . Note that only points in the area Ω can satisfy all of these conditions; this is the area “below” fp1p2 and bounded by fp1 ,w and fp2,w in Figure 3.2. For example, only points p3 and p4 satisfy these conditions. For any points q in the area Ω, it is impossible to draw a line through q that can separate both p1 and p2 from w, which means that we cannot find a line passing through q that can separate class I and class II. From 66 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) w a b p1 m12 p3 c p2 p4 w Ω (a) Finding TNP (b) Skyline Query Figure 3.2: Finding True Negative Points Proposition 3.1, we know that q cannot be a support vector and q is a true negative point. For d-dimensional data sets (d > 2), Lemma 3.3 is still applicable. The only difference is that we need to select at least d points pi instead of 2. Based on Lemma 3.3, we propose Algorithm 3.2 (PerturbTNP) to find TNPs and perturb them. The basic idea is to form an “umbrella” as in Figure 3.2(a) to find TNPs. Hereafter, the points pi are called base points as they form the base of the umbrella, and point w is called the top point. In algorithm PerturbTNP, the input parameter tr is the ratio of TNPs to points in D and indicates the minimum number of true negative points to be chosen. k ≥ d is a positive integer which specifies how many base points p are to be selected. maxCnt specifies the maximum number of loops of Step 2. Step 1 initializes TNP set T . Setp 2 to Step 8 find TNPs iteratively. S is a temporary variable to record the TNPs found in current loop and is initialized in Step 3. Step 4 selects a top point w and Step 5 selects k base points p. All base points p are from the opposite class of w. Step 6 forms an umbrella shape as what we did in Figure 3.2(a). Each point s that satisfies Lemma 3.3 is a TNP and is found in Step 7. Step 7.1 randomly perturbs s to s′ in the area Ω. Point s is deleted from D. Point s′ is put into set S and is finally put into TNP set T in Step 8. The returned data set D contains all left over points and the perturbed TNPs in T . Based on Lemma 3.3, we have the following theorem for algorithm PerturbTNP: Theorem 3.2. (The NOC Guarantee) PerturbTNP preserves support vec67 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) Algorithm 3.2: PerturbTNP Input: D: data set; tr : ratio of TNPs to points in D; k: a positive integer; maxCnt : a positive integer; 1. T = Empty; cnt = 0; NP | < tr ) and ( cnt ≤ maxCnt ){ 2. While ( |T|D| 3. Set S = Empty; 4. Select a top point w from D; 5. Select k base points p = {pi |1 ≤ i ≤ k} from the opposite class of w; 6. Build an umbrella shape as shown in Figure 3.2(a) based on points w ∪ p; 7. For each point s that satisfies Lemma 3.3 { 7.2 Randomly move s to s′ in area Ω(Figure 3.2(a)); 7.2 Delete s from D; 7.3 Put s′ to S; } 8. T =T ∪S cnt + +; } 9. Return D = D ∪ T ; Figure 3.3: A Skeleton for the TNP Perturbation Approach tors. Proof Sketch : All points s found in Step 7 of PerturbTNP are true negative points (Theorem 3.3). Deleting s from D does not change support vectors of SVM. Because s′ is a point generated by moving s in the area Ω, s′ also satisfies Lemma 3.3 and is also a true negative point in the perturbed data set. Therefore, adding s′ to D also does not change support vectors of SVM. That is, PerturbTNP preserves support vectors. The loop in Step 2 stops when either the requested percentage tr of TNPs are found or there maximum number of iterations is exceeded. Parameter tr is expected to be small (e.g., 10%, 30%). Experimental results given in Section 3.6 will show that only a small number of iterations are required for 68 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) the benchmark data sets and even with a small value of tr a high degree of protection is achieved. In algorithm PerturbTNP, Step 5 selects k base points. Intuitively, we can randomly select these base points. However, the number of TNPs captured in each loop is small for random selection. In order to improve the efficiency and convergence rate of algorithm PerturbTNP, especially for very large data sets, we propose another approach based on Skyline queries. Skyline Point: A skyline point is a point that is not dominated by any other point [6]. Given two points p and q, p is said to dominate q if ∀1 ≤ i ≤ d, p(i) ≤ q(i). This means p has smaller value than q in each dimension. Figure 3.2(b) shows an example of skyline point. Points a, b and c all are skyline points. If we select w as the top point, we can use a, c and w to form an umbrella shape like the one formed by p1 , p2 , w in Figure 3.2(a). In algorithm PerturbTNP, we can use the skyline operator to select k base points (Step 5). Instead of finding all skyline points, we only need k skyline points. The advantage of umbrella based on the skyline points is that the umbrella might be wider than random selection, which means more true negative points can be captured in Step 7 in each round. Complexity: The complexity of PerturbTNP is the combination of three parts, i.e., finding skyline points (Step 5), building umbrella (Step 6), and checking containment (Step 7). The complexity to find k skyline points is O(nk). Building umbrella is O(k⌊d/2⌋ ). Containment checking is O(nf ) where f is the number of faces of the umbrella which is exponential in k. For cs rounds selection, the total complexity is O(cs × (nk + k⌊d/2⌋ + nf )). The most significant part of the complexity of PerturbTNP if k⌊d/2⌋ . Let us discuss d and k respectively. First, the dimensionality d is small because the TNP approach is purposely designed to counter global attacks, and recall that global attacks are mainly effective for lower dimensional data sets (Section 3.3.2 and paper [9]). Second, the number of skyline points k if much less than the number of data points n. Bennett et al. find support vectors geometrically in [5]. Their approach needs to build convex hull over the whole data set and the complexity is O(n⌊d/2⌋ ). In contrast, PerturbTNP finds a subset of TNPs instead of support vectors. The number of base points (i.e., k) is significantly smaller than the number of points in D (i.e., n). For a data set that contains hundreds of thousands of data points, we only need to set k less than 100. Therefore, PerturbTNP is practical and scalable even though there is a exponential part k⌊d/2⌋ in its complexity. In practice, we can use the algorithm PerturbTNP to find and perturb TNPs first and then use POT to transform the perturbed data with orthogonal transformation. Because both the TNP perturbation and POT preserve 69 3.5. Providing Extra Protection for Linear SVM: True Negative Point (TNP) support vectors, we have the following corollary: Corollary 3.1. Given a linearly separable data set D, the combination of the TNP perturbation and POT provides the NOC guarantee. 3.5.2 Linearly Non-Separable Data Sets For linearly non-separable data sets, the two classes of data points intersect. No hyperplane can separate the two classes. Therefore, Proposition 3.1 does not hold for linearly non-separable data sets. Algorithm 3.2 (PerturbTNP) cannot be applied on such data sets directly. Since it is hard to derive purely geometric definition for support vectors for linearly non-separable data sets [10], we propose a heuristic to find negative points, which might include true negative and false negative points. From the original data set, our heuristic tries to extract a linearly separable subset, on which the algorithm PerturbTNP can then be applied. We define a shifting factor to control the selection of this subset. We explain the details via the following example. Figure 3.4 shows an example of a linearly non-separable data set. Given two classes I and II, we first find the centre of each class (e.g., A is the center of class I and B is the centre of class II). By connecting A and B, we can get a line AB. L is perpendicular to line AB and passes through the centre of L3 Class I A L4 d B L0 Class II L2 d1 L1 L Figure 3.4: Linearly Non-Separable Data Sets line segment AB. To create a linearly separable subset, we can first remove all class I points below L and class II points above L. However, we would need to increase the “gap” between the two classes. This is achieved by moving L parallel towards A or B and getting L1 and L2 respectively. The shifting factor, defined below, controls how far L moves. 70 3.6. Experiments Definition 3.7. (Shifting Factor sf) Let L0 be parallel to L and be the tangent to Class I. Let d be the distance between L and L0 and d1 be the distance between L1 and L. We define the shifting factor sf = dd1 . Instead of using the distance d1 , we use the shifting factor to control the position of L1 and L2 to extract linearly separable subsets. We use the same shifting factor sf for both classes. By deleting all Class I points below L1 and all Class II points above L2 in Figure 3.4, we get a linearly separable subset Ds of D (with no points in between L1 and L2 ). Algorithm PerturbTNP is then applied to Ds directly. Once the identified TNPs from Ds have been perturbed, we add back all the points deleted by the above heuristic and then apply POT to the resultant data set. Even though the NOC guarantee is not provided, the experimental results below will show the effectiveness of this heuristic in minimizing the outcome change. 3.6 Experiments We run our experiments on data sets from the UCI collection [29], including Forest CoverType, Boston Housing, Credit and WDBC (i.e., Wisconsin Breast Cancer (Diagnostic)). In this chapter, we deal with numerical attributes for all data sets. For WDBC, we use the first 10 numerical attributes that are the mean of the real-valued feature for the cell nucleus. Table 3.2 shows basic information about each data set. # of Numeric Attrs # of Points WDBC 10 569 Housing 10 509 Credit 4 690 CoverType 10 581012 Table 3.2: Data Sets All these data sets are linearly non-separable. In our experiments, we use the ratio of the number of data points located between line L3 and L4 (Figure 3.4) to the number of data points in a data set to measure the degree of overlap between the two classes. Lines L3 and L4 are parallel to line L and are tangent to Class II and Class I respectively. We observed that all data sets, except WDBC, are heavily overlapped (i.e., the ratio discussed above is greater than 0.9). We evaluate the POT approach and the TNP perturbation with the curve fitting attack and the global attack. The radius δ (Definition 3.1) of the knowledge point is 2% of the 1% trimmed range of each domain, i.e., the range without the 0.5% biggest values and the 0.5% smallest values. 71 3.6. Experiments In our experiments, we used linear regression, spline fitting and polyline fitting as curve fitting attacks. All of them showed similar results. In this section, we mainly show results for linear regression. We varied the radius δ for knowledge points, and observed that the trend of the results were the same as for δ is 2% of the trimmed range. The experiments were implemented in MATLAB. We ran the experiments on an Intel Pentium PC with 3GHz CPU and 2GB RAM. 3.6.1 ICA Attack Independent component analysis (ICA) [15] is considered as the attack model to reconstruct the original data from the transformed data while a hacker has no prior knowledge in [8]. The standard deviation of the difference between the reconstructed data and the original data is used to measure the protection level. Figure 3.5(a) shows the results for the ICA (a) WDBC (b) Housing Figure 3.5: ICA Attack attack for the WDBC data set. In the figure, there are two groups of bars and each group contains four bars. The first group of bars show the minimum standard deviation of the difference between the reconstructed data and the original data for all attributes. The second group of bars show the average standard deviation of the difference. For each group of bars, the first bar shows the result for rotation and the other three bars show the results for POT with correlation threshold as 0.6, 0.3, 0.1 respectively. The results show that there is no significant difference between the rotation and POT regarding the protection against the ICA attack. The POT approach actually yield a little high difference than the random rotation approach, which means the POT approach provides more protection against the ICA 72 3.6. Experiments attack than the approach proposed in [8]. Figure 3.5(b) shows the results for the Housing data set and we can see results similar to those in Figure 3.5(a). 3.6.2 Knowledge Points In this sub-section, we show a simple way for a hacker to gain knowledge points defined in Definition 3.1 and present the experimental results showing that how a hacker could use those knowledge points to crack the random transformations. Figure 3.6 shows experimental results about the crack ability if a hacker maps statistics (such as the maximum, the minimum, the median, and the quantiles) of the transformed domain to the original domain. The data sets are transformed by random rotation transformations 6 . (a) WDBC (b) Credit Figure 3.6: Sources of Knowledge Points (Kps) We can see from Figure 3.6(a) that the effectiveness of knowledge points (w.r.t. crack percentage) varies across the attributes. For attribute #4 of WDBC, even if the hacker only maps the maximum and the minimum of the transformed domain to the original domain, she can still crack about 30% of the values. If she also has the knowledge of the quantiles of the original domain, the crack percentage can go up close to 38% for attribute #4. Similar results for the Credit data set are shown in Figure 3.6(b). 6 Experimental results show that random rotations [8] and random orthogonal transformations [21] have the similar behaviour on the crack analysis. Due to space requirements, we only show experiments for random rotations in the following parts of Section 3.6 when we talk about random transformations. 73 3.6. Experiments 3.6.3 Effectiveness of POT Against Curve Fitting Attacks Figure 3.7(a) shows the crack percentage of the attribute that has the highest crack percentage in each data set. We set the number of knowledge points to 4. The threshold of correlation coefficient is set to 0.6. The correlations for different transformations are shown in Table 3.3. We can see that the crack percentage is very high for random rotations. The reason is that the correlation ρ is high (e.g., the crack percentage for WDBC is about 60% and the correlation is 0.99). POT can significantly reduce both the correlation and the crack percentage. All of the crack percentages are less than 10% for POT. (a) Highest Crack Percentage Attributes (c) CoverType (b) WDBC (d) CoverType:#10 Figure 3.7: Privacy Analysis of POT vs. Rotation Figure 3.7(b) shows the crack percentage for each attribute of WDBC and Figure 3.7(c) shows the crack percentage for each attribute of Cover74 3.6. Experiments Random Rotation POT WDBC 0.99 0.36 Housing 0.74 0.47 Credit 0.98 0.32 CoverType 0.71 0.11 Table 3.3: Correlations Type. Generally, we can see that POT provides significantly more protection than rotation. For all attributes with high crack percentage in rotation transformations (e.g., attributes #1,#3, #4 in WDBC and #6 and #10 in CoverType), the crack percentage for POT drops significantly. For other attributes, both the correlation coefficient and the crack percentage are already very small even for random rotations, which means there is not much room for improvement. Figure 3.7(d) shows the result for attribute#10 of CoverType with different correlation thresholds. Generally, the smaller the correlation threshold ρ0 , the smaller the crack percentage. Notice that when ρ0 is below 0.6, the resulting drop in crack percentage becomes insignificant. We also get similar results for other data sets. Impact of Knowledge Points Figure 3.8 shows the crack percentage for attribute #4 of WDBC against the number of knowledge points the hacker has. The graph shows that if we only transform the data with random rotations, the crack percentage increases rapidly with the increasing of the number of knowledge points. However, POT makes it harder for the hacker to guess out more data even when she has more knowledge points. Figure 3.8: Impact of the Number of Knowledge Points Figure 3.8 also confirms the effectiveness of the combination of POT 75 3.6. Experiments and the TNP perturbation. The privacy protection provided by POT will not be hurt by combining the TNP perturbation. Actually, POT and the TNP perturbation together could provide more protection than POT only to counter curve fitting attacks. Recall that the TNP perturbation is designed to counter global attacks. Experimental results about how the TNP perturbation counters global attacks will be shown in the next sub-section. Impact of Radius δ We introduce a radius δ in the definition of knowledge points (Definition 3.1). Figure 3.9 shows the crack percentage w.r.t. different radius δ for attribute #4 of WDBC with 4 knowledge points. All above results are based on the radius δ to be 2% of the domain range. In Figure 3.9, we vary the radius from 2% to 10% of the domain range. For each radius, we always have lower crack percentage (higher protection level) for the POT approach than the random rotation approach. Figure 3.9: WDBC : Radius δ Runtime Table 3.4 shows the experimental results of the cost of POT-GenMatrix on the large data set (i.e., CoverType) in terms of running time and the number of iterations. From empirical studies, even for very small correlation threshold (e.g., ρ = 0), the running time of POT-GenMatrix is less than 2 seconds. The total number of iterations of Step 5 is 120. For any single attribute, Step 5 loops less than 20 times. From the crack analysis in Figure 3.7 and the cost of algorithm POTGenMatrix shown in Table 3.4, we can conclude that POT-GenMatrix significantly enhances the protection levels on the original data for a very small cost. 76 3.6. Experiments ρ0 running time(secs) total number of iterations 0.9 1.7 2 0.6 1.75 4 0.3 1.78 15 0.1 1.84 36 0 1.9 120 Table 3.4: Cost of POT-GenMatrix 3.6.4 Effectiveness of TNP Against Global Attacks Global Attacks Figure 3.10 shows the results of a global attack on WDBC transformed using POT with the TNP approach incorporated for various values of the TNP ratio tr . We set shifting factor sf to be 0.1 and the percentage of TNPs (i.e., tr ) to be 0%, 10% and 20%. We can see that the crack percentage drops quickly for all the attributes with TNPs. Even for tr = 10%, there is a substantial reduction in crack percentage. However, the difference of the crack percentage between tr = 10% and tr = 20% is not much. Therefore, for Algorithm 3.2 (PerturbTNP ), we do not need large values of the parameter tr for reducing the crack percentage significantly. Figure 3.10: Global Attack on WDBC Table 3.5 shows the crack percentage under global attack for all the data sets. Based on the results shown in Table 3.5, we can see the TNP perturbation is very effective for countering global attacks relative to random rotations. The more the true negative points, the smaller the crack percentage. In all data sets, the inclusion of TNPs yields significant reduction in the crack percentage. For Credit, the crack percentage is still 24% even when there are 20% TNPs. The reason is that Credit is a heavily overlapped data set, i.e, the 77 3.6. Experiments Data Set WDBC Housing Credit CoverType Random Rotation 39% 34% 48% 36% POT without TNP 39% 34% 48% 36% POT with TNP tr = 10% tr = 20% 7.2% 7.1% 7.9% 6.5% 35% 24% 17% 11% Table 3.5: Global Attacks on All Data Sets overlapped part of the two classes is big in Figure 3.4. For sf = 0.1, there are only about 200 points in Ds . Therefore, 20% TNPs means we only have 40 of them, which is too few with respect to the whole Credit data set containing 690 data points. In our experiments, we observed that if sf is 0 and tr is 40%, we can get 150 TNPs and the crack percentage drops to 12%. We also evaluate the cost of algorithm PerturbTNP. For small data sets (e.g., WDBC), it takes about 10 seconds to find tr = 20% TNPs. For a large data set like CoverType, if the shifting factor sf = 0.1 and tr = 10%, it takes about 200 seconds to find TNPs. Given the effectiveness of the TNP perturbation against global attacks, we feel that the time taken is more than justified. Impact of Poor Knowledge Points on Global Attacks Recall that A hacker needs d domain knowledge points to conduct a global attack on the transformation matrix (Definition 3.4). However, there is no guarantee on the quality of the hacker’s knowledge points. If a knowledge point is poor (Definition 3.6), it can dramatically limit the crack power of the hacker. Figure 3.11 shows the experimental results of the impact of poor knowledge points for WDBC. In the experiments, we chose poor knowledge points at a distance between 2% and 10% of the trimmed range from the actual values. In the real word, a poor knowledge point might be far more away from the actual value. Figure 3.11 shows that even a single poor knowledge point can make the crack percentage drop almost half on all attributes. The crack percentage further drops to around 10% if there are two poor knowledge points. And further, if 20% TNPs are used to perturb the data, the crack percentage drops to around 5%. From experimental results we can know that even a single or two poor knowledge points can dramatically bring down a hacker’s crack ability. Meanwhile, the higher the dimensionality of the data sets, the more chance that 78 3.6. Experiments Figure 3.11: Impact of Poor Knowledge Points (WDBC) a hacker might get poor knowledge points. Therefore, global attacks and the TNP approach are mainly useful for low dimensional data. For high dimensional data, POT by itself can be very effective. 3.6.5 Minimizing the Outcome Change Against Global Attacks The other part of experiments aims at evaluating the outcome change caused by the TNP perturbation in linearly non-separable data sets (Recall that the POT approach provides the NOC guarantee for both linear and non-linear SVMs, and on linearly separable data sets the TNP approach provides the NOC guarantee). After the data custodian gets the transformed support vectors from the mining service provider, she decodes the transformed support vectors to original values and builds a classifier, which is called the classifier after the TNP perturbation. The classifier directly built on the original data set is called the classifier before the TNP perturbation. For each data set, we measure the classification accuracy (rac ) before and after the TNP perturbation. We also measure the disagreement between the classification results before and after the TNP perturbation. Disagreement (rdg ) is the ratio of the number of points that have been classified into different classes by the classifiers built before and after the TNP perturbation. Recall that tr denotes the TNP ratio and sf denotes the shifting factor (Definition 3.7). Table 3.6 shows classification accuracy and disagreement before and after the TNP perturbation. From previous studies ([4, 35]), it is computationally prohibitive to train SVM on CoverType. Therefore, we use data sets WDBC, Housing and Credit for this set of experiments. 79 3.6. Experiments Without TNP : original rac tr With TNP 10% Perturbation 20% WDBC 91.9% rac rdg 91.5% 0.8% 91.2% 1% Housing 86.3% rac rdg 85.6% 2% 84.2% 3% Credit 82.1% rac rdg 80.5% 4% 77.6% 8% Table 3.6: Measurement of Classification The first part of Table 3.6 shows the accuracy rac of the classifiers before the TNP perturbation. The second part of Table 3.6 shows the accuracy and disagreement of the classifiers after the TNP perturbation. The shifting factor sf is set to 0.1 as for experiments shown in Table 3.5. For the WDBC data set, even with 20% TNPs, the accuracy (rac ) is 91.2%, which is still very close to the original classification accuracy 91.9% with 1% disagreement (rdg ). There are more disagreements for Housing and Credit data sets than WDBC. The reason is that the two classes are heavily overlapped in both Housing and Credit data sets. However, even for the Credit data set, the accuracy changes from 82.1% to 80.5% with 10% TNPs. The outcome is not significantly affected. The overlap part of a data set should be small for linear SVM to be effective [5], which our TNP perturbation approach is designed for. Using non-linear SVM on heavily overlapped data sets might call for new perturbation techniques for the TNP. This is an interesting direction for future work. TNP v.s. Random Perturbation Even though the TNP approach could provide the NOC guarantee for linear SVM on linearly separable data sets, in the case of linearly non-separable data sets, it changes the optimal classifiers. Recall that random perturbation typically changes the optimal classifiers. We compare the TNP approach and random perturbation with respect to classification accuracy for linearly non-separable data sets and show the experimental results in Figure 3.12. The x-axis shows the standard deviation of the added noise, which varies from 10% to 30% of the domain range. The y-axis shows the classification accuracy. The top solid line shows the accuracy of the classifier on the original data without adding noise. The two dashed lines show the classification accuracy for the TNP approach. The dot line with the cross marker shows the result for the random perturbation approach. We can see that the TNP approach always has significantly higher classification accuracy than 80 3.6. Experiments Figure 3.12: Classification Accuracy : TNP v.s. Random Perturbation the random perturbation approach. 3.6.6 Intended Usage of POT and TNP One of the motivations to outsource data mining in the data custodian model is to leverage not only the computational resources but also the data mining expertise offered by the service provider. Given the POT approach and the TNP perturbation techniques developed in this chapter, a natural question to ask is how they are meant to be used by a data custodian. In particular, while the custodian does not need to possess sophisticated expertise related to SVM, are these transformations likely to demand some other expertise from the custodian and if so how complicated can that get? In this section, we address this question. Recall that both POT and TNP require certain parameters to be set for their application (e.g., correlation coefficient ρ for POT and ratio tr for the TNP perturbation)7 . Clearly, choosing these values without any care might result in a transformation whose privacy protection could be poor. Does the custodian have to do any complicated analysis of the data in order to choose the “right” values ρ and tr ? We conduct a comprehensive experimental analysis of both POT and TNP and discuss our findings in the preceding sub-sections. Based on that, we recommend that the custodian can choose the default values of ρ = 0.6 and tr = 10%. Our empirical analysis shows that with this setting, the crack percentage will be less than 10% for all the data sets we tested. Should the 7 Other parameters for the algorithm PerturbTNP are internal parameters and they will not affect the privacy protection. 81 3.7. Conclusion and Future Work custodian wish to adjust these parameters, doing so is far simpler than choosing the right values of parameters for SVM learning. For example, choosing a lower value of ρ will have the effect of less correlation between the original data and the transformed data and thus more protection. Similarly, choosing a higher value of tr will have the effect of more perturbation and more protection. 3.7 Conclusion and Future Work In this chapter, we study privacy preservation in outsourcing SVM classification. Our goal is to protect input and output privacy while minimizing the outcome change. We propose two approaches, viz., POT and TNP, that accomplish this goal. We first show that POT provides the no-outcome-change (NOC) guarantee for both linear and non-linear SVM. Thus, the data mining service provider could train the SVM classifier on the transformed data (including kernel selection, parameter tuning, etc.), which saves the data custodian the need to acquire substantial expertise. The algorithm POT-GenMatrix monotonically reduces the correlation between the original values and the transformed values to a given threshold. Empirical studies show that POT significantly reduces the crack percentage by reducing the correlation at small computational cost. We show the surprising result that if a hacker has d domain knowledge points for an attribute, the crack percentage achievable by global attacks is independent of the transformation. To counter global attacks, we develop the TNP approach. The TNP approach (i.e., the algorithm PerturbTNP) finds and perturbs (a subset of) true negative points. We prove that the TNP approach provides the NOC guarantee for linear SVM on linearly separable data set. For linearly nonseparable data set, we propose a heuristic to extract a linearly separable subset, to which PerturbTNP can be applied. A shifting factor sf is defined to control the size of the linearly separable subset that is extracted from the original data set. Experimental results show that the TNP approach works very well to counter global attacks and significantly reduces the crack percentage without compromising too much classification accuracy. In summary, we propose two approaches, i.e., POT and TNP, to provide the three pillars of privacy preservation in outsourcing of SVM classification. POT provides the NOC guarantee for both linear and non-linear SVM. The TNP approach is designed for linear SVM when a global attack is possi82 3.7. Conclusion and Future Work ble, which is typically true for low dimensional data. Extending the TNP approach to deal with non-linear kernel functions is an interesting open problem. 83 3.8. Bibliography 3.8 Bibliography [1] Aggarwal, C.C., Yu, P.S.: A Condensation Approach to Privacy Preserving Data Mining. Proc. of EDBT 2004. [2] Agrawal, D., Aggarwal, C.C.: On the Design and Quantification of Privacy Preserving Data Mining Algorithms. Proc. of PODS 2001, pp. 247 – 255. [3] Agrawal, R., Srikant, R.: Privacy preserving data mining. Proc. of SIGMOD 2000, pp. 439–450. [4] Bakir, G.H., Bottou, L., Weston,J.: Breaking SVM Complexity with Cross-Training. Advances in Neural Information Processing Systems 17, MIT Press, Cambridge, MA, USA (2005) , pp. 81–88. [5] Bennett, K.P., Bredensteiner, E.J.: Duality and Geometry in SVM Classifiers. Proc. of 17th ICML, 2000,pp. 57–64. [6] Borzsonyi, S., Kossmann,D., Stocker, K.: The Skyline Operator. Proc. of ICDE 2001, pp. 421 – 430. [7] Bu, S., Lakshmanan, L.V.S., Ng, R. T., Ramesh,G.: Preservation of Patterns and Input-Output Privacy. Proc. of ICDE 2007, pp. 696–705. [8] Chen, K., Liu, L.: Privacy Preserving Data Classification with Rotation Perturbation. Proc. of ICDM 2005,pp. 589–592. [9] Chen, K., Sun, G., Liu, L.: Toward Attack-Resilient Geometric Data Perturbation. SIAM Data Mining(SDM) 2007. [10] Cortes, C., Vapnik,V.: Support-Vector Networks. Machine Learning, v.20,1995, pp. 273–297. [11] Du, W., Teng, Z., Zhu, Z.: Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification. Proc. of. SIGMOD, 2008, pp. 459–472. [12] Evfimievski, A., Gehrke, J., Srikant, R.: Limiting Privacy Breaches in Privacy Preserving Data Mining. Proc. of PODs, 2003, pp. 211–222. [13] Fung, B.C.M., Wang, K., Yu, P.S.: Top-Down Specialization for Information and Privacy Preservation. Proc. of ICDE 2005, pp. 205–216. 84 3.8. Bibliography [14] Harshbarger, R.J., Yocco, L.S.: College Algebra in Context. AddisonWesley, 2006. [15] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley-Interscience, 2001 [16] Iyengar, V.S.: Transforming Data to Satisfy Privacy Constraints. KDD 02. [17] Hays, W.L., Winkler, R.L.: Statistics: Probability, Inference, and Decision. Holt, Rinehart & Winston,1970. [18] Huang, Z., Du, W., Chen, B.: Deriving Private Information from Randomized Data. Proc. of SIGMOD 2005, pp. 37–48. [19] Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. Proc. of ICDM 2003, pp. 99-106. [20] Lefevre, K., DeWitt, D.J., Ramakrishnan, R.: Anonymization. KDD 06. Workload-Aware [21] Liu, K., Giannella, C., Kargupta, H.: An Attacker’s View of Distance Preserving Maps for Privacy Preserving Data Mining. PKDD 2006. [22] Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-Diversity: Privacy Beyond K-Anonymity. Proc. of ICDE 2006. [23] Martin, D.J., Kifer, D. Machanavajjhala, A., Gehrke, J., Halpern, J.Y.: Worst-Case Background Knowledge in Privacy. Proc. of ICDE 2007, pp. 126–135. [24] Oliveira, S.R.M., Zaı̈ane, O.R.: Privacy Preserving Clustering by Data Transformation. Proc. of the 18th Brazilian Symposium on Databases 2003, pp. 304–318. [25] Rizvi, S., Haritsa, J.: Maintaining Data Privacy in Association Rule Mining. Prov. of VLDB 2002, pp. 682–693. [26] Strassen, V.: Gaussian Elimination Is Not Optimal. Numerische Mathematik, vol.13, 1969, pp. 354–356, [27] Sweeney, L.: K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 5, p.557-570, 2002. 85 3.8. Bibliography [28] Trefethen, L.N., Bau, D.: Numerical Linear Algebra. SIAM, Philadelphia, 1997 [29] UC Irvine Data Set Collection. http://archive.ics.uci.edu/ml/datasets.html [30] Wang, K., Yu, P.S., Chakraborty, S.: Bottom-Up Generalization: A data Mining Solution to Privacy Protection. Proc. of ICDM 2004, pp. 249–256. [31] Wong, W.K., Cheung, D.W., Hung, E., Kao, B., Mamoulis, N.: Security in Outsourcing of Association Rule Mining. Proc. of VLDB 2007, pp. 111–222. [32] Yang, Z., Zhong, S., Wright, R.N.: Privacy-Preserving Classification of Customer Data without Loss of Accuracy. In Proc. of the 2005 SIAM International Conference on Data Mining (SDM)2005 [33] Yu, H., Jiang, X., Vaidya, J.: Privacy-Preserving SVM using Nonlinear Kernels on Horizontally Partitioned Data. Proc. of ACM SAC Data Mining Track, 2006, pp. 603 –610. [34] Yu, H., Vaidya, J., Jiang, X.: Privacy-Preserving SVM Classification on Vertically Partitioned Data. Proc. of PAKDD 2006, pp. 647–656. [35] Yu, H., Yang, J., Han, J.: Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. of ACM KDD 2003, pp. 305–315. 86 Chapter 4 Extension to POT8 4.1 Introduction Privacy preservation is a key issue in the data-mining-as-a-service model [6, 11]. In this model, there are two scenarios, i.e., the data collector scenario and the data custodian scenario [3]. In the data collector scenario, individual data owners submit their data to a data collector to conduct data mining tasks [2]. In the other scenario, a data custodian, who has the responsibility to protect the data, outsources data mining tasks to an outside data miner [3, 18]. For both scenarios, a perturbation or transformation approach needs to be applied on the original data to protect the inherent private information. Random perturbation is the dominant approach to protect the privacy in the data collector scenario [1, 2, 8]. Random noise drawn from a given distribution is added to the original data. In order to provide more protection, the approach needs more noise, and at the same time there will be more change on the data mining outcome [2]. In the data custodian scenario, several studies show that it is possible to preserve the mined patterns in outsourcing of association rule mining [18], kNN computation [20], etc. There are three pillars to privacy preservation, i.e., the input privacy, the output privacy, and minimizing the outcome change [3]. The input privacy aims to protect the private information inherent in the original data. The output privacy is to disguise the mined patterns to avoid malicious utilization. While protecting the privacy, we also want to derive meaningful data mining results, which means to minimize the data mining outcome change after transformation. After we apply a transformation on the original data, if we could still get the same data mining patterns as the patterns mined from the original data, we say that this transformation provides the no-outcome-change (NOC) guarantee. 8 A version of this chapter will be submitted for publication: Bu, S., Lakshmanan, L. V.S., and Ng, R. T. : Enhancing Privacy Preservation with Prior Knowledge for Classification 87 4.2. Related Works In this chapter, we focus our study on privacy preservation for classification, especially for the SVM classification and linear classification models, which include linear regression, logistic regression, etc. The orthogonal transformation has been shown to provide the NOC guarantee for the SVM classification and a principled orthogonal transformation (POT) has been proposed to generate orthogonal transformations in chapter three. In this chapter, we extend POT to other linear classification models and make the following contributions. Firstly, we investigate the POT approach to enhance the protection on privacy in the worst cases. Generally, the POT approach could provide high level protection by reducing the correlation between the original values and the transformed values. However, it is possible that a hacker could still guess out a significant number of values based on her knowledge. The reason is that even when the correlation between the original values and the transformed values in a domain is low there might exist subsets of the original values that still have high correlation with the corresponding transformed values. We propose a heuristic to further break down the correlations between the original values and the transformed values among those subsets. Secondly, we extend the POT approach to linear classification models. We show that the orthogonal transformation also provides the NOC guarantee for linear classification models including linear regression, ridge regression and logistic regression. Thirdly, we propose to combine the POT approach and the random perturbation approach to transform the original data. After we use random perturbation to add noise on the original data, we use POT to generate an orthogonal transformation to transform the perturbed data. For a given desired level of privacy protection, the combination of POT and random perturbation needs less noise than random perturbation alone and thus leads to a reduction in the outcome change. Finally, we conduct a comprehensive set of experiments to evaluate the proposed approaches. 4.2 Related Works Private information might be breached if the original data is not protected while data owners submit their data to an outside data mining service provider [6, 17]. In order to protect privacy, researchers have proposed different approaches to transform the data. Several studies show that it is possible to generate a transformation that could preserve both the privacy 88 4.3. Preliminaries and the data mining patterns. Wong et al. propose an approach based on one-to-n mapping for secure outsourcing of association rule in [18]. Molly et al. show that the one-to-n mapping can be reduced to a one-to-one mapping and can be broke under the attacks with known frequency knowledge in [13]. Wong et al. propose an audit approach for frequent item mining in [19], and an approach that preserves the orders of the distance between data points for kNN computation in [20]. We propose piece-wise transformation for decision tree classification in [3]. All these methods provide the NOC guarantee while preserving privacy. However, these methods are not suitable for the SVM classification and linear classifications. Chen et al. use random rotation to transform the original data for the SVM classification [4, 5]. Random rotation is vulnerable to curve fitting attacks while a hacker has prior knowledge. The POT approach proposed in chapter three generates an orthogonal transformation to protect privacy in outscoring of the SVM classification. POT is effective to counter attacks with prior knowledge. Random perturbation is a widely studied approach to perturb the original data by adding random noise [1, 2, 7, 8, 10, 12, 21]. Data owners only submit the perturbed data to an outside data miner. The original data is protected by the added noise. The more noise added, the more protection on the private information, meanwhile the more change on the data mining outcome. Prior knowledge is not counted in the privacy analysis for random perturbation in previous works. In this chapter, we evaluate the random perturbation approach against attack models with prior knowledge and apply POT on random perturbation to enhance protection and minimize the outcome change. 4.3 4.3.1 Preliminaries Attack Models with Prior Knowledge We introduce attack models with prior knowledge in this section. As discussed in the previous chapters, there are various sources for a hacker to gain prior knowledge about the original data. We abstract all these forms of the prior knowledge to an abstract notion of knowledge points as defined below. Definition 4.1. (Domain Knowledge Point) Given a value u in attribute i of the original data set D and the corresponding value u′ in the transformed data set D ′ , a hacker believes that u′ is transformed from w. We say that 89 4.3. Preliminaries (w, u′ ) is a domain knowledge point if |w − u| ≤ δi , where δi is a given parameter for attribute i. The parameter δi defines the distance between the value w and the real value u. In other words, the hacker’s knowledge (i.e., the value w) is within a radius of δi from the true original value. This radius, δi , is defined as a fraction of the domain range of attribute i. For example, we can define δi to be two percent of the domain range. A hacker could make a guess on the original data based on her prior knowledge. We define a crack using the same radius δi as follows: Definition 4.2. (Crack) For a value u′ in the transformed data set D ′ , if the value guessed by a hacker’s crack function g falls in a radius δi from the actual value u, i.e., |g(u′ ) − u| ≤ δi , we call the guess g(u′ ) is a crack. The domain disclosure risk of attribute i is the fraction of the number of cracks over the number of values in attribute i of D. In this chapter, we consider the curve fitting attack as the attack model with prior knowledge. Definition 4.3. (Curve Fitting Attack) Suppose a hacker has k domain knowledge points, i.e., (w1 , u′1 ), ..., (wk , u′k ), for an attribute. The hacker can fit these knowledge points into a crack function g by a curve fitting approach to crack other transformed values in the same attribute 4.3.2 POT : Principled Orthogonal Transformation As shown in the previous chapter, the POT approach could enhance the protection level on private information. POT is a method based on the Gram-Schmidt procedure ([15]) to iteratively and monotonically reduce the correlation between the original values and the corresponding transformed values. In the Gram-Schmidt procedure, a d × d orthogonal matrix Q is computed from a linearly independent set of vectors A = {A(:,1) , ..., A(:,d) } [15]. Let us use Qj−1 to represent the first j − 1 vectors in Q, i.e., Qj−1 = {Q(:,1) , · · · , Q(:,j−1) }. Let us use Pj to denote (I − Qj−1 QTj−1 ). The method to compute the j th vector of Q is: T(:,j) = (I − Qj−1 QTj−1 ) × A(:,j), Q(:,j) = T(:,j) ||T(:,j) || (4.1) 90 4.4. Optimizing POT : POT for Linear Sub-groups Given a data set D and its j th attribute D(j,:) , the transformed data set9 ′ is D ′ = QT × D and the transformed j th attribute is D(j,:) = (Q(j,:) )T × D. The correlation between the original attribute D(j,:) and the transformed ′ attribute D(j,:) is: ρ2j = ′ cov 2 (D(j,:) , D(j,:) ) ′ std2 (D(j,:) )std2 (D(j,:) ) = (A(:,j) )T × S1 × A(:,j) (A(:,j) )T × S2 × A(:,j) (4.2) where S1 = PjT × C(:,j) × (C(:,j) )T × Pj (4.3) S2 = std2 (D(j,:) ) × PjT × C × Pj (4.4) The matrix C is the covariance matrix of the original data D, i.e., C(i,j) = cov(D(i,:) , D(j,:) ), 1 ≤ i, j ≤ d. The POT approach looks for a new vector Ã(:,j) = A(:,j) + ~v that can produce a smaller correlation by solving the following inequality, which has been shown in chapter three: AT(:,j) S1 A(:,j) + 2S1 A(:,j)~v + ~v T S1~v AT(:,j) S2 A(:,j) + 2S2 A(:,j)~v + ~v T S2~v < ρ2j 2S1 A(:,j)~v + ~v T S1~v < 2ρ2j S2 A(:,j)~v + ρ2j ~v T S2~v 4.4 (4.5) Optimizing POT : POT for Linear Sub-groups In chapter three, experimental results have shown that POT is effective to counter curve fitting attacks with prior knowledge, which means the average crack percentage has been reduced substantially to a low level. However, we found that the crack percentage might be still high in particular settings of knowledge points. The reason is that there exist subsets of the values that still have high correlation with the corresponding transformed values even when the correlation of the whole domain is low. Let us study an example first. 9 The vector generated by the Gram-Schmidt procedure is a column vector. In order to use a generated column vector to transform data, we use QT as the transformation matrix. Notice that QT is also an orthogonal matrix. 91 4.4. Optimizing POT : POT for Linear Sub-groups 4.4.1 A Motivating Example Transformed Values Transformed Values In Figure 4.1 (a) and (b), the x-axis shows the original values and the y-axis shows the transformed values. Each point in the figure is a pair of an original value and the corresponding transformed value. Thus the distribution of the points in the space shows the relationship between the original values and the transformed values. The two transformations shown in Figure 4.1(a) and (b) yield the same correlations (|ρ| = 0.6) between the original values and the corresponding transformed values. Original Values (a) Without Linear Sub-groups ρ = −0.91 ρ = 0.92 Original Values (b) With Linear Sub-groups Figure 4.1: Examples of Transformations From Figure 4.1(a) we can see that all the points are scattered in the space. However, in Figure 4.1(b), we find two linear sub-groups. The correlation coefficient for one sub-group (marked as ‘dot’) is 0.92 and the correlation coefficient for the other sub-group (marked as ‘circle’) is -0.91. Both of the two sub-groups are highly correlated. If a hacker’s knowledge points all fall in the ‘dot’ group, or all fall in the ‘circle’ group, she might be able to crack a significant number of values. On the contrast, this situation will not happen for the transformation in Figure 4.1(a). 4.4.2 A Heuristic : POT for Linear Sub-Groups We do not want to use an orthogonal transformation that will yield highly correlated linear sub-groups like the one shown in Figure 4.1(b). In order to generate transformations like the one shown Figure 4.1(a), we propose an approach to break down the correlations for linear sub-groups. We use the linear grouping algorithm (the Lga method [14]) to check whether a 92 4.4. Optimizing POT : POT for Linear Sub-groups linear sub-group exists. The Lga method assigns values into linear groups by minimizing the squared orthogonal residuals within each group [14]. The Lga method needs to know the number of groups (i.e., clusters). The GAP statistics is used to estimate the number of clusters by checking the decrease of the ‘within clusters dispersion’ while increasing the number of clusters K [16]. The proposed procedure is described as follows: • POT : generating orthogonal transformation. The POT approach generates a vector, which is a column of the generated orthogonal transformation, for each attribute respectively. Once a vector has been generated for an attribute, we could form a two dimensional data set containing the original values and the transformed values for this attribute. • GAP : estimating the number of clusters. We use the GAP statistics to estimate the number of clusters in the formed two dimensional data set for this attribute. • Lga : finding linear groups. If there is more than one cluster found by GAP, we apply the Lga method to form linear sub-groups. • Breaking down correlations for linear sub-groups: We check the correlation for each sub-group. If there exists a subgroup that has higher correlation than the predefined threshold, then this sub-group does not satisfy the privacy requirement. We propose a heuristic to break down the high correlations in linear sub-groups to further enhance the protection on the original data. Let us assume that there are K linear sub-groups found by Lga for the j th attribute after transformation. For the lth (1 ≤ l ≤ K) linear sub-group, let us use idxl to represent the indices of the values that the lth sub-group contains in the j th attribute. For example, suppose an attribute contains 10 values. After transformation, we find two linear sub-groups. The first subgroup contains the first four values and the second sub-group contains the left over six values. Then the indices for the first group is idx1 = {1, 2, 3, 4} and the indices for the second group is idx2 = {5, 6, 7, 8, 9, 10}. The original values of the lth linear sub-group in the j th attribute can be represented by D(j,idxl ) . The correlation matrix for this sub-group can 93 4.4. Optimizing POT : POT for Linear Sub-groups be calculated by C l = Cov(D(i,idxl ) , D(j,idxl ) ), 1 ≤ i, j ≤ d. We have similar expressions to Equation 4.3 and Equation 4.4 for the lth linear sub-group and they are: l l S1l = PjT × C(:,j) × (C(:,j) )T × Pj S2l = std2 (D(j,idxl ) ) × PjT × C l × Pj (4.6) (4.7) The POT approach generates a vector by repeatedly solving Inequality 4.5 until the correlation is less than the given threshold. Similar to Inequality 4.5 for the whole attribute, we could have an inequality for the lth linear subgroup as follows: 2S1l A(:,j)~v + ~v T S1l ~v < 2ρ2j S2l A(:,j)~v + ρ2j ~v T S2l ~v (4.8) We have the above inequality for each linear sub-group. Therefore, we will have K inequalities for all the K linear sub-groups. The question is how many inequalities and which inequalities we need to solve. For a domain containing n values and a sub-group in this domain containing nl values, if a hacker could crack c% of those nl values in this sub-group, then this sub-group contributes nnl × c% to the total cracks for the whole attribute. For example, given a domain with 1000 data values and a linear sub-group containing 10 values, if all the 10 values in this sub-group have been cracked (i.e., c% = 100%), its contribution to the crack percentage 10 for the whole attribute is 1000 × 100% = 1%. We only need to select the sub-groups that have significant contribution to the attribute crack, which is determined by the crack percentage c% and the size of a linear sub-group nl . Therefore, we could select sub-groups by two predefined thresholds in the following way: • Correlation Threshold : the crack percentage c% is positive correlated to the correlation between the original values and the transformed values. Therefore, we only care about the linear sub-groups that have higher correlation than the given threshold between the original values and the transformed values. We call those linear sub-groups vulnerable sub-groups. • Size Threshold : the size of a vulnerable sub-group is defined as the number of values it contains. If the size of a vulnerable sub-group is too 94 4.4. Optimizing POT : POT for Linear Sub-groups small (i.e., nl is small), it might be not necessary to take extra effort to break down the correlation for this sub-group. We only need to pick up the sub-groups that have greater size than a given threshold10 . Among those K linear sub-groups, suppose there are k vulnerable subgroups that have greater sizes than a given threshold, then we have k + 1 inequalities in total, and they are: 2S1 A(:,j)~v + ~v T S1~v < 2ρ2j S2 A(:,j)~v + ρ2j ~v T S2~v 2S11 A(:,j)~v + ~v T S11~v < 2ρ2j S21 A(:,j)~v + ρ2j ~v T S21~v (4.9) . . . . . . . . . 2S1k A(:,j)~v + ~v T S1k~v < 2ρ2j S2k A(:,j)~v + ρ2j ~v T S2k~v The first inequality is for the whole attribute and the other k inequalities are for those k vulnerable sub-groups. By solving this set of inequalities, we could get a solution v and a new vector Ã(:,j) = A(:,j) + ~v , which yields a lower correlation. Algorithm POT-LS We summarize the above procedure in Algorithm POT-LS shown in Figure 4.2. The variable cnt counts how many times the algorithm has checked vulnerable sub-groups and is set to be zero at the beginning of the algorithm. The vector q is generated by the POT approach for attribute j. Attribute j is transformed by q T × D in Step 1. In Step 2, we use the GAP statistics and the Lga method to find linear sub-groups. Step 3 checks whether there exist sub-groups that have greater size than the size threshold S and higher correlation than the correlation threshold ρ0 . If such vulnerable sub-group exists, the algorithm will generate a new vector q for attribute j from Step 3.0 to Step 3.4. In Step 3.0, the counter cnt is increased by 1. In Step 3.1, we select the top k vulnerable subgroups by the descending order of correlation and generate Inequality Set 4.9. We introduce the input parameter k, the number of selected vulnerable sub-groups, to evaluate the efficiency of the algorithm. For example, if there are three vulnerable sub-groups satisfy the selection 10 Data owners or custodians can also choose to select all vulnerable sub-groups, regardless of size. However, in this case, they need to pay the additional effort required to solve the inequalities for all sub-groups 95 4.4. Optimizing POT : POT for Linear Sub-groups Algorithm 4.1: POT-LS Input: D: d-dimensional training data set q: a vector generated by the POT method ρ0 : the correlation coefficient threshold S: the size threshold k: the number linear sub-groups to be broke down L: the total number of loops 0 cnt = 0; ′ 1 Get D(j,:) = qT × D 2 Use the GAP statistics and the Lga method to find linear sub-groups 3 If there exist groups such that 1) sizes are greater than S and 2) correlations are higher than ρ0 3.0 { cnt = cnt + 1; 3.1 Select the top k vulnerable subgroups and generate the inequality set 4.9 3.2 do { Get a ~v by solving Inequality Set 4.9; Set A(:,j) = A(:,j) + ~v ; Compute ρj and ρlj , 1 ≤ l ≤ k by Equation 4.2; } until ρj ≤ ρ0 and ρlj ≤ ρ0 , 1 ≤ l ≤ k 3.3 Compute q by Equation 4.1; 3.4 If cnt ≤ L Goto Step 1 } Figure 4.2: Algorithm POT-LS : POT for Linear Sub-Groups condition, i.e., sizes are greater than S and correlations are higher than ρ0 , the range of k is from 1 to 3. If k = 1, we only select the vulnerable sub-group with the highest correlation. In experiments, we will evaluate the efficiency and the protection level for different values of k. Step 3.2 is a loop to generate a new vector A(:,j) . First, we get a vector ~v by solving Inequality Set 4.9 and update the vector A(:,j) . The correlation is computed by Equation 4.2. Step 3.2 will repeat until both the correlation for the whole attribute (ρj ) and the correlations for the selected k vulnerable subgroups (ρlj , 1 ≤ l ≤ k) are less than the threshold ρ0 . The new vector q is generated in Step 3.3. 96 4.4. Optimizing POT : POT for Linear Sub-groups Step 3.4 checks how many times the algorithm has run. L is an input parameter to define the total number of checks. Each time when we generate a new vector q in Step 3.2, we could only guarantee that the correlations of the selected k vulnerable sub-groups have been reduced to be lower than the given threshold ρ0 . However, there is no guarantee that the new vector q will not yield other vulnerable sub-groups. Therefore, Step 3.4 of Algorithm 4.1 will go back to Step 1 to check vulnerable sub-groups again. The algorithm will stop when there is no vulnerable sub-group or the algorithm has run L times. Let us study an example to show how the algorithm works. We set the correlation threshold to be 0.6 and the size threshold to be 30% of the domain size. Figure 4.3 shows the relationship between the original values and the corresponding transformed values for different transformations. Figure 4.3(a) is the same as Figure 4.1(b), which transforms the original values with an orthogonal matrix generated by POT. The correlation for the whole domain is 0.6, but there are two highly correlated linear sub-groups with correlations as 0.92 and −0.91 respectively. Figure 4.3(b) shows the results for POT-LS when we set the parameter k to be 2. After generating a new matrix, the algorithm goes back to Step 1 and identifies two new linear sub-groups with correlations as −0.71 and 0.62 respectively. Note that the two groups in Figure 4.3(b) and the two groups Figure 4.3(a) contain different original values and transformed values. Even though the PO-LS breaks down the correlations for the two sub-groups shown in Figure 4.3(a) to be lower than the correlation threshold, POT-LS produces another two new linear sub-groups. The conditions in Step 3 are still true and POT-LS will generate another orthogonal matrix. Figure 4.3(c) shows the results for the new generated transformation. The correlation for the whole attribute is 0.15. There is no more linear sub-group identified and the algorithm will stop. Beyond Lga In Algorithm POT-LS, we use the GAP statistics to determine the number of linear sub-groups and use the Lga method to assign values into these sub-groups. In Step 3 of Algorithm POT-LS, we check the correlation and the size of each sub-group. If all sub-groups have lower correlation than the given threshold, the algorithm stops. However, even though a sub-group has lower correlation than the threshold ρ0 , we might be able to find a subset of values in this sub-group that still has higher correlation than ρ0 . Algorithm 4.2 (Lga-Posterior) in Figure 4.4 shows how to find a vulnerable subset in a 97 Transformed Values 4.4. Optimizing POT : POT for Linear Sub-groups ρ = −0.91 ρ = 0.92 Original Values (a) POT ρ = −0.71 Original Values (b) POT-LS : First Round Transformed Values Transformed Values ρ = 0.62 ρ = 0.15 Original Values (c) POT-LS : Second Round Figure 4.3: Examples of Transformations : Algorithm POT-LS linear sub-group. G is a linear sub-group found by the Lga method in Algorithm POT-LS (Figure 4.2). Same to Algorithm POT-LS, ρ0 is the correlation threshold and S is the size threshold. A new variable H is set to equal to G in Step 1. Step 2 is a while loop and the conditions are whether the size of H is greater than S and the correlation is lower than ρ0 . Step 2 repeatedly deletes the point that is farthest from the linear fitting line. The line is formed in Step 2.1 and the farthest point p is located in Step 2.2. We update the variable H by deleting point p in Step 2.3. The last step checks whether the correlation between the original values and the transformed values for H is higher than ρ0 . If H has higher correlation than ρ0 , the algorithm returns 98 4.4. Optimizing POT : POT for Linear Sub-groups Algorithm 4.2: Lga-Posterior Input: G : a sub-group found by Lga ρ0 : correlation coefficient threshold S : the size threshold Output: H : vulnerable sub-group 1 H = G; 2 while size of H is greater than S and the correlation is lower than ρ0 2.1 { Fit a line for H by linear regression; 2.2 find the farthest point p from the fitting line; 2.3 H=H-p } 3 If correlation ≥ ρ0 return H Else return Empty Figure 4.4: Lga-Posterior : Identifying More Linear Sub-Groups ρ = 0.07 Original Values (a) Lga ρ = 0.85 Transformed Values Transformed Values H. Otherwise, the algorithm returns Empty. Figure 4.5 shows an example for Lga-Posterior. After transformed by an orthogonal matrix, we show the original values and the transformed values in Figure 4.5(a). The correlation for the whole domain is 0.07 and no linear sub-group is identified by Lga. We then use Lga-Posterior on the data. The algorithm iteratively removes the points that are far away from the Original Values (b) Lga-Posterior Figure 4.5: Finding Linear Sub-Groups : Lga-Posterior 99 4.4. Optimizing POT : POT for Linear Sub-groups regression line and finally identifies a sub-group of values that has correlation as 0.85. The identified sub-group is marked by circles and locates between the two dashed lines in Figure 4.5(b). The algorithm Lga-Posterior can be inserted after Step 2 of Algorithm POT-LS when Lga has formed sub-groups. Lga-Posterior helps us to further find more vulnerable sub-groups even though the sub-groups found by Lga have lower correlations than the correlation threshold. In the experiments, we will evaluate the effectiveness of Algorithm Lga-Posterior. 4.4.3 Properties of Algorithm POT-LS The first question is whether there is always a solution for Inequality Set 4.9. In other words, whether Step 3.2 could find a solution to make both ρj and ρlj , 1 ≤ l ≤ k to be less than the threshold ρ0 . Since we select the top k vulnerable sub-groups, the total number of inequalities is k + 1 in Inequality Set 4.9. For attribute j, there is always a solution for Inequality Set 4.9 if j + (k + 1) ≤ d. We have the following theorem for Step 3.2 of Algorithm POT-LS. Theorem 4.1. Given a threshold ρ0 , for attributes j : 1, ..., d − 2, if the number of selected sub-groups k satisfies j+(k+1) ≤ d, Step 3.2 of Algorithm POT-LS can always find a vector A(:,j) such that both the correlation for the whole attribute and the correlation for each sub-group are less than the threshold ρ0 , i.e., ρ2j ≤ ρ20 and (ρlj )2 ≤ ρ20 , 1 ≤ l ≤ k. Proof : Firstly, Step 3.2 of Algorithm POT-LS monotonically reduces ρ2j and (ρlj )2 . The new vector ~v comes from a solution of Inequality Set 4.9, which ensures that the new vector A(:,j) can produce smaller correlations. Secondly, we need to show that there is always a solution ~v for Inequality Set 4.9 if ρ2j > ρ20 or ∃l, 1 ≤ l ≤ k, (ρlj )2 > ρ20 . Inequality Set 4.9 is not solvable only when ρ2j and all (ρlj )2 have reached the minimum values, which are actually 0. The reason is as follows: ′ The correlation between D(j,:) ) and D(j,:) is ρj = ′ cov(D(j,:) , D(j,:) ) ′ std(D(j,:) )std(D(j,:) ) and the covariance ′ cov(D(j,:) , D(j,:) ) = cov(D(j,:) , (Q(:,j) )T × D = (Q(:,j) )T × C(:,j) 100 4.5. POT for Other Classification Methods where C(:,j) is the j th column of covariance matrix C and is a non-zero vector. Similarly, for each sub-group, we have ρlj = ′ cov(D(j,idxl :) , D(j,idx ) l) ′ std(D(j,idxl ) )std(D(j,idx ) l) ′ l cov(D(j,idxl ) , D(j,idx ) = (Q(:,j) )T × C(:,j) l) l where C(:,j) is the j th column of covariance matrix C l and is a non-zero vector. In d dimensional space, because j + (k + 1) ≤ d, there must exist a nonl zero vector q that is orthogonal to C(:,j) , C(:,j) , 1 ≤ l ≤ k and Q(:,i) , 1 ≤ i < j, T T l which means q × C(:,j) = 0 and q × C(:,j) = 0. Let Q(:,j) = q, then ′ ′ cov(D(j,:) , D(j,:) ) = 0. The denominator of ρj is std(D(j,:) )std(D(j,:) ), which 2 is not 0 because Q(:,j) is a non-zero vector. Therefore, ρj equals 0. For the same reason, (ρlj )2 equals 0. In conclusion, Step 3.2 of Algorithm POT-LS can always iteratively find a new vector A(:,j) to monotonically reduce ρ2j and (ρlj )2 till they reach the minimum value 0. In Algorithm POT-LS, we use a counter ‘cnt’ to count how many times the algorithm have checked vulnerable sub-groups. Theorem 4.1 can guarantee that the correlation for the whole attribute and the correlations for the top k linear sub-groups are less than the threshold. However, the new generated orthogonal vector q might introduce other vulnerable sub-groups. The algorithm stops after checking vulnerable sub-groups for L times. The value of L is an input parameter. However, experiments show that the algorithm will not repeat too many times (> 10) to reach a solution. 4.5 POT for Other Classification Methods The POT approach is proposed for the SVM classification. There are five other widely used linear classification methods, i.e., linear regression, ridge regression, lasso regression, Elastic Net, and logistic regression [9]. In this section, we extend the POT approach to these linear classification methods. In particular, we examine whether the NOC guarantee is provided by POT for the linear models. Given a data set D with n data points, each point xi is associated with a class label yi . The goal of the linear model is to find the optimal parameters 101 4.5. POT for Other Classification Methods (b, w) such that the linear regression function f (x, w, b) = b+xT w minimizes the error function. We have the following definition for the NOC guarantee provided by POT for linear classification models. Definition 4.4. Given a data set D, an orthogonal matrix Q, and the transformed data set D ′ = Q×D, let (b0 , w0 ) be the optimal parameters of a linear model for D. We say that the orthogonal transformation provides the NOC guarantee for this linear model if and only if the parameters (b0 , Q × w0 ) are also the optimal parameters for the transformed data set D ′ . Based on this definition, we examine each linear classification model as follows: • Linear regression: P The error function to be minimized is Jn (w, b) = n1 ni=1 (yi −f (xi , w, b))2 , which aims to minimize the errors between the class labels and the predicted class labels. We have the following theorem for linear regression: Theorem 4.2. The POT approach provides the NOC guarantee for linear regression. Proof : Let us use (b0 , w0 ) to represent the optimal parameters of the linear regression, which means the value Jn (w0 , b0 ) is the minimal error for D and we have: n Jn (w0 , b0 ) = 1X (yi − (b0 + xTi w0 )) n i=1 For the transformed data set D ′ = Q × D, given a point xi in D, the corresponding transformed point is in the form of x′i = Q × xi . The parameters (b0 , w0′ ) = (b0 , Q × w0 ) yields the error for D ′ : n Jn (w0′ , b0 ) = 1X (yi − (b0 + (Q × xi )T × Q × w0 )) n = n 1X (yi − (b0 + xTi × QT × Q × w0 )) n = n 1X (yi − (b0 + xTi w0 )) = Jn (w0 , b0 ) n i=1 i=1 i=1 102 4.5. POT for Other Classification Methods If (b0 , w0′ ) are not the optimal parameters, then there exists a pair (b′ , w1′ ) such that Jn (w1′ , b′ ) < Jn (w0′ , b) for D ′ . The error for (b′ , w1′ ) is as follows: n Jn (w1′ , b′ ) = 1X (yi − (b′ + (Q × xi )T × w1′ ))2 n i=1 = n 1X (yi − (b′ + xTi × QT × w1′ ))2 n i=1 Then for data set D, we can have a pair of parameters (b′ , w1 ) = (b′ , QT × w1′ ) such that the error of classification is: n ′ Jn (w1 , b ) = 1X (yi − (b′ + xTi × w1 ))2 n i=1 = n 1X (yi − (b′ + xTi × QT × w1′ ))2 = Jn (w1′ , b′ ) n i=1 Then we will have Jn (w1 , b′ ) < Jn (w0 , b) for dataset D, which is contradict with the assumption that (b0 , w0 ) are the optimal parameters for D. Therefore, (b0 , Q × w0 ) are the optimal parameters for the transformed data set D ′ . For the same reason, if (b0 , Q × w0 ) are the optimal parameters for the transformed data set D ′ , then (b0 , w0 ) will be the optimal parameters for the data set D too. Thus, the POT approach provides the NOC guarantee for linear regression. The orthogonal transformation contains rotation and reflection. Geometrically, both rotation and reflection preserve the geometric shape of a data set and preserve the distance between any two points. Therefore, if a line is the optimal classifier in the original space, the corresponding line in the transformed space will be also the optimal classifier for the transformed data set, which is smiler to the reason that POT preserves the SVM classification discussed in chapter three. • Ridge regression: P P The optimization function is n1 ni=1 (yi − f (xi , w))2 + λ dj=1 wj2 . DifP ferent from the linear regression, there is a penalty, λ dj=1 wj2 , which 103 4.5. POT for Other Classification Methods P equals to wT w. The geometric meaning of dj=1 wj2 is the square of the length of the vector w. The orthogonal transformation will not change the length of a vector. Therefore, we could have the following corollary from Theorem 4.2: Corollary 4.1. The POT approach provides the NOC guarantee for ridge regression. • Lasso regression: P P The optimization function is n1 ni=1 (yi −f (xi , w))2 +λ dj=1 |wj |. The P penalty is λ dj=1 |wj |, which means the sum of the absolute projections on the axes of vector w. The orthogonal transformation contains rotation and reflection. The reflection preserves the sum of axisprojections of a vector. However, the rotation will not preserve this sum generally. Therefore, the orthogonal transformation will not preserve this sum. Thus, the orthogonal transformation does not provide the NOC guarantee for lasso regression. • Elastic Net: P P The optimization function is n1 ni=1 (yi − f (xi , w))2 + λ dj=1 ((1 − P α)wj2 + α|wj |). The penalty is λ dj=1 ((1 − α)wj2 + α|wj |). If α equals 0, the Elastic Net is actually ridge regression. If α equals 1, the Elastic Net is lasso regression. For other values of α, the penalty is the combination of length of the vector w and the sum of the axis-projections of the vector w. We already know that the orthogonal transformation does not preserve the sum of the axis-projection of the vector w. Therefore, the orthogonal transformation does not provide the NOC guarantee for the Elastic Net. • Logistic regression The optimization function is f (z) = 1+e1−z , z = β0 + xT × β. From Theorem 4.2, we could know that the function z = β0 + xT × β is preserved by the orthogonal transformation Q. Therefore, f (z) is also preserved by the orthogonal transformation. We have the following corollary from Theorem 4.2: Corollary 4.2. The POT approach provides the NOC guarantee for logistic regression. We summarize the NOC guarantee for the linear classifications in Table 4.1. 104 4.6. Combination of POT and Random Perturbation Models Linear Regression Ridge Regression Lasso Regression Elastic Net Logistic Regression Optimization Functions 1 Pn 2 n Pi=1 (yi − f (xi , w)) n 1 2 i − f (xi , w)) i=1 (yP n d +λ j=1 wj2 1 Pn 2 i − f (xi , w)) i=1 (y n P +λ dj=1 |wj | 1 Pn (y − f (xi , w))2 n Pd i=1 i +λ j=1 ((1 − α)wj2 + α|wj |) f (z) = 1+e1−z z = β0 + xT × β NOC from POT? Yes Yes No No Yes Table 4.1: The NOC Guarantee for Linear Classification 4.6 Combination of POT and Random Perturbation Random perturbation has been widely studied and is a popular approach for privacy preserving classification [1, 2]. Random perturbation adds random noise to the original data in order to hide the sensitive information. The more noise added, the more protection it provides, and meanwhile the more change it produces on the classification patterns. It will be advantageous to provide an approach that could reduce the outcome change while not sacrificing the privacy. The POT approach preserves the classification patterns while providing high level protection. Therefore, we could apply POT on the perturbed data to provide extra protection. With the added extra protection, we might need less noise to be added on the original data and thus yield less change on the classification patterns. Figure 4.6 shows how to apply POT to random perturbation and shows how to compare the combined approach and random perturbation 11 . In the random perturbation approach, we add random noise N to the original data set D and generate a new data set D ′ = D + N . Then the classification is applied on the perturbed data D ′ and the pattern P1 is produced. In the combination of POT and random perturbation, we apply an orthogonal transformation Q on the perturbed data D ′ and create a new data set D” = Q × D ′ = Q × (D + N ). The classification is applied on the data 11 Note that Figure 4.6 is a little different from Figure 1.3. The reason is that Figure 1.3 shows the framework for SVM only and the pattern P1 and the pattern P2 are identical for the SVM classification 105 4.7. Experiments Classification Data Set D Patterns P0 Random Perturbation: Adding Noise ?Discrepancy ′ D =D+N Patterns P1 ?≡ Applying POT D” = Q × D′ ?Discrepancy Patterns P ′ Patterns P2 Decoding by P2 = Q−1 × P ′ Figure 4.6: The Combination of POT and Random Perturbation set D” and the classification pattern P2 is generated. For the classification models that are preserved by POT (e.g., SVM, linear regression, ridge regression, logistic regression, etc.), the pattern P1 is identical to the pattern P2 . For lasso and Elastic Net classifications, there is also discrepancy between P1 and P2 . Both patterns P1 and P2 are different from the pattern P0 mined from the original data set D. The question is how big is the difference. In the experimental section, we will evaluate the discrepancy between P1 and P0 and the discrepancy between P2 and P0 . We also compare the discrepancy between P1 ,P2 and P0 with respect to the classification accuracy and the protection level. We will present empirical results to show that the combination of POT and random perturbation could provide more protection with less noise added, which means the outcome change is reduced. 4.7 Experiments In this section, we show experimental results about the POT-LS approach and the effectiveness of the combination of POT and random perturbation. In the experiments, we use various data sets from the UCI collection [22], including Boston Housing, Credit and WDBC (i.e., Wisconsin Breast Cancer (Diagnostic)) and the IBM synthetic data set used in paper [2]. We deal with numerical attributes for all data sets. We show the descriptions of the data sets in Table 4.2. We evaluate the POT-LS approach and the combination of POT and random perturbation with the curve fitting attack, which is defined in Definition 4.3. The radius δ (Definition 4.1) of the knowledge point is set to 106 4.7. Experiments # of Numeric Attrs # of Points WDBC 10 569 Housing 10 509 Credit 4 690 IBM 8 1000 Table 4.2: Data Sets be 2% of the 1% trimmed range of each domain, i.e., the range without the 0.5% biggest values and the 0.5% smallest values. We applied different curve fitting attack models, e.g., linear fitting, spline fitting and polyline fitting. Experiments show that all of them produce similar results. In this section, we mainly show the results for linear fitting. We implemented the experiments in MATLAB environment and ran the experiments on an Intel Pentium PC with 3GHz CPU and 2GB RAM. 4.7.1 Optimizing POT : POT-LS The first part of the experiments is to evaluate the proposed heuristic (i.e., Algorithm POT-LS) that breaks down the correlations among the linear sub-groups and enhances the protection in the worst cases. As discussed in Section 4.4.1, the worst case means that there exist at least one linear sub-group and all of a hacker’s knowledge points falling within this linear sub-group. In the experiments, we set the number of the knowledge points to be four. In the experiments, the correlation threshold (ρ0 ) is set to be 0.6. The size threshold (S) varies from 10% to 50% of the domain size and all of them show similar results. In this section, we only show the results for S = 30%. Another input parameter to Algorithm POT-LS, k, is the number of selected vulnerable sub-groups. In the experiments, in order to evaluate the effectiveness and the efficiency of the algorithm, we vary k from 1 to the total number of vulnerable sub-groups. Figure 4.7 shows the experimental results for the WDBC data set. We show the results for five attributes, e.g., attributes #1,#3,#4,#8,and #9. No linear sub-group is identified for the other five attributes, even when the size threshold is set to be 10%. For each figure in Figure 4.7, the x-axis is the parameter k, the number of linear sub-groups to be handled. The primary y-axis shows the crack percentage and the second y-axis shows the runtime in seconds. There are three lines in each figure. The solid line with the triangle marker shows the sub-group improvements with respect to the level of protection. The crack percentage for a sub-group is defined as the fraction of 107 4.7. Experiments (a) Attr #1 (b) Attr #3 (c) Attr #4 (d) Attr #8 (e) Attr #9 Figure 4.7: POT-LS (in the Worst Cases) : Data Set WDBC 108 4.7. Experiments the number of cracks in this group over the number of values in this group, which is similar to the domain disclosure risk. The dashed line with the dot marker shows the overall attribute improvements. The dotted line with the diamond marker shows the runtime. Figure 4.7(a,b,c) shows the results for attributes #1,#3,and #4. We can see that the crack percentages for these three attributes could be around 50% for the whole attributes in the worst cases if the data is transformed by POT. The crack percentages for the linear sub-groups (the solid lines) are even higher. When k is set to be 1, the POT-LS algorithm breaks down the correlation for the most correlated sub-group. We could see from the figure that the crack percentage drops quickly for both the linear sub-groups and the overall attributes. For example, Figure 4.7(c) shows that the sub-group improvement is about 32% (the crack percentage drops from 60% to 28%), which contributes to about 25% overall improvements for the whole attribute. Furthermore, if k is 2, the crack percentage will be about 20%. For attributes #8 and #9, Figure 4.7(d,e) show that the crack percentage is not as high as attributes #1,#3,#4, but the crack percentage also drops while we introduce k to break down the correlations among the linear subgroups. In Figure 4.7(d,e), only the results of k = 1 are shown because the algorithm only finds one linear sub-group for attribute #8 and attribute #9 respectively. The runtime is the time needed to solve the inequalities. The POT approach only solves one inequality for the whole attribute. The POT-LS approach solves a set of inequalities, i.e., Inequality Set 4.9. The runtime only shows the extra time needed to solve this set of inequalities12 . The experimental results show that the total runtime to solve the set of inequalities is less than 0.1 seconds even though the runtime has increased a lot by comparing with the POT approach. Figure 4.8 shows the experimental results for the Housing data set. This data set also contains ten numerical attributes. POT-LS found two attributes that contains linear sub-groups after applying the orthogonal transformation. Figure 4.8(b) shows the results for attribute #5. From the figure we can know that the crack percentage in the worst cases is more than 40% in the linear sub-groups and about 35% for the whole attribute. POT-LS reduces the crack percentage to below 20%. When k = 2, the crack percentage will 12 We do not show the runtime for the GAP approach and the Lga method. Different linear sub-group methods might have different runtime properties. 109 4.7. Experiments (a) Attr #6 (b) Attr #9 Figure 4.8: POT-LS (in the Worst Cases) : Data Set Housing be about 10%. Figure 4.8(a) shows the results for attribute #4. In Figure 4.8(b), when k = 2, the crack percentage for the linear subgroup is lower than the whole attribute, which is different from Figure 4.7 and Figure 4.8(a). This is because the cracked values counted in the crack percentage for the whole attribute contain other values which are not in the identified linear sub-groups. (a) Attr #2 (b) Attr #3 Figure 4.9: POT-LS (in the Worst Cases) : Data Set Credit Figure 4.9 shows the results for the Credit data set. Credit contains four numerical attributes. Attributes #2 and #3 contains linear sub-groups after applying the orthogonal transformation. Figure 4.10 shows the results for the IBM synthetic data set. All of them show similar results to the WDBC 110 4.7. Experiments and the Housing data sets. (a) Attr #6 (b) Attr #8 Figure 4.10: POT-LS (in the Worst Cases) : IBM Synthetic Data Set Evaluation of Lga-Posterior So far we have evaluated the algorithm POT-LS and we use the GAP statistics and the Lga method to identify linear sub-groups. If the linear subgroups found by Lga have lower correlations than the threshold, we propose to use Algorithm 4.2 (Lga-Posterior) to find more vulnerable sub-groups. In this part, we show experimental results about Algorithm Lga-Posterior in Table 4.3. We compare the Lga method and the heuristic Lga-Posterior in three aspects, i.e., the number of identified linear sub-groups, the sub-group improvements on the crack percentage, and runtime. For the WDBC data set, we can see Lga-Posterior finds two linear subgroups and yields 20% sub-group improvements for Attribute #8. In contrast, if we only use the Lga method, only one linear sub-group is found with 12% improvements. Lga-Posterior also finds one vulnerable sub-group for Attribute #7 with 10% sub-group improvements. For the Housing data set, we can see Lga-Posterior could also find more vulnerable sub-groups and yield more improvements on the crack percentage. Lga-Posterior finds one more linear sub-groups for both attribute #6 and #9. For attribute #10, Lga-Posterior finds two linear sub-groups with 20% improvements on the crack percentage. Table 4.3 also shows the results for Attribute #3 of the Credit data set and the results for Attribute #5 of the IBM synthetic data. In both cases, 111 4.7. Experiments Data WDBC Housing Credit IBM Attr #8 #7 #6 #9 #10 #3 #5 k : the number of linear sub-groups Lga Lga-Posterior 1 2 1 1 2 2 3 2 1 2 1 improvements on crack percentage Lga Lga-Posterior 12% 20% 10% 28% 33% 29% 40% 20% 12% 24% 15% runtime (milliseconds) Lga Lga-Posterior 14 25 6 12 35 30 60 35 15 33 20 Table 4.3: Evaluation of Lga-Posterior Lga-Posterior could find more vulnerable sub-groups and yield lower crack percentage. Experimental results show that Lga-Posterior could provide more protection in the worst cases. The last two columns in Table 4.3 show runtimes for Lga and LgaPosterior. We can see Lga-Posterior needs less than 0.1 seconds to identify more linear sub-groups and yields more improvements on the crack percentage even though Lga-Posterior needs more time to run. 4.7.2 Combination Of POT and Random Perturbation In this section, we evaluate the effectiveness of the combination of POT and random perturbation with regard to the classification accuracy and the protection level. 4.7.2.1 Classification Model : SVM Figure 4.11- 4.14 show the experimental results of the SVM classification for datasets WDBC, Housing, Credit, and the IBM synthetic data respectively. In each figure, (a) shows the crack percentage and (b) shows the classification accuracy and the disagreements of the support vectors with respect to the corresponding standard deviation. Figure 4.11 shows the results for WDBC. In Figure 4.11(a), the x-axis shows the standard deviation of the added noise. The noise is randomly drawn from a normal distribution with standard deviation as a given percentage (c%) of the domain range. The experiments vary the standard deviation from 1% to 50% of the domain range. The y-axis shows the crack percentage. The solid line with diamond marker shows the crack percentage for random perturbation. The crack percentage is more than 20% when the standard deviation of the noise is 5% of the domain range. Especially, when 112 4.7. Experiments c% is less than 2%, the crack percentage could be about 50%. Only when c% is more than 20%, the crack percentage drops to about 10%. The dashed line with the triangle marker shows the crack percentage for the combination of POT and perturbation. From Figure 4.11(a) we can see that the dashed line is much flatter than the solid line. Even for c% = 1%, the crack percentage is as low as 10%. (a) Crack Percentage (b) Classification Accuracy & SV Disagreements Figure 4.11: The Combination Approach for SVM : WDBC Figure 4.11(b) shows the classification accuracy and the support vector disagreements. As shown in chapter three, since POT preserves the support vectors and thus preserves the classification patterns, the patterns derived from the POT transformed data and the patterns mined from the perturbed data are identical, i.e., P1 ≡ P2 in Figure 4.6. Therefore, in the experiments, we only need to measure the discrepancy of the classification patterns mined before and after the random perturbation. The solid line shows the classification accuracy. The first value is the classification accuracy for the original data set without adding noise. The figure shows that the classification accuracy drops while increasing the standard deviation of the added noise. Besides the classification accuracy, the other way is to directly compare the patterns mined before and after perturbation. For the SVM classification, it is to compare the found support vectors before and after the perturbation. Let us use Vo to represent the set of support vectors in the original data set and Vp to represent the set of support vectors in the perturbed data set. The disagreement between Vo and Vp is defined as V ∩V disag(Vo , Vp ) = 1 − Voo ∪Vpp . In Figure 4.11(b), the dashed line shows that 113 4.7. Experiments the disagreements of the support vectors consistently increase while adding more noise. Finally, let us look at the Figure 4.11(a) and (b) together. Figure 4.11(a) shows that only when the standard deviation of the noise is more than 20% of the domain range, the random perturbation will have the same level of protection (i.e., the crack percentage is less then 10%) as the combination approach. Meanwhile, Figure 4.11(b) shows that the classification accuracy drops a lot from the original classification models and the support vector disagreements are more than 60% when the standard deviation of the noise is more than 20%. In the random perturbation approach, if we would like to derive a classification model with less discrepancy from the original model, we might want to limit the standard deviation of the added noise to be less than 10% or even less than 5%, which provide poor protection (i.e., high crack percentage). However, the combination of POT and random perturbation could significantly improve the protection on the data even when the noise is small. Figure 4.11(a) shows that the crack percentage is about 10% when the noise is 1% and further drops to about 7% when the noise is 5%. Figure 4.12 shows the results for the Housing data set. Figure 4.13 shows the results for the IBM synthetic data set. Figure 4.14 shows the results for the Credit data set. All of them show similar results to Figure 4.11. We (a) Crack Percentage (b) Classification Accuracy & SV Disagreements Figure 4.12: The Combination Approach for SVM : Housing can derive more accurate classification models for SVM while not sacrificing the privacy protection by applying POT on random perturbation. 114 4.7. Experiments (a) Crack Percentage (b) Classification Accuracy & SV Disagreements Figure 4.13: The Combination Approach for SVM : IBM Synthetic Data (a) Crack Percentage (b) Classification Accuracy & SV Disagreements Figure 4.14: The Combination Approach for SVM : Credit 4.7.2.2 Linear, Ridge, and Logistic Regression The WDBC Data Set Figure 4.15 shows experimental results for linear models, i.e., linear regression, logistic regression and ridge regression, for the WDBC data set. Results for lasso and Elastic Net are shown in Figure 4.17 and Figure 4.18. Figure 4.15(a) shows the crack percentage, which is the same as Figure 4.11(a); Figure 4.15(b,c,d) show the classification measurement for different classification models. For each figure, the solid line with the triangle marker shows the classification accuracy. The first value shows the classification accuracy on the original data set without adding noise. 115 4.7. Experiments (a) Crack Percentage (b) Linear Regression (c) Logistic Regression (d) Ridge Regression Figure 4.15: The Combination Approach for Linear Models : WDBC The dashed line with the diamond marker shows the angles between the classifier (i.e., the optimal hyperplane) mined from the original data set and the classifier mined from the data set transformed by the combination of POT and random perturbation. Similar to the SVM classification (Figure 4.11), for all these three linear models, the classification accuracy drops quickly and the angles between the two classifiers increase fast while adding more noise (i.e., increasing the standard deviation of the added noise). With POT applied on the perturbed data, we need less noise to achieve high level of protection while not sacrificing classification accuracy. The Housing Data Set Figure 4.16 shows the experimental results for the Housing data set. The results show similar trends to Figure 4.15. Experiments for the Credit and the IBM synthetic data sets also show similar results. All the experiments confirm that the combination of POT and random perturbation can produce more accuracy classifiers while not sacrificing privacy by adding less noise. 116 4.7. Experiments (a) Crack Percentage (b) Linear Regression (c) Logistic Regression (d) Ridge Regression Figure 4.16: The Combination Approach for Linear Models : Housing 4.7.2.3 Lasso and Elastic Net The WDBC Data Set We show experimental results for lasso and Elastic Net separately in Figure 4.17 and Figure 4.18. We do not repeat to show the crack percentage in this part since it is the same as Figure 4.15(a). In Figure 4.17 and Figure 4.18, (a) shows the classification measurement for random perturbation. Similar to Figure 4.15, the solid line with the triangle marker shows the classification accuracy and the dashed line with the diamond marker shows the angles between the two classifiers. Figure(b) shows the classification measurement for the combination of POT and random perturbation. Figure(c) compares the classification accuracy and Figure(d) compares the angles between the two classifiers for the two approaches. The black line with the cross marker shows the results for random perturbation and the gray line with the dot marker shows the resutls 117 4.7. Experiments for the combination of POT and random perturbation. (a) Perturbation Only (b) Perturbation + POT (c) Accuracy (d) Angles Figure 4.17: The Combination Approach for Lasso : WDBC From Figure 4.17(a) and (b), we can see the accuracy drops and the angles increases while adding more noise. Because POT does not preserve the lasso classification, the combination approach yields a little less accurate classifier than the random perturbation in Figure 4.17(c). Similarly, the combination approach also introduce wider angles between the two classifiers than the random perturbation approach in Figure 4.17(d) for most cases. However, the difference of the classification accuracy between the random perturbation and the combination approach is less than 2%. The introduction of POT could significantly improve the protection level (Figure 4.15(a)), we only need a small amount of noise while achieving enough protection with high classification accuracy. For example, for the combination of POT and perturbation, the crack percentage is about 7% (Figure 4.15(a)) and the classification accuracy is about 79% when the standard deviation of the 118 4.7. Experiments added noise is 5% of the domain range. In contrast, for random perturbation alone, in order to limit the crack percentage to be around 7%, the standard deviation of the noise needs to about 30% and the classification accuracy is round 70%. Figure 4.18 shows the results for Elastic Net. The penalty part for (a) Perturbation Only (b) Perturbation + POT (c) Accuracy (d) Angles Figure 4.18: The Combination Approach for Elastic Net : WDBC P Elastic Net ( dj=1 ((1 − α)wj2 + α|wj |)) combines the penalty parts for the ridge regression and the lasso regression. In the experiments, we set α to be 0.5. We can find similar trends for Elastic Net to lasso regression. For both Elastic Net and lasso, the combination approach provides high level of protection on the private information by adding less noise. The Housing Data Set For the housing Data Set, Figure 4.19 shows the experimental results of lasso and Figure 4.20 shows the results of Elastic Net. Both of them show similar trends to the WDBC data set (Figure 4.17 119 4.8. Conclusions (a) Perturbation Only (b) Perturbation + POT (c) Accuracy (d) Angles Figure 4.19: The Combination Approach for Lasso : Housing and Figure 4.18). For other data sets (e.g., the Credit data set and the IBM synthetic data set), empirical studies also show similar results. All these experiments show that the the combination of POT and random perturbation provides more protection and yields less outcome change. 4.8 Conclusions In this chapter, we extend the POT approach for other linear classifications, propose a transformation approach by combining POT and random perturbation, and provide a heuristic to enhance the level of privacy protection of POT in the worst cases. We use POT for privacy preserving linear classifications, including linear regression, ridge regression, lasso regression, Elastic Net, and logistic regres120 4.8. Conclusions (a) Perturbation Only (b) Perturbation + POT (c) Accuracy (d) Angles Figure 4.20: The Combination Approach for Linear Models : Housing sion. We show that POT provides the NOC guarantee for linear, ridge and logistic classification models by preserving the classification error functions. In order to take advantage of the POT approach in privacy persevering, i.e., providing high level protection and preserving classification patterns, we propose a transformation approach that combines POT and random perturbation. After the original data has been perturbed by adding random noise, we further use POT to generate an orthogonal matrix to transform the perturbed data. Experimental results confirm that the combination approach needs less noise to reach high level of privacy protection while minimizing the outcome change. In order to provide more protection on privacy, we investigate the privacy protection properties of the POT approach in the worst cases. We find that there might exist linear sub-groups that still have high correlations between the original values and the corresponding transformed values in an attribute 121 4.8. Conclusions after the orthogonal transformation. Furthermore, those high correlated linear sub-groups are vulnerable to curve fitting attacks and undermine the protection ability of the POT approach. We propose a heuristic to be applied on the POT approach to break down the correlations for vulnerable sub-groups. We present empirical results to show that the proposed heuristic could generate a more robust orthogonal transformation to significantly improve the protection on the private information. 122 4.9. Bibliography 4.9 Bibliography [1] Agrawal D. and Aggarwal C.: On the Design and Quantification of Privacy Preserving Data Mining Algorithms. Proc. of PODS 2001. [2] Agrawal R. and Srikant,R.: Privacy preserving data mining. Proc. of SIGMOD 2000, pp. 439–450. [3] Bu, S., Lakshmanan, L.V.S., Ng, R. T. and Ramesh,G.: Preservation of Patterns and Input-Output Privacy. Proc. of ICDE 2007, pp. 696– 705. [4] Chen, K. and Liu, L.: Privacy Preserving Data Classification with Rotation Perturbation. Proc. of ICDM 2005,pp. 589–592. [5] Chen, K., Sun, G. and Liu, L.: Toward Attack-Resilient Geometric Data Perturbation. SIAM Data Mining(SDM) 2007. [6] Clifton, C. and Marks, D.: Security and Privacy Implications of Data Mining. Workshop on Data Mining and Knowledge Discovery, 1996. [7] Evfimievski, A., Gehrke, J. and Srikant, R.: Limiting privacy breaches in privacy preserving data mining. Proc. of PODS 2003, pp. 211–222. [8] Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J.: Privacy preserving mining of association rules. Proc. of SIGKDD 2002, pp. 217–228. [9] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. New York: Springer,2009. [10] Z. Huang, W. Du and B. Chen.: Deriving private information from randomized data. Proc. of SIGMOD 2005, pp. 37–48. [11] Kaplan, J: Data mining as a service: the prediction is not in the box. Data Mining Review, July 1,2007. [12] Kargupta, H., Datta, S., Wang, Q. and Sivakumar, K.: On the Privacy Preserving Properties of Random Data Perturbation Techniques. Proc. of ICDM 2003, pp. 99-106. [13] Molloy, I., Li, N., and Li, T.: On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining. Proc. of ICDM 2009, pp. 872–877. 123 4.9. Bibliography [14] Pison, G., Van Aelst, S., and Zamar, R. H.: A Robust Linear Grouping Algorithm. Proc. in Computational Statistics, 2006. [15] Trefethen, L.N., Bau, D.: Numerical Linear Algebra. SIAM, Philadelphia, 1997 [16] Tibshirani, R., Walther, G., and Hastie, T.: Estimating the Number of Clusters in a Dataset via the Gap Statistic. J. R. Statist. Soc. B(2001), v 63, Part 2, pp. 411-423. [17] Verykios V. S., Bertino E., Fovino I. N., Provenza L. P., Saygin Y. and Theodoridis Y.: State-of-the-art in privacy preserving data mining. ACM SIGMOD Record, v.33 n.1, 2004. [18] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: Security in outsourcing of association rule mining. Proc. of VLDB. 2007, pp. 111-122. [19] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: An audit environment for outsourcing of frequent itemset mining. Proc. VLDB Endow. vol. 2-1 (Aug. 2009), 1162-1173. [20] Wong, W., Cheung, D., Kao, B., and Mamoulis, N.: Secure kNN computation on encrypted databases. Proc. of SIGMOD 2009 [21] Zhu Y. and Liu L.: Optimal Randomization for Privacy-Preserving Data Mining. Proc. of SIGKDD, 2004. [22] UC Irvine Data Set Collection. http://archive.ics.uci.edu/ml/datasets.html 124 Chapter 5 Conclusions and Future Works 5.1 Conclusions In this thesis, we study the problem of privacy preservation in the datamining-as-a-service model [8]. Before the data is submitted to a service provider, transformation approaches should be applied on the original data to protect the private information while minimizing the change on the mining outcome [1, 2, 11, 12, 13]. We focus our study in the context of classification and propose transformation approaches to counter attack models with prior knowledge [3, 6]. The contributions of this thesis can be concluded as follows. • We propose a piecewise transformation approach for privacy preservation in outsourcing of the decision tree classification. We show that a substantially rich class of functions – (anti-)monotone functions – can provide the no-outcome-change guarantee. Since it might not be safe enough to transform a whole domain with only one (anti-)monotone function, we further propose a piecewise transformation based on two methods to enhance the protection. The first method is to break a domain into pieces and apply different (anti-)monotone functions on different pieces. The second method is to identify the monochromatic pieces in which all values are associated with the same class label. We show that we could use an arbitrary function to transform a monochromatic piece. We prove that the proposed piecewise transformation still provides the NOC guarantee if the functions applied on pieces satisfy the global (anti-)monotone invariant. We empirically evaluate the effectiveness of the proposed framework using an extensive battery of tests and provided a comprehensive analysis of the experimental results. Our results show that with the introduction of breakpoints and monochromatic pieces, the piecewise framework is effective in reducing the disclosure risk. 125 5.1. Conclusions • We propose two approaches, i.e., the principled orthogonal transformation (POT) and the true negative points (TNP) perturbation, for privacy preservation in outsourcing of the SVM classification. We first propose to use the POT approach to generate an orthogonal matrix to transform the original data. We prove that the POT approach provides the NOC guarantee for both linear and non-linear SVM and show that the POT approach could break down the correlation between the original values and the corresponding transformed values. We present empirical results to show that POT could effectively counter curve fitting attacks with prior knowledge and significantly reduce the crack percentage by reducing the correlation. Also, the smaller the correlation, the smaller the crack percentage. Another attack model with prior knowledge is the global attack, which aims to crack the encrypted values by making a guess on the transformation matrix. The global attack is similar to the matrix attack defined in [4, 5, 9]. We prove that the crack ability of the global attacks is independent of the transformation matrix. In order to counter global attacks, we propose the TNP approach, which finds and perturbs true negative points. We show that the TNP approach provides the NOC guarantee for linear SVM on linearly separable data set. Experimental results show that the TNP approach could effectively counter global attacks while minimizing the outcome change. • In the last part of this thesis, we extend the POT approach to linear classification models and show that POT also provides the NOC guarantee for linear models including linear regression, ridge regression and logistic regression. We propose a combination approach that applies POT on random perturbation in order to minimize the outcome change without sacrificing privacy. In order to further improve the protection on privacy in the worst cases, we propose a heuristic (Algorithm POT-LS) to break down the correlations between the original values and the corresponding transformed values of subsets. We conduct a comprehensive set of experiments to evaluate both the proposed heuristic POT-LS and the combination approach. The experimental results show that the proposed heuristic, i.e., POT-LS, could substantially reduce the crack percentage and thus enhance the protection on the original data. Moreover, the results also show that the combination of POT and random perturbation could reach high level of privacy protection by adding less noise, which means the outcome 126 5.2. Future Works change is also reduced. 5.2 Future Works Besides the problems studied and the approaches proposed in this thesis, there are several open problems in privacy preservation in outsourcing of data mining, and we would like to further investigate the following problems in future: • Optimizing the POT approach: We propose the heuristic POT-LS to further improve the privacy protection by breaking down the correlations between the original values and the corresponding transformed values of subsets. However, as we have discussed in chapter four, POT-LS cannot guarantee that the new generated vector will not produce new highly correlated subsets of values. Although experimental results show that the proposed heuristic is effective to reduce the crack percentage, it will be helpful to design an approach that could generate an orthogonal matrix that will not produce any highly correlated subsets of values. • Beyond linear SVM on linear separable data sets: The TNP approach proposed in this thesis is to counter the global attack. However, the TNP approach only provides the NOC guarantee for linear SVM on linearly separable data sets. For linearly nonseparable data sets, there is no guarantee on preserving the mining outcome. Meanwhile, it is still an open problem to design a transformation approach against global attacks with the NOC guarantee for SVM with non-linear kernels. • Privacy preservation in clouds: More and more computing tasks will be conducted in cloud computing environment [7, 10]. Even though the proposed approaches could be used in cloud computing, it might be necessary to design specific transformation schema for submitting data to clouds. For example, a data owner might want to use clouds to both store data and run data mining tasks. If the data is updated frequently, the piecewise transformation frame work and the TNP approach might not work. It is interesting to propose new transformations for data updating while providing both privacy and pattern preservation in cloud computing. 127 5.3. Bibliography 5.3 Bibliography [1] Agrawal D. and Aggarwal C.: On the Design and Quantification of Privacy Preserving Data Mining Algorithms. [2] Agrawal R. and Srikant,R.: Privacy preserving data mining. Proc. of SIGMOD 2000, pp. 439–450. [3] Bu, S., Lakshmanan, L.V.S., Ng, R. T. and Ramesh,G.: Preservation of Patterns and Input-Output Privacy. Proc. of ICDE 2007, pp. 696–705. [4] Chen, K. and Liu, L.: Privacy Preserving Data Classification with Rotation Perturbation. Proc. of ICDM 2005,pp. 589–592. [5] Chen, K., Sun, G. and Liu, L.: Toward Attack-Resilient Geometric Data Perturbation. SIAM Data Mining(SDM) 2007. [6] Ferguson, N. and Schneier, B.: Practical Cryptography. Wiley Publishing Inc. [7] Grossman, R. and Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. Proc. of the 14th ACM SIGKDD. 2008, pp. 920-927. [8] Kaplan, J.: Data mining as a service: the prediction is not in the box. Data Mining Review, July 1,2007. [9] Liu, K., Giannella, C. and Kargupta, H.: An Attacker’s View of Distance Preserving Maps for Privacy Preserving Data Mining. Proc. of PKDD 2006. [10] Weiss, A.: Computing in The Clouds. ACM NetWorker, December 2007, pp. 16-25 [11] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: Security in outsourcing of association rule mining. Proc. of VLDB. 2007, pp. 111-122. [12] Wong, W. K., Cheung, D. W., Hung, E., Kao, B., and Mamoulis, N.: An audit environment for outsourcing of frequent itemset mining. Proc. VLDB Endow. vol. 2-1 (Aug. 2009), 1162-1173. [13] Wong, W., Cheung, D., Kao, B., and Mamoulis, N.: Secure kNN computation on encrypted databases. Proc. of SIGMOD 2009 128
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Patterns and privacy preservation with prior knowledge...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Patterns and privacy preservation with prior knowledge for classification Bu, Shaofeng 2010
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Patterns and privacy preservation with prior knowledge for classification |
Creator |
Bu, Shaofeng |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | Privacy preservation is a key issue in outsourcing of data mining. When we seek approaches to protect the sensitive information contained in the original data, it is also important to preserve the mining outcome. We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially. We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guarantee when the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and propose a variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results show that the two approaches are effective to counter powerful attack models. In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise. We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-09-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051744 |
URI | http://hdl.handle.net/2429/28749 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2010-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2010_fall_bu_shaofeng.pdf [ 2.66MB ]
- Metadata
- JSON: 24-1.0051744.json
- JSON-LD: 24-1.0051744-ld.json
- RDF/XML (Pretty): 24-1.0051744-rdf.xml
- RDF/JSON: 24-1.0051744-rdf.json
- Turtle: 24-1.0051744-turtle.txt
- N-Triples: 24-1.0051744-rdf-ntriples.txt
- Original Record: 24-1.0051744-source.json
- Full Text
- 24-1.0051744-fulltext.txt
- Citation
- 24-1.0051744.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0051744/manifest