UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Patterns and privacy preservation with prior knowledge for classification Bu, Shaofeng

Abstract

Privacy preservation is a key issue in outsourcing of data mining. When we seek approaches to protect the sensitive information contained in the original data, it is also important to preserve the mining outcome. We study the problem of privacy preservation in outsourcing of classifications, including decision tree classification, support vector machine (SVM), and linear classifications. We investigate the possibility of guaranteeing no-outcome-change (NOC) and consider attack models with prior knowledge. We first conduct our investigation in the context of building decision trees. We propose a piecewise transformation approach using two central ideas of breakpoints and monochromatic pieces. We show that the decision tree is preserved if the transformation functions used for pieces satisfy the global (anti-)monotonicity. We empirically show that the proposed piecewise transformation approach can deliver a secured level of privacy and reduce disclosure risk substantially. We then propose two transformation approaches, (i) principled orthogonal transformation (POT) and (ii) true negative point (TNP) perturbation, for outsourcing SVM. We show that POT always guarantees no-outcome-change for both linear and non-linear SVM. The TNP approach gives the same guarantee when the data set is linearly separable. For linearly non-separable data sets, we show that no-outcome-change is not always possible and propose a variant of the TNP perturbation that aims to minimize the change to the SVM classifier. Experimental results show that the two approaches are effective to counter powerful attack models. In the last part, we extend the POT approach to linear classification models and propose to combine POT and random perturbation. We conduct a detailed set of experiments and show that the proposed combination approach could reduce the change on the mining outcome while still providing high level of protection on privacy by adding less noise. We further investigate the POT approach and propose a heuristic to break down the correlations between the original values and the corresponding transformed values of subsets. We show that the proposed approach could significantly improve the protection level on privacy in the worst cases.

Item Media

Item Citations and Data

License

Attribution-NonCommercial-NoDerivatives 4.0 International

Usage Statistics