UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Large scale federated analytics and differential privacy budget preservation Ulhoa Avelar Stolet, Matheus

Abstract

This thesis presents two contributions. The first contribution deals with the problem of siloed data collection and prohibitive data acquisition costs. These costs limit the size and diversity of datasets used in health research. Access to larger and more diverse datasets improves the understanding of disease heterogeneity and facilitates inference of relationships between surgical and pathological findings with symptomatic indicators and outcomes. Unfortunately, freely enabling access to these datasets has the potential of leaking private information, such as medical records, even when these datasets have been stripped of personally identifiable information. In the first part of this thesis, we present LEAP, a data analytics platform with support for federated learning. LEAP allows users to analyze data distributed across multiple institutions in a private and secure manner, without leaking sensitive patient information. LEAP achieves this through an infrastructure that maintains privacy by design and brings the computation to the data, instead of bringing the data to the computation. LEAP adds an overhead of up to 2.5X, training Resnet-18 with 15 participating sites, when compared to a centralized model. Despite this overhead, LEAP achieves convergence of the model’s accuracy within 20% of the time taken for the centralized model to converge. One of the techniques used by LEAP to preserve the privacy of sensitive queries is differential privacy. Successive DP queries to a dataset depletes the privacy budget. When the privacy budget is depleted, data curators must block access to the underlying dataset to prevent private information from leaking. In the second part of this thesis, we present a system called the SmartCache. The SmartCache optimizes the use of the privacy budget by interpolating old query results to help answer new queries using a synthetic dataset. Queries answered from the synthetic dataset have a smaller privacy cost, so more queries can be answered before the budget runs out. For statistical queries, the SmartCache saved 30%-50% of the budget for threshold values of 0.99 and 0.999, and for gradient queries it consumed 70% less of the privacy budget when training a fully connected model.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International