Open Collections

BIRS Workshop Lecture Videos

Featured Collection

BIRS Workshop Lecture Videos

A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data Xie, Min-ge

Description

If there are extraordinarily large data, too large to fit into a single computer or too expensive to perform a computationally intensive data analysis, what should we do? To deal with this problem, we propose in this paper a “split-and-conquer'' approach and illustrate it using several computationally intensive penalized regression methods, along with a theoretical support. Consider a regression setting of generalized linear models with n observations and p covariates, in which n is extraordinarily large and p is either bounded or goes to ∞ at a certain rate of n. We propose to randomly split the data of size n into K subsets of size O(n/K). For each subset of data, we perform a penalized regression analysis and the results from each of the K subsets are then combined to obtain an overall result. We show that under mild conditions the combined overall result still retains desired properties of many commonly used penalized estimators, such as the model selection consistency and asymptotic normality. When K is well controlled, we also show that the combined result is asymptotically equivalent to the result of analyzing the entire data all at once (assuming that there is a super computer that could carry out such an analysis). In addition, when a computational intensive algorithm is used in the sense that its computing expense is at the order of O(na pb), a > 1 and b ≥0, we show that the split-and-conquer approach can substantially reduce computing time and computer memory requirement. Furthermore, we demonstrate that the approach has an inherent advantage of being more resistant to false model selections caused by spurious correlations. Similar to what reported in the literature, we can establish an upper bound for the expected number of falsely selected variables and a lower bound for the expected number for truly selected variables. The proposed methodology is illustrated numerically using both simulation and real data examples.

Item Metadata

Title	A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data
Creator	Xie, Min-ge
Publisher	Banff International Research Station for Mathematical Innovation and Discovery
Date Issued	2014-02-10
Description	If there are extraordinarily large data, too large to fit into a single computer or too expensive to perform a computationally intensive data analysis, what should we do? To deal with this problem, we propose in this paper a “split-and-conquer'' approach and illustrate it using several computationally intensive penalized regression methods, along with a theoretical support. Consider a regression setting of generalized linear models with n observations and p covariates, in which n is extraordinarily large and p is either bounded or goes to ∞ at a certain rate of n. We propose to randomly split the data of size n into K subsets of size O(n/K). For each subset of data, we perform a penalized regression analysis and the results from each of the K subsets are then combined to obtain an overall result. We show that under mild conditions the combined overall result still retains desired properties of many commonly used penalized estimators, such as the model selection consistency and asymptotic normality. When K is well controlled, we also show that the combined result is asymptotically equivalent to the result of analyzing the entire data all at once (assuming that there is a super computer that could carry out such an analysis). In addition, when a computational intensive algorithm is used in the sense that its computing expense is at the order of O(na pb), a > 1 and b ≥0, we show that the split-and-conquer approach can substantially reduce computing time and computer memory requirement. Furthermore, we demonstrate that the approach has an inherent advantage of being more resistant to false model selections caused by spurious correlations. Similar to what reported in the literature, we can establish an upper bound for the expected number of falsely selected variables and a lower bound for the expected number for truly selected variables. The proposed methodology is illustrated numerically using both simulation and real data examples.
Extent	40 minutes
Subject	Mathematics; Statistics; Biology and other natural sciences; Applied statistics
Type	Moving Image
File Format	video/mp4
Language	eng
Notes	Author affiliation: Rutgers University
Series	BIRS Workshop Lecture Videos (Banff, Alta)
Date Available	2014-08-06
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivs 2.5 Canada
DOI	10.14288/1.0043878
URI	http://hdl.handle.net/2429/49832
Affiliation	Non UBC
Peer Review Status	Unreviewed
Scholarly Level	Faculty
Rights URI	http://creativecommons.org/licenses/by-nc-nd/2.5/ca/
Aggregated Source Repository	DSpace

Item Media

201402101451-Xie_lrv.mp4 -- 81.08MB

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivs 2.5 Canada