UBC Theses and Dissertations
Robust and sparse regression in the presence of cellwise and casewise contamination with application in data quality modelling McGuinness, Glenn
This thesis considers the problem of robust and sparse estimation of linear regression parameters in data with structural and independent contamination. Independent outliers can propagate in data with relatively large numbers of dimensions, resulting in a high fraction of observations with at least one outlying cells. Recent work has shown that traditional robust regression methods are not highly robust to such outliers. We investigate the application of Robust Least Angle Regression (RLARS) to data with independent contamination. We also propose two modified versions of RLARS to further improve its performance. The first method applies RLARS to data which has been filtered of independent outliers. The second method performs RLARS with the Lasso modification for Least Angle Regression (LARS). Extensive simulations show that RLARS is resilient to structural and independent contamination. Compared with RLARS, simulation results show that the first modified version has significantly improved robustness to independent contamination and the second modified version has improved robustness when there are a large number of predictors. We also consider the application of the proposed methods to data quality modelling in a case study for MineSense Technologies Ltd. (MineSense). MineSense develops sensor packages for use in the harsh conditions of an active mine. To maintain high system availability and performance, data must be monitored for a deterioration in sensor health or a change in the data generating process, such as a change in ore body, which can manifest as outliers. We pose the problem of contamination detection, the identification of whether a dataset contains outliers, as a distinct problem from outlier detection, the identification of which cases or cells are outliers. We propose a contamination detection method based on the comparison of robust and non-robust linear regression estimates. When outliers are present, the robust and non-robust estimates differ significantly, indicating the presence of contamination. Simulation results and analysis of real sensor data provided by MineSense suggest that our method can effectively detect the presence of contamination with a low false detection rate.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International