UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving prediction of water main failures using statistical and machine learning algorithms Vaags, Eric Henry


Models that estimate the likelihood of failure of water mains are widely used to support the repair and replacement strategies of water utilities. Advances in the fields of statistics and machine learning have introduced a wide range of models and improvements to data management have made increasingly complex models more feasible. The datasets that are used to develop these models are frequently subject to change as strategies for the operation and renewal of water distribution systems evolve. This issue is potentially exacerbated by the nonstationary processes impacting these systems. For water main failure prediction models to be useful in this dynamic context, it may be necessary for utilities to periodically evaluate several models for their dataset or for researchers to examine the performance of one or more models across multiple datasets. This work presents a framework for the selection and analysis of water main failure prediction models that is intended to enable efficient development of a range of models for a single dataset or investigation of the performance of models across several datasets. Each step of the framework is described and recommendations are given for researchers and asset managers attempting to implement the processes defined herein. The framework is investigated using data from four different utilities, where each dataset is highly censored. Through the application of the framework, four models are selected and refined: Cox Proportional Hazards Model, Neural Multi-Task Logistic Regression Model, XGBoost Survival Embeddings Model, and Random Survival Forests Model. These models are trained on each of the utility datasets and the outputs are compared to assess the efficacy of the framework. Results show that the framework may be used to identify models that are sufficiently robust to achieve high performance using datasets from four different utilities. Of the final selection of models developed through the framework, the lowest performance among all four datasets is a C-index of 0.780. Additionally, the framework is able to establish at least one model for each utility that performs very well. The C-index values range from 0.880 to 0.913 for the best model developed for each utility.

Item Media

Item Citations and Data


Attribution-NonCommercial-NoDerivatives 4.0 International