Prediction And Anomaly Detection In Water Quality With ExplainableHierarchical Learning Through Parameter SharingbyAli Mohammad MehrB.Sc., Sharif University of Technology, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)September 2020c© Ali Mohammad Mehr, 2020The following individuals certify that they have read, and recommend to the Faculty of Graduate andPostdoctoral Studies for acceptance, the thesis entitled:Prediction And Anomaly Detection In Water Quality With Explainable Hierarchical LearningThrough Parameter Sharingsubmitted by Ali Mohammad Mehr in partial fulfillment of the requirements for the degree of Masterof Science in Computer ScienceExamining Committee:David Poole, Computer ScienceSupervisorGiuseppe Carenini, Computer ScienceSupervisor Committee MemberiiAbstractDecisions made on water quality have high implications for diverse industries and general population.In a 2020 study, Guo et al. report that the current literature on modeling spatiotemporal variabilities insurface water quality at large scales across multiple catchments is very poor. In this thesis, we introduce asimple, explainable, and transparent machine learning model that is derived from linear regression withhierarchical features for efficient prediction and for anomaly detection on large scale spatiotemporaldatasets. Our model learns offsets for various features in the dataset while utilizing a hierarchy amongthe features. These offsets can enable generalization and be used in anomaly detection. We showsome interesting theoretical results on such hierarchical models. We built a water pollution platform forexploratory data analysis of water quality data in large scales. We evaluate the predictions of our modelon the Waterbase - Water Quality dataset by the European Environmental Agency. We also investigatethe explainability of our model. Finally, we investigate the performance of our model in classificationtasks while analyzing its ability to do regularization and smoothing as the number of observations growsin the dataset.iiiLay SummaryPredicting water pollution in large catchment areas is a difficult task. Some versions of recent artificialintelligence models used to do prediction in water pollution are black box models in that the scientistsprovide some data to the model and the model outputs some predictions without any explanation pro-vided. We introduce an explainable model to make predictions about, and detect anomalies in waterpollution data. Our model is built on linear regression, one of the simplest prediction methods studiedsince 19th century; however we exploit hierarchical structures among the features. We built a waterpollution platform for exploratory data analysis on large scale water pollution data. We evaluate theperformance of our model compared to other models which can be used in such prediction tasks.ivPrefaceThis thesis is based on a project done with Wan Shing Martin Wang under supervision of Dr. DavidPoole. The ideas for the parameter sharing model and its variants were developed by Dr. David Poole.The initial tests on the model were done by the author and Martin Wang. Theorem 2.1 was developedin collaboration by Dr. David Poole, Wan Shing Martin Wang, and the author. Theorem 2.2 wasdeveloped by the author. The implementation of the models on Waterbase - Water Quality dataset weredone by the author. Water pollution platform was designed with the help of Minerva Intelligence Inc.,especially Clinton Smyth and Jake McGregor, and it was implemented by the author. Anomaly detectionusing the hierarchical parameter sharing model was conjectured by Dr. David Poole, improved by Dr.David Poole with the help of the author, and implemented by the author. Chapter 4 was conductedby the author. Wan Shing Martin Wang evaluated variants of hierarchical parameter sharing model onmovie recommendation task, especially cold start problem on Movielens dataset and on fashion datasets,specifically the ModCloth and Renttherunway datasets. Writing of the material, creation of the figures,and all code implementations in this thesis were conducted by the author.A briefer version of chapters 2 and 3 will be submitted to a 2021 conference. This paper alsoincludes Wan Shing Martin Wang’s work on movie recommendation task.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Definition and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Parameter Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Hierarchical Parameter Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Prediction and Anomaly Detection in Parameter Sharing Models . . . . . . . . . . . . 72.3 Tree DAGs as a Simple Structure of DAG Hierarchy . . . . . . . . . . . . . . . . . . 92.4 Learning Hierarchical Parameter Sharing Models . . . . . . . . . . . . . . . . . . . . 102.4.1 L2-Regularized Baseline Parameter Sharing Model . . . . . . . . . . . . . . . 102.4.2 Top-Down Parameter Sharing Model . . . . . . . . . . . . . . . . . . . . . . 132.5 Explainibility in parameter sharing models . . . . . . . . . . . . . . . . . . . . . . . . 143 Testing on Water-Quality Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Water Quality Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Water Pollution Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Stations with Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Pollutant Added to River Section . . . . . . . . . . . . . . . . . . . . . . . . . 17vi3.2.3 Plots of Measurements in a Station . . . . . . . . . . . . . . . . . . . . . . . . 183.2.4 Saving and Restoring Favourite States . . . . . . . . . . . . . . . . . . . . . . 183.2.5 Finding Peaks in the Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.6 Visualizing Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Hierarchical Parameter Sharing on Water Pollution . . . . . . . . . . . . . . . . . . . 203.3.1 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.4 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Parameter Sharing in Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . 284.1 Overconfidence in Hierarchical Parameter Sharing Models . . . . . . . . . . . . . . . 284.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.1 Preparing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39viiList of TablesTable 2.1 A summary of symbols and expressions in parameter sharing models . . . . . . . . 5viiiList of FiguresFigure 2.1 The hierarchical relationship between classes in example 2.1 is shown in a DAG.This figure assumes that all observations are done in 2016 or 2017, in January orFebruary, and in 3 locations {L1,L2,L3}. 7 observations can be seen in this figure. 6Figure 2.2 A simple DAG model on a small dataset with four observations: 6, 6, 5, and 2. In (a)all the observations are explained as noise using the singleton offsets. Based on (a)we can see that three of the observations are in class A and two of the observationsare in class B. Classes A and B share an observation. In (b) a value of 5 is pushedup to A and a value of 0.5 is pushed up to B. . . . . . . . . . . . . . . . . . . . . 7Figure 2.3 An example of a hierarchical parameter sharing model modeling a chemical obser-vations dataset. The observations for observations 1 to 5 are: 1.5mg/L, 3mg/L,2.5mg/L, 2mg/L, and 0.5mg/L. The vertices are offsets and all values are in mg/L.The observations and the edges touching them are colored. . . . . . . . . . . . . . 9Figure 2.4 A simple tree DAG model on a simple dataset with five observations. Based on(a) we can see that three of the observations have a value of 6, and the other twoobservations have values of 3 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.1 Annotated main page of the water pollution platform. This page is mainly used forexploratory data analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 3.2 The map on the main page when check boxes for chemicals ”Nitrate” and ”Lead andits components” where selected in marker #6. The black stations are the stations thathave measurements of ”Nitrate” or ”Lead and its components” in 2017. Note that”union” was selected in marker #8 and 2017 was selected in marker #11. The yellowstations do not have any measurements of ”Nitrate” or ”Lead and its components”in 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 3.3 The map on the main page of water pollution platform, when Nitrate is selected in#5 and year 2015 is selected in #10. Red segments of the river show the segmentswhere Nitrate was added substantially in 2015. . . . . . . . . . . . . . . . . . . . 17ixFigure 3.4 In this figure, the station pointed at by pointer #2 in Figure 3.1 is selected. The mapis zoomed in on the selected station. The upstream stations (colored with differentshades of orange) to the selected station (colored with red) can be seen. Two plotsshow the measurements for two chemicals done in the selected station in 2017. Therest of the plots are drawn underneath these two plots. . . . . . . . . . . . . . . . 19Figure 3.5 Residual errors for four training observations after initial training. The residualerrors for observations 1,2,11,and 12 are 10,8,10, and -4 respectively. . . . . . . . 24Figure 3.6 An example of train and test root mean squared errors(RMSE) during initial training(first 400 steps) and learning constructed anomalies for an L2-regularized baselineparameter sharing model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 3.7 Test mean squared errors(MSE) for the four compared models. . . . . . . . . . . 26Figure 3.8 Comparing the performance of five models in interpolation (t < 2016/1/1) and ex-trapolation setting (t > 2016/1/1). . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 4.1 An L2-regularized baseline parameter sharing model or a top-down hierarchical pa-rameter sharing model learned on a binary task with 3 samples under class A all ofwhich were observed to be zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.2 a) Initial state of a hierarchy with three binary observations all of which are observedto be 0. A dummy observation with value 0.5 is added under global parameter tofacilitate regularization of the global parameter towards 0.5 b) An L2-regularizedmodel learned on the hierarchy shown in (a). c) A top-down model learned on thehierarchy shown in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.3 a) Initial state of a hierarchy with three binary observations all of which are observedto be 0. Two dummy observations with values 0 and 1 are added under globalparameter to facilitate regularization of the global parameter towards 0.5 b) An L2-regularized model learned on the hierarchy shown in (a). c) A top-down modellearned on the hierarchy shown in (a). . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.4 a) A dataset with similar hierarchy to the one in Figure 4.3 except that a new class Bis added under class A b) An L2-regularized model learned on the hierarchy shownin (a). c) A top-down model learned on the hierarchy shown in (a). . . . . . . . . 31Figure 4.5 Bayesian network used as ground truth to create the Synthetic Dataset. . . . . . . . 33Figure 4.6 Test loss for multiple models compared on promoters dataset with different numberof training samples, k. For each k, we average test loss over 1000 random train-testsplits. Smaller is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 4.7 Test loss for multiple models compared on WPBC dataset with different number oftraining samples, k. For each k, we average test loss over 1000 random train-testsplits. Smaller is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 4.8 Test loss for multiple models compared on WDBC dataset with different number oftraining samples, k. For each k, we average test loss over 1000 random train-testsplits. Smaller is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36xFigure 4.9 Test loss for multiple models compared on breast cancer dataset with different num-ber of training samples, k. For each k, we average test loss over 1000 randomtrain-test splits. Smaller is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 4.10 Test loss for multiple models compared on synthetic dataset with different numberof training samples, k. For each k, we average test loss over 1000 random train-testsplits. Smaller is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36xiAcknowledgmentsI would like to express my gratitude to those that have helped me throughout this journey. My sinceregratitude goes to my supervisor, David Poole, who guided me at every step of the way. With his patienceand invaluable expertise, he supported me in developing an understanding of artificial intelligence. Mythanks goes to Minerva Intelligence Inc., specially Clinton Smyth and Jake McGregor who advised meand inspired me throughout this project. Next in line is my friendly colleague, Martin Wang, who helpedme develop this project.I would like to thank my parents, who have supported and encouraged me all my life. It is thanks tothem that I was able to take this incredible path towards my dreams. I would also like to thank my dearsister who has always listened to what I had to say.This research is supported by Minerva Intelligence Inc., MITACS and NSERC; and it was enabledin part by support provided by WestGrid (www.westgrid.ca) and Compute Canada(www.computecanada.ca).xiiChapter 1IntroductionDespite the outstanding momentum that machine learning has seen in the recent years, the entire com-munity stands in front of the barrier of explainability[1]. With the increasing use of AI in diverse fields,the implications of decisions made based on its AI are increasingly growing. This has led to moreconcerns regarding potential bias in machine learning. Such concerns regarding AI stretch to ethicaldomains as well. Model explainability and interpretability can improve public trust in AI[9].Water quality in lakes and the rivers can be an important issue in economy, public health, and bi-ological variability of key natural resources[25]. Rivers are shared among diverse industries includingtourism, fishing, farming, steelworks, and maritime transport. Such a diversity causes decisions regard-ing water quality to have huge impacts. Note that decisions regarding water quality are mainly madein government levels with multiple stakeholders and are sometimes entangled with politics. Therefore,transparency and explainibility are important factors for an AI model that wants to be trusted in waterquality tasks.When a water contamination event occurs, it is important to detect and warn of such events. Ex-trapolation of water quality datasets into the future can potentially help early warning systems for waterquality. Although the use of automatic sensors for water quality is increasing, water quality monitor-ing in major parts of the world is still done through manual sampling of water[6] which is analyzed inlaboratories. The resulting data of such monitoring is usually sparse in space and time[16] sampled ir-regularly in time with high correlations in space. This means that many recent machine learning modelswhich need temporally regular data or make stationarity assumption in space cannot be directly appliedon surface water quality datasets.There is currently a lack of capacity to model spatiotemporal variabilities in surface water quality atlarge scales across multiple catchments[10]. The objective of this research is to find a machine learningmodel for prediction and anomaly detection tasks in such large scale surface water quality datasets. Amodel that is able to learn the spatiotemporal variability of surface water pollution in large regions andin long periods of time can potentially be informative in large scale catchment management and policymaking[10]. Our model is designed to be simple and explainable while also eliminating the need forpreprocessing of noisy data. The model can also deal with missing feature values in observations.Our model works based on commonalities and deviations in the dataset. It first extracts the com-1monality among all the data points. Then it identifies how groups of related observations deviate fromthe extracted commonality in a hierarchical manner. This method can help us explain the data in termsof commonalities plus deviations. Anomalies can be extracted as the deviations with high values orobservations with high deviation from the model’s prediction. We develop some theorems about howthe deviations interact with each other. We analyze the performance of the proposed model in a regres-sion task using a water quality dataset and in classification tasks using multiple real-world and syntheticdatasets.1.1 Problem Definition and ScopeWe are mainly interested in supervised prediction tasks or unsupervised anomaly detection tasks ona dependant random variable y with n measurements which depends on multiple feature variables X .Examples of such tasks include prediction and anomaly detection on water pollution datasets whereconcentrations y of multiple chemicals are dependent on the time and location of the measurements, X .The dataset can have missing feature values in X .For our model to have explainable results, we also exploit any hierarchy structure among the fea-tures. An example of a hierarchical structure among features is the parent-child relationship one canimagine between the set of the instances measured in January 2019 and the set of instances measured in2019. In this example, the set of instances measured in January 2019 is a subset of the set of instancesmeasured in 2019.Our model can only work with discrete feature variables, X . In case of datasets with continuousfeatures, we can discretize the features either using expert knowledge, e.g. discretizing time by month,or we can use existing discretization methods, e.g. methods introduced by Liu et al. [18] such as binningwith equal frequency.1.2 Literature reviewThere are two groups of methods that closely relate to our method: the multiple linear regression model,and matrix multiplication. The baseline form of our model is a special case of a multiple linear re-gression model [14]: A multiple linear regression model can be constructed with discrete features thatmakes the same predictions as our model. From another point of view, our model is similar to matrixfactorization models. Koren et al. [13] motivate the matrix factorization model in terms of an averageand learned offsets. A matrix factorization model with some fixed features can be designed to have thesame predictions as our model.The closest body of work to our model in water pollution task is the work done by Guo et al.[10]. Guo et al. [10] report that most of existing studies on surface water pollution either study thespacial variations of time-aggregated water quality data, e.g. Tramblay et al. [24], or use regression forprediction of water pollution from some other features in a single location, e.g. Kisi and Parmar [12].In a survey done by Tiyasha et al. [23], the authors reviewed more than 200 research articles whichhave addressed the river water quality modelling using machine learning models to predict some water2quality feature directly from other water quality features or contaminants.There are diverse models that are applied in spatiotemporal tasks. Jin et al. [11] apply Bayesianspatiotemporal models on air pollution datasets. Blangiardo and Cameletti [4] train Generalized LinearModels (GLM) on spatiotemporal datasets. For predicting particulate matter (PM10) or rain fall, theyuse stochastic partial differential equation (SPDE) approach [17] to model spacial effects and randomwalk to model time effects. They also model different types of interactions between the spatial andtemporal effects. When using SPDE for spatial effects, it is assumed that the covariance between everytwo points in space is only dependant on the distance between the points. This type of stationarity is notapplicable for surface water pollution because the covariance is high along a water stream while it is lowbetween streams that have not joined. Blangiardo and Cameletti [4] also use conditional autoregressive(CAR) models to account for spatial effects in disease mapping. A similar approach might be suitablefor surface water datasets because of its ability to model spacial dependencies using a graph. Yet, themethod they use only models bidirectional spacial dependencies whereas in rivers water flows only inone direction. We compare the performance of our model with this approach.Ruybal et al. [22] use spatiotemporal regression kriging for groundwater-level predictions, whichthey recommend for datasets that do not contain consistent spatially located data over all relevant tem-poral periods. Note that kriging is not really applicable to surface water datasets where the water flowsin streams.Our model utilizes a hierarchy of features to learn explainable offsets. Taxonomies and Ontologiescan be used in structuring of such hierarchy among features.1.3 Thesis OrganizationIn Chapter 2, we introduce our model and develop some theory for it. In Chapter 3, we introduce aplatform we built for exploratory data analysis on a large-scale water pollution dataset and evaluate ourmodel on that dataset. In Chapter 4, we investigate our model in binary classification tasks.3Chapter 2Parameter Sharing ModelWe assume we have a number of discrete features, e.g. month, year, location. An instance is an assign-ment to some of the features. An observation is an instance whose value is known. Suppose we have adataset of n observations y1, . . . ,yi, . . . ,yn, for example, measurements of Phosphate in river waters. Aclass is a set of instances. There are two types of classes: One type is described by a boolean combina-tion of features (the set of instances for which the formula is true) and the other type is a singleton classthat contains a single observation. For example, all instances measured in 2017, all instances measuredin May, all instances measured in May 2017, and all instances measured in station s1 in May are fourclasses in each of which we might have multiple observations. We assume we have m+ n+ 1 classesin total (C0, . . . ,Cm+n): a universal class that includes all instances (C0), m classes defined as booleancombination of features (C1, . . . ,Cm), and n singleton classes for each observation (Cm+1, . . . ,Cm+n). Aparameter sharing model assumes the following:• Offsets for classes: There exists one parameter for each of the m+n+1 classes. σ j is the offsetfor class C j. Note that we use the words parameter and offset interchangeably in this context.• Model constraint: The value of each observation is equal to the sum of offsets of classes to whichthe observation belongs: yi = ∑i∈C j σ j, where σ j are the classes which yi belongs to.The offsets for singleton classes are called singleton offsets or noise parameters. In the next section,we motivate this naming.2.1 Hierarchical Parameter Sharing ModelA hierarchical representation of classes can be used to demonstrate how learning happens in a parametersharing model. It also allows for a structured learning of offsets based on the hierarchies among theirrespective classes. In the hierarchy of classes, if class A is a subset of class B, meaning that all instancesin class A exist in class B, class A is considered to be under class B in the hierarchy. This subset hierarchyis a subset lattice that can be represented as a directed acyclic graph(DAG) in which class A is a child,descendent, or subclass of class B if A ⊂ B. The DAG is constructed using only the classes for whichwe have assigned an offset.4Table 2.1: A summary of symbols and expressions in parameter sharing modelsSymbol or Expression Definitionn Number of observations{y1, . . . ,yi, . . . ,yn} Set of all observationsi Index for observations: 1≤ i≤ n{C0, . . . ,C j, . . . ,Cm+n} Set of all classes{Cm+1, . . . ,C j, . . . ,Cm+n} Set of all singleton classes (classes with a single observation)m+n+1 Number of all classes: one universal class, m classes definedas boolean combination of features, and n singletonclasses for each observationj Index for classes: 0≤ j ≤ m+n{σ0, . . . ,σ j, . . . ,σm+n} Set of all offsets (or parameters) in a parameter sharing modelσ0 Parameter for the universal class(universal parameter){σm+1, . . . ,σm+i, . . . ,σm+n} Set of singleton offsets (noise parameters)yi = ∑i∈C j σ j Model constraint equation for observation: i ∈ {1, . . . ,n}Every node in the DAG hierarchy is representative of a class and its associated offset. The singletonclasses, which are leaves in the DAG, are representative of the noise parameters σm+i for each observa-tion. Explainability is a main focus in parameter sharing model; therefore, the goal is to set σ j such thatthey have meaningful values. In this thesis, the goal is to set the value of σ j to be the deviation that bestfits all of the observations.Example 2.1. Suppose we have a dataset of n phosphate measurements. Measurement i is done in a spe-cific year Y i, month Mi, and location Li. A parameter sharing model will define offsets {σ0, . . . ,σm+n}based on the different classes the observations can fall into. An example of a parameter sharing modelfor this dataset is as follows:yi = σ0+σY i +σMi +σLi +σY iMi +σY iLi +σY iMiLi +σm+i, (2.1)where yi is the measurement of phosphate. σ0 is the parameter for the class of all observations (universalclass). We will see that the model will predict σ0 for an instance for which we do not have any features.Since the most reasonable prediction for such an instance is the the mean of all the observations, σ0 canrepresent universal mean. Note that parameter σ0 is shared among all observations. σY i is the offset foryear Y i Note that offset σY i is shared between all observations done in year Y i. Figure (2.1) shows theseclasses and their hierarchy in a DAG. σY iMi is the offset for the class of observations done in year Y iand month Mi. σm+i is the offset for the singleton class of observation i and is unique to observation i,meaning it is not shared with any other observation. σm+i represents the noise in observation i whichcould not be explained using the classes that the observation is in. For example, σm+i might be a highpositive value for a sample taken in a day when there was a spill of phosphate close by that caused aone-day increase in phosphate in that location. The existing classes which are based on year, month,and location of the measurement cannot model the one-day increase, so the one-day increase will be5Figure 2.1: The hierarchical relationship between classes in example 2.1 is shown in a DAG. Thisfigure assumes that all observations are done in 2016 or 2017, in January or February, and in3 locations {L1,L2,L3}. 7 observations can be seen in this figure.modeled in the noise parameter σm+i. Nonetheless, all offsets used for this observation will inevitablybe influenced by this spill and they will experience a small increase.In Figure 2.1, a hierarchical relationship between classes is shown in a DAG when all observationsare done in 2016 or 2017, in January or February, and in 3 locations {L1,L2,L3}. 7 observations canbe seen in this figure. The structure of the DAG allows us to see what classes every observation belongsto. Observations y3 and y4 were measured in January 2016 in location L1. Observations y1 and y2 areonly connected to the 2017,Jan class, which means the location of these measurements is missing in thedataset.Note that in this example, we did not define any classes for observations done in month Mi andlocation Li, which resulted in not having σMiLi in Equation (2.1). This is allowed in parameter sharingmodel. Such decisions can be made using expert knowledge.6A parameter sharing model is realized in a DAG if the constraint for the parameter sharing modelholds for all observations: yi = ∑i∈C j σ j where σ j are all the offsets reachable from the node for obser-vation yi in the reverse DAG.There are infinitely many models that can explain the data in this fashion. For instance, in example2.1, if we have high observations in July 2017, it is not trivial what part of the high values should beexplained through σJuly, σ2017, or σ2017,July. Therefore, we need bias and regularization that goes beyondthe data to be able to learn the parameters. For instance, in the previous example, a naive models is toset σm+i = yi and set all other offsets to zero. This would mean that every observation is simply someunexplained noise.Learning in a hierarchical parameter sharing model happens when we explain the noise σm+i usingother offsets. This can happen through regularization. The following is an example of how learning canhappen.(a) All data is noise (b) Some data is pushed upFigure 2.2: A simple DAG model on a small dataset with four observations: 6, 6, 5, and 2. In (a)all the observations are explained as noise using the singleton offsets. Based on (a) we cansee that three of the observations are in class A and two of the observations are in class B.Classes A and B share an observation. In (b) a value of 5 is pushed up to A and a value of0.5 is pushed up to B.Example 2.2. In Figure 2.2, we can see an example of a DAG with 6, 6, 5, and 2. The first threeobservations are in class A, and the last two observations are in class B. Classes A and B share anobservation. In Figure 2.2a all the observations are explained as noise using the singleton offsets. InFigure 2.2b a value of 5 is pushed up to A and a value of 0.5 is pushed up to B.2.2 Prediction and Anomaly Detection in Parameter Sharing ModelsAs seen above, the model is over-parameterized and its prediction for an observation is equal to thevalue of that observation. In case of an unobserved instance, the singleton offset (noise parameter) forthat instance is assumed to be zero. With this assumption, the prediction of model for instance i, yˆi,7is defined to be the sum of offsets for all classes that the instance belongs to: yˆi = ∑i∈C j σ j. Table 2.1shows a summary of different symbols and expressions in a parameter sharing model.Using a trained hierarchical parameter sharing model, we can do unsupervised anomaly detection inthe following three fashions:• Class parameters with large absolute values: If a class offset has a large absolute value relativeto other class offsets, it means that knowing that a observation belongs to this class raises thealarm that this sample will have an anomalously large or small value. As an example, if we have aclass of observations done in summer for Phosphate, we might observe that the offset for summeris high. This type of anomaly does not necessarily suggest an alarming event. For example, thehigh offset for summer could be due to the fact that farming activity in summer causes an increasein Phosphate level.• Noise parameters with large absolute values: If the noise parameter for an observation hasa large absolute value, it can mean that an anomalous event led to a high or low value for thatobservation, but this event was not explained or captured by the classes we derived from thedataset. For instance, the noise parameter could be high for an observation of phosphate doneclose to a spill that resulted a daily increase in phosphate levels, but we did not have a classcapturing this and the high value for the observation remained unexplained in the noise parameter.• Group of observations with similar noise parameters: If a group of observations have similarvalues for their noise parameters, we might be able to define a new class which includes thoseobservations. To learn a new class which includes such observations, we need extra informationabout the observations than the current classes derived from the dataset. For example, if someobservations of Phosphate that were measured in locations close to each other have a similarnoise parameter, we can make a new class for these observations spatially close to each other.Making a new class for these observations will cause the noise parameter in all these observationsto decrease. Note that creation of this class would not be possible if we did not have informationabout spatial closeness of observation. This spatial closeness may not be previously capturedthrough the existing classes.The above three methods of anomaly detection look for anomalously large absolute values in the listof class or noise parameters. Since largeness is relative, we can sort the parameters based on absolutevalue and look for anomalies among the first parameters in the sorted list.Example 2.3. Figure 2.3 shows hierarchical parameter sharing model represented using a DAG. Theleaves of the DAG model correspond to five singleton offsets for five observations (at various times andlocations) of Phosphate: 1.5mg/L, 3mg/L, 2.5mg/L, 2mg/L, and 0.5mg/L. Sample 5 is from March2018 in region B, and therefore it is under these two classes. At every node, the value of the offsetfor that class is written. For example, the offset for region B (which includes samples 4 and 5) is−0.62mg/L. The prediction for any unobserved instance is the sum of the offsets of the classes thatthe instance belongs to. For example, the observed value for observation 5 is the sum of all offsets8Figure 2.3: An example of a hierarchical parameter sharing model modeling a chemicalobservations dataset. The observations for observations 1 to 5 are: 1.5mg/L, 3mg/L,2.5mg/L, 2mg/L, and 0.5mg/L. The vertices are offsets and all values are in mg/L. Theobservations and the edges touching them are colored.for sample 5: noise parameter for sample 5, 2018-March, 2018, March, region B, and the universalparameter: −0.76−0.15−0.15−0.01−0.62+2.19 = 0.5mg/L.The model can also predict values for unobserved instances. As an example, although the model inFigure 2.3, has not seen any observation from region B in February 2017, it can give a prediction forsuch instance as −0.21−0.21+0.15−0.62+2.19 = 1.3mg/L2.3 Tree DAGs as a Simple Structure of DAG HierarchyIn general, the DAG representing the subset relationship among classes can be any subset lattice, butthis thesis considers tree DAGs as a special case of subset lattices and builds up the analysis of theproperties of hierarchical parameter sharing models by first studying their workings on tree DAGs. ADAG is considered to be a tree when each except the top node has one parent. A tree DAG is simplerthan a general DAG because in a tree DAG moving the information up to the parents is simpler; everynode has a single parent, so all the commonality between the siblings has to be explained by a singleparent. A general DAG is more complex in that it is not trivial how much of the commonality amongsiblings has to be explained by each of theirs parents.9For instance, one of the properties of a tree DAG is that starting from any initial state, for the modelconstraints to hold, all the siblings have to change by the same value.2.4 Learning Hierarchical Parameter Sharing ModelsIn the following sections, different methods for learning the σ j parameters will be explored.2.4.1 L2-Regularized Baseline Parameter Sharing ModelTo the learn offsets in an L2-Regularized baseline parameter sharing model, we minimize the L2 normof all the offsets except the parameter for the universal class (universal parameter). In this context, L2norm of some elements is defined to be the square root of the sum of the squares of the values. Note thatthe model constraints (yi = ∑i∈C j σ j) have to hold while we minimize the L2-norm of offsets. In otherwords, to learn an L2-regularized baseline parameter sharing model, we minimize the following lossfunction with the condition that the model constraints are not violated. In the following, as mentionedin table 2.1, σ0 is the universal parameter, which is not regularized because the universal average can beany value and there is no reason to believe it is close to zero:Loss = ∑j∈{1,...,m+n}(σ j)2 (2.2)One way of looking at this is that we start with all non-singleton offsets to be zero, and singletonparameters to be the observed value. In every layer of the hierarchy if sum squared of children is greaterthan that of their parents, then the model will push the signal from children to the parent by subtractinga value from children and adding some value to the parents. In the end, the mean squared of the childrenwill be closer to zero while they are pushed to their parents because this is a state with smaller L2-norm.It is possible to add an informed prior by regularizing the offsets towards a default value instead of zero,especially if the dataset has very few number of observations.We do not regularize the universal parameter because the universal parameter is supposed to capturethe universal mean in the dataset, which can be any value. All other offsets are regularized. In Equation(2.2), we can separate the sum between σ ik and observation parameters. Then, we can use the modelconstraint to reach a different formula for the loss:Loss = ∑j∈{1,...,m+n}(σ j)2 (2.3)= ( ∑i∈{1,...,n}(σm+i)2)+( ∑j∈{1,...,m}(σ j)2) (2.4)= ( ∑i∈{1,...,n}(yi− ∑i∈Ck,k 6=m+iσk)2)+( ∑j∈{1,...,m}(σ j)2) (2.5)In Equation (2.5), the first term is the error term denoting the difference between predictions ofparameter sharing model(excluding the noise parameter) and the observed value while the second term10is the sum of the class offsets squared - except the universal parameter and noise parameters. Thisequation can be used to learn an L2-regularized baseline parameter sharing model on a dataset. Thisshows that this model is equivalent to an L2-regularized linear regression model yi = σ0 +XB, wherethe universal parameter σ0 is the intercept and doesn’t get regularized. The class offsets {σ1, . . . ,σm}are the offsets B in the linear regression model which are regularized, and the independent variables Xare 0 or 1 denoting the classes that the observation belongs to.Model PropertiesTo get more insight on how this model learns the class offsets in a tree, we can assume that we initializeall class offsets to zero except the noise parameters σm+i which will be set to σm+i = yi. Starting fromthis initial state and using some iterative method that converges to the global minimum in Equation (2.5),we can get insight on how parents are learned from the values of their children in a tree. The followingexample, illustrates this.(a) initial state (b) L2-regularizedFigure 2.4: A simple tree DAG model on a simple dataset with five observations. Based on (a) wecan see that three of the observations have a value of 6, and the other two observations havevalues of 3 and 2.Example 2.4. Figure 2.4a demonstrates the initial state of a hierarchical parameter sharing model asdescribed above, and Figure 2.4b shows the L2-regularized version of the model where the loss is min-imized. The prediction of model for an instance in class A is 4.4+1.1 = 5.5 while all the observationsin A are 6. This is due to the effects of regularization which causes the predictions of siblings A and Bto be closer to each other. The prediction of model for an instance in class B is 4.4− 1.1 = 3.3 whilethere were two observations of 3 and 2 in class B.In the following two theorems, we analyze some properties of an L2-regularized baseline parametersharing model in trees and DAGs:Theorem 2.1. In a trained L2-regularized baseline parameter sharing model on a tree hierarchy, at everylayer except the top layer (with universal parameter), the sum of the children is equal to the parent.11Proof. Consider a parameter sharing model with the minimum loss defined in Equation (2.2) in whichthere is a parent with value p with c children with values v1, . . . ,vc, where S=∑ct=1 vt . Assume∑ct=1 vt 6=p. This means:ε =p−∑ct=1 vtc+16= 0 (2.6)The argument is that if we add ε to all the children and subtract ε from the parent, we will arrive at alower loss while the model constraints are still held. Since through this activity only the values of p andvt will change, we will only look at the part of loss function that is composed of these offsets:lossold = (p)2+c∑t=1(vt)2 (2.7)lossnew = (p− ε)2+c∑t=1(vt + ε)2 (2.8)= (p− ε)2+c∑t=1((vt)2+(ε)2+2vtε)(2.9)= (p)2+(ε)2−2pε+ c(ε)2+2ε(c∑t=1vt)+c∑t=1(vt)2 (2.10)= (p)2+c∑t=1(vt)2+(c+1)(ε)2−2ε(p−c∑t=1vt) (2.11)= (p)2+c∑t=1(vt)2+(c+1)(ε)2−2ε(c+1)ε (2.12)= (p)2+c∑t=1(vt)2− (c+1)(ε)2 < lossold (2.13)Notice in Figure 2.4b, offset B is equal to the sum of its children; this holds in general.Theorem 2.2. In an arbitrary DAG, if there exists a set of disjoint classes {C1, . . . ,Cl} with offsets{σ1, . . . ,σl} whose union is all the observations(i.e, the classes are a partition of all of the data), then inthe L2-regularized baseline parameter sharing model, the following holds: ∑lt=1σt = 0Proof. Assume we have a solution with minimum loss, but ∑lt=1σt 6= 0. This means that ζ , the averageof σt , is not zero. We show that if we subtract ζ from each σt and add ζ to the universal parameter,the loss defined in Equation (2.2) will decrease. The model constraints will still hold because for everyobservation the increase in universal parameter has been compensated for by the decrease in σt . Notethat universal parameter does not appear in loss expression. We will only look at the part of the lossfunction that includes σt offsetslossold =l∑t=1(σt)2 (2.14)12lossnew =l∑t=1(σt −ζ )2 (2.15)=l∑t=1(σt)2−2ζl∑t=1σt + lζ 2 (2.16)=l∑t=1(σt)2−2ζ lζ + lζ 2 =l∑t=1(σt)2− lζ 2 < lossold (2.17)Note that the above case happens frequently in real datasets. For example, in Figure 2.3, whichis already L2 minimized (and rounded to 2 decimal places) you can see that the sum of the offsets forJanuary, February, and March; the sum of offsets for years 2017 and 2018; and the sum of offsets for2017-January, 2017-February, 2017-March, and 2018-January are 0 because all of them partition theobservations. Even the sum of 2017 and 2018-March or the sum of 2017-March, 2018-March, Sample1, and sample 2 are zero.2.4.2 Top-Down Parameter Sharing ModelAn extension to L2-regularized baseline parameter sharing model would be to learn the L2-regularizedbaseline model layer by layer from top to bottom until the noise parameters are learned. This allowstraining of each class offset to have direct access to observation residuals under that class. In addition,this allows explicit use of a hierarchy during training. For example, the parents, siblings, and childrenof a node can explicitly be used to learn it.To learn a top-down parameter sharing model, first we initialize all offsets to be zero. Then, wego down from top of the hierarchy (universal parameter) to the leaves, layer by layer. At every layer,we will learn the offsets in that layer while keeping offsets on other layers fixed. At every layer, weminimize the L2-regularized mean squared error of prediction of observation. In other words, assumingoffsets σl,t are the offsets in layer l of the hierarchy (1 ≤ t ≤ cl), at layer l we minimize the followingloss function:minσl,1,...,σl,cl( ∑i∈{1,...,n}(yi− ∑i∈Ck,k 6=m+iσk)2)+λcl∑t=1(σl,t)2 (2.18)where λ is the regularization rate. Note that only the parameters in layer l are minimized in above.When layers 1 to l are learned, parameters in layers > l are zero.Model PropertiesIn a tree hierarchy, assume the observations under class C are o1, . . . ,oc and the parent of C is P and thesum of all the ancestors of C (i.e. prediction for the class of P) is Pˆ. Since the hierarchy is a tree, the13parameter for C, σC, will be calculated by minimizing the part of loss function relevant to σC:σC = argminσC(c∑t=1(Pˆ+σC−ot)2)+λ (σC)2 (2.19)=⇒ ddσC((c∑t=1(Pˆ+σC−ot)2)+λ (σC)2)= 0 (2.20)=⇒ σC = (∑ct=1(ot − pˆ))c+λ=(∑ct=1 ot)− cpˆc+λ=(∑ct=1 ot)+λ pˆc+λ− pˆ (2.21)This means that the prediction for the class of C (i.e. Cˆ) is:Cˆ = Pˆ+σC =(∑ct=1 ot)+λ Pˆc+λ(2.22)This means that in a tree, the prediction for class of C, Cˆ, is a weighted average of Pˆ (with weight λ )and the observations in C (with weight c).2.5 Explainibility in parameter sharing modelsBarredo Arrieta et al. [1] consider a model to be transparent if it is understandable by itself. They alsogive three levels of transparency for models, namely simulatability, decomposability and algorithmictransparency. Both versions of our model explained above are transparent: They are simulatable becausethe interactions between offsets are through summation and each offset corresponds to a well-definedclass, e.g. offset for month; they are decomposable because all of the offsets have the same unit ofmeasure as the observations; and they are algorithmically transparent because they use the sum of therelevant (shared) offsets to make predictions. This transparency can be utilized for explainability in thefollowing fashions:• Observation explainability: We can explain every observation and every prediction in terms of alinear sum of weights, all of which have a semantic meaning. For example, we can use the modelto answer the question of whether an extraordinarily high Phosphate observation can be explainedthrough the month it was sampled in or not.• Gestalt explainability: We can explain the whole model in terms of anomalies, which are offsetswith a high (positive or negative) value.14Chapter 3Testing on Water-Quality Dataset3.1 Water Quality DatasetWaterbase - Water Quality is an open dataset [8] by the European environmental agency (EEA) whichincludes over 33 million pollutant readings all over Europe from 1985 to 2018. Each reading is measur-ing a chemical element, at a specific time, and at a specific water pollution monitoring station. Somestations monitor pollutants in ground water and some monitor surface water, e.g. rivers, lakes and etc.We are interested in surface water pollution. The dataset does not include how surface water monitoringstations are connected to each other by rivers. To determine this, we downloaded the river data fromOpenStreetMap[19] and matched stations with rivers. A station is matched to a river if the station isless than 50 meters away from the thalweg of the river extracted from OpenStreetMap. With this data,we can determine which surface monitoring stations are upstream and downstream of each other. Wefiltered out stations that were not matched with any river (further than 50 meters from any river centerline), assuming that they were ground water monitoring stations.3.2 Water Pollution PlatformIn order to analyze the water-quality dataset and get insight about the data, we built an exploratorydata analysis platform for the dataset. This platform allows the user to visualize different aspects ofdata. The user can also use it to visualize the models trained on the dataset as well as to visualizeanomalies detected on the dataset. The preprocessing of data for the platform is done using Pythonwhile a web-based front-end built with HTML and JavaScript accesses the processed data and generatesthe visualizations. The main page of the platforms allows the user to visualize the dataset along withsome extra information extracted from dataset. Figure 3.1 shows an annotated image of this page. Themain component of this page is the map pointed at by marker #1. The user can see surface waterpollution stations as yellow dots in the map. Every measurement is done in a water pollution station.The rivers are also shown in blue lines. In the following, different functionalities available to the user inthis page are introduced.15Figure 3.1: Annotated main page of the water pollution platform. This page is mainly used forexploratory data analysis.3.2.1 Stations with MeasurementsSince not all stations have measurements of all chemicals, the first step for a user interested in certainchemicals is for them to identify the stations that have measurements of those chemicals. The user canselect a number of chemicals using check boxes in marker #6. They also need to select a number ofyears in check boxes in marker #11. Selecting chemicals will change the color of the stations that havemeasurements of those chemicals in the selected years into black. Figure 3.2 shows the map on themain page when check boxes for chemicals ”Nitrate” and ”Lead and its components” where selected inmarker #6, and year 2017 was selected in marker #11. In that figure, the black stations are the stationsthat have measurements of ”Nitrate” or ”Lead and its components” in 2017.Marker #8 shows a slider for selecting either ”union” or ”intersect.” The choice will be used in16Figure 3.2: The map on the main page when check boxes for chemicals ”Nitrate” and ”Lead andits components” where selected in marker #6. The black stations are the stations that havemeasurements of ”Nitrate” or ”Lead and its components” in 2017. Note that ”union” wasselected in marker #8 and 2017 was selected in marker #11. The yellow stations do not haveany measurements of ”Nitrate” or ”Lead and its components” in 2017coloring the stations to black. If union is selected, the stations that have measurements of any of theselected chemicals will turn to black. If intersect is selected, only the stations that have measurementsof all the selected chemicals will turn black.3.2.2 Pollutant Added to River SectionFigure 3.3: The map on the main page of water pollution platform, when Nitrate is selected in #5and year 2015 is selected in #10. Red segments of the river show the segments where Nitratewas added substantially in 2015.Radio buttons in marker #5 allow the user to select a single chemical element. Based on the chemicalelement selected in #5 and the year selected in #10, the river segments will be colored red to blue with redmeaning that chemical was added to the water in that segment of the river more than the other segmentsof the rivers. Blue means that the selected chemical was not added substantially in that segment of the17river in that year. To do this, first we calculate the average concentration of every chemical for eachyear for each station. We do not have access to water discharge in rivers, and we only have access tochemical concentrations in water as amount per litre. In a segment of river with no confluence, thedifference between downstream and upstream concentrations is a measure of how much chemical wasadded in that segment. For a river segment with confluences, to estimate amount of chemical added in ariver segment with multiple upstream stations in different upstream branches and a single downstreamstation, we calculate downstream concentration minus maximum of upstream concentrations. If thisvalue is above zero, it means that a nonzero amount of the chemical was added to the water in thatsegment. Figure 3.3 shows how the colors of segments of rivers change when Nitrate is selected inradios in #5 and the radio button for year 2015 is selected in #10. Note that it is not trivial to extractthis information from the water-quality dataset since it does not have the data about which stations areconnected with water. Only after matching the stations with their corresponding rivers could we extractthis data.The button in marker #9 allows the user to reset the colors of rivers to their original blue color.3.2.3 Plots of Measurements in a StationThe user can select a water pollution station on the map by clicking on it. This causes the station tochange color to red, and the closest stations upstream to the selected station get assigned different colorsranging from orange to green. In addition, at the bottom of the page, plots of measurements for allchemicals in the selected station appear. In Figure 3.4, we have selected the station pointed at by marker#2. The figure shows the map zoomed in on the selected station and the plots for two chemicals shownat the bottom of the page. The rest of the plots are drawn underneath the two plots.If a set of chemicals are selected using check boxes in marker #6, those chemicals will appear at thetop of the plots.The plots will only show the measurements in years selected by marker #11. Marker #15 is allowsselection of a number between zero and seven. This number is used to filter uninteresting plots that haveless than some number of distinct values. For example, some chemicals, e.g Carbendazim, are alwaysmeasured to be 0 in some stations. Since such measurements are of no interest in visualization, using alarge value in this selector will filter such plots with only few distinct measurements.3.2.4 Saving and Restoring Favourite StatesThe section pointed at by marker #16 allows the user to store favourite selections, e.g. selection ofstation, chemicals, and year. The user can return to the stored selection in a later time.3.2.5 Finding Peaks in the PlotsThis section is used to find a peak in the chemical plots for each measurement. Analyzing peaks in themeasurements is part of the exploratory data analysis process.18Figure 3.4: In this figure, the station pointed at by pointer #2 in Figure 3.1 is selected. The mapis zoomed in on the selected station. The upstream stations (colored with different shades oforange) to the selected station (colored with red) can be seen. Two plots show the measure-ments for two chemicals done in the selected station in 2017. The rest of the plots are drawnunderneath these two plots.3.2.6 Visualizing AnomaliesHyperlinks pointed at by markers #3 and #4 direct user to two other pages where the user can visualizethe detected anomalies. The method used for learning anomaly classes and the content of these pagesare explained in Section 3.3.3.193.3 Hierarchical Parameter Sharing on Water Pollution3.3.1 Dataset GenerationIn order to prepare a dataset for training and testing different models, we only kept the readings from2013 to 2017 for phosphate for the stations in the Loire basin in France shown in figures 3.1, 3.3, and3.2. This basin was chosen because the dataset includes many samples in this basin. This basin includes295 stations and 9051 phosphate readings. Note that the readings in the dataset are done irregularly andat intervals usually longer than a month. Some stations in the Loire basin do not have any phosphatereadings in 2013 to 2017.We do two sets of tests on this dataset to analyze the performance of different hierarchical parametersharing model in interpolation and future extrapolation. For interpolation test, we split the dataset to1000 test samples and 8051 training samples. For extrapolation test, we split the measurements fromyear 2015 and before as training set and the measurements for 2016 and later as test set.3.3.2 Model TrainingWe train various models on the interpolation and future extrapolation datasets:Mean PredictorThis is a very simple model which calculates the mean of the training measurements and predicts thatvalue for all the test samples.L2-Regularized baseline parameter Sharing ModelTo train this model on the dataset, we need to decide on some pre-defined classes. We chose seven setsof classes. We have classes for:• each station: 295 classes• each year: 5 classes• each season: 4 classes• each combination of every year and season: 5×4 = 20 classes• each month: 12 classes• each combination of every station and year: 295×5 = 1475 classes• each combination of every station and season: 295×4 = 1180 classes• each combination of every station, year, and season: 295×5×4 = 5900 classesIn total we have 8891 possible classes, but some of these classes are empty, therefore we actually endup with 7291 nonempty classes. Note that many of these classes have non empty intersections and mostof them have more than one sample.20Top-down Parameter Sharing ModelWe use the same sets of classes chosen for the L2-regularized baseline parameter sharing model. We setλ = 1. The results are not very sensitive to different values of λ .Spatial BYM Model with a Temporal Random WalkThis model is an adaptation of the model introduced by Blangiardo and Cameletti [5]. it is a generalizedlinear model from the Gaussian family with an identity link function and a linear predictor as follows:ylt ∼ Normal(µ = ηlt ,σ2 = 1τ ) (3.1)ηlt = b0+ γt +φt + vl +ul (3.2)with yit being the measurements at location l at time step t. The parameters in this model are explainedin the following:• b0 quantifies the average measurements in all the data,• γt represents a random walk of order 2 defined as:γt |γt−1,γt−2 ∼ Normal(2γt−1+ γt−2,s2γ) (3.3)• φt is represented by independent and identically distributed(iid) Gaussian variables for each timestep as follows: φt ∼ Normal(0,1/τφ ).• vl is an unstructured residual modeled using independent and identically distributed(iid) Gaussianvariables for the different locations as vl ∼ Normal(0,s2v),• ul is a specially structured residual modeled as a the conditional autoregressive (CAR) model usedto model spatial interactions [2] named Besag-York-Mollie´ (BYM) model introduced by Besaget al. [3]. This is a model that allows us to specify that variables ul for locations close to eachother are correlated with each other. Assuming there are L locations {u1, . . . ,uL}, the closeness isspecified with a graph of these locations. Defining Nl to be the number of neighbors of location l,u−l as the list of all locations except location l, the BYM model is specified as follows:ul|u−l ∼ Normal(µl + 1NlL∑k=1alk(uk−µk),s2l ) (3.4)where alk is 1 if locations l and k are neighbors. In this model, the constraints of ∑Lk=1 = 0 andµk = 0 are typically set (for example, Lee [15] introduces this model as the simplest CAR prior)to form the following final distribution over ul:ul|u−l ∼ Normal( 1NlL∑k=1alkuk,s2l ) (3.5)21This model was trained using R-INLA[21]. The R-INLA default priors were used for parametersτ , b0, s2γ , τφ , s2v , and s2l . The random walk in this model can only work on a dataset with temporallyregular samples; therefore, we only test this model on future extrapolation task where we interpolate thetraining set and take weekly samples of the interpolated training set. In addition, R-INLA cannot handlethe large number of samples generated from the weekly interpolated dataset, so we train this model onlyon the interpolated measurements from 2014/11/1 to 2016/1/1.3.3.3 Anomaly DetectionWe do unsupervised anomaly detection on the water-quality dataset using the hierarchical parametersharing models. In Section 2.2, we discussed three types of anomalies that can be extracted using aparameter sharing model. In this section, we discuss how those anomalies can be applied to our phos-phate dataset. To identify the three types of anomalies in the dataset, we first need to train a hierarchicalparameter sharing model on the dataset. For this, we use the L2-regularized baseline parameter shar-ing model trained on the dataset (explained in Section 3.3.2) which we will call the initially trainedL2-regularized model.Anomalies from class parameters and noise parameters with large absolute values can be easilyextracted from the initially trained L2-regularized model by sorting the class parameters and noise pa-rameters and reporting the ones with the largest absolute value. For learning groups of observationswith similar noise parameters, we learn such groups step by step by picking an observation with a largenoise parameter and expanding the group in time and space to incorporate close samples until the groupbecomes as populated as possible. We can learn as many groups as possible in this fashion. In thefollowing, we explain in detail how an anomaly group can be learned starting with one sample with alarge noise parameter.Learning groups of observations with similar noise parametersTo learn anomaly classes, we start with the sample that has the largest absolute noise. We want to expandthis anomaly class greedily in space and time to make it as populated as possible while achieving thelowest possible training loss. In our implementation, every anomaly class includes a set of stations andhas a time interval [t1, t2] where t1 is the first day of some month, e.g. Oct 1st, 2016 and t2 is the last dayof some month, e.g. April 30th, 2017; we assume the time interval of an anomaly cannot be shorter thana month.Having picked the sample that has the largest absolute residual error e j, we create a new anomalyclass A with the station that the sample is in and the month that the sample is in. As a result, one newoffset parameter σA is introduced for this anomaly class. Note that this anomaly class might start withmore than one sample if there is any other sample in the same month and station as the initially pickedsample, but for now assume the anomaly starts with only one sample. We initialize σA using Theorem2, so σA =e j2 and e j changes toe j2 . Having introduced this new anomaly class, we can observe that wehave already decreased the training loss by e2j −(( e j2)2+( e j2)2)=( e j2)2. Note that loss is computedusing Equation (2.2). Now, we try to expand the anomaly class greedily step by step by either expanding22it temporally or spatially. At each step, we accept the expansion that gives us the least final loss. Thereare 6 candidates for expanding the anomaly class temporally: Expanding the time interval 1 to 3 monthsforward and backward. The candidates for expanding the anomaly class spatially are as follows:• Expand the set of stations by adding in one of the upstream stations of one of the existing stationsin the anomaly class• Expand the set of stations by adding in 1 to 3 closet stations to one of the existing stations in theanomaly classWhen expanding the anomaly using the candidates mentioned above, we assume that Theorem 2.1holds locally and we use it to compute the new value for σA after expansion. We also approximatethe new errors of the samples in the anomaly class as their original values minus the new σA. Theseapproximations are refined by training the model for 150 gradient descent steps every time we learn 10anomaly class. It turns out that simply having the sample with the maximum absolute residual errorindividually in one anomaly class yields a lot of decrease in loss. This is a trivial anomaly with only onesample which we do not want. Therefore, we filter out the candidates that cannot achieve a training losslower than their maximum absolute residual error individually being in one anomaly class.We stop learning anomaly classes after we have learned 300 of them. There is a special case notcovered in the previous paragraph: If the sample that has the largest absolute residual error could not beexpanded and ended up being the single sample in the new anomaly class, we destroy the anomaly classand we store the sample so that in the next iteration we can skip this sample and pick the next samplethat has the largest absolute residual error. The set of stored samples is reinitialized every 100 anomalyclasses learned. The following examples, illustrate in more detail how the anomaly classes were learned.Example 3.1. In Figure 3.5a, we are showing training errors for each observation after initial training.The initial training loss is 102+82 plus the sum squared of all existing class parameters and all errors forother observations(not shown in Figure 3.5a). From now on, we will not mention the existence of “sumsquared all existing class parameters and all errors for other observations” when we are computingtraining loss; therefore, for the purpose of this example, the initial training loss is 102 + 82 = 164.Observation 1 has the largest absolute training error, which is 10. We create a new class including onlythis observation. This will cause the addition of a new class parameter, and as we said, we assumeits value can be computed using theorem 2.1; therefore, the new class parameter will be 102 = 5 andhaving this class will result in the error of observation 1 to be 5. Simply adding this class will resultin the training loss to change to 52 + 82 + 52 = 114. Now, we see if we can expand this anomalyto new samples, to reduce training loss as much as possible. Expanding the anomaly to one monthforward is one of our search spaces, so we will see what will be the new training loss if we expand theanomaly to next month. This will result in observation 2 being added to our anomaly. The parameterfor this anomaly will become 10+83 = 6. The new error values for observations 1 and 2 will be 4 and -2respectively. The new training loss will be 42+(−2)2+62 = 56. This is less than 114, so this expansionwill be greedily accepted if there is no other expansion that will result in a lower training loss than 56.23(a) (b)Figure 3.5: Residual errors for four training observations after initial training. The residual errorsfor observations 1,2,11,and 12 are 10,8,10, and -4 respectively.Now, consider the example in Figure 3.5b. In this example, the initial training loss is 102+(−4)2 =116. Assume we construct a new anomaly with observation 11. The new class parameter will be 102 = 5and the new training loss will be 52+(−4)2+52 = 66. Now we compute training loss if we expandedthe anomaly to one month forward. The new class parameter for the anomaly will be 10−43 = 2. Theerror values for observations 11 and 12 will be 7 and -7 respectively. The new training loss will be72 +(−7)2 + 32 = 104. This is greater than 66, so this expansion candidate will never be accepted. Ifthe other candidates for expanding this anomaly class lead to training losses greater than 66, observation11 will be put in a set so that in the next iteration the next observation with the largest absolute trainingerror will be picked to start a new anomaly class.Continuing the example for Figure 3.5b, assume that after a few iterations, we see that observation 12is the observation with the largest absolute training error. The initial training loss is 102+(−4)2 = 116.Now, we create a new anomaly class with observation 12. The class parameter will be −42 =−2 and thenew error for observation 12 will be−2. The new training loss will be (−2)2+102+(−2)2 = 108. Now,let us examine what the new training loss will be if we expand this anomaly to one month backward. Asseen in the previous paragraph, having observations 11 and 12 in an anomaly class leads to the trainingloss of 72 +(−7)2 + 32 = 104. This is smaller than 108, so this growth is about to be accepted! Butwe do not want to accept this expansion because we want the expansion to be symmetrical: Startingfrom observation 11 and starting from observation 12 should lead to the same result. We will add anew constraint for when expansion candidates are accepted greedily. In addition to the requirement ofreduction in training loss, an expansion candidate is only accepted if its training loss is less than the losswe could have achieved by putting the worst predicted observation in the anomaly into one class by itsown. For example, we saw that having observation 11 in its own class results in a loss of 66. We do notaccept having observations 11 and 12 in a class because that yields a loss of 104 which is greater than66.In Figure 3.6, the first 400 steps are for initial training of an L2-regularized baseline parameter24Figure 3.6: An example of train and test root mean squared errors(RMSE) during initial training(first 400 steps) and learning constructed anomalies for an L2-regularized baseline parametersharing model.sharing model. The training and test loss reduce dramatically as we train for 150 steps when we learn10 newly constructed anomalies.Visualizing Process of Learning Anomaly GroupsIn our water pollution platform, it is possible to visualize the anomaly groups while they are beinglearned. On this page of the platform, every step of learning anomalies is listed. When the user selects astep, the group that is being processes will be visualized alongside information about current loss valueand how the group will be expanded in the next step.Visualizing Learned Anomaly GroupsIn our water pollution platform, it is also possible to visualize the final learned anomaly groups. A list ofall anomaly groups learned ordered based on parameter value is shown to the user. The user can selectan anomaly group in which case the time period of the anomaly and the stations where the anomaly islocated is highlighted.3.3.4 Model EvaluationInterpolationIn Figure 3.7, we compare four different models as explained in Section 3.3.2:• the baseline model, which predicts training mean for the test set,25Figure 3.7: Test mean squared errors(MSE) for the four compared models.• L2-regularized baseline parameter sharing model after initial training• L2-regularized baseline parameter sharing model after learning constructed anomalies, as ex-plained in Section 3.3.3• top-down hierarchical parameter sharing model.We do this comparison for 100 different random train-test splits with 8051 and 1000 samples, respec-tively. We can see that the initial training model on average has a 28% lower test error than the baselinemodel. Anomalies learned model has a 30.5% lower test error compared to baseline and a 3.3% lowertest error compared to the initially trained model. The comparison between anomalies learned and ini-tially trained model shows that the anomalies learned model is learning meaningful signals via learningconstructed anomalies. The top-down model has a 12% higher test error compared to the initially trainedmodel.Future ExtrapolationFigure 3.8 compares the performance of five different models in interpolation and future extrapola-tion(predicting future observations). The mean predictor model, the 2-regularized baseline parametersharing model after initial training and after learning anomalies, and the top-down model are the samemodels as previous section. In this section we are also comparing the Spacial BYM model with a tem-poral random walk explained in Section 3.3.2, referred to here as the BYM model. For t > 2016/1/1,train set is all observations in time range [2013/1/1, 2016/1/1] and the test set is observations in26Figure 3.8: Comparing the performance of five models in interpolation (t < 2016/1/1) and ex-trapolation setting (t > 2016/1/1).[2016/1/1, t]; this is when out model is predicting samples in the future. For t < 2016/1/1, testset is 30% randomly selected observations in [t, 2016/1/1] and the training set is observations in[2013/1/1, 2016/1/1] excluding test samples. For t > 2016/1/1, we did the experiments only oncebecause the train and test sets are deterministic. Our training is also deterministic with GD steps withparameter initialization of 0 except the global parameter initialized to train mean. For t < 2016/1/1,the average of 10 runs with different random samples are shown. We can see that learning constructedanomalies has caused overfitting for extrapolation. Comparing model for t before and after 2016/1/1shows that it outperforms the baseline by same margin in interpolation and future extrapolation. We cansee that the BYM model performed poorly because the data was interpolated for the BYM model, whichmight have caused the data to become harder to learn.27Chapter 4Parameter Sharing in BinaryClassificationIn the previous section, we applied the hierarchical parameter sharing models introduced in Chapter2 on Waterbase-Water Pollution dataset[8], which was a regression task. In this section, we analyzethe performance of the hierarchical parameter sharing models in classification tasks. In this chapterwe try to use model confidence to get insight on how models work and to understand their differences.Model confidence in this context refers to the general ability of model to make smoothed predictions. Ahierarchical parameter sharing model is considered to be overconfident when its prediction for a classis more intense than what is warranted by the observations in that class, e.g. only one negative examplein a class should not make its probability very close to zero. A non-overconfident model will utilizesamples in other classes and prior knowledge alongside the samples in the class to make a predictionabout it. Overconfidence becomes an important issue in the context of datasets with small number ofsamples. In small datasets, the model should not be confident on solely the samples in a class to makea prediction for that class. In such cases, the samples in other classes can act as a guide to help smooththe model’s prediction for a class. Smoothing can also be towards a default value; for example, Laplacesmoothing smooths binary predictions towards 50%.4.1 Overconfidence in Hierarchical Parameter Sharing ModelsThe following example shows that the original hierarchical parameter sharing models introduced insection 2 can be overconfident in some classification tasks.Example 4.1. Consider the case of a binary classification task where we have observed 3 samples inclass A all of which were observed to be zero. Figure 4.1 shows the result of learning an L2-regularizedbaseline parameter sharing model on this dataset. A top-down model learned on the same dataset wouldresult in the same learned parameters. As shown in this figure, all parameters will be learned to be0; therefore, the model’s prediction for an unobserved sample under class A is 0. We consider thisprediction to be overconfident.To improve both of the hierarchical parameter sharing models above, we can regularize the global28Figure 4.1: An L2-regularized baseline parameter sharing model or a top-down hierarchical pa-rameter sharing model learned on a binary task with 3 samples under class A all of whichwere observed to be zero.parameter towards 0.5 by adding the term (σ−0.5)2 to the loss functions in expressions (2.5) and (2.18),where σ is the global parameter. Note that same effect could have been achieved by adding a dummyobservation with value of 0.5 under global parameter[20]. This change is investigated in the followingexample:(a) initial state (b) L2-regularized model learned (c) Top-down model learnedFigure 4.2: a) Initial state of a hierarchy with three binary observations all of which are observedto be 0. A dummy observation with value 0.5 is added under global parameter to facilitateregularization of the global parameter towards 0.5 b) An L2-regularized model learned on thehierarchy shown in (a). c) A top-down model learned on the hierarchy shown in (a).Example 4.2. To improve the overconfidence issue shown in example 4.1, we add a regularizationterm in the loss function of that model for the global parameter regularizing it towards 0.5. This canbe achieved by adding the term (σ − 0.5)2 to its loss function or alternatively by adding a dummyobservation under the global parameter with a value of 0.5. The second method has been used in drawingthe hierarchy shown in Figure 4.2. This figure shows that the prediction of the L2-regularized model forclass A is now 0.29− 0.21 = 0.08 while the prediction of the top-down model for class A is now29(a) initial state (b) L2-regularized model learned (c) Top-down model learnedFigure 4.3: a) Initial state of a hierarchy with three binary observations all of which are observedto be 0. Two dummy observations with values 0 and 1 are added under global parameterto facilitate regularization of the global parameter towards 0.5 b) An L2-regularized modellearned on the hierarchy shown in (a). c) A top-down model learned on the hierarchy shownin (a).0.1−0.075 = 0.025.Instead of adding only a single dummy observation under the global parameter, we could have addedone positive dummy and one negative dummy observation under the global parameter. This is analyzedin the following example:Example 4.3. In this example, to improve the overconfidence issue shown in example 4.1, we addtwo dummy observations under the global parameter: one positive and one negative observation. Theresulting models are shown in Figure 4.3. Note that the universal parameter in Figure 4.3c is exactlywhat would be produced by Laplace smoothing.In the next example, we see that in the top-down model the smoothing vanishes as the hierarchybecomes deeper.Example 4.4. In Figure 4.4, we have altered the hierarchy in the previous example by simply adding anew class B under class A. This figure shows the result of learning the two parameter sharing models onthis new hierarchy. In the L2-regularized baseline model the universal parameter has increased comparedto the previous example. For both of the models, predictions for class B are more confident (closer tozero) compared to the previous example. This shows that the effects of the smoothing introduced inprevious example diminishes as the hierarchy becomes deeper. This vanishing is stronger in the top-down model compared to the L2-regularized baseline model. With just three examples, predictions forinstances in class B are 0.06 and 0.01 for the two models. This is arguably more confident than oneshould be with just three observations.30(a) initial state (b) L2-regularized model learned (c) Top-down model learnedFigure 4.4: a) A dataset with similar hierarchy to the one in Figure 4.3 except that a new class Bis added under class A b) An L2-regularized model learned on the hierarchy shown in (a). c)A top-down model learned on the hierarchy shown in (a).4.2 Experiment SetupIn this section we introduce multiple real-world and synthetic datasets with a focus on small datasetsbecause smoothing and regularization are most important in small datasets. We also introduce multiplemodels that we train on the datasets. The models will be trained on 1000 random train-test splits fordifferent number of train samples. The goal is to compare performances of the models as the number oftraining samples increases.4.2.1 Preparing DatasetsThe following datasets are the datasets that we use to evaluate our models on. For each dataset, we alsointroduce the set of classes we use to train a hierarchical parameter sharing model on the dataset.PromotersThe promoters dataset from UCI machine learning repository[7].• Title of Database: E. coli promoter gene sequences (DNA) with associated imperfect domaintheory• Number of Instances: 106• Used Attributes:– Label: positive or negative– 57 sequential nucleotide (”base-pair”) positions: These 57 features are used to predict thelabel. Each feature has one of the values A, T, C, or G.31• Classes Defined for Training Hierarchical Parameter Sharing Model:– 57×4 = 228 classes for each feature in the dataset.–(572)×4×4 = 25536 classes for each pair of features in the dataset.Wisconsin Prognostic Breast Cancer(WPBC)Wisconsin prognostic breast cancer(WPBC) dataset from UCI machine learning repository[7].• Title of Database: Wisconsin Prognostic Breast Cancer (WPBC)• Number of Instances: 198• Used Attributes:– Label: R = recur, N = nonrecur– 30 real-valued features used to predict the label.• Classes Defined for Training Hierarchical Parameter Sharing Model:– 30× 10 = 300 classes derived by descretizing each feature in the dataset. We discretizeeach of the real-valued features into 10 discrete levels: First level includes all the valuessmaller than the 10th percentile; second level includes all the values between 10th and 20thpercentiles; and so on.–(302)×10×10 = 43500 classes for each pair of the 30 discretized featuresWisconsin Diagnostic Breast Cancer(WDBC)Wisconsin diagnostic breast cancer(WDBC) dataset from UCI machine learning repository[7].• Title of Database: Wisconsin Prognostic Breast Cancer (WPBC)• Number of Instances: 569• Used Attributes:– Label: M = malignant, B = benign– 30 real-valued features used to predict the label.• Classes Defined for Training Hierarchical Parameter Sharing Model:– 30× 10 = 300 classes derived by descretizing each feature in the dataset. This was donesimilar to WPDC dataset.–(302)×10×10 = 43500 classes for each pair of the 30 discretized features.32Breast CancerBreast Cancer dataset from UCI machine learning repository[7].• Title of Database: Breast cancer data• Number of Instances: 286• Used Attributes:– Label: no-recurrence-events, or recurrence-events– 9 discrete features used to predict the label.• Classes Defined for Training Hierarchical Parameter Sharing Model:– Each level of the discrete features at first layer of the hierarchy– Combinations of pairs of the discrete features at second layer of hierarchySynthetic DatasetThis dataset was created using the Bayesian network shown in Figure 4.5 as ground truth. All the nodeshave binomial distribution. Nodes X1, . . . ,X5 are five boolean features which determine the value of y.Nodes ObxX1, . . . ,ObsX5 model the missingness in the measurements. They determine if their respec-tive feature was observed or is unknown. For example, if ObsX1 is true, it means that the value of X1 isknown in measurement, but if ObsX1 is false, the value of X1 is unknown in that measurement. 25000measurements were sampled from this ground truth. The prior probabilities for Xi and the conditionalprobabilities of p(y|X1, . . . ,X5) were chosen uniformly from [0.1,0.9] and fixed among 25000 measure-ments. The conditional probabilities of p(ObsXi = T |Xi = T ) and p(ObsXi = T |Xi = F) were chosenuniformly from [0.6,0.9] and fixed among 25000 measurements.Figure 4.5: Bayesian network used as ground truth to create the Synthetic Dataset.334.2.2 Model TrainingIn this section, we introduce the models we train on the datasets of section 4.2.1.Naive BayesNaive Bayes model for datasets with continuous and discrete features. Lk are the K different outcomes(classes or labels). For continuous features C1, . . . ,CT , and discrete features D1, . . . ,DR:p(Ct |Lk)∼ Normal(µt,k,st,k) (4.1)p(Dr|Lk)∼Multinomial(pr,k) (4.2)p(Lk|c1, . . . ,cT ,d1, . . . ,dR) ∝ (T∏t=1p(ct |Lk))(R∏r=1p(dr|Lk)) (4.3)where parameters µt,k, st,k, and pr,k are estimated using maximum likelihood.Logistic RegressionL2-regularized logistic regression model minimizing the following cost function:minw,c12wT w+Cn∑i=1log(exp(−yi(XTi w+ c))+1) (4.4)where yi are the observations, w is a one-dimensional array of all the weights and X is a two-dimensionalarray of all the features for each observation.L2-Regularized Baseline Parameter Sharing ModelThe L2-regularized baseline parameter sharing model as explained in section 2.4.1. The classes chosenfor each dataset is explained for each dataset in section 4.2.1.Top-Down Hierarchical Parameter Sharing ModelThe Top-Down hierarchical parameter sharing model as explained in section 2.4.2. The same set ofclasses as the L2-regularized baseline parameter sharing model are used for each dataset.L2-Regularized Baseline Parameter Sharing Model - T/F smoothedThe L2-regularized baseline parameter sharing model smoothed with two dummy samples of 0 and 1under the global parameter, as explained in Example 4.3. To train this model, the same set of classes asthe L2-regularized baseline parameter sharing model are used.Top-Down Hierarchical Parameter Sharing Model - T/F smoothedThe Top-Down hierarchical parameter sharing model smoothed with two dummy samples of 0 and 1under the global parameter, as explained in Example 4.3. To train this model, the same set of classes as34the L2-regularized baseline parameter sharing model are used.4.3 Experiment ResultsThe models were trained on 1000 random train-test splits for different number of train samples, k. Weuse logloss for classification datasets and root mean squared error for regression datasets. In figures4.6,4.7,4.8,4.9, and 4.10 the test errors for each model were averaged over 1000 random train-test splits.Figure 4.6: Test loss for multiple models compared on promoters dataset with different numberof training samples, k. For each k, we average test loss over 1000 random train-test splits.Smaller is better.Figure 4.7: Test loss for multiple models compared on WPBC dataset with different number oftraining samples, k. For each k, we average test loss over 1000 random train-test splits.Smaller is better.According to these figures, it is not possible to draw a conclusion about the performance of L2-regularized baseline model compared to the top-down model. More studies are needed to improve thesemodel in binary classification task. The plots also show that the smoothed version of the models workbetter than the nonsmoothed versions when the size of the training set is small.35Figure 4.8: Test loss for multiple models compared on WDBC dataset with different number oftraining samples, k. For each k, we average test loss over 1000 random train-test splits.Smaller is better.Figure 4.9: Test loss for multiple models compared on breast cancer dataset with different numberof training samples, k. For each k, we average test loss over 1000 random train-test splits.Smaller is better.Figure 4.10: Test loss for multiple models compared on synthetic dataset with different numberof training samples, k. For each k, we average test loss over 1000 random train-test splits.Smaller is better.36Chapter 5Conclusion and future work5.1 Future directionsBased on the examples given in Chapter 4, it seems like both the top-down and baseline parametersharing models suffer from overconfidence and the suggested method does not completely solve thisissue. In addition, both models suffer from other issues that render the offsets difficult to explain. Forinstance, in figure 4.4b, the learned offsets for class A and B are equal while it is more reasonablethat the offset for class B be zero because class A has already learned all the information about thesamples underneath. Knowing that a sample is under class B rather than class A should not change theprediction of the model. We have tried improving the top-down model by trying to bound the amountof information that is passed down to the child from the parent. We have also tried to modify the top-down model in a fashion that learning the offset for a class is dependent on the ratio of number childrenunderneath versus under the siblings. Our engineered designs did not improve the test loss, but futurestudies might be able to solve these issue using similar perspectives on the workings of the models.Future studies can investigate the relationship between our models and hierarchical Bayesian mod-els. Our method can make predictions about a sample under one class or multiple classes. For example,our model can make different predictions about a sample in 2017, a sample in location l, or a sample in2017 and l. Note that in this case class l and class of 2017 are disjoint, but neither of them is a subsetof the other. It is not clear how this type of setting can be modeled through hierarchical Bayesian mod-els. For example, in the famous example of patient mortality in different hospitals modeled through ahierarchical Bayesian model, it is not clear how one is supposed to model a patient that was hospitalizedin two hospitals. In a tree DAG, where classes are either disjoint or subset of each other, a hierarchicalBayesian model with a network similar to the DAG hierarchy can be fit to the data. Future studies caninvestigate the difference between the predictions of our model and the predictions of the hierarchicalBayesian model for the data.In a general DAG, classes can have multiple parents. In such cases, it is not trivial how the signalshould be split between the parents. Future studies can investigate the workings of how the proposedhierarchical models split the signal between multiple parents.37Future studies can investigate the relationship between our models and a feedforward artificial neuralnetwork. One can imagine that an extension to our models is to set the values of each layer of thehierarchy as a function of the values of offsets in previous layers. This will result in a model similar toneural networks.5.2 ConclusionTo summarize, we propose and investigate hierarchical parameter sharing models as an explainablemodel. The explainibility in the proposed model is achieved by the fact that the parameters or offsets inthe model are all of the same unit of measure and they correspond to well-defined classes. The modelcan be utilized to provide two different types of explainability: Observation explainability and gestaltexplainability. We train the model on the water-quality dataset and we are able to show that the modelhas a better performance than a baseline model and the BYM model, which is a type of generalizedlinear model. In this process, we also created a water pollution platform for exploratory data analysison the water-quality dataset and anomaly detection using the proposed model. We learned that ourmodel can learn explainable offsets for classes that can improve extrapolation of data in the near future.Finally, we investigated the performance of the parameter sharing model on the boolean classificationwhere smoothing and regularization were deemed important specially in small datasets. The suggestedsmoothing method on both variants of our model was able to improve its predictions in small datasetsetting.38Bibliography[1] A. Barredo Arrieta, N. Dı´az-Rodrı´guez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia,S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera. Explainable artificialintelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai.Information Fusion, 58:82 – 115, 2020. ISSN 1566-2535.doi:https://doi.org/10.1016/j.inffus.2019.12.012. URLhttp://www.sciencedirect.com/science/article/pii/S1566253519308103. → pages 1, 14[2] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the RoyalStatistical Society. Series B (Methodological), 36(2):192–236, 1974. ISSN 00359246. URLhttp://www.jstor.org/stable/2984812. → pages 21[3] J. Besag, J. York, and A. Mollie´. Bayesian image restoration, with two applications in spatialstatistics. Annals of the Institute of Statistical Mathematics, 43:1–20, 03 1991.doi:10.1007/BF00116466. URL https://doi.org/10.1007/BF00116466. → pages 21[4] M. Blangiardo and M. Cameletti. Spatial and Spatio-temporal Bayesian Models with R - INLA.Wiley, 2015. ISBN 9781118326558. doi:10.1002/9781118950203. URLhttp://dx.doi.org/10.1002/9781118950203. → pages 3[5] M. Blangiardo and M. Cameletti. Spatial and Spatio-temporal Bayesian Models with R - INLA,pages 238–240. Wiley, 2015. ISBN 9781118326558. doi:10.1002/9781118950203. URLhttp://dx.doi.org/10.1002/9781118950203. → pages 21[6] E. M. Dogo, N. I. Nwulu, B. Twala, and C. Aigbavboa. A survey of machine learning methodsapplied to anomaly detection on drinking-water quality data. Urban Water Journal, 16(3):235–248, 2019. doi:10.1080/1573062X.2019.1637002. URLhttps://doi.org/10.1080/1573062X.2019.1637002. → pages 1[7] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.→ pages 31, 32, 33[8] European Environmental Agency. Waterbase - water quality, Apr 2019. URLhttps://www.eea.europa.eu/data-and-maps/data/waterbase-water-quality-2. Prod-ID:DAT-163-en, Created: 05 Apr 2019, Published: 09 Apr 2019, Accessed: 20 Jun 2019. → pages15, 28[9] K. Gade, S. Geyik, K. Kenthapadi, V. Mithal, and A. Taly. Explainable ai in industry. pages3203–3204, 07 2019. ISBN 978-1-4503-6201-6. doi:10.1145/3292500.3332281. → pages 1[10] D. Guo, A. Lintern, J. Webb, D. Ryu, U. Bende-Michl, S. Liu, and A. Western. A data-basedpredictive model for spatiotemporal variability in stream water quality. Hydrology and EarthSystem Sciences, 24:827–847, 02 2020. doi:10.5194/hess-24-827-2020. → pages 1, 239[11] J.-Q. Jin, Y. Du, L.-J. Xu, Z.-Y. Chen, J.-J. Chen, Y. Wu, and C.-Q. Ou. Using bayesianspatio-temporal model to determine the socio-economic and meteorological factors influencingambient pm2.5 levels in 109 chinese cities. Environmental Pollution, 254:113023, 2019. ISSN0269-7491. doi:https://doi.org/10.1016/j.envpol.2019.113023. URLhttp://www.sciencedirect.com/science/article/pii/S0269749119322298. → pages 3[12] O. Kisi and K. S. Parmar. Application of least square support vector machine and multivariateadaptive regression spline models in long term prediction of river water pollution. Journal ofHydrology, 534:104 – 112, 2016. ISSN 0022-1694.doi:https://doi.org/10.1016/j.jhydrol.2015.12.014. URLhttp://www.sciencedirect.com/science/article/pii/S0022169415009622. → pages 2[13] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, Aug. 2009. ISSN 0018-9162. doi:10.1109/MC.2009.263. URLhttp://dx.doi.org/10.1109/MC.2009.263. → pages 2[14] T. Lai, H. Robbins, and C. Wei. Strong consistency of least squares estimates in multipleregression ii. J. Multivariate Anal., 9:343–361, 01 1985. doi:10.1007/978-1-4612-5110-1 47. →pages 2[15] D. Lee. A comparison of conditional autoregressive models used in bayesian disease mapping.Spatial and Spatio-temporal Epidemiology, 2(2):79 – 89, 2011. ISSN 1877-5845.doi:https://doi.org/10.1016/j.sste.2011.03.001. URLhttp://www.sciencedirect.com/science/article/pii/S1877584511000049. → pages 21[16] C. Leigh, O. Alsibai, R. Hyndman, S. Kandanaarachchi, O. King, J. McGree, C. Neelamraju,J. Strauss, P. Talagala, R. Turner, K. Mengersen, and E. Peterson. A framework for automatedanomaly detection in high frequency water-quality data from in situ sensors. Science of the TotalEnvironment, 664:885–898, May 2019. ISSN 0048-9697. doi:10.1016/j.scitotenv.2019.02.085.→ pages 1[17] F. Lindgren, H. Rue, and J. Lindstro¨m. An explicit link between gaussian fields and gaussianmarkov random fields: the stochastic partial differential equation approach. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.doi:10.1111/j.1467-9868.2011.00777.x. URLhttps://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2011.00777.x. → pages 3[18] H. Liu, F. Hussain, C. L. Tan, and M. Dash. Discretization: An enabling technique. Data Miningand Knowledge Discovery, 6(4):393–423, Oct 2002. ISSN 1573-756X.doi:10.1023/A:1016304305535. URL https://doi.org/10.1023/A:1016304305535. → pages 2[19] OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org .https://www.openstreetmap.org, 2017. → pages 15[20] D. L. Poole and A. K. Mackworth. Artificial Intelligence: Foundations of Computational Agents,page 307. Cambridge University Press, second edition, 2010. ISBN 978-1-107-19539-4.doi:10.1017/9781107195394. → pages 29[21] H. Rue, S. Martino, and N. Chopin. Approximate bayesian inference for latent gaussian modelsby using integrated nested laplace approximations. Journal of the Royal Statistical Society SeriesB, 71:319–392, 04 2009. doi:10.1111/j.1467-9868.2008.00700.x. → pages 2240[22] C. J. Ruybal, T. S. Hogue, and J. E. McCray. Evaluation of Groundwater Levels in the ArapahoeAquifer Using Spatiotemporal Regression Kriging. Water Resources Research, 55(4):2820–2837,Apr. 2019. doi:10.1029/2018WR023437. → pages 3[23] Tiyasha, T. M. Tung, and Z. M. Yaseen. A survey on river water quality modelling using artificialintelligence models: 2000–2020. Journal of Hydrology, 585:124670, 2020. ISSN 0022-1694.doi:https://doi.org/10.1016/j.jhydrol.2020.124670. URLhttp://www.sciencedirect.com/science/article/pii/S002216942030130X. → pages 2[24] Y. Tramblay, T. Ouarda, A. St-Hilaire, and J. Poulin. Regional estimation of extreme suspendedsediment concentrations using watershed characteristics. Journal of Hydrology, 380:305–317, 012010. doi:10.1016/j.jhydrol.2009.11.006. → pages 2[25] C. A. Varotsos, V. F. Krapivin, F. A. Mkrtchyan, S. A. Gevorkyan, and T. Cui. A novel approachto monitoring the quality of lakes water by optical and modeling tools: Lake sevan as a casestudy. Water, Air, & Soil Pollution, 231(8):435, Aug 2020. ISSN 1573-2932.doi:10.1007/s11270-020-04792-8. URL https://doi.org/10.1007/s11270-020-04792-8. → pages141
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Prediction and anomaly detection in water quality with...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Prediction and anomaly detection in water quality with explainable hierarchical learning through parameter… Mohammad Mehr, Ali 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Prediction and anomaly detection in water quality with explainable hierarchical learning through parameter sharing |
Creator |
Mohammad Mehr, Ali |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Decisions made on water quality have high implications for diverse industries and general population. In a 2020 study, Guo et al. report that the current literature on modeling spatiotemporal variabilities in surface water quality at large scales across multiple catchments is very poor. In this thesis, we introduce a simple, explainable, and transparent machine learning model that is derived from linear regression with hierarchical features for efficient prediction and for anomaly detection on large scale spatiotemporal datasets. Our model learns offsets for various features in the dataset while utilizing a hierarchy among the features. These offsets can enable generalization and be used in anomaly detection. We show some interesting theoretical results on such hierarchical models. We built a water pollution platform for exploratory data analysis of water quality data in large scales. We evaluate the predictions of our model on the Waterbase - Water Quality dataset by the European Environmental Agency. We also investigate the explainability of our model. Finally, we investigate the performance of our model in classification tasks while analyzing its ability to do regularization and smoothing as the number of observations grows in the dataset. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-09-08 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NoDerivatives 4.0 International |
DOI | 10.14288/1.0394253 |
URI | http://hdl.handle.net/2429/75910 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_mohammadmehr_ali.pdf [ 8.85MB ]
- Metadata
- JSON: 24-1.0394253.json
- JSON-LD: 24-1.0394253-ld.json
- RDF/XML (Pretty): 24-1.0394253-rdf.xml
- RDF/JSON: 24-1.0394253-rdf.json
- Turtle: 24-1.0394253-turtle.txt
- N-Triples: 24-1.0394253-rdf-ntriples.txt
- Original Record: 24-1.0394253-source.json
- Full Text
- 24-1.0394253-fulltext.txt
- Citation
- 24-1.0394253.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394253/manifest