N O N P A R A M E T R I C L E A R N I N G F R O M E X A M P L E S IN V E R Y H I G H D I M E N S I O N A L SPACES By G R E G O R Y Z L A T K O GRUDIC B. A . Sc. (Engineering Physics) University of British Columbia, 1987 M . A. Sc. (Electrical Engineering) University of British Columbia, 1990 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES ( D E P A R T M E N T OF E L E C T R I C A L A N D C O M P U T E R ENGINEERING) We accept this thesis as conforming toJbhs required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A August 1997 © Gregory Zlatko Grudic, 1997 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical and Computer Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: Abstract Constructing predictive models or mappings from sample data is an important goal in many areas of science. Most of the research in this area has been directed towards relatively low dimensional models; however, many real world problems can be large and very high dimensional. Such problems require general learning methods which are fundamentally different from those used to construct low dimensional models. In this thesis a new nonparametric regression method-ology (SPORE) is proposed for very large and very high dimensional learning problems: problems with more than 10,000 learning examples of regression data having 100 or more inputs. The SPORE regression model is constructed incrementally by adding small, low dimensional parametric building blocks one at a time, using the outputs of previously added blocks as inputs to the new ones. This process forms stable regression functions from relatively few training examples, even when there are a large number of input variables. Furthermore, SPORE demands little computational effort to choose between candidate building blocks or inputs, making it computationaly feasible in very high dimensional spaces. SPORE challenges two basic mainstream notions found in contemporary learning algorithms. First, it questions the need to simultaneously fitting large high dimensional structures to model complex high dimensional interactions. Second, it rejects the need for "greedy", computationaly expensive searches used to finding the next "best" building block to add to a regression function. SPORE also allows for the subdivision of the domain of the input space to make incremental construction both computation-aly and theoretically feasible. Conditions under which the rate of convergence of the method is independent of the dimension of the data are established. It is also shown that the computational complexity of constructing a SPORE-type regression model is linear with respect to dimension within each domain subdivision. In addition, conditions are given under which no domain subdivision is necessary. The most basic version of this regression methodology (SPORE-1) is implemented and empirically evaluated on four types of data sets. The SPORE-1 learning algorithm is completely automatic and requires no manual intervention. First, SPORE-1 is applied to 10 regression problems found in the literature and is shown to produce regression functions which are as good or better, with respect to mean squared approximation error, than published results on 9 of these data sets. Second, SPORE-1 is applied to 15 new, synthetic, large, very high dimensional data sets ii (40,000 learning examples of 100, 200, 400, 800, and 1600 inputs) and is shown to construct effective regression functions in the presence of both input and output noise. Third, SPORE-1 is used to build mappings from input/output learning examples generated by a human using teleop-eration to execute an 'object locate and approach' task sequence. SPORE-1 effectively builds this mapping and directs a robot to autonomously execute the task demonstrated by the human operator, even in a visually cluttered environment with an unpredictable background. Finally, SPORE-1 is successfully applied to the 10-bit parity problem to demonstrate its efficacy on problems which have flat low dimensional projections, thus showing that it is not subject to the same limitations as other algorithms that build regression functions using low dimensional parametric building blocks, added one at a time. iii Table of Contents Abstract ii List of Tables vii List of Figures viii Acknowledgment ix Chapter 1: Introduction 1 1.1 Constructing Models from Data 2 1.2 The Curse of Dimensionality 5 1.2.1 Rate of Convergence of Nonparametric Regression 5 1.2.2 Computational Complexity of Nonparametric Regression 6 1.3 The SPORE Approximation 8 1.3.1 A Conceptual Summary of SPORE 11 1.3.2 Theoretical Results 12 1.3.3 The SPORE-1 Regression Function 13 1.3.4 Why Does SPORE-1 Work and When Does It Work Best? 14 1.3.5 Experimental Evaluation of the SPORE-1 Approximation 15 1.4 Plan of Presentation 17 Chapter 2: Review of Nonparametric Regression 18 2.1 Regression Algorithms: Addressing the Curse of Dimensionality 18 2.1.1 Dimensionality Reducing Algorithms 19 2.1.2 Space Partitioning Algorithms 24 2.2 Regression Model Variance Reduction Techniques 26 2.3 Convergence Results in Nonparametric Regression 29 iv 2.4 Comparing Existing Algorithms to the SPORE 30 Chapter 3: The SPORE-1 Regression Function 33 3.1 SPORE-1: A Cascade of Two Dimensional Functions 33 3.1.1 A Cascade of Bounded Continuous 2 Dimensional Functions 34 3.1.2 A Cascade of Finite Order 2 Dimensional Polynomials 36 3.2 Learning Algorithm for the SPORE-1 Regression Function 38 3.2.1 The SPORE-1 Learning Algorithm 38 3.3 Theoretical Analysis of the SPORE-1 Learning Algorithm 41 3.3.1 Convergence Results 42 3.3.2 Complexity Results 56 3.3.3 Internal Variance Reduction 57 3.3.4 Virtual Space Partitioning 58 3.4 Summary 59 Chapter 4: Experimental Results 60 4.1 Implementation Details of the SPORE-1 Regression Function 60 4.2 Evaluating SPORE-1 on Standard Regression Problems 61 4.2.1 Regression Data Used by Breiman 62 4.2.2 Regression Data Used by Rasmussen 63 4.2.3 Regression Data Used by Jordan and Jacobs 64 4.2.4 Experimental Results 65 4.3 Evaluating SPORE-1 on Synthetic High Dimensional Problems 68 4.4 Human-to-Robot Skill Transfer Experiments 72 4.4.1 The Experimental Testbed 74 4.4.2 Experimental Results 75 4.5 The 10 Bit Parity Problem 78 4.6 Summary 79 Chapter 5: The General SPORE Methodology 81 5.1 The Complete SPORE Definition 81 5.2 Theoretical Analysis of Rates of Convergence For Two Specific SPORE Structures.. 83 5.2.1 The SPORE-2 Regression Function 84 5.2.2 The SPORE-1 Regression Function with Domain Partitioning 90 5.3 Rate of Convergence and Computational Complexity 92 5.4 Summary 93 Chapter 6: Conclusion 95 6.1 Future Work 97 Bibliography 99 Appendix A. Approximating 3 Dimensional Functions as a Cascade 106 Appendix B. Constructing The SPORE-2 Structure 115 Appendix C. Rate of Convergence 130 Appendix D. The Main Theoretical Constructions 137 List of Tables Table 4.1 Performance of SPORE-1 of Published Data 67 Table 4.2 Evaluation of SPORE-1 on High Dimensional Data Sets 71 vii List of Figures Figure 1.1 The Mapping Problem 3 Figure 1.2 The SPORE-1 Structure 13 Figure 2.1 Model Variance 27 Figure 3.1 The SPORE-1 Regression Function , 34 Figure 3.2 A Cascade Representation of a Polynomial Term 36 Figure 4.1 Increase in Approximation Size with Data Dimension 72 Figure 4.2 The Human to Robot Skill Transfer Test Bed 74 Figure 4.3 A "Tall Spruce Tree" 76 Figure 4.4 A Red and Blue Stripped Tower 77 Figure 4.5 A Red and Blue Stripped Tower with Holes 78 Figure 5.1 The SPORE Structure 82 Figure 5.2 A Tree-Like Subdivision of Input Space 85 Figure 5.3 A 4 Dimensional Example of SPORE-2 87 V l l l Acknowledgment Many thanks to my supervisor Peter Lawrence for his support, patience, and above all, for never showing any doubt. His perspective on my research is invaluable to me and I could not have hoped for a better supervisor. I am eternally grateful to my wife, Jane Mulligan, for being my best friend and most valued colleague. Much of the work presented here is shaped by her insights, and she has spent many long hours and reading and re-reading various versions of this thesis. This thesis is dedicated to my mother Julijana and my father Josip. ix Chapter 1: Introduction A l l scientific predictions are based on models of the phenomena being predicted. This is equally true when one is predicting the path of a projectile as when one is trying to predict tomorrow's weather. A l l of these models are constructed starting from empirical observations of the phenomena of interest. These empirical observations, in whatever form they occur, are in essence input and output examples of what the model is intended to predict. The resulting predic-tive models can be of causal relationships between input and output or visa versa (such as predict-ing the path of a projectile given initial velocity, or conversely, predicting the initial velocity given its path), or they can be models reflecting some correlation between variables arbitrarily labelled as inputs and outputs, that have other, as yet unidentified, causal factors. Whatever the nature of the phenomenon in question, the goal is always the same: to convert empirical observations into a model structure which allows accurate prediction of future observations. Building models from data often has two specific goals: the first is to predict output values given future inputs; the second is to use the resulting model to interpret or analyze the data. Interpretation of models involves analyzing which inputs most affect the output, as well as how these inputs interact with one another. Such analysis can reveal important information about the phenomenon that generated the data. This is especially true when relatively few input variables can effectively predict output values. However, the specific focus of this thesis is model prediction accuracy, and thus the topic of data analysis is largely ignored. One justification for this is that the models we are interested in are very high dimensional, making analysis difficult when many inputs combine in complicated nonlinear ways. Most major theoretical and practical work on general methods for building models has been done in mathematics, in the form of function approximation, and statistics in the form of regression. More recently, the neural network and machine learning communities have addressed this general modeling problem, mainly using variations of tools which have their origins in mathematics and statistics. Many of these endeavors have met with much success, and many diverse fields, ranging from the physical to social sciences, have benefited from the modeling tools that have emerged. l Chapter I: Introduction 2 In studying the literature on general modeling methods, it is apparent that most of the theoretical and practical work in this area has been applied to problems which are relatively small and low dimensional. Relatively little effort has been specifically directed towards general modeling methods where there are many potential inputs, and the model is to be built based on a very large data set. Numerous examples of large high dimensional regression problems can be found in such diverse fields as meteorology, economics, and robotics. In meteorology, one can imagine predict-ing tomorrow's weather given temperature, pressure, humidity, and cloud covering readings from hundreds of weather stations surrounding the area of interest. In economics, predicting the major economic indicators based on a detailed time sampling of various economic indices can be a profitable venture. In robotics, building mappings between a robot's sensor inputs and actuator outputs can be considered to be a problem in regression, if, as in human-to-robot skill transfer, one has access to examples of the mapping (see Chapter 4 and [Grudic and Lawrence, 1995] [Grudic and Lawrence, 1996] for details). All of these regression problems have one characteristic in common: a large number of factors influence the output, and therefore, it is not easy to choose a small set of inputs which effectively explain the output. The goal of this thesis is the theoretical and experimental study of building very high dimensional models from very large data sets. The type of modeling problem addressed typically has at least 100 inputs, and there are 10,000 or more empirical input-to-output examples of the phenomenon of interest. To this end, a methodology, termed SPORE, for building such regression models is presented and theoretically and experimentally analyzed. From a practical standpoint the end result of this thesis is a regression algorithm and structure, called SPORE-1, which can be used to build such large models. A key characteristic of SPORE-1 is that no a priori knowledge of the phenomenon of interest is required for model construction: all that the user must provide is a file containing the learning data. 1.1 Constructing Models from Data The problem being addressed in this thesis is deceptively simple to state. There is a data M set containing M input/output examples (*-, y,) for / = 1, M , symbolized by {xi,yi}l . The inputs JC = (xx, ...,xN) are N dimensional and the outputs y = (y 1 ; ...,yk) are k dimensional. The goal is to find some mapping function, / , such that: Chapter I: Introduction 3 X u Figure 1.1 The Mapping Problem y = /(*) (i-i) where y is some estimate of y given x . Once constructed, / can then be used to predict y given some future input x. In the general case, the data {xt, yi}i is generated by some unknown phenomenon or function /(%), where % = ( « , v) is a P dimensional input vector. % can have 2 M types of inputs: inputs u which do not appear in the data set {xt, y,}^ , and inputs v which do appear in the data {xt, yt}^ . Therefore, one is often required to construct / using an incomplete set of inputs (i.e. we may not have access to the inputs u). A further difficulty arises when the accessible inputs x include some unknown set of inputs which do not affect the output vector y (i.e. not all inputs found in x will also be found in v) . This is diagrammatically represented in Figure 1.1. Not knowing which inputs actually have some definable relationship to the output is just one of the potential difficulties encountered in the general model construction problem. The process of constructing models from data has been studied in a variety of fields, and, not surprisingly has been given different names in each field. In applied mathematics this process is termed function approximation, and one speaks of approximating the dependent variable y as some function of the independent variable x . In statistics, the process of model building is called regression and the terminology used is as follows: estimate the responses y given the explanatory or predictor variables x. More recently, the machine learning and neural network communities have termed this type of model building supervised learning and typically refer to learning the mapping between input variables x and output variables y . In this thesis, the process of building models from data is studied from the point of view of applied mathematics (function approximation) and statistics (regression). Referring to Figure 1.1, function approximation typically deals with the case of u = 0 (i.e. the empty set) and v = x. Thus, function approximation assumes that the independent variables x exactly define the dependent variables y , and the goal is to determine how well one can model / given some defined Chapter I: Introduction 4 structure for the approximating function / . Regression on the other hand, deals with the case when H ^ 0 and/or v * x. Typically, this is represented in the following form: y = f(x) + t (1.2) where, e is a noise term or, more formally, a random variable obeying some probability distribu-tion. In most instances, 8 is assumed to be normally distributed with mean zero and fixed standard deviation. Hence, from the perspective of regression, not knowing whether the given inputs completely define the output is treated as estimation noise. There are two basic approaches to constructing the function / : parametric and nonpara-metric. In the parametric approach, knowledge about the nature of the problem is used to define the structure and size of the function / . Typically / will contain a set of parameters that need to be assigned values using some numerical algorithm such as gradient descent, least squares, or simulated annealing (to mention just a few). For example, if / is a polynomial with a finite number of terms, then the parameters which are "learned" using sample data are the polynomial coefficients. A key characteristic of the parametric method is that because both structure and size are fixed, the space or class of functions representable by any single model / is inherently constrained. It is because of this characteristic that parametric models are often fast to compute and extremely effective when they are representative of the phenomenon being studied. Paramet-ric models are in fact the best form of model to use when the nature of the problem is well understood. In contrast, nonparametric models are most useful when the nature of the problem is poorly understood. In nonparametric modeling, the basic structure of the model is usually defined, but not the number of parameters in the final constructed / function. One example of a nonparametric model is a polynomial which does not have a fixed number of coefficients. A nonparametric model grows in size in order to effectively model the desired data. Thus, in the case of a polynomial, new terms are added until a sufficient model is constructed. In theory, a nonparametric model is able to represent a greater space of functions than a parametric model. However, because nonparametric regression does not have a fixed parameter space, the task of constructing models becomes considerably more difficult than with parametric models. As a result, nonparametric models are typically more time consuming to construct. The specific problem being addressed in this thesis is model construction when little or no Chapter!: Introduction 5 information about the phenomenon being studied is available. Thus, the main focus of this thesis is nonparametric model construction, or nonparametric regression. As indicated above, there are difficulties associated with nonparametric regression. One of the most pervasive of these difficul-ties is the curse of dimensionality [Bellman, 1961], which is briefly addressed below. 1.2 The Curse of Dimensionality From a practical standpoint, the curse of dimensionality has two major implications for high dimensional nonparametric regression. The first results from the rate of convergence of nonparametric regression, and the second from the computational complexity of nonparametric regression. If we are to build regression algorithms which work well in high dimensional spaces, we must find ways of addressing both of these issues. 1.2.1 Rate of Convergence of Nonparametric Regression There are a number of theoretical results associated with the rate of convergence of nonparametric regression (see Section 2.3 of Chapter 2 for details). In this thesis, we are mainly concerned with rate of convergence results which measure the decrease in approximation error as the number of sample points used to construct the regression function increases. For example, Stone [Stone, 1982] has shown that in general, and under appropriate regularity conditions, the number of sample points required to approximate a bounded continuous N dimensional function which is at least once differentiable and defined over a closed domain, grows exponentially with dimension N. This means that it is not feasible to densely sample high dimensional space (e.g. 10 or more input variables), and therefore, in to order approximate high dimensional functions, one must impose some type of smoothness constraint on the approximation [Friedman, 1994b]. In fact, all learning methods impose such a constraint. Nonparametric methods differ from paramet-ric methods in that they impose fewer such constraints, and are thus able to form more flexible approximations. Imposing a smoothness constraint on approximation means that fewer sample points are required to determine the parameters of the approximation. However, as is argued throughout this thesis and elsewhere [Friedman, 1994b], the problem associated with approximating high dimensional functions is not strictly due to the number of input variables, but rather due to the underlying complexity of the function being Chapter I: Introduction 6 approximated. We measure complexity as the number of samples necessary to build an effective nonparametric approximation (see Section 4.3 of Chapter 4 for an example). Clearly it is possible to have a very complicated 1 dimensional function, or a very simple high dimensional function. Thus, one can argue that a regression problem is difficult because it requires large numbers of learning data, which is not by necessity related to the dimension of the problem. Our goal then, as model builders, should be to find regression algorithms which perform well in high dimensional spaces, when only relatively few learning examples (i.e. tens of thousands) are sufficient to build adequate models. The goal then, when formulating structures for very high dimensional nonparametric regression is two-fold. First, we must ensure that the rate of convergence of the regression function is not overly dependent on the dimension of the underlying function which generated the learning data. In other words, the rate at which the nonparametric regression function converges to some stable function should not depend adversely on the dimension of the problem. Second, one must have an understanding of what class of functions are representable by a given nonparametric regression technique. If the data being used to construct the regression function is being generated by a function which belongs to this class, then the regression function will converge to it. In this case the approximation error converges to zero. However, if the function generating the sample points does not belong to this class of functions, then the approximation error will converge to some finite non-zero value. From a practical standpoint, if this approximation error is greater than the maximum error that is appropriate for the regression problem at hand, then the regression function being used is inadequate for the task. Recently, much theoretical work has been done to define classes of functions which have corresponding regression functions with rate of convergence independent of dimension (see Section 2.3 of Chapter 2 for details). One of the main theoretical contributions of this thesis is that we establish a rate of convergence result which is independent of dimension on function spaces which have not been previously studied in this way. 1.2.2 Computational Complexity of Nonparametric Regression The second aspect of the curse of dimensionality is the computational complexity of constructing high dimensional regression functions. Generally speaking, there are two commonly used methods for constructing nonparametric regression functions. The first builds regression Chapter I: Introduction 7 functions by including all input variables simultaneously, thus building one or more large component functions which may be combined to form the final regression function. The second uses some form of variable selection to build up the regression function using smaller functional subunits. Both of these approaches can be effective for relatively low dimensional problems. However, consider the problem of approximating a function of 1000 variables. If one uses the first approach and attempts to fit all the variables simultaneously, then even if the regression function is a simple second order polynomial (i.e. one which has relatively few nonlinear terms) of 1000 variables, the regression algorithm needs to solve simultaneously for over 500,000 model parame-ters (calculated using the identity m = (s + v)!/(.?! • v!), where m is the number of polynomial terms and hence the number of model parameters, s is the order of the polynomial and v is the number of input variables). Whether solved via gradient descent or least squares, the computa-tional cost of simultaneously fitting 500,000 parameters is not insignificant [Press et al., 1988]. Thus even a simple simultaneous fit of a second order polynomial of 1000 variables is computa-tionaly significant, and adding more nonlinear terms makes the problem correspondingly more expensive (for example a third order polynomial of 1000 variables has more than 1.6x10 model parameters). The computational cost becomes even more significant when the appropriate model structure is unknown (thus making the problem nonparametric), and many different types of large nonlinear models need to be constructed before an adequate one is found. An additional difficulty associated with attempting to simultaneously fit many variables is due to the large number of training examples needed to form a stable fit. This results because a simultaneous fit of a large number of model inputs implies that many model parameters need to be fit at the same time (as discussed in the previous paragraph), which in turn requires many training samples for the model to converge to a stable solution [Friedman, 1994b] [Breiman, 1996b]. In terms of computational complexity, large numbers of learning data result in an increase in the computational complexity of the learning algorithm. Next consider the 1000 variable regression problem using some method of variable selection. In the literature, variable selection is typically achieved by choosing the best q of possible N inputs. Thus, in order to determine the optimal set of q input variables, the number of (N\ AT! model constructions and evaluations required is = —TTTT-—r: • As q and N become large, \qj q\{N-q)\ this search for an optimal solution quickly becomes impractical. For example, choosing the best 500 out of a possible 1000 variables requires the evaluation of ^QQ^^QQ;^ m o ^ constructions Chapter!: Introduction 8 (usually nonlinear models are needed and hence each individual construction can itself be computationaly expensive); such an exhaustive search is not currently feasible, and is unlikely to be in the near future. Even if one chooses the suboptimal solution of removing one variable at a time from the model, one would require ^(N(N + \ )-{q- 1 )(<?)) model constructions and evaluations. Hence, using this method to find the "best" 500 out of 1000 inputs requires 375,750 model constructions and evaluations. Even this suboptimal search strategy is computationaly difficult given the computational complexity of constructing and evaluating nonlinear models of more than 500 variables. Similar computational complexity arguments can be made for any method which attempts to choose the "best" (by whatever measure one uses) low dimensional projections of high dimensional data (see Section 2.1.1 of Chapter 2). Any practical nonparametric algorithm required to work on high dimensional regression functions must deal with the above complexity issues. In the next section, the SPORE approxima-tion is introduced as a methodology designed to address both the computational complexity and rate of convergence problems associated with high dimensional nonparametric regression. 1.3 The SPORE Approximation When we speak of the difficulties inherent in high dimensional nonparametric regression (i.e. the curse on dimensionality), we are referring to the difficulties of applying existing approxi-mation techniques to high dimensional problems. Any approximation problem is potentially difficult. In fact, one dimensional problems are potentially just as difficult as one thousand dimensional problems, and the intrinsic difficulty of any approximation problem is directly related to the number of learning examples required to build a sufficient approximation. Thus, the curse of dimensionality is only a curse because most algorithms have been designed to work well in low dimensional spaces, and there is no theoretical reason why a one dimensional function requiring 50,000 training examples to build a sufficient model, should be any more easier to approximate than a one thousand dimensional function requiring the same number of training samples. We propose the following methodology to address high dimensional nonparametric regression problems. First, in order to ensure that the number of learning samples required to build a sufficient model depends on the complexity of the regression problem and not on how many input variables there are, we build approximations by projecting the high dimensional Chapter 1: Introduction 9 learning data onto low dimensional parametric building blocks. Thus, the number of learning samples required to form stable approximations depends on the number of parameters in our parametric building blocks, and not on the dimension of the learning data. The result is that the rate of convergence of the algorithm to some fixed error, is in fact independent of dimension. This use of low dimensional parametric projections has been extensively used in the past (see Chapter 2, Section 2.1.1.1 for details), however, it has invariably involved a systematic search for building blocks and associated inputs for the next best (assuming some measure of best) building block to add. As we have argued in Section 1.2.2, these types of search procedures are not feasible when we have very high dimensional input spaces. Therefore, such search strategies are, of necessity, not used in the algorithms proposed in this thesis. The second key aspect of our methodology is that the algorithmic procedure we use to determine which building block to add next, or what its inputs will be, is not dependent on the number of input variables. This may appear to be an impossible or impractical algorithmic charac-teristic, however, from a theoretical standpoint, there is no reason why computationaly expensive variable or building block selection is necessary. This is illustrated in the Appendices of this thesis (see Section 5.2.1 of Chapter 5) where we present an approximation algorithm which is theoreti-cally guaranteed to construct uniformly converging approximations to any bounded continuous function of arbitrary dimension, while always using the same building blocks and a random ordering of input variables. The arguments in these Appendices demonstrate that computationaly expensive searches for building blocks and their inputs is not a theoretical necessity. A practical example of this aspect of our methodology is the SPORE-1 algorithm (see Section 1.3.3), where the same parametric building block is used throughout the algorithm, and the order in which input variables are added to the building blocks is random. The third and final key component of our methodology is that the input space of the regression function is sub-divided in order to avoid flat or zero projections from high dimensional learning data onto our low dimensional parametric building blocks. This is a necessary algorith-mic component because it gives convergence to zero error when other algorithms which use low dimensional projections do not (see Section 4.5 of Chapter 4). In fact, one of the main criticisms of algorithms which use lower dimensional building blocks is that they do not work well on high dimensional data. In the appendices of this thesis we demonstrate a systematic method of space subdivision which gives theoretically proven convergence (see Section 5.2.1 of Chapter 5). In Chapter 1: Introduction 10 addition, we show that the benefits of space subdivision can be obtained with minimal computa-tional effort and without actually subdividing the input space (see Section 3.3.4 of Chapter 3). We refer to this as Virtual Space Partitioning, and show that it is a direct consequence of using bootstrap sampling techniques [Efron, 1983] to randomly subdivide the input space when adding parametric building blocks to the regression function. Science often works by pushing a thesis to an extreme point of view in order to establish conditions under which this view is valid, as well as conditions under which it breaks down. This document is an example of this form of scientific endeavor. The thesis being put forward, and upon which the SPORE approximation is based, is the following: Very high dimensional nonparametric regression models can be effectively constructed by using a sufficient number of small, low dimensional parametric building blocks which are added to the model one at a time, using the outputs of previously added blocks as inputs to the new ones. Furthermore, one can use a random selection process to choose candidate building blocks (picked from some small finite set) and their inputs, thus little or no computational effort is required to determine what to fit next. Our objective is to both experimentally and theoretically establish under which conditions this thesis is valid, and to determine how restrictive these conditions are. It is argued throughout this document that the principal implication of the above thesis is that very high dimensional nonpara-metric regression is feasible, both in terms of rate of convergence and computational complexity. The surprising conclusion is that it is both feasible and effective on a large class of functions. The Space Partitioning, self-Organizing, and dimensionality REducing, or S P O R E , algorithmic philosophy defines our basic methodology for nonparametric regression. As the name suggests, the input space of the regression data is partitioned or divided into regions where it is possible to reduce the dimension of the problem by constructing the approximation only a few variables at a time. SPORE approximations are self-organizing, or equivalently nonparametric, in that the size of the resulting regression function is defined by the learning or regression data, and is not defined a priori. In Chapter 5, the general SPORE methodology is defined and theoretically analyzed. In the following, a brief description of the basic concepts behind SPORE are presented. Chapter 1: Introduction 11 1.3.1 A Conceptual Summary of SPORE We view SPORE as a methodology for designing learning algorithms which construct SPORE-type regression functions or approximations. The SPORE methodology has five basic characteristics which can be summarized as follows: CI. The SPORE approximation consists of a finite number of functional units. These functional units are either structurally all identical (for example 2 dimensional, third order polynomials as in Chapter 4) or of different types (for example some may belong to a class of polynomials of a given range of dimension and order, while others may belong to one or more classes of radial basis functions of some range of dimension and number of basis units). From a practical perspective, these functional units should have the following prop-erty: the process of individually fitting these functional units to input/output data should be relatively easy when compared to the complete regression problem. C2. The functional units are added to the approximation one unit at a time, and are not modified after they are added. Input variables, as well as the outputs of previously added functional units, can serve as inputs to new units. The addition of functional units stops when a desired residual error is reached. C3. Little or no computational effort is spent on selecting inputs to structural units. C4. Little or no computational effort is spent on selecting which type of structural unit is added next. C5. The input space is partitioned in order to reduce approximation error. Thus, dimensionality reduction is implied by characteristics CI and C2, self-organization is implied by C2, and space partitioning is implied by C5 (thus the name SPORE as defined above). One should note that characteristics CI and C2 are common to a number of other nonparametric regression architectures. In Chapter 2, these architectures are discussed in detail. Two notable examples are the Group Method of Data Handling (GMDH) [Ivankhnenko, 1971] (as well as various other related algorithms), and Cascade Correlation [Fahlman and Lebiere, 1990]. A l l of these algorithms build successively on structures, using the outputs of previously constructed structural units as inputs to new units. However, it is the further incorporation of characteristics C3 and C4 which make SPORE different from all other approaches to nonparametric regression. As described in Section 2.4 of Chapter 2, these two characteristics lead to learning algorithms Chapter 1: Introduction 12 which are significantly different from those studied previously. They are also a key to formulating a practical algorithm for very high dimensional regression. As argued in Section 1.2.2, the computational effort required for variable selection makes it difficult, if not impossible, to do in very high dimensional spaces. The same argument applies to structural unit selection in high dimensional spaces. The purpose of the space partitioning characteristic, C5, is to ensure that the SPORE regression function is a universal approximator, or in other words is (theoretically) able to approx-imate any bounded continuous function defined over a finite domain in N dimensional space (see Chapter 5). As is briefly outlined in the next section, it is also used to avoid problems with rate of convergence in very high dimensional regression. 1.3.2 Theoretical Results As stated previously, Stone [Stone, 1982] has shown that, in general and under appropriate regularity conditions, the number of sample points required to approximate a bounded continuous N dimensional function which is at least once differentiable and defined over a closed domain, grows exponentially with dimension N. The main theoretical result of this thesis is the following: We prove that such functions have a finite number of sub-domains where there exists at least one type of regression function (e.g. one of the SPORE structures define in Chapter 5) which has a rate of convergence independent of dimension. The SPORE approximation is in fact built by partitioning the input space into subdomains where the rate of convergence to zero approximation error is independent of dimension. This result is significant from both a theoretical and practical point of view. Theoretically it demonstrates that much broader function classes can have rates of convergence independent of dimension (see Section 2.3 of Chapter 2 for details). In practice it points the way to practical nonparametric regression algorithms which can work in very high dimensional spaces. A second theoretical result given in this thesis concerns the computational complexity required to build a SPORE structure. As stated above, SPORE subdivides the input space of the function being approximated into a finite number of subdomains where the rate of convergence is independent of dimension. In Chapter 5, it is further shown that the computational complexity of constructing SPORE in each of these subdomains is on the order of 0(N x C), where N is the dimension of the data and C is the average complexity of adding a functional unit (C is indepen-Chapter 1: Introduction 13 fix) Figure 1.2 The S P O R E - 1 Structure dent of N). This indicates that, at least within each siibdomain, the computational complexity of constructing an approximation is at most linear with dimension. This is a promising result from the point of view of practical nonparametric regression. One theoretical question which this thesis does not address is how many such subdomains are necessary. Having desirable theoretical properties in each subdomain may not mean much if there are exponentially many such domains in practice. Although this remains an open theoretical question, in Chapter 5 some theoretical justification is given which suggests that functions requir-ing many subdomains are unlikely. Furthermore, in Chapter 3 and Chapter 5 we establish conditions under which subdivision is not necessary. It is interesting to observe that none of the empirical evaluations of the SPORE approximation required domain subdivision (see Chapter 4 for details). Thus, for at least the regression problems studied in this thesis, one domain is sufficient (although virtual space subdivision was necessary to obtain these results: see Section 3.3.4 of Chapter 3). 1.3.3 The SPORE-1 Regression Function In Chapter 3, the most basic form of SPORE, termed SPORE-1, is defined and a construc-tion or learning algorithm is given. A l l of the experimental evaluations presented in this thesis (see Chapter 4) are done using SPORE-1. A brief description of SPORE-1 is give next. As shown in Figure 1.2, SPORE-1 consists of a cascade of 2 dimensional functions gL(-), Chapter I: Introduction 14 which are scaled (o^) and summed to produce the regression function f(x): the subscripts kQ, ...,kLG {1, TV} serve to identify the input variables (xx, xN). The exact algorithm used to construct the SPORE-1 approximation is fully defined in Chapter 5. The points of note in the learning algorithm are as follows: 1. the functions (building blocks) g ;(-) are 2-dimensional polynomials, all having the same number of coefficients; . 2. the functions gt(-) are added one at a time in order gj(-), g2(') >•••, ( m e learning data determines how many levels are constructed); 3. the order of inputs (JCJ, xN) is random, and any single input may enter the cascade at many levels; and, 4. the input space is randomly subdivided (using bootstrap) whenever a new gt(-) is added. Hence, the SPORE-1 learning algorithm conforms to the general SPORE methodology given in Section 1.3.1. Consider the simple nature of SPORE-1. The algorithm does not use a complicated variable selection technique (it is in fact random variable selection), and simple parametric functions (all identical) are incrementally added to the regression function. In fact, SPORE-1 is an experiment in pushing the lower bounds in algorithmic and structural simplicity, the goal being to test the efficacy of the general SPORE methodology at this most basic level. However, as summarized in Section 1.3.5, this surprisingly simple SPORE-1 structure can indeed form effective nonparametric regression functions. 1.3.4 Why Does SPORE-1 Work and When Does It Work Best? Given the simple structure of SPORE-1, it is not unreasonable to ask why SPORE-1 is able to form effective regression functions. Roughly speaking there are three basic reasons for this (see Chapter 3 for details). These are briefly addressed below. First, that the addition of each new building block to the structure defined in Figure 1.2 can have, on-average, only one of 2 consequences: 1) it can reduce the model error because of the addition of the new input variable (chosen randomly) to the regression function; or 2) it can leave the model error unchanged, because if the addition of the new input variable does not contribute to a better model, then the new building block g ;(-) simply passes the output of the previous level Chapter 1: Introduction 15 £/ _ i (•) to the input of the next level gt + x (•) (i.e. gt(•) = gt_ x (•)). The result is that the order in which input variables are added to the regression functions does not significantly affect model error, as long as the relevant variables are eventually incorporated. The second reason why SPORE-1 is effective is because it forms stable high dimension regression functions using relatively few learning examples. This is because 1) high dimensional data is projected onto low dimensional parametric building blocks gt{-) which implies that relatively few training examples are required to form stable model parameters [Friedman, 1994b], and 2) the outputs from these building blocks are combined using a weighted average (in the form of the scaling factors a ; ) which tends to further stabilize the regression function [Breiman, 1996b] (see Section 3.3.3 of Chapter 3). The third reason why SPORE-1 is effective is because the low dimensional projections (parametric building blocks) gt{-) tend to have un-correlated residual errors and are unlikely to be flat (i.e. g^ (-) = g/_ j(-)) if the input xk to g[(-) has an effect on the output. The reason for this is due to the random subdivision of the input space resulting from the use of bootstrap samples of the training data whenever a new level gt(-) is added (see Section 3.3.3 and Section 3.3.4 of Chapter 3). The result of this is that adding new levels (building blocks) to the regression function tends to improve the approximation, as long as the input variables continue to effect the output. The next question one might ask is the following: what types of problems is SPORE-1 best suited for? Given that the SPORE-1 structure consists of a long cascade of parametric building blocks (see Figure 1.2), with each level in the cascade incorporating additional input variables into the regression function, it is reasonable to assume that SPORE-1 is best suited to regression problems where many input variables are required to produce an effective model, and the relevant input variables have approximately equally significant effects on the output. In addition, if all of the parametric building blocks (g/(-)) are continuous (which they are in the current implementation; see Chapter 3 for details), in the ideal case, the target function would also be continuous. Although this is a compelling argument for defining an ideal class of target functions, there is significant experimental evidence that, in practice, SPORE-1 is not limited to only these types of regression problems. This is briefly discussed in the next section. 1.3.5 Experimental Evaluation of the SPORE-1 Approximation The SPORE-1 approximation is evaluated experimentally in Chapter 4. First, SPORE-1 is Chapter 1: Introduction 16 compared with other algorithms by applying it to 10 regression problems found in the literature [Grudic and Lawrence, 1997]. This initial evaluation shows that SPORE-1, on average, does at least as well or better (with respect to mean squared approximation error) than published results on 9 of these data sets. A l l of these published regression problems are relatively small and low dimensional, and thus not the type of problems SPORE-1 was designed for. Hence it is interesting to observe SPORE-l 's effectiveness on such problems. Next, SPORE is evaluated on 15 synthetic, large, high dimensional data sets (40,000 learning examples of data having 100, 200, 400, 800 and 1,600 inputs) [Grudic and Lawrence, 1997]. This synthetic data is generated specifically to evaluate SPORE-1 on the type of data sets it was designed for. Some of these data sets have output noise (i.e. the 8 in equation (1.2)), while others include input variables which do not affect the output (i.e. v in Section 1.1). Synthetic data allows us to test theoretical limits with full knowledge of what the best attain-able approximation errors are. The results from these experiments indicate that SPORE-1 closely approaches these best possible approximation errors. As no data sets of sufficient size and dimensionality were found in the literature, this is the first use of this data for evaluation of high dimensional regression. Next, SPORE-1 is applied to 3 high dimensional (1024 pixel inputs) human-to-robot skill transfer problems. These experiments involve having a human operator demonstrate a mobile robot "locate object and approach object" motion sequence using video input. During this demonstration, the human uses only the robot's sensors as inputs, and controls the robot's actuator as output. The demonstration generates a learning data set via recording the image pixels seen by the human, as well as the corresponding output commands, during each time step of the demonstration. This data set is used to construct a mapping between sensor inputs and actuator outputs, which when constructed, is used to autonomously control the robot. Experimental results indicate that the robot is able to autonomously accomplish the desired task. Although, as indicated in Chapter 4, the task being transferred from human to robot is relatively simple, these experi-ments demonstrate the ability of SPORE-1 to form very high dimensional regression functions, based on real world learning data. Finally, SPORE-1 is applied to the 10-bit parity problem [Grudic and Lawrence, 1997]. We chose this problem to demonstrated that SPORE-1 does not have the same limitations as other systems which build regression functions using small dimensional units which are added one at a Chapter 1: Introduction 17 time. 1.4 Plan of Presentation In Chapter 2 a review of nonparametric regression is presented. In Chapter 3 the simplest version of SPORE (SPORE-1) is described, and theoretical results are given. In Chapter 4 an empirical evaluation of SPORE-1 is presented. In Chapter 5 the general SPORE methodology is defined and the basic theory behind it is described. Chapter 6 contains a summary of the conclu-sions and major technical contributions of this thesis, and includes a proposal for future theoreti-cal and empirical evaluation of SPORE. Chapter 2: Review of Nonparametric Regression This Chapter gives a brief review of nonparametric regression. In Section 2.1, a discussion of various nonparametric regression algorithms is given. In Section 2.2, the problem of over-fitting in nonparametric regression is discussed and various published solutions are mentioned. In Section 2.3, convergence results of various regression functions are discussed. Finally, in Section 2.4, a comparison is made between published regression algorithms and the regression methodol-ogy proposed in this thesis (SPORE). 2.1 Regression Algorithms: Addressing the Curse of Dimensionality As discussed in Chapter 1, the curse of dimensionality has two serious detrimental consequences for practical high dimensional nonparametric regression. The first is the potential necessity for a large learning data set which is impractical, and the second is the computational complexity associated with a large number of input variables. Many algorithms have been presented in the literature which are designed to deal with these two difficulties. Good reviews and descriptions of many types of regression and approximation algorithms can be found in the literature [Kwok and Yeung, 1997] [Haykin, 1994] [Friedman, 1994b] [Hastie and Tibshirani, 1994] [Alfeld, 1989]. In the following, we discuss some representative examples of these algorithms. A l l of the algorithms discussed below have been demonstrated to be useful on a variety of different types of regression problems. Each has its own particular strengths and weaknesses, however, none of these algorithms have been demonstrated to be effective on the large, very high dimension nonparametric regression problems studied in this thesis. For the purposes of this thesis, we view nonparametric learning algorithms as broadly grouped into two major categories. The first is termed dimensionality reducing algorithms, and the second is called space partitioning algorithms. 18 Chapter 2: Review of Nonparametric Regression 19 2.1.1 Dimensionality Reducing Algorithms The goal in dimensionality reduction is to break a large regression problem into a number of smaller regression problems. The end result of this is a reduction in the number of parameters which are being fitted to the learning data. A reduced parameter set implies that algorithms which fit learning data to these parameters can potentially be more efficient. In addition, fewer parame-ters generally require fewer learning samples in order to achieve stable parameter values. In the following, we will briefly discuss some of the better known examples of dimensionality reducing algorithms. For organizational purposes, we divide these algorithms into two types. 2.1.1.1 Sum of Approximated Lower Dimensional Projections Let f(x), where x = ( J C 1 ; xN) e D qz 3iN, be an N dimensional function* and let /(x) be some approximation of /(x). Then in one form of dimensionality reduction, /(x) has the following structure: i,e{\,...,N} i „ i 2 G {1.....JV} X fi, i (*/,> •••»*!• ) 1 I •••lm M lm („ im e {1, N} where, for all m e { 1, AO , / . • are m dimensional functions defined over the domain D of /(x). Generally speaking, the goal is to have m much smaller than N, and to find some finite set of functions / t- • which sum to effectively approximate the target function. The hope is that effective approximations are possible using functions which are either lower dimensional or have some fixed finite parametric structure. Examples of this type of dimensionality reduction include Generalized Additive Models [Hastie and Tibshirani, 1986], M A R S [Friedman, 1991], G M D H [Ivankhnenko, 1971], A S P N [Elder and Brown, 1995], and S O N N [Tenorio and Lee, 1990]. These are briefly discussed below. In General ized Addi t ive Models [Hastie and Tibshi rani , 1986] [Hastie and Tibshirani, 1990] regression functions are constructed using a sum of smooth 1 dimensional functions X/^C*,-,) • The 1 dimensional functions / • are nonparametrically constructed, one at a time, using a scatterplot smoother. One advantage of this technique is that the construction of the functions / • is completely automated, requiring no interventions in the learning process. In Chapter 2: Review of Nonparametric Regression 20 addition, the resulting regression model can often be effectively interpreted, allowing a statistician to analyze significant factors involved in model generation. One limitation of this method is that higher dimensional interactions are difficult to model, however, in many lower dimensional regression problems this is not a significant concern. In Multidimensional Adaptive Regression Splines (MARS) [Friedman, 1991], regression functions are constructed using a sum of products of spline functions. These functions are built in a forward stepwise hierarchical manner which is based on recursive partitioning (see Section 2.1.2). M A R S is able to automatically account for interactions between variables, in a controlled manner, allowing for effective model interpretation. A limitation of M A R S is that its computa-tional complexity increases rapidly with dimension, making large regression problems (i.e. thousands of learning samples) of more than about 20 input variables, impractical when high dimensional interactions occur. However, in many applications, low dimensional interactions are sufficient to form effective approximations. Polynomials have been widely used to build regression models. Polynomial models have the following form (note that these types of multidimensional polynomials are often termed multinomials): i'i + ... + iN < M /(x) = £ ",•,...«„ •*i ' -"*!v (2.2) «i. -.«*e {0 M} where M is the order of the polynomial. These models are built by fitting the parameters at • to the learning data. One difficulty with this general polynomial formulation is that the number of parameters • increases rapidly with polynomial order and dimension, thus making it difficult to determine which subset of parameters are significant. Many iterative schemes have been developed which attempt to chose the most relevant parameters, and construct the approximation a few terms at a time. One such algorithm, originated by Ivakhnenko [Ivankhnenko, 1971], is the Group Method of Data Handling (GMDH). The original algorithm builds the global approximat-ing polynomial using layers of 2 dimensional quadratic polynomials. The first layer is constructed using enough quadratic polynomials to exhaustively account for all possible input combinations. The outputs of this first layer are then used as inputs to the next layer, where they are once more exhaustively combined two inputs at a time. This process of adding layers continues until further additions to the polynomial do not reduce the approximation error. There have been many Chapter 2: Review of Nonparametric Regression 21 enhancements to this original algorithm since it was introduced by Ivakhnenko [Farlow, 1984]. Some specific examples include the Algorithm for Synthesis of Polynomial Networks (ASPN) [Elder and Brown, 1995], which extends the paradigm to include 3 dimensional polynomials, and the Self Organizing Neural Network (SONN) [Tenorio and Lee, 1990], which allows for basis functions other than polynomials and uses a Simulated Annealing technique to search for an optimal structure. Other variations of this include the method by Sanger [Sanger, 1991] where subnetworks of polynomials (or other basis functions) are grown below the previously added terms which exhibit maximum variance in approximation error. A l l of these algorithms have demonstrated both good generalization and interpretation properties, on many types of regression problems. One common characteristic of these methods is that they all use some form of search in variable space (and often basis function space as well) to define the structure of the resulting polynomial. Ultimately, due to the computational complexity of such searches (see Section 1.2.2 of Chapter 1), these algorithms are not practical for the very high dimensional regression problems studied in this thesis. 2.1.1.2 S u m m e d H i g h D i m e n s i o n a l Bas is F u n c t i o n s A second general type of dimensionality reducing algorithm constructs regression functions that have the following structure: f{x) = X Sq(hq(x)) (2.3) q = 0 where gq are 1 dimensional functions, the basis functions h are functions of a known type chosen a priori, and the number of terms M is chosen large enough to give effective approxima-tions. Examples of these regression functions include Projection Pursuit [Friedman and Stuetzle, 1981], Radial Basis Functions [Hardy, 1971], Hinging Hyperplanes [Breiman, 1993], Wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995] [Zhang, 1997], Cascade Correlation [Fahlman and Lebiere, 1990], and other adaptive Neural Network type algorithms [Reed, 1993] [Igelnik and Pao, 1995] [Yao and Lui, 1997]. These are briefly discussed below. In Projection Pursuit [Friedman and Stuetzle, 1981], the functions hq are linear, having the form hq(x) = } a(-• xt, where the parameters a{ are chosen to maximize error reduction, given that the functions g are restricted to be smooth and nonparametric. This Chapter 2: Review of Nonparametric Regression 22 minimization of error process can be accomplished using Newton-Raphson [Press et al., 1988] on an appropriate cost function. The residual errors are updated after the construction of each hq and g function, and the construction of new functions stops when the approximation error is sufficiently small. Projection Pursuit is often used as an analysis tool because determining which projection axis (defined by function hq) gives the most error reduction, can shed light on which independent variables are important and how they interact. A limitation of this method is that all approximations are made along N dimensional linear projection axes (defined by functions hq), making the construction of these axes computationally difficult for very high dimensional regres-sion problems (N > 100). Cascade-Correlation [Fahlman and Lebiere, 1990] is an example of a nonparametric Artificial Neural Network regression function. Neural networks are generally implemented using sigmoidal basis functions (artificial neuron) of the form s(z) = 1/(1 + e z ) , where z = ot0 + Y!i- i ai' ui' w ' t n ui D e ' n g t n e inputs to the basis function and the parameters a ; being adjusted to fit the training data. Neural networks often consist of layers of these sigmoidal functions, with each layer containing a potentially large number of neurons or sigmoidal units, and the output of lower layers serving as inputs to higher layers (each unit in the first layer having as its inputs all of the predictor variables). The Cascade-Correlation algorithm is designed to automatically determine the network structure based on the training data. Cascade-Correlation adds hidden units (i.e. units not directly connected to the output) to the network one at a time, with each hidden unit having an input connection from all the previously added hidden units, as well as all of the predictor variables. Thus the number of inputs into hidden units grows linearly with the number of levels added to the network, with the first unit having N inputs (one from each of the predictor variables). In addition, each time a hidden unit is added, many potential candidate units are trained using the Widrow-Hoff delta rule [Widrow et al., 1976] starting from randomly chosen weights oc(. Only the one which best correlates with the current residual error is added as the next hidden unit. The efficacy of Cascade-Correlation on high dimensional data has been widely demonstrated in the literature [Michie et al., 1994]. However, because each hidden unit accepts inputs from all predictor variables, as well as the outputs of all existing hidden units, the method exhibits high computational complexity and potential numerical instability (due to a large number of parameters), when the regression problem is very high dimensional. As a result, Cascade-Chapter 2: Review of Nonparametric Regression 23 Correlation is unsuitable when large numbers of hidden units are necessary for effective approxi-mation, and thus is not ideally suited for the very high dimensional regression problems studied in this thesis. Other nonparametric artificial neural network algorithms work by either starting with a large network and pruning until an effective approximation has been achieved (pruning algorithms [Reed, 1993]), or by starting with a small network and adding interconnections and basis functions (artificial neurons) until no further error reduction is possible (constructive algorithms [Kwok and Yeung, 1997]). Both constructive and pruning type neural network algorithm have the same goal: to find a neural network structure which is large enough to effectively model the learning data, while small enough to ensure that over-fitting does not occur (see Section 2.2 for a discussion of variance reduction techniques and the problem of over-fitting). Other learning algorithms use genetic programming techniques [Yao and Lui , 1997] and stochas-tic basis function generation techniques [Igelnik and Pao, 1995] to achieve the same goal. These algorithms have been shown to perform well when relatively small numbers of artificial neurons effectively model the learning data. However, when large numbers of artificial neurons are necessary, the learning process can become unstable (small changes in learning data leading to large changes in model structure [Breiman, 1996b]) because of the large number of parameters in the neural network. This makes these types of neural network learning algorithms unsuitable for the large high dimensional problems studied in this thesis. In Radial Basis Function methods [Jackson, 1988] [Hardy, 1971] [Poggio and Girosi, 1989] [Broomhead and Lowe, 1988] [Casdagli, 1989] [Renals andRohwer, 1989] [Poggio and Girosi, 1990] [Edelman and Poggio, 1990] [Mel and Kock, 1990] [Saha and Keeler, 1990] [Girosi, 1994], the functions hq have the form hq(x) = <p(||x - x ||), where <j) is afunction which varies according to some distance metric from a center point xq. The functions gq take the form gq(x) = aq • hq(x) where the parameters a, are defined by the training data. Radial Basis Functions become nonparametric when the number and type of basis functions hq, and the location of the centers xq is determined by the learning data. A limitation of Radial Basis Functions is that they all use some measure of distance between points in space, and in high dimensional space, almost every learning sample point is closer to the sample boundary than to another learning point [Friedman, 1994b]. This makes radial basis functions not ideally suited for very high dimensional nonparametric regression. Chapter 2: Review of Nonparametric Regression 24 Hinging Hyperplanes [Breiman, 1993] represent regression functions as a linear superpo-sition of basis functions which consist of 2 hinging hyperplanes. These hyperplanes are hinged in the sense that two planes in TV dimensional space must always intersect at a line (or hinge) in N dimensional space. The learning algorithm works by finding the hinge between 2 hyperplanes which best fit the current residual error, thus producing a basis function. The next basis function is then produced in a similar manner using the new residual errors resulting because of the addition of the previous basis function. If N is small, the hyperplanes are chosen to be N dimensional, otherwise the best M <N dimensions are chosen for the hyperplanes. As is argued in Chapter 1, such searches make very high dimensional regression computationally difficult. Thus Hinging Hyperplanes are not an ideal candidate for the type of regression problems studied in this thesis. However, the computational efficiency of Hinging Hyperplanes on certain high dimensional problems (see [Breiman, 1993] for examples), make it an important potential candidate for many regression problems. Wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995] have also been effectively used in nonparametric regression. Wavelets have the attractive property that they are spatially adaptive, and thus they are effective at approximating functions which are locally spiky (i.e. are smooth throughout most of the input space, but have certain regions where they are rapidly changing). Wavelet shrinking algorithms have demonstrated good performance on relatively low dimensional problems (3 inputs or less) [Sjoberg et al., 1995]. However, wavelet algorithms which demonstrate good performance on very high dimensional problems have not yet been proposed in the literature. 2.1.2 Space Partitioning Algorithms In space partitioning algorithms the domain D of the function being approximated is divided into a finite number of disjoint subdomains Dt such that LJZ)- = D. Then, in each subdomain £>•, a regression function is constructed such that: V X G D ; , y = ; , (x) (2.4) Often, the functions '/,-(x) are simply constant over the domain Di. There are two main types of space partitioning algorithms: recursive partitioning or decision tree algorithms, and nearest neighbor algorithms. Chapter 2: Review of Nonparametric Regression 25 Examples of recursive partitioning algorithms or Decision Trees [Michie et al., 1994], include Classification and Regression Trees (CART) [Breiman et al. , 1984], and C4.5 [Quinlan, 1993]. Recursive partitioning works as follows: starting with the entire domain D, subdivide it into two daughter subregions such that a parametric regression function (usually chosen to be a single constant value) fitted in each region, reduces the overall approximation error. This process of binary subdivision continues with each daughter region (creating new daughter subregions) until some predefined cost trade-off between approximation error and number of subregions is met. Recursive partitioning algorithms have proven to be very useful, especially for classification problems. In addition, decision trees can be easily interpreted, allowing analysis of how various predictor variables interact with one another. However, because functions are approx-imated by many discontinuous subregions, recursive partitioning methods tend not to do well on regression problems which are continuous [Friedman, 1991]. This is especially true when there are many predictor variables. Another notable example of a tree-structured space partitioning architecture is the Hierar-chical Mixtures of Experts (HMM) model which uses the Expectation-Maximization (EM) learning algorithm [Jordan and Jacobs, 1994]. This is a parametric approach that uses a fixed number of experts (or parametric building blocks) which are hierarchically combined using gating networks to form the final regression function. In general, each "expert" is trained to work in a particular region of the input space, and the gating networks combine these regions to form a global approximation. This H M M model has been successfully applied to many learning problems, however, because it uses a parametric learning algorithm, it is not directly applicable to the nonparametric regression problems studied in this thesis. In the simplest implementation, AT-Nearest Neighbor algorithms work as follows: given a point x , find the K points in the learning set which are nearest to this point, according to some distance metric. Then the regression estimate at x becomes the average (perhaps a weighted average) response value of these ^-Nearest Neighbors [Michie et al., 1994]. A criticism of Nearest Neighbor algorithms has been that they often exhibit poor generalization [Lowe, 1995]. This problem is mainly due to a poor choice of distance metrics (or kernels). Recently, algorith-mic improvements have been suggested which allow the learning data to define the most appropri-ate distance metric [Lowe, 1995] [Friedman, 1994a] [Hastie and Tibshirani, 1996], the result being much better generalization, especially for classification problems. However, as with regres-Chapter 2: Review of Nonparametric Regression 26 sion trees, because functions are approximated by many discontinuous subregions, Nearest Neighbor algorithms have yet to demonstrate good performance on regression problems which are very high dimensional (greater than 100 inputs) and continuous. 2.2 Regression Model Variance Reduction Techniques For any given set of learning data, there are infinitely many regression models which fit this data exactly. An example of this is given in Figure 2.1, where the 5 data points are fitted exactly with 3 different model types: the piece-wise linear fit model, the smooth fit model (smoothness here refers to how rapidly the model varies within its domain rather than how many times it is differentiable) and the high variance fit model. These 3 models all describe the learning data exactly, however, they interpolate between the data points in very different ways. Most likely, at most one of these models most appropriately describes the target function that generated the data. In fact, only a small subset of all models which fit any given training data accurately, also effectively describe the target function that actually generated the data. Thus, one of the more difficult problems in nonparametric regression is choosing the most appropriate regression model. The problem becomes even more difficult when the training data is noisy, in which case, exactly modeling the learning data can lead to greater approximation errors. This makes the regression model search space even larger because, in addition to models which exactly fit the training data, the search space is extended to include those models which closely fit the training data. As an example of this, the smooth approximation model in Figure 2.1 would be just one model which can be used to model the noisy sample points. Any number of small variations to this smooth approximation of the sample points could equally well fit the data points. A model with high variance is one which is not stable with respect to which set of training samples are used to construct it. The more variance a model has, the more sensitive it is to small changes in learning data (this notion of model variance is related to model stability as define in Breiman, 1996b). If a model has high variance then two different sets of learning data randomly generated by the same target function, can produce two very different regression models. Often, as with the high variance fit model in Figure 2.1, both models exactly fit their respective learning data, however neither interpolates effectively between data points (this is often referred to as model over-fitting). Thus, an important goal in regression is to reduce model variance. For most regression problems, failing other evidence the model with least variance is often the one which Chapter 2: Review of Nonparametric Regression 27 Figure 2.1 Model Variance most smoothly interpolates through data points (for example the smooth fit model in Figure 2.1). The variance reduction problem is related to the bias-variance trade-off [Friedman, 1994b], which divides the approximation error into a bias term and a variance term. The bias term defines how closely the regression model fits the learning data, and the variance term defines how much variance there is in the regression model. For example, the three models in Figure 2.1 that fit the data exactly, all have a zero bias error, but have varying degrees of variance. Conversely, the smooth approximation model in Figure 2.1 has a nonzero bias error because it doesn't pass exactly through sample points, however, if the data is assumed to be noisy, its variance error may be the smallest of all three models. Thus, the smooth approximation model may be the most appropriate model when data is noisy. A variety of techniques have been developed for nonparametric model variance reduction. In this thesis we group these methods into two categories: penalty based variance reduction and averaging based variance reduction. These are briefly discussed below. Methods which use penalties for variance reduction work by giving a specific cost to an increase in model complexity. The basic reason for this is that an increase in model complexity (i.e. an increase in the number of model parameters) often means an increase in model variance. Thus, the trade-off between the error reduction obtained by the addition of new model terms, and the increase in complexity due to the addition of these new terms, must be analyzed to determine if this increase in model complexity is justified. Examples of these types of penalty based variance reduction techniques include the generalized cross validation criteria used in M A R S Chapter 2: Review of Nonparametric Regression 28 [Friedman, 1991], and the minimum description length criteria used in S O N N [Tenorio and Lee, 1990]. Generalized cross validation works by multiplying the average squared residual error of the training data by a penalty factor which increases as terms are added to the M A R S model. Similarly, minimum description length attempts to find the smallest model which effectively fits the training data. Other examples of penalty based variance reduction techniques can be found in [Friedman, 1994b]. One key advantage of variance reduction techniques which use penalties, is that a single regression model is constructed, thus allowing for potentially useful model interpre-tation. However, one of the key problems associated with penalty based variance reduction techniques is that they are inherently unstable. Breiman [Breiman, 1996b] has shown that this form of model selection is unstable because small changes in learning data can lead to large changes in model structure, making it difficult to choose the most appropriate model. In contrast to penalty based variance reduction techniques, averaging methods work by forming ensembles of many different regression functions, and modeling a final predictor value as a weighted average of all the regression functions. One of the first noted examples of this is Stacked Generalization [Wolpert, 1992] where potentially nonlinear combinations of regression functions, each trained on some subset of the total data, are used to form the overall regression function. This idea is simplified by Breiman in Stacked Regression [Breiman, 1996c], where the stacked regression functions are simply weighted by non-negative real values. Theoretical justifi-cation for such weighted averaging is given by Perrone [Perrone, 1993], where derivations are given for the optimal weighted average of a fixed number of regression functions. In addition, Breiman has shown that averaging can effectively stabilize predictors which are inherently unstable [Breiman, 1996b]. One example of an averaging variance reduction technique is Breiman's bootstrap aggregation or Bagging [Breiman, 1996a]. Bagging uses the average value of many regression functions, each trained using a bootstrap sample of the learning data, to represent the final regres-sion function. The key idea here being that the bootstrap samples tend to create regression functions which have uncorrelated error residuals, thus averaging their values tends to cancel out approximation errors, giving potentially better regression functions when compared with a single regression. Bagging has proven to be an effective variance reduction technique, and variations of it have been successfully applied to many types of regression problems [Parmanto et al., 1996]. However, one significant drawback of all averaging methods is that they do not easily allow for Chapter 2: Review of Nonparametric Regression 29 interpretation of the resulting regression model. 2.3 Convergence Results in Nonparametric Regression In this section, we consider three main types of known convergence results for nonpara-metric regression. First we briefly discuss some known results in representability, or what function class a given regression function is theoretically able to converge to. Second, we look at two types of rate of convergence results. The first measures how, given an infinite number of learning sample points, the approximation error decreases with the size of the regression model. The second measures how the approximation error decreases as the number of sample points of the target function increases. Examples of these types of convergence results are given below. Given a particular regression function, one of the first theoretical questions which is asked is what class of functions can it represent! Indeed, the most common question posed is whether the regression model is a universal approximator; i.e. whether it is able to form arbitrarily good approximations of bounded continuous functions defined over a finite domain. Examples of universal approximators include polynomials [Lorentz, 1986], radial basis functions [Girosi, 1994], artificial neural networks [Cybenko, 1989] [Hornik et al., 1989], space partition-ing techniques [Breiman et al., 1984], projection pursuit [Huber, 1985], and wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995]. These representation results are important because they establish what the respective regression functions are capable of representing. However, represen-tation results in themselves do not establish the rate at which these regression functions converge to the target function. The first type of rate of convergence result we consider is one which measures the decrease in approximation error as the number of parameters in the regression function increases. From a practical standpoint, this rate of convergence measure tells how the size of the regression model depends on the dimension of the target function. In particular, from the standpoint of high dimensional regression, we can ask the following question: "under what conditions is the rate of convergence of the regression model independent of dimension?" A general answer to this question is the following: by limiting the class of target functions which one is attempting to approximate, many regression functions have a rate of convergence which is independent of dimension [Girosi, 1994]. This includes radial basis functions [Girosi, 1994], artificial neural networks [Barron, 1993], hinging hyperplanes [Breiman, 1993], and projection pursuit Chapter 2: Review of Nonparametric Regression 30 [Jones, 1992]. A l l of these theoretical results demonstrate rates of convergence independent of dimension are based on a lemma which can be jointly attributed to Muarey [Barron, 1993], Jones [Jones, 1992] and Barron [Barron, 1993] [Girosi, 1994]. In all these results, the class of target functions is limited to those functions for which the magnitude of the first moment of the Fourier transform is finite and independent of dimension. In addition, all of these results are based on greedy algorithms which search an infinite function space for the globally optimal regression function. Because of this large search problem, it is not clear how computationally practical these convergence results are [Barron, 1993] [Breiman, 1993] [Juditsky et al., 1995]. The second rate of convergence result we discuss here is how approximation error decreases with the number of sample points of the target function. Stone [Stone, 1982] showed that, in the general case of a target function which is at least once differentiable, the optimal rate of convergence as a function of the number of sample points, does depend on dimension. In fact, this dependence is significant enough that general very high dimensional nonparametric regres-sion requires an impractical number of sample points (i.e. the curse of dimensionality). Once more, by putting more limits on the target function, it is possible to improve on these rates. There are examples in the literature which demonstrate function spaces, and corresponding regression functions, which have rates of convergence that are independent of dimension. One example of this is the projection pursuit regression function when the target function is limited to the class of projection pursuit regression functions with specific constraints [Chen, 1991a]. A second example is an interaction spline model, where the target function is sufficiently differentiable [Chen, 1991b]. Such theoretical results are useful because they can lead to practical implementa-tions which are effective in very high dimensional spaces. For further discussions on rate of convergence results, see Section 5.3 of Chapter 5. 2.4 Compar ing Exist ing Algori thms to the S P O R E In this chapter, many different approaches to nonparametric regression have been discussed. In addition, we have discussed the concept of model variance reduction and how it is done in the literature. Finally, we have outlined some of the basic theoretical results in nonpara-metric regression. In this section, we compare SPORE to other published regression methodolo-gies. First we outline which existing learning algorithms most closely .resemble SPORE-type learning algorithms, and how they are fundamentally different. Then we discuss how the variance Chapter 2: Review of Nonparametric Regression 31 reduction methodology of SPORE relates to those of existing variance reduction techniques. Finally, this section concludes with a brief description of how the theoretical properties of SPORE differ from those of other published regression functions. The two types of regression functions which are most closely related to the regression functions studied in this thesis (SPORE), are Cascade-Correlation [Fahlman and Lebiere, 1990], and G M D H [Ivankhnenko, 1971] type algorithms such as A S P N [Elder and Brown, 1995] and SONN [Tenorio and Lee, 1990]. A l l of these algorithms build successively on structures, using the outputs of previously constructed structural units as inputs to new units. However, from the conceptual standpoint of how units are added, and how their inputs are chosen, SPORE differs significantly from these regression functions. These regression functions attempt to build an "optimal" structure by searching for the "best" set of functional units, and/or the best set of inputs to each structural unit. As argued in Section 1.2.2 of Chapter 1, such searches are not computa-tionally feasible in very high dimensional spaces. In the SPORE methodology, this philosophical perspective of searching for optimal (or close to optimal) structures is abandoned because our aim is specifically to address the very high dimensional regression problem. Instead, SPORE algorithms spend minimal computational effort on searches in function space and input space, concentrating more on constructing structures which produce robust approximations without the need for costly search strategies. From the standpoint of how variance reduction is achieved in the SPORE methodology, we refer the reader to the specific example of SPORE-1 in Chapter 3, and the related discussion in Section 3.3.3 of that chapter. In comparison to other algorithms, SPORE-1 achieves effective variance reduction using a methodology which is similar to Bagging [Breiman, 1996a], but does so with only a single regression function, instead of an aggregate ensemble of regression functions. This is accomplished by constructing each structural unit using a bootstrap sample of the training data. The output of SPORE-1 is the weighted average (each unit is weighted propor-tionally to the inverse of the squared residual error it has on the training data) of each of these simple structural units. This produces an effective variance reduction methodology (here termed internal variance reduction) which differs significantly from other proposed averaging methods. In Section 5.3 of Chapter 5, the rate of convergence (as a function of the number of sample points) results for the general SPORE-type learning algorithms are given. The rate of convergence is shown to be independent of dimension, but on different function spaces from those studied in Chapter 2: Review of Nonparametric Regression 32 the past. Instead of putting greater limits on the function type [Chen, 1991a] or requiring it to be more differentiable [Chen, 1991b], we assume a more general class of bounded continuous functions, which are at least once differentiable. We obtain a rate of convergence results indepen-dent of dimension by limiting the target function's domain of definition in specific ways. Because of this, regression functions based on SPORE have very different convergence properties when compared with other published results. Chapter 3: The SPORE-1 Regression Function The general SPORE methodology outlined in Chapter 1 can be implemented in a variety of ways. In this chapter, the simplest implementation of SPORE, termed SPORE-1, is defined and analyzed. In Section 1.3 of Chapter 1, an intuitive justification for the SPORE-1 structure was presented. In this chapter, we present a theoretical analysis of SPORE-1, with particular emphasis on rate of convergence and complexity properties. In Section 3.1 we examine the representation capabilities of the SPORE-1 regression function. In Section 3.2, the SPORE-1 learning algorithm, which is used to construct the regression function, is defined. In Section 3.3 an analysis of the theoretical properties of the SPORE-1 learning algorithm is given. Finally, a summary of the main results of this Chapter is given in Section 3.4. 3.1 SPORE-1: A Cascade of Two Dimensional Functions The SPORE-1 structure is a cascade of two dimensional functions, with the output of each level of the cascade feeding directly to one of the inputs of the next level. This is diagrammati-cally represented in Figure 3.1. In the following, a complete description of this basic structure is given, followed by an analysis of the class of functions which can be represented by this SPORE-1 structure. Let RL(xp xN) be an N dimensional SPORE-1 regression function constructed using an L level cascade of 2 dimensional functions. Thus, y = RL(x) estimates the dependent variable y given the independent variables x = (JCJ, ..., xN), and has the following form: L RL{xx, ...,xN) = a, •gi(xko,xki) + a2-g2(g1,xk2)+ £ a, • gl(gl_ „ xk) (3.1) 1 = 3 where L defines the number of levels in the cascade or, equivalently, the number of 2 dimensional functions £/(•); OC/ are real valued scaling factors; and the subscripts k0,...,kLs {l,...,N} serve to identify the input variables (JCJ, xN). 33 Chapter 3: The SPORE-J Regression Function 34 a 1 — xk2 ^ a2 8LM RL(x) Figure 3.1 The S P O R E - 1 Regression Given the above definition, the question we pose is what class of functions can be represented by the simple cascade given in (3.1)? The answer to this question depends on what class of 2 dimensional functions g^-) are used in the cascade. In the following, two classes of 2 dimensional functions are considered. First we consider the case of gt{-) belonging to the class of bounded continuous 2 dimensional functions defined on a closed domain in 9t .In the second case gt{-) is limited to the class of 2 dimensional polynomials of finite order greater than 1. The interesting conclusion is that in both instances, the cascade in (3.1) is shown to be a universal approximator: i.e. able of approximating any bounded continuous multidimensional function defined over a finite domain. 3.1.1 A Cascade of Bounded Continuous 2 Dimensional Functions In the following theorem, it is assumed that the 2 dimensional functions gL{-) belong to the class of 2 dimensional bounded continuous functions. Under these conditions, it is shown that any continuous, bounded N dimensional function, defined over a closed domain, can be 2 represented using a cascade of 2N + 3N g ;(-) functions. Since this is the broadest class of bounded continuous gt(-) functions, this result defines a bound on the number of cascade levels necessary to exactly represent a bounded continuous multidimensional function. Other smaller classes of bounded continuous functions, such as polynomials, will require more functions. T H E O R E M 1. Let f(x\, xN) be a bounded continuous function defined on a closed Chapter 3: The SPORE-J Regression Function 35 domain in 3iN, N > 2. Then, there exist 2N2 + 3N bounded continuous gi(-) functions which 2 are at most 2 dimensional, and 2N +3N real valued numbers a.[, such that: 2N2 + 3N-2 f(xi,...,xN) = a r g l ( x k o , x k i ) + a2-g2(g],xk2)+ £ <*f8i(8i-i>xk) (3-2) /= 3 Proof: The proof is based on Kolmogorov's superposition theorem [Kolmogorov, 1957], which states that any bounded continuous TV dimensional function, f(xx, xN), defined on a closed domain, can be represented as: 2N f N \ q-=0 V = 1 (3.3) where the 1 dimensional functions \|/ and are bounded and continuous (for more recent versions of this theorem see [Lorentz, 1986]). Thus, for our proof we must show that (3.2) and (3.3) are equivalent for N>2. We show this by equating all gt(-) functions in (3.2) to the appropriate functions in (3.3). First make the following assignment: 8\ = <t>0l(xl) + (t)02(x2)' ex, = 0 (3.4) Then, for Ze {2, ...,N- 1} and p = I + 1, let £/ = S/-1 + M V ' ai = 0 ( 3 - 5 ) Thus, completing the definitions to level / = TV, let 8N = Yo(fftf-i)> A N = 1 (3-6) Therefore, TV levels are required to build the q = 0 part of (3.3). What remains are the q = 1, 2TV parts to be constructed. Consider the q = 1 part. For level / = N +1 , let 8 N + i = <hi(*i)> aN + \ = 0 (3-7) Then, for le {N + 2, ...,2N} and p = / - A T , let */ = S / - i + ( M V ' ai = 0 (3"8) Thus, completing the definitions to level / = 2N + 1 , let 82N+1 = Vl(82N)> a 2 N + \ = 1 (3-9) Chapter 3: The SPORE-1 Regression Function 36 Sl-\ xl Si xl Si+i 0 + xl+(0-gl_1) + (0-xl •£/_,) + at = 0 0 + xx + (0-gl) + (l -zj •g / ) + ... 0 + x, + ( O g / + 1 ) + ( l - g / + ] ) + ... «/ + 2 = 1 Figure 3.2 A Cascade Representation of a Polynomial Term Therefore, 2N + 1 levels are required to build the q = 0 and q = 1 parts of (3.3). What remains are the q = 2, 2N parts to be constructed. The construction of these parts is identical to that of q = 1, which took N + 1 levels of (3.2). Thus, the total number of levels required is 2N(N +l)+N = 2N2 + 3N. This completes the proof of T H E O R E M 1. Q.E.D. T H E O R E M 1 establishes a bound on the sufficient number of levels required to exactly represent any bounded continuous function defined over a finite domain in 3 or more dimensions. However, the only restriction on the functions g ;(-) is that they be some 2 dimensional function. In addition, the proof of T H E O R E M 1 is based on Kolmogorov's superposition theorem, and it is a known result that the 1 dimensional functions used in Kolmogorov's theorem are often highly non-smooth [Vitushkin, 1954] [Vitushkin and Henkin, 1967] and therefore not easy to represent in practice. 3.1.2 A Cascade of Finite Order 2 Dimensional Polynomials In this section the functions gt{-) are further restricted to belong to the class of finite order polynomials. Thus T H E O R E M 2 deals with a more practical scenario in that the functions g l {-) are of a class which makes them easier to use in practice. T H E O R E M 2. Let f(xx, xN) be a continuous function defined on a compact subset A cz 3iN. Let gl(-) be a 2 dimensional polynomial of finite order greater than 1. Then, for each Chapter 3: The SPORE-J Regression Function 37 £ > 0, there exists a finite number, L, of functions £/(•) and real numbers OC / ( such that \f{xx, ...,xN)-RL{xx, ...,xN)\<z (3.10) where / ^ (x) is defined in (3.1). Proof: It is a known result that a continuous function / ( * , , ...,xN) defined over a compact subspace can be approximated by polynomials in x u ...,xN (see Theorem 6 on page 10 in [Lorentz, 1986] for a statement of the appropriate theorem). This representation has the form f(xv...,xN)= £ b i i t _ t i i t - x i - x % (3.11) ...,iNe {0, 1,2, ...} where b^ t- are polynomial coefficients. Thus, we shall prove T H E O R E M 2 by showing that any polynomial in xx, xN can be represented by the cascade structure defined in (3.1), where the g [ (-) are 2 dimensional polynomials of finite order greater than 1. By this constraint, the simplest form of functions used in the cascade is: 2 2 gl(u,v) = a 0 0 + a ] 0 u + a 0 1 v + alxuv + a20M + an2 v (3-12) This and every other 2 dimensional polynomial of order greater than 1 must at least have the terms a 1 0w and alxuv. Given these two terms, it is simple to show that any term in (3.11) can be equated to the cascade structure (3.1). As an example, consider the polynomial term 3 3 5 3 3^35 ' xix2x3 • We first construct the JC , part of this term as outlined in Figure 3.2. Starting at any arbitrary level / , set at = 0, xk = x , , a 1 0 = 1, and all other parameters of g [ (-) to zero. Then, for levels p = / + 1 and p = / + 2 set a p = 0, xk = xx, au = 1 , and all other parameters of 3 P 3 3 5 3 gp(-) to zero. This gives us the x{ part of b 3 3 5 • x{x2x3. Next, we construct the x2 part as follows. For levels p = / + 3 to p = / + 5 set a p = 0, xk = x2, au = 1, and all other parameters of g p(-) to zero. Next, for the x^ part, for levels p = / + 6 to p = 1 + 9 set a p = 0, xk = x 3 , a n = 1, and all other parameters of g p(-) to zero. Finally, for level p = /+10,set ocp = b 3 3 5 , xk = x 3 , a n = 1, and all other parameters of g p(-) to zero. This gives us the P 3 3 5 completed term b 3 3 5 • xxx2x3 . In a similar manner, we can construct any other term in the polyno-mial (3.11). Hence, given that we can construct any polynomial term using sections of the cascade structure in (3.1), and because (3.1) is a sum of such sections, it follows that any polynomial structure can be represented by the cascade structure defined in (3.1). This completes the proof of Chapter 3: The SPORE-1 Regression Function 38 T H E O R E M 2. Q.E.D. 3.2 Learning Algorithm for the SPORE-1 Regression Function In the previous section we have shown that the SPORE-1 regression function is a universal approximator. However, we have not shown how to construct the functions g ;(-) and real numbers at. One such algorithm is presented next. 3.2.1 The SPORE-1 Learning Algorithm We assume that ML learning input/output examples (xq, y ) e TL have been divided into 2 sets: a training set TT containing MT input/output pairs, and a validation set Tv containing Mv input/output pairs. In Chapter 4, various methodologies are described for dividing learning data into training and validation sets. There are 3 preset learning parameters, symbolized by depth j , depth2, and e. The function of each of these parameters is explained in the following algorithmic description, and is further elaborated on below. A single cascade of functions is constructed in a series of sections. A section of a cascade between levels i and symbolized by lRj, is defined as follows: j l = i When the section being constructed can no longer reduce mean squared error, the learning outputs y are replaced by residual errors due to lRj and a new section is begun starting from the last error reducing function gj(-). Throughout the remainder of the chapter we assume that the functions are 2 dimensional polynomials of finite order: I + j < K gt{u,v) = 2 fly-MV (3.14) i, ye {0, 1,2, ...'} where K is the order of the polynomial, and a- are its coefficients. Stated in detail, the construc-tion of the regression functions, RL(x), proceeds as follows (note that each SPORE-1 regression function has only one output, so for problems with multiple outputs, one regression function must be constructed for each output): Chapter 3: The SPORE-J Regression Function 39 Algorithm 1: SPORE-1 STEP 1: Initialize algorithm: Initialize the 2 counters i = 1 and p = 1, where / and p are used as follows: / indexes the function g-(-) at the start of the current section; p is a subscript indicating that section p of the cascade wil l be constructed in step 2. Throughout the algorithm the subscripts k0, kL (used to identify the independent variables) are randomly selected, one at a time, from the set {1, N} , without replacement. When the set is empty, it is re-initialized to {1, N} , and the process of selecting these subscripts continues. STEP 2: Construct new section: Construct, in order, the functions £,•(•)>£,• + ](•)> • • •, gj( •), where j - i> depth j, as follows: 1. Obtain a bootstrap sample set TT (i.e. MT randomly chosen samples with replacement) from the training set TT. 2. Given r r , using singular value decomposition, determine the model parameters {apa} of the function g(-) which minimize the following least squared error criteria: Ho-} = a r g M I N H 0 } (3.15) The construction stops with function gy(-) when the change in cascade error on the validation set Tv, over the past depth} + 1 level additions, is less than some small, positive number £. This is defined as follows: 1 -depth j MSE(lRj)rv I \q = j - 1 - depth/ MSE{lRq)Tv <e (3.16) JJ where lR = ( a / ' £/(') ^ s t n e constructed cascade between levels i and q; and MSE(Rq)Tv is the mean squared error of this cascade on the validation set Tv. With the addition of each new level, the scaling factors al, for / = j, are recomputed as follows: Chapter 3: The SPORE-1 Regression Function 40 (MSE(8l(-))r )"' a, = - 1 (3.17) ZWSEigq(-))rT)-\ q = i where MSE(g[(-))r is the mean squared error of the single function gt(-) on the training set TT. Hence, at is simply the inverse of the normalized mean squared error of level / , and its purpose is to give more weight to levels which contribute more significantly to error reduction. STEP 3: Prune section: Next prune the current section back to the level which has the best mean squared error as follows. Find the minimum level s e {/- 1, i, j} which gives the least mean squared error of the cascade on the validation set Tv, in the following sense: choose^<r s.t. V( re {i,...,j}), MSE{lRs)rv<MSE(Rt)rv (3.18) Delete all functions g J + ] ( • ) , g i + 2(0> •••> g / ' ) • Set j - s and, using (3.17), recalculate the - scaling factors CLT, for / = i, j . The current state of the cascade is now given by: /= l Note that if s = / - 1 then all of the functions just constructed in step 2 are deleted. STEP 4: Update learning outputs: Update the output values, y •, of both the validation set Tv and the training set TT as follows: s V ( ( x , y ) 6 r v u r » = > y = y - X < * / • * / ( • ) (3.20) l = i The sets Tv and TT now contain the residual approximation errors after the cascade has been constructed to pruned level s. Let be the variance of the output values, y •, of the valida-tion set Tv (as defined in step 1, the index p refers to the section just constructed). Thus, pp is equivalent to the mean squared error of the approximation Rs(x) on the validation set Tv. pp is used in STEP 5 to establish a stopping condition under which the construction of the Chapter 3: The SPORE-1 Regression Function 41 cascade stops. STEP 5: Check stopping condition: Go back to STEP 2, constructing at least depth2 + 1 sections, until the following condition is met: f \ 1 - p - i \y = p - 1 - depth2 depth* < £ (3.21) Therefore, the algorithm will terminate when the mean squared error fails to improve by a factor of e, over the last depth2 + 1 sections. Before executing STEP 2, set / = s + 1 and increment the section counter p to initialize the next section. Upon termination, set L = s, the total number of levels constructed. This completes the description of the algorithm. As indicated above, the learning parameters depth j , depth2, and £, are all used to define conditions under which the construction of the cascade stops. In particular, £ defines the relative decrease in error required to justify the addition of more gt(-) functions to the cascade. The learning parameter depth} defines how many levels of the cascade are used to determine the termination conditions of STEP 2. Similarly, depth2 and £ are used in STEP 5 to define how many times STEP 2 is executed before the algorithm is executed. The theoretical significance of these learning parameters is discussed in Section 3.3.1. In Chapter 4, the actual setting of these values is discussed. 3.3 Theoretical Analysis of the SPORE-1 Learning Algorithm In this section we analyze some theoretical properties of the SPORE-1 learning algorithm. In Section 3.3.1 we give convergence and rate of convergence results for the algorithm. In Section 3.3.2 we give complexity results. In Section 3.3.3, we discuss the internal variance reduction property of SPORE-1. And in Section 3.3.4 we define the virtual space partitioning property of SPORE-1. Chapter 3: The SPORE-J Regression Function 42 3.3.1 Convergence Results We begin the theoretical analysis of SPORE-1 by analyzing its convergence properties. In T H E O R E M 3 we prove that, under appropriate regularity assumptions, the SPORE-1 algorithm monotonically converges to some finite approximation error. In T H E O R E M 4 we give sufficient conditions for the approximation error to converge to zero. In T H E O R E M 5 we extend this result by showing that the rate (as a function of the number of learning sample points) at which the learning algorithm converges to this approximation error, is independent of the dimension of the dimension of the target function. The proofs of T H E O R E M 3, T H E O R E M 4 and T H E O R E M 5 are based on 4 lemmas which we state and prove below. The reader should note that in the following, the learning data F L always refers to the updated learning data or residuals as defined in STEP 4 of the learning algorithm. In the first lemma, we show that the approximation error on any bootstrap sample used in the construction of a function £/(•)> must either decrease or stay the same; it can never increase. L E M M A A. For all / > 1, the following is true (note that the subscript q serves as an index for the summation V ): Proof: From S T E P 2 of the SPORE-1 learning algorithm, the parameters of g[ + ,(•) are chosen as follows: Given that u = gl and v = xk we could set coefficients of all terms with xk/ ^ to zero if they increase error. Therefore, the error can never increase. Similarly, error will decrease if and only if L E M M A A is true by definition. Q.E.D. The implication of L E M M A A is that the mean squared error of any new level MSE(gl + j(-)) „, on any bootstrap sample TT, is as least as small as the mean squared error of the previous level. Thus, even if v = xk does not contribute to the lowering of error at level I + 1 , the parametric form of g[+ , (•) ensures that for any bootstrap TT, the previous best error is (3.22) iaij} = a r g M I L W ^Z(8i+\(u,v)-yq) lrBT (3.23) v = xk contributes to error reduction within the parametric structure of g / + 1 ( - ) - Thus, Chapter 3: The SPORE-1 Regression Function 43 maintained. Next we examine how the expected error, on the entire learning data set Y L , behaves with the addition of a new function gt + , (•) to the cascade. This requires us to make some assumptions about the function which originally generated the learning data. Assumption A: We will assume that all the learning data Y L , and therefore all of the data in the training set YT and the validation set Yy, are independently generated by some unknown function: y = / ( x ) + e(x) (3.24) where f(x) is bounded, continuous and defined over some closed domain D cz 3lN, and e(x) is some noise term with the property that V(x £ D), E(e(x)) = 0. Furthermore, we assume that the probability density function which defines the distribution o/e(x) is continuous, and has finite variance, V ( x e D). Note that one can equivalently define fix) as follows: oo / ( x ) = E[y\x) = \ yh(y\x)iy(dy) (3.25) —oo where, h(y\x) is a continuous probability density function with bounded variance, and q>(dy) is some measure on . In the following lemma, we use Assumption A to show that the expected approximation error on the learning data either decreases or stays the same. L E M M A B. Given Assumption A, and for sufficiently large sample sizes MT and My, then for all l> 1, the following is true: E[£(8i+i(8i, xklJ - y / ] < £ [ ] • > / . ( • ) -yq)2] 0-26) Proof: From L E M M A A, we know that L E M M A B is true if the learning set YL is equiva-lent to the bootstrap training set: i.e. YT = YL. Hence, it is reasonable to assume that it will also be true if r r is a "sufficiently good" representation of the distribution of points in the training set YT and the validation set Yv. Thus, we prove this lemma by showing, given Assumption A, YT becomes sufficiently close (in the probability distribution sense) to YT and Yv for large MT and Mv. Consider any arbitrary point x, e D. From the Central Limit Theorem, we know the follow-ing: if y,, yn are independent random observations from a population with probability Chapter 3: The SPORE-1 Regression Function 44 2 1 function h(y\xx) for which the variance o [y] is finite, the sample mean y = ~2"- l s approximately normally distributed when the sample size is reasonably large, with mean E[y] 2 and variance (a [ y ] ) / « [Nester et al., 1990]. Therefore, if we limit all learning sample points to be generated from any single point X j e D and we make MT and Mv sufficiently large, the following is true due to Assumption A and the Central Limit Theorem; ^8i+\isi,xklJ-yq)2 = E\^L(8i(-)-yq) • r —' I— -i— (3.27) This is true because, at any single point x, e D, the construction of the function gl + x(gt, xk ) as defined in (3.23) is equivalent to finding the mean value of the bias coefficient a 0 0 ; i.e. 8i+i(8i> xic! ,) reduces to a single mean value which is equivalent to the mean of a one dimensional distribution. In addition, assuming that the variable xk ^ does not contribute to reduction of error, then because gt(-) and f(x) are both bounded and continuous over a finite domain, and because the distribution of e(x) is also continuous, it must be the case that (3.27) is also true for any finite local region (within the domain D) about any arbitrary point x, 6 D . This is true because the continuity constraints on gl+x{-) imply that as any point (w,, V j ) approaches another point (u2, v 2 ) , then gl + l(ul, v,) must approach gl + , ( M 2 , V 2 ) ; hence, this is analogous to taking an elastic surface and smoothly bending it according to the constraints of (3.23). Furthermore, under the same conditions, because this is true for any arbitrary point X j e D, (3.27) must be true over the entire domain D (i.e. simply stretch the above elastic surface to include all of D). Thus, at minimum, the expected approximation error will remain unchanged. Similarly, due to the construction of g l + x (g ;, xk^ ) as defined in (3.23), the condition E\T(Si+i(8i, xk ) - y/1 < 4Z(S/(0 - v/1 (3.28) is true if and only if the variable | serves to reduce the approximation error within the context of the parametric structure of g l + x (gt, xk | ) . This completes the proof of L E M M A B. Q.E.D. Next we extend the result of the previous lemma to show that the expected approximation error, over the entire domain D of the target function, either decreases or stays the same. L E M M A C. Let fu(x) represent the updated error function after the computation of residuals in STEP 4. Thus before STEP 4 is first executed fu(x) = /(x), and the execution of STEP 4, computes: Chapter 3: The SPORE-1 Regression Function 45 fu(x) = / ( x ) - £ a z •*,(•), (3.29) / = i and thereafter fu(x) is updated as follows: s fu(x)^fu(x)-^ar§l(-) (3.30) / = i Then, given Assumption A, and for sufficiently large sample sizes MT and Mv, for every level I > 1, the following is true: E\\(8i+ ,(*/, xkiJ-fu{x))2dD\ < E\\{gl{-)-fu{x))2dD\ (3.31) D D Proof: From Assumption A we know that the samples yq have mean f u(x). Further-more, because f(x) and 2/- iai ' £ / ( ' ) a r e bounded and continuous, then so is fu(x). The the proof of L E M M A C is identical to the proof of L E M M A B, with the exception that the approxi-mation errors are now integrated over the entire domain D and not simply summed over the learning set TL. Q.E.D. Next, using the above lemmas, we show that the SPORE-1 learning algorithm must converge to some finite approximation error. It is important to establish that SPORE-1 converges to some error function because this allows us to establish conditions under which the algorithm terminates. Algorithms which do not terminate are rarely useful in regression. T H E O R E M 3. Given Assumption A, and for sufficiently large sample sizes MT and Mv, then for every £ > 0 fee there exists a level I such that the following is true: EU(Rl + p(x)-Rt(x))2dD D < £ (3.32) where p > 1 is any positive integer (note that this theorem implies a monotonically converging sequence). Proof: Consider the construction of the function RL(x) during the first execution of STEP 2. Let Chapter 3: The SPORE-1 Regression Function 46 I Rt(x) = X <*P-8P(-) (3-33) p= i and / + 1 = 5>VM-> ( 3 - 3 4 ) P = \ Note that we have put a prime on a' to distinguish that they are different from ap . Therefore, /+1 / i? /+1(x)-/?,(*) = 2 <V M ' ) - I a, •*„(•) (3.35) Simplifying we get: */+i(*)-*/(x) = - S / + i ( - J - E ( a ' p - V - M - ) ( 3 3 6 ) where, in the limit of MT —> oo, JUp(-)-/(x))2^D a „ -> = -r-e- (3.37) 2 V I X(J(*,(•)-/(x))2dDj X % q = i D q = i and .2 , „ V ' « ' P "> ITT ; = TTT^ (3-38> ,2 1 q = i D q = i X (/(*,(•)-/(x))2dDJ £ 0 where, for all /? > 0 ®p = {\(gp(-)-Hx))dDJ (3.39) D Therefore, we can write oc' - ap as Chapter 3: The SPORE-1 Regression Function 47 a p - a p / + 1 / q = i q = i (1 ^ / / + I \ p - O 2*, \q = i J ^ = i J Y ' I*, I*, Simplifying, we get f V *\ -q> /+ l /+1 5>« because a' /+ I Therefore, we can simplify (3.36) as follows: substituting cc' - a p with -oc' / + jOC^ Ri+iW-R[(x) = a' / + 1 -g / + 1 ( - ) -a' / + 1 X < V M ' > P = I Substituting in Equation (3.33) we get: /?/+1(x)-/?z(x) = a' z + 1 -g^O -a^j / tyx) which simplifies to Rl+l(x)-Rt(x) = a'l+l(gl+l(-)-RL(x)) (3.40) (3.41) (3.42) (3.43) (3.44) (3.45) Finally, from L E M M A C, and assuming that > 0 for all / (note that if this assumption is false, then T H E O R E M 3 is tme by default due to L E M M A C), as / -» « , E [ O / + 1 - < D z ] - > 0 (3.46) therefore, we have an infinite sum of a constant, giving infinity, or /+ l ct <7 (3.47) Chapter 3: The SPORE-1 Regression Function 48 and as a result /+1 - - » 0 . (3.48) V X * , J q = l Therefore, as / —> °o, Equation (3.45) becomes: £ ( / ? z + 1 ( x ) - / e z ( x ) ) - > 0 . (3.49) Thus, T H E O R E M 3 is true during the first execution of execution of STEP 2, and because of the execution of STEP 2. This completes the proof of T H E O R E M 3. Q.E.D. T H E O R E M 3 establishes that the SPORE-1 Learning Algorithm converges to some function ^ ( x ) as / —> ~ . Thus, for any fixed sequence kQ, , the approximation error converges to the following: In addition, T H E O R E M 3 allows us to define theoretically optimal values for the learning parameters depth j, depth2, and e (see STEP 2 and STEP 5 of the SPORE-1 Learning Algorithm). First consider e, which defines a cut-off on the minimum allowable decrease in approximation error. From the proof of T H E O R E M 3, we know that this rate of decrease monotonically converges to zero as the number of levels grows. Hence, by stopping the algorithm when some e rate of decrease is reached, we know that any future expected decrease will not exceed this amount. Thus, in order to achieve the error given in (3.50), we must let e —> 0. In addition, the true rate of potential error reduction is measured only when p —> °° in (3.32). Therefore, the parameters depth 1 and depth2, which are related to p , should also be large in order to achieve (3.50). In the next theorem, we establish sufficient conditions under which approximation error is reduced when a new level is added to the regression function. In order to do this, we need to make a further assumption as follows. Assumption B: For simplicity, we define the following: continuity arguments in L E M M A C, the same argument can be made during each subsequent (3.50) D Chapter 3: The SPORE-1 Regression Function 49 E[\(gl+l(.)-fu(x))2dDJ D (3.51) y = E J ( g / + | ( . ) - / M ( x ) ) ( / ? / ( x ) - / ( x ) ) J Z ) (3.52) D Q = E J ( ^ ( x ) - / ( x ) ) 2 ^ D (3.53) D Given the above definitions, the following 3 conditions are assumed to be true. First, the approxi-mation error of i?;(x) is greater than the correlation between the errors due to the new level function g[+](-) and the approximation error due to R[(x). This can be stated as follows (fu(x) is defined in L E M M A C): In other words, this condition is satisfied when the errors due to g^ j (•) are sufficiently uncorre-cted with those due to Rpix). The second condition requires that the correlation between the errors due to the new level function gj + j (•) and the approximation error due to i?/(x) satisfy the following: Once again, this condition is satisfied when the errors due to g/ + j (•) are sufficiently uncorrelated with those due to i?/(x). The final condition which we need to meet is that the scaling factor oc ' / + 1 satisfies the following: where e = 1 -8 , and 5 e 31 « an arbitrarily small positive real number defining the amount of error reduction due to the addition of a new level to the regression function. If we let I —> °°, and thus a' [ + , —> 0 (see proof of THEOREM 3), this condition is also satisfied when the errors due to 8i+\(') a r e sufficiently uncorrelated with those due to i?/(x). Hence all 3 conditions require that the approximation errors of a new level are uncorrelated with those of previous levels. As discussed in Section 3.3.3, the SPORE-1 learning algorithm encourages this condition to occur through the use of bootstrap training samples. Q>x¥ (3.54) 0 > 2 V F - Q (3.55) ( Y - f l ) - 7( Y - fl)2 - (O - 2 Y + - e)Q ( 0 - 2 v F + fl) (3.56) Chapter 3: The SPORE-1 Regression Function 50 Using Assumption B, we now state the following theorem. T H E O R E M 4. Given Assumption A and Assumption B, and for sufficiently large sample sizes Mj and Mv, then for every I > 1 the following is true: J ( /? / + 1 (x)- / (x)) 2 JD]<e^rj ( /? / (x)- / (x)) 2 JD 'D D (3.57) where e is defined in Assumption A. Proof: From (3.44) we can write Rl+ j (x) as follows: * / + i ( x ) = <x' / + 1 • g z + 1 ( - ) - o c ' / + 1 / ? / ( x ) + /? ;(x) This can be written as: #, + i(x) = a ' / + 1 -gl+ ,(•) + (!-a ' [ + l )R t (x) (3.58) (3.59) In addition, we can write (R[+ t (x) - f(x)) as: ( / ? / + 1 ( x ) - / ( x ) ) = a ' / + ] ( g / + 1 ( - ) - / M ( x ) ) + ( l - a ' / + ) ) ( i ? / ( x ) - / ( x ) ) (3.60) and (R[+ j(x) - f(x)) becomes: (Rl+l(x)-f(x))2 = (a'l+l)2(gl+l(-)-fu(x))2 + 2 a ' / + 1 ( l - a ' / + 1 ) ( g / + 1 ( - ) - / M ( x ) ) ( / ? / ( x ) - / ( x ) ) + ( l - a ' / + 1 ) 2 ( / ? / ( x ) - / ( x ) ) 2 (3.61) Using the above identities and the definitions given in (3.51), (3.52) and (3.53), rewriting (3.57) we get: ( a ' / + 1 ) 2 0 + 2 a ' / + 1 ( l - a ' / + 1 ) ^ + ( l - a ' / + 1 ) 2 Q < e Q (3.62) Solving for oc'z + j , we get a' /+1 ^ - ( Y - f l ) - V ( x P - Q ) 2 - ( Q - 2 t P + n ) ( l - e ) Q ( 0 - 2 x F + Q) (3.63) The above is true given Assumption B is satisfied. This completes the proof of T H E O R E M 4. Q.E.D. In the last two theorems, we have established conditions under which the SPORE-1 learning algorithm converges. Next, we quantify the rate of convergence of this algorithm. More Chapter 3: The SPORE-1 Regression Function 51 precisely, we measure how the approximation error decreases as the number of sample points used during the construction process increases. As discussed, in Section 2.3 of Chapter 2, this type of rate of convergence analysis is an important indication of an algorithms usefulness in high dimensional regression. In the following lemma and theorem, we establish the rate of conver-gence, to some finite approximation error given by (3.50), of the SPORE-1 Learning Algorithm as independent of the dimension of the function being approximated. L E M M A D. Let the optimal g;(-) function be defined as °ptg[(u,v) = " % • „ V (3.64) i,je {0, 1,2, ...} where the coefficients {°P'aij} are chosen using an infinite number of training sample points as follows: 2 1 r Opt -. { au} = argmin , J l aijS jW)-/«(*))dD ~D (3.65) and fu(x) is the residual error function defined in LEMMA C. Similarly, let the approximated g/( •) function be defined as i + j<K g[(u, v) = ^ ay • w V (3.66) i,je {0, 1,2, ...} where the coefficients {a^} ctre chosen using a finite number of training sample points as follows: {au} = argmin { a ; . } X ( g / K v)-yqY •r? (3.67) Given Assumption A, and for sufficiently large sample size, the rate of convergence of the function gl(-) to the optimal function °p'gl(u, v), as a function of the number of sample points Mj, is independent of dimension N of the target function f(x) (we assume N>2). More precisely, there exists a constant K, independent of N, such that the rate of convergence is given by: Chapter 3: The SPORE-1 Regression Function 52 E\\(gl(-)-[op'g [(•)])-dD <K (3.68) where, T ^ is the number of terms in the polynomial g/(-). Proof: We first prove this theorem for the first level / = 1. The coefficients of °ptg[(u, v) are given, using an infinite sample size, by the following: { a-} = argmin , J ( ° p ' g l ( x v xki)-f(x))2dD D (3.69) Similarly, the coefficients for g ;(-) are given using a finite samples size MT, as follows: ,2 {atj} = argmin{ } X ^ i < * * « , ' xk)-yqy (3.70) Thus, given that gt(-) converges to op'gi(-) as MT —> °° (this is true because gt(-) is continuous and so is / ( x ) ) , then we can view g^(-) as being an approximation of °P'gi(-). Given this view, the coefficients {aif} are chosen based on sample outputs yq which are generated using the following density function: \(°ptg,(xkjXk)-f(x))dV hgl(y\(xkii, xk[)) = D' jdD' (3.71) D' where f(x) is defined in (3.25) of Assumption A, and D' is the domain D given that (xk, xk) is fixed (hence D' is N-2 dimensional space, spanning the N-2 dimensions of D which do not include the axes (xk, xk)). Within this context, and by Assumption A, hg(<y\(xk^, xk)) must be continuous with finite variance, and an equivalent representation for °ptg\{xkn, *kx) is: opt g\(xk{)>xk) = E[y\(xko, xk{)] = j yhgi(y\(xko, xki))tp(dy) (3.72) Thus, we view the sample outputs y as being generated as follows: opt , . , x yq= sMK,xk) + z(xkn,xk) (3.73) where the density function of the error term z(xk ,xk) is continuous, and has finite variance Chapter 3: The SPORE-J Regression Function 53 2 2 Ge(xko,-xk) dependent on (xk^,xk). Note that the variance GE(xk , xk ) is independent of dimension N because it is calculated using the distribution given in (3.71), which integrates and 2 2 2 divides out N-2 dimensions of f(\). Now, using the equality G (y) = E(y ) - (E(y)) , we can write E[(gl(-)-opt8l(-))2] = a2[gl(-)-optgl(-)] + (E[gl(-)-optgi(-)])2 (3.74) From the Central Limit Theorem we know that E[gx(-)-°P'gx (•)] - 0 for each point (xk, xk). In addition, because op'gl(xk ,xk) is constant for each fixed point (xk ,xk),we can write (3.74) 2 2 as follows (note that we are using the equality a (y + c) = G (y) for c constant): £ [ ( g , ( - ) - % , ( - ) ) 2 ] = o - 2 ^ - ) - 0 ' ^ - ) ] = o-2(g,(-)) (3.75) Next, if we symbolize the terms of g j (•) as 8i(-) = I 9* (3-76) k = l then, using the identity 0" 2(^"_ x c-9.) = ^ ^ ] c-c;a(9-, 0^), where G(0-, 0 ; ) is the covariance between 0, and 0,, we obtain £[(*,(•)-"'WO) 1 = * I 9* =11 °( e ;> e , ) (3.77) Vjt = 1 / Now, for all terms / = j = k where k e { 1, T ^} , the covariance is given by the Central 2 Limit Theorem as (i.e. variance of a distribution decreases as o (y)/n for n samples, if the 2 original distribution had variance G (y)): 2, . 9 o~P {xk , xk ) G(Qk, Qk) = G2(Qk) = ^ 1 (3.78) 2 where ^ ( x ^ , x^) is the variance of the noise term defined in (3.73). Similarly, for all terms i * j the covariance is: Gu{xk , xk ) 0 ( e e) = l} *• ( 3 .79) 2 2 where o\-,(x,,, xb ) is some initial covariance between 0,- and 0 ,• (note that o\,(x,, , xk ) measures Chapter 3: The SPORE-1 Regression Function 54 the covariance between terms of the 2 dimensional polynomial #](•), and is therefore indepen-2 2 dent of N). Thus, if we let GMax(xk ,xk) be the maximum covariance of all of a ij(xk , xk ) and 2 Ge(xk , xk ) , we can write (3.77) as: TSi() TSiV) Therefore, integrating ever D , we get ^Max(xk{}' xk^ (3.80) D r°Pl < \vMax(Xkn>Xk)dD) = K D ' Si(-) Mr (3.81) C 2 where = CT^^^CJC^, xk^)dD, is independent of Af. Thus, this proves L E M M A D for the function gj(-) • The proof for all other levels / > 1 follows the same logic and is not given here. This completes the proof of L E M M A D. Q.E.D. In L E M M A D we proved that the rate of convergence of each function gt(-) is indepen-dent of dimension N of the target function. In the following theorem, we extend this to show that the rate of convergence of -fy(x) is also independent of N. T H E O R E M 5. Let °ptRL(x) = Yu- \ 0 / " a / " °P'gi(') be the optimal constructed cascade (to level L , ) after the first time that STEP 2 of the SPORE-] learning algorithm is executed; thus °ptRL^(x) is constructed using an infinite number of samples points, and °ptOLl is defined as: opt _ _D al ~ L, jCP'gl(-)-f(x))2dD (3.82) = 1 7) ,-1 where °P'gl(-) is defined in L E M M A D. Similarly, let R^^x) = Y^- \ ^i(') be the constructed cascade (using the usual finite number of samples points) after the first time that STEP 2 of the SPORE-1 learning algorithm is executed. Then, given Assumption A and for sufficiently large sample sizes MT, the rate of convergence, as a function of the number of sample points MT, of the function RL](x) to the optimal function °p,RL(Kx), is independent of dimension N of the target function f(x), (we assume N > 2). More precisely, there exists a constant K, independent of N, such that the rate of convergence is given by: Chapter 3: The SPORE-J Regression Function 55 j(RLi(x)-[op'RLi(x)))2dD D <K (3.83) where, T g ^ is the number of terms in the polynomial g/(-) • 2 2 2 Proof: Using the identity a (y) = E(y )-(E(y)) , we write E[(RL(x) - [op'RL(x)])-] = G*(RL(x) - [""X (x)]) + (E(RL(x) - [""7?, (x)])M3.84) opt .opt. From the Central Limit Theorem we know that E(RL (x) - [°ptRL ( X ) ] ) = 0 for each point x . In addition, because °ptRL (X) is constant for each fixed point x , we can write (3.74) as follows 2 2 (note that we are using the equality a (y + c) = a (y) for c constant): E[(RLi(x)-[optRLi(x)]y] = a(RLi(x)-[u'"RLi(x)]) = c'(RLi(x)) Next, given that RL[(x) = , oil • gt(-), we can write the above as: V = i ) i = i y = i op / , £[(/?,(x)-[•""/?, (x)]) ] = a (3.85) (3.86) For each point x , let pMax(x) be the maximum covariance over all o(g{, gj). Then, given that j oc, = 1 by definition, we can write the above as: E[(RL(x)-[optRL(x)])-]< ( L | Y L | ^ v,- =i A,- = i ) (3.87) From L E M M A D, we know that for a(g-, g () = a (#•), the variance has the form: a 2(g ()<(r g ; ( ,) 2 Mn (3.88) Thus, given that the functions gt(-) are constructed one at a time using a least squared error fit, all other covariances 0"(g(, gf) must have the same form given in (3.88). Therefore, substituting pMax(x) with the maximum c(gt, gj), (3.87) becomes: E[(RL(x)-[optRL(x)])-]< f(T YA MT Max (0 (3.89) Chapter 3: The SPORE-] Regression Function 56 Therefore, integrating ever D, we get E\\(RL(x)-[optRLi(.x)])2dD (3.90) 'D J where K = jDoMax(-)dD, is independent of N (see proof of L E M M A D). This completes the proof of T H E O R E M 5. Q.E.D. Therefore, T H E O R E M 5 shows that the rate of convergence of the error function, to the optimal error function is independent of dimension of the function being approximated. This result is important because it establishes that, for the SPORE-1 Learning Algorithm, the rate at which the constructed cascade approaches the best regression function is independent of the dimension of the function being approximated. Thus the number of sample points required to build an optimal SPORE-1 regression function, does not explode with dimension. 3.3.2 Complexity Results An important property of a learning algorithm for the construction of high dimensional regression functions, is how its complexity grows with the number of input variables. Given that the number of parameters of each function g^-) is fixed, the computational cost of adding a level / is the same for all levels. In the following, we symbolize the computational complexity of adding a new level to the regression function, as C . The cost C is dominated by the computations required to determine the polynomial coefficients of g [ (-) (see STEP 2 of the SPORE-1 Learning Algorithm). In the following theorem, we calculate the computational cost of the SPORE-1 learning algorithm. T H E O R E M 6. For any fixed sequence kQ, ...,kL of indices defining the order of indepen-dent variables (see STEP 1 of the SPORE-1 Learning Algorithm), and any e > 0, the computa-tional cost to achieve the following convergence D is u / V C , where N is the dimension of f(x), C is the complexity of adding a level to the cascade, and p > 0 ( [ 1 6 3i) is independent of dimension N. Proof: From T H E O R E M 3, we know that a level / satisfying the above condition must (3.91) Chapter 3: The SPORE-1 Regression Function 57 exist. Furthermore, due to T H E O R E M 3, there must be a sequence of positive real numbers e0, . . . , £ k such that £• ^ e, + j and ek < e. Thus k is the number of times one must cycle through the N independent variables before the desired convergence is achieved. Note that k depends on how the function f(x) projects, at each specific level, onto the 2 dimensional functions £/(•)• Thus k depends on the distributions hg(y\-) (see L E M M A D), and not on the dimension of f(x). Therefore we complete the proof of T H E O R E M 6 by choosing p > k. Q.E.D. Thus, T H E O R E M 6 shows that the computational cost of building a SPORE-1 regression function is linear with respect to the number of predictor variables. Next consider the effect of the number of sample points MT used in the construction of the regression function, on computational complexity. The main effect of sample size is on the construction of each level of the regression function. Thus, the cost C of adding a single level is some function of the number of sample points: i.e. C(MT). The actual relationship will depend on how the coefficients of gt{-) are calculated. In the SPORE-1 algorithm the preferred method is Singular Value Decomposition (SVD), and therefore C{MT) is the cost of doing 1 SVD calcula-tion (see [Golub and Van Loan, 1993] for details). 3.3.3 Internal Variance Reduction As outlined in Section 2.2 of Chapter 2, variance reduction is fundamental to constructing regression functions which have low approximation errors. One of the more effective means of variance reduction is to use many different regression functions to create a single prediction using a weighted average of all their outputs. The resulting regression function is referred to as an ensemble regression function. Thus, given P different N dimensional regression functions symbolized by /,-(x), the ensemble regression function f(x) is given by: where the P ( are some positive scaling coefficients. Perrone [Perrone, 1993] has shown that D (3.92) p f{x) = Y P,-£(x) (3.93) i = 1 Chapter 3: The SPORE-1 Regression Function 58 optimal variance reduction is achieved when the (3- are calculated as follows: P< = (3-94) L k L j L k j where C- ; is the correlation between the /,(x) and fj(x) regression functions. Furthermore, if residual errors of any two different regression functions /,-(x) and //(x) (i.e. / ^ j) are uncorre-lated, (3.94) becomes: X J a (/*-/)] This calculation of (3,- is equivalent to the calculation of the scaling factors CL1 are computed in STEP 2 of the SPORE-1 learning algorithm. Thus, the SPORE-1 regression function will exhibit optimal variance reduction if all gt(-) have residual approximation errors which are uncorrelated. Although the learning algorithm does not guarantee that the approximation errors of all of the functions g^-) are uncorrelated, this condition is implicitly encouraged through use of different bootstrap samples and predictor variable inputs for the construction of each level. In T H E O R E M 4, we show that error reduction (or equivalently variance reduction) can occur even if two functions g-(-) and. #•(•) have some correlated residual errors, as long as the conditions in Assumption B are satisfied. Therefore, SPORE-1 exhibits variance reduction which is internal to its structure. It does not rely on a large ensemble of different high dimensional regression functions, using instead a weighted average of 2 dimensional functions within it's structure. We refer to this as the internal variance reduction property of the SPORE-1 learning algorithm. In Chapter 4, we experimentally verify that this internal variance reduction property is effective in significantly reducing approximation errors. 3.3.4 Virtual Space Partitioning Consider the pathological situation where { a..} = argmin- , J i aijt 'jCPl8i(-)-f(x))2dD D = {0, ...,0} (3.96) Chapter 3: The SPORE-1 Regression Function 59 Given this condition, we cannot expect reduction in approximation error to occur because the optimal solution is gt(-) = 0. In Chapter 5 this condition is avoided by partitioning the space D into a finite number of subdomains where this cannot happen (note that in Chapter 5 it is shown that these subdomains must exist). However, even without space partitioning, this pathological condition is not likely because the actual calculation of the coefficients of #,(•) is done using a bootstrap sample of the training data. Due to replacement during sampling, on average each bootstrap sample contains roughly 63% of the training data [Efron, 1983]. Therefore, a bootstrap sample is essentially a random partitioning of the input space. We term this phenomenon virtual space partitioning. In Chapter 4, we experimentally verify, that virtual space partitioning, causes the SPORE-1 regression function to converge to zero approximation error, even when condition (3.96) is valid. We verify this by applying SPORE-1 to the 10 bit parity problem. As outlined in Section 4.5 of Chapter 4, a major deficiency of most learning algorithms which build a regression function using lower dimensional functional units, is that they cannot represent parity functions that are of higher dimension than their largest dimensional function unit. In being able to approximate the 10 bit parity problem arbitrarily well using only 2 dimensional functional units, the virtual space partitioning property of the SPORE-1 algorithm serves to overcome the deficiencies present in other dimensionality reducing regression functions. 3.4 Summary This chapter constitutes our first attempt at a theoretical verification of our thesis presented in Section 1.3 of Chapter 1. First, we have shown that the SPORE-1 structure is a universal approximator. Second, we have shown that under appropriate conditions, the rate of convergence of the SPORE-1 learning algorithm to some finite approximation error, is indepen-dent of the dimension of the target function. Thirdly, we have given sufficient conditions under which the approximation error converges to zero. And finally, we have shown that the complexity of constructing the SPORE-1 regression function is at most linear with the dimension of the target function. Chapter 4: Experimental Results In this chapter, the SPORE-1 learning algorithm is experimentally evaluated. In Section 4.1, the implementation details of SPORE-1 are described. In Section 4.2, the algorithm is evaluated on ten well known regression problems. In Section 4.3, the algorithm is evaluated on very high dimensional regression data. In Section 4.4, SPORE-1 is applied to 3 very high dimensional human-to-robot skill transfer problems, In Section 4.5, SPORE-1 is applied to the 10-bit parity problem. Finally, Section 4.6 concludes the chapter with a summary of experimental results. 4.1 Implementation Details of the SPORE-1 Regression Function In this section we give the implementation details of the SPORE-1 algorithm described in Section 3.2 of Chapter 3. In all of the experiments described in this chapter, the same version of SPORE-1 is used. Thus, the same class of two dimensional functions g [ (-) are used, with identi-cal settings for the learning parameters depth 1 , depth2 , and e. Even though the regression data presented in this chapter vary widely in dimension and number of learning samples, no fine tuning of the SPORE-1 algorithm was required. The SPORE-1 algorithm is completely automated, requiring no manual intervention. The functions g^-) are modeled as third order 2 dimensional polynomials as follows: g[(u, v) = a Q 0 + a 0 ) v + a ] 0 M + fl|jMV + a 0 2 v 2 + (4.1) 2 2 2 3 3 CI2QU + al2uv + a 2 i u v + aQ3v + a3Qu As indicated in Chapter 3, this choice for gt(-) is not unique. In T H E O R E M 2 of Chapter 3 (Section 3.1.2) we showed that g ;(-) must belong to the class of 2 dimensional polynomials of order greater than one. Our original reason for not choosing second order polynomials was that third order polynomials are the smallest order polynomials which incorporate asymmetric terms 2 (i.e. terms such as uv which we assumed would allow for faster convergence) which second order polynomials do not. However, preliminary experiments (these results are not reported here) 60 Chapter 4: Experimental Results 61 with second order polynomials suggest that they produce approximations which are comparable (with respect to mean squared error) to those produced by third order polynomials. In addition, second order polynomials often produced regression functions which were smaller than those produced by third order polynomials (however, this may not prove true if we exclude coefficients which are insignificantly small; we have not attempted such an analysis here). Our original reason for not choosing higher order polynomials is due to our goal of testing the SPORE methodology under the simplest implementation: i.e. very simply polynomial building blocks (only 10 coeffi-cients). As with second order polynomials, preliminary experiments (these are not reported here) indicated that fourth, fifth and sixth order polynomials produce regression functions which are comparable (with respect to mean squared error) with those produced with third order polynomi-als. The main difference being that higher order polynomials produce regression functions which are larger (once again, this may not prove true if we exclude coefficients which are insignificantly small). Thus, even from a practical perspective, it appears that the choice of g^(-) may be arbitrary, as long as they belong to the class of 2 dimensional polynomials of order greater than one. The three learning parameters were assigned the following values: depthj = 25, depth2 = 6, and e = 0.0001 (see Section 3.2 of Chapter 3). These values were consistently found to be sufficiently close to the theoretically motivated values given in Section 3.3.1 of Chapter 3; i.e. theory suggests that large positive values should be assigned to depthj and depth2 , while e > 0 should be made as small as possible (see S T E P 2 and S T E P 5 of the SPORE-1 Learning Algorithm in Section 3.2 of Chapter 3). 4.2 Evaluating SPORE-1 on Standard Regression Problems In this section we evaluate the efficacy of the SPORE-1 algorithm by applying it to 10 well known regression problems from the literature (See Table 4.1) [Grudic and Lawrence, 1997]. These problems were chosen because they have been extensively studied with many different learning algorithms being applied to them. A l l of these regression examples are comparatively small, ranging in dimension from 4 input variables to 16, with learning sample sizes ranging from 80 to 15,000. Although these regression problems are not the type of large, very high dimensional problems that SPORE-1 was designed for (i.e. more than 100 input variables with about 40,000 learning examples), they are among the largest regression problems available in the literature. In Chapter 4: Experimental Results 62 addition, they allow us to directly compare the performance of SPORE-1 to many different learning algorithms. Our comparisons are based on 3 different publications: Breiman [Breiman, 1996a], Rasmussen [Rasmussen, 1996], and Jordan and Jacobs [Jordan and Jacobs, 1994]. In the follow-ing, we give a brief description of the regression data used in each of these publications. 4.2.1 Regression Data Used by Breiman Breiman studied 5 different regression problems in [Breiman, 1996a]. His goal was to determine how Bagging (see Section 2.2 of Chapter 2) affects the approximation error of CART (see Section 2.1.2 of Chapter 2). Since Bagging is a method of improving predictor accuracy through variance reduction, comparing SPORE-1 to the Bagged CART algorithm allows us to directly study the efficacy of the internal variance reduction property of SPORE-1 (see Section 3.3.3 of Chapter 3). Of the 5 regression data sets studied in [Breiman, 1996a], we used the Boston Housing, Friedman #1, Friedman #2, and Friedman #3 data. A brief description of these data sets is given below. • Boston Hosing Data: This data has 12 predictor variables, which constitute various so-cio-economic indicators in the Boston area, and 1 output variable which is the median housing price in the corresponding area. There are a total of 506 examples in this data set, 51 of which are randomly selected to be used as a test set. • Friedman #1: This is a simulated data set which appeared in [Friedman, 1991]. It has 10 independent predictor variables which are uniformly distributed over 0 < xl< 1, for all i = 1, ..., 10. The output is defined by the following equation: y = 10sin(7UXjjc2)+ 20 (x 3 -0 .5 ) 2 + \0x4 + 5x5 + z (4.2) The noise term e is Normally distributed with mean zero and a variance of 1. Half of the 10 prediction variables (x6, x 1 0 ) do not affect the output y , so this data is a good test of the ability of the SPORE-1 learning algorithm to ignore variables which do not contrib-ute to reduced approximation error. This data has a total of 200 learning examples and 1000 test examples, all randomly generated using (4.2). Chapter 4: Experimental Results 63 • Friedman #2 and Friedman #3: This is simulated data having 4 inputs and 1 output. The data was used in [Friedman, 1991] and simulates the impedance and phase shift in an alternating current circuit. The Friedman #2 is generated using the following: 2 2 1 / 2 y2 = (*! + (x2x3-(l/(x2x4))) ) +e 2 • (4.3) The Friedman #3 is generated using the following: fx2x3-(\/(x2x4))\ y3 = a t a n J - = - ^ ^ - ^ J + e 3 (4.4) The variables x ] ; x2, x3, x4 are uniformly distributed as follows: 0 < X ! < 1 0 0 20<fJ2-l<280 \2K) (4.5) 0 < x3 < 1 1 <x4< 11 2 The noise terms e 2 and e 3 are Normally distributed with mean zero and a variance of o"2 2 and a 3 respectively. The variance is chosen so that the signal to noise ratios are 3 to 1; i.e. 2 2 2 2 2 the ratios of variances are chosen to be (<y0/o\, ) = (o%/o\, ) = 1/3 , where o\, and z y2' v o yy y2 2 a are the variance of the outputs of (4.3) and (4.4) respectively. The learning examples for these data sets had 200 randomly generated samples, while the test data had 1000 randomly generated examples. In Table 4.1 we compare the performance of SPORE-1 on these data sets to the Bagged CART algorithm. We chose these data sets because the manner in which they were used was fully specified, allowing us to directly compare prediction error results (full specification of the use of the fifth data set, Ozone, was not available in [Breiman, 1996a]). 4.2.2 Regression Data Used by Rasmussen Rasmussen [Rasmussen, 1996] studied five data sets from real-world regression examples. These included Auto Price data, CPU data, Boston Housing data, M P G data, and Servo Data. We give a brief description of these data sets below. Chapter 4: Experimental Results 64 • Auto Price Data: This data set relates the list price of a 1985 automobile to various at-tributes such as it's size, weight and engine capacity. The data set has 10 real inputs and 1 output. There are 80 learning examples and 79 test examples. • CPU Data: This data set relates CPU performance to various attributes such as maxi-mum and minimum memory size. The data set has 6 real inputs and 1 output. There are 104 learning examples and 105 test examples. • Boston Housing Data: This data set is a variation of the Boston Housing data de-scribed in the previous section. The data set has 12 real inputs, 1 binary input, and 1 out-put. There are 80 learning examples and 79 test examples. • M P G Data: This data relates automobile fuel consumption to such attributes as car weight, model year, and horsepower. The data set has 3 real inputs, 6 binary inputs, and 1 output. There are 256 learning examples and 250 test examples. • Servo Data: This data relates rise time of a servomechanism to gain settings and choice of mechanical linkages. The data set has 2 real inputs, 10 binary inputs, and 1 output. There are 80 learning examples and 79 test examples. Rasmussen applied 5 different learning algorithms to this data: Monte Carlo, Gaussian Evidence, Backpropagation, M A R S , and Gaussian Process. In his paper, Rasmussen was testing the efficacy of these algorithms as the number of learning data is increased, hence he tested them on various sizes of learning data. Since we are not concerned with data-limited regression in this thesis, we limited our comparisons to those results obtained on the largest learning data sizes. In Table 4.1, SPORE-1 is always compared to the algorithm which achieved the lowest errors on the corresponding data (this is symbolized by the brackets around the published error results). This gives a direct comparison of SPORE-1 with the best algorithm applied to this regression data. 4.2.3 Regression Data Used by Jordan and Jacobs Finally, we compared SPORE-1 to the forward dynamics data studied by Jordan and Jacobs [Jordan and Jacobs, 1994]. This data was generated by a simulation of a 4 joint robot arm moving in 3 dimensional space. The goal is to predict the acceleration of the 4 joints, given as input the position, velocity, and applied torque of each joint (i.e. a total of 12 predictor variables). There are 15,000 training examples and 5,000 validation examples. This data is highly nonlinear and is a good test of a learning algorithm's ability to predict nonlinear phenomena. Jordan and Chapter 4: Experimental Results 65 Jacobs applied 7 learning algorithms to this data: linear, Backpropagation, two versions of the Hierarchical Mixtures of Experts (HME) algorithm, two versions of CART, and the M A R S algorithm. In Table 4.1 we compare SPORE-1 to the Backporpagation algorithm, which achieved the lowest approximation error of algorithms studied by Jordan and Jacobs. 4.2.4 Experimental Results Next, we detail our experimental results on the above regression data. A l l experimental results reported in Table 4.1 are obtained under the exact experimental conditions reported in the literature. This allows a direct comparison of the performance of SPORE-1 to other algorithms. A l l learning times reported in this section are for SPORE-1 programmed in C and running on a Pentium Pro 150 using L I N U X . The data sets studied by Breiman [Breiman, 1996a] and Rasmussen [Rasmussen, 1996] have relatively few training examples, and therefore we constructed the SPORE-1 approximations using 10 fold cross-validation (10 fold cross-validation is one effective method of getting good approximations using a small number of data points). The 10 fold cross-validation procedure used is the following: the learning data set was divided into 10 approximately equally distributed sets, and then 10 regression functions were constructed using, in turn, 9 of these sets as training sets, and the remaining set as a validation set (see Section 3.2.1 of Chapter 3 for definitions of learning, training and validation data). For each test data point {test data was not used during learning), the outputs of the 10 regression functions were averaged to produce the final approxi-mation output, for which error results are reported. To test reproducibility, 100 independent approximations (independent with respect to random sequences of input variables and bootstrap samples as defined in Section 3.2 of Chapter 3) were generated using the Learning Data: Table 4.1 reports the best test set error, along with the average and standard deviation (s.d.) over the 100 independent runs. For the data sets studied by Breiman, we followed the same experimental procedure used by Breiman [Breiman, 1996a]. Briefly, this procedure is summarized as follows. For each of the 4 data sets, 100 trials were set up using 100 learning and 100 test sets created randomly according to the guidelines outlined in Section 4.2.1 and in [Breiman, 1996a]. For each of the 100 learning sets a regression function was constructed (using 10 fold cross-validation as described above) and evaluated on the corresponding test set. Experimental results for the 100 experiments are given in Chapter 4: Experimental Results 66 Table 4.1. The previously published error for the Breiman [Breiman, 1996a] data refers to the average bagged error reported. From Table 4.1 it is evident that the internal variance reduction property of SPORE-1 is at least as effective as Bagged regression trees in reducing approximation error (see Section 3.3.3 of Chapter 3). Note that the large standard deviation (s.d.) in the errors is due to the fact that each of the 100 learning and training sets were different. For all 4 data sets, the average SPORE-1 learning time ranged from about 1 to 10 minutes per approximation (this is the time required to generate one 10-fold approximation, or equivalently, ten SPORE-1 regression functions). The average size of a single cascade was about 90 K-bytes (i.e. about 8500 model parameters). For the data sets studied by Rasmussen [Rasmussen, 1996], we report results for the largest learning sets only (we used 80 learning examples of auto price data, 104 of cpu data, 256 of housing data, 192 of mpg data, and 88 of servo data as outlined in Section 4.2.2). For each of these 5 fixed data sets, the SPORE-1 approximation was constructed using 10 fold cross valida-tion as described above. The previously published error given in Table 4.1 under the Rasmussen [Rasmussen, 1996] data sets is the best reported error (indicated by brackets) of the 5 algorithms evaluated, and was obtained from the graphs presented in the paper. We generated 100 trials (independent approximations) for each data set. Because all learning and test data were the same in each trial, this was done to determine what effect the stochastic aspect of the SPORE-1 algorithm has on its performance. As defined in Section 3.2 of Chapter 3, the ordering of the independent variables is random, and the construction of the 2 dimensional functions gt(-) is done using a (random) bootstrap sample of the training data. The best and average relative mean squared test set error, and its standard deviation, are reported in Table 4.1. From Table 4.1, it is evident that although there is some variation in error performance from run to run, the stochastic effect of the algorithm is mostly negligible. As with the Breiman data, the average SPORE-1 learning time ranged from about 1 to 10 minutes per approximation. The average size of a single cascade was about 90 K-bytes (i.e. about 8500 model parameters). Finally, for the forward dynamics data studied by Jordon and Jacobs [Jordan and Jacobs, 1994], our evaluation was done using the experimental setup described in their paper for the on-line back-propagation algorithm: learning was done using 15,000 training examples, and learning stopped when the error could no longer be reduced on 5000 validation examples. The SPORE-1 approximation for this data consisted of a single cascade per output: due to the large Chapter 4: Experimental Results Table 4.1 Performance of SPORE-1 of Published Data 67 Published Source Data Pub. Lrror SPORE-1 Error (100/10 Runs) best a\e. s.d. Ave. Size of Cascade Breiman. IWda Housing 11.6 1.95 9.4 8.4 90 K-byte (8500 model parameters) Friedman 1 6.1 1.77 2.88 0.43 Friedman 2 22,100 16,733 19,603 1,413 Friedman 3 0.0242 0.014 0.0189 0.0024 Rasmussen, 1996 Auto Price (0.29) 0.21 0.29 0.03 CPU (0.17) 0.086 0.14 0.025 House (0.15) 0.128 0.145 0.007 MPG (0.10) 0.008 0.1 0.004 Servo (0.15) 0.198 0.25 0.02 .lord.in and Jacobs. |<W For. Dyn. (0.09) 0.0378 0.041 0.009 1 M-byte (94,000 parameters) data size, 10 fold cross validation was not necessary. In order to estimate the reproducibility of SPORE-1 on this data, we constructed 10 independent approximations. The best and average relative error on the validation set (in accordance with [Jordan and Jacobs, 1994]), and the standard deviation over these 10 independent experiments (time constraints did not allow us to do 100 experiments), is reported in Table 4.1. The previously published error, shown in Table 4.1, is the best relative error (on the validation data) of the 7 algorithms studied in [Jordan and Jacobs, 1994]. The forward dynamics data consists of 12 inputs and 4 outputs, which requires 4 SPORE-1 cascades (1 for each output). The algorithm required about 10 hours of computation to build all 4 cascades, and the average size of each cascade was about 1 M-byte (i.e. about 94,000 model parameters). From Table 4.1 we conclude that the SPORE-1 errors on all 10 runs are consistently low, thus the stochastic effect of the learning algorithm had negligible effect on approximation error. From Table 4.1, it is evident that the SPORE-1 algorithm demonstrated as good or better error results on all but 1 (the servo data) of the data sets. However, this result should be interpreted with caution. The specific goal of SPORE-1 is to build a regression function which best represents Chapter 4: Experimental Results 68 the learning data in the mean squared error sense. The SPORE-1 learning algorithm continues to add levels until the mean squared error can no longer be reduced: this is beneficial if one wants the "best" approximation, but detrimental if one wants a representation of fixed size. Most algorithms referred to in Table 4.1 are parametric and therefore of fixed size. Two exceptions are M A R S and CART. However, even comparing these to SPORE-1 should be done with caution: both M A R S and CART can be used to determine the significance of various inputs and how they interact with one another. Currently SPORE-1 has no such data interpretation features. Nevertheless, we cannot ignore the fact that SPORE-1 consistently produced regression functions which had low prediction errors. Thus, even though SPORE-1 was not designed for small, low dimensional problems, it should be considered a strong candidate for such regression problems if prediction error is important. 4.3 Evaluating SPORE-1 on Synthetic High Dimensional Problems In this section, our goal is to study the SPORE-1 algorithm when applied to very high dimensional regression data, under various noise conditions [Grudic and Lawrence, 1997]. A l l learning times reported in this section are for SPORE-1 running on a Pentium Pro 150 using LINUX. Since no large, high dimensional regression data examples were found in the literature, we applied SPORE-1 to fifteen artificial data sets ranging from 100 to 1,600 input dimensions. For five of these data sets no noise was present, while the remaining ten data sets had noise. For the first five of these "noisy" data sets, noise was present only in the output, while the remaining five data sets had both input and output noise. Input noise was generated by simply allowing only half of all inputs to contribute to the output (thus half of the inputs used in constructing the approxima-tion had no effect on output). This allows us to determine how effectively the SPORE-1 learning algorithm is able to account for variables which have no predictive power. Output noise was generated using a normal distribution with the standard deviation selected to give a signal to noise ratio of 3 to 1, thus implying that the true underlying function accounts for 90% of the variance in learning and test data. Specifically, we generated the regres-sion data as follows: y = f(xx, ...,xN) + e (4.6) Chapter 4: Experimental Results 69 where / ( x , , xN) is the target function, and e is a Gaussian noise term with mean zero and 2 2 2 2 2 2 variance c>e . Under output noise conditions, we chose 0"E such that (o~y/cTe) = (3) , where Gj-is the variance of f{xx, xN). Thus, because e and f{xx, xN) are independent, we know that 2 2 0"P 0\> 1 1 — = — = - —— = 0.1 (4.7) 2 2 2 , 2 . 2 , , 9 + 1 v y 2 where <jy is the variance of the output y. Therefore, given that relative error is defined as ^ £ (4.8) where / is the constructed regression function, theoretically the best relative error achievable by any approximation is 0.1 . The target function / ( x j , xN) is generated as follows: f(xlt...,xN) = — (4.9) 1 + c o s ( A 7 z r = i c ° s ( ^ ( x ' ) ) where N is the input dimension of the data, while r^xA and 5((x-) are continuous 1 dimensional functions which are nonlinear and all different from one another. Equation (4.9) allows us to generate highly nonlinear data of any dimension, while at the same time controlling the complex-ity of the data via the appropriate selection of functions r^x^ and ^ (x , ) . By data complexity, we are referring to the number of random sample points required to build sufficient non-parametric models of the data, with respect to mean squared error. The more complex the generating function, the more data points are required for non-parametric modeling. For the purposes of this thesis, we have chosen functions /*,(x-) and ^,(x-) such that 40,000 random sample points are sufficient in order to effectively model the data. Our choice of functions r ; (x ; ) and .s'-(x-) was done as follows. The functions r-(x-) and ^-(x-) are modeled using linear splines. The spline knot locations were randomly chosen using a uniform random distribution. The number of knots was the same for each function. In addition, each of the predic-tor variables x- were modeled as 2 dimensional parametric functions of the form Chapter 4: Experimental Results 70 x;- = i|/i(cOj, co 2), where the functions ^(co, , co2) are different for each predictor variable x;-. The parametric functions V|/-(co,, co2) are modeled as 2 dimensional linear splines, with knot positions randomly chosen using a uniform random number generator (the number of knots is the same in each of the functions Vj/((tO], co 2), but their locations are all different). Hence, the complexity of the target function f{x{, xN) in (4.9) is directly controlled by the number of knots in functions r-(jc-), s^xA and \|/((co,, co 2), and increasing the number of knots increases the number of points required to achieve effective approximations of the target function. One should note that, even though the target function is essentially 2 dimensional in the parameter space (C0j, co 2), this information is not used by SPORE-1; i.e. regression data is provided in the form of N dimensional predictor variables x-, and one output variable y . Therefore a learning algorithm sees the resulting regression problem as N dimensional. In order to pick a target function complexity such that 40,000 sample points are required in order to generate an effective regression model, we followed the following procedure. The number of knots used to represent the functions /"-(x-), 5-(x-) and \j/(-(c0j, co 2), was increased until a nearest neighbor approximation (implemented using a simple AT-Nearest Neighbor algorithm with K = 1 ; see Chapter 2 for details), in the 2 dimensional parameter space (GL>J, (0 2) exceeded a relative mean squared error of 0.01 (relative mean squared error is defined, in the usual sense, as the mean squared error on the test data, divided by the variance of the test data). In building this AT-Nearest Neighbor approximation, 40,000 random training samples were used, and the relative error was calculated using 5000 randomly picked points. A nearest neighbor approximation gives a good indication of maximum approximation error, because the 40,000 learning points densely sample the 2 dimensional parameter space (cflj, co 2), where the nearest neighbor calculations are done (a nearest neighbor approximation done in the N dimensional predictor variables x-, would not be nearly as effective, for reasons outlined in Chapter 2). This procedure gives us a good estimate of the maximum approximation error that the SPORE-1 algorithm should get; the relative mean squared error of a SPORE-1 approximation, or any effective approximation method, should not exceed 0.01 on any of the high dimensional data presented in this section. The simulation results for the 15 high dimensional data sets are presented in Table 4.2. For each data set there are 40,000 learning examples and 40,000 testing examples. Each learning set is further divided into 30,000 training examples and 10,000 validation examples: these are used to Chapter 4: Experimental Results 71 construct one cascade which forms the approximation for that data set. The three rightmost columns of Table 4.2 contain the test set average relative mean squared error and corresponding standard deviations over 10 independent runs. From Table 4.2 it is evident that the relative error is small when no noise is present, and falls well below the expected maximum error value of 0.01. With the addition of noise the relative error approaches the theoretical limit (due to the 3 to 1 signal to noise ratio) of 0.1 (see explanation given above). Learning time for each data set with no noise was approximately 7 hours and produced cascades which were between 2 and 6 M-bytes in size (i.e. between approximately 190,000 and 570,000 model parameters). In contrast, learning time for each data set containing noise was about 1 hour and produced approximations of between 200 and 600 K-bytes in size (i.e. between approximately 19,000 and 57,000 model parameters). The drop in learning time and approximation size results because the SPORE-1 learning algorithm stops adding levels when error on the validation set can no longer be reduced. When the data has output noise, this condition is reached sooner than when there is no noise. The reason for this is that interpolation between sample points is no longer possible when noise is present, once a given level of approximation error (i.e. 0.1) is reached. Table 4.2 Evaluation of SPORE-1 on High Dimensional Data Sets Dimension No Noise Data Output Noise Data Input/Output Noise Data 100 0.00139 s.d. 0.00006 0.1049 s.d. 0.0003 0.1048 s.d. 0.0004 200 0.0018 s.d. 0.0002 0.1064 s.d. 0.0003 0.1055 s.d. 0.0002 400 0.0009 s.d. 0.0002 0.1041 s.d. 0.0002 0.1064 s.d. 0.0003 SOO 0.0044 s.d. 0.0006 0.1111 s.d. 0.0003 0.1038 s.d. 0.0003 1.600 0.0011 s.d. 0.0003 0.1087 s.d. 0.0003 0.1042 s.d. 0.0003 The results in Table 4.2 demonstrate that SPORE-1 can potentially be used on large, high dimensional regression problems. It is interesting to observe that the dimension of the data did not adversely affect either learning time or the size of the approximation. This is made evident in Figure 4.1 where we plot approximation size as a function data dimension. The "error bars" indicate the variation in approximation size for each specific data type. These experimental results support the theory developed in Chapter 3, where we showed that the rate of convergence of SPORE-1 is independent of dimension, and that complexity is at most linear with dimension. A Chapter 4: Experimental Results 72 700000 700000 J - 200000 100000 100000 0 200 400 600 800 1000120014001600 Data Dimension a) Output Noise 0 200 400 600 800 1000120014001600 Data Dimension b) Input/Output Noise 0 200 400 600 800 1000120014001600 Data Dimension c) No Noise Figure 4.1 Increase in Approximat ion Size with Data Dimension second interesting observation is that SPORE-1 is effective in dealing with both input and output noise, even on regression functions which are very high dimensional. 4.4 Human-to-Robot Skill Transfer Experiments We chose human-to-robot skill transfer as a real world test of the SPORE-1 learning algorithm. Our motivation for this is that human-to-robot skill transfer is an example of a regres-sion problem which is very high dimensional (1024 inputs), and exhibits both input and output noise characteristics [Grudic and Lawrence, 1995] [Grudic and Lawrence, 1996]. We approach the problem of programming robots that can work in unstructured environ-ments from the point of view of human-to-robot skill transfer. In this paradigm, the human transfers his or her skills to a robot by giving the robot a finite number of examples of how the desired manipulation task is done. The key to the success of this type of robot programming is that the robot must be able to generalize the desired task, given only the human experts examples, even when the unstructured environments of interest change after learning has occurred. Other researchers have had success with similar approaches to the robot programming Chapter 4: Experimental Results 73 problem. We mention only a few examples here. Perhaps the most related work is that of Pomerleau [Pomerleau, 1993b] [Pomerleau, 1993a] where he successfully uses a 3 layer percep-tron network to control the C M U A L V I N N autonomous driving system as it drives along a road. Pomerleau's approach differs from that presented here in that 1) we use a nonparametric approxi-mation approach and 2) the human expert supplying the training data uses only the sensor inputs being fed to the approximation (Pomerleau does not use this restriction). However, as with the approach presented here, Pomerleau uses a large number of sensor inputs (960) to effectively generalize the desired task. Other related work includes that of Kosuge et. al. [Kosuge et al., 1991] where a priori knowledge of the task is incorporated with human generated examples to transfer a skill. Other researchers use neural networks to encode the human's skills: for the deburring task Shimokura and Liu [Shimokura and Liu, 1991], and for compliance control Asada [Asada, 1990]. Lui and Asada [Lui and Asada, 1992] use process dynamics for transferring manipulative skills. Yang et. al. [Yang et al., 1994] use a multidimensional Hidden Markov Model to represent the transferred skill. Rouse et. al. [Rouse et al., 1989] also consider symbol processing models as useful tools for capturing human skills and knowledge. The approach to human-to-robot skill transfer proposed in this section assumes the follow-ing: The human operator can accomplish the desired manipulation task by using only the robot's sensors to obtain information about the world, and using only the robot's actuators to manipulate world objects. Thus, the human has only the robot's perspective on gathering world information and manipulating world objects, and no other. This implies that if the human can accomplish the task, then the robot should also be able to accomplish it. This assumption has at least one important implication for the choice of a learning system, or approximation framework, for generating the sensor to actuator mappings. Humans are usually only experts at manipulation tasks which they can anthropomorphize to their natural frame of reference. In other words, we are accustomed to particular human sensory and action modalities, and as the information for the task becomes more sparse, the task becomes more difficult. As a result, when a human expert is demonstrating a task which is sufficiently complex, it is likely that he or she is using a large number of sensor inputs of various types, including both visual and Chapter 4: Experimental Results 74 Figure 4.2 The Human to Robot Skill Transfer Test Bed tactile. Thus, the type of human-to-robot skill transfer proposed here requires an approximation framework which is highly flexible with respect to the number and variety of sensor inputs, and yet is practical to implement. In this section, we describe a human-to-robot skill transfer testbed which we have implemented, as well as outline our first skill transfer experiments done in an unstructured environment using the SPORE-1 algorithm. The skill which we have chosen to demonstrate is loosely based on a typical "object locate and approach task", and was chosen due to its inherently unstructured characteristics. However, the framework we are in the process of formulating should be generalizable to a large variety of human skills. 4.4.1 The Experimental Testbed A small scale proof-of-concept testbed for human-to-robot ski l l transfer has been implemented (see Figure 4.2), consisting of a tracked mobile robot which is approximately 35 centimeters long and 23 centimeters wide. The robot is equipped with 1) left and right tracks, each with independent proportional forward/backward motion similar to that found in forestry excava-tors; 2) a gripper (not shown in Figure 4.2) to allow it to grasp objects; and 3) an NTSC color camera. The leftmost image in Figure 4.2 shows the robot's workspace and the rightmost image shows a close-up of the mobile robot. Initially this robot is intended to operate in a flat 2.5 meter by 1.8 meter workspace which has randomly placed objects. The intended task of the robot is to manipulate the target objects in a manner defined by a human operator who first accomplishes the desired task a number of times via teleoperation. The human operator observes images received Chapter 4: Experimental Results 75 from the camera located on the robot (at a frame rate of approximately 20 to 30 frames per second), and operates two joysticks which control the robot's left/right forward/backward motion, as well as the open/close gripper motion. A SPARCstation 20, equipped with a frame grabber, is used to display the on-board camera image to the human operator as he or she demonstrates the desired task. The learning and subsequent autonomous control of the robot are also performed by the SPARCstation 20. The testbed currently uses various small objects including lego models and model "trees", as objects which must be located and approached. There are 2 lego objects, a red and blue stripped tower, and a red and blue stripped tower with holes. The "trees" are currently of three different types (tall spruce, short spruce, and short birch), and our goal is to teach the robot to differentiate between them. This testbed is considered unstructured because 1) although there are only three different types of trees and two lego models, no two objects are identical either in height, color distribution, or in texture, making manual encoding of a model difficult (especially for the model trees); and 2) the image background of the objects is a robotics lab which is continually changing as equipment is moved around and people move throughout the lab. 4.4.2 Experimental Results The first skill which has been successfully transferred from human to robot is described in Figure 4.3. This skill is loosely related to the "tree tending skill," which requires the selection of a certain size and species of tree for inspection. The skill involves turning the robot counterclock-wise in place until at least one tall spruce is within the camera's field of view, and then maneuver-ing the robot into a position where the tree can be inspected. Figure 4.3 shows several typical camera views which the operator sees, as the task of locating and moving towards a tall spruce is performed. Starting at the leftmost image of Figure 4.3, the robot is turning counterclockwise searching for a tree; in the third image a tall spruce has been found and the robot begins to move towards it; in the sixth image the spruce is within inspecting distance and the robot turns off to the left in search of another tall spruce. Note that this first task does not involve avoiding obstacles or actually stopping to inspect the tree, it simply requires the robot to find and move towards a tall Chapter 4: Experimental Results 76 Figure 4.3 A "Tall Spruce Tree" spruce tree in the visually cluttered environment of the robotics lab. The image which the human operator, as well as the robot, observes while executing the desired task, consists of 1024 pixels derived by sampling a 32 pixel by 32 pixel region in the center of the original colour NTSC image (see Figure 4.3 for gray-scale examples). The sampled pixel resolution is 8 bits, 3 bits for red, 3 bits for green, and 2 bits for blue. There is no image processing done prior to feeding these pixel values to the SPORE approximation; what the human sees is exactly what is passed to the approximator. In order to generalize the desired task the robot must approximate two mappings, one to control the robot's forward and backward motion, and one to control the robot's left and right motion. Both of these mappings have 1024 inputs and 1 output. As usual, two SPORE-1 cascades are generated, one for each dependent variable. The training data generated by the human operator consisted of demonstrating the above skill a total of 4 times, from 4 random starting positions of the robot, and 4 random positions of the spruce. Due to image resolution limits, the distance between the robot and the "tree" was always 1.5 meters or less. It took approximately 2 minutes (equivalent to 2180 camera frames or input/output examples) for the human operator to do these 4 demonstrations. The SPORE-1 regression function was constructed by randomly subdividng these 2180 samples into 218 valida-tion samples and 1962 training samples. The resulting approximation took about 5 minutes of SPARCstation 20 C P U time to generate, and took about 250 K-bytes of memory space (i.e. approximately 24,000 model parameters). The length of time to evaluate the resulting approxima-tion for one instance of sensor inputs is 191 milliseconds. Next we asked the following question: given only a few examples of the desired skill, and the raw sensor inputs used by the human, can the SPORE-1 regression algorithm effectively generalize the desired skill. In the case of the above experiment, the human operator generated only 4 examples of the desired task, and the SPORE approximation was fed raw pixel data with no intensity compensation or any other type of signal preprocessing. In addition, one can see from Figure 4.3 that the image generated by the frame grabber inside the SPARCstation 20 is extremely noisy. Nonetheless, when we ran 30 experiments where the robot used the SPORE approximation Chapter 4: Experimental Results 77 Figure 4.4 A Red and Blue Stripped Tower to autonomously accomplish the transferred skill, the robot never failed to locate and move towards the tree. These 30 experiments were done with random placements of tree and robot (as with the training set, the spruce was always within 1.5 meters of the robot). It is interesting to note that the robot does not approach, for instance, upside-down spruce trees, and does not seem to be solely color sensitive; the shape of the tree and its texture are significant as well. We may therefore conclude that to the extent that the robot acquired the desired skill, it was able to execute it within an unstructured environment. Next, we repeated the same experiments with the two lego objects. Figure 4.4 and Figure 4.5 show grey-scale camera frame examples of approach sequences to two objects we have experimented with. The object in Figure 4.4 is a red and blue striped tower, and the object in Figure 4.5 is also a red and blue striped tower but has a number of holes through it, which allow background scenes to be observed. For the object in Figure 4.4, the teleoperator generated 4632 camera images during 4 approaches. Similarly, for the object in Figure 4.5, the teleoperator generated 4805 images during 4 example approaches. The variation in camera frame counts is simply a result of the random placement of the vehicle and object. The SPORE-1 approximation was generated for both objects, using a random selection of 10% of the learning points as valida-tion set, and rest as a training set. The generation of the 4 approach examples took the teleoperator approximately 5 minutes to generate for each object, while the off-line learning of SPORE-1 required approximately 4 additional minutes on the SPARCstation 20, for each object. The size of the approximation generated for the object in Figure 4.4 is 489 K-bytes (i.e. approximately 44,000 model parameters), while the size of the approximation generated for the object in Figure 4.5 is 552 K-bytes (i.e. approximately 52,000 model parameters). The time required to evaluate either approximation (once it has been constructed) for an observed image is less than 120 milliseconds Chapter 4: Experimental Results 78 Figure 4.5 A Red and Blue Stripped Tower with Holes for either object (run on a SPARCstation 20). Once more, in order to evaluate the ability of the robot to autonomously execute an approach sequence toward an object, we ran 30 experiments for each of the two objects described above. Each experiment involved placing the vehicle and object in random positions, and then seeing whether the robot would autonomously be able to located and approach the object. For the object shown in Figure 4.4 the robot successfully executed the task 29 out of 30 trials, while for the object shown in Figure 4.5, the robot was successful in approaching the object 27 out of 30 trials (only one object was present at each trial). Note that the initial distance between the vehicle and target object was always less than 1.5 meters, because at distances greater than this, the object is not easily distinguishable, even by a human operator. From these experimental results, we can conclude that the SPORE-1 approximation was successful in generalizing all of the above locate and approach tasks, given relatively few noisy examples and raw pixel inputs. This is promising for the many and difficult robot programming problem facing robotics researchers. 4.5 The 10 Bit Parity Problem One of the main criticisms of algorithms which construct a high dimensional regression functions using the smaller dimensional functional units, is that they cannot model interactions which have an intrinsic dimension that is greater than the dimension of the highest dimensional functional unit. For example, the classic implementation G M D H [Ivankhnenko, 1971] has 2 dimensional functional units, and therefore can at most model the 2-bit parity problem G M D H [Elder IV and Brown, 1995]. Similarly, ASPN can at most model the 3-bit parity problem because its functional units are at most 3 dimensional [Elder IV and Brown, 1995]. Note that the N-b'ti parity problem is one of determining whether an odd or even number of the N bits in a word is high: the target function being, for example, set to 1 if an even number of bits are high, and -1 otherwise. Chapter 4: Experimental Results 79 Using this same train of thought, one may conclude that because SPORE-1 uses 2 dimensional functional units, it should not be able to model parity problems for words of 3 or more bits. This, however, is not the case. As discussed in Section 3.3.4 of Chapter 3, the virtual space partitioning property of SPORE-1 can allow the algorithm to converge to zero error, even when the following pathological condition occurs; The above condition occurs in a parity problem of more than 2 bits because the target function value is equally to be likely -1 or 1, for each point (u, v) in the training set. Thus, given that the training set consists of an exhaustive set of parity points, and the training data is applied without modification (i.e. a bootstrap sample is not used), the function °p''gt(u, v) which always gives the best squared error fit to the training data is the zero function gt(u, v) - 0. In order to demonstrate the ability of SPORE-1 to represent 3 or more bit parity functions, we applied it to the 10-bit parity problem [Grudic and Lawrence, 1997]. The training data and the validation data were made to be identical, with each containing the exhaustive 1024 examples of the 10-bit parity function. Our goal was to determine how well SPORE-1 could fit these learning examples. The results achieved were that the relative mean squared error on the learning data was _9 9.8x10 , indicating that, to within the floating point precision of the C P U , the SPORE-1 algorithm was able to exactly model the 10-bit parity function. However, the rate at which SPORE-1 converged to this exact solution was rather slow: it took about 2 hours and the final approximation was over a megabyte in size (i.e. approximately 94,000 model parameters). If, on the other hand we stopped the learning algorithm when the parity problem was actually solved (i.e. when the maximum approximation error is less than 1, meaning that every instance of the learning data has been appropriately mapped), the learning time is only 10 minutes, and the size of the approximation is about 40 K-bytes (i.e. approximately 3,800 model parameters). 4.6 Summary { a-} = argmin { aij} jC'gl(u,v)-f(x)) dD = {0, ...,0} (4.10) D This chapter is our first attempt at an experimental verification of our thesis presented in Section 1.3 of Chapter 1. The SPORE-1 algorithm is our most extreme example of the general SPORE methodology. The structural units are small and all the same (i.e third order two Chapter 4: Experimental Results 80 dimensional polynomials), and no input selection is ever done. We have demonstrated its success-ful performance on problems ranging from relatively small 6 dimensional regression problems having only 104 learning examples, to 1,600 dimensional regression problems with 40,000 learning examples. Its performance compared favorably to published results on well known data sets in the literature. We introduced new very high dimensional data and demonstrated that SPORE-l 's construction times were independent of dimension. A real world example of human-to-robot skill transfer was also successful despite the raw image pixels used as inputs. Finally, we demonstrated that SPORE-1 does not have the same limitations on parity type problems, as other systems which build regression functions using small dimensional functions which are added one at a time. Chapter 5: The General SPORE Methodology In Chapter 3 the SPORE-1 learning algorithm was presented, and its basic theoretical characteristics were explored. In this chapter the most general form of the SPORE methodology is presented and the theoretical characteristics of learning algorithms which construct SPORE-type regression functions are presented. In particular, this chapter deals with the rate of convergence and computational complexity properties of the general SPORE learning algorithms, with particu-lar emphasis on how these depend on the dimension of the function being approximated. 5.1 The Complete SPORE Definition The most general form of the SPORE structure (see Figure 5.1) has C cascades of D dimensional functions. In Figure 5.1, the D dimensional functions are symbolized by g i j, where i serves to identify one of the C cascades, and j identifies the level of the cascade where the function g • . is located. In the general case, there are C > 1 cascades of D > 2 dimensional functions, and the dimension of the functions g • • need not all be the same. As a specific example, the SPORE-1 structure in Chapter 3 has C = 1 and D = 2 for all g- • functions. The columns in Figure 5.1 symbolize cascades of functions, and the rows symbolize levels in each of the cascades. Thus, the SPORE structure depicted in Figure 5.1 has C cascades, with the first cascade having Lj levels, and the final cascade having Lc levels. There are 2 types of inputs to the functions gt •. The first type are symbolized by X; ,• = (jc t, ..., xk ), where (xk , ...,xb ) are some subset (perhaps randomly selected) of all of the model input variables (or independent variables). The second type are symbolized by Gitj = (8ri,qi> — S r ^ J ' W h e r e (Sr,,,,. • • • ' ^ , , J ) , <?,,,) ^ S ° m e S U b S e t ° f ° U t P U t S ° f functions which have already been added to the structure (note that gr< q^ refers to the output of the function from level qx of the r, cascade). The final regression function, symbolized by / in Figure 5.1, is some weighted sum of all 81 Chapter 5: The General SPORE Methodology 82 Cascades D > <U Figure 5.1 The SPORE Structure constructed functions g • •. This is symbolized by the vector: W = ( w u r g u l , w l t 2 - g l t 2 , . . . , W C , L c - 8 C , L C ) (5.1) Given this definition of the general SPORE structure outlined in Figure 5.1, we next give a description of how the a general implementation of a SPORE learning algorithm would construct this structure. The specific structure of the functions g(- • is either 1) fixed throughout SPORE structure as with SPORE-1; 2) chosen randomly, either with or without replacement, from a set of possible candidate functions (if without replacement, then reinitializing the list after all have been chosen), or 3) chosen in a specific order from a predefined sequence. Similarly, the inputs to each g- • function is either 1) chosen in a specific order (going back to the first one in the list after all have been chosen); or 2) chosen randomly, with or without replacement, from a list of all possible inputs (if without replacement, then reinitializing the list after all have been chosen). This methodology for selecting inputs and function types implies that no time consuming search occurs during the process of constructing the structure. In all instances of SPORE, the function g, , is constructed first, its inputs being some Dj , > 2 dimensional projection of the N dimensional learning data. The parameters of the Chapter 5: The General SPORE Methodology 83 function g, x are determined using some subset (perhaps even all) of points in a training set containing examples of inputs and outputs. Once the initial function j is constructed, the remaining functions are constructed, one at a time, with the only stipulation being that no function gt j is constructed until all of its inputs have been constructed. As with the initial function gx x , the parameters of each function g- • are determined using some subset (or perhaps all) of the points in a training set, projected onto the inputs (X • G ( •) of g • •. The weights in the vector W are either all the same, or chosen using some criteria such as W:. being proportional to the inverse of the approximation error of g- • on the training data (note ij ' J that this is how the weights are chosen in the SPORE-1 learning algorithm). During construction the desired learning outputs are periodically replaced by the residual errors resulting from subtracting out the approximation due to the structure already constructed. This usually happens when the approximation errors can no longer be reduced by fitting the parameters of the current gi • function to the current residual errors in the data set. Further additions to the structure are then added to reduce these residual errors. In the general case, the SPORE construction terminates when the approximation error, on some validation set, has been reduced to some acceptable level. An exception to this rule is when the algorithm allows for the subdivision of the input space of the target function into disjoint sub-domains. In which case new SPORE structures are then constructed in each resulting subdomain. The reader should note that the above description conforms with the general SPORE characteristics outlined in Section 1.3.1 of Chapter 1. 5.2 Theoretical Analysis of Rates of Convergence For Two Specific SPORE Structures In this chapter we study the rate of convergence properties of two specific examples of SPORE: the SPORE-2 algorithm and an extension of the SPORE-1 algorithm defined in Chapter 3. As discussed in Section 2.3 of Chapter 2, there are two types of rate of convergence results. The first assumes an infinite number of learning examples and measures how approximation error decreases as new structures (or parameters) are added to the regression model. The second assumes a finite number of learning samples and measures how approximation error decreases as the learning sample size increases. In this section we address the second type of convergence result. Chapter 5: The General SPORE Methodology 84 In Section 5.2.1, we describe and theoretically analyze the SPORE-2 regression function. Specifically, our goal is to define function spaces for which the rate of convergence of SPORE-2 is independent of dimension. In such function spaces, it is potentially possible to avoid both the computational complexity and large sample size aspects of the curse of dimensionality. Our aim is to analyze theoretically motivated regression functions which point the way to real implementa-tions that are feasible. The theoretical constructions presented should be viewed with this in mind. In Section 5.2.2, we extend the SPORE-1 regression function described in Chapter 3 to include the partitioning of the input space into disjoint subdomains. This subdivision of the input space guarantees that the approximation error to converges to zero, while simultaneously preserv-ing the rate of convergence and complexity properties of the SPORE-1 learning algorithm within each of these subdomains. 5.2.1 The SPORE-2 Regression Function As the primary goal of this thesis is to study very high dimensional nonparametric regres-sion, in this section we explore function spaces where such large problems are potentially tracta-ble. In the previous two chapters we have demonstrated the value of building cascades of lower dimensional functions as approximations to high dimensional functions. In this section we extend this basic idea to more complex cascaded structures, and include space partitioning as a method to ensure that the approximation error always converges zero. As is outlined below, an additional benefit of space partitioning is that the rate of convergence of the regression function is indepen-dent of dimension, within each resulting subdomain. The SPORE-2 regression model consists of a partitioner and a structure. The structure consists of a cascade of 2 dimensional functions, and the partitioner is a method of subdividing the domain of the target function into regions were the cascade is an effective regression function. The algorithm for building the SPORE-2 regression function is based first on an algorithm which decomposes a continuous 3 dimensional function U(x, y, z) into a cascaded 2 dimensional function having the structural form cW{'U(x, y), z). Such a decomposition of 3 dimensional function into two 2 dimensional functions is possible because we partition the domain of Zl(x, y, z) into a finite number of regions where it can be represented to arbitrary accuracy by W( V(x, y), z) • This partitioning is the subject of the following theorem. T H E O R E M I: Let U(x, y, z) be a bounded continuous 3 dimensional function defined Chapter 5: The General SPORE Methodology 85 Level 1 ! Level 2 Figure 5.2 A Tree-Like Subdivision of Input Space 3 over a finite dense domain in U czSi . Then for every arbitrary small real £ > 0, there exist a finite number of dense disjoint domains J7,-, where L J Ut = U, and corresponding bounded continuous 2 dimensional functions and <Vi, such that for any (x, y, z) £ U t the following is true: \W-AVAx, y), z) - U(x, y, z)\ < £ (5.2) The proof of T H E O R E M I is given in Appendix A, where it is restated in a more precise, but equivalent, way. The proof is based on four algorithmic constructions which are presented in Appendix D. These constructions are titled Construction I through Construction IV, and they collectively serve to decompose the function 1l(x, y, z) into the functions rWAVi(x, y), z), as well as define and build the disjoint domains Ui. Using the 3 dimensional decomposition and domain partitioning outlined in T H E O R E M I, we can begin to define the domain partitioning nature of the complete N dimensional SPORE-2 algorithm. The SPORE-2 partitioner works by subdividing the input space of the target function into a tree-like subdivision. This subdivision is represented using the tree notation P j O ^ . . . p ^ , where / represents the level of the tree, and the symbols p and a refer to branch numbers at Chapter 5: The General SPORE Methodology 86 various levels of the tree. Thus, each branch of the tree (or equivalently path through a tree), defines a subdomain of the target function. An example of such a subdivision of space is given in Figure 5.2. Because each tree branch defines a unique subdomain, we speak of a point (x0, xN_ j) being an element of the subdomain corresponding to a tree branch with index P I < * I - P J < V Given this description of domain subdivision, we can define the structure built by the SPORE-2 regression algorithm within each of these subdomains. The partitioner works by subdividing the domain of the function being estimated into a set of disjoint subdomains, such that in each subdomain, one can estimate the function using a superposition, or summation, of a cascade of 2 dimensional functions. The 2 dimensional functions are of two types symbolized by £p,a, p,a, a n c * ^p,o, p,a,' where Pi°~i •••P/°~/ is the tree subdivision of the input space defined above. Thus, we can view the functions en „ n„ and hn „ n n as acting on the subdomain ° P i ° i - P / 0 / P i ° i - P ; ° / ° defined by tree branch pial...plo[. The SPORE-2 structure is defined as follows. Let 0„(JC O , ...,xN_l) be the SPORE-2 estimate of some unknown function Q(xQ, xN_ j ) , which is based on n random samples of 0(x o, xN_ j ) . Then, if (x0, xN_ j) is an element of the subdomain corresponding to a tree branch with index p , a , ...p, o~, , the SPORE-2 regression function Qn(xa, xN_ ,) has the lmax '•max ^ i v 1 following expansion: e„(*0> ...,xN_l) = g P l 0 l ( fc p i 0 l ( ( *o> xl)> xl)) + S p . o . p ^ p . o . p ^ ^ p . o , ' *2)> * 3)) + '«« (5.3) X ^p1a,...a/(^p1ol...o/(^P|0|...o,_1' x(l mod #))> X((l+ 1) modN))) 1 = 3 where, 1. A mod N is the integer remainder of A -s- N, and is used to cycle through the indices of the independent variables (x 0, xN_ j) (hence the order of input variables is predefined); 2. the subscript notation p,G,...pl a, is described above (see Appendix B for a more 1 1 1 max lmax complete specification) and serves to identify the two dimensional functions g P | ( J | 0 / and ^p,0, o w n i c n a c t on the specific subdomain of 0(x o , ...,xN_x) referred to by branch PifJi---P/0-/; 3. and, lmax is an integer defining the maximum number of terms (or levels of the tree structure) required in order to achieve some approximation error bound Chapter 5: The General SPORE Methodology 87 level 1 level L X —• 9,,.,, I —» Figure 5.3 A 4 Dimensional Example of SPORE-2 maxxeD\Qn(x)-Q(x)\ <e (5.4) where e is an arbitrarily small (for sufficiently large n) positive real number, and D is the domain of the unknown function 6(x 0 , ..., xN_ j ) . A 4 dimensional example of a SPORE-2 structure is shown in Figure 5.3, where the tree branch is defined by pt = 1 and at = 1, for all / = 1, lmax. Note that this branch is identified by the top most branch in Figure 5.2. Given the above description of SPORE-2, we now state the second theoretical result of this section. In T H E O R E M II, we prove that the SPORE-2 algorithm produces an approximation which is a universal approximator. T H E O R E M II: Let 9(x 0 , ...,xN_l) be a bounded continuous N dimensional function defined over a finite dense domain in D cz^.N. Then, for every arbitrarily small real £ > 0, there Chapter 5: The General SPORE Methodology 88 exist a finite number of dense disjoint domains Sk, where {JkSk = D, and each domain corresponds to some branch p . O i . . . p i <5, , and corresponding bounded continuous 2 * ' lmax ''max dimensional functions gpjCi] 0 and ^ P | C | 0 , such that, for any ( x 0 , x N _ j) e Sk the follow-ing is true: |0„(JC)-0(JC)| <e (5.5) where Qn(x) is defined in (5.3), and we let n —» °°. The proof of T H E O R E M II is given in Appendix B, where it is restated in a more precise, but equivalent, way. The proof uses the result of T H E O R E M I, and is based on five algorithmic constructions which are presented in Appendix D. These constructions are titled Construction I through Construction V, and they collectively serve to decompose the function Q(x) into the functions Qn(x), as well as define and build the disjoint domains Sk. Note that Construction I through Construction IV are constructions which are also used in the proof of T H E O R E M I. The proof of T H E O R E M II defines a sufficient algorithm for constructing the universal approximation 0„(JC) , given that the number of samples points, n, is infinite. In T H E O R E M III we extend this result by assuming a finite learning sample size n, analyzing how the approxima-tion error decrease as the sample size increases. We state this theorem next. T H E O R E M III: Let Qn(x), 0(x), and Sk be as defined in THEOREM II, with the further assumptions that Q(x) is p -times differentiable (p > 1), and is the mean value of some probability function as follows: Q(x) = \yh(y\xMdy) (5.6) where h(y\x) is some probability function, and §(dy) is some measure on SK. Assume the regularity conditions 1-3 (establishing optimal rates of convergence), defined by Stone [Stone, 1982], are true for the target function d(x). Let n be the number of samples of the target function Q(x) within a specific subdomain Sk. Then, as n increases, the rate of convergence of the regression function Qn(x) to the target function Q(x), on the subdomain Sk, is given by: ||en(jc) - 0(x)|| o o < 0((nYX log(n)) r (5.7) where r = p/(2p + 3), and IHI^ is the L°° norm. The proof of T H E O R E M III is given in Appendix C, where it is restated in a more precise, Chapter 5: The General SPORE Methodology 89 but equivalent, way. The three regularity conditions of Stone [Stone, 1982] which we use in the statement of the theorem, simply impose certain restrictions on the probability distributions associated with the target function 6(JC) (see [Stone, 1982] for details). The first condition imposes constraints on the density function h(y\x), and is required to verify that the sequence —l r (in) log(n)) is a lower convergence sequence (i.e. the best convergence sequence). The second condition imposes further constraints on h(y\x) to ensure that the rate defined by the - l r sequence ((n) log(rc)) is achievable. Examples of density functions which satisfy these two conditions are the Normal distribution and the Bernoulli distribution. The final condition used by Stone puts constraints on the asymptotic distribution of the independent variables, x, of the training samples. This condition is required to ensure that the rate of convergence is both optimal an achievable. An example domain D a SiN which satisfies this condition would be any polyhe-dral domain in 3iN where the distribution of points x is uniform, where the learning samples are independent and identically distributed. A more general example of a sufficient domain D cz 3iN can be found in [Stone, 1982]. T H E O R E M III establishes that the closed domain of any function, which is bounded and at least once differentiable (and under appropriate regularity conditions of Stone [Stone, 1982]), is densely covered by a finite number of subdomains where the rate of convergence of our regression model is independent of dimension. This is an important result because it is not intuitively obvious. Previously, Stone [Stone, 1982] showed that the optimal rate of convergence, over the entire domain of such functions, does depend on dimension. Specifically, for an N dimensional function, the rate of convergence established by Stone is ||9„(JC) - 0(x)| | o o < 0((n) 1 log(n)) , where v = p/(2p + N). Knowing that every such function has a finite number of subdomains where there exists at least one type of regression function (for example the SPORE-2 regression function studied in this section) which has a rate of convergence independent of dimension, has important practical implications if these domains can be identified. For example, suppose that we have a regression problem which has a rate of convergence that depends adversely on dimension over its whole domain. Furthermore, suppose that we can divide the domain of this regression problem into two regions, with each region having a rate of convergence independent of dimension. Then our original "hard" regression problem has been converted into two "easier" regression problems. The constructions given in Appendix D of this thesis provide a sufficient algorithm for Chapter 5: The General SPORE Methodology 90 finding domains which have rates of convergence independent of dimension. These theoretical constructions can serve as a basis for practical algorithms for very high dimensional regression, and in fact are the motivation behind the implementation of the SPORE-1 algorithm. A further implication of T H E O R E M III of Appendix C is that it demonstrates that it is possible to measure the complexity of regression in terms of the minimum number of subdomains which have rate of convergence independent of dimension, rather than the dimension of the regression problem. The more subdomains, the more complex the regression problem. This measure of complexity may have important implications for the practical implementation of very high dimensional regression. This is especially true if we can establish the complexity of a regres-sion problem using only the given training examples. Our hope is that the constructions contained in Appendix D of this thesis can serve as the basis for such a data based complexity measure. As described above, these constructions collectively serve to decompose the function 0(x) into the functions 0„ (x) , while at the same time building the disjoint domains Sk which have the desired rate of convergence properties. Since these constructions are based on learning examples of the target function 0(x) (see proof of T H E O R E M III in Appendix C), they constitute a sufficient algorithm for such a data based complexity measure. However, the practicality of such a complex-ity measure has yet to be demonstrated. 5.2.2 The SPORE-1 Regression Function with Domain Partitioning The rate of convergence result for the SPORE-2 algorithm is based on two key observa-tions. First, the SPORE-2 algorithm works by constructing the approximation one functional unit at a time, and the rate of convergence of each of these individual functional units is independent of dimension. Secondly, the domain of the function being approximated is subdivided to ensure that as each functional unit is constructed, the approximation error over the particular subdivision acted on by the functional unit, must decrease. The result is that within each subdomain, the approximation error converges to zero, and the rate at which it does so is independent of dimension. Given the definition of the SPORE-1 algorithm in Chapter 3, it is not difficult to envision a modification to the SPORE-1 algorithm which will also give it a rate of convergence to zero error which is independent of dimension. First, in T H E O R E M 5 of Chapter 3, we showed that the rate of convergence, to a fixed approximation error, Of SPORE-1 is independent of the dimension of Chapter 5: The General SPORE Methodology 91 the function being approximated. Therefore, if we designate each single execution of the SPORE-1 algorithm to be a single equivalent functional unit, then we can achieve a rate of convergence to zero error, by subdividing the input space such that the next time SPORE-1 is run on any subdivi-sion, the resulting approximation error over that subdivision wil l decrease. One method of choosing such a subdivision is to do the following binary split of the input space. Let RL(x) be the constructed SPORE-1 regression function which acts on the domain D. Then, create 2 subdomains, Dj and D2, such that Z)j u D 2 = D,as follows: x e D , iff ( ( i ? L (x ) - / ( x ) )>0 ) xeD2 iff ( ( * L ( x ) - / ( x ) ) < 0 ) (5.8) (Note that Dx and D2 may not be topologically connected, whereas the subdivisions of most implementations of regression trees [Breiman et al., 1984] are. This may or may not be a signifi-cant advantage depending on how hard it is to construct such domains in practice. This is a topic for future research) The subdivision in (5.8) ensures that the maximum expected approximation error, within each subdomain Di (i e { 1, 2}), is given by j" «RL(x) - f(x)) - AfdD^ < E\\ (Rl(X) - f(x))2dDi where A,- e 3i is a constant defined as follows: '-D, (5.9) A, = argminA % \({RL(x)-f{x))-Ai)2dDi •D, (5.10) (Note that A • is simply the average value of (RL(x) - f(x)) over each region Di). Hence, such a subdivision ensures the reduction of error after each execution of the SPORE-1 algorithm. Using T H E O R E M 5 of Chapter 3, this gives us the following rate of convergence (assuming we subdivide after each execution of SPORE-1), j ( ^ L | ( - ) - / ( x ) ) 2 ^ DT\(m) (m) <K Mn + e (5.11) where 1. e is any arbitrarily small positive real number; 2. -D^^) is any resulting subdomain after m domain subdivisions; Chapter 5: The General SPORE Methodology 92 3. and MT is the number of training sample points in domain £ ) r ) ( m ) • Thus, the rate of convergence to any arbitrarily small error e is independent of the dimension of 5.3 Rate of Convergence and Computational Complexity In this chapter we have given two specific examples of the SPORE structure. We have shown that both examples have a rate of convergence which is independent of the dimension of the function being approximated as a result of two important characteristics. The first is that the regression function is constructed one functional unit at a time, and the rate of convergence of each of these individual functional units is independent of dimension. The second is that the domain of the function being approximated is subdivided to ensure that as each functional unit is constructed, the overall approximation error decreases. Thus, as long as these two characteristics are maintained, any resulting SPORE implementation will have a rate of convergence independent of dimension. In T H E O R E M 6 of Chapter 3, we showed that the computational complexity of the SPORE-1 algorithm is linear with respect to the dimension of the function being approximated. This is a powerful practical result because it makes regression functions which have thousands of inputs, at least potentially feasible from a computational standpoint. This is experimentally supported by the high dimensional regression examples given in Chapter 4. The SPORE-1 algorithm achieves a computational complexity which is linear with dimension because the drop in approximation error, due to the addition of a new level, does not depend on the dimension of the function being approximated. Instead, it depends on the probabil-ity distribution of the function space of f{x) around the 2 dimensional space that the error function is being projected on; i.e. the 2 dimensional space defined by the independent variables of the function g((-) (see proof of T H E O R E M 6 in Chapter 3). Thus, if the variance of this probability distribution is high, this is an indication that f(x) maps poorly onto g[(-) and many levels in the SPORE-1 structure may be required to get a good fit. Thus, the error drop achieved by cycling through all independent variables of / ( x ) is not a dependent on the dimension of / ( x ) , but rather on the way it projects onto the 2 dimensional input space of g^(-) • Hence the linear growth in computational complexity. Following this argument, we can show that, within each subdomain defined in Section Chapter 5: The General SPORE Methodology 93 5.2.1, the computational complexity of the SPORE-2 algorithm is also linear with dimension. Thus, the total computational cost of the SPORE-2 algorithm is dependent on the number of subdomains required (which may in fact depend on dimension). Similar arguments are valid for the computational cost analysis of any SPORE structure. Thus, we conclude that within each subdivision of a general SPORE-type structure, the computational cost of the algorithm is linear with respect to the number of input variables. Finally, we ask the following question: under what conditions is the subdivisions of space not necessary in order to achieve convergence to zero error? For the SPORE-1 algorithm, this question is answered in T H E O R E M 4 of Chapter 3, where we showed that as long as the approxi-mation errors due to the addition of a level to the regression function are sufficiently uncorrelated with respect to current approximation errors, approximation errors will continue to converge to zero. Another interpretation of this is simply that the new level is contributing to overall error reduction. An analysis of the SPORE-2 constructions gives an equally trivial interpretation of what constitutes continued error reduction without subdivision. In order for SPORE-2 to reduce error without space subdivision, the projection of the current error function onto its two dimensional functions must always contribute to overall error reduction. The only condition under which error reduction will not occur, is if the projection of the current error function onto the appropriate 2 dimensional space produces functions g P i 0 i 0 and ^ P | 0 i a which are identically zero (see Appendix B for details). Similarly, the SPORE-1 learning algorithm will also fail to converge if gt(-) = 0. Therefore, for both algorithms, subdivision is required when the residual error produce two dimensional functions which are identically zero. As described in Section 3.3.4 of Chapter 3, SPORE-1 uses bootstrap samples to make this pathological condition unlikely. Perhaps it is also possible to use bootstrap samples to achieve similar results with the SPORE-2 algorithm, and more generally with other SPORE-type algorithms. This is an open theoretical question. 5.4 Summary In this chapter we have shown that the general SPORE structure addresses the two major difficulties associated with high dimensional nonparametric regression. Both of these can be attributed to the curse of dimensionality [Bellman, 1961]). The first difficulty is associated with the potentially intractable number of sample points required to approximate high dimensional Chapter 5: The General SPORE Methodology 94 functions. The second is associated with the computational complexity inherent in building very high dimensional regression functions. Both difficulties are addressed in the SPORE structure via the sequential addition of small functional units in a way which decreases the residual approxima-tion error. We have shown under which regularity conditions (Stone [Stone, 1982]) this leads to a rate of convergence (as a function of the number of training sample points) independent of dimension. In addition, we have demonstrated that under appropriate conditions, the computa-tional complexity of the general SPORE structure, is linear with respect to the dimension of the function being approximated. Thus, this chapter, building upon our results in Chapter 3, constitutes our second argument for a theoretical verification of our thesis presented in Section 1.3 of Chapter 1. The theoretical analysis presented in this chapter is an attempt to fully understand the fundamental characteristics of a learning algorithm. Many practical implementations of learning algorithms are not based on a firm theoretical understanding of their theoretical characteristics, often resulting in ad hoc heuris-tics which have no firm foundations. Perhaps one of the most significant differences between the SPORE methodology and the other learning methodologies which it loosely resembles (see Section 2.4 of Chapter 2), results from the fact that we have specifically used a theoretical analysis of the characteristics of very high dimensional regression to define our methodology. Chapter 6: Conclusion The focus of this thesis has been the construction of models (nonparametric regression functions) which have a large number of inputs, and where little a priori knowledge is available about what constitutes a good choice for model structure. Examples of large high dimensional nonparametric regression problems can be found in such diverse fields as meteorology, econom-ics, and robotics. However, general practical methods for constructing these high dimensional models have yet to be widely proposed in the literature. It has been shown in this thesis that very high dimensional nonparametric regression can be effectively accomplished using a sufficient number of small, low dimensional parametric building blocks, which are fitted to the regression data one at a time. Furthermore, we have demonstrated that both variables and building blocks can be added in a random order to the approximation, thus little or no computational effort is required to determine what to fit next. Projection onto low dimensional parametric building blocks was shown by proof and example to produce stable regression functions from relatively few sample points. It was demonstrated that the random selection of building blocks and their inputs leads to algorithms that are computation-ally feasible in very high dimensional spaces. This document is a theoretical and experimental evaluation of a general methodology (SPORE) which exhibits the above characteristics. The most significant result of this thesis is that this simple algorithmic structure leads to regression functions which are effective approximators in very high dimensional spaces. The successful application of nonparametric regression to high dimensional problems such as the 1,600 dimensional problem defined in Section 4.3 of Chapter 4, has not been demonstrated in the past, without using some form of dimensionality reduction requiring a priori knowledge. The main reason for this is that previously published techniques have been designed with lower dimensional problems in mind, resulting in algorithms which utilize search strategies in input space, thus making them computationally infeasible in very high dimensional spaces. The main contributions of this thesis are the following: I. A new methodology is proposed (SPORE) for constructing very high dimensional 95 Chapter 6: Conclusion 96 regression functions. The SPORE approach to nonparametric regression differs from conven-tional approaches. Emphasis is shifted from searching through a large parameter space for an optimal structure, to an algorithmic approach which projects learning data onto low dimen-sional structures, without resorting to computationally expensive searches. This makes SPORE particularly suited to the large regression problems considered in this thesis. II. A theoretical analysis of SPORE demonstrated that it is a universal approximator. Fur-thermore, we showed that domain of the target function can be subdivided into a finite number of regions where the rate of convergence (as a function of the number of learning examples) to any bounded continuous target function (at least once differentiable) is independent of the number of input variables (i.e. dimension). This theoretical result is significant because it shows a rate of convergence which has not previously been demonstrated for high dimen-sional nonparametric regression problems. In addition, we showed that the SPORE structure has a computational complexity which is linear with dimension in each of these subdomains. This is also a significant theoretical result because previously proposed multidimensional non-parametric regression algorithms are not computationally practical in higher dimensional spaces. III. The simplest example of SPORE, called SPORE-1, has been implemented and evalu-ated. The SPORE-1 learning algorithm is completely automated, requiring no user interven-tion. Theoretical analysis of the SPORE-1 learning algorithm gives conditions under which the approximation error converges to zero (without domain partitioning), with a rate of con-vergence which is independent of the number of input variables (i.e. dimension). Experimen-tal evaluation of SPORE-1 on regression problems which have been previously studied in the literature, shows that SPORE-1 compares favorably, and often outperforms, other published regression algorithms. Out of ten regression examples, SPORE-1 did as well or better than published results on all but one of these examples. Evaluation of SPORE-1 on the 10-bit par-ity problem showed that it can be applied to problems which have flat low dimensional projec-tions, and therefore is not subject to the same limitations as other methods which construct approximations using low dimensional structural units added sequentially. Evaluation of SPORE-1 on very high dimensional data (100 to 1600 inputs) showed that SPORE-1 performs well under noise conditions in both input variables and output values, and that the dimension of the data had no direct effect on learning times or the size of the approximation. Application Chapter 6: Conclusion 97 of SPORE-1 to real-world human-to-robot skill transfer problems, using raw image pixels, showed that SPORE-1 is effective on very high dimensional (1024 inputs), noisy, real world problems. These human-to-robot skill transfer experiments demonstrated a promising (and theoretically understood) solution for a wide variety of complicated learning/programming tasks, which could not previously be addressed in this manner. 6.1 Future Work SPORE defines a general methodology for constructing regression functions. The SPORE-1 algorithm, having only a simple 2 dimensional cascaded structure with only one type of functional unit, is the most basic example of this general methodology. Our intention is to implement and empirically evaluate other examples of SPORE. In particular, we intend to implement a SPORE algorithm which has many types of functional units, where each functional unit can accept inputs from more than one existing unit. Such a structure would exploit more aspects of the general SPORE methodology, while still maintaining the computational capabilities of the SPORE-1 algorithm. In addition, the SPORE structures presented in this thesis have been designed as long, cascaded chains which can only be evaluated in a serial manner. This may not be a suitable computational structure for applications which require fast calculations (e.g. real-time control systems). For such applications, computationally parallel SPORE structures may be more appropriate. The theoretical and empirical evaluation of a parallel SPORE algorithm is a good topic for further investigation. The main focus of this thesis has been building nonparametric regression functions for the purpose of predicting future outputs given new inputs. However, traditionally many nonparamet-ric regression techniques have been used as analytical tools to determine which inputs are important and how they interact. It would be interesting to investigate the possibility of using SPORE as such an analytical tool. If this were possible, it would allow analysis of input spaces for problems which have many input variables. An additional topic for further study is a more complete analysis of the theoretical proper-ties of the SPORE structure. This thesis gives rate of convergence results as a function of the number of learning variables, and gives conditions under which these rates are valid. Although empirical and theoretical evidence suggests that these conditions are not overly restrictive, there Chapter 6: Conclusion 98 are as yet no theoretical results which analyze how useful or interesting this class of functions is. In addition, no rate of convergence results, as a function of the number of parameters in the SPORE regression function, are given in this thesis. Both of these theoretical questions are topics for future research. A specific theoretical investigation which this thesis has not addressed is the full computa-tional cost of a SPORE learning algorithm which uses domain subdivision. We know that the computational cost of a SPORE construction is linear with dimension, in each of the resulting subdivisions. However, we don't know how many subdivisions are required for a specific class of target functions. Empirical evidence presented in this thesis suggests that subdivision of the input space is not necessary (at least for the regression examples studied here), however, this may not be true in the general case. In fact, regression examples may exist where an exponential growth in the number of subdomains may be needed, as the dimension of the target function increases. This is clearly a topic for future investigation. Finally, in this thesis, we demonstrated that it is possible to do very high dimensional human-to-robot skill transfer if the operator demonstrates the task using only the robot's sensors and actuators. This is a promising result which we have only started to explore. Many more experiments need to be done in order to fully evaluate the efficacy of this approach to robot programming. Bibliography Alfeld, P. (1989). Scattered data interpolation in three of more variables. In Lyche, T. and Schumaker, L . L. , editors, Mathematical Methods in Computer Aided Geometric Design, pages 1-33. Academic Press, Boston, M A . Asada, H. (1990). Teaching and learning of compliance using neural nets: Representation and generation of nonlinear compliance. In Proceedings of the 1990 IEEE International Conference on Robotics and Automation, pages 1237-1244. IEEE. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory, 39(3):930-945. Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115-133. Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press. Breiman, L . (1993). Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Inform. Theory, 39(3):999-1013. Breiman, L . (1996a). Bagging predictors. Machine Learning, 24:123. Breiman, L . (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist., 24(6):2350-2383. Breiman, L . (1996c). Stacked regressions. Machine Learning, 24:49. Breiman, L. , Friedman, J. H , Olshen, R. A. , and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California. Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321-355. Casdagli, M . (1989). Nonlinear prediction of chaotic time-series. Physica D, 35:335-356. Chen, H. (1991a). Estimation of a projection-pursuit type regression model. Ann. Statist., 19(1):142-157. 99 Bibliography 100 Chen, Z. (1991b). Interactive spline models and their convergence rates. Ann. Statist., 19(4): 1855-1868. Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Math. Contr. Sygnals, Syst., 2:303-314. Edelman, S. and Poggio, T. (1990). Bringing the grandmother back into the picture: A memory-based view of object recognition. MIT A.I. Memo No. 1181. Efron, B. (1983). Estimating the error rate of a prediction rule: Some improvements on cross-validation. Journal of the American Statistical Association, 78:316-331. Elder IV, J. F. and Brown, D. E. (1995). Induction and polynomial networks. In IEEE Int. Conf. on Sys. Man and Cyb., pages 874-879. Fahlman, S. E. and Lebiere, C. (1990). The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, pages 524-532. Farlow, S. J. (1984). The gmdh algorithm. In Farlow, S. J., editor, Self-Organizing Methods in Modeling: GMDH Type Algorithms, pages 1-24. Marcel Dekker, Inc., New York, NY. Friedman, J. H . (1991). Multivariate adaptive regression splines. Ann. Statist., 19:1-141. Friedman, J. H . (1994a). Flexible metric nearest neighbor classification. Technical Report, Stanford University, Department of Statistics. Friedman, J. H . (1994b). An overview of predictive learning and function approximation. In Cherkassky, V , Friedman, J. H. , and Wechsler, FL, editors, From Statistics to Neural Networks, pages 1-61. Springer-Verlag. Friedman, J. H . and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc., 76(376): 1-141. Girosi, F. (1994). Regularization theory, radial basis functions and networks. In Cherkassky, V , Friedman, J. H. , and Wechsler, H. , editors, From Statistics to Neural Networks, pages 167-187. Springer-Verlag. Golub, G. H . and Van Loan, C. F. (1993). Matrix Computations. The John Hopkins University Press, Baltimore and London. Bibliography 101 Grudic, G. Z. and Lawrence, P. D. (1995). Human-to-robot skill transfer via teleoperation. In IEEE Int. Conf. on Sys. Man and Cyb.. Grudic, G. Z. and Lawrence, P. D. (1996). Human-to-robot skill transfer using the spore approximation. In IEEE Int. Conf. on Rob. andAut.. Gmdic, G. Z. and Lawrence, P. D. (1997). Is nonparametric learning practical in very high dimensional spaces? In To Appear: Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), NAGOYA, Aichi, Japan. Hardy, R. L . (1971). Multiquadric equations of topology and other irregular surfaces. J. Geophys. Res.,16:1905-1915. Hastie, T.J . and Tibshirani, R . J . (1986). Generalized additive models. Statistical Science, 1(3):297-318. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall. Hastie, T. J. and Tibshirani, R. J. (1994). Nonparametric regression and classification. In Cherkassky, V., Friedman, J. H , and Wechsler, H. , editors, From Statistics to Neural Networks, pages 1-61. Springer-Verlag. Hastie, T. J. and Tibshirani, R. J. (1996). Discriminant adaptive nearest neighbor classification and regression. In Advances in Neural Information Processing Systems 8, pages 409^-15. MIT Press, Cambridge M A . Haykin, S. S. (1994). Neural networks : a comprehensive foundation. Macmillan, New York. Hornik, K., Stinchcombe, M . , and White, H. (1989). Multi-layer feedforward networks are universal approximators. Neural Networks, 2:359-366. Huber, P. J. (1985). Projection pursuit. Ann. Statist., 13:435^175. Igelnik, B. and Pao, Y.-H. (1995). Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Transactions on Neural Networks, 6(6): 1320-1329. Ivankhnenko, A . G. (1971). Polynomial theory of complex systems. IEEE Trans. Systems, Man, and Cybernetics, SMC-12:364-378. Bibliography 102 Jackson, I. R . H . (1988). Convergence properties of radial basis functions. Constructive Approximation, 4:243-264. Jones, L . K. (1992). A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist., 20(1):608-613. Jordan, M . I. and Jacobs, R. A . (1994). Hierarchical mixtures of experts and the em algorithm. Neural Computation, 6:181-214. Juditsky, A. , Hjalmarsson, H. , Benveniste, A. , Delyon, B., Ljung, L . , Sjoberg, J., and Zhang, Q. (1995). Nonlinear black-box modeling in system identification: Mathematical foundations. Automatica: Special Issue on Trends in System Identification, 31(12): 1725— 1750. Kolmogorov, A . N . (1957). On the representation of continuous functions of many variables by superposition of functions of one variable and addition. Dokl. Akad. Nauk SSSR, 114:953-956. Kosuge, K. , Fukuda, T., and Asada, H. (1991). Acquisition of human skills for robotic systems. In Proceedings of the 1991 IEEE International Symposium on Intelligent Control, pages 469-474. IEEE. Kwok, T.-Y. and Yeung, D.-Y. (1997). Constructive algorithms for structural learning in feedforward neural networks for regression problems. IEEE Transactions oh Neural Networks, 8(3):630-645. Lorentz, G. G. (1986). Approximation of Functions. Chelsea Publishing Co., New York. Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier. Neural Computation, 7(l):72-85. Lui, S. and Asada, H . (1992). Transferring manipulation skills to robots: Representation and acquisition of manipulation skills using process dynamics. J. Dynamic Syst. Measur. Contr., 114(2):220-228. Malitz, J. (1987). Introduction to Mathematical Logic. Springer-Verlag, New York, NY. Bibliography 103 Mel, B. W. and Kock, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In Advances in Neural Information Processing Systems 2, pages 474-481. Michie, D., Siegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, NY. Nester, J., Wasserman, W., and Kutner, M . H. (1990). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Design. IRWIN. Parmanto, B. , Munro, P. W., and Doyle, H. R. (1996). Improving committee diagnosis with resampling techniques. In Touretzky, D. S., Mozer, M . C , and Hasselmo, M . E., editors, Advances in Neural Information Processing Systems 8. MIT Press, Cambridge M A . Perrone, M . P. (1993). Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization. PhD thesis, Department of Physics, Brown University. Poggio, T. and Girosi, F. (1989). A theory of networks for approximation and learning. MIT A.I. Memo No. 1140. Poggio, T. and Girosi, F. (1990). Extensions of a theory of networks for approximation and learning: Dimensionality reduction and clustering. MIT A.I. Memo No. 1167. Pomerleau, D. A . (1993a). Input reconstruction reliability estimation. In NIPS5, pages 279-286. Morgan Kaufmann Pub. Pomerleau, D. A. (1993b). Neural Network Perception for Mobile Robot Guidance. Kluwer Academic Publishers, Boston/Dordrecht/London. Press, W. H. , Flannery, B. P., Teukolsky, S. A. , and Vetterling, W. T. (1988). Numerical Recipes in C. Cambridge University Press, New York NY. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, C A . Rasmussen, C. A. (1996). A practical monte carlo implementation of bayesian learning. In Touretzky, D. S., Mozer, M . C , and Hasselmo, M . E., editors, Advances in Neural Information Processing Systems 8. MIT Press, Cambridge M A . Bibliography 104 Reed, R. (1993). Paining algorithms - a survey. IEEE Transactions on Neural Networks, 4(5):740-747. Renals, S. and Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. In Proceedings of the International Joint Conference on Neural Networks, pages 1-461-1-467. Rouse, W. B., Hammer, J. M . , and Lewis, C. M . (1989). On capturing human skills and knowledge: Algorithmic approaches to model identification. IEEE Trans. Syst. Man Cybernet., 19(3): 558-573. Saha, A . and Keeler, J .D. (1990). Algorithms for better representation and faster learning in radial basis function networks. In Advances in Neural Information Processing Systems 2, pages 482-489. Sanger, T. D. (1991). A tree-structured adaptive network for function approximation in high-dimensional spaces. IEEE Transactions on Neural Networks, 2(2). Shimokura, K. and Liu, S. (1991). Programming deburring robots based on human demonstration with direct burr size measurements. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 469^474. IEEE. Sjoberg, J., Zhang, Q., Ljung, L. , Benveniste, A. , Delyon, B., Glorennec, P. Y., Hjalmarsson, H , and Juditsky, A. (1995). Nonlinear black-box modeling in system identification: A unified overview. Automatica: Special Issue on Trends in System Identification, 31(12) :1691-1724. Stone, C . J . (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist., 10:1040-1053. Tenorio, M . F. and Lee, W.-T. (1990). Self-organizing network for optimum supervised learning. IEEE Trans, on Neur. Net., 1 (1): 100-110. Vitushkin, A . G. (1954). On hilbert's thirteenth problem. Dokl. Akad. Nauk. SSSR, 95:701-704. Vitushkin, A . G. and Henkin, G. M . (1967). Linear superposition of functions. Russian Math. Surveys, 22:77-125. Bibliography 105 Widrow, B., McCool, J. M . , Larimore, M . G., and Johnson, C. R. (1976). Stationary and non-stationary learning characteristics of the L M S adaptive filter. Proc. IEEE, 64(8): 1151— 1162. Wolpert, D. (1992). Stacked generalization. Neural Networks, 5. Yang, J., Xu, Y , and Chen, C. S. (1994). Hidden markov model approach to skill learning and its application to telerobotics. IEEE Trans. Robotics and Automat., 10(5):421-631. Yao, X . and Lui , Y. (1997). A new evolutionary system for evolving artificial neural networks. IEEE Transactions on Neural Networks, 8(3):694-713. Zhang, Q. (1997). Using wavelet network in nonparametric estimation. IEEE Transactions on Neural Networks, 8(2):227-236. Appendix A. Approximating 3 Dimensional Functions as a Cascade THEOREM I: Let U(x,y,z) be a bounded continuous 3 dimensional function defined on the dense domain (x,y,z) £ U, where U C 9£3 and U has finite non-zero volume. Then there exists, for all b £ {1 ,2 , . . . , 6 m a x } where 6 m a x > 0 ( 6 m a x £ \), a. a set of bounded continuous 2 dimensional functions Vi(x,y) defined on (x,y) £ Sb, where Sb C 3?2 and V i , j £ {1 ,2 , . . . , ^ j) => 5; D S'; = 0]; and, b. a second set of bounded continuous 2 dimensional functions Wfc(Vj, 2) defined on (Vj, 2) £ c 3t2, SMC/I ?/ia£, /or a// e > 0 (e £ 3t), the following is true: \/(x,y,z)£U [36 £ { l , 2 , . . . , 6 m a x } [((x,y) £ Sb) A (|W f c(V 6(.r,y),z) - U(x,y,z)\ < e)] . The following are some basic definitions which are used throughout the proof of THEOREM I. Let 8 > 0 be an arbitrarily small real number. Create a uniformly spaced grid of points in -ft3 at all points (/ • 6,m • 6,n • 8), such that l,m,n £ I. Define the set U to be: U = {(l,m,n)\(l,m,n £ I) A ((/ • 8, m • 8,n • 8) £ £/)}, (2A) where the set U is defined in T H E O R E M I. Let l\, m\ £ 1 be fixed such that 3n[(l\, m i , rc) £ U]. Then, U(l\ • 8,m\- 8, n • 8) (the function U(x,y,z) is defined in T H E O R E M I) is a 1 dimensional function defined for all n such that ( / i ,mi ,n) £ U. Let / i , m i , / 2 , « ^ 2 G 1 be fixed such that 106 3n [ ( ( / i , m i , n) £ U ) A ((h,rn,2,n) £ U ) ] . Now one can define a norm, symbolized by [|• jl^ -between the two 1 dimensional functions U{1\ • 8,m\ • 8,n • 8) and £/(/ 2 • 8, ra2 • 8,n • 8) as: \\U(li • 8,mi • S,n • 8) — Ufa • 8,rn,2 • 8,n • 8) ||max = m a x • 8, mi • 8,n • 6) — U{l2 • 8, m,2 • 8,n • 8)\, (3A) where the set N is defined as: = {n\(n 6 1) A ((/ i ,mi,re) £ U ) A ( ( / 2 , m 2 , n ) £ U)}. (4A) Note that the norm \\U(h • 6,171! -8,n-8)- U{h • 8, m 2 • 6, n • <5)||max (5A) is defined if and only if 3n[{{l\,m\,n) G U ) A {(h,m2,n) £ U ) ] . The concept of an S-ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (defined V n such that (/, m, n) £ U) with respect to the norm | | - | | m a x , is defined next (note that an S-ordering is similar to a well ordering defined in [Malitz, 1987, pp. 21-23], with the addition of some other conditions given below). Define the set T to be: T = {(l,m)\(l,m £ 1) A (3n[(l,m,n) £ U] )} . (6A) Let S C T represent a set of 1 dimensional functions U{1 • 8,m • 8,n • 8), where (l,m) £ S, which are S-ordered. Let the set V 1 C S be defined such that V ( / 1 , m 1 ) £ V 1 the 1 dimensional functions U (ll • <5, m 1 • 6,n • 8) are first in the S-ordering. Let the set Vp C S be defined such that V(lp,mp) £ Vp the 1 dimensional functions U(lp • 8,mp • 8,n • 8) are in order p (where 107 p £ I [p > 1]) of the S-ordering. Let the set VPm C S (where pm £ 1 \pm > 1]) be denned such that V(lPm,mPm) £ V P m the 1 dimensional functions U(lPm • 8, mPm • 8,n • 8) are last in the S-ordering (because T is a set which has a finite number of elements, S C T also has a finite number of elements and therefore there must exist a set VPm C S which is last in the S-ordering). Then the S-ordering of the functions U(l • 8,m • 8,n • 8), where (l,m) £ S, is defined such that the set S, and the sets Vp (\/p £ { 1 , . . . ,pm}), must satisfy the following conditions: Cond a .V( / i ,mi) , ( / 2 ,m 2 ) £ S[3n £ l[((/i, m i , n) £ U) A {{l2,m2,n) £ U)]]. Cond b.Vp £ {1,2,...pm}, if ( / i ,mi) £ V* and (/ 2 ,m 2 ) £ V? then \\U(h-6,m1-8,n-6)-U(l2-8,7n2-8,n-6)\\m^ = 0. (7 A) Also, Vp £ {l,2,...Pm}, if (h,mi) £ V and (Z 3,m 3) £ S [(/ 3 ,m 3) g Vp] then \\U{h • 8,mi • 8,n-8)- U{13 • 8,m3 • 8,n • <5)||max ^ 0. (8A) Condc .Vpi £ { l , 2 , . . . , p m - 2 } , Vp 2 £ {Pl + 1,... , P m - 1}, Vp 3 £ {p2 + 1,... ,Pm}, if ( / i ,mi) £ V * \ (/ 2 ,m 2 ) £ Vp\ and (Z 3 ,m 3) £ V ' 3 then ||W(/3 • £,??^3 • 8,n • 8) -K(h- 8,m1 • 8,n- <J)||max > (9A) ||W(/2 • 8,m2 • 8,n- 8) - U(l\- 8,mx • 8,n • <5)||max. Cond d.Choose 8 > 0 (8 £ §ff) such that V(x i , y i , 21), (x 2 , j / 2 , z 2 ) £ /7, if 11(^1,2/1,-21) - ( « 2 , y 2 , - Z 2 ) | | < 8 • s/2, then (note that e is defined in T H E O R E M I) \U(xuyu Zl) - U(x2,y2, z2)\ < (10A) 108 It is always possible to choose such a 8 because U(x,y,z) is bounded and uniformly continuous on U. Given the above 8, V p £ { 1 , 2 , . . . , p T O — 1 } , if ( / i , m i ) £ V p , and (h,rn2) £ Vp+1 then the following condition must hold: \\U(h • 6,mi • 8,n • 8) - U{12 • 8,m2 • 6,n • 8)||max < (11 A) Proof of T H E O R E M I: The proof of T H E O R E M I is based on constructions of the functions Vb(x,y) and Wb(Vb,z), and the domains Sb (for all b £ { 1 , 2 , . . . , 6 m a x } ) . These constructions are given in CONSTRUCTION I, CONSTRUCTION II, CONSTRUCTION III, and CONSTRUCTION IV of Appendix D. In CONSTRUCTION I the discrete functions V 6(x, y) and W6(Vfc, z), along with the discrete sets S 6 (for all b £ { 1 , 2 , . . . , 6 m a x } ) are constructed. In CONSTRUCTION II the discrete sets S 6 (for all b £ { 1 , 2 , . . . , 6 m a x } ) are used to construct the domains Sb. In CONSTRUCTION III the discrete functions V&(.T, y) (for all b £ { 1 , 2 , . . . , 6 m a x } ) are used to construct the continuous functions Vb(x,y), and finally, in CONSTRUCTION IV the discrete functions W 6(V 6,z) (for all b £ { 1 , 2 , . . . , 6 m a x } ) a r e used to construct the continuous functions W&(V&(a:, y),z). One should first note that a maximum value for b £ { 1 , 2 , . . . , 6 m a x } must exist because, in STEP 3 of CONSTRUCTION I, the number of elements in the set T i is finite, and because in STEP 9 of CONSTRUCTION I, Jb is updated using the equation Jb = T 6 _ i - S 6 _ i , where S f t _ ! is always a non-empty S-ordering set (non-empty because, by the definition of an S-ordering given on page 108, if T j _ i is non-empty, then there is always at least one element in T&_i which satisfies the conditions of an S-ordering and is therefore a member of the set Sj_i) . 109 Because CONSTRUCTION II chooses a value for b £ {1, 2 , . . . , 6 m a x } which minimizes the function P = ,,min \\(x,y) - (/ • S,m • 8)\\, (12A) it must be the case that V(x,y,z) £ U, 3b £ {1 ,2 , . . . , 6 m a x } such that (x,y) £ 5j. Thus the constructed approximation spans the domain of definition of the function being approximated. Also, because the value of b chosen is one which is minimum in magnitude, it must also be true that Vi , j £ {1 ,2 , . . . , bmax}[(i ^ j) St D Sj = 0] (i.e. V(x, y, z) £ U, b is unique and therefore Sb is also unique). Given that (x,y,z) £ U and (x,y) £ Sb, in CONSTRUCTION IV the function Wb(Vb(x, y), z) is calculated using the equation: Wb(Vb(x,y),z) =(W 6 (V 6 l > m • <5) - W 6 ( V 6 o , n 0 • 6)) • sv(Vb(xy)-Vbo) + s ( z - n 6 ) ( 1 3 A ) where the function Vb(x,y) is calculated, from CONSTRUCTION III, using the equation: Vb(x, y) =(V 6(/i • tf, mi • *) - V 6 ( / 0 • <5, m 0 • *)) • a * ( x ~ l * ' 8 ) + (14A) (V 6 (/ 2 • 5, m 2 • 5) - V 6 ( / 0 • .5, m 0 • 6)) • S y { y ~ ™° ' 6 ) + V 6 ( / 0 • m 0 • 5). One should note that the function Vt(.x,?/) is continuous and bounded on (x,y) £ Sb, and the function Wb(Vb(x,y), z) is continuous and bounded on (Vj,z) £ $6, because: 1. U[x,y,z) is continuous and bounded on (z,?/,z) £ (7; 2. Condition d of an S-ordering given on page 108; 110 3. STEP 6 of CONSTRUCTION I; 4. - The tessellations grids used in CONSTRUCTION III and CONSTRUCTION IV; and, 5. STEP 2.a and STEP 2.b of both CONSTRUCTION III and CONSTRUCTION IV. Inserting equation (14A) into equation (13A), one gets: Wb(Vb(x,y),z) = ( W 6 ( V 6 l , n i •6)-Ub(Vbo,n0-6))-sv (Vfc(/i • 8, mi • 8) - V 6 ( / 0 • 8, m0 • 8)) sx(x - l0 • 6) 8V 6 ( V 6 ( / 2 • 8, m2 • 8) - V f c ( / 0 • 8, m0 • 8)) sy(y - m0 • 8) + 8V 8 ( V i ( / 0 • 8,m0 • 8) - Vbo] 8v (Ub{Vb2,n2-8)-Wb(Vbo,n0-8)) • ^ ' ^ + W 6 ( V f e o , n 0 • 8). (15A) Therefore, \m(Vb(x,y),z)-U(x,y,z) ( W 6 ( V 6 l , ? i i • 8) - W j ( V 6 o , n o • 6)) • sy (Vb(h • 8, mi • 8) - V 6 ( / 0 • 8,m0 • 8)) sx(x - l0 • 8) 8V 8 ( V f e ( / 2 • 8, m2 • 8) - V f e ( / 0 • 8, mp • 8)) sy(y - ?np • 8) 8y 8 ( V 6 ( / 0 -6,7710 -8) - V 6 o ) + (Ub(Vb2,n2-6)-Ub{Vbo,n0-8)) W/j(Vfeo,?2o • 8) -U(x,y,z) sz{z — no • 8) (16A) Using the Schwartz inequality and taking out the sy, sx, sy, and sz terms (this can be done because, from CONSTRUCTION III and CONSTRUCTION IV, \sv \ = \ss \ = Is, 1), one gets: \Wb(Vb{x,y),z)-U(x,y,z)\ < ( W 6 ( V 6 l , n i -8) - W 6 ( V 6 o , n 0 •<$))• (V f e ( / i • 6, m i • 8) - V f e ( / 0 • 8, mp • 6)) (x - lp • 8) 8V ' 8 ( W 6 ( V 6 l , n i - < S ) - W 6 ( V f e o , n 0 - 6 ) ) -(V f t(/ 2 • S, m2 • 8) - Vb(lp • 8, mp • 8)) (y - nip- 8) 8y ' 8 (Ub(Vbl,n1-8)-Ub(Vb0,no-8))-(Vb(lp-8,m0-8)-Vbo) + + 8v + ( W 6 ( V 6 2 , n 2 - -6) - W 6 ( V 6 o , n 0 -6)) W & ( V f e o , n 0 • 8) -U(x,y,z) (z - np • 8) + (17A) Because of: 1. Condition d of an S-ordering given on page 108; 2. STEP 6 of CONSTRUCTION I; and, 3. STEP 2.a of both CONSTRUCTION III and CONSTRUCTION IV, the following inequalities are true: | W & ( V 6 l , r i i • 8) - W 6 (V 6 O ,72 0 • < $ ) ! < { , 5 Because of: (18A) 1. STEP 2 of CONSTRUCTION I; and, 112 2. STEP 1 of both CONSTRUCTION III and CONSTRUCTION IV, the following inequalities are true: (x -lo- 6) 8 (V - m0 •8) 8 - n0 8 Because of: 1. STEP 1 of CONSTRUCTION I; and, 2. STEP 2 of CONSTRUCTION I; and, 3. STEP 1 of both CONSTRUCTION III and CONSTRUCTION IV, the following inequality is true: \Vb{Vbo,n0-6)-U{x,y,z)\ < ^. And finally, because of: 1. Condition d of an S-ordering given on page 108; 2. STEP 6.iii of CONSTRUCTION I; 3. STEP 2.a of both CONSTRUCTION III and CONSTRUCTION IV; and, 4. STEP 2 of CONSTRUCTION IV, 113 the following inequalities are true: (V 6(/i 8, mi 6)-Mk • 8, m 0 S)) (Mh 8, 7712 8)-Vb(lo • 8, mo S)) Sv (Vj(/0 8, mo * ) - v i o ) < 1. Sv Inserting the above inequalities into equation (17A), one gets: \Wb(Vb(x,y),z)-U(x,y,z)\ < ^ + ^ + J + ^ + This completes the proof of T H E O R E M I • . 114 Appendix B. Constructing the SPORE-2 Structure In the proposed structure the approximation of f(x0,..., ZJV-I) is given by the following expansion (see Figure 1 for a 4 dimensional example of this): If (xrj, • • •, xN-i) is an element of the subdomain corresponding to tree branch P\o~i • • • Pi cri , then f(x0,.. .,XN-I) = g P l a 1 ( h P l a 1 ( x o , xi), x2) + 9pia1p2(T2 (jlpl<TlP2<T2 VlPl&l 1 X^)l ^ ( 3 ) niOd ( J V ) ) ) ^ 9pi(Ti...a, \ hpifTl...ai \ h p i < r 1 . . . < T ( l _ 1 ) , X ( l m o c j J , m 0 f j {N)\J , 1=3 (23A) where, a. A mod TV = integer remainder of (A -f- TV), and is used as an index to cycle through the independent variables of f(xo, • • •, £_/v_i); b. the subscript notation /?i <TI . . . /9;<r; (/ e { 1 , . . . , l m a x } ) is standard tree notation (see complete definition below) and serves to identify the two dimensional functions g P i a i . . . p i a i and hp^ai-.-piui which act on the specific subdomain of / (xo, . . . , XJV_I) referred to by branch P\(T\ • • • picrr, c. and, lmax is an integer defining the maximum number of terms (or levels of the tree structure) required in order, to achieve some maximum approximation error max /(x) - /(x) < e (24A) 115 where e is an arbitrary small positive real number, andV is the domain of the function / (x ) . The most basic structural characteristic of the proposed iV dimensional model is its tree structure. This structure is here referred to as the approximation tree. An example of the proposed model approximation tree structure is given in Figure 2. Two things are represented via this tree structure: 1) variables of the function being approximated are selected and acted upon by functions at each tree node; and, 2) the domains spanned by these variables are being divided up at each node in such a way that each point in the domain of the function being approximated represents one and only one path through the tree. These characteristics are explained in detail below. Tree Notation: Each level of the proposed model has 2 types of branches: p branches and a branches. These two types of branches, as is described later, are differentiated from one another by the type of discriminant functions (which are functions that dictate which path through the tree is taken) associated with them. Note that the length of a branch path and the number of branches emanating from a particular tree node is not fixed a priori, and is determined based on the mapping of the function being approximated, as the model is being built. The branches of the tree are labeled as follows: let i\, i2, ..., iii be positive integers, then level 1 branches are identified by the labeling pi — i\ for p branches and p\o\ = ii,%2 for a branches; level 2 branches are identified by the labeling p\(j\p2 = u,«2,«3 for p branches and p\o\p2(J2 = ii,i2,h,H for a branches; similarly, level / branches are identified by the labeling piai... p\ = %i,%2,..., i2i-i for p branches and pio-y... p\o\ = ii,i.2, • • - ,121-1,121 for a branches. 1 1 6 The above is an example of how a 4 dimensional function f(xQ, xx, x2,x3) is approximated in the proposed structure. The approximation is symbolized above by / . The above example shows the calculation of the approximation / along one path of the approximation tree. This path passes through the branches p\ = 1, p\(T1 = 1,1, p\0-1p2 — 1,1,1, and Picr1p2a2 = 1,1,1,1 (see Figure 2), and continues on through to branch p j C i . . - P L ^ L = 1,1, - - -, 1,1, where the subscript L defines the number of levels of the tree required to get a sufficiently good approximation along this path of the tree. As explained in the body, the proposed approximation approximates functions 2 dimensions at a time. These 2 dimensional functions are symbolized as (JP1<T1,,,PI<TI and h p i a i p i a i (the integer / defines the level of the tree). At tree branch p 1 a 1 = 1,1 of level / = 1, the functions are symbolized by <h ^ and h1:1; at tree branch Picr1p2o-2 = 1,1,1,1 of level / = 2, the functions are symbolized by <7i,i,i,i and ^ 1 , 1 , 1 , 1 ; and so on through to branch P1V1 • • - P L & L = 1,1,..., 1,1 at level / = L , where the functions are symbolized by <7i,i,...,i,i and • Figure 1 4-D Example of the Proposed Approximation 1 1 7 Rr P,= 2 D. P ° M 1 R p a , = 1 , 1 / P,a =1,2^ p a =2,1 1 I Branches o Branches Level 1 P , a l P 2 = 1,1,1 P,°,P, = 1,1,2 p , c f 2 = 2,1,1 ,P ,a l P 2 = 2.1.2 P i 0 i P f ; P f f f 2 = 1 . 1 . 1 . 1 ^ P l a l P p 2 = 1,1,2,1 ^ P 1 a , P f 2 = X \ 1,1,2,2 \ P . ° l P f \ = 1,1,2,3 \ P l ° l P f 2 = 2 . 1 ' 1 > 1 Branches Level 2 ppp?2= 2.1 A l , , < Pl° lPf 2 = 2.1.2.2 Branches The above is an example of the tree structure of the proposed model which partitions the input space of the function being approximated into subdomains that allow the function to be approximated using a superposition of a cascade of 2 dimensional functions. The example tree given shows only 2 levels. The notation used in the diagram is explained in detail within the text of the paper. Figure 2 The Space Partitioning Characteristic of the Proposed Model Each branch of the approximation tree represents a specific subdomain of the function being 118 approximated. Thus, one can define the notion of (XQ, . . . , . T J V - I ) £ tree branch p\a\ ... p\o\ (25A) to mean that the point (XQ, . . . , x-yv-i) (which is within the domain of the dimensional function being approximated) is contained within the subdomain represented by branch p\o\... p\o\. For / = 2, this subdomain is defined as follows: (xo, • • • i ^ A f - i ) £ tree branch p\o\p2(J2 ^ ( ( x 3 , x 4 , . . . , . T / Y _ I ) £ RPl A (x0,xi) £ Dpiai)A (26A) ((x0,xi,X4,x5,... , X J V _ I ) £ i2p 1 ( T l p 2 A (hPlffl(xo,xi),X2) £ D P l < T l P 2 a 2 ) , where the domains RPl, - D P l C T l , RPl<jip2-, Dp1<T1p2CT2 and the functions hPl<Tl are defined in THEO-R E M II. For / > 2, the domain represented by branch p\a\... p\o\ is defined as follows: (XQ, . . . , x /y- i ) £ tree branch p\o\ ... p\o\ ((x0, . . . , x / v _ i ) £ tree branch p\a\... / ? ; _ i c r / _ i ) A (27A) { ( X i l J • • • ) XiN-2) ^ • ^ p i ( T 1 . . . p | A ( / l p l ( T l ...pi-KTl-l 5 ^ j l ) ^ D pi<T\...pi(Tl)l where the domains RPl(Ji...pnDpiai...Pl(Tl, the functions / i p l C T l . . . p , _ 1 ( r , _ 1 , and the subscripts z i , . . . ,iN-2,ji a r e defined in T H E O R E M II. Note that the integer subscripts z i , . . . , I N - 2 ,J I are used to keep track of which variables of the function being approximated are used at which node of the approximation tree. Also, note that the modulo function used in equation (23A) and in the statement of T H E O R E M II, is used to cycle through all possible variables as often as is necessary to produce an approximation which is within some maximum allowable error. T H E O R E M II: Let f(xo,xi,... , .r/y_i) be a bounded continuous N dimensional function (N > 3) defined on the dense domain V, where V C and V has finite non-zero size. For 119 all I G 1 such that I > 1, let i\, io, • • •, i N-2 £ { 0 , l , . . . , i V — 1} and j\, ji G { 0 , 1 , . . . , N — 1 swcn £na£, ' a. ji = (/ mod N), and j2 = ((/ + 1) mod N); b. Vm,n G {1, 2,. . . , TV — 2} [(m ^ n => im ^ in) A (m > n => im > in)]; a n d c. Vm G {1 ,2 , . . . , T V - 2 } A Vn G {1, 2}[im + jn]. Then there exists, 1. V / 9 i G {1,2,3,. . .}, a set of domains RPl C K ^ - 3 where a. V i , j G { l , 2 , 3 , . . . } [ i + j => Rt n Rj = 0], and b. 3 ? G {l,2,3,...}[(Pl>i)=>(RPl =<&)]; 2. V /O I ,<T I G {1,2,3, . . .}, a set of bounded continuous 2 dimensional functions hPl(Tl(xQ,xi defined on (xo,x\) G DPl(Tl where a. DPiai C 3?2, b. V/>i,i,j G { l , 2 , . . . } [ i ^ j =4> DPlinDPlj = 0], and, c. 3i G { l , 2 , 3 , . . . } [ ( < 7 1 > O ^ ( £ p 1 « r 1 = 0 ) ] ; 3. V / ? I , < T I G {1,2, ,3,...}, a set of bounded continuous 2 dimensional functions g P i a i ( h p i a i , x2 defined on (hPlC7l,xz) £ wnere a. Q p i a i C 9£2, and b. 3 ? G { l ^ , . . . } ^ ! > 0 ( Q P l < 7 l = 0 ) ] ; 120 4. V/ 6 {2,3,4,.. .} andVpi,ai,...,pi £ {1,2,. . .}, a set of domains RPiai...Pl C ^ 2 wnere a. V/>i,CTi,...,/»j_i,<7i_i,i,.7 e {1,2,....} {} 7^ i) ^ (-Rpi<Ti...pi_1<T|_ii l~l Rp1ai...pi-1at-1j = 0) b. 3? e {i,2,3,. . .}[( / 0 />i)=>(i2 P la 1 . . . .p I = 0)]; , and, 5. VI £ {2,3,4,.. .} and V/3i, <7i, />2, cr2, • • • ,Y>/,cr; £ {1,2,3,. ..}, a set of bounded continuous 2 dimensional functions hpl(Tl...Piai(hpl<Tl:..Pl_iai_1,xJ1) defined on (hPlffl...Pl_l(ri_1, xjf) £ Dp\o\ . . . p\o-\ where 1 &• -Dpi (Ti ...p((T; C K , b. V / J i , c r i , . . . , / J i , i , j £ { l , 2 , . . . } [ ( i ^ j ) (£>pi <Ti...pii H DPlfTl,,,plj — 0)], and c. 3i G { 1 , 2 , 3 , . . . > i ) =^(Z>Pl(Tl.;.plffl = 0)]; ana", , 6. V/ £ {2,3,4,. . .} and \/pi, ai, p2, cr2, ...,pi,at £ {1,2,3,. . .}, a set of bounded continuous 2 dimensional functions gPla1...pial(hPla1...Plal,Xj2) defined on (hPiai„.Pl„nxj2) £ Qp^.-piai, where 3- Qp\ct\...picri CL 9?", and b. 3? £ {l,2,3,...}[(a />i)=>(Qp 1<r 1 . . . P l f f | = 0)]; such that (given all of the above definitions), for any arbitrarily small emax £ St where emax > 0, .121 the following is true: V(x0,xi,... ,XN-I) G T>, 3lm G {1,2,3,. . .} (V/ G {1,2, 3 , . . . , lm}(3(pi G { l , 2, 3,...} A a, G {1, 2, 3,...}))) f(x0,xi,:.. ,xN_i) - \ (28A) where fl\ is defined as follows: i . far I = 1, ((x0,xi) G D P l ( T l A ( x 3 , . . . € RPl) =>• fi\ - gPla1(hpl<Tl(x0,x1),x2), i i . V/ G {2,3, 4 , . . . , lm}, if(xQ,.... ,x JV_I ) G tree branch p\o~\... p\o-\ (note that the concept of "(x0,..., xTV-I) an element of tree branch p\o~\ ... p\a\" is defined on page 119), then — 9pi<jj...pi<Ti(ll P\(J\ ...pi<Tl V'Pl (Tl...p(_lCT(_l , Xjl ) 5 XJ2 ) • (29k) Proof: The proof of T H E O R E M II is based on constructions of the functions hPlCru_,Pl(Tl and g P i a i . . . p m , and the domains RPl(7l...Pl ' and DplCTl,„piCTl (for all / G {1,2,3,.. .} and V/9i, a i , . . . , pi, ai G {1,2,3,. . .}). These constructions are given in CONSTRUCTION V of Appendix D. L E M M A I: Let (x\, x\,..., x1N_1) G $Nr3 be a single point which satisfies equation (83A). Then there exists a domain RPl C $tN~3 such that (x\, x\,..., x\f_i) G RPl and the conditions of STEP 3.ii of CONSTRUCTION V are satisfied. 122 Proof of L E M M A I: Let 7 G 3? such that 7 > 0. Then, define the domain RPl as follows: RP1 = {(xi,...,xN_1)\((x3,...,xN_1) G ^ " ^ A (3(x0,xi,x2)[(x0,... G 2>])A (30A) ( | | (x 3 , • • - , X N - l ) - (xl, • • • ,XN-l)\\ < 7 ) A (Vi G { 0 , . . . , / 9 i - 1 } [ ( X 3 , . . . , X J V _ I ) Note that one can always choose a domain RPl which satisfies the above condition because the domain T> of the function being approximated, is dense, and because equation (83A) is true. Note that the above RPl, by definition, satisfies the condition of STEP 3.ii.a and the condition of STEP 3.ii.b of CONSTRUCTION V. Given the above definition of RPl, for 7 = 0 equation (85A) becomes: VPl(xo,xi,x2) = f(x0,xi,X2,xl,x\ . . . ,x1N_1). (31A) Because f(xo,x\,..., XJV-I) is continuous within the dense domain T>, it follows then that as 7 approaches zero, r n , ax |7 ? P l (a ;o ,x i ,X2) - /(a; 0, x i . • . , XN-I)\ 0. (32A) By the definition of continuous functions, this implies that VPl(xa,xi,x2) is also continuous. Thus, because the domain V is dense, there must exist a 7 > 0 such that equation (88A) is satisfied. Therefore, there must exist a domain RPl which satisfies the condition of STEP 3.ii.d of CONSTRUCTION V. Finally, if one chooses any value of 7 > 0 which satisfies equation (88A), then the condition of STEP 3.ii.c of CONSTRUCTION V is also satisfied because RPl is dense and because the 123 domain V of the function being approximated, f(xo, x \ , . . . ,XN-I), is also dense and finite in size. This completes the proof of L E M M A I. As shown in the proof of L E M M A I, the function VPl (XQ,X\, X2) is continuous and therefore, by T H E O R E M I, the construction in STEP 3.iii of CONSTRUCTION V is valid. Also, by T H E O R E M I, the functions hpi\,(xQ,xi) and gPlb(hPlb,x2) a r e continuous and therefore the function fPltTl(x0,xAr_1), constructed in STEP 3.iv of CONSTRUCTION V as / p i ( T i ( » 0 , • • • ,xN-l) = f(x0, • • • ,xN-l) ~ 9p1aAflPi(rAx0,xl),x2), (33A) is also continuous within the domain D o m P l ( T l . Note that, because the domains DPl(Tl,RPl,U and V are all dense and of finite size, then by STEP 3.iv of CONSTRUCTION V, the domain Dom,,^, is also dense and of finite size. Also, by T H E O R E M I and STEP 3.iv of CONSTRUCTION V, the union of the domains D o m P l ( 7 l span the domain V; i.e. V= | J D o m i ; . (34A) i € { l , 2 , . . . } j G { l , 2 , . . . } Next, by T H E O R E M I and STEP 3.iii of CONSTRUCTION V, the following is true: max \VPl(x0,x1,x2) - gPla1(hpi<Tl(x0,x1),x2)\ < ep. (35A) Thus, given that equation (88A) of CONSTRUCTION V is satisfied, one can rewrite equation (88A) as follows: max | / ( . T 0 , • • • . Z J V - I ) - VPl{x0,xi,X2)\ + max \VPl(x0,xi,X2) - gPl<,1(hPl<Tl(xo,xi),X2)\ < P • ( m a x \f(x0,...,xN_i)\ (36A) 124 Using the Schwartz inequality, the above equation becomes: f(x0,. . . ,XN_I) - VPl(x0,xi,X2) + VPl(x0,x1,X2)-max Dom„, 9Pla1(hPla1(xo,xi),x2) < f3 • ( max \f(x0,..., aw_i) | I, \ D o m w a i J (37A) or, simplifying the equation, one gets f(xo, • • • ,XN_i) - gPla1{hpia1{xo,Xi),X2) max Donip l C T l < (3 • ( max \f(x0,. . . , 2 . ^ - 1 ) 1 , , ,Dom P 1 CTj (38A) which can be expressed as max Dom 0 . a fPlCTl(xQ,...,xN-i) < / ? • ( max \f(x0,. . . , X J V - I ) | ) • Dom„, CT1 (39A) Now consider the statement and proof of L E M M A II. L E M M A II: Let (x^,..., x\N_^j £ $tN~2 be a single point which satisfies equation (94A). Then there exists a domain RPl(J-i...pt_lUl_lPl C such that (x^,..., x}N_^J £ RPlai...Pl-1ai-lPl and conditions o / S T E P 7.iiL2 of CONSTRUCTION V are satisfied. Proof of L E M M A II: The lemma is first proven for the case / = 2. Let 7 £ 9£ such that 7 > 0. Then, define the domain R P l ( T l P 2 as follows: R j (xil,. . •, XiN_2) I ( 2 ^ ,. . ., X{N_2) £ 3? ^ (3(xjl,Xj2)[(x0,.\. .,xN_i) £ D])A ( ( X Q , . . . , XN-I) £ tree branch pia\)A A (40A) . ) - ( 4 . - ^ L ) | < 7 ) A (Vi £ {0,...,/)2 - l}^,-,,...,^^) ^ i2pi f f l i ] )}-125 Note that one can always choose a domain RPlfflP2 which satisfies the above condition because, as shown above, the domain, D o m P l ( r i , of the function being approximated, fPlUl (^0, • • •, xN-i), is dense and because equation (94A) is true. Note that the above RPiaiP2, by definition, satisfies the condition of STEP 7.iii.2.a and the condition of STEP 7.iii.2.b of CONSTRUCTION V. Given the above definition of RPl(TlP2, for 7 = 0 equation (97A) becomes: ^Pi<TlP2{hpL!TL,X2,X3) = fPl„1(xl,x\,x2,X3,x\,...,x1N_1). (41A) Because the function being approximated, fpuTi(xo, • • • ^ ' J V - i ) . is continuous within the dense domain D o m P l ( T l , and because the function hpj(Ti is continuous (as proven in T H E O R E M I), it follows then that as 7 approaches zero, max \VPllXlp2(hpLLTL,X2,X3) - fPl(Tl (an, a i , • • •, z / v - i ) | -» 0. (42A) PI "\ P 2 By the definition of continuous functions, this implies that VPl<Tlp2(hPl(Tl, X2, £ 3 ) is also contin-uous. Thus, because the domain DomPl(Ti is dense, there must exist a 7 > 0 such that equation (100A) is satisfied. Therefore, there must exist a domain -Rp 1 ( T ] / 9 2 which satisfies the condition of STEP 7.iii.2.d of CONSTRUCTION V. Finally, if one chooses any value of 7 > 0 which satisfies equation (100A), then the condition of STEP 7.iii.2.c of CONSTRUCTION V is also satisfied because RPl(7lP2 is dense and the domain D o m P l J ] of the function being approximated, fpl<Ti (xrj, • • •, xN-i), is also dense and finite in size. This completes the proof of L E M M A II for the case / = 2. The proof of L E M M A II for the cases / 6 {3,4,...} is given next. 126 As shown above, the function VPl(jlP2(hPiai,x2,£3) is continuous and therefore, by THE-O R E M I, the construction in STEP 7.iii.3 of CONSTRUCTION V is valid. Also, by THE-O R E M I, the functions hPiaiP2(J2 and gPlalP2a2 are continuous and therefore the function fpi<TlP2a2(xo, S J V - I ) , constructed in STEP 7.iii.4 of CONSTRUCTION V as fpi<Tip2cr2 (XQ, • • • , a-'Af — l ) = jPl <7] (XQ, . . . , Xyyr_ 1 ) —- gplu\P2a2 Olpi<Tip2cr2 5 ^ 3 ) , (43A) is also continuous within the domain DomPlCJlP2CT2. Note that, because the domains Dpi<TlP2cr2, R P l a i P 2 , U and V are all dense and of finite size, then by STEP 7.iii.4 of CON-STRUCTION V, the domain Dom P l o\p2o2 is also dense and of finite size. Also, by T H E O R E M I and STEP 7.iii.4 of CONSTRUCTION V, the union of the domains D o m P l ( T l P 2 ( T 2 span the domain D o m P l ( T l ; i.e. D o m p i ( T l = [ J D o m P l ( T l i r (44 A) ^ { 1 , 2 , . . } ./ G {1-2....} Therefore, given that fPl<jlP2cr2(xo,... , xyy_i) is continuous within the dense finite domain Dom P l ( T l p 2 ( T 2 , the proof of L E M M A II for the cases / £ {3,4,...} follows by induction the proof for the case / = 2. Note that for / £ {3,4,...} the domains R P i a i . . . P l are constructed as follows: Rpl<T1...Pi — {(^ii? • • • , xix-i)\{x%\i • • •) xiN_2) E 2 A ( 3 ( X J I , . T J 2 ) [ ( X 0 , . . . ,ia?AT_i) £ D]) ((xo, • • •, XN-I) £ tree branch p\(?\ ... /0/_i<7/_i)A (45A) ||(x,-,,..., xtN_2) - (xl,..., x\N_2) || < 7A Vi £ {0,. . . ,pi - l } [ (x . t l , . . . , x J N _ 2 ) i RPl<j1...Pl„1cTl-1i] }• 127 Also, note that, by induction, and by T H E O R E M I and STEP 7.iii.4 of CONSTRUCTION V, the union of the domains DomPl(Tl...p((T( span the domain Domp 1 < 7 l... ( 0 (_ l O- l_ 1; i.e. Dom P l ( r i...p I_ 1 ( r i_ ] = ( J Dom / ? l 0 . 1 . . . w _ l £ r , _ 1 i r (46A) z G { l , 2 , . . . } j G { l , 2 , . . . } This completes the proof of L E M M A II. Next, by T H E O R E M I, and STEP 7.iii.3 of CONSTRUCTION V, the following is true for the case 1=2: x3)\<ep. (47A) Uom CTJ P2 a~2 Thus, given that equation (lOOA) of CONSTRUCTION V is satisfied, one can rewrite equation (100A) as follows: max \fp1<T1(xo,...,xN^i)-/Pp1a1p2(hp1„1,x2,xs)\ + UOHlp^ tjy P2&2 max \/Pp1<rip2 (x0, xl, X2) — Qp^ai P2O2 i]lPi<JiP2(T2 {hpitri, x2), xz) I 5i (A<2\\ P - I max |/p l C T l(a y J J O I T l p ^ cr^ p2 o"2 [ . T 0 , • • • , z / y _ i , Using the Schwartz inequality, the above equation becomes: /PI<TI(ZO, • • • ,xN-l) - Vpl(Tlp2{hpiUl1X2,xz)+ max DomPlCTlP2 CT2 ~Ppi<J\P2 {xG-> xli x2^ 9p\o\p2<72 ( ^ p i T 1 P 2 U 2 {hpi&i, x2~) 1 xs) P-[ max \fPiai(x0,... ,xN_i)\], < (49A) or, simplifying the equation, one gets max Dom P i a 1 P2 "2 fpiUl {X0l • • • 5 XN — l ) 9pi<TlP2<72 0lPl<riP2&2 ( ^ P l « T l , X2)l XZ) < P-\ max \fp1<Tl(xo,. . . ,xN-i] . L/OITlrti <T 1 Do (TO 128 (50A) Therefore, for the cases / £ {3,4, . . .} , it follows by induction that jp\G\...px-xUl-x (X0l • • • ) XN —1)~ Dom max Qp\<J\...plVl {]lpi(Jl...pi<Tl (hpi<Tl...pi-lCTl-l ) Xjl)-I XJ2) (3 • I max \fpi<JI••.pi-i(Ti-i{xQi • • • •> x N ) \ D o m P 1 C T 1 . . . p 1 ( r ( / (51 A) or, simply, Dom max fp < Ip\<Ti...pi<jXxQi • • • ixN—l) (3 • [ max \fPl<T1...Pl-1cTl-.1(x0,-••,xN-l)\)' \DomPiai...Pl<Tl J Thus, given equation (39A), it follows that (52A) Dor max 11>\ "i • • • PI fpi(Tl...pi(Tl{X0-> • • • 1 X N — l) </? ' • ( max | / ( s 0 , • • -,XN-l] Dom, (53A) Therefore, because 0 < (3 < 1 and because the function being approximated, f(xo,. • . , Z J V - I ) is bounded, it follows that there exists an / E {1,2,...} such that, for any arbitrarily small > 0, (3 • max |/(x0,...,a;jv-i)| <£r \ Dom c (54A) Thus, because equation (34A) and equation (46A) are true, it must be the case that equation (28A) of T H E O R E M II is satisfied. This completes the proof of T H E O R E M II • . 129 Appendix C. Rate of Convergence The approximations developed in the previous two appendices are here be applied to nonparametric regression. Let (X, Y) be a pair of random variables such that X is of dimension TV, where iV > 3, and Y is of dimension one. Let 6(X) = E(Y\X) be the regression function of Y on the measurement variable X. Let 9n, n > 1, denote an estimation of 9 given the n random samples (Xu Yi),..., (Xn,Yn) of the distribution (X, Y). Let the estimators 9n be based on the nonparametric model given in equation (23A) of Appendix B. Let the two dimensional functions gPl<r1...pl<rl and hPl(Tl,,,Pl(Tl of 9n be constructed as defined in CONSTRUCTION V of Appendix D with 0(x) = / (x ) , x = (x0, xi,..., xN_i) G V. Assume that the function 9 = f, and the domain V, are as defined in T H E O R E M II, with the additional constraint that 9 is p-times differentiable (p > 1) on V. As defined in [Stone, 1982], assume that the distribution of Y depends on the value of the measurement variable x and takes the form h{y\x.,t)4>(dy), where <f> is a measure on 9t and t = 0(x) is the mean of the distribution as defined by: T H E O R E M III: Assume that conditions 1-3 defined in [Stone, 1982] are true. Assume that 9 and 9n are as defined above. Then, there exist a finite set of dense domains Sk C V, where k G { 1 , . . . , K} (K finite), and | J Sk = T>, such that, given n random samples of 9 from any (55A) K k=l 130 9 — 9n < 0 ^ ( n ) 1 log (n)^ where domain S}., the rate of convergence of 6n to 9 on Sk is r = p/(2p + 3). Proof: Because the dense domains D o m p i ( T l P i ( T i are finite in number and span V (see equation (46A)), in order to prove this theorem, it is sufficient to establish the rate of convergence of the error functions fpitJi pl(Ti(xo-> • • • > xN-i)> where n is the number of random samples from the corresponding domain 5& = D o m p i ( T l . . . P l f f r As defined in equations (91A) and (103A) of CONSTRUCTION V in Appendix D, the error function after the construction of / levels is given by: fp1*1...pl*l(xo, • • -,xN-i) =9(x0,... , X J V - I ) - 9pla1{hp1ff1(xo,x1),x2) + 9p\<J\P2ff2\'lp\<JiP2'J2 (hpiai,X2),Xi) + I ^^9pi(Ti...ai(hp1<T1...tTi(hp1<Ti...(T{i^1),xri mod w ) ) ' x ( ( ; + i ) mod w ) ) > (56A) We prove T H E O R E M III by showing that \\f^1<ri...pi<ri(x0, • . . , xN-i)\\OQ < O^n)'1 log ( n ) ) " given n random samples from the corresponding domain Sk = Dom P l ( T l . . . P i C T r As defined in CONSTRUCTION V of Appendix D, the functions hPiai,,.Pl(Ti and g P l o 1 . . . p l ( j l in (56A) are constructed from the 3 dimensional functions VPl and VPiai...Pl_1(xl_lPl (see equations (85A) and (97A)) which are in turn constructed from the function 0(x) = / (x) being approximated. However, 0(x) is unknown, and therefore, VPl and VPlUl...Pl_xUl_lPl must be approximated from n random samples from the domain TPl I) D o m P l ( T l and TPl(Tl...Pl D Dom ( 0 l f f l . . . P j ( T i respectively (see equations (87A) and (98A)). We denote this by defining Vpi and ^>p1<T1...pi-1ai-1pi t 0 D e approximations of VPl and VPl<Tl...Pl_1<Tl_ipl, respectively, based on n from 131 there respective domains. The following lemma establishes the optimal rate of convergence for the estimation of \\VPl - ^ " J ^ and \\VPiai...Pl_1(ri_lPl - V^iai_pl_1<Tl_lPl W^, as a function of n random samples. L E M M A III: Let the functions VPl andV pitJl^.Pl_1(Tl_lPl, for all I £ { 2 , . . . , lm}, be estimated using the local average estimator of [Stone, 1982]. Then the optimal L°° rate of convergence of each estimator Vp\ and VpiCTl...Pl_l(Tl_lPl *s C ( n _ 1 logn) r , where r = p/(2p + 3), and n is the number of points used in constructing the estimator. Proof of L E M M A III: This lemma is proven by showing that, given the assumption in Theorem III, all functions VPl and Vpi(7l,,,pi_1(7l_ipi meet the conditions defined in [Stone, 1982]. First consider the estimation of VPl. By equation (85A) and equation (55A), the expected value VPl takes the following form: / 6(x0,xi,..., VPl(x0,xi,x2) R, •PI pi R, •PI (57A) Therefore, the distribution of VPl is of the form / -h(y\-x.,t)dRPl <f>(dy). (58A) 132 Because (58A) is an appropriately scaled version of h(y\x.,t)</>(dy), it follows that given that h(y\x.,t)<f>(dy) meets conditions 1-3 defined in [Stone, 1982], then the distribution in (58A) must also meet these conditions. Also, note that because 6 is p-times differentiable, then given the above calculation, it must be the case that VPl is also p-times differentiable. Therefore, given that VPl is a p-times differentiable 3 dimensional regression function, then by Theorem 1 of [Stone, 1982], the optimal L°° rate of convergence is \\VPl — ^ " J l < O ( n _ 1 l o g n ) r , where r = p/(2p + 3). Next consider the construction of the 2 dimensional functions hPl<Tl and gPl(Jl in STEP 3-iii of CONSTRUCTION V, and functions hPltTl...PltTl and g P l t T l . . . P l t r i , for all / £ { 2 , . . . , lm}, in STEP 7.iii.3 of CONSTRUCTION V. As defined in CONSTRUCTION III and CONSTRUCTION IV, these functions are constructed using linear interpolation based on VPl. Because of equations (91 A), (97A), and (103A), in order to complete the proof of L E M M A III, we must ensure that the 2 dimensional functions h P l ( T l , g P l t T l , hPl/Jl...Plljl and gPxcx...Pla{ are at least p-times differentiable (that being the number of times that the regression function 9 is differentiable). Without loss of generality, this condition can be satisfied by representing h P l C T l , gPlCTl, hPl(Tl.„PllTl and gPlCj1...Pl(jl, for all / £ { 2 , . . . , lm}, using, for example, 2 dimensional spline approximations, of differentiability p or higher (i.e. instead of the linear interpolation scheme defined in CONSTRUCTION III and CONSTRUCTION IV). Given that the functions f 1 p l l C J l . . . P l _ 1 c T l _ 1 of equation (97A) are at least p-times differentiable, the proof of the lemma for the functions VPiai...pl_iai_lPl, for all / £ { 2 , . . . , l m } , follows the same steps outlined above for VPl, and is therefore not be presented here. This completes the 133 proof of L E M M A III. Given L E M M A III, we can establish the rate of convergence of the error functions fp1<T1...plai(xo-,---->xN-i)- Consider the rate of convergence of the first function fPl<Jl. Us-ing an analysis similar to that found in [Barron, 1994] and referring to CONSTRUCTION V of Appendix D, the norm of /™1(Tl is given by (unless otherwise stated, the Loo norm is over the domain of the respective function fPl(Tl,,,P[ai(xo, • • • ,xN-i), for a H / = 1, 2,.. .): \\r;^ L ^ II / - ^ " o o + \\vP1 - vnpi L (59A) Letting ep = \\VPl — ^ " J l < 0(ji~l \ogn)r and using equations (88A) and (39A), equation (59A) can be written as: n P iM o o < m\oo + *v (60A) In order to guarantee convergence we choose an a 6 $ such that 0 < (3 < a < 1 and equation (60A) satisfies: 1 1 / ^ 1 1 0 0 ^ 1 1 ^ 0 0 + ^ < « l l * l l o o ( 6 1 A ) Similarly, the Loo norm of f P l ( T l P 2 ( T 2 is given by: ||/pi<7lP2<T2 lloo — 11/piO-l ~~ ^ V ^ ^ H o O + H^PlO-lPS ~~ ^ P i <7i p2 \ | QQ (62A) Once again, letting ep = \\VPl(7lP2 — ' P p i a l P 2 ^ O ( n _ 1 l o g n ) r and using equations (100A) and (50A), equation (62A) can be written as: 11/pUp, J o o ^ J L + **< "WK* lloo ^ « 2 | ^ l l o o (63A) 134 Thus, by induction, the norm of /p 1 < T l... P ( ( T i is given by: \\fn I I < l l f ™ —v I I 4-Wv -Vn I I || J p1a1...pi<ri\\00 — \\J pi<Ti...pi-1cn-i ' picri—pi IIQQ ~ \\' p\<J\...pi r p 1 a i . . . p i \ \ 0 0 <P\\fp\,1...Pl-1*l-1\L + h (64A) < « l l / ; 1 . 1 . . . p I _ 1 . I _ 1 I L < « ' P i i 0 0 Therefore, construction stops at level / when, at level / + 1, equation (64A) is no longer true; this is defined by the following condition: | | i p i < 7 i . . . p i < T i H o o — l ^ \ \ f p 1 c r 1 . . . p i - i a i - 1 IIQO + eP — a l l ^ l l o o A N D (65A) \\fpiCT1...pi+1al+1 IQQ ^ a Halloo The stopping condition given in equation (65A) is met if and only if the number of points used in constructing Vpir7l...piaipl+1 is insufficient for error reduction; in other words, the following is true: I K . , . . P , + 1 - ^ ; U . . . P J L < o(n-Hogn)r i H I / P % , . . P J L -0\\&w,\L (66A) Using the inequality \\fp,fTl...pl_lCTl_1\\oo < a1 l\\9\\OQ, then for all fp1(7l„.pl(7l satisfying equation (64A), the following is true: / ^ ' - ^ I L + ^ a ^ l L (67A) Solving equation (67A) for a^PHoo, we get: 1 3 5 Using the inequality ||/™1<Tl...P(_ : 1 (T(_11|^ < of 1\\0\\OQ, and inserting equation (68A) into equation (64A), we get the final result: H/piaL.-Pif f lHoo ^ P a V_ p + 6P < (9 (n _ 1 log n) This completes the proof of T H E O R E M III • . 136 Appendix D. The Main Theoretical Constructions CONSTRUCTION I: The Construction of V 6 (x ,y) , W 6(V 6 ,z), and S 6 STEP 1. Choose 6. > 0 (8 £ 9fc) such that V(xx,yi, zi),(x2,y2, z2) £ U, if IK^i,Vi,zi) - (x2,y'2,z2)\\ < 8 • y/2, then \U(x1,y1,z1)-U(x2,y2,z2)\ < | . (70A) STEP 2. Create a uniformly spaced grid of points in 3?3 at all points (/ • 8,m • S,n • 8), such that / ,m ,n £ 1. Define the set U to be U =•{(*, m,ri)|(Z,m,ra £ 1) A ((/ • 8, m • 8,n • 8) £ U)}. (71A) [Thus the set U contains all points which are within the set U and which lie on the uniformly spaced grid of points in 3?3 defined above. (Note that the italic text within the square brackets indicates comments.)] STEP 3. Define the set T i to be T i = {( / i ,mi ) | ( / i ,mi £ 1) A (3n £ mun) £ U])}. (72A) [The set T i represents the set of all one dimensional functions U{1\ • 8, mi • 8,n • 8), where (l\,m{) £ ~X\.] STEP 4. Let 6 = 1. [The index b indicates that discrete functions Vft(x,y) and W J ( V J , Z ) , along with the discrete set S& are constructed.] 137 STEP 5. Choose an arbitrary point (!o,mo) G T j . Construct the set Sj such that: a. S 6 C T f c. b. (k,m0) G Si; c. V ( / i , ? ? 2 i ) E Sj, the 1 dimensional functions U(l\ • 8,mi • 8,n • 8) form an S-ordering as defined above; d. V( /2 ,m2) G T fc [ ( / 2 , « i2 ) 0 the 1 dimensional functions U(h • 8, rn2 • 8, n • 8) cannot be inserted within the S-ordering of the 1 dimensional functions U{li • 8, mi • 8,n • 8), where m i ) G S;,. [Thus the set Sb represents an S-ordering of the functions U(li • 8, mi • 8,n • 8) (V(?i,mi)"e Sb).] STEP 6. Let C S 6 be the set of all points (J1,?™1) G Sfe which are first in the S-ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (V(/, m) E Sb). Similarly, Vp G { 2 , 3 , . . . , p m } , let Vf C Sfc be the set of all points (lp,mp) G Sfc which are in position p of the S-ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (V(/, m) G S 6). E X E C U T E the following steps: i . V ( / 1 , m 1 ) G set the value of the 2 dimensional function Vj to be V ^ / 1 -8,ml -8) = 0. i i . Let p = 2. i i i . WHILE p < pm E X E C U T E the following steps: 138 1. Let (p-^mP-1) G V p 1 . V(lp}mp) G VPb set the value of the 2 dimensional function Vb to be: Vb(lp -8,mp-8) = V^P-1 • W 1 • 8) + a, (73A) where a = \\U(lp-6,mp -8,n- 8)-U(lp~1 • 8,mp-1 • 6,n- 8)\\ 2. Let p = p + 1. (74A) Vp G { l , 2 , . . . , p m } , V(lp,mp) G Vf and Vn such that G U, set the value of the 2 dimensional function Wj to be: W F E ( V 6 ( / P • 6, mp • 8), n-8)= U{lp • 8, mp -8,n-8). (75A) [Therefore, at all points (/ • 8,m • 8,n • 8), where (/, m) G Sb and n G I [(/, ra, n) G U], the function WJ(VJ(Z • 6, m • <5), n • <5)« egua/ to the function U(l • 6,m • 8,n • 8).[ Define.the set Qb as: Q& = j(<*,/3)|Vp G {1 ,2 , . . . ,pm} j v ( / p ,m p ) G V£ [Vn such that (lp, mp', n) G U (76A) [(a = V 6 ( P - 6 , m p • 8)) A (ft = n • 8)]\ [Therefore, Qb contains the set of all possible independent variable values of the function Wb(Vb,n • 8).[ 1 3 9 STEP 7. Let 6 m a x = 6. [bma.x represents the current maximum number ofVb(x, y) and W J ( V J , z) functions, and Sb sets.] STEP 8. Let 6 = 6 + 1 . [Thus the next set ofVb(x,y) and l/lb(yb,z) functions, and. the next set Sb will be constructed next.] STEP 9. Let Tb = Tb_i - S 4 _ i . /T/ie se? Tj represents the set of all one dimensional functions U(lb • S,mb • 6,n • 6), where (h,mb) £ T i , which have not been used in the construction of the functions Vft(x, y) and W J ' ( V J , 2 ) , and sets Sb. W / ien Tj = 0 //ze construction is complete.] STEP 10. IF T f c ^ 0, GOTO STEP 5 ELSE END CONSTRUCTION I. /"STEP 5 through STEP 9 are repeated until all one dimensional functions U(l\ • 6,mi • <5, n • S), where (h,mi) £ Ti , have been used in the construction of the functions Vb{x, y) and W J ( V J , 2 ) , and sete Sb.] C O N S T R U C T I O N II: The Calculation of the Domains Sb The function of CONSTRUCTION II is, given a point (x, y) £ 5t2 such that 3z[(x, y, z) £ U], determine which domain Sb, b £ {1 ,2 , . . . , b m a x ] , the point (x, y) belongs to. STEP 1. Let / ? m i n = . r m i n , [,Mnc IK^^) - ( /-.^m-^)ll )• ( 7 7 A ) iG{i,...,omax} y^rajes, / 140 [Therefore, ', Pmin is the distance from the point (x,y) £ 9ft2 to the closest point (lb.8,mb • 8) where (h,mb) 6 Sb and be {1, 2 , . . . , b m a x } . ] STEP 2. Let ^min be the minimum value of i £ {1 ,2 , . . . , &max} such that (/,m)gSv (78A) — Pmin-[Thus, im[n is the first i which satisfies the above condition p = pm[n. Therefore, for each point (x, y), there is one and only one assigned value for i m in-7 STEP 3. Let b = imin. The point (x,y) is therefore in domain Sb.. [Therefore, each point (x,y) belongs to one and only one domain Sb.] STEP 4. END CONSTRUCTION II. C O N S T R U C T I O N III: The Calculation of the functions Vb{x,y) The function of CONSTRUCTION III is, given a point (x,y) G Sb calculate the function STEP 1. Given the triangular tessellations grid in Figure 3 find the triangle which contains the point (x, y) (if the point lies at the intersection of 2 or more triangles, arbitrarily choose any one of these triangles). Let (IQ • 8,mo • 8), (li • 8, m\ • 8), and (l2 • 8,m2 • 8) be the vertices, as defined in Figure 4, of the triangle containing the point (x,r/). Let Vj(/o • 8, mo • 8), Vfc(/i • 8, mi • 8), and Vb(l2 • 8, m2 • 8) be the values of the function at these vertices. Vb(x,y). 141 STEP 2. Then Vb(x,y).is calculated as follows: Vb(x, y) =(V 6(/i • 6, mx • «S) - Vj(/ 0 • 8,mQ • 8)) • S * ( X - J ° ' 6> + (79A) (V 6 (/ 2 • 8, m 2 • 8) - V 6 ( / 0 • 8, m 0 • 6)) • * * ( y ~ ™° ' ^ + V 6 ( / 0 • 8, m0 • 8), where for an upper triangle (as defined in Figure 4) sx = sy = — 1 and for a lower triangle sx = sy = +1. Note that if any of the points (/o,mo), ( I i ,mi) , or (h,^) are not elements of then the value of Vj at that point must be chosen such that: a. Vz,j G {0,1,2} \\Vb(li-6,mi-6)-Vb(lj • f>,mj • 6)\\ < 8 , and b. the resulting surface Vb(x,y) ((x,y) E Sb) is continuous. [Therefore the function Vb(x,y) is calculated as a linear interpolation between the vertices of the triangle which contains the point (x,y).] STEP 3. END CONSTRUCTION III. C O N S T R U C T I O N IV: The Calculation of the functions Wb(Vb(x,y),z) The function of CONSTRUCTION IV is, given Vb(x,y) and z, where (x,y,z) E U, calculate Wb(Vb(x,y),z). STEP 1. Given the triangular tessellations grid in Figure 5 find the triangle which contains the point (Vb(x, y), z) (if the point lies at the intersection of 2 or more triangles, arbitrarily choose any one of these triangles). Let (Vj0,nn • 8), [Vbl,ni • 8), and (Nb2,n2 • 8) be the vertices, as defined in Figure 6, of the triangle containing the point (xo,yo). Let Wj(V60,no • 8), Wj(V6j,ni • 8), and Wfc(Vj2,n2 • 8) be the values of the function Wj at these vertices. 142 STEP 2. Then Wb(yb(x,y), z) is calculated as follows: sv(Vb(x,y) - V f c o ) Wb(Vb(x,y),z) = ( W J ( V 6 l , m - 5 ) - W 6 ( V 6 o , n 6 - 5 ) ) <5y (W 6 (V f c 2 , n 2 • 5) - W 6 ( V 6 o , 77 0 -6))- S z < y Z + W 6 ( V f c o , n 0 • 5), (80A) where for an upper triangle (as defined in Figure 6) sy = sz = — 1 and for a Zower triangle sy = sz = +1, and min ||V&0 — C K | | , for sy = —1 mm — V j 0 | | , tor sy = +1 v y ( a , ^ ) e Q 6 , Q ' > V 6 o Note that if any of the points (Vbo, no • <*>), (V^, rai • 6), or (Vfe2 , n 2 • 6) are not elements of Qj , then the value of Wj at that point must be chosen such that: a. G {0,l ,2}[ | |w 6 (V 6 i ,n f - -6) - W 6 ( v 6 y , n j < f ] , and b. the resulting surface Wb(Vb(x,y), z), V(x, y, z)[((x, y) G Sj) A (x,y,z) G (7], is continuous. Also, if the ^ does not exist as calculated in (81 A), then choose 6y such that: $v = |W 6(V 6 l,rai -6)- W6(Vfco, n0-6) |. (82A) [Therefore the function Wb(Vb(x, y), z) is calculated as a linear interpolation between the vertices of the triangle which contains the point (Vb(x,y), z).] STEP 3. END CONSTRUCTION IV. 1 4 3 Triangular tessellations grid used to construct the function Vj(x,y). Note that there are two types of triangles formed using this tessellations gird: a lower triangle and an upper triangle (see Figure 4). Figure 3 Triangular Tessellation Grid for Vb(x,y) C O N S T R U C T I O N V: Approximating Functions of N Dimensions Using Functions of 2 Dimensions (JV > 3) STEP 1. Let Ro = 0. [The first domain RQ is initialized to be an empty set.] STEP 2. Let P l = 0. [The integer p\ is used as a pointer to the domains RPL.] STEP 3. WHILE it is true that 3(aJ3, • • -,XN-I) pi ; 3 , . . . ,XN-I) £ [j Rt I A ( 3 ( x 0 , x 1 , x 2 ) [ ( x 0 , . . . , £ J V - I ) € V]) i=0 (83A) 144 2 0 0 2 lower triangle upper triangle The point £ Sfc can be located in either a lower triangle or an upper triangle of the tessellations grid defined in Figure 3. For both types of triangles, vertex "0" is located at point (fa • 8, rnQ • 8) £ IR2 and the value of the function Vj at this point is V&(7rj • 6, mo • 6); vertex "1" is located at point (fa • 8,m\- 8) £ !Jc2 and the value of the function at this point is Vj(/i • 8, mi -8); and finally, vertex "2" is located at point (fa • 8, m2 • 8) £ 5t2 and the value of the function V& at this point is Vb(h~ 8,m2 • 8). E X E C U T E the following steps: [Therefore, at the completion o /STEP 3, the entire domain of definition of the function being approximated will be spanned, along the axis (x$, x±,..., X J V - I ) , by the union of all the domains Ri, for all i £ { 0 , 1 , . . . , pi}.] i . Let pi = pi + 1. [This initiates the construction of the next domain RPl.] i i . Choose RPl C $t,N~3 such that Figure 4 Lower A n d Upper Triangles of Sb a. V ? £ {l,...,(pi-l)}[RlnRl •pi ~ [Thus the domains RPl are all disjoint, satisfying definition La of THEOREM II.] 145 Triangular tessellations grid used to construct the function VVj,(Vb(x,y),z). Note that there are two types of triangles formed using this tessellations gird: a lower triangle and an upper triangle (see Figure 6). Note also that the vertical lines are not regularly spaced: the vertical line spacing is defined by the value of a in STEP 6.iii of CONSTRUCTION I. Figure 5 Triangular Tessellation Grid for Wb(Vb(x, y) , z) b. 3 ( x 0 , x i , X 2 ) [ ( ( x o , x i , . . . , X J V - I ) E T>) A ( ( x 3 , x 4 , . . . ,XN-I) e RPl)], and> [Thus the domains RPl contain points which correspond to the domain of definition of the function being approximated.] c. The domain Rpi is dense. d. The boundaries of the domains RPl are chosen such that a 3 dimensional approximation, along the axes (x 0 , X I , x 2 ) , of the function / ( x 0 , . . . , Z J V - I ) within the domain RPl, is at least better than no approximation at all. Or, put formally, this condition is satisfied as follows. Define the 3 dimensional 0 1 lower triangle upper triangle The point (Vb(x,y),z) can be located in either a lower triangle or an upper triangle of the tessellations grid defined in Figure 5. For both types of triangles, vertex "0" is located at point (Vfc0, no • 8) and the value of the function at this point is W& (V&0,720 • 8); vertex "1" is located at point (Vjj ,n\ -8) and the value of the function at this point is W^V^, n\ • 8); and finally, vertex "2" is located at point (Vj 2 , n2 • 8) and the value of the function at this point is W&(V&2, n2 • 8). Figure 6 Lower A n d Upper Triangles of (Vb(x,y),z) function VPl(xo,xi,x2), defined on the domain UPl = {(x0,xi,x2)\((xo,xi,... ,XN-I) G £>) A ((x3,... ,xN_x) E RPl)}, (84A) to be J f{xo,xi,.. .,XN-i)dRi pi VPi(XQ,X1,X2) = R0 I dRpi (85A) Ro Let P G 3? such that 0 < P < 1, and let tp G 3ft such that 0 < £p < P • ( max \f(xo, • • • ,XN-I)\ ), where the domain TPl is defined as (86A) TPi = {{XQ,XI,. . . ,XN-i)\{[xQ,Xi,.. . ,XN-\) G V) A ((x3,£4, • • • € RPl)}-(87A) 147 Then, the domains RPl must be chosen such that max \f(x0,... , X J V _ I ) - VPl(x0, x1} x2)\ + ep < 0 - ( max | / ( x 0 , • • . , x ^ - i ) | ) • (88A) [In LEMMA I it is shown that such a domain Rp^ must exist. The proof of LEMMA I is based on the construction of a domain RPl which satisfies all of the above conditions.] i i i . Let W(x, y , z) = VPi(XQ,XI,X2) and U = { ( . T , y , 2 r ) | ( ( x 0 , . . . , x 7 v - i ) G Z>) A ( ( x 3 , . . . , X J V - I ) G # P l )A (89A) ( x = x 0 ) A (y = X i ) A (z = x2)}. Use CONSTRUCTION I, CONSTRUCTION II, CONSTRUCTION III and CON-STRUCTION IV to construct the functions Vb(x, y) and W&(V&, z), and the domain Sb (V6 G {1 ,2 , . . . , 6 m a x } ) . Note that the e used in CONSTRUCTION I must be chosen to satisfy equation (86A) of this CONSTRUCTION; i.e. t < tp. For all be {1 ,2 , . . . , 6 m a x } , let hpib(xQ,xi) = Vb(x,y), g P l b ( h p i b , x 2 ) = Wb{Vb,z), and D P l b = Sb. Also, V6 € { ( 6 m a x + 1), (femax + 2),...} let D P l b = 0. [Thus the domains D p i b and the functions hpib(xo, x-y) satisfy definition 2 of THE-OREM II, and the functions gPlb(jlplb-> x2) satisfy definition 3 of THEOREM II (for all b G { l , 2 , . . . , 6 m a x } | 7 iv. For all a\ G {1 ,2 , . . . , 6 m a x } define the N dimensional function fPl(Tl (x0,..., xN_i), V ( x 0 , • • • , Z J V - I ) G Dom p i ( 7 ] where Dom P l £ r i = { ( x o , . . . ,XN-I)\((XQ, . . .,xN-x) G V) A ( ( x 3 , x 4 , . . . ,XN-I) G -R P l)A (90A) ( ( x 0 , x i ) G I>p1(Ti) A ( ( x 0 , x i , x 2 ) G UP1)}, 148 as fp1(ri{xo,. • • ,xN_x) = f(x0,... ,XAf - i ) - gPla1(hp1<7l(xo,xi),X2). (91A) [Thus, the function fPiai (xo, • • •, £ J V - I ) represents the approximation error along branch p\G\ for the case lm = 1 /n equation (28A) of THEOREM II. Note that if max l / p i o - i (aJo, - - • , ^ J V - i ) | < ^ m a z , (92AJ V ( x o , . . . , x N - i ) 6 D o m p i [ r i r/ien r/ie approximation along branch pi&i is complete and the branch need not be extended.] STEP 4. Vz G {(pi + 1), (pi +2),...}, let Rt = 0. [This satisfies definition Lb of THEOREM II] STEP 5. Let / = 2. [This initiates the construction of level I = 2 of the approximation tree.] STEP 6. Leu ' i , i 2 , . . . , * ' jv-2 G {0,1, • • •,-/V - 1} and j i , j 2 6 { 0 , 1 , . . . , TV - 1} such that, a. j i = (/ mod (TV - 1)), and j2 = ((I + 1) mod (TV - 1)); b. Vm, n G {1 ,2 , . . . , TV - 2} [(m ^ ra = M m / i „ ) A ( m > n im > in )]; and, c. Vm G {1, 2,..-., TV — 2} A Vn G {1, 2}[im + Jn]. STEP 7. For all P i , c ^ , . . . , / > / _ ! , c r ^ G {1,2,3,.. .} and (note that for the case / = 2, Pi,°~i,.. - , p i - i is read as P l , and /3 I ,<JI , ... , / > j _ i , <T/_X is read as / > i , c r i ) V(xo, • • • , £ j V - i ) G tree branch p \ o \ . . . p\-\Oi_\, such that 149 a. max \fp1a1...Pl_1ai-1{xo,---,XN-l)\> tmax, (93A) ( i o , . . - , W - i ) € D o m w „ ] . . . P i _ 1 j | _ 1 where the domain Dom P l< 7 l... P (_ 1 ( 7 (_ 1 is defined in STEP 3.iv and STEP 7.iii.4 of this CONSTRUCTION, and, b. ^ 0 and •D P l < r i . . .p,_ l f f |_ 1 7^ 0, E X E C U T E the following steps: [The goal of the following steps is to continue the construction of the tree until each branch has an approximation error which is less than the maximum allowable error. The steps executed are similar in function to the first 4 steps of this CONSTRUCTION, with the exception that the function being approximated at branch pi, a\,..., p / _ i , c / _ i is denoted by fp1<Jl...pl_iai-l [xo, • • • ,xN-i)J i . Let i? P l ( T l . . .p ,_ 1<Ti_iO = 0-[The first domain RPla1...Pl_lai_lo = 0 is initialized to be an empty set.] i i . Let pi = 0. [The integer p\ is used as a pointer to the domains RPla1...Pi-1<Tl-iPr] i i i . WHILE it is true that, ^(Xil !••••> XlN-2 ) (3(xn, xJ2)[(x0,..., X J V _ I ) G DA (94A) (.TO, • • • , XN-l) £ i r c c b r a n c h p i (Tl . . . / 9 / _ l C T ^ _ l J ' Pi (.Tjj, . . . , X{N_f) Rpia1...pi_1(rl-ii I A 150 E X E C U T E the following steps: [Therefore, at the completion of STEP 7.iii, the entire domain of definition of the function fPla1...Pl_1(Tl_1 (so, • • •, xN-l) be spanned, along the axis (xix,xl2,...,XiN_2), by the union of all the domains R P l t T 1 . . . P l _ l t j , _ 1 i , for all i G {0 ,1 , . . . ,pi}. Note that the goal of this step is similar to goal of STEP 3.7 1. Let pi = pi + 1. [This initiates the construction of the next domain RPltJl...Pl_1(Jl_lPl-] 2. Choose R P l ( T l . . . P l _ 1 c T l - l P i C $lN~2 such that a. \/i G {1, ...,(/>; — 1)} [Rpicr1...Pi_1cri-1i H Rpiai...Pi_1ai-lPi = 0] * [Thus the domains R p u T l . . . p l _ 1 c T l _ l P l are all disjoint, satisfying definition 4.a of THEOREM II] b. 3(xn,xj2)[((x1,.. .,xN) G D)A {(XQ, xN_x) G tree branch p\o-\ ... cr/_i)A ( ( X i l , • • • , X i N - i ) ^ 7?p1(7l ...Pl_y(Tl_lPl)\ and, [Thus the domains R P x l T l . . . P l _ 1 t T l _ l P l contain points which correspond to the domain of definition of the function being approximated.] c. The domain R P l ( T l . . . P t _ 1 ( T l _ l P l is dense. d. The boundaries of the domains RPl<j1...Pl_1<jl-1pi are chosen such that a 3 di-151 mensional approximation, along the axes (hPl(Tl...pl_iai_1,xji:xj2), of the function / P l ( T l . . .p 1_ 1 ( T l_ 1 (xo, • • • ,XN-I) within a domain Rp1<T1...pl_1<rl_lPl, is at least better than no approximation at all. Or, put formally, this condition is satisfied as follows. Define the 3 dimensional function 'Pp\o\...pi-\oi-\pi (hpiai...pi-iai-i 5 xji i XJ2)i (95A) defined on the domain UPl(Tl...pi_l(Tl_lpi = { (hpiai...pi-1ai-i 5 xjn x j 2 ) \ ( ( x 0 i • • • , XN — l ) £ T>)A ((x0, • • •, x^-i) G tree branch pia\... p i ^ a ^ A (96A) { ( X i l 1 • • • 5 XlN-2) G - R p 1 f J 1 . . . p , _ 1 f J i _ 1 p ( ) } , to be J / p io - i . . . p i - i o - i - ! dYi ^Pp1(T1...pi-i(Ti-1pi(hpia1...pi-1ai-n xjn xj2) = j ^JJ (97 A) n where II = TPiai...Pl_1(Tl_lPl n UPl<Tl...Pl_1(Tl_lPl and the domain ?p 1 ( T l . . . p i _ 1 (T ,_ 1 p i is defined as Tpiai...pi_1<Tl_xPi = |(x 0, • • • ,xyv_i) | ((x0,... ,XN-I) G D)A ((x0,. •.-, XAr_!) G tree branch p\G\ ... p/_i<7/_i)A (98A) ((xjj , . . . , XiN_2) G Rpxoi ...pi^ai-iPi) j* • Let /? G 5t such that 0 < /? < 1, and let e# G 3t such that 0 max |/p1ffi...p,_iff,_i(zo,---,zw-i)| )• (99A) [lij ,...,XiN JGTp 1 < rj...p (_ 1o- i_ 1/> i 152 max T Then, the domains RPl<T1...p,_1<Tl_1pl must be chosen such that fp\(j\ ...p\-\<ji-\ (^0, • • • j xN — l)~ 1 ~ > p i c r l . . . p l _ l c T l _ l p l ( h p l C r l . . . p l _ l c T l _ l , X j l , X j i > ) < (100A) P ' \ T m a x \fpi<Tl...pi-1<n-AxQ,...,xN-i)\y \J- P1a1.-P{_1al_1Pl J [In LEMMA II it is shown that such a domain R P l t T l . . . p l _ l t T , _ l P l must ex-ist. The proof of LEMMA II is based on the construction of a domain Rpi<Ti...pi-i<xi-ipi which satisfies the all of the above conditions.] 3. Let lA(x, y , z) = T J p i a - 1 . . . p l _ 1 a i _ 1 p i ( h p i a i „ . p l _ 1 ( r i _ 1 , xjt, XJ2^ and U = { ( x , y , z ) | ( ( x 0 , . . . , x 7 v - i ) G £>)A ((x0,zjv-i) G tree branch pxux ... px_xox_{)N (101A) 5 • • • 1 XlN-2) G i?p 1 0'i...pi_i (7 l_iPl) A (s = ' V K T L - W - I ^ - I ) A (y = xy j A (2 = x j 2 ) } . Use CONSTRUCTION I, CONSTRUCTION II, CONSTRUCTION III and CONSTRUCTION IV to construct the functions Vb(x,y) and Wb(Vb,z), and the domain Sb (Vfe G {1 ,2 , . . . , 6 m a x } ) . Note that the e used in CONSTRUCTION I must be chosen to satisfy equation (99A) of this CONSTRUCTION i.e. t < tp. For all ^ G {1 ,2 , . . . , 6 m a x } , let hpi<j\...pib(]ipia\...pi-i<Ti-iixji) — ^bi,xiy)i (Jpicr1...plb(hp1a1...plb,x32) = Wb(Vb,z), and DpiCTl^.pib = Sb. Also, V6 G {(femax + 1), (&max + 2), . . .} let Dpi<Tl...pib = 0. 153 [Thus the domains D p i ( T l pii = 0 and the functions hpi<Ti...pib{hPjai...Pl_iai_1,Xj1) satisfy definition 5 of THEOREM II, and the functions gPla1...pib(hpi(Tl...plb, x n ) sansfy definition 6 of THEOREM II (for all b G {1, 2 , . . . , 6 m a x } ; j 4. For all a\ G {1 ,2 , . . . , 6 m a x } define the TV dimensional function fpi<ri...pi(T,(xo, • • . , XN-I), V ( x 0 , . . . ,a:j\r-i) G Dom P l ( T l...p i ( T, where D o m f t f f i . . . p i o i = {(xo,---,xN_1)\((x0,...,xN-i) G V)A ((xi1,... ,XiN_2) G RPl Tl...pi )A ( ( x i , . . . , XN) G tree branch pio- i . . . / O / _ I < J / _ 1 ) A (102A) ((hp, <J\...pi-\Gl-\ ) ) G Up1(Tl..,pl_lCrt_1pl)A as P\U\...pi<Tl ((hp1<Tl...pl_1al_l,Xjf) G L)p 1 o-i . . .p(<7() 15 >o, • • • , X J V - I ) =/p1<Ti...p,_i<T,_1(a;o, • • • , X J V - I ) -(103 A) 9p1ai...pi<Ti(hp1(T1...pian x j 2 ) -[Thus, the function fp1a1...piai(xo,..., XjV - l ) represents the approximation er-ror along branch p\o\ ... p\cri for the case lm = / in equation (28A) of THE-OREM II Note that if max |/p 1 ( T l . . .p ( ( T i(xo,..., x jv- i ) | < V(x0,...,XN-i)eDomp i ( J i m < r t (•max, then the approximation along branch p\o\ ... p\o\ is complete and the branch need not be extended.] Vi G { ( M + l ) , ( ^ + 2), . . .}, let Rp [This satisfies definition 4.b of THEOREM II] 154 STEP 8. IF 3p1,a1,...,pt,o-i 6 {1,2,3,...} such that, V(x 0 , • • • , X J V - I ) G tree branch p\o\... p\o\, [The goal of this step is to search out branches pi,o-\,..., pi,o~i which still have an approximation error which is greater than the maximum allowable error.] a. . max \fPi<T1...picri(xo,..., XN-I)\ > emax, (104A) (xo, • • •, x N-1 ) € D o m P l CT1.. p , at where the domain DompifTl...piai is defined in STEP 3.iv and STEP 7.iii.4 of this CONSTRUCTION, and, b. RPiax...Pl ^ 0 and DPl(7l...Piai ^ 0, T H E N E X E C U T E the following steps: i. Let / = / + 1. [This initiates the construction of level I of the approximation tree.] i i . GOTO STEP 6 ELSE END CONSTRUCTION V. 155
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Nonparametric learning from examples in very high dimensional...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Nonparametric learning from examples in very high dimensional spaces Grudic, Gregory Zlatko 1997
pdf
Page Metadata
Item Metadata
Title | Nonparametric learning from examples in very high dimensional spaces |
Creator |
Grudic, Gregory Zlatko |
Date Issued | 1997 |
Description | Constructing predictive models or mappings from sample data is an important goal in many areas of science. Most of the research in this area has been directed towards relatively low dimensional models; however, many real world problems can be large and very high dimensional. Such problems require general learning methods which are fundamentally different from those used to construct low dimensional models. In this thesis a new nonparametric regression methodology (SPORE) is proposed for very large and very high dimensional learning problems: problems with more than 10,000 learning examples of regression data having 100 or more inputs. The SPORE regression model is constructed incrementally by adding small, low dimensional parametric building blocks one at a time, using the outputs of previously added blocks as inputs to the new ones. This process forms stable regression functions from relatively few training examples, even when there are a large number of input variables. Furthermore, SPORE demands little computational effort to choose between candidate building blocks or inputs, making it computationaly feasible in very high dimensional spaces. SPORE challenges two basic mainstream notions found in contemporary learning algorithms. First, it questions the need to simultaneously fitting large high dimensional structures to model complex high dimensional interactions. Second, it rejects the need for "greedy", computationaly expensive searches used to finding the next "best" building block to add to a regression function. SPORE also allows for the subdivision of the domain of the input space to make incremental construction both computationaly and theoretically feasible. Conditions under which the rate of convergence of the method is independent of the dimension of the data are established. It is also shown that the computational complexity of constructing a SPORE-type regression model is linear with respect to dimension within each domain subdivision. In addition, conditions are given under which no domain subdivision is necessary. The most basic version of this regression methodology (SPORE-1) is implemented and empirically evaluated on four types of data sets. The SPORE-1 learning algorithm is completely automatic and requires no manual intervention. First, SPORE-1 is applied to 10 regression problems found in the literature and is shown to produce regression functions which are as good or better, with respect to mean squared approximation error, than published results on 9 of these data sets. Second, SPORE-1 is applied to 15 new, synthetic, large, very high dimensional data sets (40,000 learning examples of 100, 200, 400, 800, and 1600 inputs) and is shown to construct effective regression functions in the presence of both input and output noise. Third, SPORE-1 is used to build mappings from input/output learning examples generated by a human using tele-operation to execute an 'object locate and approach' task sequence. SPORE-1 effectively builds this mapping and directs a robot to autonomously execute the task demonstrated by the human operator, even in a visually cluttered environment with an unpredictable background. Finally, SPORE-1 is successfully applied to the 10-bit parity problem to demonstrate its efficacy on problems which have flat low dimensional projections, thus showing that it is not subject to the same limitations as other algorithms that build regression functions using low dimensional parametric building blocks, added one at a time. |
Extent | 9489394 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-03-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065207 |
URI | http://hdl.handle.net/2429/6679 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1997-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1997-250636.pdf [ 9.05MB ]
- Metadata
- JSON: 831-1.0065207.json
- JSON-LD: 831-1.0065207-ld.json
- RDF/XML (Pretty): 831-1.0065207-rdf.xml
- RDF/JSON: 831-1.0065207-rdf.json
- Turtle: 831-1.0065207-turtle.txt
- N-Triples: 831-1.0065207-rdf-ntriples.txt
- Original Record: 831-1.0065207-source.json
- Full Text
- 831-1.0065207-fulltext.txt
- Citation
- 831-1.0065207.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065207/manifest