Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Nonparametric learning from examples in very high dimensional spaces Grudic, Gregory Zlatko 1997

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1997-250636.pdf [ 9.05MB ]
Metadata
JSON: 831-1.0065207.json
JSON-LD: 831-1.0065207-ld.json
RDF/XML (Pretty): 831-1.0065207-rdf.xml
RDF/JSON: 831-1.0065207-rdf.json
Turtle: 831-1.0065207-turtle.txt
N-Triples: 831-1.0065207-rdf-ntriples.txt
Original Record: 831-1.0065207-source.json
Full Text
831-1.0065207-fulltext.txt
Citation
831-1.0065207.ris

Full Text

N O N P A R A M E T R I C L E A R N I N G F R O M E X A M P L E S IN V E R Y  HIGH  D I M E N S I O N A L SPACES By GREGORY ZLATKO GRUDIC B . A . Sc. (Engineering Physics) University of British Columbia, 1987 M . A. Sc. (Electrical Engineering) University of British Columbia, 1990 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES ( D E P A R T M E N T OF E L E C T R I C A L A N D C O M P U T E R E N G I N E E R I N G )  We accept this thesis as conforming toJbhs required standard  T H E U N I V E R S I T Y OF BRITISH C O L U M B I A August 1997 ©  Gregory Zlatko Grudic,  1997  In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis forfinancialgain shall not be allowed without my written permission.  Department of Electrical and Computer Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T  Date:  1W5  Abstract Constructing predictive models or mappings from sample data is an important goal in many areas of science. Most of the research in this area has been directed towards relatively low dimensional models; however, many real world problems can be large and very high dimensional. Such problems require general learning methods which are fundamentally different from those used to construct low dimensional models. In this thesis a new nonparametric regression methodology (SPORE) is proposed for very large and very high dimensional learning problems: problems with more than 10,000 learning examples of regression data having 100 or more inputs. The SPORE regression model is constructed incrementally by adding small, low dimensional parametric building blocks one at a time, using the outputs of previously added blocks as inputs to the new ones. This process forms stable regression functions from relatively few training examples, even when there are a large number of input variables. Furthermore, SPORE demands little computational effort to choose between candidate building blocks or inputs, making it computationaly feasible in very high dimensional spaces. S P O R E challenges two basic mainstream notions found in contemporary learning algorithms. First, it questions the need to simultaneously fitting large high dimensional structures to model complex high dimensional interactions. Second, it rejects the need for "greedy", computationaly expensive searches used to finding the next "best" building block to add to a regression function. SPORE also allows for the subdivision of the domain of the input space to make incremental construction both computationaly and theoretically feasible. Conditions under which the rate of convergence of the method is independent of the dimension of the data are established. It is also shown that the computational complexity of constructing a SPORE-type regression model is linear with respect to dimension within each domain subdivision. In addition, conditions are given under which no domain subdivision is necessary. The most basic version of this regression methodology (SPORE-1) is implemented and empirically evaluated on four types of data sets. The SPORE-1 learning algorithm is completely automatic and requires no manual intervention. First, SPORE-1 is applied to 10 regression problems found in the literature and is shown to produce regression functions which are as good or better, with respect to mean squared approximation error, than published results on 9 of these data sets. Second, SPORE-1 is applied to 15 new, synthetic, large, very high dimensional data sets  ii  (40,000 learning examples of 100, 200, 400, 800, and 1600 inputs) and is shown to construct effective regression functions in the presence of both input and output noise. Third, SPORE-1 is used to build mappings from input/output learning examples generated by a human using teleoperation to execute an 'object locate and approach' task sequence. SPORE-1 effectively builds this mapping and directs a robot to autonomously execute the task demonstrated by the human operator, even in a visually cluttered environment with an unpredictable background. Finally, SPORE-1 is successfully applied to the 10-bit parity problem to demonstrate its efficacy on problems which have flat low dimensional projections, thus showing that it is not subject to the same limitations as other algorithms that build regression functions using low dimensional parametric building blocks, added one at a time.  iii  Table of Contents Abstract  ii  List of Tables  vii  List of Figures  viii  Acknowledgment  ix  Chapter 1: Introduction  1  1.1  Constructing Models from Data  2  1.2  The Curse of Dimensionality  5  1.2.1  Rate of Convergence of Nonparametric Regression  5  1.2.2  Computational Complexity of Nonparametric Regression  6  1.3  1.4  The SPORE Approximation  8  1.3.1  A Conceptual Summary of SPORE  11  1.3.2  Theoretical Results  12  1.3.3  The SPORE-1 Regression Function  13  1.3.4  Why Does SPORE-1 Work and When Does It Work Best?  14  1.3.5  Experimental Evaluation of the SPORE-1 Approximation  15  Plan of Presentation  17  Chapter 2: Review of Nonparametric Regression 2.1  18  Regression Algorithms: Addressing the Curse of Dimensionality  18  2.1.1  Dimensionality Reducing Algorithms  19  2.1.2  Space Partitioning Algorithms  24  2.2  Regression Model Variance Reduction Techniques  26  2.3  Convergence Results in Nonparametric Regression  29  iv  2.4  Comparing Existing Algorithms to the SPORE  Chapter 3: The SPORE-1 Regression Function 3.1  3.2  3.3  3.4  30 33  SPORE-1: A Cascade of Two Dimensional Functions  33  3.1.1  A Cascade of Bounded Continuous 2 Dimensional Functions  34  3.1.2  A Cascade of Finite Order 2 Dimensional Polynomials  36  Learning Algorithm for the SPORE-1 Regression Function  38  3.2.1  38  The SPORE-1 Learning Algorithm  Theoretical Analysis of the SPORE-1 Learning Algorithm  41  3.3.1  Convergence Results  42  3.3.2  Complexity Results  56  3.3.3  Internal Variance Reduction  57  3.3.4  Virtual Space Partitioning  58  Summary  59  Chapter 4: Experimental Results  60  4.1  Implementation Details of the SPORE-1 Regression Function  60  4.2  Evaluating SPORE-1 on Standard Regression Problems  61  4.2.1  Regression Data Used by Breiman  62  4.2.2  Regression Data Used by Rasmussen  63  4.2.3  Regression Data Used by Jordan and Jacobs  64  4.2.4  Experimental Results  65  4.3  Evaluating SPORE-1 on Synthetic High Dimensional Problems  68  4.4  Human-to-Robot Skill Transfer Experiments  72  4.4.1  74  The Experimental Testbed  4.4.2  Experimental Results  75  4.5  The 10 Bit Parity Problem  78  4.6  Summary  79  Chapter 5: The General SPORE Methodology  81  5.1  The Complete SPORE Definition  81  5.2  Theoretical Analysis of Rates of Convergence For Two Specific SPORE Structures.. 83 5.2.1  The SPORE-2 Regression Function  84  5.2.2  The SPORE-1 Regression Function with Domain Partitioning  90  5.3  Rate of Convergence and Computational Complexity  92  5.4  Summary  93  Chapter 6: Conclusion 6.1  Future Work  Bibliography  95 97  99  Appendix A. Approximating 3 Dimensional Functions as a Cascade  106  Appendix B. Constructing The SPORE-2 Structure  115  Appendix C. Rate of Convergence  130  Appendix D. The Main Theoretical Constructions  137  List of Tables Table 4.1  Performance of SPORE-1 of Published Data  67  Table 4.2  Evaluation of SPORE-1 on High Dimensional Data Sets  71  vii  List of Figures Figure 1.1  The Mapping Problem  3  Figure 1.2  The SPORE-1 Structure  13  Figure 2.1  Model Variance  27  Figure 3.1  The SPORE-1 Regression Function  Figure 3.2  A Cascade Representation of a Polynomial Term  36  Figure 4.1  Increase in Approximation Size with Data Dimension  72  Figure 4.2  The Human to Robot Skill Transfer Test Bed  74  Figure 4.3  A "Tall Spruce Tree"  76  Figure 4.4  A Red and Blue Stripped Tower  77  Figure 4.5  A Red and Blue Stripped Tower with Holes  78  Figure 5.1  The SPORE Structure  82  Figure 5.2  A Tree-Like Subdivision of Input Space  85  Figure 5.3  A 4 Dimensional Example of SPORE-2  87  Vlll  ,  34  Acknowledgment Many thanks to my supervisor Peter Lawrence for his support, patience, and above all, for never showing any doubt. His perspective on my research is invaluable to me and I could not have hoped for a better supervisor.  I am eternally grateful to my wife, Jane Mulligan, for being my best friend and most valued colleague. Much of the work presented here is shaped by her insights, and she has spent many long hours and reading and re-reading various versions of this thesis.  This thesis is dedicated to my mother Julijana and my father Josip.  ix  Chapter 1: Introduction  A l l scientific predictions are based on models of the phenomena being predicted. This is equally true when one is predicting the path of a projectile as when one is trying to predict tomorrow's weather. A l l of these models are constructed starting from empirical observations of the phenomena of interest. These empirical observations, in whatever form they occur, are in essence input and output examples of what the model is intended to predict. The resulting predictive models can be of causal relationships between input and output or visa versa (such as predicting the path of a projectile given initial velocity, or conversely, predicting the initial velocity given its path), or they can be models reflecting some correlation between variables arbitrarily labelled as inputs and outputs, that have other, as yet unidentified, causal factors. Whatever the nature of the phenomenon in question, the goal is always the same: to convert empirical observations into a model structure which allows accurate prediction of future observations. Building models from data often has two specific goals: the first is to predict output values given future inputs; the second is to use the resulting model to interpret or analyze the data. Interpretation of models involves analyzing which inputs most affect the output, as well as how these inputs interact with one another. Such analysis can reveal important information about the phenomenon that generated the data. This is especially true when relatively few input variables can effectively predict output values. However, the specific focus of this thesis is model prediction accuracy, and thus the topic of data analysis is largely ignored. One justification for this is that the models we are interested in are very high dimensional, making analysis difficult when many inputs combine in complicated nonlinear ways. Most major theoretical and practical work on general methods for building models has been done in mathematics, in the form of function approximation, and statistics in the form of regression. More recently, the neural network and machine learning communities have addressed this general modeling problem, mainly using variations of tools which have their origins in mathematics and statistics. M a n y of these endeavors have met with much success, and many diverse fields, ranging from the physical to social sciences, have benefited from the modeling tools that have emerged.  l  2  Chapter I: Introduction  In studying the literature on general modeling methods, it is apparent that most of the theoretical and practical work in this area has been applied to problems which are relatively small and low dimensional. Relatively little effort has been specifically directed towards general modeling methods where there are many potential inputs, and the model is to be built based on a very large data set. Numerous examples of large high dimensional regression problems can be found in such diverse fields as meteorology, economics, and robotics. In meteorology, one can imagine predicting tomorrow's weather given temperature, pressure, humidity, and cloud covering readings from hundreds of weather stations surrounding the area of interest. In economics, predicting the major economic indicators based on a detailed time sampling of various economic indices can be a profitable venture. In robotics, building mappings between a robot's sensor inputs and actuator outputs can be considered to be a problem in regression, if, as in human-to-robot skill transfer, one has access to examples of the mapping (see Chapter 4 and [Grudic and Lawrence, 1995] [Grudic and Lawrence, 1996] for details). All of these regression problems have one characteristic in common: a large number of factors influence the output, and therefore, it is not easy to choose a small set of inputs which effectively explain the output. The goal of this thesis is the theoretical and experimental study of building very high dimensional models from very large data sets. The type of modeling problem addressed typically has at least 100 inputs, and there are 10,000 or more empirical input-to-output examples of the phenomenon of interest. To this end, a methodology, termed SPORE, for building such regression models is presented and theoretically and experimentally analyzed. From a practical standpoint the end result of this thesis is a regression algorithm and structure, called SPORE-1, which can be used to build such large models. A key characteristic of SPORE-1 is that no a priori knowledge of the phenomenon of interest is required for model construction: all that the user must provide is a file containing the learning data.  1.1 Constructing Models from Data The problem being addressed in this thesis is deceptively simple to state. There is a data M  set containing M input/output examples (*-, y,) for / = 1, The inputs  JC =  (x , ...,x ) x  N  are  N dimensional  M , symbolized by {x ,y }  and the outputs y = (y  dimensional. The goal is to find some mapping function, / , such that:  i  1;  i  l  .  ...,y ) are k k  3  Chapter I: Introduction  X  u Figure 1.1 T h e M a p p i n g Problem y = /(*)  (i-i)  where y is some estimate of y given x . Once constructed, / can then be used to predict y given some future input x. In the general case, the data {x , y } t  i  i  is generated by some unknown  phenomenon or function / ( % ) , where % = ( « , v) is a P dimensional input vector. % can have 2 M  types of inputs: inputs u which do not appear in the data set {x , y,}^ , and inputs v which do t  appear in the data {x , y }^ . Therefore, one is often required to construct / using an incomplete t  t  set of inputs (i.e. we may not have access to the inputs u). A further difficulty arises when the accessible inputs x include some unknown set of inputs which do not affect the output vector y (i.e. not all inputs found in x will also be found in v ) . This is diagrammatically represented in Figure 1.1. Not knowing which inputs actually have some definable relationship to the output is just one of the potential difficulties encountered in the general model construction problem. The process of constructing models from data has been studied in a variety of fields, and, not surprisingly has been given different names in each field. In applied mathematics this process is termed function approximation, and one speaks of approximating the dependent variable y as some function of the independent variable x . In statistics, the process of model building is called regression and the terminology used is as follows: estimate the responses y given the explanatory or predictor variables x. More recently, the machine learning and neural network communities have termed this type of model building supervised learning and typically refer to learning the mapping between input variables x and output variables y . In this thesis, the process of building models from data is studied from the point of view of applied mathematics (function approximation) and statistics (regression). Referring to Figure 1.1, function approximation typically deals with the case of u = 0 (i.e. the empty set) and v = x. Thus, function approximation assumes that the independent variables x exactly define the dependent variables y , and the goal is to determine how well one can model / given some defined  Chapter I: Introduction  4  structure for the approximating function / . Regression on the other hand, deals with the case when H ^ 0 and/or v * x. Typically, this is represented in the following form: y = f(x)  +t  (1.2)  where, e is a noise term or, more formally, a random variable obeying some probability distribution. In most instances, 8 is assumed to be normally distributed with mean zero and fixed standard deviation. Hence, from the perspective of regression, not knowing whether the given inputs completely define the output is treated as estimation noise. There are two basic approaches to constructing the function / : parametric and nonparametric. In the parametric approach, knowledge about the nature of the problem is used to define the structure and size of the function / . Typically / will contain a set of parameters that need to be assigned values using some numerical algorithm such as gradient descent, least squares, or simulated annealing (to mention just a few). For example, if /  is a polynomial with a finite  number of terms, then the parameters which are "learned" using sample data are the polynomial coefficients. A key characteristic of the parametric method is that because both structure and size are fixed, the space or class of functions representable by any single model / is inherently constrained. It is because of this characteristic that parametric models are often fast to compute and extremely effective when they are representative of the phenomenon being studied. Parametric models are in fact the best form of model to use when the nature of the problem is well understood. In contrast, nonparametric models are most useful when the nature of the problem is poorly understood. In nonparametric modeling, the basic structure of the model is usually defined, but not the number of parameters in the final constructed / function. One example of a nonparametric model is a polynomial which does not have a fixed number of coefficients. A nonparametric model grows in size in order to effectively model the desired data. Thus, in the case of a polynomial, new terms are added until a sufficient model is constructed. In theory, a nonparametric model is able to represent a greater space of functions than a parametric model. However, because nonparametric regression does not have a fixed parameter space, the task of constructing models becomes considerably more difficult than with parametric models. As a result, nonparametric models are typically more time consuming to construct. The specific problem being addressed in this thesis is model construction when little or no  Chapter!: Introduction  5  information about the phenomenon being studied is available. Thus, the main focus of this thesis is nonparametric model construction, or nonparametric regression. A s indicated above, there are difficulties associated with nonparametric regression. One of the most pervasive of these difficulties is the curse of dimensionality  [Bellman, 1961], which is briefly addressed below.  1.2 The Curse of Dimensionality From a practical standpoint, the curse of dimensionality has two major implications for high dimensional nonparametric regression. The first results from the rate of convergence of nonparametric regression, and the second from the computational complexity of nonparametric regression. If we are to build regression algorithms which work well in high dimensional spaces, we must find ways of addressing both of these issues.  1.2.1 Rate of Convergence of Nonparametric Regression There are a number of theoretical results associated with the rate of convergence of nonparametric regression (see Section 2.3 of Chapter 2 for details). In this thesis, we are mainly concerned with rate of convergence results which measure the decrease in approximation error as the number of sample points used to construct the regression function increases. For example, Stone [Stone, 1982] has shown that in general, and under appropriate regularity conditions, the number of sample points required to approximate a bounded continuous N dimensional function which is at least once differentiable and defined over a closed domain, grows exponentially with dimension N. This means that it is not feasible to densely sample high dimensional space (e.g. 10 or more input variables), and therefore, in to order approximate high dimensional functions, one must impose some type of smoothness constraint on the approximation [Friedman, 1994b]. In fact, all learning methods impose such a constraint. Nonparametric methods differ from parametric methods in that they impose fewer such constraints, and are thus able to form more flexible approximations. Imposing a smoothness constraint on approximation means that fewer sample points are required to determine the parameters of the approximation. However, as is argued throughout this thesis and elsewhere [Friedman, 1994b], the problem associated with approximating high dimensional functions is not strictly due to the number of input variables, but rather due to the underlying complexity of the function being  Chapter I: Introduction  6  approximated. We measure complexity as the number of samples necessary to build an effective nonparametric approximation (see Section 4.3 of Chapter 4 for an example). Clearly it is possible to have a very complicated 1 dimensional function, or a very simple high dimensional function. Thus, one can argue that a regression problem is difficult because it requires large numbers of learning data, which is not by necessity related to the dimension of the problem. Our goal then, as model builders, should be to find regression algorithms which perform well in high dimensional spaces, when only relatively few learning examples (i.e. tens of thousands) are sufficient to build adequate models. The goal then, when formulating structures for very high dimensional nonparametric regression is two-fold. First, we must ensure that the rate of convergence of the regression function is not overly dependent on the dimension of the underlying function which generated the learning data. In other words, the rate at which the nonparametric regression function converges to some stable function should not depend adversely on the dimension of the problem. Second, one must have an understanding of what class of functions are representable by a given nonparametric regression technique. If the data being used to construct the regression function is being generated by a function which belongs to this class, then the regression function will converge to it. In this case the approximation error converges to zero. However, if the function generating the sample points does not belong to this class of functions, then the approximation error will converge to some finite non-zero value. From a practical standpoint, if this approximation error is greater than the maximum error that is appropriate for the regression problem at hand, then the regression function being used is inadequate for the task. Recently, much theoretical work has been done to define classes of functions which have corresponding regression functions with rate of convergence independent of dimension (see Section 2.3 of Chapter 2 for details). One of the main theoretical contributions of this thesis is that we establish a rate of convergence result which is independent of dimension on function spaces which have not been previously studied in this way.  1.2.2 Computational Complexity of Nonparametric Regression The second aspect of the curse of dimensionality is the computational complexity of constructing high dimensional regression functions. Generally speaking, there are two commonly used methods for constructing nonparametric regression functions. The first builds regression  Chapter I: Introduction  7  functions by including all input variables simultaneously, thus building one or more large component functions which may be combined to form the final regression function. The second uses some form of variable selection to build up the regression function using smaller functional subunits. Both of these approaches can be effective for relatively low dimensional problems. However, consider the problem of approximating a function of 1000 variables. If one uses the first approach and attempts to fit all the variables simultaneously, then even if the regression function is a simple second order polynomial (i.e. one which has relatively few nonlinear terms) of 1000 variables, the regression algorithm needs to solve simultaneously for over 500,000 model parameters (calculated using the identity m = (s + v)!/(.?! • v!), where m is the number of polynomial terms and hence the number of model parameters, s is the order of the polynomial and v is the number of input variables). Whether solved via gradient descent or least squares, the computational cost of simultaneously fitting 500,000 parameters is not insignificant [Press et al., 1988]. Thus even a simple simultaneous fit of a second order polynomial of 1000 variables is computationaly significant, and adding more nonlinear terms makes the problem correspondingly more expensive (for example a third order polynomial of 1000 variables has more than 1.6x10 model parameters). The computational cost becomes even more significant when the appropriate model structure is unknown (thus making the problem nonparametric), and many different types of large nonlinear models need to be constructed before an adequate one is found. An additional difficulty associated with attempting to simultaneously fit many variables is due to the large number of training examples needed to form a stable fit. This results because a simultaneous fit of a large number of model inputs implies that many model parameters need to be fit at the same time (as discussed in the previous paragraph), which in turn requires many training samples for the model to converge to a stable solution [Friedman, 1994b] [Breiman, 1996b]. In terms of computational complexity, large numbers of learning data result in an increase in the computational complexity of the learning algorithm. Next consider the 1000 variable regression problem using some method of variable selection. In the literature, variable selection is typically achieved by choosing the best q of possible N inputs. Thus, in order to determine the optimal set of q input variables, the number of model constructions and evaluations required is  (N\ \qj  =  AT! —TTTT-—r:  q\{N-q)\  • As q and N become large,  this search for an optimal solution quickly becomes impractical. For example, choosing the best 500 out of a possible 1000 variables requires the evaluation of ^QQ^^QQ;^  m  o  ^ constructions  Chapter!: Introduction  8  (usually nonlinear models are needed and hence each individual construction can itself be computationaly expensive); such an exhaustive search is not currently feasible, and is unlikely to be in the near future. Even if one chooses the suboptimal solution of removing one variable at a time from the model, one would require ^(N(N + \ )-{q-  1 )(<?)) model constructions and  evaluations. Hence, using this method to find the "best" 500 out of 1000 inputs requires 375,750 model constructions and evaluations. Even this suboptimal search strategy is computationaly difficult given the computational complexity of constructing and evaluating nonlinear models of more than 500 variables. Similar computational complexity arguments can be made for any method which attempts to choose the "best" (by whatever measure one uses) low dimensional projections of high dimensional data (see Section 2.1.1 of Chapter 2). Any practical nonparametric algorithm required to work on high dimensional regression functions must deal with the above complexity issues. In the next section, the SPORE approximation is introduced as a methodology designed to address both the computational complexity and rate of convergence problems associated with high dimensional nonparametric regression.  1.3 The SPORE Approximation When we speak of the difficulties inherent in high dimensional nonparametric regression (i.e. the curse on dimensionality), we are referring to the difficulties of applying existing approximation techniques to high dimensional problems. Any approximation problem is potentially difficult. In fact, one dimensional problems are potentially just as difficult as one thousand dimensional problems, and the intrinsic difficulty of any approximation problem is directly related to the number of learning examples required to build a sufficient approximation. Thus, the curse of dimensionality is only a curse because most algorithms have been designed to work well in low dimensional spaces, and there is no theoretical reason why a one dimensional function requiring 50,000 training examples to build a sufficient model, should be any more easier to approximate than a one thousand dimensional function requiring the same number of training samples. We propose the following methodology to address high dimensional nonparametric regression problems. First, in order to ensure that the number of learning samples required to build a sufficient model depends on the complexity of the regression problem and not on how many input variables there are, we build approximations by projecting the high dimensional  Chapter 1: Introduction  9  learning data onto low dimensional parametric building blocks. Thus, the number of learning samples required to form stable approximations depends on the number of parameters in our parametric building blocks, and not on the dimension of the learning data. The result is that the rate of convergence of the algorithm to some fixed error, is in fact independent of dimension. This use of low dimensional parametric projections has been extensively used in the past (see Chapter 2, Section 2.1.1.1 for details), however, it has invariably involved a systematic search for building blocks and associated inputs for the next best (assuming some measure of best) building block to add. As we have argued in Section 1.2.2, these types of search procedures are not feasible when we have very high dimensional input spaces. Therefore, such search strategies are, of necessity, not used in the algorithms proposed in this thesis. The second key aspect of our methodology is that the algorithmic procedure we use to determine which building block to add next, or what its inputs will be, is not dependent on the number of input variables. This may appear to be an impossible or impractical algorithmic characteristic, however, from a theoretical standpoint, there is no reason why computationaly expensive variable or building block selection is necessary. This is illustrated in the Appendices of this thesis (see Section 5.2.1 of Chapter 5) where we present an approximation algorithm which is theoretically guaranteed to construct uniformly converging approximations to any bounded continuous function of arbitrary dimension, while always using the same building blocks and a random ordering of input variables. The arguments in these Appendices demonstrate that computationaly expensive searches for building blocks and their inputs is not a theoretical necessity. A practical example of this aspect of our methodology is the SPORE-1 algorithm (see Section 1.3.3), where the same parametric building block is used throughout the algorithm, and the order in which input variables are added to the building blocks is random. The third and final key component of our methodology is that the input space of the regression function is sub-divided in order to avoid flat or zero projections from high dimensional learning data onto our low dimensional parametric building blocks. This is a necessary algorithmic component because it gives convergence to zero error when other algorithms which use low dimensional projections do not (see Section 4.5 of Chapter 4). In fact, one of the main criticisms of algorithms which use lower dimensional building blocks is that they do not work well on high dimensional data. In the appendices of this thesis we demonstrate a systematic method of space subdivision which gives theoretically proven convergence (see Section 5.2.1 of Chapter 5). In  Chapter 1: Introduction  10  addition, we show that the benefits of space subdivision can be obtained with minimal computational effort and without actually subdividing the input space (see Section 3.3.4 of Chapter 3). We refer to this as Virtual Space Partitioning, and show that it is a direct consequence of using bootstrap sampling techniques [Efron, 1983] to randomly subdivide the input space when adding parametric building blocks to the regression function. Science often works by pushing a thesis to an extreme point of view in order to establish conditions under which this view is valid, as well as conditions under which it breaks down. This document is an example of this form of scientific endeavor. The thesis being put forward, and upon which the SPORE approximation is based, is the following: Very high constructed building  selection  number  regression of small,  models  can be  low dimensional  effectively parametric  are added to the model one at a time, using the outputs  added blocks  small finite determine  nonparametric  by using a sufficient  blocks which  previously random  dimensional  as inputs to the new ones. Furthermore,  process  to choose candidate  building  one can use a  blocks (picked from  set) and their inputs, thus little or no computational  of  some  effort is required  to  what to fit next.  Our objective is to both experimentally and theoretically establish under which conditions this thesis  is valid, and to determine how restrictive these conditions are. It is argued throughout this  document that the principal implication of the above thesis is that very high dimensional nonparametric regression is feasible, both in terms of rate of convergence and computational complexity. The surprising conclusion is that it is both feasible and effective on a large class of functions. The Space Partitioning, self-Organizing, and dimensionality REducing, or S P O R E , algorithmic philosophy defines our basic methodology for nonparametric regression. As the name suggests, the input space of the regression data is partitioned possible to reduce the dimension  or divided into regions where it is  of the problem by constructing the approximation only a few  variables at a time. SPORE approximations are self-organizing,  or equivalently nonparametric, in  that the size of the resulting regression function is defined by the learning or regression data, and is not defined a priori. In Chapter 5, the general SPORE methodology is defined and theoretically analyzed. In the following, a brief description of the basic concepts behind SPORE are presented.  Chapter 1: Introduction  11  1.3.1 A Conceptual Summary of SPORE We view SPORE as a methodology for designing learning algorithms which construct SPORE-type regression functions or approximations. The SPORE methodology has five basic characteristics which can be summarized as follows: CI. The SPORE approximation consists of a finite number of functional units. These functional units are either structurally all identical (for example 2 dimensional, third order polynomials as in Chapter 4) or of different types (for example some may belong to a class of polynomials of a given range of dimension and order, while others may belong to one or more classes of radial basis functions of some range of dimension and number of basis units). From a practical perspective, these functional units should have the following property: the process of individually fitting these functional units to input/output data should be relatively easy when compared to the complete regression problem. C2. The functional units are added to the approximation one unit at a time, and are not modified after they are added. Input variables, as well as the outputs of previously added functional units, can serve as inputs to new units. The addition of functional units stops when a desired residual error is reached. C3. Little or no computational effort is spent on selecting inputs to structural units. C4. Little or no computational effort is spent on selecting which type of structural unit is added next. C5. The input space is partitioned in order to reduce approximation error. Thus, dimensionality reduction is implied by characteristics CI and C2, self-organization is implied by C2, and space partitioning is implied by C5 (thus the name SPORE as defined above). One should note that characteristics CI and C2 are common to a number of other nonparametric regression architectures. In Chapter 2, these architectures are discussed in detail. Two notable examples are the Group Method of Data Handling (GMDH) [Ivankhnenko, 1971] (as well as various other related algorithms), and Cascade Correlation [Fahlman and Lebiere, 1990]. A l l of these algorithms build successively on structures, using the outputs of previously constructed structural units as inputs to new units. However, it is the further incorporation of characteristics C3 and C4 which make SPORE different from all other approaches to nonparametric regression. As described in Section 2.4 of Chapter 2, these two characteristics lead to learning algorithms  Chapter 1: Introduction  12  which are significantly different from those studied previously. They are also a key to formulating a practical algorithm for very high dimensional regression. As argued in Section 1.2.2, the computational effort required for variable selection makes it difficult, if not impossible, to do in very high dimensional spaces. The same argument applies to structural unit selection in high dimensional spaces. The purpose of the space partitioning characteristic, C5, is to ensure that the SPORE regression function is a universal approximator, or in other words is (theoretically) able to approximate any bounded continuous function defined over a finite domain in N dimensional space (see Chapter 5). As is briefly outlined in the next section, it is also used to avoid problems with rate of convergence in very high dimensional regression.  1.3.2 Theoretical Results As stated previously, Stone [Stone, 1982] has shown that, in general and under appropriate regularity conditions, the number of sample points required to approximate a bounded continuous N dimensional function which is at least once differentiable and defined over a closed domain, grows exponentially with dimension N. The main theoretical result of this thesis is the following: We prove that such functions have a finite number of sub-domains where there exists at least one type of regression function (e.g. one of the SPORE structures define in Chapter 5) which has a rate of convergence independent of dimension. The SPORE approximation is in fact built by partitioning the input space into subdomains where the rate of convergence to zero approximation error is independent of dimension. This result is significant from both a theoretical and practical point of view. Theoretically it demonstrates that much broader function classes can have rates of convergence independent of dimension (see Section 2.3 of Chapter 2 for details). In practice it points the way to practical nonparametric regression algorithms which can work in very high dimensional spaces. A second theoretical result given in this thesis concerns the computational complexity required to build a SPORE structure. As stated above, SPORE subdivides the input space of the function being approximated into a finite number of subdomains where the rate of convergence is independent of dimension. In Chapter 5, it is further shown that the computational complexity of constructing SPORE in each of these subdomains is on the order of 0(N x C), where N is the dimension of the data and C is the average complexity of adding a functional unit ( C is indepen-  Chapter 1: Introduction  13  fix)  Figure 1.2  T h e S P O R E - 1 Structure  dent of N). This indicates that, at least within each siibdomain, the computational complexity of constructing an approximation is at most linear with dimension. This is a promising result from the point of view of practical nonparametric regression. One theoretical question which this thesis does not address is how many such subdomains are necessary. Having desirable theoretical properties in each subdomain may not mean much if there are exponentially many such domains in practice. Although this remains an open theoretical question, in Chapter 5 some theoretical justification is given which suggests that functions requiring many subdomains are unlikely. Furthermore, in Chapter 3 and Chapter 5 we establish conditions under which subdivision is not necessary. It is interesting to observe that none of the empirical evaluations of the SPORE approximation required domain subdivision (see Chapter 4 for details). Thus, for at least the regression problems studied in this thesis, one domain is sufficient (although virtual space subdivision was necessary to obtain these results: see Section 3.3.4 of Chapter 3).  1.3.3 The SPORE-1 Regression Function In Chapter 3, the most basic form of SPORE, termed SPORE-1, is defined and a construction or learning algorithm is given. A l l of the experimental evaluations presented in this thesis (see Chapter 4) are done using SPORE-1. A brief description of SPORE-1 is give next. As shown in Figure 1.2, SPORE-1 consists of a cascade of 2 dimensional functions  g (-), L  Chapter I: Introduction  14  which are scaled (o^) and summed to produce the regression function f(x): the subscripts k , ...,k G Q  L  {1,  TV} serve to identify the input variables (x , x  x ). N  The exact algorithm  used to construct the SPORE-1 approximation is fully defined in Chapter 5. The points of note in the learning algorithm are as follows: 1. the functions (building blocks) g (-) are 2-dimensional polynomials, all having the ;  same number of coefficients; . 2. the functions g (-) are added one at a time in order gj(-), g (') >•••, t  2  (  m e  learning  data determines how many levels are constructed); 3. the order of inputs (JCJ,  x ) is random, and any single input may enter the cascade N  at many levels; and, 4. the input space is randomly subdivided (using bootstrap) whenever a new g (-) is t  added. Hence, the SPORE-1 learning algorithm conforms to the general SPORE methodology given in Section 1.3.1. Consider the simple nature of SPORE-1. The algorithm does not use a complicated variable selection technique (it is in fact random variable selection), and simple parametric functions (all identical) are incrementally added to the regression function. In fact, SPORE-1 is an experiment in pushing the lower bounds in algorithmic and structural simplicity, the goal being to test the efficacy of the general S P O R E methodology at this most basic level. However, as summarized in Section 1.3.5, this surprisingly simple SPORE-1 structure can indeed form effective nonparametric regression functions.  1.3.4 Why Does SPORE-1 Work and When Does It Work Best? Given the simple structure of SPORE-1, it is not unreasonable to ask why SPORE-1 is able to form effective regression functions. Roughly speaking there are three basic reasons for this (see Chapter 3 for details). These are briefly addressed below. First, that the addition of each new building block to the structure defined in Figure 1.2 can have, on-average, only one of 2 consequences: 1) it can reduce the model error because of the addition of the new input variable (chosen randomly) to the regression function; or 2) it can leave the model error unchanged, because if the addition of the new input variable does not contribute to a better model, then the new building block g (-) simply passes the output of the previous level ;  Chapter 1: Introduction  15  £/ _ i (•) to the input of the next level g  t +  x  (•) (i.e. g (•) = g _ (•)). The result is that the order in t  t  x  which input variables are added to the regression functions does not significantly affect model error, as long as the relevant variables are eventually incorporated. The second reason why SPORE-1 is effective is because it forms stable high dimension regression functions using relatively few learning examples. This is because 1) high dimensional data is projected onto low dimensional parametric building blocks g {-) which implies that t  relatively few training examples are required to form stable model parameters [Friedman, 1994b], and 2) the outputs from these building blocks are combined using a weighted average (in the form of the scaling factors a ) which tends to further stabilize the regression function ;  [Breiman, 1996b] (see Section 3.3.3 of Chapter 3). The third reason why SPORE-1 is effective is because the low dimensional projections (parametric building blocks) g {-) tend to have un-correlated residual errors and are unlikely to be t  flat (i.e. g^(-) = g/_ j(-)) if the input x  k  to g[(-) has an effect on the output. The reason for this  is due to the random subdivision of the input space resulting from the use of bootstrap samples of the training data whenever a new level g (-) is added (see Section 3.3.3 and Section 3.3.4 of t  Chapter 3). The result of this is that adding new levels (building blocks) to the regression function tends to improve the approximation, as long as the input variables continue to effect the output. The next question one might ask is the following: what types of problems is SPORE-1 best suited for? Given that the SPORE-1 structure consists of a long cascade of parametric building blocks (see Figure 1.2), with each level in the cascade incorporating additional input variables into the regression function, it is reasonable to assume that SPORE-1 is best suited to regression problems where many input variables are required to produce an effective model, and the relevant input variables have approximately equally significant effects on the output. In addition, if all of the parametric building blocks (g/(-)) are continuous (which they are in the current implementation; see Chapter 3 for details), in the ideal case, the target function would also be continuous. Although this is a compelling argument for defining an ideal class of target functions, there is significant experimental evidence that, in practice, SPORE-1 is not limited to only these types of regression problems. This is briefly discussed in the next section.  1.3.5 Experimental Evaluation of the SPORE-1 Approximation The SPORE-1 approximation is evaluated experimentally in Chapter 4. First, SPORE-1 is  Chapter 1: Introduction  16  compared with other algorithms by applying it to 10 regression problems found in the literature [Grudic and Lawrence, 1997]. This initial evaluation shows that SPORE-1, on average, does at least as well or better (with respect to mean squared approximation error) than published results on 9 of these data sets. A l l of these published regression problems are relatively small and low dimensional, and thus not the type of problems SPORE-1 was designed for. Hence it is interesting to observe SPORE-l's effectiveness on such problems. Next, S P O R E is evaluated on 15 synthetic, large, high dimensional data sets (40,000 learning examples of data having 100, 200, 400, 800 and 1,600 inputs) [Grudic and Lawrence, 1997]. This synthetic data is generated specifically to evaluate SPORE-1 on the type of data sets it was designed for. Some of these data sets have output noise (i.e. the 8 in equation (1.2)), while others include input variables which do not affect the output (i.e. v  in Section  1.1). Synthetic data allows us to test theoretical limits with full knowledge of what the best attainable approximation errors are. The results from these experiments indicate that SPORE-1 closely approaches these best possible approximation errors. As no data sets of sufficient size and dimensionality were found in the literature, this is the first use of this data for evaluation of high dimensional regression. Next, SPORE-1 is applied to 3 high dimensional (1024 pixel inputs) human-to-robot skill transfer problems. These experiments involve having a human operator demonstrate a mobile robot "locate object and approach object" motion sequence using video input. During this demonstration, the human uses only the robot's sensors as inputs, and controls the robot's actuator as output. The demonstration generates a learning data set via recording the image pixels seen by the human, as well as the corresponding output commands, during each time step of the demonstration. This data set is used to construct a mapping between sensor inputs and actuator outputs, which when constructed, is used to autonomously control the robot. Experimental results indicate that the robot is able to autonomously accomplish the desired task. Although, as indicated in Chapter 4, the task being transferred from human to robot is relatively simple, these experiments demonstrate the ability of SPORE-1 to form very high dimensional regression functions, based on real world learning data. Finally, SPORE-1 is applied to the 10-bit parity problem [Grudic and Lawrence, 1997]. We chose this problem to demonstrated that SPORE-1 does not have the same limitations as other systems which build regression functions using small dimensional units which are added one at a  Chapter 1: Introduction  17  time.  1.4 Plan of Presentation In Chapter 2 a review of nonparametric regression is presented. In Chapter 3 the simplest version of SPORE (SPORE-1) is described, and theoretical results are given. In Chapter 4 an empirical evaluation of SPORE-1 is presented. In Chapter 5 the general SPORE methodology is defined and the basic theory behind it is described. Chapter 6 contains a summary of the conclusions and major technical contributions of this thesis, and includes a proposal for future theoretical and empirical evaluation of SPORE.  Chapter 2: Review of Nonparametric Regression  This Chapter gives a brief review of nonparametric regression. In Section 2.1, a discussion of various nonparametric regression algorithms is given. In Section 2.2, the problem of overfitting in nonparametric regression is discussed and various published solutions are mentioned. In Section 2.3, convergence results of various regression functions are discussed. Finally, in Section 2.4, a comparison is made between published regression algorithms and the regression methodology proposed in this thesis (SPORE).  2.1 Regression Algorithms: Addressing the Curse of Dimensionality As discussed in Chapter 1, the curse of dimensionality has two serious detrimental consequences for practical high dimensional nonparametric regression. The first is the potential necessity for a large learning data set which is impractical, and the second is the computational complexity associated with a large number of input variables. Many algorithms have been presented in the literature which are designed to deal with these two difficulties. Good reviews and descriptions of many types of regression and approximation algorithms can be found in the literature [Kwok and Yeung, 1997] [Haykin, 1994] [Friedman, 1994b] [Hastie and Tibshirani, 1994] [Alfeld, 1989]. In the following, we discuss some representative examples of these algorithms. A l l of the algorithms discussed below have been demonstrated to be useful on a variety of different types of regression problems. Each has its own particular strengths and weaknesses, however, none of these algorithms have been demonstrated to be effective on the large, very high dimension nonparametric regression problems studied in this thesis. For the purposes of this thesis, we view nonparametric learning algorithms as broadly grouped into two major categories. The first is termed dimensionality the second is called space partitioning  algorithms.  18  reducing  algorithms,  and  Chapter 2: Review of Nonparametric Regression  19  2.1.1 Dimensionality Reducing Algorithms The goal in dimensionality reduction is to break a large regression problem into a number of smaller regression problems. The end result of this is a reduction in the number of parameters which are being fitted to the learning data. A reduced parameter set implies that algorithms which fit learning data to these parameters can potentially be more efficient. In addition, fewer parameters generally require fewer learning samples in order to achieve stable parameter values. In the following, we will briefly discuss some of the better known examples of dimensionality reducing algorithms. For organizational purposes, we divide these algorithms into two types.  2.1.1.1 Sum of Approximated Lower Dimensional Projections Let f(x), where x = (JC  x ) e D qz 3i , be an N dimensional function* and let N  1;  N  /(x) be some approximation of /(x). Then in one form of dimensionality reduction, /(x) has the following structure:  i,e{\,...,N}  i„i G 2  X  fi, 1  („  where, for all m e { 1,  AO , / .  i  m  e {1,  {1.....JV} i (*/,> •••»*!•  I ••• m l  M  ) m  l  N}  • are m dimensional functions defined over the domain D  of /(x). Generally speaking, the goal is to have m much smaller than N, and to find some finite set of functions / t  • which sum to effectively approximate the target function. The hope is that  effective approximations are possible using functions which are either lower dimensional or have some fixed finite parametric structure. Examples of this type of dimensionality reduction include Generalized Additive Models [Hastie and Tibshirani, 1986], M A R S [Friedman, 1991], G M D H [Ivankhnenko, 1971], A S P N [Elder and Brown, 1995], and S O N N [Tenorio and Lee, 1990]. These are briefly discussed below. In G e n e r a l i z e d A d d i t i v e M o d e l s [Hastie and T i b s h i r a n i , 1986] [Hastie and Tibshirani, 1990] regression functions are constructed using a sum of smooth 1 dimensional functions X/^C*,-,) • The 1 dimensional functions / • are nonparametrically constructed, one at a time, using a scatterplot smoother. One advantage of this technique is that the construction of the functions / • is completely automated, requiring no interventions in the learning process. In  Chapter 2: Review of Nonparametric Regression  20  addition, the resulting regression model can often be effectively interpreted, allowing a statistician to analyze significant factors involved in model generation. One limitation of this method is that higher dimensional interactions are difficult to model, however, in many lower dimensional regression problems this is not a significant concern. In Multidimensional Adaptive Regression Splines (MARS) [Friedman, 1991], regression functions are constructed using a sum of products of spline functions. These functions are built in a forward stepwise hierarchical manner which is based on recursive partitioning (see Section 2.1.2). M A R S is able to automatically account for interactions between variables, in a controlled manner, allowing for effective model interpretation. A limitation of M A R S is that its computational complexity increases rapidly with dimension, making large regression problems (i.e. thousands of learning samples) of more than about 20 input variables, impractical when high dimensional interactions occur. However, in many applications, low dimensional interactions are sufficient to form effective approximations. Polynomials have been widely used to build regression models. Polynomial models have the following form (note that these types of multidimensional polynomials are often termed multinomials): i'i + ... +  /(x) =  i <M N  £ «i. -.«*e {0  ",•,...«„ •*i'-"*!v  (2.2)  M}  where M is the order of the polynomial. These models are built by fitting the parameters a  t  •  to the learning data. One difficulty with this general polynomial formulation is that the number of parameters  • increases rapidly with polynomial order and dimension, thus making it difficult  to determine which subset of parameters are significant. Many iterative schemes have been developed which attempt to chose the most relevant parameters, and construct the approximation a few terms at a time. One such algorithm, originated by Ivakhnenko [Ivankhnenko, 1971], is the Group Method of Data Handling (GMDH). The original algorithm builds the global approximating polynomial using layers of 2 dimensional quadratic polynomials. The first layer is constructed using enough quadratic polynomials to exhaustively account for all possible input combinations. The outputs of this first layer are then used as inputs to the next layer, where they are once more exhaustively combined two inputs at a time. This process of adding layers continues until further additions to the polynomial do not reduce the approximation error. There have been many  Chapter 2: Review of Nonparametric Regression  21  enhancements to this original algorithm since it was introduced by Ivakhnenko [Farlow, 1984]. Some specific examples include the Algorithm for Synthesis of Polynomial Networks (ASPN) [Elder and Brown, 1995], which extends the paradigm to include 3 dimensional polynomials, and the Self Organizing Neural Network (SONN) [Tenorio and Lee, 1990], which allows for basis functions other than polynomials and uses a Simulated Annealing technique to search for an optimal structure. Other variations of this include the method by Sanger [Sanger, 1991] where subnetworks of polynomials (or other basis functions) are grown below the previously added terms which exhibit maximum variance in approximation error. A l l of these algorithms have demonstrated both good generalization and interpretation properties, on many types of regression problems. One common characteristic of these methods is that they all use some form of search in variable space (and often basis function space as well) to define the structure of the resulting polynomial. Ultimately, due to the computational complexity of such searches (see Section 1.2.2 of Chapter 1), these algorithms are not practical for the very high dimensional regression problems studied in this thesis. 2.1.1.2 S u m m e d H i g h D i m e n s i o n a l B a s i s F u n c t i o n s  A second general type of dimensionality reducing algorithm constructs regression functions that have the following structure: f{x)  = X S (h (x)) q  (2.3)  q  q=0  where g are 1 dimensional functions, the basis functions h are functions of a known type q  chosen a priori, and the number of terms M is chosen large enough to give effective approximations. Examples of these regression functions include Projection Pursuit [Friedman and Stuetzle, 1981], Radial Basis Functions [Hardy, 1971], Hinging Hyperplanes [Breiman, 1993], Wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995] [Zhang, 1997], Cascade Correlation [Fahlman and Lebiere, 1990], and other adaptive Neural Network type algorithms [Reed, 1993] [Igelnik and Pao, 1995] [Yao and Lui, 1997]. These are briefly discussed below. In Projection Pursuit [Friedman and Stuetzle, 1981], the functions h are linear, having q  the form h (x) = q  }  a -• x , where the parameters a are chosen to maximize error (  t  {  reduction, given that the functions g are restricted to be smooth and nonparametric. This  Chapter 2: Review of Nonparametric Regression  22  minimization of error process can be accomplished using Newton-Raphson [Press et al., 1988] on an appropriate cost function. The residual errors are updated after the construction of each h and q  g  function, and the construction of new functions stops when the approximation error is  sufficiently small. Projection Pursuit is often used as an analysis tool because determining which projection axis (defined by function h ) gives the most error reduction, can shed light on which q  independent variables are important and how they interact. A limitation of this method is that all approximations are made along N dimensional linear projection axes (defined by functions  h ), q  making the construction of these axes computationally difficult for very high dimensional regression problems (N > 100). Cascade-Correlation [Fahlman and Lebiere, 1990] is an example of a nonparametric Artificial Neural Network regression function. Neural networks are generally implemented using sigmoidal basis functions (artificial neuron) of the form s(z) = 1/(1 + e ) , where z  z = ot + Y!i- i i' a  0  i'  u  w  '  t n  i  u  D e  ' g n  t n e  inputs to the basis function and the parameters a  ;  being adjusted to fit the training data. Neural networks often consist of layers of these sigmoidal functions, with each layer containing a potentially large number of neurons or sigmoidal units, and the output of lower layers serving as inputs to higher layers (each unit in the first layer having as its inputs all of the predictor variables). The Cascade-Correlation algorithm is designed to automatically determine the network structure based on the training data. Cascade-Correlation adds hidden units (i.e. units not directly connected to the output) to the network one at a time, with each hidden unit having an input connection from all the previously added hidden units, as well as all of the predictor variables. Thus the number of inputs into hidden units grows linearly with the number of levels added to the network, with the first unit having N inputs (one from each of the predictor variables). In addition, each time a hidden unit is added, many potential candidate units are trained using the Widrow-Hoff delta rule [Widrow et al., 1976] starting from randomly chosen weights oc . Only (  the one which best correlates with the current residual error is added as the next hidden unit. The efficacy of Cascade-Correlation on high dimensional data has been widely demonstrated in the literature [Michie et al., 1994]. However, because each hidden unit accepts inputs from all predictor variables, as well as the outputs of all existing hidden units, the method exhibits high computational complexity and potential numerical instability (due to a large number of parameters), when the regression problem is very high dimensional. As a result, Cascade-  Chapter 2: Review of Nonparametric Regression  23  Correlation is unsuitable when large numbers of hidden units are necessary for effective approximation, and thus is not ideally suited for the very high dimensional regression problems studied in this thesis. Other nonparametric artificial neural network algorithms work by either starting with a large network and pruning until an effective approximation has been achieved  (pruning  algorithms [Reed, 1993]), or by starting with a small network and adding interconnections and basis functions (artificial neurons) until no further error reduction is possible  (constructive  algorithms [Kwok and Yeung, 1997]). Both constructive and pruning type neural network algorithm have the same goal: to find a neural network structure which is large enough to effectively model the learning data, while small enough to ensure that over-fitting does not occur (see Section 2.2 for a discussion of variance reduction techniques and the problem of over-fitting). Other learning algorithms use genetic programming techniques [Yao and Lui, 1997] and stochastic basis function generation techniques [Igelnik and Pao, 1995] to achieve the same goal. These algorithms have been shown to perform well when relatively small numbers of artificial neurons effectively model the learning data. However, when large numbers of artificial neurons are necessary, the learning process can become unstable (small changes in learning data leading to large changes in model structure [Breiman, 1996b]) because of the large number of parameters in the neural network. This makes these types of neural network learning algorithms unsuitable for the large high dimensional problems studied in this thesis. In Radial Basis Function methods [Jackson, 1988] [Hardy, 1971] [Poggio and Girosi, 1989] [Broomhead and Lowe, 1988] [Casdagli, 1989] [Renals andRohwer, 1989] [Poggio and Girosi, 1990] [Edelman and Poggio, 1990] [Mel and Kock, 1990] [Saha and Keeler, 1990] [Girosi, 1994], the functions h have the form h (x) q  q  = <p(||x - x ||), where <j) is afunction which  varies according to some distance metric from a center point x . The functions g take the form q  g (x) q  = a • h (x) q  q  q  where the parameters a, are defined by the training data. Radial Basis  Functions become nonparametric when the number and type of basis functions h , and the q  location of the centers x  q  is determined by the learning data. A limitation of Radial Basis  Functions is that they all use some measure of distance between points in space, and in high dimensional space, almost every learning sample point is closer to the sample boundary than to another learning point [Friedman, 1994b]. This makes radial basis functions not ideally suited for very high dimensional nonparametric regression.  Chapter 2: Review of Nonparametric Regression  24  Hinging Hyperplanes [Breiman, 1993] represent regression functions as a linear superposition of basis functions which consist of 2 hinging hyperplanes. These hyperplanes are hinged in the sense that two planes in TV dimensional space must always intersect at a line (or hinge) in N dimensional space. The learning algorithm works by finding the hinge between 2 hyperplanes which best fit the current residual error, thus producing a basis function. The next basis function is then produced in a similar manner using the new residual errors resulting because of the addition of the previous basis function. If N is small, the hyperplanes are chosen to be N dimensional, otherwise the best M <N dimensions are chosen for the hyperplanes. As is argued in Chapter 1, such searches make very high dimensional regression computationally difficult. Thus Hinging Hyperplanes are not an ideal candidate for the type of regression problems studied in this thesis. However, the computational efficiency of Hinging Hyperplanes on certain high dimensional problems (see [Breiman, 1993] for examples), make it an important potential candidate for many regression problems. Wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995] have also been effectively used in nonparametric regression. Wavelets have the attractive property that they are spatially adaptive, and thus they are effective at approximating functions which are locally spiky (i.e. are smooth throughout most of the input space, but have certain regions where they are rapidly changing). Wavelet shrinking algorithms have demonstrated good performance on relatively low dimensional problems (3 inputs or less) [Sjoberg et al., 1995]. However, wavelet algorithms which demonstrate good performance on very high dimensional problems have not yet been proposed in the literature.  2.1.2 Space Partitioning Algorithms In space partitioning algorithms the domain D of the function being approximated is divided into a finite number of disjoint subdomains D such that LJZ)- = D. Then, in each t  subdomain £>•, a regression function is constructed such that: VXGD , ;  y = ;,(x)  (2.4)  Often, the functions '/,-(x) are simply constant over the domain D . There are two main types of i  space partitioning algorithms: recursive partitioning or decision tree algorithms, and nearest neighbor algorithms.  Chapter 2: Review of Nonparametric Regression  25  Examples of recursive partitioning algorithms or Decision Trees [Michie et al., 1994], include Classification and Regression Trees ( C A R T ) [Breiman et al., 1984], and C4.5 [Quinlan, 1993]. Recursive partitioning works as follows: starting with the entire domain D, subdivide it into two daughter subregions such that a parametric regression function (usually chosen to be a single constant value) fitted in each region, reduces the overall approximation error. This process of binary subdivision continues with each daughter region (creating new daughter subregions) until some predefined cost trade-off between approximation error and number of subregions is met. Recursive partitioning algorithms have proven to be very useful, especially for classification problems. In addition, decision trees can be easily interpreted, allowing analysis of how various predictor variables interact with one another. However, because functions are approximated by many discontinuous subregions, recursive partitioning methods tend not to do well on regression problems which are continuous [Friedman, 1991]. This is especially true when there are many predictor variables. Another notable example of a tree-structured space partitioning architecture is the Hierarchical Mixtures of Experts ( H M M ) model which uses the Expectation-Maximization (EM) learning algorithm [Jordan and Jacobs, 1994]. This is a parametric approach that uses a fixed number of experts (or parametric building blocks) which are hierarchically combined using gating networks to form the final regression function. In general, each "expert" is trained to work in a particular region of the input space, and the gating networks combine these regions to form a global approximation. This H M M model has been successfully applied to many learning problems, however, because it uses a parametric learning algorithm, it is not directly applicable to the nonparametric regression problems studied in this thesis. In the simplest implementation, AT-Nearest Neighbor algorithms work as follows: given a point x , find the K points in the learning set which are nearest to this point, according to some distance metric. Then the regression estimate at x becomes the average (perhaps a weighted average) response value of these ^-Nearest Neighbors [Michie et al., 1994]. A criticism of Nearest Neighbor algorithms has been that they often exhibit poor generalization [Lowe, 1995]. This problem is mainly due to a poor choice of distance metrics (or kernels). Recently, algorithmic improvements have been suggested which allow the learning data to define the most appropriate distance metric [Lowe, 1995] [Friedman, 1994a] [Hastie and Tibshirani, 1996], the result being much better generalization, especially for classification problems. However, as with regres-  Chapter 2: Review of Nonparametric Regression  26  sion trees, because functions are approximated by many discontinuous subregions, Nearest Neighbor algorithms have yet to demonstrate good performance on regression problems which are very high dimensional (greater than 100 inputs) and continuous.  2.2 Regression Model Variance Reduction Techniques For any given set of learning data, there are infinitely many regression models which fit this data exactly. A n example of this is given in Figure 2.1, where the 5 data points are fitted exactly with 3 different model types: the piece-wise linear fit model, the smooth fit model (smoothness here refers to how rapidly the model varies within its domain rather than how many times it is differentiable) and the high variance fit model. These 3 models all describe the learning data exactly, however, they interpolate between the data points in very different ways. Most likely, at most one of these models most appropriately describes the target function that generated the data. In fact, only a small subset of all models which fit any given training data accurately, also effectively describe the target function that actually generated the data. Thus, one of the more difficult problems in nonparametric regression is choosing the most appropriate regression model. The problem becomes even more difficult when the training data is noisy, in which case, exactly modeling the learning data can lead to greater approximation errors. This makes the regression model search space even larger because, in addition to models which exactly fit the training data, the search space is extended to include those models which closely fit the training data. As an example of this, the smooth approximation  model in Figure 2.1 would be just one model which  can be used to model the noisy sample points. Any number of small variations to this smooth approximation of the sample points could equally well fit the data points. A model with high variance is one which is not stable with respect to which set of training samples are used to construct it. The more variance a model has, the more sensitive it is to small changes in learning data (this notion of model variance is related to model stability as define in Breiman, 1996b). If a model has high variance then two different sets of learning data randomly generated by the same target function, can produce two very different regression models. Often, as with the high variance fit model in Figure 2.1, both models exactly fit their respective learning data, however neither interpolates effectively between data points (this is often referred to as model over-fitting).  Thus, an important goal in regression is to reduce model variance. For most  regression problems, failing other evidence the model with least variance is often the one which  Chapter 2: Review of Nonparametric Regression  Figure 2.1  27  M o d e l Variance  most smoothly interpolates through data points (for example the smooth fit model in Figure 2.1). The variance r e d u c t i o n p r o b l e m is related to the b i a s - v a r i a n c e trade-off [Friedman, 1994b], which divides the approximation error into a bias term and a variance term. The bias term defines how closely the regression model fits the learning data, and the variance term defines how much variance there is in the regression model. For example, the three models in Figure 2.1 that fit the data exactly, all have a zero bias error, but have varying degrees of variance. Conversely, the smooth approximation model in Figure 2.1 has a nonzero bias error because it doesn't pass exactly through sample points, however, if the data is assumed to be noisy, its variance error may be the smallest of all three models. Thus, the smooth approximation model may be the most appropriate model when data is noisy. A variety of techniques have been developed for nonparametric model variance reduction. In this thesis we group these methods into two categories: penalty based variance averaging  based variance  reduction.  reduction  and  These are briefly discussed below.  Methods which use penalties for variance reduction work by giving a specific cost to an increase in model complexity. The basic reason for this is that an increase in model complexity (i.e. an increase in the number of model parameters) often means an increase in model variance. Thus, the trade-off between the error reduction obtained by the addition of new model terms, and the increase in complexity due to the addition of these new terms, must be analyzed to determine if this increase in model complexity is justified. Examples of these types of penalty based variance reduction techniques include the generalized  cross  validation  criteria used in M A R S  Chapter 2: Review of Nonparametric Regression  [Friedman, 1991], and the minimum  description  28  length  criteria used in S O N N [Tenorio and  Lee, 1990]. Generalized cross validation works by multiplying the average squared residual error of the training data by a penalty factor which increases as terms are added to the M A R S model. Similarly, minimum description length attempts to find the smallest model which effectively fits the training data. Other examples of penalty based variance reduction techniques can be found in [Friedman, 1994b]. One key advantage of variance reduction techniques which use penalties, is that a single regression model is constructed, thus allowing for potentially useful model interpretation. However, one of the key problems associated with penalty based variance reduction techniques is that they are inherently unstable. Breiman [Breiman, 1996b] has shown that this form of model selection is unstable because small changes in learning data can lead to large changes in model structure, making it difficult to choose the most appropriate model. In contrast to penalty based variance reduction techniques, averaging methods work by forming ensembles of many different regression functions, and modeling a final predictor value as a weighted average of all the regression functions. One of the first noted examples of this is Stacked Generalization [Wolpert, 1992] where potentially nonlinear combinations of regression functions, each trained on some subset of the total data, are used to form the overall regression function. This idea is simplified by Breiman in Stacked Regression [Breiman, 1996c], where the stacked regression functions are simply weighted by non-negative real values. Theoretical justification for such weighted averaging is given by Perrone [Perrone, 1993], where derivations are given for the optimal weighted average of a fixed number of regression functions. In addition, Breiman has shown that averaging can effectively stabilize predictors which are inherently unstable [Breiman, 1996b]. One example of an averaging variance reduction technique is Breiman's bootstrap aggregation or Bagging [Breiman, 1996a]. Bagging uses the average value of many regression functions, each trained using a bootstrap sample of the learning data, to represent the final regression function. The key idea here being that the bootstrap samples tend to create regression functions which have uncorrelated error residuals, thus averaging their values tends to cancel out approximation errors, giving potentially better regression functions when compared with a single regression. Bagging has proven to be an effective variance reduction technique, and variations of it have been successfully applied to many types of regression problems [Parmanto et al., 1996]. However, one significant drawback of all averaging methods is that they do not easily allow for  Chapter 2: Review of Nonparametric Regression  29  interpretation of the resulting regression model.  2.3 Convergence Results in Nonparametric Regression In this section, we consider three main types of known convergence results for nonparametric regression. First we briefly discuss some known results in representability, or what function class a given regression function is theoretically able to converge to. Second, we look at two types of rate of convergence results. The first measures how, given an infinite number of learning sample points, the approximation error decreases with the size of the regression model. The second measures how the approximation error decreases as the number of sample points of the target function increases. Examples of these types of convergence results are given below. Given a particular regression function, one of the first theoretical questions which is asked is what class of functions  can it represent!  Indeed, the most common question posed is whether  the regression model is a universal approximator; i.e. whether it is able to form arbitrarily good approximations of bounded continuous functions defined over a finite domain. Examples of universal approximators include polynomials [Lorentz, 1986], radial basis functions [Girosi, 1994], artificial neural networks [Cybenko, 1989] [Hornik et al., 1989], space partitioning techniques [Breiman et al., 1984], projection pursuit [Huber, 1985], and wavelets [Sjoberg et al., 1995] [Juditsky et al., 1995]. These representation results are important because they establish what the respective regression functions are capable of representing. However, representation results in themselves do not establish the rate at which these regression functions converge to the target function. The first type of rate of convergence in approximation  result we consider  error as the number of parameters  is one which measures the  in the regression  function  increases.  decrease  From a  practical standpoint, this rate of convergence measure tells how the size of the regression model depends on the dimension of the target function. In particular, from the standpoint of high dimensional regression, we can ask the following question: "under what conditions is the rate of convergence of the regression model independent of dimension?" A general answer to this question is the following: by limiting the class of target functions which one is attempting to approximate, many regression functions have a rate of convergence which is independent of dimension [Girosi, 1994]. This includes radial basis functions [Girosi, 1994], artificial neural networks [Barron, 1993], hinging hyperplanes [Breiman, 1993], and projection pursuit  Chapter 2: Review of Nonparametric Regression  30  [Jones, 1992]. A l l of these theoretical results demonstrate rates of convergence independent of dimension are based on a lemma which can be jointly attributed to Muarey [Barron, 1993], Jones [Jones, 1992] and Barron [Barron, 1993] [Girosi, 1994]. In all these results, the class of target functions is limited to those functions for which the magnitude of the first moment of the Fourier transform is finite and independent of dimension. In addition, all of these results are based on greedy  algorithms which search an infinite function space for the globally optimal regression  function. Because of this large search problem, it is not clear how computationally practical these convergence results are [Barron, 1993] [Breiman, 1993] [Juditsky et al., 1995]. The second decreases  rate of convergence  with the number  of sample  result  we discuss  here is how approximation  points of the target function.  error  Stone [Stone, 1982] showed  that, in the general case of a target function which is at least once differentiable, the optimal rate of convergence as a function of the number of sample points, does depend on dimension. In fact, this dependence is significant enough that general very high dimensional nonparametric regression requires an impractical number of sample points (i.e. the curse of dimensionality). Once more, by putting more limits on the target function, it is possible to improve on these rates. There are examples in the literature which demonstrate function spaces, and corresponding regression functions, which have rates of convergence that are independent of dimension. One example of this is the projection pursuit regression function when the target function is limited to the class of projection pursuit regression functions with specific constraints [Chen, 1991a]. A second example is an interaction spline model, where the target function is sufficiently differentiable [Chen, 1991b]. Such theoretical results are useful because they can lead to practical implementations which are effective in very high dimensional spaces. For further discussions on rate of convergence results, see Section 5.3 of Chapter 5.  2.4 C o m p a r i n g Existing Algorithms to the S P O R E In this chapter, many different approaches to nonparametric regression have been discussed. In addition, we have discussed the concept of model variance reduction and how it is done in the literature. Finally, we have outlined some of the basic theoretical results in nonparametric regression. In this section, we compare SPORE to other published regression methodologies. First we outline which existing learning algorithms most closely .resemble SPORE-type learning algorithms, and how they are fundamentally different. Then we discuss how the variance  Chapter 2: Review of Nonparametric Regression  31  reduction methodology of SPORE relates to those of existing variance reduction techniques. Finally, this section concludes with a brief description of how the theoretical properties of SPORE differ from those of other published regression functions. The two types of regression functions which are most closely related to the regression functions studied in this thesis (SPORE), are Cascade-Correlation [Fahlman and Lebiere, 1990], and G M D H [Ivankhnenko, 1971] type algorithms such as A S P N [Elder and Brown, 1995] and S O N N [Tenorio and Lee, 1990]. A l l of these algorithms build successively on structures, using the outputs of previously constructed structural units as inputs to new units. However, from the conceptual standpoint of how units are added, and how their inputs are chosen, SPORE differs significantly from these regression functions. These regression functions attempt to build an "optimal" structure by searching for the "best" set of functional units, and/or the best set of inputs to each structural unit. As argued in Section 1.2.2 of Chapter 1, such searches are not computationally feasible in very high dimensional spaces. In the SPORE methodology, this philosophical perspective of searching for optimal (or close to optimal) structures is abandoned because our aim is specifically to address the very high dimensional regression problem. Instead, S P O R E algorithms spend minimal computational effort on searches in function space and input space, concentrating more on constructing structures which produce robust approximations without the need for costly search strategies. From the standpoint of how variance reduction is achieved in the SPORE methodology, we refer the reader to the specific example of SPORE-1 in Chapter 3, and the related discussion in Section 3.3.3 of that chapter. In comparison to other algorithms, SPORE-1 achieves effective variance reduction using a methodology which is similar to Bagging [Breiman, 1996a], but does so with only a single regression function, instead of an aggregate ensemble of regression functions. This is accomplished by constructing each structural unit using a bootstrap sample of the training data. The output of SPORE-1 is the weighted average (each unit is weighted proportionally to the inverse of the squared residual error it has on the training data) of each of these simple structural units. This produces an effective variance reduction methodology (here termed internal variance reduction) which differs significantly from other proposed averaging methods. In Section 5.3 of Chapter 5, the rate of convergence (as a function of the number of sample points) results for the general SPORE-type learning algorithms are given. The rate of convergence is shown to be independent of dimension, but on different function spaces from those studied in  Chapter 2: Review of Nonparametric Regression  32  the past. Instead of putting greater limits on the function type [Chen, 1991a] or requiring it to be more differentiable [Chen, 1991b], we assume a more general class of bounded continuous functions, which are at least once differentiable. We obtain a rate of convergence results independent of dimension by limiting the target function's domain of definition in specific ways. Because of this, regression functions based on SPORE have very different convergence properties when compared with other published results.  Chapter 3: The SPORE-1 Regression Function  The general SPORE methodology outlined in Chapter 1 can be implemented in a variety of ways. In this chapter, the simplest implementation of SPORE, termed SPORE-1, is defined and analyzed. In Section 1.3 of Chapter 1, an intuitive justification for the SPORE-1 structure was presented. In this chapter, we present a theoretical analysis of SPORE-1, with particular emphasis on rate of convergence and complexity properties. In Section 3.1 we examine the representation capabilities of the SPORE-1 regression function. In Section 3.2, the SPORE-1 learning algorithm, which is used to construct the regression function, is defined. In Section 3.3 an analysis of the theoretical properties of the SPORE-1 learning algorithm is given. Finally, a summary of the main results of this Chapter is given in Section 3.4.  3.1 SPORE-1: A Cascade of Two Dimensional Functions The SPORE-1 structure is a cascade of two dimensional functions, with the output of each level of the cascade feeding directly to one of the inputs of the next level. This is diagrammatically represented in Figure 3.1. In the following, a complete description of this basic structure is given, followed by an analysis of the class of functions which can be represented by this SPORE1 structure. Let R (xp L  x ) be an N dimensional SPORE-1 regression function constructed using N  an L level cascade of 2 dimensional functions. Thus, y = R (x) L  estimates the dependent  variable y given the independent variables x = (JCJ, ..., x ), and has the following form: N  L  R {x , ...,x ) = a, L  x  N  + a -g (g ,x )+  •gi(x ,x ) ko  ki  2  2  1  k2  £  a, • ( _ gl  gl  „ x) k  (3.1)  1 =3  where L defines the number of levels in the cascade or, equivalently, the number of 2 dimensional functions £/(•);  OC/  are real valued scaling factors; and the subscripts k0,...,kLs {l,...,N}  serve to identify the input variables (JCJ,  x ). N  33  Chapter 3: The SPORE-J Regression Function  34  a 1  a  —  2  k  ^  x  2  R (x) L  8LM Figure 3.1  The S P O R E - 1 Regression  Given the above definition, the question we pose is what class of functions can be represented by the simple cascade given in (3.1)? The answer to this question depends on what class of 2 dimensional functions g^-) are used in the cascade. In the following, two classes of 2 dimensional functions are considered. First we consider the case of g {-) belonging to the class of t  bounded continuous 2 dimensional functions defined on a closed domain in 9t .In the second case g {-) is limited to the class of 2 dimensional polynomials of finite order greater than 1. The t  interesting conclusion is that in both instances, the cascade in (3.1) is shown to be a universal approximator: i.e. able of approximating any bounded continuous multidimensional function defined over a finite domain.  3.1.1 A Cascade of Bounded Continuous 2 Dimensional Functions In the following theorem, it is assumed that the 2 dimensional functions g {-) belong to L  the class of 2 dimensional bounded continuous functions. Under these conditions, it is shown that any continuous, bounded N dimensional function, defined over a closed domain, can be 2  represented using a cascade of 2N + 3N g (-) functions. Since this is the broadest class of ;  bounded continuous g (-) functions, this result defines a bound on the number of cascade levels t  necessary to exactly represent a bounded continuous multidimensional function. Other smaller classes of bounded continuous functions, such as polynomials, will require more functions. T H E O R E M 1. Let f(x\,  x) N  be a bounded  continuous  function  defined on a  closed  Chapter 3: The SPORE-J Regression Function  domain  in 3i , N  35  N > 2. Then, there exist 2N  + 3N bounded continuous  2  gi(-) functions  which  2 are at most 2 dimensional,  and 2N  +3N  real valued numbers a.[, such that:  2N + 3N-2 2  f(x ,...,x ) i  N  = a g (x r  l  k o  ,x  k i  )  + a -g (g ,x )+ 2  2  ]  £  k2  <*f8i(8i-i> k)  (-)  x  3  2  /= 3  Proof: The proof is based on Kolmogorov's superposition theorem [Kolmogorov, 1957], which states that any bounded continuous TV dimensional function, f(x , x  x ), defined on a N  closed domain, can be represented as: 2N  f N  \  (3.3) q-=0  where the 1 dimensional functions \|/ and  V =1  are bounded and continuous (for more recent  versions of this theorem see [Lorentz, 1986]). Thus, for our proof we must show that (3.2) and (3.3) are equivalent for N>2. We show this by equating all g (-) functions in (3.2) to the t  appropriate functions in (3.3). First make the following assignment: 8\ = <t>0l( l) t 02( 2)' x  Then, for Ze {2, ...,N-  +  ( )  ex, = 0  x  (3.4)  1} and p = I + 1, let £/ = S/-1 + M V '  =  i  a  0  ( 3  -  5 )  Thus, completing the definitions to level / = TV, let 8N = Yo(fftf-i)>  N=  A  (-)  1  3 6  Therefore, TV levels are required to build the q = 0 part of (3.3). What remains are the q = 1,  2TV parts to be constructed. Consider the q = 1 part. For level / = N +1 , let 8  Then, for le {N + 2, ...,2N}  N +  i  =  <hi(*i)>  N +\ =  a  (-) 3  0  7  and p = / - A T , let  */ = S / - i  + (  MV'  i  a  =  "  0  (3  8)  Thus, completing the definitions to level / = 2N + 1 , let 82N+1  = Vl(82N)>  a  2N \ +  =  1  (-) 3  9  Chapter 3: The SPORE-1 Regression Function  Sl-\  0 + x +(0-g _ ) l  l  + (0-x  36  1  l  •£/_,)  a =0 t  +  l  x  Si 0 + x + (0- ) x  gl  + (l - z j • g ) + ... /  l  x  Si+i  0 + x, + ( O g  / + 1  ) + (l  -g  / + ]  ) + ... «/ 2 =  1  +  Figure 3.2  A Cascade Representation of a Polynomial T e r m  Therefore, 2N + 1 levels are required to build the q = 0 and q = 1 parts of (3.3). What remains are the q = 2,  2N parts to be constructed. The construction of these parts is identical to that  of q = 1, which took N + 1 levels of (3.2). Thus, the total number of levels required is 2N(N  +l)+N  = 2N + 3N. 2  This completes the proof of T H E O R E M 1. Q.E.D.  T H E O R E M 1 establishes a bound on the sufficient number of levels required to exactly represent any bounded continuous function defined over a finite domain in 3 or more dimensions. However, the only restriction on the functions g (-) is that they be some 2 dimensional function. ;  In addition, the proof of T H E O R E M 1 is based on Kolmogorov's superposition theorem, and it is a known result that the 1 dimensional functions used in Kolmogorov's theorem are often highly non-smooth [Vitushkin, 1954] [Vitushkin and Henkin, 1967] and therefore not easy to represent in practice.  3.1.2 A Cascade of Finite Order 2 Dimensional Polynomials In this section the functions g {-) are further restricted to belong to the class of finite order t  polynomials. Thus T H E O R E M 2 deals with a more practical scenario in that the functions { - ) gl  are of a class which makes them easier to use in practice. T H E O R E M 2. Let f(x , x  A cz 3i . Let (-) N  gl  be a 2 dimensional  x ) be a continuous N  polynomial  function  defined on a compact  of finite order greater  subset  than 1. Then, for each  Chapter 3: The SPORE-J Regression Function  37  £ > 0, there exists a finite number, L, of functions  \f{x ,  £/(•) and real numbers  ...,x )-R {x ,  x  N  L  OC such that /(  (3.10)  ...,x )\<z  x  N  where / ^ ( x ) is defined in (3.1).  Proof: It is a known result that a continuous function / ( * , , ...,x ) N  compact subspace can be approximated by polynomials in x  (see Theorem 6 on page 10  ...,x  u  defined over a  N  in [Lorentz, 1986] for a statement of the appropriate theorem). This representation has the form £  f(x ...,x )= v  N  ...,i e N  where b^  b _ {0, 1,2, ...} i i t  t i i t  (3.11)  -xi-x%  - are polynomial coefficients. Thus, we shall prove T H E O R E M 2 by showing that t  any polynomial in x ,  x  x  N  can be represented by the cascade structure defined in (3.1), where  the ( - ) are 2 dimensional polynomials of finite order greater than 1. B y this constraint, the g[  simplest form of functions used in the cascade is: 2 = a 0 0 + a ] 0 u + a 0 1 v + a uv  gl(u,v)  2  + a M +n2 a  (3-12)  v  20  lx  This and every other 2 dimensional polynomial of order greater than 1 must at least have the terms a w and a uv. 10  Given these two terms, it is simple to show that any term in (3.11) can be  lx  equated to the cascade structure (3.1). A s an example, consider the polynomial term 3 35  3  • We first construct the JC, part of this term as outlined in Figure 3.2. Starting at any  ^335 ' i 2 3 x x  x  arbitrary level / , set a = 0 , x t  k  = x,, a  for levels p = / + 1 and p = / + 2 set a  p  1 0  = 1, and all other parameters of ( - ) to zero. Then, g[  = 0, x  gp(-) to zero. This gives us the x  {  P  part of b  =x, a  k  3 3 3 5  x  u  = 1 , and all other parameters of  3 35  • xxx. {  follows. For levels p = / + 3 to p = / + 5 set a  p  2  3  3  Next, we construct the x part as  = 0, x  2  k  =x, a 2  u  = 1, and all other  parameters of g (-) to zero. Next, for the x^ part, for levels p = / + 6 to p = 1 + 9 set a p  x  =x , a  k  3  oc = b p  3 3 5  p  = 0,  = 1, and all other parameters of g (-) to zero. Finally, for level p = /+10,set  n  p  , x  k  = x , a = 1, and all other parameters of g p (-) to zero. This gives us the 3 3 5 3  P  completed term b  3 3 5  n  •xxx x  2  3  . In a similar manner, we can construct any other term in the polyno-  mial (3.11). Hence, given that we can construct any polynomial term using sections of the cascade structure in (3.1), and because (3.1) is a sum of such sections, it follows that any polynomial structure can be represented by the cascade structure defined in (3.1). This completes the proof of  Chapter 3: The SPORE-1 Regression Function  38  T H E O R E M 2. Q.E.D.  3.2 Learning Algorithm for the SPORE-1 Regression Function In the previous section we have shown that the SPORE-1 regression function is a universal approximator. However, we have not shown how to construct the functions g (-) and real ;  numbers a . One such algorithm is presented next. t  3.2.1 The SPORE-1 Learning Algorithm We assume that M  L  2 sets: a training set T  T  M  v  learning input/output examples (x , y ) e T have been divided into q  containing M  L  input/output pairs, and a validation set T containing  T  v  input/output pairs. In Chapter 4, various methodologies are described for dividing learning  data into training and validation sets. There are 3 preset learning parameters, symbolized by depth j , depth , 2  and e. The function of each of these parameters is explained in the following  algorithmic description, and is further elaborated on below. A single cascade of  functions is  constructed in a series of sections. A section of a cascade between levels i and  symbolized by  Rj, is defined as follows:  l  j  l= i  When the section being constructed can no longer reduce mean squared error, the learning outputs y are replaced by residual errors due to Rj and a new section is begun starting from the last error l  reducing function gj(-).  Throughout the remainder of the chapter we assume that the functions  are 2 dimensional polynomials of finite order: I +  g {u,v) t  =  j<  K  2  fly-MV  (3.14)  i, ye {0, 1,2, ...'} where K is the order of the polynomial, and a- are its coefficients. Stated in detail, the construction of the regression functions, R (x), L  proceeds as follows (note that each SPORE-1 regression  function has only one output, so for problems with multiple outputs, one regression function must be constructed for each output):  Chapter 3: The SPORE-J Regression Function  39  Algorithm 1: SPORE-1 S T E P 1: Initialize algorithm:  Initialize the 2 counters i = 1 and p = 1, where / and p are used  as follows: / indexes the function g-(-) at the start of the current section; p is a subscript indicating that section p of the cascade will be constructed in step 2. Throughout the algorithm the subscripts k ,  k (used to identify the independent variables) are randomly  0  L  selected, one at a time, from the set {1,  N} , without  replacement. When the set is empty,  N} , and the process of selecting these subscripts continues.  it is re-initialized to {1,  S T E P 2: Construct new section: Construct, in order, the functions £,•(•)>£,• + ](•)> • • •, gj( •), where j - i> depth j, as follows:  1. Obtain a bootstrap sample set T  (i.e. M  T  randomly chosen samples with replacement)  T  from the training set T . T  2. Given r , using singular value decomposition, determine the model parameters r  {a } pa  of the function g(-) which minimize the following least squared error criteria:  Ho-} =  a r  g  M I N  (3.15)  H } 0  The construction stops with function gy(-) when the change in cascade error on the validation set T ,  over the past depth + 1 level additions, is less than some small, positive number £.  v  }  This is defined as follows:  MSE( Rj)r l  1 depth j  where R = l  MSE(R )  q Tv  a (  / ' £/(') ^  s t  I  MSE{ R ) l  q Tv  \q = j - 1 - depth/ n  e  (3.16)  <e  v  JJ  constructed cascade between levels i and q; and  is the mean squared error of this cascade on the validation set T . v  addition of each new level, the scaling factors a , for / = l  With the  j, are recomputed as follows:  Chapter 3: The SPORE-1 Regression Function  40  (MSE( (-)) 8l  )"'  r  a, =  -  (3.17)  1  ZWSEig (-)) )-\ q  rT  q= i  where MSE( (-)) g[  is the mean squared error of the single function g (-)  r  t  on the training set  T . Hence, a is simply the inverse of the normalized mean squared error of level / , and its T  t  purpose is to give more weight to levels which contribute more significantly to error reduction. STEP 3: Prune section: Next prune the current section back to the level which has the best mean squared error as follows. Find the minimum  level s e {/- 1, i,  j}  which gives the least  mean squared error of the cascade on the validation set T , in the following sense: v  choose^<r s.t. V ( r e {i,...,j}),  Delete all functions g ] ( • ) , g J+  i+ 2  - scaling factors CL , for / = i, T  MSE{ R )r <MSE(R )r  (3.18)  l  s  v  t  v  (0> •••> g / ' ) • Set j - s and, using (3.17), recalculate the j . The current state of the cascade is now given by:  /= l  Note that if s = / - 1 then all of the functions just constructed in step 2 are deleted. STEP 4: Update learning outputs: Update the output values, y •, of both the validation set T  v  the training set T  and  as follows:  T  s V((x,y)6 r  v  u r » = > y = y-X<*/•*/(•)  (3.20)  l= i  The sets T  v  and T  T  now contain the residual approximation errors after the cascade has been  constructed to pruned level s. Let tion set T  v  be the variance of the output values, y •, of the valida-  (as defined in step 1, the index p refers to the section just constructed). Thus, p  p  is equivalent to the mean squared error of the approximation R (x) on the validation set s  p  p  T. v  is used in STEP 5 to establish a stopping condition under which the construction of the  Chapter 3: The SPORE-1 Regression Function  41  cascade stops. STEP 5: Check  stopping  condition:  Go back to STEP 2, constructing at least depth  2  +1  sections, until the following condition is met: f  \ 1 depth*  (3.21)  <£  p-i \y = p - 1 - depth  2  Therefore, the algorithm will terminate when the mean squared error fails to improve by a factor of e, over the last depth  2  + 1 sections. Before executing STEP 2, set / = s + 1 and  increment the section counter p to initialize the next section. Upon termination, set L = s, the total number of levels constructed. This completes the description of the algorithm. As indicated above, the learning parameters depth j , depth , 2  and £, are all used to define  conditions under which the construction of the cascade stops. In particular, £ defines the relative decrease in error required to justify the addition of more g (-) functions to the cascade. The t  learning parameter depth  }  defines how many levels of the cascade are used to determine the  termination conditions of STEP 2. Similarly, depth  2  and £ are used in STEP 5 to define how  many times STEP 2 is executed before the algorithm is executed. The theoretical significance of these learning parameters is discussed in Section 3.3.1. In Chapter 4, the actual setting of these values is discussed.  3.3 Theoretical Analysis of the SPORE-1 Learning Algorithm In this section we analyze some theoretical properties of the SPORE-1 learning algorithm. In Section 3.3.1 we give convergence and rate of convergence results for the algorithm. In Section 3.3.2 we give complexity results. In Section 3.3.3, we discuss the internal variance reduction property of SPORE-1. And in Section 3.3.4 we define the virtual space partitioning property of SPORE-1.  Chapter 3: The SPORE-J Regression Function  42  3.3.1 Convergence Results We begin the theoretical analysis of SPORE-1 by analyzing its convergence properties. In T H E O R E M 3 we prove that, under appropriate regularity assumptions, the SPORE-1 algorithm monotonically converges to some finite approximation error. In T H E O R E M 4 we give sufficient conditions for the approximation error to converge to zero. In T H E O R E M 5 we extend this result by showing that the rate (as a function of the number of learning sample points) at which the learning algorithm converges to this approximation error, is independent of the dimension of the dimension of the target function. The proofs of T H E O R E M 3, T H E O R E M 4 and T H E O R E M 5 are based on 4 lemmas which we state and prove below. The reader should note that in the following, the learning data F updated learning  data  L  always refers to the  or residuals as defined in S T E P 4 of the learning algorithm.  In the first lemma, we show that the approximation error on any bootstrap sample used in the construction of a function £/(•)> must either decrease or stay the same; it can never increase. L E M M A A . For all / > 1, the following index for the summation  V  is true (note that the subscript  q serves as an  ):  (3.22)  Proof: From S T E P 2 of the SPORE-1 learning algorithm, the parameters of g ,(•) are chosen [+  as follows: =  i ij} a  a r  g  M I L  W  (3.23)  ^Z(8i \(u,v)-y ) +  lr  q  B  Given that u = g and v = x l  T  we could set coefficients of all terms with x  k  k/  ^ to zero if they  increase error. Therefore, the error can never increase. Similarly, error will decrease if and only if v = x  contributes to error reduction within the parametric structure of g  k  / + 1  ( - ) - Thus,  L E M M A A is true by definition. Q.E.D. The implication of L E M M A A is that the mean squared error of any new level MSE(g  l  +  j(-)) „, on any bootstrap sample T , is as least as small as the mean squared error of T  the previous level. Thus, even if v = x  k  I + 1 , the parametric form of g  [+  does not contribute to the lowering of error at level  , (•) ensures that for any bootstrap T , the previous best error is T  Chapter 3: The SPORE-1 Regression Function  43  maintained. Next we examine how the expected error, on the entire learning data set Y , behaves with L  the addition of a new function g  t +  , (•) to the cascade. This requires us to make some assumptions  about the function which originally generated the learning data. Assumption A: We will assume that all the learning data Y , and therefore all of the data L  in the training  set Y  T  and the validation  set Y ,  are independently  y  generated  by some  unknown  function: y = /(x) where f(x)  is bounded,  continuous  some noise term with the property the probability variance,  density function D).  V(xe  (3.24)  + e(x)  and defined over some closed domain that V(x £ D),  E(e(x))  N  = 0. Furthermore,  define fix)  as  and e(x)  is  we assume that  o/e(x) is continuous,  which defines the distribution  Note that one can equivalently  D cz 3l ,  and has finite  follows:  oo  /(x)  = E[y\x)  =  \ yh(y\x)iy(dy)  (3.25)  —oo  where, h(y\x)  is a continuous  some measure on  probability  density function  with bounded  variance,  and q>(dy) is  .  In the following lemma, we use Assumption A to show that the expected approximation error on the learning data either decreases or stays the same. L E M M A B. Given Assumption A, and for sufficiently then for all l>  1, the following E[£(8i i(8i, +  large sample sizes M  T  and  My,  is true:  xJ kl  -y/]  < £ [ ] • > / . ( • ) -y ) ]  0-26)  2  q  Proof: From L E M M A A , we know that L E M M A B is true if the learning set Y is equivaL  lent to the bootstrap training set: i.e. Y = Y . Hence, it is reasonable to assume that it will also be T  true if r Y  T  r  L  is a "sufficiently good" representation of the distribution of points in the training set  and the validation set Y . Thus, we prove this lemma by showing, given Assumption A, Y v  becomes sufficiently close (in the probability distribution sense) to Y  T  T  M. v  and Y  v  for large M  T  and  Consider any arbitrary point x, e D. From the Central Limit Theorem, we know the follow-  ing: if y,,  y  n  are independent random observations from a population with probability  Chapter 3: The SPORE-1 Regression Function  44  1  2  function h(y\x ) for which the variance o [y] is finite, the sample mean y = ~2"-  l s  x  approximately normally distributed when the sample size is reasonably large, with mean E[y] 2  and variance (a [ y ] ) / « [Nester et al., 1990]. Therefore, if we limit all learning sample points to be generated from any single point X j e D and we make M  and M  sufficiently large, the  following is true due to Assumption A and the CentralE Limit Theorem;  (3.27)  T  = \^L(8i(-)-y )  ^8i \isi,x J-y )  2  +  kl  v  q  q  •r  —'  — I -— i  This is true because, at any single point x, e D, the construction of the function g (g , x l+ x  t  as defined in (3.23) is equivalent to finding the mean value of the bias coefficient a  )  k  0 0  ; i.e.  8i+i(8i> ic! ,) reduces to a single mean value which is equivalent to the mean of a one x  dimensional distribution. In addition, assuming that the variable x  k  ^ does not contribute to  reduction of error, then because g (-) and f(x) are both bounded and continuous over a finite t  domain, and because the distribution of e(x) is also continuous, it must be the case that (3.27) is also true for any finite local region (within the domain D) about any arbitrary point x, 6 D . This is true because the continuity constraints on g {-) l+x  imply that as any point (w,, V j ) approaches  another point (u , v ) , then g (u , v,) must approach g , ( M , V ) ; hence, this is analogous 2  l+ l  2  l  l+  2  2  to taking an elastic surface and smoothly bending it according to the constraints of (3.23). Furthermore, under the same conditions, because this is true for any arbitrary point X j e D, (3.27) must be true over the entire domain D (i.e. simply stretch the above elastic surface to include all of D). Thus, at minimum, the expected approximation error will remain unchanged. Similarly, due to the construction of g  E\T(Si (8i,  x  +i  is true if and only if the variable of the parametric structure of g  (g , x ^ ) as defined in (3.23), the condition ;  )-  k  k  y/1 < 4Z(S/(0 - v/1  (3.28)  serves to reduce the approximation error within the context  |  l + x  l + x  (g , x t  k  |  ) . This completes the proof of L E M M A B. Q.E.D.  Next we extend the result of the previous lemma to show that the expected approximation error, over the entire domain D of the target function, either decreases or stays the same. L E M M A C. Let f (x) u  residuals STEP 4,  represent  the updated error function  in STEP 4. Thus before STEP 4 is first executed f (x) u  computes:  after the computation  = /(x),  and the execution  of of  Chapter 3: The SPORE-1 Regression Function  45  = /(x)-£a  f (x) u  (3.29)  •*,(•),  z  /= i and thereafter f (x)  is updated as follows:  u  s  f (x)^f (x)-^a (-) u  u  (3.30)  r§l  /= i Then, given Assumption A, and for sufficiently large sample sizes M  T  and M , for every level v  I > 1, the following is true:  E\\(8i D  ,(*/, x J-f {x)) dD\  < E\\{ {-)-f {x)) dD\ D  2  +  ki  (3.31)  2  u  gl  u  Proof: From Assumption A we know that the samples y have mean f (x). Furtherq  more, because f(x)  and 2/-  i i '£/(') a  a  r  u  bounded and continuous, then so is f (x).  e  u  The the  proof of L E M M A C is identical to the proof of L E M M A B , with the exception that the approximation errors are now integrated over the entire domain D and not simply summed over the learning set T . Q.E.D. L  Next, using the above lemmas, we show that the SPORE-1 learning algorithm must converge to some finite approximation error. It is important to establish that SPORE-1 converges to some error function because this allows us to establish conditions under which the algorithm terminates. Algorithms which do not terminate are rarely useful in regression. T H E O R E M 3. Given Assumption A, and for sufficiently large sample sizes M  T  then for every £ > 0 f e e  and M , v  there exists a level I such that the following is true:  EU(R D  <£  (x)-R (x)) dD 2  l  +  p  t  (3.32)  where p > 1 is any positive integer (note that this theorem implies a monotonically converging sequence).  Proof: Consider the construction of the function R (x) during the first execution of S T E P L  2. Let  Chapter 3: The SPORE-1 Regression Function  46  I  R (x)  = X <* -8 (-)  t  P  (3-33)  P  p= i  and /+1  = 5>VM-> P  ( 3  = \  -  3 4 )  Note that we have put a prime on a' to distinguish that they are different from a . Therefore, p  /+1  /  i? (x)-/?,(*) = 2 <V M ' ) - I /+1  a , •*„(•)  (3.35)  Simplifying we get: */ i(*)-*/(x) =  -S/ i(-J-  +  +  E (a'p-V-M ) -  (  3  3  6  )  where, in the limit of M —> oo, T  JU (-)-/(x)) ^D 2  p  a „ ->  = -r-e2  V  X(J(*,(•)-/(x)) dDj 2  q= i D  I  (3.37)  X % q= i  and ,„V'  .2  « ' P "> ITT  ; = TTT^ ,2  X (/(*,(•)-/(x)) dDJ 2  q= i D  (- > 3  38  £ 01  q= i  where, for all /? > 0 ®  p  = {\(g (-)-Hx))dDJ p  D  Therefore, we can write oc' - a as p  (3.39)  Chapter 3: The SPORE-1 Regression Function  47  (  ^  \q = i  J  1  p a  p -  a  p  /+1  /  q= i  q=i  //+ I  2*,J  -O  ^ =i  Y  I*,  \  '  (3.40)  I*,  Simplifying, we get V  f -q>  *\  /+ l  (3.41)  /+1  5>« because /+ I  a'  (3.42)  Therefore, we can simplify (3.36) as follows: substituting cc' - a Ri iW-R[(x)  =  +  a'  /+1  -g (-)-a' /+1  p  X  /+1 P  with -oc'  / +  <VM'>  jOC^ (3.43)  = I  Substituting in Equation (3.33) we get:  /? (x)-/? (x) = a ' /+1  z  z+1  -g^O-a^j/tyx)  (3.44)  which simplifies to R (x)-R (x) l+l  =  t  a' (g (-)-RL(x)) l+l  Finally, from L E M M A C, and assuming that  l+l  (3.45)  > 0 for all / (note that if this assumption is  false, then T H E O R E M 3 is tme by default due to L E M M A C), as / - » « , E[O  / + 1  -<D ]->0 z  (3.46)  therefore, we have an infinite sum of a constant, giving infinity, or /+ l  ct  <7  (3.47)  Chapter 3: The SPORE-1 Regression Function  48  and as a result  /+1  (3.48)  -»0.  -  V  X*, q=  J  l  Therefore, as / —> °o, Equation (3.45) becomes: £(/?  z + 1  (3.49)  (x)-/e (x))->0. z  Thus, T H E O R E M 3 is true during the first execution of execution of STEP 2, and because of the continuity arguments in L E M M A C, the same argument can be made during each subsequent execution of STEP 2. This completes the proof of T H E O R E M 3. Q.E.D. T H E O R E M 3 establishes that the SPORE-1 Learning Algorithm converges to some function ^ ( x ) as / —> ~ . Thus, for any fixed sequence k , Q  , the approximation error  converges to the following: (3.50) D  In addition, T H E O R E M 3 allows us to define theoretically optimal values for the learning parameters depth j, depth , and e (see STEP 2 and STEP 5 of the SPORE-1 Learning 2  Algorithm). First consider e, which defines a cut-off on the minimum allowable decrease in approximation error. From the proof of T H E O R E M 3, we know that this rate of decrease monotonically converges to zero as the number of levels grows. Hence, by stopping the algorithm when some e rate of decrease is reached, we know that any future expected decrease will not exceed this amount. Thus, in order to achieve the error given in (3.50), we must let e —> 0. In addition, the true rate of potential error reduction is measured only when p —> °° in (3.32). Therefore, the parameters depth and depth , which are related to p , should also be large in 1  2  order to achieve (3.50). In the next theorem, we establish sufficient conditions under which approximation error is reduced when a new level is added to the regression function. In order to do this, we need to make a further assumption as follows.  Assumption B: For simplicity,  we define the following:  Chapter 3: The SPORE-1 Regression Function  49  (3.51)  E[\(g (.)-f (x)) dDJ 2  l+l  u  D  (3.52)  y = E J(g (.)-/ (x))(/? (x)-/(x))JZ) / + |  M  /  D  Q = E J(^(x)-/(x)) ^D  (3.53)  2  D Given the above definitions, the following 3 conditions are assumed to be true. First, the approximation error of i?;(x) is greater than the correlation between the errors due to the new level function g (-) [+]  and the approximation error due to R[(x). This can be stated as follows (f (x) u  is defined in L E M M A C):  (3.54)  Q> ¥ x  In other words, this condition is satisfied when the errors due to g^ j (•) are sufficiently uncorrected with those due to Rpix). The second condition requires that the correlation between the errors due to the new level function gj j (•) and the approximation error due to i?/(x) satisfy the +  following:  0>2 F-Q  (3.55)  V  Once again, this condition is satisfied when the errors due to g/ j (•) are sufficiently uncorrelated +  with those due to i?/(x). The final condition which we need to meet is that the scaling factor oc'  /+1  satisfies the following:  ( Y - f l ) - 7( Y - fl) - (O - 2 Y + ( 0 - 2 F + fl) 2  v  - e)Q  (3.56)  where e = 1 - 8 , and 5 e 31 « an arbitrarily small positive real number defining the amount of error reduction due to the addition of a new level to the regression function. If we let I —> °°, and thus a' [ , —> 0 (see proof of THEOREM +  8i+\(')  a  r  e  3), this condition is also satisfied when the errors due to  sufficiently uncorrelated with those due to i?/(x). Hence all 3 conditions require that  the approximation errors of a new level are uncorrelated with those of previous levels. As discussed in Section 3.3.3, the SPORE-1 learning algorithm encourages this condition to occur through the use of bootstrap training samples.  Chapter 3: The SPORE-1 Regression Function  50  Using Assumption B, we now state the following theorem. T H E O R E M 4. Given Assumption A and Assumption B, and for sufficiently large sample sizes Mj and M , then for every I > 1 the following is true: v  (3.57)  J(/? (x)-/(x)) JD]<e^rj(/? (x)-/(x)) JD 2  2  /+1  /  D  'D  where e is defined in Assumption A. Proof: From (3.44) we can write R  j (x) as follows:  l+  * / i ( x ) = <x' +  /+1  •g  z + 1  (-)-oc'  / + 1  / ? ( x ) + /? (x) /  (3.58)  ;  This can be written as: #,  i(x) = a '  +  In addition, we can write (R  [+  (/? and (R  [+  / + 1  t  / + ]  (g  / + 1  / + 1  (!-a'  [+l  (3.59)  )R (x) t  / + 1  (-)-/ (x)) +(l-a' M  / + )  )(i? (x)-/(x))  (3.60)  /  = (a' ) (g (-)-f (x))  2  2  l+l  (l-a'  ,(•) +  l+  becomes:  (R (x)-f(x)) 2a'  -g  ( x ) - f(x)) as:  (x)-/(x)) = a'  j(x) - f(x))  / + 1  )(g  / + 1  l+l  +  2  l+l  u  (-)-/ (x))(/? (x)-/(x)) + (l-a' M  /  (3.61)  ) (/? (x)-/(x)) 2  / + 1  2  /  Using the above identities and the definitions given in (3.51), (3.52) and (3.53), rewriting (3.57) we get: (a' Solving for oc'  z+  ) 0 +2 a ' 2  / + 1  / + 1  (l-a'  / + 1  )^ + (l-a'  ) Q<eQ 2  / + 1  (3.62)  j , we get ^ - ( Y - f l ) - V ( P - Q ) - ( Q - 2 P +n)(l -e)Q ( 0 - 2 F + Q) x  a' /+1  2  x  t  (3.63)  The above is true given Assumption B is satisfied. This completes the proof of T H E O R E M 4.  Q.E.D. In the last two theorems, we have established conditions under which the SPORE-1 learning algorithm converges. Next, we quantify the rate of convergence of this algorithm. More  Chapter 3: The SPORE-1 Regression Function  51  precisely, we measure how the approximation error decreases as the number of sample points used during the construction process increases. As discussed, in Section 2.3 of Chapter 2, this type of rate of convergence analysis is an important indication of an algorithms usefulness in high dimensional regression. In the following lemma and theorem, we establish the rate of convergence, to some finite approximation error given by (3.50), of the SPORE-1 Learning Algorithm as independent of the dimension of the function being approximated. L E M M A D. Let the optimal g;(-) function be defined as ° (u,v) pt  g[  =  " % •„ V i,je  where the coefficients {° 'a j}  (3.64)  {0, 1,2, ...}  are chosen using an infinite number of training sample points as  P  i  follows: r Opt  {  2  u  l  J  and f (x) u  jW)-/«(*)) ~D  -.  a } = argmin , ijS  a  1 dD  (3.65)  is the residual error function defined in LEMMA C. Similarly, let the approximated  g/( •) function be defined as i + j<K  (u, v) =  ^  g[  i,je  ay • w V  (3.66)  {0, 1,2, ...}  where the coefficients {a^} ctre chosen using a finite number of training sample points as follows: {a } u  = argmin . X ( g / K {a;  }  v)-y Y q  (3.67)  •r? Given Assumption A, and for sufficiently large sample size, the rate of convergence of the function (-) to the optimal function ° 'g (u, p  gl  l  v), as a function of the number of sample points  Mj, is independent of dimension N of the target function f(x) (we assume N>2). More precisely, there exists a constant K, independent of N, such that the rate of convergence is given by:  Chapter 3: The SPORE-1 Regression Function  E\\( (-)-[ 'g  52  [(•)])-dD <K  op  gl  (3.68)  where, T ^ is the number of terms in the polynomial g/(-). Proof: We first prove this theorem for the first level / = 1. The coefficients of ° g[(u, v) pt  are given, using an infinite sample size, by the following: {  a-} = argmin  ,  J(° ' (x  x )-f(x)) dD  p  g l  (3.69)  2  v  ki  D  Similarly, the coefficients for g (-) are given using a finite samples size M , as follows: ;  {a }  T  = argmin  tj  ,2 {  X^i  }  (3.70)  k)-y y  x  <**«,'  q  Thus, given that g (-) converges to 'gi(-) as M — > °° (this is true because g (-) is continuous op  t  T  t  and so is / ( x ) ) , then we can view g^(-) as being an approximation of ° 'gi(-). P  the coefficients {aif} are chosen based on sample outputs y  q  Given this view,  which are generated using the  following density function: \(° g,(x jX )-f(x))dV pt  k  h (y\(x , gl  x ))  kii  D'  =  k[  k  (3.71) jdD' D'  where f(x)  is defined in (3.25) of Assumption A, and D' is the domain D given that (x ,  x)  k  is fixed (hence D' is N-2  dimensional space, spanning the N-2  not include the axes (x , x )). k  dimensions of D which do  Within this context, and by Assumption A, h ( y\(x ^,  k  g  <  must be continuous with finite variance, and an equivalent representation for ° g\{x , pt  kn  opt  g\( k > k) x  x  {)  = E[y\(x ,  x )]  ko  Thus, we view the sample outputs y  q  = j yh (y\(x , gi  ko  x ))tp(dy) ki  k  x )) k  * ) is: kx  (3.72)  as being generated as follows: opt  y=  k{  k  ,  .  sM ,x ) K  ,  +  k  where the density function of the error term z(x  k  x  z(x ,x ) kn  ,x ) k  (3.73)  k  is continuous, and has finite variance  Chapter 3: The SPORE-J Regression Function  53  2  2  G (x ,-x ) dependent on (x ^,x ). Note that the variance G (x , x ) is independent of e  ko  k  k  k  E  k  k  dimension N because it is calculated using the distribution given in (3.71), which integrates and 2  divides out N-2 dimensions of f(\). Now, using the equality  G  2  2  (y) = E(y ) - (E(y)) , we  can write E[( (-)-  (-)) ]  opt  gl  + (E[g (-)- g (-)])  = a [g (-)- g (-)]  2  2  8l  opt  l  opt  l  l  From the Central Limit Theorem we know that E[g (-)-° 'g  (•)] - 0 for each point (x , x ).  P  x  (3.74)  2  i  x  k  k  In addition, because 'g (x ,x ) is constant for each fixed point (x ,x ),we can write (3.74) op  l  k  k  k  2  k  2  as follows (note that we are using the equality a (y + c) = G (y) for c constant): £ [ ( g , ( - ) - % , ( - ) ) ] = o - ^ - ) - ' ^ - ) ] = o- (g,(-)) 2  2  0  (3.75)  2  Next, if we symbolize the terms of g j (•) as 8i(-) = I k=  then, using the identity 0" (^"_ c-9.) = ^  9* l  (3-76)  ^  2  x  ]  c-c a(9-, 0^), where G(0-, 0 ) is the ;  ;  covariance between 0, and 0,, we obtain  £[(*,(•)-"'WO) 1 = *  I  9* =11  Vjt = 1  Now, for all terms / = j = k where k e { 1,  °( ;> , ) e  (3.77)  e  /  T ^} , the covariance is given by the Central 2  Limit Theorem as (i.e. variance of a distribution decreases as o (y)/n for n samples, if the 2  original distribution had variance G (y)): 2,  G(Q , k  Q) k  .  o~ {x , x )  9  k  P  = G (Q )  =  2  k  ^  1  k  (3.78)  2  where ^ ( x ^ , x ^ ) is the variance of the noise term defined in (3.73). Similarly, for all terms i * j the covariance is:  G {x , x ) u  0 ( e  2  e) =  l}  k  k  *•  (3  .79)  2  where o\-,(x,,, x ) is some initial covariance between 0,- and 0 ,• (note that o\,(x,, , x ) measures b  k  Chapter 3: The SPORE-1 Regression Function  54  the covariance between terms of the 2 dimensional polynomial #](•), and is therefore indepen2 2 dent of N). Thus, if we let G (x ,x ) be the maximum covariance of all of aj(x , x ) and 2 G (x , x ) , we can write (3.77) as: Max  e  k  k  k  i  k  k  k  Si()  T  SiV)  T  ^Max( k '  k^  x  x  {}  (3.80)  Therefore, integrating ever D , we get r°P  l  <  \vMax( k > k) ) D X  D C  where  =  X  dD  n  =  K  '  Si(-)  Mr  (3.81)  2 CT^^^CJC^,  x ^)dD, is independent of Af. Thus, this proves L E M M A D for the k  function gj(-) • The proof for all other levels / > 1 follows the same logic and is not given here. This completes the proof of L E M M A D. Q.E.D. In L E M M A D we proved that the rate of convergence of each function g (-) is indepent  dent of dimension N of the target function. In the following theorem, we extend this to show that the rate of convergence of -fy(x) is also independent of N. T H E O R E M 5. Let ° R (x)  = Yu- \ " / " ° 'gi(')  pt  0 /  L  a  P  be the optimal constructed cascade  (to level L , ) after the first time that STEP 2 of the SPORE-] learning algorithm is executed; thus ° R ^(x) is constructed using an infinite number of samples points, and ° OL is defined as: pt  pt  L  l  jC 'gl(-)-f(x)) dD P  2  opt _ _D l ~ L,  (3.82)  a  ,-1  = 1 7)  where ° 'g (-) P  l  is defined in L E M M A D . Similarly,  let R^^x) = Y^- \  ^i(') be the  constructed cascade (using the usual finite number of samples points) after the first time that STEP 2 of the SPORE-1 learning algorithm is executed. Then, given Assumption A and for sufficiently large sample sizes M , the rate of convergence, as a function of the number of sample T  points M , of the function R (x) to the optimal function ° R ( x), p,  T  L]  L  K  is independent of dimension  N of the target function f(x), (we assume N > 2). More precisely, there exists a constant K, independent of N, such that the rate of convergence is given by:  Chapter 3: The SPORE-J Regression Function  55  <K  j(R (x)-[ 'R (x))) dD op  2  Li  Li  (3.83)  D where, T ^  is the number of terms in the polynomial g/(-) •  g  2  2  Proof: Using the identity a (y) = E[(R (x) L  - [ 'R (x)])-]  2  = G*(R (x)  op  L  , we write  E(y )-(E(y)) -  L  L  From the Central Limit Theorem we know that E(R (x) - [° R pt  L  addition, because ° R  (X)  pt  L  .opt. - [""7?,  opt [""X (x)]) + (E(R (x)  L  (X)]) =  (x)])M3.84)  0 for each point x . In  is constant for each fixed point x , we can write (3.74) as follows 2  2  (note that we are using the equality a (y + c) = a (y) for c constant): E[(R (x)-[ R (x)]y]  u  Li  Next, given that R (x)  Li  =  Li  c'(R (x)) Li  (3.85)  , oi • g (-), we can write the above as:  =  L[  op/,  = a(R (x)-[ '"R (x)])  opt  Li  l  t  (3.86)  £[(/?,(x)-[•""/?, (x)]) ] = a V =  For each point x , let p (x) Max  i  i= iy =  )  i  be the maximum covariance over all o(g , gj). Then, given that {  j oc, = 1 by definition, we can write the above as: (  Y  L |  L  ^  |  (3.87)  E[(R (x)-[ R (x)])-]< opt  L  L  v,- =i  A,- = i  )  From L E M M A D, we know that for a(g-, g ) = a (#•), the variance has the form: (  a (g )<(r ,) 2  (  g;(  (3.88)  2  Mn  Thus, given that the functions g (-) are constructed one at a time using a least squared errorfit,all t  other covariances 0"(g , gf) must have the same form given in (3.88). Therefore, substituting (  p (x) Max  with the maximum c(g , gj), (3.87) becomes: t  f(T E[(R (x)-[ R (x)])-]<  YA  opt  L  L  MT  Max(0  (3.89)  Chapter 3: The SPORE-] Regression Function  56  Therefore, integrating ever D, we get  (3.90)  E\\(R (x)-[ R (.x)]) dD opt  L  2  Li  'D  J  where K = j o (-)dD, D  Max  is independent of N (see proof of L E M M A D). This completes the  proof of T H E O R E M 5. Q.E.D. Therefore, T H E O R E M 5 shows that the rate of convergence of the error function, to the optimal error function is independent of dimension of the function being approximated. This result is important because it establishes that, for the SPORE-1 Learning Algorithm, the rate at which the constructed cascade approaches the best regression function is independent of the dimension of the function being approximated. Thus the number of sample points required to build an optimal SPORE-1 regression function, does not explode with dimension.  3.3.2 Complexity Results An important property of a learning algorithm for the construction of high dimensional regression functions, is how its complexity grows with the number of input variables. Given that the number of parameters of each function g^-) is fixed, the computational cost of adding a level / is the same for all levels. In the following, we symbolize the computational complexity of adding a new level to the regression function, as C . The cost C is dominated by the computations required to determine the polynomial coefficients of ( - ) (see S T E P 2 of the SPORE-1 Learning g[  Algorithm). In the following theorem, we calculate the computational cost of the SPORE-1 learning algorithm. T H E O R E M 6. For anyfixedsequence k , ...,k of indices defining the order of indepenQ  L  dent variables (see STEP 1 of the SPORE-1 Learning Algorithm), and any e > 0, the computational cost to achieve the following convergence  (3.91) D is u / V C , where N is the dimension of f(x),  C is the complexity of adding a level to the cascade,  and p > 0 ( [ 1 6 3i) is independent of dimension N.  Proof: From T H E O R E M 3, we know that a level / satisfying the above condition must  Chapter 3: The SPORE-1 Regression Function  57  exist. Furthermore, due to T H E O R E M 3, there must be a sequence of positive real numbers e , ...,£ 0  k  (3.92) D  such that £• ^ e, j and e < e. Thus k is the number of times one must cycle through the N +  k  independent variables before the desired convergence is achieved. Note that k depends on how the function f(x) projects, at each specific level, onto the 2 dimensional functions £/(•)• Thus k depends on the distributions h (y\-) (see L E M M A D), and not on the dimension of g  f(x).  Therefore we complete the proof of T H E O R E M 6 by choosing p > k. Q.E.D. Thus, T H E O R E M 6 shows that the computational cost of building a SPORE-1 regression function is linear with respect to the number of predictor variables. Next consider the effect of the number of sample points M  T  used in the construction of  the regression function, on computational complexity. The main effect of sample size is on the construction of each level of the regression function. Thus, the cost C of adding a single level is some function of the number of sample points: i.e. C(M ). T  The actual relationship will depend  on how the coefficients of g {-) are calculated. In the SPORE-1 algorithm the preferred method is t  Singular Value Decomposition (SVD), and therefore C{M ) T  is the cost of doing 1 SVD calcula-  tion (see [Golub and Van Loan, 1993] for details).  3.3.3 Internal Variance Reduction As outlined in Section 2.2 of Chapter 2, variance reduction is fundamental to constructing regression functions which have low approximation errors. One of the more effective means of variance reduction is to use many different regression functions to create a single prediction using a weighted average of all their outputs. The resulting regression function is referred to as an ensemble regression function. Thus, given P different N dimensional regression functions symbolized by /,-(x), the ensemble regression function f(x)  is given by:  p  f{x)  = Y  P,-£(x)  (3.93)  i= 1  where the P are some positive scaling coefficients. Perrone [Perrone, 1993] has shown that (  Chapter 3: The SPORE-1 Regression Function  58  optimal variance reduction is achieved when the (3- are calculated as follows:  P< =  (3-94) L k L j  L  k j  where C- is the correlation between the /,(x) and fj(x) regression functions. Furthermore, if ;  residual errors of any two different regression functions /,-(x) and //(x) (i.e. / ^ j) are uncorrelated, (3.94) becomes:  X J (/*-/)] a  This calculation of (3,- is equivalent to the calculation of the scaling factors CL are computed in 1  S T E P 2 of the SPORE-1 learning algorithm. Thus, the SPORE-1 regression function will exhibit optimal variance reduction if all g (-) have residual approximation errors which are uncorrelated. t  Although the learning algorithm does not guarantee that the approximation errors of all of the functions g^-) are uncorrelated, this condition is implicitly encouraged through use of different bootstrap samples and predictor variable inputs for the construction of each level. In T H E O R E M 4, we show that error reduction (or equivalently variance reduction) can occur even if two functions g-(-) and. #•(•) have some correlated residual errors, as long as the conditions in Assumption B are satisfied.  Therefore, SPORE-1 exhibits variance reduction which is internal to its structure. It does not rely on a large ensemble of different high dimensional regression functions, using instead a weighted average of 2 dimensional functions within it's structure. We refer to this as the internal variance reduction property of the SPORE-1 learning algorithm. In Chapter 4, we experimentally verify that this internal variance reduction property is effective in significantly reducing approximation errors.  3.3.4 Virtual Space Partitioning Consider the pathological situation where {  a..} = argmin- , J  i  'jC 8i(-)-f(x)) dD Pl  ijt  a  D  2  = {0, ...,0}  (3.96)  Chapter 3: The SPORE-1 Regression Function  59  Given this condition, we cannot expect reduction in approximation error to occur because the optimal solution is g (-) = 0. In Chapter 5 this condition is avoided by partitioning the space D t  into a finite number of subdomains where this cannot happen (note that in Chapter 5 it is shown that these subdomains must exist). However, even without space partitioning, this pathological condition is not likely because the actual calculation of the coefficients of #,(•) is done using a bootstrap sample of the training data. Due to replacement during sampling, on average each bootstrap sample contains roughly 63% of the training data [Efron, 1983]. Therefore, a bootstrap sample is essentially a random partitioning of the input space. We term this phenomenon space  virtual  partitioning.  In Chapter 4, we experimentally verify, that virtual space partitioning, causes the SPORE1 regression function to converge to zero approximation error, even when condition (3.96) is valid. We verify this by applying SPORE-1 to the 10 bit parity problem. As outlined in Section 4.5 of Chapter 4, a major deficiency of most learning algorithms which build a regression function using lower dimensional functional units, is that they cannot represent parity functions that are of higher dimension than their largest dimensional function unit. In being able to approximate the 10 bit parity problem arbitrarily well using only 2 dimensional functional units, the virtual space partitioning property of the SPORE-1 algorithm serves to overcome the deficiencies present in other dimensionality reducing regression functions.  3.4 Summary This chapter constitutes our first attempt at a theoretical verification of our thesis presented in Section 1.3 of Chapter 1. First, we have shown that the SPORE-1 structure is a universal approximator. Second, we have shown that under appropriate conditions, the rate of convergence of the SPORE-1 learning algorithm to some finite approximation error, is independent of the dimension of the target function. Thirdly, we have given sufficient conditions under which the approximation error converges to zero. And finally, we have shown that the complexity of constructing the SPORE-1 regression function is at most linear with the dimension of the target function.  Chapter 4: Experimental Results  In this chapter, the SPORE-1 learning algorithm is experimentally evaluated. In Section 4.1, the implementation details of SPORE-1 are described. In Section 4.2, the algorithm is evaluated on ten well known regression problems. In Section 4.3, the algorithm is evaluated on very high dimensional regression data. In Section 4.4, SPORE-1 is applied to 3 very high dimensional human-to-robot skill transfer problems, In Section 4.5, SPORE-1 is applied to the 10-bit parity problem. Finally, Section 4.6 concludes the chapter with a summary of experimental results.  4.1 Implementation Details of the SPORE-1 Regression Function In this section we give the implementation details of the SPORE-1 algorithm described in Section 3.2 of Chapter 3. In all of the experiments described in this chapter, the same version of SPORE-1 is used. Thus, the same class of two dimensional functions ( - ) are used, with identig[  cal settings for the learning parameters  depth , depth 1  2  ,  and e. Even though the regression data  presented in this chapter vary widely in dimension and number of learning samples, no fine tuning of the SPORE-1 algorithm was required. The SPORE-1 algorithm is completely automated, requiring no manual intervention. The functions g^-) are modeled as third order 2 dimensional polynomials as follows: g[(u,  v) = a CI QU  2  Q 0  + a uv l2  2  +a  v+ a  0 )  2  +a  ] 0  M + fl|jMV + a  2  2 i  u v +a v Q3  v + 2  0 2  3  3  (4.1)  + a u 3Q  As indicated in Chapter 3, this choice for g (-) is not unique. In T H E O R E M 2 of Chapter 3 t  (Section 3.1.2) we showed that g (-) must belong to the class of 2 dimensional polynomials of ;  order greater than one. Our original reason for not choosing second order polynomials was that third order polynomials are the smallest order polynomials which incorporate asymmetric terms 2  (i.e. terms such as uv which we assumed would allow for faster convergence) which second order polynomials do not. However, preliminary experiments (these results are not reported here)  60  Chapter 4: Experimental Results  61  with second order polynomials suggest that they produce approximations which are comparable (with respect to mean squared error) to those produced by third order polynomials. In addition, second order polynomials often produced regression functions which were smaller than those produced by third order polynomials (however, this may not prove true if we exclude coefficients which are insignificantly small; we have not attempted such an analysis here). Our original reason for not choosing higher order polynomials is due to our goal of testing the SPORE methodology under the simplest implementation: i.e. very simply polynomial building blocks (only 10 coefficients). As with second order polynomials, preliminary experiments (these are not reported here) indicated that fourth, fifth and sixth order polynomials produce regression functions which are comparable (with respect to mean squared error) with those produced with third order polynomials. The main difference being that higher order polynomials produce regression functions which are larger (once again, this may not prove true if we exclude coefficients which are insignificantly small). Thus, even from a practical perspective, it appears that the choice of g^(-) may be arbitrary, as long as they belong to the class of 2 dimensional polynomials of order greater than one. The three learning parameters were assigned the following values: depthj = 25, depth  2  = 6,  and e = 0.0001 (see Section 3.2 of Chapter 3). These values were consistently  found to be sufficiently close to the theoretically motivated values given in Section 3.3.1 of Chapter 3; i.e. theory suggests that large positive values should be assigned to depthj and depth , 2  while e > 0 should be made as small as possible (see S T E P 2 and S T E P 5 of the SPORE-1 Learning Algorithm in Section 3.2 of Chapter 3).  4.2 Evaluating SPORE-1 on Standard Regression Problems In this section we evaluate the efficacy of the SPORE-1 algorithm by applying it to 10 well known regression problems from the literature (See Table 4.1) [Grudic and Lawrence, 1997]. These problems were chosen because they have been extensively studied with many different learning algorithms being applied to them. A l l of these regression examples are comparatively small, ranging in dimension from 4 input variables to 16, with learning sample sizes ranging from 80 to 15,000. Although these regression problems are not the type of large, very high dimensional problems that SPORE-1 was designed for (i.e. more than 100 input variables with about 40,000 learning examples), they are among the largest regression problems available in the literature. In  62  Chapter 4: Experimental Results  addition, they allow us to directly compare the performance of SPORE-1 to many different learning algorithms. Our comparisons are based on 3 different publications: Breiman [Breiman, 1996a], Rasmussen [Rasmussen, 1996], and Jordan and Jacobs [Jordan and Jacobs, 1994]. In the following, we give a brief description of the regression data used in each of these publications.  4.2.1 Regression Data Used by Breiman Breiman studied 5 different regression problems in [Breiman, 1996a]. His goal was to determine how Bagging (see Section 2.2 of Chapter 2) affects the approximation error of CART (see Section 2.1.2 of Chapter 2). Since Bagging is a method of improving predictor accuracy through variance reduction, comparing SPORE-1 to the Bagged C A R T algorithm allows us to directly study the efficacy of the internal variance reduction property of SPORE-1 (see Section 3.3.3 of Chapter 3). Of the 5 regression data sets studied in [Breiman, 1996a], we used the Boston Housing, Friedman #1, Friedman #2, and Friedman #3 data. A brief description of these data sets is given below. •  Boston Hosing Data: This data has 12 predictor variables, which constitute various so-  cio-economic indicators in the Boston area, and 1 output variable which is the median housing price in the corresponding area. There are a total of 506 examples in this data set, 51 of which are randomly selected to be used as a test set. •  Friedman #1: This is a simulated data set which appeared in [Friedman, 1991]. It has  10 independent predictor variables which are uniformly distributed over 0 < x < 1, for all l  i = 1, ..., 10. The output is defined by the following equation: y = 10sin(7UXjjc )+ 2 0 ( x - 0 . 5 ) + \0x 2  2  3  4  + 5x + z 5  (4.2)  The noise term e is Normally distributed with mean zero and a variance of 1. Half of the 10 prediction variables (x , 6  x  1 0  ) do not affect the output y , so this data is a good test  of the ability of the SPORE-1 learning algorithm to ignore variables which do not contribute to reduced approximation error. This data has a total of 200 learning examples and 1000 test examples, all randomly generated using (4.2).  Chapter 4: Experimental Results  •  63  Friedman #2 and Friedman #3: This is simulated data having 4 inputs and 1 output.  The data was used in [Friedman, 1991] and simulates the impedance and phase shift in an alternating current circuit. The Friedman #2 is generated using the following: 2 y  2  2  = (*! + (x x -(l/(x x ))) 2  3  2  1 / 2  )  4  +e  •  2  (4.3)  The Friedman #3 is generated using the following: fx x -(\/(x x ))\ 2  y  3  =  3  2  4  atanJ-=-^  ^ - ^ J + e  (4.4)  3  The variables x x , x , x are uniformly distributed as follows: ] ;  2  3  4  0<X!<100  20<fJ -l<280 2  (4.5)  \2K)  0 <x <1 3  1 <x < 4  11  2  The noise terms e and e are Normally distributed with mean zero and a variance of o" 2  3  2  2  and a  3  respectively. The variance is chosen so that the signal to noise ratios are 3 to 1; i.e. 2  2  2  2  2  the ratios of variances are chosen to be (<y /o\, ) = (o%/o\, ) = 1/3 , where o\, and 0  z  y'  v  2  o  yy  y  2  2  a  are the variance of the outputs of (4.3) and (4.4) respectively. The learning examples  for these data sets had 200 randomly generated samples, while the test data had 1000 randomly generated examples. In Table 4.1 we compare the performance of SPORE-1 on these data sets to the Bagged C A R T algorithm. We chose these data sets because the manner in which they were used was fully specified, allowing us to directly compare prediction error results (full specification of the use of the fifth data set, Ozone, was not available in [Breiman, 1996a]).  4.2.2 Regression Data Used by Rasmussen Rasmussen [Rasmussen, 1996] studied five data sets from real-world regression examples. These included Auto Price data, C P U data, Boston Housing data, M P G data, and Servo Data. We give a brief description of these data sets below.  Chapter 4: Experimental Results  •  64  Auto Price Data: This data set relates the list price of a 1985 automobile to various at-  tributes such as it's size, weight and engine capacity. The data set has 10 real inputs and 1 output. There are 80 learning examples and 79 test examples. •  C P U Data: This data set relates C P U performance to various attributes such as maxi-  mum and minimum memory size. The data set has 6 real inputs and 1 output. There are 104 learning examples and 105 test examples. •  Boston Housing Data: This data set is a variation of the Boston Housing data de-  scribed in the previous section. The data set has 12 real inputs, 1 binary input, and 1 output. There are 80 learning examples and 79 test examples. •  M P G Data: This data relates automobile fuel consumption to such attributes as car  weight, model year, and horsepower. The data set has 3 real inputs, 6 binary inputs, and 1 output. There are 256 learning examples and 250 test examples. •  Servo Data: This data relates rise time of a servomechanism to gain settings and choice  of mechanical linkages. The data set has 2 real inputs, 10 binary inputs, and 1 output. There are 80 learning examples and 79 test examples. Rasmussen applied 5 different learning algorithms to this data: Monte Carlo, Gaussian Evidence, Backpropagation, M A R S , and Gaussian Process. In his paper, Rasmussen was testing the efficacy of these algorithms as the number of learning data is increased, hence he tested them on various sizes of learning data. Since we are not concerned with data-limited regression in this thesis, we limited our comparisons to those results obtained on the largest learning data sizes. In Table 4.1, SPORE-1 is always compared to the algorithm which achieved the lowest  errors on the  corresponding data (this is symbolized by the brackets around the published error results). This gives a direct comparison of SPORE-1 with the best algorithm applied to this regression data.  4.2.3 Regression Data Used by Jordan and Jacobs Finally, we compared SPORE-1 to the forward dynamics data studied by Jordan and Jacobs [Jordan and Jacobs, 1994]. This data was generated by a simulation of a 4 joint robot arm moving in 3 dimensional space. The goal is to predict the acceleration of the 4 joints, given as input the position, velocity, and applied torque of each joint (i.e. a total of 12 predictor variables). There are 15,000 training examples and 5,000 validation examples. This data is highly nonlinear and is a good test of a learning algorithm's ability to predict nonlinear phenomena. Jordan and  65  Chapter 4: Experimental Results  Jacobs applied 7 learning algorithms to this data: linear, Backpropagation, two versions of the Hierarchical Mixtures of Experts (HME) algorithm, two versions of C A R T , and the M A R S algorithm. In Table 4.1 we compare SPORE-1 to the Backporpagation algorithm, which achieved the lowest approximation error of algorithms studied by Jordan and Jacobs.  4.2.4 Experimental Results Next, we detail our experimental results on the above regression data. A l l experimental results reported in Table 4.1 are obtained under the exact experimental conditions reported in the literature. This allows a direct comparison of the performance of SPORE-1 to other algorithms. A l l learning times reported in this section are for SPORE-1 programmed in C and running on a Pentium Pro 150 using L I N U X . The data sets studied by Breiman [Breiman, 1996a] and Rasmussen [Rasmussen, 1996] have relatively few training examples, and therefore we constructed the SPORE-1 approximations using 10 fold cross-validation (10 fold cross-validation is one effective method of getting good approximations using a small number of data points). The 10 fold cross-validation procedure used is the following: the learning  data  set was divided into 10 approximately equally distributed sets,  and then 10 regression functions were constructed using, in turn, 9 of these sets as training and the remaining set as a validation learning,  training  and validation  set  sets,  (see Section 3.2.1 of Chapter 3 for definitions of  data). For each test data point {test data was not used during  learning), the outputs of the 10 regression functions were averaged to produce the final approximation output, for which error results are reported. To test reproducibility, 100 independent approximations (independent with respect to random sequences of input variables and bootstrap samples as defined in Section 3.2 of Chapter 3) were generated using the Learning Data: Table 4.1 reports the best test set error, along with the average and standard deviation (s.d.) over the 100 independent runs. For the data sets studied by Breiman, we followed the same experimental procedure used by Breiman [Breiman, 1996a]. Briefly, this procedure is summarized as follows. For each of the 4 data sets, 100 trials were set up using 100 learning and 100 test sets created randomly according to the guidelines outlined in Section 4.2.1 and in [Breiman, 1996a]. For each of the 100 learning sets a regression function was constructed (using 10 fold cross-validation as described above) and evaluated on the corresponding test set. Experimental results for the 100 experiments are given in  Chapter 4: Experimental Results  66  Table 4.1. The previously published error for the Breiman [Breiman, 1996a] data refers to the average bagged error reported. From Table 4.1 it is evident that the internal variance reduction property of SPORE-1 is at least as effective as Bagged regression trees in reducing approximation error (see Section 3.3.3 of Chapter 3). Note that the large standard deviation (s.d.) in the errors is due to the fact that each of the 100 learning and training sets were different. For all 4 data sets, the average SPORE-1 learning time ranged from about 1 to 10 minutes per approximation (this is the time required to generate one 10-fold approximation, or equivalently, ten SPORE-1 regression functions). The average size of a single cascade was about 90 K-bytes (i.e. about 8500 model parameters). For the data sets studied by Rasmussen [Rasmussen, 1996], we report results for the largest learning sets only (we used 80 learning examples of auto price data, 104 of cpu data, 256 of housing data, 192 of mpg data, and 88 of servo data as outlined in Section 4.2.2). For each of these 5 fixed data sets, the SPORE-1 approximation was constructed using 10 fold cross validation as described above. The previously published error given in Table 4.1 under the Rasmussen [Rasmussen, 1996] data sets is the best reported error (indicated by brackets) of the 5 algorithms evaluated, and was obtained from the graphs presented in the paper. We generated 100 trials (independent approximations) for each data set. Because all learning and test data were the same in each trial, this was done to determine what effect the stochastic aspect of the SPORE-1 algorithm has on its performance. As defined in Section 3.2 of Chapter 3, the ordering of the independent variables is random, and the construction of the 2 dimensional functions g (-) is t  done using a (random) bootstrap sample of the training data. The best and average relative mean squared test set error, and its standard deviation, are reported in Table 4.1. From Table 4.1, it is evident that although there is some variation in error performance from run to run, the stochastic effect of the algorithm is mostly negligible. As with the Breiman data, the average SPORE-1 learning time ranged from about 1 to 10 minutes per approximation. The average size of a single cascade was about 90 K-bytes (i.e. about 8500 model parameters). Finally, for the forward dynamics data studied by Jordon and Jacobs [Jordan and Jacobs, 1994], our evaluation was done using the experimental setup described in their paper for the on-line back-propagation algorithm: learning was done using 15,000 training examples, and learning stopped when the error could no longer be reduced on 5000 validation examples. The SPORE-1 approximation for this data consisted of a single cascade per output: due to the large  Chapter 4: Experimental Results  67  Table 4.1 Performance of SPORE-1 of Published Data SPORE-1 Error (100/10 Runs) Published Source  Breiman. IWda  Rasmussen, 1996  .lord.in and Jacobs. |<W  Data  Pub. Lrror  best  a\e.  s.d.  Housing  11.6  1.95  9.4  8.4  Friedman 1  6.1  1.77  2.88  0.43  Friedman 2  22,100  16,733  19,603  1,413  Friedman 3  0.0242  0.014  0.0189  0.0024  Auto Price  (0.29)  0.21  0.29  0.03  CPU  (0.17)  0.086  0.14  0.025  House  (0.15)  0.128  0.145  0.007  MPG  (0.10)  0.008  0.1  0.004  Servo  (0.15)  0.198  0.25  0.02  For. Dyn.  (0.09)  0.0378  0.041  0.009  Ave. Size of Cascade  90 K-byte (8500 model parameters)  1 M-byte (94,000 parameters)  data size, 10 fold cross validation was not necessary. In order to estimate the reproducibility of SPORE-1 on this data, we constructed 10 independent approximations. The best and average relative error on the validation set (in accordance with [Jordan and Jacobs, 1994]), and the standard deviation over these 10 independent experiments (time constraints did not allow us to do 100 experiments), is reported in Table 4.1. The previously published error, shown in Table 4.1, is the best relative error (on the validation data) of the 7 algorithms studied in [Jordan and Jacobs, 1994]. The forward dynamics data consists of 12 inputs and 4 outputs, which requires 4 SPORE-1 cascades (1 for each output). The algorithm required about 10 hours of computation to build all 4 cascades, and the average size of each cascade was about 1 M-byte (i.e. about 94,000 model parameters). From Table 4.1 we conclude that the SPORE-1 errors on all 10 runs are consistently low, thus the stochastic effect of the learning algorithm had negligible effect on approximation error. From Table 4.1, it is evident that the SPORE-1 algorithm demonstrated as good or better error results on all but 1 (the servo data) of the data sets. However, this result should be interpreted with caution. The specific goal of SPORE-1 is to build a regression function which best represents  Chapter 4: Experimental Results  68  the learning data in the mean squared error sense. The SPORE-1 learning algorithm continues to add levels until the mean squared error can no longer be reduced: this is beneficial if one wants the "best" approximation, but detrimental if one wants a representation of fixed size. Most algorithms referred to in Table 4.1 are parametric and therefore of fixed size. Two exceptions are M A R S and CART. However, even comparing these to SPORE-1 should be done with caution: both M A R S and CART can be used to determine the significance of various inputs and how they interact with one another. Currently SPORE-1 has no such data interpretation features. Nevertheless, we cannot ignore the fact that SPORE-1 consistently produced regression functions which had low prediction errors. Thus, even though SPORE-1 was not designed for small, low dimensional problems, it should be considered a strong candidate for such regression problems if prediction error is important.  4.3 Evaluating SPORE-1 on Synthetic High Dimensional Problems In this section, our goal is to study the SPORE-1 algorithm when applied to very high dimensional regression data, under various noise conditions [Grudic and Lawrence, 1997]. A l l learning times reported in this section are for SPORE-1 running on a Pentium Pro 150 using LINUX. Since no large, high dimensional regression data examples were found in the literature, we applied SPORE-1 to fifteen artificial data sets ranging from 100 to 1,600 input dimensions. For five of these data sets no noise was present, while the remaining ten data sets had noise. For the first five of these "noisy" data sets, noise was present only in the output, while the remaining five data sets had both input and output noise. Input noise was generated by simply allowing only half of all inputs to contribute to the output (thus half of the inputs used in constructing the approximation had no effect on output). This allows us to determine how effectively the SPORE-1 learning algorithm is able to account for variables which have no predictive power. Output noise was generated using a normal distribution with the standard deviation selected to give a signal to noise ratio of 3 to 1, thus implying that the true underlying function accounts for 90% of the variance in learning and test data. Specifically, we generated the regression data as follows: y = f(x , x  ...,x ) N  + e  (4.6)  Chapter 4: Experimental Results  where / ( x , ,  69  x ) is the target function, and e is a Gaussian noise term with mean zero and N  2  2  2  2  2  2  variance c> . Under output noise conditions, we chose 0" such that (o~y/cT ) = (3) , where Gje  E  is the variance of f{x ,  e  x ). Thus, because e and f{x ,  x  N  x ) are independent, we know  x  N  that 2  2 0\>  0"P  — =  1  — =  2  2 2 , 2 . 2 ,  1  - — — = 0.1 , 9 + 1  (4.7) v  y  2  where <j is the variance of the output y. Therefore, given that relative error is defined as y  ^  £  (4.8)  where / is the constructed regression function, theoretically the best relative error achievable by any approximation is 0.1 . The target function / ( x j ,  f(x ...,x ) lt  N  x ) is generated as follows: N  =  — 1 + c o s  ( 7zr=i ° ^ ' c  s (  ( x  (4.9)  ) )  A  where N is the input dimension of the data, while r^xA and 5 (x-) are continuous 1 dimensional (  functions which are nonlinear and all different from one another. Equation (4.9) allows us to generate highly nonlinear data of any dimension, while at the same time controlling the complexity of the data via the appropriate selection of functions r^x^ and ^ ( x , ) . By data complexity, we are referring to the number of random sample points required to build sufficient non-parametric models of the data, with respect to mean squared error. The more complex the generating function, the more data points are required for non-parametric modeling. For the purposes of this thesis, we have chosen functions /*,(x-) and ^,(x-) such that 40,000 random sample points are sufficient in order to effectively model the data. Our choice of functions r (x ) and .s'-(x-) was done as follows. The functions r-(x-) and ^-(x-) are modeled ;  ;  using linear splines. The spline knot locations were randomly chosen using a uniform random distribution. The number of knots was the same for each function. In addition, each of the predictor variables x- were modeled as 2 dimensional parametric functions of the form  Chapter 4: Experimental Results  70  x - = i|/i(cOj, co ), where the functions ^ ( c o , , co ) are different for each predictor variable x -. ;  2  2  ;  The parametric functions V|/-(co,, co ) are modeled as 2 dimensional linear splines, with knot 2  positions randomly chosen using a uniform random number generator (the number of knots is the same in each of the functions Vj/ (tO], co ), but their locations are all different). Hence, the 2  (  complexity of the target function f{x ,  x)  {  in (4.9) is directly controlled by the number of  N  knots in functions r-(jc-), s^xA and \|/ (co,, co ), and increasing the number of knots increases (  2  the number of points required to achieve effective approximations of the target function. One should note that, even though the target function is essentially 2 dimensional in the parameter space (C0j, co ), this information is not used by SPORE-1; i.e. regression data is provided in the 2  form of N dimensional predictor variables x-, and one output variable y . Therefore a learning algorithm sees the resulting regression problem as N dimensional. In order to pick a target function complexity such that 40,000 sample points are required in order to generate an effective regression model, we followed the following procedure. The number of knots used to represent the functions /"-(x-), 5-(x-) and \j/ -(c0j, co ), was increased 2  (  until a nearest neighbor approximation (implemented using a simple AT-Nearest Neighbor algorithm with K = 1 ; see Chapter 2 for details), in the 2 dimensional (GL>J,  (0 ) 2  parameter  space  exceeded a relative mean squared error of 0.01 (relative mean squared error is defined,  in the usual sense, as the mean squared error on the test data, divided by the variance of the test data). In building this AT-Nearest Neighbor approximation, 40,000 random training samples were used, and the relative error was calculated using 5000 randomly picked points. A nearest neighbor approximation gives a good indication of maximum approximation error, because the 40,000 learning points densely sample the 2 dimensional parameter space (cflj, co ), where the nearest 2  neighbor calculations are done (a nearest neighbor approximation done in the N dimensional predictor variables x-, would not be nearly as effective, for reasons outlined in Chapter 2). This procedure gives us a good estimate of the maximum approximation error that the SPORE-1 algorithm should get; the relative mean squared error of a SPORE-1 approximation, or any effective approximation method, should not exceed 0.01 on any of the high dimensional data presented in this section. The simulation results for the 15 high dimensional data sets are presented in Table 4.2. For each data set there are 40,000 learning examples and 40,000 testing examples. Each learning set is further divided into 30,000 training examples and 10,000 validation examples: these are used to  71  Chapter 4: Experimental Results  construct one cascade which forms the approximation for that data set. The three rightmost columns of Table 4.2 contain the test set average relative mean squared error and corresponding standard deviations over 10 independent runs. From Table 4.2 it is evident that the relative error is small when no noise is present, and falls well below the expected maximum error value of 0.01. With the addition of noise the relative error approaches the theoretical limit (due to the 3 to 1 signal to noise ratio) of 0.1 (see explanation given above). Learning time for each data set with no noise was approximately 7 hours and produced cascades which were between 2 and 6 M-bytes in size (i.e. between approximately 190,000 and 570,000 model parameters). In contrast, learning time for each data set containing noise was about 1 hour and produced approximations of between 200 and 600 K-bytes in size (i.e. between approximately 19,000 and 57,000 model parameters). The drop in learning time and approximation size results because the SPORE-1 learning algorithm stops adding levels when error on the validation set can no longer be reduced. When the data has output noise, this condition is reached sooner than when there is no noise. The reason for this is that interpolation between sample points is no longer possible when noise is present, once a given level of approximation error (i.e. 0.1) is reached. Table 4.2 Evaluation of SPORE-1 on High Dimensional Data Sets  Dimension  No Noise Data  Output Noise Data  Input/Output Noise Data  100  0.00139 s.d. 0.00006  0.1049 s.d. 0.0003  0.1048 s.d. 0.0004  200  0.0018 s.d. 0.0002  0.1064 s.d. 0.0003  0.1055 s.d. 0.0002  400  0.0009 s.d. 0.0002  0.1041 s.d. 0.0002  0.1064 s.d. 0.0003  SOO  0.0044 s.d. 0.0006  0.1111 s.d. 0.0003  0.1038 s.d. 0.0003  1.600  0.0011 s.d. 0.0003  0.1087 s.d. 0.0003  0.1042 s.d. 0.0003  The results in Table 4.2 demonstrate that SPORE-1 can potentially be used on large, high dimensional regression problems. It is interesting to observe that the dimension of the data did not adversely affect either learning time or the size of the approximation. This is made evident in Figure 4.1 where we plot approximation size as a function data dimension. The "error bars" indicate the variation in approximation size for each specific data type. These experimental results support the theory developed in Chapter 3, where we showed that the rate of convergence of SPORE-1 is independent of dimension, and that complexity is at most linear with dimension. A  Chapter 4: Experimental Results  72  700000  J-  700000  200000 100000  100000 0  200 400 600 800 1000120014001600 Data Dimension  0  200 400 600 800 1000120014001600 Data Dimension  b) Input/Output Noise  a) Output Noise  0  200 400 600 800 1000120014001600 Data Dimension  c) No Noise Figure 4.1  Increase in A p p r o x i m a t i o n Size with Data Dimension  second interesting observation is that SPORE-1 is effective in dealing with both input and output noise, even on regression functions which are very high dimensional.  4.4 Human-to-Robot Skill Transfer Experiments We chose human-to-robot skill transfer as a real world test of the SPORE-1 learning algorithm. Our motivation for this is that human-to-robot skill transfer is an example of a regression problem which is very high dimensional (1024 inputs), and exhibits both input and output noise characteristics [Grudic and Lawrence, 1995] [Grudic and Lawrence, 1996]. We approach the problem of programming robots that can work in unstructured environments from the point of view of human-to-robot skill transfer. In this paradigm, the human transfers his or her skills to a robot by giving the robot a finite number of examples of how the desired manipulation task is done. The key to the success of this type of robot programming is that the robot must be able to generalize the desired task, given only the human experts examples, even when the unstructured environments of interest change after learning has occurred. Other researchers have had success with similar approaches to the robot programming  Chapter 4: Experimental Results  73  problem. We mention only a few examples here. Perhaps the most related work is that of Pomerleau [Pomerleau, 1993b] [Pomerleau, 1993a] where he successfully uses a 3 layer perceptron network to control the C M U A L V I N N autonomous driving system as it drives along a road. Pomerleau's approach differs from that presented here in that 1) we use a nonparametric approximation approach and 2) the human expert supplying the training data uses only the sensor inputs being fed to the approximation (Pomerleau does not use this restriction). However, as with the approach presented here, Pomerleau uses a large number of sensor inputs (960) to effectively generalize the desired task. Other related work includes that of Kosuge et. al. [Kosuge et al., 1991] where a priori knowledge of the task is incorporated with human generated examples to transfer a skill. Other researchers use neural networks to encode the human's skills: for the deburring task Shimokura and Liu [Shimokura and Liu, 1991], and for compliance control Asada [Asada, 1990]. Lui and Asada [Lui and Asada, 1992] use process dynamics for transferring manipulative skills. Yang et. al. [Yang et al., 1994] use a multidimensional Hidden Markov Model to represent the transferred skill. Rouse et. al. [Rouse et al., 1989] also consider symbol processing models as useful tools for capturing human skills and knowledge. The approach to human-to-robot skill transfer proposed in this section assumes the following: The human operator the robot's  sensors  robot's  actuators  robot's  perspective  can accomplish to obtain  information  to manipulate on gathering  objects, and no other. This implies the robot should  the desired  world world  manipulation  about the world, objects.  task by using and using  Thus, the human  information  also be able to accomplish  only  the  has only  the  and manipulating  that if the human can accomplish  only  world  the task, then  it.  This assumption has at least one important implication for the choice of a learning system, or approximation framework, for generating the sensor to actuator mappings. Humans are usually only experts at manipulation tasks which they can anthropomorphize to their natural frame of reference. In other words, we are accustomed to particular human sensory and action modalities, and as the information for the task becomes more sparse, the task becomes more difficult. As a result, when a human expert is demonstrating a task which is sufficiently complex, it is likely that he or she is using a large number of sensor inputs of various types, including both visual and  Chapter 4: Experimental Results  74  Figure 4.2 The Human to Robot Skill Transfer Test Bed tactile. Thus, the type of human-to-robot skill transfer proposed here requires an approximation framework which is highly flexible with respect to the number and variety of sensor inputs, and yet is practical to implement. In this section, we describe a human-to-robot skill transfer testbed which we have implemented, as well as outline our first skill transfer experiments done in an unstructured environment using the SPORE-1 algorithm. The skill which we have chosen to demonstrate is loosely based on a typical "object locate and approach task", and was chosen due to its inherently unstructured characteristics. However, the framework we are in the process of formulating should be generalizable to a large variety of human skills.  4.4.1 The Experimental Testbed A small scale proof-of-concept testbed for human-to-robot skill transfer has been implemented (see Figure 4.2), consisting of a tracked mobile robot which is approximately 35 centimeters long and 23 centimeters wide. The robot is equipped with 1) left and right tracks, each with independent proportional forward/backward motion similar to that found in forestry excavators; 2) a gripper (not shown in Figure 4.2) to allow it to grasp objects; and 3) an N T S C color camera. The leftmost image in Figure 4.2 shows the robot's workspace and the rightmost image shows a close-up of the mobile robot. Initially this robot is intended to operate in a flat 2.5 meter by 1.8 meter workspace which has randomly placed objects. The intended task of the robot is to manipulate the target objects in a manner defined by a human operator who first accomplishes the desired task a number of times via teleoperation. The human operator observes images received  Chapter 4: Experimental Results  75  from the camera located on the robot (at a frame rate of approximately 20 to 30 frames per second), and operates two joysticks which control the robot's left/right forward/backward motion, as well as the open/close gripper motion. A SPARCstation 20, equipped with a frame grabber, is used to display the on-board camera image to the human operator as he or she demonstrates the desired task. The learning and subsequent autonomous control of the robot are also performed by the SPARCstation 20. The testbed currently uses various small objects including lego models and model "trees", as objects which must be located and approached. There are 2 lego objects, a red and blue stripped tower, and a red and blue stripped tower with holes. The "trees" are currently of three different types (tall spruce, short spruce, and short birch), and our goal is to teach the robot to differentiate between them. This testbed is considered unstructured because 1) although there are only three different types of trees and two lego models, no two objects are identical either in height, color distribution, or in texture, making manual encoding of a model difficult (especially for the model trees); and 2) the image background of the objects is a robotics lab which is continually changing as equipment is moved around and people move throughout the lab.  4.4.2 Experimental Results The first skill which has been successfully transferred from human to robot is described in Figure 4.3. This skill is loosely related to the "tree tending skill," which requires the selection of a certain size and species of tree for inspection. The skill involves turning the robot counterclockwise in place until at least one tall spruce is within the camera's field of view, and then maneuvering the robot into a position where the tree can be inspected. Figure 4.3 shows several typical camera views which the operator sees, as the task of locating and moving towards a tall spruce is performed. Starting at the leftmost image of Figure 4.3, the robot is turning counterclockwise searching for a tree; in the third image a tall spruce has been found and the robot begins to move towards it; in the sixth image the spruce is within inspecting distance and the robot turns off to the left in search of another tall spruce. Note that this first task does not involve avoiding obstacles or actually stopping to inspect the tree, it simply requires the robot to find and move towards a tall  Chapter 4: Experimental Results  76  Figure 4.3 A "Tall Spruce Tree" spruce tree in the visually cluttered environment of the robotics lab. The image which the human operator, as well as the robot, observes while executing the desired task, consists of 1024 pixels derived by sampling a 32 pixel by 32 pixel region in the center of the original colour NTSC image (see Figure 4.3 for gray-scale examples). The sampled pixel resolution is 8 bits, 3 bits for red, 3 bits for green, and 2 bits for blue. There is no image processing done prior to feeding these pixel values to the SPORE approximation; what the human sees is exactly what is passed to the approximator. In order to generalize the desired task the robot must approximate two mappings, one to control the robot's forward and backward motion, and one to control the robot's left and right motion. Both of these mappings have 1024 inputs and 1 output. As usual, two SPORE-1 cascades are generated, one for each dependent variable. The training data generated by the human operator consisted of demonstrating the above skill a total of 4 times, from 4 random starting positions of the robot, and 4 random positions of the spruce. Due to image resolution limits, the distance between the robot and the "tree" was always 1.5 meters or less. It took approximately 2 minutes (equivalent to 2180 camera frames or input/output examples) for the human operator to do these 4 demonstrations. The SPORE-1 regression function was constructed by randomly subdividng these 2180 samples into 218 validation samples and 1962 training samples. The resulting approximation took about 5 minutes of SPARCstation 20 C P U time to generate, and took about 250 K-bytes of memory space (i.e. approximately 24,000 model parameters). The length of time to evaluate the resulting approximation for one instance of sensor inputs is 191 milliseconds. Next we asked the following question: given only a few examples of the desired skill, and the raw sensor inputs used by the human, can the SPORE-1 regression algorithm effectively generalize the desired skill. In the case of the above experiment, the human operator generated only 4 examples of the desired task, and the SPORE approximation was fed raw pixel data with no intensity compensation or any other type of signal preprocessing. In addition, one can see from Figure 4.3 that the image generated by the frame grabber inside the SPARCstation 20 is extremely noisy. Nonetheless, when we ran 30 experiments where the robot used the SPORE approximation  Chapter 4: Experimental Results  77  Figure 4.4 A Red and Blue Stripped Tower to autonomously accomplish the transferred skill, the robot never failed to locate and move towards the tree. These 30 experiments were done with random placements of tree and robot (as with the training set, the spruce was always within 1.5 meters of the robot). It is interesting to note that the robot does not approach, for instance, upside-down spruce trees, and does not seem to be solely color sensitive; the shape of the tree and its texture are significant as well. We may therefore conclude that to the extent that the robot acquired the desired skill, it was able to execute it within an unstructured environment. Next, we repeated the same experiments with the two lego objects. Figure 4.4 and Figure 4.5 show grey-scale camera frame examples of approach sequences to two objects we have experimented with. The object in Figure 4.4 is a red and blue striped tower, and the object in Figure 4.5 is also a red and blue striped tower but has a number of holes through it, which allow background scenes to be observed. For the object in Figure 4.4, the teleoperator generated 4632 camera images during 4 approaches. Similarly, for the object in Figure 4.5, the teleoperator generated 4805 images during 4 example approaches. The variation in camera frame counts is simply a result of the random placement of the vehicle and object. The SPORE-1 approximation was generated for both objects, using a random selection of 10% of the learning points as validation set, and rest as a training set. The generation of the 4 approach examples took the teleoperator approximately 5 minutes to generate for each object, while the off-line learning of SPORE-1 required approximately 4 additional minutes on the SPARCstation 20, for each object. The size of the approximation generated for the object in Figure 4.4 is 489 K-bytes (i.e. approximately 44,000 model parameters), while the size of the approximation generated for the object in Figure 4.5 is 552 K-bytes (i.e. approximately 52,000 model parameters). The time required to evaluate either approximation (once it has been constructed) for an observed image is less than 120 milliseconds  Chapter 4: Experimental Results  78  Figure 4.5 A Red and Blue Stripped Tower with Holes for either object (run on a SPARCstation 20). Once more, in order to evaluate the ability of the robot to autonomously execute an approach sequence toward an object, we ran 30 experiments for each of the two objects described above. Each experiment involved placing the vehicle and object in random positions, and then seeing whether the robot would autonomously be able to located and approach the object. For the object shown in Figure 4.4 the robot successfully executed the task 29 out of 30 trials, while for the object shown in Figure 4.5, the robot was successful in approaching the object 27 out of 30 trials (only one object was present at each trial). Note that the initial distance between the vehicle and target object was always less than 1.5 meters, because at distances greater than this, the object is not easily distinguishable, even by a human operator. From these experimental results, we can conclude that the SPORE-1 approximation was successful in generalizing all of the above locate and approach tasks, given relatively few noisy examples and raw pixel inputs. This is promising for the many and difficult robot programming problem facing robotics researchers.  4.5 The 10 Bit Parity Problem One of the main criticisms of algorithms which construct a high dimensional regression functions using the smaller dimensional functional units, is that they cannot model interactions which have an intrinsic dimension that is greater than the dimension of the highest dimensional functional unit. For example, the classic implementation G M D H [Ivankhnenko, 1971] has 2 dimensional functional units, and therefore can at most model the 2-bit parity problem G M D H [Elder IV and Brown, 1995]. Similarly, A S P N can at most model the 3-bit parity problem because its functional units are at most 3 dimensional [Elder IV and Brown, 1995]. Note that the N-b'ti parity problem is one of determining whether an odd or even number of the N bits in a word is high: the target function being, for example, set to 1 if an even number of bits are high, and -1 otherwise.  79  Chapter 4: Experimental Results  Using this same train of thought, one may conclude that because SPORE-1 uses 2 dimensional functional units, it should not be able to model parity problems for words of 3 or more bits. This, however, is not the case. As discussed in Section 3.3.4 of Chapter 3, the virtual space partitioning property of SPORE-1 can allow the algorithm to converge to zero error, even when the following pathological condition occurs; {  jC' (u,v)-f(x))  a-} = argmin {  dD  gl  ij}  = {0, ...,0}  (4.10)  a  D  The above condition occurs in a parity problem of more than 2 bits because the target function value is equally to be likely -1 or 1, for each point (u, v) in the training set. Thus, given that the training set consists of an exhaustive set of parity points, and the training data is applied without modification (i.e. a bootstrap sample is not used), the function ° ''g (u, p  t  v) which always gives the  best squared error fit to the training data is the zero function g (u, v) - 0. t  In order to demonstrate the ability of SPORE-1 to represent 3 or more bit parity functions, we applied it to the 10-bit parity problem [Grudic and Lawrence, 1997]. The training data and the validation data were made to be identical, with each containing the exhaustive 1024 examples of the 10-bit parity function. Our goal was to determine how well SPORE-1 could fit these learning examples. The results achieved were that the relative mean squared error on the learning data was _9  9.8x10  , indicating that, to within the floating point precision of the C P U , the SPORE-1  algorithm was able to exactly  model the 10-bit parity function. However, the rate at which  SPORE-1 converged to this exact solution was rather slow: it took about 2 hours and the final approximation was over a megabyte in size (i.e. approximately 94,000 model parameters). If, on the other hand we stopped the learning algorithm when the parity problem was actually solved (i.e. when the maximum approximation error is less than 1, meaning that every instance of the learning data has been appropriately mapped), the learning time is only 10 minutes, and the size of the approximation is about 40 K-bytes (i.e. approximately 3,800 model parameters).  4.6 Summary This chapter is our first attempt at an experimental verification of our thesis presented in Section 1.3 of Chapter 1. The SPORE-1 algorithm is our most extreme example of the general S P O R E methodology. The structural units are small and all the same (i.e third order two  Chapter 4: Experimental Results  80  dimensional polynomials), and no input selection is ever done. We have demonstrated its successful performance on problems ranging from relatively small 6 dimensional regression problems having only 104 learning examples, to 1,600 dimensional regression problems with 40,000 learning examples. Its performance compared favorably to published results on well known data sets in the literature. We introduced new very high dimensional data and demonstrated that SPORE-l's construction times were independent of dimension. A real world example of humanto-robot skill transfer was also successful despite the raw image pixels used as inputs. Finally, we demonstrated that SPORE-1 does not have the same limitations on parity type problems, as other systems which build regression functions using small dimensional functions which are added one at a time.  Chapter 5: The General SPORE Methodology In Chapter 3 the SPORE-1 learning algorithm was presented, and its basic theoretical characteristics were explored. In this chapter the most general form of the SPORE methodology is presented and the theoretical characteristics of learning algorithms which construct SPORE-type regression functions are presented. In particular, this chapter deals with the rate of convergence and computational complexity properties of the general SPORE learning algorithms, with particular emphasis on how these depend on the dimension of the function being approximated.  5.1 The Complete SPORE Definition The most general form of the SPORE structure (see Figure 5.1) has C cascades of D dimensional functions. In Figure 5.1, the D dimensional functions are symbolized by  g  i  j, where  i serves to identify one of the C cascades, and j identifies the level of the cascade where the function g • . is located. In the general case, there are C > 1 cascades of D > 2 dimensional functions, and the dimension of the functions g • • need not all be the same. As a specific example, the SPORE-1 structure in Chapter 3 has C = 1 and D = 2 for all g- • functions. The columns in Figure 5.1 symbolize cascades of functions, and the rows symbolize levels in each of the cascades. Thus, the SPORE structure depicted in Figure 5.1 has C cascades, with the first cascade having L j levels, and the final cascade having L levels. c  There are 2 types of inputs to the functions g •. The first type are symbolized by t  X; ,• = (jc , ..., x t  ), where (x , ...,x  k  k  b  ) are some subset (perhaps randomly selected) of all  of the model input variables (or independent variables). The second type are symbolized by Gj it  =  (8 , > ri qi  —  S  r  ^  J  '  W  h  e  r  e  (Sr,,,,. • • • ' ^ , , , <?,,,) ^ J)  S  °  m  functions which have already been added to the structure (note that g  e  S U b S e t  ° ° f  U t  P  U t S  °  f  ^ refers to the output of  r< q  the function from level q of the r, cascade). x  The final regression function, symbolized by / in Figure 5.1, is some weighted sum of all  81  82  Chapter 5: The General SPORE Methodology  Cascades  D > <U  Figure 5.1 The SPORE Structure constructed functions g • •. This is symbolized by the vector: W  = (w  u  r  g  u  l  ,w  l  t  2  -g  l  t  2  ,  . . . , W  C  , L  c  - 8 C , L  C  )  (5.1)  Given this definition of the general S P O R E structure outlined in Figure 5.1, we next give a description of how the a general implementation of a SPORE learning algorithm would construct this structure. The specific structure of the functions g - • is either 1) fixed throughout SPORE structure (  as with SPORE-1; 2) chosen randomly, either with or without replacement, from a set of possible candidate functions (if without replacement, then reinitializing the list after all have been chosen), or 3) chosen in a specific order from a predefined sequence. Similarly, the inputs to each g- • function is either 1) chosen in a specific order (going back to the first one in the list after all have been chosen); or 2) chosen randomly, with or without replacement, from a list of all possible inputs (if without replacement, then reinitializing the list after all have been chosen). This methodology for selecting inputs and function types implies that no time consuming search occurs during the process of constructing the structure. In all instances of SPORE, the function g, , is constructed first, its inputs being some D j , > 2 dimensional projection of the N dimensional learning data. The parameters of the  83  Chapter 5: The General SPORE Methodology  function g,  x  are determined using some subset (perhaps even all) of points in a training  containing examples of inputs and outputs. Once the initial function  set  j is constructed, the  remaining functions are constructed, one at a time, with the only stipulation being that no function g j is constructed until all of its inputs have been constructed. As with the initial function g t  x  x  ,  the parameters of each function g- • are determined using some subset (or perhaps all) of the points in a training set, projected onto the inputs (X •  G •) of g • •. (  The weights in the vector W are either all the same, or chosen using some criteria such as W:. being proportional to the inverse of the approximation error of g-' J• on the training data (note ij that this is how the weights are chosen in the SPORE-1 learning algorithm). During construction the desired learning outputs are periodically replaced by the residual errors resulting from subtracting out the approximation due to the structure already constructed. This usually happens when the approximation errors can no longer be reduced by fitting the parameters of the current g • function to the current residual errors in the data set. Further i  additions to the structure are then added to reduce these residual errors. In the general case, the SPORE construction terminates when the approximation error, on some validation  set,  has been reduced to some acceptable level. A n exception to this rule is when  the algorithm allows for the subdivision of the input space of the target function into disjoint subdomains. In which case new SPORE structures are then constructed in each resulting subdomain. The reader should note that the above description conforms with the general SPORE characteristics outlined in Section 1.3.1 of Chapter 1.  5.2 Theoretical Analysis of Rates of Convergence For Two Specific SPORE Structures In this chapter we study the rate of convergence properties of two specific examples of SPORE: the SPORE-2 algorithm and an extension of the SPORE-1 algorithm defined in Chapter 3. As discussed in Section 2.3 of Chapter 2, there are two types of rate of convergence results. The first assumes an infinite number of learning examples and measures how approximation error decreases as new structures (or parameters) are added to the regression model. The second assumes a finite number of learning samples and measures how approximation error decreases as the learning sample size increases. In this section we address the second type of convergence result.  Chapter 5: The General SPORE Methodology  84  In Section 5.2.1, we describe and theoretically analyze the SPORE-2 regression function. Specifically, our goal is to define function spaces for which the rate of convergence of SPORE-2 is independent of dimension. In such function spaces, it is potentially possible to avoid both the computational complexity and large sample size aspects of the curse of dimensionality. Our aim is to analyze theoretically motivated regression functions which point the way to real implementations that are feasible. The theoretical constructions presented should be viewed with this in mind. In Section 5.2.2, we extend the SPORE-1 regression function described in Chapter 3 to include the partitioning of the input space into disjoint subdomains. This subdivision of the input space guarantees that the approximation error to converges to zero, while simultaneously preserving the rate of convergence and complexity properties of the SPORE-1 learning algorithm within each of these subdomains.  5.2.1 The SPORE-2 Regression Function As the primary goal of this thesis is to study very high dimensional nonparametric regression, in this section we explore function spaces where such large problems are potentially tractable. In the previous two chapters we have demonstrated the value of building cascades of lower dimensional functions as approximations to high dimensional functions. In this section we extend this basic idea to more complex cascaded structures, and include space partitioning as a method to ensure that the approximation error always converges zero. As is outlined below, an additional benefit of space partitioning is that the rate of convergence of the regression function is independent of dimension, within each resulting subdomain. The SPORE-2 regression model consists of a partitioner and a structure. The structure consists of a cascade of 2 dimensional functions, and the partitioner is a method of subdividing the domain of the target function into regions were the cascade is an effective regression function. The algorithm for building the SPORE-2 regression function is based first on an algorithm which decomposes a continuous 3 dimensional function U(x, y, z) into a cascaded 2 dimensional function having the structural form W{'U(x, y), z). Such a decomposition of 3 dimensional c  function into two 2 dimensional functions is possible because we partition the domain of Zl(x, y, z) into a finite number of regions where it can be represented to arbitrary accuracy by W( V(x, y), z) • This partitioning is the subject of the following theorem. T H E O R E M I: Let U(x, y, z) be a bounded continuous  3 dimensional  function  defined  Chapter 5: The General SPORE Methodology  85  Level 1  !  Level 2  Figure 5.2 A Tree-Like Subdivision of Input Space 3  over a finite dense domain finite  number  continuous  in U czSi  of dense disjoint  2 dimensional  . Then for  domains  functions  every arbitrary  J7,-, where  small real £ > 0, there exist a  L J U = U, and corresponding t  bounded  and V , such that for any (x, y, z) £ U the following <  i  t  is  true: \W-AVAx,  y), z) - U(x, y, z)\ < £  (5.2)  The proof of T H E O R E M I is given in Appendix A , where it is restated in a more precise, but equivalent, way. The proof is based on four algorithmic constructions which are presented in Appendix D . These constructions are titled Construction I through Construction IV, and they collectively serve to decompose the function 1l(x, y, z) into the functions WAV (x, r  i  y), z),  as  well as define and build the disjoint domains U . i  Using the 3 dimensional decomposition and domain partitioning outlined in T H E O R E M I, we can begin to define the domain partitioning nature of the complete N dimensional SPORE-2 algorithm. The SPORE-2 partitioner works by subdividing the input space of the target function into a tree-like subdivision. This subdivision is represented using the tree notation P j O ^ . . . p ^ , where / represents the level of the tree, and the symbols p and a refer to branch numbers at  86  Chapter 5: The General SPORE Methodology  various levels of the tree. Thus, each branch of the tree (or equivalently path through a tree), defines a subdomain of the target function. An example of such a subdivision of space is given in Figure 5.2. Because each tree branch defines a unique subdomain, we speak of a point (x ,  x _ j) being an element of the subdomain corresponding to a tree branch with index  0  N  PI<*I-PJ<V  Given this description of domain subdivision, we can define the structure built by the SPORE-2 regression algorithm within each of these subdomains. The partitioner works by subdividing the domain of the function being estimated into a set of disjoint subdomains, such that in each subdomain, one can estimate the function using a superposition, or summation, of a cascade of 2 dimensional functions. The 2 dimensional functions are of two types symbolized by £p,a,  p,a,  a n c  * ^p,o,  p,a,' where Pi°~i •••P/°~/ is the tree subdivision of the input space defined  above. Thus, we can view the functions e „  „ and h „  n  n  n  °Pi°i-P/ / i  l  l  [  The SPORE-2 structure is defined as follows. Let estimate of some unknown function Q(x , x _ j ) . Then, if (x , N  N  =g  P l 0 l  l  N  n  max  l  N  N  l  0  be the SPORE-2  ...,x _ )  o~, , the SPORE-2 regression function Q (x ,  branch with index p , a , ...p,  e„(*> ...,x _ )  O  x _ j) is an element of the subdomain corresponding to a tree  0  following expansion:  0„(JC ,  x _ j ) , which is based on n random samples of  Q  o  °  p a ...p o .  defined by tree branch  0(x ,  as acting on the subdomain  n n  Pi°i-P;°/  0  (fc  '•max  ^  x _ ,) has the N  iv 1  ( ( * o > l)> l)) + S p . o . p ^ p . o . p ^ ^ p . o , ' *2)> * ) ) + x  p i 0 l  a  x  3  '««  (5.3)  X ^p a,...a (^p o ...o/(^P|0|...o,_ ' (l mod #))> ((l+ 1) modN))) x  1  1=3  /  1  l  X  1  where, 1. A mod N is the integer remainder of A -s- N, and is used to cycle through the indices of the independent variables (x ,  x _ j) (hence the order of input variables is predefined);  0  N  2. the subscript notation p,G,...p  l  1  1  1  a,  is described above (see Appendix B for a more  max max l  complete specification) and serves to identify the two dimensional functions g ^p,0, o  w  n  i  c  n  a  c  t  on the specific subdomain of 0(x , ...,x _ ) o  N  x  P | ( J |  0 /  and  referred to by branch  PifJi---P/0-/; 3. and, l  max  is an integer defining the maximum number of terms (or levels of the tree  structure) required in order to achieve some approximation error bound  Chapter 5: The General SPORE Methodology  87  level 1  I  level L  —»  X —•  9,,.,,  Figure 5.3 A 4 Dimensional Example of SPORE-2 max \Q (x)-Q(x)\ xeD  (5.4)  <e  n  where e is an arbitrarily small (for sufficiently large n) positive real number, and D is the domain of the unknown function 6(x , ..., x _ j ) . N  0  A 4 dimensional example of a SPORE-2 structure is shown in Figure 5.3, where the tree branch is defined by p = 1 and a = 1, for all / = 1, t  t  l  .  max  Note that this branch is identified by the  top most branch in Figure 5.2. Given the above description of SPORE-2, we now state the second theoretical result of this section. In T H E O R E M II, we prove that the SPORE-2 algorithm produces an approximation which is a universal approximator. T H E O R E M II: Let 9(x , ...,x _ ) 0  defined over a finite  N  be a bounded  l  dense domain in D cz^. . N  continuous  Then, for every arbitrarily  N dimensional  function  small real £ > 0,  there  88  Chapter 5: The General SPORE Methodology  exist a finite number corresponds  of dense disjoint  to some branch  p.Oi ...pi *  dimensional  functions gp  jCi]  domains  0  '  and ^  S , where {J S k  k  = D, and each  k  <5, , and corresponding  bounded  domain  continuous  2  max ''max  l  P  |  C  |  0  , such that, for any ( x  0  , x  N  _ j) e S the followk  ing is true:  |0„(JC)-0(JC)| <e  (5.5)  where Q (x) is defined in (5.3), and we let n —» °°. n  The proof of T H E O R E M II is given in Appendix B , where it is restated in a more precise, but equivalent, way. The proof uses the result of T H E O R E M I, and is based on five algorithmic constructions which are presented in Appendix D. These constructions are titled Construction I through Construction V, and they collectively serve to decompose the function Q(x) into the functions Q (x),  as well as define and build the disjoint domains S . Note that Construction I  n  k  through Construction IV are constructions which are also used in the proof of T H E O R E M I. The proof of T H E O R E M II defines a sufficient algorithm for constructing the universal approximation 0„(JC) , given that the number of samples points, n, is infinite. In T H E O R E M III we extend this result by assuming a finite learning sample size n, analyzing how the approximation error decrease as the sample size increases. We state this theorem next. T H E O R E M III: Let Q (x), n  further  assumptions  probability  0(x), and S be as defined in THEOREM k  that Q(x) is p -times differentiable  function  as  h(y\x)  regularity  is some probability  conditions  = \yh(y\xMdy)  function,  1-3 (establishing  Q(x) within a specific  the regression function  subdomain  optimal  rates of convergence),  S . Then, as n increases, k  n  defined  the  by Stone  X  oo  the rate of convergence  Q(x), on the subdomain  ||e (jc) - 0(x)|| < 0((nY where r = p/(2p  on SK. Assume  d(x). Let n be the number of samples of the target  Q (x) to the target function  n  (5.6)  and §(dy) is some measure  [Stone, 1982], are true for the target function function  (p > 1), and is the mean value of some  follows:  Q(x) where  II, with the  log(n))  r  of  S , is given by: k  (5.7)  + 3), and IHI^ is the L°° norm.  The proof of T H E O R E M III is given in Appendix C, where it is restated in a more precise,  89  Chapter 5: The General SPORE Methodology  but equivalent, way. The three regularity conditions of Stone [Stone, 1982] which we use in the statement of the theorem, simply impose certain restrictions on the probability distributions associated with the target function 6(JC) (see [Stone, 1982] for details). The first condition imposes constraints on the density function h(y\x), and is required to verify that the sequence —l (in) log(n)) is a lower convergence sequence (i.e. the best convergence sequence). The second condition imposes further constraints on h(y\x) to ensure that the rate defined by the -l r  r  sequence ((n)  log(rc)) is achievable. Examples of density functions which satisfy these two  conditions are the Normal distribution and the Bernoulli distribution. The final condition used by Stone puts constraints on the asymptotic distribution of the independent variables, x, of the training samples. This condition is required to ensure that the rate of convergence is both optimal an achievable. A n example domain D a Si  N  dral domain in 3i  N  which satisfies this condition would be any polyhe-  where the distribution of points x is uniform, where the learning samples are  independent and identically distributed. A more general example of a sufficient domain D cz 3i  N  can be found in [Stone, 1982]. T H E O R E M III establishes that the closed domain of any function, which is bounded and at least once differentiable (and under appropriate regularity conditions of Stone [Stone, 1982]), is densely covered by a finite number of subdomains where the rate of convergence of our regression model is independent of dimension. This is an important result because it is not intuitively obvious. Previously, Stone [Stone, 1982] showed that the optimal rate of convergence, over the entire domain  of such functions, does depend on dimension. Specifically, for an N dimensional  function, the rate of convergence established by Stone is ||9„(JC) - 0(x)|| < 0((n) oo  1  log(n)) ,  where v = p/(2p + N). Knowing that every such function has a finite number of subdomains where there exists at least one type of regression function (for example the SPORE-2 regression function studied in this section) which has a rate of convergence independent of dimension, has important practical implications if these domains can be identified. For example, suppose that we have a regression problem which has a rate of convergence that depends adversely on dimension over its whole domain. Furthermore, suppose that we can divide the domain of this regression problem into two regions, with each region having a rate of convergence independent of dimension. Then our original "hard" regression problem has been converted into two "easier" regression problems. The constructions given in Appendix D of this thesis provide a sufficient algorithm for  Chapter 5: The General SPORE Methodology  90  finding domains which have rates of convergence independent of dimension. These theoretical constructions can serve as a basis for practical algorithms for very high dimensional regression, and in fact are the motivation behind the implementation of the SPORE-1 algorithm. A further implication of T H E O R E M III of Appendix C is that it demonstrates that it is possible to measure the complexity of regression in terms of the minimum number of subdomains which have rate of convergence independent of dimension, rather than the dimension of the regression problem. The more subdomains, the more complex the regression problem. This measure of complexity may have important implications for the practical implementation of very high dimensional regression. This is especially true if we can establish the complexity of a regression problem using only the given training examples. Our hope is that the constructions contained in Appendix D of this thesis can serve as the basis for such a data based complexity measure. As described above, these constructions collectively serve to decompose the function 0(x) into the functions 0 „ ( x ) , while at the same time building the disjoint domains S which have the desired k  rate of convergence properties. Since these constructions are based on learning examples of the target function 0(x) (see proof of T H E O R E M III in Appendix C), they constitute a sufficient algorithm for such a data based complexity measure. However, the practicality of such a complexity measure has yet to be demonstrated.  5.2.2 The SPORE-1 Regression Function with Domain Partitioning The rate of convergence result for the SPORE-2 algorithm is based on two key observations. First, the SPORE-2 algorithm works by constructing the approximation one functional unit at a time, and the rate of convergence of each of these individual functional units is independent of dimension. Secondly, the domain of the function being approximated is subdivided to ensure that as each functional unit is constructed, the approximation error over the particular subdivision acted on by the functional unit, must decrease. The result is that within each subdomain, the approximation error converges to zero, and the rate at which it does so is independent of dimension. Given the definition of the SPORE-1 algorithm in Chapter 3, it is not difficult to envision a modification to the SPORE-1 algorithm which will also give it a rate of convergence to zero error which is independent of dimension. First, in T H E O R E M 5 of Chapter 3, we showed that the rate of convergence, to a fixed approximation error, Of SPORE-1 is independent of the dimension of  Chapter 5: The General SPORE Methodology  91  the function being approximated. Therefore, if we designate each single execution of the SPORE1 algorithm to be a single equivalent functional unit, then we can achieve a rate of convergence zero error,  to  by subdividing the input space such that the next time SPORE-1 is run on any subdivi-  sion, the resulting approximation error over that subdivision will decrease. One method of choosing such a subdivision is to do the following binary split of the input space. Let R (x) be L  the constructed SPORE-1 regression function which acts on the domain D. Then, create 2 subdomains, Dj and D , such that Z)j u D = D , a s follows: 2  2  x e D , iff  (5.8)  L  2  x  L  iff ( ( * ( x ) - / ( x ) ) < 0 )  xeD  (Note that D  ((i? (x)-/(x))>0)  and D may not be topologically connected, whereas the subdivisions of most 2  implementations of regression trees [Breiman et al., 1984] are. This may or may not be a significant advantage depending on how hard it is to construct such domains in practice. This is a topic for future research) The subdivision in (5.8) ensures that the maximum expected approximation error, within each subdomain D (i e { 1, 2}), is given by i  j" «R (x) L  - f(x))  < E\\  - AfdD^  (R (X) l  -  (5.9)  f(x)) dD 2  i  '-D,  where A,- e 3i is a constant defined as follows: A , = argmin  (5.10)  % \({R (x)-f{x))-A ) dD 2  A  L  i  i  •D,  (Note that A • is simply the average value of (R (x) L  - f(x))  over each region D ). Hence, such a i  subdivision ensures the reduction of error after each execution of the SPORE-1 algorithm. Using T H E O R E M 5 of Chapter 3, this gives us the following rate of convergence (assuming we subdivide after each execution of SPORE-1),  j  (^  ( - ) - / ( x ) ) ^ (m) <K 2  L |  Mn  +e  T\(m)  D  where 1. e is any arbitrarily small positive real number; 2. -D^^) is any resulting subdomain after m domain subdivisions;  (5.11)  Chapter 5: The General SPORE Methodology  3. and M  92  is the number of training sample points in domain £ ( ) • )  T  r)  m  Thus, the rate of convergence to any arbitrarily small error e is independent of the dimension of  5.3 Rate of Convergence and Computational Complexity In this chapter we have given two specific examples of the SPORE structure. We have shown that both examples have a rate of convergence which is independent of the dimension of the function being approximated as a result of two important characteristics. The first is that the regression function is constructed one functional unit at a time, and the rate of convergence of each of these individual functional units is independent of dimension. The second is that the domain of the function being approximated is subdivided to ensure that as each functional unit is constructed, the overall approximation error decreases. Thus, as long as these two characteristics are maintained, any resulting SPORE implementation will have a rate of convergence independent of dimension. In T H E O R E M 6 of Chapter 3, we showed that the computational complexity of the SPORE-1 algorithm is linear with respect to the dimension of the function being approximated. This is a powerful practical result because it makes regression functions which have thousands of inputs, at least potentially feasible from a computational standpoint. This is experimentally supported by the high dimensional regression examples given in Chapter 4. The SPORE-1 algorithm achieves a computational complexity which is linear with dimension because the drop in approximation error, due to the addition of a new level, does not depend on the dimension of the function being approximated. Instead, it depends on the probability distribution of the function space of f{x)  around the 2 dimensional space that the error  function is being projected on; i.e. the 2 dimensional space defined by the independent variables of the function g (-) (see proof of T H E O R E M 6 in Chapter 3). Thus, if the variance of this (  probability distribution is high, this is an indication that f(x)  maps poorly onto g[(-) and many  levels in the SPORE-1 structure may be required to get a good fit. Thus, the error drop achieved by cycling through all independent variables of / ( x ) is not a dependent on the dimension of / ( x ) , but rather on the way it projects onto the 2 dimensional input space of g^(-) • Hence the linear growth in computational complexity. Following this argument, we can show that, within each subdomain defined in Section  Chapter 5: The General SPORE Methodology  93  5.2.1, the computational complexity of the SPORE-2 algorithm is also linear with dimension. Thus, the total computational cost of the SPORE-2 algorithm is dependent on the number of subdomains required (which may in fact depend on dimension). Similar arguments are valid for the computational cost analysis of any SPORE structure. Thus, we conclude that within each subdivision of a general SPORE-type structure, the computational cost of the algorithm is linear with respect to the number of input variables. Finally, we ask the following question: under what conditions is the subdivisions of space not necessary in order to achieve convergence to zero error? For the SPORE-1 algorithm, this question is answered in T H E O R E M 4 of Chapter 3, where we showed that as long as the approximation errors due to the addition of a level to the regression function are sufficiently uncorrelated with respect to current approximation errors, approximation errors will continue to converge to zero. Another interpretation of this is simply that the new level is contributing to overall error reduction. A n analysis of the SPORE-2 constructions gives an equally trivial interpretation of what constitutes continued error reduction without subdivision. In order for SPORE-2 to reduce error without  space subdivision, the projection of the current error function onto its two  dimensional functions must always contribute to overall error reduction. The only condition under which error reduction will not occur, is if the projection of the current error function onto the appropriate 2 dimensional space produces functions g  P i 0 i  0  and ^  P | 0 i  a  which are identically  zero (see Appendix B for details). Similarly, the SPORE-1 learning algorithm will also fail to converge if g (-) = 0. Therefore, for both algorithms, subdivision is required when the residual t  error produce two dimensional functions which are identically zero. As described in Section 3.3.4 of Chapter 3, SPORE-1 uses bootstrap samples to make this pathological condition unlikely. Perhaps it is also possible to use bootstrap samples to achieve similar results with the SPORE-2 algorithm, and more generally with other SPORE-type algorithms. This is an open theoretical question.  5.4 Summary In this chapter we have shown that the general SPORE structure addresses the two major difficulties associated with high dimensional nonparametric regression. Both of these can be attributed to the curse of dimensionality [Bellman, 1961]). The first difficulty is associated with the potentially intractable number of sample points required to approximate high dimensional  Chapter 5: The General SPORE Methodology  94  functions. The second is associated with the computational complexity inherent in building very high dimensional regression functions. Both difficulties are addressed in the SPORE structure via the sequential addition of small functional units in a way which decreases the residual approximation error. We have shown under which regularity conditions (Stone [Stone, 1982]) this leads to a rate of convergence (as a function of the number of training sample points) independent of dimension. In addition, we have demonstrated that under appropriate conditions, the computational complexity of the general SPORE structure, is linear with respect to the dimension of the function being approximated. Thus, this chapter, building upon our results in Chapter 3, constitutes our second argument for a theoretical verification of our thesis presented in Section 1.3 of Chapter 1. The theoretical analysis presented in this chapter is an attempt to fully understand the fundamental characteristics of a learning algorithm. Many practical implementations of learning algorithms are not based on a firm theoretical understanding of their theoretical characteristics, often resulting in ad hoc heuristics which have no firm foundations. Perhaps one of the most significant differences between the SPORE methodology and the other learning methodologies which it loosely resembles (see Section 2.4 of Chapter 2), results from the fact that we have specifically used a theoretical analysis of the characteristics of very high dimensional regression to define our methodology.  Chapter 6: Conclusion  The focus of this thesis has been the construction of models (nonparametric regression functions) which have a large number of inputs, and where little a priori knowledge is available about what constitutes a good choice for model structure. Examples of large high dimensional nonparametric regression problems can be found in such diverse fields as meteorology, economics, and robotics. However, general practical methods for constructing these high dimensional models have yet to be widely proposed in the literature. It has been shown in this thesis that very high dimensional nonparametric regression can be effectively accomplished using a sufficient number of small, low dimensional parametric building blocks, which are fitted to the regression data one at a time. Furthermore, we have demonstrated that both variables and building blocks can be added in a random order to the approximation, thus little or no computational effort is required to determine what to fit next. Projection onto low dimensional parametric building blocks was shown by proof and example to produce stable regression functions from relatively few sample points. It was demonstrated that the random selection of building blocks and their inputs leads to algorithms that are computationally feasible in very high dimensional spaces. This document is a theoretical and experimental evaluation of a general methodology (SPORE) which exhibits the above characteristics. The most significant result of this thesis is that this simple algorithmic structure leads to regression functions which are effective approximators in very high dimensional spaces. The successful application of nonparametric regression to high dimensional problems such as the 1,600 dimensional problem defined in Section 4.3 of Chapter 4, has not been demonstrated in the past, without using some form of dimensionality reduction requiring a priori knowledge. The main reason for this is that previously published techniques have been designed with lower dimensional problems in mind, resulting in algorithms which utilize search strategies in input space, thus making them computationally infeasible in very high dimensional spaces. The main contributions of this thesis are the following: I. A new methodology is proposed (SPORE) for constructing very high dimensional  95  Chapter 6: Conclusion  96  regression functions. The SPORE approach to nonparametric regression differs from conventional approaches. Emphasis is shifted from searching through a large parameter space for an optimal structure, to an algorithmic approach which projects learning data onto low dimensional structures, without resorting to computationally expensive searches. This makes SPORE particularly suited to the large regression problems considered in this thesis. II. A theoretical analysis of SPORE demonstrated that it is a universal approximator. Furthermore, we showed that domain of the target function can be subdivided into a finite number of regions where the rate of convergence (as a function of the number of learning examples) to any bounded continuous target function (at least once differentiable) is independent of the number of input variables (i.e. dimension). This theoretical result is significant because it shows a rate of convergence which has not previously been demonstrated for high dimensional nonparametric regression problems. In addition, we showed that the SPORE structure has a computational complexity which is linear with dimension in each of these subdomains. This is also a significant theoretical result because previously proposed multidimensional nonparametric regression algorithms are not computationally practical in higher dimensional spaces. III. The simplest example of SPORE, called SPORE-1, has been implemented and evaluated. The SPORE-1 learning algorithm is completely automated, requiring no user intervention. Theoretical analysis of the SPORE-1 learning algorithm gives conditions under which the approximation error converges to zero (without domain partitioning), with a rate of convergence which is independent of the number of input variables (i.e. dimension). Experimental evaluation of SPORE-1 on regression problems which have been previously studied in the literature, shows that SPORE-1 compares favorably, and often outperforms, other published regression algorithms. Out of ten regression examples, SPORE-1 did as well or better than published results on all but one of these examples. Evaluation of SPORE-1 on the 10-bit parity problem showed that it can be applied to problems which have flat low dimensional projections, and therefore is not subject to the same limitations as other methods which construct approximations using low dimensional structural units added sequentially. Evaluation of SPORE-1 on very high dimensional data (100 to 1600 inputs) showed that SPORE-1 performs well under noise conditions in both input variables and output values, and that the dimension of the data had no direct effect on learning times or the size of the approximation. Application  Chapter 6: Conclusion  97  of SPORE-1 to real-world human-to-robot skill transfer problems, using raw image pixels, showed that SPORE-1 is effective on very high dimensional (1024 inputs), noisy, real world problems. These human-to-robot skill transfer experiments demonstrated a promising (and theoretically understood) solution for a wide variety of complicated learning/programming tasks, which could not previously be addressed in this manner.  6.1 Future Work S P O R E defines a general methodology for constructing regression functions. The SPORE-1 algorithm, having only a simple 2 dimensional cascaded structure with only one type of functional unit, is the most basic example of this general methodology. Our intention is to implement and empirically evaluate other examples of S P O R E . In particular, we intend to implement a SPORE algorithm which has many types of functional units, where each functional unit can accept inputs from more than one existing unit. Such a structure would exploit more aspects of the general SPORE methodology, while still maintaining the computational capabilities of the SPORE-1 algorithm. In addition, the SPORE structures presented in this thesis have been designed as long, cascaded chains which can only be evaluated in a serial manner. This may not be a suitable computational structure for applications which require fast calculations (e.g. real-time control systems). For such applications, computationally parallel S P O R E structures may be more appropriate. The theoretical and empirical evaluation of a parallel SPORE algorithm is a good topic for further investigation. The main focus of this thesis has been building nonparametric regression functions for the purpose of predicting future outputs given new inputs. However, traditionally many nonparametric regression techniques have been used as analytical tools to determine which inputs are important and how they interact. It would be interesting to investigate the possibility of using SPORE as such an analytical tool. If this were possible, it would allow analysis of input spaces for problems which have many input variables. An additional topic for further study is a more complete analysis of the theoretical properties of the SPORE structure. This thesis gives rate of convergence results as a function of the number of learning variables, and gives conditions under which these rates are valid. Although empirical and theoretical evidence suggests that these conditions are not overly restrictive, there  Chapter 6: Conclusion  98  are as yet no theoretical results which analyze how useful or interesting this class of functions is. In addition, no rate of convergence results, as a function of the number of parameters in the SPORE regression function, are given in this thesis. Both of these theoretical questions are topics for future research. A specific theoretical investigation which this thesis has not addressed is the full computational cost of a SPORE learning algorithm which uses domain subdivision. We know that the computational cost of a SPORE construction is linear with dimension, in each of the resulting subdivisions. However, we don't know how many subdivisions are required for a specific class of target functions. Empirical evidence presented in this thesis suggests that subdivision of the input space is not necessary (at least for the regression examples studied here), however, this may not be true in the general case. In fact, regression examples may exist where an exponential growth in the number of subdomains may be needed, as the dimension of the target function increases. This is clearly a topic for future investigation. Finally, in this thesis, we demonstrated that it is possible to do very high dimensional human-to-robot skill transfer if the operator demonstrates the task using only the robot's sensors and actuators. This is a promising result which we have only started to explore. Many more experiments need to be done in order to fully evaluate the efficacy of this approach to robot programming.  Bibliography Alfeld, P. (1989). Scattered data interpolation in three of more variables. In Lyche, T. and Schumaker, L. L., editors, Mathematical  Methods  in Computer  Aided  Geometric  Design,  pages 1-33. Academic Press, Boston, M A . Asada, H . (1990). Teaching and learning of compliance using neural nets: Representation and generation of nonlinear compliance. In Proceedings Conference  on Robotics  of the 1990 IEEE  International  pages 1237-1244. IEEE.  and Automation,  Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE  Trans. Inform.  Theory,  39(3):930-945.  Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Machine  Learning,  14:115-133.  Bellman, R. E. (1961). Adaptive  Control Processes.  Breiman, L . (1993). Hinging hyperplanes  Princeton University Press.  for regression,  classification, and function  approximation. IEEE Trans. Inform. Theory, 39(3):999-1013. Breiman, L . (1996a). Bagging predictors. Machine  Learning,  24:123.  Breiman, L . (1996b). Heuristics of instability and stabilization in model selection. Ann. Statist., 24(6):2350-2383. Breiman, L. (1996c). Stacked regressions. Machine  Learning,  24:49.  Breiman, L . , Friedman, J. H , Olshen, R. A., and Stone, C. J. (1984). Classification Regression  Trees.  and  Wadsworth International Group, Belmont, California.  Broomhead, D. S. and Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321-355. Casdagli, M . (1989). Nonlinear prediction of chaotic time-series. Physica D, 35:335-356. Chen, H . (1991a). Estimation of a projection-pursuit type regression model. Ann. 19(1):142-157.  99  Statist.,  Bibliography  100  Chen, Z. (1991b). Interactive spline models and their convergence  rates. Ann.  Statist.,  19(4): 1855-1868. Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Math. Sygnals,  Syst.,  Contr.  2:303-314.  Edelman, S. and Poggio, T. (1990). Bringing the grandmother back into the picture: A memorybased view of object recognition. MIT A.I. Memo No. 1181. Efron, B . (1983). Estimating the error rate of a prediction rule: Some improvements on crossvalidation. Journal of the American Statistical Association,  78:316-331.  Elder IV, J. F. and Brown, D. E. (1995). Induction and polynomial networks. In IEEE Int. Conf. on Sys. Man and Cyb., pages 874-879. Fahlman, S. E. and Lebiere, C. (1990). The cascade-correlation learning architecture. In Advances in Neural Information  Processing  Systems 2,  pages 524-532.  Farlow, S. J. (1984). The gmdh algorithm. In Farlow, S. J., editor, Self-Organizing Modeling:  GMDH  Type Algorithms,  Methods  in  pages 1-24. Marcel Dekker, Inc., New York, NY.  Friedman, J. H . (1991). Multivariate adaptive regression splines. Ann. Statist., 19:1-141. Friedman, J. H . (1994a). Flexible metric nearest neighbor classification. Technical Stanford  University,  Department  of  Report,  Statistics.  Friedman, J. H . (1994b). An overview of predictive learning and function approximation. In Cherkassky, V , Friedman, J. H., and Wechsler, FL, editors, From Statistics Networks,  to  Neural  pages 1-61. Springer-Verlag.  Friedman, J. H . and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist.  Assoc.,  76(376): 1-141. Girosi, F. (1994). Regularization theory, radial basis functions and networks. In Cherkassky, V , Friedman, J. H., and Wechsler, H . , editors, From Statistics  to Neural  Networks,  pages  167-187. Springer-Verlag. Golub, G. H . and Van Loan, C. F. (1993). Matrix Press, Baltimore and London.  Computations.  The John Hopkins University  Bibliography  101  Grudic, G. Z. and Lawrence, P. D. (1995). Human-to-robot skill transfer via teleoperation. In IEEE  Int. Conf. on Sys. Man and Cyb..  Grudic, G. Z. and Lawrence, P. D. (1996). Human-to-robot skill transfer using the spore approximation. In IEEE Int. Conf. on Rob.  andAut..  Gmdic, G. Z. and Lawrence, P. D. (1997). Is nonparametric learning practical in very high dimensional spaces? In To Appear: Intelligence  (IJCAI-97),  Fifteenth International  Joint Conference  on  Artificial  N A G O Y A , Aichi, Japan.  Hardy, R. L . (1971). Multiquadric equations of topology and other irregular surfaces. J. Geophys. Res.,16:1905-1915.  Hastie, T . J . and Tibshirani, R . J . (1986). Generalized additive models. Statistical  Science,  1(3):297-318. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized  Additive Models.  Chapman and Hall.  Hastie, T. J. and Tibshirani, R. J. (1994). Nonparametric regression and classification. In Cherkassky, V., Friedman, J. H , and Wechsler, H . , editors, From Statistics Networks,  to  Neural  pages 1-61. Springer-Verlag.  Hastie, T. J. and Tibshirani, R. J. (1996). Discriminant adaptive nearest neighbor classification and regression. In Advances  in Neural  Information  Processing  Systems 8,  pages 409^-15.  MIT Press, Cambridge M A . Haykin, S. S. (1994). Neural networks : a comprehensive  foundation.  Macmillan, New York.  Hornik, K . , Stinchcombe, M . , and White, H . (1989). Multi-layer feedforward networks are universal approximators. Neural Networks,  2:359-366.  Huber, P. J. (1985). Projection pursuit. Ann. Statist., 13:435^175. Igelnik, B . and Pao, Y.-H. (1995). Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE  Transactions  on Neural  Networks,  6(6): 1320-1329. Ivankhnenko, A . G. (1971). Polynomial theory of complex systems. IEEE Trans. Systems, and Cybernetics,  SMC-12:364-378.  Man,  Bibliography  102  Jackson, I. R . H . (1988). Convergence properties of radial basis functions.  Constructive  4:243-264.  Approximation,  Jones, L . K . (1992). A simple lemma on greedy approximation in hilbert space and convergence rates for projection pursuit regression and neural network training. Ann.  Statist.,  20(1):608-613. Jordan, M . I. and Jacobs, R. A . (1994). Hierarchical mixtures of experts and the em algorithm. Neural  Computation,  6:181-214.  Juditsky, A., Hjalmarsson, H . , Benveniste, A., Delyon, B., Ljung, L . , Sjoberg, J., and Zhang, Q. (1995).  Nonlinear  black-box  foundations. Automatica:  Special  modeling  in  system  identification:  Mathematical 31(12): 1725—  Issue on Trends in System Identification,  1750. Kolmogorov, A . N . (1957). On the representation of continuous functions of many variables by superposition of functions of one variable and addition. Dokl. Akad. Nauk SSSR, 114:953956. Kosuge, K., Fukuda, T., and Asada, H. (1991). Acquisition of human skills for robotic systems. In Proceedings  of the 1991 IEEE  International  Symposium  on Intelligent  Control,  pages  469-474. IEEE. Kwok, T.-Y. and Yeung, D.-Y. (1997). Constructive algorithms for structural learning in feedforward neural networks for regression problems. IEEE Networks,  Transactions  oh  Neural  8(3):630-645.  Lorentz, G. G. (1986). Approximation  of Functions.  Chelsea Publishing Co., New York.  Lowe, D. (1995). Similarity metric learning for a variable-kernel classifier. Neural  Computation,  7(l):72-85. Lui, S. and Asada, H . (1992). Transferring manipulation skills to robots: Representation and acquisition of manipulation skills using process dynamics. J. Dynamic  Syst.  Contr., 114(2):220-228.  Malitz, J. (1987). Introduction  to Mathematical  Logic.  Springer-Verlag, New York, NY.  Measur.  Bibliography  103  Mel, B . W. and Kock, C. (1990). Sigma-pi learning: On radial basis functions and cortical associative learning. In Advances  in Neural Information  Processing  Systems 2,  pages 474-  481. Michie, D., Siegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Classification.  and  Statistical  Ellis Horwood, New York, NY.  Nester, J., Wasserman, W., and Kutner, M . H . (1990). Applied Regression,  Neural  Analysis  of Variance,  and Experimental  Design.  Linear  Statistical  Models:  IRWIN.  Parmanto, B . , Munro, P. W., and Doyle, H . R. (1996). Improving committee diagnosis with resampling techniques. In Touretzky, D. S., Mozer, M . C , and Hasselmo, M . E., editors, Advances  in Neural  Information  Perrone, M . P. (1993). Improving Reduction  with  Extensions  Processing  Regression  Systems 8. Estimation:  to General  Convex  MIT Press, Cambridge M A . Averaging  Measure  Methods  for  Optimization.  Variance  PhD thesis,  Department of Physics, Brown University. Poggio, T. and Girosi, F. (1989). A theory of networks for approximation and learning. MIT A.I. Memo No. 1140.  Poggio, T. and Girosi, F. (1990). Extensions of a theory of networks for approximation and learning: Dimensionality reduction and clustering. MIT A.I. Memo No. 1167. Pomerleau, D. A . (1993a). Input reconstruction reliability estimation. In NIPS5, pages 279-286. Morgan Kaufmann Pub. Pomerleau, D. A. (1993b). Neural  Network  Perception  for  Mobile  Robot  Kluwer  Guidance.  Academic Publishers, Boston/Dordrecht/London. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1988). Numerical  Recipes  in  C. Cambridge University Press, New York NY. Quinlan, J. R. (1993). C4.5: Programs  for Machine  Learning.  Morgan Kaufmann, San Mateo,  CA. Rasmussen, C. A. (1996). A practical monte carlo implementation of bayesian learning. In Touretzky, D. S., Mozer, M . C , and Hasselmo, M . E., editors, Advances Information  Processing  Systems 8. MIT Press, Cambridge M A .  in  Neural  Bibliography  104  Reed, R. (1993). Paining algorithms - a survey. IEEE  Transactions  on Neural  Networks,  4(5):740-747. Renals, S. and Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. In Proceedings  of the International  Joint  Conference  on Neural  Networks,  pages 1-461-1-467. Rouse, W. B., Hammer, J. M . , and Lewis, C. M . (1989). On capturing human skills and knowledge: Algorithmic approaches to model identification. IEEE Cybernet.,  Trans.  Syst. Man  19(3): 558-573.  Saha, A . and Keeler, J . D . (1990). Algorithms for better representation and faster learning in radial basis function networks. In Advances  in Neural  Information  Processing  Systems 2,  pages 482-489. Sanger, T. D. (1991). A tree-structured adaptive network for function approximation in highdimensional spaces. IEEE Transactions  on Neural  Networks,  2(2).  Shimokura, K . and Liu, S. (1991). Programming deburring robots based on human demonstration with direct burr size measurements. In Proceedings Conference  on Robotics  and Automation,  of the 1994 IEEE  International  pages 469^474. IEEE.  Sjoberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P. Y., Hjalmarsson, H , and Juditsky, A . (1995). Nonlinear black-box modeling in system identification: A unified overview. Automatica:  Special  Issue on Trends in System Identification,  31(12) :1691-  1724. Stone, C . J . (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist.,  10:1040-1053.  Tenorio, M . F. and Lee, W.-T. (1990). Self-organizing network for optimum supervised learning. IEEE  Trans, on Neur. Net.,  1 (1): 100-110.  Vitushkin, A . G . (1954). On hilbert's thirteenth problem. Dokl. Akad. Nauk. SSSR, 95:701-704. Vitushkin, A . G. and Henkin, G. M . (1967). Linear superposition of functions. Russian Surveys,  22:77-125.  Math.  Bibliography  105  Widrow, B., McCool, J. M . , Larimore, M . G., and Johnson, C. R. (1976). Stationary and nonstationary learning characteristics of the L M S adaptive filter. Proc. IEEE,  64(8): 1151—  1162. Wolpert, D. (1992). Stacked generalization. Neural Networks,  5.  Yang, J., Xu, Y , and Chen, C. S. (1994). Hidden markov model approach to skill learning and its application to telerobotics. IEEE Trans. Robotics and Automat., 10(5):421-631. Yao, X . and Lui, Y. (1997). A new evolutionary system for evolving artificial neural networks. IEEE  Transactions  on Neural  Networks,  8(3):694-713.  Zhang, Q. (1997). Using wavelet network in nonparametric estimation. IEEE Neural  Networks,  8(2):227-236.  Transactions  on  Appendix A. Approximating 3 Dimensional Functions as a Cascade THEOREM I: Let U(x,y,z) dense domain  be a bounded  continuous  3 dimensional  (x,y,z) £ U, where U C 9£ and U has finite non-zero 3  function  volume.  defined on the  Then there  exists,  for all b £ { 1 , 2 , . . . , 6 m a x } where 6 m a x > 0 ( 6 m a x £ \), a.  a set of bounded  continuous  2 dimensional  S C 3? and V i , j £ { 1 , 2 , . . . , a second set of bounded continuous  Vi(x,y) defined on (x,y)  £ S, b  where  ^ j ) => 5; D S' = 0]; and,  2  b  b.  functions  ;  2 dimensional  functions  Wfc(Vj, 2) defined on (Vj, 2) £  c 3t , 2  SMC/I ?/ia£, / o r a// e > 0 (e £ 3t), the following \/(x,y,z)£U  is true:  [36 £ { l , 2 , . . . , 6  m a x  [((x,y) £ S ) A (|W (V (.r,y),z) - U(x,y,z)\ b  fc  6  } < e)] .  The following are some basic definitions which are used throughout the proof of T H E O R E M I. Let 8 > 0 be an arbitrarily small real number. Create a uniformly spaced grid of points in -ft  3  at all points (/ • 6,m • 6,n • 8), such that l,m,n £ I. Define the set U to be:  U = {(l,m,n)\(l,m,n  £ I) A ((/ • 8, m • 8,n • 8) £ £/)},  (2A)  where the set U is defined in T H E O R E M I. Let l\, m\ £ 1 be fixed such that 3n[(l\, m i ,rc)£ U]. Then, U(l\ • 8,m\- 8, n • 8) (the function U(x,y,z) is defined in T H E O R E M I) is a 1 dimensional function defined for all n such that ( / i , m i , n ) £ U . Let / i , m i , / 2 , « ^ 2 G 1 be fixed such that 106  3n[((/i, m i , n ) £ U ) A ((h,rn,2,n) £ U ) ] . Now one can define a norm, symbolized by [|• jl^between the two 1 dimensional functions U{1\ • 8,m\ • 8,n • 8) and £/(/ • 8, ra • 8,n • 8) as: 2  2  \\U(li • 8,mi • S,n • 8) — Ufa • 8,rn,2 • 8,n • 8) ||  =  max  max  (3A)  • 8, mi • 8,n • 6) — U{l2 • 8, m,2 • 8,n • 8)\,  where the set N is defined as:  = {n\(n 6 1) A ((/i,mi,re) £ U ) A ( ( / , m , n ) £ U)}. 2  (4A)  2  Note that the norm  \\U(h • 6,171! is defined if and only if 3n[{{l\,m\,n)  -8,n-8)-  U{h • 8, m • 6, n • <5)||  (5A)  max  2  G U ) A {(h,m2,n) £ U ) ] .  The concept of an S-ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (defined V n such that (/, m, n) £ U) with respect to the norm | | - | |  max  , is defined next (note that an S-ordering  is similar to a well ordering defined in [Malitz, 1987, pp. 21-23], with the addition of some other conditions given below). Define the set T to be:  T = {(l,m)\(l,m  £ 1) A (3n[(l,m,n)  £ U])}.  (6A)  Let S C T represent a set of 1 dimensional functions U{1 • 8,m • 8,n • 8), where (l,m) which are S-ordered. Let the set V  1  C S be defined such that V ( / , m ) £ V the 1 dimensional 1  1  1  functions U (l • <5, m • 6,n • 8) are first in the S-ordering. Let the set V l  that V(l ,m ) p  p  1  £ V  p  £ S,  p  the 1 dimensional functions U(l • 8,m p  107  p  C S be defined such  • 8,n • 8) are in order p (where  p £ I [p > 1]) of the S-ordering. Let the set V  C S (where p  Pm  such that V(l ,m ) Pm  £V  Pm  the 1 dimensional functions U(l  P m  £ 1 \p > 1]) be denned  m  Pm  m  • 8, m  • 8,n • 8) are last in the  Pm  S-ordering (because T is a set which has a finite number of elements, S C T also has a finite number of elements and therefore there must exist a set V  C S which is last in the S-ordering).  Pm  Then the S-ordering of the functions U(l • 8,m • 8,n • 8), where (l,m) £ S, is defined such that the set S, and the sets V (\/p £ { 1 , . . . ,p }), must satisfy the following conditions: p  m  Cond a . V ( / i , m i ) , ( / , m ) £ S[3n £ l[((/i, m i , n) £ U) A {{l ,m ,n) £ U)]]. 2  2  2  Cond b.Vp £ {1,2,...p },  2  if ( / i , m i ) £ V* and ( / , m ) £ V? then  m  2  2  \\U(h-6,m -8,n-6)-U(l -8,7n -8,n-6)\\ ^ 1  Also, Vp £ {l,2,... },  2  if (h, )  Pm  Condc.Vpi  £ { l , 2 , . . . , p - 2 } , Vp £ { m  (7 A)  p  3  • 8,n-8)-  mi  = 0.  m  £ V and (Z ,m ) £ S [(/ ,m ) g V ] then  mi  \\U{h • 8,  2  2  Pl  3  3  3  U{1 • 8,m • 8,n • <5)|| 3  3  + 1,... ,  P  m  max  ^ 0.  (8A)  - 1}, V p £ {p + 1,... , }, 3  2  Pm  if  ( / i , m i ) £ V * \ ( / , m ) £ V \ and (Z ,m ) £ V ' then p  2  3  2  3  3  ||W(/ • £,??^ • 8,n • 8) -K(h3  8,m • 8,n- <J)||  3  1  max  >  (9A) ||W(/ • 8,m • 8,n- 8) - U(l\- 8,m • 8,n • <5)|| . 2  2  x  max  Cond d.Choose 8 > 0 (8 £ §ff) such that V ( x i , y i , 21), (x , j / , z ) £ /7, if 2  11(^1,2/1,-21) -  ( « 2 , y2,-Z2)||  2  2  < 8 • s/2, then (note that e is defined in T H E O R E M I)  \U(x y u  u  ) - U(x ,y ,  Zl  2  108  2  z )\ < 2  (10A)  It is always possible to choose such a 8 because U(x,y,z) continuous on U. (h,rn2) £ V  p+1  is bounded and uniformly  Given the above 8, V p £ { 1 , 2 , . . . , p  — 1 } , if ( / i , m i ) £ V , and p  T  O  then the following condition must hold:  \\U(h • 6,mi • 8,n • 8) - U{1 • 8,m • 6,n • 8)|| 2  2  <  max  (11 A)  Proof of T H E O R E M I: The proof of T H E O R E M I is based on constructions of the functions V (x,y) and Wb(Vb,z), and the domains S (for all b £ { 1 , 2 , . . . , 6 b  b  m a x  } ) . These  constructions are given in CONSTRUCTION I, CONSTRUCTION II, CONSTRUCTION III, and CONSTRUCTION IV of Appendix D. In CONSTRUCTION I the discrete functions V (x, y) and W (V , z), along with the discrete 6  sets S (for all b £ { 1 , 2 , . . . , 6 6  S (for all b £ { 1 , 2 , . . . , 6 6  m a x  m a x  6  fc  } ) are constructed. In CONSTRUCTION II the discrete sets  } ) are used to construct the domains S . In CONSTRUCTION III b  the discrete functions V&(.T, y) (for all b £ { 1 , 2 , . . . , 6  m a x  } ) are used to construct the continuous  functions V (x,y), and finally, in CONSTRUCTION IV the discrete functions W (V ,z) (for all b  b£ {1,2,...,6max})  6  a r e  6  used to construct the continuous functions W&(V&(a:, y),z).  One should first note that a maximum value for b £ { 1 , 2 , . . . , 6  m a x  } must exist because, in  S T E P 3 of CONSTRUCTION I, the number of elements in the set T i is finite, and because in S T E P 9 of CONSTRUCTION I, J is updated using the equation J b  b  = T _ i - S _ i , where S _ ! 6  6  ft  is always a non-empty S-ordering set (non-empty because, by the definition of an S-ordering given on page 108, if T j _ i is non-empty, then there is always at least one element in T&_i which satisfies the conditions of an S-ordering and is therefore a member of the set Sj_i). 109  Because CONSTRUCTION II chooses a value for b £ {1, 2 , . . . , 6  } which minimizes  m a x  the function  P = ,,min  \\(x,y)  - (/ • S,m • 8)\\,  it must be the case that V(x,y,z) £ U, 3b £ { 1 , 2 , . . . , 6  max  (12A)  } such that (x,y) £ 5 j . Thus the  constructed approximation spans the domain of definition of the function being approximated. Also, because the value of b chosen is one which is minimum in magnitude, it must also be true that Vi, j £ { 1 , 2 , . . . , b }[(i  ^ j)  max  S  t  D Sj = 0] (i.e. V(x, y, z) £ U, b is unique and  therefore S is also unique). b  Given that (x,y,z) W (V (x, b  y), z)  b  U  and (x,y)  £  S, b  in CONSTRUCTION IV the function  is calculated using the equation: =(W (V m • <5) - W ( V , n • 6)) • v(Vb(xy)-V )  W (V (x,y),z) b  £  s  6  b  6l>  6  6o  0  bo  +  s ( z - n 6 )  where the function V (x,y)  (  1  3  A  )  is calculated, from CONSTRUCTION III, using the equation:  b  V (x, y) =(V (/i •tf,m i • *) - V ( / • <5, m • *)) • * ( ~ * ' ) + a  b  6  6  0  x  l  8  0  (14A) (V (/ • 5, m • 5) - V ( / • .5, m • 6)) • 6  2  2  6  0  S y { y  0  ~ ™° '  6 )  + V (/ • 6  0  m • 5). 0  One should note that the function Vt(.x,?/) is continuous and bounded on (x,y) £ S , and the b  function W (V (x,y), b  1. U[x,y,z) 2.  b  z)  is continuous and bounded on (Vj,z) £ $6, because:  is continuous and bounded on (z,?/,z) £  Condition d of an S-ordering given on page 108; 110  (7;  3.  STEP 6 of CONSTRUCTION I;  4. - The tessellations grids used in CONSTRUCTION III and CONSTRUCTION IV; and, 5.  STEP 2.a and STEP 2.b of both CONSTRUCTION III and CONSTRUCTION IV. Inserting equation (14A) into equation (13A), one gets: W (V (x,y),z) b  b  •6)-U (V ,n -6))-sv  =(W (V ,ni 6  6 l  b  bo  0  (Vfc(/i • 8, mi • 8) - V ( / • 8, m 6  0  0  • 8))  s (x x  - l • 6) 0  6  8  V  (V (/ 6  • 8, m  2  2  • 8) - V ( / • 8, m f c  0  0  • 8))  s (y - m y  0  8  8  V  ( V i ( / • 8,m 0  0  • 8)  +  (15A)  • 8) - V ] bo  8v (U {V ,n -8)-W (V ,n -8)) b  b2  2  b  bo  •^  0  ' ^ + W (V 6  f e o  ,n  • 8).  0  Therefore,  \m(V (x,y),z)-U(x,y,z)  ( W ( V , ? i i • 8) - W j ( V , n o • 6)) • sy 6  6 l  6o  b  (Vb(h • 8, mi • 8) - V ( / • 8,m 6  0  • 8))  0  s (x x  8 fe  2  2  • 8) - V ( / f e  • 8, mp • 8))  0  s (y - ?np • 8) y  8y (V (/ 6  8  -6,7710 -8) - V  0  6 o  )  + s {z z  (U (V ,n -6)-U {V ,n -8)) b  b2  2  b  W/j(V ,?2o • 8) feo  bo  0  8  V  ( V ( / • 8, m  - l • 8)  — no • 8)  0  -U(x,y,z)  (16A) Using the Schwartz inequality and taking out the sy, s , s , and s terms (this can be done x  y  z  because, from CONSTRUCTION III and CONSTRUCTION IV, \s \ = \s \ = Is, v  s  1),  one gets: \W (V {x,y),z)-U(x,y,z)\ b  b  <  ( W ( V , n i -8) - W ( V 6  6 l  6  6 o  ,n  0  •<$))•  ( V ( / i • 6, m i • 8) - V ( / • 8, mp • 6)) fe  f e  8  '  V  (W (V ,ni-<S)-W (V 6  6 l  6  (V (/ • S, m ft  2  (x - lp • 8)  0  f e o  ,n -6))0  • 8) - V (lp • 8, mp • 8))  2  +  8  (y - nip- 8)  b  8y  '  8  +  (U (V ,n -8)-U (Vb ,no-8))b  bl  1  b  0  (V (lp-8,m -8)-V ) 8v b  0  +  bo  ( W ( V , n - -6) - W ( V 6  62  2  6  ,n  0  -6))  (z - np • 8)  +  -U(x,y,z)  W & ( V , n • 8) feo  6 o  0  (17A) Because of: 1. Condition d of an S-ordering given on page 108; 2.  S T E P 6 of CONSTRUCTION I; and,  3.  S T E P 2.a of both CONSTRUCTION III and CONSTRUCTION IV,  the following inequalities are true: | W & ( V , r i i • 8) 6 l  W (V ,72 6  6O  0  •<$)!<{,  5  Because of: 1. S T E P 2 of CONSTRUCTION I; and, 112  (18A)  2.  S T E P 1 of both CONSTRUCTION III and CONSTRUCTION IV,  the following inequalities are true: (x -lo-  6)  8 (V -  m •8) 0  8 -  n  0  8  Because of: 1. S T E P 1 of CONSTRUCTION I; and, 2.  S T E P 2 of CONSTRUCTION I; and,  3.  S T E P 1 of both CONSTRUCTION III and CONSTRUCTION IV,  the following inequality is true:  \V {V ,n -6)-U{x,y,z)\ b  bo  0  <  ^.  And finally, because of: 1.  Condition d of an S-ordering given on page 108;  2.  S T E P 6.iii of C O N S T R U C T I O N I;  3.  S T E P 2.a of both CONSTRUCTION III and CONSTRUCTION IV; and,  4.  S T E P 2 of CONSTRUCTION IV,  113  the following inequalities are true: (V (/i  8, mi  6)-Mk  (Mh  8, 7712  8)-V ( o  6  • 8,  m  0  • 8, mo  l  b  S)) S))  Sv  (Vj(/  0  8, mo  *)-v ) i o  Sv  < 1.  Inserting the above inequalities into equation (17A), one gets: \W (V (x,y),z)-U(x,y,z)\ b  <^ + ^+ J+ ^ +  b  This completes the proof of T H E O R E M I • .  114  Appendix B. Constructing the SPORE-2 Structure In the proposed structure the approximation of f(x ,..., ZJV-I) is given by the following 0  expansion (see Figure 1 for a 4 dimensional example of this): If  is an element  (xrj, • • •, N-i) x  P\o~i • • • Pi f(x ,.. 0  .,XN-I)  of the subdomain  corresponding  to tree branch  cri , then =g  P l  a (h 1  9pia p (T2 1  ^  2  xi), x ) +  a ( o, x  P l  2  1  (j pl<TlP2<T2 V Pl&l l  9pi(Ti...a,  l  \h  ...ai  pifTl  ^  1 ^)l X  \ h <r ... p i  1  (  < T ( l  3  )  niOd  _ ,X( 1 )  l  ( J V ) ) )  m  o  c  j  J ,  m  0  f  j  1=3  \J  {N)  ,  (23A)  where, a.  A mod TV = integer remainder of (A -f- TV), and is used as an index to cycle through the independent variables of f(xo, • • •, £_/v_i);  b.  the subscript notation  /?i  <TI . . .  /9;<r;  (/ e { 1 , . . . ,  l  m a x  })  is standard tree notation (see complete  definition below) and serves to identify the two dimensional functions hp^ai-.-piui  g  P i a i  ...piai  and  which act on the specific subdomain o f / ( x o , . . . , XJV_I) referred to by branch  P\(T\ • • • picrr,  c.  and,  lmax  is an integer defining the maximum number of terms (or levels of the tree structure)  required in order, to achieve some maximum approximation error max /(x) - /(x) < e  115  (24A)  where e is an arbitrary small positive real number, andV is the domain of the function / ( x ) . The most basic structural characteristic of the proposed iV dimensional model is its tree structure.  This structure is here referred to as the approximation  tree.  A n example of the  proposed model approximation tree structure is given in Figure 2. Two things are represented via this tree structure: 1) variables of the function being approximated are selected and acted upon by functions at each tree node; and, 2) the domains spanned by these variables are being divided up at each node in such a way that each point in the domain of the function being approximated represents one and only one path through the tree.  These characteristics are  explained in detail below. Tree Notation: Each level of the proposed model has 2 types of branches: p branches and a branches. These two types of branches, as is described later, are differentiated from one another by the type of discriminant  functions  (which are functions that dictate which path through the  tree is taken) associated with them. Note that the length of a branch path and the number of branches emanating from a particular tree node is not fixed a priori, and is determined based on the mapping of the function being approximated, as the model is being built. The branches of the tree are labeled as follows: let i\, i , ..., iii be positive integers, then 2  level 1 branches are identified by the labeling pi — i\ for p branches and p\o\ = ii,%2 for a branches; level 2 branches are identified by the labeling p\(j\p  2  and p\o\p (J 2  2  labeling piai...  = u,«2,«3 for p branches  for a branches; similarly, level / branches are identified by the  = ii,i2,h,H  p\ = %i,% ,..., i i-i 2  2  for p branches and pio-y...  a branches. 1 1 6  p\o\ = ii,i.2,  • • - ,121-1,121 for  The above is an example of how a 4 dimensional function f(x , x , x ,x ) Q  x  2  3  is approximated in the proposed structure.  The approximation is symbolized above by / . The above example shows the calculation of the approximation / along one path of the approximation tree. This path passes through the branches p\ = 1, p\(T = 1,1, p\0- p 1  and Picr p a2 1  2  = 1,1,1,1 (see Figure 2), and continues on through to branch p j C i  .. - P L ^ L  1  —  2  1,1,1,  = 1,1, - - -, 1,1, where  the subscript L defines the number of levels of the tree required to get a sufficiently good approximation along this path of the tree. As explained in the body, the proposed approximation approximates functions 2 dimensions at a time. These 2 dimensional functions are symbolized as (J  ,,,  P1<T1  PI<TI  and h  p i a i  p i a i  (the integer / defines the level of  the tree). At tree branch p a = 1,1 of level / = 1, the functions are symbolized by <h ^ and h ; at tree branch 1  Picr p o1  P1V1  2  2  1  1:1  = 1,1,1,1 of level / = 2, the functions are symbolized by <7i,i,i,i and • • - P L & L  = 1,1,..., 1,1 at level / =  L ,  ^1,1,1,1;  and so on through to branch  where the functions are symbolized by <7i,i,...,i,i and  Figure 1 4-D Example of the Proposed Approximation  1 1 7  •  D.P  Rr  M  °1  R  P,a  Pi iPf; 0  Pfff  = 1,1,1  l P 2  P l  a  2  l P p  2  =1.1.1.1^  = 1,1,2,1 ^  pa, = 1 , 1 / P,°,P, = 1,1,2  P ,Pf2 = a  1  X  \  1,1,2,2  \ P,a  P.°lPf\ = 1,1,2,3  =1,2^  \  P,= 2 P  p , c f = 2,1,1  l°l f 2 P  =  2  . ' > 1  1  1  2  pa 1  I  =2,1  ppp? = 2.1 A l , , 2  ,P,a  lP2  = 2.1.2  <  Pl°lPf 2 = 2.1.2.2  Branches  o Branches  Branches  Branches  Level 2  Level 1  The above is an example of the tree structure of the proposed model which partitions the input space of the function being approximated into subdomains that allow the function to be approximated using a superposition of a cascade of 2 dimensional functions. The example tree given shows only 2 levels. The notation used in the diagram is explained in detail within the text of the paper. Figure 2 The Space Partitioning Characteristic of the Proposed M o d e l  Each branch of the approximation tree represents a specific subdomain of the function being 118  approximated. Thus, one can define the notion of £ tree branch p\a\ ... p\o\  . . . , .TJV-I)  (XQ,  (25A)  to mean that the point (XQ, . . . , x-yv-i) (which is within the domain of the  dimensional function  being approximated) is contained within the subdomain represented by branch p\o\...  p\o\. For  / = 2, this subdomain is defined as follows: (xo, •  • •  i ^ A f - i ) £ tree branch p\o\p (J2  ((x ,x ,... 3  4  ((x ,xi,X4,x ,... 0  where the domains R , - D Pl  ,.T/Y_I)  ,XJV_I)  5  P l C T l  ^  2  £  i2p  A (x ,xi) £ D )A  R  £  Pl  1 ( T l  p  A (h  1  2  1  £ D  (xo,xi),X2)  Plffl  2  P l < T l P 2 a 2  and the functions h  , R <jip -, Dp <T p Pl  (26A)  piai  0  are defined in THEO-  Pl<Tl  2CT2  R E M II. For / > 2, the domain represented by branch p\a\...  ),  p\o\ is defined as follows:  (XQ, . . . , x/y-i) £ tree branch p\o\ ... p\o\ ((x , . . . , x / v _ i ) £ tree branch p\a\... 0  {( il X  J • • • ) iN-2)  ^ •^pi(T ...p|  X  1  where the domains R ...p Dp ... , Pl(Ji  z i , . . . ,iN-2,ji  ar  n  iai  Pl(Tl  A ( / l p T l ...pi-KTl-l l (  the functions  (27A)  /?;_icr/_i)A 5^jl ) /ip  l C T l  ^  pi<T\...pi(Tl)l  D  ...p,_  1 (  r,_ , 1  and the subscripts  e defined in T H E O R E M II. Note that the integer subscripts z i , . . . , I N - 2 , J I  are used to keep track of which variables of the function being approximated are used at which node of the approximation tree. Also, note that the modulo function used in equation (23A) and in the statement of T H E O R E M II, is used to cycle through all possible variables as often as is necessary to produce an approximation which is within some maximum allowable error. T H E O R E M II: Let f(xo,xi,... , .r/y_i) be a bounded continuous (N  > 3) defined on the dense domain  V, where V C 119  N dimensional  and V has finite  non-zero  function size.  For  all I G 1 such that I > 1, let i\, io, • • •, i N-2 £ { 0 , l , . . . , i V — 1} and j\, ji G { 0 , 1 , . . . , N — 1 swcn £na£,  '  a.  ji  = (/ mod N), and j  b.  V m , n G {1, 2,. . . , TV — 2} [(m ^ n => i  c.  Vm G { 1 , 2 , . . . , T V - 2 } A Vn G {1, 2}[i  2.  3.  ^ i ) A (m > n => i  m  n  V/9i  m  > i )];  a n  n  d  + jn].  m  Then there  1.  = ((/ + 1) mod N);  2  exists,  G {1,2,3,. . .}, a set of domains R  C  Pl  K ^  -  3  where  a.  V i , j G { l , 2 , 3 , . . . } [ i + j => R n Rj = 0], and  b.  3 G {l,2,3,...}[( >i)=>(R  t  ?  Pl  =<&)];  Pl  V/OI,<TI  G {1,2,3,...}, a set of bounded continuous  defined  on (xo,x\)  GD  2 dimensional  functions  h (xQ,xi Pl(Tl  where  Pl(Tl  a.  D  C 3? ,  b.  V/>i,i,j G { l , 2 , . . . } [ i ^ j =4> D nD j  c.  3i G { l , 2 , 3 , . . . } [ ( < 7 > O ^ ( £ p « r = 0 ) ] ;  2  Piai  Pli  1  V/?I,<TI  G {1,2, ,3,...},  defined  on (h  a.  Qpiai  b.  3 G{l^,...}^! > 0 ?  C 9£ , 2  1  a set of bounded  £  ,xz)  PlC7l  = 0], and,  Pl  1  continuous  wnere  and (Q  P l < 7 l  =0)];  120  2 dimensional  functions  g  P i a i  (h  p i a i  ,  x  2  4.  V/ 6 {2,3,4,...} andVpi,ai,...,pi  a.  5.  Piai  C^  Pl  2  wnere  V/>i,CTi,...,/»j_i,<7i_i,i,.7 e {1,2,....} {} 7^ i) ^  b.  £ {1,2,...}, a set of domains R ...  (-Rpi<Ti...pi_ <T|_ii l~l Rp ai...pi- a - j 1  1  1  t  = 0)  1  , and,  3 e {i,2,3,...}[( >i)=>(i2 a ....p = 0)]; ?  /0/  Pl  VI £ {2,3,4,...} and V/3i, 2 dimensional Dp\o\  functions  <7i,  1  I  /> , cr , • • • ,Y>/,cr; £ {1,2,3,. ..}, a set of bounded 2  2  h ... (h < :.. _ _ ,x ) pl(Tl  Piai  pl  Tl  Pl  iai  1  . . . p\o-\ where  defined  J1  continuous  on (h ... _ _ , Plffl  Pl  l(ri  xjf) £  1  1  &•  -Dpi (Ti ...p((T; C K ,  b.  V/Ji,cri,...,/Ji,i,j  c.  3i G { 1 , 2 , 3 , . . . > i ) =^(Z>  £ {l,2,...}[(i^j) Pl(Tl  (£>pi <Ti...pii H D ,,, j PlfTl  .;.p  ana",  lffl  —  pl  0)], and  = 0)];  ,  6. V/ £ {2,3,4,...} and \/pi, ai, p , cr , ...,pi,a 2  2 dimensional  functions  2  t  g a ...pia (h a ... a ,Xj ) Pl  1  l  Pl  1  Pl  l  £ {1,2,3,...}, a set of bounded defined  2  on (h „. „ xj ) Piai  Pl  n  2  £  continuous Qp^.-piai,  where  3-  Qp\ct\...picri CL 9?", and  b.  3 £ {l,2,3,...}[(a >i)=>(Qp <r ... ?  /  1  such that (given all of the above definitions),  1  Plff|  = 0)];  for any arbitrarily  .121  small e  max  £ St where e  max  > 0,  the following  is true:  V(x ,xi,...  G T>, 3l  ,XN-I)  0  (V/ G {1,2, 3 , . . . , l }(3(pi  G { l , 2, 3,...} A a, G {1, 2, 3,...})))  m  f(x ,xi,:..  ,x _i)  0  where fl\ is defined  as  i. far I = 1, ((x ,xi) ii.  V/ G {2,3, 4 , . . . , l }, m  "(x ,...,  xTV-I)  0  -  N  (28A)  \  follows:  GD  0  G {1,2,3,...}  m  P l ( T l  A (x ,...  € R ) =>• fi\ -  3  if(xQ,....  ,x  JV_I  g a (h (x ,x ),x ),  Pl  Pl  ) G tree branch p\o~\...  an element of tree branch  — 9pi<jj...pi<Ti(l P\(J\ l  1  pl<Tl  0  1  p\o-\ (note that the concept of  p\o~\ ... p\a\" is defined on page 119), then  ...pi<Tl V'Pl (Tl...p(_lCT(_l , jl X  (29k)  ) 5 J2 ) • X  Proof: The proof of T H E O R E M II is based on constructions of the functions and g  P i a i  ...  p m  ,  and the domains R ... Pl(7l  '  Pl  2  and D ,„ plCTl  h  _,  PlCru  Pl(Tl  (for all / G {1,2,3,...} and  piCTl  V/9i, a i , . . . , pi, ai G {1,2,3,...}). These constructions are given in CONSTRUCTION V of Appendix D . L E M M A I: Let (x\, x\,..., Then there exists a domain of  R  N  S T E P 3.ii of CONSTRUCTION  r  N  1  1  C $t ~ N  Pl  G$  x _) 3  3  be a single point which satisfies equation  such that (x\, x\,...,  V are satisfied.  122  \f_i)  x  GR  Pl  and the  (83A).  conditions  Proof of L E M M A I: Let 7 G 3? such that 7 > 0. Then, define the domain R  Pl  R  P1  = {(x ,...,x _ )\((x ,...,x _ ) i  N  1  3  N  as follows:  G ^ " ^ A  1  G 2>])A  (3(x ,xi,x )[(x ,... 0  2  0  (30A) (||(x , • • - , N - l ) X  3  (Vi  G {0,...,/9i  Note that one can always choose a domain R  Pl  -  ( l,  < 7) A  • • • , N-l)\\  x  X  -1}[(X ,...,XJV_I) 3  which satisfies the above condition because the  domain T> of the function being approximated, is dense, and because equation (83A) is true. Note that the above R , by definition, satisfies the condition of S T E P 3.ii.a and the condition Pl  of S T E P 3.ii.b of CONSTRUCTION V . Given the above definition of R , for 7 = 0 equation (85A) becomes: Pl  V (xo,xi,x ) Pl  Because f(xo,x\,...,  2  = f(x ,xi,X2,xl,x\  . . . ,x _ ).  (31A)  1  0  N  1  XJV-I) is continuous within the dense domain T>, it follows then that as  7 approaches zero, rn  , a x | 7 ( a ; o , x i , X 2 ) - /(a; , i . • . , ?  x  Pl  0  0.  N-I)\  X  By the definition of continuous functions, this implies that V (xa,xi,x ) Pl  2  (32A) is also continuous.  Thus, because the domain V is dense, there must exist a 7 > 0 such that equation (88A) is satisfied. Therefore, there must exist a domain R  Pl  which satisfies the condition of S T E P 3.ii.d  of CONSTRUCTION V . Finally, if one chooses any value of 7 > 0 which satisfies equation (88A), then the condition of S T E P 3.ii.c of CONSTRUCTION V is also satisfied because R  Pl  123  is dense and because the  domain V of the function being approximated, f(xo,  is also dense and finite in  x \ , . . . ,XN-I),  size. This completes the proof of L E M M A I. As shown in the proof of L E M M A I, the function V  Pl  (XQ,X\,  X) 2  is continuous and therefore,  by T H E O R E M I, the construction in STEP 3.iii of CONSTRUCTION V is valid. Also, by T H E O R E M I, the functions h \,(xQ,xi)  and g b(h b, 2)  0  A  Pl  Pl  continuous and therefore the  constructed in STEP 3.iv of CONSTRUCTION V as  function f (x ,x r_ ), PltTl  a r e  x  pi  1  /pi(Ti(»0,  •••  , N-l)  = f( 0,  x  x  is also continuous within the domain D o m  ~ 9p aA Pi(rA 0, l), 2),  • • • , N-l) x  Pl(Tl  fl  x  x  (33A)  x  1  . Note that, because the domains  D ,R ,U Pl(Tl  Pl  and V are all dense and of finite size, then by STEP 3.iv of CONSTRUCTION V , the domain Dom,,^, is also dense and of finite size. Also, by T H E O R E M I and STEP 3.iv of CONSTRUCTION V , the union of the domains D o m V=  | J  Pl(7l  span the domain V; i.e.  Dom .  (34A)  i;  i€{l,2,...} jG{l,2,...} Next, by T H E O R E M I and STEP 3.iii of CONSTRUCTION V , the following is true: max  \V (x ,x ,x ) Pl  0  1  - g a (h < (x ,x ),x )\  2  Pl  1  pi  Tl  0  1  < ep.  2  (35A)  Thus, given that equation (88A) of CONSTRUCTION V is satisfied, one can rewrite equation (88A) as follows: max max  \V (x ,xi,X2) Pl  0  |/(.T , 0  • • • .ZJV-I)  - g , (h (xo,xi),X2)\ Pl< 1  Pl<Tl  -  V {x ,xi,X2)\ Pl  0  < P • (max  + \f(x ,...,x _i)\ 0  N  (36A) 124  Using the Schwartz inequality, the above equation becomes: f(x ,.  max Dom„, 1  +  V (x ,x ,X2)-  < f3 • ( max \Dom  \f(x ,...,  a w _ i ) | I, J  Pl  0  Pl  1  0  (37A)  9 a (h a (xo,xi),x ) Pl  - V (x ,xi,X2)  . . ,XN_I)  0  Pl  1  2  0  w a i  or, simplifying the equation, one gets max Donip  • • • ,X _i)  f(xo,  -  N  <  g a {h a {xo,Xi),X ) Pl  1  pi  1  2  lCTl  (38A) (3 • ( max \f(x ,. ,Dom j  . . ,2.^-1)1  0  , ,  P 1 CT  which can be expressed as max Dom . 0  f (xQ,...,x -i) PlCTl  </?•(  N  max Dom„,  \f(x ,. 0  ..,XJV-I)|  (39A)  )•  CT1  a  Now consider the statement and proof of L E M M A II. L E M M A II: Let (x^,..., (94A).  Then  R ai... - aiPl  Pl 1  lPl  there  exists  x\ _^j  £  N  a domain  $t ~ N  be a single  2  C  R - ... _ _ Pl(J  i  pt  lUl  lPl  point  which  such  that  o / S T E P 7.iiL2 of CONSTRUCTION  and conditions  V are  satisfies  (x^,...,  equation x} _^J N  £  satisfied.  Proof of L E M M A II: The lemma is first proven for the case / = 2. Let 7 £ 9£ such that 7  > 0. Then, define the domain R R  as follows:  P l ( T l P 2  . •, Xi _ ) I ( 2 ^ ,. . ., X{ _ ) £ 3?  j (x ,.  N  il  2  (3(x ,Xj )[(x ,.\. jl  2  ((XQ, . . . ,  0  XN-I)  N  .,x _i) N  2  ^A  £ D])A  £ tree branch  (40A)  pia\)A  .)-(4.-^L)|<7)A (Vi £ {0,...,/)2 - l } ^ , - , , . . . , ^ ^ ) ^i2pi i])}ffl  125  Note that one can always choose a domain R  which satisfies the above condition because,  PlfflP2  as shown above, the domain, D o m  P l ( r i  , of the function being approximated, f  (^0, • • •, N-i), x  PlUl  is dense and because equation (94A) is true. Note that the above R  ,  by definition, satisfies  PiaiP2  the condition of S T E P 7.iii.2.a and the condition of S T E P 7.iii.2.b of CONSTRUCTION V . Given the above definition of R  ,  for 7 = 0 equation (97A) becomes:  Pl(TlP2  = f „ (xl,x\,x ,X3,x\,...,x _ ).  ^ i<T {hp ,X2,X3) P  lP2  Pl  1  2  Because the function being approximated, f i( o, Pl(Tl  N  ••  x  puT  domain D o m  (41A)  1  L!TL  , and because the function h  1  ^'JV-i).  •  is continuous within the dense  is continuous (as proven in T H E O R E M I), it  pj(Ti  follows then that as 7 approaches zero, max \V p (hp ,X2,X3) PllXl  PI  "\  P  2  -f  Pl(Tl  LLTL  (an, a i , • • •, z / v - i ) | -» 0.  (42A)  2  X2, £ 3 ) is also contin-  By the definition of continuous functions, this implies that V p (h , Pl<Tl  uous. Thus, because the domain Dom  Pl(Ti  2  Pl(Tl  is dense, there must exist a 7 > 0 such that equation  (100A) is satisfied. Therefore, there must exist a domain -Rp  1(T]/92  which satisfies the condition  of S T E P 7.iii.2.d of CONSTRUCTION V . Finally, if one chooses any value of 7 > 0 which satisfies equation (100A), then the condition of S T E P 7.iii.2.c of CONSTRUCTION V is also satisfied because R  Pl(7lP2  domain D o m  P l J ]  of the function being approximated, f  (xrj, • • •, N-i), x  pl<Ti  is dense and the is also dense and  finite in size. This completes the proof of L E M M A II for the case / = 2. The proof of L E M M A II for the cases / 6 {3,4,...} is given next. 126  As shown above, the function V j (h ,x ,£3) Pl( lP2  Piai  is continuous and therefore, by THE-  2  O R E M I, the construction in STEP 7.iii.3 of CONSTRUCTION V is valid. Also, by THEO R E M I, the functions h  and g a a  PiaiP2(J2  fpi<T a (xo, lP2  SJV-I),  2  fpi<Tip cr 2  ( Q, • • • X  2  Pl  lP2  are continuous and therefore the function  2  constructed in STEP 7.iii.4 of CONSTRUCTION V as , a-'Af — l )  j  =  Pl  <7]  (XQ, . . . , X y y r _ 1 ) —-  is also continuous within the domain Dom  g u\ a  lP2  R  2  P  l  ai  P  2  P2  O pi<Tip cr l  2  2  2  5^3),  and V are all dense and of finite size, then by STEP 7.iii.4 of C O N -  ,U  STRUCTION V , the domain D o m  Pl  o\p o 2  2  is also dense and of finite size. Also, by T H E O R E M  I and STEP 7.iii.4 of CONSTRUCTION V , the union of the domains D o m domain D o m  P l ( T l  pi(Tl  =  Therefore, given that f <j cr (xo,... Pl  Pl(Tl  p  2(T2  P l ( T l P 2 ( T 2  span the  ; i.e. Dom  Dom  (43A)  Note that, because the domains  .  PlCJlP2CT2  Dpi<T cr ,  pl  lP2  [ J Dom ^{1,2,..} ./ G {1-2....}  (44 A)  P l ( T l i r  , xyy_i) is continuous within the dense finite domain  2  , the proof of L E M M A II for the cases / £ {3,4,...} follows by induction the proof  for the case / = 2. Note that for / £ {3,4,...} the domains R <T ... i pl  1  P  —  {(^ii? • • • , ix-i)\{x%\i x  (3(X  J I  ,.T  J 2  )[(X , 0  ••  R  •) i _ ) E  ...  2  x  N  P i a i  2  are constructed as follows:  P l  A  . . . ,ia?AT_i) £ D])  ((xo, • • •, XN-I) £ tree branch p\(?\ ... 0/_i<7/_i)A  (45A)  /  ||(x,-,,..., xtN_2) - (xl,..., x\N_2) || Vi £ {0,. . . ,pi - l } [ ( x . , . . . , x _ ) i tl  127  J N  2  <  7A  R <j ... „ cT - i] Pl  1  Pl  1  l  1  }•  Also, note that, by induction, and by T H E O R E M I and S T E P 7.iii.4 of CONSTRUCTION V , the union of the domains Dom  Dom  Pl(ri  Pl(Tl  ...p  ...p _ I  1(ri  _  ((T(  ]  span the domain Domp ... _ - _ ; i.e. 1<7l  =  Dom  ( J  /?l0  (0(  lO  l  . ... _ ,_ i 1  w  l£r  1  1  (46A)  r  zG{l,2,...} jG{l,2,...} This completes the proof of L E M M A II. Next, by T H E O R E M I, and S T E P 7.iii.3 of CONSTRUCTION V , the following is true for the case 1=2: x )\<e . 3  p  (47A)  U o mCTJP2 a~2  Thus, given that equation (lOOA) of CONSTRUCTION V is satisfied, one can rewrite equation (100A) as follows: \fp T (xo,...,x ^i)- Pp a p (hp „ ,x ,xs)\  max  +  /  1<  1  1  N  1  2  1  1  2  UOHlp^ tjy P2&2  max  \ Pp <rip2 /  1  ( 0, l, X ) — Qp^ai P2O2 x  x  2  P - I  max  |/p  lCTl  i] Pi<JiP2(T2 l  {hpitri, 2), z) I 5i x  x  (A<2\\  (a [.T , • • • , z / y _ i , 0  y J J O I T l p ^ cr^ p2 o"2  Using the Schwartz inequality, the above equation becomes: max Dom  /PI<TI(ZO,  • • • , N-l) x  -  Vp p {hp X2,xz)+ l(Tl  2  iUl1  PlCTlP2 CT2  { G-> li 2^ x  ~Ppi<J\P2  x  x  P-[  9p\o\p2<72  (^piT1P2U2  max \f (x ,... Piai  {hpi&i, 2~) x  1  s)<  x  (49A)  ,x _i)\],  0  N  or, simplifying the equation, one gets max Dom P i a 1 P2 "2  fpiUl { 0l X  • ••5  N —l )  X  9pi<TlP2<72  0 Pl<riP2&2 ( ^ P l « T l l  ,  2)l  X  Z)<  X  (50A) P-\  max \fp (xo,. . . ,x -i] 1<Tl  N  . L/OITlrti <T 1 Do (TO 128  Therefore, for the cases / £ {3,4,...}, it follows by induction that max  jp\G\...px-xUl-x  Dom  Qp\<J\...plVl  (3 • I  ( 0l  X  {] pi(Jl...pi<T l  l  P1CT1  —1)~  (hpi<Tl...pi-lCTl-l  ) jl)-I  X  • • • •>  x  ...p  (51 A)  J2)  X  \fpi<JI••.pi-i(Ti-i{ Qi  max  \Dom  •••)N  X  x  N  )  /  1(r(  or, simply, max  Ip\<Ti...pi<jX Qi fp x  Dom (3 • [  i N—l)  •••  x  < (52A)  \f <T ... - cT -. ( 0,-••, N-l)\)'  max  Pl  1  Pl  1  l  1  x  x  \Dom ... Piai  J  Pl<Tl  Thus, given equation (39A), it follows that  Dor  max 1>\  1  fpi(Tl...pi(Tl{ 0-> X  • • • 1 N — l) X  </?'•(  max  |/(s , •• 0  (53A)  -,XN-l]  Dom,  "i • • • PI  Therefore, because 0 < (3 < 1 and because the function being approximated, f( o,. x  •.,  ZJV-I)  is bounded, it follows that there exists an / E {1,2,...} such that, for any arbitrarily small > 0,  (3 •  max \ Dom  |/(x ,...,a;jv-i)| < £ r 0  (54A)  c  Thus, because equation (34A) and equation (46A) are true, it must be the case that equation (28A) of T H E O R E M II is satisfied. This completes the proof of T H E O R E M II • .  129  Appendix C. Rate of Convergence The approximations developed in the previous two appendices are here be applied to nonparametric regression. Let (X, Y) be a pair of random variables such that X is of dimension TV, where iV > 3, and Y is of dimension one. Let 6(X) = E(Y\X)  be the regression function of Y on the  measurement variable X. Let 9 , n > 1, denote an estimation of 9 given the n random samples n  (X  u  Yi),...,  (X ,Y ) n  n  of the distribution (X, Y).  Let the estimators 9 be based on the nonparametric model given in equation (23A) of n  Appendix B. Let the two dimensional functions g <r ...p <r and h ,,, Pl  1  l  l  Pl(Tl  Pl(Tl  of 9 be constructed as n  defined in CONSTRUCTION V of Appendix D with 0(x) = / ( x ) , x = (x , xi,...,  x _i)  0  N  G V.  Assume that the function 9 = f, and the domain V, are as defined in T H E O R E M II, with the additional constraint that 9 is p-times differentiable (p > 1) on V. As defined in [Stone, 1982], assume that the distribution of Y depends on the value of the measurement variable x and takes the form h{y\x.,t)4>(dy), where <f> is a measure on 9t and t = 0(x) is the mean of the distribution as defined by: (55A) T H E O R E M III: Assume that conditions 1-3 defined in [Stone, 1982] are true. Assume that 9 and 9  n  are as defined above.  Then, there exist a finite set of dense domains S  k  C V, where  K  k  G { 1 , . . . , K} (K finite), and | J S = T>, such that, given n random samples of 9 from any k  k=l 130  domain r =  S}., the rate of convergence  of 6 to 9 on S is 9 — 9 n  <0^(n)  n  k  where  log ( n ) ^  1  p/(2p + 3).  Proof: Because the dense domains D o m  pi(Tl  are finite in number and span V (see  Pi(Ti  equation (46A)), in order to prove this theorem, it is sufficient to establish the rate of convergence of the error functions  f  i( o-> • • • > N-i)> where n is the number of random samples from x  pitJi  x  pl(T  the corresponding domain 5& = D o m  ...  pi(Tl  As defined in equations (91A) and (103A) of  Plffr  CONSTRUCTION V in Appendix D, the error function after the construction of / levels is given by: fp * ...p * ( o, • • -, N-i) =9(x ,... x  1  1  l  x  l  , X J V - I )  0  9p a {hp ff (xo,x ),x )  -  l  1  1  1  1  +  2  9p\<J\P2ff2\' p\<JiP2'J2 (hp , 2), i) I X  l  +  X  iai  ^^9pi(Ti...ai(hp <T ...tTi(hp <Ti...(T ^ , r 1  1  {i  1)  w ) ) '  mod  x  1  i  x  ((;+i)  mod  w ) )  >  (56A) We prove T H E O R E M III by showing that  • . . , N-i)\\  \\f^ ... (x , 1<ri  pi<ri  < O^n)'  x  log  1  OQ  0  given n random samples from the corresponding domain Sk = D o m  Pl(Tl  ...  PiCTr  As defined in CONSTRUCTION V of Appendix D, the functions h ,,. Piai  in (56A) are constructed from the 3 dimensional functions  V  Pl  and  (n))"  and  Pl(Ti  g  Pl  1(  l  o ...p j 1  l (  l  (see  V ... _ x _ Piai  P l  lPl  equations (85A) and (97A)) which are in turn constructed from the function 0(x) = / ( x ) being approximated. However, 0(x) is unknown, and therefore, be approximated from Dom  (0lffl  ...  Pj(Ti  ^ p <T ...pi- ai- pi >  1  1  1  1  n  random samples from the domain  T  Pl  V  Pl  and  V  I) D o m  Pl(Tl  must  ... _ _  PlUl  Pl  xUl  and  lPl  T  ...  Pl(Tl  Pl  D  respectively (see equations (87A) and (98A)). We denote this by defining V and pi  t  0D  e  approximations of V  Pl  and  V  131  ... _ < _ip ,  Pl<Tl  Pl  1  Tl  l  respectively, based on  n from  there respective domains. The following lemma establishes the optimal rate of convergence for the estimation of \\VPl  -^"J^  and \\VPiai...Pl_1(ri_lPl - V^iai_pl_1<Tl_lPl W^, as a function of  n random samples. L E M M A III: Let the functions using the local average each estimator number  Pl  estimator  V \ and V  of points  V  andV  *  used in constructing  the  piCTl  Pl  1(Tl  of [Stone, 1982].  ... _ _  p  ^. _ _ ,  pitJl  Pl  l(Tl  s  lPl  C(n  _ 1  for all I  lPl  Then the optimal  £ { 2 , . . . , l }, be estimated m  L°° rate of convergence  of  l o g n ) , where r = p/(2p + 3), and n is the r  estimator.  Proof of L E M M A III: This lemma is proven by showing that, given the assumption in Theorem III, all functions V  Pl  and V ,,, _ _ pi(7l  First consider the estimation of V . Pl  V  Pl  pi  1(7l  ipi  meet the conditions defined in [Stone, 1982].  By equation (85A) and equation (55A), the expected value  takes the following form: / R, •PI  V (x ,xi,x ) Pl  0  6(x ,xi,..., 0  2  pi  R, •PI  Therefore, the distribution of V  Pl  (57A)  is of the form /  -h(y\-x.,t)dR  Pl  <f>(dy).  132  (58A)  Because (58A) is an appropriately scaled version of h(y\x.,t)</>(dy), it follows that given that h(y\x.,t)<f>(dy)  meets conditions 1-3 defined in [Stone, 1982], then the distribution in (58A)  must also meet these conditions. Also, note that because 6 is p-times differentiable, then given the above calculation, it must be the case that V  Pl  given that V  Pl  is also p-times differentiable.  Therefore,  is a p-times differentiable 3 dimensional regression function, then by Theorem  1 of [Stone, 1982], the optimal L°° rate of convergence is \\V — ^ " J l  < O(n logn) , _ 1  Pl  r  where r = p/(2p + 3). Next consider the construction of the 2 dimensional functions h  and g  Pl<Tl  CONSTRUCTION V , and functions h ... PltTl  and g  PltTl  PltTl  ...  Pltri  ,  Pl(Jl  in S T E P 3-iii of  for all / £ { 2 , . . . , l }, in S T E P m  7.iii.3 of CONSTRUCTION V . As defined in CONSTRUCTION III and CONSTRUCTION IV, these functions are constructed using linear interpolation based on V . Because of equations Pl  (91 A), (97A), and (103A), in order to complete the proof of L E M M A III, we must ensure that the 2 dimensional functions h  , g  P l ( T l  P l t T l  , h ... j Pl/Jl  Pll  l  and g cx... a Px  Pl  {  are at least p-times  differentiable (that being the number of times that the regression function 9 is differentiable). Without loss of generality, this condition can be satisfied by representing h and g j ... (j , PlC 1  Pl  l  P l C T l  , g  ,  PlCTl  h .„ Pl(Tl  PllTl  for all / £ { 2 , . . . , l }, using, for example, 2 dimensional spline approximations, m  of differentiability p or higher (i.e. instead of the linear interpolation scheme defined in CONSTRUCTION III and CONSTRUCTION IV). Given that the functions  f  1  l p  l C J l  ...  P l  _ cT _ 1  l  1  of equation (97A) are at least p-times differentiable,  the proof of the lemma for the functions V ... _ _ , Piai  pl  iai  lPl  for all / £ { 2 , . . . , l } , follows the m  same steps outlined above for V , and is therefore not be presented here. This completes the Pl  133  proof of L E M M A III. Given L E M M A III, we can establish the rate of convergence of the error functions Consider the rate of convergence of the first function  fp <T ...p ai( o-,----> N-i)x  1  1  x  l  f  Us-  .  Pl<Jl  ing an analysis similar to that found in [Barron, 1994] and referring to CONSTRUCTION V of Appendix D, the  norm of /™  is given by (unless otherwise stated, the Loo norm is over  1(Tl  the domain of the respective function \\r;^  Letting  —^"Jl  e = \\V p  Pl  <  f ,,, ( o,  0(ji~  l  a  x  P[ai  ^ II / - ^  L  for H / = 1, 2,...):  • • • , N-i),  x  Pl(Tl  " o o + \\v  - v  L  n  P1  pi  (  59A)  and using equations (88A) and (39A), equation  \ogn)  r  (59A) can be written as:  < m\oo  n PiMoo  + *v  (60A)  In order to guarantee convergence we choose an a 6 $ such that 0 < (3 < a < 1 and equation (60A)  satisfies:  1 1 / ^ 1 1 0 0 ^ 1 1 ^ 0 0  Similarly, the Loo norm of  ||/pi<7lP2<T2  Once again, letting  f  Pl(TlP2(T2  ^  <«ll*lloo  —' P  Pl(7lP2  (  6  1  A  )  is given by:  lloo — 11/piO-l ~~ ^ V ^ ^ H o O + H^PlO-lPS  e = \\V p  +  p  i  a  l  P  2  ^  O(n logn) _ 1  ~~  r  ^  P i  <7i  p \ | QQ 2  (62A)  and using equations (100A)  and (50A), equation (62A) can be written as:  11/pUp, J o o  ^  J  L + **< "WK* lloo ^ « | ^ l l o o 2  134  (63A)  Thus, by induction, the  \\f  — \\J pi<Ti...pi- cn-i  || J p a ...pi<ri\\ 1  1<Tl  00  ...  P((Ti  is given by:  —v  II <llf™  n  1  norm of /p  <P\\fp\, ... - * - \L Pl  1  IIQQ ~  ' picri—pi  1  1  I I 4-Wv  l  1  I  1  II  n  p\<J\...pi  r  p a ...pi\\ 1  i  +h  1  I  0 0  (64A)  <«ll/; . ...p _ . _ IL<«'Pii 1  \\'  -V  1  0 0  Therefore, construction stops at level / when, at level / + 1, equation (64A) is no longer true; this is defined by the following condition: ||ipi<7i...pi<Ti  H o o —  l^\\fp cr ...pi-iai1  1  1  IIQO +  P  e  —  l l ^ l l o o  a  AND  IQQ ^  \\fpiCT ...pi a 1  +1  (65A)  l+1  a  Halloo  The stopping condition given in equation (65A) is met if and only if the number of points used in constructing Vp ...p p ir7l  IK.,..P,  iai  is insufficient for error reduction; in other words, the following is true:  l+1  - ^ ; U . . . P J L < o(n-Ho n)  i  r  + 1  g  Using the inequality \\fp, ... _ _ \\ fTl  pl  lCTl  1  < a  1  oo  \\9\\ ,  l  OQ  -0\\&w,\L  H I / P % , . . P J L  then for all fp „.p 1(7l  l(7l  ( ) 66A  satisfying equation  (64A), the following is true:  /^'-^IL +^ a ^ l L Solving equation (67A) for a ^ P H o o , we get:  1 3 5  (67A)  Using the inequality ||/™ ... _ 1<Tl  P(  _ 1|^ < of \\0\\ , and inserting equation (68A) into equation 1  :1(T(  1  OQ  (64A), we get the final result: H/piaL.-PifflHoo  ^  P  _  V a  < (9(n This completes the proof of T H E O R E M III • .  136  p  _1  +  P  6  log n)  Appendix D. The Main Theoretical Constructions CONSTRUCTION I: The Construction of V ( x , y ) , W (V ,z), and S 6  6  STEP 1. Choose 6. > 0 (8 £ 9fc) such that V(x ,yi, zi),(x ,y , x  IK^i,Vi,zi)  - (x ,y' ,z )\\ 2  2  2  2  6  6  z ) £ U, if 2  < 8 • y/2, then  2  \U(x ,y ,z )-U(x ,y ,z )\ 1  1  1  2  2  < | .  2  (70A)  STEP 2. Create a uniformly spaced grid of points in 3? at all points (/ • 8,m • S,n • 8), such 3  that / , m , n £ 1. Define the set U to be  U =•{(*, m,ri)|(Z,m,ra £ 1) A ((/ • 8, m • 8,n • 8) £ U)}.  [Thus  the set U contains  uniformly  all points  spaced grid of points  the square brackets  indicates  which are within  in 3? defined above. 3  (71A)  the set U and which (Note that the italic  lie on the text within  comments.)]  STEP 3. Define the set T i to be  T i = { ( / i , m i ) | ( / i , m i £ 1) A (3n £  [The set T i represents  the set of all one dimensional  m n) u  £ U])}.  functions  (72A)  U{1\ • 8, mi • 8,n • 8),  where (l\,m{) £ ~X\.] STEP 4. Let 6 = 1. [The  index b indicates  discrete  set S& are  that discrete functions  constructed.] 137  Vft(x,y) and W J ( V J , Z ) , along  with the  S T E P 5. Choose an arbitrary point (!o,mo) G T j . Construct the set Sj such that: a.  S  b.  (k,m )  c.  V(/i,??2i)  6  C T . fc  G Si;  0  E Sj, the 1 dimensional functions U(l\ • 8,mi • 8,n • 8) form an S-  ordering as defined above; d.  G Tfc[(/ ,«i2) 0  V(/2,m2)  the 1 dimensional functions U(h • 8, rn • 8, n • 8) 2  2  cannot be inserted within the S-ordering of the U{li • 8, mi • 8,n • 8), where  [Thus  the  set  S  represents  b  1 dimensional  functions  m i ) G S;,.  an  S-ordering  of  the functions  U(li  • 8, mi  • 8,n • 8)  (V(?i,mi)"e S ).] b  S T E P 6. Let  C S  be the set of all points (J ,?™ ) G S 1  6  1  fe  which are first in the S-  ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (V(/, m ) E S ). Similarly, b  Vp G { 2 , 3 , . . . , p } , let Vf C S be the set of all points (l ,m ) p  m  fc  p  G S which are fc  in position p of the S-ordering of the 1 dimensional functions U(l • 8,m • 8,n • 8) (V(/, m ) G S ). E X E C U T E the following steps: 6  i.  G  V(/ ,m ) 1  1  V ^ / -8,m 1  ii.  l  set the value of the 2 dimensional function Vj to be  -8)  = 0.  Let p = 2.  iii. WHILE p < p  m  E X E C U T E the following steps: 138  1.  Let (p-^mP- ) 1  G  V p . V(l m ) 1  p  p  V  set the value of the 2 dimensional  P  G  }  b  function V to be: b  V (l  -8,m -8)  p  •W  = V^P-  1  p  b  (73A)  • 8) + a,  1  where a = \\U(l -6,m p  8)-U(l ~  -8,n-  p  2.  Let p = p + 1.  Vp  G  p  { l , 2 , . . . , p } , V(l ,m ) p  p  G  m  • 8,m -  1  p  • 6,n-  1  (74A)  8)\\  Vf and Vn such that  G  U , set the  value of the 2 dimensional function Wj to be: W (V (/ F E  [Therefore, I  [(/, ra, n)  U(l  at G  6  P  • 6,  all  m  • 8), n-8)=  p  points  U{l  • 8, m  p  (/ • 8,m • 8,n • 8),  U], the function  WJ(VJ(Z •  (75A)  -8,n-8).  p  where  (/, m)  G  S  b  and  6, m • <5), n • <5)« egua/ to the  n  G  function  • 6,m • 8,n • 8).[  Define.the set Q  b  as:  Q& = j(<*,/3)|Vp G { 1 , 2 , . . . ,p } m  j v ( / , m ) G V£ p  p  [Vn such that (l , m ', n) G U p  (76A)  p  [(a = V ( P - 6 , m • 8)) A (ft = n • 8)]\ p  6  [Therefore, function  Q  b  contains  the set of all possible  W (V ,n • 8).[ b  b  1 3 9  independent  variable  values of the  S T E P 7. Let 6  m a x  = 6.  [b a.x represents  the current  m  S  b  S T E P 8. Let  maximum  number  ofV (x,  y) and  b  z) functions,  WJ(VJ,  and  sets.] 6 =  [Thus  6+1.  the next set ofV (x,y)  constructed S T E P 9. Let T  =  b  -  T _i b  where (h,m )  Vft(x, y) and ^  functions,  b  and. the next set S  b  will  be  S _i. 4  the set of all one dimensional  £ T i , which  b  fc  b  next.]  /T/ie se? Tj represents  S T E P 10. IF T  and l/l (y ,z)  b  WJ'(VJ, 2),  functions  U(l  • S,m  b  b  have not been used in the construction  and sets S . b  W/ien  Tj =  0 //ze construction  • 6,n • 6),  of the  is  functions  complete.]  0, GOTO S T E P 5  E L S E E N D CONSTRUCTION I. /"STEP 5 through U(l\  • 6,mi  functions  S T E P 9 are repeated until all one dimensional  • <5, n • S), where (h,mi) V {x,  y) and  b  WJ(VJ, 2),  functions  £ T i , have been used in the construction  of the  and sete S .] b  C O N S T R U C T I O N II: The Calculation of the Domains S  b  The function determine  which  S T E P 1.  Let  of CONSTRUCTION domain  II is, given a point (x, y) £ 5t  S , b £ {1,2,..., b b  m a x  such that 3z[(x, y, z) £  2  ] , the point (x, y) belongs  to.  / m i n =iG{i,...,o . , [,M IK^^) - ( -.^ -^)ll /)• } y^rajes, ?  m i n  /  n  r  c  max  140  m  U],  (  77A  )  [Therefore, ', Pmin is the distance (l .8,m b  6  • 8) where (h,m )  b  b  S  b  from  the point  {1, 2 , . . . ,  and be  £ 9ft  2  (x,y) b  m a x  to the closest  point  }.]  S T E P 2. Let ^min be the minimum value of i £ { 1 , 2 , . . . , &max} such that  (/,m)gSv  (78A)  — Pmin-  [Thus,  i[  is the first i which satisfies  m n  each point  m  n  The point  each point  is therefore in domain  (x,y)  (x,y)  belongs  Therefore,  p = p [. m  (x, y), there is one and only one assigned value for  S T E P 3. Let b = i i . [Therefore,  the above condition  n  for  i in-7 m  S .. b  to one and only one domain  S .] b  STEP 4. E N D CONSTRUCTION II.  C O N S T R U C T I O N III: The Calculation of the functions  The function  of CONSTRUCTION  III  is, given  a point  (x,y)  G S  b  V {x,y) b  calculate  the  function  V (x,y). b  S T E P 1. Given the triangular tessellations grid in Figure 3 find the triangle which contains the point (x, y) (if the point lies at the intersection of 2 or more triangles, arbitrarily choose any one of these triangles). Let (IQ • 8,mo • 8), (li • 8, m\ • 8), and (l • 8,m • 8) be 2  2  the vertices, as defined in Figure 4, of the triangle containing the point (x,r/). Let Vj(/o • 8, mo • 8), Vfc(/i • 8, mi • 8), and V (l • 8, m • 8) be the values of the function b  at these vertices. 141  2  2  S T E P 2. Then V (x,y).is  calculated as follows:  b  V (x,  y)  b  =(V (/i • 6, m • «S) - Vj(/ • 8,m • 8)) 6  0  x  Q  • *( -J° S  X  ' > 6  +  (79A) (V (/ • 8, m • 8) - V ( / • 8, m • 6)) • * * 6  2  2  6  where for an upper triangle s  triangle  x  = s  y  0  0  6  0  0  (as defined in Figure 4) s = s = — 1 and for a lower x  y  (h,^)  then the value of Vj at that point must be chosen such that:  a.  V z , j G {0,1,2} \\V (li-6,mi-6)-Vb(lj  b.  the resulting surface V (x,y)  • f>,mj • 6)\\ < 8 , and  b  b  vertices  ~ ™° ' ^ + V ( / • 8, m • 8),  = +1. Note that if any of the points (/o,mo), ( I i , m i ) , or  are not elements of  [Therefore  ( y  the function  of the triangle  Vb(x,y)  ((x,y)  E Sb) is continuous.  is calculated  which contains  the point  as a linear  interpolation  between the  (x,y).]  STEP 3. E N D CONSTRUCTION III. C O N S T R U C T I O N IV: The Calculation of the functions  The function  of CONSTRUCTION  IV is, given V (x,y) b  W (V (x,y),z) b  and z, where (x,y,z)  b  E U,  calculate  W (V (x,y),z). b  b  S T E P 1. Given the triangular tessellations grid in Figure 5 find the triangle which contains the point (V (x, y), z) (if the point lies at the intersection of 2 or more triangles, arbitrarily b  choose any one of these triangles). Let (Vj ,nn • 8), [V ,ni 0  bl  • 8), and (N ,n b2  2  • 8) be  the vertices, as defined in Figure 6, of the triangle containing the point (xo,yo).  Let  Wj(V6 ,no • 8), Wj(V6j,ni • 8), and Wfc(Vj ,n2 • 8) be the values of the function Wj at 0  2  these vertices. 142  S T E P 2. Then W (y (x,y), z) is calculated as follows: b  b  W (V (x,y),z) b  sv(Vb(x,y)  =(W (V ,m -5)-W (V ,n6-5))  b  J  6  6 l  (W (V ,n 6  fc2  2  • 5)  6 o  W (V ,77  -  6 o  6  0  - V ) fco  <5y  -6))-  S z < y Z  W ( V , n • 5),  +  6  fco  0  (80A) where for an upper triangle triangle  sy = s  z  (as defined in Figure 6) sy = s = — 1 and for a Zower z  = +1, and  min  ||V&  CK||,  for sy = —1  —Vj ||,  tor sy = +1  0  mm  —  0  (a,^)eQ6,Q'>V  v  y  6 o  Note that if any of the points (V , no • <*>), (V^,rai• 6), or (Vfe , n • 6) are not elements bo  2  2  of Q j , then the value of Wj at that point must be chosen such that: G {0,l,2}[||w (V ,n - -6) - W ( v , n j  a. b.  6  6i  f  6  < f ] , and  6 y  the resulting surface W (V (x,y), z), V(x, y, z)[((x, y) G Sj) A (x,y,z) G (7], is b  b  continuous. Also, if the ^ does not exist as calculated in (81 A), then choose 6y such that:  $v = |W (V ,rai -6)- W (V , n -6) |. 6  [Therefore the vertices  the function  6l  6  fco  W (V (x, y), z) is calculated  of the triangle  b  b  which contains  S T E P 3. E N D CONSTRUCTION IV. 1 4 3  the point  0  as a linear interpolation (V (x,y), z).] b  (82A)  between  Triangular tessellations grid used to construct the function Vj(x,y).  Note that there are two  types of triangles formed using this tessellations gird: a lower triangle and an upper triangle (see Figure 4).  Figure 3 Triangular Tessellation Grid for  Vb(x,y)  C O N S T R U C T I O N V : Approximating Functions of N Dimensions Using Functions of 2 Dimensions (JV > 3)  S T E P 1. Let Ro = 0. [The first domain  S T E P 2. Let  P  l  RQ is initialized  to be an empty set.]  = 0.  [The integer  p\ is used as a pointer  to the domains  R .] PL  S T E P 3. W H I L E it is true that pi  3(aJ , • • 3  -,XN-I)  ; , . . . ,XN-I) 3  £ [j  Rt  I A (3(x ,x ,x )[(x ,... 0  1  2  0  ,£JV-I) €  V])  i=0  (83A) 144  2  0  0  2  upper triangle  lower triangle  The point  £ Sfc can be located in either a lower triangle or an upper triangle of the  tessellations grid defined in Figure 3. For both types of triangles, vertex "0" is located at point (fa • 8, rnQ • 8) £ IR and the value of the function Vj at this point is V&(7rj • 6, mo • 6); vertex 2  "1" is located at point (fa • 8,m\- 8) £ !Jc and the value of the function  at this point is  2  Vj(/i • 8, mi -8); and finally, vertex "2" is located at point (fa • 8, m • 8) £ 5t and the value of 2  2  the function V& at this point is Vb(h~ 8,m  2  • 8).  Figure 4 L o w e r A n d Upper Triangles of  S  b  E X E C U T E the following steps: [Therefore,  being approximated  will be spanned,  of all the domains  i.  Let  pi  [This  ii.  =  V  ?  R  Pl  the entire domain  along  XJV-I),  of the  function  by the  union  pi}.]  1.  the construction  C  of definition  the axis (x$, x±,...,  Ri, for all i £ { 0 , 1 , . . . ,  pi +  initiates  Choose a.  o / S T E P 3,  at the completion  $t, ~ N  3  £  [Thus the domains  of the next domain  R .] Pl  such that {l,...,(pi-l)}[R nR•pi l  R  Pl  l  ~  are all disjoint,  II.] 145  satisfying  definition  La of  THEOREM  Triangular tessellations grid used to construct the function  VVj,(Vb(x,y),z).  Note that there are  two types of triangles formed using this tessellations gird: a lower triangle and an upper triangle (see Figure 6). Note also that the vertical lines are not regularly spaced: the vertical line spacing is defined by the value of a in S T E P 6.iii of CONSTRUCTION I.  Figure 5 Triangular Tessellation Grid for Wb(Vb(x, b.  y),  3(x ,xi,X2)[((xo,xi,...  , X J V - I )  E T>) A ( ( x , x , . . . ,XN-I)  [Thus  contain  points  0  the domains  definition  R  Pl  of the function  being  3  correspond  e  R )], Pl  >  and  to the domain  of  approximated.]  c.  The domain R  d.  The boundaries of the domains R  pi  which  4  z)  is dense. Pl  approximation, along the axes  (x , 0  are chosen such that a 3 dimensional  XI,  x ), 2  of the function  /(x ,..., 0  ZJV-I)  within the domain R , is at least better than no approximation at all. Or, Pl  put formally, this condition is satisfied as follows. Define the 3 dimensional  0  1  lower triangle The point (Vb(x,y),z)  upper triangle  can be located in either a lower triangle or an upper triangle of the  tessellations grid defined in Figure 5. For both types of triangles, vertex "0" is located at point (Vfc , no • 8) and the value of the function  at this point is W& (V& ,720 • 8); vertex "1" is located at  0  0  point (Vjj ,n\ -8) and the value of the function  at this point is W ^ V ^ , n\ • 8); and finally, vertex  "2" is located at point (Vj , n • 8) and the value of the function 2  at this point is W&(V&, n • 8).  2  2  (Vb(x,y),z)  Figure 6 L o w e r A n d U p p e r Triangles of function V (xo,xi,x ), Pl  U  = {(x ,xi,x )\(( o,xi,... 0  defined on the domain  2  x  Pl  ,XN-I)  2  2  N  3  (84A)  E R )},  G £>) A ((x ,... ,x _ )  Pl  x  to be J f{xo,xi,.. V (XQ,X ,X ) Pi  1  =  2  .,XN-i)dR pi i  R  0  I  (85A)  dR  pi  Ro  Let P G 3? such that 0 < P < 1, and let tp G 3ft such that  0 < £p < P • ( where the domain T  Pl  Pi  T  =  {{XQ,XI,.  . . ,XN-i)\{[x ,Xi,.. Q  max  \f(xo,  • • • ,XN-I)\  (86A)  ),  is defined as  . ,XN-\)  G V) A ((x ,£4, • • • 3  €  R )}Pl  (87A) 147  Then, the domains R  must be chosen such that  Pl  max \f(x ,...  V (x ,  ,XJV_I) -  0  Pl  [In LEMMA LEMMA  x  0  2  I it is shown  0  , • •. , x ^ - i ) |  must exist.  R^ p  of a domain  which  R  Pl  (88A)  )•  The proof  of  satisfies all  of  conditions.]  iii. Let W(x, y , z) = V (XQ,XI,X ) Pi  U =  |/(x  that such a domain  I is based on the construction  the above  - ( max  + ep < 0  x )\  1}  and  2  G Z>) A ( ( x  {(.T,y,2r)|((x ,...,x v-i) 0  7  3  G # )A  ,. . . , XJV-I)  Pl  (89A) (x  =  x  0  )  A (y  =  X i )  A (z =  x )}. 2  Use CONSTRUCTION I, CONSTRUCTION II, CONSTRUCTION III and C O N STRUCTION IV to construct the functions V (x, y) and W&(V&, z), and the domain b  S (V6 G { 1 , 2 , . . . , 6 b  max  } ) . Note that the e used in CONSTRUCTION I must be  chosen to satisfy equation (86A) of this CONSTRUCTION; i.e. t < tp. For all {1,2,... , 6  be D  P l b  [Thus OREM  m a x  } , let h (x ,xi) pib  Also, V6 € { ( 6  = S. b  the domains  D  iv. For f  all  Pl(Tl  Dom  Pl£ri  =  a\  (x ,..., 0  {(xo,. . .  m a x  G  max  + 1), (fe  g b(j p b->  2)  l  Pl  x  l  max  P l b  (h  p i b  ,  + 2),...} let D  h (xo,  x  2  )  P l b  =  b  b  = 0.  x-y) satisfy definition  pib  satisfy definition  and  W {V ,z),  2 of  3 of THEOREM  THEII  (for  }|7  {1,2,..., 6  max  }  x _i),  V(x , • • •, ZJV-I)  ,XN-I)\((XQ,  . . .,x -x)  0  N  g  b  and the functions  p i b  II, and the functions  all b G { l , 2 , . . . , 6  = V (x,y),  Q  define  the  G Dom  pi(7]  G V) A ( ( x  N  3  N  dimensional  function  where  , x  4  , . . . ,XN-I)  G -R )A Pl  (90A) ( ( x  0  , x i )  G I>p Ti) A ( ( x 1(  148  0  , x i , x  2  )  G U )}, P1  as  fp {xo,.  • • ,x _x)  1(ri  = f(x ,...  N  , X A f - i ) - g a (hp (xo,xi),X2).  0  [Thus,  the function  f  branch  p\G\ for the case l  (xo, • • •,  Piai  £JV-I)  max  l/pio-i  V(xo,...,x -i)6Dom N  1  represents  = 1 /n equation  m  r/ien r/ie approximation  Pl  (91A)  1<7l  the approximation  (28A) of THEOREM  (aJo, -  - •, ^JV-i)|  pi&i  is complete  <  error  along  II. Note that if  (92AJ  ^maz,  p i [ r i  along branch  and the branch  need not be  extended.]  STEP 4. Vz G {(pi + 1), (pi +2),...}, [This satisfies  definition  let R = 0. t  Lb of THEOREM  II]  STEP 5. Let / = 2. [This initiates  the construction  of level I = 2 of the approximation  STEP 6. L e u ' i , i , . . . , * ' j v - 2 G {0,1, • • •,-/V - 1} and j i , j 2  2  6 { 0 , 1 , . . . , TV - 1} such that,  a.  j i = (/ mod (TV - 1)), and j  b.  Vm, n G { 1 , 2 , . . . , TV - 2} [(m ^ ra = M / i „ ) A ( m > n  c.  Vm G {1, 2,..-., TV — 2} A Vn G {1, 2}[i  2  tree.]  = ((I + 1) mod (TV - 1)); m  i  m  > i  n  )]; and, m  STEP 7. For all  P  i  ,  +  ].  Jn  c ^ , . . . , / > / _ ! , c r ^ G {1,2,3,...} and (note that for the case / = 2, is read as  Pi,°~i,..  -,pi-i  V(xo, •  • •, £jV-i)  P l  , and / 3 I , < J I ,  G tree branch  ... , / > j _ i , <T/_X is read as / > i , c r i )  p \ o \ . . . p\-\Oi_\,  149  such that  a. max (io,..-,W-i)€Dom „ ... _ j _ w  ]  P i  1  |  \fp a ... _ ai- {xo,---,XN-l)\> 1  1  Pl  1  tmax,  1  (93A)  1  where the domain Dom < ... _ _ is defined in S T E P 3.iv and S T E P 7.iii.4 of Pl  7l  P(  1(7(  1  this CONSTRUCTION, and, b.  ^ 0 and •D  Pl<ri  ...p,_ _ 7^ 0, lff|  1  E X E C U T E the following steps: [The goal of the following branch  steps is to continue  has an approximation  error  The steps executed are similar with the exception is denoted i.  Let i?  in function  that the function  Pl(Tl  l  iai  allowable  to the first 4 steps of this  each error.  CONSTRUCTION,  at branch pi, a\,...,  p / _ i , c/_i  , N-i)J  x  1<Jl  of the tree until  is less than the maximum  being approximated  by fp ...p _ - [ o, • • •  x  l  ...p,_ <Ti_iO = 01  R a ... _ a _ o  [The first domain  ii.  which  the construction  Pl  1  Pl  l  i  = 0 is initialized  l  to be an empty  set.]  Let pi = 0. [The integer  p\ is used as a pointer  R a ... i- <T - ]  to the domains  Pl  1  P  1  l iPr  iii. W H I L E it is true that, ' ^( il X  !••••> lN-2 X  )  Pi (.Tjj, . . . , X{ _f)  R a ... _ r -ii  N  (3(x , n  x )[(x ,..., J2  0  pi  1  pi  1(  l  IA  XJV_I) G DA  (.TO, • • • , N-l) X  150  £  ircc branch  (94A) p i (Tl . . . / 9 / _ l C T ^ _ l J  E X E C U T E the following steps: [Therefore,  of the function  1  Pl  Pl  l2  N  1(  l  i G { 0 , 1 , . . . ,pi}.  domain  be spanned,  x  1  by the union  2  the entire  (so, • • •, N-l)  f a ... _ T _  (xi ,x ,...,Xi _ ), x  of S T E P 7.iii,  at the completion  of all the domains  R  P l t  along  T ... 1  Note that the goal of this step is similar  of  P l  definition the axis for all  _ j,_ i, l t  1  to goal  of S T E P  3.7 1.  Let pi = pi + 1. [This  2.  initiates  Choose a.  R  P l ( T l  the construction ...  P l  _ cT 1  l  l P  C $l ~ N  i  pi  the domains  R  4.a of THEOREM  j2  {(XQ,  .,x )  1  p l  P  _ c 1  T l  H R a ... i_ ai- i  1  1  _  pi  1(Jl  lPl  P  1  =  lP  0] *  satisfying  definition  cr/_i)A  x  X  i  are all disjoint,  l P l  x _ ) G tree branch p\o-\ ... N  Pl  G D)A  N  (( il , •••, i N - i ) X  1  ...  PltJl  II]  3(x ,x )[((x ,.. n  p u T l  R ... _ _ -]  such that  2  \/i G {1, ...,(/>; — 1)} [R cr ... i_ cri- i [Thus  b.  of the next domain  ^ 7?p  ... l_y(Tl_l l)\ P  1(7l  P  and, [Thus the domains domain  R  P x l T l  of definition  ...  _ tT _  P l  1  l  of the function  c.  The domain  d.  The boundaries of the domains  R  P l ( T l  ...  contain points which correspond  l P l  P t  _  1 (  T _ l  l P l  151  being  to the  approximated.]  is dense. R <j ... _ <j - p Pl  1  Pl  1  l  1  i  are chosen such that a 3 di-  mensional approximation, along the axes (h ...p _ _ ,xj xj ), Pl(Tl  function /  Pl(Tl  ...p _ 1  _ (xo, • • • , N-I)  iai  1  i:  within a domain  X  1(Tl  l  1  of the  2  Rp <T ...p _ <r _ , 1  1  l  1  l  lPl  is  at least better than no approximation at all. Or, put formally, this condition is satisfied as follows. Define the 3 dimensional function  'Pp\o\...pi-\oi-\pi  (hpiai...pi-iai-i  5  ji i  x  (95A)  J2)i  X  defined on the domain U ... _ _ Pl(Tl  pi  l(Tl  (h ai...pi- ai-i  = {  lpi  pi  5  1  ((x ,  jn  x  { ( i l 1 • • • 5 lN-2) X  j ) \ ( ( 0 i  • ••, N  x  —l)  X  2  G tree branch pia\...  • • •, x^-i)  0  x  £  T>)A  p i ^ a ^ A  (96A)  G -Rp fJ ...p,_ fJi_ p() },  X  1  1  1  1  to be J / p i o - i . . . p i - i o - i - ! dYi  ^Pp (T ...pi-i(Ti- pi(h a ...pi- ai-n 1  1  1  pi  1  jn j )  x  1  =  x  2  j ^JJ  (97 A)  n where  II =  ?p ...p _ (T,_ p 1(Tl  Tp ...pi_ _ i iai  1<Tl  xP  i  1  1  i  n U ... _ _  T ... _ _ Piai  Pl  1(Tl  lPl  Pl<Tl  Pl  1(Tl  lPl  and the domain  is defined as  = |(x , • • • ,xyv_i) | ((x ,... 0  0  ,XN-I)  G D)A  ((x ,. •.-, XAr_!) G tree branch p\G\ ... p/_i<7/_i)A 0  ((xjj , . . . , Xi _ ) N  G Rpxoi ...pi^ai-iPi)  2  (98A)  j* •  Let /? G 5t such that 0 < /? < 1, and let e# G 3t such that 0  max [lij ,...,Xi  N  |/p ffi...p,_iff,_i(zo,---,zw-i)| )• (99A) 1  JGTp j...p _ o- _ /> 1<r  (  1  i  1  152  i  Then, the domains max  fp\(j\  must be chosen such that  R <T ...p,_ <T _ p Pl  ...p\-\<ji-\  1  1  l  1  l  • • • j N  (^0,  —  x  l)~  T 1~ p cr ...p _ cT _ p (hp >  i  l  P ' \  l  l  l  m  l  a  l  l C  r ...p _ cT _ ,Xj ,Xj l  l  l  l  l  l  > i  <  )  (100A)  \fpi<T ...pi- <n-AxQ,...,x -i)\y  x  T  l  1  N  \J- P a .-P{_ a _ P 1  [In LEMMA ist.  1  1  l  The proof  lA(x,  Let  y , z)  J  l  II it is shown  =  that such a domain  of LEMMA  R  P l t T l  ...  T p J  i  a  - ...p _ ai_ pi(hp 1  l  1  p l  _  l t  T,_  II is based on the construction  which satisfies the all of the above  Rpi<Ti...pi-i<xi-ipi  3.  1  1  i  a  i  „.p _ l  1  (  r  xj ,  _ ,  i  ex-  of a  domain  conditions.]  XJ ^  t  1  st  mu l P l  and  2  U = { ( x , y , z ) | ( ( x , . . . , x v - i ) G £>)A 0  7  G tree branch  ((x ,zjv-i) 0  p ux x  ...  p _ o _{)N x  x  x  (101A) 5 •••1 l - ) X  N  (s = Use III  1  'VKTL-W-I^-I)  CONSTRUCTION  A (y = x y j A (2 = x ) } . j2  I,  CONSTRUCTION  and CONSTRUCTION  and  b  b  equation ^  {1,2,..., 6  Jp cr ...p b(hp a ...p b, 3 )  (  x  l  S  b  (Vfe G { 1 , 2 , . . . , 6  in CONSTRUCTION  1  1  l  2  m a x  },  let =  CONSTRUCTION  t  x  and  V6 G {(femax + 1), (&max + 2), . . .} let D ...  153  }).  pib  = 0.  pib  =  Note  to satisfy For all ^bi, iy)i x  —  D ^. piCTl  b  < tp.  hpi<j\...pib(]ipia\...pi-i<Ti-ii ji)  pi<Tl  max  I must be chosen  Wb(Vb,z),  V (x,y)  the functions  (99A) of this CONSTRUCTION i.e.  G 1  II,  IV to construct  and the domain  W (V ,z),  that the e used  i  2  G i?p 0'i...pi_i(7l_iPl) A  S. b  Also,  [Thus  the  domains  D  Pl  iai  1  1  and the functions  4.  Pl  1  pi(Tl  and  0  definition x  g a ...pib(h ...p b, l  sfy  n)  5  of  definition  san  the  6 of  m a x  For all a\ G { 1 , 2 , . . . , 6  } define the TV dimensional function  ftffi...pioi  max  ., XN-I), N  V ( x , . . . ,a:j\r-i) G Dom 0  ((xi ,...  1  0  ,Xi _ )  1  N  Pl  Pl(Tl  ...p , where i(T  )A  Tl...pi  (102A)  ( ( x i , . . . , XN) G tree branch p i o - i . . . O / _ I < J / _ ) A /  <J\...pi-\Gl-\  ((hp,  )  l  1  l  G  )  1  Up ..,p _ r _ p )A 1(Tl  l  lC  t  1  l  G L)p o-i...p(<7() 15  ((hp ...p _ a _ ,Xjf) 1<Tl  THEOREM  G V)A  N  G R  2  II,  };j  {(xo,---,x _ )\((x ,...,x -i)  =  functions  THEOREM  II (for all b G {1, 2 , . . . , 6  fpi<ri...pi(T,(xo, • • D o m  =  pi  satisfy  hpi<Ti...pib{h ... _ _ ,Xj ) Pjai  i  p i ( T l  l  1  as P\U\...pi<Tl  >o, • • • , X J V - I ) =/p <Ti...p,_i<T,_ (a;o, • • •  (103 A) 9p ai...pi<Ti(hp (T ...pian 1  [Thus,  the function  ror along OREM  II  branch Note  1  1  p\o\ ... p\cri for the case l  m  0  need not be  along  |/p pi(Ji  branch  extended.]  Vi G { ( + l ) , ( ^ + 2),...}, let R  p  M  [This satisfies  definition  4.b of THEOREM  154  2  II]  ) -  the approximation  = / in equation  max  that if  (•max, then the approximation  j  X j V - l ) represents  fp a ...piai(xo,..., 1  x  1  V(x ,...,XN-i)eDom  branch  , X J V - I ) -  1  1  1(Tl  er-  (28A) of THE-  ...p (xo,..., x j v - i ) | < ((Ti  m < r t  p\o\ ... p\o\ is complete  and the  STEP 8. IF  6  3p ,a ,...,p ,o-i 1  1  t  tree branch p\o\...  {1,2,3,...}  that,  V(x , • 0  • •, XJV-I)  G  p\o\,  [The goal of this step is to search approximation  such  out branches  error which is greater  pi,o-\,...,  than the maximum  pi,o~i which still have an allowable  error.]  a. . max  \f <T ...picri(xo,..., Pi  ( x o , • • •, x N-1 ) € D o m  P l  ..  CT1  p  XN-I)\  > e x, ma  is defined in STEP 3.iv and STEP 7.iii.4 of this  piai  CONSTRUCTION, and, b.  R ... Piax  Pl  ^ 0 and D ... Pl(7l  Piai  ^  0,  T H E N E X E C U T E the following steps: i.  Let / = / + 1. [This initiates  ii.  (104A)  t  where the domain Dom ... pifTl  1  , a  the construction  of level I of the approximation  GOTO STEP 6  E L S E E N D CONSTRUCTION V .  155  tree.]  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065207/manifest

Comment

Related Items