Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Applications of machine learning in sensorimotor control Bradley, Susanne 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_september_bradley_susanne.pdf [ 8.06MB ]
JSON: 24-1.0166596.json
JSON-LD: 24-1.0166596-ld.json
RDF/XML (Pretty): 24-1.0166596-rdf.xml
RDF/JSON: 24-1.0166596-rdf.json
Turtle: 24-1.0166596-turtle.txt
N-Triples: 24-1.0166596-rdf-ntriples.txt
Original Record: 24-1.0166596-source.json
Full Text

Full Text

Applications of Machine Learning in SensorimotorControlbySusanne BradleyB.Sc., Queen’s University, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2015c© Susanne Bradley, 2015AbstractThere have been many recent advances in the simulation of biologically realisticsystems, but controlling these systems remains a challenge. In this thesis, we focuson methods for learning to control these systems without prior knowledge of thedynamics of the system or its environment.We present two algorithms. The first, designed for quasistatic systems, com-bines Gaussian process regression and stochastic gradient descent. By testing ona model of the human mid-face, we show that this combined method gives bettercontrol accuracy than either regression or gradient descent alone, and improves theefficiency of the optimization routine.The second addresses the trajectory-tracking problem for dynamical systems.Our method automatically learns the relationship between muscle activations andresulting movements. We also incorporate passive dynamics compensation andpropose a novel gain-scheduling algorithm. Experiments performed on a modelof the human index finger demonstrate that each component we add to the controlformulation improves performance of fingertip precision tasks.iiPrefaceThe methods described in Chapter 5, with the exception of the gain-scheduling al-gorithm of Section 5.4, formed part of the following publication:Prashant Sachdeva, Shinjiro Sueda, Susanne Bradley, Mikhail Fain, and Di-nesh K. Pai. Biomechanical simulation and control of hands and tendinoussystems. ACM Transactions on Graphics, 34(4):42:1-42:10, July 2015.For the control algorithm in this paper, we began with the methods described inMikhail Fain’s M.Sc. thesis, to which I added configuration-dependent AVMs (seeSection 5.2), activation smoothing (Section 5.3), and passive dynamics compensa-tion (Section 5.5). The index finger simulator described in the paper and used inthis thesis for all control algorithms in Chapters 4 and 5 was developed by PrashantSachdeva, Shinjiro Sueda and Duo Li, under the supervision of Dinesh Pai. Thesoftware interface between the controller and the simulator I developed jointly withPrashant Sachdeva.The model of the skin of the human mid-face described in Chapter 3 was de-veloped by Debanga Raj Neog under the supervision of Dinesh Pai. It is part of hisupcoming doctoral thesis and is used here with his permission.All algorithms, experiments and analysis in this thesis were developed by me,with guidance from my supervisor, Dinesh Pai.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 42.1 Properties of Biological Motor Systems . . . . . . . . . . . . . . 52.1.1 Components of Biological Learning . . . . . . . . . . . . 52.1.2 Representations of Biomechanical Systems . . . . . . . . 62.2 Control of Dynamical Systems . . . . . . . . . . . . . . . . . . . 72.2.1 Reaching Control . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Feedback Control . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Hierarchical Control Methods . . . . . . . . . . . . . . . 122.2.4 Control of Stochastic Systems . . . . . . . . . . . . . . . 132.2.5 Control Methods for Sensorimotor Tasks . . . . . . . . . 13iv2.3 Machine Learning Approaches in Motor Control . . . . . . . . . 152.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . 152.3.2 Experimental Design . . . . . . . . . . . . . . . . . . . . 192.3.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . 223 Control of Elastodynamic Skin . . . . . . . . . . . . . . . . . . . . . 263.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 A Supervised Learning Approach . . . . . . . . . . . . . . . . . 283.2.1 Regression Methods . . . . . . . . . . . . . . . . . . . . 293.2.2 Forward Prediction . . . . . . . . . . . . . . . . . . . . . 323.2.3 Inverse Prediction . . . . . . . . . . . . . . . . . . . . . . 353.3 Combining Regression and Local Search . . . . . . . . . . . . . . 373.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Reaching Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1 Plant Description: Human Index Finger . . . . . . . . . . . . . . 444.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 k-Nearest-Neighbours . . . . . . . . . . . . . . . . . . . . . . . . 494.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.6 Methods Comparison/Discussion . . . . . . . . . . . . . . . . . . 545 Trajectory Tracking with a Human Index Finger . . . . . . . . . . . 575.1 Existing Implementation: Baseline Hierarchical Control . . . . . . 585.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 AVM Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 61v5.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.4 Gain Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5 Passive Dynamics Compensation . . . . . . . . . . . . . . . . . . 745.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 796.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83viList of TablesTable 2.1 PID tuning with the Ziegler-Nichols method. . . . . . . . . . . 11Table 3.1 Table of mean pose errors over test set for forward predictionwith manual and MLE correlation parameters. Reported errorsrepresent the average and maximum relative errors per meshpoint in the testing set. . . . . . . . . . . . . . . . . . . . . . . 35Table 3.2 Table of mean pose errors over 28 boundary points for regres-sion method, policy gradient (PG) with random starts, and re-gression combined with policy gradients. Reported errors rep-resent the average and maximum relative errors per mesh pointin the testing set. . . . . . . . . . . . . . . . . . . . . . . . . . 40Table 4.1 Table of summary statistics for reaching task using k-nearest-neighbours with locality-sensitive hashing. Error denotes theaverage distance (in centimetres) of final configurations fromtarget positions; query time gives the time (in seconds) to com-pute the predicted activation uˆ for a target qr; and activation sizeis the average L2-norm of the predicted activations uˆ. . . . . . . 51viiTable 4.2 Table of summary statistics for reaching task using neural net-works. Error denotes the average distance (in centimetres) offinal configurations from target positions, build time gives theneural network build time (in seconds); query time gives thetime (in seconds) to compute the predicted activation uˆ for atarget qr; and activation size is the average L2-norm of the pre-dicted activations uˆ. . . . . . . . . . . . . . . . . . . . . . . . 52Table 4.3 Table of summary statistics for reaching task using random forests.Error denotes the average distance (in centimetres) of final con-figurations from target positions, build time gives the time tobuild the six random forests (in seconds); query time gives thetime (in seconds) to compute the predicted activation uˆ for atarget qr; and activation size is the average L2-norm of the pre-dicted activations uˆ. . . . . . . . . . . . . . . . . . . . . . . . 54Table 4.4 Comparison of best predictors from each of the k-nearest-neighbours,neural network, and random forest experiments – parameter val-ues for best predictors from each class are 1 neighbour, 10 hid-den layers, and 100 trees, respectively. Error denotes the av-erage distance (in centimetres) of final configurations from thetarget positions; build time gives the time (in seconds) to buildthe data structure; and query time gives the time (in seconds) tocompute the predicted activation uˆ for a target qr. . . . . . . . 56Table 5.1 Summary of controllers’ performance on the 2D circle tracingtask. Average error denotes the average positional error be-tween the target and actual plant position at each point on thetrajectory, in centimetres. Max error denotes the maximum po-sitional error at any single point on the trajectory, in centimetres. 78Table 5.2 Summary of controllers’ performance on the 3D circle tracingtask. Average error denotes the average positional error be-tween the target and actual plant position at each point on thetrajectory, in centimetres. Max error denotes the maximum po-sitional error at any single point on the trajectory, in centimetres. 78viiiList of FiguresFigure 2.1 Illustration of a Hill-type muscle model. . . . . . . . . . . . . 8Figure 2.2 A basic feedback system, where: r is the reference command;ε is the sensor output, which gives an estimate of the controlerror; u is the actuating signal; d is the external disturbanceacting on the plant; y is the plant output and measured signal;and n is sensor noise. . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 Kriging in 1D for f (x) = exp(−3x) · cos(5pix), given by thesolid line. Design points are shown by asterisks and the dashedline gives the predicted values obtained with a kriging estima-tor using an exponential correlation function [61] with param-eters θ = 4,ρ = 2 . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.4 Illustration of different space-filling design techniques for 16sample points in 2 dimensions. (a) Simple random sample. (b)Stratified random sample, where each stratum is a square com-prising one-sixteenth of the design space. (c) Latin hypercubesampling – the projection of the points onto either the x or yaxes is spaced uniformly. . . . . . . . . . . . . . . . . . . . . 21Figure 3.1 (a) low-resolution skin mesh with location of muscle sheetslabelled. (b) high-resolution skin mesh. . . . . . . . . . . . . 27Figure 3.2 Example of passive and active muscle force-length (FL) curves.Left: passive force. Right: active force, given by shifting pas-sive curve leftward according to the muscle activation u. . . . 28ixFigure 3.3 Depiction of the 49 low-resolution mesh points for which r(pi)>0.05. The black circles denote the points, with larger circlesrepresenting points with proportionally larger ranges of move-ment. The coloured areas of the mesh represent the locationsof the different muscle sheets. . . . . . . . . . . . . . . . . . 33Figure 3.4 An example of forward prediction with manually selected cor-relation parameters (a) and MLE correlation parameters (b).The true pose is shown in black and the predicted poses areoverlaid in red. (a) The average relative pose error is 0.062and the maximum error is 0.3152 – notice the large deviationsin the bottom right. (b) The average pose error is 0.0107, andthe maximum is 0.0570. . . . . . . . . . . . . . . . . . . . . 34Figure 3.5 Inverse prediction on pose generated by full activation of thecorrugator and right lower OOM. (a) Target position. (b) Po-sition obtained by predictor. (c) Overlay of the two, with thetarget in black and predicted position in red. . . . . . . . . . . 38Figure 3.6 Example of local search updates on initial policy obtained byregression. (a) Plot of relative activation errors. (b) Plot of themean of the average and maximum per-point pose errors (thenegative of the value function). . . . . . . . . . . . . . . . . . 43Figure 4.1 Biomechanical model of the human index finger. . . . . . . . 45Figure 4.2 Passive and active muscle force-length (FL) curves. Left: pas-sive force, given by a piecewise linear curve. Right: activeforce, given by shifting the piecewise linear curve leftward ac-cording to the muscle activation u. . . . . . . . . . . . . . . . 46Figure 4.3 3-dimensional scatter plot of plant configurations (i.e. finger-tip positions) in the training data set for the reaching controlproblem. Axis units are in centimetres. . . . . . . . . . . . . 47xFigure 4.4 Different configurations of the index finger, with the plant’strajectory from the initial configuration shown in red. (a) fullactivation of the DI. (b) full activation of the EDC. (c) fullactivation of the FDS. (d) full activation of the EDC and halfactivation of the lumbrical. . . . . . . . . . . . . . . . . . . . 48Figure 5.1 Screenshots for baseline controller. (a) tracing a circle in theplane. (b) tracing a circle in free space. (c) close-up of the pla-nar tracing task: the red dot shows the target position and theblue line represents the high-level controller’s desired changein position, given by ∆t(q˙t + vb). . . . . . . . . . . . . . . . . 60Figure 5.2 Depiction of circle-tracing results for baseline controller. (a)trajectory tracked for 2D experiments (as viewed from above):target trajectory is shown by the dashed blue line and the actualtrajectory taken is given in solid red. (b) muscle activations for2D trajectory tracking. (c-d) trajectory (as viewed from theleft) and muscle activations for 3D (unconstrained) tracking. . 62Figure 5.3 3-dimensional scatter plot of plant configurations (i.e. fingertippositions) at which we compute the AVM. Axis units are incentimetres. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Figure 5.4 Screenshots for the variable AVM controller. (a) tracing a cir-cle in the plane. (b) tracing a circle in free space. . . . . . . . 64Figure 5.5 Depiction of circle-tracing results for variable AVM controller.(a) trajectory tracked for 2D experiments (as viewed from above):target trajectory is shown by the dashed blue line and the actualtrajectory taken is given in solid red. (b) muscle activations for2D trajectory tracking. (c-d) trajectory (as viewed from theleft) and muscle activations for 3D (unconstrained) tracking. . 65Figure 5.6 Screenshots for the variable AVM controller with smoothing.(a) tracing a circle in the plane. (b) tracing a circle in free space. 67xiFigure 5.7 Depiction of circle-tracing results for the variable AVM con-troller with smoothing. (a) trajectory tracked for 2D experi-ments (as viewed from above): target trajectory is shown bythe dashed blue line and the actual trajectory taken is given insolid red. (b) muscle activations for 2D trajectory tracking. (c-d) trajectory (as viewed from the left) and muscle activationsfor 3D (unconstrained) tracking. . . . . . . . . . . . . . . . . 68Figure 5.8 3-dimensional scatter plot of plant positions at which we com-pute the gains KP and KD. Axis units are in centimetres. . . . 71Figure 5.9 Screenshots for the gain scheduled controller. (a) tracing a cir-cle in the plane. (b) tracing a circle in free space. . . . . . . . 72Figure 5.10 Depiction of circle-tracing results for the gain scheduled con-troller. (a) trajectory tracked for 2D experiments (as viewedfrom above): target trajectory is shown by the dashed blue lineand the actual trajectory taken is given in solid red. (b) mus-cle activations for 2D trajectory tracking. (c-d) trajectory (asviewed from the left) and muscle activations for 3D (uncon-strained) tracking. . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 5.11 Polar plots of the values of the controller gains KP (a) andKD (b) along the 3D circle trajectory. Gains values are mean-shifted and scaled by their ranges so that the variations can beseen clearly on a polar plot. . . . . . . . . . . . . . . . . . . 74Figure 5.12 Screenshots for the gain scheduled controller with passive dy-namics compensation. (a) tracing a circle in the plane. (b)tracing a circle in free space. . . . . . . . . . . . . . . . . . . 76Figure 5.13 Depiction of circle-tracing results for the gain scheduled con-troller with passive dynamics compensation. (a) trajectory trackedfor 2D experiments (as viewed from above): target trajectoryis shown by the dashed blue line and the actual trajectory takenis given in solid red. (b) muscle activations for 2D trajectorytracking. (c-d) trajectory (as viewed from the left) and muscleactivations for 3D (unconstrained) tracking. . . . . . . . . . . 77xiiGlossaryACT Anatomically Correct TestbedCE Contractile ElementSE Series ElementPE Parallel ElementPID Proportional-Integral-DerivativePD Proportional-DerivativePI Proportional-IntegralLPV Linear Parameter-VaryingQP Quadratic ProgramMDP Markov Decision ProcessMSPE Mean Squared Prediction ErrorBLUP Best Linear Unbiased PredictorLSH Locality-Sensitive HashingLHD Latin Hypercube DesignIMSE Integral Mean-Squared ErrorMMSE Maximum Mean-Squared ErrorxiiiOOM Orbicularis Oculi MuscleMLE Maximum Likelihood EstimateAVM Activation-Velocity MatrixLPS Levator Palpebrae SuperiorisxivAcknowledgmentsFirst and foremost, I would like to express my gratitude to my supervisor, Dr. Di-nesh Pai, for guiding me through my initiation into the realm of academic research.Whenever I was wrestling with a particularly difficult problem, Dinesh proved tobe an invaluable source of help. He would be able to identify – usually withinabout 30 seconds – the next approach to try, and it always turned out to be a greatidea. I was deeply impressed by his knowledge and insight, and this thesis wouldnot have been possible without his leadership. I want to thank Dr. Mark Schmidtfor lending his expertise on different learning algorithms and serving as the secondreader of this thesis. Many thanks also go to my co-author, Dr. Shinjiro Sueda, andmy supervisor-to-be, Dr. Chen Greif, for all their help and guidance.I’m very grateful to all my current and former labmates in the SensorimotorSystems Lab. Special thanks goes to my SIGGRAPH buddy, Prashant Sachdeva,for all his help with the simulation and control of the index finger model. In partic-ular, Prashant built the software platform I used for developing and testing controlalgorithms; without his skill and hard work our project would have been impossi-ble. I also want to thank Debanga Raj Neog for developing the mid-face simulatorI used for my work in Chapter 3 and allowing me to use it here. Lastly, huge thanksto Mikhail Fain, who gave me lots of great information about control algorithmsand whose Master’s thesis served as the inspiration for this one.I want to thank my parents for all they have done to get me to where I am today.For all their love and support (both financial and otherwise) over the years, I amdeeply grateful. Finally, very special thanks to my husband, Nick, for his unwa-vering love and encouragement, and for all the times he helped me with computerissues.xvChapter 1IntroductionThe human musculoskeletal system is a complex, intricate structure – a networkof bones, muscles, tendons, and other connective tissues that supports the bodyand allows us to perform a huge variety of complex movements. Though it hasbeen widely researched, how the brain learns to control these movements is notfully understood. The movement of each joint in our body is constrained by themuscles and tissues surrounding it: as a result, the relationship between muscleactivation signals and resulting motor movements is complicated, non-convex andnon-linear. From a computational perspective, even a minor task like holding yourarm in a particular position is a challenging problem.The goal of this thesis is to develop learning algorithms for motor control.We focus on simulated, biologically motivated systems, with particular emphasison tendon-driven systems. In these systems, tendons are modelled as stiff elas-tic strands connecting bones and muscles. When the muscles are activated, theycontract, which causes the tendon to pull on the bones and rotate the joints.Most work in robotics focuses on torque-driven systems, in which motors ma-nipulate each joint directly. Because their dynamics can be computed analytically,these systems are easier to control [80]. By comparison, several factors compli-cate the control of tendon-driven (and real biological) systems. Unlike the torque-driven approach, there is no one-to-one mapping between muscle activations andjoint movements: a single muscle may span multiple joints and a single joint canbe spanned by multiple muscles [22]. These mappings are complicated further by1non-linear interactions between muscles, tendons, bones and joints [17]. Tendon-driven systems also encounter issues with redundancy: there are multiple muscleactivation patterns that result in the same movement [39].These challenges beg the question: why not just work with torque-driven sys-tems, thereby avoiding all the difficulties arising from the tendon-driven paradigm?One reason is that mimicking human biomechanics may enable us to developrobotic systems that approach human levels of dexterity. Another is that, in theprocess of developing control methods capable of handling the challenges inherentin biological systems, we may be able to gain insight into how the brain controlsthe body.A key feature of human motor control is adaptability. The brain needs to beable to provide appropriate motor commands for different contexts we experience[84]. For example, if you’re helping a friend move, you need to interpret contextualinformation and generate the correct motor signals to lift light boxes, lift heavyboxes, and not fling your arm behind your head and topple over backwards when abox you think will be heavy turns out to be light. These kinds of context changescan be brought about by external factors – as with the differently weighted boxes– or by internal ones, such as changes in muscle strength caused by aging or bymuscle fatigue from repeatedly lifting heavy boxes.If we want robots that are similarly adaptable, we need control algorithms thatare:1. Capable of learning complex, non-convex and non-linear mappings betweenmuscle activations and resulting movements2. Flexible enough to learn and perform motor skills in the presence of differentexternal dynamics and internal constitutive modelsTo this end, our goal is to develop control algorithms that learn how to per-form movements with no previous knowledge of the dynamics of the system or itsenvironment. In the control of tendon-driven systems, much of what has been im-plemented relies on prior knowledge of internal dynamics (see [72] and the “whitebox” controller of Sachdeva et al. [60]) and manual tuning of controller parameters(see [18, 46, 88]).2The contributions of this thesis are:• A novel framework combining supervised learning and local search for thecontrol of an anatomy-based quasistatic system (Chapter 3)• A data-driven method to learn control of tendon-driven dynamical systems,including passive dynamics compensation (Chapters 4 and 5.5), configuration-dependent mappings between muscle activations and motor movements (Sec-tion 5.2), and automatic selection of feedback parameters (Section 5.4)3Chapter 2Background and Related WorkThe goal of this thesis is to use machine learning techniques to improve existingmethods for the control of biomechanical systems. We begin by considering thecontrol of a dynamical system (also referred to as a plant), which is defined by:xt+1 = f (xt ,ut) (2.1)where f is the function defining the forward dynamics of the system, xt is thestate of the system at time t, and ut is the control signal. In practice, f is often acomplicated non-linear function that is not known in advance.The problem in control is to select the activations ut to achieve some desiredbehaviour of the plant. This thesis focuses mainly on trajectory-tracking problems:at each timestep t we have a specified target state xr, and the goal is to make theplant follow the target points as closely possible. An important special case that weexplore in detail is reaching tasks, which can be formulated as a trajectory-trackingproblem by keeping xr constant for all timesteps.Section 2.1 deals with some properties of biological motor systems, includinghow motor learning occurs in the brain and how these systems are represented incomputer science and robotics. Section 2.2 overviews existing methods for controlof dynamical systems, and Section 2.3 covers some concepts in machine learningthat can be used for this task.42.1 Properties of Biological Motor SystemsThe control literature is replete with work on motor tasks: examples include lo-comotion [22, 38, 83, 87], swimming [70], and movements of the head [39], eyes[36] and hands [18, 20, 43, 46, 50]. This section examines some properties of bi-ological learning and discusses how biomechanical systems are represented in theliterature.2.1.1 Components of Biological LearningCurrent approaches in biological motor learning focus mainly on information ex-traction and decision-making [85]. Learning to perform a task requires effectivegathering and processing of sensory information relevant to action. Humans arehighly skilled in this regard: Najemnik and Geisler [51] showed that human sac-cades (rapid movements of the eye between fixation points) are near-optimal in theBayesian sense at extracting task-relevant information. Meanwhile, performing atask consists of decisions and strategies: we decide when to take what action as in-formation is obtained. Research suggests the existence of simultaneous processesin the brain that specify potential motor actions and select between them [11].Dimensionality reduction The human body has approximately 650 skeletal mus-cles [55]. One can imagine the difficulty of learning to control a system with 650degrees of freedom. However, human movement consists of many muscles movingtogether in coordinated actions, rather than each muscle working individually. Thismeans that the dimensionality of the control problem is reduced.This concept is referred to as muscle synergies: neural control modules thatcan be flexibly combined to generate a large repertoire of behaviours [84]. Thecentral nervous system controls large numbers of degrees of freedom by joint-leveland muscle-level synergies [78]. Support for the synergy hypothesis is given byBerger et al. [7], who investigated reaching tasks in a virtual environment wherethe normal muscle-to-force mappings could be changed. After identifying musclesynergies, they found that the subjects’ adaptation was faster when a full range ofmovements could be achieved by recombining synergies than when new synergieswere required.5From a computational perspective, this means that we can simplify controltasks using standard dimensionality reduction techniques [26]. For example, In-gram et al. [35] found that 80% of the variance of hand movements was accountedfor by two principal components.Challenges of biological control Motor control suffers several inherent challenges.One is the need for adaptability: our brain/controller must be able to provide appro-priate motor commands for different contexts likely to be experienced [84]. Errorcorrection is difficult as sensory streams are temporally delayed and corrupted bynoise [36, 85]. In addition, the mappings between muscle activations and result-ing movements are difficult to learn. Muscles, tendons, bones and joints interactnon-linearly [17], and there is an issue of redundancy: for example, one muscu-loskeletal configuration can be achieved in many ways by co-activating agonistand antagonist muscles [39].2.1.2 Representations of Biomechanical SystemsBiomechanical systems are represented in control literature in one of two ways:torque- or tendon-driven. Torque-driven systems are used by Liu [43], Yin et al.[87] and Wang et al. [83], among others.Tendon-driven systems are simulated using thin strands for tendons and rigidbodies for bones. A simulated tendon-driven system is used by Sueda et al. [72],and a robotic example is the Anatomically Correct Testbed (ACT) hand [17]. Tendon-driven systems have the advantage of matching real anatomy: the fibers of thestrand contract, which transmits contractile force directly along the fiber. Theyalso allow for accurate imitation of physiological phenomena – such as joint cou-pling [60] – which are difficult to model with a torque-driven approach. They do,however, pose difficulties for control: tendon-driven muscles do not provide directcontrol over degrees of freedom (as is typically assumed with computed torquemethods) because a single muscle may span multiple joints, and a single joint maybe spanned by multiple muscles [22].Muscles are commonly simulated using Hill-type muscle models (for example,in [22, 39, 70]). The model consists of a Contractile Element (CE) and two non-6linear spring elements: the Series Element (SE) and Parallel Element (PE). The PErepresents the passive force of the connective tissues surrounding the muscle, whilethe SE represents the tendons and provides an energy-storing mechanism [47].The force produced by the CE, FCE , is given by:FCE = uFmax fL(LCE) fV (VCE) (2.2)where u is the muscle activation level, Fmax is the muscle’s maximum isometricforce, fL describes the relationship between force and muscle length (LCE), and fVgives the relationship between force and current contraction velocity (VCE).The passive forces FPE and FSE are modelled as non-linear springs based ontheir length:FSE = fSE(L−LCE) (2.3)FPE = fPE(LCE) (2.4)where fSE , fPE are non-linear force-length relations and L is the total musclelength.The total muscle force F and the forces in the CE, PE and SE satisfyF = FPE +FSE (2.5)FCE = FSE (2.6)For an illustration of the model, see Figure Control of Dynamical SystemsNow that we have covered some dynamical system representations of biomechani-cal systems, we discuss methods for controlling dynamical systems. A key issue incontrol is the choice of parametrization of the desired task. This is given in termsof a configuration variable, denoted by qt . The subtle but important distinction be-tween the configuration variable qt and the state xt is that qt describes the position7Figure 2.1: Illustration of a Hill-type muscle model.of the system, while xt describes both the position and the velocity.Section 2.2.1 deals with asymptotic reaching control, which is the problemof guiding a plant to a desired configuration. Section 2.2.2 introduces feedbackcontrol, and Section 2.2.3 discusses hierarchical control methods. Section 2.2.4then overviews the control of stochastic systems and Section 2.2.5 explores issuesrelating specifically to the control of sensorimotor tasks.2.2.1 Reaching ControlConsider a dynamical system governed by Equation 2.1. The reaching controlproblem consists of computing the activation ut at each timestep so that the systemstate xt converges to some target xr.Equilibrium control A conceptually straightforward approach to this problem isequilibrium control, reviewed in [66]. It relies on the concept of equilibrium points:these are state-activation pairs {xeq,ueq} such that xeq = f (xeq,ueq). That is, thesystem at state xeq does not move when given the activation ueq. The main idea8behind equilibrium control is that if we wish to drive the plant to the state xeq, itmay suffice to apply the controls ueq that keep the system in equilibrium at thatpoint.The advantage of equilibrium control is its simplicity. The existence of a map-ping between activations and states means that the problem can be handled withstandard supervised learning methods. However, this simplicity comes at a price.The activation-to-state mapping means the effects of starting configuration and in-termediate dynamics are ignored. There is no guarantee that the mappings arebijective: multiple activations could stabilize at a point, or an activation could sta-bilize the system at multiple points. Convergence tends to be slow, though this canbe overcome with the pulse-step paradigm, in which a stronger “pulse” activationprecedes the equilibrium activation. The pulse-step pattern is seen in the control ofsaccadic eye movements (see [36, 71]).Because of its drawbacks, equilibrium control is rarely used by itself in prac-tice, but it is often used as part of a more sophisticated controller scheme. Desh-pande et al. [18] and others have used equilibrium control to account for passivemuscle forces at a given plant configuration. Liu [43] used it as a preprocessingstep for object manipulation from a grasping pose: at an input pose, equilibriumcontrols were used to maintain the pose against gravity.Task parametrization Reaching tasks can be parametrized either by spline-basedtrajectories or with dynamics-based approaches. The spline-based approach is eas-ier to understand, but does not easily generalize to new behavioural situations with-out complete recomputation of the spline. It also can’t easily be co-ordinated withother events in the environment (for example, catching a ball) [52]. The alterna-tive is to use a dynamical system to specify an attractor landscape. This allowsfor an extension of equilibrium control-type methods beyond reaching for a sin-gle point: with this parametrization, the target state can also be a periodic orbit ormanifold [33, 70]. Peters and Schaal [52] used parametrized non-linear dynamicalsystems as motor primitives, where the attractor properties of the systems definedthe desired behaviour.9Figure 2.2: A basic feedback system, where: r is the reference command; εis the sensor output, which gives an estimate of the control error; u isthe actuating signal; d is the external disturbance acting on the plant; yis the plant output and measured signal; and n is sensor noise.2.2.2 Feedback ControlEquilibrium control is an example of feedforward control: it is based purely on sen-sory signals that are selected prior to movement [84]. Purely feedforward methodsare not the best approach in practice: errors inevitably occur during movement, sowe need a way to correct errors as we perceive them.Feedback control basics This brings us to feedback control, which is based onthe outcome of a movement. A block diagram of a basic feedback system [19] isgiven in Figure 2.2. It consists of three components – the controller, the plant, andthe sensor – each having two inputs and one output.PID control A common method for control of feedback systems is the Proportional-Integral-Derivative (PID) controller. Let εt denote the sensor’s error signal at timet. The PID controller gives the control signal:u(t) = KPεt +KI∫ t0ετdτ+KDddtεt (2.7)10Ziegler-Nichols parametersControl type KP KI KDPI 0.45KU 1.2KP/TU -PD 0.8KU - KPTU/8PID 0.6KU 2KP/TU KPTU/8Table 2.1: PID tuning with the Ziegler-Nichols method.where KP, KI and KD are scalar tuning parameters referred to as gains. Thefirst term of this equation is the proportional term, which produces an output pro-portional to the current error. The integral term (second term) contributes an outputproportional to the accumulated error over time; this helps eliminate steady-stateerror. Lastly, the derivative term (third term) produces an output proportional tothe slope of the error; this helps improve control by predicting the behaviour of thesystem [19].Special cases of PID control include Proportional-Integral (PI) control (same asEquation 2.7 but with the derivative term omitted) and Proportional-Derivative (PD)control (integral term omitted). PID control is used by [18, 46, 88] and PD controlby [2, 14, 22, 39, 70, 83, 87]. In practice, the gains parameters are mostly tunedby hand; this is the case in all the papers just mentioned. This tuning can becomplicated, as gains may need to be chosen per controller state [88], per joint[18], or both [46].Loop tuning One method for tuning a PID controller is the Ziegler-Nichols method[90]. In this method, the integral and derivative gains KI and KD are set to 0. Theproportional gain KP is set to 0 and gradually increased until it reaches the ultimategain, KU , defined as the point at which the output of the control loop oscillates witha constant oscillation period TU . The gains values are then set according to the typeof controller used – see Table 2.1.There are other, similar tuning methods – for example, Haugen’s [30] – that donot require the loop to get into oscillations during tuning.Gain scheduling In gain scheduling, the gains parameters KP,KI and KD dependon the current system configuration. The general idea in classical gain scheduling is11to divide the configuration space into small areas that are assumed to be linear, cre-ate a PID controller for each section of the space, and use blending or interpolationat other points to ensure that changes in controller parameters are smooth [40]. Inother words, we model the dynamical system as a Linear Parameter-Varying (LPV)system (mentioned explicitly in [64, 69] and used implicitly in [18, 88]). This isa class of non-linear systems that can be modelled as parametrized linear systemswhose parameters change with their state.The drawback of classical gain scheduling is that adequate performance andstability are not guaranteed at points other than the design points [68]. In particu-lar, we are restricted to slow variations in the scheduling variable (some functionalof which determines the gains at a given configuration), so we may encounter prob-lems if there are fast parameter variations.For detailed descriptions of specific gain scheduling algorithms, we refer thereader to [28, 69, 77].2.2.3 Hierarchical Control MethodsHierarchical control is a divide-and-conquer approach that separates a control taskinto multiple (usually two) simpler tasks. Typically, one level controls high-leveldynamics and a second level computes muscle activations to satisfy the high-levelcommands.These methods are biologically plausible – evidence suggests that the centralnervous system works hierarchically [78] – and confer several computational ad-vantages. Hou et al. [32] found that a hierarchical control framework greatlyimproves the efficiency of neural network calculations. By including a formulationfor converting high-level commands into something that is solvable by a low-levelmethod, this control framework allows the user to provide commands parametrizedat the high level: hence, it becomes easier and more intuitive to specify controlgoals. For example, the method of Mordatch et al. [50] for object manipulationemploys high-level trajectory optimization to synthesize motion and contact eventsfrom only a few high-level goals. This allows the user to specify a goal in termsof movement of the object, while the hierarchical control algorithm uses an in-verse dynamics Quadratic Program (QP) to ensure that forces, controls and contact12variables are consistent with each other.Hierarchical methods enjoy great popularity in the control literature. Si et al.[70] developed a swimming controller in which neural networks produced rhyth-mic motor patterns based on high-level commands (such as amplitude, frequencyand phase) to give desired muscle lengths, while a low-level PD controller producedactivations to give desired muscle lengths. In the work of Deshpande et al. [18], ahigh-level controller determined the desired muscle length for a target joint angle inthe index finger, and a low-level PID controller computed the corresponding mus-cle forces. Deisenroth and Rasmussen [16] employed a hierarchical formulation todecompose the motor learning problem into a hierarchy of three sub-problems.2.2.4 Control of Stochastic SystemsControl tasks are often subject to some degree of stochasticity: noise corrupts ob-servations of the environment, the dynamics of the system’s movement, or the acti-vation signals computed by the controller. In biological systems, this is always thecase [85].Several control algorithms have been designed for stochastic systems. Jones[36] gave an optimal control law for performing saccades where the cost functiondepended on the statistical distribution of the eye’s position, obtained with MonteCarlo sampling. Geijtenbeek et al. [22] incorporated noise in the form of tem-poral delay into their feedback control formulation. In Huh and Todorov’s paper[33], both controller and body were assumed to be stochastic dynamical systems,and activations were computed using a risk-sensitive objective function. A rein-forcement learning framework (see Section 2.3.3) can handle Markov DecisionProcesses (MDPs) with stochastic rewards, observations, and/or state transitions[74].2.2.5 Control Methods for Sensorimotor TasksSubsections 2.2.1 - 2.2.4 deal with general control principles. In this section, wediscuss some considerations that arise specifically in the control of motor tasks.13Managing multiple phases of movement Practically speaking, many motor tasks –such as locomotion and object manipulation – involve several stages of movement.For complicated, multi-stage tasks, researchers commonly use finite state machinemodels. This involves building different controllers at each stage of movement. Forexample, locomotion controllers are often divided into four states, correspondingto the ground contact and swing phases for each foot [83, 87]. Generating the de-sired behaviour in different states is accomplished either by building conceptuallydifferent controllers for each, as in de Lasa et al. [14], or by having one controllerframework and changing the parameters for each state. Wang et al. [83], Yin et al.[87] and Geijtenbeek et al. [22] accomplished the latter by changing the weights ofan objective function, which was optimized to yield the control signal. Zhang et al.[88] and Malhotra et al. [46] changed the per-joint controller gains at each state,while Andrews and Kry [2] changed the controller’s joint stiffness and dampingmatrices.Enforcing realism of movements A common concern in control problems – partic-ularly in computer graphics – is ensuring that movements look natural. Animationtechniques generally have a trade-off between motion naturalness and the controlthat can be exerted over the motion [79].Several techniques can be used to encourage natural motions. For methodsthat compute activation signals by optimizing an objective function, one can add tothis function terms that penalize unnatural-looking behaviour – or, more generally,anything in the movement (besides control error) that is undesirable. For example,the locomotion controller of Wang et al. [83] included penalty terms to reduce headmovement and penalize unnatural – i.e. significantly different from human – jointtorque ratios between the hips, knees and ankles.Another approach, which has tie-ins to biological learning, is to parametrizethe control task in terms of muscle synergies or, more broadly, high-level featuresthat result in coordinated muscle movements. Compared to activating each muscleindividually, this results in more natural, controlled movements. For examples, see[14] for walking, [70] for swimming, and [88] for playing the piano.When the motion’s appearance is the primary concern, it is possible to foregophysical fidelity to achieve the desired plausible motions – for example, by blend-14ing kinematic and dynamic motions, as in [91]. Another approach is the directimitation of motion capture data. Grochow et al. [24] and Wang et al. [82] bothfit dynamical models to motion capture frames to learn the most likely poses andmotions satisfying a set of constraints.2.3 Machine Learning Approaches in Motor ControlLearning is a crucial component in the performance of motor tasks. Because of thehigh dimensions of the actuation and movement spaces and the complexity of themappings between them, there is great potential for techniques from statistics andmachine learning. In this section, we focus on the control of simulated – ratherthan robotic – systems. We also focus more on problems where the output is deter-ministic given the input, rather than on stochastic problems. Hence, our learningprocess can be categorized as a computer experiment: as the name suggests, theseare experiments involving simulated processes. These differ from physical exper-iments in that they are assumed to have no random error: the system’s output isa deterministic function of its input. According to Sacks et al. [61], the role ofstatistics in computer experiments is:• Data analysis• Quantification of uncertainty associated with prediction of fitted models• Experimental designSection 2.3.1 describes supervised learning methods that can be used in control,while Section 2.3.2 discusses design principles for computer experiments. Sec-tion 2.3.3 concludes with an overview of reinforcement learning.2.3.1 Supervised LearningThis section deals with supervised learning, specifically regression – namely, learn-ing a mapping from inputs to continuous outputs. Regression for computer exper-iments involves modelling a complicated deterministic function (such as the sim-ulation of the dynamical system in Equation 2.1) as the realization of a stochasticprocess. This allows the use of statistical techniques [61].15Gaussian process regression A Gaussian process is a family of statistical dis-tributions where every finite collection of random variables within an input spacefollows a multivariate normal distribution. Because of properties inherited from thenormal distribution, the distributions of various derived quantities can be obtainedanalytically [56]. The model is defined by its mean and covariance functions.Suppose we have inputs {x(1), ...,x(n)} and corresponding responses {Y (1), ...,Y (n)}. We focus here on the case in which the response is modelled as a linearfunction of the inputs:Y (x) =k∑j=1β jφ j(x)+Z(x) (2.8)where φ are previously specified regression functions, k is the number of re-gression functions φ j, β is a vector of regression coefficients, and Z(·) is a randomprocess assumed to have mean 0 and covariance V (w,x) = σ2R(w,x) for some cor-relation function R and variance σ2. There are two ways to interpret Equation 2.8.One is to view the responses as departures from a simple regression model [61].The second is that the Y (·) are a Bayesian prior on the true response functions, withthe β ’s either specified a priori or given a prior distribution [12].Suppose now that, given a design S = {s(1), ...,s(n)} and data Ys = {Y (s(1)), ...,Y (s(n))}′, we wish to construct a linear predictor Yˆ (x) = c′(x)Ys of Y (x) at untried x.In kriging, we select the Best Linear Unbiased Predictor (BLUP): in other words,we want to choose a vector c(x) minimizing the Mean Squared Prediction Error(MSPE):MSPE(Yˆ )≡ E{(Yˆ −Y )2} (2.9)subject to the unbiasedness constraintE{c′(x)Ys}= E{Y (x)} (2.10)We now introduce some notation. Let φ(x) = [φ1(x), ...,φk(x)]′ denote the vec-tor of the k functions in the regression; Φ =φ ′(s(1))...φ ′(s(n)) the n× k expanded de-16sign matrix; R = {R(s(i),s( j))},1≤ i≤ n,1≤ j ≤ n the n×n matrix of stochastic-process correlations between Z values at the design sites; and r(x) = [R(s(1),x), ...,R(s(n),x)]′ the vector of correlations between Z values at the design sites and at anuntried input x.The BLUP is given by:Yˆ (x) = φ ′(x)βˆ + r′(x)R−1(Ys−Φβˆ ) (2.11)whereβˆ = (Φ′R−1Φ)−1Φ′R−1Ys (2.12)is the least-squares estimate of β . For proof that Equation 2.11 is in fact theBLUP, we refer the reader to [63]. The fit occurs in two stages: we first obtain thegeneralized least-squares predictor (the first term on the right-hand side of Equa-tion 2.11) and then interpolate the residuals as if there were no regression model(the second term). Hence, the kriging regression model differs from the standardregression model in that the model interpolates the values at the design points. Thisis obviously desirable for computer experiments, where the output is assumed tobe noise-free. For a visual depiction of kriging, see Figure 2.3.Gaussian process regression is rather time-consuming (O(n3)), but can be spedup with matrix inverse CUDA parallelization [15] or sparsification methods [9].Because of the need to calculate R−1, the method may become unstable when R ispoorly conditioned.Gaussian process methods are common in the control literature [44, 86]. It wasused in Wang et al. [82] to fit dynamical models to human motion, and in Grochowet al. [24] for inverse kinematics. Deshpande et al. [18] used Gaussian processesfor learning the moment-arm matrix (which gives the relationship between musclesand joints) of the ACT hand based on motion capture data.Nearest-neighbours The idea behind nearest-neighbours prediction is to predictthe value at a new input site x based on some function of the values of the k nearestneighbours. Some choices for this function are weighted averages and local re-gression models. k-nearest-neighbours with an inverse distance-weighted average17Figure 2.3: Kriging in 1D for f (x) = exp(−3x) ·cos(5pix), given by the solidline. Design points are shown by asterisks and the dashed line gives thepredicted values obtained with a kriging estimator using an exponentialcorrelation function [61] with parameters θ = 4,ρ = used for policy value estimation in Andrews and Kry [2].Most standard nearest-neighbours methods, for example kd-trees [89], haveeither space or query time that is exponential in the number of points and/or thedimension of the data. However, Locality-Sensitive Hashing (LSH) algorithmscan perform approximate k-nearest-neighbours computations with both space andquery time that is linear in the number of points to search [1].Neural networks A neural network is represented by a series of interconnectedunits whose behaviour is inspired by that of neurons. Learning occurs by adjustingthe weightings between these neurons. This is analogous to learning in the brain,18which is effected by creating new synapses and strengthening existing ones [59].Neural networks are ubiquitous in the control literature. For example, Grzeszczuket al. [25] performed approximations of physics-based animations by replacing theentire dynamical system with a neural network, which could be easily controlledby gradient-based methods (as the gradient of a neural network can be computedanalytically). Recurrent neural networks were used for equilibrium control in Huhand Todorov [33], and the swimming controller of Si et al. [70] produced rhythmicoutputs using a special class of neural networks called Central Pattern Generators.Deep learning methods are often used (e.g. in Mnih et al. [49]) because some func-tions cannot be efficiently represented by architectures that are too shallow: theymay need exponentially many more units if the network depth is even 1 too low[6].2.3.2 Experimental DesignThe experimental design problem addresses how we should sample from our inputspace to obtain training data for our learning algorithms. In computer experiments,uncertainty arises only from not knowing the exact function between inputs andresponse variables. Therefore, designs should take no more than one observationat a set of inputs, allow for a variety of models, and provide information about allportions of the output space [61].Space-filling designs Methods like kriging are interpolators; so intuitively, wewant space-filling designs. A simple strategy is to select points according to aregular grid pattern superimposed on the experimental region. However, in thiscase, the number of points required grows exponentially with the dimension of theexperimental region, making it infeasible for high-dimensional problems.Another straightforward option is a simple random sample (fig. 2.4a). Theseare eventually space-filling, but don’t work well for small sample sizes. A moresophisticated approach is a stratified random sample [63], in which we obtain npoints by dividing the region into n evenly spread strata and randomly select asingle point from each (fig. 2.4b).A popular approach is the Latin Hypercube Design (LHD) [48], in which we19spread observations evenly over the range of each input separately (fig. 2.4c).These designs possess desirable marginal properties, but only a subset of themare truly “space-filling” [63]. A variation of LHD is the cascading LHD: essentially,we generate an LHD, consider a small region around each point of the design, andgenerate a second LHD in that region. This allows the exploration of both the localand global correlation structure of the model for the response. Other variations ofLHDs are based on orthogonal arrays [31].Criterion-based designs In space-filling designs, we use geometric criteria mea-suring the amount of “spread” of the data points. More classical methods of choos-ing an experimental design are based on some statistical criterion [63]. One suchapproach is called D-optimality, in which we wish to minimize the determinantD = |XT X |, where X is the design matrix [13]. This is equivalent to minimizingthe variance of the least-squares estimates of the regression parameters.Another strategy is maximum entropy design, which seeks to maximize theexpected Shannon entropy of observed responses at the design points [65]. Assumethat our model for the response function is parametrized by θ , and let [θ |D] denotethe posterior distribution of θ given the data obtained with the design D. Beforethe experiment, the amount of information about θ (given by its prior distribution,[θ ]) is:I =∫[θ ]log[θ ]dθ (2.13)The amount of information about θ after the experiment using the design D is:ID =∫[θ |D]log([θ |D])dθ (2.14)The change in information after using D is given by I − ID. The expectedchange in information is maximized by the design that maximizes the entropy ofthe observed responses at points in the design: this is what is meant by maximumentropy design. An algorithm for finding such designs is described by Malakar andKnuth [45].Lastly, mean-squared prediction error designs are based on minimizing somefunctional of the MSPE [63]. The Integral Mean-Squared Error (IMSE) criterion20(a) (b)(c)Figure 2.4: Illustration of different space-filling design techniques for 16sample points in 2 dimensions. (a) Simple random sample. (b) Strat-ified random sample, where each stratum is a square comprising one-sixteenth of the design space. (c) Latin hypercube sampling – the pro-jection of the points onto either the x or y axes is spaced uniformly.minimizes the average MSPE over the experimental domain X , and the MaximumMean-Squared Error (MMSE) minimizes the worst-case prediction error over X .Both the IMSE and MMSE tend to locate points away from the boundaries of thedesign space.21Motor babbling The motivation of experimental design methods in the statisticalliterature is similar to that of goal babbling in control theory. This approach dif-fers from random motor babbling – in which we gather data by supplying randomactuation signals – in that we sample points using goal-directed exploration. Thisis similar to how infants learn motor skills [57]. Rolf [58] presented a goal bab-bling method for exploring an unknown space. It is based on successive attemptsto move as far as possible in a particular target direction until there are large de-viations between the current state and the target – a hint that the target is outsiderange and we have gone as far as possible in that direction.2.3.3 Reinforcement LearningSections 2.3.1 and 2.3.2 have dealt with supervised learning. However, becausethe solution space in control problems is frequently non-convex, we may encounterproblems if we employ supervised learning methods without further consideration[54].We now introduce the reinforcement learning framework to address some ofthese problems. Reinforcement learning involves mapping situations to actions soas to maximize reward. The most important features of reinforcement learning aretrial-and-error search and delayed reward [74]. It differs from supervised learningin that correct input/output pairs are never presented, and “incorrect” actions arenever explicitly corrected. Reinforcement learning is mainly used to compute op-timal controls for problems that are framed as a Markov Decision Process (MDP),as in Wampler et al. [81].Value functions The key computational problem for most reinforcement learningalgorithms is estimating value functions: these are functions of states (or state-action pairs) that estimate how good it is to be in a particular state (or how good itis to perform a given action in a particular state) [74]. To quantify this, we definethe notion of a return: this is some function of the sequence of rewards experiencedat each timestep, which we wish to maximize. The return RT is generally a sum(RT =T∑k=0rk)or a discounted sum(RT =T∑k=0γkrk,0 < γ < 1)of the instantaneousreward rk at each timestep k. We also define a policy pi as a function mapping states22to actions.The value of a state s under policy pi , denoted V pi(s), is the expected returnwhen starting in s and following pi thereafter. For MDPs:V pi(s) = Epi{Rt |st = s}= Epi{∞∑k=0γkrt+k+1|st = s} (2.15)We call V pi the value function for the policy pi . The value of taking an action ain state s under policy pi is defined by the action-value policy Qpi(s,a):Qpi(s,a) = Epi{Rt |st = s,at = a}= Epi{∞∑k=0γkrt+k+1|st = s,at = a} (2.16)For any policy pi and state s, the following consistency condition holds betweenthe value of s and the value of its successor states:V pi(s) = Epi{Rt |st = s}=∑api(s,a)∑s′Pass′ [Rass′+ γV pi(s′)] (2.17)where Pass′ is the probability of transition to state s′ given that action a wastaken in state s and Rass′ is the expected value of the next reward r given a currentstate s, action a and next state s′. Equation 2.17 is called the Bellman equation forV pi . V pi is the unique solution to its Bellman equation [5].The value function provides a method for comparing different policies. We saythat a policy pi is optimal if its expected return V pi(s) is greater than or equal tothat of any other policy pi ′ for all states. All optimal policies pi∗ share an optimalstate-value function V ∗(s)≡maxpiV pi(s) for all s ∈ S . They also share an optimalaction-value function, denoted Q∗ and defined by Q∗(s,a) ≡ maxpiQpi(s,a). Wecan write Q∗ in terms of V ∗ as:Q∗(s,a) = E{rt+1 + γV ∗(st+1)|st = s,at = a} (2.18)Policy iteration Assume we have a given policy pi with a corresponding (known)value function V pi . Suppose that in state s, we want to know whether we shouldchange the policy to choose an action a 6= pi(s). We can determine how good it23would be to follow a and then return to the policy pi with the valueQpi(s,a) = Epi{rt+1 + γV pi(st+1)|st = s,at = a}=∑s′Pass′ [Rass′+ γV pi(s′)] (2.19)If this value is greater than V pi(s), this means it’s better to select a once in s thanto follow pi all the time. As such, we would expect that if we select a every timewe encounter s, the new policy would be better overall. The process of making anew policy that improves on the original policy by making it greedy with respectto the original policy’s value function is called policy improvement [74].Repeating this process leads to policy iteration [75] – once pi has been im-proved using V pi to yield a better policy pi ′, we can the compute V pi ′ and use itto improve pi ′, and so on. By alternating policy evaluation and policy improve-ment steps, we obtain a sequence of monotonically improving policies and valuefunctions [74].Monte Carlo methods We now turn to the problem of estimating value functions.Since the value of a state is its expected return starting from that state, an obviousway to estimate it is from experience: simply average the returns observed aftervisits to that state. This is the Monte Carlo method. Unlike dynamic program-ming approaches, which compute the value function by solving the Bellman equa-tion (Equation 2.17), Monte Carlo methods require no knowledge of the transitionprobabilities Pass′ or expected rewardsRass′ .However, these methods use samples inefficiently because we use a long tra-jectory to improve the estimate of just the single state-action pair that started thetrajectory. This means that the methods converge slowly, and generally work onlyfor small, finite MDPs [74].Temporal difference methods Temporal difference methods [74] are similar toMonte Carlo methods, but they update value estimates based in part on otherlearned estimates, rather than waiting for the final outcome.The simplest method, TD(0), updates the value function after each step with:24V (st)←V (st)+α[rt+1 + γV (st+1)−V (st)] (2.20)where α is a constant step-size parameter. In other words, we improve ourestimate of the value of st using both the instantaneous reward and the expectedvalue of the next state.Some more sophisticated temporal difference methods include SARSA [23],Q-learning [49], and actor-critic methods [53].Policy gradient methods Policy gradient methods frame policy optimization as astochastic optimization problem and perform a local search. These methods arequite versatile: Kohl and Stone [38] performed policy gradient estimation for anon-MDP with no notion of a state, but which instead has a direct mapping frompolicy parameters to expected return. Fast convergence is seen with vanilla andnatural policy gradients [52], and methods like natural actor-critic [53] achievegood results by combining policy gradients with temporal difference learning.25Chapter 3Control of Elastodynamic SkinThis chapter describes the control of a quasistatic elastodynamic model of the skinof the human mid-face. This model was developed by Debanga Raj Neog underthe supervision of Dinesh Pai for Neog’s forthcoming doctoral thesis, and is usedhere with permission. Section 3.1 provides a high-level overview of the simulator,while Section 3.2 and Section 3.3 present two methods for its control.3.1 System DescriptionThe model (Figure 3.1) is a triangular mesh of the human midface, which is con-trolled by specifying the activation levels of seven facial muscles: right and leftfrontalis; corrugator; right and left levator palpebrae superioris (LPS); and rightand left lower orbicularis oculi (OOM).The skin is modelled as a two-dimensional hyperelastic membrane and repre-sented with an Eulerian discretization [42]. Unlike other representations, this ap-proach avoids the need to constrain the skin to lie on the body surface and allowsthe use of a single mesh to represent both the body and the skin.The face has two major types of muscles: linear and sphincter [76]. In linearmuscles, the fibers lie in a line, and the muscle produces force along the fiberdirection. Sphincter muscles’ fibers contract tangentially, producing the net forcealong the radial direction and closing the orifice in the muscle. Since both types ofmuscles are sheet-like, the 2D formulation is suitable for simulation. The muscle26(a)(b)Figure 3.1: (a) low-resolution skin mesh with location of muscle sheets la-belled. (b) high-resolution skin mesh.model is anisotropic and hyperelastic: the passive muscle stress-strain profile isshifted along the muscle fiber direction with an amount proportional to the muscle’s27activation level (see Figure 3.2).Figure 3.2: Example of passive and active muscle force-length (FL) curves.Left: passive force. Right: active force, given by shifting passive curveleftward according to the muscle activation u.The model is quasistatic: that is, given a set of muscle activations, it generatesa simulated pose without the need of a full dynamic simulation. If we supply aconstant muscle activation, the model will reach a steady-state configuration whenthe stress due to the muscle activations is balanced by the external forces fromthe attached tendons and muscles. Steady-state configurations are computed in the2-dimensional skin atlas space, with corresponding 3-dimensional mesh configu-rations obtained by a barycentric mapping.3.2 A Supervised Learning ApproachBecause of the model’s quasistatic formulation, we can fully describe its behaviourwith a mapping from muscle activations to resulting mesh configurations. Thismeans we can address the control problem with supervised learning methods. Themuscle activation is given by a 7-dimensional vector u, with each ui ∈ [0,1] repre-senting the activation level of one of the seven facial muscles. Equilibrium configu-rations are computed in the 2-dimensional skin atlas space; this is the space we willuse for learning and prediction. This will be faster to predict than 3-dimensionalmesh positions and, because the skin is modelled as a 2-dimensional surface, thereis no loss of information in the 2D configuration compared to the 3D. The skinatlas consists of 163 points, which means the dimension of our output is 326. We28refer to the mesh configuration as the pose, denoted by the variable P.3.2.1 Regression MethodsTraining data We perform our sampling in the 7-dimensional activation spaceusing a Latin hypercube design. Recall from Section 2.3.2 that a Latin hypercubedesign is one that distributes points evenly over each input dimension separately.Designs are generated with the maximin criterion [37], which aims to maximizethe minimum distance between points. In all the demonstrations in this section, weuse 350 training points.Kriging prediction We model the mapping u→ P using kriging, which was dis-cussed in Section 2.3.1. Because kriging methods assume scalar output variables,we fit a separate predictor to each component of P. We assume that each compo-nent Pi of our response P is of the form Pi(u) = ∑kj=1 βi, jφ j(u)+Z(u), where φ arepreviously specified regression functions, k is the number of regression functions,and Z(·) is a random process with mean 0 and covariance function σ2R(w,u). Foreach component Pi we use the same regression and correlation functions, but we fitdifferent least-squares parameters β .The predicted value Pˆi of unew, given inputs U = u(1), ...,u(n) and correspondingoutputs PU,i = Pi(u(1)), ...,Pi(u(n)), is given by:Pˆi(unew) = φ ′(unew)βˆi + r′(unew)R−1(PU,i−Φβˆi) (3.1)whereβˆi = (Φ′R−1Φ)−1Φ′R−1PU,i (3.2)In Equation 3.1, φ(unew) = [φ1(unew), ...,φk(unew)]′ denotes the vector of the kfunctions in the regression; Φ =φ ′(u(1))...φ ′(u(n)) the n× k expanded design matrix;R = {R(u(i),u( j))},1≤ i≤ n,1≤ j ≤ n the n×n matrix of stochastic-process cor-29relations between Z values at the design sites; and r(unew) = [R(u(1),unew), ...,R(u(n),unew)]′ the vector of correlations between Z values at the design sites andan untried input unew. For our regression functions, we use φ(u) = [u1,u2, ...,uk],where k is the dimension of u. In other words, we model our response P as a simplelinear function of u.Choice of correlation function A question we have not yet addressed is the valueof the correlation function, R(u(i),u( j)). These functions generally take the formR(w,u) = ∏mj=1 g(w j− u j) for some function g, where m is the dimension of theinput data [61]. A common choice is the exponential correlation function, of theform:R(w,u) =m∏j=1exp(−θ j|w j−u j|ρ)0 < ρ ≤ 2(3.3)where θ ,ρ are correlation parameters. Another option is the linear correlationfunction, which results in a linear spline:R(w,u) =m∏j=1(1−θ j|w j−u j|)+ (3.4)Lastly, the correlation functionR(w,u) =m∏j=1[1−a j(w j−u j)2 +b j|w j−u j|3] (3.5)gives a cubic spline for certain values of a j and b j.For all problems in this chapter, we use the exponential correlation function.In initial experiments, we will assume known values for the parameters θ ,ρ (prac-tically speaking, this means trying different values until we obtain good results).Afterwards, we will compare the accuracy of prediction with hand-chosen param-eters to prediction with the Maximum Likelihood Estimate (MLE) parameters.Maximum Likelihood Estimation of correlation parameters MLE methods [63]assume that the correlation function is of the form R = R(·|ψ), where ψ is a30vector of parameters. In the case of the exponential correlation function, ψ ≡{θ1, ...,θm,ρ}. The MLE chooses the parameter vector ψˆ that minimizesnlogσˆ2z (ψ)+ log(det(R(ψ))) (3.6)where R(ψ) is the correlation matrix built from the existing training data withparameters ψ , andσˆ2z = σˆ2z (ψ) =1n(PU,i−Φβˆ )T R−1(ψ)(PU,i−Φβˆ ) (3.7)For large n, it would be impractical to compute the gradient of Equation 3.6, sowe perform the optimization for ψˆ using a quasi-Newton method [3].Measuring prediction error The next two subsections deal with forward and in-verse prediction. Forward prediction computes the map from muscle activationsto poses; inverse prediction maps from poses to the muscle activations that yieldthem. Here we describe how we assess the quality of predictions. For inverse pre-diction, we have a pose that was generated by an activation utrue; if our inversepredictor returns uˆ as the estimate of utrue, we define the activation error by:Erract(uˆ,utrue) =||uˆ−utrue||||utrue||(3.8)Note that when we omit the subscript from vector norms, we are referring tothe L2-norm.For both forward and inverse prediction, it is useful to have a measure of theerror between poses. This is obviously necessary to assess the quality of forwardpredictions, which predict a pose given a set of muscle activations. However, it willalso be a useful metric for inverse prediction. Recall that for control problems, wewant the plant to exhibit some desired behaviour. In this case, we want to be ableto use inverse predictions to mimic poses. So, if the simulator maps utrue→ Ptrue,and our inverse predictor estimates an activation uˆ corresponding to Ptrue, we cancompute the pose that results from applying uˆ as a simulator input: uˆ→ Pˆ. Wethen assess the quality of the choice uˆ by computing the error between Ptrue and Pˆ.We denote this quantity by Errpose(Pˆ,Ptrue).31To describe how we compute Errpose, recall that we perform all pose predictionson the 163 points in the (2D) physical atlas space. Since we would like a notion ofrelative error, rather than absolute error, we ran the simulator with 500,000 randomactivations and approximate the range of each point as follows:• Define point i on the mesh by pi, for 1≤ i≤ 163, with coordinates [pi,(1), pi,(2)]• Over the 500,000 trials, obtain minimum and maximum possible values foreach coordinate: {pmini,(1), pmaxi,(1), pmini,(2), pmaxi,(2)}• Define r(pi) = ||pmaxi,(1)− pmini,(1), pmaxi,(2)− pmini,(2)||114 of the 163 points were near-stationary (defined as having r(pi)< 0.05). Toavoid artificially low error estimates, we only compute the pose error over the 49points with a higher range of movement. These points are shown in Figure 3.3.Given two poses P =P1...P163 and Pˆ =Pˆ1...Pˆ163, let P˜ =P˜1...P˜49 and˜ˆP =˜ˆP1...˜ˆP49 denote the respective reduced matrices consisting of the 49 non-stationarypoints.We define the average relative per-point error by:Erravgpose(P, Pˆ) =14949∑i=1|| ˜ˆPi− P˜i||r(p˜i)(3.9)And the maximum relative per-point error by:Errmaxpose(P, Pˆ) = max1≤i≤49|| ˜ˆPi− P˜i||r(p˜i)(3.10)3.2.2 Forward PredictionHere we attempt to learn the mapping from input activations to resulting poses.Since kriging assumes a scalar output, we build a separate predictor for each di-32Figure 3.3: Depiction of the 49 low-resolution mesh points for which r(pi)>0.05. The black circles denote the points, with larger circles represent-ing points with proportionally larger ranges of movement. The colouredareas of the mesh represent the locations of the different muscle sheets.mension of the pose. This is a valid method for multidimensional regression, butwill ignore relationships between different dimensions of P. We could probablyachieve better results with methods that learn in parallel for different dimensionsof vector output [10], but such methods are beyond the scope of this thesis.Manual correlation parameters – results We use an exponential correlation func-tion with parameters θi = 2 for all i, ρ = 2. Using 350 training points in a Latinhypercube design and evaluating on 50 testing points, the mean value of the aver-age pose error over the testing points was 0.1093 and the mean value of the maxi-mum pose error was 0.2478. This means that, for the average point in the test set,each point of the predicted pose deviated by 10.93% (relative to its full movementrange) on average, and the worst point deviated by 24.78%. Graphical depiction of33a specific example from the testing set is given in Figure 3.4a.(a)(b)Figure 3.4: An example of forward prediction with manually selected corre-lation parameters (a) and MLE correlation parameters (b). The true poseis shown in black and the predicted poses are overlaid in red. (a) Theaverage relative pose error is 0.062 and the maximum error is 0.3152– notice the large deviations in the bottom right. (b) The average poseerror is 0.0107, and the maximum is 0.0570.34Pose ErrorAverage MaximumManual parameters 0.1093 0.2478MLE parameters 0.0246 0.060Table 3.1: Table of mean pose errors over test set for forward prediction withmanual and MLE correlation parameters. Reported errors represent theaverage and maximum relative errors per mesh point in the testing set.MLE correlation parameters – results We use the same training and testing datahere as for the previous example. The MLE gives correlation parameters θi = 0.669for all i, and ρ = 1.95. Unsurprisingly, these parameters result in better predictionaccuracy than the manual correlation parameters: average and maximum per-pointerrors of 0.0243 and 0.06, respectively. For an illustration, see Figure 3.4b.3.2.3 Inverse PredictionHere we deal with the control (or inverse prediction) problem: we wish to learn themapping from poses P to activations u. As in the previous section, we gather ourtraining data in a Latin hypercube over the activation space.We use only the 49 non-stationary mesh points for prediction, and we standard-ize each dimension so that all observations cover the range [0,1]. As before, thequantity we are predicting is a vector (in this case, the 7-dimensional activation),so we fit a separate kriging predictor to each dimension of the activation. Becausewe have 98 input dimensions (49 points times 2 dimensions per point), this couldmake it challenging to find the MLE correlation parameters. So, we make the sim-plifying assumption that, for each of our seven predictors, θi = θ for all i – in otherwords, we assume that the vector θ is constant for all 98 dimensions of the pose.This means we have seven predictors (one for each dimension ui), each withtwo correlation parameters θ and ρ . The respective MLE values of θ for each of theseven predictors are [2.02,2.02,9.89,3.03,2.63,10.30,2.02] and the correspondingvalues of ρ are [0.74,1.1,2,0.84,0.84,2,1.18].To test our predictors, we draw 50 activations from a random uniform distribu-tion and apply them to the simulator to obtain the corresponding poses. We applythe predictor to the poses to obtain predicted activations and clamp the results so35they are all in the legal range [0,1]. We then measure both how close the predictedactivations are to the true activations (activation error), as well as how closely theposes resulting from applying the predicted activations match the target poses inthe testing set (pose error). The results are:• Average relative activation error: 0.1423• Average relative per-point pose error (mean over points in test set): 0.0183• Maximum relative per-point pose error (mean over points in test set): 0.1132At first glance, these results seem satisfactory. When we mimic a target pose,the average point on the mesh deviates by 1.8%, and in typical cases, no singlepoint is off by more than 11.3%. There is, however, one class of pose on which thismethod performs quite poorly, and this is for points achieved with mostly saturatedactivations – i.e. with most or all values of u very close to 0 or 1. This is perhaps tobe expected, as these points are at the boundaries of the input space, and thereforeat or beyond the boundaries of the training set.To illustrate this point, let us consider the same inverse predictor obtained ear-lier, but instead of testing it on poses resulting from random activations, we willtest it on:1. The 7 possible ways to fully activate one muscle while setting the others to0, and2. the 21 possible ways to fully activate two muscles while setting the others to0In this case, the results are:• Average relative activation error: 0.4856• Average relative per-point pose error (mean over points in test set): 0.2862• Maximum relative per-point pose error (mean over points in test set): 0.739436Clearly, performance is much worse than in the random case. Consider theexample given in Figure 3.5, in which the target is obtained by fully activatingthe corrugator and right lower OOM. The correct activation to generate this pose is[0,0,1,0,0,0,1], while the inverse predictor generates the activation [0,1,1,1,0.39,0,1]. The average and maximum per-point errors in the pose are 0.6363 and 0.9999,respectively. That second number merits some attention: recall that the maximumper-point error is a measure of how far off-target the worst point on the mesh is,relative to its range of motion. The fact that this number is 0.9999 tells us thatthere is at least one point in the mesh that is as far away from its target position asit could possibly be.3.3 Combining Regression and Local SearchThe poor performance of our control method for poses generated by boundarycases in the activation space tells us that we need to improve the prediction methodof Section 3.2. However, with any experimental design we select, points near theboundary of the input space will always have less information for an interpolatedmethod (such as kriging) to use to make a prediction. If we want adequate bound-ary performance without covering the entire boundary during training, we need tomove beyond standard supervised learning methods. Hence, this section presentsa method for adding local optimization to the regression method in Section 3.2 toimprove control performance.3.3.1 MethodProblem formulation To borrow a term from reinforcement learning, we defineour policy, denoted by U , by the choice of activations: U = {u1, ...,u7}. Let Ptargetdenote our pose, and PU denote the pose obtained by running the simulator withactivation U . We then define our value function, which we seek to maximize, by:VU =−12(Erravgpose(PU ,Ptarget)+Errmaxpose(PU ,Ptarget))(3.11)with Erravgpose(PU ,Ptarget) and Errmaxpose(PU ,Ptarget) denoting the average and maxi-mum per-point errors, defined in Equation 3.9 and Equation 3.10, respectively.37(a) (b)(c)Figure 3.5: Inverse prediction on pose generated by full activation of the cor-rugator and right lower OOM. (a) Target position. (b) Position obtainedby predictor. (c) Overlay of the two, with the target in black and pre-dicted position in red.Gradient descent To compute a policy U to maximize the value function, weadapt Kohl and Stone’s [38] method to estimate the policy gradient. This is thegradient of the value function with respect to the parameters of U ; once it is com-38puted, we can use gradient descent to improve our estimate.We begin with an initial activation U = {u1, ...,u7}, obtained by kriging. Wethen estimate the partial derivative of VU with respect to each parameter. We dothis by first evaluating t randomly generated policies {W1,W2, ...,Wt} near U –specifically, each Wi = {u1 +∆1, ...,u7 +∆7}, where each ∆ j is chosen randomlyto be either −ε j,0 or +ε j. We require all u j to be in the range [0,1], so if u j < ε j,then we choose ∆ j to be one of {0,+ε j,−u j}, and if u j > 1− ε j, we choose ∆ jfrom {−ε j,0,1−u j}. This will ensure that all dimensions of U remain in the legalrange (note that, since our kriging predictor clamps all values to lie in [0,1], we areguaranteed to begin our iterations with legal values of U).After evaluating the value VW of each Wi, we group each Wi into one of threesets for each dimension (n = 1, ...,7):Wi ∈S+ε,n if the nth parameter perturbation of Wi is positiveS+0,n if the nth parameter perturbation of Wi is 0S−ε,n if the nth parameter perturbation of Wi is negative(3.12)We then compute the average scores Avg+ε,n, Avg+0,n and Avg−ε,n of S+ε,n,S+0,n and S−ε,n, respectively, to give us an estimate of the value of altering thenth parameter by ε . We then use the average scores to compute a 7-dimensionaladjustment vector D where:Dn =0 if Avg+0,n > Avg−ε,n and Avg+0,n > Avg+ε,nAvg+ε,n−Avg−ε,n otherwise(3.13)In Kohl and Stone [38], the adjustment vector was normalized and multipliedby a scalar stepsize parameter η , and then added to the policy. However, we foundthat this often led to a policy with lower value VU than the previous one and some-times prevented the iterations from converging to a suitable value of U . So, weadded a check to see if the value of U +η ∗ D||D|| is higher than the value of U . Ifso, we accept the update and set U ←U +η ∗ D||D|| , and then continue with the iter-39Pose ErrorAverage MaximumRegression only 0.2862 0.7394Local search only (random starts) 0.0156 0.0316Regression + Local search 0.0003 0.0028Table 3.2: Table of mean pose errors over 28 boundary points for regressionmethod, policy gradient (PG) with random starts, and regression com-bined with policy gradients. Reported errors represent the average andmaximum relative errors per mesh point in the testing set.ations. However, if the value function decreased after adding the displacement, wereject the proposed update – in other words, we implement greedy policy iterationsby choosing the action (either acceptance or rejection of a proposed update) thatleads to a higher value VU . We attempt another update after decreasing η by afactor of√2, to account for the possibility that the stepsize is too large. Once wefound a D which improved the results, we would reset η to its original value. If wehad 10 consecutive failed updates, we would stop the algorithm. For pseudocode,see Algorithm ResultsWe perform gradient descent updates on the 28 boundary cases discussed in Sec-tion 3.2.3, with parameters t = 30,η0 = 0.1, and εi = 0.01 for all i. Predictionaccuracy increases considerably (see rows 1 and 3 of Table 3.2); in fact, for 20 ofthe 28 cases, the local search converges exactly to the correct activations. This isthe case with the example given in Figure 3.5, which reaches the correct activation(and therefore 0 mesh error) after 114 iterations (see Figure 3.6).Do we need the regression step? Given the effectiveness of the local optimization,a natural question is to ask is whether we even need to perform the regression stepfirst. To investigate, we compare the performance of the local search method withrandom initial policies to its performance when starting with the values obtainedby kriging. We compare on the same set of 28 test points used earlier.We start by considering the pose errors for both methods. Results are shown in40Algorithm 1 Gradient descent for inverse prediction1: U ← initial activation2: nIter← 03: η ← η0 . Set to initial stepsize value4: nFails← 0 . Number of consecutive rejected updates5: while nIter < maxIter and nFails < maxFails do6: {W1,W2, ...,Wt}← t random perturbations of U7: for n = 1 to dim(U) do8: Avg+ε,n← average score for Wi that have positive perturbation in9: dimension n10: Avg−ε,n← average score for Wi that have negative perturbation in11: dimension n12: Avg+0,n← average score for Wi that have 0 perturbation in13: dimension n14: if Avg+0,n > Avg+ε,n and Avg+0,n > Avg−ε,n then15: Dn← 016: else17: Dn← Avg+ε,n−Avg−ε,n18: end if19: end for20: D← D||D|| ∗η21: if VU+D < VU then . Update worsens policy22: nFails← nFails+123: η ← η√224: else25: U ←U +D . Accept update26: nFails← 0 . Reset failure counter27: η ← η0 . Reset stepsize28: end if29: nIter← nIter+130: end while41Table 3.2. As we can see, local with random starts performs reasonably well – stillbetter than regression without further optimization – but the error is higher thanwhen we combine both methods.Another consideration is time to convergence. The simulator and learning algo-rithms were implemented in MATLAB and run on a PC with an AMD A10-6800K4.10 GHz processor and 10 GB of memory. Each gradient descent update requirescomputing the objective function for 30 different activations, which involves com-puting the mesh position for a given activation 30 times. This calculation takesbetween 1 and 2 seconds, so the updates are somewhat expensive. With randomstarts, the gradient descent method required an average of 128.61 iterations, taking35.68 minutes; with the kriging starts, it’s 73.57 iterations and 19.98 minutes. So,we gain a significant efficiency boost with the kriging starts. Given that it onlytakes 75 seconds to build a predictor and 0.6 seconds to predict the value for a newpoint, the kriging starts are certainly worth our time.42(a)(b)Figure 3.6: Example of local search updates on initial policy obtained by re-gression. (a) Plot of relative activation errors. (b) Plot of the mean of theaverage and maximum per-point pose errors (the negative of the valuefunction).43Chapter 4Reaching ControlThis chapter deals with reaching control for a model of the human index finger. Weframe this as an equilibrium control problem, which we discussed in Section 2.2.1.To recap, an equilibrium point is an state-activation pair {xeq,ueq} with the prop-erty that f (xeq,ueq) = xeq. In other words, the activation ueq keeps the system inequilibrium at the point xeq. With equilibrium control, we attempt to reach the statexeq by applying the constant activation ueq, with the rationale that an activation thatstabilizes the plant at a given point may be all we need to drive the plant to thatpoint from an arbitrary starting state.This chapter is similar to Chapter 3 in that we obtain a mapping from activa-tions to final (equilibrium) configurations, and we don’t consider the dynamics atintermediate points. In this chapter, we deal with supervised learning approachesfor reaching control. We will deal with general-purpose methods for more compli-cated control tasks in Chapter 5.4.1 Plant Description: Human Index FingerThe plant we use in this chapter is a biomechanical simulation of a human indexfinger, described in detail in [41]. It was originally created by Duo Li and Shin-jiro Sueda under the supervision of Dinesh Pai, and developed further by PrashantSachdeva [60]. It is a tendon-based simulator built using the Strands framework[72], in which bones are rigid bodies controlled by tendons, which are represented44by thin strands. The motion of the tendon is constrained by the network of sheathsand pulleys present in the hand. These constraints are handled using the approachof Sueda et al. [73].Model description The model (shown in Figure 4.1) has four bones: the metacarpalbone; the proximal phalanx, which is connected to the metacarpal bone via themetacarpophalangeal (MCP) joint; the middle phalanx, which is connected to theproximal phalanx by the proximal interphalangeal (PIP) joint; and the distal pha-lanx, connected to the middle phalanx by the distal interphalangeal (DIP) joint. Theplant has four degrees of freedom: flexion-extension for all joints, and abduction-adduction for the MCP. The finger is controlled by the activation of six muscles:the distal interosseous (DI), extensor digitorum communis (EDC), flexor digito-rum profundus (FDP), flexor digitorum superficialis (FDS), lumbrical, and palmarinterosseous (PI). Thus, the activation for the plant is given by a six-dimensionalvector u, where 0≤ ui ≤ 1 for i = 1...6.Figure 4.1: Biomechanical model of the human index finger.45Muscle model The force exerted by a muscle is assumed to be a sum of the pas-sive and active forces: f = fP+ fA. The muscle model relates the force at the originof the tendon to the tendon excursion (i.e. the displacement of the tendon origin).Passive and active forces are modelled as piecewise linear functions [60], as shownin Figure 4.2. The passive force consists of a horizontal line concatenated witha positive-sloping line beginning at 0 force, and the active force is the piecewiselinear curve shifted by an amount proportional to the muscle’s activation. Thus,the muscle model is equivalent to that of the skin simulator of Chapter 3, but withthe simplifying assumption of piecewise linearity.Figure 4.2: Passive and active muscle force-length (FL) curves. Left: passiveforce, given by a piecewise linear curve. Right: active force, givenby shifting the piecewise linear curve leftward according to the muscleactivation u.4.2 Task DescriptionConfiguration variable In all index finger control examples in this and the follow-ing chapter, our configuration variable q is chosen to be the 3-dimensional fingertipposition, as in the work of Sueda et al. [72] and Fain [20]. Unlike joint angles ortendon excursions, which were used for control of the ACT hand in Deshpande etal. [18] and Malhotra et al. [46], respectively, the fingertip position doesn’t fullydescribe the finger’s configuration. There is, however, some physiological basis forthis choice of configuration variable, as evidence suggests that humans do use fin-46gertip position directly for reaching movements [4]. This choice of configurationvariable has the added advantage of making it more intuitive to specify controllergoals.Training and testing sets In the reaching control problem, we learn the mappingfrom configurations q to activations u. That is, for a target fingertip position, weestimate the activation that, when applied constantly to the plant, will take us there.Our training data for this problem consists of 39,000 activation-configurationpairs, {u(i),q(i)}. We obtained these by sampling from a uniform grid in the activa-tion space, and we applied each activation u(i) to the plant until it reached a stableconfiguration q(i). A scatter plot of the fingertip positions in the training data set isshown in Figure 4.3. Note the clear non-convexity of the configuration space: thiscould cause problems for some supervised learning algorithms. An illustration ofdifferent plant configurations is shown in Figure 4.4.Figure 4.3: 3-dimensional scatter plot of plant configurations (i.e. fingertippositions) in the training data set for the reaching control problem. Axisunits are in centimetres.47(a) (b)(c) (d)Figure 4.4: Different configurations of the index finger, with the plant’s tra-jectory from the initial configuration shown in red. (a) full activation ofthe DI. (b) full activation of the EDC. (c) full activation of the FDS. (d)full activation of the EDC and half activation of the lumbrical.To assess the quality of a predicted activation uˆ for a target configuration qr,we apply uˆ to the plant until stabilization and measure the Euclidean distance (incentimetres) of the plant configuration from qr.For our test set, we would like plant configurations that are (a) not in the train-ing set, and (b) in the space of reachable configurations. To achieve a high probabil-ity of a testing point meeting both criteria, we first choose a random configurationin the training set. We then take the average of this configuration and the three nextconfigurations in the training data. Because of the order in which we performedthe sampling, consecutive points in the training data are generally close together inactivation space. This implies that the corresponding configurations are probablyclose together, as well. This means that the average of the four points is likely tobe a reachable configuration. Because of the non-convexity of the configuration48space, we would have no guarantees of this if we took the average of completelyrandom points (with consecutive points, we still don’t have a guarantee, but wehave a higher probability).We gather 500 such points to create the test set. The training and testing dataare the same for all experiments conducted in this chapter.For all methods, we compare the build time for the data structure and the querytime to predict the activation for a single point. All learning algorithms were im-plemented in MATLAB and run on a commodity PC with an Intel Core i5 3570processor and 16GB of memory.4.3 k-Nearest-Neighbours4.3.1 MethodTo estimate the activations leading to a target point qr, we begin by computing qr’sk nearest neighbours in the training set, {q(1)nn , ...,q(k)nn }. We predict the uˆ to lead theplant to qr with a weighted average of the activations {u(1)nn , ...,u(k)nn } correspondingto the configurations {q(1)nn , ...,q(k)nn }. The weight of each activation u(i)nn is given bythe reciprocal of the distance between qr and q(i)nn . In the case where the distancebetween qr and its nearest neighbour q(1)nn is 0 – i.e. when qr is in the training data– then we simply set uˆ = u(1)nn . Hence, this method interpolates the training data.Locality-sensitive hashing We improve the efficiency of our k-nearest-neighboursearch using Locality-Sensitive Hashing (LSH) algorithms [34, 67]. This approachhas space and query time that are both linear in the number of points to search.These algorithms are based on locality-sensitive hashing functions, which pos-sess the property that, for any two points p,r ∈ Rd :1. If ||p− r|| ≤ D then Pr[h(r) = h(p)]≤ P12. If ||p− r|| ≥ cD then Pr[h(r) = h(p)]≥ P2where P1 > P2 and c≥ 1.We concatenate several LSH functions to amplify the gap between P1 and P2.For parameters m and L, we choose L functions49g j(r) = (h1, j(r), ...,hm, j(r)) (4.1)where the hash functions hl, j(1≤ l ≤m,1≤ j≤ L) are chosen at random froma family of LSH functions.We then construct a hash table structure by placing each point q(i) from theinput set into a bucket g j(q(i)), j = 1, ...,L. To process a query point qr, we scanthrough the buckets g1(qr), ...,gL(qr) and retrieve the points stored in them. Foreach retrieved point, we compute the distance between it and qr and report the kclosest points.For our hash functions, we project a point q onto a random 1-dimensional linein Rd , which is then partitioned into uniform segments. The bucket into which aparticular point is hashed corresponds to the index of the segment containing it.For Euclidean spaces, these functions are locality sensitive; for proof, we refer thereader to [1]. The algorithm parameters we use are m = 20,L = 30 and varyingvalues of k.4.3.2 ResultsWe examined the performance of k-nearest-neighbours for k = 1,4,7, ...,34 on thetraining and testing datasets described in Section 4.2. We build a single LSH tablethat is used for all the predictors; the build time was 2.20 seconds. Summarystatistics of the k-nearest-neighbours predictor performance are shown in Table 4.1.Surprisingly, prediction accuracy is highest with a single nearest neighbour: wediscuss why this may be the case in Section Neural Networks4.4.1 MethodWe now attempt our reaching task using a feedforward neural network with Levenberg-Marquardt backpropagation [27]. This method is designed to approach second-order training speed without having to compute the Hessian matrix. The Hessianis approximated as H = JT J, and the network gradient is computed as g = JT e.50Neighbours Error (cm) Query time (s) Activation size1 0.19 0.0058 0.9484 0.41 0.0059 0.9397 0.40 0.0059 0.94210 0.39 0.0058 0.94213 0.41 0.0059 0.94016 0.40 0.0059 0.94119 0.36 0.0061 0.94022 0.36 0.0061 0.94025 0.36 0.0060 0.94028 0.36 0.0062 0.94031 0.34 0.0062 0.94034 0.36 0.0065 0.939Table 4.1: Table of summary statistics for reaching task using k-nearest-neighbours with locality-sensitive hashing. Error denotes the average dis-tance (in centimetres) of final configurations from target positions; querytime gives the time (in seconds) to compute the predicted activation uˆ fora target qr; and activation size is the average L2-norm of the predictedactivations uˆ.Here J is the Jacobian matrix (much simpler to compute than the Hessian), whichcontains the first derivatives of the network errors with respect to the weights andbiases, and e is the vector of network errors.Let Θk denote the network weight and bias variables at training iteration k. Wethen use the approximation to the Hessian matrix in the update rule:Θk+1 =Θk− [JT J +µI]−1JT e (4.2)The scalar parameter µ is set to an initial value of 0.001. After a successfulstep (i.e. where the sum of the squared network errors decreases), µ is multipliedby 0.1; after an unsuccessful step, µ is multiplied by 10.Training is done in parallel on the GPU. Note that each time the network istrained, it can result in different solutions (and different build times) due to differentinitial weight and bias values, as well as different divisions of the data into training,validation and test sets.51Hidden layers Error (cm) Build time (s) Query time (s) Activation size1 4.64 2035 0.012 0.8202 2.26 3631 0.016 0.9213 2.28 3066 0.015 0.9884 1.91 10802 0.016 1.0115 1.53 1449 0.014 1.0456 1.53 10802 0.014 1.0597 1.38 5328 0.015 1.1038 1.54 10802 0.015 1.0939 1.17 10802 0.014 1.11610 1.05 3266 0.016 1.12111 1.41 7053 0.020 1.15512 1.24 7572 0.019 1.160Table 4.2: Table of summary statistics for reaching task using neural net-works. Error denotes the average distance (in centimetres) of final config-urations from target positions, build time gives the neural network buildtime (in seconds); query time gives the time (in seconds) to compute thepredicted activation uˆ for a target qr; and activation size is the averageL2-norm of the predicted activations uˆ.4.4.2 ResultsWe tested the performance of neural networks with n = 1,2, ...,12 hidden layers.Results are given in Table 4.2. Several of the networks have a build time of 10,802seconds: this is because we specified one of our stopping criteria for training to bea maximum of 3 hours’ (10,800 seconds’) training time.4.5 Random Forests4.5.1 MethodWe now turn our attention to random forest predictors. In a survey of differentclassification methods, Fernandez et al. [21] found them to be the most effectiveclass of predictor over a wide variety of datasets.Our focus here is on random forests for regression, which consist of an ensem-ble of regression trees [29]. Suppose our data consists of n observations {q(i),u(i)},52each with three input dimensions q(i) = {q(i)1 ,q(i)2 ,q(i)3 }, and an output consisting ofone dimension of the activation, u(i)k . (Because regression trees assume a scalaroutput, we fit predictors separately to each dimension of u).Regression trees partition the input space into M regions and model the re-sponse as a constant, cm, in each region, i.e.:f (q) =M∑i=1cmI(q ∈ Rm) (4.3)where I is an indicator function for q ∈ Rm.To minimize the mean-squared prediction error, the best cˆ(m) is the average ofthe responses u(i)k corresponding to the observations q(i) in the region Rm:cˆm = ave(u(i)k |q(i) ∈ Rm) (4.4)Finding the best binary partition in terms of the minimum sum of squares isgenerally infeasible, so we use a greedy algorithm. Starting with all data, we con-sider a splitting variable j and split point s, and define a pair of half-planes:R1( j,s) = {Q|Q j ≤ s} and R2( j,s) = {Q|Q j > s} (4.5)We then seek the splitting variable j and split point s that solveminj,s(minc1∑q(i)∈R1( j,s)(u(i)k − c1)2+minc2∑q(i)∈R2( j,s)(u(i)k − c2)2)(4.6)For any choice j,s the inner minimization is solved by setting cˆ1 and cˆ2 ac-cording to Equation 4.4. For each splitting variable j, the determination of the splitpoint s can be determined easily; so, choosing the optimal j and s at each iterationis computationally feasible.Next, we must determine how large to grow the tree. A tree that is too largemay overfit the data, while one that is too small may be too simple to effectivelycapture the relationship between inputs and outputs. The most common approachis to stop when some minimum node size is reached [29]. In our case, we choosethis value to be 5: in other words, there will be no fewer than 5 observations q(i) inany region R j.53Trees Error (cm) Build time (s) Query time (s) Activation size10 0.55 4.39 0.11 0.73620 0.50 7.30 0.18 0.73730 0.49 11.19 0.28 0.73640 0.46 14.57 0.37 0.73750 0.47 18.34 0.47 0.73560 0.46 22.18 0.58 0.73570 0.45 25.76 0.68 0.73680 0.46 29.19 0.77 0.73490 0.45 33.13 0.88 0.736100 0.44 36.87 1.01 0.736110 0.45 40.24 1.07 0.734120 0.45 46.89 1.21 0.734Table 4.3: Table of summary statistics for reaching task using random forests.Error denotes the average distance (in centimetres) of final configurationsfrom target positions, build time gives the time to build the six randomforests (in seconds); query time gives the time (in seconds) to computethe predicted activation uˆ for a target qr; and activation size is the averageL2-norm of the predicted activations uˆ.A random forest is simply an ensemble of regression trees: the forest’s predic-tion is the average of the trees’. This will result in estimates with a lower variancethan those generated by a single tree.4.5.2 ResultsWe tested the random forest predictor for n = 10,20, ...,120 regression trees. Nu-merical results are given in Table Methods Comparison/DiscussionWe compare the best predictors from each of Sections 4.3 - 4.5 in Table 4.4. In arather striking illustration of the fact that more sophisticated is not always better,we find that single-nearest-neighbour prediction outperforms all other predictors interms of both accuracy and efficiency.What accounts for this surprising fact? Well, notice that 1-nearest-neighbour54is the only method in which the predicted activation u isn’t based on some com-bination of activations for multiple values of q. This could be important for thisparticular problem: due to the non-convexity of the configuration space, we couldhave two points q that are located close together in Euclidean space, but if pointson the line segment joining them aren’t in the legal configuration space, it couldbe that we require very different paths to get to them. So, two points that are closetogether may be quite dissimilar in how we can reach them.Similarly, by basing our prediction on a single observation, we can mitigate theeffects of redundant mappings: since we are considering a relationship between6 muscle activations and a 3-dimensional position, we can assume that there aremultiple activations u that map to the same or similar configurations q. (If nothingelse, we can achieve this by co-activating agonist-antagonist muscle pairs). If twovery different activations u(1),u(2) result in nearly identical configurations q, but avalue u(3) on the path joining u(1) and u(2) maps to a different configuration, thiscould lead to problems for any methods that base their predictions on more thanone activation-configuration pair.The computation of equilibrium controls, which is what we have dealt within these experiments, form an important component of the controller discussed inChapter 5. If prediction accuracy was our only concern, random forests would stillbe a competitive option. While the average error is consistently higher than fork-nearest-neighbours, the random forests’ performance seem to be less affected bythe parameter choice (i.e. the number of trees) than k-nearest-neighbours. How-ever, in the next chapter, we will examine problems where equilibrium controls willneed to be computed every 3 milliseconds of controlled simulation: in this case, therandom forests’ 1-second query time is unacceptably slow. Thus, for the remainderof this thesis, nearest-neighbour will be our prediction method of choice.55Predictor Error (cm) Build time (s) Query time (s)k-NN 0.19 2.20 0.0058NN 1.05 3266 0.027RF 0.44 36.87 1.01Table 4.4: Comparison of best predictors from each of the k-nearest-neighbours, neural network, and random forest experiments – parametervalues for best predictors from each class are 1 neighbour, 10 hidden lay-ers, and 100 trees, respectively. Error denotes the average distance (incentimetres) of final configurations from the target positions; build timegives the time (in seconds) to build the data structure; and query timegives the time (in seconds) to compute the predicted activation uˆ for atarget qr.56Chapter 5Trajectory Tracking with aHuman Index FingerThis chapter deals with more sophisticated control mechanisms for the index fingermodel introduced in Chapter 4.Section 5.1 begins with an implementation of an existing controller, developedby Malhotra et al. [46]. Sections 5.2 - 5.5 each present an additional component toimprove the performance of this controller, and Section 5.6 ends with a discussionof the results.We focus in this chapter on two fingertip precision tasks. The simpler of thetwo involves tracing a circle when the fingertip is constrained to lie in the q1q2plane: this is similar to drawing on a flat horizontal surface, such as a tablet. Themore difficult task involves tracing a circle in free space. In this case, the targetis a circle in the q1q3 plane but, unlike the planar writing task, the position of thefingertip is unconstrained. The performance of each control method is given in itscorresponding subsection, and tracking errors for all methods are summarized inTable 5.1 (for 2D tracking) and Table 5.2 (3D tracking).575.1 Existing Implementation: Baseline HierarchicalControl5.1.1 MethodThe controller in this section – which we will call the baseline controller – wasdeveloped by Malhotra et al. [46]. It is a hand-tuned feedback controller with ahierarchical formulation. The high-level controller computes the desired changein plant velocity based on the instantaneous movement errors, and the low-levelcontroller estimates activations to yield the velocity change sought by the high-level controller.High-level controller Suppose that, at time t, our plant has configuration(qtq˙t)– comprised of fingertip position and velocity – and a target (or reference) con-figuration(qrq˙r). The desired change in plant velocity is given by the feedbackcommand, defined as:vb = KP(qr−qt)+KD(q˙r− q˙t) (5.1)where KP and KD are manually chosen scalar gains. For this problem, wefound that controller performance was sensitive to the choice of gains and that theselection process was non-trivial. We implemented a procedure to automaticallyselect KP and KD off-line from a user-specified set of possibilities, based on whichvalues gave the best performance on a trial task. The chosen values were KP = 1.9,KD = 7.6∆t2 , where ∆t is the controller timestep. The∆t2 term brings the positionand velocity terms to a common scale, as the absolute value of the instantaneousvelocity is 2∆t larger than the absolute value of the change in position for a bodyaccelerated for time ∆t.Low-level controller Given a velocity command vb computed by the high-levelcontroller, the low-level controller computes an activation ut to produce the desiredchange in plant velocity.58To describe how this is done, we first define the notion of the Activation-Velocity Matrix (AVM), denoted by R. This is a dim(q)×dim(u) matrix that mapsactivations to corresponding changes in plant velocity: R ·ut → ∆q˙t .The controller computes ut by the constrained least-squares equation:ut = argminu||R ·u− vb||2such that 0≤ u≤ 1(5.2)AVM computation We compute R using the identification method of [46]. Startingfrom the plant’s initial configuration, we compute the change in velocity resultingfrom fully activating one muscle at a time. The change in velocity given by fullyactivating ui (the ith dimension of u) becomes the ith column of R.Algorithm 2 Identification method for AVM estimation1: f ← function such that: f((qtq˙t),ut)→(qt+1q˙t+1)2: qinit ← arbitrary initial configuration3:(q(0)q˙(0))← f((qinitq˙init),0). Apply 0 muscle activation4: for i = 1 to dim(u) do5: u(i)← vector with value 1 at index i, 0 elsewhere6:(q(i)q˙(i))← f((qinitq˙init),u(i))7: R[i]← q˙(i)− q˙(0) . ith column of R gets difference in velocities8: end for5.1.2 Results2D tracing We begin with trajectory-tracking constrained to a plane. A screen-shot of this is shown in Figure 5.1a. A plot of the trajectory followed compared tothe target trajectory is given in Figure 5.2a, with corresponding muscle activationlevels shown in Figure 5.2b.There is a striking translational error in the 2D tracking experiment. This errorcould be from one of two sources: it could result from the feedback command59(a) (b)(c)Figure 5.1: Screenshots for baseline controller. (a) tracing a circle in theplane. (b) tracing a circle in free space. (c) close-up of the planartracing task: the red dot shows the target position and the blue linerepresents the high-level controller’s desired change in position, givenby ∆t(q˙t + vb).(see Equation 5.1), which would suggest a poor choice of gains parameters; orfrom the activation computation (Equation 5.2), which suggests a poor choice ofAVM. To investigate this, see the close-up screenshot in Figure 5.1c: the red ballrepresents the target point and the blue line represents the high-level controller’sdesired change in position for the next timestep, computed as follows.The controller’s desired plant velocity over the next timestep is equal to the cur-rent velocity q˙t plus the velocity command vb. Note that this desired plant velocity60is determined by the current error in the plant configuration (see Equation 5.1), andas such is not generally equal to the reference velocity q˙r. We multiply the desiredvelocity q˙t + vb by ∆t to get the position to which the high-level controller is at-tempting to guide the plant, and illustrate this attempted correction with the blueline.From the screenshot, we see that the feedback command is trying to compen-sate for the positional error: the blue line pointing to the (plant’s) right is presentfor the duration of the experiment. This tells us that the high-level controller is (ap-propriately) instructing the low-level controller to produce activations to bring theplant to the right, but the low-level controller is producing the wrong activations.This tells us that the AVM we’ve computed is not a good choice for this sectionof the configuration space; this is the motivation for Section 5.2, in which we willmake the AVM configuration-dependent.The average tracking error per trajectory point in this experiment was 0.13centimetres, and the maximum error at any single point was 0.21 centimetres.3D tracing Next, we considered tracking a circular trajectory in 3-dimensionalspace, without constraining the position of the fingertip. Unsurprisingly, the re-sults are much less smooth than for the planar example: a screenshot is shown inFigure 5.1b, and plots of the followed trajectory and muscle activation are shownin Figures 5.2c and 5.2d, respectively. The controller gave an average per-pointtracking error of 0.17 centimetres and a maximum error of 0.55 centimetres.5.2 AVM Estimation5.2.1 MethodIn this section, we make the AVM configuration-dependent: in other words, insteadof having a matrix R, we have a matrix R(qt). As before, we estimate the AVM ata given configuration using the identification method (Algorithm 2). But, unlikethe previous section, in which we estimated R at one point and used this as theAVM over the entire configuration space, we now estimate R(qt) at multiple designpoints and interpolate the value at non-design points. We will call this controller61(a) (b)(c) (d)Figure 5.2: Depiction of circle-tracing results for baseline controller. (a) tra-jectory tracked for 2D experiments (as viewed from above): target tra-jectory is shown by the dashed blue line and the actual trajectory takenis given in solid red. (b) muscle activations for 2D trajectory tracking.(c-d) trajectory (as viewed from the left) and muscle activations for 3D(unconstrained) tracking.the variable AVM controller.Specifically, we estimate R(q(i)) for 154 stationary plant configurations Q ={q(1),q(2), ...,q(154)} (see Figure 5.3). To approximate the AVM at a new point q,we compute q’s three nearest neighbours in Q, {q(1)nn ,q(2)nn ,q(3)nn }, and compute adistance-weighted average of the values R(q(i)nn), i = 1,2,3 with the weights given62by the inverse of the distance between q and q(i)nn . We could use any number ofneighbours k. In practice, there is a trade-off between smoothness of activationpatterns and tracking accuracy. For smaller k, the controller parameters changemore drastically with configuration changes, which can lead to better tracking ac-curacy but tends to cause jerkiness due to large parameter shifts. We found thatthree neighbours worked well. The controller can now be interpreted as a linearlyblended composition of controllers with different matrices R, which are valid in theneighbourhood of corresponding configurations q [8].Figure 5.3: 3-dimensional scatter plot of plant configurations (i.e. fingertippositions) at which we compute the AVM. Axis units are in centimetres.Compared to the control formulation of Section 5.1, the computation of thehigh-level velocity command is the same as before, but the low-level controllernow solves for ut by:ut = argminu||R(qt) ·u− vb||2such that 0≤ u≤ 1(5.3)635.2.2 Results2D tracing A screenshot for the 2D tracing task is given in Figure 5.4a, whileFigures 5.5a and 5.5b plot the trajectory and muscle activation levels. Comparedto the baseline controller, we have jerkier patterns of movement and muscle acti-vations – which we address in the next section – but improved tracking accuracy,as we have overcome the persistent translational error given by the baseline con-troller. The variable AVM controller gives an average trajectory tracking error of0.042 centimetres and a maximum of 0.14 centimetres.(a) (b)Figure 5.4: Screenshots for the variable AVM controller. (a) tracing a circlein the plane. (b) tracing a circle in free space.3D tracing As we can see from the screenshot in Figure 5.4b and trajectory plotin Figure 5.5c, the variable AVM controller still encounters difficulty with tracing acircle in free space. It does, however, improve over the baseline controller: averagetracking error was 0.085 centimetres and the maximum is 0.28 centimetres.5.3 Smoothing5.3.1 MethodConfiguration-dependent AVMs improve the tracking accuracy of the baseline con-troller, but lead to more oscillatory activation patterns. In this section, we try to64(a) (b)(c) (d)Figure 5.5: Depiction of circle-tracing results for variable AVM controller.(a) trajectory tracked for 2D experiments (as viewed from above): tar-get trajectory is shown by the dashed blue line and the actual trajectorytaken is given in solid red. (b) muscle activations for 2D trajectory track-ing. (c-d) trajectory (as viewed from the left) and muscle activations for3D (unconstrained) tracking.remedy this by introducing activation regularization to the low-level controller.To that end, we replace the constrained least-squares formulation of Equa-tion 5.3 with the formulation used in [72]:65ut = argminuα||R(qt) ·u− vb||2 +β ||u||2 + γ||u−ut−1||2such that 0≤ u≤ 1(5.4)where α,β and γ are blending weights. The first term encourages the computedactivations to follow the feedback commands. The second term penalizes large ac-tivations, and the third penalizes large changes in activations between timesteps.We scale the weights such that α+β + γ = 1. With α = 1,β = γ = 0, this formu-lation reduces to that of the unsmoothed variable AVM controller.The smoothing parameters are chosen empirically, using the same automaticselection procedure we used to select the feedback gains in Section 5.1. Forthe remaining demonstrations in this chapter, we use the values α = 0.984,β =0.014,γ = Results2D tracing Here we notice a marked improvement over the baseline and un-smoothed variable AVM controllers. Figures 5.6a and 5.7a show that, apart froma deviation in the bottom right quadrant, the controller is able to follow the circlevery closely. The muscle activations produced by this controller are much lowerthan for the unsmoothed controller: Figure 5.7b shows that only the palmar in-terosseous (PI) muscle ever has an activation level above 0.4.The average tracking error for this task was 0.011 centimetres, and the maxi-mum was 0.16 centimetres.3D tracing Figures 5.6b and 5.7c illustrate that, while far from perfect, the 3Dcircle tracing is noticeably improved over that of the previous controllers. Despitethe smoothing, the muscle activation patterns (shown in 5.7d) are still fairly satu-rated, though not to the degree of the unsmoothed controller. The average trackingerror was 0.062 centimetres, and the maximum error was 0.31 centimetres.66(a) (b)Figure 5.6: Screenshots for the variable AVM controller with smoothing. (a)tracing a circle in the plane. (b) tracing a circle in free space.5.4 Gain Scheduling5.4.1 MethodIn this section we implement gain scheduling for the variable AVM controller. Re-call that gain scheduling automatically selects configuration-dependent values forthe controller gains KP and KD in Equation 5.1. The general idea behind the methodis similar to the one we used for AVM estimation: we move the plant to differentconfigurations, and at each configuration we compute the optimal set of gains pa-rameters (more on what we mean by “optimal” shortly). Then, during controloperations, we compute the gains at each timestep by interpolating the data weobtained during training. Our approach – analytically computing optimal gains atdesign points and interpolating them throughout the configuration space – is simi-lar to the methods used in [28, 64, 69], though as far as we know nobody has usedour specific method for computing gains at the design points.Defining optimal gains In defining optimal gains parameters, we assume that weare computing the feedback command vb according to Equation 5.1 and the motorcommand ut according to Equation 5.3 (in other words, we assume a variable AVMand no smoothing of the activations).Intuitively, the purpose of the feedback term is to correct deviations of theplant from its desired configuration; thus, the ideal vb is one that will result in an67(a) (b)(c) (d)Figure 5.7: Depiction of circle-tracing results for the variable AVM controllerwith smoothing. (a) trajectory tracked for 2D experiments (as viewedfrom above): target trajectory is shown by the dashed blue line and theactual trajectory taken is given in solid red. (b) muscle activations for2D trajectory tracking. (c-d) trajectory (as viewed from the left) andmuscle activations for 3D (unconstrained) tracking.activation ut that will bring the plant back to its target configuration(qrq˙r).So, for a given(qq˙), how do we find a KP and KD that will give us such a vb?To approach this problem, we begin with a known, arbitrary activation vector u˜. If68we apply that activation to the system for one controller timestep, we obtain themapping:f((qq˙), u˜)=(q˜˙˜q)(5.5)where f is the function that governs the dynamical system and q˜ is the configu-ration that results from applying u˜ to the system in configuration q for one timestep.Since we now know that u˜ will bring the system from(qq˙)to(q˜˙˜q), we know thatan ideal feedback controller, given a current state(qq˙)and a target state(q˜˙˜q),would compute u˜ as an activation. Thus, in this particular case, our ideal feedbackfunction vˆb (parametrized by the ideal gains KˆP and KˆD) will return the value:vˆb : = KˆP(q− q˜)+ KˆD(q˙− ˙˜q)= R(qt) · u˜(5.6)Note that we have not accounted for the possibility that there may be more thanone activation u that will lead the plant to the target configuration – in particular,by co-activating agonist and antagonist muscles, there could be many so-called“optimal” activations u. However, we would argue that, in such instances, a trulyideal controller would give the solution that activates as few muscles as possible(so as to minimize biomechanical cost). Hence, when we estimate the optimalgains at configurations throughout the state space (the procedure for which we willdescribe shortly) we will only use values of u˜ that activate a single muscle.Estimating optimal gains for a given configuration We estimate the optimal gainsat a given configuration by beginning at the configuration and fully activating onemuscle at a time for one timestep (this is similar to our procedure for estimatingthe AVM). We obtain the next states corresponding to each of these activations and,treating these new states as the target configurations in the feedback parametriza-tion, solve a non-negative least-squares problem for KˆP and KˆD. Pseudocode isgiven in Algorithm 3.69Algorithm 3 Estimation of PD gains1: Given: current configuration(qq˙)2: N← dim(u)3: for i = 1 to N do4: u(i)← vector with value 1 at index i, 0 elsewhere5: vˆ(i)b ← R(q) ·u(i) . Ideal feedback command for q,u(i)6:(q˜(i)˙˜q(i))← f((qq˙),u(i)). Result of applying u(i) for one timestep7: ∆q(i)← q˜(i)−q . Store resulting position change8: ∆q˙(i)← ˙˜q(i)− q˙ . Store resulting velocity change9: end for10: Vˆ ←[vˆ(1)Tb , vˆ(2)Tb , ..., vˆ(N)Tb]T. Column vector of concatenated vˆ(i)b11: ∆Q←[∆q(1)T,∆q(2)T, ...,∆q(N)T]T. Column vector of concatenated ∆q(i)12: ∆Q˙←[∆q˙(1)T,∆q˙(2)T, ...,∆q˙(N)T]T. Column vector of concatenated ∆q˙(i)13: KˆP, KˆD← solution of non-negative least-squares problem: Vˆ = [∆Q ∆Q˙][KPKD]Gathering and interpolating gains data Like with AVM estimation, we computeKˆP, KˆD at various configurations q and, to estimate these values at a new config-uration, we use an inverse distance-weighted average of the k nearest neighbours.However, the crucial difference is that, for AVM estimation, we considered onlythe position of the plant and ignored the velocity; plant velocity is crucial to PD-control, so ignoring it for gains estimation would be unwise. Hence, we interpolatewith respect to velocity as well as position.This requires a slightly different scheme for gathering training data than weused for the AVM problem. For the AVM, we specified different activation vectors,and the configurations at which the AVM was sampled were those obtained by ap-plying these activations until the plant stabilized. For the gains, we first sampled500 activations in a Latin hypercube and 100 activations activating either the EDCalone, or the EDC and lumbrical. The reason for the 100 EDC and lumbrical acti-vations was to ensure that we had adequate sampling of points with a high verticalcoordinate, as we found that randomly chosen activations tended to oversample thelower points in the configuration space. We reach the configurations by applying70the specified activations for 50, 100, 200, 300, 400, 600, 800, 1000, 1300, 1600,and 2000 milliseconds. If the plant stabilized in less than 2 seconds (for most ac-tivations it did so in less than 1), we stopped the iterations early and moved on tothe next activation. This resulted in a gains lookup table of 3,272 distinct locationsin the combined position-velocity space (scatter plot of the positions is shown inFigure 5.8).Figure 5.8: 3-dimensional scatter plot of plant positions at which we computethe gains KP and KD. Axis units are in centimetres.Like AVM interpolation, in selecting k there is a trade-off between trackingaccuracy and smoothness of movement, as the controller parameters change morerapidly for smaller values of k. We found that k = 30 worked well, and this is thevalue used for the remaining experiments in this chapter.5.4.2 Results2D tracing A screenshot of the 2D tracing is shown in Figure 5.9a, and plots ofthe tracked trajectory and muscle activations are given in Figures 5.10a and 5.10b,respectively. In terms of tracking accuracy, performance is slightly improved over71the variable AVM controller with smoothing: there is noticeably decreased devi-ation in the bottom right corner of the circle. However, there is some increase inrapid oscillations in the muscle activations, and this controller activates the musclesmore strongly for this task.The average tracking error for this task was 0.028 centimetres and the maxi-mum was 0.13 centimetres.(a) (b)Figure 5.9: Screenshots for the gain scheduled controller. (a) tracing a circlein the plane. (b) tracing a circle in free space.3D tracing From Figures 5.9b and 5.10c, we see that there is still a problem withthe plant dragging away from the target trajectory in the bottom left quadrant ofthe circle, though the error is less for the gain scheduled controller than for thesmoothed variable AVM controller. The average tracking error is 0.038 centime-tres, and the maximum is 0.22 centimetres.Interestingly, if we look at the activations shown in Figure 5.10d and comparethem to the activations produced by the smoothed variable AVM controller for thesame task, we notice more of a distinct pattern for this controller – the activationsfor all muscles display a noisy but distinct curve, while the non-gain-scheduledcontroller tends to oscillate frequently between very high and very low activations(particularly for the EDC and PI). This is in contrast to the 2D tracing task, onwhich the non-gain-scheduled controller produced smaller, smoother activations.The fact that gain scheduling yields greater improvements for 3D tracing than 2D72should perhaps not come as a surprise: since unconstrained tracing is bound toexplore a larger portion of the configuration space, we expect the importance ofvariable gains to be greater.(a) (b)(c) (d)Figure 5.10: Depiction of circle-tracing results for the gain scheduled con-troller. (a) trajectory tracked for 2D experiments (as viewed fromabove): target trajectory is shown by the dashed blue line and the ac-tual trajectory taken is given in solid red. (b) muscle activations for2D trajectory tracking. (c-d) trajectory (as viewed from the left) andmuscle activations for 3D (unconstrained) tracking.Gains values The values of the computed gains along the 3D circle trajectoryare shown in Figure 5.11. The computed values for KP ranged between 1.651 and732.19 while the values of KD (before being multiplied by ∆t2 ) ranged from 11.324to 14.062. In the plots, both KP and KD are mean-shifted and divided by theirrespective ranges so that the variation in the values can be seen clearly.(a) (b)Figure 5.11: Polar plots of the values of the controller gains KP (a) and KD(b) along the 3D circle trajectory. Gains values are mean-shifted andscaled by their ranges so that the variations can be seen clearly on apolar plot.5.5 Passive Dynamics Compensation5.5.1 MethodIn the previous subsections, we used only feedback control to compute the veloc-ity command, which we denoted by vb. In this section, we update the velocitycommand to vt := vp + vb, where vp = vp(qt ,qr) is a 3×1 passive dynamics termdesigned to compensate for passive forces acting on the plant and capture somenon-linearities in the dynamics.The low-level controller computes activations in the same way as before, but74now aims to follow vt at each timestep instead of vb:ut = argminuα||R(qt) ·u− vt ||2 +β ||u||2 + γ||ut −ut−1||2such that 0≤ u≤ 1(5.7)We compensate for passive dynamics using equilibrium controls, the compu-tation of which was discussed at length in Chapter 4. The rationale behind thisis that equilibrium controls will hold the plant’s position against gravity and otherpassive forces; as such, they cancel out the passive forces acting on the plant ata given configuration. The use of equilibrium controls for this purpose is seen in[18, 20, 43].Specifically, for a target position qr we estimate the activation ueq that willlead the plant there by nearest-neighbour prediction on the 39,000 activation/equi-librium position pairs discussed in Chapter 4. In that chapter, we found nearest-neighbours to be the most effective method for predicting an activation that wouldlead to a target configuration; hence, that is the method we use here. For a detaileddescription of our methodology, see Section 4.3.Once we have computed ueq corresponding to the target qr, the passive dynam-ics term is set to vp = R(qt) ·ueq, where qt is the current configuration of the plant.Note that, in computing vp, we consider only the positional component of the targetand ignore the target velocity.5.5.2 Results2D tracing As we can see from Figures 5.12a and 5.13a, the controller tracesa nearly perfect circle. The average tracking error is 0.022 centimetres and themaximum is 0.032. Interestingly, adding the passive dynamics term has also hadthe effect of producing much smaller muscle activations for this task as comparedto gain-scheduled controller without the passive dynamics term.3D tracing Figures 5.12b and 5.13c show that, after adding five new features toour original controller, we are finally able to track a circle in free space withoutany large deviations. The activation patterns (Figure 5.13d) have a similar pattern75(a) (b)Figure 5.12: Screenshots for the gain scheduled controller with passive dy-namics compensation. (a) tracing a circle in the plane. (b) tracing acircle in free those produced by the original gain-scheduled controller, though with somewhatfewer oscillations. Activation of the FDS is also significantly reduced.The average tracking error for this task was 0.030 centimetres, and the maxi-mum instantaneous tracking error was 0.16 centimetres.5.6 SummaryIn this section, we compare the performance of the different control methods. Ta-bles 5.1 and 5.2 summarize controller tracking errors on the 2D and 3D circletracing tasks, respectively.For both tasks, going from the baseline to the variable AVM controller cutstracking errors by about half. In going from the unsmoothed to smoothed variableAVM controller, there is a sizeable decrease in the average error (about 30 to 50%),but a slight increase in the maximum error. However, an advantage of the smoothedcontroller is more natural-looking movement: there is reduced oscillation in theactivation patterns, which results in a less jittery trajectory.In going from constant to scheduled gains, there is little difference for the 2Dtask: the maximum error is decreased by reducing the kink to the right of thecircle, but there is an increase in average error effected by added drift to the left ofthe circle. The more significant difference (30-40% error reduction) is seen for the76(a) (b)(c) (d)Figure 5.13: Depiction of circle-tracing results for the gain scheduled con-troller with passive dynamics compensation. (a) trajectory tracked for2D experiments (as viewed from above): target trajectory is shown bythe dashed blue line and the actual trajectory taken is given in solid red.(b) muscle activations for 2D trajectory tracking. (c-d) trajectory (asviewed from the left) and muscle activations for 3D (unconstrained)tracking.3D task. As mentioned earlier, this makes sense as the 3D task explores a largerportion of the configuration space due to being unconstrained in its movement, soconfiguration-dependent gains are more likely to be useful. Another considerationis that our gain-scheduling algorithm is performed without constraining the plantto a plane, so the higher average error for the 2D task could be due to the fact that772D tracking errorsController Average error (cm) Max error (cm)Baseline 0.133 0.208Variable AVM 0.0420 0.142Variable AVM + Smoothing 0.0221 0.160Gain-scheduled 0.0282 0.131Gain-scheduled + Passive Dynamics 0.0218 0.0321Table 5.1: Summary of controllers’ performance on the 2D circle tracingtask. Average error denotes the average positional error between thetarget and actual plant position at each point on the trajectory, in cen-timetres. Max error denotes the maximum positional error at any singlepoint on the trajectory, in centimetres.3D tracking errorsController Average error (cm) Max error (cm)Baseline 0.168 0.549Variable AVM 0.0849 0.280Variable AVM + Smoothing 0.0621 0.306Gain-scheduled 0.0378 0.215Gain-scheduled + Passive Dynamics 0.0296 0.160Table 5.2: Summary of controllers’ performance on the 3D circle tracingtask. Average error denotes the average positional error between thetarget and actual plant position at each point on the trajectory, in cen-timetres. Max error denotes the maximum positional error at any singlepoint on the trajectory, in centimetres.the ideal gains parameters are different when the fingertip is constrained to a plane.When we adapt our gain-scheduled controller to include passive dynamics,there is a considerable reduction in errors for both the 2D and 3D tasks. Becausethe passive dynamics term gives the activation required to hold our target poseagainst passive forces, it removes some non-linearities in the dynamics. Our feed-back control formulation assumes that system dynamics are locally linear: hence,passive dynamics compensation simplifies the problem in such a way that our lin-ear control methods are more suitable.78Chapter 6Conclusions and Future Work6.1 ConclusionsSummary This thesis aimed to develop novel learning algorithms for motor con-trol. Chapter 2 reviewed the necessary background concepts and related work incontrol literature. Section 2.1 reviewed the properties of biological motor systems,including how learning occurs in the brain and how biomechanical systems arerepresented as dynamical systems (either simulated or robotic). Section 2.2 cov-ered some basic methods for controlling these systems, and Section 2.3 went overconcepts in machine learning with discussions of where these concepts are used incontrol literature.In Chapter 3 we developed a learning method for control of a quasistatic modelof the elastodynamic skin of the human mid-face. Section 3.1 gave propertiesof the model, and Section 3.2 introduced regression methods for its control. InSection 3.3, we developed a stochastic gradient descent formulation to improve theperformance of the supervised learning methods, and showed that the use of regres-sion as a pre-processing step significantly improves the accuracy and convergencetime of the optimization.Chapter 4 investigated reaching control for the human index finger. Section 4.1described the index finger model. Section 4.2 framed the task as a supervised learn-ing problem, which was then addressed using k-nearest-neighbours (Section 4.3),79neural networks (Section 4.4) and random forests (Section 4.5). Section 4.6 com-pared the performance of the methods studied and concluded that 1-nearest-neigh-bour was the best prediction method for this problem.Finally, in Chapter 5 we incrementally developed a control algorithm for tra-jectory tracking with the index finger model. Section 5.1 reviewed the controller of[46], which was additionally improved upon in the succeeding subsections by in-troducing configuration-dependent AVMs (Section 5.2), activation smoothing (Sec-tion 5.3), gain scheduling (Section 5.4), and passive dynamics compensation (Sec-tion 5.5).Contributions Our first contribution is a novel combination of regression and lo-cal search for control of quasistatic systems. We improve on the policy gradientmethod of Kohl and Stone [38] by implementing greedy policy updates and incor-porating regression as a pre-processing step. With the combined regression/policygradient method, we have significantly better control performance than with eithermethod alone, and we are able to improve the convergence time of the gradientdescent optimization by nearly 50%.Next, Chapter 5 develops several techniques for trajectory-tracking problems,using the controller formulation of Malhotra et al. [46] as a starting point. Byframing the model as a Linear Parameter-Varying (LPV) system with respect to theAVM, we obtain a controller with significantly improved tracking abilities. Whilethe LPV assumption is fairly common in control literature, our particular methodfor estimating the AVM has (as far as we know) not been implemented elsewhere.By adapting Malhotra et al.’s [46] low-level controller to include the activation-smoothing method of Sueda et al. [72], we obtain less oscillatory muscle activa-tions, which results in fewer sudden, jerky movements in trajectory tracking.We then proposed a novel gain-scheduling method for PD control that is bothsimple to implement and applicable to a broad class of feedback problems. To date,the vast majority of control methods for tendon-based systems rely on hand-tunedgains, so our method could be of use. Lastly, we introduce the use of equilibriumcontrols for passive dynamics compensation. This is not a new idea – see, for ex-ample, [20, 66] – but by using nearest-neighbour prediction we are able to avoidissues with redundant and non-convex mappings that cause problems for other pre-80diction methods, and the use of locality-sensitive hashing algorithms makes ourmethod extremely efficient.6.2 Future WorkOnline learning Our current controller implementation performs all learning off-line: we determine in advance where to gather the training data, we do it, and thenwe learn the model parameters accordingly. This is different from how humanslearn: we learn as we perform movements, and our performance at a task improvesthe more time we spend doing it.If we want our controller to be adaptable to changes in its environment, we needan online learning method so that parameters can be changed as new situations areencountered. Online learning/data collection could also improve the efficacy oftraining by allowing the controller to identify portions of the plant’s configurationspace that are worth exploring further – for example, areas of the space we expectto occupy often, or regions of rapid change in plant configuration with respect tomuscle activations.Noise and sensory delay Our feedback controller assumes noiseless, instanta-neous feedback on the state of the system. This is not at all what is seen in biolog-ical systems, in which sensory streams are affected by noise and temporal delay.We also assume that motor commands work deterministically: we can compute acertain activation and, when we apply that command to the plant in a certain con-figuration, we will always see the same result. Again, this is not what we see inreal biological systems: in these cases, motor signals are generated imprecisely andcorrupted as they travel to the muscle. For example, if you hit a hole-in-one at aparticular hole on a golf course, that doesn’t mean that you have now “figured out”how to do it and can replicate your earlier action perfectly every time: variance inmovements is unavoidable even for highly practiced individuals.It remains to be seen how (or if) our controller could be used to control systemswith some kind of noise or delay involved.81Scaling We need to consider how our method would scale to a larger dynam-ical system – for example, learning to control the entire hand or performing full-body tasks like locomotion. Due to our use of k-nearest-neighbours prediction withlocality-sensitive hashing, the computation of controller parameters is efficient androbust (i.e. not prone to numerical instability), so the determination of these valuesgiven a set of training data should scale reasonably well.A potential problem arises in solving for low-level commands: with a largenumber of muscles, the solution of the QP in Equation 5.7 could become difficultto solve – specifically, if the matrix R(qt) is indefinite, the problem becomes NP-hard [62].However, the real bottleneck of our learning process is data collection. Thesizes of our problem’s input and configuration spaces grow exponentially with thesystem’s degrees of freedom. This could be partially addressed with more effi-cient experimental design and online learning techniques, but ensuring sufficientcoverage of the entire problem space poses a challenging computational problem.Object manipulation Interaction with the environment is a crucial function ofmotor movements, particularly of the hands. For our method to be useful for real-world tasks, we need to ensure that it is capable handling contact with externalobjects. This will likely require some modification of our existing framework,particularly in how we parametrize our control goals. Our method for dynamicalsystem control works with fully specified, spline-based trajectories: this frameworkmakes it difficult to incorporate elements such as force modulation and reaction toenvironmental cues, both of which are critical to object interaction.Imitating human movements All experiments performed for this thesis were for-mulated using simulated data: for the mid-face, we assessed prediction quality byattempting to reach certain simulated poses, and for the index finger we attemptedto trace geometric trajectories. We have not addressed the problem of imitatingor learning from the movements of human subjects. We are currently working ontesting the controller of Chapter 3 on real facial expressions obtained from videocapture data.82Bibliography[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, Jan.2008. ISSN 0001-0782. doi:10.1145/1327452.1327494. URL → pages 18, 50[2] S. Andrews and P. Kry. Goal directed multi-finger manipulation: Controlpolicies and analysis. Computers and Graphics, 37(7):830 – 839, 2013.ISSN 0097-8493. doi: → pages11, 14, 18[3] U. Ascher and C. Greif. A First Course on Numerical Methods.Computational Science and Engineering. Society for Industrial and AppliedMathematics, 2011. ISBN 9780898719970. → pages 31[4] P. Be´dard and J. Sanes. Gaze and hand position effects onfinger-movement-related human brain activation. J. Neurophysiol., 101(2):834–842, Feb 2009. → pages 47[5] R. Bellman. Dynamic Programming. Princeton University Press, Princeton,NJ, USA, 1957. → pages 23[6] Y. Bengio. Learning deep architectures for AI. Found. Trends Mach. Learn.,2(1):1–127, Jan. 2009. ISSN 1935-8237. doi:10.1561/2200000006. URL → pages 19[7] D. J. Berger, R. Gentner, T. Edmunds, D. K. Pai, and A. d’Avella.Differences in adaptation rates after virtual surgeries provide direct evidencefor modularity. The Journal of Neuroscience, 33(30):12384–12394, 2013.doi:10.1523/JNEUROSCI.0122-13.2013. URL → pages 583[8] R. R. Burridge, A. A. Rizzi, and D. E. Koditschek. Sequential Compositionof Dynamically Dexterous Robot Behaviors. Int. J. Robot. Res., 18(6):534–555, 1999. → pages 63[9] Y. Cao, M. A. Brubaker, D. J. Fleet, and A. Hertzmann. Efficientoptimization for sparse Gaussian process regression. CoRR, abs/1310.6007,2013. URL → pages 17[10] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.ISSN 0885-6125. doi:10.1023/A:1007379606734. → pages 33[11] P. Cisek and J. Kalaska. Neural mechanisms for interacting with a world fullof action choices. Annual Review of Neuroscience, 33(1):269–298, 2010.doi:10.1146/annurev.neuro.051508.135409. URL → pages 5[12] C. Currin, T. Mitchell, M. Morris, and D. Ylvisaker. A Bayesian approach tothe design and analysis of computer experiments. Technical ReportORNL–6498, Oak Ridge Laboratory, 1988. → pages 16[13] P. de Aguiar, B. Bourguignon, M. Khots, D. Massart, andR. Phan-Than-Luu. D-optimal designs. Chemometrics and IntelligentLaboratory Systems, 30(2):199 – 210, 1995. ISSN 0169-7439.doi: URL →pages 20[14] M. de Lasa, I. Mordatch, and A. Hertzmann. Feature-Based LocomotionControllers. ACM Transactions on Graphics, 29(3), 2010. → pages 11, 14[15] E. G. de Rave´, F. Jime´nez-Hornero, A. Ariza-Villaverde, andJ. Go´mez-Lo´pez. Using general-purpose computing on graphics processingunits (GPGPU) to accelerate the ordinary kriging algorithm. Computers andGeosciences, 64(0):1 – 6, 2014. ISSN 0098-3004.doi: URL →pages 17[16] M. Deisenroth and C. Rasmussen. Efficient reinforcement learning formotor control. In Proceedings of the 10th International Workshop onSystems and Control, 2009. → pages 1384[17] A. Deshpande, Z. Xu, M. Weghe, B. Brown, J. Ko, L. Chang, D. Wilkinson,S. Bidic, and Y. Matsuoka. Mechanisms of the Anatomically CorrectTestbed hand. Mechatronics, IEEE/ASME Transactions on, 18(1):238–250,Feb 2013. ISSN 1083-4435. doi:10.1109/TMECH.2011.2166801. → pages2, 6[18] A. D. Deshpande, J. Ko, D. Fox, and Y. Matsuoka. Control strategies for theindex finger of a tendon-driven hand. Int. J. Rob. Res., 32(1):115–128, Jan.2013. ISSN 0278-3649. doi:10.1177/0278364912466925. URL → pages 2, 5, 9, 11, 12, 13,17, 46, 75[19] J. Doyle, B. Francis, and A. Tannenbaum. Feedback control theory, 1990.→ pages 10, 11[20] M. Fain. Control of complex biomechanical systems. Master’s thesis,University of British Columbia, 2013. → pages 5, 46, 75, 80[21] M. Ferna´ndez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we needhundreds of classifiers to solve real world classification problems? Journalof Machine Learning Research, 15:3133–3181, 2014. URL → pages 52[22] T. Geijtenbeek, M. van de Panne, and A. F. van der Stappen. Flexiblemuscle-based locomotion for bipedal creatures. ACM Transactions onGraphics, 32(6), 2013. → pages 1, 5, 6, 11, 13, 14[23] A. Gosavi. Reinforcement learning: A tutorial survey and recent advances.INFORMS Journal on Computing, 21(2):178–192, 2009.doi:10.1287/ijoc.1080.0305. → pages 25[24] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popovic´. Style-basedinverse kinematics. ACM Trans. Graph., 23(3):522–531, Aug. 2004. ISSN0730-0301. doi:10.1145/1015706.1015755. URL → pages 15, 17[25] R. Grzeszczuk, D. Terzopoulos, and G. Hinton. Neuroanimator: Fast neuralnetwork emulation and control of physics-based models. In Proceedings ofthe 25th Annual Conference on Computer Graphics and InteractiveTechniques, SIGGRAPH ’98, pages 9–20, New York, NY, USA, 1998.ACM. ISBN 0-89791-999-8. doi:10.1145/280814.280816. URL → pages 1985[26] I. Guyon and A. Elisseeff. An introduction to variable and feature selection.J. Mach. Learn. Res., 3:1157–1182, Mar. 2003. ISSN 1532-4435. URL → pages 6[27] M. Hagan and M. Menhaj. Training feedforward networks with theMarquardt algorithm. Neural Networks, IEEE Transactions on, 5(6):989–993, Nov 1994. ISSN 1045-9227. doi:10.1109/72.329697. → pages 50[28] H. Hannoun, M. Hilairet, and C. Marchand. High performance currentcontrol of a switched reluctance machine based on a gain-scheduling {PI}controller. Control Engineering Practice, 19(11):1377 – 1386, 2011. ISSN0967-0661. doi: URL →pages 12, 67[29] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, Second Edition. SpringerSeries in Statistics. Springer, 2009. ISBN 9780387848587. URL → pages 52, 53[30] F. Haugen. The good gain method for simple experimental tuning of PIcontrollers. Modeling, Identification and Control, 33(4):141–152, 2012.ISSN 1890-1328. → pages 11[31] A. Hedayat, N. Sloane, and J. Stufken. Orthogonal Arrays: Theory andApplications. Springer Series in Statistics. Springer New York, 1999. ISBN9780387987668. URL→ pages 20[32] Z. Hou, M. M. Gupta, P. N. Nikiforuk, M. Tan, and L. Cheng. A recurrentneural network for hierarchical control of interconnected dynamic systems.IEEE Transactions on Neural Networks, 18(2):466–481, 2007.doi:10.1109/TNN.2006.885040. URL → pages 12[33] D. Huh and E. Todorov. Real-time motor control using recurrent neuralnetworks. In Adaptive Dynamic Programming and Reinforcement Learning,2009. ADPRL ’09. IEEE Symposium on, pages 42–49, March 2009.doi:10.1109/ADPRL.2009.4927524. → pages 9, 13, 19[34] P. Indyk and R. Motwani. Approximate nearest neighbors: Towardsremoving the curse of dimensionality. In Proceedings of the ThirtiethAnnual ACM Symposium on Theory of Computing, STOC ’98, pages86604–613, New York, NY, USA, 1998. ACM. ISBN 0-89791-962-9.doi:10.1145/276698.276876. URL → pages 49[35] J. N. Ingram, K. P. Kording, I. S. Howard, and D. M. Wolpert. The statisticsof natural hand movements. Experimental brain research, 188(2):223–236,2008. doi:10.1007/s00221-008-1355-3. → pages 6[36] G. L. Jones. Noisy optimal control strategies for modelling saccades.Master’s thesis, University of British Columbia, 2011. → pages 5, 6, 9, 13[37] V. R. Joseph and Y. Hung. Orthogonal-maximin Latin hypercube designs,2008. → pages 29[38] N. Kohl and P. Stone. Policy gradient reinforcement learning for fastquadrupedal locomotion. In Proceedings of the IEEE InternationalConference on Robotics and Automation, pages 2619–2624, May 2004.URL → pages 5, 25, 38,39, 80[39] S.-H. Lee and D. Terzopoulos. Heads up!: Biomechanical modeling andneuromuscular control of the neck. ACM Trans. Graph., 25(3):1188–1198,July 2006. ISSN 0730-0301. doi:10.1145/1141911.1142013. URL → pages 2, 5, 6, 11[40] W. Leithead. Survey of gain-scheduling analysis design. InternationalJournal of Control, 73:1001–1025, 1999. → pages 12[41] D. Li. Biomechanical simulation of the hand musculoskeletal system andskin. Master’s thesis, University of British Columbia, 2013. → pages 44[42] D. Li, S. Sueda, D. R. Neog, and D. K. Pai. Thin skin elastodynamics. ACMTrans. Graph., 32(4):49:1–49:10, July 2013. ISSN 0730-0301.doi:10.1145/2461912.2462008. URL → pages 26[43] C. K. Liu. Dextrous manipulation from a grasping pose. ACM Trans.Graph., 28(3):59:1–59:6, July 2009. ISSN 0730-0301.doi:10.1145/1531326.1531365. URL → pages 5, 6, 9, 75[44] X. Liu, C. Ma, M. Li, and M. Xu. A kriging assisted direct torque control ofbrushless DC motor for electric vehicles. In Natural Computation (ICNC),872011 Seventh International Conference on, volume 3, pages 1705–1710,July 2011. doi:10.1109/ICNC.2011.6022349. → pages 17[45] N. K. Malakar and K. H. Knuth. Entropy-Based Search Algorithm forExperimental Design. AIP Conference Proceedings, 1305(1):157–164,2011. doi:10.1063/1.3573612. URL→ pages 20[46] M. Malhotra, E. Rombokas, E. Theodorou, E. Todorov, and Y. Matsuoka.Reduced dimensionality control for the ACT hand. In Robotics andAutomation (ICRA), 2012 IEEE International Conference on, pages5117–5122, May 2012. doi:10.1109/ICRA.2012.6224651. → pages 2, 5, 11,14, 46, 57, 58, 59, 80[47] J. Martins, E. Pires, R. Salvado, and P. Dinis. A numerical model of passiveand active behavior of skeletal muscles. Computer Methods in AppliedMechanics and Engineering, 151(34):419 – 433, 1998. ISSN 0045-7825.doi: URL papers presented at the Symposium on Advances inComputational Mechanics. → pages 7[48] M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of threemethods for selecting values of input variables in the analysis of output froma computer code. Technometrics, 21(2):pp. 239–245, 1979. ISSN 00401706.URL → pages 19[49] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra,and M. A. Riedmiller. Playing Atari with deep reinforcement learning.CoRR, abs/1312.5602, 2013. URL → pages19, 25[50] I. Mordatch, Z. Popovic´, and E. Todorov. Contact-invariant optimization forhand manipulation. In Proceedings of the ACM SIGGRAPH/EurographicsSymposium on Computer Animation, SCA ’12, pages 137–144,Aire-la-Ville, Switzerland, Switzerland, 2012. Eurographics Association.ISBN 978-3-905674-37-8. URL → pages 5, 12[51] J. Najemnik and W. Geisler. Optimal eye movement strategies in visualsearch. Nature, 434:387–391, 2005. → pages 588[52] J. Peters and S. Schaal. 2008 special issue: Reinforcement learning of motorskills with policy gradients. Neural Netw., 21(4):682–697, May 2008. ISSN0893-6080. doi:10.1016/j.neunet.2008.02.003. URL → pages 9, 25[53] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(79):1180 –1190, 2008. ISSN 0925-2312.doi: URL in Modeling, Theory, and Application of Computational Intelligenc15th European Symposium on Artificial Neural Networks 2007 15thEuropean Symposium on Artificial Neural Networks 2007. → pages 25[54] J. R. Peters. Machine Learning of Motor Skills for Robotics. PhD thesis,University of Southern California, 2007. → pages 22[55] R. Poole and N. G. S. (U.S.). The Incredible Machine. National GeographicSociety, 1995. ISBN 9780792227298. → pages 5[56] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for MachineLearning (Adaptive Computation and Machine Learning). The MIT Press,2005. ISBN 026218253X. → pages 16[57] M. Rolf. Goal Babbling for an Efficient Bootstrapping of Inverse Models inHigh Dimensions. PhD thesis, Bielefeld University, 2012. → pages 22[58] M. Rolf. Goal babbling with unknown ranges: A direction-samplingapproach. In Development and Learning and Epigenetic Robotics (ICDL),2013 IEEE Third Joint International Conference on, pages 1–7, Aug 2013.doi:10.1109/DevLrn.2013.6652526. → pages 22[59] F. Rosenblatt. The perceptron: a probabilistic model for information storageand organization in the brain. Psychological Review, 65(6):386–408, Nov.1958. → pages 19[60] P. Sachdeva, S. Sueda, S. Bradley, M. Fain, and D. K. Pai. Biomechanicalsimulation and control of hands and tendinous systems. ACM Trans. Graph.,34(4):42:1–42:10, July 2015. ISSN 0730-0301. doi:10.1145/2766987. URL → pages 2, 6, 44, 46[61] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis ofcomputer experiments. Statist. Sci., 4(4):409–423, 11 1989.doi:10.1214/ss/1177012413. URL→ pages ix, 15, 16, 18, 19, 3089[62] S. Sahni. Computationally related problems. SIAM Journal on Computing, 3(4):262–279, 1974. doi:10.1137/0203021. URL → pages 82[63] T. J. Santner, W. B., and N. W. The Design and Analysis of ComputerExperiments. Springer-Verlag, 2003. → pages 17, 19, 20, 30[64] I. Sardellitti, G. Medrano-Cerda, N. Tsagarakis, A. Jafari, and D. Caldwell.Gain scheduling control for a class of variable stiffness actuators based onlever mechanisms. Robotics, IEEE Transactions on, 29(3):791–798, June2013. ISSN 1552-3098. doi:10.1109/TRO.2013.2244787. → pages 12, 67[65] P. Sebastiani and H. P. Wynn. Maximum entropy sampling and optimalBayesian experimental design. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 62(1):145–157, 2000. ISSN 1467-9868.doi:10.1111/1467-9868.00225. URL → pages 20[66] R. Shadmehr. Equilibrium point hypothesis. In The handbook of braintheory and neural networks, pages 370–372. MIT Press, 1998. → pages 8,80[67] G. Shakhnarovich. Locality sensitive hashing (LSH).∼gregory/download.html, 2009. → pages 49[68] J. Shamma and M. Athans. Gain scheduling: potential hazards and possibleremedies. Control Systems, IEEE, 12(3):101–107, June 1992. ISSN1066-033X. doi:10.1109/37.165527. → pages 12[69] X. Shu, P. Ballesteros, W. Heins, and C. Bohn. Design of structureddiscrete-time LPV gain-scheduling controllers through state augmentationand partial state feedback. In American Control Conference (ACC), 2013,pages 6090–6095, June 2013. doi:10.1109/ACC.2013.6580793. → pages12, 67[70] W. Si, S.-H. Lee, E. Sifakis, and D. Terzopoulos. Realistic biomechanicalsimulation and control of human swimming. ACM Trans. Graph., 34(1):10:1–10:15, Dec. 2014. ISSN 0730-0301. doi:10.1145/2626346. URL → pages 5, 6, 9, 11, 13, 14, 19[71] D. L. Sparks. The brainstem control of saccadic eye movements. NatureReviews Neuroscience, 3:952–964, Dec 2002. doi:10.1038/nrn986. → pages990[72] S. Sueda, A. Kaufman, and D. K. Pai. Musculotendon simulation for handanimation. ACM Trans. Graph. (Proc. SIGGRAPH), 27(3), 2008. → pages2, 6, 44, 46, 65, 80[73] S. Sueda, G. L. Jones, D. I. W. Levin, and D. K. Pai. Large-scale dynamicsimulation of highly constrained strands. ACM Trans. Graph., 30(4):39:1–39:10, July 2011. ISSN 0730-0301. doi:10.1145/2010324.1964934.URL → pages 45[74] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MITPress, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981. →pages 13, 22, 24[75] C. Szepesvari and M. L. Littman. Generalized Markov decision processes:Dynamic-programming and reinforcement-learning algorithms. Technicalreport, University of Alberta, 1996. → pages 24[76] D. Thalmann and M. van de Panne. Computer Animation and Simulation97: Proceedings of the Eurographics Workshop in Budapest, Hungary,September 2–3, 1997. Eurographics. Springer Vienna, 2012. ISBN9783709168745. → pages 26[77] E. Theodorou. Iterative Path Integral Stochastic Optimal Control: Theoryand Applications to Motor Control. PhD thesis, University of SouthernCalifornia, 2011. → pages 12[78] L. H. Ting and J. L. McKay. Neuromechanics of muscle synergies forposture and movement. Current Opinion in Neurobiology, 17(6):622 – 628,2007. ISSN 0959-4388. doi: systems / Neurobiology of behaviour. → pages 5, 12[79] H. Van Welbergen, B. J. H. Van Basten, A. Egges, Z. M. Ruttkay, and M. H.Overmars. Real time animation of virtual humans: A trade-off betweennaturalness and control. Computer Graphics Forum, 29(8):2530–2554,2010. ISSN 1467-8659. doi:10.1111/j.1467-8659.2010.01822.x. URL → pages 14[80] N. Vitiello, T. Lenzi, S. M. M. D. Rossi, S. Roccella, and M. C. Carrozza. Asensorless torque control for antagonistic driven compliant joints.Mechatronics, 20(3):355 – 367, 2010. ISSN 0957-4158.doi: URL91 →pages 1[81] K. Wampler, J. Popovic´, and Z. Popovic´. Animal locomotion controllersfrom scratch. In EUROGRAPHICS, volume 32, 2013. → pages 22[82] J. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical modelsfor human motion. Pattern Analysis and Machine Intelligence, IEEETransactions on, 30(2):283–298, Feb 2008. ISSN 0162-8828.doi:10.1109/TPAMI.2007.1167. → pages 15, 17[83] J. M. Wang, D. J. Fleet, and A. Hertzmann. Optimizing walking controllers.ACM Trans. Graph., 28(5):168:1–168:8, Dec. 2009. ISSN 0730-0301.doi:10.1145/1618452.1618514. URL → pages 5, 6, 11, 14[84] D. Wolpert and M. Kawato. Multiple paired forward and inverse models formotor control. Neural Networks, 11(78):1317 – 1329, 1998. ISSN0893-6080. doi: URL →pages 2, 5, 6, 10[85] D. M. Wolpert, J. Diedrichsen, and J. R. Flanagan. Principles ofsensorimotor learning. Nature reviews. Neuroscience, 12(12):739–751,2011. ISSN 1471-0048. URL → pages 5, 6, 13[86] H. Wu and F. Sun. Adaptive kriging control of discrete-time nonlinearsystems. Control Theory Applications, IET, 1(3):646–656, May 2007. ISSN1751-8644. → pages 17[87] K. Yin, K. Loken, and M. van de Panne. SIMBICON: Simple bipedlocomotion control. ACM Trans. Graph., 26(3):Article 105, 2007. → pages5, 6, 11, 14[88] A. Zhang, M. Malhotra, and Y. Matsuoka. Musical piano performance bythe ACT hand. In Robotics and Automation (ICRA), 2011 IEEEInternational Conference on, pages 3536–3541, May 2011.doi:10.1109/ICRA.2011.5980342. → pages 2, 11, 12, 14[89] K. Zhou, Q. Hou, R. Wang, and B. Guo. Real-time kd-tree construction ongraphics hardware. ACM Transactions on Graphics (TOG), 27(5):126, 2008.→ pages 1892[90] J. G. Ziegler and N. B. Nichols. Optimum Settings for AutomaticControllers. Transactions of ASME, 64:759–768, 1942. → pages 11[91] V. B. Zordan, A. Majkowska, B. Chiu, and M. Fast. Dynamic response formotion capture animation. ACM Trans. Graph., 24(3):697–701, July 2005.ISSN 0730-0301. doi:10.1145/1073204.1073249. URL → pages 1593


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items