UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Control of complex biomechanical systems Fain, Mikhail 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2013_fall_fain_mikhail.pdf [ 1.61MB ]
JSON: 24-1.0052179.json
JSON-LD: 24-1.0052179-ld.json
RDF/XML (Pretty): 24-1.0052179-rdf.xml
RDF/JSON: 24-1.0052179-rdf.json
Turtle: 24-1.0052179-turtle.txt
N-Triples: 24-1.0052179-rdf-ntriples.txt
Original Record: 24-1.0052179-source.json
Full Text

Full Text

Control of Complex Biomechanical SystemsbyMikhail FainBSc., University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)October 2013c?Mikhail Fain, 2013AbstractHumans show stunning performance on a variety of manipulation tasks. However,little is known about the computation that the human brain performs to accomplishthese tasks. Recently, anatomically correct tendon-driven models of the humanhand have been developed, but controlling them remains an issue. In this thesis,we present a computationally efficient feedback controller, capable of dealing withthe complexity of these models. We demonstrate its abilities by successfully per-forming tracking and reaching tasks for an elaborated model of the human indexfinger.The controller, called One-Step-Ahead controller, is designed in a hierarchi-cal fashion, with the high-level controller determining the desired trajectory andthe low-level controller transforming it into muscle activations by solving a con-strained linear least squares problem. It was proposed to use equilibrium controlsas a feedforward command, and learn the controller?s parameters online by stabi-lizing the plant at various configurations. The conducted experiments suggest thefeasibility of the proposed learning approach for the index finger model.iiPrefaceThe algorithms, experiments and analysis described in this thesis are original andunpublished, and the related work is appropriately cited. I designed and imple-mented them with the help of my supervisor, Dr. Dinesh Pai. All the text in thisthesis is original and unpublished.The human index finger model, described in Chapter 5, was developed by Dr.Shinjiro Sueda and Duo Li under the supervision of Dr. Dinesh Pai, and was de-scribed in the following unpublished paper:Shinjiro Sueda, Duo Li, Dinesh K. Pai, Anatomically-based Dynamic Simu-lation of the Hand, 2012 (in preparation)The toy plant described in Chapter 4 was developed using the same code baseby me.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 32.1 Approximation of Dynamical Systems . . . . . . . . . . . . . . . 42.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Learning the Dynamical System Equation . . . . . . . . . 52.1.3 Partially Observable Dynamical Systems . . . . . . . . . 62.1.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . 82.2 Control of Dynamical Systems . . . . . . . . . . . . . . . . . . . 92.2.1 Asymptotic Reaching Control . . . . . . . . . . . . . . . 102.2.2 Optimal Control . . . . . . . . . . . . . . . . . . . . . . 132.2.3 Hierarchical Control . . . . . . . . . . . . . . . . . . . . 17iv2.2.4 Adaptive Control . . . . . . . . . . . . . . . . . . . . . . 192.3 Biomechanical Tendon-Driven Systems . . . . . . . . . . . . . . 213 One-Step-Ahead controller . . . . . . . . . . . . . . . . . . . . . . . 253.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 One-Step-Ahead controller (OSA) Design . . . . . . . . . . . . . 263.3 Learning Parameters of One-Step-Ahead controller (OSA) . . . . . 323.3.1 Learning Feedforward Command . . . . . . . . . . . . . 323.3.2 Learning Activation-Position Matrix (APM) . . . . . . . . 353.3.3 Determining Other Parameters . . . . . . . . . . . . . . . 373.4 Proposed Approach for Controlling Biomechanical Systems . . . 383.5 Using Learned System Dynamics with One-Step-Ahead controller(OSA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Control of Simplified Human Dynamics . . . . . . . . . . . . . . . . 414.1 Simplified Human Dynamics (F1T2) . . . . . . . . . . . . . . . . 424.2 Experiment 1: Learning Controller Parameters for a Reaching Task 464.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Experiment 2: Reaching Task with Feedforward Command andVariable APM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.4 Experiment 3: Using End-Point Coordinates . . . . . . . . . . . . 604.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 Control of a Human Index Finger Model . . . . . . . . . . . . . . . 665.1 Human Index Finger Model . . . . . . . . . . . . . . . . . . . . . 665.2 Reaching Control for the Index Finger . . . . . . . . . . . . . . . 676 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78v6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81viList of TablesTable 4.1 Scaled Performance Error for the Baseline and the Variable APMcontrollers for reaching the target ?r = pi6 rad, cr = Lb cospi6 . . . 63viiList of FiguresFigure 2.1 An example of a tendon-driven system (or plant). This plantconsists of a single moving bone (the upper one) and two inex-tensible tendons (green-blue) attaching it to the muscles ? theflexor (yellow) and the extensor (red). . . . . . . . . . . . . . 22Figure 3.1 The One-Step-Ahead controller (OSA) algorithm. Given the(unknown to the controller) plant equation xt+1 = f (xt ,ut), thegoal of the controller is to find a control signal 0? ut ? 1, suchthat the system state, xt , drives to the reference state, xr. . . . 31Figure 3.2 A schematic representation of the One-Step-Ahead controller(OSA) algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 4.1 A representation of the F1T2 plant. The plant consists of asingle moving bone (the upper one) and two inextensible ten-dons attaching it to the muscles ? the flexor (yellow) and theextensor (red). . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 4.2 A set of N = 11 runs that the Baseline controller performed.x-axis represents the time, y-axis shows the joint angle ? . Thered line shows the target configurations ? (i)r = q(i)r , while theblue line shows the actual trajectories ? (i)t = q(i)t , t = 1, ...,50during the execution. . . . . . . . . . . . . . . . . . . . . . . 49viiiFigure 4.3 Comparison of empirically computed APM, R(i)T , with scaledanalytically computed active torques at the configuration ? (i)T =q(i)T . (left) the first entry of R corresponding to the exten-sor muscle, (right) the second entry of R corresponding to theflexor muscle. . . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure 4.4 Muscle activations during N reaching runs performed in Ex-periment 1 for the extensor (left) and the flexor (right). Note,that the graphs look similar, as they depict all N runs, illustrat-ing both flexion and extension. . . . . . . . . . . . . . . . . . 52Figure 4.5 Details for the run to reach the location ?r = qr = 5pi6 rad. (A)Red line shows the target position, blue line shows the actualtrajectory. (B) Red line shows the target velocity, blue lineshows the actual velocity profile. (C) Activation level for theextensor muscle during the run. (D) Activation level for theflexor muscle during the run. . . . . . . . . . . . . . . . . . . 54Figure 4.6 Cubic polynomial fit of empirically computed APM. (Left) thefirst entry of R(?) = R(q) corresponding to the extensor mus-cle, (right) the second entry of R(?) = R(q) corresponding tothe flexor muscle. . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 4.7 Trajectories for the set of N = 11 runs that the Equilibrium(green) or the Baseline (blue) controller performed. x-axis rep-resents the time, y-axis shows the joint angle ? . . . . . . . . . 57Figure 4.8 Angular velocity profile for N = 11 runs that the Equilibrium(green) or the Baseline (blue) controllers performed. x-axisrepresents the time, y-axis shows the velocities ?? . . . . . . . . 58Figure 4.9 Muscle activations during N reaching runs performed in Ex-periment 2 for the extensor (left) and the flexor (right). Greenlines show activations produced by the Equilibrium controller,blue lines show activations produced by the Baseline controller.Note, that the graphs look similar, as they depict all N runs,some of which involve flexion and some ? extension of the joint. 59ixFigure 4.10 Trajectories for the set of N = 11 runs that the Equilibrium(green), the Variable APM (magenta), or the Baseline (blue)controllers performed. x-axis represents the time, y-axis showsthe end-point coordinate c. . . . . . . . . . . . . . . . . . . . 61Figure 4.11 Velocities for the set of N = 11 runs that the Equilibrium (green),the Variable APM (magenta), or the Baseline (blue) controllersperformed. x-axis represents the time, y-axis shows the end-point velocity c?. . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 4.12 Muscle activations for reaching the target qr = cr = 0.4 m thatthe Equilibrium (green), the Variable APM (magenta), or theBaseline (blue) controllers performed. x-axis represents thetime, y-axis shows the activation level for the extensor muscle(left graph), and the flexor muscle (right graph). . . . . . . . . 63Figure 5.1 A visual representation of the human index finger model [47].The plant consists of four bones (distal, middle, proximal andmetacarpal) which are tied to 6 muscles by 10 interconnectedtendons. The other four fingers are displayed only for conve-nience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 5.2 Trajectories for the set of N = 6 runs that the Variable APM(magenta), the Equilibrium (green), or the Baseline (blue) con-trollers performed. x-axis represents the time, y-axis shows thethird end-point coordinate q1 = z. . . . . . . . . . . . . . . . 70Figure 5.3 Muscle activations for reaching the target q1 = z = ?1.2 cmthat the Baseline (blue) controller performed. x-axis representsthe time, y-axis shows the activation level for all the six indexfinger muscles. . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.4 The results of the Baseline controller (blue) tracking the circu-lar trajectory for the fingertip (red). The fingertip starts at therest configuration (blue star) off the circle, but quickly catchesup, and does three circles around, with a total duration of 6seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72xFigure 5.5 Muscle activations for the circular trajectory tracking that theBaseline (blue) controller performed. x-axis represents thetime, y-axis shows the activation level for all six index fingermuscles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 5.6 The results of the Baseline controller with (green) or without(blue) regularization tracking the circular trajectory for the fin-gertip (red). Note, that the controller without regularizationproduces jerky movement, while the one with regularizationmakes it smooth. . . . . . . . . . . . . . . . . . . . . . . . . 73Figure 5.7 Muscle activations for the circular trajectory tracking that theBaseline controller with (green) or without (blue) regulariza-tion performed. x-axis represents the time, y-axis shows theactivation level for all six index finger muscles. . . . . . . . . 74Figure 5.8 The results of the Baseline controller tracking the non-smoothtrajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75xiGlossaryOSA One-Step-Ahead controllerAPM Activation-Position MatrixESN Echo State NetworkACT Anatomically Correct TestbedLTI Linear Time-InvariantPID Proportional-Integral-DerivativePD Proportional-DerivativeHJB Hamilton-Jacobi-BellmanLQR Linear-Quadratic RegulatorILQR iterative Linear-Quadratic RegulatorLQG Linear-Quadratic GaussianILQG iterative Linear-Quadratic GaussianPI2 Policy Improvement with Path IntegralsGP Gaussian ProcessesRNN Recurrent Neural NetworkDMP Dynamic Movement PrimitivesCMA Covariance Matrix AdaptationxiiAcknowledgmentsForemost, I would like to express my gratitude to my supervisor, Prof. DineshPai, who patiently guided me through my degree and gave me the opportunities,knowledge and advice to do this research. He tremendously influenced the way Isee and understand biomechanical systems (now I cannot eat meat without exam-ining muscle and tendon fibers), and the world of research in general. I hope tocontinue to build upon that knowledge later in my life.I would also like to thank the second reader of this thesis, Prof. Michiel van dePanne, for his valuable comments on my work.I thank all the current and former members of the Sensorimotor Systems Lab,whom I have an honor to know. They helped me a lot with insights, discussions andadvices, and I am especially grateful to Shin and Duo, whose work I was buildingupon.Last but not least, I would like to thank the people who are special to me ?my parents who made me the person I am now, my caring sisters, my wonderfulgirlfriend and my dear friends (especially comrades Dasha, Gosha and Lina). Theirsupport is the reason I was able to complete this journey!xiiiChapter 1IntroductionMotor control is the task of making some physical object (named plant) tobehave in a desired way by applying control signals to it. A beautiful example ofsuch control is human hand movement. Humans show stunning performance ona huge variety of tasks, from knitting to writing. While for some specific settingsrobotic systems may outperform humans in manipulation tasks, these settings arevery restrictive, and overall the performance of robotic systems is much worse.What are the reasons for the human?s amazing abilities in motor control?The first possible reason is the special system that the human operates, that is,the human hand. As well as other animal limbs, the human hand is tendon-driven.A tendon is a very stiff elastic strand that connects bones and muscles. Whenactivated muscle contracts, it rotates the joints. This architecture is drasticallydifferent from the majority of robotic plants of similar purpose. Robotic arms aretypically joint torque driven, that is, the joints are moved by motors that directlyrotate them.Torque-driven systems are used in robotics partly because it is possible to writetheir dynamics analytically. On the other hand, tendon-driven systems are hard toimplement, and the correspondence between the controls and the movement thatthe plant produces is rather complicated. In addition, biomechanical systems areinput-constrained since the muscles can pull only in one direction, and redundant,1since there are usually multiple activation patterns that lead to the same movementof the plant.Thus, even simple tasks like reaching the desired location become hard fortendon-driven systems [71], yet humans are able to do it with ease. This brings us tothe second possible reason of why humans are so good at manipulation tasks ? thealgorithms implemented in the human brain. Unfortunately, the information aboutthese algorithms is very scarce. However, the recent advances in anatomicallycorrect modeling of the human hand [47, 71, 78] make it easier to test algorithmsthat can potentially control complex biomechanical systems.This thesis investigates the problem of designing an efficient and robust con-troller, which is able to control a complex biomechanical system, and learn in realtime. It focuses on the trajectory tracking problem, and one reason for this is theexistence of algorithms that can learn kinematic trajectories for rather complextasks [50, 56, 62].Chapter 2 gives background on the problem of controlling dynamical systems,and reviews the relevant work. Chapter 3 derives the algorithm of the One-Step-Ahead controller (OSA). In Chapter 4, OSA performance is analyzed by applyingit to a toy plant. Chapter 5 successfully applies OSA to the complex model ofthe human index finger. Finally, Chapter 6 summarizes our findings and discussespossible future work.2Chapter 2Background and Related WorkThis thesis intends to develop methods for controlling complex biomechanicalsystems. This chapter gives some background on existing methods of controllingdynamical systems, and the typical properties of biomechanical systems.First of all, a notion of a dynamical system (or a plant) should be defined. A dy-namical system consists of a state at time t, xt , and an equation (Equation 2.1) thatdefines how the state evolves in time depending on a control function u(x, t). Thecontrol function is also referred to as a policy in reinforcement learning literature.xt+1 = f (xt ,ut)ut = u(xt , t)(2.1)The general goal in any control task is to find such a policy u(x, t) that the stateevolves in a way that maximizes some criteria.It should be mentioned, that Equation 2.1 can be thought of as the solution ofcontinuous system dynamics x?(t) = F(x(t),u(t)), integrated over a fixed timestep?t, assuming that u(t) = ut is constant during this timestep.First, methods for approximating the function f (x,u), which can be unknown,are described in Section 2.1. Then, Section 2.2 describes various control problemsand methods available for solving them. These two sections do not outline everyavailable method in detail as there is a vast amount of them ? instead, they try to3organize the methods in a coherent fashion. In Section 2.2.3, an important conceptof Hierarchical Control is described, as we believe it to be the key to the successfulapplication of control methods. Finally, Section 2.3 discusses the specific proper-ties of biomechanical tendon-driven systems that can shed light on which methodscan be used for controlling them.Throughout this chapter it is assumed that xt and ut are real scalars unless oth-erwise stated. This simplification was made since the generalization to the multidi-mensional case is straightforward, but introduces more notational burden. We alsopoint out the computational methods that can be useful in biomechanical controland if there exist biological data suggesting that a method can be implemented inthe human brain.2.1 Approximation of Dynamical SystemsThe dynamical system equation (Equation 2.1) is often unknown, and hence ap-proximating it is crucial for any controller. This section reviews various methodsavailable for learning this approximation. It should be emphasized that the meth-ods described here are also applicable for learning other functions, e.g., equilibriumcontrols.2.1.1 Data CollectionIn order to learn the system approximation, data from the environment should beobtained. Let D denote such a dataset, consisting of triples {xt ,ut} ? xt+1, suchthat xt+1 = f (xt ,ut). However, the question remains how to get such a dataset fromexperience.The simplest idea is to get the data by randomly generating controls ut in agiven state xt , and recording the state at the next timestep, xt+1. This idea is com-monly referred in the literature as motor babbling [65]. However, the problem withthis approach is that randomly choosing controls in such a way would yield a lotof samples which are useless for performing some specific control tasks, becausevery often the regions we would like to explore are very unlikely to be reached byrandomly generated control signals.Consequently, the idea of goal-directed exploration, sometimes referred to as4goal babbling, comes into place [61]. In this case, the controller performs abootstrapping procedure involving acquiring the data from goal-directed reachingmovements. This idea originated from observing infants? motor learning - evenvery early in development they try to produce reaching movements and learn fromthem right away. Rolf et al. [61] claim that this technique is very efficient especiallyin high dimensional parameter space.The distinction between motor and goal babbling could be thought of in termsof the classic problem of exploration vs. exploitation [18], as motor babbling cor-responds to an act of exploration, while goal babbling and related schemes areexploiting the existing knowledge to generate a new one. Thus, we believe thatideally one would incorporate both approaches in the learning procedure.In addition, sometimes assumptions about the structure of the function f? sug-gest some specific ways of getting the training data. Consider the case wheref? (xt ,ut) = xt + But . Then, to learn B, one just needs to set ut = 1, collect thenext state xt+1, and set B = xt+1?xt . Such an approach for data collection was em-ployed by Malhotra et al. [53], which the authors referred to as self-identification.2.1.2 Learning the Dynamical System EquationThis section reviews the methods available for approximating the function f (xt ,ut),given the dataset D= {xt ,ut}? xt+1. This problem is exactly the standard functionapproximation problem in a supervised learning setting, and hence the methodsmentioned here are well-explored.Linear Models. The most straightforward approach is to assume that the functionf is affine, that isxt+1 = Axt +But +G, (2.2)where A,B and G are constants.Learning the system from a dataset D, then, reduces to a linear regression prob-lem, which can be solved using various methods. Despite the fact that the systemsto approximate are rarely linear, they can be thought of as being linear locally, andthus this approach can still be useful if the data collected are confined to a small5enough neighborhood around some initial state x0. For example, Malhotra et al.[53] use recursive least squares algorithm to update the linear model of a complextendon-driven plant, the Anatomically Correct Testbed (ACT) hand [78]. Linearmodels were also used in the other work on this robotic hand [62, 63, 82].Non-Linear Models. However, in a general case the problem of dynamical sys-tem approximation is equivalent the problem of non-linear function approximation.A variety of supervised learning algorithms can be used to solve it ? e.g., Lesmanaand Pai [46] used Radial Basis Functions, Deshpande et al. [20, 21], Nguyen-Tuongand Peters [55] applied Gaussian Processes (GP), and Neural Networks were uti-lized by Grzeszczuk et al. [26], Saegusa et al. [65].Locally Linear Models. Often the dynamical system can be assumed to be linearin the neighborhood of a current state xt , that isxt+1 = A(xt)+B(xt)ut , (2.3)where A,B are functions of xt .One way to find the functions A(x),B(x) is to approximate the system locallyaround various states and then interpolate the values of A,B by using any functionapproximator including the ones mentioned above. Another way is to ignore thedynamics of the system by assuming A(x) = x, and use the kinematic JacobianTranspose method derived using the principle of virtual work to find B(x) [20].Finally, there is yet another one way to approximate Equation 2.3, which uses theidea of reverberating loops ? that is, it keeps the system model stored as signals,not function parameters [23].2.1.3 Partially Observable Dynamical SystemsIt is often the case that the actual state of the system, xt , is unobservable ? more-over, even the dimension of the state may be unknown. Instead, the only informa-tion available at time t is yt = h(xt), where h is an arbitrary function. In this case,the problem of dynamical system approximation includes the problem of state es-timation. That is, the state xt needs to be estimated from the previously observed6values values [y1, ...,yt ] and applied controls [u1, ...,ut?1].One way to estimate the state is to assume xt = [yt?N , ...,yt ], for some fixed N.In the Neural Network literature, such an approach is referred to as Time-DelayedNeural Networks [30], while in estimation theory it is known as Finite ImpulseResponse [14].Another way to deal with partially observable dynamical systems is to keeptrack of the current estimate of a state, x?t , and update this estimate as the newcontrols ut are applied and observations yt+1 arrive. In this way, the problem ofstate estimation couples with the function approximation problem, since one needssome model of the system f? (xt ,ut) available to keep track of the state. The rest ofthe section describes the methods dealing with the problem of partial observabilityin this fashion.Recurrent Neural Network. One solution to the problem of dynamical systemapproximation with partially observable state is so-called Recurrent Neural Net-work (RNN) [79]. Assuming that the reader is familiar with a standard feedforwardNeural Network architecture, it should be mentioned that RNN differs in that thehidden layer units are interconnected. That is, the current values of hidden layerunits influence the values of hidden layer units on the next timestep. That naturallycreates a notion of a state estimate encoded by the current values of hidden layerunits, as this estimate gets updated based on new values ut ,yt arriving.However, the training of these networks using standard gradient-descent algo-rithms was proven to be ineffective [30, 31]. Thus, many other approaches wereintroduced to deal with the problem of training RNN, including algorithms likeHessian-Free optimization [54] and network architectures like Long Short-TermMemory [31] and others [34, 51, 73]. RNNs were successfully employed for build-ing models of dynamical systems by [7, 12, 13, 43] and others.Echo State Network. An interesting extension of the idea of RNN is so-calledEcho State Network (ESN) [33]. The magic of ESN is a special type of a hiddenlayer that exhibits echo-state property ? the influence of initial conditions on thenetwork state vanishes as time goes by [81]. In short, these features make the hid-7den recurrent layer to work as a reservoir ? when it gets a particular history ofinputs, it shows a specific spatio-temporal pattern of activation (consider the anal-ogy of listening to echo in a large hall ? where words that were said just now andearlier are heard simultaneously - with the earlier sounds slowly subsiding). Sincethe hidden layer connections do not change in ESN, the training of such networksreduces to a linear regression problem which can be solved very efficiently.ESNs were shown to perform very well for a task of learning temporal depen-dencies [52], thus it could be a nice framework for building dynamical models,e.g., [27, 66].There is also a biological motivation in using ESN for biomechanical modeling.Indeed, it is widely accepted that the human cerebellum may hold various types ofmodels of the body and the environment [19, 38]. While historically it is believedthat the cerebellum is a huge feedforward neural network [64], some researcherssuggest that the cerebellum actually resembles a Liquid State Machine [80], whichis a concept very closely related to ESN.Bayesian Inference Methods. Another way to keep track of the state estimate x?t isto use standard Bayesian algorithms such as the Kalman Filter for linear systems,the Extended Kalman Filter for non-linear systems, and other methods includingParticle Filtering. However, the standard formulation of these methods implieshaving a model of the system in the first place (which may not be available). Thus,these methods are typically used to remove uncertainty in the estimate, in combi-nation with other methods providing the system model f? (xt ,ut) (see, e.g., [53]).2.1.4 Dimensionality ReductionThis section briefly switches from the one-dimensional systems to multidimen-sional ones, that is, x and u are no longer scalars, but vectors of size n and mrespectively.In many cases, even when the state x is fully observable, the system dynamicsf (xt ,ut) are too complicated and hard to learn by any method. This happens es-pecially often when the dimension of the state and controls are large. Thus, onemight want to find a low-dimensional approximation of the function f (xt ,ut).8More precisely, a transformation x? = S(x) which transforms an n-dimensionalstate vector x to a lower-dimensional vector x? should be found. In addition, onemay seek to reduce dimensionality of the controls u, that is, to find a transformationu? =W (u) with u? having a smaller dimension than u. Then, the dynamical systemapproximation problem can be solved in terms of these reduced state and controls.The question, of course, remains in how to find such transformations S(x),W (u)that make solving a system approximation problem easier but at the same timemake sure that the approximation captures the important behaviors of the system,however they are defined.One way, described extensively by Berniker [9] in his thesis, is to use a methodfrom linear systems theory called balanced truncation [4, 22]. The resulting low-level approximation was also shown to express much less non-linearities than thereal system for some plants. Thus, this method may help to build a good linear ap-proximation of some systems, which is extremely useful for control (Section 2.2).Other methods for dimensionality reduction, such as Principal Component Anal-ysis, were also successfully used by a number of researchers to reduce the dimen-sion of the state and controls in high-dimensional systems [53, 82].It should also be mentioned that the idea of dimensionality reduction is wellgrounded in physiology. Indeed, reduced control vector u? is also referred to asmuscle synergies, and there is a reason to believe that humans do use these syner-gies to control their limbs [8, 17].2.2 Control of Dynamical SystemsIn Section 2.1, the issues of approximating dynamical systems were discussed.This section describes various methods for controlling dynamical systems, that is,using system models to make it behave in a desired way. Clearly, the declaration ofwhat is desired is crucial, since the methods applied for control could vary basedon the goals the controller has.The most standard problem in control theory, asymptotic reaching, is describedin Section 2.2.1. Section 2.2.2 discusses the methods for minimizing the cost ofmovement (equivalently, maximizing the reward). Finally, Section 2.2.3 showsthe role of hierarchy in motor control, and the methods for adapting the existing9controller to new circumstances are described in Section Asymptotic Reaching ControlConsider the dynamical system, Equation 2.1. The asymptotic reaching controltask is defined as finding a policy, u(xt , t) such that xtt????? xr, for some target statexr. This task is related the problem of tracking the predefined trajectory, which istypically done by changing the target state xr at every timestep.Control of Linear Systems. The asymptotic reaching control of Linear Time-Invariant systems is one of the most developed problems in classical control the-ory. It can be reduced to finding controls which lead the system to the state xr = 0asymptotically, for the systemxt+1 = Axt +ButThe most common controllers used for solving this problem are so-called state-feedback controllers, which set u(xt) = Kxt , where K is a constant, or a matrixin a multidimensional case. The standard method for picking this matrix K iscalled pole placement [22] which, under some conditions, guarantees the desiredconvergence xtt????? 0.A special case of state feedback control is when the state xt is of the form[qtq?t], where qt is a configuration variable, and K =[KP KD], where KP,KD aresquare diagonal matrices. Such a controller is named as Proportional-Derivative(PD) controller, and the diagonal matrices KP,KD are denoted as gains which areoften picked manually. Even though it can be applied only for a certain type oflinear systems, PD control is extremely popular in robotics literature due to itssimplicity, e.g., [46].The case when the system dynamics are affine, that is, have an additional forc-ing term G (Equation 2.2), can be handled by using the gravity compensation ap-proach. According to this method, one can find ug, such that Bug = ?G, and addthis term to the controls computed by a state-feedback controller u(xt) = Kxt . Theforcing term G in robotic systems is usually a gravity force, hence the name of the10method. The same idea can be used when G is state-dependent, in which case thismethod is a special case of inverse dynamics (or feedforward control) [59].In addition, this case can be handled by an extension of a PD-controller ? aProportional-Integral-Derivative (PID) controller. A PID controller represents a PDcontroller augmented by a term that takes into account the error accumulated overtime:u(xt , t) =[KP KD]xt +[KI 0] t?1??=1x? , (2.4)where KI is a square diagonal matrix, usually tuned manually.In this way, a PID controller helps to eliminate the steady-state error due to theforcing term G. However, the additional integral component comes with a price ?it is very sensitive to the values of KI , and can easily make the system oscillate. APID controller was used in numerous applications, e.g., [53, 62].Finally, it is mentioned briefly how the linear systems theory deals with non-zero targets xr. In this case, the values xt ? xr can be used as a new state variable,thus reducing the problem to the one described above. In addition, the above meth-ods are also applicable in the case when the linear model parameters A and B aretime-dependent.Control of Non-Linear Systems. The problem of asymptotic reaching control inthe case of arbitrary dynamics f (xt ,ut) is a hard one, and there are no methods thatguarantee the correctness of the solution in a general case. However, there existmethods that work in some special cases.The first group of methods operate on the assumption that the non-linear dy-namical system is locally linear w.r.t. the state. One of these methods is namedfeedback linearization [29]. However, this approach requires a fairly precise modelof the system to work properly. Another method is known as gain scheduling, andessentially represents a PID controller with state dependent gains, KP(x),KD(x),KI(x).However, the problem of figuring out how the gains depend on the state is not aneasy one. It can be approached, for instance, by using reinforcement learning al-gorithms as in [74]. Finally, a technique named backstepping can be used on topof the above methods to handle more complicated systems [44].11The second group of methods represents so-called hybrid controllers, which arecharacterized by switching between several existing controllers based on context.For example, sliding mode control implies keeping the system on a specific surfaceof the state space, by activating separate controllers that pull the system towardsthis surface from both sides. This type of control can be very useful for many tasks,e.g., obstacle avoidance. It should be mentioned that there also exist methods basedon potential functions for solving the obstacle avoidance problem [59]. Anotherexample of a hybrid controller is described by Burridge et al. [10], who introducethe technique for stacking the controllers which are valid only locally to reachtargets across the whole state-space.Finally, the third group of methods for non-linear asymptotic control is relatedto the equilibrium point control methods [42, 68]. These methods are based on theidea of equilibrium points. An equilibrium point is a pair of a state and a control{xeq,ueq}, such that xeq = f (xeq,ueq), that is, the system does not move. The idea ofequilibrium control is that to reach a target state xr, it may be sufficient to find thecontrols ueq, such that {xr,ueq} is a stable equilibrium point, and set u(xt , t) = ueq.The beauty of equilibrium point control is its simplicity, as there is no needfor feedback and the only thing that needs to be learned is a map from the statesxr to the corresponding equilibrium controls ueq. However, there are a number ofproblems with this idea. First of all, there could be multiple controls that set thesystem in the equilibrium at a given point xr (the redundancy problem). Second,there could be multiple states that the given control ueq stabilizes, based on theinitial state, in which case one will not be able to solve a control problem in theentire state space. Third, the system could be very sensitive to the exact values ofequilibrium controls applied, which may cause errors due to the lack of feedback.Finally, even if everything else works out nicely, the system tends to converge tothe equilibrium point very slowly.However, there are methods based on equilibrium point control that are ableto overcome its slow convergence, e.g., [42, 68]. One notable idea is a pulse-stepcontrol paradigm described, for example, in [5, 60]. It is based on the fact that bio-logical control signals in reaching tasks for eye saccades do have a typical structure- first there is a strong ?pulse? that drives the system to the desired configuration,and then there is a ?step? signal that keeps the system in equilibrium at the desired12location ? that is, the step signal is exactly the equilibrium control ueq. This typeof control was successfully implemented by Lesmana and Pai [46] in their roboticsystem modeling human saccades.Control of Input-Constrained Systems. The most common problem that biologi-cal systems face is the fact that the controls ut are typically constrained ? musclescan pull only in one direction and have a maximum feasible activation. That is,mathematically, one can say that, after normalization, 0? ut ? 1.It turns out that this slight complication makes the methods described abovefail for the task of controlling the system if applied directly, with the exception ofthe equilibrium-point control methods (which is an interesting observation giventhat equilibrium control is exactly the one that has been proposed to be used inbiological systems).However, these methods can be used to control input-constrained systems in-directly, through the framework named Hierarchical Control (Section 2.2.3).2.2.2 Optimal ControlIn Section 2.2.1 the performance measure for the controller was the asymptoticbehavior of the system. In the case of the optimal control problem discussed in thissection, the goal of the controller is to minimize the cost of movement:C(u(x, t)) =T??=0L(x? ,u? ,?),where u? = u(x? , t), and x?+1 = f (x? ,u?).Here it is assumed that T is some fixed finite number, while the problems withinfinite horizon T = ? are not discussed. The instantaneous cost L(x,u, t) is afunction of state and controls. It is also assumed that L(x,u,T ) = ?(x), that is atthe final timestep the cost depends solely on the final state xT .The cost of movement C is a function of a policy u(x, t), since the policy de-termines the states x? the system is going to be at during the movement (that is,the system trajectory). Thus, the optimal control problem is to find u?(x, t) thatminimizes the cost C across all the policies u(x, t):13u?(x, t) = argminu(x,t)T??=0L(x? ,u(x? ,?),?), t = 0, ...,Tx?+1 = f (x? ,u?)(2.5)Hamilton-Jacobi-Bellman Equation. The problem of optimal control can also bethought of in terms of an optimal cost-to-go function J(x, t) = minu(x,t)C(u(x, t)).This function represents the lowest cost that can be possibly paid by being at a statex at time t, that is J(x, t) =C(u?(x, t)). In other words,J(x, t) = argminu(x,t)T??=tL(x? ,u(x? ,?),?), t = 0, ...,Twhich it turn yields a recursive Hamilton-Jacobi-Bellman (HJB) equation:J(x, t) = minu(x,t)(L(xt ,u(xt , t), t)+ J( f (xt ,u(xt , t)), t +1)), t = 0, ...,T ?1J(x,T ) = ?(x)(2.6)Then, the optimal controls can be computed asu?(x, t) = argminu(x,t)(L(xt ,u(xt , t), t)+ J( f (xt ,u(xt , t)), t +1)), t = 0, ...,T ?1 (2.7)A direct method for solving Equation 2.6 is named dynamic programming andinvolves computing the values J(x, t) backwards in time, starting with t = T whenthe boundary condition J(x,T ) = ?(x) holds. However, this methods suffers fromthe curse of dimensionality and thus can be applied only for small discrete stateand control spaces.Linear-Quadratic Regulator. As in Section 2.2.1, the linearity of the system dy-namics simplifies the problem greatly. A Linear-Quadratic Regulator (LQR) givesus the analytic solution to HJB equation in a familiar state feedback form, u?(x, t) =Kx, if14xt+1 = Axt +ButL(x,u, t) =12Qx2 +12Ru2for some constants A,B,R,Q.LQR method can be extended to the case when system is injected with Gaussiannoise ? such a method is known as Linear-Quadratic Gaussian (LQG).The problem with this approach is that LQR is not a robust method, that is itdoes not perform well if the model of the system is not precise [22], which presentsa problem in many practical applications.Linear Hamilton-Jacobi-Bellman Equation. Another group of optimal controlmethods are related to systems with dynamics xt = A(xt)+But , and the cost func-tion L(x,u, t) = ?(x)+Ru2, where A,? are functions of a state and B,R are con-stants. Somewhat counterintuitively, injecting Gaussian noise into the system dy-namics allows to transform Equation 2.6 to a linear partial differential equation[37].The linearity of this transformed HJB equation gives rise to a family of optimalcontrol methods [76], including path integral control [36, 37]. In addition, thislinearity enables a new method for combining optimal controllers with differentgoals, Linear Bellman Combination [16, 77].It should be pointed out that in the case of constrained controls, 0? u? 1, evenLQR is not guaranteed to produce an optimal policy for linear systems. However,the problem can still be solved approximately with the help of hierarchical control(Section 2.2.3).Locally Optimal Control. While the optimal control problem can be solved in alinear case with quadratic cost function, even in the case of linear HJB equation theoptimal solution is still hard to compute, and in many systems one gets more com-plicated problems. Thus, many algorithms were developed to find locally optimalcontrols ? that is, the controls u(x, t) that are optimal only in some neighborhood ofthe given trajectory x? ,? = 0, ...,T , but not necessarily optimal for all the possiblestates x.15All of these methods have a similar structure. The controller starts at someinitial state x0 with an initial guess of the optimal policy, u0(x, t). This currentpolicy generates a trajectory x? , with the cost of movement C(u0(x, t)). Then, thenew policy u1(x, t) is computed by applying the described above optimal controlmethods locally around the trajectory, for instance, by linearizing the dynamics inthe case of iterative Linear-Quadratic Regulator (ILQR) [48] and iterative Linear-Quadratic Gaussian (ILQG) [75], or by sampling the state space locally in PolicyImprovement with Path Integrals (PI2) [74]. The new policy u1(x, t) is then used togenerate a new trajectory x? , and the process is repeated till convergence.In addition, if the controller has a good model of the system dynamics, it ispossible to estimate what would be the movement cost C for a given policy u(x, t).Then, one could directly apply iterative non-linear solvers (e.g., gradient descent)to Equation 2.5 and find the optimal policy u?(x, t). Most efficient non-linearsolvers are local thus the policy found is typically only locally optimal, e.g., see[26]. There exist, however, powerful evolutionary algorithms like Covariance Ma-trix Adaptation (CMA) that can find a more global solution to Equation 2.5 [24, 50].Finally, it should be mentioned that locally optimal control methods imply pre-computing the policy u(x, t) and then running it without further change. However,one might also think about recomputing optimal controls after each timestep asfeedback information arrives. This method is named model-predictive control, andalthough it typically improves the solution, it could be computationally expensive,unless applied for efficient enough methods with a good initial guess of the locallyoptimal policy.Reinforcement Learning. Reinforcement learning in general includes any type ofmethods that find the optimal policy minimizing the cost of movement. This chap-ter, however, defines the reinforcement learning problem as finding the optimalpolicy when the cost L(x,u, t) is not available to the controller directly ? instead,the controller receives rewards and punishments at every timestep during the exe-cution.There are number of methods to find u?(x, t), e.g., temporal-difference meth-ods, and biological data that supports claims that some of those methods can beimplemented in the human brain [28, 40]. However, the problem itself is a very16hard one to solve in a general case [18].There have been attempts to adapt function approximators described in Sec-tion 2.1 for learning the optimal controls u?(x, t) in a reinforcement learning set-ting, for example, the use of RNN was explored in [3, 6, 25, 41, 43, 67, 69], and theuse of ESN in [11].In addition, another interesting branch of reinforcement learning methods isthe apprenticeship learning developed by Abbeel and Ng [1]. In this method, thesystem learns how to imitate the expert by guessing what the cost of the movementis, based on what the expert is doing.It should be mentioned that among optimal control methods described above,only PI2 [74] and CMA-based methods [24, 50] were designed to work in the re-inforcement learning setting. For instance, PI2 shows state-of-the-art performancefor controlling the Anatomically Correct Testbed (ACT) hand [62, 63].2.2.3 Hierarchical ControlIn Sections 2.2.1 and 2.2.2 it is observed that there exist a limited number of meth-ods for controlling complex systems, and they can be applied in a very restrictedsetting. However, the methods designed for restricted setting can be used for con-trolling more complex plants by introducing the hierarchy in the controller in thefollowing way.The controller has a high-level model of the system, which is typically lin-ear or simple enough, xt+1 = f?v(xt ,vt), where vt are high-level controls denotedby kinematic controls. This high-level system is controlled using a policy v(x, t),computed by any of the methods described above. Then, the kinematic controlsvt computed at time t are sent to the low-level controller, which transforms themto the motor controls using the function g, that is ut = g(xt ,vt). For example, vtcould represent a state to reach at the next timestep, and g(xt ,vt) would computethe controls necessary to reach that state.Such an architecture is used in the vast majority of control theory applications,while some of the papers mention it explicitly [32, 49, 82] and others do it implic-itly [23, 24, 53].Consider an example of such an architecture implemented by Deshpande et al.17[20]. In this case, the controller has a locally-linear model of system dynamicswith constrained input:xt+1 = xt +B(xt)ut0? ut ? 1Then, the kinematic controls are set to be vt = xt+1? xt . Hence, the high-leveldynamics are now described asxt+1 = xt + vtThe above dynamical system is then controlled by a PID controller. The lower-level function g(xt ,vt) returns the controls ut , such thatvt = B(xt)ut0? ut ? 1That is, the lower-level control g is implemented by a standard constrainedlinear least squares solver.The example above demonstrates why the hierarchical control architecture presentsa nice way of dealing with non-linearities and constraints ? the high-level systemcan often be set up to be linear and unconstrained which allows the applicationof a variety of methods, and the constraints could be handled independently by alow-level function g. However, the trouble is that the low-level controller cannotalways execute the higher level commands exactly, e.g., if kinematic controls de-scribe the system?s desired state at the next timestep, it is typically impossible tofind the controls that move the system exactly in this way. In addition, low-levelmodels themselves are often only crude approximations of real system dynamics.Thus, the high-level controller has to be tuned with the specific low-level controllerin order for the whole system to work properly.The idea of hierarchical control can be easily extended further by introducingmore levels of hierarchy. As an example, consider the system developed by Rom-bokas et al. [62] for controlling Anatomically Correct Testbed (ACT) hand. In thiswork, a desired kinematic trajectory is generated by using Dynamic MovementPrimitives (DMP) [56, 57]. The parameters of this trajectory are learned by Policy18Improvement with Path Integrals (PI2) method. Then, a high-level PID controllergenerates kinematic controls to track the trajectory, and a low-level constrainedlinear least squares solver finds the motor controls necessary for moving the arm.Another example was implemented by Geijtenbeek et al. [24], where the desiredtrajectory is computed with CMA, a high-level PD controller generates kinematiccontrols and a low-level controller transforms them into muscle activations. Itshould be mentioned that in the latter example the objective function, optimized byCMA, included not only the desired trajectory, but also other system parameters,such as controller gains and muscle physiology parameters.2.2.4 Adaptive ControlThis section deals with the problem of adaptive control, that is, adjusting the ex-isting controllers to new environments. Consider a simplified dynamical system,x = f (u) where f is not known by the controller. In this case, the problem of reach-ing a particular target state xr reduces to finding the inverse model of the dynamics,that is such a function u = f??1(x), that f (u) = f ( f??1(x)) = x. Assume further-more, that there exists a black-box controller, denoted as the brainstem, that findsapproximate controls uB = B(x). The goal of an adaptive controller, denoted as thecerebellum, is to adjust the controls uB to the value uA, such that f (uA) = x.This task can be solved in two ways ? by using the forward architecture or therecurrent architecture.Forward Architecture. According to the forward architecture framework, to ad-just the controls uB one adds an additional controller that learns online to com-pute the adjustment uC = C(x), such that uA = uB + uC [39]. In other words,f??1(x) = B(x)+C(x), where B is a fixed but unknown function, and C is the func-tion to learn.Thus, uC =C(x) is a solution to Equation 2.8. One obvious way to implementC is by approximating the system dynamics f (u), black-box controller B(x), andsolving the non-linear equation Equation 2.8 to find uC. However, this sectionfocuses on learning a parameterized function C(x) by using non-linear functionapproximators, e.g., described in Section 2.1.19f (uC +B(x)) = x (2.8)First, let us understand what data is needed to adjust the parameters of C(x).At every step, the controller gets a target value xr, and the cerebellum computesthe adjustment uC = C(xr). Assume that there exists a solution to Equation 2.8,u?C, that is f (u?C +B(xr)) = xr. Then, the motor error is defined as ?u = u?C? uC.Intuitively, to figure out how to adjust the parameters of C(x), one needs to knowthe motor error [58].However, the controller faces the motor error problem ? the true solution u?Cis unknown. Instead, the controller can get the data about what has happened tothe system, that is the value x = f (uC +B(xr)). As a consequence, the sensoryerror, ?x = xr? x can be computed, and the relationship between the motor andthe sensory error becomes?x = R(x,uA)?u, (2.9)where the function R(x,uA) is named sensitivity derivatives [2], and has to be esti-mated to enable learning of the cerebellum function, C(x).There are a number of methods for sensitivity derivatives estimation ? distalteacher by Jordan and Rumelhart [35], implicit supervision by Abdelghani et al.[2] and weightless learning with reverberating loops by Fortney and Tweed [23].However, there is another problem that the forward architecture faces ? the re-dundancy problem. It turns out that in most biological systems Equation 2.8 hasmultiple solutions [58], and hence the inverse model f??1(x) is not well-defined,which forces the controller to either make a choice between various solutions oraverage them in some way.Recurrent Architecture. The existence of the motor error and redundancy prob-lems inspired Porrill and Dean [58] to develop a recurrent architecture, which pro-poses the adjustment of the input xr to the brainstem B(x) instead of adjusting itsoutput uB. That is, in the recurrent architecture, the cerebellum C(u) is a functionof controls, that solves Equation 2.10.20uB = B(xr +C(uB))f (B(xr +C(uB))) = x(2.10)Porrill and Dean [58] propose a way to learn the function C(u) from this recur-rent equation by applying gradient descent for the objective function L(xr) = ?x2,which utilizes the sensory error ?x directly, without the need for the explicit knowl-edge of sensitivity derivatives. However, even though the authors were able to showthe convergence of this method to the solution of Equation 2.10 under some con-ditions, it is clear that the success of the recurrent architecture depends highly onthe properties of the brainstem controller B(x). For instance, Equation 2.10 can-not be solved if the image of the function B(x) does not include the solution u?A tothe problem xr = f (u). Nevertheless, if the brainstem B(x) provides a good initialguess for controls, the recurrent architecture typically presents a neat way to adaptthe system [58].Biological Data Support. The problem of adaptation to the new environment isthe typical task humans must solve during limb movements. It is widely believedthat for such tasks the black-box controller B(x) is located in the part of the braincalled brainstem (as the reader may guess from the naming convention) [19]. It isalso believed that the adjusting controller, C(x), is located in the human cerebellum? there is a lot of physiological data showing that damage to the cerebellum disruptsthe adaptation process [70].There also exists physiological support for the existence of forward [38?40]and recurrent [19] architectures in the cerebellum ? it is quite possible that both ofthem can be implemented in the human brain despite their dissimilarities.2.3 Biomechanical Tendon-Driven SystemsIn the final section of this chapter, the typical properties of tendon-driven systemsin biomechanics are reviewed.A tendon-driven system is built upon a system of rigid bodies (bones) con-nected to each other by joints. A tendon is a very stiff elastic strand that attachesto the bone from one side, and to the muscle (motor) on the other. The muscle is21Figure 2.1: An example of a tendon-driven system (or plant). This plant con-sists of a single moving bone (the upper one) and two inextensible ten-dons (green-blue) attaching it to the muscles ? the flexor (yellow) andthe extensor (red).activated by a controller, and applies force to the tendon, which in turn transfersit to the bone, rotating it around the joint. An example of such a system is shownin Figure 2.1. One of the most notable (and complex) examples of tendon-drivensystems in human anatomy is the hand.Tendon-driven system are often presented as a biological alternative to thetorque-driven systems. In torque-driven systems, the joint torques are directly ap-plied by a controller and the dynamics of the joints are thus decoupled, whichmakes it straightforward to use a PID controller (Section 2.2.1) to control the sys-tem. On the other hand, tendon-driven systems usually have their joints coupled byinterconnected tendons, and the effect of controls is highly non-linear ? the torquesthat the muscle applies to the joints depend on many variables such as the jointsconfiguration, angular velocities, tendon elasticity and muscle behavior. In addi-tion, nearly all tendon-driven systems are redundant and input-constrained, in a22sense described in Section 2.2. All of the above makes controlling a tendon-drivensystem a hard task.However, the biological tendon-driven systems tend to show some nice prop-erties. By making an assumption that the tendons are inextensible, it can be shownthat the muscle force transforms to joint torques linearly with the transformationmatrix (e.g., moment arms) being configuration-dependent [9]. If the influenceof velocity on muscle force is ignored, then the system can be approximated as alocally linear model for a small enough timestep:xt+1 = A(xt)+B(qt)ut ,where ut are the muscle activations identified with forces, xt =[qtq?t]is the state,and qt is the configuration of the system.Of course, this approximation comes with a price. In addition to the assump-tions made above, this model cannot capture highly non-linear behaviors of theplant, which may be induced by static friction, joint limits or too large a timestep.However, this approximation is usually good enough to enable control of complexrobotic tendon-driven systems [20].There is also some physiological evidence that biological systems utilize theidea of model reduction (Section 2.1) for controlling the limbs [9, 17], and it couldbe that muscle synergies allow the brain to make a better linear approximation ofthe system [9]. It also appears that the non-linear spring-like passive behavior ofthe muscles greatly simplifies the control [21].Humans show stunning performance on a variety of hand movement taskswhich far exceeds the abilities of modern robots, despite the complexity of thehand. However, the above information suggests that a big part of this performancemay be due to the physical properties of the human hand. Thus, there are attemptsto build precise models of the hand to understand its properties and behavior better.Probably the most elaborate robotic system that models the hand as a tendon-driven plant is the Anatomically Correct Testbed (ACT) hand developed by Y. Mat-suoka?s group [78], and a number of different approaches for controlling it wereexplored, e.g., [20, 53]. Among the simulated models, we point out the strand-23based model of the hand created by Sueda et al. [71], and its updated version [47],developed using the Eulerian-Lagrangian simulator Strands++ [72].Finally, the choice of a configuration variable q should be discussed. The sen-sory information about the most obvious candidate, joint angles, is not directlyavailable to the human brain [20]. Thus, there were attempts to use tendon excur-sions [53] and fingertip coordinates [71] to describe the state. These representa-tions, however, suffer from other problems ? tendon excursions are highly coupledand thus overspecify the state, and fingertip coordinates cannot fully describe thetrue state of the system.24Chapter 3One-Step-Ahead controllerThis chapter derives the One-Step-Ahead controller (OSA) for solving asymptoticreaching and tracking problems for complex non-linear input-constrained systems,which is the major contribution of this thesis.Section 3.1 discusses the motivation and preferred design of OSA. Section 3.2derives the OSA algorithm using the idea of hierarchical control. Section 3.3 de-scribes the methods for learning OSA parameters. Section 3.4 presents a high-leveldescription of an agent that uses OSA for reaching movements, and learns its pa-rameters online by stabilizing the plant at various configurations. The ideas de-scribed in this section are tested in Chapter 4. Finally, Section 3.5 outlines animportant extension of OSA which involves learning a precise model of systemdynamics.The next chapters of this thesis are devoted to assessing the performance ofOSA controller.3.1 MotivationConsider the task of finding a control policy ut = u(xt , t) for an input-constrainedbiomechanical system described by Equation 3.1.25xt+1 = f (xt ,ut)ut = u(xt , t)0? ut ? 1(3.1)Recall from Section 2.2, that control of non-linear complex input-constraineddynamical systems is a very hard problem for any type of control, and there areno general-purpose methods that can solve this task efficiently. In addition, thefunction f is unknown to the controller so its dynamics have to be approximated.Thus, the controller would benefit from exploiting the typical structure of biome-chanical system dynamics. To further simplify the system, the idea of hierarchicalcontrol can be utilized (Section 2.2.3).For hierarchical control, it is reasonable to suggest that the high-level systemshould provide a valid kinematic trajectory vt for a plant to follow, while a low-level controller ut = g(xt ,vt) should deal with finding motor controls that executethe high-level goal and satisfy the constraints.One can envision two general ways to learn the low-level controller g. Thefirst way involves learning this non-linear function directly from the data and post-process its output so that the controls are within the constraints. However, thismethod faces the redundancy and motor error problem as described in Section 2.2.4,and constraints are satisfied by an ad-hoc method which may not yield good re-sults in practice. The second way to implement the low-level controller is model-predictive control (Section 2.2.2), which involves solving a constrained non-linearequation. The latter is usually a computationally expensive task to solve, and oneway to make this type of control feasible is to assume that the equation to solve islinear, as was done in [20, 53].Thus, the goal of this chapter is to design a hierarchical controller which ex-ploits the specifics of biomechanical systems, and solves the low-level constrained-input problem with a help of constrained linear least squares solvers.3.2 One-Step-Ahead controller (OSA) DesignSection 2.3 mentions that the dynamics of biomechanical systems xt+1 = f (xt ,ut)could be approximated as being locally linear w.r.t. controls, that is26xt+1 = A(xt)+B(xt)ut (3.2)In addition, it is assumed that the state of the system can be approximated asx =[qq?], where q is the configuration of the plant, e.g., its joint angles. Furthersimplification of Equation 3.2 includes ignoring active force-velocity relationship? that is, the influence of velocity q? on a function B(xt) in the system dynamics(Equation 3.3).xt+1 = A(xt)+B(qt)ut (3.3)Activation-Position Matrix (APM). While the algorithms developed here are eas-ily adjusted for any structure of the matrix B(q), in the related work it was oftenassumed that B(qt) =[R(qt)2?t R(qt)], where ?t is the timestep, and we chose to keep itfor consistency [20, 53].It is a fairly reasonable assumption if the controls directly influence accelera-tion q? of the plant. Indeed, consider the continuous systemq? = uIntegrating this system over timestep ?t yieldsq(t +?t) = q(t)+?tq?(t)+?t22uBy taking x =[qq?], this solution can be rewritten in a state-space form asx(t +?t) =[1 ?t0 1]x(t)+[?t22?t]uThe above solution generates a discrete state-space systemxt+1 =[1 ?t0 1]xt +[?t22?t]ut27Comparing this system with Equation 3.3, it is easy to see that B(qt)=[R(qt)2?t R(qt)],with R(qt) = ?t22 .In this work the matrix R(qt) is denoted by Activation-Position Matrix (APM),and its learning is one of the central issues for the success of OSA, since APMcompletely determines the controller?s knowledge about the relationship betweenmotor controls applied to the plant and the resulting state change.Designing a Hierarchical Controller. Recall from Section 2.2.3 that a hierarchi-cal controller finds a control policy u(x, t) by first finding a high-level policy v(x, t),and then transforming it to the motor controls ut (Equation 3.4).xt+1 = fv(xt ,vt)vt = v(xt , t)ut = g(xt ,vt)(3.4)where vt are the high-level controls, fv(xt ,vt) are the high-level dynamics andg(xt ,vt) represents the low-level controller.As was discussed in Section 3.1, finding ut through the constrained linear leastsquares solver is a highly desirable feature of the controller, and hence we choosevt = R(qt)ut . Thus, to find ut , one needs to solve a constrained linear least squaresproblem:vt = R(qt)ut0? ut ? 1(3.5)With this choice of kinematic controls vt , the high-level dynamical system islocally linear and has no constraints on vt :xt+1 = A(xt)+[12?t]vt (3.6)Controlling the High-Level System. Consider the task of asymptotic reachingcontrol for Equation 3.6, in which case the goal is to reach some target state28xr =[qrq?r], where qr is a target configuration and q?r = 0. One way to controlthis system is to use the feedforward control method (Section 2.2.1). In this case,vt = v f (xt , t)+ vb(xt , t), (3.7)where v f (x, t) is a precomputed policy, which simplifies the dynamics enoughso that feedback policy vb(x, t) can finish the job.The feedback policy can be implemented with a PD controller (Section 2.2.1),which can be written asvb(xt , t) = KP(qt ?qr)+KD(q?t ? q?r), (3.8)where KP,KD are square diagonal matrices.This high-level controller can also be used for a purpose of tracking a prede-fined trajectory. Although the PD-controller is not very powerful, its advantage isrobustness as it can be effective for a wide range of unknown dynamics providedgains are carefully picked.Combining Equations 3.6, 3.7 and 3.8, the high-level system becomes:xt+1 = A(xt)+[12?t]v f +[12?t]vbv f = v f (xt , t)vb = KP(qt ?qr)+KD(q?t ? q?r)(3.9)However, a reasonable question arises ? how to pick such a feedforward con-trol policy such that Equation 3.9 can be controlled by a simple PD-controller?One possible solution is to use v f for canceling non-linearities of A(xt). Anothersolution is to find a policy that drives the system to the goal (so that even if the PD-controller fails to do well, the system will remain stable). Section 3.3 will showthat there is a possibility of finding such a feedforward policy through the use ofequilibrium point control described in Section the Low-Level System. To find low-level controls ut , one just needsto solve a constained linear least squares problem (Equation 3.5), which can berewritten asvb + v f = R(qt)ut0? ut ? 1(3.10)However, a note should be made about the way the controller stores the kine-matic control policy v f (x, t). Indeed, the estimate of R(qt) may change duringlearning, in which case the policy v f (x, t) needs to be adjusted. To avoid this prob-lem, the controller can keep track of feedforward commands on the lower level,that is store the feedforward motor controls u f (x, t), and set v f (x, t) = R(x)u f (x, t).This allows the controller to change the estimate of the system dynamics withoutworrying about adjusting the feedforward policy. On the other hand, direct learn-ing of v f (x, t) helps to avoid the problem of redundancy, thus the choice of how tostore feedforward command should be carefully made by an algorithm designer.In addition, one may introduce a penalty for the large activations, ??u?, intothe least squares problem to encourage smooth movements [71]. There exist otherregularization methods which could be used separately or together for further smooth-ing of the movements. For example, the difference between the current and theprevious control vectors could be penalized [71], or the activation dynamics couldbe introduced by blending the current and the previous controls [75].Algorithm. Figure 3.1 summarizes the theory described above and presents aOne-Step-Ahead controller (OSA) algorithm for finding motor controls at a giventimestep for a reaching (or tracking) task. In this algorithm, the feedforward com-mand is stored on a lower level as u f (x, t). In addition, note how the least squaresproblem from Equation 3.10 was rewritten in the form of Equation 3.11:ut = argminu?(vb + v f )?R(qt)u?2 +??u?2s.t. 0? u? 1(3.11)The OSA algorithm is also schematically represented in Figure 3.2.30Figure 3.1: The One-Step-Ahead controller (OSA) algorithm. Given the (un-known to the controller) plant equation xt+1 = f (xt ,ut), the goal of thecontroller is to find a control signal 0 ? ut ? 1, such that the systemstate, xt , drives to the reference state, xr.31Figure 3.2: A schematic representation of the One-Step-Ahead controller(OSA) algorithm.3.3 Learning Parameters of One-Step-Aheadcontroller (OSA)3.3.1 Learning Feedforward CommandThis section turns to the question of learning the feedforward policy u f (x, t) forOSA controller. It is generally not an easy task since the policy is exactly what OSAcontroller computes itself. Thus, feedforward policy should describe some goodinitial guess of controls that is not too hard to obtain.One solution suggested for an asymptotic reaching control task is to use equi-librium point control as a feedforward policy (Section 2.2.1) [42]. According tothe equilibrium point control method, in order to drive the system to the stationarypoint xr =[qr0], one just needs to use the policy u f (x, t) = ueq(qr), where ueq(q)32are the controls that keep the system at the equilibrium at the configuration q. Thatis, to reach a particular target, constant controls are applied. Even though, as wasmentioned in Section 2.2.1, equilibrium point control is very slow, it often drivesthe system to the desired location even for complex systems, and thus it couldbe a good candidate for implementing a feedforward policy for a reaching task[15, 42, 68].To understand how exactly the use of equilibrium controls transforms the sys-tem dynamics, Equation 3.3 can be rewritten in a more convenient form. Assumingthat A(x) is an analytic function, it can be expanded in Taylor series w.r.t. q? aroundq? = 0:A(x) = A(q, q?)= A(q,0)+??i=11i!?A(q, q?)? q?????q?=0? q?i= A(q,0)+ q???i=11i!?A(q, q?)? q?????q?=0? q?i?1= A1(q)+A2(q, q?)q?,where A1(q) = A(q,0) and A1(q, q?) = ??i=11i!?A(q,q?)? q????q?=0? q?i?1.Using the above expansion and taking A1(q) =[A11(q)A21(q)], A2(q, q?) = A2(x) =[A12(x)A22(x)]and B(q) =[B1(q)B2(q)], Equation 3.3 becomes[qt+1q?t+1]=[A11(qt)A21(qt)]+[A12(xt)A22(xt)]q?t +[B1(qt)B2(qt)]utAfter defining G1(q) = A11(q)? q and G2(q) = A21(q), the above transformsinto[qt+1q?t+1]=[1n?n A12(xt)0n?n A22(xt)][qtq?t]+[B1(qt)B2(qt)]ut +[G1(qt)G2(qt)], (3.12)where n,m are the dimensions of configuration and control vectors, respec-33tively, B1(qt),B2(qt) are n?m configuration-dependent matrices, G1(qt),G2(qt)are n? 1 configuration-dependent vectors, and A12(xt),A22(xt) are square n? nstate-dependent matrices. If A(xt) is polynomial w.r.t. q? (which is a reasonableassumption), then Equation 3.12 is valid for all velocities, otherwise it is valid onlyfor small enough velocities.Let us now return to the equilibrium controls. By definition, ueq(qt) are thecontrols that ensure that if the system is at the state xt =[qt0], then xt+1 = xt . Thatis, applying it to Equation 3.12:[B1(q)B2(q)]ueq(q) =?[G1(q)G2(q)](3.13)Equation 3.13 implies that knowing the matrix B(q) and equilibrium controlsueq(q), one automatically gains the knowledge about a good chunk of the systemnon-linearity, more specifically, about all of the velocity-independent non-linearterms G(qt) =[G1(qt)G2(qt)]in passive dynamics A(xt). This implies that using a feed-forward control policy u f (x, t) = ueq(q) could be promising as it makes the high-level system Equation 3.9 close to linear. However, using the equilibrium pointcontrol policy u f (x, t) = ueq(qr) could be more beneficial for a reaching task, be-cause in this case the remaining non-linearities are useful as they pull the systemto the target configuration. In this way, equilibrium point control could serve bothas a feedforward command, and a tool to linearize the system dynamics (if thefunctions A12(xt),A22(xt) are close to being constant).Finally, the issue of learning equilibrium controls should be discussed. Oneobvious solution is to stabilize the plant at some configuration q by using any avail-able controller and store the current controls ut , which give a good estimate of theequilibrium controls ueq(q) (Algorithm 1). However, in case the controller pro-vides oscillating controls, some improvements like averaging the controls over aprolonged period of time may be needed to estimate the equilibrium controls bet-ter.34input : qt ? current stable configuration of the plantut ? current controls, computed by some controller whichstabilizes the plant at qt .output: ueq(qt) ? equilibrium controls at the current configurationueq(qt)? utAlgorithm 1: The procedure for collecting equilibrium controls when thesystem is stabilized by some controller at a given configuration qt .3.3.2 Learning Activation-Position Matrix (APM)Two methods of interest have been proposed to estimate Activation-Position Matrixat a given configuration q. Deshpande et al. [20] used the principle of virtual workto estimate APM as a jacobian transpose of the coordinate transformation. How-ever, this method is purely kinematic and does not take the system dynamics intoaccount. Malhotra et al. [53] proposed a method they named self-identification toestimate APM, but this method was applied only for one configuration, and it doesnot take into account the passive movement of the system (that is, it presumes thatG1(q) = 0, G2(q) = 0 in Equation 3.12).This section proposes a method for learning APM at a given configuration q(Algorithm 2), based on the self-identification method with an addition of takingpassive dynamics of the system into account. In short, the matrix R(q) learned inthis way shows how much the system moves if one of the muscles is fully acti-vated, relative to the passive movement of the system. In other words, Algorithm 2estimates the jacobian B(q) of the system dynamics f (xt ,ut) w.r.t. ut , under theassumptions that f is locally linear w.r.t. ut and B(q) =[R(q)2?t R(q)].Algorithm 2 shares the advantages of self-identification method in that it allowsaquiring samples for each of the entries Ri j(q) of APM directly, thus making iteasier to learn functions for each of those entries separately from each other.However, a cautious reader would note that this algorithm is not well suited foronline execution in a non-simulated environment, as the system has to be put intothe same state x again and again. Fortunately, this turns out not to be a hard con-straint, since the entries of APM could be fitted independently of each other. Thus,35input : Current state of the plant, x =[qq?]output: R(q) ? APM at the current configuration (n?m matrix)R? 0n?m/* Apply no activations to the system for onetimestep */u(0)? 0m?1[q(0)q?(0)]? f (x,u(0))for i? 1 to m do/* Move the system back to the state x andfully activate ith muscle for one timestep*/u(i)?[u(i)1 ... u(i)m]T, whereu(i)j ={0 if i 6= j1 if i = j[q(i)q?(i)]? f (x,u(i))(ith column of R)?(q(i)?q(0))endR(q)? RAlgorithm 2: Learning APM in a configuration q for the system withdynamics xt+1 = f (xt ,ut). Note, that some adjustments are needed forlearning APM in non-simulated environments, as the system has to be putinto the same state x again and again.36the data sample could be collected only for a single randomly picked coordinate ofthe control vector at a given configuration. A trick is still needed to estimate thepassive dynamics, that is x(0), at the same time. This can be done if Algorithm 2 isapplied at stationary states, in which case it has to be adjusted to account for non-zero equilibrium controls that keep the system still. The above, of course, requiresthe controller to do m times more reaching trials to get the same amount of data,but it can be safely said that if the goal is to learn to control the plant in a lifetime,a linear factor in running time is not a hard restriction to impose on the algorithm.Thus, Algorithm 2 appears to be an effective way of learning APM and can beadjusted for the online learning procedure. In this thesis, however, Algorithm 2 isapplied directly since it deals with simulated plants.3.3.3 Determining Other ParametersTimestep. One of the issues that a controller designer faces is the choice of timestepfor feedback control. In all the discussions above it was assumed that the systemdynamics are given as xt+1 = f (xt ,ut). However, biological systems have contin-uous dynamics, and often the amount of time passing between the current and thenext timestep can be determined by the controller.Generally, the longer the timestep, the more non-linear the dynamics xt+1 =f (xt ,ut) become, and the worse is the locally-linear approximation (Equation 3.2)of the true system dynamics. However, a longer timestep also means a more robustcontroller as it estimates the dynamics further in future.These factors, in addition to the biomechanical constraints such as sensory andmotor delays, should be taken into account while choosing the timestep for theoperation of OSA Controller.PD-gains. The PD-controller plays a crucial part in finding high-level policy forOSA controller, and it needs to handle the unknown, potentially non-linear dynam-ics (Equation 3.9). Thus picking the gains KP, KD properly is very important.One way is to pick them manually, as was done by, e.g., [20, 53]. In addition, toaccount for the non-linearities of the dynamics, one may think of applying gainscheduling approach (Section 2.2.1). However, it shall be mentioned that kine-37matic control gains are not that hard to pick, as kinematic controls vt directly relateto the desired change in the state position and velocity through Equation 3.9, andthe non-linearity of APM does not affect their performance.Choice of Configuration Space. Recall that the choice of configuration space (thatis, defining what q represents) can make a big difference in the properties of thesystem dynamics if the state is assumed to be x =[qq?], as discussed in Section 2.3.This problem is discussed further in Section 4.4.Determining Reference State. OSA controller attempts to follow the prespecifiedtrajectory xr, however, this trajectory itself can be determined by a yet higher-level controller, e.g., PI2 or any other optimal control algorithm discussed in Sec-tion 2.2.2. In this way, OSA can serve as a low-level trajectory tracker for morecomplex control architectures.3.4 Proposed Approach for Controlling BiomechanicalSystemsThis section describes one possible implementation of an agent which uses OSAfor doing the reaching (or tracking) tasks, and learns its parameters online. Herewe describe a rather high-level view of such an agent, while further in the thesis itis defined more rigorously for testing the ideas presented here.OSA has four parameters ? kinematic control gains, regularization parame-ter, feedforward command and APM (Section 3.2). Here it is assumed that gainsand regularization parameters are easy to pick manually, and the question of find-ing them is not discussed. However, feedforward command and APM have to belearned. We propose using equilibrium controls, ueq(q), as feedforward command,as suggested by Section 3.3.1. Note, that Section 3.3 argues that learning equilib-rium controls and APM is easy when the system is stabilized. Following the ideaof goal babbling [61], described in Section 2.1, the parameters could be learned bygoal-directed movements of the controller. Thus, the generic procedure for learn-ing OSA parameters online goes as follows.38Let the agent have a so-called rest configuration qs ? a position to which itcan always move the plant to. In the beginning, the agent has absolutely no priorinformation about the plant, except the dimensions of control and state vectors,manually tuned PD gains, KP and KD, and the regularization parameter ? . Then,the agent learns APM matrix at the rest configuration, R(qs), and sets R(q) =R(qs), u f (x, t) = 0 for OSA.Then, the agent generates reaching movement to some target location q?. Assoon as the controller stabilizes the plant at some location q? (not necessarily equalto q?, as the parameters are not perfect), the agent samples the values R(q?), ueq(q?)by using Algorithms 1 and 2. Then, the estimate of OSA parameters is updated. Thenew reaching target q? is chosen, and the process is repeated with the updated con-troller parameters. In this way, the agent collects sample values of OSA parametersat various configurations.The choice of target positions q? is determined by a specific online learningalgorithm used. For example, according to the goal babbling approach [61], reach-ing targets should lie on the path towards desired locations, thus the parameter?svalues are sampled in the area which is important for the current task.3.5 Using Learned System Dynamics withOne-Step-Ahead controller (OSA)The final section of this chapter is related to the problem of system identification.Consider the system dynamics in Equation 3.12. In OSA architecture, one needs theknowledge of B1(q) and B2(q) to solve low-level control problem (Equation 3.10).One can also get the knowledge about G1(q) and G2(q) by learning equilibriumcontrols (Section 3.3.1). However, A12(x) and A22(x) remain unknown, and aresupposed to be handled by a PD-controller. Knowing these non-linear functionsallows to build an approximation of the system dynamics (a forward model), whichshould potentially help to control the system better. For example, the forwardmodel can help to solve the problem of sensory and motor delays [46], and theinverse dynamics approach can be used for finding the feedforward command forOSA [59].If there exists a forward model f? (x,u) of the system dynamics f (x,u) which is39learned independently of OSA, one might as well use it for determining the func-tion B(x) from Equation 3.2, by setting B(x) = ? f??u , which can be used in OSAarchitecture.However, the forward model f? (x,u) can also be obtained by learning A12(x)and A22(x) separately after B(q) and equilibrium controls, using any of the methodspresented in Section 2.1. This may be an easier task than learning a full systemapproximation, but an extra care should be taken if all of the parts of Equation 3.12are updated simultaneously.We believe it to be an important direction to explore what is the efficient wayto learn and use the forward model of the system dynamics, however, this questionis outside of the scope of this thesis.40Chapter 4Control of Simplified HumanDynamicsThis chapter is devoted to the application of One-Step-Ahead controller (OSA) con-troller to the control of simplified human dynamics in an attempt to understandits potential in controlling complex biomechanical systems. Here, the approach tolearning and controlling biomechanical systems outlined in Section 3.4 is assessed.That section suggested an online learning method for collecting data, which can beused for improving the performance of OSA. According to this approach, the con-troller starts with a restricted local model of the plant, and gathers the data aboutthe plant behavior in other configurations by stabilizing the plant at those configu-rations. Thus, a reaching task was chosen to be a benchmark performance task inthe experiments described in this chapter.Section 4.1 describes a toy plant utilized for the experiments in this chapter.Section 4.2 designs the experiment which runs OSA without much prior knowledgeof the plant, and checks the validity of the collected data. Section 4.3 uses thesecollected data for fitting the parameters of OSA and shows that the performance ona reaching task improves significantly. Section 4.4 addresses the question of choos-ing different configuration variables. The findings are summarized and discussedin Section 4.5.41Figure 4.1: A representation of the F1T2 plant. The plant consists of a singlemoving bone (the upper one) and two inextensible tendons attaching itto the muscles ? the flexor (yellow) and the extensor (red).4.1 Simplified Human Dynamics (F1T2)In this section, the 2-dimensional toy plant used for the experiments is describedin detail. The plant is further denoted as F1T2 (stands for 1 joint, 2 tendons), andis intended to mimic biological musculoskeletal systems.The graphical depiction of F1T2 plant can be seen on Figure 4.1. It consistsof a planar 1-link arm, with the single moving bone. The bone is attached to theantagonist muscles (flexor and extensor) via inextensible tendons, and gravity pullsit upwards. The input to the plant is a 2-dimensional vector of muscle activations,and the output is the plant state (in this case represented by a joint angle and angularvelocity).Bone Model. The bone is represented by a uniform density rigid body of lengthLb = 0.5 m, which is rotated about one of its ends (the joint) by the muscles. Themass of the rigid body is m = 0.2 kg, the damping factor is b = 10 N?srad . The42configuration of the bone is thus represented by the joint angle, ? .Muscle Force. Both flexor and extensor muscles have identical parameters. Eachof the muscles has initial length L0 = 0.3 m, and produces the force F which isequal to the sum of the passive force, Fp, and active force, Fa. The active andpassive forces depend on the instantaneous length of the muscle, L:F = Fa +FpFp = fp ?10?Fa = u fa(?+1)(4.1)where 0? u? 1 is the activation level set by a controller, fp = 10 N, fa = 20 N,and ? = L?L0L0 is the muscle strain.Muscle Length. The length of the muscles, L, depends solely on a configurationof the rigid body i.e., the joint angle ? . In all the equations below, the variablesreferring to the extensor muscle are denoted by a subscript e, and variables referThe length of the flexor and the extensor muscles can be described asLe = L0 +?(2d cos? +d)2 +(2d sin?)2? l0L f = L0 +?(2d cos? ?d)2 +(2d sin?)2? l0(4.2)In the above formulas, the parameter d is equal to 0.05 m, and l0 = d?5 .Moment Arms. The moment arms for the two muscles are computed in the fol-lowing way:Me =?2d cos?(2d sin? +d)+2d sin? ?2d cos?M f =?2d cos?(2d sin? ?d)+2d sin? ?2d cos?(4.3)Muscle Torques. Let us denote by Ff and Fe the forces that the flexor and theextensor muscles apply at a given moment. Then the torques that the muscles exerton the joint become43?e = Me ?Fe? f = M f ?Ff(4.4)Gravity. The gravity exerts the torque?g = mgLb3cos? , (4.5)where g= 9.8 ms2 , that is according to Figure 4.1 the gravity is pointing upwards.Plant Dynamics. The dynamics of the system are given bym(Lb3)2?? +b?? = ? (4.6)where the net torque? = ? f + ?e + ?g (4.7)Constraints and Initial Configuration. The joint angle ? is constrained to be inthe interval [0.3,pi?0.3], and the initial configuration is ? = pi2 radLocally-Linear Representation. To illustrate the connection of the above plantequations with the theory outlined in Section 3.2, we now show how these contin-uous formulas could be discretized and transformed to a form of Equation 3.3.First, let q = ? . Using Equation 4.6 and setting x1 = q,x2 = q?, it can be seenthat[x?1x?2]=??0 10 ? bm(Lb3 )2??[x1x2]+??0?m(Lb3 )2?? (4.8)Now, ? itself depends on the configuration q = x1. From Equations 4.7, 4.4,? = Me ?Fe +M f ?Ff + ?gUsing Equation 4.1, this transforms into44? = Me(Fpe +ue fa(?e +1)+M f (Fp f +u f fa(? f +1)+ ?g,where ue,u f are activations, ?e,? f are strains, and Fpe,Fp f are the passive forcesof the extensor and the flexor muscles, respectively. From Equations 4.1,4.3 and4.5, it can be seen that ?e,? f ,Fpe,Fp f ,Me,M f ,?g are non-linear functions of x1.The above equation can be rewritten in a matrix form:? =[Me fa(?e +1) M f fa(? f +1)][ueu f]+MeFpe +M f Fp f + ?gInserting this expression for torque into Equation 4.8, we get[x?1x?2]= Ac(x1)[x1x2]+Bc(x1)[ueu f], (4.9)whereAc(x1) =??0 1MeFpe+M f Fp f +?gx1m(Lb3 )2?bm(Lb3 )2?? (4.10)Bc(x1) =??0 0Me fa(?e+1)m(Lb3 )2M f fa(? f +1)m(Lb3 )2?? (4.11)By setting x =[x1x2]=[qq?]and u =[ueu f], Equation 4.9 can be rewritten as acontinuous dynamical system equation in a state-space form:x? = Ac(q)x+Bc(q)u (4.12)To get a discrete equation for the timestep of size ?t, Equation 4.12 must be in-tegrated, which yields a non-linear solution of a form xt+1 = f (xt ,ut). Even for thistoy plant, this solution is hard to obtain analytically. However, with the assumptionthat for a small timestep ?t the matrices Ac(q),Bc(q) are constant, Equation 4.12can be integrated in a closed form:45xt+1 = Ad(qt)xt +Bd(qt)ut , (4.13)whereAd(qt) = eAc(qt)?t (4.14)Bd(qt) = (? ?t0eAc(qt)?d?)Bc(qt) (4.15)Finally, by setting A(xt) = Ad(qt)xt and B(qt) = Bd(qt), Equation 3.3 is re-covered. It should be stressed, however, that Equation 4.13 was obtained using asimplifying assumption about local linearity of Equation 4.12, and hence it is onlyan approximation to the true solution xt+1 = f (xt ,ut).Implementation. F1T2 plant was implemented using the Strands++ software pack-age [72]. In the implementation, the system input consists of a 2-dimensional vec-tor u, where the first coordinate represents the activation of the extensor muscle ue,and the second coordinate represents the activation of the flexor muscle u f .4.2 Experiment 1: Learning Controller Parameters for aReaching TaskThis section describes the procedure for learning the parameters u f (x, t),R(q) forOSA (Chapter 3), and assessment of the success of this learning. Following Sec-tion 3.4, the controller task is to reach a given configuration and stabilize the plantthere.4.2.1 MotivationOur goal is to develop a strategy that does not require prior knowledge of the plantmodel, but at the same time is easy to implement and run. As suggested in Sec-tion 3.4, the online learning approach is desired to work in the following way. Thecontroller starts with some basic configuration which is easy to obtain. Then, thecontroller reaches various targets and learns its configuration-dependent parame-ters u f (x, t),R(q) in the process of getting more experience. APM is learned us-46ing Algorithm 2, and the equilibrium controls serve as the feedforward commandu f (x, t) as suggested in Section 3.3.1.The problem of identifying the usefulness of this learning procedure for OSA isnaturally divided into two simpler ones. The first problem, addressed by this sec-tion, is to understand whether the initial configuration of OSA can help to collect thedata for estimating equilibrium controls and APM. The second problem, addressedby Section 4.3 is whether or not these collected data improve the performance ofOSA.4.2.2 MethodIn this experiment, the local value of APM at the initial configuration, qs, is learned,and the parameters for OSA are set as R(q) = R(qs) and u f (x, t) = 0. This versionof OSA is further denoted by the Baseline controller. The plant F1T2 described inSection 4.1 is utilized for the experiment.Despite the fact that the Baseline controller is not supposed to be precise withthe initial parameters, it allows to get the information about the equilibrium con-trols and APM at various configurations if it can stabilize the plant. In turn, equilib-rium controls can be thought of as the feedforward command u f (x, t) for the reach-ing task, and APM can be learned as a function of the configuration q. This experi-ment also validates the hypothesis that our method indeed gets proper equilibriumcontrols ueq(q) for a given configuration, and useful APMs R(q). Section 4.3 usesthe learned controller parameters for the same task, and checks whether or not theyimprove the performance of OSA.Task Description. To test the performance of the controller on a reaching prob-lem, the controller task is defined to be reaching N = 11 various configurationsq(i)r = pi6 +(i? 1)pi15 rad, for i = 1, ...,N for the F1T2 plant ? where q = ? is thejoint angle. Each run starts from the rest state qs = ?s = pi2 rad, q?s = ??s = 0rads .The duration of the run is T = 200 timesteps of size ?t = 0.02 s. Given the actualtrajectory q(i)t during the run (i), for t = 1, ...,T , the error for this run is defined asE(i) = 2T ?Tt= T2 +1?q(i)t ?q(i)r ?maxj=1,...,N?q( j)r ?qs?? that is, it shows how close is the plant to the targetin the second half of the run, normalized by the largest reaching distance.47The kinematic control gains were manually tuned to be KP = ?0.1, KD =??t2 0.9, and the regularization parameter was simply set to ? = 0. APM at theinitial configuration, R(qs), is learned using Algorithm 2, and the controller runswith this fixed APM R(q) = R(qs) and u f (x, t) = 0 for each of the reference targets,q(i)r , for T timesteps.In addition, the data for learning the controller parameters, u f (x, t) and R(q), iscollected. At the end of each run, the controller stores the final state x(i)T , the finalconfiguration q(i)T , the last activation vector u(i)T applied (which approximates anequilibrium control vector as in Algorithm 1), and APM R(i)T , computed by runningAlgorithm 2 for the state x(i)T .Validation. This section also validates that the stored values are indeed what wethink they are ? that is, u(i)T are the equilibrium controls for configuration q(i)T , andcolumns of R(i)T do describe correctly how controls influence the configuration ofthe system.For equilibrium controls, the validation is straightforward ? equilibrium controlvector should make the net torque ? equal to zero in Equation 4.7 (equilibrium con-dition). Thus, the equilibrium control samples u(i)T are inserted into Equation 4.7,and the computed torque value is then compared to 0.To validate the samples for the matrix R(q), one needs to compare it to the tophalf of the matrix Bd(q) from Equation 4.15, which follows from the definition ofAPM (Section 3.2). From Equations 4.11 and 4.15 it is easy to see that the top twoentries of Bd(q) are proportional to the quantities Me fa(?e + 1) and M f fa(?e + 1)respectively. In turn, the quantities M j fa(? j +1), for each muscle j, represent thedifference between the net torque with jth muscle being fully activated, and thenet torque with no muscles activated. Hence, to check the correctness of the sam-ples R(i)T , this difference, further denoted by active torque, is first computed usingEquation 4.7. Then, the obtained active torques are scaled to fit the correspondingentries of R(i)T across different configurations, and the results are reported. A goodfit would mean that the computed APMs describe the action of the muscles of theplant in a sensible way.48Figure 4.2: A set of N = 11 runs that the Baseline controller performed. x-axis represents the time, y-axis shows the joint angle ? . The red lineshows the target configurations ? (i)r = q(i)r , while the blue line shows theactual trajectories ? (i)t = q(i)t , t = 1, ...,50 during the execution.4.2.3 ResultsOn Figure 4.2 the results of all the N runs are depicted. From these graphs it isclear that the Baseline controller does a good job at stabilizing the plant, but thefarther the destination, the harder it is for the controller to reach the target.Next the analysis of the estimated equilibria, u(i)T , is presented. For each i =491, ...,N, the vector u(i)T was used to compute the net torque at the configurationq(i)T according to Equation 4.7. Across N samples, the maximum value of the nettorque was 0.0014 N ?m, which is a very small number given that the magnitudeof the gravity torque ?g can reach the value of 0.4 N ?m for some configurations.That shows that u(i)T is a good estimate of equilibrium controls, as the controls docompensate for passive dynamics of the plant and gravity.However, an attentive reader would notice that the described method (Algo-rithm 1) picks the controls at a single instance (specifically, at the step T ) in hopethat a stable equilibrium is achieved. This approach was shown above to be work-ing for the given task and plant, but the scenario in which the given controls wouldmove the plant out of the equilibrium is not unlikely. Thus, some improvementsof the above method can be utilized to account for this problem. For example,an averaged control signal over some prolonged period of time can be used as anequilibrium control, or a safety check can be added to detect the situations whenthe applied controls move the plant away from the equilibrium.Let us now return to the computed APMs, R(i)T . In this case, R(i)T is a 1?2 matrix[R(i)T,1 R(i)T,2]. On Figure 4.3 these two entries are compared with the correspondingscaled active torques at configurations q(i)T . From these graphs it is clear that APMis indeed proportional to active torques, implying that the collected APM samplesdescribe the effect of controls on the F1T2 plant well enough.Finally, the activations computed by the Baseline controller during reachingruns are displayed on Figure 4.4. From this plot it can be observed that, ini-tially, when a plant is far away from the destination, the controls are saturated.Then, there is a slight dip, which indicates the braking phase, and then the controlsroughly stabilize.This picture has a resemblance to a famous pulse-step control paradigm, de-scribed in Section 2.2.1. The plots in Figure 4.4 differ from the pulse-step signalonly in the existence of a dip indicating a braking phase. Figure 4.5 shows that thedip occurs at the point when the (scaled) velocity error becomes comparable to the(scaled) position error, which results in the braking behavior. Thus, it appears thatslightly reducing the derivative gain will remove the dip and make the transition tothe step signal smoother.50Figure 4.3: Comparison of empirically computed APM, R(i)T , with scaled an-alytically computed active torques at the configuration ? (i)T = q(i)T . (left)the first entry of R corresponding to the extensor muscle, (right) thesecond entry of R corresponding to the flexor muscle.51Figure 4.4: Muscle activations during N reaching runs performed in Experi-ment 1 for the extensor (left) and the flexor (right). Note, that the graphslook similar, as they depict all N runs, illustrating both flexion and ex-tension.52Assuming the above explanation is correct, it can be hypothesized that thepulse-step signal can be produced by OSA. However, more experiments are neededto understand this problem better ? and at the very least it should be mentioned thatfor some biomechanical reaching tasks like eye saccades the control has no feed-back component as eye movements are too fast for the feedback signal to reach thebrain in time [46].4.3 Experiment 2: Reaching Task with FeedforwardCommand and Variable APMExperiment 2 continues exploring the approach for online learning of OSA pa-rameters, u f (x, t) and R(q). It addresses the question of how the data samples ofequilibrium controls u(i)T , and APM R(i)T influence the performance of the controlleron a reaching task.4.3.1 MethodIn this experiment, the same reaching task with the same target locations as in Sec-tion 4.2 is used, with the goal of testing the performance of the Baseline controllerwith the use of feedforward control (in which case the controller is denoted by theEquilibrium controller) or the variable APM (in which case the controller is namedas the Variable APM controller).First, the feedforward command u f (x, t) is added to the Baseline controller. Tocompute the feedforward command from the data samples {q(i)T ,u(i)T } collected inSection 4.2, one needs to fit the samples {q(i)T ,u(i)T } to some function ueq(q), andthen set u f (x, t) = ueq(qr) for reaching the target qr. For convenience, variableswith a subscript eq relate to the equilibrium point control. In this experiment, asimple Nearest-Neighbor model is used to fit N = 11 data samples collected, thatisueq(q) = u(i)T , i = argminj=1,...,N?q?q( j)T ?Using this feedforward term u f (x, t) in the Equilibrium controller, N reachingruns with the same targets as in Section 4.2 are completed, and the performance of53Figure 4.5: Details for the run to reach the location ?r = qr = 5pi6 rad. (A) Redline shows the target position, blue line shows the actual trajectory. (B)Red line shows the target velocity, blue line shows the actual velocityprofile. (C) Activation level for the extensor muscle during the run. (D)Activation level for the flexor muscle during the run.54the controller is reported.Second, the effects of varying APM separately from the feedforward controlsare explored. Thus, for this part of Experiment 2, u f (x, t) = 0, but instead APMvaries based on the current configuration qt . The function R(q) = [R1(q) R2(q)]is learned by fitting each of the APM components to the data samples {R(i)T =[R(i)T,1 R(i)T,2],q(i)T }. To approximate the function R(q), cubic polynomials were usedfor fitting samples for each coordinate of R(q) (Figure 4.6). A smoother functionapproximation compared to Nearest-Neighbor fit was chosen due to the necessityof continuously changing APM during a single run.Using these values of APM in the Variable APM controller, the reaching runsdescribed in Section 4.2 are completed, and the performance of the controller isreported.4.3.2 ResultsUsing equilibrium controls as the feedforward command in the Equilibrium con-troller gives an average error of 0.0021 compared to 0.0296 computed in Sec-tion 4.2 with the Baseline controller. The recorded trajectories are shown in Fig-ure 4.7, and it can be observed that the use of feedforward controls eliminated thesteady state error, even though the approximation of equilibrium controls is not per-fect as Nearest-Neighbour fit is not good for a small sample set. Figure 4.8 showsvelocity profiles for each of the runs. Figure 4.9 shows the activations applied tothe plant during reaching, which display the same pattern as in Figure 4.4.Thus, it can be concluded that using the estimation of equilibrium controls doesimprove the performance of OSA significantly, even though the estimation is farfrom perfect. However, the results of using variable APM are not as encouraging.The average error in the case of the Variable APM is 0.0274, which is an im-provement over the Baseline controller. However, a closer look reveals that signif-icant improvement is observed only when the reaching destination is far enoughfrom the original configuration, where the entries of R(q) differ significantly fromthe original ones, R(qs). This says that the use of variable APM is justified onlywhen the movement goes far enough, and there are other ways to deal with thesechanges of APM beside interpolation. For instance, one could utilize sequential55Figure 4.6: Cubic polynomial fit of empirically computed APM. (Left) thefirst entry of R(?) = R(q) corresponding to the extensor muscle, (right)the second entry of R(?) = R(q) corresponding to the flexor muscle.56Figure 4.7: Trajectories for the set of N = 11 runs that the Equilibrium(green) or the Baseline (blue) controller performed. x-axis representsthe time, y-axis shows the joint angle ? .57Figure 4.8: Angular velocity profile for N = 11 runs that the Equilibrium(green) or the Baseline (blue) controllers performed. x-axis representsthe time, y-axis shows the velocities ?? .58Figure 4.9: Muscle activations during N reaching runs performed in Experi-ment 2 for the extensor (left) and the flexor (right). Green lines showactivations produced by the Equilibrium controller, blue lines show ac-tivations produced by the Baseline controller. Note, that the graphs looksimilar, as they depict all N runs, some of which involve flexion andsome ? extension of the joint.59feedback controllers as described by Burridge et al. [10].An important remark has to be made regarding the application of Algorithm 2in this case. As was mentioned in Section 3.3.2, if Algorithm 2 is applied at sta-tionary positions, it can be extended to its online version in non-simulated environ-ments, although this extension is not used here.4.4 Experiment 3: Using End-Point CoordinatesThis section turns to the problem of choosing a state representation. In Section 4.2and Section 4.3, it was assumed that the configuration q represents the joint angle? . However, it was noted in Section 2.3 that often one might want to work withother coordinate systems, e.g., end-point coordinates. Thus, this experiment?s goalis to check whether OSA with goal babbling can handle the case when the valuesc = Lb cos? are used as a configuration q.For this new coordinate system, exactly the same experiments as in Section 4.2and Section 4.3 are performed, with the configuration q = c instead of q = ? , andthe results are reported. Thus, the rest configuration qs is equal to cs = Lb cos?r =0 m. The only difference from earlier experiments is the choice of reaching targetsqr, as the magnitude of the targets needs to be scaled to account for the coordinatechange. Thus, the targets were chosen to be q(i)r = c(i)r =?0.4+0.08 ? (i?1) m.This experiment produced the same qualitative results as in Section 4.2 andSection 4.3. The trajectories produced by OSA are shown in Figure 4.10, the ve-locities are shown in Figure 4.11, and a sample muscle activation profile is shownin Figure 4.12.One thing to note here is that the change of variables made the plant equa-tion more non-linear, and the effects of using configuration-dependent APM areexpected to be greater for end-point coordinates. To check whether this is indeedtrue, another mini-experiment was designed ? the system runs with the Baselineand the Variable APM controllers for the joint angle coordinates with the targetqr = ?r = pi6 rad, and also for the endpoint coordinates with the corresponding tar-get qr = cr = Lb cos pi6 . To compare the performance error for different coordinatesystems, the error for joint angles is set to be E?r =2T ?Tt= T2 +1??t??r???r??s? , and for theend-point coordinates Ecr =2T ?Tt= T2 +1?ct?cr??cr?cs?.60Figure 4.10: Trajectories for the set of N = 11 runs that the Equilibrium(green), the Variable APM (magenta), or the Baseline (blue) controllersperformed. x-axis represents the time, y-axis shows the end-point co-ordinate c.61Figure 4.11: Velocities for the set of N = 11 runs that the Equilibrium(green), the Variable APM (magenta), or the Baseline (blue) controllersperformed. x-axis represents the time, y-axis shows the end-point ve-locity c?.62Figure 4.12: Muscle activations for reaching the target qr = cr = 0.4 m thatthe Equilibrium (green), the Variable APM (magenta), or the Baseline(blue) controllers performed. x-axis represents the time, y-axis showsthe activation level for the extensor muscle (left graph), and the flexormuscle (right graph).Table 4.1: Scaled Performance Error for the Baseline and the Variable APMcontrollers for reaching the target ?r = pi6 rad, cr = Lb cospi6 .Joint Angles End-Point coordinatesFixed APM 0.0576 0.0648Variable APM 0.0469 0.0327The hypothesis is that variable APM should be more beneficial in the case ofend-point coordinates than angles, as in the former case the system equation be-comes more non-linear. The results of this experiment are shown in Table 4.1, andit is clear that they support the hypothesis. However, it shall be mentioned thatthe fact that variable APM is more beneficial for the more non-linear system in thisparticular case does not prove that this is a true statement in general.634.5 ConclusionsThis chapter analyzed the behavior of One-Step-Ahead controller (OSA) on a reach-ing task for biomechanical systems. For that purpose, a toy plant named F1T2 wasdeveloped and described in Section 4.1.OSA, described in Chapter 3, has four open parameters for a given task setting ?that is, kinematic control gains KP, KD, a regularization parameter ? , feedforwardcontrols u f (x, t) and Activation-Position Matrix (APM) R(q). In this work it isassumed that gains and a regularization parameter are manually tuned to somefixed values, but u f (x, t) and R(q) could change, thus raising a question of how toassign these changing parameters.This chapter tests the approach proposed in Section 3.4, which involves us-ing equilibrium controls as a feedforward command in a reaching task, and Algo-rithm 2 for learning APM locally. That section hypothesized that estimating APMat various configurations and interpolating them could improve the performanceof the controller. Section 3.4 also proposed learning the equilibrium controls andAPM online by stabilizing the plant at different positions.Section 4.2 addressed the problem of collecting the samples for equilibriumcontrols and APM. First, the controller is initialized with a local estimate of APMand u f (x, t) = 0. Then, the controller reaches to various locations, and estimatesthe equilibrium controls and APM in these configurations. The correctness of theestimated values is checked in Section 4.2 by utilizing the analytical model of theplant.Then, in Section 4.3, the collected values are used to set u f (x, t) and R(q) forOSA. The performance analysis showed that for a reaching task the use of equi-librium controls eliminates steady-state error even if the estimation is not precise.The use of variable APM showed to be effective only when the reaching destina-tion is far enough from the initial position of the plant so that APM entries differsignificantly from their original values.An important note should be made on how to choose the configuration variable? indeed, the choice of joint angles ? is not well justified in this case (Section 2.3,and one might want the controller to work in terms of the end-point coordinates.Thus, Section 4.4 repeated the experiments described in this chapter with the use of64c = Lb cos? as a configuration variable q, and got qualitative results similar to thejoint angles case. In addition, the effects of using configuration-dependent APMwas greater for the end-point coordinates, as the plant equations are less linearthan for joint angles. Thus, OSA in combination with the online learning methodappears to do well in controlling systems with various degrees of non-linearity.Based on this information, it can be concluded that learning the equilibriumcontrols and APM online as described in Section 3.4 is probably a useful approachthat significantly improves the performance of OSA for a reaching task.65Chapter 5Control of a Human Index FingerModelThis chapter extends the work described in Chapter 4, and applies OSA to a com-plex model of a biomechanical system ? the human index finger.Section 5.1 describes the model of the human index finger [47]. Section 5.2uses the same experiments as in Chapter 4 adjusted for the finger model to test theperformance of OSA, and reports the results.5.1 Human Index Finger ModelThe human index finger model utilized in this chapter was developed in the Sen-sorimotor Systems Laboratory by Dr. Shinjiro Sueda and Duo Li using Strands++simulator [72], and is described in detail in [47]. This section presents only a high-level description of the plant, which can be seen on Figure 5.1.The finger model consists of four bones ? the metacarpal bone, which positionand orientation is fixed, the proximal phalanx which is connected to the metacarpalbone via the MCP joint, the middle phalanx connected to the proximal phalanxvia the PIP joint, and the distal phalanx connected to the middle phalanx via theDIP joint. The three joints allow the plant to have 4 degrees of freedom (flexion-extension for all the joints, and abduction-adduction for the MCP joint).The joint angles can be changed by the activation of 6 different muscles, which66Figure 5.1: A visual representation of the human index finger model [47].The plant consists of four bones (distal, middle, proximal andmetacarpal) which are tied to 6 muscles by 10 interconnected tendons.The other four fingers are displayed only for convenience.are attached to the bones through 10 interconnected tendons. That is, the plant iscontrolled by a six-dimensional vector of activations u, such that 0? u? 1.In addition, there is a gravity force that pulls the bones and tendons down (thatis, perpendicular to the wrist plane in Figure 5.1).5.2 Reaching Control for the Index FingerThis section explains how to set up OSA to control the finger model, and conductsthe experiments described in Sections 4.2 and 4.3 for testing the performance ofthe controller on a reaching task. Recall that, according to the approach outlined inSection 3.4, the controller has no prior information about the plant except the PDgains KP, KD, and the regularization parameter ? .67Choice of the Configuration Variable. In the work on controlling the robotic ACThand, the configuration variable q was chosen to be the tendon excursions or jointangles [20, 53]. There is, however, some evidence that humans do use the fingertipposition directly for reaching movements. Thus, the configuration variable q waschosen to be a 3-dimensional position of the fingertip. Note, that this is not acomplete description of the 4D configuration space of the skeleton.Unlike F1T2 plant, the configuration of the finger model is multidimensional.To keep the analysis simple and easily comparable with the results reported inChapter 4, the reaching locations were chosen to lie at a position z on the q1 axis.That is, each reaching target is of a form q =[z 0 0]T. Assuming that themovement of the fingertip is close to q1-axis, the reaching task can be representedin 1-dimensional space, and equilibrium controls and APM become functions thatdepend on a scalar z, that is, ueq(q) = ueq(z) and R(q) = R(z). However, it shouldbe emphasised that the movement still occurs in 3-dimensions, and the use of zsimplifies only the form of the target locations and fitting the equilibrium controland APM functions.Controller Parameters. The rest configuration for the plant is qs =[0 0 0]T,and the the Baseline controller is initialized with u f (x, t) = 0, R(q) = R(qs), whereR(qs) is computer using Algorithm 2 at the rest configuration. Note, that APM is a3? 6 matrix in this case. The regularization parameter ? was set to zero, and PDgains were chosen to beKP =?diag([0.2 0.2 0.2]), KD =?diag([?t2 0.5?t2 0.5?t2 0.5])Next, the Experiments 1 and 2 described in Chapter 4 are repeated here withappropriate adjustments.Task Parameters. In this experiment, the N = 6 targets were chosen to be q(i)r =[z(i)r 0 0]T, where z(i)r =?1.2 ? (i?1) cm, i = 1, ...,N. The duration of the runis T = 100 timesteps of size ?t = 10 ms, and the error definition was adjusted toreflect the last quarter of the run:68E(i) =2TT?t= T2 +1?z(i)t ? z(i)r ?maxj=1,...,N?z( j)? zs?,where zs = q1,s, and zt is the value of the first configuration coordinate at timet, that is qt =[zt q2,t q3,t]T.Results. The experiment produced similar qualitative results to the ones fromSection 4.2 and Section 4.3. Using equilibrium controls as feedforward commandin the Equilibrium controller gives an average error of 0.0087 compared to 0.0227computed with the Baseline controller and 0.0130 with the Variable APM controller.The trajectories produced by OSA are shown in Figure 5.2, and a sample muscleactivation profile is shown in Figure 5.3.The application of equilibrium controls clearly improved the performance tonearly perfect when the equilibriums used where close enough to the target loca-tion. However, as the equilibrium point is farther from the target (bottom part ofFigure 5.2), the Equilibrium controller performance goes down. On the other hand,Variable APM did not improve the performance of the Baseline controller for smallreaches (since APM does not change much), but made a significant difference forlarge movements. Thus, it could be said that the above experiment supports theconclusions made in Section 4.5 about the usefulness of variable APM and equilib-rium controls as a feedforward commands.An important remark should be made regarding the use of only one coordinate,q1 = z, for learning APM and equilibrium controls. It turns out that the reachingmovement that OSA performs deviates a bit from the z-axis during the execution(other dimensions are not shown on figures). Although this deviation is compa-rably small, it creates some trouble for the Variable APM controller, as it has toupdate the APM on every timestep, and the APM turned out to be sensitive to thosedeviations from the z-axis. Thus, to fully understand the performance of OSA, aset of experiments has to be devised where APM depends on the full configurationspace, q, although this is outside the scope of the thesis.69Figure 5.2: Trajectories for the set of N = 6 runs that the Variable APM (ma-genta), the Equilibrium (green), or the Baseline (blue) controllers per-formed. x-axis represents the time, y-axis shows the third end-pointcoordinate q1 = z.70Figure 5.3: Muscle activations for reaching the target q1 = z =?1.2 cm thatthe Baseline (blue) controller performed. x-axis represents the time, y-axis shows the activation level for all the six index finger muscles.Tracking Desired Trajectory with the Baseline Controller. The reader may won-der what are the capabilities of OSA for doing more complex tasks than reaching.Thus, the Baseline controller was informally tested on a task of following the circle2 cm in diameter with the fingertip on the q1q3-plane. The results could be seenat Figure 5.4 ? the Baseline controller was able to track the trajectory well for 6seconds. This example illustrates that even the basic version of OSA is able to solvea fairly elaborate tracking problem for the index finger model.However, small oscillations are noticeable on Figure 5.4, and a closer lookat the activation profile (Figure 5.5) shows that the controller produces very non-smooth control signals. While this problem could be a side effect of modeling[21], changing the regularizing term ? = 0.01 to OSA eliminates the oscillations(Figure 5.6) and produces smooth controls (Figure 5.7).71Figure 5.4: The results of the Baseline controller (blue) tracking the circulartrajectory for the fingertip (red). The fingertip starts at the rest config-uration (blue star) off the circle, but quickly catches up, and does threecircles around, with a total duration of 6 seconds.This regularized controller can also perform tracking of non-smooth and notproperly timed trajectories (Figure 5.8). The controller was required to followtrajectories in q1q3 and q1q2 planes, which consist of straight lines in the form ofthe letter ?A?. Each line segment consists of target locations qr, evenly distributedon the corresponding line, and the target velocity q?r = 0 cms . That is, the controlleris presented with the unrealistic task of moving the fingertip with (unspecified)uniform speed without any acceleration or braking during sharp turns.Conclusions. This chapter illustrated how to use OSA with a complex model ofthe human index finger, and showed that the results of the experiments on F1T2plant described in Chapter 4 also hold for the much more complex system. That is,equilibrium controls provide a useful feedforward command to OSA, and variableAPM helps significantly for large enough movements. In addition, it appears thateven the Baseline controller can perform quite elaborate tracking tasks. Finally,it was shown that using regularization parameter in OSA removes the oscillations72Figure 5.5: Muscle activations for the circular trajectory tracking that theBaseline (blue) controller performed. x-axis represents the time, y-axisshows the activation level for all six index finger muscles.Figure 5.6: The results of the Baseline controller with (green) or without(blue) regularization tracking the circular trajectory for the fingertip(red). Note, that the controller without regularization produces jerkymovement, while the one with regularization makes it smooth.73Figure 5.7: Muscle activations for the circular trajectory tracking that theBaseline controller with (green) or without (blue) regularization per-formed. x-axis represents the time, y-axis shows the activation level forall six index finger muscles.of control signals and creates a smoother trajectory, as mentioned by Sueda et al.[71].74Figure 5.8: The results of the Baseline controller tracking the non-smoothtrajectories.75Chapter 6Conclusion6.1 SummaryThis thesis investigated how to design a robust and efficient controller for complexbiomechanical systems. The controller algorithm was derived and tested on a toyplant, and a complex model of the human index finger. In addition, an approachfor learning the parameters of the controller without much prior knowledge of theplant online was proposed, and the initial analysis of this approach was conductedfor both the toy plant and the finger model. The results of the experiments showedthat the proposed controller performs well for a wide range of motions, and theproposed learning approach is promising.Chapter 2 presented the necessary background to the reader and reviewed therelated work. It mathematically defined a dynamical system in terms of its state xand applied control signals u. Section 2.1 described the methods for approximatingthe dynamical system equations. Section 2.2 outlined the existing methods for find-ing the control signals u such that the system state evolves in a desired fashion. Sec-tion 2.2.3 described the key idea named Hierarchical Control, which allows the ap-plication of control methods not suitable for complex non-linear input-constrainedsystems. According to this approach, the dynamical system is divided into twosimpler parts ? the high level system that finds high-level controls v (kinematiccontrols), and the low-level system which transforms kinematic controls into motorcontrols u. Section 2.3 described the specific features of tendon-driven biomechan-76ical systems, with the most notable one being the box constraint on the controls,that is 0? u? 1.Chapter 3 derived the controller algorithm (denoted by the One-Step-Aheadcontroller (OSA)), based on the Hierarchical Control architecture. The high-levelsystem is controlled by a PD-controller with manually tuned gains, and the high-level controls v in this case represent the desired trajectory to follow. The low-levelsystem was designed to be a constrained linear least squares problem which canbe solved efficiently, and the solution to this problem gives the motor controls u.OSA algorithm has two main parameters aside from PD-gains and regularizationparameter ? . One of them is a feedforward policy u f (x, t) which simplifies thehigh-level dynamics enough for the PD-controller to handle it. Section 3.3.1 arguedthat equlibrium point control is a neat candidate for such a policy. The secondimportant parameter of OSA is Activation-Position Matrix (APM) which definesthe linear relationship between the controls and the resulting change in state. Thisconfiguration-dependent matrix R(q) can be learned locally, for example, by usingAlgorithm 2. Finally, Section 3.4 suggested a high-level approach for learning APMand equilibrium point controls online by stabilizing the plant at various locations.The next two chapters dealt with assessing the performance of OSA and thefeasibility of the approach described in Section 3.4 for learning OSA parameters.Chapter 4 analyzed the performance of the controller on a reaching task for the sim-plified human dynamics model, F1T2, described in Section 4.1. The analysis wasdone in the following way. First, Section 4.2 set up a basic version of OSA denotedby the Baseline controller to assess if it can stabilize the plant and collect valid datafor learning OSA parameters. This controller has u f (x, t) = 0, and constant APM,which is learned locally. In Experiment 1, the Baseline controller reached variouslocations q and collected the data for learning equilibrium controls ueq(q) and APM.The collected data was validated using the knowledge of the analytical model ofthe plant. Then, Section 4.3 used the collected data to test whether the use of equi-librium controls and configuration-dependent APM improves the performance ofthe controller. It was shown that these data indeed help ? the equilibrium controlseliminate the steady state error, and changing APM makes the difference when thereaching location is far enough from the start configuration. Finally, Section 4.4explored the effect of choosing different configuration variables to represent the77state, and found that OSA is flexible enough to deal with end-point coordinates, aswell as joint angles, in the case of F1T2 plant.Chapter 5 extended the experiments described in Chapter 4 to the much morecomplex case of the human index finger model (Section 5.1), and confirmed theconclusions drawn for the toy plant ? namely, that the basic version of OSA con-troller is capable of stabilizing the plant, that equilibrium point control providesa good feedforward command for OSA, and that OSA benefits from configuration-dependent APM if the target is far enough from the start location. Section 5.2 alsoshowed that the finger model can be controlled directly via observing only the fin-gertip position, and even the basic model of OSA can complete a task of trackingnon-trivial trajectories.6.2 ContributionsThe main contribution of this thesis is the development of OSA, which is capable ofcontrolling complex tendon-driven systems online. It was inspired by controllersused by Malhotra et al. [53] and Deshpande et al. [20], who utilized the linear leastsquares problem to deal with input constraints. However, their work does not takepassive dynamics of the system into account, unlike OSA, and does not use anyfeedforward command or penalties for controls. In addition, the OSA algorithmand the use of equilibrium controls as a feedforward command was explicitly ana-lyzed in the Hierarchical Control framework. Finally, the online learning method,inspired by goal babbling [61], was proposed to be used to learn OSA parametersby stabilizing the plant at various configurations.It should be emphasized, that the idea of using equilibrium controls as a feed-forward command for feedback control is not new [42], the method for learningAPM (Algorithm 2) stems from the self-identification method [53] applied at vari-ous configurations, and penalizing the controls in least squares problems is a stan-dard technique used, for example, by Sueda et al. [71]. In addition, although thefeasibility of using online learning for OSA parameters was assessed (confirmingthat this idea is promising), the full-scale experiment involving bootstrapping al-gorithms like goal babbling [61] was left for the future work.OSA was successfully utilized for controlling the complex human index finger78model with only a few samples taken for learning the initial set of parameters ?which repeats the results of Malhotra et al. [53], thus further validating the powerof this approach. In addition, the finger model was controller by observing onlythe fingertip positions, which, to our knowledge, has not been done before forelaborated tendon-driven models of the human hand.6.3 Future WorkThere are a number of unaddressed issues related to the application of OSA to thecomplex biomechanical plant. First of all, as was mentioned above, the onlinelearning approach was not used in a full-scale mode in this thesis. There are anumber of problems that has to be resolved for that purpose ? most importantly,which function approximator to use to learn the parameters R(q), u f (x, t), and howto choose goals properly. In addition, the effectiveness of this learning could becompared to learning parameters of some other general-purpose function approxi-mator for the same dataset, for example, an Echo State Network.Another huge issue unaddressed by the current work is the problem of motordelay. In biological systems, the commands that the controller sends out to theplant reach it only after some considerable amount of time (e.g., 100 ms). It has tobe determined whether or not (and under what conditions) OSA is able to deal withthis delay. There are ways to avoid this problem by using the forward model of theplant [46], which brings a question of how to learn this forward model itself, andwhether or not the strategy of learning parts of the model separately as describedin Section 3.5 could be useful.The goal of this work was to develop a controller for complex biomechani-cal systems. However, it should be stressed that the model used here, althoughgood, is still far from the real finger. The issues of noise, friction, and joint stiff-ness are completely ignored in this thesis, and it is crucial to understand how thesedisturbances change the behavior of OSA. However, there is hope that, counterintu-itively, controlling the real system could make things easier for the controller [21],and even the noise level could be of some use as a model regularizer [33]. Never-theless, until the controller is applied to a better model, these accounts remain merespeculations. In addition, the problem of handling contact was also not addressed79here. One way of handling contact is to use a high-level planner that specifies thetrajectory for OSA to follow, which compensates for the contact forces [45, 56].A more subtle issue is the way to learn the feedforward command of OSA.As mentioned in Section 3.2, there exists a choice between learning a high-levelfeedforward policy v f (x, t) directly, or through a low-level policy u f (x, t). Whilehere we chose to use the low-level policy as it is more consistent with the existingwork on equilibrium controls, it could be beneficial to learn the high-level policydirectly to get rid of redundancy problem. We believe, that this direction should beexplored carefully, as it could greatly benefit the learning procedure for OSA.Finally, it would be interesting to see how well OSA algorithm scales, and ifit can handle more complicated cases of biomechanical systems, for example, thecomplete model of the human hand [47] or the full human body.80Bibliography[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcementlearning. In Twenty-first international conference on Machine learning -ICML ?04, page 1, New York, New York, USA, 2004. ACM Press. ISBN1581138285. doi:10.1145/1015330.1015430. URLhttp://dl.acm.org/citation.cfm?id=1015430http://portal.acm.org/citation.cfm?doid=1015330.1015430. ? pages 17[2] M. N. Abdelghani, T. P. Lillicrap, and D. B. Tweed. Sensitivity Derivativesfor Flexible Sensorimotor Learning. Neural Computation, 20:2085?2111,2008. ? pages 20[3] C. W. Anderson, P. M. Young, M. R. Buehner, J. N. Knight, K. A. Bush, andD. C. Hittle. Robust Reinforcement Learning Control Using IntegralQuadratic Constraints for Recurrent Neural Networks. IEEE Transactionson Neural Networks, 18(4):993?1002, 2007. ? pages 17[4] A. C. Antoulas. Approximation of Large-Scale Dynamical Systems. Societyfor Industrial and Applied Mathematics, Jan. 2005. ISBN978-0-89871-529-3. doi:10.1137/1.9780898718713. URLhttp://epubs.siam.org/doi/book/10.1137/1.9780898718713. ? pages 9[5] A. T. Bahill, M. R. Clark, and L. Stark. Glissades - eye movementsgenerated by mismatched components of the saccadic motoneuronal controlsignal. Mathematical Biosciences, 26(3-4):303?318, 1975.doi:10.10116/0025-5564(75)90018-8. URLhttp://www.sciencedirect.com/science/article/pii/0025556475900188. ?pages 12[6] P. B. Bakker. The state of mind: Reinforcement learning with recurrentneural networks. PhD thesis, University of Leiden, 2004. ? pages 17[7] I. S. Baruch, P. Georgieva, J. Barrera-Cortes, and S. F. de Azevedo. Adaptiverecurrent neural network control of biological wastewater treatment.81International journal of intelligent systems - soft computing for modeling,simulation and control of nonlinear dynamical systems, 20(2):173?193,2005. doi:10.1002/int.v20:2. ? pages 7[8] D. J. Berger, R. Gentner, T. Edmunds, D. K. Pai, and A. D?Avella.Differences in adaptation rates after virtual surgeries provide direct evidencefor modularity. The Journal of neuroscience : the official journal of theSociety for Neuroscience, 33(30):12384?94, July 2013. ISSN 1529-2401.doi:10.1523/JNEUROSCI.0122-13.2013. URLhttp://www.ncbi.nlm.nih.gov/pubmed/23884944. ? pages 9[9] M. S. Berniker. Linearity, motor primitives and low-dimensionality in thespinal organization of motor control. PhD thesis, Massachusetts Institute ofTechnology, 2005. URL ? pages9, 23[10] R. R. Burridge, A. A. Rizzi, and D. E. Koditschek. Sequential Compositionof Dynamically Dexterous Robot Behaviors. The International Journal ofRobotics Research, 18(6):534?555, June 1999. ISSN 0278-3649.doi:10.1177/02783649922066385. URLhttp://ijr.sagepub.com/cgi/doi/10.1177/02783649922066385. ? pages 12,60[11] K. Bush and C. Anderson. Modeling reward functions for incomplete staterepresentations via echo state networks. In IEEE International JointConference on Neural Networks, pages 2995?3000. IEEE, 2005. ISBN0-7803-9048-2. doi:10.1109/IJCNN.2005.1556402. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1556402.? pages 17[12] T. Catfolis. Generating adaptive models of dynamic systems with recurrentneural networks. In IEEE International Conference on Neural Networks(ICNN?94), volume 5, pages 3238?3243. IEEE, 1994. ISBN0-7803-1901-X. doi:10.1109/ICNN.1994.374754. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=374754.? pages 7[13] T. Catfolis and K. Meert. Adaptive Process Control using Recurrent NeuralNetworks: a feasibility study. In IEEE Mediterranean symposium on newdirections in control and automation, pages 225?231, 1995. ? pages 7[14] T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions,regularizations and Gaussian processes - Revisited. Automatica, 48(8):821525?1535, Aug. 2012. ISSN 00051098.doi:10.1016/j.automatica.2012.05.026. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0005109812001999. ? pages 7[15] P. Cordo, R. S. Dow, and S. Harnad. Movement Control. CambridgeUniversity Press, 1994. ISBN 9780511529788.doi:http://dx.doi.org/10.1017/CBO9780511529788. ? pages 33[16] M. da Silva, F. Durand, and J. Popovic?. Linear Bellman combination forcontrol of character animation. ACM Transactions on Graphics, 28(3):1,July 2009. ISSN 07300301. doi:10.1145/1531326.1531388. URLhttp://portal.acm.org/citation.cfm?doid=1531326.1531388. ? pages 15[17] A. D?Avella and D. K. Pai. Modularity for sensorimotor control: evidenceand a new prediction. Journal of motor behavior, 42(6):361?369, Dec. 2010.ISSN 1940-1027. doi:10.1080/00222895.2010.526453. URLhttp://www.ncbi.nlm.nih.gov/pubmed/21184354. ? pages 9, 23[18] P. Dayan and N. D. Daw. Decision theory, reinforcement learning, and thebrain. Cognitive, affective & behavioral neuroscience, 8(4):429?53, Dec.2008. ISSN 1530-7026. doi:10.3758/CABN.8.4.429. URLhttp://www.ncbi.nlm.nih.gov/pubmed/19033240. ? pages 5, 17[19] P. Dean, J. Porrill, C.-F. Ekerot, and H. Jo?rntell. The cerebellar microcircuitas an adaptive filter: experimental and computational evidence. Naturereviews. Neuroscience, 11(1):30?43, Jan. 2010. ISSN 1471-0048.doi:10.1038/nrn2756. URL http://www.ncbi.nlm.nih.gov/pubmed/19997115.? pages 8, 21[20] A. D. Deshpande, J. Ko, D. Fox, and Y. Matsuoka. Anatomically CorrectTestbed Hand Control: Muscle and Joint Control Strategies. In IEEEInternational Conference on Robotics and Automation, pages 2287?2293.IEEE Press Piscataway, NJ, USA, 2009. ? pages 6, 18, 23, 24, 26, 27, 35,37, 68, 78[21] A. D. Deshpande, J. Ko, D. Fox, and Y. Matsuoka. Control strategies for theindex finger of a tendon-driven hand. The International Journal of RoboticsResearch, 32(1):115?128, Jan. 2013. ISSN 0278-3649.doi:10.1177/0278364912466925. URLhttp://ijr.sagepub.com/cgi/doi/10.1177/0278364912466925. ? pages 6, 23,71, 7983[22] G. E. Dullerud and F. Paganini. A Course in Robust Control Theory,volume 36 of Texts in Applied Mathematics. Springer New York, New York,NY, 2000. ISBN 978-1-4419-3189-4. doi:10.1007/978-1-4757-3290-0.URL http://www.springerlink.com/index/10.1007/978-1-4757-3290-0. ?pages 9, 10, 15[23] K. Fortney and D. B. Tweed. Computational advantages of reverberatingloops for sensorimotor learning. Neural computation, 24(3):611?34, Mar.2012. ISSN 1530-888X. doi:10.1162/NECO\ a\ 00237. URLhttp://www.mitpressjournals.org/doi/pdf/10.1162/NECO a 00237http://www.ncbi.nlm.nih.gov/pubmed/22091669. ? pages 6, 17, 20[24] T. Geijtenbeek, M. van de Panne, and A. F. van der Stappen. FlexibleMuscle-Based Locomotion for Bipedal Creatures. ACM Transactions onGraphics, 32(6), 2013. ? pages 16, 17, 19[25] K. Goto and K. Shibata. Emergence of Prediction by ReinforcementLearning Using a Recurrent Neural Network. Journal of Robotics, 2010:1?9, 2010. ISSN 1687-9600. doi:10.1155/2010/437654. URLhttp://www.hindawi.com/journals/jr/2010/437654/. ? pages 17[26] R. Grzeszczuk, D. Terzopoulos, and G. Hinton. NeuroAnimator: fast neuralnetwork emulation and control of physics-based models. In Proceedings ofthe 25th annual conference on Computer graphics and interactivetechniques - SIGGRAPH ?98, pages 9?20, New York, New York, USA,1998. ACM Press. ISBN 0897919998. doi:10.1145/280814.280816. URLhttp://portal.acm.org/citation.cfm?doid=280814.280816. ? pages 6, 16[27] A. Gunduz, M. C. Ozturk, J. C. Sanchez, and J. C. Principe. Echo StateNetworks for Motor Control of Human ECoG Neuroprosthetics. In 3rdInternational IEEE/EMBS Conference on Neural Engineering, pages514?517. IEEE, May 2007. ISBN 1-4244-0791-5.doi:10.1109/CNE.2007.369722. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4227327.? pages 8[28] M. Haruno and M. Kawato. Heterarchical reinforcement-learning model forintegration of multiple cortico-striatal loops: fMRI examination instimulus-action-reward association learning. Neural networks : the officialjournal of the International Neural Network Society, 19(8):1242?54, Oct.2006. ISSN 0893-6080. doi:10.1016/j.neunet.2006.06.007. URLhttp://www.ncbi.nlm.nih.gov/pubmed/16987637. ? pages 1684[29] M. A. Henson and D. E. Seborg. Feedback linearizing control. In Nonlinearprocess control. Prentice-Hall, Inc., Uper Saddle River, HJ, USA, 1997.ISBN 0-13-625179-X. URLhttp://www.cse.sc.edu/?gatzke/cache/npc-Chapter4-nofigs.pdf. ? pages 11[30] S. Hochreiter. Recurrent Neural Net Learning and Vanishing Gradient.International Journal of Uncertainty, Fuzziness and Knowledge-BasedSystems, 6(2):107?116, Nov. 1998. ISSN 0899-7667. URLhttp://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735. ?pages 7[31] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. NeuralComputation, 9(8):1735?1780, Nov. 1997. ISSN 0899-7667.doi:10.1162/neco.1997.9.8.1735. URL http://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735?journalCode=necohttp://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735. ? pages7[32] Z.-G. Hou, M. M. Gupta, P. N. Nikiforuk, M. Tan, and L. Cheng. ARecurrent Neural Network for Hierarchical Control of InterconnectedDynamic Systems. IEEE Transactions on Neural Networks, 18(2):466?481,Mar. 2007. ISSN 1045-9227. doi:10.1109/TNN.2006.885040. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4118262.? pages 17[33] H. Jaeger. The ?echo state? approach to analysing and training recurrentneural networks. Technical report, Fraunhofer Institute for AutonomousIntelligent Systems, 2001. ? pages 7, 79[34] X. D. Ji and B. O. Familoni. A diagonal recurrent neural network-basedhybrid direct adaptive SPSA control system. IEEE Transactions onAutomatic Control, 44(7):1469?1473, July 1999. ISSN 00189286.doi:10.1109/9.774125. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=774125.? pages 7[35] M. I. Jordan and D. E. Rumelhart. Forward models: Supervised learningwith a distal teacher. Cognitive Science, 16(3):307?354, Sept. 1992. ISSN03640213. doi:10.1016/0364-0213(92)90036-T. URLhttp://doi.wiley.com/10.1016/0364-0213(92)90036-T. ? pages 20[36] H. Kappen. Linear Theory for Control of Nonlinear Stochastic Systems.Physical Review Letters, 95(20):200201, Nov. 2005. ISSN 0031-9007.85doi:10.1103/PhysRevLett.95.200201. URLhttp://link.aps.org/doi/10.1103/PhysRevLett.95.200201. ? pages 15[37] H. J. Kappen. An introduction to stochastic control theory, path integrals andreinforcement learning. In AIP Conference Proceedings, volume 887, pages149?181. AIP, 2007. doi:10.1063/1.2709596. URLhttp://link.aip.org/link/APCPCS/v887/i1/p149/s1&Agg=doi. ? pages 15[38] M. Kawato. Cerebellum: Models. In Encyclopedia of Neuroscience, pages757?767. Oxford: Academic Press, 2009. ? pages 8, 21[39] M. Kawato and H. Gomi. The cerebellum and VOR/OKR learning models.Trends in Neurosciences, 15(11):445?453, Nov. 1992. ISSN 01662236.doi:10.1016/0166-2236(92)90008-V. URLhttp://linkinghub.elsevier.com/retrieve/pii/016622369290008V. ? pages 19[40] M. Kawato and K. Samejima. Efficient reinforcement learning:computational theories, neuroscience and robotics. Current opinion inneurobiology, 17(2):205?12, Apr. 2007. ISSN 0959-4388.doi:10.1016/j.conb.2007.03.004. URLhttp://www.ncbi.nlm.nih.gov/pubmed/17374483. ? pages 16, 21[41] D. Kimura and Y. Hayakawa. Reinforcement learning of recurrent neuralnetwork for temporal coding. Neurocomputing, 71(16):17, Jan. 2006. URLhttp://arxiv.org/abs/nlin/0601005. ? pages 17[42] D. a. Kistemaker, A. J. Van Soest, and M. F. Bobbert. Is equilibrium pointcontrol feasible for fast goal-directed single-joint movements? Journal ofneurophysiology, 95(5):2898?2912, May 2006. ISSN 0022-3077.doi:10.1152/jn.00983.2005. URLhttp://www.ncbi.nlm.nih.gov/pubmed/16436480. ? pages 12, 32, 33, 78[43] J. N. Knight and C. Anderson. Stable reinforcement learning with recurrentneural networks. Journal of Control Theory and Applications, 9(3):410?420,July 2011. ISSN 1672-6340. doi:10.1007/s11768-011-0177-1. URLhttp://link.springer.com/10.1007/s11768-011-0177-1. ? pages 7, 17[44] P. Kokotovic. The joy of feedback: nonlinear and adaptive. IEEE ControlSystems, 12(3):7?17, June 1992. ISSN 1066-033X. doi:10.1109/37.165507.URL http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=165507http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=165507. ?pages 1186[45] P. G. Kry and D. K. Pai. Interaction capture and synthesis. ACMTransactions on Graphics, 25(3):872?880, July 2006. ISSN 07300301.doi:10.1145/1141911.1141969. URLhttp://portal.acm.org/citation.cfm?doid=1141911.1141969. ? pages 80[46] M. Lesmana and D. K. Pai. A biologically inspired controller for fast eyemovements. In IEEE International Conference on Robotics and Automation,pages 3670?3675. IEEE, May 2011. ISBN 978-1-61284-386-5.doi:10.1109/ICRA.2011.5980432. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5980432http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5980432. ?pages 6, 10, 13, 39, 53, 79[47] D. Li. Biomechanical Simulation of the Hand Musculoskeletal System andSkin. Master thesis, University of Bitish Columbia, 2013. ? pages x, 2, 24,66, 67, 80[48] W. Li and E. Todorov. Iterative Linear Quadratic Regulator Design forNonlinear Biological Movement Systems. In ICINCO (1), pages 222?229,2004. URL ilqg biological.pdf. ? pages 16[49] D. Liu and E. Todorov. Hierarchical optimal control of a 7-DOF arm model.In IEEE Symposium on Adaptive Dynamic Programming and ReinforcementLearning, pages 50?57, Nashville, TN, USA, Mar. 2009. IEEE. ISBN978-1-4244-2761-1. doi:10.1109/ADPRL.2009.4927525. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4927525.? pages 17[50] L. Liu, K. Yin, M. van de Panne, and B. Guo. Terrain runner: control,parameterization, composition, and planning for highly dynamic motions.ACM Transactions on Graphics, 31(6):154, Nov. 2012. ISSN 07300301.doi:10.1145/2366145.2366173. URLhttp://dl.acm.org/citation.cfm?doid=2366145.2366173. ? pages 2, 16, 17[51] J. T.-H. Lo. Unsupervised Hebbian learning by recurrent multilayer neuralnetworks for temporal hierarchical pattern recognition. In 44th AnnualConference on Information Sciences and Systems (CISS), pages 1?6. IEEE,Mar. 2010. ISBN 978-1-4244-7416-5. doi:10.1109/CISS.2010.5464925.URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5464925.? pages 787[52] M. Lukos?evic?ius and H. Jaeger. Reservoir computing approaches torecurrent neural network training. Computer Science Review, 3(3):127?149,Aug. 2009. ISSN 15740137. doi:10.1016/j.cosrev.2009.03.005. URLhttp://linkinghub.elsevier.com/retrieve/pii/S1574013709000173. ? pages 8[53] M. Malhotra, E. Rombokas, E. Theodorou, E. Todorov, and Y. Matsuoka.Reduced Dimensionality Control for the ACT Hand. In IEEE InternationalConference on Robotics and Automation, pages 5117?5122. IEEE, 2012. ?pages 5, 6, 8, 9, 11, 17, 23, 24, 26, 27, 35, 37, 68, 78, 79[54] J. Martens and I. Sutskever. Learning recurrent neural networks withhessian-free optimization. In 28th International Conference on MachineLearning, Bellevue, WA, USA, 2011. ? pages 7[55] D. Nguyen-Tuong and J. Peters. Learning Robot Dynamics for ComputedTorque Control Using Local Gaussian Processes Regression. In ECSISSymposium on Learning and Adaptive Behaviors for Robotic Systems(LAB-RS), pages 59?64, Edinburgh, UK, Aug. 2008. IEEE. ISBN978-0-7695-3272-1. doi:10.1109/LAB-RS.2008.16. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4599428.? pages 6[56] P. Pastor, L. Righetti, M. Kalakrishnan, and S. Schaal. Online movementadaptation based on previous sensor experiences. In 2011 IEEE/RSJInternational Conference on Intelligent Robots and Systems, pages 365?371,San Francisco, CA, USA, Sept. 2011. IEEE. ISBN 978-1-61284-456-5.doi:10.1109/IROS.2011.6095059. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6095059.? pages 2, 18, 80[57] P. Pastor, M. Kalakrishnan, F. Meier, F. Stulp, J. Buchli, E. Theodorou, andS. Schaal. From dynamic movement primitives to associative skillmemories. Robotics and Autonomous Systems, 61(4):351?361, Apr. 2013.ISSN 09218890. doi:10.1016/j.robot.2012.09.017. URLhttp://dx.doi.org/10.1016/j.robot.2012.09.017http://linkinghub.elsevier.com/retrieve/pii/S0921889012001716. ? pages 18[58] J. Porrill and P. Dean. Recurrent cerebellar loops simplify adaptive controlof redundant and nonlinear motor systems. Neural computation, 19(1):170?193, Jan. 2007. ISSN 0899-7667. doi:10.1162/neco.2007.19.1.170.URL http://www.ncbi.nlm.nih.gov/pubmed/17134321. ? pages 20, 2188[59] Rimon and D. E. Koditschek. Exact robot navigation using artificialpotential functions. IEEE Transactions on Robotics and Automation, 8(5):501?518, 1992. ISSN 1042296X. doi:10.1109/70.163777. URLhttp://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=163777http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=163777. ?pages 11, 12, 39[60] D. Robinson. Oculomotor control signals. In Basic mechanisms of ocularmotility and their clinical implications, pages 337?374. 1975. ? pages 12[61] M. Rolf, J. J. Steil, and M. Gienger. Online Goal Babbling for rapidbootstrapping of inverse models in high dimensions. In IEEE InternationalConference on Development and Learning (ICDL), pages 1?8, Frankfurt amMain, Aug. 2011. IEEE. ISBN 978-1-61284-989-8.doi:10.1109/DEVLRN.2011.6037368. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6037368.? pages 5, 38, 39, 78[62] E. Rombokas, M. Malhotra, E. Theodorou, E. Todorov, and Y. Matsuoka.Tendon-Driven Variable Impedance Control Using Reinforcement Learning.In Robotics: Science and Systems, Sydney, NSW, Australia, 2012. ? pages2, 6, 11, 17, 18[63] E. Rombokas, E. Theodorou, M. Malhotra, E. Todorov, and Y. Matsuoka.Tendon-driven control of biomechanical and robotic systems: A pathintegral reinforcement learning approach. In IEEE International Conferenceon Robotics and Automation, pages 208?214. IEEE, May 2012. ISBN978-1-4673-1405-3. doi:10.1109/ICRA.2012.6224650. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6224650.? pages 6, 17[64] F. Rosenblatt. The perceptron: A probabilistic model for information storageand organization in the brain. Psychological Review, 65(6):386?408, 1958.ISSN 0033-295X. doi:10.1037/h0042519. URLhttp://content.apa.org/journals/rev/65/6/386. ? pages 8[65] R. Saegusa, G. Metta, G. Sandini, and S. Sakka. Active motor babbling forsensorimotor learning. In IEEE International Conference on Robotics andBiomimetics, pages 794?799. IEEE, Feb. 2009. ISBN 978-1-4244-2678-2.doi:10.1109/ROBIO.2009.4913101. URLhttp://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=4913101http:89//ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4913101. ?pages 4, 6[66] M. Salmen and P. Ploger. Echo State Networks used for Motor Control. InIEEE International Conference on Robotics and Automation, pages1953?1958. IEEE, 2005. ISBN 0-7803-8914-X.doi:10.1109/ROBOT.2005.1570399. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1570399.? pages 8[67] A. M. Schaefer. Reinforcement Learning with Recurrent Neural Networks.PhD thesis, University of Osnabruck, 2008. ? pages 17[68] R. Shadmehr. The equilibrium point hypothesis for control of movement.Technical report, Johns Hopkins School of Medicine, Baltimore, MD, USA,1998. URL http://jhu.edu/shadmehr/Reprints/eqpoint.pdf. ? pages 12, 33[69] J. Shmidhuber. Reinforcement Learning in Markovian and Non-MarkovianEnvironments. In Advances in Neural Information Processing Systems,pages 500?506, San Mateo, CA, USA, 1991. ? pages 17[70] J. F. Stein. Role of the cerebellum in the visual guidance of movement.Nature, 323(6085):217?221, 1986. ISSN 0028-0836.doi:10.1038/323217a0. URLhttp://www.ncbi.nlm.nih.gov/pubmed/3762673. ? pages 21[71] S. Sueda, A. Kaufman, and D. K. Pai. Musculotendon simulation for handanimation. ACM Transactions on Graphics, 27(3):1, Aug. 2008. ISSN07300301. doi:10.1145/1360612.1360682. URLhttp://portal.acm.org/citation.cfm?doid=1360612.1360682. ? pages 2, 24,30, 74, 78[72] S. Sueda, G. L. Jones, D. I. W. Levin, and D. K. Pai. Large-scale dynamicsimulation of highly constrained strands. ACM Transactions on Graphics,30(4):1, July 2011. ISSN 07300301. doi:10.1145/2010324.1964934. URLhttp://portal.acm.org/citation.cfm?doid=2010324.1964934. ? pages 24, 46,66[73] I. Sutskever and G. Hinton. Temporal-kernel recurrent neural networks.Neural networks : the official journal of the International Neural NetworkSociety, 23(2):239?243, Mar. 2010. ISSN 1879-2782.doi:10.1016/j.neunet.2009.10.009. URLhttp://www.ncbi.nlm.nih.gov/pubmed/19932002. ? pages 790[74] E. Theodorou. Iterative path integral stochastic optimal control: Theory andapplications to motor control. PhD thesis, Viterbi School of Engineering,2011. URL http://dl.acm.org/citation.cfm?id=2338290. ? pages 11, 16, 17[75] E. Todorov. A generalized iterative LQG method for locally-optimalfeedback control of constrained nonlinear stochastic systems. In AmericanControl Conference, pages 300?306. IEEE, 2005. ISBN 0-7803-9098-9.doi:10.1109/ACC.2005.1469949. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1469949http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1469949. ?pages 16, 30[76] E. Todorov. Efficient computation of optimal actions. Proceedings of theNational Academy of Sciences of the United States of America, 106(28):11478?83, July 2009. ISSN 1091-6490. doi:10.1073/pnas.0710743106.URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705278&tool=pmcentrez&rendertype=abstract. ? pages 15[77] E. Todorov. Compositionality of Optimal Control Laws. In Advances inNeural Information Processing Systems, pages 1856?1864, Vancouver, BC,Canada, 2009. ? pages 15[78] J. M. Vandeweghe, M. Rogers, M. Weissert, and Y. Matsuoka. The ACThand: design of the skeletal structure. In IEEE International Conference onRobotics and Automation, pages 3375?3379, New Orleans, LA, USA, 2004.IEEE. ISBN 0780382323. URLhttp://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:ACT+hand+finger+control:+Muscle+and+joint+torque+control+strategies#2http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1308775. ? pages 2, 6, 23[79] R. J. Williams and D. Zipser. A Learning Algorithm for ContinuallyRunning Fully Recurrent Neural Networks. Neural Computation, 1(2):270?280, June 1989. ISSN 0899-7667. doi:10.1162/neco.1989.1.2.270.URL http://www.mitpressjournals.org/doi/abs/10.1162/neco.1989.1.2.270.? pages 7[80] T. Yamazaki and S. Tanaka. The cerebellum as a liquid state machine.Neural networks : the official journal of the International Neural NetworkSociety, 20(3):290?297, Apr. 2007. ISSN 0893-6080.doi:10.1016/j.neunet.2007.04.004. URLhttp://www.ncbi.nlm.nih.gov/pubmed/17517494. ? pages 891[81] I. B. Yildiz, H. Jaeger, and S. J. Kiebel. Re-visiting the echo state property.Neural networks: the official journal of the International Neural NetworkSociety, 35:1?9, Nov. 2012. ISSN 1879-2782.doi:10.1016/j.neunet.2012.07.005. URLhttp://www.ncbi.nlm.nih.gov/pubmed/22885243. ? pages 7[82] A. Zhang, M. Malhotra, and Y. Matsuoka. Musical piano performance bythe ACT Hand. In IEEE International Conference on Robotics andAutomation, pages 3536?3541, Shanghai, May 2011. IEEE. ISBN978-1-61284-386-5. doi:10.1109/ICRA.2011.5980342. URLhttp://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5980342.? pages 6, 9, 1792


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items