UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Data driven auto-completion for keyframe animation Xinyi , Zhang 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_zhang_xinyi.pdf [ 6.61MB ]
JSON: 24-1.0371204.json
JSON-LD: 24-1.0371204-ld.json
RDF/XML (Pretty): 24-1.0371204-rdf.xml
RDF/JSON: 24-1.0371204-rdf.json
Turtle: 24-1.0371204-turtle.txt
N-Triples: 24-1.0371204-rdf-ntriples.txt
Original Record: 24-1.0371204-source.json
Full Text

Full Text

Data Driven Auto-completion for Keyframe AnimationbyXinyi ZhangB.Sc, Massachusetts Institute of Technology, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2018c© Xinyi Zhang, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Data Driven Auto-completion for Keyframe Animationsubmitted by Xinyi Zhang in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Michiel van de Panne, Computer ScienceSupervisorLeonid Sigal, Computer ScienceSecond ReaderiiAbstractKeyframing is the main method used by animators to choreograph appealing mo-tions, but the process is tedious and labor-intensive. In this thesis, we present adata-driven autocompletion method for synthesizing animated motions from inputkeyframes. Our model uses an autoregressive two-layer recurrent neural networkthat is conditioned on target keyframes. Given a set of desired keys, the trainedmodel is capable of generating a interpolating motion sequence that follows thestyle of the examples observed in the training corpus.We apply our approach to the task of animating a hopping lamp character andproduce a rich and varied set of novel hopping motions using a diverse set of hopsfrom a physics-based model as training data. We discuss the strengths and weak-nesses of this type of approach in some detail.iiiLay SummaryComputer animators today use a tedious process called keyframing to make anima-tions. In this process, animators must carefully define a large number of guidingposes, also known as keyframes for a character at different times in an action.The computer then generates smooth transitions between these poses to create thefinal animation. In this thesis, we develop a smart animation auto-completion sys-tem to speed up the keyframing process by making it possible for animators todefine fewer keyframes. Using statistical models, our system learns a character’smovement patterns from previous animation examples and then incorporates thisknowledge to generate longer intervals between keyframes.ivPrefaceThis thesis is submitted in partial fulfillment of the requirements for a Master ofScience Degree in Computer Science. The entire work presented here is originalwork done by the author, Xinyi Zhang, performed under the supervision of Dr.Michiel Van De Panne with code contributions from Ben Ling on the 3D visualizerfor displaying results. A version of this work is currently in review for the 2018Motion, Interaction, and Games conference.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problems and Challenges . . . . . . . . . . . . . . . . . . . . . . 21.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Hand-drawn In-Betweening . . . . . . . . . . . . . . . . . . . . . 42.2 Computer-based In-Betweening . . . . . . . . . . . . . . . . . . 52.3 Physics-based Methods for Animation . . . . . . . . . . . . . . . 62.4 Data-driven Motion Synthesis and Animation . . . . . . . . . . . 6vi2.5 Deep Learning for Motion Synthesis . . . . . . . . . . . . . . . . 73 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 The Animation Database . . . . . . . . . . . . . . . . . . . . . . 113.1.1 Physics-based Method for Generating Animations . . . . 113.1.2 The Articulated Lamp Model . . . . . . . . . . . . . . . . 123.1.3 Control Scheme for Jumping . . . . . . . . . . . . . . . . 133.1.4 Finding Motions with Simulated Annealing . . . . . . . . 143.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 ARNN Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 Curriculum Learning . . . . . . . . . . . . . . . . . . . . 223.4.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . 234 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1 Comparison to Other Architectures . . . . . . . . . . . . . . . . . 335 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39viiList of TablesTable 3.1 The autoregressive recurrent neural network (ARNN) trained withvs without curriculum on a smaller sample set of 80 jump se-quences for 20000 epochs. Curriculum learning results lowerloss values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Table 4.1 The test losses from other network architectures vs the ARNN.The ARNN produces the best overall losses. . . . . . . . . . . . 34viiiList of FiguresFigure 2.1 Visualization of the Vector Animation Complex [11] data struc-ture (top) for a 2D bird animation (bottom). (Reproduced fromFigure 10 of [11] with permission.) . . . . . . . . . . . . . . 5Figure 2.2 (a) The neural network architecture developed by Holden et al.[16] for motion synthesis. The feed forward network networkmaps high level control parameters to motion in the hiddenspace of a convolutional autoencoder. (b) Character motiongenerated from an input trajectory using the method developedin [16]. (Reproduced from Figure 1 and 2 of [16] with permis-sion.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Figure 3.1 The three stages of our system: creation of the animation database,training of the ARNN network, and generation of believablemotions at runtime. . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.2 The mechanical configuration of Luxo. . . . . . . . . . . . . 12Figure 3.3 Pose control graph for jumping Luxo. The control graph isparameterized by the target poses for the 4 states and the tran-sition durations t1,t2,t3,t4. . . . . . . . . . . . . . . . . . . . . 13Figure 3.4 The energy, acceptance probability, and temperature values fora single run of the simulated annealing algorithm over 1000search iterations. . . . . . . . . . . . . . . . . . . . . . . . . 16ixFigure 3.5 Preprocessed training data with extracted keyframes for eachdegree of freedom. There are 25 frames of animation for eachfull jump, punctuated by 6 frames of still transition betweenjumps. The bold open circles indicate extracted key frames. . 18Figure 3.6 Architecture of the ARNN. The ARNN is composed of a recur-rent portion and a feed-forward portion. . . . . . . . . . . . . 19Figure 3.7 Curriculum learning schedules used to train the ARNN net-work during 60000 training epochs: teacher forcing ratio de-cay (Green Curve) and key importance weight ω annealing(Blue Curve) . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 4.1 Motion reconstruction of four jumps taken from the test set.animation variables (AVARS) from the original simulated mo-tion trajectory from the test set is plotted in light green. Ex-tracted keyframes are circled in dark green. The resulting mo-tion generated using our network is displayed in dark blue. . . 25Figure 4.2 Height edits. Keyframes extracted from the original test jumpare shown in green. The trajectory of the generated motionsmoothly tracks the new apex keyframes edited to have baseheights of 0.7 (top), 1.5 (middle), and 1.8 (bottom) times theoriginal base height. . . . . . . . . . . . . . . . . . . . . . . 27Figure 4.3 Timing edits. Input keyframes extracted from the original testjump are shown in green. The predicted pose at key locationsare shown in dark gray. The top figure shows the predictionusing unmodified keyframes; the keyframe of the pose at thetop of the 3 jumps occur at t=13,46,79. In the middle figure,the jumps are keyed to have faster take off. The new keyframeswith the same pose are newly located to be at t=7, 40, 73. Thebottom figure shows the jumps with slower takeoff with thejump top keyframes shifted to be at t=19, 52, 85. . . . . . . . 28Figure 4.4 Motion generation with sparse keyframes. The apex and land-ing keyframes for the above jumps have been removed fromthe input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30xFigure 4.5 Coupled nature of the motion synthesis model. Edits to a sin-gle degree of freedom (top graph, base y position) leads todifferent warping functions for the other degrees of freedom. . 31Figure 4.6 Motion synthesis from novel keyframe inputs. We created newkeyframes from randomly sampled and perturbed keys takenfrom the test set (green). The output motion from the networkis shown with predicted poses at input key locations shown indark gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 4.7 Architecture of the segregated network which combines a RNNonly prediction produced by the Motion Pattern Network, witha keyframe conditioned correction produced by a feed-forwardInterpolation Network. . . . . . . . . . . . . . . . . . . . . . 33Figure 4.8 Qualitative comparison between results for (a) a feed-forwardnet; (b) a segregated net; and (c) the ARNN. The ARNN andsegregated nets produce smoother motions at key transitionswith the use of memory. . . . . . . . . . . . . . . . . . . . . 35xiGlossaryARNN autoregressive recurrent neural networkAVAR animation variableFSM finite state machineGRU gated recurrent unitRNN recurrent neural networkSELU scaled exponential linear unitxiiAcknowledgmentsFirstly, I would like to thank my supervisor Dr. Michiel van de Panne for help-ing me turn my fanciful ideas about animation into reality. His optimism andgood cheer are truly inspiring. Whenever I encountered moments of difficulty,Michiel would always patiently work with me to find ways to move forward. Iam incredibly grateful to have had the opportunity to learn from his wisdom andknowledge.I wish to thank my lab mates Glen Berseth, Boris Dalstein, Xue Bin Peng,Jacob Chen, Zhaoming Xie, and Chenxi Liu for helping me feel at home in gradu-ate school with their companionship and support. Additionally, I extend my thanksto Dr. Leonid Sigal for his suggestions and comments on my thesis and Dr. DanielHolden for his helpful advice on troubleshooting neural networks. I also wish tothank all of my colleagues and friends from the computer graphics community whohave supported me throughout this journey.To my parents, Hongbing Zhang and Manhua Song, and the best brother onecould ever have, Daniel Zhang, thank you for always being there for me, for sup-porting me through the tough times, and for helping me discover the good in life.I’m forever grateful for all that you’ve given me and all that you continue to giveme.xiiiMy time in Vancouver has been magical because of my wonderful friends.I’d never imagined I’d be snow camping, having bear adventures (I forgive youSabnam), or wearing yoga leggings everywhere, but here I am, kind of missingthe rain. A special thanks to Sabnam, Jessica Yu, and my Vancouver friends forshowing me the beauty of this city. I would also like to thank my friends at homeand abroad for their messages and visits, which made my days so much brighter.Finally, all of this is really the fault of Charles Badiller and Omar Elhindi, twovery talented animators I met during my internship at Disney in 2014. If I hadn’tsat next to them that summer witnessing their frustrations, none of this work wouldhave been possible. I wish to thank Charles, Omar, and artists everywhere for theirdedication and passion. Your work has been the inspiration for all that I’ve donehere.xivChapter 1IntroductionAnimation is a beautiful art form that has evolved artistically and technically overmany years. However, despite advancements in tools and technologies, one of themain methods used to produce animation, keyframing, remains a labor-intensiveand challenging process. In this process, sets of poses, or keyframes, are placed atcritical moments in an action, and the animation system generates smoothly tran-sitioning inbetweens between keys to complete the motion. For artists, keyframingoffers a low-level degree of timing and positional control that enables them to cre-ate highly expressive animations. However, the keyframing process is slow andlabor-intensive. The spline curves generally used for interpolation are not aware ofthe type of motion, nor its style or physics. Stated another way, the motion inbe-tweening process is agnostic to what is being animated. Consequently, animatorsmust often do extra work to achieve a desired result, either by manually adjustingthe interpolation curve parameters or by inserting many additional keyframes. Ina production environment where characters often have with hundreds of animationvariables (AVARS) controlling different degrees of freedom, this way of workingquickly becomes unmanageable[35]. It is not uncommon for artists to spend manyhours posing and defining key frames or poses to choreograph motions, with a typ-ical animator at Pixar producing around only 4 seconds of animation every one ortwo weeks.11.1 Problems and ChallengesOver the years, many researchers have sought ways to automate the animation pro-cess, but this a challenging task. To be helpful to animators, a good system wouldneed to generate motions comparable to those produced by professionals. How-ever, crafting high-quality motions requires tremendous expertise, and animatorstake into account many factors in the creation of animations, including advancedknowledge of physical laws, acting, and visual appeal. In order to achieve non-trivial levels of automation, parts of this expertise must somehow be emulated.Further, being able to generate quality motions is not enough. In order to be prac-tically useful, an automation system should afford artists a high degree of creativefreedom and facilitate artistic control. Previous approaches to the problem usingphysics-based or data-driven techniques to augment the animation process are sel-dom used in practice because they fail to support low-level timing or positionalcontrol of motions. An ideal automation system should support and accelerate thework flow while still allowing for precise art-direction of motions by artists.1.2 Our approachIn this thesis, we address the issues discussed above by introducing a novel data-driven method for automation that supports low-level keyframe control. The corecomponent of our method is a new autoregressive recurrent neural network (ARNN)architecture that is conditioned on the target keyframes. These are trained using aspecialized loss function that places extra importance on the accuracy of the posepredictions made at the designated keyframe times. Given a set of example mo-tions of a character to be animated, the ARNN learns the motion characteristicsof the character, along with a blending function for key frame interpolation. Asa consequence, motions produced by the ARNN match the style and movementcharacteristics of the character and conform to the art-direction provided by theartist through keyframing. Additionally, the flexibility of an recurrent neural net-work (RNN) model allows our method to naturally support the ability for artists tocontrol the level of automation by applying the method to either tightly-spaced orloosely spaced keyframes.We train our network on a database of physics-based motion samples of a pla-2nar articulated hopping Luxo character, and generate novel motions of the char-acter choreographed using keyframing. The results demonstrate the expressibilityand flexibility of our system.1.3 Thesis OverviewThe remainder of this thesis is organized as follows. In Chapter 2, we review pre-vious approaches for automatic inbetweening and related work in motion synthesisand deep learning. Chapter 3 describes our process for constructing the animationdata set we use to train the ARNN network. Chapter 4 details the network architec-ture and training procedure. Chapter 5 presents and discusses the results producedusing our method. Lastly, in Chapter 6, we discusses the limitations of our systemand possible future directions for the work.3Chapter 2Related WorkThere is an abundance of related work that serves as inspiration and as buildingblocks for our work, including inbetweening, physics-based methods, data-drivenanimation methods, and machine learning methods as applied to computer anima-tion. Below we note the most relevant work in each of these areas in turn.2.1 Hand-drawn In-BetweeningMethods for automatically inbetweening hand-drawn key frames have a long his-tory, dating back to Burtnyk and Wein [6]. Kort introduced an automatic systemfor inbetweening 2D drawings which identifies stroke correspondences betweendrawings and then interpolates between the matched strokes [20]. Whited et al.[34] focus on the case in which two key frames are very similar in shape. They usea graph-based representation of strokes to find correspondences, and then performa geometric interpolation to create natural, arc shaped motion between keys. Morerecently, Dalstein et al. [11] introduced a novel data structure, the Vector Anima-tion Complex, shown in Figure 2.1 to enable continuous interpolation in both spaceand time for 2D vector drawings. The VAC handles complex changes in drawingtopology and supports a keyframing paradigm similar to that used by animators.For our work, we focus on the problem of inbetweening for characters that aremodeled as articulated rigid bodies.4Figure 2.1: Visualization of the Vector Animation Complex [11] data struc-ture (top) for a 2D bird animation (bottom). (Reproduced from Fig-ure 10 of [11] with permission.)2.2 Computer-based In-BetweeningInbetweening in computer animation is performed semi-automatically. Animatorsuse control points to precisely define interpolating splines which blend between pa-rameters (i.e AVARS) at keyframes to generate smooth animations. As noted in Sec-tion 1, animators typically touch hundreds of AVARS when animating, and definingspline curves for individual AVARS tedious. Methods enabling faster edits of ani-mation curves include motion warping [37], which preserves local movement pat-terns during edits by shifting and scaling original animation curves to satisfy newkeyframe constraints, reducing the work required to adjust individual splines. Stag-gered poses [9] encode timing relationships between coordinated AVARS and pre-serves inter-AVAR movement patterns during edits of correlated movements. Therehas been less work on the generation of inbetweening splines. Shen et al. [29]developed a procedural method for automatically generating detailed coordinatedmotions using a minimum number of AVARS. However, their method is limitedto mostly cyclical motions. Nebel et al. [25] generated interpolating splines fromkeyframes with the objective of avoiding self-collisions between the limbs of char-acters. Our objective in this thesis of generating appealing, style and motion aware5inbetweens is more complex.2.3 Physics-based Methods for AnimationAnother important class of methods relies on simulation to automatically generatemotions. However, although physics-based methods excel at producing realisticand plausible motions, they are inherently difficult to control. This is because themotions are defined by an initial state and the equations of motion, which are thenintegrated forward in time. However, keyframes require motions to reach poseobjectives at specific instants of time in the future. One approach to the controlproblem for rigid bodies generates collections of likely trajectories for objects byvarying simulation parameters and leave users to select the most desirable trajec-tory out of the possibilities, e.g., [7, 31]. Space-time constraint methods [3, 15, 36]]take another approach, treating animation as a trajectory optimization problem.These methods can treat the keyframe poses and timing as hard constraints and arecapable of producing smooth and realistic interpolating motions. However, thesemethods are generally complex to implement, can be problematic to use when-ever collisions are involved, and require explicitly-defined objective functions toachieve desired styles. The recent work of Bai et al. [2] combine interpolationwith simulation to enable more expressive animation of non-physical actions forkeyframed 2D cartoon animations. Their method focuses on secondary animationof shape deformations and is agnostic to the style and context of what is beinganimated, a drawback shared by physics-based methods. Our work focuses on theanimation of articulated figures and the desire to achieve context-aware motioncompletion for primary animation.2.4 Data-driven Motion Synthesis and AnimationTechniques to synthesize new motions from an existing motion database includemotion blending and motion graphs. Motion graphs organize large datasets ofmotion clips into a graph structure in which edges mark transitions between mo-tion segments located at nodes. New motions can be generated by traversingwalks on the graph, which can follow high-level constraints placed by the user,e.g., [22]. Such methods cannot generalize beyond the space of motions present6in the database. To create larger variations in what an be generated, motion blend-ing may be used. Motion blending techniques can generate new motions satis-fying high-level control parameters by interpolating between motion examples,e.g., [13, 21, 27]. Relatedly, motion edits can also be performed using motionwarping to meet specific offset requirements [38]. The parameterizations of mostmotion blending methods do not support precise art-direction.Subspace methods choose to identify subspace of motions, usually linear, inwhich to perform trajectory optimization. The work of Safonova et al. [28] wasamong the first to propose this approach. Min et al. [24] construct a space-timeparameterized model which enables users to make sequential edits to the timingor kinematics. Their system models the timing and kinematic components of themotions separately and fails to capture spatio-temporal correlations. Additionally,their system is based on a linear analysis of the motions that have been put intocorrespondence with each other, whereas our proposed method builds on an au-toregressive predictive model that can in principle be more flexible and thereforemore general. In related work, Wei et al. [33] develop a generative model for hu-man motion that exploits both physics-based priors and statistical priors based onGPLVMs. As with [24], this model uses a trajectory optimization framework.2.5 Deep Learning for Motion SynthesisRecently, researchers have begun to exploit deep learning algorithms for motionsynthesis. One such class of methods uses recurrent neural networks (RNNs),which can model temporal dependencies between data points. Fragkiadaki et al. [12]used an Encoder-Recurrent-Decoder network for motion prediction and mocapdata generation. Crnkovic-Friis [10] use an RNN to generate dance choreogra-phy with globally consistent style and composition. However, the above methodsprovide no way of exerting control over the generated motions. Additionally, sinceRNNs can only generate data points similar to those seen in the training data set ina forward manner, without modification, they cannot be used for in-filling. Recentwork by Holden et al. [16, 17] takes a step towards controllable motion generationusing deep neural networks. In one approach [16] illustrated in Figure 2.2 theyconstruct a manifold of human motion using a convolutional autoencoder trained7on a motion capture database, and then train another neural network to map high-level control parameters to motions on the manifold. In [17], the same authors takea more direct approach for real-time controllable motion generation and train aphase-functioned neural network to directly map keyboard controls to output mo-tions. Since [16] synthesizes entire motion sequences in parallel using a CNN,their method is better suited for motion editing, and does not support more pre-cise control through keyframing. The method proposed in [17] generates motionin the forward direction only and requires the desired global trajectory for everytime step. The promise (and challenge) of developing stable autoregressive mod-els of motion is outlined in a number of recent papers, e.g., [16, 23]. The earlyNeuroAnimator work [14] was one of the first to explore learning control policiesfor physics-based animation. Recent methods also demonstrate the feasibility ofreinforcement learning as applied to physics-based models, e.g., [26]. These donot provide the flexibility and control of keyframing, however, and they currentlywork strictly within the space of physics-based simulations.8(a)(b)Figure 2.2: (a) The neural network architecture developed by Holden et al.[16] for motion synthesis. The feed forward network network maps highlevel control parameters to motion in the hidden space of a convolutionalautoencoder. (b) Character motion generated from an input trajectoryusing the method developed in [16]. (Reproduced from Figure 1 and 2of [16] with permission.)9Chapter 3Method OverviewIn this section, we describe the details of our system, which is show in Figure 3.1.First, to create our dataset for training, we use simulation to generate a set ofjumping motions for our linkage-based Luxo lamp character (see Section 3.1) andpreprocess the simulation data (see 3.2) to extract sequences of animation framesalong with ”virtual” keyframes and timing information for each sequence. Duringtraining, we feed the ARNN network keyframes of sequences and drive the net-work to learn to reproduce the corresponding animation sequence using a customloss function (see Section 3.3, Section 3.4). Once the network is trained, userscan synthesize new animations by providing the network with a sequence of inputkeyframes.The structure of our ARNN neural network is illustrated in Figure 3.6 and de-scribed in greater detail in Section 3.3. The ARNN is a neural network composed ofa recurrent portion and feedforward portion. The recurrent portion in conjunctionwith the the feedforward portion helps the net learn both the motion characteristicsof the training data keyframe constraints. The ARNN takes as input a sequence ofn key frames X = {X0,X1, ...,Xn}, along with timing information describing thetemporal location of keys, T = {T0,T1, ...,Tn}, and outputs a final interpolating se-quence of frames Y = {Y0,Y1, ...,Ym} of length m = Tn−T0 + 1. Frames are poserepresentations, where each component X i or Y i describes the scalar value of adegree of freedom of Luxo. y posIn-between FramesKeys0. x pos−0.8−0.6−0.4−0.20.0Base orientation−0.4− link angle0. link angle0 100 200 300 400 500 600 700Time−2.0−1.5−1.0− link angle1. Create Animation Database 2. Train Neural Network 3. Motion Generation at RuntimeFigure 3.1: The three stages of our system: creation of the animationdatabase, training of the ARNN network, and generation of believablemotions at runtime.3.1 The Animation DatabaseIn this section, we describe our physics-based method for generating motion sam-ples for Luxo and our procedure for transforming the motion data into a formatsuitable for training. Our animation database for training consists of hopping mo-tions of a 2D Luxo character.3.1.1 Physics-based Method for Generating AnimationsIn order to efficiently train an expressive network, we need a sufficiently-sizedmotion database containing a variety of jumping motions. Creating such a datasetof jumps by hand would be impractical, so we developed a physics-based solutionto generate our motion samples. We build a physics-based model of the Luxocharacter in Bullet with actuated joints, and use simulated annealing to search forcontrol policies that make Luxo hop off the ground.113.1.2 The Articulated Lamp ModelFigure 3.2 shows the mechanical configuration of Luxo which we use for simu-lation. The model is composed of 4 links and has 6 degrees of freedom: the xposition of the base link (L1), the y position of the base link, the orientation of thebase link θ1, the joint angle θ1 between the base link ad the the leg link (L2), thejoint angle θ2 between the leg link ad the the neck link (L3), and the joint angleθ3 at the lamp head (L4). Despite its simple construction, the primitive Luxo isexpressive and capable of a rich range of movements. To drive the motion of thecharacter, we equip each joint with a proportional-derivative (PD) controller. ThePD controller computes an actuating torque τ that moves the link towards a giventarget pose θd according to τ = kp(θd−θ)−kdω where θ and ω denote the currentlink position and velocity, and k and ω are stiffness and damping parameters forthe controller. By finding suitable pose targets for the PD controllers over time,we can drive the lamp to jump. However, searching for admissible control policiescan be intractable due to the large space of possible solutions. Thus, we restrict thepolicy search space by developing a simple control scheme for hops which makesthe optimization problem easier.XYθ1L1θ2A1L2θ3A2L3A3L4θ4Figure 3.2: The mechanical configuration of Luxo.12Figure 3.3: Pose control graph for jumping Luxo. The control graph is pa-rameterized by the target poses for the 4 states and the transition dura-tions t1,t2,t3,t4.3.1.3 Control Scheme for JumpingOur control scheme for Luxo, shown in Figure 3.3, is based on the periodic con-troller synthesis technique presented in [32]. It consists of a simple finite statemachine finite state machine (FSM) with a cyclic sequence of timed state transi-tions.The FSM behavior is governed by a set of transition duration parameters, andthe target poses for each state, which then dictate the amount of torque applied toLuxo’s joints, via PD controllers, when in a given state.Given this control scheme, we can find parameter values for the transition du-rations and pose targets that propel Luxo forward in a potentially rich variety ofhop styles. To this end, we use simulated annealing, which finds good motions byiteratively searching the parameter space using forward simulation.133.1.4 Finding Motions with Simulated AnnealingAlgorithm 1 gives the pseudocode for the simulated annealing algorithm. The al-gorithm starts with an initial guess for the transition durations and target poses.At each step, the algorithm seeks to improve upon the current best guess s for thecontrol parameters by selecting a candidate solution s′ and comparing the quality,or energy of the resulting motions. If the s′ produces a better jump with lower en-ergy, the algorithm updates the current best guess. Otherwise, the algorithm prob-abilistically decides whether to keep the current guess according to a temperatureparameter, T . At the beginning, T is set to be high to encourage more explorationof unfavorable configurations, and slowly tapered off until only a strictly bettersolutions are accepted.Algorithm 1 Simulated Annealing1: function SIMULATEDANNEALING()2: T ← Tmax3: K← Kmax4: s← INIT()5: while K > Kmax do6: s′← NEIGHBOUR(s)7: ∆E← E(s′) − E(s)8: if RANDOM() < ACCEPT(T,∆E) then9: s← s′10: end if11: T ← COOLING(K)12: end while13: return s14: end functionThe Energy FunctionThe quality of motions is measured by the energy functionE(s) =−(1.0−we)D−weHmax. (3.1)This function is a weighted sum of the total distance D and maximum height14Hmax reached by Luxo after cycling through the pose control graph three timesusing a particular policy, corresponding to three hops. To obtain a greater varietyof hops, we select a random value for we between 0.0 and 1.0 for each run of thesimulated annealing algorithm and biasing the search towards different trajectories.After Kmax iterations of search, we record the trajectory of the best jump found ifit is satisfactory, i.e reaches a certain distance and max height, and contains threehops.Picking Neighboring Candidate Solutions s′At each iteration of the simulated annealing algorithm, we choose a new candidatesolution by slightly perturbing the current best state s. We select one of the 4transition durations to perturb by 7 ms and one joint angle for each of the 4 targetposes to perturb by 0.5 radians.Temperature And Acceptance ProbabilityThe probability of transitioning from the current state s to the new candidate states′ for each iteration of search is specified by the following function:ACCEPT(T,E(s),E(s′)) =1, if E(s′)< E(s′)exp(−(E(s′)−E(s))T ), otherwise (3.2)Thus, we transition to the new candidate state if it has strictly lower energythan the current state s′. Otherwise, we decide probabilistically whether we shouldexplore the candidate state. From (3.2), we see that this probability is higher forstates with lower energy and for higher temperature values T . As T cools downto 0, the algorithm increasingly favors transitions that move towards lower energystates.Implementation DetailsWe use the Bullet physics engine [1] to control and simulate the Luxo charac-ter. We use a temperature cooling schedule of T (K) = 2 ∗ 0.999307K and search15for Kmax = 1000 iterations for our implementation of simulated annealing. Afteraround 10000 total runs of the algorithm, we obtained 300 different successful ex-amples of repeatable hops for our final motion dataset. For each type we use threesuccessive hops for training.Figure 3.4: The energy, acceptance probability, and temperature values for asingle run of the simulated annealing algorithm over 1000 search itera-tions.163.2 Data PreprocessingThe raw simulated data generated from the above procedure consists of denselysampled pose information for jumps and must be preprocessed into a suitable for-mat that includes plausible keyframes for training. We note that in most animationsystems, a keyframe could also allows a user to define the incoming and outgo-ing tangents, which allows for incoming and outgoing velocities to be specified, aswell as motion discontinuities, but this is not the case for our system. For hops, weuse the liftoff pose, the landing pose, and any pose with Luxo in the air as the 3key poses for each jump action. To create a larger variety of motions in the trainingset, we also randomly choose to delete 0-2 arbitrary keys from each jump sequenceevery time it is fed into the network during training, so the actual training data setis much larger.In order to extract the above key poses along with inbetween frames, we firstidentify events corresponding to jump liftoffs and landings in the raw data. Oncethe individual jump segments are located, we evenly sample each jump segment,beginning at the liftoff point and ending at the landing point, to obtain 25 framesfor each hop. We then take one of those 25 frames with Luxo in the air to be an-other keyframe, along with its timing relative to the previous keyframe. Althoughindividual jumps may have different durations, we choose to work with the rel-ative timing of poses within jumps rather than the absolute timing of poses forour task, and thus sample evenly within each hop segment. In the pause betweenlandings and liftoffs, we take another 6 samples inbetween frames to be includedin the final sequence. The fully processed data consists of a list of key framesX = {X0,X1, ...,Xn} and their temporal locations, T = {T0,T1, ...,Tn}, and the fullsequence including key frames and inbetween frames Y = {Y0,Y1, ...,Ym}, wherem= Tn−T0 +1. Figure 3.5 shows a preprocessed jump sequence from our trainingset.3.3 ARNN NetworkGiven a sequence of key poses X and timing information T , the task of the ARNNis to progressively predict the full sequence of frames Y that interpolate betweenkeys. The network makes its predictions sequentially, making one frame prediction170.000.250.500.751.001.251.50Base y posIn-between FramesKeys0. x pos−0.8−0.6−0.4−0.20.0Base orientation−0.4− link angle0. link angle0 100 200 300 400 500 600 700Time−2.0−1.5−1.0− link angleFigure 3.5: Preprocessed training data with extracted keyframes for each de-gree of freedom. There are 25 frames of animation for each full jump,punctuated by 6 frames of still transition between jumps. The bold opencircles indicate extracted key frames.at a time. To make a pose prediction for a frame t, temporally located in the intervalbetween key poses XK and XK+1, the network takes as input the keys XK and XK+1,the previous predicted pose Y ′t−1, the previous recurrent hidden state Ht−1, andthe relative temporal location of t, which is defined according to trel = t−TKTK+1−TK ,where TK and TK+1 are the temporal locations of XK and XK+1 respectively. Notethat we use absolute positions for the x and y locations of Luxo’s base, ratherthan using the relative ∆x and ∆y translations with respect to the base positionin the previous frame. This does not yield the desired invariance with respect tohorizontal or vertical positions. However, our experiments when using absolute18GRU LayerGRU LayerRecurrent NetworkHt−1HtFC LayerFC Layer Output LayerFeedforward Networkresidualt−1Y ′t−1 XK XK+1+Pose Feature Preprocessingt−TKTK+1−TKY ′tFigure 3.6: Architecture of the ARNN. The ARNN is composed of a recurrentportion and a feed-forward portion.positions produced smoother and more desirable results than when using relativepositions.We now further describe the details of the ARNN network structure, which iscomposed of a recurrent and a feedforward portion, and then discuss the loss func-tion and procedure we use to train the network to accomplish our objectives.In the first stage of prediction, the network takes in all the relevant inputs in-cluding the previous predicted frame Y ′t−1, the previous hidden state Ht−1, the pre-ceding key frame XK , the next key frame XK+1, and the relative temporal locationof t, trel to produce a new hidden state Ht .Y ′t−1, XK , and XK+1 are first individually preprocessed by a feedforward net-work composed of two linear layers as a pose feature extraction step before theyare concatenated along with trel to form the full input feature vector at time t,xt . This concatenated feature vector along with the hidden state Ht−1 is fed into19the recurrent portion of the network, composed of two layers of gated recurrentunits (GRUS) [8]. We use scaled exponential linear unit (SELU) activations [19]between intermediate outputs. The GRU cells have 100 hidden units each, witharchitectures described by the following equations:rt = σ(Wirxt +bir+WhrH(t−1)+bhr)zt = σ(Wizxt +biz+WhzH(t−1)+bhz)nt = tanh(Winxt +bin+ rt(WhnH(t−1)+bhn))Ht = (1− zt)nt + ztH(t−1)(3.3)For the second stage of the prediction, the network takes the hidden outputfrom the previous stage Ht as input into a feedforward network composed of twolinear layers containing 10 hidden units each. This output from the feedforwardnetwork is then mapped to a 6 dimensional residual pose vector which is added tothe previous pose prediction Y ′t−1 to produce the final output Y′t .The recurrent and feedforward portions of the network work together to accu-mulate knowledge about keyframe constraints while learning to make predictionsthat are compatible with the history of observed outputs. This dual-network struc-ture arose from experiments showing that a RNN only network/a feedforward onlynetwork is insufficient for meeting the competing requirements of self-consistencywith the motion history and consistency with the keyframe constraints. Our exper-iments with other architectures are detailed in Section TrainingDuring the training process, we use backpropagation to optimize the weights of theARNN network so that the net can reproduce the full sequence of framesY given theinput keyframe information X and T . In this section, we describe the details of ourtraining process, including the loss function and the curriculum learning strategywe use to train the network.203.4.1 Loss FunctionTo drive the network to learn to interpolate between keyframe constraints, we de-veloped a custom loss function for the task,LARNN = 100ωn∑K=1(XK−Y ′TK )2 +MSE(Y,Yˆ )MSE(W,Z) =1NN∑i=1(Wi−Zi)2.(3.4)This custom loss function is composed of two parts - the frame prediction lossand the key loss. The frame prediction loss, MSE(Y,Y ′), is the vanilla mean-squared error loss which calculates the cumulative pose error between poses in thefinal predicted sequence and the ground truth sequence. This loss helps the networkencode movement information about the character during training as it is coercedto reproduce the original motion samples. By itself this frame loss is insufficientfor our inbetweening task because it fails to model the influence of keyframe con-straints on the final motion sequence. In order to be consistent with the keyframingparadigm, inbetween frames generated by the network should be informed by theart-direction of the input keys and interpolate between them. Consequently, weintroduce an additional loss term to the total loss - the key loss, ∑nK=1(XK −Y ′tK )2to penalize discrepancies between predicted and ground truth keys, forcing the net-work to pay attention to the input keyframe constraints. Amplifying the weight ofthis loss term simulates placing a hard constraint on the network to always produceframe predictions that hit the original input keyframes. In experiments, we foundthat a weight of 100 was sufficient. However, because the network must incorpo-rate the contrasting demands of learning both the motion pattern of the data as wellas keyframe interpolation during training, setting the weight of the key loss to 100in the beginning is not optimal. Consequently, we introduce ω , the Key Impor-tance Weight, which we anneal during the training process as part of a curriculumlearning method [5] to help the network learn better during training.213.4.2 Curriculum LearningIn our curriculum learning method, we setω to be 0 in the beginning stages of train-ing and slowly increase ω during the latter half of the training process. Thus, forthe first part of the training process the network primarily focuses on learning themotion characteristics of the data. Once the first stage of learning has stabilized, weincrease ω so the network can begin to consider the keyframe constraints.Duringthe first phase of the training, we apply scheduled sampling [4] as another cur-riculum learning regime to help the recurrent portion of the net learn movementpatterns. In this scheme, we feed the recurrent network the ground truth input forthe previous predicted pose Yt−1 instead of the recurrent prediction Y ′t−1 at the startof training, and gradually change the training process to fully use generated predic-tionsY ′t−1, which corresponds to the inference situation at test time. The probabilityof using the ground truth input at each training epoch is controlled by the teacherforcing ratio which is annealing using an inverse sigmoid schedule shown by thegreen curve of Figure 3.7. Once the learning has stabilized, we gradually increaseω and push the network to learn the interpolation aspect of the prediction until thenetwork is able to successfully make predictions for the training samples under thenew loss function. The sigmoid decay schedule for ω is shown by the blue curvein Figure 3.7.Without curriculum learning, the network has a harder time learning duringthe training process and we observe a large error spike in the learning curve atthe beginning of training. In contrast, using the above curriculum learning methodproduces a smoother learning curve and superior qualitative and quantitative resultsas shown in Table 3.1.Training Procedure Key Loss Frame Loss Total LossWith ω annealing 0.00100392 0.00395658 0.00496051No ω annealing 0.00120846 0.00670011 0.00790857Table 3.1: The ARNN trained with vs without curriculum on a smaller sampleset of 80 jump sequences for 20000 epochs. Curriculum learning resultslower loss values.220 20000 40000 60000Epochs0. 3.7: Curriculum learning schedules used to train the ARNN networkduring 60000 training epochs: teacher forcing ratio decay (GreenCurve) and key importance weight ω annealing (Blue Curve)3.4.3 Training DetailsThe final model we use to produce motions in the results section is trained on 240jump sequences, with 3 jumps per sequence. As noted above, from each jumpsequence we obtain many more sequences by randomly removing 0-2 arbitrarykeys from the sequence each time it is fed into the network. The model is optimizedfor 60000 epochs using Adam [18] with an initial learning rate of 0.001, 1 = 0.9, 2= 0.999 and ε = 108 and regularized using dropout [30] with probabilities 0.9 and0.95 for the first and second GRU layers in the recurrent portion of the network andprobabilities 0.8 and 0.9 for the first and second linear layers in the feedforwardportion. The current training process takes 130 hours on an NVIDIA GTX 1080GPU, although we expect that significant further optimizations are possible.23Chapter 4ResultsWe demonstrate the autocompletion method by choreographing novel jumping mo-tions using our system. When given keyframe constraints that are similar to thosein the training set, our system reproduces the motions accurately. For keyframeconstraints that deviate from those seen in the training dataset, our system gener-alizes, synthesizing smooth motions even with keyframe inputs that are physicallyunfeasible. Results are best seen in the accompanying video.We first test our system’s ability to reproduce motions from the test dataset, i.e.,motions that are excluded from the training data, based on keyframes derived fromthose motions. The AVARS for two synthesized jumps from the test set are plottedin Figure 4.1. More motions can be seen in the video. The trajectories generatedusing our system follow the original motions closely, and accurately track almostall of the keyframe AVARS. The motions output by our system are slightly smootherthan the original jumps, possibly due to predictions regressing to an average or theresidual nature of the network predictions. y pos0246810Base x pos0. Orientation0. link angle0. link angle0 100 200 300 400 500 600Time1.00.50.0Head link angleIn-between FramesPredictionKeys0. y pos0246810Base x pos0. Orientation0. link angle0.51.01.5Neck link angle0 100 200 300 400 500 600Time1. link angleIn-between FramesPredictionKeys0. y pos0246Base x pos0. Orientation0. link angle0. link angle0 100 200 300 400 500 600Time1. link angleIn-between FramesPredictionKeys0.000.250.500.751.001.25Base y pos02468Base x pos0. Orientation0.500. link angle0. link angle0 100 200 300 400 500Time1. link angleIn-between FramesPredictionKeysFigure 4.1: Motion reconstruction of four jumps taken from the test set.AVARS from the original simulated motion trajectory from the test setis plotted in light green. Extracted keyframes are circled in dark green.The resulting motion generated using our network is displayed in darkblue.25We next demonstrate generalization by applying edits of increasing amplitudesto motions in the test set. Our system produces plausible interpolated motions aswe modify keyframes to demand motions that are increasingly non-physical anddivergent from the training sets. Figure 4.2 shows a sequence of height edits to ajump taken from the test set. As we decrease or increase the height of the keyframeat the apex, the generated motion follows the movements of the original jump andtracks the new keyframe constraints with a smooth trajectory. The generated mo-tion also appears to be natural and physically correct. The character acceleratesduring liftoff, decelerates nearing the apex of the jump before accelerating again atlanding. Next, we modify the timing of another sequence from the test set. Timingedits are shown in Figure 4.3. Here, our system generates reasonable results. Thespacing of the generated trajectory again appears to follow the laws of physics andthe character accelerates and decelerates at the right moments in the jump. How-ever, we do note that the generated motion do not track the keyframe constraintsas precisely. In the second figure the new timing requires that Luxo cover the firstportion of the jump in half the original amount of time. This constraint deviatesfar from motions of the training set, and motion generated by the network is biasedtoward a more physically plausible result.26(a) Jump apex keyframed at 0.7x original height(b) Jump apex keyframed at 1.5x original height(c) Jump apex keyframes at 1.8x original heightFigure 4.2: Height edits. Keyframes extracted from the original test jump areshown in green. The trajectory of the generated motion smoothly tracksthe new apex keyframes edited to have base heights of 0.7 (top), 1.5(middle), and 1.8 (bottom) times the original base height.27(a) Prediction using original keyframes extracted from a test jump.(b) Prediction with 2x faster jump takeoffs.(c) Prediction with 1.5x slower jump takeoffs.Figure 4.3: Timing edits. Input keyframes extracted from the original testjump are shown in green. The predicted pose at key locations are shownin dark gray. The top figure shows the prediction using unmodifiedkeyframes; the keyframe of the pose at the top of the 3 jumps occur att=13,46,79. In the middle figure, the jumps are keyed to have faster takeoff. The new keyframes with the same pose are newly located to be att=7, 40, 73. The bottom figure shows the jumps with slower takeoff withthe jump top keyframes shifted to be at t=19, 52, 85.28We next show that in the absence of keyframe direction, i.e, with sparse keyinputs, our network is still able to output believable trajectories, demonstrating thatthe predictions output by the network are style-and-physics aware. In Figure 4.4,we have removed the apex keyframe and the landing keyframe of the second jumpin the sequence. Our network creates smooth jump trajectories following the move-ment characteristics of the Luxo character despite the lack of information. This isenabled by the memory embodied in the recurrent model, which allows the networkto build an internal motion model of the character to make rich predictions. y pos0246Base x pos0. Orientation0. link angle0.500.751.001.251.501.75Neck link angle0 100 200 300 400 500Time1.000.750.500. link angleIn-between FramesPredictionKeysFigure 4.4: Motion generation with sparse keyframes. The apex and landingkeyframes for the above jumps have been removed from the input.30The non-linearity and complexity of the motions as output by the network canalso be seen in Figure 4.5. It shows the changes of individual degrees of freedomresulting from an edit to the height of the keyframe, as seen in the top graph. If thisedit were made with a simple motion warping model, applied individually to eachdegree of freedom, then we would expect to see an offset in the top graph that isslowly blended in and then blended out. Indeed, for the base position, the motiondeformation follows a profile of approximately that nature. However, the curvesfor the remaining degrees of freedom are also impacted by the edit, revealing thecoupled nature of the motion synthesis model.01234Base y  osGround TruthKeysPredictionDeformation01234Base x  os− Orientation−0.75−0.50− link angle0. link angle0 25 50 75 100 125 150 175 200Time−1.5−1.0− link angleFigure 4.5: Coupled nature of the motion synthesis model. Edits to a singledegree of freedom (top graph, base y position) leads to different warpingfunctions for the other degrees of freedom.31A number of results are further demonstrated by randomly sampling, mixing,and perturbing keyframes from the test dataset, as shown in Figure 4.6.Figure 4.6: Motion synthesis from novel keyframe inputs. We created newkeyframes from randomly sampled and perturbed keys taken from thetest set (green). The output motion from the network is shown withpredicted poses at input key locations shown in dark gray.324.1 Comparison to Other ArchitecturesWe evaluate other network architectures for the task of keyframe auto-completionincluding a keyframe conditioned feed-forward network with no memory and asegregated network which separates the motion pattern and interpolation portionsof the task. The architecture of the segregated network is show in Figure 4.7. Thisproduces a pure RNN prediction with no keyframe conditioning that is then cor-rected with a residual produced by a keyframe conditioned feed-forward network.The number of layers and hidden units for all the networks are adjusted to producea fair comparison and all networks are trained on 80 sample jump sequences untilconvergence.Recurrent NetY ′t−1 Ht−1Recurrent residualt−1+HtLinear MapRecurrent Prediction YˆtMotion Pattern NetworkFeedForward Nett−tKtK+1−tK XK XK+1FeedForward residualt−1+Final Prediction Y ′tInterpolation NetworkFigure 4.7: Architecture of the segregated network which combines a RNNonly prediction produced by the Motion Pattern Network, with akeyframe conditioned correction produced by a feed-forward Interpo-lation Network.33Quantitatively, the ARNN produces the lowest total loss out of the three archi-tectures Table 4.1 and the segregated net produces the worst loss.Architecture Key Loss Frame Loss Total LossARNN 0.00100392 0.00395658 0.00496051No Memory Net 0.000183804 0.0081695 0.00835331Segregated Net 0.00124264 0.0160917 0.0189122Table 4.1: The test losses from other network architectures vs the ARNN. TheARNN produces the best overall losses.The ARNN also produces the best results qualitatively, while the feed-forwardonly network produces the least-desirable results due to motion discontinuities.Figure 4.8 shows the failure case when a network has no memory component. Al-though the feed-forward only network generates interpolating inbetweens that areconsistent with one another locally within individual motion segments, globallyacross multiple keys, the inbetweens are not coordinated, leading to motion dis-continuities. This is most evident in the predictions for the base x AVAR. The seg-regated net and the ARNN both have memory, which allows these nets to producesmoother predictions with global consistency. Ultimately however, the combinedmemory and keyframe conditioning structure of the ARNN produces better resultsthan the segregated net which separates the predictions.340. y pos0246Base x pos0. Orientation0. link angle0.500.751.001.251.501.75Neck link angle0 100 200 300 400 500Time1.000.750.500. link angleIn-between FramesPredictionKeys(a) y pos0246Base x pos0. Orientation0. link angle0.500.751.001.251.501.75Neck link angle0 100 200 300 400 500Time1.000.750.500. link angleIn-between FramesPredictionKeys(b) y pos0246Base x pos0. Orientation0. link angle0.500.751.001.251.501.75Neck link angle0 100 200 300 400 500Time1.000.750.500. link angleIn-between FramesPredictionKeys(c)Figure 4.8: Qualitative comparison between results for (a) a feed-forwardnet; (b) a segregated net; and (c) the ARNN. The ARNN and segregatednets produce smoother motions at key transitions with the use of mem-ory.35Chapter 5ConclusionsIn this thesis, we explored a conditional autoregressive method for motion-awarekeyframe completion. Our method synthesizes motions adhering to the art-directionof input keyframes while following the style of samples from the training data,combining intelligent automation with flexible controllability to support and accel-erate the animation process. In our examples, the training data comes from physics-based simulations, and the model produces plausible reconstructions when givenphysically-plausible keyframes. For motions that are non-physical, our model iscapable of generalizing to produce smooth motions that adapt to the given keyframeconstraints.The construction of autoregressive models allows for a single model to belearned for a large variety of character movements. Endowed with memory, ournetwork can learn an internal model of the movement patterns of the characterand use this knowledge to intelligently extrapolate frames when in the absence offrom keyframe guidance. The recurrent nature of our model allows it to operatein a fashion that is more akin to a simulation, i.e., it is making forward predic-tions based on a current state. This has advantages, i.e., simplicity and usabilityin online situations, and disadvantages, e.g., lack of consideration of more thanone keyframe in advance, as compared to motion trajectory optimization methods.As noted in previous work, e.g., [23], the construction of predictive autoregressivemodels can be challenging, and thus the proposed conditional model is a furtherproof of feasibility for this class of model and its applications.36Trajectory optimization methods are different in nature to our work, as theyrequire require an explicit model of the physics and the motion style objectives.In contrast, autoregressive models such as ours make use of a data-driven implicitmodel of the dynamics that encompasses both the physics and style of the examplemotions. These differences make a meaningful direct comparison difficult. Theimplicit modeling embodied by the data-driven approach offers convenience andsimplicity, although this comes at the expense of needing a sufficient number (andcoverage) of motion examples.The method we present, together with its evaluation, still has numerous limi-tations. If target keyframes go well beyond what was seen in the training set, themotion quality may suffer. We wish to further improve the motion coverage of themethod via data augmentation methods, e.g., collecting physics-based motions inreduced gravity environments. While the current results represents an initial vali-dation of our approach, we wish to apply our model to more complex motions andcharacters. A last general problem with the type of learning method we employ isthat of reversion to the mean when there are multiple valid possibilities for reach-ing a given keyframe. In future work, we wish to develop methods that can samplefrom trajectory distributions.Currently a primary extrapolation artifact is an apparent loss of motion conti-nuity in the vicinity of the keyframes, which can happen when the model generatesinbetweens that fail to interpolate the keyframes closely. This artifact could likelybe ameliorated with additional training that further weights the frame predictionloss after ω has been annealed. This could help the network consolidate knowl-edge and produce smoother results. Discontinuities in motions caused by collisionimpulses are also not fully resolved in our method. These are modeled implicitlyin the learned model and the resulting motion quality suffers slightly as a result.An alternative approach would be to add explicit structure to the learned model insupport of modeling collision events.In terms of production usability, a limitation of the current model we havedeveloped is lack of support for partial keyframe control; the full set of AVARS forthe character must be specified per keyframe for our system. However, animatorsworking in production environments almost never specify the full set of AVARSwhen keyframing. This limitation is in part due to the nature of the artificial training37data set we’ve created for this thesis, which does not include partial keyframes. Infuture work, we would like to to add support for partial keyframing by using morerealistic animation dataset.Lastly, there is still other significant work to be done before our system can beincorporated into production tools in terms of required training data and tacklingissues that may only be observable with deployment at scale. The quality of resultsproduced by our system is dependent on the training data we use to train the ARNN.If there is not enough training data or motion variation in the training data, thequality of the output may deteriorate. Additionally, the Luxo character we createdfor this thesis only has 6 degrees of freedom, but in production settings, animatorsoften work with characters controlled by hundreds of AVARS. We hope to test theapplicability of our system for production use on real animation data sets and inreal production settings in future work.38Bibliography[1] Y. Bai and E. Coumans. a python module for physics simulation in robotics,games and machine learning., 2016-2017. http://pybullet.org/. → page 15[2] Y. Bai, D. M. Kaufman, C. K. Liu, and J. Popovic´. Artist-directed dynamicsfor 2d animation. ACM Trans. Graph., 35(4):145:1–145:10, July 2016.ISSN 0730-0301. doi:10.1145/2897824.2925884. URLhttp://doi.acm.org/10.1145/2897824.2925884. → page 6[3] J. Barbicˇ, M. da Silva, and J. Popovic´. Deformable object animation usingreduced optimal control. In ACM SIGGRAPH 2009 Papers, SIGGRAPH’09, pages 53:1–53:9, New York, NY, USA, 2009. ACM. ISBN978-1-60558-726-4. doi:10.1145/1576246.1531359. URLhttp://doi.acm.org/10.1145/1576246.1531359. → page 6[4] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling forsequence prediction with recurrent neural networks. In Proceedings of the28th International Conference on Neural Information Processing Systems -Volume 1, NIPS’15, pages 1171–1179, Cambridge, MA, USA, 2015. MITPress. URL http://dl.acm.org/citation.cfm?id=2969239.2969370. → page 22[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning.In Proceedings of the 26th Annual International Conference on MachineLearning, ICML ’09, pages 41–48, New York, NY, USA, 2009. ACM. ISBN978-1-60558-516-1. doi:10.1145/1553374.1553380. URLhttp://doi.acm.org/10.1145/1553374.1553380. → page 21[6] N. Burtnyk and M. Wein. Computer animation of free form images. In ACMSIGGRAPH Computer Graphics, volume 9, pages 78–80. ACM, 1975. →page 4[7] S. Chenney and D. A. Forsyth. Sampling plausible solutions to multi-bodyconstraint problems. In Proceedings of the 27th Annual Conference on39Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages219–228, New York, NY, USA, 2000. ACM Press/Addison-WesleyPublishing Co. ISBN 1-58113-208-5. doi:10.1145/344779.344882. URLhttp://dx.doi.org/10.1145/344779.344882. → page 6[8] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the propertiesof neural machine translation: Encoder-decoder approaches. CoRR,abs/1409.1259, 2014. URL http://arxiv.org/abs/1409.1259. → page 20[9] P. Coleman, J. Bibliowicz, K. Singh, and M. Gleicher. Staggered poses: Acharacter motion representation for detail-preserving editing of pose andcoordinated timing. In Proceedings of the 2008 ACMSIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’08,pages 137–146, Aire-la-Ville, Switzerland, Switzerland, 2008. EurographicsAssociation. ISBN 978-3-905674-10-1. URLhttp://dl.acm.org/citation.cfm?id=1632592.1632612. → page 5[10] L. Crnkovic-Friis and L. Crnkovic-Friis. Generative choreography usingdeep learning. CoRR, abs/1605.06921, 2016. URLhttp://arxiv.org/abs/1605.06921. → page 7[11] B. Dalstein, R. Ronfard, and M. van de Panne. Vector graphics animationwith time-varying topology. ACM Trans. Graph., 34(4):145:1–145:12, July2015. ISSN 0730-0301. doi:10.1145/2766913. URLhttp://doi.acm.org/10.1145/2766913. → pages ix, 4, 5[12] K. Fragkiadaki, S. Levine, and J. Malik. Recurrent network models forkinematic tracking. CoRR, abs/1508.00271, 2015. URLhttp://arxiv.org/abs/1508.00271. → page 7[13] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popovic´. Style-basedinverse kinematics. ACM Trans. Graph., 23(3):522–531, Aug. 2004. ISSN0730-0301. doi:10.1145/1015706.1015755. URLhttp://doi.acm.org/10.1145/1015706.1015755. → page 7[14] R. Grzeszczuk, D. Terzopoulos, and G. Hinton. Neuroanimator: Fast neuralnetwork emulation and control of physics-based models. In Proceedings ofthe 25th annual conference on Computer graphics and interactivetechniques, pages 9–20. ACM, 1998. → page 8[15] K. Hildebrandt, C. Schulz, C. von Tycowicz, and K. Polthier. Interactivespacetime control of deformable objects. ACM Trans. Graph., 31(4):4071:1–71:8, July 2012. ISSN 0730-0301. doi:10.1145/2185520.2185567.URL http://doi.acm.org/10.1145/2185520.2185567. → page 6[16] D. Holden, J. Saito, and T. Komura. A deep learning framework forcharacter motion synthesis and editing. ACM Trans. Graph., 35(4):138:1–138:11, July 2016. ISSN 0730-0301. doi:10.1145/2897824.2925975.URL http://doi.acm.org/10.1145/2897824.2925975. → pages ix, 7, 8, 9[17] D. Holden, T. Komura, and J. Saito. Phase-functioned neural networks forcharacter control. ACM Trans. Graph., 36(4):42:1–42:13, July 2017. ISSN0730-0301. doi:10.1145/3072959.3073663. URLhttp://doi.acm.org/10.1145/3072959.3073663. → pages 7, 8[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980. → page23[19] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizingneural networks. CoRR, abs/1706.02515, 2017. URLhttp://arxiv.org/abs/1706.02515. → page 20[20] A. Kort. Computer aided inbetweening. In Proceedings of the 2NdInternational Symposium on Non-photorealistic Animation and Rendering,NPAR ’02, pages 125–132, New York, NY, USA, 2002. ACM. ISBN1-58113-494-0. doi:10.1145/508530.508552. URLhttp://doi.acm.org/10.1145/508530.508552. → page 4[21] L. Kovar and M. Gleicher. Automated extraction and parameterization ofmotions in large data sets. ACM Trans. Graph., 23(3):559–568, Aug. 2004.ISSN 0730-0301. doi:10.1145/1015706.1015760. URLhttp://doi.acm.org/10.1145/1015706.1015760. → page 7[22] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. ACM Trans. Graph.,21(3):473–482, July 2002. ISSN 0730-0301. doi:10.1145/566654.566605.URL http://doi.acm.org/10.1145/566654.566605. → page 6[23] Z. Li, Y. Zhou, S. Xiao, C. He, and H. Li. Auto-conditioned lstm networkfor extended complex human motion synthesis. arXiv preprintarXiv:1707.05363, 2017. → pages 8, 36[24] J. Min, Y.-L. Chen, and J. Chai. Interactive generation of human animationwith deformable motion models. ACM Trans. Graph., 29(1):9:1–9:12, Dec.2009. ISSN 0730-0301. doi:10.1145/1640443.1640452. URLhttp://doi.acm.org/10.1145/1640443.1640452. → page 741[25] J.-C. Nebel. Keyframe interpolation with self-collision avoidance. InN. Magnenat-Thalmann and D. Thalmann, editors, Computer Animation andSimulation ’99, pages 77–86, Vienna, 1999. Springer Vienna. ISBN978-3-7091-6423-5. → page 5[26] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne. Deeploco: Dynamiclocomotion skills using hierarchical deep reinforcement learning. ACMTransactions on Graphics (Proc. SIGGRAPH 2017), 36(4), 2017. → page 8[27] C. Rose, M. F. Cohen, and B. Bodenheimer. Verbs and adverbs:Multidimensional motion interpolation. IEEE Comput. Graph. Appl., 18(5):32–40, Sept. 1998. ISSN 0272-1716. doi:10.1109/38.708559. URLhttp://dx.doi.org/10.1109/38.708559. → page 7[28] A. Safonova, J. K. Hodgins, and N. S. Pollard. Synthesizing physicallyrealistic human motion in low-dimensional, behavior-specific spaces. ACMTransactions on Graphics (ToG), 23(3):514–521, 2004. → page 7[29] C. Shen, T. Hahn, B. Parker, and S. Shen. Animation recipes: Turning ananimator’s trick into an automatic animation system. In ACM SIGGRAPH2015 Talks, SIGGRAPH ’15, pages 29:1–29:1, New York, NY, USA, 2015.ACM. ISBN 978-1-4503-3636-9. doi:10.1145/2775280.2792531. URLhttp://doi.acm.org/10.1145/2775280.2792531. → page 5[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. J.Mach. Learn. Res., 15(1):1929–1958, Jan. 2014. ISSN 1532-4435. URLhttp://dl.acm.org/citation.cfm?id=2627435.2670313. → page 23[31] C. D. Twigg and D. L. James. Many-worlds browsing for control ofmultibody dynamics. In ACM SIGGRAPH 2007 Papers, SIGGRAPH ’07,New York, NY, USA, 2007. ACM. doi:10.1145/1275808.1276395. URLhttp://doi.acm.org/10.1145/1275808.1276395. → page 6[32] M. van de Panne, R. Kim, and E. Flume. Virtual wind-up toys for animation.In Proceedings of Graphics Interface ’94, pages 208–215, 1994. → page 13[33] X. Wei, J. Min, and J. Chai. Physically valid statistical models for humanmotion generation. ACM Transactions on Graphics (TOG), 30(3):19, 2011.→ page 7[34] B. Whited, G. Noris, M. Simmons, R. Sumner, M. Gross, and J. Rossignac.Betweenit: An interactive tool for tight inbetweening. Comput. GraphicsForum (Proc. Eurographics), 29(2):605–614, 2010. → page 442[35] Wikipedia. Avar (animation variable) — Wikipedia, the free encyclopedia.http://en.wikipedia.org/w/index.php?title=Avar%20(animation%20variable)&oldid=815219789, 2018. [Online; accessed 18-April-2018]. → page 1[36] A. Witkin and M. Kass. Spacetime constraints. In Proceedings of the 15thAnnual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH ’88, pages 159–168, New York, NY, USA, 1988. ACM. ISBN0-89791-275-6. doi:10.1145/54852.378507. URLhttp://doi.acm.org/10.1145/54852.378507. → page 6[37] A. Witkin and Z. Popovic. Motion warping. In Proceedings of the 22NdAnnual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH ’95, pages 105–108, New York, NY, USA, 1995. ACM. ISBN0-89791-701-4. doi:10.1145/218380.218422. URLhttp://doi.acm.org/10.1145/218380.218422. → page 5[38] A. Witkin and Z. Popovic. Motion warping. In Proceedings of the 22ndannual conference on Computer graphics and interactive techniques, pages105–108. ACM, 1995. → page 743


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items