Towards Human Pose Estimation inVideo SequencesbyGeorgii OleinikovB.Sc., V. N. Karazin Kharkiv National University, 2010M.Sc., V. N. Karazin Kharkiv National University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)The University Of British Columbia(Vancouver)January 2014c? Georgii Oleinikov 2014iiAbstractRecent advancements in human pose estimation from single images haveattracted wide scientific interest of the Computer Vision community to theproblem domain. However, the problem of pose estimation from monoc-ular video sequences is largely under-represented in the literature despitethe wide range of its applications, such as action recognition and human-computer interaction. In this thesis we present two novel algorithms forvideo pose estimation that demonstrate how one could improve the perfor-mance of a state-of-the-art single-image articulated human detection algo-rithm on realistic video sequences. Furthermore, we release the UCF SportsPose dataset, containing full-body pose annotations of people performingvarious actions in realistic videos, together with a novel pose evaluation met-ric that better reflects the performance of current state of the art. We alsorelease the Video Pose Annotation tool, a highly customizable applicationthat we used to construct the dataset. Finally, we introduce a task-basedabstraction for human pose estimation, which selects the ?best? algorithmfor every specific instance based on a task description defined using an appli-cation programming interface covering the large volume of the human poseestimation domain.iiiPrefaceThe work on the contents of Chapter 6 was done in collaboration with GregorMiller. In this section of the thesis we would like to highlight the parts ofthe research project, performed by Gregor Miller and the author. GregorMiller?s contributions towards the project are as follows:? An idea of a task-based abstraction targeting non-expert users togetherwith a task-to-algorithm mapping? Formulation of the target description and organization of the conditionmatrix 6.1? Part of the experiments for the evaluation of pose estimation algo-rithmsThe author of this thesis made the following contributions to the project:? Formulated the input type and output requirements that are includedin the task description? Surveyed the pose estimation literature to make sure the task descrip-tion covers sufficiently large volume of the problem space? Selected the algorithms that were included in the framework? Manually annotated training and test images with description such asamount of clutter and occlusion? Performed part of the experiments for the evaluation of pose estima-tion algorithms? Analyzed the results of the experiments and determined the contentsof the condition matrix 6.1? Suggested the task-to-algorithm mapping procedure 2Also, Kevin Woo helped us to create the UCF Sports Pose Datasetby making high-quality annotations using the Video Pose Annotation tool(Chapter 3).ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Literature Overview . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Pose Estimation Algorithms . . . . . . . . . . . . . . . 72.1.2 Pose Estimation in Video . . . . . . . . . . . . . . . . 82.1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.4 Abstractions over Computer Vision . . . . . . . . . . . 92.2 Relevant Algorithms . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Flexible Mixture of Parts . . . . . . . . . . . . . . . . 112.2.2 Dynamic Programming . . . . . . . . . . . . . . . . . 122.2.3 Distance Transform of Sampled Functions . . . . . . . 142.2.4 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . 15Table of Contents v3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Video Pose Annotation Tool . . . . . . . . . . . . . . . . . . . 183.1.1 Application Features . . . . . . . . . . . . . . . . . . . 183.1.2 Graphical User Interface . . . . . . . . . . . . . . . . . 223.1.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . 273.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . 283.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Pose Estimation in Video: a Shortest Path Approach . . . 334.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Discussion and Future Work . . . . . . . . . . . . . . . . . . . 395 Pose Estimation in Video: a Detection Approach . . . . . . 455.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . 485.2.2 An Approximate Distance Transform . . . . . . . . . . 495.2.3 The Inference Procedure . . . . . . . . . . . . . . . . . 525.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4 Discussion and Future Work . . . . . . . . . . . . . . . . . . . 546 Abstracting Human Pose Estimation . . . . . . . . . . . . . . 586.1 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 596.1.1 Input Description . . . . . . . . . . . . . . . . . . . . . 596.1.2 Output Requirement . . . . . . . . . . . . . . . . . . . 606.1.3 Target Description . . . . . . . . . . . . . . . . . . . . 616.2 Task to Algorithm Mapping . . . . . . . . . . . . . . . . . . . 636.2.1 Algorithm Selection . . . . . . . . . . . . . . . . . . . 646.2.2 Closest Algorithm Search . . . . . . . . . . . . . . . . 656.2.3 Parameter Derivation . . . . . . . . . . . . . . . . . . 686.3 Algorithm Selection Evaluation . . . . . . . . . . . . . . . . . 696.4 Discussion and Future Work . . . . . . . . . . . . . . . . . . . 707 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76viList of Tables4.1 Results of our shortest path approach (SPA) . . . . . . . . . 395.1 Results of our detection approach (DA) . . . . . . . . . . . . 546.1 Abstraction condition matrix . . . . . . . . . . . . . . . . . . 64viiList of Figures1.1 Pose estimation in video sequence example. . . . . . . . . . . 22.1 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . 132.2 Lower envelope of parabolas for the distance transform . . . . 153.1 Several examples of 14-joint pose annotations . . . . . . . . . 193.2 Annotation using tracking in a cluttered scene . . . . . . . . 203.3 Annotation using interpolation in a cluttered scene . . . . . . 213.4 A screenshot of the Video Pose Annotation tool GUI . . . . . 233.5 Video Pose Annotation tool GUI functionality example . . . . 243.6 The UML class diagram for skeletal body models . . . . . . . 263.7 Annotation examples from two different datasets. . . . . . . . 293.8 Example of video pose annotation with alternating limbs . . . 303.9 Examples of the UCF Sports Pose dataset . . . . . . . . . . . 324.1 Examples of pose estimates of the shortest path approach . . 424.2 Examples of pose estimates of the shortest path approach . . 434.3 Examples of pose estimates of the shortest path approach . . 445.1 Spatio-temporal tree structure of the model . . . . . . . . . . 475.2 Examples of pose estimates of the detection approach . . . . 565.3 Examples of pose estimates of the detection approach . . . . 576.1 Examples of scene conditions and algorithms output . . . . . 636.2 Graph representing input types . . . . . . . . . . . . . . . . . 666.3 Graphs representing output requirements . . . . . . . . . . . 676.4 Algorithm mapping evaluation for full-body pose estimation . 716.5 Algorithm mapping evaluation for head yaw estimation . . . 726.6 Algorithm mapping evaluation for upper-body pose estimation 73viiiAcknowledgementsFirst and foremost, I would like to thank my supervisor Prof. Little forhis invaluable support and encouragement throughout my Master?s thesis,for multiple prolific and stimulating discussions, important suggestions andinsights as well as for the constant involvement and the readiness to help.I want to thank Gregor Miller for the interesting and productive collabora-tion on the pose estimation abstraction, for multiple inspiring and motivat-ing ideas and suggestions. I would also like to thank Prof. Woodham forthoroughly reading my thesis and providing many valuable comments.I would like to thank Ankur Gupta for his help as well as many thought-ful discussions and insightful comments on my ideas throughout my Master?sthesis. I want to thank Kevin Woo for making the high quality video an-notations as well as his important feedback and suggestions regarding theVideo Pose Estimation tool. I would like to thank Daria Bondareva for herinvaluable continuous support throughout my thesis that helped me to over-come multiple challenges, for her readiness to help, and for the work she didon the video annotations.ixDedicationI dedicate this thesis to my MomNina Oliinykovawho made it possible for me, without whom I would not be here.Everything that she gave me from my earliest childhood played a role inwho I am now. The English lessons that she taught me when I was a smallkid were exciting and inspiring, and I would never forget the verb flowergarden that she drew for me. The Math classes that we went throughtogether determined my future career path. And I will always rememberthe door with a picture of a smiling computer on it, behind which I firstexperienced the exciting world of personal computers.For an amazing childhood, for all the love she gave to me I am infinitelygrateful to my Mom.1Chapter 1IntroductionHuman pose estimation is an area in Computer Vision that aims to identifythe correct pose of a person, often defined by the position of body joints andlimbs, given an image, video sequence or depth data. Unconstrained, real-world human pose estimation is a challenging problem that has been widelystudied in Computer Vision. It has great potential to assist a wide range ofapplications, such as indexing of images and videos, activity understandingand action recognition, automatic surveillance and human-computer inter-action [37].In this thesis we address the problem of 2D human pose estimation inmonocular video sequences. In particular, we are interested in developingalgorithms that determine the 2D coordinates of 14 body joints in everyframe of a given video sequence. We assume that there is only one per-son in the recording and the full body is visible in all frames. Figure 1.1demonstrates an example of the desired output of a video pose estimationalgorithm. Although in this thesis we consider full-body poses only, thetechniques developed in this work can be applied towards upper-body orother kinds of poses.Furthermore, we also consider the pose estimation problem from the op-posite perspective. Namely, how one should select the right pose estimationalgorithm for a particular problem, and what semantic language would oneuse to describe it. We are seeking to develop an abstraction that would allowone to describe any pose estimation problem for which the solving algorithmsexist in the literature or may potentially emerge in future. The problem de-scription includes specification of the input data and requirements for thepose estimation results as well as relevant task conditions. In addition, weare interested in a system that would accept the problem description andreturn the results of a pose estimation method that suits the problem best.1.1 MotivationMost of the pose estimation applications in real-world scenarios provide onewith video sequences. Video cameras and other sensors often work in real-Chapter 1. Introduction 2Figure 1.1: Pose estimation in video sequence example. The pose in eachframe is represented by a body graph with more than 14 joints, with con-nections between joints denoting parts of limbs, torso and head.time, producing streaming data. In fact, there are relatively few applicationswhere pose estimation in images is required while video data is not available.Surprisingly, most of the current pose estimation algorithms focus on esti-mating the pose from single images. However, it seems that pose estimationover time is critical for improving the estimations [25], and the temporalinformation should be utilized for improved performance. This motivatesus to develop two video pose estimation algorithms, attempting to fill inthe gap between the wide range of applications of pose estimation in videoand the lack of algorithms that utilize the full available data by focusing onvideo sequences. We develop these algorithms with an application to sportsvideo analysis in mind.In order to evaluate and train video pose estimation algorithms onewould need the ground truth. To the best of our knowledge, there are noChapter 1. Introduction 3publicly available datasets with unconstrained real-world video sequencesand fully annotated human poses (see Section 2.1.3). We firmly believe thatthe shortage of data is one of the reasons for the lack of video pose estima-tion algorithms. Therefore, in this thesis we are also determined to create avideo dataset with annotated full-body poses.In our opinion, the lack of data is mainly due to the fact that it istime-consuming and generally difficult to annotate all frames of a videosequence with human poses. It is much harder than to annotate boundingboxes or provide action labeling, since various pose representations usuallyconsist of 10 to 20 body joints, each of which must be manually adjustedfor every frame. Furthermore, providing one dataset with fully annotatedposes would not solve the problem of the shortage of data, simply becauseit cannot fit all the potential requirements in the degree of data complexityand type. Certain algorithms may require high-quality data with annotatedupper bodies, while other would need full-body annotated videos with heavyocclusions and action labelings. Therefore, we decide to take a thoroughapproach and create a tool, that would allow one to annotate skeletal posesof humans in video sequences.The challenge to develop accurate pose estimation algorithms is not theonly obstacles that prevent them from being effectively embraced in real-world applications. For most software engineers who are not experts inComputer Vision it is challenging to utilize the best pose estimation algo-rithm for their needs. With the state of the art advancing fast it is hardto track down the best algorithm for every specific case of input conditionsand output requirements. Furthermore, it is hard for non-experts to imple-ment these algorithms and keep them up-to-date with the state of the art.In order to address these problems we develop an abstraction over humanpose estimation together with a task-to-algorithm mapping, which selectsthe best pose estimation algorithm based on task description.1.2 OutlineData Preparation. In order to help solve the data shortage problem weintroduce a Video Pose Annotation tool, which allows one to annotate skele-tal poses people in video sequences. The skeletal representations may havevarious forms, enabling the tool to be adjusted for the specific needs of theuser, such as annotation of upper bodies or hands only. The tool features asimple yet powerful graphic user interface, and the annotation process is as-sisted by automatic detection, tracking and interpolation. Furthermore, weChapter 1. Introduction 4introduce the UCF Sports Pose dataset, which is a UCF Sports dataset withannotated full-body poses. It features more than 150 sports video sequenceswith high-quality full-body annotations made with the aforementioned an-notation tool. In addition, we propose a new pose estimation evaluationmetric that in our opinion better reflects the performance of the commonstate-of-the-art algorithms for pose estimation. To the best of our knowl-edge, until most recently both the tool and the dataset were the only onesof their kind1.Pose Estimation. Most of the recent single-image pose estimation al-gorithms build on top of the pictorial structures model [2]. It represents thehuman body as a tractable tree graph, making full search possible via dy-namic programming. However, the direct application of the same techniqueto video sequences is not possible in practice. The single-frame tree graphturns into a loopy intractable model, where inference complexity would growexponentially with the number of frames.The most natural way to overcome the above is to make an algorithmthat works on top of the detections returned by a single-image pose esti-mator. It would not search through all possible combination of poses, butit may give a good approximation, depending on the quality of the single-image pose detector. Wang et al. [51] uses dynamic programming to searchthrough multiple single-image detections of a state-of-the-art pose estima-tor, utilizing color information to score pose configurations. Zuffi et al. [58]build their work on the idea of using the optical flow to integrate image ev-idence from multiple frames. They iteratively propagate best single-imagedetections from every frame to the adjacent ones and then refine and shrinkthe set of poses for every frame.The first video pose estimation method that we introduce in this thesisbuilds on the two ideas described above2. We first run a state-of-the-artsingle-image pose detector on every frame of a video sequence, selectingseveral best detections. Then we propagate the poses in every frame toneighbouring frames in order to expand the set of poses for every frame.Finally, we select the best-overall combination of poses throughout the ob-tained sets of poses. We use Flexible Mixture of Parts (FMP) [55] as astate-of-the-art single-image pose estimation algorithm. We refer to thismethod as the shortest path approach, since it involves minimizing the costof transferring from a pose in the first frame to a pose in the last frame.1See Section 2.1.3.2Note that current algorithm was developed independently from [51] and [58]. SeeSection 4.4 for comparison of these approaches.Chapter 1. Introduction 5One of the alternatives to the above is to look at some of the ways to dealwith the temporal dimension from the action recognition approaches liter-ature. Most of the common methods utilizing local features involve eithercomputation of optical flow or trajectories [52] [54] or utilize local volumet-ric space-time features such as SIFT3D [45] or HOG3D [28]. Fragkiadakiet al. [19] use optical flow to segment body parts and propagate segmenta-tions over time. Tian et al. [49] extend the Deformable Parts Model [14]to the temporal dimension by replacing the 2D HOG filters [9] with theirvolumetric versions HOG3D and use them for action detection.The second video pose estimation method that we develop resorts to adifferent approach. Instead of using 3D spatio-temporal features or opticalflow only, we look at how the 2D features change over time along the pathsof the optical flow, thus combining the appearance and flow information ina single framework. Furthermore, in contrast to the approaches above weare able to do inference in the current frame and several previous frames atonce. The latter is possible because we relax the connections between alljoints in every frame but the current one, only leaving the temporal connec-tions between joints and their instances in the past. The resulting structureis a tree, which enables full search via dynamic programming (see Figure5.1). The relaxation of the body edges in previous frames does not have asmuch impact as one may think, because we set their expected positions inaccordance with the backward optical flow around the body joints in thecurrent frame. The idea of conversion of the intractable model spanningmore than one frame to a tree structure is the most similar to the work bySapp et al. [44], who decompose the model into an ensemble of several tree-structured models that cover the edge relationships of the full model. Werefer to our method as the detection approach, since it estimates the posein every current frame independently, taking into account several previousframes.Abstraction. In this thesis we also develop an abstraction over humanpose estimation together with a task-to-algorithm mapping. The abstrac-tion features an interface, allowing the user to describe the pose estimationproblem, which includes input conditions and output requirements. Theinterface is flexible enough to describe most of the possible variations in aproblem description. The task to algorithm mapping encompasses expertknowledge about performance of several pose estimation methods and usesit to select the best one according to the given problem definition. We de-sign this system as a part of OpenVL, an abstraction over Computer Vision,which currently encompasses tasks such as segmentation, image registration,correspondence, detection and tracking.Chapter 1. Introduction 6To summarize, in this thesis we make the following five key contributions:? Video Pose Annotation tool, which allows one to annotate skeletalposes of humans in video sequences? UCF Sports Pose dataset, containing realistic videos with full-bodyannotations, together with a new pose estimation evaluation metricPCP2D? Video pose estimation method, demonstrating a way to improve a poseestimation algorithm for video sequences? Novel video pose estimation method, embracing both temporal andappearance information in a single framework? An abstraction over human pose estimation together with a task-to-algorithm mapping, which selects the best algorithm according to thegiven problem description.1.3 OrganizationThis thesis is organized as follows. In Chapter 2 we discuss related work aswell as briefly give necessary background on algorithms that are essential forunderstanding this thesis. We present the Video Pose Annotation tool andthe UCF Sports Pose dataset in Chapter 3. Afterwards we focus on poseestimation algorithms for video sequences and introduce the shortest pathapproach in Chapter 4. Then we proceed to the detection approach for poseestimation in Chapter 5. We describe the abstraction over pose estimationin Chapter 6 and finally finish with conclusions in Chapter 7.7Chapter 2Related WorkIn this chapter we survey the relevant literature in Section 2.1 and thengive necessary background required for understanding of the material inthis thesis in Section 2.2.2.1 Literature OverviewIn this section we first survey related work on pose estimation algorithmsfor various forms of input data and output results from the task-result per-spective, which is be needed in Chapter 6 in order to justify the design ofthe abstraction interface and the selection of the algorithms in the frame-work. Then we proceed to the overview of the existing methods targeting 2Dhuman pose estimation in monocular videos. We further describe relevantdatasets and annotation tools, and then proceed to existing abstractionsover Computer Vision and briefly describe OpenVL.2.1.1 Pose Estimation Algorithms3D human pose estimation is a hard problem that has been researched mostsuccessfully in the setting of depth images. A method for super-realtimeestimation of 3D positions of body joints and pixel-wise body-part labelingsbased on randomized decision forests was introduced by Shotton et al.in [46],which was a technology behind the initial release of KinectTM. Fanelli etal. [13] tackled a problem of real-time head pose estimation from depth datausing random regression forests.Another class of methods considered the problem of 3D human poseestimation using sources of data other than depth images. Yu et al. [56] in-troduced a method for monocular 3D pose estimation from video sequencesusing action detection on top of 2D deformable part models. Amin et al. [1]presented a method for 3D pose estimation from multiple calibrated cam-eras, incorporating evidence from every camera obtained with 2D pictorialstructures. The problem of determining 3D shape of the human body to-gether with its pose was considered by Guan et al. [20]. Although estimatingChapter 2. Related Work 8a 3D pose solely from a 2D image is an under-constrained problem, it hasbeen tackled by Simo-Serra et al. [48] by jointly solving 2D detection and3D inference problems.The problem of 2D body pose estimation has traditionally been ap-proached with variations of pictorial structures framework [2]. Recently,Yang and Ramanan introduced a flexible mixture of parts model [55], whichextended the deformable parts model [14] for articulated 2D human detec-tion with considerable improvement to the state of the art. The state of theart was further improved among others by Rothrock et al., who used a com-positional and-or graph grammar model together with segmentation [43].Also, Kinect-style body-part labelings were obtained by Ladicky et al. [31],combining part-based and pixel-based approaches in a single optimizationframework. Hara and Chellappa introduced a super-realtime 2D pose esti-mator with the help of multidimensional output regressors along the bodypart dependency paths [21]. The problem of head and face orientation es-timation from images was tackled by Maji et al. [34] and Zhu and Ra-manan [57].It is easy to see that with such an abundance of algorithms performingvarious tasks the development of an abstraction to select the best algorithmfor every specific case would be beneficial.2.1.2 Pose Estimation in VideoThere is a large literature on 2D human pose estimation in single images.Similarly, many methods were devoted to the pose tracking problem, whichoften assumes correct manual initialization in at least one of the framesof the video sequence. However, general 2D pose estimation in monocularvideo sequences is largely underrepresented in the literature.Nevertheless, several recent papers focused on pose estimation in videowithout any requirement for supervision. Some papers exploit the idea of re-lying on confident detections. Ramanan et al. [41] require that the video se-quence contains an easily detectable canonical pose. They find the pose withan accurate canonical pose detector and use it for instance-specific appear-ance training, which is subsequently utilized to find poses in all frames inde-pendently. Buehler et al. [6] use a similar approach by identifying keyframeswith reliable detections and filling in the intermediate frames taking into ac-count temporal consistency. Ferrari et al. [16] first reduce the search space byhighlighting the foreground with segmentation applied on top of the resultsof a human detection algorithm. Then they do single-frame pose detectionsand refine them with a spatio-temporal instance-specific model trained onChapter 2. Related Work 9reliable detections. Wang et al. [51] searches for a best-overall combinationof poses obtained from a single-image pose detector, taking into accounttemporal and appearance coherence.Other methods use optical flow to exploit coherence of the informationfrom consecutive frames. Fragkiadaki et al. [19] use segmented body partsand propagate segmentations over time with the help of optical flow. Zuffiet al. [58] exploit optical flow to propagate best single-image detections tothe adjacent frames and refine and shrink the poses for every frame in aniterative process.Simultaneous inference over more than one frame presents a challenge todeal with loopy intractable models, which necessitates approximate inferenceas in [53]. Alternatively, one may attempt to convert the intractable modelinto one where exact inference is possible. Sapp et al. [44] decompose theloopy model into an ensemble of several tree-structured models that coverthe edge relationships of the full model.2.1.3 DatasetsThere exist a variety of single-image datasets with annotated poses [40] [10][12] [11]. However, there are few video pose annotated datasets mostly dueto the difficulties in manual annotation.HumanEva [47] is a motion capture dataset providing both motion cap-ture and video data of 4 subjects performing a set of 6 actions two timeseach. However, the environment is not realistic and the videos have staticbackgrounds, well centered persons and high contrast clothing, while the setof actions is limited. VideoPose 2.0 [44] is a video dataset with annotatedarm joints every other frame. The dataset consists of 44 short clips, 2-3seconds in length each, 1,286 video frames in total. Most recently, Jhuanget al.released J-HMDB [25], a video dataset with annotated full-body jointpositions and human silhouettes derived from joints. It contains 21 actionclasses, 36-55 clips per action class 15-40 frames each, 31,838 video framesin total. Together with the dataset Jhuang et al.announce an annotationtool1 that helped them to build J-HMDB. Its current web demo allows oneto drag joints over the body and propagate annotations to the next frame.2.1.4 Abstractions over Computer VisionThe idea of developing an abstraction for Computer Vision tasks is not new,and there have been numerous attempts towards it. Matsuyama and Hwang1http://files.is.tue.mpg.de/hjhuang/pose_annotation/html/avalidator.htmlChapter 2. Related Work 10introduced SIGMA [35], an expert system performing detection based on alearned appearance model, which is selected based on geometric context-dependent reasoning. Kohl and Mundy developed the Image Understand-ing Environment, an abstraction providing high-level access to vision meth-ods, although requiring the understanding of all the algorithms underneathit [29]. Firschein and Strat introduced RADIUS [18], which helped the userchoose best image processing algorithms based on geometric models, definedby the user. Konstantinides and Rasure developed a visual programminglanguage in Khoros, which allowed its users to create vision applicationsby connecting components in a data flow [30]. However, it also required athorough understanding of the vision algorithms, as the components it in-cluded were relatively low-level, featuring color conversions, spatial filteringand feature extraction.More recently declarative programming languages such as ShapeLogic2and FVision [39] were introduced, which provided functionality as smalllow-level units, requiring expert knowledge about vision methods. Chiu andRaskar introduced Vision on Tap, a web-based tool featuring a high-levelabstraction targeted for web developers [8], although its usage is limited dueto its web interface.Several openly available libraries, such as OpenCV [3], FastCV3, OpenTL[38] and the Vision Toolbox4, provide common Computer Vision function-ality. These frameworks provide direct access to specific vision componentsand algorithms, but the context of usage and tuning of the parameters isessential, which requires expert Computer Vision knowledge.Most recently Miller and Fels introduced OpenVL [36], an abstractiontargeting a variety of Computer Vision problems from the task perspective.Currently, OpenVL is working with segmentation, correspondence and regis-tration, while certain steps have been made towards tracking and detection.Human pose estimation can be considered as an articulated human detec-tion problem and thus fits well into the OpenVL paradigm, which allows usto extend OpenVL with pose estimation.2http://www.shapelogic.org3http://developer.qualcomm.com/mobile-development/mobile-technologies/computer-vision-fastcv4http://www.mathworks.com/products/computer-visionChapter 2. Related Work 112.2 Relevant AlgorithmsIn this section we go over some algorithms that are essential for understand-ing this thesis. We start with a brief description of the Flexible Mixture ofParts model (FMP) [55]. We further go over dynamic programming and theViterbi algorithm and then proceed to the distance transform of sampledfunctions [15]. Also, we cover the basics of optical flow and explain thenotation of median optical flow which we use as a tracking algorithm.2.2.1 Flexible Mixture of PartsOur work largely builds on top of FMP and we briefly describe it in thissection. It is a human pose estimation method from single images, based ona mixture of non-oriented pictorial structures.Model. The model is a tree graph (V,E) covering the human body,where each node i is located at pixel pi = (xi, yi) and is assigned a filtertype fi. Every filter type fi is associated with a particular HOG filter [9]representing a specific mode of the appearance of part i. Every body part hasseveral appearance modes, which cover the most common cases of the part?sappearance. The score of a configuration of body part positions p = {pi}Ki=1and part types f = {fi}Ki=1 in an image I is defined as follows:S(I, p, f) =?i?Vbfii +?(i,j)?Ebfifjij +?i?V?fii ? ?(I, pi)+?(i,j)?E?fifjij ? ?(pi ? pj),(2.1)where ?(dx, dy) = [d2x d2y dx dy]T is a deformation spring model and?(I, pi) is an image feature vector extracted at location pi. The first twoterms of (2.1) represent the appearance compatibility score, the third termdefines the appearance score, while the last term is a quadratic-cost defor-mation score. Note that, in practice, in order to reduce computation duringinference the assumption on ?fifjij is relaxed, stating that deformation springmodels depend only on the filter type of child i:?fifjij = ?fiij . (2.2)Inference. The inference procedure corresponds to maximizing (2.1)with respect to p and f . This can be done efficiently using dynamic pro-gramming (see Section 2.2.2) with message passing of the formChapter 2. Related Work 12scorei(fi, p) = bfii + ?fii ? ?(I, pi) +?j?kids(i)mj(fi, p), (2.3)mj(fi, p) = maxfj[bfjfiji + maxpj(scorej(fj , pj) + ?fjfiji ? ?(pj ? pi))]. (2.4)The computational cost of (2.4) for each body part is O(L2H2), whereL is the total number of body part pixel locations and H is the numberof HOG filters per part. This can be reduced to O(LH2) with the help ofthe distance transform [15], described in Section 2.2.3. Assumption (2.2)reduces the cost further to O(LH).Learning. The supervised learning paradigm with negative {I?(n)}n?Nand positive labeled {(I(n), p(n), f (n))}n?P training examples is employed.The scoring function is linear in model parameters ? = (b, ?) and can berewritten as S(I, z) = ? ? ?(I, z), where z(n) = (p(n), f (n)). Therefore, themodel is learned in the formarg min?,n?012? ? ? + C?nn, (2.5)s.t. ?n ? P ? ? ?(I(n), z(n)) ? 1? n,?n ? N, ?z ? ? ?(I(n), z) ? ?1 + n.The latter is a quadratic programming problem, which can be optimizedwith an out of the box solver such as the cutting plane solver in [17] orstochastic gradient descent in [14].2.2.2 Dynamic ProgrammingWe extensively use dynamic programming throughout this thesis. In thissection we discuss relevant dynamic programming ideas.The Viterbi algorithm. Suppose we are given a sequence of variablesX = (X1, . . . , Xn), each of which can take one of the m values {sj}kj=1.Furthermore, there are scores Si(Xi) associated with the choice of particularassignment of values to variables Xi as well as scores Si?1,i(Xi?1, Xi) forparticular co-assignments of the values in adjacent variables Xi?1, Xi. TheViterbi algorithm solves the following problem:Chapter 2. Related Work 13Figure 2.1: The Viterbi algorithm. For every value of X2 the best value ofX1 is computed, then the process continues up to Xn.arg maxXS(X), (2.6)S(X) =n?i=1Si(Xi) +n?i=2Si?1,i(Xi?1, Xi) (2.7)In order to do this, for every Xi starting with X2 we can compute thebest candidate value for Xi?1 (see Figure 2.1):score1(X1) =S1(X1), (2.8)scorei(Xi) =Si(Xi) + maxXi?1mi?1(Xi?1, Xi), i > 1, (2.9)mi?1(Xi?1, Xi) =(scorei?1(Xi?1) + Si?1,i(Xi?1, Xi)), (2.10)indi(Xi) = arg maxXi?1mi?1(Xi?1, Xi), i > 1. (2.11)Here scorei(Xi) stores the total accumulated score (2.7) up to Xi, andindi(Xi) is the index of the best assignment to Xi?1 for every assignmentof Xi. This process is generally referred to as message passing from Xi toXi+1.After finishing the message passing procedure scoren(Xn) would containthe final configuration scores. The best combination of assignments (2.6) canbe then obtained by taking the maximum value of scoren(Xn) and applyingChapter 2. Related Work 14backtracking, a process of consecutively recovering the best assignment ofXi:X =n?i=1(indi+1 ? ? ? ? ? indn ? id)(arg maxXnscoren(Xn)), (2.12)where ? denotes Cartesian product, ? denotes function composition, id isthe identity function.Tree structures. The Viterbi algorithm can also be applied to othercases when the graph connecting variables Xi forms a tree. The wholeinference procedure stays the same, with the only difference that messagepassing (2.9) has to take into account all children kids(i) of node i:scorei(Xi) = Si(Xi) +?j?kids(i)maxXjmj(Xj , Xi). (2.13)2.2.3 Distance Transform of Sampled FunctionsIn this section we describe the distance transform of sampled functions [15],as its understanding is essential in Chapter 5.One dimension. Let G = {g1, . . . , gn} be a one-dimensional grid. Thegoal is to compute Df (p), Ef (p) for every p in a grid H = {h1, . . . , hm}:Df (p) = minq?GP2(p, q), (2.14)Ef (p) = arg minq?GP2(p, q), (2.15)P2(p, q) = a(p? q)2 + b(p? q) + f(q). (2.16)This can be done via full search in O(nm) time. The distance transformhowever is able to compute this in O(n+m) in the following way. The firststep is the computation of the lower envelope of parabolas a(p? q)2 + b(p?q) + f(q). This can be done in linear time by using simple algebra. Duringthe second step the values of Df (p) are filled in for all p in grid H by selectingthe appropriate parabolas in the lower envelope (see figure 2.2).Two dimensions. LetG = {g11 . . . g1n}?{g11 . . . gk1} be a two-dimensionalgrid with an arbitrary function f : G ? R defined on it. We are aiming tocompute Df (x, y) for every (x, y) in a grid H = {h11 . . . h1m} ? {h11 . . . hl1}:Chapter 2. Related Work 15Figure 2.2: Lower envelope of parabolas for the distance transform. In thisexample b = 0 and parabolas are centered at points in grid G = {2, 4, 6, 8}.Blue contour corresponds to lower envelope of parabolas, while dotted partsof the parabolas represent their parts that do not constitute a part of it.Dotted red vertical lines correspond to grid H = {3, 5, 7}, in which the valuesof Df (p) will be filled. In this example Ef (3) = 4, Ef (5) = 6, Ef (7) = 6.Df (x, y) = min(x?,y?)?GP2((x, y), (x?, y?)), (2.17)P2((x, y), (x?, y?)) = ax(x? x?)2 + bx(x? x?)+ay(y ? y?)2 + by(y ? y?) + f(x?, y?). (2.18)Since the first two terms in (2.18) do not depend on y, the equationabove can be rewritten asDf (x, y) = minx?[ax(x? x?)2 + bx(x? x?) +Df |x?(y)]. (2.19)Thus, we can use one-dimensional distance transform along the y axesand then use it again along the x axis on the result.2.2.4 Optical FlowIn this thesis we frequently resort to optical flow. By optical flow we meana class of methods attempting to calculate the motion between two consec-Chapter 2. Related Work 16utive video frames taken at times t and t + ?t for small values of ?t. Thebrightness constancy equation is often utilized:I(x, y, t) = I(x+ ?x, y + ?y, t+ ?t), (2.20)where I(x, y, t) is the brightness of the pixel (x, y) at time t. The equationabove stays that brightness of pixels, potentially belonging to the sameobject in the video sequence should stay the same. Optical flow algorithmsusually encompass additional assumptions and constraints to improve theoptical flow accuracy. They can be roughly divided into local and globalapproaches. Local methods such as Lucas-Kanade [33] are often more robustto noise, while global methods such as Horn-Schunck [22] produce a denseflow field.In this paper we use an optical flow algorithm that combines local andglobal approaches, attempting to yield a dense flow field that is robust tonoise. It uses the brightness constancy assumption, the gradient constancyassumption and a discontinuity-preserving spatio-temporal smoothness con-straint. It is based on [4] and [5], and we use its Matlab implementationby Liu [32]. However, the choice of particular optical flow algorithm is notimportant for us, and it can be replaced with any other method.We also use the notion of median optical flow throughout this thesis. Bymedian optical flow in an image area we mean the median values of flowcoordinates ?x and ?y in the image region. It can be used as a simpletracking algorithm.17Chapter 3Data PreparationThere are multiple applications of human pose estimation in video sequences,such as human-computer interaction, entertainment, surveillance and sportsvideo analysis. Surprisingly, there are very few methods that focus on 2Dpose estimation in video in comparison to a large number of single-image al-gorithms being published every year (see Section 2.1.2). We strongly believethat one of the main reasons for that is a lack of video datasets with anno-tated poses (see Section 2.1.3), on which these algorithms could be trainedand/or evaluated. The availability of such datasets would immediately ben-efit the research community working on pose estimation and may potentiallyattract more research into this field.While there are many video datasets with annotated people locations (fortracking) and actions (for action recognition), there are few fully annotatedrealistic pose video datasets, and untill very recently none of them includedfull-body annotations (see Section 2.1.3). We think this is mainly due to thefact that it is very hard and time consuming to annotate poses for all framesof a video sequence. For instance, in contrast to annotating a boundingbox or an action label, for a full-body pose annotation of a 3-second videosequence one would have to provide locations of 14 body joints for every ofthe 3 ? 30 frames, resulting in more than 1000 mouse clicks. Furthermore,annotation of a pose requires much more precision than annotation of a box,and it is difficult to make the annotations consistent throughout the videosequence. Evidently, without any annotation tool the whole annotationprocess becomes impractical.In this chapter we make two key contributions, motivated by the argu-ments above:? Introduce the Video Pose Annotation tool, allowing one to make fastand easy pose annotations in video sequences, featuring a user-friendlygraphical interface and flexible design.? Introduce the UCF Sports Pose dataset, consisting of full-body anno-tations for the UCF Sports Action dataset [42]. The annotations wereproduced with the above annotation tool.Chapter 3. Data Preparation 18To the best of our knowledge, until most recently the Video Pose Anno-tation tool was the only application aiding the task of manual annotation ofposes in images or video sequences. Likewise, the UCF Sports Pose was theonly video dataset providing full-body pose annotations in realistic environ-ments. Recently Jhuang et al.released their J-HMDB dataset [25] togetherwith an annotation tool. We compare it with our tool in Section 3.3.This chapter is organized as follows. We introduce the Video Pose An-notation tool in Section 3.1 and then describe the annotated dataset thatwe obtain with its help in Section 3.2.3.1 Video Pose Annotation ToolThe Video Pose Annotation tool enables fast and accurate annotation of hu-man poses in video sequences, featuring finely tuned Graphic User Interface(GUI). The annotation process is aided by automatic pose initialization,tracking and per-joint interpolation.The annotations that we use are defined by the 2D locations of all bodyjoints that represent the structure of the body. The model of the humanpose as well as the way it is aligned with the image data depends on therequired level of detail. We follow Ramanan [40] and use the 14-joint skeletonbody model defining the positions of arms, legs, hips, shoulders, neck andhead. See Figure 3.1 for some examples of annotations that we expect forour model. Note that since we are interested in poses in videos, the idealannotation would contain smooth movement of every joint throughout thesequence, preserving the length of body parts and their placements relativeto the body.3.1.1 Application FeaturesWhen working on a video sequence the user has the following options:? Automatically estimate the pose in the current frame using a posedetector? Automatically translate the current pose to any other frame usingtracking? Manually adjust the joints of the current poseAutomatic pose initialization is helpful because it often puts a numberof joints at their desired locations. However, the current version of theChapter 3. Data Preparation 19Figure 3.1: Several examples of 14-joint pose annotations. Pink and cyanlines cover right and left hands of a person, red and blue lines cover right andleft leg correspondingly. Note that in contrast to Ramanan [40] we mark inred the actual right leg of the person, as opposed to the leg that is the mostleft in the image, assuming that the person always looks in the direction ofthe camera. This also makes difference when the limbs alternate, e.g. whenthe person is running sideways (best viewed in color).pose detector often misses body parts or places them inaccurately. Usuallymanual adjustment is used afterwards to correct the pose. Furthermore,our current pose detector returns independent poses for every frame, andthey often have slight differences in placement of head/shoulders/hips. Asa result, detections in consecutive frames may have quite different poses,resulting in a very jittery annotation of poses overall. In order to increasethe accuracy of annotations the tool utilizes tracking. Our experience revealsthat tracking of a correct pose to the next frame produces substantially moreaccurate result than independent estimation of the pose.Despite the good annotation results provided with workflow based onpose detections and tracking, it does have its disadvantages. The mainChapter 3. Data Preparation 20Figure 3.2: Annotation using tracking in a cluttered scene. The correctpose in a frame before occlusion was tracked forward, and the pose from aframe after occlusion was tracked back. It is very hard to guess the correctpositions of occluded joints and maintain the right motion pattern.downside is that one has to repeat the whole process for every frame, po-tentially adjusting every joint, which takes a lot of time. Furthermore,whenever occlusions or self-occlusions take place, it becomes very hard tocorrectly identify the positions of the missing parts in all frames, maintain-ing the right motion pattern (see Figure 3.2). In addition, we found thathard-to-notice subtle differences in consecutive frames may result in largedisplacements overall. For instance, the width of the hips in a video se-quence may be changing all the time. It is hard to control such long-termdeviations because one would have to go through all frames and separatelyadjust the incorrectly positioned joints.In order to overcome the above difficulties interpolation between anno-tated poses is essential. We use a notion of a keypoint, which extends thecommon understanding of a keyframe. Every joint in every frame is eitherChapter 3. Data Preparation 21Figure 3.3: Annotation using interpolation in a cluttered scene. The inter-polation on per-joint bases successfully resolves the occlusion problem. Itallows one to specify only certain positions of a joint when it is visible, whileall other positions get their values automatically.marked as a keypoint or regular joint. The position of every regular jointis linearly interpolated in time between the closest left and right keypoints,and is adjusted accordingly when the position of any of the two keypointschanges. Every regular joint becomes a keypoint whenever it is manuallyadjusted or modified with detection or tracking. The user also has an optionto remove any keypoint, making it a regular joint.The latter interpolation procedure helps to solve the problems statedabove. Instead of automatically estimating or tracking every pose to the nextframe, one can do this every 5 or 10 frames, and interpolation would take careof the annotations in between. This not only saves time adjusting most ofthe joints in every frame, but also helps to recover unstable hips/shouldersand most importantly deal with occlusions (see Figure 3.3). Because ofthe complexity of human motion we find it particularly important that theChapter 3. Data Preparation 22interpolation is done on per-joint bases. If one would resort to keyframesinstead of keypoints, one would soon find out that most of the frames haveat least one manually modified joint, which would turn every frame to akeyframe, and no interpolation would be performed.We experiment with linear interpolation in two ways. In the first one,the joint position is linearly interpolated in image coordinates between itspositions (x1, y1) and (x2, y2). While this interpolation keeps hips, shouldersand head more stable, we found out that it does not work very well on jointsthat cover hands, elbows, feet and knees mostly because human motion oftenproduces swings that follow round trajectories. For example, feet in Figure3.3 rotate relatively to knees, while knees rotate relatively to hips. Therefore,we use interpolation in polar coordinates for limb joints from ?1, ?1 to ?2, ?2,where ?1, ?2 are the distances from joint to its parent in the first and lastframe of interpolation and ?1, ?2 are the angles relative to parent. Thishelps us to reduce the number of manual adjustments of joints. Note thatthe interpolation procedure described here could be replaced with any otheralgorithm, e.g. incorporate human motion models [53].3.1.2 Graphical User InterfaceThe Video Annotation Tool was developed in a continuous usage-feedback-improvement loop. As a result we were able to develop a powerful yet simpleGUI that suits the user?s needs the best.The main application window is shown in Figure 3.4. It consists ofthe image area, navigation bar, input/output panels and annotation andmiscellaneous panel. Almost all of the functionality of the tool is hotkeyed,so that frequently repeated actions can be performed fast. All the changesmade to the interface, such as last loaded video sequence or states of thecheck boxes are saved in the configuration file and loaded during subsequentruns.Input/Output. The input panel determines the input video sequence,which could be loaded either by selecting a video file or an image sequencein a Load dialog window or by entering the path in the edit box. The outputpanel specifies the output .mat file, containing resulting annotations. When-ever a video sequence is loaded, the tool loads the corresponding annotationsif they are found.Annotation. The main functionality of the tool is gathered in the an-notation panel. The Detect button performs automatic estimation of thepose in the current frame, while Detect Fast does local pose search basedon the position and speed of the person in previous frames for the purposeChapter 3. Data Preparation 23Figure 3.4: A screenshot of the Video Pose Annotation tool GUI. Anno-tations are displayed on top of the images as a colored stickman figure.Brighter colors for body joints represent keypoints, while darker correspondto regular joints. Hovering mouse over a joint pops up a transparent circle,identifying which joint is going to be affected. The left mouse button al-lows one to drag joints, while the right mouse button is used to remove thekeypoint from the highlighted joint.of reducing the computation time. The arrow buttons <= and => performtracking of the current pose back and forward correspondingly, while num-bers in boxes nearby specify how many frames the pose should be tracked.The copy radio button enables the direct copy functionality, which maycome in handy if tracking fails.View and Navigation. The navigation bar allows one to browse thevideo frames back and forth, jump to a frame by number, play the videosequence with adjustable speed, etc. The corresponding annotations are dis-played on top of the video frames in the image area. The user may manuallyadjust the annotations by dragging the joints around the image using the leftChapter 3. Data Preparation 24(a) (b)Figure 3.5: Video Pose Annotation tool GUI functionality example. (a) Itis hard to see where the limbs of the person are, while dragging the joints.(b) When annotations are hidden, only joints are shown when dragging, andlines do not occlude the limbs.mouse button. Holding Shift results in groups of the joints being movedtogether, which is helpful when dragging the whole arm/leg/body together.Clicking the right mouse button releases the keypoint associated with theselected joint. Keypoints are highlighted with brighter color compared toregular joints, which can be turned off by unchecking the Show keypointscheck box. Furthermore, one may want to uncheck Show annotations inorder to hide the annotated pose. We found this useful when annotatingvideos of low quality/high motion noise, when it gets particularly hard tosee what the right pose of the human is, with the annotations displayed ontop (see Figure 3.5). Also, unchecking Show frames hides the images incase one wants to see how realistic the resulting motion of a stickman is.Finally, it is possible to take snapshots of the current image area with thehelp of Screenshot button.3.1.3 DesignOne of the objectives of the Video Pose Annotation tool is to be flexibleenough to be applied in various scenarios. The dataset described in SectionChapter 3. Data Preparation 253.2 includes full-body pose annotations, which might be useful for sportsanalysis applications. However, in other domains different pose representa-tions might be required, such as upper body or hands only. In order for thistool to encompass potential changes in the body pose, we designed it withthe principles of Object-Oriented Programming and flexibility in mind.Figure 3.6 demonstrates the class hierarchy of the part of the application,responsible for body pose representation. AbstractSkeleton is the baseclass for all body part representations. Skeleton2D is the base abstractclass for all ?stickman? representations, which consist of 2D joint locations,sizes and connections between them. The distinction between the two ismade in order to embrace potential classes that have information beyondthe standard 2D information, such as 3D orientation or depth.If one wants to annotate 2D poses with a different skeleton structure,they have to inherit the Skeleton2D class and provide implementation formethods representing the body graph: skeletonSize, getParentIndexes,getPartConnections, getConnectionColors, getJointColors andgetDragAdjacentJoints. Also, one may modify the SkeletonFactoryclass, which creates the appropriate instance of the AbstractSkeleton classbased on the number of joints.We provide implementations for four body pose classes. FullBody rep-resents a 14-joint body skeleton, MidpointFullBody expands the latter posewith joints in the middle of each limb and two additional joints on each sideof the torso, resulting in a 26-joint body structure. Likewise, UpperBody isa 10-joint skeleton covering the upper body and MidpointUpperBody is its18-joint expanded version. An example of a MidpointFullBody skeleton canbe seen in Figure 1.1.Every instance of a subclass of AbstractSkeleton also has a createFrommethod, which serves the role of a constructor accepting instances of otherclasses inherited from AbstractSkeleton. This enables conversions betweendifferent pose classes, which may come handy since many pose representa-tions share the same body parts. For instance, it is possible to convertMidpointFullBody to FullBody and back, FullBody can be converted toUpperBody etc.We make two assumptions regarding the classes inherited from the baseclass AbstractSkeleton. First, the graph representing the body structuremust be connected. Second, the number of joints in the graph should bedifferent for every subclass of AbstractSkeleton. However, the applicationprovides an easy way to overcome the assumptions above. If one wants todefine a disjoint skeleton such as two arms, one can connect two disjointsubgraphs with an edge E in order to obtain a tree model and then specifyChapter 3. Data Preparation 26Figure 3.6: The UML class diagram for skeletal body models.FullBody, MidpointFullBody, UpperBody, MidpointUpperBody extend theSkeleton2D class, which extends the most abstract AbstractSkeletonclass. SkeletonFactory is used to create appropriate AbstractSkeletonobjects based on the joint information.the color for the edge E to be transparent. Furthermore, if it happens thattwo different classes have the same number of joints, one may want to modifythe SkeletonFactory class by introducing one more optional parameter,further distinguishing the classes between each other.The current tool was designed such that every video sequence acceptsonly one annotation, thus not foreseeing simultaneous annotations of severalpeople in one frame. A simple workaround for multiple-person sequences isto make a separate annotation file for every person in the video. However,the simultaneous annotations of multiple persons is made possible by thedesign of the application. This could be enabled by defining a single graphcovering several skeleton models in the way described above and providinga multiple-person detection algorithm for initialization.Chapter 3. Data Preparation 273.1.4 ImplementationWe implemented the tool with Matlab and tested it on version R2011b.We use a Matlab implementation of Flexible Mixture of Parts [55] as astate-of-the-art pose detector (see Section 2.2.1) and median optical flow(see Section 2.2.4) as a tracking algorithm, based on Liu?s Matlab opticalflow implementation [32]. However, the detection and tracking algorithmscan be easily changed based on the user?s need. Such replacements may benecessary when changing the skeletal representation of the body.3.2 DatasetThe research of this thesis was done mostly with applications to sports videoanalysis in mind. Therefore, we are most interested in datasets containingfull body annotations in unconstrained real-world videos. Furthermore, ac-tion labeling might be potentially useful for the applications of pose estima-tion to action recognition.The UCF Sports Action dataset [42] fits the description above and thussuits our needs. It contains more than 150 video sequences falling in oneof the 9 action classes: diving, golf-swinging, kicking, lifting, riding-horse,running, skating, swinging and walking. The actions were collected fromvarious sport recordings, typically featured on broadcast TV channels. Mostsequences contain one or more people performing similar action.In this work we release annotations for human poses in selected videosequences of the UCF Sports Action dataset. We limited ourselves to thefollowing 7 action classes due to time constraints: golf-swinging, kicking, lift-ing, riding-horse, running, skating and walking. The people in these videosare roughly upright, which is in line with a some existing image datasetswith annotated poses [16] [40] [12]. If a video sequence contains more thanone person, we create annotation files for each one of them if they performthe action of their action class and are sufficiently unoccluded. We used theVideo Pose Annotation tool (Section 3.1) to create these annotations. Seesome examples of the annotations in Figure 3.9. By releasing the datasettogether with the annotation tool we hope to encourage more research intohuman pose estimation in video sequences and to lessen the gap between theabundance of its real-world applications and the lack of targeted algorithms.Chapter 3. Data Preparation 283.2.1 Evaluation MetricIn this section we consider the question of the definition of the correct pose.We show that authors of different datasets and algorithms understand itdifferently and propose our own definition that in our opinion better suits thecurrent state-of-the-art algorithms. We follow the tradition for the datasetsin defining evaluation metrics for the consistency of the results, and proposea PCP2D evaluation metric that we suggest be used when reporting resultson our dataset.The most common evaluation metric for human pose estimation is thepercentage of correct parts (PCP), reflecting the number of body parts esti-mated withing a certain distance threshold to their ground truth positions.Body parts are usually defined by the edges in the body graph. The mostwidely used version of PCP labels a body part as correct if the average dis-tance of its joints to their ground truth positions is less than a threshold,which is defined by a fraction of the size of the ground truth body part [16].The stricter version of PCP used by Ramanan [40] requires that both jointsare within a threshold distance to their ground truth locations. The twoversions of the PCP measure are the consequence of an ambiguous verbaldefinition of PCP by Ferrari [16]. In order to avoid such confusions in futurewe think it is important to address the question of how to define what isground truth, which has not been addressed in the literature yet.Let us consider the problem of pose estimation as the task of fitting acolor skeleton in the image. The skeleton?s right leg is red, left leg is blue,right arm is pink, left arm is cyan, torso is yellow and head is green. Johnsonand Everingham [26] provide annotations for the Leeds dataset, such thatthe skeleton position always corresponds to the actual position of a person inthe image (Figure 3.7 (a)). However, Ramanan [40] always fits the skeletonin the image so that the skeleton?s red leg and pink arm are roughly on theleft from its blue leg and cyan arm (Figure 3.7 (b)). Such labeling may bebeneficial for various pose estimation algorithms such as FMP [55], becauseit allows them to build different appearance models for right and left partsof the body, which improves the pose estimation performance.From the examples above one may see that the authors of different papersunderstand the notion of a correct pose differently. Johnson and Evering-ham require that a pose estimation algorithm evaluated on their datasetis capable of telling which side of the body is left and which is right. Ra-manan, to the contrary, requires an algorithm to label everything on the leftas red/pink, and on the right as blue/cyan. Therefore, an algorithm giv-ing perfect results on the Leeds dataset would often confuse the limbs andChapter 3. Data Preparation 29(a) (b)Figure 3.7: Annotation examples from two different datasets. (a) Leedsdataset [26]. (b) People datset [26] (best viewed in color).give lower performance on the People dataset and, conversely, the perfectalgorithm for the People dataset will often fail on the Leeds dataset. Fur-thermore, such discrepancies become even more important when one dealswith video sequences, where the relative horizontal placements of body partschange in one video sequence (Figure 3.8).In order to address the above issues, we suggest two evaluation schemes.The first scheme requires an algorithm to be able to distinguish the left andright sides of a body, and uses standard PCP (either strict or loose) forevaluation. This scheme could be applied to the datasets that themselvesdistinguish the left and right sides such as Leeds dataset. The second schemedoes not require an algorithm to have any knowledge about which side iswhich and allows it to freely confuse the left/right body parts. Although thefirst scheme describes the image best, we argue that the state of the art for2D pose estimation is not able to differentiate the actual right and left bodyparts, and is not intended for this purpose [55]. Therefore, we introduce amodified version of PCP that we entitle PCP2D in order to elaborate thesecond scheme.PCP2D is a metric for evaluating the percentage of correct body parts,allowing an algorithm to switch the left and right body parts. It operateson pairs of larger body parts that are defined by several edges in a bodygraph, which represent the arms, legs and two sides of the torso. There aretwo possible assignments of every pair of the left (L) and right (R) groundtruth body parts to the left (L) and right (R) instances in the detection:L? L,R? R and L? R,R? L. We compute the standard PCP (eitherChapter 3. Data Preparation 30Figure 3.8: Example of video pose annotation with alternating limbs.Left/right positioning of left/right leg changes with time (best viewed incolor).strict or loose) for every assignment and take the highest one. The finalPCP2D measure incorporates the highest PCP from every pair and alsoincludes non-paired body parts such as the head.We suggest using the PCP2D evaluation metric on the UCF Sports Posedataset for algorithms that do not distinguish between the left and right sidesof the body, as we think that using the standard PCP measure in such casesdoes not reflect the actual performance of an algorithm. However, standardPCP is an option for methods that are able to predict the left/right sidelabeling.3.3 DiscussionRecently Jhuang et al.released an annotation tool, aiding manual annotationof human poses in video sequences [25]. In this section we briefly contrastit to the tool introduced in this section. The advantages of JHuang?s toolin comparison to our application are as follows:? Annotations come together with a direction-specific human silhouette? The tool has web interface, which does not require any proprietarysoftware such as MatlabChapter 3. Data Preparation 31It is worth noting that the tool allows one to chose the silhouette basedon direction, however its shape is defined by a set of joints and cannot bemodified. Therefore, one may reconstruct similar silhouettes from anno-tations made with our tool. The drawbacks of the Jhuang?s applicationaccording to its online demo are the following:? Although the tool propagates the body poses to the next frame usingoptical flow, it does not have any interpolation functionality, and thepose must be adjusted for every frame? Without interpolation there is no easy way one could deal with occlu-sions and self-occlusions when using the tool? There is no easy way one could make temporally smooth annotations.The tool also does not support the video playback functionality, whichwould allow one to check the temporal consistency of annotations? In contrast to our highly configurable application Jhuang?s tool sup-ports only one type of annotation which is the pre-defined full-bodymodel? At the time of writing the tool only features a web demo for a specificvideo sequences, and cannot be used for the user?s data, doesn?t sup-port saving and loading annotations. Our tool in contrast can be usedto browse annotations in a convenient way? The source code of the tool is hidden behind the web interface, andthus cannot be modified. Our tool is available together with the sourcecode and features flexible design for easy changes in algorithms under-neath itAt the time of writing we have no access to J-HMDB annotations [25]and cannot directly compare the quality of the data. However, given theabove considerations we believe that our annotations are more accurate andsmooth, even when the exact location of body joints is unknown due to theocclusions.Chapter 3. Data Preparation 32Figure 3.9: Examples of the UCF Sports Pose dataset.33Chapter 4Pose Estimation in Video: aShortest Path ApproachIn this chapter we focus on human pose estimation in video sequences. Inparticular, we are aiming to improve the state-of-the-art human pose estima-tion method in single images entitled Flexible Mixture of Parts (FMP) [55].The key observations that motivate the work of this chapter are as follows:? The FMP pose estimations are very noisy, often giving substantiallydifferent results in consecutive frames of a video sequence, even whenthe subject is almost static? Often the pose estimation with the highest score obtained with FMP isnot the best one, and there are better estimates among the top-scoringcandidates? Frequently the best estimation of pose is not present in the set ofresults, returned by FMP, while a similar pose is present among thebest results in the adjacent frames. This happens mostly due to thedouble-counting problem, when two body parts cover the same imageregionThe above observations imply that it is possible to combine FMP withmotion information to obtain better estimations of pose. The main con-tribution of this chapter is a method for human pose estimation in videosequences, improving the state-of-the art for pose estimation in single im-ages.This chapter is organized as follows. In Section 4.1 we describe themodel, in Section 4.2 we explain the inference method. We discuss resultsin Section 4.3.Chapter 4. Pose Estimation in Video: a Shortest Path Approach 344.1 ModelThe main idea of this method is to collect the best n outputs of FMP forevery frame and expand it with additional examples that were missed byFMP using tracking, then find the best combination of poses throughoutthe whole video sequence with respect to a certain measure. We assume anoffline setup, when the whole video sequence is given at once. The measurethat we use when computing the best set of poses is the combination oflocal and pairwise scores. The local score of a pose in an image determineshow well the pose matches the image, while the pairwise score between twoconsecutive pair of poses measure how well the poses are aligned with eachother. We refer to this method as the shortest path approach.Suppose we are given a sequence of video frames I = {It}Tt=1. Let pt ={pti}Ki=1 denote body pose in frame It, where pti = (xti, yti) is the pixel locationof body part i, and p = (p1, . . . pT ) denote the total spatial configuration ofbody parts in T frames. Our goal is to find the best combination of posesthroughout the T video frames:p = arg maxp?PS(I, p), (4.1)S(I, p) =T?t=1Sloc(It, pt) +T?t=2Spair(It?1, It, pt?1, pt). (4.2)Features. Although it is possible to use additional information such ascolor for the computation of local scores, we use only HOG features [9] inorder to make the comparison to FMP fair. For the computation of pairwisescores we resort to tracking methods capturing motion information. Namely,we use optical flow (see Section 2.2.4).Tracking. In the current model we use tracking extensively. We chosethe median optical flow because it shares information between its separateinstances when tracking different image areas in the same video sequence,which makes it relatively fast (see Section 2.2.4). However, usage of differenttracking algorithms is possible. We write (x?, y?) = Ft1t2(x, y) for the resultof a tracking algorithm from frame t1 to frame t2 applied to the image regioncentered at (x, y) with the size of a body part. Furthermore, F?t1t2(pt1) ={Ft1t2(pt1i )}Ki=1 is the pose obtained by tracking pose pt1 to frame t2.Poses search set. P determines the set of poses considered in (4.1). LetFMP(It) denote a set of n best-scoring poses, returned by flexible mixtureof parts for frame t. We first populate P with FMP(It) and then expand itChapter 4. Pose Estimation in Video: a Shortest Path Approach 35with tracking FMP(It) with median optical flow ? frames back and forward.We find expansion necessary as it usually fills in the correct poses, missingfrom FMP(It):P =T?t=1(FMP(It) ??2??=?1{F??t(p? )|p? ? FMP(I? )}), (4.3)where ?1 = max(1, t ? ?), ?2 = min(T, t + ?) and?denotes Cartesianproduct.Local scores. Sloc(It, pt) define the local score of pose pt in imageIt. Since we are determined to use HOG features only, the best score ofthe pose would be the actual score returned by FMP, as it contains boththe appearance and deformation parts. However, there is no score assignedto many poses in P , as they were obtained by tracking, as opposed to bedirectly returned by FMP. We reconstruct the scoring function of FMP bycomputing the best combination of filters given current position of bodyparts:SFMP(It, pt) = maxf tS(It, pt, f t), (4.4)where S(I, p, f) is defined as in (2.2.1). This can be done using dynamicprogramming with the following message passing:scoreti(fti , pt) = bf tii + ?f tii ? ?(It, pti) +?j?kids(i)mtj(fti , pt), (4.5)mtj(fti , pt) = maxf tj(bf tjftiji + scoretj(ftj , ptj) + ?f tjftiji ? ?(ptj ? pti)). (4.6)FMP is far from being perfect, and it frequently happens that low-scoringbody configurations estimate the pose better than high-scoring ones. Withthe above scoring function correct detections that were obtained by trackingand not selected by FMP will have low scores and will often be rejected bythe dynamic programming algorithm selecting the best-overall combinationof poses p. However, the tracking origins of such poses as direct outputs ofthe FMP will score higher. This motivates us to alter the local scores oftracked poses to capture both the score of the tracking origin and the actualChapter 4. Pose Estimation in Video: a Shortest Path Approach 36score by blending them together:Sloc(It, pt) =?????SFMP(It, pt) if pt ? FMP(It),| ?? |SFMP(It, pt)+(1? | ?? |)SFMP(It+? , F??1t+?,t(pt)) if pt ? F?t+?,t(FMP(It+? )).(4.7)Pairwise scores. Spair(It?1, It, pt?1, pt) represents score between posespt?1, pt in two adjacent frames. We use squared euclidean distance betweenthe pose obtained by tracking of pt?1 and pose pt:Spair(It?1, It, pt?1, pt) = Cd(F?t?1,t(pt?1), pt), (4.8)d(pt1 , pt2) =K?i=1(xt1i ? xt2i )2 + (yt1i ? yt2i )2, (4.9)where C is a normalizing constant utilized in order to make the local andpairwise scores comparable.4.2 InferenceInference corresponds to maximizing (4.2) with respect to the combinationof poses p. This can be done efficiently using dynamic programming withthe following message passing:scoret(pt) = Sloc(It, pt) + maxpt?1?Pt?1Spair(It?1, It, pt?1, pt), (4.10)where Pt = {pt|(p1, . . . , pt, . . . , pT ) ? T} is a set of poses in frame t. Afterpassing messages throughout the whole chain of poses scoreT (pT ) wouldcontain the total scores of pose configurations. The final set of poses can beobtained by taking the maximum-scoring pose from scoreT (pT ) and applyingbacktracking.The inference procedure can be summarized as follows:Chapter 4. Pose Estimation in Video: a Shortest Path Approach 37Input: Set of images I = {It}Tt=1, constants n, ?Output: Set of poses p = (p1, . . . , pT )for each frame It doP?t ? the set of n best poses returned by FMP;if t > 1 thenft?1,t(x, y)? optical flow for all (x, y);endendfor each frame It doPt ? P?t;?1 ? max(1, t? ?);?2 ? min(T, t+ ?);for ? = ?1, . . . , ?2, ? 6= t dofor each p? ? P?t doF?t(p?i )? median of optical flow around pti;pt ? {F?t(p?i )}Ki=1;S1 ? reconstructed score of FMP;S2 ? score of p? returned by FMP;?? |t? ? |/?;Sloc(It, pt)? ?S1 + (1? ?)S2;Pt ? Pt ? pt;endendendfor each frame It, t > 1 dofor each pt?1 ? Pt?1 dofor each pt ? Pt doSpair(It?1, It, pt?1, pt)? d(F?t?1,t(pt?1), pt)endendend(score(p), ind(p))? dynamic programming on Sloc, Spair;(p1, . . . , pT )? backtracking with ind(arg max(score));Algorithm 1: Pose estimation in video procedure.Chapter 4. Pose Estimation in Video: a Shortest Path Approach 384.3 ExperimentsParameter Adjustment. The major parameters in the algorithm to beset are the number of top FMP detections in each frame n and the number offrames each original detection is tracked back and forward to ?. The majorfactor that we take into account when adjusting parameters for the modelis the time of inference. We would like the computation time of our methodto be of the same order of magnitude as FMP. The latter often runs in 5-20seconds for a single image on a conventional machine. We limit ourselves to60 seconds per image frame, which puts certain constraints on n and ?.The computation time during detection is mostly consumed by twostages: computing n FMP detections and tracking detection back and for-ward ? frames. The FMP computation time does not depend on n, thereforewe have to minimize the tracking time, which in our experiments takes onaverage 80% of the inference time. In every frame our algorithm performsarticulated tracking 2n? times, which involves tracking each of the 26 bodyparts in the model. We use median optical flow, which on average takes0.06 seconds per body part, taking into account pre-computation of opticalflow for every consecutive pair of images. Therefore we impose a constraintn? < 20 in order to satisfy 60 seconds per frame computation time require-ment.Although the median optical flow is fast, it is not the best tracking algo-rithm. In our experiments we observe that tracking for more than 2 framesoften drifts away and picks the background regions especially when peoplein the videos move fast. Therefore we set ? = 2 to maximize the number ofFMP detections, and set n = 10. Our early experiments demonstrated thebenefit of this approach in comparison with ? = 5, n = 4. We also set thetransition cost weighting constant C = 1, as it does not seem to significantlyaffect the results.Evaluation. We evaluate our algorithm using PCP2D, a definition ofpercentage of correct parts proposed in Section 3.2.1. As discussed earlier,evaluation of an algorithm that does not distinguish between left and rightsides of a body on a dataset that does differentiate them using standard PCPmeasure does not necessarily reflect the performance of the algorithm. Weuse PCP2D based on a more common (loose) version of PCP (see Section3.2.1). The threshold for the PCP measure often varies depending on thedataset. Since our detections are 26-part body skeletons, the parts them-selves are smaller than in a 14-joint skeleton, therefore we set the thresholdto a relatively large value 0.6.Many videos in the UCF Sports Pose dataset contain multiple peopleChapter 4. Pose Estimation in Video: a Shortest Path Approach 39Table 4.1: Results of our shortest path approach (SPA). Results are com-pared to FMP [55] for different action classes and overall.Action Class Golf Kick Lift Ride Run Skate Walk AllSwing HorseFMP [55] 58% 57% 72% 52% 52% 58% 68% 60%SPA 64% 60% 82% 60% 60% 64% 79% 68%either in the background or performing similar actions together. FMP isa detection approach, which may often detect people other than the targetperson. In order to make the comparison to FMP fair we crop the ini-tial video sequences such that they contain single person only. Due to thetime constraints we evaluate our algorithm on a subset of UCF Sports Posedataset, containing 4 to 10 videos per action class, totaling from 230 to 405video frames per action class. It has 39 video sequences with 2305 videoframes in total. The comparison of the algorithm introduced in this chapterand FMP for each action class and overall is presented in Table 4.1. For theexamples refer to Figures 4.1-4.3. The author?s website1 provides severalvideo examples comparing FMP to our approach.4.4 Discussion and Future WorkComparison to other methods. The approach described in this chapteris close in some of its ideas to both [51] and [58], although it was devel-oped independently, as its development started before these papers werepublished. Here we would like to briefly contrast these methods with ourapproach.Both our approach and [51] take the top n outputs of the FMP. Wang etal. [51] find the best-overall combination of poses using dynamic program-ming. They score every pose according to a pre-learned color model, andscore co-occurrences of poses in adjacent frames using color similarity. Wefind the best combination of poses in the same way, however we chose notto use any features other than HOG. We want to make the comparison toFMP fair, and we are interested in seeing how the addition of only flowinformation improves the detections. Instead, we use the score returned bythe FMP itself. In order to obtain the pairwise scores we propagate the1http://www.cs.ubc.ca/nest/lci/thesis/olgeorge/index.htmlChapter 4. Pose Estimation in Video: a Shortest Path Approach 40body poses to the adjacent frames and compute their displacements, whileWang et al.do not use optical flow.We further decide to incorporate knowledge from adjacent frames inevery frame. We propagate the poses from adjacent frames to the currentframe and add them to the pool of poses for dynamic programming. In thiswe are similar to Zuffi et al. [58] who propagate poses from the neighbouringframes in order to aggregate more information for further processing.Future work. Our algorithm ?fails? most often in the presence of fastmotion or motion blur. The tracking of poses loses the body parts, result-ing in incorrect propagation of information, which often causes wrong poseestimates. In order to improve our algorithm one may replace the medianoptical flow with a more accurate tracker. As the tracker is required tobe relatively fast, we foresee two potential candidates. The first one calledMedian Flow [27] combines forward-backward error filtering and normalizedcross-correlation. It is based on optical flow and may satisfy the speed re-quirement. The second alternative to median optical flow is the utilizationof trajectories, such as dense trajectories [52]. One may compute the tra-jectories for the current frame and then use the median value in a box fortracking. This procedure should not be time consuming as well. In orderto improve the pairwise scoring of poses in adjacent frames one may uti-lize learning of the co-occurrence patterns of the appearance features of themodel. An alternative direction of the future work is to make the algorithmonline, as many applications of pose estimation such as human-robot inter-action require the processing of the information on-the-fly. This could beachieved by utilizing a hidden Markov model instead of the Viterbi algorithmwhen looking for the best-overall combination of poses.Our approach may be considered as a sampling method that first chosesa subset of all possible points representing the space of its model and thendoes the full search on the results. The performance of a sampling methoddepends on the sampling technique, which in our case is heavily based onFMP. Therefore, it has very strict limitations to the set of poses it canproduce, which is determined by the output of FMP. Thus, if on a certainsequence FMP fails, our method would fail as well. Furthermore, our methodwould make the largest improvement on the sequences where FMP workssufficiently well to detect the right pose among its top candidates, but notwell enough to pick the best one. In other words, our approach allows FMPto fix itself based on what it already knows, filtering out incorrect detections.Also, our method knows nothing about the dynamics of human motion,performing only spatial reasoning about the discrepancies in consecutiveframes of a video sequence. In the next chapter we aim to address the aboveChapter 4. Pose Estimation in Video: a Shortest Path Approach 41issues by introducing a method that does full search over several consecutiveframes while taking into account the change of the appearance in time.Chapter 4. Pose Estimation in Video: a Shortest Path Approach 42Figure 4.1: Examples of pose estimates of the shortest path approach. Theresults are compared to the results of FMP [55]. The first and third rowscontain the results of FMP, the second and fourth rows contain the resultsof our method on the same images.Chapter 4. Pose Estimation in Video: a Shortest Path Approach 43Figure 4.2: Examples of pose estimates of the shortest path approach. Theresults are compared to the results of FMP [55]. The first and third rowscontain the results of FMP, the second and fourth rows contain the resultsof our method on the same images.Chapter 4. Pose Estimation in Video: a Shortest Path Approach 44Figure 4.3: Examples of pose estimates of the shortest path approach. Theresults are compared to the results of FMP [55]. The first and third rowscontain the results of FMP, the second and fourth rows contain the resultsof our method on the same images.45Chapter 5Pose Estimation in Video: aDetection ApproachThe approach described in Chapter 4 gives better results than the originalFlexible Mixture of Parts (FMP). Acting as a smoothing filter, it helps toget rid of sporadic incorrect detections, giving an overall better estimation ofpose throughout the video sequence. However, it is fundamentally a filteringapproach, and as was discussed in Section 4.4 it can give better results onlywhen the original approach succeeds more often than fails.In order to address the above limitation in this chapter we introduce anovel articulated human detection algorithm in video sequences. In contrastto the previous approach that searches only among the best FMP results, itis a detection algorithm that enables full search in several consecutive framesat once, which internally takes into account both appearance and motioninformation. Like FMP, it utilizes a tree model, allowing fast and tractableinference using dynamic programming and a modified distance transform.In addition it is an online method, in the sense that at every point in timeit does not require any future information in order to detect a pose in thecurrent frame. The latter makes it possible to run it in real-time, givenenough computational power.This chapter is organized as follows. In Section 5.1 we define the model,in Section 5.2 we describe inference algorithm together with our modificationof the distance transform of sampled functions. We present results in Section5.3.5.1 ModelThe main idea behind the current method is the way to incorporate motionand appearance in a single model, such that it captures how the appearancechanges with time. This can be achieved by learning the co-occurrencepatterns of filter types, corresponding to the same body part in consecutiveframes. Thus, the model fully connecting several tree models covering theChapter 5. Pose Estimation in Video: a Detection Approach 46human body in adjacent frames may be utilized. However, inference insuch model becomes intractable. In order to restore the tree property ofthe model graph, we drop the limb connections in all frames except thefirst one, leaving a ?trail? of positions in the past several frames for everypart (see Figure 5.1). Although the dropped connections would distort thepositions of body parts in previous frames, this may be compensated bythe temporal connections, aligned with the optical flow. Furthermore, theinference problems that arise when utilizing this model can be solved by amodified distance transform.Let us write I = {It}?t=0 for a sequence of ? + 1 video frames, where I0is the frame where we want to detect a pose and I? . . . I1 are ? precedingframes. We use descending enumeration for convenience. Our model utilizesa tree graph (V,E) = (??t=0 Vt,??t=0Et) spanning ?+1 frames, such that inframe I0 graph (V0, E0) represents a K-node tree model of the human body,while nodes V? . . . V1 correspond to locations of body parts in ? precedingframes and edges E? . . . E1 connect body parts to their instances in theprevious frame (See Figure 5.1). Formally, we use double indexing for nodesin the graph, such that V = {(i, t)}K,?i=1,t=0 where t denotes the frame number,and i represents the body part index. Then E0 = {((i, 0), (j, 0))} and Et ={((i, t? 1), (i, t))}Ki=1. For convenience we use the following notation: V?t ={i}Ki=1, E?0 = {(i, j)}, E?t = {i}Ki=1.Furthermore, let pti = (xti, yti) be the pixel location of body part i inframe It. Then pt = {pti}Ki=1 defines all body part locations in frame It, andp = (p0, . . . , p?) is the total spatial configuration of body parts in ? + 1frames. Likewise, let f ti = {1, . . . , R} determine the filter type for body parti in frame t, then f t = {f ti }Ki=1 and f = (f0, . . . , f?) represent the filterconfiguration in frame t and the total filter configuration correspondingly.Also, similarly to Section 4.1 let (x?, y?) = Ft(x, y), t = {1, . . . ,?} denotethe median optical flow frame t? 1 to frame t in the image region centeredat (x, y) with the size of a body part (see Section 2.2.4).The score of a specific configuration of body part locations p and filtertypes f in ? + 1 video frames I has the following form:S(I, p, f) =??t=0St(I, p, f), (5.1)Chapter 5. Pose Estimation in Video: a Detection Approach 47(a) (b)Figure 5.1: Spatio-temporal tree structure of the model. Graph (V0, E0)represents body structure in the frame where detection is being performed.Each set of nodes Vt correspond to locations of body parts t frames backin time, each set of edges Et connects nodes in Vt?1 to their correspond-ing nodes in Vt. (a) An example of the model structure for ? = 2. (b)Corresponding poses for frames I2, I1, I0.S0(I, p, f) =?i?V ?0bf0ii +?(i,j)?E?0bf0i f0jij + . . .?i?V ?0?f0ii ? ?(I0, p0i )+?(i,j)?E?0?f0i f0jij ? ?(p0i ? p0j ),(5.2)St(I, p, f) =?i?V ?tbf tii +?i?E?tbf ti ft?1ii + . . .?i?V ?t?f tii ? ?(It, pti)+?i?E?t?f ti ft?1ii ? ?(pti ? Ft(pt?1i )), t = {1, . . . ,?}.(5.3)In the above equation ?(It, pti) is a feature vector extracted from imageIt at location pti. This could be a HOG descriptor [9] or any other feature.We also write ?(dx, dy) = [d2x d2y dx dy]?.Chapter 5. Pose Estimation in Video: a Detection Approach 48Given this notation, our model has the following form:M = (B,?) , (5.4)B =({bmi }, {bmnij }, {bmni }), (5.5)? =({?mi }, {?mnij }, {?mni }). (5.6)Here bmi favours assignment of filter type m to body part i, bmnij favoursco-occurrence of filter types m and n in body parts i and j, bmni favoursswitching from filter m to filter n in body part i in two consecutive frames.Furthermore, ?mi is the filter of type m of body part i. The quadraticdeformation spring model between filters m and n of body parts i and jis determined by ?mnij . Also, ?mni defines the deformation model for theswitching from filter n to filter m of body part i in two consecutive frames.Note that S0(I, p, f) is exactly the cost function of the Flexible Mixtureof Parts [55] (see Section 2.2.1). Thus, our model turns into FMP in thecase when ? = 0.5.2 InferenceInference corresponds to maximizing the model?s score function (5.1) overparameters (p, f) given a sequence of video frames I = {It}?t=0. Since ourrelational graph (V,E) is a tree, dynamic programming enables full searchover all possible locations p and filter types f , similarly to FMP.5.2.1 Message PassingIn order to perform dynamic programming, we set up a message passingmechanism from child to parent nodes (see Section 2.2.2). Let kids(i, t) bethe set of children of node (i, t), let kt(i, t) denote the temporal child of node(i, t) and ks(i, t) denote its spatial children:ks(i, t) = {(j, ?) ? kids(i, t)|? = t},kt(i, t) = {(j, ?) ? kids(i, t)|? = t+ 1, j = i}.The message that child (i, t) passes to its parent has the following form:scoreti(fti , pti) = bf tii +?f tii ? ?(It, pti) + . . .+?(j,?)?ks(i)mstj(fti , pti)+?(j,?)?kt(i)mtt+1i (fti , pti),(5.7)Chapter 5. Pose Estimation in Video: a Detection Approach 49where mstj and mtti are defined as follows:mstj(fti , pti) = maxf tj[bf tjftiji + maxptj(scoretj(ftj , ptj) + ?f tjftiji ? ?(ptj ? pti))],(5.8)mtti(ft?1i , pt?1i ) =maxf ti[bf ti ft?1ii + maxpti(scoreti(fti , pti) + ?f ti ft?1ii ? ?(pti ? Ft(pt?1i )))](5.9)Note that our inference procedure is obtained from the one in FMP byadding the second sum in equation (5.7), which has 0 and 1 terms for leavesand internal nodes correspondingly.The message passing starts from leaves, and proceeds until all the nodesexcept the root have passed messages to their parents. Then, score01(f01 , p01)contains the final scores for detection, and similar to FMP we obtain multipledetections by thresholding the score and applying non-maximum suppression(see Section 2.2.1) on the result to remove the detections covering the samehuman. We use backtracking to restore the detected poses (see Section2.2.2).5.2.2 An Approximate Distance TransformThe computationally expensive portion of message passing is computing(5.8) and (5.9). It requires looping over L?R possible locations and typesof the parent and L?R potential locations and types of the child, making thecomplexity of the whole procedure O(L2R2). The relaxation (2.2) reducesthe complexity to O(L2R). However, given that the total number of possiblelocations L is often very large, the quadratic complexity makes inferenceprocedure too slow, almost impractical. In our experiments it took morethan an hour to find a pose in a small image on a conventional PC.Therefore, utilization of methods reducing computation time is essen-tial. Yang and Ramanan [55] use the distance transform developed byFelzenszwalb and Huttenlocher [15], which reduces computation of (5.8)to O(LR2) in the case when ?(dx, dy) is a quadratic function (see Section2.2.3). However, direct usage of the aforementioned distance transform forcomputation of (5.9) is not possible. In this section we will describe howone can modify it for our case.Chapter 5. Pose Estimation in Video: a Detection Approach 50One dimension. Consider the following problem. Let G = {g1, . . . , gn}and H = {h1, . . . , hm} be one-dimensional grids, f : G? R and d : H ? Rbe arbitrary functions. For every p in grid H find Df (p), defined as:Df (p) = minq?GP2(p+ d(p), q), (5.10)P2(p, q) = a(p? q)2 + b(p? q) + f(q) (5.11)This problem can be reduced to Felzenszwalb?s distance transform ofsampled functions [15] in the following way. First, we perform the compu-tation of the lower envelope of parabolas P2(p, q) for all q in G. Then wefill in the values of Df (p), but we replace p with p + d(p) when computingvalues of the lower envelope.Two dimensions. Let f : G ? R be an arbitrary function defined ontwo-dimensional grid G = {g11 . . . g1n} ? {g11 . . . gk1}, and let d : H ? R be anarbitrary function defined on grid H = {h11 . . . h1m} ? {h11 . . . hl1}. The goalis to find Df (x, y) for every (x, y) in H:Df (x, y) = min(x?,y?)?GP2((x, y) + d(x, y), (x?, y?)), (5.12)P2((x, y), (x?, y?)) = ax(x? x?)2 + bx(x? x?)+ay(y ? y?)2 + by(y ? y?) + f(x?, y?). (5.13)Df (x, y) = min(x?,y?)?Gax(x+ d1(x, y)? x?)2 + ay(y + d2(x, y)? y?)2 + f(x?, y?)(5.14)In the case when d(x, y) ? 0 the problem reduces to (2.17)-(2.18), whichcan be formulated as (2.19). The latter can be solved by performing Felzen-szwalb?s one-dimensional distance transform along each column of the gridG and then computing the distance transform along each row of the result.Similar reduction to the one-dimensional case (5.10) is possible when thefunction d(x, y) = (d1(x, y), d2(x, y)) satisfies the constraintsd1(x, y1) = d1(x, y2) = d1(x), ?x ? {h11 . . . h1n}, (5.15)d2(x1, y) = d2(x2, y) = d2(y), ?y ? {h11 . . . hk1}, (5.16)Chapter 5. Pose Estimation in Video: a Detection Approach 51because in this case the first two terms of (5.13) do not depend on y:Df (x, y) = minx?,y?(ax(x+ d1(x, y)? x?)2 + bx(x+ d1(x, y)? x?)+ay(y + d2(x, y)? y?)2 + by(y + d2(x, y)? y?) + f(x?, y?)) = (5.17)minx?[ax(x+ d1(x)? x?)2 + bx(x+ d1(x)? x?)+miny?(ay(y + d2(y)? y?)2 + by(y + d2(y)? y?) + f(x?, y?))] = (5.18)minx?[ax(x+ d1(x)? x?)2 + bx(x+ d1(x)? x?) +Df |x?(y)]. (5.19)Intuitively, the aforementioned reduction is possible because the set ofpoints (x, y) + d(x, y) forms a grid:R ={(x, y) + d(x, y)|(x, y) ? H} ? {r11 . . . r1m}? {r11 . . . rl1}. (5.20)However, in the general case when equalities 5.15-5.16 do not hold theprocedure outlined above does not provide the solution to the problem 5.10-5.11. One of the possible ways to deal with it is to form a grid from theset R as defined in (5.20) by taking the Cartesian product of its projectionson the X and Y axes. This however may expand the set from L to L2points and although in this case the above procedure could be utilized, thecomputation will not be performed in linear time, thus the benefit of thedistance transform will be lost.An alternative solution to this problem would be a distance transformworking directly in two dimensions by utilizing a two-dimensional lowerenvelope of elliptic paraboloids. However, the computation of the lowerenvelope in two dimensions is a much more complicated procedure, as onehas to find intersections of every elliptic paraboloid with all its neighbours,and potentially neighbours of the neighbours, forming a complex partitionof the two-dimensional space, consisting of polygons of potentially arbitraryshape.We take a different approach. We quantize the set R as defined in (5.20)into grid H, by obtaining d?(x, y) such that (x, y) + d?(x, y) is the closestpoint to (x, y) + d?(x, y) in grid H. Then we use the distance transform inthe conventional way, obtaining Df (x, y) as defined in (2.17). Finally wecompute the approximation to the distance transform D?f (x, y) as defined in(5.14):D?f (x, y) = Df (x+ d?1(x, y), y + d?2(x, y)). (5.21)Chapter 5. Pose Estimation in Video: a Detection Approach 52The above is equivalent to quantizing the optical flow information to thenearest HOG cell. Since body part filters are often represented by 4? 4 to6 ? 6 HOG grids, this loss of information is not dramatic, which may beeffectively mitigated for a more accurate tracking algorithm. Furthermore,when the quadratic coefficients ax, ay are sufficiently small as in our exper-iments, the paraboloids are wide, and the difference in the computation ofthe score is minimal.In order to apply the distance transform to message passing (5.9), onehas to define [ax, ay, bx, by] ? ?f ti ft?1ii , f(x, y) = scoreti(ftj , (x, y)), d(x, y) =Ft(x, y). Then utilization of the approximate distance transform is possible,reducing the corresponding portion of message passing to linear time.5.2.3 The Inference ProcedureAs mentioned in the beginning of this chapter, the approach described aboveis a detection algorithm, requiring information only about previous frames.Given a video sequence V = {V1, . . . , VT } the inference corresponds to find-ing a pose in frame t for all t ? {?? + 1, . . . , T}, taking into account onlyvideo frames V1, . . . , Vt. In order to do this, for every frame t we select ?previous frames with temporal step ?:I ={I?}T?=0, (5.22)I? =Vt???, ? = {0, . . . ,?}. (5.23)Then, for every t the inference procedure described in this section canbe performed independently, obtaining a full temporal tree, representingthe human body in ? frames. The temporal part can be then disregarded,leaving only the spatial part in the current frame, which represents the finalresult of the detection.Input: Sequence of video frames V = {V1, . . . , VT }, constants ?,?Output: Set of poses p = (p1, . . . , pT )for each t ? {?? + 1, . . . , T} dofor each ? ? {0, . . . ,?} doI? ? Vt???;endp? arg maxp maxf S({I?}??=0, p, f);pt ? (p)1;endChapter 5. Pose Estimation in Video: a Detection Approach 535.3 ExperimentsParameter Adjustment. Our model consists of a spatial and a temporalpart. The spatial part defines how well the model fits the current frame,while the temporal part identifies how consistent the appearance and loca-tion of every body part in time is. We think that bf tii and ?f tii from equalities5.1-5.1 that determine appearance bias and filter can be learned indepen-dently of the temporal dimension, because the detection is performed inevey frame independently taking into account several previous frames, andthese parameters may not depend on t. Therefore we use the correspondingparameters from the FMP model trained on single images: bf tii = bf0ii and?f tii = ?f0ii . We follow Yang and Ramanan [55] and resort to relaxation?f0i f0jij = ?f0iij , which reduces the computation cost during inference.We explore two simple methods for learning bf ti ft?1ii . The first methodfinds the score of every ground truth pose in the training video sequenceby fixing the position and maximizing over all possible filter types for allbody parts using equations 4.5-4.6. Then for each body part it counts co-occurrences of different filter types in two consecutive frames. The secondmethod acts similarly, but it takes into account all filter responses at everytime instance, as opposed to only counting the filter types that were selectedin the score reconstruction process. If in the current frame filter type ihas score si, and in the next frame filter type j has score sj , then the co-occurrence bias of filter types i and j is increased by s1s2. When both iand j either score high or negatively low this favours co-occurrence of thesefilter types. It penalizes the co-occurrence when one of them is high and theother one is low. We also explored other counting functions such as s1 + s2,max(s1, 0) max(s2, 0) but found no difference in several early experiments.Although the above learning schemes demonstrated efficiency in earlyexperiments, we found that they do not always improve the pose estimationperformance even on the videos they were trained on, and do not trans-late well between video sequences. Therefore in our experiments we setbf ti ft?1ii = 0 and explore the framework in the absence of the appearanceswitch bias. We set ?f ti ft?1ii = ?f0ii as it does not significantly alter the re-sults. Such an approach does not require any training video data, as we usethe model trained on single images. We set the temporal step ? = 2 toincrease the discrepancy between two consecutive frames and set the num-ber of simultaneously considered frames in the past as ? = 1. Increasing? above 2 results in the growth of the number of tracking failures of ourChapter 5. Pose Estimation in Video: a Detection Approach 54Table 5.1: Results of our detection approach (DA). Results are compared toFMP [55] for different action classes and overall.Action Class Golf Kick Lift Ride Run Skate Walk AllSwing HorseFMP [55] 58% 57% 72% 52% 52% 58% 68% 60%DA 57% 53% 73% 54% 58% 58% 67% 62%median optical flow algorithm. This decreases the algorithm performance,as the model is constrained to search for the pose using incorrect spatio-temporal tree configuration.Evaluation. Similarly to Section 4.3 we evaluate the algorithm usingPCP2D on the same subset of videos from the UCF Sports Pose dataset,consisting of 29 video sequences totaling 2305 video frames. The comparisonof the algorithm to FMP for each action class and overall is demonstratedin Table 5.1. See Figures 5.2-5.3 for some pose estimation examples.5.4 Discussion and Future WorkAs one may see from Figures 5.2-5.3 our algorithm gives results very closeto the results of FMP. The optical flow does add new information whichleads to overall marginal improvement. However, on certain video sequencesour method works worse than FMP because inaccurate tracking imposesincorrect priors on a pose in the previous frame.We see several steps that may address the issues above. The first oneutilizes a joint spatio-temporal training scheme similarly to Yang and Ra-manan [55]. It may help define the biases bf ti ft?1ii more optimally, whichmay play an important role in the detection process by penalizing unlikelyfilter switches. For instance, the horizontal forearm in the current frame isunlikely to be vertical in the next frame as the change is too abrupt. Suchrelations may be captured by a jointly learned spatio-temporal model.However, providing a better learning scheme may not be enough to makea substantial improvement in accuracy. Since we relax the spatial relation-ships between joints in previous frames we need a good prior on the locationof these joints. The second step that we suggest for future work is to seek abetter tracking mechanism. As discussed in Section 4.4 the potential algo-rithms are Median Flow [27] and dense trajectories [52].Chapter 5. Pose Estimation in Video: a Detection Approach 55An alternative way to improve the performance may exploit richer spatio-temporal models. For instance one may utilize multiple tree structuresproviding additional coverage of the edges in the original loopy graphicalmodel. A similar idea was exploited by Sapp et al. [44] who decompose theintractable model into a tree ensemble with the full coverage of the edgerelationships of the original model. We do not know how the above im-provements might increase the accuracy of the algorithm, and we leave it asan interesting problem for future research.Chapter 5. Pose Estimation in Video: a Detection Approach 56Figure 5.2: Examples of pose estimates of the detection approach. Theresults are compared to the results of FMP [55]. The first and third rowscontain the results of FMP, the second and fourth rows contain the resultsof our method on the same images.Chapter 5. Pose Estimation in Video: a Detection Approach 57Figure 5.3: Examples of pose estimates of the detection approach. Theresults are compared to the results of FMP [55]. The first and third rowscontain the results of FMP, the second and fourth rows contain the resultsof our method on the same images.58Chapter 6Abstracting Human PoseEstimationHuman pose estimation is a challenging problem and an active researchfield, motivated by many applications of pose detection, such as human-computer interaction, surveillance and gaming. Recent advancements inpose estimation [46] are able to give results sufficiently good for the usein industrial and/or commercial applications, such as Microsoft KinectTM.However, using a state-of-the-art algorithms in real-world applications hasnumerous challenges. The majority of software engineers are non-expertsin Computer Vision and pose estimation and they may encounter manyproblems, including the following:? The state of the art advances fast and it is hard to track it down,as there is no regularly updated list of benchmarked and evaluatedpose estimation algorithms. Furthermore, there is no one method thatwould work best in all circumstances.? Given a state-of-the-art pose estimation academic paper, it is not triv-ial for non-experts to implement it. Furthermore, it is very hard toconstantly reimplement the pose estimation algorithm for a specificapplication to keep up with the state of the art.? The interface to most ready-to-use algorithms requires understandingof the parameters.We believe that the solution to these problems may be addressed withthe notion of the task, which we define as a combination of input descrip-tion and output requirement, as well as parameters that can affect the result.This provides enough information to select the appropriate algorithm, whilehiding the implementation details behind the abstraction. If the abstractioncovers enough of the problem space, new algorithms can be seamlessly inte-grated without any changes to the interface, which would provide users withcontinuous updates to the state of the art. Furthermore, the requirementsChapter 6. Abstracting Human Pose Estimation 59of a specific platform may be taken into account, e.g. by utilizing low-poweralgorithms for mobile devices.The purpose of this chapter is to introduce a task-based human poseestimation control system. We focus on the problem of 2D pose estimationin our selection of algorithms used in the system. However, the design of theabstraction is sufficiently general to accept other types of algorithms, such as3D pose estimation and pose estimation in stereo. The system was designedas part of OpenVL [36], a framework that abstracts some Computer Visionproblems such as segmentation, matching and image registration.The two key contributions of this chapter are:? A task-based abstraction for human pose estimation, which hides im-plementations of various pose estimation algorithms behind a singlesimple yet powerful application programming interface (API)? A method for mapping from task to algorithm that automatically se-lects method most likely to succeed and adjusts its parameters basedon the task descriptionThis chapter is organized as follows. We first discuss task descriptionin Section 6.1, and then outline the mapping from the task description tothe algorithm which produces final pose estimates. Section 6.3 present theexperimental evaluation of the algorithm mapping, Section 6.4 is devoted todiscussion and future work.6.1 Task DescriptionThe task description is based on three categories: input, output, and target.6.1.1 Input DescriptionInput type. In our definition of abstraction we would like to capture asmany combinations of input data as possible. With this in mind, we formatthe input data as a temporal sequence of spatial arrangements. Every spatialarrangement is determined by a set of images coupled with poses of thecorresponding cameras at the current moment. The poses may be undefined,while images contain information about color, depth or both. Temporalsequences may also contain data such as whether the video sequence is beingstreamed or is available at once. We refer to this as the input type.The above definition of Input Type is flexible enough to cover most ofthe common combinations of cameras in time and space. For instance, itChapter 6. Abstracting Human Pose Estimation 60naturally represents the setting of multiple calibrated cameras, where eachspatial arrangement captures the position of the cameras at every time in-stance. The stereo vision system may be described by spatial arrangements,each containing two color images. The setup when the single camera is mov-ing with an unknown trajectory is described by spatial arrangements withsingle image and undefined camera position each. The common case of asingle depth camera is handled by one spatial arrangement with single depthimage.Image description. In addition to input type we include an imagedescription, which encompasses the user?s prior knowledge about the inputimage data. We define two types of image description: amount of occlusionand clutter. We think they are the most relevant to the general descriptionof an image or video sequence in the setting of 2D pose estimation. However,in other cases such as ones utilizing depth images there might be differentrelevant conditions, which may be added as part of future work.We loosely define the clutter as how many features would likely be foundin regions of the image not belonging to a person. For instance, an imagewith a person standing on a field would possess low clutter, while an imageof a person in a city setting with many cars and buildings in the backgroundwould be considered as high clutter. The occlusion condition reflects howlikely human bodies or their parts are to be covered by elements in the scene,such as a person standing behind of a desk. Both clutter and occlusion aredefined in the range of (0, 1) and we quantize them into Low, Medium, High,as most of the problems do not require an in-depth description.Our input description consists of the input type and image description asdefined above, which we use to select the algorithm most likely to succeedin the current case. The input type constrains the set of algorithms ourframework may select, because most of the algorithms strictly define whatkind of data they work on, such as color images or or stereo vision pairs.We use image description as a factor to find the best algorithm for the giveninput data.6.1.2 Output RequirementSimilarly to input description, we would like the output requirement to covermost of the common body representations that one might want to infer fromthe image or video data. With this in mind we first define a set of all bodyparts that we include in the framework: Head, Neck, Chest, L/R Shoulder,L/R UpperArm, L/R Elbow, L/R LowerArm, L/R Hand, Abdomen, L/RHip, L/R UpperLeg, L/R Knee, L/R LowerLeg, L/R Foot. Although finer-Chapter 6. Abstracting Human Pose Estimation 61grained representations such as one including fingers may fit well into ourframework, we are leaving this as a future work. We further include bodypart composition, which is a set of body parts with a description of therequirement regarding the included parts. The composition requirementmay include one or several of the following:? 2D or 3D location relative to camera? Orientation as roll, pitch and yaw relative to camera? Pixel-wise mask? Bounding box or cubeFinally, we define the output requirement for the task as a set of bodypart compositions. We also predefine a set of common compositions includ-ing Full-Body, Upper-Body, Head+Torso, Head. Note that if the systemdetects more than one person, it returns the required information abouteach of them. We also include the speed/accuracy requirement in the rangeof (0, 1), determining how much the accuracy could be sacrificed for thespeed.The above representation captures the results one may get from themajority of pose estimation algorithms. For instance, kinect-style body partlabelings together with 3D positions of body joints may be described by aset of compositions, each of which requires a 3D location of the joint and apixel-wise mask. A set of 2D body joint positions together with the person?ssilhouette may be represented as a set of single-part compositions with therequirement of a 2D position, together with a composition of all body parts,requiring a pixel-wise mask. A simple face detector may be described as asingle composition with a single body part Head with the only requirementof the bounding box.Note that the above description may encompass a set of other algorithmsin Computer Vision, generally not associated with pose estimation suchas face or head detection and pose orientation estimation. However, webelieve that they share common features and algorithms with the field ofpose estimation and person detection and should be considered together (seeSection 2.1.1).6.1.3 Target DescriptionThe target description allows users to encode their prior knowledge aboutpeople in the image, which consists of one or more of the following:Chapter 6. Abstracting Human Pose Estimation 62? The population? The set of compositions with defined priors that include visibility, lo-cation, size and orientation in 2D or 3D? The distinctiveness from background in terms of color, texture, ormotion, in the range (0, 1)We define the population as the number of people in each image. Thesize and location of body part compositions are defined in the range (0, 1)relatively to the size of the input image. The above conditions may affectthe selection of the algorithm. For example, there may exist algorithms thatspecifically target multiple-people scenarios or work well on low-resolutionimages. Conversely, there may be algorithms that fail on images when cer-tain body parts are invisible, e.g. lower body. The distinctiveness frombackground may be given when the user knows something about the ap-pearance or motion of the person in the video. High color distinctivenessmay favour methods that rely on color while high texture may be importantfor certain gradient-based algorithms. High motion distinctiveness tells thesystem that the person is moving fast compared to the background, andcertain methods involving motion-based segmentation may come into play.Alternatively, certain pre-processing based on color, gradient or motion fea-tures may be applied.Furthermore, the user may also have prior knowledge of the person?spose, such as visibility, location, size or orientation of certain body parts,which may play a role in the algorithm selection process. For instance, priorknowledge of legs being hidden behind the desk may trigger the selection ofan upper-body pose estimation algorithm, or the prior that the person is fac-ing the camera may help select the face orientation method instead of headorientation estimation algorithm. The prior knowledge of the above condi-tions may be utilized by algorithms that employ instance-specific learningapplied on top of reliable detections of canonical poses [41]. Alternatively,this may be directly incorporated by certain algorithms. For instance, lo-cal features involving invisible body parts may be weighted low in the poseestimation procedure while a location prior may increase scores for certainparts in the image. Prior knowledge of size of body parts may weight cer-tain scales higher, and orientation prior may be used to increase weight fororientation-specific features in an algorithm.In addition, we provide several pre-defined body poses that may be usedinstead of body compositions: Regular, Unusual and Front-Facing. TheUnusual pose is a non-vertical or highly articulated body configuration andChapter 6. Abstracting Human Pose Estimation 63(a) (b) (c) (d) (e)Figure 6.1: Examples of scene conditions and algorithms output. The resultsfrom three algorithms are presented, from top to bottom: GBM [43], F-FMP [55], U-FMP [55]. (a) Regular Pose (b) Unusual Pose (c) Lower Body isinvisible (d) High Clutter (e) Low Clutter and Large Size. These algorithmsare described in Section 6.2.Regular is any other pose. Front-Facing is a pose when torso of the personis roughly facing the camera. See Figure 6.1 for some examples of imageswith a defined image description or pose prior together with the outputsfrom the algorithms that we include in the framework.6.2 Task to Algorithm MappingBased on the abstraction outlined in the previous section we now present aproof-of-concept framework designed to demonstrate the utilization of theabstraction. In this work we consider four algorithms for 2D body poseestimation: Rothrock?s grammar-based model (GBM) [43], Flexible Mix-ture of Parts [55] for upper body (U-FMP) and full body (F-FMP) and ourshortest path approach from Chapter 5 (SPA). Furthermore, we include twoalgorithms for head and face orientation estimation in order to demonstratethe utility of the part-wise requirement formulation in the abstraction: Faceorientation estimation by Zhu and Ramanan [57] (FO) and head/torso ori-entation prediction algorithm by Maji et al. [34] (HTO). We selected theChapter 6. Abstracting Human Pose Estimation 64Table 6.1: Abstraction condition matrix. The task controls are presentedin the first two columns, followed by the algorithms used in our proof-of-concept abstraction. The level of satisfaction for each control per algo-rithm forms the basis for algorithm selection based on the user-supplieddescription. Note: FB = Full Body, UB = Upper Body, LB= Lower Body,FF=Front-Facing, H+T=Head+Torso; L=Low, M=Medium, H=High; px= pixels.Controls GBM F-FMP U-FMP SPA HTO FO[43] [55] [55] [34] [57]Input Image X X X 7 X XType: Video X X X X X XImage Clutter L M-H M-H M-H L-H L-HDescrip- Occlusion L M-H M-H M-H H L-Mtion:Target Population 1 >= 1 >= 1 1 >= 1 >= 1Descrip- Size (px) 80? 300 50? 500 > 300 50? 500 > 20 > 80tion: Pose All Regular FF Regular All FFInvisibility 7 LB LB LB 7 7Output Joint Locations FB,UB FB,UB UB FB,UB 7 HeadRequire- Joint Orientation 7 7 7 7 H+T Headments: Accuracy/Speed X 7 7 X X Xalgorithms for the framework based on the problem space coverage, per-formance and code availability on the web. Methods for pose estimationreturn 2D locations of body joints, while head/face orientation estimationalgorithms return the yaw angle in degrees. All methods operate on colorimages or image sequence in the case of SPA.6.2.1 Algorithm SelectionWe use the task description presented from the previous section to selectthe appropriate algorithm. Table 6.1 shows the conditions matrix that weuse for the algorithm selection. The Input Type row reflects that only theSPA algorithm does not work on single images. Image description specifiesthe level of occlusion and clutter different algorithms can tolerate.Target description identifies that only GBM and SPA a require singleperson in the image. Size describes on which pixel sizes of relevant com-positions the algorithm work best. For pose estimation algorithms GBM,F-FMP, U-FMP and SPA the size of the body is reflected in the table, whileChapter 6. Abstracting Human Pose Estimation 65we use size of the face or head for HTO and FO. Pose specifies the poses forwhich the algorithm has relatively high performance. GMB works relativelywell on all poses, F-FMP and SPA require a roughly vertical pose and U-FMP works best when the person is facing the camera. Likewise, FO workson front-facing people, while HTO works in all circumstances. Invisibilityreflects that GBM does not work as well as FMP when parts of the bodyare not visible.Output requirements specify what kind of output each algorithm is ableto produce. We can see that among pose estimation algorithms only U-FMPcannot return full body pose. In contrast to FO, HTO is not able to give anyinformation about the location of the head. Furthermore, among all algo-rithms only HTO and FO can return orientation of body parts, Head+Torsoand Head correspondingly.In order to select the appropriate algorithm for a specific task, the systemperforms the following steps:1. Searches the task conditions matrix (Table 1) for all methods thatwould satisfy the input description and output requirements. If noalgorithm covers the provided specification, the closest algorithm ischosen (see Section 6.2.2).2. From the chosen algorithms it selects the ones that satisfy the condi-tions of the target description sequentially for each condition in thefollowing order: support for multiple people; support for invisibility,size, orientation and location priors; support for color, texture or mo-tion priors.3. Chooses the fastest algorithm among the ones obtained in the previ-ous step and adjusts its parameters according to the speed/accuracyrequirement.6.2.2 Closest Algorithm SearchThe input type of a task is either a single image, a temporal sequence offrames, multiple images from calibrated cameras or multiple video sequencesfrom calibrated cameras. This could be represented by a directed graph,such that every edge from a parent to its child induces loss of information,such as conversion from video sequence to a single image by discarding allframes but the current one (Figure 6.2 (a)). Furthermore, every image orvideo frame contains either grayscale, color, depth or both color and depthinformation, which may be likewise described by a directed graph (FigureChapter 6. Abstracting Human Pose Estimation 66(a) (b)Figure 6.2: Graph representing input types. (a) Input types form a 4-nodegraph. (b) Image types form a 4-node graph.6.2 (b)). The final definition of input type includes both aspects outlinedabove. Therefore, the set of all possible input types may be represented bya directed graph containing 4?4 nodes, where the less informative type is asingle grayscale image, while the most informative contains temporal videosequences of color images with depth from multiple cameras.Our output requirements are represented by two graphs GLO and GDO ,where each edge identifies a data conversion process. Edges in graph GLOinvolve loss of information (Figure 6.3), while edges GDO involve deriving thenon-existent information using certain assumptions or default values (Figure6.3). For instance, 2D joint locations are obtained from 3D joint locationsby projecting them onto the image plane, and pixel-wise body part labelingmay be obtained from the positions of 2D body joints using a simple puppetmodel, whose size is determined by the average limb length in the skeleton.Suppose the task description includes input data I and output require-ment O. If no algorithm in the framework fits the provided task description,the procedure described in Algorithm 2 is performed with GLO replacing GO.The algorithm returned by the solution works on at least one of the sub-setsof the given input data and returns the results, at least as detailed as therequirement is. In this case our framework directly supports the given taskdescription, and if the procedure returned more than one algorithm, furtherselection takes place as outlined in the previous section. If the procedureChapter 6. Abstracting Human Pose Estimation 67(a) (b)Figure 6.3: Graphs representing output requirements. (a) Graph GLO reflectsloss of information in the data conversion process. (b) Graph GDO identifiesderiving additional data during the conversion process.does not find any algorithm, the framework repeats it, but uses GDO insteadof GO. In contrast to the previous attempt, the returned algorithm inferscertain parts of the data, and it is labeled as Inferred. For example, thealgorithm may return 2D joint positions, but the conversion procedure willassume the Z axes to be 1 for all joints. The latter procedure will alwaysreturn an algorithm, because the bottom of the chain is a 2D pose estima-tion algorithm included in the framework that works on grayscale images,which will be selected in the worst case.Chapter 6. Abstracting Human Pose Estimation 68Input: Input type I in a graph GI , output requirement O in a graphGOOutput: Pose estimation algorithmAlgorithm ? ?;Queue ? ?; Queue ?? O;for Queue 6= ? doO? ?? Queue; Queue ?? parents(O?)Queue2 ? ?; Queue2 ?? I;for Queue2 6= ? doI? ?? Queue2; Queue2 ?? children(I?)if Exist algorithm A for I? and O? thenAlgorithm ? A;exit;endendendAlgorithm 2: Closest algorithm search. The procedure finds the closestalgorithm that matches input type I and output requirement O, assuminginput type graph GI and output requirement graph GO. By ?? we denotethe operation of taking and putting an element into a set.6.2.3 Parameter DerivationThe task description is also used to derive the appropriate parameters for thechosen algorithm. Often many of the parameters of an algorithm are learnedand included with the model. However, usually there are certain parametersone has to tune according to one?s needs. In our current set of algorithmswe only adjust parameters that affect the speed/accuracy tradeoff. As canbe seen in Table 6.1 all algorithms but FMP can be tuned in accordancewith the user?s requirement for speed. GBM requires scale parameters tobe specified, which are set based on the prior knowledge of target size andspeed constraints. SPA has controls that specify likelihood of guessing thecorrect pose and the amount of smoothing, which is tuned for the requiredlevel of speed/accuracy. FO comes with three pre-trained models of differentlevels of detail and different inference speeds, which are set automatically.HTO directly provides the parameter that affects speed and accuracy.Chapter 6. Abstracting Human Pose Estimation 696.3 Algorithm Selection EvaluationEvery algorithm has a concrete input type that it can accept and outputrequirements it is able to satisfy, which allows us to fill Input Type andOutput Requirements rows of the condition matrix 6.1. However, fillingthe Image Description and Target Description rows requires an insight intothe performance of the algorithms under various task conditions, which wedetermine with the help of experiments. We selected 120 images from fivepose estimation datasets: Buffy Stickmen [16], Image Parse [40], Leeds PoseDataset [26] and Synchronic Activities Stickmen [12]. We selected the imagesbased on the maximum coverage of the task description problem space, andmanually annotated them with the following labels:? The amount of clutter? The amount of occlusion? Lower body visibility flag? Target size? Pose labelWe measure clutter as the distinctiveness from the background in termsof occlusion and clutter. Pose is labeled to be either Regular or Unusual andmay be Front-Facing (in the abstraction, All is equal to Regular+Unusual).Non-vertical or highly articulated body configurations were labeled as Un-usual, roughly facing the camera as Front-Facing, all others as Regular.Furthermore, we cropped the images such that they each contained a singleperson only in order to maintain consistency, as GBM works only on singleperson images.We ran GBM, and F-FMP and U-FMP on 70 images out of 120 selectedand filled the task conditions matrix based on obtained evidence of perfor-mance of the algorithms, leaving the remaining 50 images for testing. Wefound that F-FMP is the preferable algorithm for the task of full-body poseestimation in the presence of clutter, as GBM is more likely to pick suitableclutter as a body part. At the same time GBM works better in the absenceof clutter, while FMP is more likely to miss a body part in an unclutteredenvironment. We think this can be explained by the fact that GBM uti-lizes segmentation to better distinguish foreground from the background.Furthermore, F-FMP produces limb double counting more frequently thanGBM, which may be explained by the fact that it is a tree model whichChapter 6. Abstracting Human Pose Estimation 70does not share knowledge between its limbs. Furthermore, F-FMP producessubtly less accurate results in the case of non-vertical or highly articulatedposes, as it seems to have stronger priors towards such poses in its trainingdataset. We use SPA in the cases when the video sequence is available.For upper body pose estimation we found that U-FMP is the preferredmethod when the image is sufficiently large or the person is facing the cam-era. It also works better than other algorithms when the prior knowledgeof lower body being occluded is given, although F-FMP is more likely thanGBM to correctly detect the hands of the person even when there is noevidence of legs in the image. For the task of head yaw estimation we foundthat the main control that affects the selection between HTO and FO is theprior knowledge of the size of the head or orientation.In order to evaluate the task-based algorithm mapping, we ran threepose estimators GBM, F-FMP and U-FMP on the remaining 50 imagesand determined the system success rate, which was defined by how oftenthe best algorithm for a specific task is selected. The success rate was76%, and in cases where it picked a non-optimal method the difference inaccuracy between the best and selected algorithms was approximately 15%,therefore the result would be relatively close to optimal. This supports ourinsights on the performance of the algorithms and shows that the systemselects pose estimation algorithms reasonably well. Figures 6.4-6.6 illustratesome examples of the results of our algorithm selection process. Every rowcorresponds to a single algorithm, while every column represent an imagewith a task description. The algorithm selected by our system is marked ingreen, the correct one is marked in blue.6.4 Discussion and Future WorkThis chapter is intended to demonstrate that a wide-coverage abstractionover human pose estimation targeted at non-expert users is possible, and atask-based approach provides a reasonable level of control while still hidingthe complexity of algorithmic details and parameters under a single flexibleapplication programming interface. Our abstraction utilizes various kindsof input data together with image description, target prior knowledge def-inition and output requirements to cover a large volume of the pose detec-tion problem space. The condition matrix together with closest algorithmsearch procedure 2 maps the task description to a suitable pose estimationalgorithm and automatically derives the necessary parameters. Our resultsdemonstrate the advantages of the current approach.Chapter 6. Abstracting Human Pose Estimation 71(a) (b) (c)Figure 6.4: Algorithm mapping evaluation for full-body pose estimation.Images with various descriptions are taken: (a) Regular Pose, High Clutter,Low Occlusion, Small Size; (b) Regular Pose, Medium Clutter, Low Occlu-sion, Large Size; (c) Unusual Pose, High Clutter, Low Occlusion, Small Size.Algorithm, selected by the abstraction is marked in green, the correct oneis marked in blue. Algorithms from top to bottom: GBM [43], F-FMP [55].Best viewed in color.The main flaw of the system in its current state is the fact that it requiresexpert knowledge to be encoded into the condition matrix. If one wants toadd an algorithm into the current framework, one would have to run it on thesame set of images we used to test other algorithms and expand the conditionmatrix in accordance with the results. However, there is no guarantee thatthe parameters that we chose in our framework such as clutter and occlusiongive the best prediction for the efficiency of the algorithms.The solution to the above might be provided by an automated run-timealgorithm selection process based on the features extracted from input im-ages or video sequences. This may be guided by a multi-class classificationprocedure, where every class corresponds to a certain algorithm that fits thegiven task conditions. During inference the system would select the bestalgorithm according to the output of the classifier. The classifier may betrained on the features extracted from a set of training images with anno-Chapter 6. Abstracting Human Pose Estimation 72(a) (b) (c)Figure 6.5: Algorithm mapping evaluation for head yaw estimation. Imageswith various descriptions are taken: (a) Large Size, Non-Front-Facing; (b)Small Size, Front-Facing; (c) Large Size. Algorithm, selected by the ab-straction is marked in green, the correct one is marked in blue. Algorithmsfrom top to bottom: HTO [34], FO [57]. Empty annotation reflect algorithmfailure. Best viewed in color.tated poses, which would be coupled with class labels, specifying the best-performing pose estimation algorithm. In our opinion a broad set of variousfeatures may be beneficial. Together with a weighting classifier such as SVMone may find out which features have the most influence in algorithm selec-tion process. Furthermore, in practice feature computation should not betime-consuming, so features like BRIEF [7] or Haar-like features [50] maycome into play. With this setup the new algorithms can be seamlessly addedinto the framework, without any prior expert knowledge about their perfor-mance, as the training process outlined above would automatically re-trainthe classifier. Furthermore, the set of spatial features may be changed orexpanded at any time with the same effect. Note that similar approacheshave been tried in other areas in Computer Science, utilizing machine learn-ing for aiding algorithm selection process [23] and runtime prediction [24].In addition, one may explore the ways of automatic adjustment of the pa-rameters for each algorithm in the system. We leave such an automatedclassification procedure to future work.Chapter 6. Abstracting Human Pose Estimation 73(a) (b) (c)Figure 6.6: Algorithm mapping evaluation for upper-body pose estimation.Images with various descriptions are taken: (a) Medium Clutter, Low Oc-clusion, Small Size; (b) Low Clutter, Low Occlusion, Small Size; (c) LowClutter, Large Size, Lower Body Invisible. Algorithm, selected by the ab-straction is marked in green, the correct one is marked in blue. Algorithmsfrom top to bottom: GBM [43], F-FMP [55], U-FMP [55]. Best viewed incolor.74Chapter 7ConclusionIn this thesis we consider the problem of human pose estimation. Wepresent two novel algorithms for monocular 2D pose estimation from videosequences. The first one aggregates the information from adjacent framesand then searches for a shortest path of pose estimates from the output ofa single-image pose estimator throughout the whole video sequence, signif-icantly outperforming the state of the art for single-image pose estimation.The second algorithm utilizes a spatio-temporal tree model and for everyvideo frame performs articulated human detection, taking into account sev-eral previous frames, demonstrating state-of-the-art pose estimation perfor-mance.Furthermore, we release the UCF Sports Pose dataset, which consistsof full-body human pose annotations for a subset of videos from the UCFSports Action dataset that contain people in roughly vertical positions. Fur-thermore, we propose a new metric for the evaluation of pose estimationresults, which better reflects the performance of the current state of the artalgorithms for 2D human pose estimation. In addition, we release a highlyconfigurable Video Pose Annotation tool that greatly simplifies the manualprocess of annotating poses in video sequences.Finally, we present a novel abstraction over human pose estimation thatcaptures a large volume of the pose estimation problem space. The abstrac-tion comes with a notion of a task description, which includes the descriptionof the input data and the output requirements for the pose estimation prob-lem. It also includes a meta-algorithm that maps a task description to apose estimation algorithm that is expected to give the best results on thespecific problem based on the expert knowledge about the algorithms in theframework.Future work for each part of this thesis is discussed in detail in the end ofeach chapter. The future work on the first video pose estimation algorithmmay focus on the utilization of a better and faster method of tracking theposes. In addition, learning of the patterns of temporal co-occurrence ofappearance may be employed. An alternative research direction is to inves-tigate the ways to make the algorithm work in an on-line setting. FutureChapter 7. Conclusion 75work on the second algorithm encompasses joint spatio-temporal learning,improved tracking of poses and exploration of different tree models.The abstraction of human pose estimation may benefit from utilizationof a machine learning technique for the expert knowledge directly encodedinto the system. The technique may work on image features and consider thetask as a classification problem, where each class corresponds to a specificpose estimation algorithm. Furthermore, one may want to explore ways toautomatically tune the parameters of each algorithm.76Bibliography[1] S. Amin, M. Andriluka, R. Marcus, and B. Schiele. Multi-view PictorialStructures for 3D Human Pose Estimation. In British Machine VisionConference, 2013.[2] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited:People detection and articulated pose estimation. In Computer Visionand Pattern Recognition, 2009.[3] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision withthe OpenCV Library. 2008.[4] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy op-tical flow estimation based on a theory for warping. In IEEE EuropeanConference on Computer Vision, volume 3024, pages 25?36. 2004.[5] A. Bruhn, J. Weickert, and C. Schno?rr. Lucas/Kanade meetsHorn/Schunck: Combining local and global optic flow methods. In-ternational Journal of Computer Vision, 61:211?231, 2005.[6] P. Buehler, M. Everingham, D.P. Huttenlocher, and A. Zisserman. Up-per body detection and tracking in extended signing sequences. Inter-national Journal of Computer Vision, 95:180?197, 2011.[7] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary robustindependent elementary features. In IEEE European Conference onComputer Vision, pages 778?792, 2010.[8] K. Chiu and R. Raskar. Computer vision on tap. In IEEE Conferenceon Computer Vision and Pattern Recognition Workshops, pages 31?38,2009.[9] N. Dalal and B. Triggs. Histograms of oriented gradients for human de-tection. In IEEE Conference on Computer Vision and Pattern Recog-nition, pages 886?893, 2005.Bibliography 77[10] M. Eichner and V. Ferrari. Better appearance models for pictorialstructures. In British Machine Vision Conference, 2009.[11] M. Eichner and V. Ferrari. We are family: Joint pose estimation ofmultiple persons. In IEEE European Conference on Computer Vision,2010.[12] M. Eichner and V. Ferrari. Human pose co-estimation and applica-tions. IEEE Transasctions on Pattern Analysis and Machine Intelli-gence, pages 2282?2288, 2012.[13] G. Fanelli, J. Gall, and L. Van Gool. Real time head pose estima-tion with random regression forests. In IEEE Conference on ComputerVision and Pattern Recognition, pages 617?624, 2011.[14] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Ob-ject detection with discriminatively trained part-based models. IEEETransactions on Pattern Analysis and Machine Intelligence, 32:1627?1645, 2010.[15] P.F. Felzenszwalb and D.P. Huttenlocher. Distance transforms of sam-pled functions. Technical report, Cornell Computing and InformationScience, 2004.[16] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive searchspace reduction for human pose estimation. In IEEE Conference onComputer Vision and Pattern Recognition, pages 1?8, 2008.[17] T. Finley and T. Joachims. Training structural SVMs when exact infer-ence is intractable. In International Conference on Machine Learning,pages 304?311, 2008.[18] O. Firschein and T.M. Strat. RADIUS: Image Understanding For Im-agery Intelligence. Morgan Kaufmann, 1st edition, 1997.[19] K. Fragkiadaki, Han Hu, and Jianbo Shi. Pose from flow and flow frompose. In IEEE Conference on Computer Vision and Pattern Recogni-tion, 2013.[20] P. Guan, A. Weiss, A.O. Balan, and M.J. Black. Estimating HumanShape and Pose from a Single Image In IEEE International Conferenceon Computer Vision, 2009Bibliography 78[21] K. Hara and R. Chellappa. Computationally efficient regression on adependency graph for human pose estimation. In IEEE Conference onComputer Vision and Pattern Recognition, 2013[22] B.K.P. Horn and B.G. Schunck. Determining optical flow. ArtificialIntelligence, 17:185?203, 1981.[23] F. Hutter, D. Babi, H.H. Hoos, and Alan J. Hu. Boosting verificationby automatic tuning of decision procedures. In Formal Methods inComputer Aided Design, pages 27?34, 2007.[24] F. Hutter, Lin Xu, H.H. Hoos, and K. Leyton-Brown. Algorithm run-time prediction: The state of the art. Artificial Intelligence Journal,abs/1211.0906, 2012.[25] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towardsunderstanding action recognition. In IEEE International Conferenceon Computer Vision, 2013.[26] S. Johnson and M. Everingham. Clustered pose and nonlinear appear-ance models for human pose estimation. In British Machine VisionConference, 2010.[27] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Auto-matic detection of tracking failures. In IEEE International Conferenceon Pattern Recognition, pages 2756?2759, 2010.[28] A. Kla?ser, M. Marsza lek, and C. Schmid. A spatio-temporal descriptorbased on 3d-gradients. In British Machine Vision Conference, pages995?1004, 2008.[29] C. Kohl and J. Mundy. The development of the image understandingenvironment. In IEEE Conference on Computer Vision and PatternRecognition, pages 443?447, 1994.[30] K. Konstantinides and J.R. Rasure. The khoros software developmentenvironment for image and signal processing. IEEE Transactions onImage Processing, 3:243?252, 1994.[31] L. Ladicky?, P.H.S. Torr, and A Zisserman. Human pose estimationusing a joint pixel-wise and part-wise formulation. In IEEE Conferenceon Computer Vision and Pattern Recognition, 2013Bibliography 79[32] C. Liu. Beyond Pixels: Exploring New Representations and Applica-tions for Motion Analysis. PhD thesis, 2009.[33] B.D. Lucas and T. Kanade. An iterative image registration techniquewith an application to stereo vision. In International Joint Conferenceon Artificial Intelligence, pages 674?679, 1981.[34] S. Maji, L. Bourdev, and J. Malik. Action recognition from a dis-tributed representation of pose and appearance. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3177?3184, 2011.[35] T. Matsuyama and H. Vincent. Sigma: a framework for image un-derstanding integration of bottom-up and top-down analyses. In In-ternational Joint Conference on Artificial intelligence, volume 2, pages908?915, 1985.[36] G. Miller and S. Fels. OpenVL: A task-based abstraction for developer-friendly computer vision. In IEEE Winter Application and ComputerVision Conference, pages 288?295, 2013.[37] .B. Moeslund, A. Hilton, V. Krger, and L. Sigal, editors. Visual Analysisof Humans - Looking at People. 2011.[38] G Panin. Model-based Visual Tracking: the OpenTL Framework. JohnWiley and Sons, 2011.[39] J. Peterson, P. Hudak, A. Reid, and G.D. Hager. Fvision: A declarativelanguage for visual tracking. In Third International Symposium onPractical Aspects of Declarative Languages, pages 304?321, 2001.[40] D. Ramanan. Learning to parse images of articulated bodies. Advancedin Neural Information Processing Systems, 2006.[41] D. Ramanan, D.A. Forsyth, and A. Zisserman. Strike a pose: Trackingpeople by finding stylized poses. In IEEE Conference on ComputerVision and Pattern Recognition, volume 1, pages 271?278, 2005.[42] M.D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maximum average correlation height filter for action recog-nition. In IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1?8, 2008.[43] B. Rothrock, S. Park, and S.-C. Zhu. Integrating grammar and segmen-tation for human pose estimation. In IEEE Conference on ComputerVision and Pattern Recognition, 2013.Bibliography 80[44] B. Sapp, D. Weiss, and B. Taskar. Parsing human motion with stretch-able models. In IEEE Conference on Computer Vision and PatternRecognition, pages 1281?1288, 2011.[45] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT descriptorand its application to action recognition. In International Conferenceon Multimedia, 2007.[46] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake. Real-time human pose recognition in partsfrom single depth images. In IEEE Conference on Computer Visionand Pattern Recognition, 2011.[47] L. Sigal, A.O. Balan, and M.J. Black. Humaneva: Synchronized videoand motion capture dataset and baseline algorithm for evaluation ofarticulated human motion. International Journal on Computer Vision,87:4?27, 2010.[48] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer. A JointModel for 2D and 3D Pose Estimation from a Single Image. IEEEConference on Computer Vision and Pattern Recognition, 2013[49] Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable partmodels for action detection. In IEEE Conference on Computer Visionand Pattern Recognition, pages 2642?2649, 2013.[50] P. Viola and M. Jones. Rapid object detection using a boosted cas-cade of simple features. In IEEE Conference on Computer Vision andPattern Recognition, volume 1, pages I?511?I?518 vol.1, 2001.[51] C. Wang, Y Wang, and A.L. Yuille. An approach to pose-based actionrecognition. In IEEE Conference on Computer Vision and PatternRecognition, 2013.[52] Heng Wang, A. Klaser, C. Schmid, and Cheng-Lin Liu. Action recog-nition by dense trajectories. In IEEE Conference on Computer Visionand Pattern Recognition, pages 3169?3176, 2011.[53] J.M. Wang, D.J. Fleet, and A. Hertzmann. Gaussian process dynamicalmodels for human motion. IEEE Transactions on Pattern Analysis andMachine Intelligence, 30:283?298, 2008.Bibliography 81[54] Shandong Wu, O. Oreifej, and M. Shah. Action recognition in videosacquired by a moving camera using motion decomposition of lagrangianparticle trajectories. In IEEE International Conference on ComputerVision, pages 1419?1426, 2011.[55] Y. Yang and D. Ramanan. Articulated pose estimation with flexiblemixtures-of-parts. In IEEE Conference on Computer Vision and Pat-tern Recognition, 2011.[56] T.-H. Yu, T.-K. Kim, and R. Cipolla. Unconstrained Monocular 3DHuman Pose Estimation by Action Detection and Cross-modality Re-gression Forest. In IEEE Conference on Computer Vision and PatternRecognition, 2013.[57] X. Zhu and D. Ramanan. Face detection, pose estimation, and land-mark localization in the wild. In IEEE Conference on Computer Visionand Pattern Recognition, pages 2879?2886, 2012.[58] S. Zuf, J. Romero, C. Schmid, and M.J. Black. Estimating human posewith flowing puppets. In IEEE International Conference on ComputerVision, 2013.