UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multiview depth-based pose estimation Shafaei, Alireza 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2016_february_shafaei_alireza.pdf [ 10.02MB ]
JSON: 24-1.0223065.json
JSON-LD: 24-1.0223065-ld.json
RDF/XML (Pretty): 24-1.0223065-rdf.xml
RDF/JSON: 24-1.0223065-rdf.json
Turtle: 24-1.0223065-turtle.txt
N-Triples: 24-1.0223065-rdf-ntriples.txt
Original Record: 24-1.0223065-source.json
Full Text

Full Text

Multiview Depth-based Pose EstimationbyAlireza ShafaeiB.Sc., Amirkabir University of Technology, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)December 2015© Alireza Shafaei, 2015AbstractCommonly used human motion capture systems require intrusive attachment ofmarkers that are visually tracked with multiple cameras. In this work we presentan efficient and inexpensive solution to markerless motion capture using only a fewKinect sensors. We use our system to design a smart home platform with a networkof Kinects that are installed inside the house.Our first contribution is a multiview pose estimation system. Unlike the previ-ous work on 3d pose estimation using a single depth camera, we relax constraintson the camera location and do not assume a co-operative user. We apply recentimage segmentation techniques with convolutional neural networks to depth im-ages and use curriculum learning to train our system on purely synthetic data. Ourmethod accurately localizes body parts without requiring an explicit shape model.The body joint locations are then recovered by combining evidence from multipleviews in real-time.Our second contribution is a dataset of 6 million synthetic depth frames forpose estimation from multiple cameras with varying levels of complexity to makecurriculum learning possible. We show the efficacy and applicability of our datageneration process through various evaluations. Our final system exceeds the state-of-the-art results on multiview pose estimation on the Berkeley MHAD dataset.Our third contribution is a scalable software platform to coordinate Kinect de-vices in real-time over a network. We use various compression techniques and de-velop software services that allow communication with multiple Kinects throughTCP/IP. The flexibility of our system allows real-time orchestration of up to 10Kinect devices over Ethernet.iiPrefaceThe entire work presented here has been done by the author, Alireza Shafaei, withthe collaboration and supervision of James J. Little. A manuscript describing thecore of our work and our results has been submitted to the IEEE Conference onComputer Vision and Pattern Recognition (2016) and is under anonymous reviewat the moment of thesis submission.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Kinect Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Our Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Single-view Pose Estimation . . . . . . . . . . . . . . . . 112.1.2 Multiview Depth-based Pose Estimation . . . . . . . . . . 132.2 Dense Image Segmentation . . . . . . . . . . . . . . . . . . . . . 14iv2.3 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . 153 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 The General Context . . . . . . . . . . . . . . . . . . . . . . . . 163.2 High-Level Framework Specification . . . . . . . . . . . . . . . . 173.3 Internal Structure and Data Flow . . . . . . . . . . . . . . . . . . 183.3.1 Camera Registration and Data Aggregation . . . . . . . . 193.3.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 204 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . 224.1 Sampling Human Pose . . . . . . . . . . . . . . . . . . . . . . . 234.2 Building Realistic 3d Models . . . . . . . . . . . . . . . . . . . . 254.3 Setting Camera Location . . . . . . . . . . . . . . . . . . . . . . 254.4 Sampling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Multiview Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 345.1 Human Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Pixel-wise Classification . . . . . . . . . . . . . . . . . . . . . . 365.2.1 Preprocessing The Depth Image . . . . . . . . . . . . . . 375.2.2 Dense Segmentation with Deep Convolutional Networks . 375.2.3 Designing the Deep Convolutional Network . . . . . . . . 395.3 Classification Aggregation . . . . . . . . . . . . . . . . . . . . . 405.4 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.1 Training the Dense Depth Classifier . . . . . . . . . . . . . . . . 456.2 Evaluation on UBC3V Synthetic . . . . . . . . . . . . . . . . . . 486.2.1 Dense Classification . . . . . . . . . . . . . . . . . . . . 486.2.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 496.3 Evaluation on Berkeley MHAD . . . . . . . . . . . . . . . . . . . 506.3.1 Dense Classification . . . . . . . . . . . . . . . . . . . . 54v6.3.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 556.4 Evaluation on EVAL . . . . . . . . . . . . . . . . . . . . . . . . 587 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 59Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61viList of TablesTable 4.1 Dataset complexity table. Θ is the relative camera angle, Hrefers to the height parameter and D refers to the distance pa-rameter as described in Figure 4.5. The simple set is the subsetof postures that have the label ‘walk’ or ‘run’. Going from thefirst dataset to the second would require pose adaptation, whilegoing from the second to the third dataset requires shape adap-tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Table 6.1 The dense classification accuracy of the trained networks on thevalidation sets of the corresponding datasets. Net 2 andNet 3 are initialized with the learned parameters of Net 1and Net 2 respectively. . . . . . . . . . . . . . . . . . . . . . 47Table 6.2 Mean and standard deviation of the prediction error by testingon subjects and actions with the joint definitions of Michel etal [28]. We also report and compare the accuracy at 10cmthreshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57viiList of FiguresFigure 1.1 The goal of pose estimation is to learn to represent the posturalinformation of the left image abstractly as shown in the rightimage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 A sample depth image. Each shade of gray visualizes a dif-ferent depth value. The closer the point, the darker the corre-sponding pixel. The white region is too distant or too noisy,making the sensor readings unreliable. . . . . . . . . . . . . . 4Figure 1.3 An overview of our pipeline. In this hypothetical setting threeKinect 2 devices are communicating with a main hub wherethe depth information is processed to generate a pose estimate. 8Figure 3.1 The high-level overview of the components in our system. EachKinect is connected to a local Kinect Service. At theSmart Home Core we communicate with each KinectService to gather data. The Kinect Clients are the in-terfaces to the Kinect Service and can be implementedin any programming language. . . . . . . . . . . . . . . . . . 18Figure 3.2 The high-level representation of data flow within our pipeline.The pose estimation block operates independently from thenumber of the active Kinects. . . . . . . . . . . . . . . . . . . 19Figure 3.3 An example to demonstrate the output result of camera cal-ibration. The blue and the red points are coming from twodifferent Kinects facing each other but they are presented in aunified coordinate space. . . . . . . . . . . . . . . . . . . . . 20viiiFigure 3.4 The pose estimation pipeline in our platform. . . . . . . . . . 21Figure 4.1 The synthetic data generation pipeline. We use realistic 3dmodels with real human pose configurations and random cam-era location to generate realistic training data . . . . . . . . . 23Figure 4.2 Random samples from MotionCK as described in Section 4.1. 25Figure 4.3 Regions of interest in our humanoid model. There are total of43 different body regions color-coded as above. (a) The frontalview and (b) the dorsal view. . . . . . . . . . . . . . . . . . . 26Figure 4.4 All the 16 characters we made for synthetic data generation.Subjects vary in age, weight, height, and gender. . . . . . . . 26Figure 4.5 An overview of the extrinsic camera parameters inside our datageneration pipeline. . . . . . . . . . . . . . . . . . . . . . . . 27Figure 4.6 Three random samples from Easy-Pose. (a,c,e) are groundtruthimages and (b,d,f) are corresponding depth images. . . . . . . 30Figure 4.7 Three random samples from Inter-Pose. (a,c,e) are groundtruthimages and (b,d,f) are corresponding depth images. . . . . . . 31Figure 4.8 Three random samples from Hard-Pose. (a,c,e) are groundtruthimages and (b,d,f) are corresponding depth images. . . . . . . 32Figure 5.1 Our framework consists of four stages through which we grad-ually build higher level abstractions. The final output is anestimate of human posture. . . . . . . . . . . . . . . . . . . . 35Figure 5.2 Sample human segmentation in the first stage of our pose esti-mation pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 5.3 Sample input and output of the normalization process. (a,b)the input from two views, (c,d) the corresponding foregroundmask, (e,f) the normalized image output. The output is rescaledto 250×250 pixels. The depth data is from the Berkeley MHAD [31]dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38ixFigure 5.4 Our CNN architecture. The input is a 250× 250 normalizeddepth image. The first row of the network generates a 44×14× 14 coarsely classified depth with a high stride. Then itlearns deconvolution kernels that are fused with the informa-tion from lower layers to generate finely classified depth. Like[26] we use summation and crop alignment to fuse informa-tion. The input and the output blocks are not drawn to preservethe scale of the image. The number in the parenthesis withineach block is the number of the corresponding channels. . . . 39Figure 6.1 Front camera samples of all the subjects in the Berkeley MHAD [31]dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 6.2 Front depth camera samples of all the subjects in the EVAL [12]dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 6.3 The reference groundtruth classes of UBC3V synthetic data. . 49Figure 6.4 The confusion matrix of Net 3 estimates on the Test set ofHard-Pose. . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 6.5 The output of Net 3 classifier on the Test set of Hard-Pose(left) versus the groundtruth body part classes (right). The im-ages are in their original size. . . . . . . . . . . . . . . . . . . 51Figure 6.6 The groundtruth body part classes (top) versus the output ofNet 3 classifier on the Test set of Hard-Pose (bottom). . 52Figure 6.7 Mean average joint prediction error on the groundtruth and theNet 3 classification output. The error bar is one standard de-viation. The average error on the groundtruth is 2.44cm, andon Net 3 is 5.64cm. . . . . . . . . . . . . . . . . . . . . . . 53Figure 6.8 Mean average precision of the groundtruth dense labels and theNet 3 dense classification output with accuracy at threshold10cm of 99.1% and 88.7% respectively. . . . . . . . . . . . . 54Figure 6.9 Dense classification result of Net 3 together with the originaldepth image on the Berkeley MHAD [31] dataset. Net 3 hasbeen trained only on synthetic data. . . . . . . . . . . . . . . 55xFigure 6.10 Blue color is the motion capture groundtruth on the BerkeleyMHAD [31] and the red color is the linear regression pose es-timate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 6.11 Pose estimate mean average error per joint on the BerkeleyMHAD [31] dataset. . . . . . . . . . . . . . . . . . . . . . . 56Figure 6.12 Accuracy at threshold for the entire skeleton on the BerkeleyMHAD [31] dataset. . . . . . . . . . . . . . . . . . . . . . . 57Figure 6.13 Dense classification result of Net 3 and the original depthimage on the EVAL [12] dataset. Net 3 has been only trainedon synthetic data. . . . . . . . . . . . . . . . . . . . . . . . . 58xiAcknowledgmentsI would like to express my sincerest gratitude to my supervisor and mentor, Profes-sor James J. Little, who has always been considerate, supportive, and most impor-tantly, patient with me. His guidance and support in academia and life have beentruly invaluable.I also would like to thank Professor Robert J. Woodham for the intellectuallystimulating conversations in almost every encounter. My appreciations go to Prof.David Kirkpatrick, Prof. Nick Harvey, and Prof. Mark Schmidt from who I learnedinnumerable lessons. I am thankful to the Computer Science Department staff whohave helped me in various circumstances with an outstandingly professional andcourteous manner.I would like to thank Ankur Gupta, and Bita Nejat for their friendship andsupport throughout the difficult times, and for continually going out of their wayto offer solace. I am thankful to the other fellow graduate students whom I had thepleasure of knowing and working with.My greatest gratitude is reserved for my parents, Asghar Shafaei and ZahraBaghaei for teaching me what is truly valuable. I am eternally indebted to theirnever-ending support and comfort.xiiDedicationxiiiChapter 1IntroductionPose estimation in computer vision is the problem of determining an approximateskeletal configuration of people in the environment through visual sensor readings(see Figure 1.1). Postural information provides a high level abstraction over thevisual input which serves as a foundation for other computer vision tasks such asactivity understanding, automatic surveillance, and gesture recognition to name afew.However, possible applications of pose estimates are not limited only to com-puter vision. For instance, postural information is used in human motion capturefor computer generated imagery. Within this context, a 3d character is visualizedwhile moving like a real human model – used mostly in movie and gaming prod-ucts. Currently the most reliable and accurate existing method is to use markeredmotion capture, that is, a collection of retroreflective markers are attached to thesubjects usually in form of a specialized clothing, and then tracked by infrared cam-eras. A reliable pose estimation method can virtually replace the existing motioncapture systems in any context.Similarly in human computer interaction one can use this abstract informa-tion to design user interfaces that actively collaborate with a person. Interactiveaugmented reality of simulations for instance, can benefit from real time humaninteraction to provide an immersive, exciting, and educative experience.In medical care, systems that are capable of pose estimation can be used to fa-cilitate early diagnosis of cognitive decline in patients. Pose estimation also opens1Figure 1.1: The goal of pose estimation is to learn to represent the posturalinformation of the left image abstractly as shown in the right image.the possibility for the doctors to perform remote physiotherapy and to ensure thepatient is performing the activities by examining the exact movements. In e-healthwe can also use postural information to monitor the elderly who prefer to livealone. By careful analysis one can generate emergency notifications in case of anaccident.1.1 Kinect SensorThe Kinect sensor provides a high resolution RGB video stream, as well as a spe-cialized depth stream, where for each pixel instead of color we get a measure ofdistance from the camera. This depth image can then be used to reconstruct a volu-metric model of the observed space. The depth image is usually referred to as 2.5ddata because we get one depth reading along each ray through an image pixel.This sensor was popularized my Microsoft to facilitate game development withhuman interactions. For example, in one game the player can learn to performdance moves correctly; or in another, a player can interact with virtual animalsthrough augmented reality. In the research community people have been using this2sensor extensively in robotics to perform tasks more accurately. Such a sensor cansimplify important indoor problems that otherwise would be difficult to deal withthrough mere color data, such as obstacle avoidance, or determining the correctway to grasp objects.A topic of current interest with the Kinect sensor is human pose estimation. Us-ing a depth image has a few attractive properties that make pose estimation easierthan using the data from the color domain. With the additional depth information,we no longer need to worry about the scale ambiguity which is always a presentproblem with single color images (i.e., is the object small and close, or big anddistant?). Furthermore, the depth image is primarily focusing on the shape of theobjects rather than their visual pattern (see Figure 1.2). This brings in invariance tocomplex visual patterns for free; we can focus solely on the shape of the observa-tion, which is greatly beneficial for the pose estimation problem. In domains suchas pose estimation with RGB images, the high clothing variation, which alters thevisual pattern, is itself a major source of complication in learning usable filters.As part of the software development kit released for Kinect, Microsoft alsoprovides APIs to automatically perform pose estimation. The original softwaredeveloped by Microsoft makes the assumption that there is a cooperative user fac-ing the camera and interacting with the system. While this assumption is valid inthe home entertainment setting, the resulting pose estimation algorithm does notgeneralize well in broader scenarios in which the user is not necessarily facing thecamera or is not cooperative.This limitation has lead the researchers to work on more robust pose estimationalgorithms. Ever since the original publication [36] there have been numerous at-tempts at improving the empirical results of pose estimation [e.g., 1, 13, 17, 43, 44].However, much of this research has been focused on performing pose estimationwith only a single-view depth image.One of the common obstacles with using multiple Kinect 1 devices in the pasthas been the interference issue between the Kinects that are aimed at the samedirection. The early versions of this sensor determined the depth through emittinga patterned infrared light and analyzing its reflection. When these patterns collidethe Kinect is either unable to determine depth, or the result of the computation ishighly erroneous.3Figure 1.2: A sample depth image. Each shade of gray visualizes a differentdepth value. The closer the point, the darker the corresponding pixel.The white region is too distant or too noisy, making the sensor readingsunreliable.With the recent release of Kinect 2 with a new technology, using multipleKinects has finally become practical; but presently there is relatively a small lit-erature on pose estimation with multiple Kinects.1.2 Our ScenarioWhile many of the current methods focus on a single depth camera, using just asingle camera has one inherent limitation that is not possible to deal with: whenthere is occlusion in the scene, there is no way to be sure of the hidden body part’spose. But if a second camera is present there is a higher chance of observingthe occluded region from a different viewpoint, and hence, having a more reliableoutput.In this thesis our work is focused on generating better pose estimates withmultiple Kinect sensors. By doing so we are hoping to perform a better pose esti-mation in human monitoring scenarios. The basic idea is to install a set of Kinects4inside our target environment and non-intrusively perform pose estimation on anot necessarily cooperative user. Our first contribution is a generalized frameworkfor multiview depth-based pose estimation that is inspired by the existing litera-ture. We split the problem into several subproblems that we solve separately. Ourapproach enables using the state-of-art for each subproblem independently. Wealso instantiate the said framework with simple, yet effective, methods and discussvarious design decisions and their effects on the final result.The targeted application of such a system is to facilitate e-health solutionswithin home environments. As part of our work we also develop a lightweight,flexible, platform independent, and distributed software framework to coordinatethe orchestration and data communication of several Kinect sensors in real-time.We use this system as our infrastructure for experiments and evaluations.We present our software framework in Chapter 3. In Chapter 5 we describe ourabstract pose estimation framework and instantiate it with state-of-the-art imagesegmentation and a simple, yet effective, pose estimation algorithm.1.3 DatasetsOne of the main challenges in this field is the absence of datasets for training oreven a standard benchmark for evaluation. At the time of writing the only datasetthat provides depth from more than one viewpoint is the Berkeley Multimodal Hu-man Action Database (MHAD) [31] with only two synchronized Kinect readings.Collecting useful data is a lengthy and expensive process with unique chal-lenges. For instance, reliably annotating posture on multiview data by itself is anexpensive task for humans to do, let alone dealing with the likely event of anno-tators not agreeing on the same exact label. One can also resort to using externalmotion capture systems to accurately capture joint configurations while recordingthe data, however, a reliable motion capture requires wearing marker sensors thatalter the visual appearance.At the same time, with the advances in the graphics community, it is not diffi-cult to solve the forward problem; that is, given a 3d model and a pose, generate arealistic depth image from multiple viewpoints. In contrast we are interested in theinverse problem, that is, to infer the 3d pose from multiview depth images. One5way to benefit from the advances in graphics is to synthesize data. Using syntheticdepth data for pose estimation has previously been proposed in Shotton et al. [36].Interestingly, they show that their synthetic data provides a challenging baselinefor the pose estimation problem. Unfortunately this pipeline, or even their data,is not publicly available and we believe the technical difficulties of making sucha pipeline may have been discouraging a lot of researchers to take this particulardirection.As part of our contribution we adopt and implement a data generation pipelineto render realistic training data for our work. In Chapter 4 we thoroughly dis-cuss the challenges, details, and differences with the previous work. Note that ourcontribution here removes a huge obstacle and makes further research within thisrealm possible. The developed pipeline, together with a compilation of datasetswith varying level of complexities are prepared and will be released to encouragefurther research in this direction.1.4 Pose EstimationThe main focus of this thesis is on the problem of estimating human postures givenmultiview depth information. Our work differs from the previous work from thefollowing perspectives.Multiview Depth. Most prior work has focused on single view depth informa-tion. While there is a substantial amount of work on multiview RGB based poseestimation, there are relatively few publications on multiview depth-based pose es-timation. Furthermore, the absence of datasets further complicates our work. Aspart of our contribution we also release a dataset for multiview pose estimation.Context Assumptions. The major focus of pose estimation papers has been ona limited context. For instance, the method that powers the Kinect SDK assumesthat the camera is installed for home entertainment scenarios and a cooperative useris using the system. In contrast, our focus is on home monitoring, where camerasare more likely to be installed on the walls rather than in front of the televisionor the user. Furthermore, the user is not necessarily cooperative, or even awareof existence of such a system in the environment. Our goal is to improve uponsingle-view pose estimation techniques by analyzing the aggregated information6Dense ClassifierViewAggregationPoseEstimation12350 cm800 cmReferenceFigure 1.3: An overview of our pipeline. In this hypothetical setting threeKinect 2 devices are communicating with a main hub where the depthinformation is processed to generate a pose estimate.of multiple views. Such a system will enable the application of pose estimation inbroader contexts.At a high level our system connects to n Kinects that are installed within anenvironment. We then retrieve depth images from all of these Kinects to find andestimate the posture of the individuals who are present and observable. Each per-son may be visible from one or more Kinects. The output of our system is thelocation of each predefined skeletal joint inR3. These steps are visually illustratedin Figure 1.3. Further discussion will be presented in Chapter 5.71.5 OutlineIn Chapter 2 we explore the literature of pose estimation, and for the unfamiliarreader we also present some of the fundamental methods that will facilitate com-prehension of the subsequent chapters. Chapter 3 is dedicated to the high-levelabstraction of our scenario and the infrastructure that we have designed for ourexperiments. It is the requirement engineering portion of our project where wedescribe all the steps that are required to prepare the environment.In Chapter 4 we describe the data synthesis procedure and its challenges. Atthe end of this chapter we describe the properties of the datasets that we use forour work. Our abstract pose estimation framework is defined and motivated inChapter 5. As we develop the framework we discuss the possible approaches thatone can take which serves as a possible future work for the interested reader. Wealso describe the chosen elements that we later experiment on in the subsequentchapter.Chapter 6 includes all the experiments that we have conducted together witha discussion of the results. We conclude in Chapter 7 and provide possible futuredirections that may be of interest.8Chapter 2Related WorkIn this chapter we present relevant background material for the methods in this the-sis. In Section 2.1 we precisely define the pose estimation problem and present asummary of recent work. We then look at the image segmentation problem in Sec-tion 2.2 and discuss recent progress that underlies the foundation of our pipeline.To train our models we apply curriculum learning which is presented in Section Pose EstimationPrevious work on real-time pose estimation can be categorized into top-down andbottom-up methods. Top-down or generative methods define a parametric shapemodel based on the kinematic properties of human body. These models generallyrequire expensive optimization procedures to be initialized with parameters that ac-curately explain the presented evidence. After the initialization step the parameterestimates are used as a prior for tracking the subject [11, 12, 18, 43].Top-down methods require an accurate shape model in order to generate a rea-sonably precise pose estimate. A common practice for shape estimation is to apriori adapt the basic model to fit the physical properties of the test subjects. Theshape estimation process usually requires a co-operative user taking a neutral posesuch as the T-Pose (i.e., standing erect with hands stretched) at the beginning whichmakes it difficult to apply top-down methods in non-cooperative scenarios.9Bottom-up discriminative models, the second category of approaches, directlyfocus on the current input to identify individual body parts, usually down to pixellevel classification. These estimates are then used to generate hypotheses about theconfiguration of body that usually neglect higher-level kinematic properties andmay give unlikely or impossible results. However, bottom-up methods are fastenough to be combined with a separate tracking algorithm that ensures the labelingis consistent and correct throughout the subsequent frames. Random-forest-basedtechniques have been shown to be an efficient approach for real-time performancein this domain [14, 34, 37, 44].Pose estimation is the problem of determining skeletal joint locations of humansubjects within a given image. More formally, given an image I the problem is todetect all human subjects s ∈ SI in the image I and determine the joint configura-tion of each subject s as Ps = {ps1, . . . , psn}, where psi corresponds to the location ofthe i-th joint for subject s, and n is the total number of predefined joints.There are two variations of pose estimation: 3d pose estimation and 2d poseestimation. In 3d pose estimation we are interested in finding the 3d real-worldjoint locations (i.e., psi ∈R3) while in the 2d setting we only want to label particularpixel coordinates on the spatial input such as an image.While 2d images are primarily used to generate 2d pose estimates, it is alsopossible to infer 3d pose from a single or multiple 2d images. The multiple imagesused for 3d pose estimates can either come from a coherent sequence, such as avideo, or simultaneously from different viewpoints. Commercial motion capturesystems such as Vicon use up to 8 cameras/viewpoints for real-time joint tracking.However, the cheapest, fastest, and reasonably accurate pose estimates come from2.5d depth images.Pose estimates are useful for providing a higher level abstraction of the scenein problems such as action understanding, surveillance, and human computer inter-action. An accurate pose estimate could be crucial for reliable action understand-ing [21]. One of the benefits of having a reliable pose estimate is the possibility ofdefining view invariant features that can significantly decrease our dependence ontraining data with multiple viewpoints.We expect a pose estimation process to operate under various constraints suchas time, memory, and process power. This is important because pose estimation is10potentially the beginning of a longer pipeline that is resource demanding, whetherit is human computer interaction or action recognition. Furthermore, if we wish toapply pose estimations tasks in embedded systems, or even mobile devices, satis-fying the resource constraints is even more critical.Two types of challenges arise in pose estimation: appearance and structural.The appearance problem refers to the way human body is captured under lightingconditions, varied clothing, and different views, which makes it difficult to recog-nize the body in random settings. The structural problem refers to the exponentiallylarge possible configurations of human body and the possible ambiguities; an ex-haustive search through all possible joint configurations is simply not a feasibleapproach and it is necessary to capture all configurations in an effective system tobe able to choose the sensible pose within a reasonable time.2.1.1 Single-view Pose EstimationRGB BasedBourdev and Malik [4] propose a method to group neighboring body joints basedon the groundtruth configuration. They randomly select a window containing atleast a few joints and collect training images that have the exact same configuration.Each chosen pattern is called a poselet. It is typical to randomly choose 1000s ofposelets and learn appropriate filters to detect them later. During the evaluationthese filters are run on an image and the highest responses vote for the true jointlocation.The time complexity of learned poselet models grow linearly with the numberof the poselets which can be an obstacle to scalability. Chen et al. [6] present ahierarchical evaluation method to make poselet-based approaches scalable in prac-tice. Bourdev et al. [5] explore an alternative approach by learning generalizeddeep convolutional networks that generate pose discriminative features. Gkioxariet al. [15] use poselets in context of a deformable parts model to accurately esti-mate pose. The main idea of [15] is to reason based on the relative positions ofposelets to remove erroneous results and improve accuracy.11Yang and Ramanan [42] formulate a structural-SVM model to infer the possi-ble body part configurations. The unary terms in their formulation corresponds tothe local evidence through a learned mixture of HOG filters; and the binary termsare quadratic functions of the relative joint locations to score the determined struc-ture. In test time they efficiently pick the maximum scoring settings by applyingdynamic programming and distance transform operators [8].Toshev and Szegedy [40] directly learn a deep network regression model fromimages to body joint locations. Tompson et al. [39] learn a convolutional networkto jointly identify body parts and perform belief-propagation like inference on agraphical model.Depth BasedMost of the application oriented approaches to pose estimation rely on depth dataor a combination of depth and RGB data. The common preference of using depthsensors is due to the capability of operating in low-light or even no-light conditions,providing color invariant data, and also resolving scale ambiguity of the RGB do-main.One of the successful applications of pose estimation is the Microsoft Kinectand the work of Shotton et al. [36]. Shotton et al. use synthetic depth data and learnrandom-forest-based pixel classifiers on single depth images. The joint estimatesare then derived from the densely classified depth by applying a mean-shift-basedmode seeking method. Shotton et al. also propose an alternative method to directlylearn regression forests that estimate the joint location.The use of random forests in [36] allows a 200fps performance. Furthermore,the accuracy of a single depth pose estimate is high enough that no temporal con-straint (e.g., tracking) on the input depth stream is necessary. However, since thealgorithm of Shotton et al. uses a vast number of decision trees, it is a resourcedemanding algorithm for real-time applications. Notably, Shotton et al. assume ahome entertainment scenario with a co-operative user which limits the applicabilityof their solution.Baak et al. [1] describe a data driven approach to pose estimation which hasa 60fps performance. Their method depends on a good initialization of a realistic123d model while the co-operative subject is taking a neutral pose. Furthermore, theinitialization step depends on a few hyper-parameters that are manually tuned foreach dataset.Ye and Yang [43] describe a probabilistic framework to simultaneously per-form pose and shape estimation of an articulated object. They assume the inputpoint cloud is a Gaussian Mixture Model whose centroids are defined by an ar-ticulated deformable model. Ye and Yang describe an Expectation Maximizationapproach to estimate the correct deformation parameters of a 3d model to explainthe observations. It is possible to run this computationally intensive algorithm inreal-time if the implementation is on the GPU.Ge and Fan [13] introduced a non-rigid registration method called Global-Local Topology Preservation (GLTP). Their method combines two preexisting ap-proaches of Coherent Point Drift [30] and articulated ICP [32] as complementinghybrid approaches. They first initialize a realistic 3d model assuming the personhas a neutral-pose and then track each joint similar to Pellegrini et al. [32]. Theirmethod heavily relies on the target person starting with a neutral-pose which isgenerally not the case in a monitoring setting. Furthermore this system is compu-tationally expensive and does not offer a real-time performance.Yub Jung et al. [44] demonstrate a Random-Tree-Walk-based method to achieve1000fps pose estimation. They learn a regression forest for each joint to help withnavigation on a depth image from a random starting point. After a predefined num-ber of steps they use the average location as the joint position estimate. The speedimprovement in their method is due to learning random forests per each joint ratherthan per pixel (as opposed to Shotton et al. [36]). This method does not model thestructural constraints of a human body, rather it uses it as a guide to search thespatial space of the input depth image.2.1.2 Multiview Depth-based Pose EstimationMichel et al. [28] use multiple depth sensors to generate a single point cloud of theenvironment. This point cloud is then used to configure the postural parameters of acylindrical shape model by running particle swarm optimization offline. The phys-ical description of each subject is manually tuned before each experiment. Phan13and Ferrie [33] use optical flow in the RGB domain together with depth informa-tion to do multiview pose estimation within a human-robot collaboration context atthe rate of 8fps. Phan and Ferrie report a median joint prediction error of approx-imately 15cm on a T-pose sequence. Zhang et al [45] combine the depth data ofmultiple Kinects with wearable pressure sensors to estimate shape and track humansubjects at 6fps.To the best of our knowledge, these are the only published methods to multi-view pose estimation from depth.2.2 Dense Image SegmentationDense image segmentation is the problem of generating per-pixel classificationestimates. This particular area of computer vision has been progressing rapidlyduring the past few months. All of the the top competing methods make use ofconvolutional networks one way or another.One of the major obstacles with commonly used convolutional networks fordense classification is that the output of each layer is shrinking spatially as makingprogress towards the end of the pipeline. While in general classification this is adesirable property, in dense classification, however, this effectively leads to highstrides in the output space which gives coarse region predictions.Long et al. [26] propose a specific type of architecture for image segmentationthat uses deconvolution layers to scale up the outputs of individual layers in a deepstructure. These deconvolutional layers act as spatially large dictionaries that arecombined in proportion to the outputs of the previous layer. Long et al. also fuseinformation of lower layers with higher layers through summation.Hu and Ramanan [20] further motivate the use of lower layers by looking atthese networks from another perspective. They show that having top-down propa-gation of information, as opposed to just doing a bottom-up or feed-forward, is anessential part of reasoning for a variety of computer vision tasks – something thatis motivated by the empirical neuroscientific results. The interesting aspect of thiswork is that we can simulate the top-down propagation by unrolling the networkbackwards and treating the whole architecture as a feed-forward structure.14Chen et al. [7] show that it is possible to further improve the image segmen-tation by adding a fully connected CRF on top of the deep convolutional network.This approach is essentially treating the output of the deep network as features forthe CRFs, except that these features are also automatically learned by the back-propagation algorithm.More recently Zheng et al. [46] present a specialized deep architecture that inte-grates a CRF inside itself. They show that this architecture is capable of performingmean-field like inference on a CRF with Gaussian pairwise potentials while doingthe feed-forward operation. They successfully train this architecture end-to-endand achieve the state-of-the-art performance in dense image segmentation.All of the recent work suggest that it is possible to incorporate the local depen-dency in the output domain as part of a deep architecture itself. Building on thisobservation we also take advantage of deep architectures as part of our work.2.3 Curriculum LearningBengio et al. [2] describe curriculum learning as a possible approach to trainingmodels that involve non-convex optimization. The idea is to rank the training in-stances by their difficulty. This ranking is then used for training the system bystarting with simple instances and then gradually increasing the complexity of theinstances during the training procedure. This strategy is hypothesized to improvethe convergence speed and the quality of the final local minima [2].Kumar et al. [25] later introduced the idea of self-paced learning where thesystem decides which instance is more important to learn next – in contrast to theearlier approach where an oracle had to define the curriculum before the trainingstarts. More recently Jiang et al. [23] combined these two methods to have anadaptive approach to curriculum learning that takes feedback of the classifier intoconsideration while following the original curriculum guidelines.Our experiments suggest that a controlled approach to training deep convolu-tional networks can be crucial for training a better model, providing an example ofcurriculum learning in practice.15Chapter 3System OverviewIn this chapter we present an overview of our environment and the developed sys-tem. In Section 3.1 we talk about the general context of our problem and highlighta few important details. We then talk about the high level specification of our sys-tem in Section 3.2. Section 3.3 is dedicated to the internal structure and the dataflow within our system.3.1 The General ContextThe vision of our project is to non-obtrusively collect activity information withinhome environments. The target application is e-health care and automated moni-toring of people who may suffer from physical disabilities and require immediateattention in case of an accident. A collection of sensors are installed inside thehouse and a centeral server processes all the information. The specific sensor thatwe will be using is Microsoft Kinect 2, however, the developed framework is ca-pable of incorporating more sources of information.The Kinect 2 sensor is abstractly a consolidated set of microphones, infraredsensors, depth camera, and an HD RGB camera. The relatively cheap price andavailability of this sensor has made an immense interest in developing systemswith multiple Kinect sensors. Since this version of the device does not cause in-terference with the other Kinect 2.0 sensors, it has become more attractive thanever.16In our scenario the Kinect 2 sensors are installed within an indoor space suchas a house or a room. A central server located inside the house will be processingall the incoming data. The privacy concerns has the added limitation that no rawinformation like the RGB video feed or the depth information can be stored on disk.Therefore, we are limited to real-time methods that analyze the data as observationstake place. The system should also recognize the people automatically to profile theactivities of the individuals and also invoke notifications in case of an emergencysituation.A technical challenge is to organize communication with the Kinects. EachKinect 2 requires the full bandwidth of a USB 3 controller, and the connectingcable can not be longer than 5m – the underlying USB 3 protocol has a maximumcommunication latency limit. Additionally, the Microsoft Kinect 2 SDK does notsupport multiple Kinects at the same time. Our solution to the above technicalchallenges is to deploy the system on multiple computers that communicate overthe network.3.2 High-Level Framework SpecificationAt the highest abstraction level our system is a collection of small software pack-ages that communicate with each other on a network through message passing. Themain component is a singleton Smart Home Core running on the server. Foreach Kinect involved we run a separate Kinect Service that processes andtransmits sensor readings to the Smart Home Core software through a networkwith TCP/IP support (See Figure 3.1). Decoupling the individual components hasthe added benefit of easier scalability. For instance, the software operates inde-pendently of the total number of active Kinects in system – if later we wish to addmore Kinects for a better accuracy or more coverage, we can do so without alteringthe software.The foundation of this platform is implemented in C# under the .Net frame-work 4. We use OpenCV1 and Point Cloud Library (PCL)2 at the lower levelsfor efficiency in tasks such as visualization or image processing. Messages used1www.opencv.org2www.pointclouds.org17Kinect ServiceKinect ServiceKinect ServiceKinect ServiceSmart Home Core…Kinect ClientKinect ClientKinect ClientKinect Client…Figure 3.1: The high-level overview of the components in our system.Each Kinect is connected to a local Kinect Service. At theSmart Home Corewe communicate with each Kinect Serviceto gather data. The Kinect Clients are the interfaces to theKinect Service and can be implemented in any programming lan-guage.in inter-process communication are serialized using Google Protocol Buffers3 toachieve fast and efficient transmission. The language neutrality of Google Pro-tocol Buffers allows interfacing with multiple languages. For example, we cancommunicate with a Kinect Service to gather data inside Matlab.To minimize network overhead we further compress the message payload withlossless LZ44 compression and lossy JPEG scheme. The final system transmits a720p video stream and depth data at a frame rate of 30fps while consuming only5.3MB of bandwidth, making simultaneous communication with up to ten Kinectsfeasible over wireless networks. While we can still optimize the communicationcosts for even higher scalability, we found that the described system is efficientenough to proceed with our experiments.3.3 Internal Structure and Data FlowThe internal structure of our system is best described by walking through the dataflow path within our pipeline. The highest abstraction level of the data flow isshown in Figure 3.2.3developers.google.com/protocol-buffers/4https://code.google.com/p/lz4/18Smart Home CoreCamera Registration & Data Aggregation Pose EstimationKinect ClientKinect ClientKinect Client…Figure 3.2: The high-level representation of data flow within our pipeline.The pose estimation block operates independently from the number ofthe active Kinects.3.3.1 Camera Registration and Data AggregationEach Kinect sensor makes measurements in its own coordinate system. This coor-dinate system is a function of camera location and orientation in our environment.The problem of finding the relative transformations to unify the measurements intoa single coordinate system is called camera calibration or extrinsic camera param-eter estimation. Camera calibration has been studied extensively in the computervision and robotics community [10, 16, 19].Within our problem context we assume the cameras are installed in a fixed loca-tion. Therefore, we only need to calibrate the cameras once. As long as we can dothis in a reasonable time we can just resort to simple procedures. In our pipeline wesimply perform feature matching in the RGB space to come up with reasonably ac-curate transformation parameters, and then run the Iterative Closest Point (ICP) [3]algorithm to fine-tune the estimates. Our implementation uses SIFT [27] to matchfeatures and then estimates a transformation matrix Tˆ by minimizing an `2 lossover the corresponding matched depth locations within a RANSAC [9] pipeline.To find a locally optimal transformation with respect to the entire point clouds, wethen initialize the ICP method with Tˆ using the implementation of PCL.After generating a transformation estimate for each sensor pair we can unifythe coordinate spaces and merge all the measurements into the same domain. Fig-ure 3.3 demonstrates a real output after determining a unified coordinate system.By adding more cameras we can increase the observable space to the entire house.19Figure 3.3: An example to demonstrate the output result of camera calibra-tion. The blue and the red points are coming from two different Kinectsfacing each other but they are presented in a unified coordinate space.By constructing a larger observable space we no longer need to know about theindividual Kinects in our pipeline. If later we decide to add more Kinects, we cansimply estimate the relative transformation of the newly added Kinect to at leastone of the existing cameras – at this point the unification is straightforward due totransitivity. After merging the new data we will have a more accurately observedspace or a larger observable area, and either way the rest of the pipeline remainsagnostic to the number of the Kinects involved. After unifying the measurementsthe data flow path proceeds to the pose estimation stage. In this thesis we only focuson development of the pose estimation subsystem and leave the other potentialstages to the future work.3.3.2 Pose EstimationAt the pose estimation stage the target is to identify the posture of every person inthe observable space. For each individual we perform 3d pose estimation based onthe depth and the point cloud data. This stage of the pipeline can further be sepa-20Human Segmentation1Pixel-wise ClassificationClassification AggregationPose Estimation2 3 4Figure 3.4: The pose estimation pipeline in our platform.rated into different parts as shown in Figure 3.4. The main focus of this thesis ison this particular stage of the developed framework. Our pose estimation pipelineconsists of four stages.Human segmentation At this stage we perform background subtraction to sep-arate the evidence of each person from the background so that the rest ofpipeline can examine the data in isolation.Pixel-wise classification After separating each person we perform classificationon each pixel of the depth image.Classification aggregation At this stage we merge all the classification results ofall the cameras.Pose estimation Given the merged evidence of the previous step we now solve thepose estimation problem and derive the actual joint locations.Further details of our pose estimation methodology is presented in Chapter 5.21Chapter 4Synthetic Data GenerationTo address the absence of appropriate datasets we use computer generated imageryto synthesize realistic training data. The original problem of interest is extractinghuman posture from input, but the inverse direction, that is, generating output fromhuman posture is reasonably solved in computer graphics. The main theme of thischapter is to simulate the inverse process to generate data.Commercial 3d rendering engines such as Autodesk Maya1 have simplifiedmodeling a human body with human inverse kinematic algorithms. A human in-verse kinematic algorithm calculates human joint angles under the constraints ofhuman anatomy to achieve the desired posture. The HumanIK middleware2 thatunderlies the aforementioned 3d engine has also been widely adopted in game de-velopment to create real-time characters that interact with the environment of thegame. While the inverse kinematic algorithms facilitate body shape manipulationin a credible way, we also require realistic 3d body shapes to begin with. Luck-ily, there are numerous commercial and non-commercial tools to create realistic3d human characters with desired physical attributes. We will be using the term‘character’ from this point to refer to body shapes.Shotton et al. [36] demonstrate the efficacy of using synthetic data for depth-based pose estimation and argue the synthesized data tends to be more difficultfor pose estimation than real-world data. This behavior is attributed to the high1www.autodesk.com/products/maya2http://gameware.autodesk.com/humanik22Human Pose DatasetSample Human PoseSample 3d Model of Human3d Computer Graphics SoftwareSynthetic DataSample Camera ConfigurationHuman Character DatasetCamera LocationsFigure 4.1: The synthetic data generation pipeline. We use realistic 3d mod-els with real human pose configurations and random camera location togenerate realistic training datavariation of possible postures in synthetic data, while the real-word data tends toexhibit a biased distribution towards the common postures.In the remainder of this chapter we discuss the specifics of the data generationprocess. Our pipeline is an adoption of the previous work as presented in Shottonet al. [36]. An overview of the data generation process is shown in Figure 4.1. Weuse a collection of real human postures and synthesize data with realistic 3d modelsand random camera locations. The output of this process has been carefully tunedto generate usable data for our task.4.1 Sampling Human PoseAt this stage of the data generation pipeline we are interested in collecting a set ofreal human postures. With the powerful HumanIK and a carefully defined spaceof postures, it is possible to simply enumerate over all the possible configurations.However, we take the simpler path of using data that is already collected fromhuman subjects and leave the spontaneous pose generation to the future work.To collect real human postures we chose the publicly available motion capturedataset of CMU3. This dataset consists of over four million motion capture frames3mocap.cs.cmu.edu23of human subjects performing a variety of tasks ranging from day to day conver-sation with other people to activities such as playing basketball. Each sequenceof this dataset has the recorded joint rotation parameters of the subjects’ skeleton.Using the orientation of each bone rather than the XYZ coordinates of the jointlocations has the benefit of being invariant to the skeleton’s physical properties.By defining the physical properties of the skeleton ourselves, which is merely thelength of a few bones, we can convert this rotational information to absolute XYZjoint locations.One way to use this dataset is to simply pick random frame samples from theentire pool of frames. However, doing so will heavily bias our pose space becauseof the redundant nature of the data. For instance, consecutive frames of a 120fpssequence are highly similar to each other. Moreover, activities such as ‘walk’ tendto show up frequently in the dataset over different sequences. Therefore, we needto build a uniform space of postures to make sure the data skew does not bias oursubsequent models.To build an unbiased dataset we collect a representative set of 100,000 humanpostures. To achieve this goal we first define a basic fixed skeleton and convert therotational information to Cartesian space and then run the K-means clustering algo-rithm on this dataset with 100K centers. We use the Fast Library for ApproximateNearest Neighbours (FLANN) [29] to speed up the nearest neighbor look-ups. Af-ter finding the association of each posture with the cluster centers, we identify themedian within each cluster and pick the corresponding rotational data as a repre-sentative for that cluster.After selecting a set of 100K human postures we split the data into three setsof 60K, 20K, and 20K for train, validation, and test respectively. We refer to thispose set as MotionCK. In Figure 4.2 you can see some examples of our postures.Note that at this stage our sets are merely a description of pose and do not includeany 3d characters.24Figure 4.2: Random samples from MotionCK as described in Section Building Realistic 3d ModelsThe next stage in our pipeline is to create realistic human 3d models. We use theopen-source Make Human Project4 which allows creation of human-like modelswith varying physical and clothing attributes. Since we are only interested in gen-erating synthetic depth data, applying a human-like skin is irrelevant to the depthinformation. Hence, we create our own special skin that reflects our target regionsof interest. Our model has 43 different body regions with distinguishing labels forleft and right body parts (see Figure 4.3). We purposefully chose to oversegmentthe body parts to be able to merge them if necessary, without regenerating data.To make sure our data includes variety in shape, we create 16 characters withvarying parameters in age, gender, height, and weight (2 of each). The Make Hu-man project allows a higher degree of freedom in making a model, however, wefound varying other parameters does not substantially affect the apparent physicalattributes. All of our characters can be seen in Figure 4.4. We plan to release ourmodels for public use.4.3 Setting Camera LocationIn order to render a 3d model we also require a camera location. The camera pa-rameter controls the viewpoint in which we would like to collect data from. Recallthat in our problem we do not require cooperation and we would like to estimate4www.makehuman.org25(a) (b)Figure 4.3: Regions of interest in our humanoid model. There are total of 43different body regions color-coded as above. (a) The frontal view and(b) the dorsal view.Figure 4.4: All the 16 characters we made for synthetic data generation. Sub-jects vary in age, weight, height, and gender.26HeightDistanceThetaFigure 4.5: An overview of the extrinsic camera parameters inside our datageneration pipeline.pose from distances up to seven meters and heights up to three meters. Thereforewe should define the camera location with respect to the aforementioned assump-tions. We chose the following possible configurations for the camera. The heightof the camera is assumed to be between one to three meters from the ground. Weassume the person is at most four meters away from the sensor, and the relativeazimuthal angle between the person and the camera is the entire 2pi span. SeeFigure 4.5 for a visualization of the defined camera parameters. The chosen pa-rameters are for data generation purpose only. In Chapter 5 we describe how ourmethod will handle cases where the person is farther away.The intrinsic camera parameters such as focal length and output size are care-fully chosen based on the intrinsic depth camera parameters of the Kinect 2 deviceto ensure the synthetic data is comparable to real data as much as possible.4.4 Sampling DataTo generate data we follow the sampling process described in Algorithm 1. Thefirst input to our algorithm is C , the pool of target 3d characters (e.g., the char-acters of Section 4.2). The second input is range of camera locations L (e.g., thedefinition in Section 4.3). The third input is pool of posturesP (e.g., MotionCK).Finally, the last parameter is the total number of viewpoints n. The output sample27is a set S= {(Di,Gi)}ni=1, where Di and Gi are the depth and the groundtruth imageas seen from the i-th camera.Algorithm 1 Sample dataC : Pool of characters.L : Range of camera locations.P: Pool of postures.n: Number of cameras.1: procedure SAMPLE(C ,L ,P,n)2: c ∼ Unif(C ) . select a random character3: l1:n ∼ Unif(L ) . select n random locations4: p ∼ Unif(P) . select a posture5: S← Render depth image and groundtruth6: return S . S = {(Di,Gi)}ni=1Separating the input to our function allows full control over the data generationpipeline. In Section 4.5 we will generate multiple datasets with different inputs tothis function. To implement Algorithm 1 we use Python for scripting and Mayafor rendering. We generate over 2 million samples within our pipeline for training,validation, and test.4.5 DatasetsIn order to apply curriculum learning we make datasets of different complexity totrain our models. We first start training on the simplest dataset and then graduallyincrease the complexity of data to adapt our models. Further details on our trainingprocedure is described in Chapter 5.The datasets that we have created can be briefly described as in Table 4.1.Our first dataset is Easy-Pose which is the simplest of the three. To generateEasy-Pose we select a subset of postures in MotionCK that are labeled with‘Walk’ or ‘Run’, and pick only one 3d character from our models. The seconddataset Inter-Pose extends Easy-Pose by adding more variation to the pos-sible postures. If a hypthotetical model is making a transition from Easy-Poseto Inter-Pose, it will be required to learn more postures for the same character.The final dataset Hard-Pose includes all the 3d characters of Section 4.2. Each28Table 4.1: Dataset complexity table. Θ is the relative camera angle, H refersto the height parameter and D refers to the distance parameter as de-scribed in Figure 4.5. The simple set is the subset of postures that havethe label ‘walk’ or ‘run’. Going from the first dataset to the second wouldrequire pose adaptation, while going from the second to the third datasetrequires shape adaptation.Dataset Postures # Characters Camera Param. SamplesEasy-Pose simple (~10K) 1 Θ∼U(−pi,pi)H ∼U(1,1.5)mD∼U(1.5,4)m1MInter-Pose MotionCK (100K) 1 Θ∼U(−pi,pi)H ∼U(1,1.5)mD∼U(1.5,4)m1.3MHard-Pose MotionCK (100K) 16 Θ∼U(−pi,pi)H ∼U(1,3)mD∼U(1.5,4)m300Kdataset of Table 4.1 has a train, test, and validation set with mutually exclusive setof postures.We generate all the data with n = 3 cameras. In Figure 4.6, Figure 4.7, andFigure 4.8 you can find sample training data from Easy-Pose, Inter-Pose,and Hard-Pose respectively.4.6 DiscussionIn Section 4.1 we extracted a set of representative postures from the publicly avail-able CMU Mocap dataset. Even though the said dataset has over four millionframes of real human motion capture, it does not cover the entire set of possiblepostures. Shotton et al. [36] further collect real data within their problem context toimprove generalization, but they do not release it for public use. To create a betterrepresentation of human posture two potential approaches is suggested: collectingmore data, and using human inverse kinematic algorithms. While collecting moremotion capture data can be prohibitively expensive, it guarantees validity of thecollected postures. Alternatively, it is also possible to enumerate over a space of29(a) (b)(c) (d)(e) (f)Figure 4.6: Three random samples from Easy-Pose. (a,c,e) aregroundtruth images and (b,d,f) are the corresponding depth images.30(a) (b)(c) (d)(e) (f)Figure 4.7: Three random samples from Inter-Pose. (a,c,e) aregroundtruth images and (b,d,f) are the corresponding depth images.31(a) (b)(c) (d)(e) (f)Figure 4.8: Three random samples from Hard-Pose. (a,c,e) aregroundtruth images and (b,d,f) are the corresponding depth images.32body configurations and rely on HumanIK to calculate appropriate joint angles.However, a majority of the generated postures using this procedure could be unre-alistic and useless without an effective pruning strategy.In Section 4.2 we created 16 characters to generate data with. The chosenregions of interest are defined heuristically while considering the previous work.Since we were able to achieve satisfactory results (presented in Chapter 6) we didnot spend time with other region selection schemes. One potential future work hereis to experiment with different region definitions to gauge the room for improve-ment.In Section 4.3 we defined a subspace to sample uniformly for the camera lo-cation parameter. An idea that we did not explore is to generate datasets withdifferent elevations. Since at test time we have knowledge of the camera location,it is possible to just use models that are trained for that specific elevation.33Chapter 5Multiview Pose EstimationIn this chapter we present a general framework for multiview depth-based pose es-timation. Our approach is to define a sequence of four tasks that must be addressedin order to predict the pose. Each task is a specific problem for which we can ap-ply a multitude of approaches. The four stages of our framework are depicted inFigure 5.1.In the first stage we perform background subtraction on the input depth imageto separate out the human pixels. For each set of identified human pixels we thengenerate a pixel-wise classification where each pixel is labeled according to thebody regions in Figure 4.3. Recall that each Kinect has an independent machinerunning an instance of Kinect Service (see Figure 3.2). The first two stagesof our pipeline can be run on each machine in a distributed fashion, or run centrallywithin Smart Home Core. The next step is to aggregate information from allthe cameras into a single unified coordinate space. This aggregation will result in alabeled point cloud of the human body. The final step is to perform pose estimationon the labeled point cloud of the previous stage.In the following sections we go through each step of this pipeline to providean in-depth description of each task. We then discuss the potential design choicesand present the motivation behind our chosen methods. We end this chapter inSection 5.5 by discussing alternative design choices and potential future researchdirections.34Human SegmentationPixel-wise ClassificationClassification AggregationPose EstimationHuman SegmentationPixel-wise Classification1 23 4… …Per KinectFigure 5.1: Our framework consists of four stages through which we gradu-ally build higher level abstractions. The final output is an estimate ofhuman posture.5.1 Human SegmentationHuman segmentation is a binary background/foreground classification task thatassigns a label y ∈ {0,1} to each pixel. The purpose of this task is to separatepeople from the background to process the individuals in isolation. While devisingan exact solution to this problem is arguably a challenge in the RGB domain, it ispossible to make sufficiently accurate methods in the depth domain. With the depthdata it is possible to mark the boundary pixels of a human body by just examiningthe discontinuities – unfortunately in the RGB domain this cue is unreliable.Generating a pixel mask for each input is generally treated as a pre-processingstep for which efficient tools exist already and in the pose estimation literature itis commonly assumed to be given [36, 43, 44]. While theoretically it is possibleto use any classification model for this task, random forests have been shown to beparticularly efficient.Given that the rest of the pipeline is likely to require more sophisticated modelsto yield a good result, as commonly practiced in the literature, we also use randomdecision forest classifiers. More specifically, we use the implementation of theKinect SDK to execute this step of the pipeline. A sample output of this step isshown in Figure 5.2. Note that after this stage of the pipeline we will be looking atindividual human subjects.35(a) (b)Figure 5.2: Sample human segmentation in the first stage of our pose estima-tion pipeline.5.2 Pixel-wise ClassificationGiven a masked depth image of a human subject our task is to assign a class labely∈Y to each pixel, whereY is the set of our body classes as defined in Figure 4.3.This formulation of the subproblem has previously been motivated by Shotton et al.[36], but we have a few differences from the previous work.Shotton et al. [36] assume the user is facing the camera, leading to a simplifiedclassification problem because it is no longer necessary to distinguish between theleft and right side of a body. Furthermore, Shotton et al. use body labels with only21 regions which wrap around the person. In this work we extend this to 43 regionsto distinguish the right and left sides of the body (See Figure 4.3). In our context,unlike the case of Shotton et al., it is not possible to make classification decisionslocally. A left or right hand label for instance, depends on the existence of a frontalor dorsal label for the head at a different spatial location; and even then, there arestill special cases that must be taken into account.The other assumption that we are relaxing is the relative distance between thecamera and the user. The home entertainment scenario of [36] is no longer validin our context, and thus our system should be able to handle more varieties ofviewpoints. Furthermore, in our context the user is not necessarily a co-operativeagent, which further complicates our task. We may be able to assume co-operation36by focusing on use cases such as distant physiotherapy, but the general monitoringproblem does not admit this simplification. We approach the classification problemfrom a new perspective that allows embedding of higher level inter-class spatialdependencies.During the subsequent sections we fully describe our pixel-wise classifier. InSection 5.2.1 we describe how background/foreground masks and depth imagesare used for normalization. In Section 5.2.2 we describe the CNN architecturethat takes the normalized image as input and generates a densely classified outputimage. Further discussion of our architecture is presented in Section Preprocessing The Depth ImageThe first step in our image classification pipeline is to normalize the input image tomake it consistent across all possible inputs. The input of this part is a depth imagewith a foreground mask of the target subject whose pose needs to be estimated.We first quantize and linearly map the depth value range [50, 800]cm to therange [0, 255]. We then crop the depth image using the foreground mask and scalethe image to fit in a 190× 190 pixel window while preserving the aspect ratio.Finally we translate all the depth values so that the average depth is approximately160cm. After adding a 30 pixel margin we will have a 250× 250 pixel imagewhich will be used in the next stage. See the examples in Figure Dense Segmentation with Deep Convolutional NetworksWe use Convolutional Neural Networks (CNN) to generate a densely classified out-put from the normalized input depth image. Our network architecture is inspiredby the work of Long et al. [26] in image segmentation. We use deconvolution out-puts and fuse the result with the information from the lower layers to generate adensely classified depth image. Our approach takes advantage of the informationin neighboring pixels to generate a densely classified depth image; this is in con-trast to random forest based methods such as [37] where each pixel is evaluatedindependently.The particular architecture that we have chosen is presented in Figure 5.4. Theinput to this network is a single channel of normalized depth data. In the first row of37(a) (b)(c) (d)(e) (f)Figure 5.3: Sample input and output of the normalization process. (a,b) theinput from two views, (c,d) the corresponding foreground mask, (e,f)the normalized image output. The output is rescaled to 250×250 pixels.The depth data is from the Berkeley MHAD [31] dataset.38(64)128×128(64)42×42(128)40×40(128) 20×20 (256) 16×16 14×14(44)16×16(44)20×20(44)16×16(44)Σ(44)34×34(44)42×42Σ(44)34×34(1)250×250(44)250×250Conv (5×5)Stride (2)Pad (2)ReLU+Pool (3×3)Stride (3)Conv (3×3)Stride (1)ReLU+Pool (2×2)Stride (2)Conv (5×5)Stride (1)+ ReLUConv (3×3)Stride (1)Deconv (3×3)Stride (1)Conv (1×1)Dropout (0.5)Deconv (4×4)Stride (2)Conv (1×1)Dropout (0.5)Deconv (19×19)Stride (7)InputOutputFigure 5.4: Our CNN architecture. The input is a 250×250 normalized depthimage. The first row of the network generates a 44× 14× 14 coarselyclassified depth with a high stride. Then it learns deconvolution kernelsthat are fused with the information from lower layers to generate finelyclassified depth. Like [26] we use summation and crop alignment to fuseinformation. The input and the output blocks are not drawn to preservethe scale of the image. The number in the parenthesis within each blockis the number of the corresponding channels.operations in Figure 5.4 our network generates a 14×14 output with 44 channels.The remainder of the network is responsible for learning deconvolution kernelsto generate a dense classification output. After each deconvolution we fuse theoutput with the lower layer features through summation. The final deconvolutionoperation with the kernel size 19×19 enforces the spatial dependency of adjacentpixels within a 19 pixel neighborhood. At the end, our network gives a 250×250output with 44 channels – one per class and one for the background label.This stage of our pipeline can be either run independently within each KinectService, or executed on all the data at once within Smart Home Core.5.2.3 Designing the Deep Convolutional NetworkTo the best of our knowledge, the general approach to designing deep convolutionalnetworks is by trial and error. Most of the current applications of CNNs simplyreuse previously successful architectures such as VGGNet [38] or AlexNet [24].39Such applications often fine-tune the pretrained architecture on datasets such asImagenet [35] on the specific target data.We initially started our experiments by training window classifiers that labeleach pixel locally by only observing a 50× 50 pixel window. The final architec-ture for our window classifier is the first row of Figure 5.4 where a 14×14 outputwith 44 channels is generated. In the window classification setting, the output issimply 1× 1 with 44 channels – one per class and one for the background. Aftertraining the initial window classifier we fix the parameters and extend the networkwith deconvolution layers to get the final architecture of Figure 5.4. We learn theparameters of the newly added layers by training the entire network on denselylabeled input data. Using the deconvolution approach of Long et al. [26] to gener-ate a densely classified output was particularly an attractive choice because of itscompatibility with our window classifier.During our experiments we learned that a separate step for window classifica-tion was unnecessary, and thus, we abandoned the initial two-step approach andtrained the entire network in an end-to-end fashion.5.3 Classification AggregationAfter generating a densely classified depth image we use the extrinsic camera pa-rameters of our set-up to reconstruct a merged point cloud in a reference coordinatesystem. It is possible to apply various filtering and aggregation techniques, how-ever, we have found this simple and fast approach to be sufficiently effective. Atthis point we have a labeled point cloud which is the final result after combiningall the views. Now we extract a feature f from our merged point cloud to be usedfor the next stage of our pose estimation pipeline.For each class in our point cloud we extract the following features:• The median location.• The covariance matrix.• The eigenvalues of the covariance matrix.• The standard deviation within each dimension.40• The minimum and maximum values in each dimension.The final feature f is the concatenation of all the above features into a single featurevector f ∈R1032. We chose the above features to have a real-time performance infeature extraction. It is also possible to apply more computationally expensive datasummarization techniques to generate possibly better features for the next stage.5.4 Pose EstimationWe treat the problem of pose estimation as regression. For each joint j in ourskeleton we would like to learn a function Fj(·) to predict the location of jointj given the feature vector f . After examining a few real-time performing designchoices such as linear regression and neural networks we learned that simple linearregression gives the best trade-off between complexity and performance. Our linearregression is a least squares formulation with an `2 regularizer which is also knownas Ridge Regression (Equation 5.1).argminWj∈R1032×3,b j∈R312 ·nn∑i=1‖W Tj f i+b j− yij‖22+λ j2· (Tr(W Tj Wj)+‖b j‖22) (5.1)Our regression problem for each joint j is defined in Equation 5.1. For eachjoint j we would like to learn a matrix Wj ∈R1032×3 and a bias term b j ∈R3 usingthe regularization parameter λ j. If we add a row of one to the feature vector fwe can remove the bias term b and arrive at the closed-form solution shown inEquation 5.2.Wj = (FTF +n ·λ jI)−1FTYj (5.2)Here F ∈ Rn×1033 is the design matrix of all the training features and Yj ∈Rn×3 is the corresponding coordinate for the j-th joint. Having a closed-formsolution allows fast hyperparameter optimization of λ j. We also experimentedwith the LASSO counterpart to obtain sparser solutions but the improvements werenegligible while the optimization took substantially more time. If the input data is41over a sequence, we further temporally smooth the predictions by calculating aweighted average over the previous estimate.Yˆ st = (1−η)Yˆt−1+ηYˆt 0≤ η ≤ 1 (5.3)Where Yˆ st is the smoothed estimate at time t, and Yˆt is the original estimate at timet. The regularizer hyper-parameters and the optimal smoothing weights are chosenautomatically by cross-validation over the training data.For pose estimation it is also possible to apply more complex methods suchas structural SVMs [42] or Gaussian Processes [41]. However, if we choose morecomplicated methods we also will need more data and computational power. Sincewe need to evaluate on real data and each dataset comes with its own definitionof the skeleton we prefer to use the simplest approach to simultaneously maintainreal-time performance and robustness against overfitting. Since datasets come withtheir own definition of joints, we need to train this part of our pipeline separatelyfor each dataset.5.5 DiscussionIn Section 5.2 we presented a CNN architecture to generate densely classified out-puts. Our architecture uses the ideas of Long et al. [26], however, there has been arecent surge of architectures such as the one presented by Zheng et al. [46] or Huand Ramanan [20] that directly model CRFs inside the CNNs. One future directionis to investigate more recent CNN architectures to improve the classification.In Section 5.3 we aggregated the classification result of all the views throughsimple algebraic operations. By using the temporal information and applying filter-ing strategies we can eliminate some noise in the final labeled point cloud, whichis likely to lead to minor improvements in the pipeline. Another possible directionis to explore various shape summarization techniques to generate better features.The final step of pose estimation in Section 5.4 is a simple linear regressionwhich ignores the kinematic constraints. While the presented CNN architecturehas control over kinematic constraints, one potential direction is to also add kine-matic constraints in the final prediction stage while maintaining a real-time perfor-mance. A kinematic model that also incorporates temporal consistency is likely to42significantly improve the results. Collecting real data with multiple Kinect sensorsis also a valuable future direction that would greatly benefit the community andhelp with further development of multiview pose estimation methods.43Chapter 6EvaluationIn this chapter we provide our evaluation results on three datasets: (i) UBC3VSynthetic Data (ii) Berkeley MHAD [31] (iii) EVAL [12]. To train and evaluate ourdeep network model we use Caffe [22]. Since each dataset has a specific definitionof joint locations we simply need to train the regression part of our pipeline (seeSection 5.4) on each dataset.Evaluation Metrics. There are two common evaluation metrics that are usedfor the pose estimation task: (i) mean joint prediction error (ii) mean average pre-cision at threshold. Mean joint prediction error is the measure of average error in-curred at prediction of each joint location. A mean average error of 10cm for a jointsimply indicates that we incur an average error of 10cm per estimate. Mean aver-age precision at threshold is the fraction of predictions that are within the thresholddistance of the groundtruth. A mean average precision of 70% for a threshold of5cm means that our estimates are within the 5cm radius of the groundtruth 70%of the time. Because of the errors in groundtruth annotations it is also common tojust report the mean average precision at 10 cm threshold [12, 43, 44].Datasets. The only publicly available dataset for multiview depth at the mo-ment of writing is the Berkeley MHAD [31]. Note that our target is pose estimationwith multiple depth cameras, therefore we only qualitatively evaluate our CNNbody part classifier on single camera datasets as our pose estimation technique isnot applicable to single depth camera settings. Our evaluation also includes theresults on our synthetic dataset with three depth cameras.44UBC3V Synthetic. For evaluation we use the Test set of Hard-Pose. Thisdataset consists of 19000 body postures with 16 characters from three cameras atrandom locations. Note that these 19000 postures are not present in the training setof our dataset and have not been observed before. The groundtruth and the extrinsiccamera parameters come from the synthetic data directly. For more informationand sample data see Chapter 4.Berkeley MHAD [31]. This dataset includes 12 subjects performing 11 actionswhile being recorded by 12 cameras, two Kinect one devices, an Impulse motioncapture system, four microphones, and six accelerometers. The mocap sequencegenerated by the Impulse motion capture is the groundtruth for pose estimation onthis dataset. A sample frame from all the 12 subjects is shown in Figure 6.1. Notethat we only use the depth information from the two Kinects for pose estimationand ignore all other sources of information.EVAL [12]. This dataset consists of 24 sequences of three different charactersperforming eight activities each. Since this dataset is created for single view poseestimation we only qualitatively evaluate our body part classifier of Section 5.2 onthis data. See Figure 6.2 for three random frames from this dataset.6.1 Training the Dense Depth ClassifierOur initial attempts to train the deep network presented in Section 5.2.2 with theHard-Pose dataset did not yield satisfying results. We experimented with vari-ous optimization techniques and configurations, but the accuracy of the network ondense classification did not go beyond 50%. Resorting to the curriculum learningidea of Bengio et al. [2] (see Section 2.2), we simplified the problem by definingeasier datasets that we call Easy-Pose and Inter-Pose (see Table 4.1).We start training the network with the Easy-Pose dataset. Each iterationconsists of eight densely classified images and we stop at 250k iterations reach-ing a dense classification accuracy of 87.8%. We then fine-tune the resulting net-work on Inter-Pose, initially starting at the accuracy of 78% and terminatingat the iteration of 150k with an accuracy of 82%. Interestingly, the performanceon Easy-Pose is preserved throughout this fine-tuning stage. Finally, we startfine-tuning on the Hard-Pose dataset and stop after 88k iterations. Initially this45Figure 6.1: Front camera samples of all the subjects in the BerkeleyMHAD [31] dataset.network evaluates to 73% and by the termination point we have an accuracy of81%. The evolution of our three networks is shown in Table 6.1.Notice how the final accuracy improved from 50% to 81% by controlling thedifficulty of the training instances that our network sees. Our experiments demon-strate a real application of curriculum learning [2] in practice.All of our networks are trained with Stochastic Gradient Descent (SGD) with amomentum of 0.99. The initial learning rate is set to 0.01 and multiplied by 10−146Figure 6.2: Front depth camera samples of all the subjects in the EVAL [12]dataset.Dataset Easy-Pose Inter-Pose Hard-PoseStart End Start End Start EndNet 1 0% 87% – – – –Net 2 87% 87% 78% 82% – –Net 3 87% 85% 82% 79% 73% 81%Table 6.1: The dense classification accuracy of the trained networks on thevalidation sets of the corresponding datasets. Net 2 and Net 3are initialized with the learned parameters of Net 1 and Net 2 respec-tively.every 30k iterations. The weight decay parameter is set to 5 · 10−5. In hindsight,all the above stages of training can be run in approximately 5 days on a Tesla K40GPU. Given that we experimented with multiple architectures simultaneously, theexact required time to train these networks is not available.If we only extract the most likely class from the output, the CNN only takes6ms to process each image on the GPU. However, calculating the exponent andnormalizing the measures to get the full probability distribution on each pixel cancost up to an extra 40ms. To maintain real-time 30fps performance on multipleKinects, we discard the full probability output and only use the most likely classfor each pixel.476.2 Evaluation on UBC3V SyntheticThis dataset includes the groundtruth body part classification and pose annotation.Since the annotations come from synthetic data there are no errors associated withthe annotation. For real-world data however, the annotations are likely to be erro-neous to a small extent.Having groundtruth body part classes and postures allows the separation ofevaluation for the dense classifiers and the pose estimates. That is, we can evaluatethe pose estimates assuming a perfect dense classification is available, and thencompare the results with the densely classified depth image generated by our CNN.This separation gives us insight on how improvement on dense classification islikely to affect the pose estimates, and whether we should spend time on improvingthe dense depth classifier or the pose estimation algorithm.For training we have the multi-step fine-tuning procedure as described in Sec-tion 6.1. We first train the network on the Train set of Easy-Pose. We thensuccessively fine-tune on the Train sets of Inter-Pose and Hard-Pose. Werefer to the third fine-tuned network as Net 3 which we will be using throughoutthe remainder of the thesis.6.2.1 Dense ClassificationThe Test set of Hard-Pose includes 57057 depth frames with class annotationsthat are synthetically generated. This dataset is generated from a pool of 19000postures that have not been seen by our classifier at any point. Furthermore, eachframe of this dataset is generated from a random viewpoint.The reference class numbers are shown in Figure 6.3. The confusion matrix ofour classifier is shown in Figure 6.4. Figure 6.5 displays a few sample classificationoutputs and the corresponding groundtruth images in their original size. Figure 6.6shows a few enlarged sample classification output and the groundtruth. Note thatfor visualization we only use the most likely class at each pixel.The accuracy of Net 3 on the Test set is 80.6%, similar to the reported ac-curacy on the Validation set in Table 6.1. As evident by Figure 6.5 the networkcorrectly identifies the direction of human body and assigns appropriate left/right481 23 47 58 6910 111213 1415 1618172019222124232022241921232827 25263029 2931323334 3433323138373635 3536373839 40414243Figure 6.3: The reference groundtruth classes of UBC3V synthetic data.classes. However, the network seems to be ignoring sudden depth discontinuitiesin the classification (see the last row of Figure 6.5).6.2.2 Pose EstimationWe evaluate our linear regression on the groundtruth class and the classificationoutput of our CNN. The estimates derived from the groundtruth serve as a lowerbound on the error for the pose estimation algorithm.The mean average joint prediction error is shown in Figure 6.7. Our systemachieves an average pose estimation error of 2.44cm on groundtruth, and 5.64cmon the Net 3. The gap between the two results is due to dense classification errors.This difference is smaller on easy to recognize body parts and gets larger on thehard to recognize classes such as hands or feet. It is possible to reduce this gap byusing more sophisticated pose estimation methods at the cost of more computation.In Figure 6.8 we compare the precision at threshold. The accuracy at 10cm forthe groundtruth and Net 3 is 99.1% and 88.7% respectively.49Estimate1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43True Class12345678910111213141516171819202122232425262728293031323334353637383940414243 6.4: The confusion matrix of Net 3 estimates on the Test set ofHard-Pose.6.3 Evaluation on Berkeley MHADThis dataset has a total of 659 sequences of 12 actors over 11 actions with 5 rep-etitions1. There are two Kinect one devices at the opposite sides of the subjectscapturing the depth information. This dataset defines 35 joints over the entire body(for a full list see Figure 6.11). The groundtruth pose here is the motion capturedata.At the moment of writing there is no protocol for evaluation of pose estimationtechniques on this dataset. The leave-one-out approach is a common practice for1one sequence is missing.50Figure 6.5: The output of Net 3 classifier on the Test set of Hard-Pose(left) versus the groundtruth body part classes (right). The images arein their original size.51GroundtruthNet 3Figure 6.6: The groundtruth body part classes (top) versus the output ofNet 3 classifier on the Test set of Hard-Pose (bottom).single view pose estimation. However, each action has five repetitions and we canargue in general that it may not be a fair indicator of the performance because themethod can adapt to the shape of the subject of the test on other sequences to get abetter result. Furthermore, we are no longer restricted to only a few sequences ofdata as in previous datasets.To evaluate the performance on this dataset we take the harder leave-one-subject-out approach, that is, for evaluation on each subject we train our systemon all the other subjects. This protocol ensures that no extra physical information52HeadNeckSpine2Spine1SpineHip RHipRKneeRFootLHipLKneeLFootRShoulderRElbowRHandLShoulderLElbowLHandMean Average Error (cm)02468101214161820Groundtruth Net 3Figure 6.7: Mean average joint prediction error on the groundtruth and theNet 3 classification output. The error bar is one standard deviation.The average error on the groundtruth is 2.44cm, and on Net 3 is5.64cm.is leaked during the training and can provide a measure of robustness to shapevariation.The Kinect depth images of this dataset are captured with Kinect 1 sensors,which has a different intrinsic camera parameters than Kinect 2. The differencein focal length and principal point offset can be eliminated by a simple scale andtranslation of the depth image. To make the depth images of this dataset compatiblewith our pipeline we resize and translate the provided depth images to match theintrinsic camera parameters of a Kinect 2 sensor. To verify the correctness of ourprocedure we generate a point cloud from the final output using Kinect 2 intrinsiccamera parameters and compare the output cloud with the original point cloudgenerated from the Kinect 1 depth images. To measure the discrepancy between thetwo point clouds we run ICP for one iteration and calculate the objective value – wesimply pick the translation and scale parameter that minimizes the error objectivebetween the two clouds.53Threshold (cm)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Mean Average Precision00. 3Figure 6.8: Mean average precision of the groundtruth dense labels and theNet 3 dense classification output with accuracy at threshold 10cm of99.1% and 88.7% respectively.6.3.1 Dense ClassificationTo reuse the CNN that we have trained on the synthetic data in Section 6.2.1 weadjust the depth images using the procedure described earlier. After this step, wesimply feed the depth image to the CNN to get dense classification results. Fig-ure 6.9 shows the output of our dense classifier from the two Kinects on a fewrandom frames. Even though the network has been only trained on synthetic data,it is generalizing well on the real test data. As demonstrated in Figure 6.9, thenetwork has also successfully captured the long distance spatial relationships tocorrectly classify pixels based on the orientation of the body.The right column of Figure 6.9 shows an instance of high partial classificationerror due to occlusion. On the back image, the network mistakenly believes thatthe chair legs are the subject’s hands. However, once the back data is merged withthe front data we get a reasonable estimate (see Figure 6.10).54Front KinectBack Kinect50 cm800 cmReferenceFigure 6.9: Dense classification result of Net 3 together with the originaldepth image on the Berkeley MHAD [31] dataset. Net 3 has beentrained only on synthetic data.6.3.2 Pose EstimationWe use the groundtruth motion capture joint locations to train our system. For eachtest subject we train our system on the other subjects’ sequences. The final resultis an average over all the test subjects.Figure 6.11 shows the mean average joint prediction error. The total averagejoint prediction error is 5.01cm. The torso joints are easier for our system to local-ize than the hands’ joints, a similar behavior to the synthetic data results. However,it must be noted that even the groundtruth motion capture on smaller body partssuch as hands or feet is biased with a high variance. During visual inspectionof Berkely MHAD we noticed that, on some frames, especially when the subjectbends over, the hands’ location is outside of the body point cloud or even outsidethe frame, and clearly erroneous. The overall average precision at 10cm is 93%.55Figure 6.10: Blue color is the motion capture groundtruth on the BerkeleyMHAD [31] and the red color is the linear regression pose estimate.Hip Spine1Spine2Spine3NeckNeck1HeadRLShoulderRShoulderRArmRElbowRForearmRHandRHFingerBaseLLShoulderLShoulderLArmLElbowLForearmLHandLHFingerBaseRHipRULegRKneeRLLegRFootRToeBaseRToeLHipLULegLKneeLLLegLFootLToeBaseLToeMean Average Error (cm)0123456789101112Figure 6.11: Pose estimate mean average error per joint on the BerkeleyMHAD [31] dataset.An interesting observation is the similarity of performance on Berkeley MHADdata and the synthetic data in Figure 6.7. This suggests, at least for the appliedmethods, the synthetic data is a reasonable proxy for evaluating performance whichhas also been suggested by Shotton et al [37]. Figure 6.12 shows the accuracy atthreshold for joint location predictions.We also compare our performance with Michel et al [28] in Table 6.2. Sincethey are using an alternative definition of skeleton that is derived by their shapemodel, we only evaluate over a subset of the joints that are closest with the locationspresented in Michel et al [28]. Note that the method of [28] uses predefined shapeparameters that are optimized for each subject a priori and does not operate in real-56Threshold (cm)0 2 4 6 8 10 12 14 16 18 20Mean Average Precision00. 6.12: Accuracy at threshold for the entire skeleton on the BerkeleyMHAD [31] dataset.Subjects ActionsMean Std Acc (%) Mean Std Acc (%)OpenNI [28] 5.45 4.62 86.3 5.29 4.95 87.3Michel et al [28] 3.93 2.73 96.3 4.18 3.31 94.4Ours 3.39 1.12 96.8 2.78 1.5 98.1Table 6.2: Mean and standard deviation of the prediction error by testing onsubjects and actions with the joint definitions of Michel et al [28]. Wealso report and compare the accuracy at 10cm threshold.time. In contrast, our method does not depend on shape attributes and operatesin real-time. Following the procedure of [28] we evaluate our performance bytesting the subjects and testing the actions. Our method improves the previousmean joint prediction error from 3.93cm to 3.39cm (13%) when tested on subjectsand 4.18cm to 2.78cm (33%) when tested on actions.57Time-of-flight SensorDense Classification50 cm800 cmReferenceFigure 6.13: Dense classification result of Net 3 and the original depth im-age on the EVAL [12] dataset. Net 3 has been only trained on syn-thetic data.6.4 Evaluation on EVALThere are a total of 24 sequences of 3 subjects with 16 joints. To generate a depthimage from this dataset we must project the provided point cloud into the originalcamera surface and then rescale the image to resemble a Kinect 2 output. To verifythe correctness of our depth image, we generate a point cloud from this image withKinect 2 parameters and compare against the original point cloud that is providedin the dataset. Three sample outputs of our procedure are presented in Figure 6.2.Figure 6.13 shows four random dense classification outputs from this dataset.The first column of Figure 6.13 shows an instance of the network failing to confi-dently label the data with front or back classes, but the general location of torso,head, feet and hands is correctly determined. The accuracy of our preliminary re-sults suggests that single depth pose estimation techniques can benefit from usingthe output of our dense classifier.58Chapter 7Discussion and ConclusionWe presented an efficient and inexpensive markerless pose estimation system thatuses only a few Kinect sensors. Our system only assumes availability of calibrateddepth cameras and is capable of real-time performance without requiring an ex-plicit shape model of the subject or co-operation by the subject. While our maingoal is to estimate the posture in real-time for smart homes, our system can also beused as a reasonably accurate and inexpensive replacement for commercial motioncapture solutions in applications that do not require precise measurements. Thenon-intrusive nature of our system can also facilitate development of easy-to-usevirtual reality or augmented reality platforms.The subproblems of our pose estimation pipeline as described in Chapter 5 areall open to further improvement. Our results in Chapter 6 suggest that improvingthe depth dense classifier of Section 5.2 is a worthwhile path to explore. For athorough discussion on this topic we refer the reader to Section 5.5.The supporting infrastructure of our pose estimation is a scalable and modularsoftware framework for smart homes that orchestrates multiple Kinect devices inreal-time. By tackling the technical challenges, our platform enables research onmultiview depth-based pose estimation. The modular structure of our system sim-plifies integration of more sources of information for the smart home application.Our platform is only developed to the extent that provides the necessities of ourresearch on pose estimation. Adding more features such as analysis of the auditory59signals to support voice activated commands is one of the many exciting researchdirections that our platform supports.The training of our system was made possible by generating a dataset of 6million synthetic depth frames. Our data generation process depended on a set ofhuman postures that were collected from real data. The 100k set of postures (seeSection 4.1) that we used is arguably not a representative of every possible humanposture. An interesting future research direction is to make automated systems thatgenerate random, but plausible, body configurations.Our experiments demonstrated an application of curriculum learning in prac-tice and our system exceeded the state-of-the-art multiview pose estimation perfor-mance on the Berkeley MHAD [31] dataset.60Bibliography[1] A. Baak, M. Mu¨ller, G. Bharaj, H. P. Seidel, and C. Theobalt. A data-drivenapproach for real-time full body pose reconstruction from a depth camera. InConsumer Depth Cameras for Computer Vision. 2013. → pages 3, 12[2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning.In International Conference on Machine Learning, 2009. → pages 15, 45,46[3] P. J. Besl and H. D. McKay. A method for registration of 3-D shapes. InTransactions on Pattern Analysis and Machine Intelligence, 1992. → pages19[4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3dhuman pose annotations. In International Conference on Computer Vision,2009. → pages 11[5] L. Bourdev, F. Yang, and R. Fergus. Deep poselets for human detection.arXiv preprint arXiv:1407.0717, 2014. → pages 11[6] B. Chen, P. Perona, and L. Bourdev. Hierarchical cascade of classifiers forefficient poselet evaluation. In British Machine Vision Conference, 2014. →pages 11[7] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille.Semantic image segmentation with deep convolutional nets and fullyconnected CRFs. In International Conference on Learning Representations,2015. → pages 15[8] P. F. Felzenszwalb and D. P. Huttenlocher. Distance transforms of sampledfunctions. Theory of Computing, 2012. → pages 12[9] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm formodel fitting with applications to image analysis and automatedcartography. Communications of the ACM, 1981. → pages 1961[10] Y. Furukawa and J. Ponce. Accurate camera calibration from multi-viewstereo and bundle adjustment. In Computer Vision and Pattern Recognition,2008. → pages 19[11] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real time motioncapture using a single time-of-flight camera. In Computer Vision and PatternRecognition, 2010. → pages 9[12] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Real-time human posetracking from range data. In European Conference on Computer Vision,2012. → pages x, xi, 9, 44, 45, 47, 58[13] S. Ge and G. Fan. Non-rigid articulated point set registration for human poseestimation. In Winter Applications of Computer Vision, 2015. → pages 3, 13[14] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficientregression of general-activity human poses from depth images. InInternational Conference on Computer Vision, 2011. → pages 10[15] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets fordetecting people and localizing their keypoints. In Computer Vision andPattern Recognition, 2014. → pages 11[16] D. F. Glas, D. Brsˇcˇic´, T. Miyashita, and N. Hagita. SNAPCAT-3D:Calibrating networks of 3d range sensors for pedestrian tracking. InInternational Conference on Robotics and Automation, 2015. → pages 19[17] L. He, G. Wang, Q. Liao, and J. H. Xue. Depth-images-based poseestimation using regression forests and graphical models. Neurocomputing,2015. → pages 3[18] T. Helten, A. Baak, G. Bharaj, M. Muller, H. P. Seidel, and C. Theobalt.Personalization and evaluation of a real-time depth-based full body tracker.In International Conference on 3D Vision, 2013. → pages 9[19] L. Heng, G. H. Lee, and M. Pollefeys. Self-calibration and visual slam witha multi-camera system on a micro aerial vehicle. In Robotics: Science andSystems (RSS), 2014. → pages 19[20] P. Hu and D. Ramanan. Bottom-up and top-down reasoning withconvolutional latent-variable models. arXiv preprint arXiv:1507.05699,2015. → pages 14, 4262[21] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towardsunderstanding action recognition. In International Conference on ComputerVision, 2013. → pages 10[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. In International Conference on Multimedia, 2014. →pages 44[23] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann. Self-pacedcurriculum learning. In Association for the Advancement of ArtificialIntelligence, 2015. → pages 15[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems, 2012. → pages 39[25] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latentvariable models. In Advances in Neural Information Processing Systems,2010. → pages 15[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In Computer Vision and Pattern Recognition, 2015.→ pages x, 14, 37, 39, 40, 42[27] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004. → pages 19[28] D. Michel, C. Panagiotakis, and A. A. Argyros. Tracking the articulatedmotion of the human body with two RGBD cameras. Machine Vision andApplications, 2014. → pages vii, 13, 56, 57[29] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for highdimensional data. Transactions on Pattern Analysis and MachineIntelligence, 36, 2014. → pages 24[30] A. Myronenko and X. Song. Point set registration: Coherent point drift.Transactions on Pattern Analysis and Machine Intelligence, 2010. → pages13[31] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Berkeley MHAD:A comprehensive multimodal human action database. In Winter Applicationsof Computer Vision, 2013. → pages ix, x, xi, 5, 38, 44, 45, 46, 55, 56, 57, 6063[32] S. Pellegrini, K. Schindler, and D. Nardi. A generalisation of the ICPalgorithm for articulated bodies. In British Machine Vision Conference,2008. → pages 13[33] A. Phan and F. P. Ferrie. Towards 3D human posture estimation usingmultiple kinects despite self-contacts. In IAPR International Conference onMachine Vision Applications, 2015. → pages 14[34] G. Pons-Moll, J. Taylor, J. Shotton, A. Hertzmann, and A. Fitzgibbon.Metric regression forests for human pose estimation. In British MachineVision Conference, 2013. → pages 10[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNetlarge scale visual recognition challenge. International Journal of ComputerVision, 2015. → pages 40[36] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio,R. Moore, P. Kohli, A. Criminisi, and A. Kipman. Efficient human poseestimation from single depth images. Transactions on Pattern Analysis andMachine Intelligence, 2013. → pages 3, 6, 12, 13, 22, 23, 29, 35, 36[37] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake,M. Cook, and R. Moore. Real-time human pose recognition in parts fromsingle depth images. Communications of the ACM, 2013. → pages 10, 37,56[38] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. →pages 39[39] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of aconvolutional network and a graphical model for human pose estimation. InAdvances in Neural Information Processing Systems, 2014. → pages 12[40] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deepneural networks. In Computer Vision and Pattern Recognition, 2014. →pages 12[41] R. Urtasun and T. Darrell. Sparse probabilistic regression foractivity-independent human pose inference. In Computer Vision and PatternRecognition, 2008. → pages 4264[42] Y. Yang and D. Ramanan. Articulated human detection with flexiblemixtures of parts. Transactions on Pattern Analysis and MachineIntelligence, 2013. → pages 12, 42[43] M. Ye and R. Yang. Real-time simultaneous pose and shape estimation forarticulated objects using a single depth camera. In Computer Vision andPattern Recognition, 2014. → pages 3, 9, 13, 35, 44[44] H. Yub Jung, S. Lee, Y. Seok Heo, and I. Dong Yun. Random tree walktoward instantaneous 3d human pose estimation. In Computer Vision andPattern Recognition, 2015. → pages 3, 10, 13, 35, 44[45] P. Zhang, K. Siu, J. Zhang, C. K. Liu, and J. Chai. Leveraging depthcameras and wearable pressure sensors for full-body kinematics anddynamics capture. Transactions on Graphics, 2014. → pages 14[46] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du,C. Huang, and P. Torr. Conditional random fields as recurrent neuralnetworks. In International Conference on Computer Vision, 2015. → pages15, 4265


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items