UBC Theses and Dissertations
Understanding the sources of error for 3D human pose estimation from monocular images and videos Hossain, Mir Rayat Imtiaz
With the success of deep learning in the field of computer vision, most state-of-the-art approaches of estimating 3D human pose from images or videos rely on training a network end-to-end which can regress into 3D joint locations or heatmaps from an RGB image. Although most of these approaches provide good results, the major sources of error are often difficult to understand. The errors may either come from incorrect 2D pose estimation or from the incorrect mapping of the features in 2D to 3D. In this work, we aim to understand the sources of error in estimating 3D pose from images and videos. Therefore, we have built three different systems. The first takes 2D joint locations of every frame individually as inputs and predicts 3D joint positions. To our surprise, we found that by using a simple feed-forward fully connected network, with residual connections, the ground truth 2D joint locations can be mapped to 3D space at a remarkably low error rate, outperforming the best reported result by almost 30% on the Human 3.6M dataset, the largest publicly available dataset of motion capture data. Furthermore, training this network on the outputs of an off-the-shelf 2D pose detector gives us state-of-the-art results when compared with a vast array of systems trained end-to-end. To validate the efficacy of this network, we also trained an end-to-end system that takes an image as input and regresses 3D pose directly. We found that it is harder to train the network end-to-end than decoupling the task. To examine whether temporal information over a sequence improves results, we built a sequence-to-sequence network that takes a sequence of 2D poses as input and predicts a sequence of 3D poses as output. We found that the temporal information improves the results from our first system. We argue that a large portion of error of 3D pose estimation systems results from the error in 2D pose estimation.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International