You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Generative adversarial networks for pose-guided human video generation Zablotskaia, Polina 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_may_zablotskaia_polina.pdf [ 24.41MB ]
JSON: 24-1.0389697.json
JSON-LD: 24-1.0389697-ld.json
RDF/XML (Pretty): 24-1.0389697-rdf.xml
RDF/JSON: 24-1.0389697-rdf.json
Turtle: 24-1.0389697-turtle.txt
N-Triples: 24-1.0389697-rdf-ntriples.txt
Original Record: 24-1.0389697-source.json
Full Text

Full Text

Generative Adversarial Networks for Pose-guided HumanVideo GenerationbyPolina ZablotskaiaBSc, Peter the Great St.Petersburg Polytechnic University, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)March 2020c© Polina Zablotskaia, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Generative Adversarial Networks for Pose-guided Human Video Gen-erationsubmitted by Polina Zablotskaia in partial fulfillment of the requirements for thedegree of Master of Science in Computer Science.Examining Committee:Leonid Sigal, Computer ScienceSupervisorJames Little, Computer ScienceSecond ReaderiiAbstractGeneration of realistic high-resolution videos of human subjects is a challengingand important task in computer vision. In this thesis, we focus on human mo-tion transfer – generation of a video depicting a particular subject, observed in asingle image, performing a series of motions exemplified by an auxiliary (driv-ing) video. Our GAN-based architecture, DwNet, leverages a dense intermediatepose-guided representation and a refinement process to warp the required subjectappearance, in the form of the texture, from a source image into a desired pose.Temporal consistency is maintained by further conditioning the decoding processwithin a GAN on the previously generated frame. In this way a video is generatedin an iterative and recurrent fashion. We illustrate the efficacy of our approachby showing state-of-the-art quantitative and qualitative performance on two bench-mark datasets: TaiChi and Fashion Modeling. The latter was collected by us and ismade publicly available to the community. We also show how our proposed methodcan be further improved by using a recent segmentation-mask-based architecture,such as SPADE, and how to battle temporal inconsistency in video synthesis usinga temporal discriminator.iiiLay SummaryHuman motion transfer is a novel task of generating a video of a single person,who’s appearance is taken from an image (source) and the actions are extractedfrom a driving video of someone else. This is a very challenging task. First, it hasto be fully automated, meaning that the model receives as input only an image and avideo. Second, the model has to be able to ”guess” the missing bits of the person’sappearance, that are not shown on the source image. Third, the resulting videohas to be temporarily consistent across generated frames. Solving this problem inthe automated way will be especially useful for visual artists working on moviesand game animations. The task also occurs in the augmented reality industry suchas virtual try-on apps. We propose an architecture that overcomes the above chal-lenges and shows state-of-the-art quantitative and qualitative performance on twobenchmark datasets.ivPrefaceThe work presented here is original work done by the first author, Polina Zablot-skaia, with the help and advises of a second and third authors, Aliaksandr Siarohinand Dr. Bo Zhao, respectively. The work was done under the supervision of Prof.Leonid Sigal. The main content of this thesis is based on a version that was ac-cepted at the British Machine Vision Conference (BMVC), 2019.• P. Zablotskaia, A. Siarohin, B. Zhao and L. Sigal. DwNet: Dense warp-based network for pose-guided human video generation. In British MachineVision Conference (BMVC), 2019The first author is completely responsible for the design, implementation of themodel and the experiments. Aliaksandr Siarohin has kindly shared his advice onarchitecture nuances and provided his expertise on writing a conference paper andDr. Zhao proposed a way to collect the Fashion Modelling dataset. Prof. Sigalprovided feedback during each step and also helped with refining the conferencepublication.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Pose guided image generation . . . . . . . . . . . . . . . . . . . 32.3 Video Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1 Warp module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.1 Coarse warp grid estimate . . . . . . . . . . . . . . . . . 93.1.2 Refined warp grid estimate . . . . . . . . . . . . . . . . . 10vi3.2 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Temporal discriminator . . . . . . . . . . . . . . . . . . . 123.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 SPADE-based Generator . . . . . . . . . . . . . . . . . . . . . . 134 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 The Fashion Dataset . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . 164.2.3 Implementation details . . . . . . . . . . . . . . . . . . . 174.3 Comparison with the state-of-the-art . . . . . . . . . . . . . . . . 184.4 Ablation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Temporal consistency . . . . . . . . . . . . . . . . . . . . . . . . 215 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1 Limitations and future work . . . . . . . . . . . . . . . . . . . . 24Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A Additional Visualizations . . . . . . . . . . . . . . . . . . . . . . . . 29viiList of TablesTable 4.1 Quantitative Comparison with the State-of-the-Art. Perfor-mance on the Fashion and Tai-chi datasets is reported in termsof the Perceptual Loss, Frechet Inception Distance (FID) andAverage Keypoint Distance (AKD) . . . . . . . . . . . . . . . 19Table 4.2 User study. Percentage of the time our method is preferred overone of the competing approaches. . . . . . . . . . . . . . . . . 20Table 4.3 Quantitative evaluations for ablations on Fashion Dataset.See text for details. . . . . . . . . . . . . . . . . . . . . . . . 20Table 4.4 Quantitative evaluation of the Temporal DwNet. Compari-son between DwNet and its temporal version on Tai-Chi dataset. 22viiiList of FiguresFigure 1.1 Human Motion Transfer. Our pose-guided approach allowsgeneration of a video for the source subject, depicted in animage, that follows the motion in the driving video. . . . . . . 2Figure 3.1 Proposed Architecture. Overview of the generator. The gen-erator consists of three main parts: Pose Encoder (E), WarpModule (W) and Decoder (D). The encoder E takes as inputthe next driving video pose P(di+1) and produces its featurerepresentation. On the other hand source image s, source im-age pose P(s) and the next frame’s pose P(di+1) are used byW to produce deformed feature representation of the sourceimage s. Similarly,W produces a deformed feature represen-tation of previously generated frame tˆi. Then, these represen-tations are combined and passed to a set of Residual blocksand afterwards to the Decoder D that generates a new frame tˆi+1. 7Figure 3.2 Proposed Architecture. Detailed overview of the warp mod-uleW . In the beginning, we compute a coarse warp grid fromDi+1 and P(s). The coarse warp grid is used to deform thesource image s, this deformed image along with driving poseP(di+1) and coarse warp coordinates is used by Refine branchR to predict correction to coarse warp grid. We employ thisrefined estimate to deform feature representation of the sourceimage s. Note, deformation of tˆi takes similar form. . . . . . . 8ixFigure 3.3 Visualisation of the warp module for Tai-Chi (a) and Fash-ion (b) datasets. First row is a source image s, second row is atarget pose P(di), third row is the coarse warpW [P(s)→ P(di)]applied to the source image and the last row is the refined warpW [P(s)→ P(di)] applied to the source image. . . . . . . . . 9Figure 3.4 Discriminators. Overview of the model discriminator (a). Thediscriminator takes two channel-wise concatenated inputs: thereal image ti or synthesized image tˆi and a corresponding poseP(di). The discriminator then downsamples the input into theprobability map of patches. Afterwards those patches are inde-pendently classified into being fake or real and the final patchcritique loss is the average of the patches. Overview of thetemporal discriminator (b). Everything else except for the in-put stays the same. Instead we pass either three real frames ti,ti+1 and ti+2 or three fake frames tˆi, tˆi+1 and tˆi+2 concatenatedwith each other. In that case we backpropagate only once forall three frames. . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.5 SPADE-based Generator Architecture. Overview of the gen-erator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 4.1 Qualitative Results on Tai-Chi Dataset. First row illustratesthe driving videos; second row are results of our method; thirdrow are results obtained with Monkey-Net [31]. . . . . . . . 17Figure 4.2 Qualitative Results on Fashion Dataset. The first row il-lustrates the driving videos; the second row are results of ourmethod; the third row are results obtained with Monkey-Net [31];the fourth row are results of Coordinate Inpainting [10]. Pleasezoom in for detail. . . . . . . . . . . . . . . . . . . . . . . . 19xFigure 4.3 Ablations on Fashion Dataset. Qualitative result from theablated methods. The first column shows two random framesfrom the real video, the second column shows correspondingframes generated by the full model, the third column displaysthe results produced by the model trained without previousframes, the fourth and the fifth columns show results with justrefine warp and without any warp respectively. . . . . . . . . 21Figure A.1 DwNet results on Fashion dataset. First example. First rowis a driving video. Second row is a set of the extracted DensePoses. The last three rows are results of the motion transferapplied to three source images. . . . . . . . . . . . . . . . . 30Figure A.2 DwNet results on Fashion dataset. Second example. Firstrow is a driving video. Second row is a set of the extractedDense Poses. The last three rows are results of the motiontransfer applied to three source images. . . . . . . . . . . . . 31Figure A.3 DwNet results on Tai-Chi dataset. First example. First rowis a driving video. Second row is a set of the extracted DensePoses. The last two rows are results of the motion transferapplied to two source images. . . . . . . . . . . . . . . . . . 32Figure A.4 DwNet results on Tai-Chi dataset. Second example. Firstrow is a driving video. Second row is a set of the extractedDense Poses. The last two rows are results of the motion trans-fer applied to two source images. . . . . . . . . . . . . . . . 33xiList of VideosVideo A.1 The demo video. . . . . . . . . . . . . . . . . . . . . . . . . . 29xiiChapter 1IntroductionGenerative models, both conditional and un-conditional, have been at the core ofcomputer vision field from its inception. In recent years, approaches such as GANs[9] and VAEs [18] have achieved impressive results in a variety of image-basedgenerative tasks. The progress on the video side, on the other hand, has beenmuch more timid. Of particular challenge is generation of videos containing high-resolution moving human subjects. In addition to the need to ensure that eachframe is realistic and video is overall temporally coherent, an additional challengeis contending with coherent appearance and motion realism of a human subjectitself. Notably, visual artifacts exhibited on human subjects tend to be most glaringfor observers.In this thesis, we address a problem of human motion transfer. Mainly, givena single image depicting a (source) human subject, we propose a method to gen-erate a high-resolution video of this subject, conditioned on the (driving) motionexpressed in an auxiliary video. The task is illustrated in Figure 1.1. Similar torecent methods that focus on pose-guided image generation [2, 6, 7, 10, 20, 21,23, 29, 31, 43], we leverage an intermediate pose-centric representation of thesubject. However, unlike those methods that tend to focus on sparse keypoints[2, 6, 20, 21, 29] or skeleton [7] representations, or intermediate dense optical flowobtained from those impoverished sources [31], we utilize a more detailed denseintermediate representation [11] and a texture transfer approach to define a fine-grained warping from the (source) human subject image to the target poses. This1Figure 1.1: Human Motion Transfer. Our pose-guided approach allowsgeneration of a video for the source subject, depicted in an image, thatfollows the motion in the driving video.texture warping allows us to more explicitly preserve the appearance of the sub-ject. Further, we focus on temporal consistency which ensures that the transfer isnot done independently for each generated frame, but is rather sequentially condi-tioned on previously generated frames. We also note that unlike [5, 39], we relyonly on a single (source) image of the subject and not a video, making the problemthat much more challenging.1.1 ContributionsOur contributions are multiple fold. First, we propose a dense warp-based archi-tecture designed to account for, and correct, dense pose errors produced by ourintermediate human representation. Second, we formulate our generative frame-work as a conditional model, where each frame is generated conditioned not onlyon the source image and the target pose, but also on the previous frame generatedby the model. This enables our framework to produce a much more temporallycoherent output. Third, we describe two additional extensions of the framework,to achieve even better temporal consistency and accuracy of the human motiontransfer. Fourth, we illustrate the efficacy of our approach by showing improvedperformance with respect to recent state-of-the-art methods. Finally, we collectand make available a new high-resolution dataset of fashion videos.2Chapter 2Related work2.1 Image GenerationImage generation has become an increasingly popular task in the recent years. Thegoal is to generate realistic images, mimicking samples from true visual data dis-tribution. Variational Autoencoders (VAEs) [18] and Generative Adversarial Net-works (GANs) [9] are powerful tools for image generation that have shown promis-ing results. In the case of unconstrained image generation, the resulting images aresynthesized from random noise vectors. However, this paradigm can be extendedto conditional image generation [15, 38], where apart from the noise vector thenetwork input includes conditional information, which can be, for example, a styleimage [15, 17], a descriptive sentence [13] or an object layout [44] designatingaspects of desired image output. Multi-view synthesis [25, 34, 46] is one of thelargest topics in the conditional generation and it is the the one that mostly relatedto our proposal. The task of multi-view synthesis is to generate unseen view givenone or more known views.2.2 Pose guided image generationIn the pioneering work of Ma et al. [20] pose guided image generation using GANshas been proposed. Ma et al. [20] proposed modeling human poses as a set of key-points and use standard image-to-image translation networks (e.g., UNET [27]).3Later, it has been found [2, 29] that for UNET-based architectures it is difficultto process inputs that are not spatially aligned. In case of pose-guided models,keypoints of the target are not spatially aligned with the source image. As a conse-quence, [2, 29] propose new generator architectures that try to first spatially alignthese two inputs and then generate target images using image-to-image translationparadigms. Neverova et al. [23] suggest exploiting SMPL [19] representation of aperson, which they estimate using DensePose [11], in order to improve pose-guidedgeneration. Compared to keypoint representations, DensePose [11] results providemuch more information about the human pose, thus using it as a condition allowsmuch better generation results. Grigorev et al. [10] propose coordinate based in-painting to recover missing parts in the DensePose [11] estimation. Coordinatebased inpainting explicitly predicts from where to copy the missing texture, whileregular inpainting predicts the RGB values of the missing texture itself. UnlikeGrigorev et al. [10], our work can perform both standard inpainting and coordinatebased inpainting.2.3 Video GenerationThe field of video generation is much less explored, compared to image gener-ation. Initial works adopt 3D convolutions and recurrent architectures for videogeneration [28, 35, 37]. Later works start to focus on conditional video generation.The most well studied task in conditional video generation is future frame predic-tion [1, 8, 24, 33]. Recent works step out from fully unsupervised paradigm andsuggest exploiting intermediate representation, in the form of learned keypoints,for future frame prediction [36, 41, 45]. However, the most realistic video resultsare obtained by conditioning video generation on another video. This task is oftencalled video-to-video translation. Two recent works [5, 39] suggest pose-guidedvideo generation, which is a special case of video-to-video translation. The maindrawback of these models is that they need to train a separate network for eachperson. In contrast, we suggest to generate a video based only on a single imageof a person. Recently, this task was addressed by Siarohin et al. [31][30], but theytry to learn a representation of a subject in an unsupervised manner which leads tosub-optimal results. In other words the pose keypoints are learned during training4and they are not rich enough to describe complex motions. Another recent workby Wang et al. [40] uses a few images of the person, rather than one single as donein our case. In this paper authors proposed a more general model, i.e. it does notuse specific human body priors and does not solely focus on a human video gen-eration. Conversely, in this thesis, we exploit and refine the richer structure andrepresentation from DensePose [11] as an intermediate guide for video generation.5Chapter 3MethodThe objective of this thesis is to generate a video, consisting ofN frames [t1, . . . , tN ],containing a person from a source image s, conditioned on a sequence of drivingvideo frames of another person [d0, . . . ,dN ], such that the person from the sourceimage replicates motions of the person from a driving video. Our model is basedon a standard image-to-image [15] translation framework. However, standard con-volutional networks are not well suited for the task where the condition and theresult are not well aligned. Hence, as advocated in the recent works [2, 29], thekey to precise human subject video generation lies in leveraging of motion fromthe estimated poses. Moreover, perceptual quality of the video is highly dependenton the temporal consistency between nearby frames. We design our model havingthese goals and intuitions in mind (see Figure 3.2).First, unlike standard pose-guided image generation frameworks [2, 20, 29], inorder to produce temporary consistent videos we add a Markovian assumption toour model. If we generate each frame of the video independently, the result canbe temporary inconsistent and contain a lot of flickering artifacts. To this end, wecondition generation of each frame ti on a previously generated frame ti−1, wherei ∈ [2, . . . ,N].Second, we use a DensePose [11] architecture to estimate correspondences be-tween pixels and parts of the human body, in 3D. We apply DensePose [11] tothe initial image P(s) and to every frame of a driving video [P(d0), . . . ,P(dN)],where function P(·) denotes the output of the DensePose. Using this informa-6Figure 3.1: Proposed Architecture. Overview of the generator. The gen-erator consists of three main parts: Pose Encoder (E), Warp Module(W) and Decoder (D). The encoder E takes as input the next driv-ing video pose P(di+1) and produces its feature representation. On theother hand source image s, source image pose P(s) and the next frame’spose P(di+1) are used byW to produce deformed feature representationof the source image s. Similarly,W produces a deformed feature repre-sentation of previously generated frame tˆi. Then, these representationsare combined and passed to a set of Residual blocks and afterwards tothe Decoder D that generates a new frame tˆi+1.tion we obtain a partial correspondence between pixels of any two human images.This correspondence allows us to analytically compute the coarse warp grid es-timate W [P(s)→ P(di)] and W [P(di)→ P(di+1)], where i ∈ [1, . . . ,N−1]. Thecoarse warp grid, in turn, allows us to perform texture transfer and estimate motionflow. Even though DensePose produces high quality estimates, it is not perfectand sometimes suffers from artifacts, such as false human detections and missingbody parts. Another important drawback of using pose estimators is lack of in-formation regarding clothing. This information is very important to us, since weare trying to map a given person onto a video, while preserving their body shape,facial features, hair and clothing. This motivates us to compute refined warp grid7Figure 3.2: Proposed Architecture. Detailed overview of the warp moduleW . In the beginning, we compute a coarse warp grid from Di+1 andP(s). The coarse warp grid is used to deform the source image s, thisdeformed image along with driving pose P(di+1) and coarse warp coor-dinates is used by Refine branch R to predict correction to coarse warpgrid. We employ this refined estimate to deform feature representationof the source image s. Note, deformation of tˆi takes similar form.estimatesW [P(s)→ P(di)] andW [P(di)→ P(di+1)], where i ∈ [0, . . . ,N−1] (seeFigure 3.2(b)). We train this component end-to-end using standard image genera-tion losses.To sum up, our generator G consists of three blocks: pose encoder E , warpmoduleW and the decoder D (see Figure 3.2(a)). G generates video frames iter-atively one-by-one. First, the encoder E produces a representation of the drivingpose E (P(di+1)). Then given the source image s, the source pose P(s) and the driv-ing pose P(di), warp moduleW estimates a grid W [P(s)→ P(di+1)], encodes thesource image s and warps this encoded image according to the grid. This gives usa representationW (s,P(s),P(di+1)). The previous frame is processed in the sameway, i.e., after transformation we obtain a representation W (di,P(di),P(di+1)).These two representations together with E (P(di+1)) are concatenated and laterprocessed by the decoder D to produce an output frame tˆi+1.83.1 Warp module(a) (b)Figure 3.3: Visualisation of the warp module for Tai-Chi (a) and Fash-ion (b) datasets. First row is a source image s, second row is a targetpose P(di), third row is the coarse warp W [P(s)→ P(di)] applied tothe source image and the last row is the refined warp W [P(s)→ P(di)]applied to the source image.3.1.1 Coarse warp grid estimateAs described in Section 3 we use DensePose [11] for coarse estimation of warpgrids. For simplicity, we describe only how to obtain warp grid estimatesW [P(s)→ P(di+1)]; the procedure for W [P(di)→ P(di+1)] is similar. For eachbody part of SMPL model [19] DensePose [11] estimates UV coordinates. TheUV-mapping unfolds the body surface onto a 2D image such that every pixel cor-responds to a 3D point on the body surface. However, the UV pixel coordinatesin the source image s may not exactly match with the UV pixel coordinates in thedriving frame di+1, so in order to obtain a correspondence we use nearest neigh-bour interpolation. In more detail, for each pixel in the driving frame we find thenearest neighbour in the UV space of source image that belongs to the same bodypart. In order to perform efficient nearest neighbour search we make use of theKD-Trees [3]. The third row of the Figures 3.3 (a) and (b) shows the results ofapplying the coarse warp grid to the images of the first row. It is clear from the im-9ages that the coarse warp leaves a lot of artefacts, however it is still a very powerfulpart of the model considering that it does not involve any learning.3.1.2 Refined warp grid estimateWhile the coarse warp grid estimation preserves general human body movement,it contains a lot of errors because of self occlusions and imprecise DensePose [11]estimates. Moreover the movement of the person’s outfit is not modeled. This mo-tivates us to add an additional correction branchR. The goal of this branch is, givensource image s, coarse warp grid W [P(s)→ P(di+1)] and target pose P(di+1), topredict refined warp grid W [P(s)→ P(di+1)]. This refinement branch is trainedend-to-end with the entire framework. Differentiable warping is implemented us-ing the bilinear kernel [16]. The main problem of the bilinear kernel its limitedgradient flow, e.g., from each spatial location gradients are propagated only to 4nearby spatial locations. This makes the module highly vulnerable to local min-ima. One way to address the local minimum problem is good initialization. Havingthis in mind, we adopt a residual architecture for our module, i.e., the refinementbranch predicts only the correction R(s,W [P(s)→ P(di+1)] ,P(di+1)) which islatter added to the coarse warp grid:W [P(s)→ P(di+1)] =W [P(s)→ P(di+1)]+R(s,W [P(s)→ P(di+1)] ,P(di+1)) .(3.1)Note that since we transform intermediate representations of source image s, thespatial size of the warp grid should be the equal to the spatial size of the represen-tation. In our case the spatial size of the representation is 64×64. Because of this,and to save computational resources,R predicts corrections of size 64×64; more-over, the coarse warp grid W [P(s)→ P(di+1)] is downsampled to the size of 64×64. Also, since convolutional layers are translation equivariant they can not effi-ciently process absolute coordinates of coarse grid warpW [P(s)→ P(di+1)]. In or-der to alleviate this issue we input to the network relative shifts, i.e.,W [P(s)→ P(di+1)]−I, where I is an identity warp grid. The fourth row on Figures 3.3 (a) and (b) dis-plays the effect of the refined warp branch and its effect on the final grid. We canclearly observe how many artifacts left by the coarse warp grid were removed by10the refined warp.3.2 Discriminator(a) (b)Figure 3.4: Discriminators. Overview of the model discriminator (a). Thediscriminator takes two channel-wise concatenated inputs: the real im-age ti or synthesized image tˆi and a corresponding pose P(di). Thediscriminator then downsamples the input into the probability map ofpatches. Afterwards those patches are independently classified into be-ing fake or real and the final patch critique loss is the average of thepatches. Overview of the temporal discriminator (b). Everything elseexcept for the input stays the same. Instead we pass either three realframes ti, ti+1 and ti+2 or three fake frames tˆi, tˆi+1 and tˆi+2 concate-nated with each other. In that case we backpropagate only once for allthree frames.We use two discriminators according to the method proposed in [38]. Theyhave the same inputs, but in a different scale, meaning that the second input willbe downsampled by a factor of 2. This is done in order to have two differentreceptive fields, one is global and can capture more abstract features and secondfocuses on the details. Since both discriminators have the same architecture and theinputs, we will omit one of them, and further when describing losses we assumethat we always sum across the discriminators. On Figure 3.4 (a) we demonstratean overview of the discriminator model. In case of a real label it takes as an input areal image ti concatenated with the corresponding pose P(di) and in case of a fakeit takes a synthesised image tˆi and again the same corresponding pose P(di). Thenfollows a sequence of convolution downsample layers. Further, instead of havinga single probability indicating whether the image is fake or real, we employ the11patch critique approach, also named “PatchGAN” classifier [14]. In this approachthe output of the model is a probability map, and each probability corresponds to acertain patch of the original image. Then cross-entropy loss is computed for eachof the patches and the final patch based critique C of the image is the average ofthose losses.3.2.1 Temporal discriminatorOn Figure 3.4 we show another version of the discriminator which takes into ac-count a temporal aspect of the video generation problem. Instead of differentiatingbetween single frames, we design it in a way that it observes triplets of synthesisedframes and compares them with triplets of the real frames. The differences betweentwo discriminators are just the input and the number of channels; the architectureand the output stay the same.3.3 TrainingOur training procedure is similar to [2], however, specifically adapted to take theMarkovian assumption into account. At the training time we sample four framesfrom the same training video (three of which are consecutive), the indices of theseframes are i, j, j+ 1, j+ 2, where i ∈ [1, . . . ,N], j ∈ [1, . . . ,N − 2] and N is thetotal number of frames in the video. Experimentally we observe that using fourframes is the best in terms of temporal consistency and computational efficiency.We treat frame i as the source image s, while the rest are treated as both the drivingdi,di+1,di+2 and the ground truth target ti, ti+1, ti+2 frames. We generate the threeframes as follows:tˆi = G (s,P(s),s,P(s),P(di)) , (3.2)for the first frame, where the source frame is treated as the “previous” frame, andfor the rest:tˆi+1 = G(s,P(s), tˆi,P(di),P(di+1)),tˆi+2 = G(s,P(s), tˆi+1,P(di+1),P(di+2)). (3.3)12This formulation has low memory consumption, but at the same time allows stan-dard pose-guided image generation which is needed to produce the first target out-put frame. Note that if in Eq. 3.3 we use real previous frame di, the generator willignore the source image s, since di is much more similar to di+1 than s.Losses. We use a combination of losses from pix2pixHD [38] model. We employthe least-square GAN [22] for the adversarial loss:LDgan(ti, tˆi) = (C(ti)−1)2+C(tˆi)2, (3.4)LGgan(tˆi) = (C(tˆi)−1)2, (3.5)where C is the patch based critique described in Section 3.2,To drive image reconstruction we also employ feature matching [38] and per-ceptual [17] losses:Lrec(ti, tˆi) = ‖Nk(tˆi)−Nk(ti)‖1, (3.6)whereNk is the feature representation from k-th layer of the network, for perceptualloss this is VGG-19 [32] and for feature matching it is the critique C. The total lossis given by:L=∑iLGgan(tˆi)+λLrec(ti, tˆi), (3.7)following [38] λ = 10.3.4 SPADE-based GeneratorIn order to further improve the quality of the proposed framework, we employ theidea used in the recently released SPADE model [26] that solves the image to im-age translation from a semantic layout to a realistic image. This model is basedon a simple yet effective spatially-adaptive normalization. The idea behind thisnormalization layer is to learn the affine layers from the semantic segmentationmaps. Luckily, in our problem setting, segmentation maps of human bodies arealready there, since DensePose estimations include accurate body parts segmenta-tions as one of the output channels. Hence, we replace regular Residual blocks withSPADE Residual blocks and change the Decoder to have a spatially adaptive nor-13Figure 3.5: SPADE-based Generator Architecture. Overview of the gen-erator.malization as well. Each of the new normalization layers will receive as input thesegmentations obtained from the target pose P(di). Figure 3.5 illustrates the abovechanges in the architecture and the visual representations of the obtained masks.Unfortunately, we didn’t observe any improvements in the quantitative evaluationsfor this architecture modification, however we believe that this approach has poten-tial. For example, segmentation masks can be used to manipulate the final output,by making small changes in the body layout.14Chapter 4ExperimentsWe have conducted an extensive set of experiments to evaluate the proposed DwNet.We first describe our newly collected dataset, then we compare our method withprevious state-of-the-art models for pose-guided human video synthesis and forpose-guided image generation. We show the superiority of our model in aspectsof realism and temporal coherency. We then evaluate the contributions of eacharchitecture choice we made and show that each part of the model positively con-tributes to the results. Finally, we investigate how to further improve on temporalcoherency of the model, and show that the temporal discriminator demonstratessuperior performance on Tai-Chi dataset.4.1 The Fashion DatasetWe introduce a new Fashion dataset containing 500 training and 100 test videos,each containing roughly 350 frames. Videos from our dataset are of a single hu-man subject and characterized by high resolution and static camera, angle and po-sition. Most importantly, clothing and textures are diverse and cover a large spaceof possible appearances. The dataset is publicly released at Setup4.2.1 DatasetsWe conduct our experiments on the proposed Fashion and Tai-Chi [30] datasets.The latter is composed of more than 3000 tai-chi video clips downloaded fromYouTube. In all previous works [31, 35], the smaller 64×64 pixel resolution ver-sion of this dataset has been used; however, for this thesis we use 256×256 pixelresolution. The length varies from 128 to 1024 frames per video. Number of videosper train and test sets are 3049 and 285 respectively.4.2.2 Evaluation metricsThere is no consensus in the community on a single criterion for measuring qualityof the generated videos from the perspective of realism, texture similarity and tem-poral coherence. Therefore we choose a set of widely accepted evaluation metricsto measure performance.Perceptual loss. The texture similarity is measured using a perceptual loss. Similarto our training procedure, we use VGG-19 [32] network as a feature extractor andthen compute L1 loss between the extracted features from the real and generatedframes.FID. We use Frechet Inception Distance [12] (FID) to measure realism of the in-dividual frames. FID is known to be a widely used metric for comparison of theGAN-based methods.AKD. We evaluate if the motion is correctly transferred by the means of AverageKeypoint Distance (AKD) metric [31]. Similarly to [31] we employ a human poseestimator [4] and compare average distance between ground truth and detectedkeypoints. Intuitively this metric measures if the person moves in the same way asin the driving video. We do not report Missing keypoint rate (MKR) because it issimilar and close to zero for all methods.User study. We conduct a user study to measure the overall temporal coherencyand quality of the synthesised videos. For the user study we exploit Amazon16Figure 4.1: Qualitative Results on Tai-Chi Dataset. First row illustratesthe driving videos; second row are results of our method; third row areresults obtained with Monkey-Net [31].Mechanical Turk (AMT). On AMT we show users two videos (one produced byDwNet and another by a competing method) in random order and ask users tochoose the one that has higher realism and consistency. To conduct this study wefollow the protocol introduced in Siarohin et al. [31].Temporal discriminator. We evaluate the temporal consistency of the results byusing the pretrained temporal discriminator from Sec. 3.2 and GAN loss LGgan fromEq. 3.5. We use this metric only to compare the results between our originalDwNet model [42] and its temporal modification. The discriminator that we usefor this metric is the one that was trained in the Temporal DwNet.4.2.3 Implementation detailsAll of our models are trained for two epochs. In our case epoch denotes a full passthrough the whole set of video frames, where each sample from the dataset is aset of four frames, as explained in the Section 3.3. We train our model startingwith the learning rate 0.0002 and bring it down to zero during the training pro-cedure. Generally, our model is similar to Johnson et al. [17]. Novelties of thearchitecture such as pose encoder E and the appearance encoder A both contain2 downsampling Conv layers. The warp module’s refine branch R is also basedon 2 Conv layers and additional 2 ResNet blocks. Our decoder D architecture ismade out of 9 ResNet blocks and 2 upsampling Conv layers. We perform all ourtraining procedures on a single GPU (Nvidia GeForce GTX 1080). Our code willbe released.174.3 Comparison with the state-of-the-artWe compare our framework with the current state-of-the-art method for motiontransfer MonkeyNet [31], which solves a similar problem for the human synthesis.The first main advantage of our method, compared to MonkeyNet, is the ability togenerate frames with a higher resolution. Originally, MonkeyNet was trained on64×64 size frames. However, to conduct fair experiments we re-train MonkeyNetfrom scratch to produce the same size images as our method. Our task is quitenovel and there is a limited number of baselines. To this end, we also comparewith Coordinate Inpainting framework [10] which is state-of-the-art for image (notvideo) synthesis, i.e., synthesis of a new image of a person based on a single image.Even though this framework solves a slightly different problem, we still choose tocompare with it, since it is similarly leverages DensePose [11]. This approachdoesn’t have any explicit background handling mechanisms therefore there is noexperimental results on a Tai-Chi dataset. Note that since the authors of the paperhaven’t released the code for the method we were only able to run our experimentson a pre-trained network.The quantitative comparison is reported in Table 4.1. Our approach outper-forms MonkeyNet and Coordinate Inpainting on both datasets and according toall metrics. With respect to MonkeyNet, this can be explained by its inability tograsp complex human poses, hence it completely fails on the Tai-Chi dataset whichcontains a large variety of non-trivial motions. This can be further observed in Fig-ure 4.1. MonkeyNet simply remaps the person from the source image without mod-ifying the pose. In Figure 4.2 we can still observe a large difference in terms of thehuman motion consistency and realism. Unlike our model, MonkeyNet producesimages with missing body parts. For Coordinate Inpainting, its poor performancecould be explained by the lack of temporal consistency, since (unlike our method)it generates each frame independently and hence lacks consistency in clothing tex-ture and appearance. Coordinate Inpainting is heavily based on the output of theDensePose and doesn’t correct resulting artifacts, as is done in our model using therefined warp grid estimate. As one can see from Figure 4.2 the resulting framesare blurry and inconsistent in small details. This can also explain why such a smallpercentage of users prefer the results of Coordinate Inpainting. The user study18Fashion Tai-ChiPerceptual (↓) FID (↓) AKD (↓) Perceptual (↓) FID (↓) AKD (↓)Monkey-Net [30] 0.3726 19.74 2.47 0.6432 94.97 10.4Coordinate Inpainting [10] 0.6434 66.50 4.20 - - -Ours 0.2811 13.09 1.36 0.5960 75.44 3.77Table 4.1: Quantitative Comparison with the State-of-the-Art. Perfor-mance on the Fashion and Tai-chi datasets is reported in terms of thePerceptual Loss, Frechet Inception Distance (FID) and Average KeypointDistance (AKD)comparison is reported in Table 4.3 where we can observe that videos produced byour model were significantly more often preferred by users in comparison to thevideos from the competitors’ models.Figure 4.2: Qualitative Results on Fashion Dataset. The first row illus-trates the driving videos; the second row are results of our method; thethird row are results obtained with Monkey-Net [31]; the fourth row areresults of Coordinate Inpainting [10]. Please zoom in for detail.4.4 AblationThe table in Figure 4.3 (right) shows the contribution of each of the major architec-ture choices, i.e., the Markovian assumption (No di), the refined warp grid estimate19Fashion Tai-ChiUser-Preference (↑) User-Preference (↑)Monkey-Net [30] 60.40% 85.20%Coordinate Inpainting [10] 99.60% -Table 4.2: User study. Percentage of the time our method is preferred overone of the competing approaches.FashionPerceptual (↓) FID (↓)No di, W , W 0.29 17.18No di, W 0.29 15.37No di 0.29 15.05Full 0.28 13.09Table 4.3: Quantitative evaluations for ablations on Fashion Dataset. Seetext for details.(No di, W ) and the coarse warp grid estimate (No di, W , W ). For these experimentswe remove the mentioned parts from DwNet and train the resulting model architec-tures. As expected, removing the Markovian assumption, i.e., not conditioning onthe previous frame, leads to a worse realism and lower similarity with the featuresof a real image. Mainly it is because this leads to a loss of temporal coherence.Further removal of both warping grid estimators, in the generation pipeline, resultsin worse performance in the FID score. The perceptual loss is not affected by thischange which can be explained by the fact that the warp module mostly improvesrealism of an image and not the perceptual features. In Figure 4.3 (left) we see thequalitative reflection of our quantitative results. The full model produces the bestresults, the third column shows misalignment between textures of two frames. Thearchitecture without the refined warp produces less realistic results, with a distortedface. Lastly, an architecture without any warp produces blurry, unrealistic resultswith an inconsistent texture.20Figure 4.3: Ablations on Fashion Dataset. Qualitative result from the ab-lated methods. The first column shows two random frames from thereal video, the second column shows corresponding frames generatedby the full model, the third column displays the results produced by themodel trained without previous frames, the fourth and the fifth columnsshow results with just refine warp and without any warp respectively.4.5 Temporal consistencyWe have conducted extensive experiments to evaluate the quantitative performanceof the temporal DwNet. Performance on the Fashion dataset hasn’t shown anyscore improvements and the qualitative analysis confirmed this. However, resultsfor the temporal model trained on the Tai-Chi dataset outperform original DwNetmodel in Perceptual and LGgan measures and are shown in Table 4.4.We suspect that the reason for Temporal DwNet not being useful on the Fashiondataset is the already high quality of the original results. This high accuracy can beattributed to more consistent and slow motions across the Fashion dataset, as wellas its high resolution and static white background.21Perceptual (↓) LGgan(↑)DwNet 0.5960 1.9399Temporal DwNet 0.5621 2.0274Table 4.4: Quantitative evaluation of the Temporal DwNet. Comparisonbetween DwNet and its temporal version on Tai-Chi dataset.22Chapter 5ConclusionIn this thesis we present DwNet, a generative architecture for pose-guided videogeneration. Our model can produce high quality videos, based on a single sourceimage depicting human appearance and a driving video with another person mov-ing. We propose a novel Markovian model to address the temporal inconsistencythat typically arises in video generation frameworks. Moreover, we suggest a novelwarp module that is able to correct warping errors. Additionally, by using theDensePose [11] estimations we significantly improve upon existing works withthe learned pose keypoints. We validate our method on two video datasets, Tai-Chi and a new high-resolution dataset Fashion Modeling. The latter was collectedby us and made publicly available. We show superiority of our method over thebaselines. We also investigate how to further improve temporal consistency us-ing a different type of the discriminator. We show that for the Tai-Chi dataset weachieve slight improvements using the new discriminator. Additionally we proposean updated architecture based on the SPADE network [26] and run the comparisonexperiments. We have not observed any quantitative improvements in the SPADE-based architecture, however we believe that this updated architecture has potentialin a future work.235.1 Limitations and future workThe proposed method doesn’t come without limitations. We have observed thatthe model is strongly dependent on the DensePose outputs and since the pose es-timations are not designed to be consistent across frames this naturally leads toinconsistency of our own results. One approach to resolve this issue is a joint train-ing procedure of the DensePose and our framework. That way DensePose wouldbe fine-tuned to our particular task. Another possible way to improve the bodyestimates is to modify the Warp Module so that it also learns how to correct theDensePose artefacts.Alternative way to improve the overall performance is to learn from multipleimages of the source person, instead of one (when those images are available), asit has been done by Wang et al. [40]. Multiple images of a person from differentangles significantly simplify the problem of “guessing” the appearance of unseenbody parts.Finally, we can step into the direction of a slightly different task: making themodel learn how to separate faces and physique from the clothes and accessories,which leads us to the task of the virtual try-on. This modification will allow to swapthe clothing style of a source person with a person from a target video, meaningthat the person from the video is “trying on” clothes of the person from the sourceimage.24Bibliography[1] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine.Stochastic variational video prediction. In ICLR, 2017. → page 4[2] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag.Synthesizing images of humans in unseen poses. In CVPR, 2018. → pages1, 4, 6, 12[3] J. L. Bentley. Multidimensional binary search trees used for associativesearching. Communications of the ACM, 18(9):509–517, 1975. → page 9[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d poseestimation using part affinity fields. In CVPR, 2017. → page 16[5] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. InECCV, 2018. → pages 2, 4[6] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and J. Yin. Soft-gatedwarping-gan for pose-guided person image synthesis. In NIPS, 2018. →page 1[7] P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditionalappearance and shape generation. In CVPR, 2018. → page 1[8] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physicalinteraction through video prediction. In NIPS, 2016. → page 4[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS,2014. → pages 1, 3[10] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and V. Lempitsky.Coordinate-based texture inpainting for pose-guided image generation. InCVPR, 2019. → pages x, 1, 4, 18, 1925[11] R. A. Guler, N. Neverova, and I. Kokkinos. Densepose: Dense human poseestimation in the wild. In CVPR, 2018. → pages 1, 4, 5, 6, 9, 10, 18, 23[12] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Ganstrained by a two time-scale update rule converge to a local nash equilibrium.In NIPS, 2017. → page 16[13] S. Hong, D. Yang, J. Choi, and H. Lee. Inferring semantic layout forhierarchical text-to-image synthesis. In CVPR, 2018. → page 3[14] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translationwith conditional adversarial networks. In CVPR, 2017. → page 12[15] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translationwith conditional adversarial networks. In CVPR, 2017. → pages 3, 6[16] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatialtransformer networks. In NIPS, 2015. → page 10[17] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time styletransfer and super-resolution. In ECCV, 2016. → pages 3, 13, 17[18] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR,2014. → pages 1, 3[19] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL:A skinned multi-person linear model. ACM Transactions on Graphics (Proc.SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015. → pages 4, 9[20] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Poseguided person image generation. In NIPS, 2017. → pages 1, 3, 6[21] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz.Disentangled person image generation. In CVPR, 2018. → page 1[22] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squaresgenerative adversarial networks. In ICCV, 2017. → page 13[23] N. Neverova, R. Alp Guler, and I. Kokkinos. Dense pose transfer. In ECCV,2018. → pages 1, 4[24] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional videoprediction using deep networks in atari games. In NIPS, 2015. → page 426[25] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg.Transformation-grounded image generation network for novel 3d viewsynthesis. In CVPR, 2017. → page 3[26] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu. Semantic image synthesiswith spatially-adaptive normalization. In CVPR, 2019. → pages 13, 23[27] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks forbiomedical image segmentation. In MICCAI, 2015. → page 3[28] M. Saito, E. Matsumoto, and S. Saito. Temporal generative adversarial netswith singular value clipping. In ICCV, 2017. → page 4[29] A. Siarohin, E. Sangineto, S. Lathuilie`re, and N. Sebe. Deformable gans forpose-based human image generation. In CVPR, 2018. → pages 1, 4, 6[30] A. Siarohin, S. Lathuilie`re, S. Tulyakov, E. Ricci, and N. Sebe. First ordermotion model for image animation. In NeurIPS, 2019. → pages 4, 16[31] A. Siarohin, S. Lathuilie`re, S. Tulyakov, E. Ricci, and N. Sebe. Animatingarbitrary objects via deep motion transfer. In CVPR, 2019. → pagesx, 1, 4, 16, 17, 18, 19[32] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In ICLR, 2015. → pages 13, 16[33] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning ofvideo representations using lstms. In ICML, 2015. → page 4[34] S.-H. Sun, M. Huh, Y.-H. Liao, N. Zhang, and J. J. Lim. Multi-view to novelview: Synthesizing novel views with self-learned confidence. InProceedings of the European Conference on Computer Vision (ECCV),2018. → page 3[35] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposingmotion and content for video generation. In CVPR, 2018. → pages 4, 16[36] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning togenerate long-term future via hierarchical prediction. In ICML, 2017. →page 4[37] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scenedynamics. In NIPS, 2016. → page 427[38] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro.High-resolution image synthesis and semantic manipulation with conditionalgans. In CVPR, 2018. → pages 3, 11, 13[39] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, andB. Catanzaro. Video-to-video synthesis. In NIPS, 2018. → pages 2, 4[40] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro.Few-shot video-to-video synthesis. In NeurIPS, 2019. → pages 5, 24[41] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe. Everysmile is unique: Landmark-guided diverse smile generation. In CVPR, 2018.→ page 4[42] P. Zablotskaia, A. Siarohin, B. Zhao, and L. Sigal. Dwnet: Densewarp-based network for pose-guided human video generation, 2019. →page 17[43] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu. Human appearancetransfer. In CVPR, 2018. → page 1[44] B. Zhao, L. Meng, W. Yin, and L. Sigal. Image generation from layout. InCVPR, 2019. → page 3[45] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. Metaxas. Learning toforecast and refine residual motion for image-to-video generation. In ECCV,2018. → page 4[46] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis byappearance flow. In ECCV, 2016. → page 328Appendix AAdditional VisualizationsAdditional static and video visualisations of our results. Please zoom into the im-ages. The demo video can be found under the following url or as a supplement to this thesis.29Figure A.1: DwNet results on Fashion dataset. First example. First row is a driving video. Second row is a set ofthe extracted Dense Poses. The last three rows are results of the motion transfer applied to three source images.30Figure A.2: DwNet results on Fashion dataset. Second example. First row is a driving video. Second row is a set ofthe extracted Dense Poses. The last three rows are results of the motion transfer applied to three source images.31Figure A.3: DwNet results on Tai-Chi dataset. First example. First row is a driving video. Second row is a set ofthe extracted Dense Poses. The last two rows are results of the motion transfer applied to two source images.32Figure A.4: DwNet results on Tai-Chi dataset. Second example. First row is a driving video. Second row is a set ofthe extracted Dense Poses. The last two rows are results of the motion transfer applied to two source images.33


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items