Deep Kernel Mean Embeddings forGenerative Modeling and FeedforwardStyle TransferbyTian Qi ChenB.Sc., The University of British Columbia, 2015A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2017c© Tian Qi Chen 2017AbstractThe generation of data has traditionally been specified using hand-craftedalgorithms. However, oftentimes the exact generative process is unknownwhile only a limited number of samples are observed. One such case isgenerating images that look visually similar to an exemplar image or as ifcoming from a distribution of images. We look into learning the generatingprocess by constructing a similarity function that measures how close thegenerated image is to the target image. We discuss a framework in whichthe similarity function is specified by a pre-trained neural network withoutfine-tuning, as is the case for neural texture synthesis, and a frameworkwhere the similarity function is learned along with the generative processin an adversarial setting, as is the case for generative adversarial networks.The main point of discussion is the combined use of neural networks andmaximum mean discrepancy as a versatile similarity function.Additionally, we describe an improvement to state-of-the-art style transferthat allows faster computations while maintaining generality of the generatingprocess. The proposed objective has desirable properties such as a simpleroptimization landscape, intuitive parameter tuning, and consistent frame-by-frame performance on video. We use 80,000 natural images and 80,000paintings to train a procedure for artistic style transfer that is efficient butalso allows arbitrary content and style images.iiLay SummaryWhile physical actions generate data in the real world, this thesis discussesthe problem of simulating this generation procedure with a computer. Thequality of the simulation can be determined by designing a suitable similaritymeasure between a data set of real examples and the simulated data. Wecontribute to this line of work by proposing and analyzing the performanceof a certain similarity measure that can be arbitrarily complex while stillbeing easy to compute. Furthermore, we contribute to the problem ofgenerating stylistic images by proposing an efficient approximation to theexisting state-of-the-art method.iiiPrefaceChapter 3 of this thesis is original, unpublished, independent work by theauthor, Tian Qi Chen.The work described in Chapter 4 of this thesis was performed in collabora-tion with the author’s supervisor Mark Schmidt. All code was written by theauthor, Tian Qi, while Mark helped analyze the results of the experiments.This work has not been published, but a preprint for the proposed methoddescribed in Chapter 4 is available online as “Fast Patch-based Style Transferof Arbitrary Style”. This preprint was written by both Tian Qi and Mark.A shorter version of this preprint was accepted as an oral presentation atthe Constructive Machine Learning (CML2016) workshop.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Generative modeling . . . . . . . . . . . . . . . . . . . . . . . 11.2 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . 31.3 Image synthesis tasks . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.1 Modularity of neural networks . . . . . . . . . . . . . 62.1.2 Learning by gradient descent . . . . . . . . . . . . . . 72.1.3 Choice of network modules . . . . . . . . . . . . . . . 82.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Divergences . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Integral probability metrics . . . . . . . . . . . . . . . 102.2.3 Maximum mean discrepancy . . . . . . . . . . . . . . 112.3 Generative adversarial networks . . . . . . . . . . . . . . . . 122.4 Texture synthesis and style transfer . . . . . . . . . . . . . . 142.4.1 Neural style transfer . . . . . . . . . . . . . . . . . . . 15vTable of Contents3 Kernelized GAN . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 MMD with a trained feature extractor . . . . . . . . . . . . . 253.2.1 Fixing instability with asymetric constraints . . . . . 263.2.2 Empirical estimates . . . . . . . . . . . . . . . . . . . 283.2.3 Kernels and hyperparameters . . . . . . . . . . . . . . 283.2.4 Intuitions behind the squared MMD objective . . . . 293.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 Toy distributions . . . . . . . . . . . . . . . . . . . . 313.3.2 Qualitative samples . . . . . . . . . . . . . . . . . . . 313.3.3 Latent space interpolation . . . . . . . . . . . . . . . 323.3.4 Conditional generation . . . . . . . . . . . . . . . . . 354 Feedforward style transfer . . . . . . . . . . . . . . . . . . . . 394.1 The need for faster algorithms . . . . . . . . . . . . . . . . . 394.2 Style transfer as one-shot distribution alignment . . . . . . . 414.2.1 Style Swap . . . . . . . . . . . . . . . . . . . . . . . . 414.2.2 Comparison with neural texture synthesis . . . . . . . 424.2.3 Parallelizable implementation . . . . . . . . . . . . . 424.2.4 Optimization formulation . . . . . . . . . . . . . . . . 454.3 Inverse network . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.1 Training the inverse network . . . . . . . . . . . . . . 464.3.2 Feedforward style transfer procedure . . . . . . . . . 474.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.1 Style swap results . . . . . . . . . . . . . . . . . . . . 484.4.2 CNN inversion . . . . . . . . . . . . . . . . . . . . . . 504.4.3 Computation time . . . . . . . . . . . . . . . . . . . . 525 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66A Linear time and space complexity kernel sums . . . . . . . 66B Inverse network architecture . . . . . . . . . . . . . . . . . . . 68viList of Tables4.1 Mean computation times of style transfer methods that canhandle arbitary style images. Times are taken for images ofresolution 300 × 500 on a GeForce GTX 980 Ti. Note thatthe number of iterations for optimization-based approachesshould only be viewed as a very rough estimate. . . . . . . . . 53B.1 Truncated VGG-19 network from the input layer to “relu3 1”(last layer in the table). . . . . . . . . . . . . . . . . . . . . . 69B.2 Inverse network architecture used for inverting activationsfrom the truncated VGG-19 network. . . . . . . . . . . . . . . 69viiList of Figures1.1 (Left) A latent variables model, where the observed data isassumed to be a function of some hidden unobserved variables.(Right) A generative model that approximates the generativeprocess by first sampling z then passing it through a learnedfunction; the synthetic data is then compared with real datato train the generative model. . . . . . . . . . . . . . . . . . . 22.1 Illustration of neural texture synthesis [22] and style transfer[24]. Multiple losses are defined at different layers of a con-volutional neural network. The synthetic image must matchminimize the L2 norm between its features and those of thecontent image, while also minimizing the MMD (2.18) betweenits features and those of the style image. . . . . . . . . . . . . 172.2 Neural texture synthesis views patches of the exemplar textureas samples and synthesizes a new image that minimizes theMMD between the synthetic patches and the exemplar patches. 192.3 Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel anda convolutional neural network. (Part I) . . . . . . . . . . . . 212.4 Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel anda convolutional neural network. (Part II) . . . . . . . . . . . 222.5 Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel anda convolutional neural network . (Part III) . . . . . . . . . . . 233.1 (a) Depiction of training instability with using the symmetricIPM. (b) Plots of the values of Ef(x) and Ef(G(z)) duringthe course of training WGAN on CIFAR-10. For the rightmostfigure, we add an asymmetric regularization described in (3.5)to stabilize training. . . . . . . . . . . . . . . . . . . . . . . . 26viiiList of Figures3.2 Random samples after 100 epochs of training WGAN withweight clipping (F = {f : ||f ||L ≤ 1}) on CIFAR-10. Thespecific loss function used to train the discriminator is shownbelow each figure. . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 IPM contours based on (a) approximate WGAN with weightclipping [2], (b) approximate WGAN with gradient penalty[28], (c) maximum mean discrepancy with Gaussian kernel,and (d) maximum mean discrepancy with trained discrimi-nator and Gaussian kernel. The contour lines for kernelizedGAN appear very close to the samples. . . . . . . . . . . . . . 303.4 Random samples from MNIST dataset. GMMN requires amuch higher minibatch size to produce quality samples. . . . 323.5 Random samples from the LFW dataset. GMMN is trainedusing a batchsize of 1024 whereas kernelized GAN uses abatchsize of 64. . . . . . . . . . . . . . . . . . . . . . . . . . . 333.6 Random samples from the LSUN dataset. Kernelized GANcan be trained using different kernels, with the discriminatorappropriately adapting to the specific manifold required touse the kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . 343.7 Interpolation within latent space for Kernelized GAN trainedon LFW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.8 Interpolation within latent space for Kernelized GAN trainedon LSUN bedrooms. . . . . . . . . . . . . . . . . . . . . . . . 373.9 Conditional generation of MNIST digits. Each row corre-sponds to different label from 0 to 9. . . . . . . . . . . . . . . 384.1 Illustration of existing feedforward methods [14, 41, 87, 88]that simply try to minimize (2.17) by training a separateneural network for each style image, or a limited number ofstyle images. . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 We propose a one-shot concatenation method based on asimple nearest neighbour alignment to combine the contentand style activations. The combined activations are theninverted back into an image by a trained inverse network. . . 414.3 Style swap performs a forced distribution alignment by replac-ing each content patch by its nearest neighbour style patch. . 43ixList of Figures4.4 Illustration of a style swap operation. The 2D convolutionextracts patches of size 3 × 3 and stride 1, and computesthe normalized cross-correlations. There are nc = 9 spatiallocations and ns = 4 feature channels immediately before andafter the channel-wise argmax operation. The 2D transposedconvolution reconstructs the complete activations by placingeach best matching style patch at the corresponding spatiallocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 The inverse network takes the style swapped activations andproduces an image. This network is trained by minimizing theL2 norm (orange line) between the style swapped activationsand the activations of the image after being passed throughthe pretrained network again. . . . . . . . . . . . . . . . . . . 464.6 We propose the first feedforward method for style transferthat can be used for arbitrary style images. We formulatestyle transfer using a constructive procedure (Style Swap) andtrain an inverse network to generate the image. . . . . . . . . 474.7 The effect of style swapping in different layers of VGG-19 [82],and also in RGB space. Due to the naming convention ofVGG-19, “reluX 1” refers to the first ReLU layer after the(X − 1)-th maxpooling layer. The style swap operation usespatches of size 3× 3 and stride 1, and then the RGB image isconstructed using optimization. . . . . . . . . . . . . . . . . . 494.8 Our method achieves consistent results compared to existingoptimization formulations. We see that Gatys et al.’s for-mulation [23] has multiple local optima while we are able toconsistently achieve the same style transfer effect with randominitializations. Figure 4.9 shows this quantitatively. . . . . . . 494.9 Standard deviation of the RGB pixels over the course ofoptimization is shown for 40 random initializations. The linesshow the mean value and the shaded regions are within onestandard deviation of the mean. The vertical dashed linesindicate the end of optimization. Figure 4.8 shows examplesof optimization results. . . . . . . . . . . . . . . . . . . . . . . 504.10 We can tradeoff between content structure and style textureby tuning the patch size. The style images, Starry Night (top)and Small Worlds I (bottom), are shown in Figure 4.13. . . . 51xList of Figures4.11 We compare the average loss (4.7) achieved by optimizationand our inverse networks on 2000 variable-sized validationimages and 6 variable-sized style images, using patch sizes of3× 3. Style images that appear in the paintings dataset wereremoved during training. . . . . . . . . . . . . . . . . . . . . 524.12 Compute times as (a,b) style image size increases and (c,d) ascontent image size increases. The non-variable image size iskept at 500×500. As shown in (a,c), most of the computationis spent in the style swap procedure. . . . . . . . . . . . . . . 544.13 Qualitative examples of our method compared with Gatys etal.’s formulation of artistic style transfer. . . . . . . . . . . . . 55xiAcknowledgementsFirst and foremost, I am deeply thankful to my supervisor Mark Schmidtfor not just the technical insights and guidance, but also for his unendingsupport and encouragement throughout my graduate studies. Mark has givenme ample opportunities to explore research areas I find interesting all thewhile providing much needed redirections when I hit unforeseen roadblocks.I would like to express my utmost gratitude to other faculty members whohave given me valuable advice before and during the course of my program.Thank you Kevin Leyton-Brown for driving me on the path to research andcritical thinking. Thank you Alexandre Bouchard-Coˆte´, Nicholas Harvey,and Michael Gelbart for being awesome professional role models.I must also thank Issam Laradji, Alireza Shafaei, Mohamed Ahmed, RezaBabanezhad, and everyone in the machine learning lab at UBC for theirlighthearted conversations during stressful workdays.I thank my friends Brendan Shillingford and Yi Pin Cao for their friend-ship and general existence.Last but not least, I am also grateful to all the participants of the machinelearning and machine learning theory reading groups for spending our timetogether to learn new topics and explore interesting directions.xiiDedicationTo my parents for their unconditional love and sacrifices.xiiiChapter 1IntroductionAugmenting real data with synthesis data can increase classification resultsespecially when data is scarce [78]. Conditionally generated data can beused in applications such as fraud detection when there are few amountsof customer data, or spam filtering due to the large imbalance of positiveand negative samples. It is also possible to simply generate images that areaesthetically pleasing and entertaining. The generation process can be usedas an advanced image filter, such as a one that creates haunted buildingsand monster faces from normal photographs [92]. This kind of learned ormimicked creativity allows fast production of artistic textures in creativeapplications.This thesis discusses methods for the synthesis or generation of data.The act of creating new data requires sufficient knowledge of the underlyingproperties of the real data. Though perfectly mimicking the data generatingprocess would require complete understanding of the real world, it is possibleto approximate the generating process with a simple model and few assump-tions. For image data, it is possible to mimic the contours of a face, or theartistic style of a painting. By using domain knowledge about the structureof image patterns, recent methods have been able to synthesize compellingnew paintings of existing artistic styles. Methods that can create a seeminglyinfinite number of new images from just a single exemplar painting can bere-interpreted as mimicking particular patterns of the exemplar painting. Byonly slightly perturbing the location of these patterns, new structure can beinjected into the painting. Existing methods for this task are slow or limited,and in this thesis we propose a method to speedup the generating processwithout sacrificing generality.1.1 Generative modelingA generative process describes how data are sampled and which factorslead to specific changes in the samples. If the real underlying generativeprocess is known, then statistical properties and even causal relations canbe extracted. In an ideal setting, discriminative methods will no longer be11.1. Generative modelingFigure 1.1: (Left) A latent variables model, where the observed data isassumed to be a function of some hidden unobserved variables. (Right) Agenerative model that approximates the generative process by first samplingz then passing it through a learned function; the synthetic data is thencompared with real data to train the generative model.hindered by limited number of samples, as samples can be generated freely.Many detectors can become more robust to small changes by augmentingthe classifier with synthetic data. For instance, face detectors should notbe fixated on the specific shape or texture details of facial features. Suchoverfitting can be avoided by having more data, but while real data is scarce,synthetic data can be produced easily.Typically the generative process is not known and only a finite number ofsamples are present. When the generative process is not known or intractibleto infer, as is often the case for real world data, it is possible to approximatethe underlying process with a generative model. This generative model canbe trained using the finite samples that are available. Obviously the modelcould be imperfect and can only replicate simple aspects of the generativeprocess. However, an approximate model can still be used to increase theeffectiveness of discriminative classifiers [78] and perform complex imageprocessing such as inpainting [93] or de-occlusion [96]. The generative modelcan be further augmented to produce an approximate density estimation [13]or extract cross-domain properties between pairs of datasets [42, 97]. This isby no means an extensive list, and newer applications of generative modelsare still being discovered.Constructing a generative model and replicating the real generativeprocess can range from easy to incredibly difficult. Generating the outcomeof a coin toss or dice is much simpler than generating an image of a person’sface or a piece of artwork. However, it may be possible that a complicatedimage is composed of simpler components. For example, the structure ofa person’s face is fixed, with a set of facial features ranging in size and21.2. Deep neural networksshape. It may be possible to recreate, or at least approximate, the generativeprocess of a face image by flipping numerous coins. This is the idea that high-dimensional data like images lie in a simpler lower-dimensional manifold. Inother words, the data is caused or described by a number of simpler variables.An approximate model can then be constructed by first sampling thesesimpler variables, then mapping or relating these samples to a point in thehigh-dimensional space. This kind of generative model is often referred to asa latent variable model, where an assumption is made that the data we wantto generate can be represented using a set of latent (hidden; unobservable)random variables.1.2 Deep neural networksA model that maps a set of simple random variables to samples from thedata distribution can be a very complex function. The state-of-the-art modelfor learning arbitrary relations is a deep neural network. The family ofmodels described by deep neural networks is extremely large, includingsimpler models such linear or logistic regression. Recent works have shownneural networks significantly outperforming other models in computer vision,natural language processing, and generative modeling. Researchers haveonly begun to understand the intricacies of properties that a neural networklearns. It is known that neural nets can be used to infer the latent variablesof a dataset [6] or be used to extract important properties of how humansclassify images [61]. For a history on the development of neural networks,refer to a recent survey by Schmidhuber [80].In simple terms, a neural network simply refers to a differentiable function.Certain differentiable functions work better than others, or make differentassumptions on the input. A neural network f(θ) often has a fixed structureor architecture f but tunable parameters θ. The output of f(θ) changesaccordingly depending on the values of θ. Training a neural net refers tofinding suitable parameters such that f(θ) has desirable properties. Dueto the large number of parameters available to a neural net, this versatilityhas allowed networks to mimic highly complex functions. For the task ofapproximating a generative process, a neural network’s output should besamples from the data distribution.In practice, certain neural network modules (functions) work better thanothers. Certain modules make explicit assumptions on the input structure.For example, a convolutional neural net extracts only translation-invariantinformation from an image, as it makes the assumption that objects in an31.3. Image synthesis tasksimage can appear in any spatial location and have the same meaning. Aconvolutional layer can be seen as a trainable filter that slides across an image.Prior to convolutional neural networks, computer vision researchers usedhand-designed algorithms that extract certain features (e.g. faces) from animage. However, human engineered algorithms often do not perform as wellas a trained convolutional network when the true function is very complex.In early research on texture synthesis, researchers defined texture using a setof constraints such as periodicity and spatial homogeneity [17, 18, 72]. Morerecent methods based on convolutional networks [22] have shown to workwith a larger set of images, especially colour images, though implicitly stillmakes a spatial homogeneity assumption.1.3 Image synthesis tasksIn this thesis, we discuss two related tasks, both regarding the generationof images. The first task is to simply learn an approximate model of thegenerative process of images, given a large set of examples. The examplescan be a dataset of natural images, faces, etc.The second task is to learn a generative model of a certain type of artisticstyle given only a single exemplar image. The first task is rather genericand the methods can be applied to other forms of data, but the second taskrequires more assumptions about the data, such that the painting style isconsistent throughout the image. A simpler extension of this task is styletransfer, where the artistic style of an image is transferred to the structuralcontent of another image.Learning such generative processes is difficult. The most popular methodfor training a neural net is gradient descent on a loss function L(f(θ)).However, what is the correct loss function for these tasks? We look into avery versatile and generic loss function called maximum mean discrepancy.In simple terms, maximum mean discrepancy is a measure of differencebetween two probability distributions, ie. the true data distribution and thegenerative model distribution. This metric can be used for both tasks, makesminimal assumptions about the data, and is easy to compute.We combine maximum mean discrepancy with a neural network for bothtasks, though the first task is harder and requires an extra regularizationtrick. We also discuss how a part of a recent popular style transfer algorithmcan be seen as minimizing the maximum mean discrepancy. We propose abranch of fast style transfer algorithms by approximating the process with atrained neural network.41.3. Image synthesis tasksThe contributions of this thesis include:• A generalization of generative moment matching networks, a certaintype of generative model, to make use of the representational power ofneural networks. We develop a regularization term to stabilize trainingof the model, which shows improvement in generating colored imagescompared to baseline generative moment matching networks.• A new take on style transfer where a forced distributional alignment isperformed inside the feature space of a neural network, then invertedback into image space. This process provides generalization to artisticimages not found in prior feedforward style transfer methods whilebeing faster than methods that are based on optimization.5Chapter 2Background2.1 Deep LearningThe works described in this thesis use deep neural networks as modelingtools. These are flexible models that can express any multivariate continuousfunction as long as the network has enough free parameters and enoughdata to learn from [80]. Due to their unparalleled expressiveness, deepneural networks are used extensively in computer vision, natural languageprocessing, and probabilistic modeling.2.1.1 Modularity of neural networksNeural networks are compositions of linear and non-linear functions, wherethe non-linearities are typically applied element-wise and are differentiable.The linear functions of the neural network introduce tunable weights, orparameters, that change the output of the function. The composition offunctions lends to a high degree of modularity, as larger networks can becreated by composing smaller networks. In this context, the small networksare often referred to as layers or modules. Intermediate outputs of theeach layer are called features or activations. In more complex applications,networks modules can have separate use cases, be trained either independentlyor altogether in an end-to-end fashion. For example, a variational autoencoder[44] is a neural network where one module encodes data samples to samplesfrom a simple (e.g. Gaussian) prior distribution and another module decodessamples from the prior distribution to data samples. Trained in an end-to-endfashion, this network learns a latent variables model.An important characteristic of neural nets is the differentiability of theentire network by application of chain rule. In deep learning, the execution ofthe function is called forward propagation and computing the gradient (firstderivatives) with respect to its input and weights is called backpropagation,as gradients are propagated from the end of the network to the input layer.62.1. Deep Learning2.1.2 Learning by gradient descentAs the entire network is differentiable, the most popular method of findingthe best values for the weights of the network is through gradient descent,a first-order optimization technique. Gradient descent in its simplest formfinds local minima of continuous functionsargminwL(X,w) (2.1)where X represents the data, w are the weights of the network, and L is aloss function. The loss function can contain a neural network itself, wherebyafter the optimal w is found, only the neural network is used for inference.Gradient descent is a simple iterative method where at each iteration theweights w are perturbed slightly to improve the loss function. The updateequation iswnext = w − α∇wL(X,w) (2.2)where ∇wL(X,w) is the gradient, denoting the multivariate first-order deriva-tives of L(X,w) with respect to w, and α is the step size.However, when the amount of data is large, even computing L(X,w) canbe intractable. As gradient descent is an iterative procedure, a few thousanditerations are needed depending on the dimension of the weight vector w, ie.the degree of freedom in the neural network.To alleviate the problem of having a runtime being dependent on theamount of data, stochastic gradient descent is used in practice. Assumingthe loss function can be decomposed asargminwN∑i=1Li(Xi, w) (2.3)then stochastic gradient descent looks only at a single sample xi per iteration.wnext = w − α∇wLi(xi, w) (2.4)While the number of iterations required to find the best solution to (2.3) maybe higher than solving (2.1), stochastic optimization will typically find decentsolutions quickly, as the weights can be updated before even observing theentire dataset. In practice, a small number of samples, called a minibatch, isused per iteration.One downfall of stochastic gradient descent is its performance near-optimum is quite noisy. As the optimization only sees a handful of samples72.1. Deep Learningeach iteration, convergence is reliant of reducing the step size to zero; oth-erwise, the optimization never converges. It is possible to have the best ofboth the speed of stochastic gradient descent and the convergence proper-ties of gradient descent by a branch of methods called stochastic averagegradient [81]. Other improvements on stochastic gradient descent exist,such as momentum and per-element estimation of curvature. For surveyon numerical optimization algorithms in the context of machine learningapplications, see [4].2.1.3 Choice of network modulesThough many functions are differentiable, not all are useful in practice.Certain non-linear functions are more amenable to gradient descent opti-mization, and linear functions can contain structural information that allowtraining to progress faster. The problem of vanishing and exploding gradientsexists during backpropagation when certain modules significantly decreaseor increase the magnitude of the gradient from one module to another. As aneural network becomes deeper (composed of more and more modules) thegradient can entirely vanish to zero or explode to large enough values to causenumerical instability. Architectural changes to neural networks have beenproposed to counter this problem, such as the use of batch normalization [38],residual layers [30], and many others. These methods allow deeper neuralnetworks to be trained using just gradient descent.Imposing structure in neural networks is another interesting direction.The most popular structured linear layer is the convolutional layer, whichis widely used in computer vision and more recently in natural languageprocessing. Generic convolutional layers impose the constraints of localityand translation-invariance. Each output neuron of a convolutional layer onlydepends on a local region of the input, and the same function is appliedto each region. This formulation is particularly intuitive for images as theconvolutional layer learns filters that slide across the image and computesa cross-correlation score at each region of the input, indicating whether aparticular pattern exists in that region of the input. Convolutional neuralnetworks are the standard in image processing [45, 82], where the winningentry in ImageNet competitions [77] use cleverly designed convolutionallayers stacked with non-linear activation functions.Transposed convolutional layers, a variant where the forward propagationand backpropagation algorithms are reversed, are used to upsample imageactivations. Transposed convolutions are used in applications for semanticsegmentation [59], super-resolution [49], image synthesis [74], and other82.2. Loss functionsproblems where activations of a small spatial activation must be upsampledto a larger spatial resolution. See [15] for illustrations.2.2 Loss functionsSimilar to the choice of network architecture, choosing an appropriate lossfunction is essential to solving any problem with gradient descent. In asupervised learning problem setting, each input xi is paired with an outputvalue yi. One can train a neural network f(x,w) to learn the relation betweeneach xi and yi. A common loss function is the squared L2 norm between theneural network output and desired target yiL(xi, yi, w) = ||f(xi, w)− yi||2. (2.5)Variants include using a different norm or thresholding the loss if yi iscategorical.The supervised learning setting typically use loss functions defined onreal-valued input/output vectors. However, in certain settings of generativemodeling, the set of inputs and outputs are not necessarily paired. Whenthe objective is to train the neural net such that f(x,w) follows a specificdistribution (where either one or both of x and w may be random), simple lossfunctions used in supervised learning do not suffice. Instead, one must defineloss functions that differentiate between probability distributions rather thanreal-valued samples.It should be noted that using the mean squared error (2.5) to train asupervised learning problem can be perceived as assuming elements of thenetwork output f(x,w) follow independent Normal distributions. Other lossfunctions have similar implications, such as the L1 loss function and Laplacedistribution. However, the difference between a loss function defined onreal-valued vectors and a loss function defined on probability distributionsis that the particular sample of x is assumed to be random as well andthere is no corresponding value of y. Instead, f(x,w) is seen as a randomvector defined by the particular distribution of x. The neural network shouldmap the distribution of x to the distribution of y, rather than learningthe distribution of y conditioned on single samples of x. The two types ofmodeling are referred to as generative and discriminative, which model thedistributions p(x, y) and p(y|x) respectively.The latent variables model tries to learn how samples z from a simpledistribution distribution can be used to generate samples x from a complexdata distribution. This problem definition, contrary to a supervised learning92.2. Loss functionssetting, does not explicitly assume any particular pairing between individualsamples z and x but that any reasonable pairing will suffice. Thus in thisunpaired setting, a type of unsupervised learning, only loss functions definedon the distributions of z and x can be used.2.2.1 DivergencesMeasuring the difference between probability distributions is not easy. Somemeasures make the assumption that the distributions are known, some arenot proper metrics, and some are simply intractable to compute.Let px and py be two continuous distributions while also representingtheir density functions. One family of consists of f-divergences of the formDf (px||py) = Es∼py[f(px(s)py(s))](2.6)where f must be a convex function such that f(1) = 0. Different divergencescan be constructed by the choice of the function f [71]. Intuitively, thismeasure of difference computes the average of the odds ratio px/py weightedby the function f . When f(t) = log t, (2.6) is the popular Kullback-Leiblerdivergence, seen in variational inference.The biggest disadvantage of using f-divergences is px/py must be com-putable, and the expectation over py must be known. For the latter reason,the reverse divergence Df (py||px) is some times used if the expectation underpx is easier to compute. However, f-divergences are typically asymmetricand the choice of Df (px||py) or Df (py||px) can have implications that arenot fully understood.2.2.2 Integral probability metricsWe describe a more general class of difference measures on probability distri-butions called integral probability metrics (IPMs). Given two distributionspx, py on the same support, an IPM can be used to provide a proper metricdefined using the supremum over a function class F ,IPM(F , px, py) = supf∈F∣∣Ex∼px [f(x)]− Ey∼py [f(y)]∣∣ . (2.7)Intuitively the function f extracts meaningful statistics that can be usedto discriminate between the two distributions. Conversely, if px and py arethe same distribution, then the mean of any function under px and py are102.2. Loss functionsthe same. The function class F must be chosen to be rich enough but alsotractable to compute.Depending on the choice of F , different named metrics can be constructed.• F = {f : ||f ||∞ ≤ 1} results in the total variation distance. Thesefunctions are bounded between -1 and 1.• F = {f : ||f ||L ≤ 1} results in the Wasserstein distance. Thesefunctions are smooth 1-Lipschitz.• F = {f : ||f ||H ≤ 1} results in the maximum mean discrepancy (MMD).These functions are within the unit ball of a reproducing kernel Hilbertspace (RKHS) defined by a well-behaved kernel function.Note that the constants 1 in the above definitions are merely convention andcan be any number. Using a different constant simply scales the metric bythe same amount. Note also that this is an non-exhaustive list.A nice property of integral probability metrics is that they are propermetrics, as opposed to f-divergences which do not satisfy all properties of ametric. Importantly, the definition of IPMs does not require the computationof px or py, only an expectation which can be approximated using onlysamples. However, this also comes at the cost of increased computationalcomplexity, as finding the supremum over a class of functions is often in-tractable. So far the only IPM that can be tractably computed is maximummean discrepancy.2.2.3 Maximum mean discrepancyWhen the set F is the unit ball in a reproducing kernel Hilbert space (RKHS)H with kernel k(X ,X ) → R, the IPM metric (2.7) is known as maximummean discrepancy (MMD). For every positive definite kernel, the RKHSis uniquely defined and there exists a feature mapping φ : X → H suchthat k(x, y) = 〈φ(x), φ(y)〉H. This means that while H may be an infinitedimensional function space, the inner product can still be tractable computedusing only the kernel function; this is known as the kernel trick.The idea of feature maps is extended to probability distributions bydefining the kernel mean embedding of p asφ(p) := µp =∫k(x, ·)p(x)dx = Ex∼pk(x, ·) (2.8)An amazing property is the function attaining the supremum is knownup to a constant [26] asf∗(·) ∝ Ex∼pxk(x, ·)− Ey∼pyk(y, ·) (2.9)112.3. Generative adversarial networksThis is referred to as the witness function because it witnesses the differencein distributions. Intuitively, the witness function puts high values on samplesfrom px and low values on samples from py. For samples from regionswhere px and py have similar densities, the witness function is near zero. Byplugging (2.9) into (2.7), the squared MMD is computable in closed form [26],MMD2(k, px, py) = Ex,x′∼px,px [k(x, x′)] + Ey,y′∼py ,py [k(y, y′)]− 2Ex,y∼px,py [k(x, y)](2.10)Maximum mean discrepancy as a measure of difference on probabilitydistributions can be extremely powerful as it is both tractable to computeand satisfies properties of a metric. Specifically, for certain choices of kernelsknown as characteristic kernels, MMD is zero if and only if px = py. However,this often does not hold in practice as only a finite number of sample isobserved. Moreover, empirical uses of MMD typically involve manuallyfinding the “right” kernel function that works well for the specific datadistribution. In high dimensional spaces, choosing the kernel function oftenrequires heuristics and intuition. (For instance, the popular Gaussian kernelrelies on Euclidean distance to be meaningful, but the distance of imagesrepresented in RGB do not accurately reflect semantic differences.) Asa result, MMD has not shown much progress in modeling complex highdimensional distributions. See the recent survey by Muandet et al. [67] formore information.Instead of manually choosing a kernel function from a handful of functions,it is possible to learn a kernel function that optimizes the discrepancy measurefor a specific task. The kernel function can be parameterized by a neuralnetwork, then trained to optimize a specific objective. The neural net canalso be viewed as a feature extractor that maps the input space to a low-dimensional manifold where a simpler kernel function can be used moreeffectively. This is discussed further in Chapter 3.2.3 Generative adversarial networksGenerative adversarial networks in its simplest form consists of a generatornetwork G and an auxiliary discriminator network D. The generator networktakes random samples from a prior distribution pz and outputs samples ina single forward pass. The goal is to train the generator network to be apowerful generative model that can directly learn the generating process ofthe target data distribution px.122.3. Generative adversarial networksOriginal GAN. The pioneering approach proposed by Goodfellow etal. [25] trains the discriminator network to compare the generated sampleswith real samples from the data distribution in a binary classification task,while the generator network is trained to fool the discriminator in thefollowing objective function:minGmaxDEx∼px [logD(x)]− Ez∼pz [log(1−D(G(z)))] (2.11)Ironically, in practice if the discriminator is able to perfectly distinguishgenerated and real samples, then no information is passed to the generatornetwork and training stagnates. This is due in part to D having to squashthe output to be in (0,1) with a sigmoid layer, which results in vanishinggradients if the output is close to 0 or 1. When the generator is far from theoptimum, the discriminator has a much easier time discriminating betweengenerated and real samples. This leads to the output of D being very closeto 0 or 1. This introduces instability during training and is only remediedby careful tuning of network architectures and balancing between training Dand G.Generative Moment Matching Networks. As an alternative to theGAN two-player objective function, Dziugaite et al. [16] and Li et al. [55]propose to remove the adversarial aspect and use the MMD objective.minGEx,x′∼px,px [k(x, x′)] + Ez,z′∼pz ,pz [k(G(z), G(z))]− 2Ex,z∼px,pz [k(x,G(z))](2.12)The biggest advantage of MMD is that the supremum is implicit, so anadversary is not necessary. The proposed algorithm minimizes an empiricalestimate of (2.10) with an additive sum of fixed Gaussian kernels. Analternative version proposed by [86] instead trains the discriminator tomaximize the power of the statistical test associated with MMD. While theycan also fine-tune the specific bandwidths of the Gaussian kernel to critica trained generative network, existing works are unable to simultaneouslytrain a generative model and optimize the kernel.Approximate IPM GANs. Recently Arjovskyet al. [2] propose to usethe Wasserstein distance to train the generative network. This is equivalent tominimizing an IPM with F the set of all 1-Lipschitz functions. An auxiliarydiscriminiator network (also referred to as critic) is then used to approximatethe supremum over F . Note that if ∀f ∈ F =⇒ −f ∈ F , then the absolutesign in (2.7) can be removed. This results in the following objective functionminGmaxD:||D||L≤1Ex∼px [D(x)]− Ez∼pz [D(G(z))] (2.13)132.4. Texture synthesis and style transferAnother recent formulation called mcGAN proposes to use functions f thatare finite but multidimensional and match both the mean and covariance ofthe output of f . This formulation is a special case of our proposed neuralMMD with a polynomial kernel of degree 2 and bias 1.In practice, the resulting GAN algorithms no longer use log and sigmoidfunctions, so problematic gradients due to asymptotes are gone. Additionally,both WGAN and mcGAN seem to have reduced or even solved the problemof mode collapse often seen in the original GAN formulation. This may beattributed to IPMs being proper metrics rather than divergences.However, the current IPM-based GAN algorithms do have some downsides.The proposed WGAN algorithm clips the weights of the discriminator tobe within a pre-specified interval to enforce a Lipschitz constraint whilethe mcGAN algorithm does the same to enforce a bounded constraint.Weight clipping is a simple operation but using it can lead to failure intraining as existing stochastic optimization methods can conflict with such aconstraint [28].Moreover, mcGAN performs QR decomposition at every update iterationto ensure orthogonality of weights. An improved version of WGAN [28]instead adds a regularization term that encourages the discriminator to be1-Lipschitz along the path between the generated samples and real samples.Both are additional computation that are more expensive than the originalGAN algorithm. Additionally, due to the need to approximate the unknownsupremum, the discriminator is trained for more iterations than the generatornetwork. This encourages stability during training but convergence is slowerthan the original GAN [28].2.4 Texture synthesis and style transferThe generation of images started with synthesizing textures. These areimages that exhibit a homogeneous property, such that small patches of theimage look similar and often display the same pattern. Methods designed forthis task [18, 47, 56, 89] often specify an algorithmic approach that generatesthe image one pixel or small patch at a time. To generate a new pixel, theneighbouring region is typically matched with patches from the exemplartexture image, and the new pixel is picked based on the best matchingtexture patch. This often works well for simple patterns but does not workas well for complicated artistic images. Even the choice of similarity functionbetween patches is important. For example, simple pointwise differences incolor images typically do not have any meaningful interpretaton. That is, it142.4. Texture synthesis and style transferis difficult to create a similarity function that conveys visual perception.More recent methods for texture synthesis operate by specifying a complexsimilarity function between the generated texture and the real texture. Thisis often done by taking texture patches as data points, and defining asimilarity function based on patches. Methods have been proposed for usingprincipal components [50], wavelet transformation [72], and neural networks[22] that aid in defining such similarity functions by essentially shifting therepresentation of images to a different domain. The new domain allowsconventional distance metrics to become more visually meaningful.While it is harder to realize the generative process using the secondapproach, better results are often achieved by these as they only implicitlydefine the generative process. The key is in choosing a good similarityfunction, which arguably is more adaptable to different textures and styles,rather than designing the generative process directly which explicitly assumescertain properties of the texture image such as periodicity or size of itspatterns.2.4.1 Neural style transferWe describe the work of [22] and [24] in more detail, as their method hasachieved surprising results and renewed interest in the area. The significantincrease in visual quality come from the use of convolutional neural networks(CNN) for feature extraction [19, 21, 23, 51]. The success of these methods haseven created a market for mobile applications that can stylize user-providedimages on demand [37, 73, 85].For texture synthesis, we are given a style image S and we want tosynthesize an output image Y . These images are passed through a CNN,resulting in intermediate activations for some layer l which we denote bySl ∈ RMl×Dl and Yl ∈ RNl×Dl . Here Dl is the number of feature maps forlayer l and Nl,Ml are the sizes (height times width) of each feature map forthe output and style images respectively. The style reconstruction loss forlayer l as defined by [24] is as follows:L(l)style =14NlMlD2l||G(Sl)−G(Yl)||2F (2.14)where G(F ) := F TF ∈ RDl×Dl is the Gram matrix. The complete stylereconstruction loss is a sum over multiple layers with differing sizes ofreceptive fields,Lstyle =∑l∈LsαlL(l)style (2.15)152.4. Texture synthesis and style transferwhere Ls is a set of layers and αl is an additional tuning parameter, thoughtypically al is simply set to 1. To adapt this loss for style transfer, anadditional content image C is provided. With Cl ∈ RNl×Dl denoting thecontent activations, the output image is asked to minimize the followingcontent reconstruction loss best mLcontent = 12||Cl − Yl||2F (2.16)for some layer l. Note that while the style reconstruction requires the useof multiple layers, in practice only a single layer is used for the contentreconstruction. The complete loss for neural style transfer is then formulatedas a weighted sum of the style and content reconstruction losses.Ltotal = λLcontent + (1− λ)Lstyle (2.17)It is clear that the texture synthesis formulation is matching some statisticsof the style image, while using a CNN to map images to a more meaningfulmanifold. However, this had not been rigorously investigated until recentlyby [54], who showed that the style reconstruction loss is equivalent to abiased estimate of maximum mean discrepancy, a family of (pseudo-)metricsdefined on probability distributions.Patch-based MMD MinimizationThe loss function (2.14) may appear unintuitive, but the underlying similarityfunction is maximum mean discrepancy, as was shown by Li et al. [54]. TheCNN is applied to patches of the input image, and for each patch a samplevector representation is extracted. The loss function (2.14) assumes thesevector samples are from a certain distribution and compares them betweenthe generated image and the exemplar image.With k(x, y) = (xT y)2, the squared MMD defined on patch samples162.4. Texture synthesis and style transferFigure 2.1: Illustration of neural texture synthesis [22] and style transfer[24]. Multiple losses are defined at different layers of a convolutional neuralnetwork. The synthetic image must match minimize the L2 norm betweenits features and those of the content image, while also minimizing the MMD(2.18) between its features and those of the style image.172.4. Texture synthesis and style transferS = {si} and Y = {yj} is proportional to the style reconstruction loss.MMD2b [X,Y ] =1N2 N∑i,i′=1k(si, si′) +N∑j,j′=1k(yj , yj′)− 2N∑i,j=1k(si, yj)=1N2 N∑i,i′=1(sTi si′)2 +N∑j,j′=1(yTj yj′)2 − 2N∑i,j=1(sTi yj)2=1N2[||SST ||2F + ||Y Y T ||2F − 2||SY T ||2F ]=1N2||STS − Y TY ||2F(2.18)The last line is proportional to (2.14), indicating that texture synthesis istaking activations from an image and viewing these activations as a dataset.The neural texture synthesis method then creates a new synthetic set ofactivations by minimizing the MMD between the original and synthesizedactivations. Note that the activations are not independent, as the convolu-tional layers of the CNN creates local dependencies in the outputs. In fact,the method works precisely because the activations are dependent on eachother due to overlapping regions between the receptive fields. This allowsthe method to create varying textures while remaining visually similar tothe original texture, assuming homogeneity.It is possible to interpret the neural style transfer algorithm as using aspecific kernel on the pixels of the image. This kernel is a concatenation ofthe polynomial kernel of degree 2 and the CNN. Specifically, small patchesof the image is passed through the CNN to obtain single activations, whichare then passed to a kernel as a similarity function. This interpretation isin line with interpreting neural nets as learnabe complex feature extractors.In this case, the CNN is trained on a fairly challenging dataset, then usedas a mapping between arbitrary images and an activation space where thepolynomial kernel more semantically meaningful than used directly on thepixel space.With this interpretation, the effectiveness of neural texture synthesis iseasy to understand. It assumes that the image has homogeneous patterns andtries to generate a new image that contains similar homogeneous patterns.Figure 2.2 illustrates this idea, though the actual distribution matching isdone at the feature-level inside multiple layers of a convolutional neuralnetwork.182.4. Texture synthesis and style transferFigure 2.2: Neural texture synthesis views patches of the exemplar textureas samples and synthesizes a new image that minimizes the MMD betweenthe synthetic patches and the exemplar patches.The constraints of a convolutional neural network acts in an interestingmanner for texture synthesis. If the patches of an image can be changedwithout any constraints, then MMD to zero can be minimized by mimickingthe exact patches of the exemplar texture. However, different random noisecan result in different textures because the convolutional neural networkconstrains that patches overlapping with each other must have the samevalues. This is also due to gradient descent being able to only find localoptima. The use of a content reconstruction loss for style transfer actssimilarly to constrain specific textures in certain regions on the image suchthat the generated image looks similar in structure to the content image.Different choices of kernel and neural networkThe authors of [22] recommend using the VGG-19 network architecture[82]. We try out different network architectures and kernel functions andfind that this combination can lead to drastic changes in the generatedresult. Generated samples are shown in Figures 2.3, 2.4, and 2.5. Theconvolutional neural networks are trained on ImageNet. For each architecture,every downsampling layer (pooling or strided convolution) is used in a stylereconstruction loss.It seems that out of the four architectures tested, VGG19 with thepolynomial kernel of degree 2 does perform best. Perhaps the way that theCNN was trained, or the architecture of the CNN itself, inhibits the useof the RBF kernel. A similarity mertic based on the dot product betweenCNN activations may more semantically align to our perception of visual192.4. Texture synthesis and style transferfeatures than a similarity metric based on the Euclidean distance betweenCNN activations. It is yet unclear how to construct the optimal networkarchitecture and kernel function for this method.202.4. Texture synthesis and style transferResNet18 SqueezeNet VGG16 VGG19 OriginalPoly2KernelRBFKernelPoly2KernelRBFKernelPoly2KernelRBFKernelFigure 2.3: Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel and a convolutionalneural network. (Part I)212.4. Texture synthesis and style transferResNet18 SqueezeNet VGG16 VGG19 OriginalPoly2KernelRBFKernelPoly2KernelRBFKernelPoly2KernelRBFKernelFigure 2.4: Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel and a convolutionalneural network. (Part II)222.4. Texture synthesis and style transferResNet18 SqueezeNet VGG16 VGG19 OriginalPoly2KernelRBFKernelPoly2KernelRBFKernelPoly2KernelRBFKernelFigure 2.5: Comparison of texture synthesis results using different MMDkernels defined by the composition of a classifical kernel and a convolutionalneural network . (Part III)23Chapter 3Kernelized GANThis chapter describes the use of maximum mean discrepancy as a criteriafor training a generative model. Similar to generative adversarial networks(GANs), the kernelized GAN uses a neural network to discriminate betweenthe real distribution and a distribution of outputs from a simultaneously-trained generator network. Once trained, the distribution of generatorsamples should be similar to the data generating distribution.3.1 MotivationExisting formulations of the GAN adversarial game are theorically equivalentto minimizing some divergence or probability metric between the real distri-bution (Section 2.3). The generator network minimizes this criterion while adiscriminator tries to maximize it. A powerful discriminator should ideallybe able to find discriminative features that separate the real data and thegenerated data, while also providing meaningful gradients to the generator.This isn’t always the case, however, when the discriminator is a fixed-length neural network. The network is usually not be powerful enoughto exactly discriminate between the real and generated samples (even forfinite samples). We propose the use of kernels to efficiently improve thecomplexity of the discriminator without requiring extra layers. This approachis theoretically motivated by the use of maximum mean discrepancy witha trained feature extractor, which is referred to as the discriminator in ourapproach.Figure 3.3 shows the use of GMMN and our approach with a traineddiscriminator and Gaussian kernel. GMMN does not use a trained discrimi-nator and already achieves superior performance compared to WGAN. Forthe 25-Gaussians problem, our approach does not visibly show a significantimprovement upon GMMN, but one neat advantage is our approach is muchless sensitive to the kernel parameter whereas we had to tune the bandwidthof the Gaussian kernel used in the toy GMMN illustration carefully to obtaingood results.243.2. MMD with a trained feature extractorNear the completion of this thesis, [53] independently proposes a similarmethod to combine MMD and adversarial training. However, their workrequires an additional autoencoder whereas we show here that having aregularization term is sufficient to stabilize training. We also emphasize theexistence of training instability when this regularization term is omitted.3.2 MMD with a trained feature extractorFirst, we define the deep MMD objective by appending a discriminator tothe MMD criteria,L(G,D) = Ex,x′∼px [k(D(x), D(x′))] + Ez,z′∼py [k(D(G(z)), D(G(z′)))]− 2Ex,z∼px,pz [k(D(x), D(G(z)))](3.1)where k is any kernel function, px is the data distribution, and pz is a fixedprior distribution. However, naively appending a discriminator doesnot work. To see why, first note that while integral probability metrics aretypically defined asIPM(F , px, py) = supf∈F|Ex∼pxf(x)− Ey∼pyf(y)| (3.2)If ∀f ∈ F ,−f ∈ F , then the absolute value sign can be removed. Thesimplified asymmetric1 versionIPM(F , px, py) = supf∈FEx∼pxf(x)− Ey∼pyf(y) (3.3)is used in existing IPM-based GAN frameworks [2, 28, 66]. Intuitivelyunderstanding the use of the asymmetric version (3.3) in a two-player GANframework is straightforward. The discriminator’s role is to approximatethe supremum, so we refer to it as f in the following. The generator anddiscriminator both attempt to increase the value of f applied to the generatedand real samples, respectively. However, using (3.2) would not lend to suchsimplification. The value of f for generated samples must “chase” the realsamples. This allows the discriminator to skirt around the generator at everyiteration, without necessarily providing any meaningful gradientsto the generator.1We refer to this equation as the asymmetric version because IPM(px, py) is notnecessarily equal to IPM(py, px) for a fixed f .253.2. MMD with a trained feature extractorfIPM2(a)0 20000 40000 60000 80000Iteration0.40.30.20.10.00.10.20.30.4f valueAsymmetric IPM Lossf(x)f(G(z))0 20000 40000 60000 80000Iteration0.40.30.20.10.00.10.20.30.4Symmetric IPM Lossf(x)f(G(z))0 20000 40000 60000 80000Iteration0.40.30.20.10.00.10.20.30.4Regularized Sym. IPM Lossf(x)f(G(z))(b)Figure 3.1: (a) Depiction of training instability with using the symmetric IPM.(b) Plots of the values of Ef(x) and Ef(G(z)) during the course of trainingWGAN on CIFAR-10. For the rightmost figure, we add an asymmetricregularization described in (3.5) to stabilize training.We show this instability in Figure 3.1. It is clear that when the generatoris nearing optimum, the discriminator simply switches the sign of f and thiscauses the generator to become unstable. Let Sf be the subset of F thatachieves a higher value than f in either (3.2) or (3.3) depending on context.We suspect that when the generator network shifts its outputs to minimizethe IPM defined by the current discriminator, |Sf | may actually increaseinstead of decrease. Whereas when the asymmetric IPM is used, |Sf | shouldclearly decrease if the generator improves.Visual samples are shown in Figure 3.2. While the generator still learnsthe general shapes and colors of the real data, there are visible large darkspots in the generated samples when either the absolute IPM or squaredIPM is used.For the same reason, naively adding a trained discriminator to the GMMNframework creates instability during training as only the squared MMD iscomputable in closed form.MMD2(k, px, py) =Ex,x′∼pxk(x, x) + Ey,y′∼pyk(y, y′)− 2Ex,y∼px,pyk(x, y).(3.4)3.2.1 Fixing instability with asymetric constraintsThe aforementioned instability problem occurs when the discriminator hasmultiple directions that can increase the IPM criterion. To fix this, wepropose adding an asymmetric regularization term to the criterion thatforces the discrimator to move in a single direction. This results in the263.2. MMD with a trained feature extractor(a) supf∈F Ef(x)− Ef(G(z)) (b) supf∈F |Ef(x)− Ef(G(z))|(c) supf∈F (Ef(x)− Ef(G(z)))2 (d) supf∈F (Ef(x)− Ef(G(z)))2 +min(Ef(x)− Ef(f), 0)Figure 3.2: Random samples after 100 epochs of training WGAN with weightclipping (F = {f : ||f ||L ≤ 1}) on CIFAR-10. The specific loss function usedto train the discriminator is shown below each figure.273.2. MMD with a trained feature extractorfollowing two-player GAN framework,minGL(G,D)maxDL(G,D)− λ||min{Ex∼pxD(x)− Ez∼pzD(G(z)), 0}||2(3.5)where L(G,D) is as defined in (3.1). The added regularization term is nega-tive whenever D(x) < D(G(z)), and zero otherwise. In order to maximizethis expression, D is constrained such that D(x) ≥ D(G(z)). We use the L2norm to enact a larger penalty for large deviations so that the discriminatorcan quickly move back into the constrained region, while only a smallerpenalty is given for small deviations so the discriminator does not focus onthis regularization term too much. We find that larger values of λ workbetter, especially at the beginning of optimization as we want to force thediscriminator to be in the right direction as early as possible.3.2.2 Empirical estimatesWe use an unbiased estimate of MMD [26]. For each minibatch of realsamples {xi}ni=1 and prior samples {zi}ni=1, the empirical loss function isdefined asLˆ(G,D) =1n(n− 1)n∑i=1n∑j 6=ik(D(xi), D(xj))+1n(n− 1)n∑i=1n∑j 6=ik(D(G(zi)), D(G(zj)))− 2nmn∑i=1n∑j=1k(D(xi), D(G(zj)))(3.6)The generator and discriminator networks are then trained usingminGLˆ(G,D)maxDLˆ(G,D)− λ∣∣∣∣∣∣∣∣∣∣∣∣min 1n∑iD(xi)− 1n∑jD(G(zj)), 0∣∣∣∣∣∣∣∣∣∣∣∣2(3.7)3.2.3 Kernels and hyperparametersWe experiment with the linear and Gaussian kernels. The linear kernel isthe simplest and can be computed directly as a dot product,kL(x, y) = xT y. (3.8)283.2. MMD with a trained feature extractorThe Gaussian kernel is a popular radial-basis kernel with a tunable parameterγ.kG(x, y) = exp(γ||x− y||2) (3.9)Note that one big advantage of combining a kernel with a deep neural networkis that the network can adapt to the kernel being used. This implies thatthe parameter γ need not be manually specified, as the weights of the neuralnetwork can simply learn to scale itself to effectively use any gamma.We additionally experiment by approximating the Gaussian kernel withrandom kitchen sinks (RKS) [75] features. Random kitchen sinks constructsapproximate explicit features Φˆ(·) such that Φˆ(x)T Φˆ(y) ≈ kG(x, y). This ap-proximate computation allows MMD to be computed in linear time, allowinglarger minibatch sizes to be used in practice (see Appendix A for details).In experiments, we show that this approximation does not lead to obvioussample degradation. The RKS function Φ : Rd → Rs is defined asΦˆ(x) =√2scos(√2γWx+ b)(3.10)where W ∼ Normal(0, 1), b ∼ Unif(0, 2pi). The approximation qualityincreases with s but is independent of d [75]. We fix s at 300 in ourexperiments.3.2.4 Intuitions behind the squared MMD objectiveLet kD(·, ·) = k(D(·), D(·)). A kernel function can be intuitively understoodas a similarity measure between its two inputs. The squared MMD objectivecontains three termsMMD2(k, px, py) = Ex,x′∼px,px [kD(x, x′)]︸ ︷︷ ︸kxx+Ey,y′∼py ,py [kD(y, y′)]︸ ︷︷ ︸kyy− 2Ex,y∼px,py [kD(x, y)]︸ ︷︷ ︸kxyThe discriminator tries to minimize kxx and kyy. The minimization of theseterms ensures that the discriminator places the data from each distributionin a similar area of the activation space. The discriminator tries to maximizekxy. This term contains the similarity between real and generated data, sothe discriminator will try to place them in different areas of the activationspace. This discriminator can be seen as clustering the two data distributionsbased on the choice of kernel function.293.3. Experiments8Gaussians25GaussiansSwissRoll(a) WGAN (b) WGAN-GP (c) MMD (d) Kernelized GANFigure 3.3: IPM contours based on (a) approximate WGAN with weightclipping [2], (b) approximate WGAN with gradient penalty [28], (c) maximummean discrepancy with Gaussian kernel, and (d) maximum mean discrepancywith trained discriminator and Gaussian kernel. The contour lines forkernelized GAN appear very close to the samples.On the other hand, the generator tries to minimize kxy while maximizingkxx. The minimization of kxy is the main objective that drives the generator toproduce similar samples to real data. The maximization of kxx is interestingas the kernel function can be interpreted as a similarity function, so thisimplies that the generator should generate dissimilar samples. If the generatoralways creates the same samples, or a limited number of samples, then kxxwill be low, whereas if the generator creates different samples all the time,then kxx will be high. This term explicitly enforces the generator distributionto have high diversity.303.3. Experiments3.3 Experiments3.3.1 Toy distributionsWe first show that the use of kernels do in fact help discriminate betweendistributions when only a finite function approximator is used. Though thisproperty is difficult to observe in higher dimensions, it can be easily shownfor synthetic toy distributions.Figure 3.3 shows the resulting IPM contours based on different IPM esti-mates. The discriminator, when applicable, is a small multilayer perceptronwith 4 ReLU layers and 64 hidden units. The contour lines show the valuesof f(·), which are either the discriminator output for WGAN methods orthe witness function (2.9) for MMD and kernelized GAN. The orange pointsare samples from the real distribution while the “fake” distribution (notshown) is fixed to be the real distribution plus a small Gaussian noise. Thisimplies that the boundary separating the real and fake distributions need tobe extremely tight in order to correctly discriminate between them. However,only the MMD and kernelized GAN methods are able to discriminate betweendifferent clusters due to the use of a radial-basis kernel function.This experiment shows that for the same discriminator architecture,kernelized GAN can provide a much more complex discriminative surfacecompared to WGAN. This is because kernelized GAN uses an exact maxi-mum mean discrepancy with an optimized kernel, whereas WGAN uses anapproximation to the Wasserstein distance. Note that in order to obtainnice contours for MMD, we had to manually tune γ for the Gaussian kernel(3.9). Whereas for kernelized GAN, we kept γ = 1 and only tuned thediscriminator.3.3.2 Qualitative samplesWe show some random qualitative samples from our trained generators. Bydefault, we use a minibatch size of 64 during training and the Gaussiankernel. All generator and discriminators use the DCGAN architecture [74].MNIST. This dataset contains handwritten digits centered at a resolu-tion of 28× 28. Figure 3.4 shows random samples from trained generativemoment matching networks (GMMN) and a kernelized GAN. Note that theGMMN algorithm requires using multiple Gaussian kernels with different val-ues of γ and a large number of samples per iteration. In contrast, kernelizedGAN discriminator can adapt to any value of gamma and so only a singleGaussian kernel is needed. The kernelized GAN can be successfully trained313.3. ExperimentsReal GMMN (64) GMMN (1024) Kernelized GANFigure 3.4: Random samples from MNIST dataset. GMMN requires a muchhigher minibatch size to produce quality samples.with the typical minibatch size of 64, owing to the effective representationalpower of the discriminator.LFW. The Labeled Faces in the Wild (LFW) dataset contains facesof celebrities. We perform a center crop and resize the images to 64 × 64resolution. Figure 3.5 shows random samples from the training dataset,along with generators using the GMMN algorithm and kernelized GAN.Generated images are mapped to values in (0, 1). As GMMN does not use aneural network discriminator, it is unable to extract important features ofthe dataset. For example, some samples have large smudges, a third eye, or adisfigured mouth. The discriminator in kernelized GAN is able to transformthe image into a more meaningful space where the Gaussian kernel is able todistinguish samples more easily, leading to a better trained generator.LSUN Bedrooms. This dataset contains 3 million images of bedrooms,which we resize to a resolution of 64× 64. Figure 3.6 shows samples fromkernelized GAN with different kernels. We show that the model trainsadequately with kernel functions other than the typical Gaussian kernel.While the linear kernel is not characteristic, it still produces a good generativemodel with no obvious deficiencies. An advantage of using random kitchensinks to approximate the Gaussian kernel is that it can be computed in lineartime whereas Gaussian kernel is quadratic with respect to the minibatch size.This show that kernelized GAN can potentially scale up to large minibatchsizes, unlike GMMN.3.3.3 Latent space interpolationA degenerate case of a trained generative model is to only memorize andonly generate samples from the training dataset. For latent variable models,we can visualize how changing the latent state produces diverse samples. In323.3. ExperimentsReal GMMNKernelized GANFigure 3.5: Random samples from the LFW dataset. GMMN is trained usinga batchsize of 1024 whereas kernelized GAN uses a batchsize of 64.333.3. ExperimentsReal Gaussian kernelRKS kernel Linear kernelFigure 3.6: Random samples from the LSUN dataset. Kernelized GANcan be trained using different kernels, with the discriminator appropriatelyadapting to the specific manifold required to use the kernel.343.3. ExperimentsFigures 3.7 and 3.8, we show random samples on the leftmost and rightmostcolumns with interpolated samples in-between. We see that in most cases,there is a smooth transition showing that the model learns to generalizeinstead of memorize. In 3.7, there are signs of a smooth transition from faceswith glasses to faces without glasses, and from neutral expressions to happierexpressions. In 3.8, we see transitions where windows turn to cabinets ordrapes, and chairs turning into beds.3.3.4 Conditional generationThe simplest version of a conditional GAN [65] involves including imagelabels to both the generator and discriminator networks. As there is nodiscriminator network in GMMN, it is possible to implement a conditionalgenerative moment matching network. However, with kernelized GAN, wecan augment the generator and discriminator as described in [65] to createconditional samples. Figure 3.9 shows conditional samples from a kernelizedGAN. It should also be possible to augment kernelized GAN with morecomplicated conditional generation such as [6].353.3. ExperimentsFigure 3.7: Interpolation within latent space for Kernelized GAN trained onLFW.363.3. ExperimentsFigure 3.8: Interpolation within latent space for Kernelized GAN trained onLSUN bedrooms.373.3. ExperimentsFigure 3.9: Conditional generation of MNIST digits. Each row correspondsto different label from 0 to 9.38Chapter 4Feedforward style transferFamous artists are typically renowned for a particular artistic style, whichtakes years to develop. Even once perfected, a single piece of art cantake days or even months to create. This motivates us to explore efficientcomputational strategies for creating artistic images. While there is a largeclassical literature on texture synthesis methods that create artwork from ablank canvas [18, 47, 56, 89], several recent approaches study the problem oftransferring the desired style from one image onto the structural content ofanother image. This approach is known as artistic style transfer.Methods for texture synthesis often create generative processes wherecarefully chosen insertions of random variables creates diverse textures. Theneural texture synthesis algorithm described in Chapter 2 is an examplewhere the generative process involves sampling a noise image, then iterativelyrefining the image to create a new texture sample. This method can beadapted to style transfer by conditioning on a content image in the generativeprocess. The neural style transfer algorithm simply adds an appropriatecontent reconstruction loss (2.14) to redirect the iterative process. SeeChapter 2 for a more thorough description for neural texture synthesis andstyle transfer.4.1 The need for faster algorithmsDespite renewed interest in the domain, the actual process of style transfer isbased on solving a complex optimization procedure, which can take minuteson today’s hardware. A typical speedup solution is to train another neuralnetwork that approximates the optimum of the optimization in a single feed-forward pass [14, 41, 87, 88]. While much faster, existing works that use thisapproach sacrifice the versatility of being able to perform style transfer withany given style image, as the feed-forward network cannot generalize beyondits trained set of images. Due to this limitation, existing applications areeither time-consuming or limited in the number of provided styles, dependingon the method of style transfer.394.1. The need for faster algorithmsFigure 4.1: Illustration of existing feedforward methods [14, 41, 87, 88] thatsimply try to minimize (2.17) by training a separate neural network for eachstyle image, or a limited number of style images.In this chapter we propose a method that addresses these limitations: anew method for artistic style transfer that is efficient but is not limited to afinite set of styles. To accomplish this, we define a new optimization objectivefor style transfer that notably only depends on one layer of the CNN (asopposed to existing methods that use multiple layers). The new objectiveleads to visually-appealing results while this simple restriction allows us touse an “inverse network” to deterministically invert the activations from thestylized layer to yield the stylized image.While it is possible to train a neural network that approximates theoptimum of Gatys et al.’s loss function (see Section 2.4) for one or more fixedstyles [14, 41, 87, 88]. This yields a much faster method, but these methodsneed to be re-trained for each new style. Figure 4.1 highlights the limitationof this method as a new neural network must be trained for each new style.It should be noted that these works all minimize the objective defined by[24].404.2. Style transfer as one-shot distribution alignmentFigure 4.2: We propose a one-shot concatenation method based on a simplenearest neighbour alignment to combine the content and style activations.The combined activations are then inverted back into an image by a trainedinverse network.4.2 Style transfer as one-shot distributionalignmentThe main component of our style transfer method is a patch-based operationfor constructing the target activations in a single layer, given the style andcontent images. We refer to this procedure as “swapping the style” of animage, as the content image is replaced patch-by-patch by the style image.We first present this operation at a high level, followed by more details onour implementation.4.2.1 Style SwapLet C and S denote the RGB representations of the content and styleimages (respectively), and let Φ(·) be the function represented by a fullyconvolutional part of a pretrained CNN that maps an image from RGB tosome intermediate activation space. After computing the activations, Φ(C)and Φ(S), the style swap procedure is as follows:1. Extract a set of patches for both content and style activations, denotedby {φi(C)}i∈nc and {φj(S)}j∈ns , where nc and ns are the number ofextracted patches. The extracted patches should have sufficient overlap,and contain all channels of the activations.2. For each content activation patch, determine a closest-matching style414.2. Style transfer as one-shot distribution alignmentpatch based on the normalized cross-correlation measure,φssi (C, S) := argmaxφj(S), j=1,...,ns〈φi(C), φj(S)〉||φi(C)|| · ||φj(S)|| . (4.1)3. Swap each content activation patch φi(C) with its closest-matchingstyle patch φssi (C, S).4. Reconstruct the complete content activations, which we denote byΦss(C, S), by averaging overlapping areas that may have differentvalues due to step 3.This operation results in hidden activations corresponding to a single imagewith the structure of the content image, but with textures taken from thestyle image.4.2.2 Comparison with neural texture synthesisThe neural texture synthesis algorithm minimizes the maximum mean dis-crepancy between the patches {φi(S)} and {φj(N)} where S is the styleimage and N is a white noise image. The MMD kernel used in neuraltexture synthesis is a polynomial kernel of degree 2. In contrast, style swaptakes two sets of patches {φi(S)} and {φj(N)} and performs a distributionalignment by replacing each patch by their closest neighbour (Figure 4.3),with a similarity metric related to the MMD kernel.4.2.3 Parallelizable implementationTo give an efficient implementation, we show that the entire style swapoperation can be implemented as a network with three operations: (i) a 2Dconvolutional layer, (ii) a channel-wise argmax, and (iii) a 2D transposedconvolutional layer. Implementation of style swap is then as simple asusing existing efficient implementations of 2D convolutions and transposedconvolutions2.To concisely describe the implementation, we re-index the content activa-tion patches to explicitly denote spatial structure. In particular, we’ll let dbe the number of feature channels of Φ(C), and let φa,b(C) denote the patchΦ(C)a:a+s, b:b+s, 1:d where s is the patch size.2The transposed convolution is also often referred to as a “fractionally-strided” convo-lution, a “backward” convolution, an “upconvolution”, or a ”deconvolution”.424.2. Style transfer as one-shot distribution alignmentFigure 4.3: Style swap performs a forced distribution alignment by replacingeach content patch by its nearest neighbour style patch.Content ActivationsTarget Activations2D ConvolutionWith Normalized Style Patches as Filters2D Transposed ConvolutionWith Style Patches as FiltersChannel-wiseArgmaxFigure 4.4: Illustration of a style swap operation. The 2D convolutionextracts patches of size 3 × 3 and stride 1, and computes the normalizedcross-correlations. There are nc = 9 spatial locations and ns = 4 featurechannels immediately before and after the channel-wise argmax operation.The 2D transposed convolution reconstructs the complete activations byplacing each best matching style patch at the corresponding spatial location.434.2. Style transfer as one-shot distribution alignmentNotice that the normalization term for content activation patches φi(C)is constant with respect to the argmax operation, so (4.1) can be rewrittenasKa,b,j =〈φa,b(C),φj(S)||φj(S)||〉φssa,b(C, S) = argmaxφj(S), j∈Ns{Ka,b,j}(4.2)The lack of a normalization for the content activation patches simplifiescomputation and allows our use of 2D convolutional layers. The followingthree steps describe our implementation and are illustrated in Figure 4.4:• The tensor K can be computed by a single 2D convolution by using thenormalized style activations patches {φj(S)/||φj(S)||} as convolutionfilters and Φ(C) as input. The computed K has nc spatial locationsand ns feature channels. At each spatial location, Ka,b is a vectorof cross-correlations between a content activation patch and all styleactivation patches.• To prepare for the 2D transposed convolution, we replace each vectorKa,b by a one-hot vector corresponding to the best matching styleactivation patch.Ka,b,j ={1 if j = argmaxj′{Ka,b,j′}0 otherwise(4.3)• The last operation for constructing Φss(C, S) is a 2D transposed con-volution with K as input and unnormalized style activation patches{φj(S)} as filters. At each spatial location, only the best matching styleactivation patch is in the output, as the other patches are multipliedby zero.Note that a transposed convolution will sum up the values from overlappingpatches. In order to average these values, we perform an element-wise divisionon each spatial location of the output by the number of overlapping patches.Consequently, we do not need to impose that the argmax in (4.3) has aunique solution, as multiple argmax solutions can simply be interpreted asadding more overlapping patches.444.3. Inverse network4.2.4 Optimization formulationThe pixel representation of the stylized image can be computed by placinga loss function on the activation space with target activations Φss(C, S).Similar to prior works on style transfer [23, 51], we use the squared-errorloss and define our optimization objective asIstylized(C, S) = argminI∈Rh×w×d||Φ(I)− Φss(C, S)||2F+λ`TV (I)(4.4)where we’ll say that the synthesized image is of dimension h by w by d, || · ||Fis the Frobenius norm, and `TV (·) is the total variation regularization termwidely used in image generation methods [1, 41, 61]. Because Φ(·) containsmultiple maxpooling operations that downsample the image, we use thisregularization as a natural image prior, obtaining spatially smoother resultsfor the re-upsampled image. The total variation regularization is as follows:`TV (I) =h−1∑i=1w∑j=1d∑k=1(Ii+1,j,k − Ii,j,k)2+h∑i=1w−1∑j=1d∑k=1(Ii,j+1,k − Ii,j,k)2(4.5)Since the function Φ(·) is part of a pretrained CNN and is at least oncesubdifferentiable, (4.4) can be computed using standard subgradient-basedoptimization methods.4.3 Inverse networkUnfortunately, the cost of solving the optimization problem to compute thestylized image might be too high in applications such as video stylization.We can improve optimization speed by approximating the optimum usinganother neural network. Once trained, this network can then be used toproduce stylized images much faster, and we will in particular train thisnetwork to have the versatility of being able to use new content and newstyle images.The main purpose of our inverse network is to approximate an optimumof the loss function in (4.4) for any target activations. We therefore define454.3. Inverse networkFigure 4.5: The inverse network takes the style swapped activations andproduces an image. This network is trained by minimizing the L2 norm(orange line) between the style swapped activations and the activations ofthe image after being passed through the pretrained network again.the optimal inverse function as:arginffEH[||Φ(f(H))−H||2F + λ`TV (f(H))](4.6)where f represents a deterministic function and H is a random variablerepresenting target activations. The total variation regularization term isadded as a natural image prior similar to (4.4).4.3.1 Training the inverse networkA couple problems arise due to the properties of the pretrained convolutionalneural network.Non-injective. The CNN defining Φ(·) contains convolutional, max-pooling, and ReLU layers. These functions are many-to-one, and thus donot have well-defined inverse functions. Akin to existing works that useinverse networks [11, 59, 95], we instead train an approximation to the inverserelation by a parametric neural network.minθ1nn∑i=1||Φ(f(Hi; θ))−Hi||2F + λ`TV (f(Hi; θ)) (4.7)where θ denotes the parameters of the neural network f and Hi are activationfeatures from a dataset of size n. This objective function leads to unsupervised464.3. Inverse networkFigure 4.6: We propose the first feedforward method for style transfer thatcan be used for arbitrary style images. We formulate style transfer using aconstructive procedure (Style Swap) and train an inverse network to generatethe image.training of the neural network as the optimum of (4.4) does not need to beknown. We place the description of our inverse network architecture in theappendix.Non-surjective. The style swap operation produces target activationsthat may be outside the range of Φ(·) due to the interpolation. This wouldmean that if the inverse network is only trained with real images then theinverse network may only be able to invert activations in the range of Φ(·).Since we would like the inverse network to invert style swapped activations,we augment the training set to include these activations. More precisely,given a set of training images (and their corresponding activations), weaugment this training set with style-swapped activations based on pairs ofimages.4.3.2 Feedforward style transfer procedureOnce trained, the inverse network can be used to replace the optimizationprocedure. Thus our proposed feedforward procedure consists of the followingsteps:1. Compute Φ(C) and Φ(S).2. Obtain Φss(C, S) by style swapping.3. Feed Φss(C, S) into a trained inverse network.This procedure is illustrated in Figure 4.6. As described in Section 4.2.3, styleswapping can be implemented as a (non-differentiable) convolutional neural474.4. Experimentsnetwork. As such, the entire feedforward procedure can be seen as a neural netwith individually trained parts. Compared to existing feedforward approaches[14, 41, 87, 88], the biggest advantage of our feedforward procedure is theability to use new style images with only a single trained inverse network.4.4 ExperimentsIn this section, we analyze properties of the proposed style transfer andinversion methods. We use the Torch7 framework [7] to implement ourmethod3, and use existing open source implementations of prior works[40, 51, 76] for comparison.4.4.1 Style swap resultsTarget Layer. The effects of style swapping in different layers of the VGG-19network are shown in Figure 4.7. In this figure the RGB images are computedby optimization as described in Section 4.2. We see that while we can styleswap directly in the RGB space, the result is nothing more than a recolor. Aswe choose a target layer that is deeper in the network, textures of the styleimage are more pronounced. We find that style swapping on the “relu3 1”layer provides the most visually pleasing results, while staying structurallyconsistent with the content. We restrict our method to the “relu3 1” layer inthe following experiments and in the inverse network training. Qualitativeresults are shown in Figure 4.13.Consistency. Our style swapping approach concatenates the contentand style information into a single target feature vector, resulting in an easieroptimization formulation compared to other approaches. As a result, wefind that the optimization algorithm is able to reach the optimum of ourformulation in less iterations than existing formulations while consistentlyreaching the same optimum. Figures 4.8 and 4.9 show the difference inoptimization between our formulation and existing works under randominitializations. Here we see that random initializations have almost no effecton the stylized result, indicating that we have far fewer local optima thanother style transfer objectives.Straightforward Adaptation to Video. This consistency propertyis advantageous when stylizing videos frame by frame. Frames that are thesame will result in the same stylized result, while consecutive frames will bestylized in similar ways. As a result, our method is able to adapt to video3Code available at https://github.com/rtqichen/style-swap484.4. ExperimentsContent Image RGB relu1 1 relu2 1Style Image relu3 1 relu4 1 relu5 1Figure 4.7: The effect of style swapping in different layers of VGG-19 [82],and also in RGB space. Due to the naming convention of VGG-19, “reluX 1”refers to the first ReLU layer after the (X − 1)-th maxpooling layer. Thestyle swap operation uses patches of size 3 × 3 and stride 1, and then theRGB image is constructed using optimization.Content Image Gatys et al.with random initializationsStyle Image Our method with random initializationsFigure 4.8: Our method achieves consistent results compared to existingoptimization formulations. We see that Gatys et al.’s formulation [23] hasmultiple local optima while we are able to consistently achieve the samestyle transfer effect with random initializations. Figure 4.9 shows thisquantitatively.494.4. Experiments0 100 200 500Optimization Iteration00.10.20.30.4Standard DeviationStandard Deviation of PixelsGatys et al.Li and WandStyle SwapFigure 4.9: Standard deviation of the RGB pixels over the course of opti-mization is shown for 40 random initializations. The lines show the meanvalue and the shaded regions are within one standard deviation of the mean.The vertical dashed lines indicate the end of optimization. Figure 4.8 showsexamples of optimization results.without any explicit gluing procedure like optical flow [76]. We place stylizedvideos in the code repository.Simple Intuitive Tuning. A natural way to tune the degree of styl-ization (compared to preserving the content) in the proposed approach isto modify the patch size. Figure 4.10 qualitatively shows the relationshipbetween patch size and the style-swapped result. As the patch size increases,more of the structure of the content image is lost and replaced by texturesin the style image.4.4.2 CNN inversionHere we describe our training of an inverse network that computes anapproximate inverse function of the pretrained VGG-19 network [82]. Morespecifically, we invert the truncated network from the input layer up to layer“relu3 1”. The network architecture is placed in the appendix.Dataset. We train using the Microsoft COCO (MSCOCO) dataset [57]and a dataset of paintings sourced from wikiart.org and hosted by Kaggle [12].Each dataset has roughly 80, 000 natural images and paintings, respectively.Since typically the content images are natural images and style images arepaintings, we combine the two datasets so that the network can learn torecreate the structure and texture of both categories of images. Additionally,504.4. Experimentspatch size 3× 3 patch size 7× 7 patch size 12× 12Figure 4.10: We can tradeoff between content structure and style textureby tuning the patch size. The style images, Starry Night (top) and SmallWorlds I (bottom), are shown in Figure 4.13.the explicit categorization of natural image and painting gives respectivecontent and style candidates for the augmentation described in Section 4.3.1.Training. We resize each image to 256× 256 pixels (corresponding toactivations of size 64× 64) and train for approximately 2 epochs over eachdataset. Note that even though we restrict the size of our training images(and corresponding activations), the inverse network is fully convolutionaland can be applied to arbitrary-sized activations after training.We construct each minibatch by taking 2 activation samples from naturalimages and 2 samples from paintings. We augment the minibatch with 4style-swapped activations using all pairs of natural images and paintings inthe minibatch. We calculate subgradients using backpropagation on (4.7)with the total variance regularization coefficient λ = 10−6 (the method isnot particularly sensitive to this choice), and we update parameters of thenetwork using the Adam optimizer [43] with a fixed learning rate of 10−3.Result. Figure 4.11 shows quantitative approximation results using 2000full-sized validation images from MSCOCO and 6 full-sized style images.Though only trained on images of size 256 × 256, we achieve reasonableresults for arbitrary full-sized images. We additionally compare againstan inverse network that has the same architecture but was not trainedwith the augmentation. As expected, the network that never sees style-swapped activations during training performs worse than the network with514.4. Experiments0 20 40 60 80 100Iteration01234Loss Function10 4 Validation LossOptimizationInvNet-NoAugInvNet-AugFigure 4.11: We compare the average loss (4.7) achieved by optimizationand our inverse networks on 2000 variable-sized validation images and 6variable-sized style images, using patch sizes of 3 × 3. Style images thatappear in the paintings dataset were removed during training.the augmented training set.4.4.3 Computation timeComputation times of existing style transfer methods are listed in Table 4.1.Compared to optimization-based methods, our optimization formula is easierto solve and requires less time per iteration, likely due to only using onelayer of the pretrained VGG-19 network. Other methods use multiple layersand also deeper layers than we do.We show the percentage of computation time spent by different parts ofour feedforward procedure in Figures 4.12a and 4.12c. For any nontrivialimage sizes, the style swap procedure requires much more time than theother neural networks. This is due to the style swap procedure containingtwo convolutional layers where the number of filters is the number of stylepatches. The number of patches increases linearly with the number of pixelsof the image, with a constant that depends on the number of pooling layersand the stride at which the patches are extracted. Therefore, it is no surprisethat style image size has the most effect on computation time (as shown inFigures 4.12a and 4.12b).Interestingly, it seems that the computation time stops increasing atsome point even when the content image size increases (Figure 4.12d), likelydue to parallelism afforded by the implementation. This suggests that ourprocedure can handle large image sizes as long as the number of style patches524.4. ExperimentsMethod N. Iters. Time/Iter. (s) Total (s)Gatys et al.[23] 500 0.1004 50.20Li and Wand [51] 200 0.6293 125.86Style Swap (Optim) 100 0.0466 4.66Style Swap (InvNet) 1 1.2483 1.25Table 4.1: Mean computation times of style transfer methods that canhandle arbitary style images. Times are taken for images of resolution300 × 500 on a GeForce GTX 980 Ti. Note that the number of iterationsfor optimization-based approaches should only be viewed as a very roughestimate.is kept manageable. It may be desirable to perform clustering on the stylepatches to reduce the number of patches, or use alternative implementationssuch as fast approximate nearest neighbour search methods [29, 68].534.4. ExperimentsBreak Down of Time Spent100 2 300 2 400 2 500 2 600 2 700 2 800 2Number of Pixels of Style Image0%20%40%60%80%100%Time Spent (%)Inverse NetworkStyle Swap(a)100 2 300 2 400 2 500 2 600 2 700 2 800 2Number of Pixels of Style Image00.511.522.53Time Spent (s)Time Spent by Style SwapStyle Swap(b)Break Down of Time Spent100 2 300 2 400 2 500 2 600 2 700 2 800 2Number of Pixels of Content Image0%20%40%60%80%100%Time Spent (%)Inverse NetworkStyle Swap(c)100 2 300 2 400 2 500 2 600 2 700 2 800 2Number of Pixels of Content Image00.511.522.53Time Spent (s)Time Spent by Style SwapStyleSwap(d)Figure 4.12: Compute times as (a,b) style image size increases and (c,d)as content image size increases. The non-variable image size is kept at500× 500. As shown in (a,c), most of the computation is spent in the styleswap procedure.544.4. ExperimentsStyleSmall Worlds I,WassilyKandinsky, 1922StyleThe Starry Night,Vincent VanGogh, 1889StyleComposition X,WassilyKandinsky, 1939StyleMountainprism,Renee Nemerov,2007StyleButterfly DrawingStyleLa Muse,Pablo Picasso,1935Content Ours Gatys et al. Content Ours Gatys et al.Figure 4.13: Qualitative examples of our method compared with Gatys etal.’s formulation of artistic style transfer.55Chapter 5ConclusionWe have discussed the use of maximum mean discrepancy in learning gen-erative models in two different settings. The first setting covers a generaluse case where the generative process for a training dataset is approximated.The second setting covers a more specific use case of combining content andstyle images in a generative process that produces a new artistic image. Weshow generalization of our style transfer method to arbitrary style images,as prior work focused on speed have not been successful at generalization.For future work, it would be ideal to increase the quality or speed ofthe style transfer without sacrificing generalization, as the proposed methodcontains a three-way trade off between speed, quality, and generalizationwith no outstanding performance in either speed or quality. Speed can beachieved by replacing the patch-based distributional alignment with a fasterone, while quality can be increased by using existing neural style transfer loss[24] instead of the autoencoder loss we use. This has been partly exploredby Huang and Belongie [35] by replacing the style swap operation with asimpler mean and variance alignment.The kernelized GAN is a straightforward replacement for the originalGAN objective, though it is difficult to show whether discriminative power isincreased by using maximum mean discrepancy as opposed to the arguablysimpler Wasserstein distance for real applications. It would be helpful togenerative modeling research if robust testing methods could be createdspecifically for generative models.56Bibliography[1] Hussein A Aly and Eric Dubois. Image up-sampling using total-variationregularization with a new observation model. IEEE Transactions onImage Processing, 14(10):1647–1659, 2005.[2] Martin Arjovsky, Soumith Chintala, and Le´on Bottou. Wasserstein gan.arXiv preprint arXiv:1701.07875, 2017.[3] Guillaume Berger and Roland Memisevic. Incorporating long-range consistency in cnn-based texture generation. arXiv preprintarXiv:1606.01286, 2016.[4] Le´on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methodsfor large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.[5] Alex J Champandard. Semantic style transfer and turning two-bitdoodles into fine artworks. arXiv preprint arXiv:1603.01768, 2016.[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever,and Pieter Abbeel. Infogan: Interpretable representation learning byinformation maximizing generative adversarial nets. In Advances inNeural Information Processing Systems, pages 2172–2180, 2016.[7] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-likeenvironment for machine learning. In BigLearn, NIPS Workshop, 2011.[8] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper under-standing of neural networks: The power of initialization and a dual viewon expressivity. In Advances In Neural Information Processing Systems,pages 2253–2261, 2016.[9] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang,Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activationfeature for generic visual recognition. In ICML, pages 647–655, 2014.[10] Alexey Dosovitskiy and Thomas Brox. Inverting convolutional networkswith convolutional networks. CoRR, abs/1506.02753, 2015.57Bibliography[11] Alexey Dosovitskiy, Jost Springenberg, Maxim Tatarchenko, andThomas Brox. Learning to generate chairs, tables and cars with convo-lutional networks.[12] Small Yellow Duck. Painter by numbers, wikiart.org. https://www.kaggle.com/c/painter-by-numbers, 2016.[13] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, MartinArjovsky, Olivier Mastropietro, and Aaron Courville. Adversariallylearned inference. arXiv preprint arXiv:1606.00704, 2016.[14] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learnedrepresentation for artistic style. CoRR, abs/1610.07629, 2016.[15] Vincent Dumoulin and Francesco Visin. A guide to convolution arith-metic for deep learning. arXiv preprint arXiv:1603.07285, 2016.[16] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani.Training generative neural networks via maximum mean discrepancyoptimization. arXiv preprint arXiv:1505.03906, 2015.[17] Alexei A Efros and William T Freeman. Image quilting for texturesynthesis and transfer. In Proceedings of the 28th annual conferenceon Computer graphics and interactive techniques, pages 341–346. ACM,2001.[18] Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The Proceedings of theSeventh IEEE International Conference on, volume 2, pages 1033–1038.IEEE, 1999.[19] Michael Elad and Peyman Milanfar. Style-transfer via texture-synthesis.arXiv preprint arXiv:1609.03057, 2016.[20] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learningfor physical interaction through video prediction. In Advances In NeuralInformation Processing Systems, pages 64–72, 2016.[21] Oriel Frigo, Neus Sabater, Julie Delon, and Pierre Hellier. Split andmatch: Example-based adaptive patch sampling for unsupervised styletransfer. 2016.[22] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesisusing convolutional neural networks. In Advances in Neural InformationProcessing Systems, pages 262–270, 2015.58Bibliography[23] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neuralalgorithm of artistic style. CoRR, abs/1508.06576, 2015.[24] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image styletransfer using convolutional neural networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages2414–2423, 2016.[25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-erative adversarial nets. In Advances in neural information processingsystems, pages 2672–2680, 2014.[26] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, BernhardScho¨lkopf, and Alexander Smola. A kernel two-sample test. Journal ofMachine Learning Research, 13(Mar):723–773, 2012.[27] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakr-ishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K. Sripe-rumbudur. Optimal kernel choice for large-scale two-sample tests. InF. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,Advances in Neural Information Processing Systems 25, pages 1205–1213.Curran Associates, Inc., 2012.[28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin,and Aaron Courville. Improved training of wasserstein gans. arXivpreprint arXiv:1704.00028, 2017.[29] Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang.Fast approximate nearest-neighbor search with k-nearest neighbor graph.[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages 770–778, 2016.[31] Kun He, Yan Wang, and John Hopcroft. A Powerful Generative ModelUsing Random Weights for the Deep Image Representation. 2016.[32] Kun He, Yan Wang, and John E. Hopcroft. A powerful generativemodel using random weights for the deep image representation. CoRR,abs/1606.04801, 2016.[33] Aaron Hertzmann. Paint By Relaxation. Proceedings Computer GraphicsInternational (CGI), pages 47–54, 2001.59Bibliography[34] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, andDavid H Salesin. Image analogies. In Proceedings of the 28th annualconference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001.[35] Xun Huang and Serge Belongie. Arbitrary style transfer in real-timewith adaptive instance normalization. arXiv preprint arXiv:1703.06868,2017.[36] Ferenc Husza´r. How (not) to train your generative model: Scheduledsampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.[37] Artify Inc. Artify, 2016.[38] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In Interna-tional Conference on Machine Learning, pages 448–456, 2015.[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. CoRR,abs/1502.03167, 2015.[40] Justin Johnson. neural-style. https://github.com/jcjohnson/neural-style, 2015.[41] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses forReal-Time Style Transfer and Super-Resolution. Arxiv, 2016.[42] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Ji-won Kim. Learning to discover cross-domain relations with generativeadversarial networks. CoRR, abs/1703.05192, 2017.[43] Diederik Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.[44] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013.[45] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In Advances inneural information processing systems, pages 1097–1105, 2012.[46] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and JoshTenenbaum. Deep convolutional inverse graphics network. In Advancesin Neural Information Processing Systems, pages 2539–2547, 2015.60Bibliography[47] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. Tex-ture optimization for example-based synthesis. ACM Transactions onGraphics (ToG), 24(3):795–802, 2005.[48] Quoc Viet Le, Tama´s Sarlo´s, and Alexander Johannes Smola. Fastfood:Approximate kernel expansions in loglinear time. CoRR, abs/1408.3060,2014.[49] Christian Ledig, Lucas Theis, Ferenc Husza´r, Jose Caballero, An-drew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Te-jani, Johannes Totz, Zehan Wang, et al. Photo-realistic single imagesuper-resolution using a generative adversarial network. arXiv preprintarXiv:1609.04802, 2016.[50] Sylvain Lefebvre and Hugues Hoppe. Appearance-space texture synthesis.ACM Transactions on Graphics (TOG), 25(3):541–548, 2006.[51] Chuan Li and Michael Wand. Combining Markov Random Fields andConvolutional Neural Networks for Image Synthesis. Cvpr 2016, page 9,2016.[52] Chuan Li and Michael Wand. Precomputed real-time texture synthesiswith markovian generative adversarial networks. In European Conferenceon Computer Vision, pages 702–716. Springer, 2016.[53] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, andBarnaba´s Po´czos. Mmd gan: Towards deeper understanding of momentmatching network. arXiv preprint arXiv:1705.08584, 2017.[54] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifyingneural style transfer. CoRR, abs/1701.01036, 2017.[55] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matchingnetworks. In Proceedings of the 32nd International Conference onMachine Learning (ICML-15), pages 1718–1727, 2015.[56] Lin Liang, Ce Liu, Ying-Qing Xu, Baining Guo, and Heung-YeungShum. Real-time texture synthesis by patch-based sampling. ACMTransactions on Graphics (ToG), 20(3):127–150, 2001.[57] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev,Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, PiotrDolla´r, and C. Lawrence Zitnick. Microsoft COCO: common objects incontext. CoRR, abs/1405.0312, 2014.61Bibliography[58] Peter Litwinowicz. Processing Images and Video for an ImpressionistEffect. Proc. SIGGRAPH, pages 407–414, 1997.[59] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages 3431–3440,2015.[60] Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learncorrespondence? In Advances in Neural Information Processing Systems,pages 1601–1609, 2014.[61] Aravindh Mahendran and Andrea Vedaldi. Understanding deep imagerepresentations by inverting them. In 2015 IEEE conference on computervision and pattern recognition (CVPR), pages 5188–5196. IEEE, 2015.[62] Jonathan Masci, Ueli Meier, Dan Cires¸an, and Ju¨rgen Schmidhuber.Stacked convolutional auto-encoders for hierarchical feature extraction.In International Conference on Artificial Neural Networks, pages 52–59.Springer, 2011.[63] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprintarXiv:1511.05440, 2015.[64] Barbara J Meier. Painterly rendering for animation. Proceedings of the23rd annual conference on Computer graphics and interactive techniquesSIGGRAPH 96, 30(Annual Conference Series):477–484, 1996.[65] Mehdi Mirza and Simon Osindero. Conditional generative adversarialnets. arXiv preprint arXiv:1411.1784, 2014.[66] Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean andcovariance feature matching gan. arXiv preprint arXiv:1702.08398, 2017.[67] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, BernhardScho¨lkopf, et al. Kernel mean embedding of distributions: A review andbeyond. Foundations and Trends R© in Machine Learning, 10(1-2):1–141,2017.[68] Marius Muja and David G Lowe. Scalable nearest neighbor algorithmsfor high dimensional data. IEEE Transactions on Pattern Analysis andMachine Intelligence, 36(11):2227–2240, 2014.62Bibliography[69] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, andJeff Clune. Plug & play generative networks: Conditional iterativegeneration of images in latent space. arXiv preprint arXiv:1612.00005,2016.[70] Roman Novak and Yaroslav Nikulin. Improving the neural algorithm ofartistic style. arXiv preprint arXiv:1605.04603, 2016.[71] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Traininggenerative neural samplers using variational divergence minimization.In Advances in Neural Information Processing Systems, pages 271–279,2016.[72] Javier Portilla and Eero P Simoncelli. A parametric texture modelbased on joint statistics of complex wavelet coefficients. Internationaljournal of computer vision, 40(1):49–70, 2000.[73] Inc. Prisma Labs. Prisma. http://prisma-ai.com/, 2016.[74] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised represen-tation learning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434, 2015.[75] Ali Rahimi and Benjamin Recht. Random features for large-scale kernelmachines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors,Advances in Neural Information Processing Systems 20, pages 1177–1184.Curran Associates, Inc., 2008.[76] Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. Artistic styletransfer for videos. pages 1–14, 2016.[77] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet LargeScale Visual Recognition Challenge. International Journal of ComputerVision (IJCV), 115(3):211–252, 2015.[78] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, AlecRadford, and Xi Chen. Improved techniques for training gans. InAdvances in Neural Information Processing Systems, pages 2226–2234,2016.63Bibliography[79] Thomas Schlegl, Philipp Seebo¨ck, Sebastian M. Waldstein, UrsulaSchmidt-Erfurth, and Georg Langs. Unsupervised anomaly detectionwith generative adversarial networks to guide marker discovery. CoRR,abs/1703.05921, 2017.[80] Ju¨rgen Schmidhuber. Deep learning in neural networks: An overview.CoRR, abs/1404.7828, 2014.[81] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finitesums with the stochastic average gradient. Mathematical Programming,162(1-2):83–112, 2017.[82] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556,2014.[83] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, andFerenc Husza´r. Amortised map inference for image super-resolution.arXiv preprint arXiv:1610.04490, 2016.[84] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, BernhardScho¨lkopf, and Gert RG Lanckriet. Hilbert space embeddings andmetrics on probability measures. Journal of Machine Learning Research,11(Apr):1517–1561, 2010.[85] PicsArt Photo Studio. Picsart. https://picsart.com/, 2016.[86] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De,Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative modelsand model criticism via optimized maximum mean discrepancy. arXivpreprint arXiv:1611.04488, 2016.[87] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky.Texture Networks: Feed-forward Synthesis of Textures and StylizedImages. CoRR, 2016.[88] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instancenormalization: The missing ingredient for fast stylization. CoRR,abs/1607.08022, 2016.[89] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structuredvector quantization. In Proceedings of the 27th annual conference onComputer graphics and interactive techniques, pages 479–488. ACMPress/Addison-Wesley Publishing Co., 2000.64Bibliography[90] Christopher K. I. Williams and Matthias Seeger. Using the nystro¨mmethod to speed up kernel machines. In T. K. Leen, T. G. Dietterich, andV. Tresp, editors, Advances in Neural Information Processing Systems13, pages 682–688. MIT Press, 2001.[91] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and JoshTenenbaum. Learning a probabilistic latent space of object shapes via3d generative-adversarial modeling. In Advances in Neural InformationProcessing Systems, pages 82–90, 2016.[92] Pinar Yanardag, Manuel Cebrian, Nick Obradovich, and Iyad Rahwan.Nightmare machine, 2016.[93] Raymond Yeh, Chen Chen, Teck-Yian Lim, Mark Hasegawa-Johnson,and Minh N. Do. Semantic image inpainting with perceptual andcontextual losses. CoRR, abs/1607.07539, 2016.[94] Wojciech Zaremba, Arthur Gretton, and Matthew B. Blaschko. B-test: A non-parametric, low variance kernel two-sample test. CoRR,abs/1307.1954, 2013.[95] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus.Deconvolutional networks. In Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE, 2010.[96] Fang Zhao, Jiashi Feng, Jian Zhao, Wenhan Yang, and Shuicheng Yan.Robust lstm-autoencoders for face de-occlusion in the wild. CoRR,abs/1612.08534, 2016.[97] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpairedimage-to-image translation using cycle-consistent adversarial networkss.arXiv preprint arXiv:1703.10593, 2017.65AppendicesAppendix ALinear time and spacecomplexity kernel sumsIt is often believed that the computation of MMD is prohibitive as the kernelmatrix requires quadratic time and space complexity with respect to thenumber of samples. However, this isn’t always the case when the explicitkernel matrix isn’t required. In maximum mean discrepancy, only the sumof all elements of the kernel matrix is required. We describe a few simpletricks to allow computation in linear time and space complexity. With theuse of parallel computation on graphics processing units (GPUs), linear timemay not affect the computation time much, but the linear space complexityallows much higher minibatch sizes due to the restricted memory sizes oncurrent GPUs.Let X and Y denote two n × d datasets in Rd. For simplicity, we’veassumed the two datasets have the same number of samples. However, thisis only required for the polynomial kernel we derive below.Firstly, the linear kernel sum can be easily computed in O(nd) in bothspace and time complexity. This is done by replacing summation withmultiplication by a vector of ones. Let e denote a vector of ones of length n.Thenn∑i,jxTi yj = eTXY T e = (eTX)︸ ︷︷ ︸1×d(eTY )︸ ︷︷ ︸d×1(A.1)Note eTX can also be computed as∑i xi and can be performed in lineartime.66Appendix A. Linear time and space complexity kernel sumsThe polynomial kernel of degree 2 with bias b:n∑i,j(xTi yj + b)2 = ||XY T + bI||2F= trace[(XY T + bI)T (XY T + bI)]= trace[(XY T )T (XY T ) + 2bXY T + b2I)]= trace[(XY T )T (XY T )]+ trace[2bXY T]+ trace[b2I]= trace[(XY T )(XY T )T]+ trace[2bXTY]+ trace[b2I]= ||XTY ||2F + 2bd∑jn∑ixijyij + nb2(A.2)This changes the O(N2D) operation to a O(ND2) operation in both timeand space complexity. For the purposes of our method, D can be made verysmall due to the use of a parametric neural net. (As opposed to computingthis directly on images, where the dimension of the image is 3HW . For evenmedium sized images, the D2 cost can be intractible.)For shift-invariant kernels such as the Gaussian RBF, Random KitchenSinks (RKS) or the Fastfood method can be used to construct approximatefeature maps φˆ(xi)T φˆ(yj) ≈ k(xi, yi). For other kernels, the Nystroemmethod can be used, though it’s unknown how well it’ll perform in anadversarial setting. Since these approximate using a linear kernel, the spaceand time complexity are the same as linear kernels.67Appendix BInverse network architectureThe architecture of the truncated VGG-19 network used in the experimentsis shown in Table B.1, and the inverse network architecture is shown inTable B.2. It is possible that better architectures achieve better results,as we did not try many different types of convolutional neural networkarchitectures.– Convolutional layers use filter sizes of 3× 3, padding of 1, and strideof 1.– The rectified linear unit (ReLU) layer is an elementwise functionReLU(x) = max{x, 0}.– The instance norm (IN) layer standardizes each feature channel inde-pendently to have 0 mean and a standard deviation of 1. This layerhas shown impressive performance in image generation networks [88].– Maxpooling layers downsample by a factor of 2 by using filter sizes of2× 2 and stride of 2.– Nearest neighbor (NN) upsampling layers upsample by a factor of 2 byusing filter sizes of 2× 2 and stride of 2.68Appendix B. Inverse network architectureLayer Type Activation DimensionsInput H ×W × 3Conv-ReLU H ×W × 64Conv-ReLU H ×W × 64MaxPooling 1/2H × 1/2W × 64Conv-ReLU 1/2H × 1/2W × 128Conv-ReLU 1/2H × 1/2W × 128MaxPooling 1/4H × 1/4W × 128Conv-ReLU 1/4H × 1/4W × 256Table B.1: Truncated VGG-19network from the input layerto “relu3 1” (last layer in thetable).Layer Type Activation DimensionsInput 1/4H × 1/4W × 256Conv-IN-ReLU 1/4H × 1/4W × 128NN-Upsampling 1/2H × 1/2W × 128Conv-IN-ReLU 1/2H × 1/2W × 128Conv-IN-ReLU 1/2H × 1/2W × 64NN-Upsampling H ×W × 64Conv-IN-ReLU H ×W × 64Conv H ×W × 3Table B.2: Inverse network archi-tecture used for inverting activa-tions from the truncated VGG-19network.69
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Deep kernel mean embeddings for generative modeling...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Deep kernel mean embeddings for generative modeling and feedforward style transfer Chen, Tian Qi 2017
pdf
Page Metadata
Item Metadata
Title | Deep kernel mean embeddings for generative modeling and feedforward style transfer |
Creator |
Chen, Tian Qi |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | The generation of data has traditionally been specified using hand-crafted algorithms. However, oftentimes the exact generative process is unknown while only a limited number of samples are observed. One such case is generating images that look visually similar to an exemplar image or as if coming from a distribution of images. We look into learning the generating process by constructing a similarity function that measures how close the generated image is to the target image. We discuss a framework in which the similarity function is specified by a pre-trained neural network without fine-tuning, as is the case for neural texture synthesis, and a framework where the similarity function is learned along with the generative process in an adversarial setting, as is the case for generative adversarial networks. The main point of discussion is the combined use of neural networks and maximum mean discrepancy as a versatile similarity function. Additionally, we describe an improvement to state-of-the-art style transfer that allows faster computations while maintaining generality of the generating process. The proposed objective has desirable properties such as a simpler optimization landscape, intuitive parameter tuning, and consistent frame- by-frame performance on video. We use 80,000 natural images and 80,000 paintings to train a procedure for artistic style transfer that is efficient but also allows arbitrary content and style images. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-08-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0354397 |
URI | http://hdl.handle.net/2429/62668 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2017-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2017_september_chen_tianqi.pdf [ 16.58MB ]
- Metadata
- JSON: 24-1.0354397.json
- JSON-LD: 24-1.0354397-ld.json
- RDF/XML (Pretty): 24-1.0354397-rdf.xml
- RDF/JSON: 24-1.0354397-rdf.json
- Turtle: 24-1.0354397-turtle.txt
- N-Triples: 24-1.0354397-rdf-ntriples.txt
- Original Record: 24-1.0354397-source.json
- Full Text
- 24-1.0354397-fulltext.txt
- Citation
- 24-1.0354397.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0354397/manifest