Investigating the Impact of Normalizing Flows on LatentVariable Machine TranslationbyMichael PrzystupaB. Sc., The University of British Columbia, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)February 2020c©Michael Przystupa, 2020The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Investigating the Impact of Normalizing Flows on Latent Variable Ma-chine Translationsubmitted by Michael Przystupa in partial fulfillment of the requirements for thedegree of Master of Science in Computer Science.Examining Committee:Muhammad Abdul-Mageed, Linguistics and Information ScienceCo-supervisorMark Schmidt, Computer ScienceCo-supervisorAdditional Supervisory Committee Members:Leonid Sigal, Computer ScienceSupervisory Committee MemberiiAbstractNatural language processing (NLP) has pervasive applications in everyday life,and has recently witnessed rapid progress. Incorporating latent variables in NLPsystems can allow for explicit representations of certain types of information. Inneural machine translation systems, for example, latent variables have the potentialof enhancing semantic representations. This could help improve general translationquality. Previous work has focused on using variational inference with diagonalcovariance Gaussian distributions, which we hypothesize cannot sufficiently encodelatent factors of language which could exhibit multi-modal distributive behavior.Normalizing flows are an approach that enables more flexible posterior distributionestimates by introducing a change of variables with invertible functions. Theyhave previously been successfully used in computer vision to enable more flexibleposterior distributions of image data. In this work, we investigate the impact ofnormalizing flows in autoregressive neural machine translation systems. We do soin the context of two currently successful approaches, attention mechanisms, andlanguage models. Our results suggest that normalizing flows can improve translationquality in some scenarios, and require certain modelling assumptions to achievesuch improvements.iiiLay SummaryIn this work, we consider the question of how to better encode complex informationbetween languages (such as the hidden semantic meaning of sentences) in orderto improve machine translation systems. In previous work this is accomplishedby introducing continuous random variables which are assumed to have simpleprobability distributions. We extend these works by enabling these distributions tobe more flexible beyond these simple distributions by adding ”normalizing flows”.These are invertible functions that can help transform simple distributions intocomplex ones. Normalizing flows have previously been quite helpful in other areasof artificial intelligence including computer vision. Our results suggest normaliz-ing flows can benefit existing machine translation systems, but require particularmodelling design to be better utilized.ivPrefaceAll of the work presented henceforth was conducted jointly by the Machine Learninglab in the Department of Computer Science and the Deep Learning and NaturalLanguage Processing lab in the School Information both located at the Universityof British Columbia, Point Grey Campus. All work was performed under thesupervision of Dr. Muhammad Abdul-Mageed and Dr. Mark Schmidt. AaronMischkin provided code snippets to help visualize the latent variables. Preliminaryresults were presented at the Invertible and Normalizing Flows workshop as partof the 2019 International Conference of Machine Learning. The remainder of thisthesis including writing, figures, experiments, and implementation were conductedby the author of this thesis.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . 42.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . 62.3 Normalizing Flows . . . . . . . . . . . . . . . . . . . . . . . . . 93 Latent Variable Neural Machine Translation . . . . . . . . . . . . . 113.1 Neural Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Global Attention . . . . . . . . . . . . . . . . . . . . . . 12vi3.1.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Representation of Latent Variables . . . . . . . . . . . . . . . . . 143.3 Discriminative Translation Model . . . . . . . . . . . . . . . . . 153.4 Generative Translation Model . . . . . . . . . . . . . . . . . . . 173.4.1 Latent Variable in Language and Translation Model . . . . 174 Normalizing Flows in Machine Translation . . . . . . . . . . . . . . 194.1 Applying Flows to Latent Variables . . . . . . . . . . . . . . . . 194.2 Considered Flows for Analysis . . . . . . . . . . . . . . . . . . . 214.2.1 Planar Flows . . . . . . . . . . . . . . . . . . . . . . . . 214.2.2 Inverse Autoregressive Flows . . . . . . . . . . . . . . . 224.3 Regularization Tricks . . . . . . . . . . . . . . . . . . . . . . . . 225 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 General Translation . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Importance of Attention . . . . . . . . . . . . . . . . . . . . . . . 275.3 Understanding Latent Variable . . . . . . . . . . . . . . . . . . . 295.4 Language Modelling Performance . . . . . . . . . . . . . . . . . 306 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 44A.1 Discussion on other latent dimensions . . . . . . . . . . . . . . . 44viiList of TablesTable 5.1 BLEU score for our models with normalizing flows for German-English (De–En) translation. The best performances are in boldas compared to differing numbers of flows and the baseline.Yellow rows represent our variational neural machine translation(VNMT) model results, and red our generative neural machinetranslation (GNMT) results for differing number and type of flows. 26Table 5.2 Results for translation systems without attention mechanism.Baseline includes models with normalizing flows and determin-istic version of model excluding latent variables. VNMT resultsare in yellow, and GNMT results are in red with best models.Best models across number and type of flow are in bold. NoLatent (NL) are models trained without z included. . . . . . . 28Table 5.3 Average KL divergence for the test set. For VNMT (yellow) theKL term should be smaller, meaning the distributions encodesimilar information. For GNMT (red) they should be higher asthese suggest more informative latent spaces. . . . . . . . . . 30Table 5.4 Change in BLEU score when z is set to 0 vector at decodetime. Negative numbers indicate our models do better without zincluded during translation. . . . . . . . . . . . . . . . . . . . 31Table 5.5 Average KL divergence for test set for models without atten-tion. VNMT should generally be lower, where as GNMT shouldergenerally have non-zero KL terms. . . . . . . . . . . . . . . . 31viiiTable 5.6 Change in BLEU score without attention models. Positive valuesindicate latent variable z is important for translation system. Allmodels seem to depend on z when attention is unavailable. . . 32Table 5.7 BLEU scores for GNMT without language model training. Boldentries are the best performing models. We include previousGNMT results in red to compare with GNMT without languagemodel training (blue). . . . . . . . . . . . . . . . . . . . . . . 33Table 5.8 GNMT training results without attention or language model train-ing. In several instances, GNMT without language model trainingoutperform models which optimize the language model. . . . 34Table 5.9 KL divergence of GNMT models without language model train-ing. Typically a near 0 KL divergence with stationary priorsindicates posterior collapse has occurred. . . . . . . . . . . . 34Table A.1 List of hyperparameters used for experiments . . . . . . . . . 45Table A.2 IWSLT sentence counts for De–En language pair. Counts rep-resent actual number of sentences we use in our analysis whenlimiting max sentence length to 50. Values in parentheses repre-sent full sentence counts of each dataset. . . . . . . . . . . . . 45Table A.3 BLEU score for our models with normalizing flows for De-Entranslation. The best performing models are in bold for eachtype and number of flows. This table is similar to Table 5.1 inmain thesis, but includes result for latent dimensions set to 2. . 46Table A.4 Average KL divergence for the test set. For VNMT the KL termshould be smaller meaning the distributions encode similar infor-mation. For GNMT they should be higher as these suggest moreinformative latent spaces. . . . . . . . . . . . . . . . . . . . . 47Table A.5 Change in BLEU score when Z is set to 0 vector at decodetime. Negative numbers indicate our models do better without Zincluded during translation. . . . . . . . . . . . . . . . . . . . 48ixTable A.6 BLEU score difference of GNMT with attention when languagemodel is not optimized during training. We include GNMT withlanguage model training for reference. The comparison justshows that without the language model, GNMT does not incorpo-rate information at all from the latent variable z. . . . . . . . . 53Table A.7 KL divergence for GNMT models trained without attention orlanguage model optimization. Results suggest posterior collapsehas occurred as the values are almost all near 0. . . . . . . . . 53Table A.8 Measure of performance drop when z is removed from trainedsystem for GNMT without attention or language model optimiza-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54xList of FiguresFigure 2.1 Simplest formulation of a sequence to sequence (SEQ2SEQ)model in neural machine translation (NMT) [32]. . . . . . . . 7Figure 3.1 Graphical representation of latent variable neural machine trans-lation (LVNMT) systems we consider. Left, is the discriminativemodel [42]. Right, is the joint or generative model [6]. Dashedlines represent the variational distribution. Solid lines are themodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.2 General approach to encoding the parameters of the latent vari-able Z in both models considered. Encoder is part of inferencenetwork in generative case. . . . . . . . . . . . . . . . . . . . 16Figure A.1 One of our GNMT models with 2 dimensional latent space, and8 planar flows. Each row is a sentence pair with an increasingnumber of flows applied from 0 through 8. The ordering is 0flows on the far left, and 8 flows on the far right. . . . . . . . 49Figure A.2 One of our 2 dimensional latent variable GNMT models trainedwith 8 inverse autoregressive flow (IAF) flows. Each row is onesentence pair with an increasing number of flows applied tosamples from the base distribution. Each row is a sentence pairwith an increasing number of flows applied from 0 through 8.The ordering is 0 flows on the far left, and 8 flows on the farright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50xiFigure A.3 Two dimensional latent variable VNMT with 8 planar flows.Captions are the sentence pairs. Each row is a sentence pairwith an increasing number of flows applied from 0 through 8.The ordering is 0 flows on the far left, and 8 flows on the farright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure A.4 Visualization of several sentence pairs latent spaces with VNMTmodel and 8 IAF flows. Solid blue images mean the distributionhas been flattened into a line. Each row is a sentence pair with0 flows on the far left and 8 flows applied on the right. . . . . 52xiiGlossaryBPE byte-pair encodingELBO evidence lower boundGRU gated recurrent unitGNMT generative neural machine translationIAF inverse autoregressive flowLVNMT latent variable neural machine translationMC Monte CarloMLP multi-layer perceptronNLM neural language modelNMT neural machine translationRNN recurrent neural networkSEQ2SEQ sequence to sequenceSOTA state of the artSMT Statistical machine translationVAE Variational autoencoderVI variational inferenceVNMT variational neural machine translationxiiiAcknowledgementsThank you to my supervisors Muhammad Abdul-Mageed and Mark Schmidt fortheir support throughout this whole project. Thank you to Frank Wood for several in-formative discussions on focusing on latent variable models for machine translation.Thank you to William Harvey who helped with understanding the Pyro probabilisticprogramming language in which all models were implemented. Thank you to myparents who have loved and supported me through the years, and to my sister whohelped improve my writing over the years. Thank you to the University of BritishColumbia for all the memories.xivChapter 1IntroductionIncorporating latent variables to explicitly capture aspects of language, such assemantics, has previously been shown to improve neural machine translation (NMT)quality. This includes difficult scenarios in machine translation, such as translatinglonger sentences better [28, 31, 42], demonstrating robustness to domain mis-matchbetween training and test data [6], as well as enabling word level imputation fornoisy sentences [28].Another utility of latent variable neural machine translation (LVNMT) systemsis encoding lexical variation. This is achieved by sampling from the latent variablesand using beam search to find semantically similar sentences [25, 29]. Generatingsemantically meaningful sentences is a useful property, because research has shownthat synthetically generated bi-text can improve translation system quality [5, 26].In machine translation literature, bi-text generally refers to paired sentences from asource language and its translation into a target language. Depending on the modelformulation, LVNMT systems can likely help build even better machine translationsystems by generating synthetic bi-text of sufficiently good quality.To our knowledge, much of the research in LVNMT applies amortised varia-tional inference to learn the posterior distribution of paired language data. Authorsgenerally have focused on creating variational auto-encoder type models whichoptimize the evidence lower bound (ELBO) [14, 24]. In the context of translation,this involves maximizing the log-likelihood of the conditional distribution p(y|x,z)where y is the target language sentence, x is the source language sentence, and z is1the introduced latent variable. Authors have assumed the variational posterior distri-bution is a Gaussian with diagonal covariance and learn a variational distributionqφ (z | ·) conditioned on different combinations of available paired sentences.1The primary focus of this work is to investigate the choice of variational dis-tribution to encode information about translation data. A criticism of variationalinference is the limited guarantees on approximating, even asymptotically, the trueposterior distribution. There are several empirical findings which suggest that choos-ing the Gaussian as the variational distribution family may not truly represent latentaspects of language. One simple example is the power-law distribution behaviourthat words exhibit in large corpora of text [16]. Previous work in language modelsshowed experimental results demonstrating multi-modal distributive behavior evenat the character level [44]. These results suggest that assuming the latent factorsfollow an isotropic Gaussian distribution is not representative of the true distributivebehavior of languages. If latent variables are to be more effectively utilized formachine translation, one needs to consider more flexible variational distributions.Normalizing flows represent one variational inference approach towards pro-ducing more accurate posterior distribution estimates. They accomplish this bytransforming a base distribution into a more complex, possibly multi-modal, dis-tribution [33, 34]. This change of variables is achieved by invertible functions totransform samples from a chosen base distribution [23]. This variational approachhas the added benefit of empirical findings showing more accurate approximationsof target posterior distributions when such distributions are known [23].In the literature, normalizing flows work has seen a number of successes incomputer vision, and more recently in natural language processing. Particularlywith the task of image generation, normalizing flows have been successful inproducing high resolution images [13, 15, 35, 38]. Schulz et al. [25] proposednormalizing flows as a potential improvement to their work in LVNMT systems, butto our knowledge never actually expanded to this direction. Recent works havealso considered flows on discrete distributions with modulus operations [12, 36].Continuous normalizing flows have been used for nonautoregressive languagemodelling as one means to produce meaningful sentences with faster decoding1 Some condition on both the target and source sentence [6, 28, 42], just the target sentences [25],or even just the source [6].2[44]. Most closely related to our work is the work of Ma et al. [19] who created anon-autoregressive normalizing flow machine translation system. Our work differsfrom Ma et al. [19] as we consider incorporating normalizing flows with variationsof existing autoregressive LVNMT systems.We conjecture that normalizing flows are capable of helping achieve betterposterior approximations of language factors, and that these improved estimates canhelp the expressiveness of latent codes in machine translation.Overall, we make the following contributions:1. We investigate the use of normalizing flows in LVNMT and discuss relatedconsiderations and challenges.2. Our experiments seem to suggest that performance improvements due to theintroduction of normalizing flows are minute compared to baseline models.3. We find that the direct contribution of the latent variable to final transla-tion performance to be marginal overall for systems trained to include suchvariables.3Chapter 2BackgroundIn this section, we provide background information on several subjects related tothe work in this thesis. These include discussions on variational auto-encoders,normalizing flows, and a general description of NMT. Further details about theparticular NMT architecture considered in this work are discussed in Chapter 3.Neural Machine TranslationStatistical machine translation (SMT) is a field which applies statistical methodsto train computer systems that perform language translation. Historically thesemethods have typically involved learning word-level mappings, phrase tables, andlanguage models to perform translation [16]. These components of the system arelearned with statistical methods like expectation maximization by analyzing thebi-text to extract relevant phrases and derive their likelihood under the providedbi-text. In SMT literature, NMT refers specifically to training SMT systems whichincorporate neural networks as the primary model in the system.NMT systems have achieved state of the art (SOTA) results across a variety oflanguage pairs over alternative approaches in the SMT literature [1, 17, 39]. Forhuman understanding, the goal is to correctly translate from source language x(for example, English) to target language y (for example, German) such that thesentence is coherent and captures the meaning of the source sentence.1 Capturing1There are other criteria for defining quality in translations including concepts such as adequacy,fidelity, and fluency [16, 21].4the subjective quality of a translation makes defining an optimization objective adifficult problem. Instead, similar to historical SMT, NMT systems learn to maximizethe log-likelihood of the conditional distribution p(y | x):maxθ log pθ (y | x) =T∑i=1log pθ (yi | x,y<i). (2.1)Here, θ represents the parameters of the NMT model, y<i refers to conditioning onall previous words excluding i, and T is the sequence length.There are a variety of hyperparameter choices when building an NMT system.One core component of many SOTA systems are auto-regressive neural networkswhich condition on the previous output of the network for data which have sequentialrelations. One category of such networks is the recurrent neural network (RNN) inwhich an internal hidden representation is maintained. The hidden representationcan be viewed as the network’s memory of a sequence [9]. In the above objective,this hidden state generally is interpreted as allowing the system to condition on thewhole source sentence x and all previous target words y<i. Unfortunately, RNNscan suffer from long term dependencies problems. This can lead to gradients eithervanishing or exploding during the training process. Researchers have developed anumber of architectures to address this problem, such as the long short term memorycells [9].As there are several types of RNNs, we only describe the gated recurrent unit(GRU) because it is the one used in our work [4]. The GRU can be viewed as asimplification of a long short term memory cell which helps mitigate the effect ofgradient exploding or vanishing problem due to long sequence dependencies [9].The network can be described with the following equationszt = sigmoid(Wz[xt ;ht−1]) (2.2)rt = sigmoid(Wr[xt ;ht−1])◦ht−1, (2.3)ht = (1− zt)◦ht−1+ zt ◦ (Wh[xt ;rt ]). (2.4)5In these equations, Wz,Wr,Wh represent the learned weight matrices which caninclude a bias term, [a;b] refers to a concatenation operation, and ◦ is the element-wise product. Intuitively, the GRU works as a soft logic gate where the update gate zcontrols the ‘relevance’ of previous and current state in the memory of the network.The reset gate r decides the importance of previous information in conjunction withthe new input.Whether choosing the GRU, or alternatives, all NMT models generally follow thesequence to sequence (SEQ2SEQ) framework [40]. In SEQ2SEQ models, the sourcesentence x is encoded as a series of latent representations capturing words-in-contextinformation. A decoder utilizes these hidden states, such as for initializing its ownhidden state, to help inform the decoding process for target sentence y. The decodercan be interpreted as a conditional neural language model [17].2The one other major consideration for NMT is picking the best translation. Anaive solution is to take the most likely word at each step of decoding. This canlead to the garden-path, or label bias, problem in which case the most likely wordsat the current time step lead to a series of unlikely words later in translation [17, 18].Instead, typically beam search is employed to maintain n possible beams (translationhypotheses) [17]. At each step, after committing to the top n beams in the previoustime step, a score function is calculated and the top n new beams are selected. Thescore function is the partial probability up to the current time step i, ∏Ti=1 p(xi|x<i).When a beam encounters an end-of-sentence token, or designated maximum length,it is held out and the number of active beams is reduced. Once all beams arein-active, the best translation is then chosen from the remaining beams based ontheir probability normalized by length p(y1,...,yn)n .Variational AutoencodersThe Variational autoencoder (VAE) is a class of generative model which represent thejoint distribution P(X ,z), where X is the data set and z is an introduced random vari-able. The joint distribution is factorized typically as follows: p(X ,z) = p(X | z)p(z).This is interpreted as assuming the dataset X was generated by latent process z.2This simply means a neural network is chosen to model the joint probability of a sentencep(y1,y2, ...,yT ) with a continuous representations.6Figure 2.1: Simplest formulation of a SEQ2SEQ model in NMT [32].An important aspect of VAEs is amortized inference, which involves introducingfunctions f (x) that produce distribution parameters for each sentence pair in ourcase.3 The motivation for employing amortized inference is eliminating the needto learn separate distribution parameters for each sentence pair. Otherwise, eachsentence pair would require their own parameters, requiring much more computation.In VAEs this function is typically represented with a neural network. Throughoutthis paper when we describe any distribution as pθ (·), the chosen Greek symbol (asit will vary beyond θ ) represents the parameters of neural networks which handleamortization of p(·).In order to learn a good representation of p(X ,z) the objective is to maximizethe log-likelihood of the marginal distribution p(X). To calculate p(X) one thenwould need to integrate over the random variable z:log p(X) = log∫p(X | z)p(z)dz. (2.5)Unfortunately, this integral is generally considered intractable. This is often dueto choosing a neural network to handle the amortization of distribution parameters.This could be addressed by Monte Carlo sampling from p(z), but the latent variableis unknown and the posterior p(z | X) is also difficult to calculate.3More generally, the amortization function learn distribution per datum, which could be of anymodality or appropriate representation depending on the application.7Instead, researchers often solve this problem with optimization. This involvesapplying variational inference (VI) to learn an approximation of the true posteriorp(z | X). This introduces a variational distribution qφ (z | X), parametrized by someneural network φ , which will be optimized along with our model. This can beinterpreted as introducing an inference network which performs the amortization ofthe prior parameters p(z) as discussed previously. The objective in VI then involvesmaximizing the evidence lower bound (ELBO), or variational free energy, on thejoint distribution p(X ,z) and q(z | X):logpθ (x)≥ Eqφ (z | X)[log(Pθ (X | z)]−KL(qφ (z | X)||p(z)). (2.6)An important thing to note in the above equation is that the prior p(z) is typicallychosen to be stationary. This means the ELBO can be interpreted as optimizing twoconflicting objectives. The expectation term seeks to maximize the reconstructionof data from z whereas the second term bounds the variational distribution to staywithin some latent space.Despite this conflict we have gained some important properties. As the KLdivergence is non-negative, and only 0 when the distributions match (in other words,they are identical), it is theoretically possible to recover the true log-likelihood ofthe data. We can also sample from qφ (z | X) to approximate the expectation termwhich was not possible before.Unfortunately, directly sampling qφ (z | X) introduces a discrete operation whichmeans we cannot calculate gradients end-to-end in the model. One approach tomitigate this is the re-parametrization trick in which the variational distributionis rewritten as a function and sampling is done from some surrogate distribution[14, 24] . Here, we show this approach for the Gaussian with mean µ and diagonalcovariance σ conditioned on xfφ (x,ε) = µφ (x)+ ε ◦σφ (x),ε ∼ N(0,1). (2.7)Here, ◦ is an element-wise operation between each dimension σφ (x) and ε .With this, the expectation in the ELBO can be rewritten with respect to p(ε) and8enables end-to-end optimizationlog pθ (x)≥ Ep(ε)[log(Pθ (X | fφ (x,ε))]−KL(qφ (z | X)||p(z)). (2.8)Normalizing FlowsNormalizing flows are an application of the change of variables theorem in machinelearning. They introduce a series of k invertible functions f1:k which transform abase distribution p(z0) into another distribution p(zk):p(zk) = fk ◦ fk−1 ◦ ...◦ f0(p(z0)). (2.9)Here, ◦ is shorthand for the nested calls for functions fi. This allows one to transformsamples from the base distribution z0 into samples zkzK = q0(z0)K∏k=1∣∣∣∣det( δ fiδ zi−1)∣∣∣∣. (2.10)Alternatively, it becomes possible to do density estimation of samples by calling theinverse f−1i on samples zk. To our knowledge, there are 2 main ways normalizingflow models are optimized in the literature.The first approach incorporates normalizing flows into the ELBO which intro-duces an expectation term of the log absolute Jacobian with respect to the variationaldistribution, as follows:Eqφ (z0 | X)[log(pθ (X | z)]−KL(qφ (z0 | X)||p(zk))+Eqφ (z0 | X)[ k∑i=1log∣∣∣∣ δ fiδ zi−1∣∣∣∣].(2.11)Previous works typically employ hyper-networks [10] to amortize the parametersof the normalizing flows instead of directly optimizing them [23, 35, 37]. In theVI setting this enables the variational distribution q to capture possible posteriorrepresentations beyond those achievable with the base distribution.Alternatively, normalizing flows enable direct optimization of the log probability9of the data distribution:log p(x) = log p(z0)−k∑i=1log∣∣∣∣ δ fiδ zi−1∣∣∣∣ (2.12)This approach is quite useful as it eliminates the need for the variational distributionin VI. The choice of flow becomes particularly important though because of thedependence on the absolute log Jacobian.The predominant focus of normalizing flows research has been on defininginvertible functions which have computationally efficient Jacobians. This has leadto a variety of flows with different Jacobian formulations such as autoregressiveflows which lead to triangular Jacobians [15, 20]; flows which by construction haveanalytic Jacobians [23, 37], or volume preserving flows which have the Jacobianequal to one [12, 35, 36]. Note that these groupings do not necessarily reflect thefull range of approaches to normalizing flows.10Chapter 3Latent Variable Neural MachineTranslationIn this chapter we describe the LVNMT models we considered as part of our anal-ysis. We begin by describing the underlying NMT architecture common to bothapproaches. We then describe our discriminative model p(y | x), and our genera-tive model p(x,y). These models abstractly are represented as graphical modelsin Figure 3.1. The final section discusses the practical details of representing thelatent variables and defining the inputs from the appropriate components in eacharchitecture. Throughout this chapter, we will often reference vectors as Rn but theactual dimensions for these vectors does not mean all the vectors are of the samedimensions and will otherwise specify the actual dimension whenever appropriate.Neural ArchitectureIn this section we explain the base neural architecture which our LVNMT build from.Inherently, there is nothing necessarily unique to this architecture for incorporatinglatent variables. The ideas in each LVNMT model considered are applicable toalternative neural architectures, such as the Transformer [39], which may benefitfrom introduced latent variables as well.The NMT architecture we consider is a SEQ2SEQ model proposed by the workof Bahdanau et al. [1]. The core components include source and target word11embeddings, an encoder, attention mechanism, and decoder. We describe all thelayers except the word embeddings which are projections of the source and targetvocabularies into a continuous space Rn. When we refer words xi or yi, theseactually correspond to each word’s associated word embedding as inputs to themodel.EncoderThe encoder is a bi-directional RNN which generates hidden states from readingthe source sequence x both forwards and backwards. Formally, the RNN producesforward hidden states h fi ∈ Rn for each input word xi. Each h fi is conditioned on allprevious x<i words through hfi−1. The sequence is then read backwards by the RNNto produce hidden states hbi ∈ Rn for each word xi. Each hbi is conditioned on allsubsequent words in the sentence x>i. The final hidden states are a concatenationof these complementary embeddings hi = [hfi ;hbi ] ∀i ∈ [0, ...,T ] where T is thesentence length. Intuitively, each hi can be viewed as a contextual embedding ofeach source word in the sentence. We next describe how these embeddings areutilized for decoding via the global attention mechanism.Global AttentionIn the context of SEQ2SEQ models, global attention mechanisms combine theencoder hidden states to inform the decoding process. This is achieved by a functionof the current decoder state s j and encoder states H ∈ RT×n to output an energyvector e ∈ RT . In the work of Bahdanau et al. [1] the authors propose a multi-layerperceptron (MLP) attention functionei = MLP(hi,s j),∀i ∈ [1...T ]. (3.1)These energy values are usually normalized to provide weights αi per hiddenstate. Bahdanau et al. [1] choose the softmax function for this operation to producea context vector c j:αi =exp(ei)∑Tt=1 exp(e′t), (3.2)12c j =T∑1αihi. (3.3)Note that αi is a scalar while hi is a vector. The intuition for c j is that it capturesalignment information between the source sequence and the current position in thetranslation (in other words, target) sequence [1].DecoderThe decoder is a feed-forward RNN which generates the translated target sentence.It reads the sequence forward producing hidden states s j ∈ Rn for each target wordy j in the sequence of length K. In the literature, it can be viewed as a conditionallanguage model [17] whose hidden state is initialized as s0 = tanh(affine(hT )).Affine refers to a linear layer learned weight matrix and bias term. The decoder hasthree inputs which include the previous word y j−1, the previous decoder hiddenstate s j−1 and the context vector c j as mentioned in our discussion on the globalattention mechanism.To generate the probabilities for each word j in the target sentence, the decoderincludes a MLP which uses the maxout activation over the hidden state values [8]:maxout(h j) = [max(h2k−1j ,h2kj )]K/2k=1. (3.4)For clarity, the maxout operation takes the maximum value between h2k−1j and h2kjwhich are positions in the input vector h j of length K. These maxout values are thenfed to a final layer which represents the conditional distribution:t j = maxout(affine([y j−1;c j;s j])), (3.5)p(y j | x,y< j) = affine(t j). (3.6)13Figure 3.1: Graphical representation of LVNMT systems we consider. Left, isthe discriminative model [42]. Right, is the joint or generative model [6].Dashed lines represent the variational distribution. Solid lines are themodel.Representation of Latent VariablesBoth models we consider follow the same process to encode the latent variable zwhich is visually shown in 3.2. There are slight variations in how z is incorporatedduring decoding, or the exact inputs to our inference networks between the twomodels. Those details are described in each respective model section that follow.In either case, the inputs of our models will be the hidden representationsproduced by an encoder as described in the previous section. These hidden statesare then averaged over to produce a single vector representation of the sentence:hmean =1TT∑i=1hi. (3.7)These hmean vectors are passed through a single hidden layer MLP that producesdistribution parameters µ and logσ for an isotropic Gaussian distribution. We use asoftplus operation on the logσ when sampling from the distribution.Another agnostic decision is setting the value of z at decode time. Duringtraining, we sample the latent variable as usual in VAE style models. However,at decode time this sampling proceedure means different translations could beproduced by our LVNMT. Instead, we set z = µ at evaluation time, which has been14previously considered in the literature [6, 42]Discriminative Translation ModelThe discriminative LVNMT model we considered is a variation of the work byZhang et al. [42]. It models the conditional distribution p(y | x) which is the typicaldistribution considered in NMT:p(y | x) =∫p(y | x,z)p(z | x)dz. (3.8)As previously discussed, the integral over latent variable z is often intractable. Thisrequires optimizing the ELBO and introducing a variational distribution q(z | x,z).In our version of the model, we optimize the ELBO by generating samples fromqφ (z | x,z) where as [42] samples from the prior pθ (z | x):Eqφ (z | x,y)[logp(y | x,z)]−KL(qφ (z | x,y)||pθ (z | x)). (3.9)In either formulation, our objectives deviate from the typical VAE models, asthe prior is a parametrized distribution pθ (z | x). Often, the prior is chosen to be astationary distribution such as N(0, I). This could potentially cause degeneracy inthe latent space, because the distributions are not anchored to a fixed target. Themodel can push each sentence pair’s latent distributions apart arbitrarily in order tomaximize the translation objective term.Despite this drawback, the parametrized prior can be beneficial in the translationsetting. To translate novel sentences with our variational distribution, we wouldrequire having target sentence y, which is not available at decode time. As weminimize KL divergence between qφ and pθ , they should encode similar information.This means pθ can replace the variational distribution when translating sentences[42].As the posterior distribution conditions on both source and target sentences, weneed to encode the target sentence y as well during training. We accomplish this bysimply passing our target sentence through our encoder as well and average overthese target encoder states. The posterior’s input is then the concatenation of hxmeanand hymean. The samples from these distributions are then included as additional15Figure 3.2: General approach to encoding the parameters of the latent variableZ in both models considered. Encoder is part of inference network ingenerative case.16inputs to the decoder at each time-step in the decoding process:y j = decoder(s j,y j−1,c j,z) (3.10)Generative Translation ModelIn the generative model, we learn the joint distribution p(x,y) which has otherwisebeen considered by the work of Eikema and Aziz [6] and Shah and Barber [28].In their work, the latent variable is included in the joint distribution which ismarginalized out during translationp(x,y) =∫p(y | x,z)p(x | z)p(z)dz. (3.11)In this framework, the latent variable z represents shared aspects of languagebetween the source and target language. We introduce a variational distributionq(z | x) which was shown to be as effective as conditioning on both target and source[6]. We expand the ELBO of this objective to explicitly show each model optimized:Eqφ (z | x)[logpθ (y | x,z)]+Eqφ (z | x)[logpθ (x | z)]−KL(qφ (z | x)||p(z)). (3.12)Our primary goal is learning a translation system through the distribution p(y | x,z),but as an artifact train a neural language model (NLM) on the distribution p(x | z).Latent Variable in Language and Translation ModelOur system largely follows the architecture described in Eikema and Aziz [6], withthe exception of sharing a projection layer from the latent variable to the translationand language model components. For both the systems, z initializes the hidden stateof an RNN. Particularly for the translation system z initializes the hidden state ofthe encoder to provide a global semantic context during translation. Otherwise, thismodel behaves the same way as the baseline NMT system. In the language modelthis corresponds to initializing a language model with the latent variable z. The mostnotable difference in our generative model is the inclusion of optimizing p(x | z).17The language model behaves similarly to the decoder in our base translation systemwith the exclusion of (1) the attention mechanism, and (2) initialization by thelatent variable z instead of an encoder network. The motivation for optimizing thiscomponent is to help improve translation quality by adding additional inductive biasfor our latent space to encode share information between source and target sentence.18Chapter 4Normalizing Flows in MachineTranslationIn this chapter we discuss our approach to incorporating normalizing flows intothe LVNMT systems discussed in Chapter 3. This includes descriptions of thenormalizing flows we considered, and regularization techniques to improve theutilization of the latent variable with auto-regressive models.Applying Flows to Latent VariablesAs a reminder, normalizing flows transform samples from a base distribution p(z0)to samples of more complex distribution by applying k invertible functions fisequentially:zk = fk ◦ fk−1...◦ f2 ◦ f1(z0),z0 ∼ p(z0). (4.1)In our LVNMT models, p(z0) refers to our variational posterior distributions. Eachfi can be viewed as adding additional network layer between the base Gaussiandistribution and the designated part of the translation model that the latent variablez is included as input. We visualize this process in Figure 3.2 in orange.We follow previous research which makes flows data dependent [15, 23, 35, 38].Generally speaking, this means each input sentence x will have unique transformsfi enabling more flexible latent distributions per sentence pair. We leave further19details on this to the next section as each flow handles this differently. Agnostic tothe flow choice, we condition on the source sentence via hXmean.1 This choice wasbecause in either model we can only condition on x at decode time.During training, we optimize the ELBO, which we present again as formulatedmore specifically to machine translation case for the discriminative model, and isbased on the derivation from Rezende and Mohamed [23], Section 4.2:Eq(z0 | x,y)[ p∑j=1log pθ (y j | zk,x,y< j)]−KL(qφ (z0 | x,y)||pθ (zk | x))+Eqφ (z0 | x,y)[ K∑k=1log∣∣∣∣ δ fkδzk−1∣∣∣∣].(4.2)This simply introduces maximizing the sum of log absolute Jacobian from our flows.2Unfortunately, in normalizing flows we cannot analytically derive the KL diver-gence and instead we must perform Monte Carlo sampling to optimize this objective.When evaluating translation quality, we set the latent variable to the expected valueof the Gaussian distribution , z0 = µθ (x), and apply our flows to this value.Specific to our discriminative LVNMT system, we choose to share the flowparameters on both the prior and variational distribution. At decoding time, wereplace the variational distribution with the prior which has been optimized togenerate distributions similar to the variational posterior. Learning separate flowsfor each distribution would otherwise be unnecessary computation overhead as thetwo base distributions should ideally match each other. This also changes the aboveequation to be quite similar to the original ELBO without normalizing flows, withthe exception that we use zk instead of z0.1As a reminder, this is the averaged hidden representation of the source sentence produced by ourinference network.2In the joint distribution case, there would be the inclusion of optimizing a language model aswell.20Considered Flows for AnalysisFor our analysis we consider two types of flows from the literature. Note, however,that there are a variety of choices as research, at the time of writing, is quite active.The Jacobian is different for each flow we consider, offering alternative influenceson the training procedure for our LVNMT architectures.Planar FlowsPlanar flows were proposed in the work of Rezende and Mohamed [23]. They canbe viewed as a scale-shift operation of the following form:fi(z) = z+uih(wTi z+bi). (4.3)Here, ui,wi ∈ Rd and b ∈ R are the parameters of planar flow i, and h is a non-linearactivation. For our experiments we use tanh but the authors note that alternativeactivations are permissible. A convenient aspect of these flows is they provide ananalytical term of the Jacobian:∣∣∣∣det(δ fδ z)∣∣∣∣= ∣∣∣∣1+uth′(wTi z+bi)w∣∣∣∣ (4.4)Here, h′ is the derivative of the nonlinear activation. For clarity, the second term inthe right hand side of the equation is a dot product. Intuitively, this transformationcan be seen as contracting or expanding the samples z along a single plane inspace. This is a simple transformation and requires many planar flows to representcomplicated distributions.As previously mentioned, the parameters of flows are generally made to be datadependent. In the case of planar flows this is achieved by utilizing a hyper-network[10] which outputs the parameters of the flow:MLP(hmeanx ) = (u,w,b). (4.5)In our experiments each of our flows has a separate hyper-network with a singlehidden layer with tanh activations to enable flows to have sufficient flexibility totransform distributions.21Inverse Autoregressive FlowsInverse autoregessive flows were proposed by Kingma et al. [15] to enable parallelsampling by defining an invertible function as a sequentially dependent inverse scaleand shift operation in a sequence of random variableszik+1 =zik−σ(z1:i−1k ,h)µ(z1:i−1k ,h). (4.6)Here, zi is the dimension i of the vector of the latent variable z, σ(·) and µ(·)are the outputs of an auto-regressive network which was proposed in the workof Germain et al. [7]. Here, h is referred to as the context input which we useh= hmeanx . It represents the data conditioning of the normalizing flow, and otherwisethe parameters of the auto-regressive network are shared between sentence pairs.Defining a normalizing flow in this way provides a lower triangular Jacobian whichmeans the absolute Jacobian is a product of the diagonal terms∣∣∣∣det δ fδ z∣∣∣∣= T∏t=0σ(z1:t−1k ,hmeanx ) (4.7)Regularization TricksAn often cited challenge including latent variables in auto-regressive models isposterior collapse [11]. In order to maximize the ELBO, the variational distributionparameters, for all the training data, are pushed to more closely match the prior.This leads to uninformative latent variables typically when choosing the Gaussianas the prior. Part of this behaviour has been accredited to strong decoders, likeauto-regressive models, which are flexible enough to model the output even byignoring z. We recommend Chen et al. [3] or Zhao et al. [43] which providemore thorough discussions on the subject. For this work, we address this potentialproblem with previously proposed approaches referred to as KL-annealing [2, 30]and word dropout [2].KL-annealing typically involves annealing the weight β of the divergence term22in the ELBO:Eq(z0 | x,y)[ p∑j=1log pθ (y j | zk,x,y< j)]−βKL(qφ (z0 | x,y)||pθ (zk | x))+βEqφ (z0 | x,y)[ K∑k=1log∣∣∣∣ δ fkδzk−1∣∣∣∣].(4.8)In normalizing flows research, it is often cited that KL-annealing helps improveperformance even without strong decoders [15, 23, 35, 38, 44]. In our experiments,we use a linear schedule which increases the importance of our regularization termsafter each mini-batch update.Word dropout is a procedure during training time where with some probabilityρ the current word embedding for xi is replaced with the word embedding for theunknown token.3 The intuition for its effectiveness is it encourages the model todepend more on the latent variable for information at decoding. This approachhas particularly been important for improving performance of generative LVNMTmodels [6, 28].3The unknown token is used to handle words not included in the vocabulary during training.23Chapter 5ExperimentsIn this chapter we discuss our experiments to evaluate performance of LVNMT mod-els with and without normalizing flows. We consider settings to empirically evaluatethe impact of latent variables on LVNMT systems with attention and language mod-els. We include hyperparameter values in the supplement material Table A.1, basingthem mostly from Eikema and Aziz [6]. These hyperparameters were picked pri-marily with the generative LVNMT system in mind, not the discriminative model.However, our main interest is not in the exact gains between models, and empiricallythese hyperparameters were sufficient. As part of this work, we release our code.1To help prevent posterior collapse, we perform KL-annealing linearly over thefirst 80,000 mini-batch updates. We choose a word-dropout rate of 0.1 for bothmodels. We note, previous work suggested that word dropout is not necessarilyhelpful in the discriminative NMT case [28].We conducted our experiments with the IWSLT 2016 data sets available in thetorchtext library.2 We evaluated our models with the German–English (De–En)language pair. We chose this language direction because we could more naturallyevaluate the translations.3 This dataset consists of 233,213 training, 2052 validation,and 9773 test sentences. We measure performance with the raw BLEU scoreimplementation available in sacreBLEU [22]. For all experiments, we keep the1Experiment code: https://github.com/gamerDecathlete/NormalizingFlowsNMT2https://github.com/pytorch/text3 We did not do any formal human evaluation of translation quality.24random seed fixed to a single value due to stochastic variables.We represented our vocabulary for each language with byte-pair encoding (BPE)[27]. We used vocabularies of 10,000 BPEs per language. We found that largervocabularies resulted in sub-word units that occurred infrequently enough to beuninformative from a practical learning perspective. We performed BPE using theSentencePiece library.4 We trained our NMT models on sequences of maximumlength 50. We used beam search with a beam width of 10, and length normalizationset to 1.0. Throughout our experiments we will refer to our discriminative modelsas variational neural machine translation (VNMT) and our joint system as generativeneural machine translation (GNMT).General TranslationIn this section we report our results when including normalizing flows on top ofLVNMT systems. Our hypothesis was that, given normalizing flows success incomputer vision, similar gains can be achieved by including normalizing flows inpreviously considered LVNMT systems.Our baselines include the LVNMT models we described in Chapter 3 with justthe diagonal Gaussian for the variational distribution. Our baselines optimize theELBO with an equal number of Monte Carlo (MC) samples as our normalizing flowsmodels to provide a fair comparison. This corresponds to better approximations ofthe negative log-likelihood of sequence predictions. The KL divergence is analyticin the Gaussian case which does not require sampling [14, 24]. Although we do notreport numbers, we did find just increasing the number of MC samples improvedtranslation quality on the validation set.Table 5.1 shows the BLEU score of our results on the test set. The best modelwas picked based on validation BLEU score from checkpoints after every epoch oftraining. Each model was trained for 47 epochs.5 The boldfaced results representthe best performing version of a model for the given latent dimensions and flowtype.Regardless of flow type, we found our generative model to perform better for4https://github.com/google/sentencepiece5An epoch here refers to training on all mini-batches before reshuffling the sentence pairs.25Table 5.1: BLEU score for our models with normalizing flows for German-English (De–En) translation. The best performances are in bold as com-pared to differing numbers of flows and the baseline. Yellow rows rep-resent our VNMT model results, and red our GNMT results for differingnumber and type of flows.Latent Dimension: 128Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 18.84 18.84 19.14 19.02 19.18IAF 18.89 18.96 19.29 18.81 18.92 18.768 VNMTPlanar 20.59 20.60 20.48 20.55 20.66IAF 20.64 20.64 20.51 20.65 20.5020.73 GNMTLatent Dimension: 256Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 18.93 19.26 19.02 18.80 18.82IAF 18.95 19.17 19.02 18.99 18.77 18.76 VNMTPlanar 20.55 20.67 20.54 20.51 20.67IAF 20.85 20.86 20.64 20.62 20.66 20.66 GNMTany number of flows or latent space size. Even our worst performing GNMT model(z=128, with 4 Planar flows) provides at least 1.20 BLEU score above the bestperforming discriminative model (z=128, with 4 inverse autoregressive flow (IAF)).This result seems congruent with previous research suggesting joint modelling canbe more effective than the simpler discriminative representation [6].We found overall that VNMT benefited more from the inclusion of normalizingflows than GNMT. For VNMT, even a single normalizing flow results in improve-ments. Between flows, we do not note any clear distinctions for the number of flowsor type of flow in the VNMT case although best performance included more thanjust one flow. In contrast, GNMT only saw performance gains with a latent space of256, and even then only a few of the flows models outperformed the baseline.Overall these results suggest that flows can indeed be added to existing modelsand provide benefit to the final translation quality. In several instances our flowsmodels outperform baseline results. Here, it seems the input feeding VNMT maybenefit more than the initialization approach in our GNMT model.26Importance of AttentionIn this section, to investigate the utility of including latent variables, we trainsimplified versions of our LVNMT systems which do not include the attentionmechanism. Our motivation for this experiment is to tease apart the impact of ourlatent variable as the only additional information to the decoding process. We donot expect this system to outperform the models with attention due to their successin SOTA models [1, 39]. However, we hypothesize that if the latent variable canencode useful information in translation, then NMT models will still benefit fromthis global latent information. As an extension, normalizing flows will then enablethese variables to be more beneficial by making the latent variable distribution moreflexible.We include the latent variable in the generator network as a substitute forattention, which is the same design choice as Bahdanau et al. [1] for includingattention. In our previous experiment we did not consider this, as the focus wason incorporating normalizing flows in variations of existing models. We compareour modified discriminative model against a version of Bahdanau et al. [1] whereattention is simply removed. In the joint modelling case, we compare to a baselinesimilar to the joint model of Eikema and Aziz [6] except without attention. Theirmodel optimize the language model and translation systems separately except forsharing the source language word embeddings. All other hyperparameters are thesame. In our tables these two modified models are the No Latent (NL) baselineswe compare our latent variable models against. Table 5.2 show’s results for ourmodified models and the baselines when evaluated with BLEU score. The bestperformances are bold. They were chosen by comparing models with differingnumbers of the same flow, and the baselines.Considering only the latent dimension size, we find an increase in performancesimply doubling the dimensions of the latent variable. This would suggest thatbigger latent spaces become more important when the model depends more on thelatent variable. The combination of planar flows and VNMT seem to benefit mostfrom the latent variable achieving close to our best GNMT baseline. Unfortunately,the IAF VNMT models did not improve as much. There could be several possiblereasons. One interpretation we suspect is related to the data conditioning approach27Table 5.2: Results for translation systems without attention mechanism. Base-line includes models with normalizing flows and deterministic version ofmodel excluding latent variables. VNMT results are in yellow, and GNMTresults are in red with best models. Best models across number and typeof flow are in bold. No Latent (NL) are models trained without z included.Latent Dimension: 128Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 6.61 6.64 6.63 6.93 6.78IAF 6.42 6.32 6.28 5.98 5.92 6.25 6.38 VNMTPlanar 7.2 7.22 7.19 7.11 7.25 7.37IAF 7.31 7.37 7.33 7.18 7.41 7.37 7.16 GNMTLatent Dimension: 256Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 6.31 6.81 6.91 7.37 7.38IAF 6.45 6.47 6.4 6.23 6.71 6.43 6.38 VNMTPlanar 7.4 7.22 7.33 7.26 7.07IAF 7.35 7.34 7.28 7.04 7.367.51 7.16 GNMTof planar flows compared to IAF. As our planar flows provide per-sentence flowparameters, this enables them to provide more unique distributions per sentence. Incomparison, the IAF is unable to capture this more nuanced information by simplyconditioning on a context vector. This argument has been pointed out in previousresearch for other types of flow [38].Unfortunately, our GNMT model did not benefit from the inclusion of normaliz-ing flows and was actually hindered in performance for most choices of flows. Wereason this is due to the formulation of GNMT. The latent variable initializes thetranslation system beginning in the encoder network as compared to being passedas input during decoding. Based on our results, it seems this design is less effectivecompared to the input feeding approach of VNMT to utilize latent variables. Weconjecture this is likely without explicitly passing z to the decoder, GNMT is relyingon the GRUs to implicitly keep this information from beginning to end.28Understanding Latent VariableIn previous works, the utility of latent variables is typically justified by traininga similar system without the latent variable included, or by optimizing the modeldifferently. In cases where the prior is stationary, authors then typically report theKL divergence to justify the latent variables usage. This metric is inapplicable toVNMT where the prior is learned. In the ideal VNMT scenario, the KL divergencebeing 0 suggests both distributions encode the same information.As an alternative approach to measure the value of our latent variable duringtranslation, we set z to the 0 vector. We measure the difference in BLEU scorewith and without z during the decoding process. This can give us direct insight intothe importance of z when translating sentences, whether the prior is learned or not.Table 5.3 provides the KL divergence and Table 5.4 shows the difference in BLEUscore when z is the 0 vector for our models with attention.Interestingly, for many of our VNMT models with planar flows we see includingz often negatively impacts performance of the translation system. This couldsuggest that the latent variable itself is not helping the translation, but improvesrepresentations in the encoder or source word embeddings. Another explanationis that our substitution of pθ for qφ at decode time provide information that is toodissimilar to the variational distribution. This is plausible given that in many casesthe KL divergence is non-zero, but even in several instances a lower KL divergencestill leads to loss of performance with z (see VNMT, z=256 results for 8 flows vs. 16flows). These substitution approaches have also previously been shown to provideminimal or mixed results in terms of comparative performance [6].The exception to our above observations of course is our VNMT with IAF modelsin higher dimensions which seem to heavily depend on the latent variable. Onepossible reason for this has to do with the amortization of IAF flows. As these flowsare composed of neural networks shared across data points, they can more easilybe viewed as additional layers in the network.6 This could also be a more volatileresult which occurs due to different choices of initialization. It has been noted KLannealing schemes can be affected by initialization for final performance [41].6In a similar setting, one bug we found in our implementation was excluding the projection layerand found similar performance drops to those without IAF flows results.29Table 5.3: Average KL divergence for the test set. For VNMT (yellow) theKL term should be smaller, meaning the distributions encode similarinformation. For GNMT (red) they should be higher as these suggest moreinformative latent spaces.Latent Dimension: 128 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 2.16 1.27 1.18 1.04 1.63IAF 0.68 1.35 1.32 1.6 1.061.04 VNMTPlanar 3.23 3.15 2.36 2.82 3.64IAF 4.31 4.36 4.19 4.43 4.144.39 GNMTLatent Dimension: 256 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.92 2.35 2.37 1.11 1.27IAF 2.41 1.94 1.14 1.1 1.12.0 VNMTPlanar 4.34 3.93 3.36 2.75 3.62IAF 3.87 3.74 4.0 3.99 3.983.84 GNMTIn comparison, GNMT seems to generally show at least minute dependence onthe latent variable regardless of number of flows. There do not seem to be any clearpatterns between choices about flows and the information encoded by z. Likewise, ahigher KL divergence does not necessarily correspond to more utility in translationperformance.When we perform the same experiment with our LVNMT systems withoutattention, we see more dependence on the latent variable for all models. Our resultsare in Table 5.5 and 5.6. In most cases with planar flows, we see our models dependmore on the latent variable over the baseline without planar flows. Our IAF flowmodels are more mixed in terms of performance changes, where in GNMT ourmodels depend more on the flows with a smaller latent dimension space as opposedto larger latent space.Language Modelling PerformanceFor most of our previous experiments, we have found that GNMT largely showsonly slight gains from normalizing flows. As previously discussed in chapter 3,30Table 5.4: Change in BLEU score when z is set to 0 vector at decode time.Negative numbers indicate our models do better without z included duringtranslation.Latent Dimension: 128 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar -0.29 -0.37 -0.06 -0.14 0.1IAF 18.64 12.17 15.52 17.4 17.560.04 VNMTPlanar 0.14 0.08 0.02 0.06 0.1IAF 0.12 0.08 -0.01 0.1 -0.010.08 GNMTLatent Dimension: 256 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.02 -0.07 -0.03 -0.24 0.16IAF 12.22 17.27 18.5 10.84 17.42-0.11 VNMTPlanar 0.1 0 0.11 0.15 0.04IAF 0.04 0.09 0.04 0.06 0.190.09 GNMTTable 5.5: Average KL divergence for test set for models without attention.VNMT should generally be lower, where as GNMT shoulder generally havenon-zero KL terms.Latent Dimension: 128 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 0.22 0.56 0.76 0.39 0.05IAF 0.59 0.39 0.2 0.18 0.230.6 N/A VNMTPlanar 4.28 4.48 3.84 4.12 4.58IAF 3.38 2.97 3.89 4.57 4.532.23 N/A GNMTLatent Dimension: 256 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 0.35 0.04 0.3 0.16 0.07IAF 0.74 0.08 0.7 0.25 0.220.72 N/A VNMTPlanar 4.1 4.26 3.57 3.54 3.65IAF 3.99 3.19 3.99 3.12 3.454.19 N/A GNMT31Table 5.6: Change in BLEU score without attention models. Positive valuesindicate latent variable z is important for translation system. All modelsseem to depend on z when attention is unavailable.Latent Dimension: 128 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 3.02 3.08 2.73 3.51 3.5IAF 2.33 1.75 1.82 1.59 0.880.6 N/A VNMTPlanar 0.91 0.93 0.7 0.49 0.72IAF 0.88 0.59 0.88 0.79 1.10.46 N/A GNMTLatent Dimension: 256 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) NL (Baseline) ModelPlanar 3.15 4.09 4.51 4.88 4.98IAF 3.03 3.79 3.98 2.29 3.091.79 N/A VNMTPlanar 0.76 0.98 0.29 0.48 0.51IAF 0.66 0.5 0.63 0.41 0.490.68 N/A GNMTGNMT trains a language model as part of the system. In this section, we train GNMTmodels that do not optimize this language model as part of the training procedure.The goal of this experiment is to test the impact of normalizing flows in the absenceof language model for GNMT.Table 5.7 shows our results with attention but no language model (in blue)compared against our GNMT results from Table 5.1. Our results show similarperformance to training our GNMT model with the language model. At a 128dimensional latent space our baseline still outperforms all flow models, as well asthe training of GNMT with a language model. At a 256 latent dimensional space,we see results close to our language model training, but with slight decreases inperformance. Given how close these numbers are, these results suggest that z isnot the key factor for performance gains in our translation system for GNMT. Inaddition, when we measure the KL divergence (see Table 5.9 ) we find that the latentvariable has largely collapsed to the prior in many cases. Measurements of changeof BLEU and without BLEU are in supplementary material Tables A.4, A.5,and A.6as these show largely the same behaviour.Table 5.8 shows results of GNMT with no language model training along with32Table 5.7: BLEU scores for GNMT without language model training. Boldentries are the best performing models. We include previous GNMT resultsin red to compare with GNMT without language model training (blue).Latent Dimension: 128Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 20.46 20.42 20.5 20.65 20.41IAF 20.48 20.45 20.71 20.45 20.3820.77 GNMT (No LM)Planar 20.59 20.60 20.48 20.55 20.66IAF 20.64 20.64 20.51 20.65 20.5020.73 GNMTLatent Dimension: 256Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 20.32 20.36 20.47 20.21 20.65IAF 20.56 20.64 20.51 20.54 20.41 20.59 GNMT (No LM)Planar 20.55 20.67 20.54 20.51 20.67IAF 20.85 20.86 20.64 20.62 20.66 20.66 GNMTattention removed. Here, we see that our flow models show some improvementover baselines unlike the GNMT case. The language modelling seems to be moreimportant as GNMT does outperform our GNMT without language modelling inseveral cases. Overall though, our best performances in this table is GNMT withoutlanguage modelling and a single planar flow. We did observe posterior collapse(included in supplementary material) which might suggest these gains are notdirectly because of latent variable z. This could be simply a by-product of thestochastic behavior due to the training procedure. It has been previously suggestedthat the stochastic behaviour injected by latent variables does help with trainingperformance [42].33Table 5.8: GNMT training results without attention or language model training.In several instances, GNMT without language model training outperformmodels which optimize the language model.Latent Dimension: 128Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 7.23 7.03 7.25 7.32 7.27IAF 7.5 7.42 7.37 7.31 7.16 7.31 GNMT (No LM)Planar 7.2 7.22 7.19 7.11 7.25 7.36IAF 7.31 7.37 7.33 7.18 7.41 7.36 GNMTLatent Dimension: 256Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 7.58 7.45 7.28 7.44 7.33IAF 7.25 7.35 7.32 7.08 7.477.21 GNMT (No LM)Planar 7.4 7.22 7.33 7.26 7.07IAF 7.35 7.34 7.28 7.04 7.367.51 GNMTTable 5.9: KL divergence of GNMT models without language model training.Typically a near 0 KL divergence with stationary priors indicates posteriorcollapse has occurred.Latent Dimension: 128 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.01 0.0 0.0 0.01 0.01IAF 0.0 0.0 0.0 0.01 0.010.0 GNMT (No LM)Planar 3.23 3.15 2.36 2.82 3.64IAF 4.31 4.36 4.19 4.43 4.144.39 GNMTLatent Dimension: 256 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.01 0.01 0.01 0.01 0.01IAF 0.0 0.0 0.0 0.0 0.010.0 GNMT (No LM)Planar 4.34 3.93 3.36 2.75 3.62IAF 3.87 3.74 4.0 3.99 3.983.84 GNMT34Chapter 6ConclusionIn this work we considered the inclusion of normalizing flows in existing LVNMTmodels. Whether with attention or without attention, our VNMT model seemed tobenefit from the inclusion of normalizing flows. Particularly when attention wasremoved, our probing of the latent z found that models would depend more soon the latent variable z for translation quality. In contrast, GNMT did not benefitas much in any setting we included normalizing flows. Although we saw moredependence on the latent z when removing attention for GNMT, our baseline modelsoften had better final translation performance. One explanation from this came tolight from removing the optimization of the language model, in which case ourGNMT model completely ignored the latent z for translation. This might suggest theinitialization approach to including z in our GNMT implementation is less effectivethan the input-feeding approach of VNMT for enabling normalizing flows to improvetranslation performance.We caution interpretations of our results given the breadth of hyper-parametersto consider, such as the amount of word-dropout, KL-annealing, or even MC sam-ples used for approximating the ELBO. Given these other factors, it can be difficultto fully give credit to normalizing flows themselves for performance gains. Un-derstanding how different annealing schedules might impact normalizing flows isa promising future research direction itself, particularly given understanding KLannealing is an active area of research [11, 41].Despite these considerations, we believe normalizing flows do have much35future potential benefits in machine translation systems, particularly given previoussuccess in non-autoregressive translation [19]. However, normalizing flows arenot necessarily something that can be added for a guaranteed performance benefit.To properly see benefit from normalizing flows, additional considerations must betaken into account. One potential future work is extending our findings to sequentiallatent variable NMT systems. These have previously been introduced to help withlonger sentence translation as well as diversifying translation. Another directioncould be considering joint modelling of flow based models to generate syntheticsentence paired data. Synthetically generated data has been studied in machinetranslation as one approach to improve existing systems performances [26]. Mostrecently, discrete flows have been proposed showing improvement on existing flowbased language modelling results [36]. Their the authors cite the vocabulary sizeas a limiting factor, but from our results with a small BPE vocabulary it may beplausible to still utilize these flows effectively for translation.36Bibliography[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointlylearning to align and translate. In Y. Bengio and Y. LeCun, editors, 3rdInternational Conference on Learning Representations, ICLR 2015, SanDiego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URLhttp://arxiv.org/abs/1409.0473. → pages 4, 11, 12, 13, 27[2] S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio.Generating sentences from a continuous space. In Proceedings of The 20thSIGNLL Conference on Computational Natural Language Learning, pages10–21, Berlin, Germany, Aug. 2016. Association for ComputationalLinguistics. doi:10.18653/v1/K16-1002. URLhttps://www.aclweb.org/anthology/K16-1002. → page 22[3] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman,I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In 5thInternational Conference on Learning Representations, ICLR 2017, Toulon,France, April 24-26, 2017, Conference Track Proceedings, 2017. URLhttps://openreview.net/forum?id=BysvGP5ee. → page 22[4] K. Cho, B. van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio. Learning phrase representations using RNNencoder–decoder for statistical machine translation. In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), pages 1724–1734, Doha, Qatar, Oct. 2014. Association forComputational Linguistics. doi:10.3115/v1/D14-1179. URLhttps://www.aclweb.org/anthology/D14-1179. → page 5[5] S. Edunov, M. Ott, M. Auli, and D. Grangier. Understanding back-translationat scale. In Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing, pages 489–500, Brussels, Belgium, Oct.-Nov.2018. Association for Computational Linguistics. doi:10.18653/v1/D18-1045.URL https://www.aclweb.org/anthology/D18-1045. → page 137[6] B. Eikema and W. Aziz. Auto-encoding variational neural machinetranslation. In Proceedings of the 4th Workshop on Representation Learningfor NLP (RepL4NLP-2019), pages 124–141, Florence, Italy, Aug. 2019.Association for Computational Linguistics. doi:10.18653/v1/W19-4315. URLhttps://www.aclweb.org/anthology/W19-4315. → pagesxi, 1, 2, 14, 15, 17, 23, 24, 26, 27, 29[7] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Maskedautoencoder for distribution estimation. In F. Bach and D. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning,volume 37 of Proceedings of Machine Learning Research, pages 881–889,Lille, France, 07–09 Jul 2015. PMLR. URLhttp://proceedings.mlr.press/v37/germain15.html. → page 22[8] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio.Maxout networks. In S. Dasgupta and D. McAllester, editors, Proceedings ofthe 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 1319–1327, Atlanta,Georgia, USA, 17–19 Jun 2013. PMLR. URLhttp://proceedings.mlr.press/v28/goodfellow13.html. → page 13[9] A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks.2011. → page 5[10] D. Ha, A. Dai, and Q. Le. Hypernetworks. 2016. → pages 9, 21[11] J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick. Lagging inferencenetworks and posterior collapse in variational autoencoders. In InternationalConference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=rylDfnCqF7. → pages 22, 35[12] E. Hoogeboom, J. W. T. Peters, R. van den Berg, and M. Welling. Integerdiscrete flows and lossless compression. In H. M. Wallach, H. Larochelle,A. Beygelzimer, F. d’Alche´-Buc, E. B. Fox, and R. Garnett, editors, Advancesin Neural Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada, pages 12134–12144, 2019. URL http://papers.nips.cc/paper/9383-integer-discrete-flows-and-lossless-compression. → pages2, 10[13] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,38N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 31, pages 10215–10224. Curran Associates, Inc., 2018.→ page 2[14] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In 2ndInternational Conference on Learning Representations, ICLR 2014, Banff, AB,Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URLhttp://arxiv.org/abs/1312.6114. → pages 1, 8, 25[15] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, andM. Welling. Improved variational inference with inverse autoregressive flow.In Proceedings of the 30th International Conference on Neural InformationProcessing Systems, NIPS16, page 47434751, Red Hook, NY, USA, 2016.Curran Associates Inc. ISBN 9781510838819. → pages 2, 10, 19, 22, 23[16] P. Koehn. Statistical Machine Translation. Cambridge University Press, NewYork, NY, USA, 1st edition, 2010. ISBN 0521874157, 9780521874151. →pages 2, 4[17] P. Koehn. Neural machine translation. CoRR, abs/1709.07809, 2017. →pages 4, 6, 13[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. InProceedings of the Eighteenth International Conference on Machine Learning,ICML 01, page 282289, San Francisco, CA, USA, 2001. Morgan KaufmannPublishers Inc. ISBN 1558607781. → page 6[19] X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy. Flowseq: Non-autoregressiveconditional sequence generation with generative flow, 09 2019. → pages 3, 36[20] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow fordensity estimation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 30, pages 2338–2347. Curran Associates,Inc., 2017. URL http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf. → page 10[21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method forautomatic evaluation of machine translation. In Proceedings of the 40thAnnual Meeting on Association for Computational Linguistics, ACL ’02,pages 311–318, Stroudsburg, PA, USA, 2002. Association for Computational39Linguistics. doi:10.3115/1073083.1073135. URLhttps://doi.org/10.3115/1073083.1073135. → page 4[22] M. Post. A call for clarity in reporting BLEU scores. In Proceedings of theThird Conference on Machine Translation: Research Papers, pages 186–191.Association for Computational Linguistics, 2018. URLhttp://aclweb.org/anthology/W18-6319. → page 24[23] D. Rezende and S. Mohamed. Variational inference with normalizing flows.In F. Bach and D. Blei, editors, Proceedings of the 32nd InternationalConference on Machine Learning, volume 37 of Proceedings of MachineLearning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.URL http://proceedings.mlr.press/v37/rezende15.html. → pages2, 9, 10, 19, 20, 21, 23[24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation andapproximate inference in deep generative models. In E. P. Xing and T. Jebara,editors, Proceedings of the 31st International Conference on MachineLearning, volume 32 of Proceedings of Machine Learning Research, pages1278–1286, Bejing, China, 22–24 Jun 2014. PMLR. URLhttp://proceedings.mlr.press/v32/rezende14.html. → pages 1, 8, 25[25] P. Schulz, W. Aziz, and T. Cohn. A stochastic decoder for neural machinetranslation. In Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 1243–1252,Melbourne, Australia, July 2018. Association for Computational Linguistics.URL https://www.aclweb.org/anthology/P18-1115. → pages 1, 2, 44[26] R. Sennrich, B. Haddow, and A. Birch. Improving neural machine translationmodels with monolingual data. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers),pages 86–96, Berlin, Germany, Aug. 2016. Association for ComputationalLinguistics. doi:10.18653/v1/P16-1009. URLhttps://www.aclweb.org/anthology/P16-1009. → pages 1, 36[27] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rarewords with subword units. In Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany, Aug. 2016. Association for ComputationalLinguistics. doi:10.18653/v1/P16-1162. URLhttps://www.aclweb.org/anthology/P16-1162. → page 2540[28] H. Shah and D. Barber. Generative neural machine translation. In S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,editors, Advances in Neural Information Processing Systems 31, pages1346–1355. Curran Associates, Inc., 2018. URLhttp://papers.nips.cc/paper/7409-generative-neural-machine-translation.pdf.→ pages 1, 2, 17, 23, 24[29] T. Shen, M. Ott, M. Auli, and M. Ranzato. Diverse machine translation with asingle multinomial latent variable, 2019. URLhttps://openreview.net/forum?id=BJgnmhA5KQ. → page 1[30] C. K. Sø nderby, T. Raiko, L. Maalø e, S. r. K. Sø nderby, and O. Winther.Ladder variational autoencoders. In D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett, editors, Advances in Neural Information ProcessingSystems 29, pages 3738–3746. Curran Associates, Inc., 2016. URLhttp://papers.nips.cc/paper/6275-ladder-variational-autoencoders.pdf. →page 22[31] J. Su, S. Wu, D. Xiong, Y. Lu, X. Han, and B. Zhang. Variational recurrentneural machine translation. In S. A. McIlraith and K. Q. Weinberger, editors,AAAI, pages 5488–5495. AAAI Press, 2018. URLhttp://dblp.uni-trier.de/db/conf/aaai/aaai2018.html#SuWXLHZ18. → page 1[32] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning withneural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,and K. Q. Weinberger, editors, Advances in Neural Information ProcessingSystems 27, pages 3104–3112. Curran Associates, Inc., 2014. URLhttp://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. → pagesxi, 7[33] E. Tabak and E. Vanden Eijnden. Density estimation by dual ascent of thelog-likelihood. Communications in Mathematical Sciences, 8(1):217–233,2010. ISSN 1539-6746. → page 2[34] E. G. Tabak and C. V. Turner. A family of nonparametric density estimationalgorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013. doi:10.1002/cpa.21423. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21423. → page 2[35] J. M. Tomczak and M. Welling. Improving variational auto-encoders usinghouseholder flow. CoRR, abs/1611.09630, 2016. URLhttp://arxiv.org/abs/1611.09630. → pages 2, 9, 10, 19, 2341[36] D. Tran, K. Vafa, K. K. Agrawal, L. Dinh, and B. Poole. Discrete flows:Invertible generative models of discrete data. In H. M. Wallach, H. Larochelle,A. Beygelzimer, F. d’Alche´-Buc, E. B. Fox, and R. Garnett, editors, Advancesin Neural Information Processing Systems 32: Annual Conference on NeuralInformation Processing Systems 2019, NeurIPS 2019, 8-14 December 2019,Vancouver, BC, Canada, pages 14692–14701, 2019. URL http://papers.nips.cc/paper/9612-discrete-flows-invertible-generative-models-of-discrete-data.→ pages 2, 10, 36[37] R. van den Berg, L. Hasenclever, J. Tomczak, and M. Welling. Sylvesternormalizing flows for variational inference. In proceedings of the Conferenceon Uncertainty in Artificial Intelligence (UAI), 2018. → pages 9, 10[38] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling. Sylvesternormalizing flows for variational inference. In UAI, 2018. → pages2, 19, 23, 28[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u.Kaiser, and I. Polosukhin. Attention is all you need. pages 5998–6008, 2017.URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf. →pages 4, 11, 27[40] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton.Grammar as a foreign language. In Proceedings of the 28th InternationalConference on Neural Information Processing Systems - Volume 2, NIPS15,page 27732781, Cambridge, MA, USA, 2015. MIT Press. → page 6[41] J. Xu and G. Durrett. Spherical latent spaces for stable variationalautoencoders. 2018. → pages 29, 35[42] B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang. Variational neuralmachine translation. In Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages 521–530, Austin, Texas, Nov.2016. Association for Computational Linguistics. doi:10.18653/v1/D16-1050.URL https://www.aclweb.org/anthology/D16-1050. → pagesxi, 1, 2, 14, 15, 33[43] S. Zhao, J. Song, and S. Ermon. Infovae: Balancing learning and inference invariational autoencoders. pages 5885–5892, 2019.doi:10.1609/aaai.v33i01.33015885. URLhttps://doi.org/10.1609/aaai.v33i01.33015885. → page 2242[44] Z. Ziegler and A. Rush. Latent normalizing flows for discrete sequences. InK. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36thInternational Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 7673–7682, Long Beach, California, USA,09–15 Jun 2019. PMLR. URLhttp://proceedings.mlr.press/v97/ziegler19a.html. → pages 2, 3, 2343Appendix ASupporting MaterialsDiscussion on other latent dimensionsIn this section we include our results when the latent space was set to 2. Theseresults are available in Table A.3. We originally considered such a small latentspace for plotting purposes, but found it to be quite effective. Our findings wouldseem congruent with previous results on the value of smaller latent spaces [25].1Interestingly, our conclusions with flow complexity are reverse for this latent spacewhere mostly planar flows perform better than IAF. This would suggest for especiallysmall latent spaces, that flexible transformations are more of a hindrance thanbeneficial to translation performance. We chose to exclude this result from the mainthesis because to our knowledge there is no precedence for such performance gains,and further future work is necessary.We also considered the same drop in performance scenario discussed in the mainthesis. The KL divergence results are in Table A.4. The BLEU difference are inTable A.5. Here, we see similar trends as in the original models where performancedrops are quite small with the latent variable z. For most VNMT models we againsee that the latent variable seems to negatively impact performance with z = 2.Additionally, we visualized the latent space of several sentence pairs as thelatent dimensions were quite small. Our VNMT results are available in Figures A.3and A.4. Our GNMT results are available in Figures A.1 and A.2. These plots1We do not know if other researchers have also considered such small latent spaces.44Table A.1: List of hyperparameters used for experimentsOptimization ParametersOptimizer AdamLearning Rate 0.0003KL Annealing Schedule 80,000 stepsWord Dropout Rate 0.1Clip Norm 1.0Mini Batch Size 64Number of Samples (ELBO) 10Model ParametersSource Embedding Size 256Target Embedding Size 256Encoder Hidden Dimensions 256Number of Encoder Layers 1Decoder Hidden Dimensions 256Number of Decoder Layers 1Dropout 0.5Z dim (latent variable) 2, 128, 256Global Attention MechanismKey Size 512Query Size 256IAF DetailsAutoregressive NN hidden sizes 320, 320Planar Flows DetailsHidden layer dimensions 150Table A.2: IWSLT sentence counts for De–En language pair. Counts representactual number of sentences we use in our analysis when limiting maxsentence length to 50. Values in parentheses represent full sentence countsof each dataset.De–ENTrain 182,999 (233,213)Dev 1,873 (2,052)Test 9,101 (9,773)45Table A.3: BLEU score for our models with normalizing flows for De-Entranslation. The best performing models are in bold for each type andnumber of flows. This table is similar to Table 5.1 in main thesis, butincludes result for latent dimensions set to 2.Latent Dimension: 2Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 19.06 18.95 18.94 19.01 19.05IAF 18.89 18.48 18.803 18.96 18.67 18.92 VNMTPlanar 20.77 20.70 20.81 20.86 20.37IAF 20.61 20.52 20.58 20.55 20.4621.02 GNMTLatent Dimension: 128Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 18.84 18.84 19.14 19.02 19.18IAF 18.89 18.96 19.29 18.81 18.92 18.768 VNMTPlanar 20.59 20.60 20.48 20.55 20.66IAF 20.64 20.64 20.51 20.65 20.5020.73 GNMTLatent Dimension: 256Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 18.93 19.26 19.02 18.80 18.82IAF 18.95 19.17 19.02 18.99 18.77 18.76 VNMTPlanar 20.55 20.67 20.54 20.51 20.67IAF 20.85 20.86 20.64 20.62 20.66 20.66 GNMTwere generated by first encoding each sentence pair to produce the variationalposterior parameters. With these parameters, we generated 10,000 samples fromthe distributions. Each plot is then a Gaussian kernel density estimation of thesamples after being transformed by K normalizing flows with the base distributionon the far left, and k = 8 flows on the far right. These plots largely show us theexpected behaviours of the transformations. In the planar flows cases we see mostlythe distribution being shifted around in the latent space but otherwise remaininguni-modal. In the IAF cases we see interestingly that the normalizing flows arecollapsing our 2D latent spaces into 1D. This may be an indication that the stochasticnature of the latent variable is actually negatively impacting the training as the IAFflows attempt to flatten the distribution.46Table A.4: Average KL divergence for the test set. For VNMT the KL termshould be smaller meaning the distributions encode similar information.For GNMT they should be higher as these suggest more informative latentspaces.Latent Dimension: 2 (KL Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.94 1.45 0.68 0.0 0.0IAF 1.15 1.56 1.4 0.85 1.561.15 VNMTPlanar 2.48 2.32 2.01 0.10 3.64IAF 0.0 2.36 3.91 2.09 2.632.99 GNMTLatent Dimension: 128 (KL Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 2.16 1.27 1.18 1.04 1.63IAF 0.68 1.35 1.32 1.6 1.061.04 VNMTPlanar 3.23 3.15 2.36 2.82 3.64IAF 4.31 4.36 4.19 4.43 4.144.39 GNMTLatent Dimension: 256 (KL Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.92 2.35 2.37 1.11 1.27IAF 2.41 1.94 1.14 1.1 1.12.0 VNMTPlanar 4.34 3.93 3.36 2.75 3.62IAF 3.87 3.74 4.0 3.99 3.983.84 GNMT47Table A.5: Change in BLEU score when Z is set to 0 vector at decode time.Negative numbers indicate our models do better without Z includedduring translation.Latent Dimension: 2 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0 -0.09 -0.06 0.04 0IAF -0.01 0.06 -0.1 -0.1 -0.11-0.1 VNMTPlanar 0.25 -0.03 0.11 0.04 0.30IAF 0 0.11 0.01 0.1 -0.020.24 GNMTLatent Dimension: 128 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar -0.29 -0.37 -0.06 -0.14 0.1IAF 18.64 12.17 15.52 17.4 17.560.04 VNMTPlanar 0.14 0.08 0.02 0.06 0.1IAF 0.12 0.08 -0.01 0.1 -0.010.08 GNMTLatent Dimension: 256 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.02 -0.07 -0.03 -0.24 0.16IAF 12.22 17.27 18.5 10.84 17.42-0.11 VNMTPlanar 0.1 0 0.11 0.15 0.04IAF 0.04 0.09 0.04 0.06 0.190.09 GNMT48(De) Als ich in meinen 20ern war, hatte ich meine erste Psychotherapie-Patientin.(En) When I was in my 20s, I saw my very first psychotherapy client.(De) Ich war Doktorandin und studierte Klinische Psychologie in Berkeley.(En) I was a Ph.D. student in clinical psychology at Berkeley.(De) Sie war eine 26-jährige Frau namens Alex.(En) She was a 26-year-old woman named Alex.(De) Und als ich das hörte, war ich erleichtert.(En) Now when I heard this, I was so relieved.Figure A.1: One of our GNMT models with 2 dimensional latent space, and 8planar flows. Each row is a sentence pair with an increasing number offlows applied from 0 through 8. The ordering is 0 flows on the far left,and 8 flows on the far right.49(De) Als ich in meinen 20ern war, hatte ich meine erste Psychotherapie-Patientin.(En) When I was in my 20s, I saw my very first psychotherapy client.(De) Ich war Doktorandin und studierte Klinische Psychologie in Berkeley.(En) I was a Ph.D. student in clinical psychology at Berkeley.(De) Sie war eine 26-jährige Frau namens Alex.(En) She was a 26-year-old woman named Alex.(De) Und als ich das hörte, war ich erleichtert.(En) Now when I heard this, I was so relieved.Figure A.2: One of our 2 dimensional latent variable GNMT models trainedwith 8 IAF flows. Each row is one sentence pair with an increasingnumber of flows applied to samples from the base distribution. Eachrow is a sentence pair with an increasing number of flows applied from0 through 8. The ordering is 0 flows on the far left, and 8 flows on thefar right.50(De) Als ich in meinen 20ern war, hatte ich meine erste Psychotherapie-Patientin.(En) When I was in my 20s, I saw my very first psychotherapy client.(De) Ich war Doktorandin und studierte Klinische Psychologie in Berkeley.(En) I was a Ph.D. student in clinical psychology at Berkeley.(De) Sie war eine 26-jährige Frau namens Alex.(En) She was a 26-year-old woman named Alex.(De) Und als ich das hörte, war ich erleichtert.(En) Now when I heard this, I was so relieved.Figure A.3: Two dimensional latent variable VNMT with 8 planar flows. Cap-tions are the sentence pairs. Each row is a sentence pair with an increas-ing number of flows applied from 0 through 8. The ordering is 0 flowson the far left, and 8 flows on the far right.51(De) Als ich in meinen 20ern war, hatte ich meine erste Psychotherapie-Patientin.(En) When I was in my 20s, I saw my very first psychotherapy client.(De) Ich war Doktorandin und studierte Klinische Psychologie in Berkeley.(En) I was a Ph.D. student in clinical psychology at Berkeley.(De) Sie war eine 26-jährige Frau namens Alex.(En) She was a 26-year-old woman named Alex.(De) Und als ich das hörte, war ich erleichtert.(En) Now when I heard this, I was so relieved.Figure A.4: Visualization of several sentence pairs latent spaces with VNMTmodel and 8 IAF flows. Solid blue images mean the distribution hasbeen flattened into a line. Each row is a sentence pair with 0 flows onthe far left and 8 flows applied on the right.52Table A.6: BLEU score difference of GNMT with attention when languagemodel is not optimized during training. We include GNMT with languagemodel training for reference. The comparison just shows that without thelanguage model, GNMT does not incorporate information at all from thelatent variable z.Latent Dimension: 128 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.0 0.0 0.0 0.0 0.0IAF 0.0 0.0 0.0 0.0 0.00.0 GNMT (No LM)Planar 0.14 0.08 0.02 0.06 0.1IAF 0.12 0.08 -0.01 0.1 -0.010.08 GNMTLatent Dimension: 256 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.0 0.0 0.0 0.0 0.0IAF 0.0 0.0 0.0 0.0 0.00.0 GNMT (No LM)Planar 0.1 0 0.11 0.15 0.04IAF 0.04 0.09 0.04 0.06 0.190.09 GNMTTable A.7: KL divergence for GNMT models trained without attention or lan-guage model optimization. Results suggest posterior collapse has oc-curred as the values are almost all near 0.Latent Dimension: 128 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.0 0.0 0.0 0.01 0.01IAF 0.0 0.0 0.01 0.0 0.00.0 GNMT (No LM)Planar 4.28 4.48 3.84 4.12 4.58IAF 3.38 2.97 3.89 4.57 4.532.23 GNMTLatent Dimension: 256 (KL Divergence)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.01 0.01 0.01 0.01 0.02IAF 0.01 0.01 0.01 0.01 0.010.0 GNMT (No LM)Planar 4.1 4.26 3.57 3.54 3.65IAF 3.99 3.19 3.99 3.12 3.454.19 GNMT53Table A.8: Measure of performance drop when z is removed from trainedsystem for GNMT without attention or language model optimization.Latent Dimension: 128 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.0 0.0 0.0 0.0 0.0IAF 0.0 0.0 0.0 0.0 0.00.0 GNMT (No LM)Planar 0.91 0.93 0.7 0.49 0.72IAF 0.88 0.59 0.88 0.79 1.10.46 GNMTLatent Dimension: 256 (BLEU Difference)Flows 1 2 4 8 16 0 (Baseline) ModelPlanar 0.0 0.0 0.0 0.0 0.0IAF 0.0 0.0 0.0 0.0 0.00.0 GNMT (No LM)Planar 0.76 0.98 0.29 0.48 0.51IAF 0.66 0.5 0.63 0.41 0.490.68 GNMT54
You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Investigating the impact of normalizing flows on latent...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Investigating the impact of normalizing flows on latent variable machine translation Przystupa, Michael 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Investigating the impact of normalizing flows on latent variable machine translation |
Creator |
Przystupa, Michael |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Natural language processing (NLP) has pervasive applications in everyday life, and has recently witnessed rapid progress. Incorporating latent variables in NLP systems can allow for explicit representations of certain types of information. In neural machine translation systems, for example, latent variables have the potential of enhancing semantic representations. This could help improve general translation quality. Previous work has focused on using variational inference with diagonal covariance Gaussian distributions, which we hypothesize cannot sufficiently encode latent factors of language which could exhibit multi-modal distributive behavior. Normalizing flows are an approach that enables more flexible posterior distribution estimates by introducing a change of variables with invertible functions. They have previously been successfully used in computer vision to enable more flexible posterior distributions of image data. In this work, we investigate the impact of normalizing flows in autoregressive neural machine translation systems. We do so in the context of two currently successful approaches, attention mechanisms, and language models. Our results suggest that normalizing flows can improve translation quality in some scenarios, and require certain modelling assumptions to achieve such improvements. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-02-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0388739 |
URI | http://hdl.handle.net/2429/73621 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_przystupa_michael.pdf [ 733.81kB ]
- Metadata
- JSON: 24-1.0388739.json
- JSON-LD: 24-1.0388739-ld.json
- RDF/XML (Pretty): 24-1.0388739-rdf.xml
- RDF/JSON: 24-1.0388739-rdf.json
- Turtle: 24-1.0388739-turtle.txt
- N-Triples: 24-1.0388739-rdf-ntriples.txt
- Original Record: 24-1.0388739-source.json
- Full Text
- 24-1.0388739-fulltext.txt
- Citation
- 24-1.0388739.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0388739/manifest