{"Affiliation":[{"label":"Affiliation","value":"Science, Faculty of","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","classmap":"vivo:EducationalProcess","property":"vivo:departmentOrSchool"},"iri":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","explain":"VIVO-ISF Ontology V1.6 Property; The department or school name within institution; Not intended to be an institution name."},{"label":"Affiliation","value":"Statistics, Department of","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","classmap":"vivo:EducationalProcess","property":"vivo:departmentOrSchool"},"iri":"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool","explain":"VIVO-ISF Ontology V1.6 Property; The department or school name within institution; Not intended to be an institution name."}],"AggregatedSourceRepository":[{"label":"Aggregated Source Repository","value":"DSpace","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","classmap":"ore:Aggregation","property":"edm:dataProvider"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider","explain":"A Europeana Data Model Property; The name or identifier of the organization who contributes data indirectly to an aggregation service (e.g. Europeana)"}],"Campus":[{"label":"Campus","value":"UBCV","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","classmap":"oc:ThesisDescription","property":"oc:degreeCampus"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeCampus","explain":"UBC Open Collections Metadata Components; Local Field; Identifies the name of the campus from which the graduate completed their degree."}],"Creator":[{"label":"Creator","value":"Xi, Quanhan","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/creator","classmap":"dpla:SourceResource","property":"dcterms:creator"},"iri":"http:\/\/purl.org\/dc\/terms\/creator","explain":"A Dublin Core Terms Property; An entity primarily responsible for making the resource.; Examples of a Contributor include a person, an organization, or a service."}],"DateAvailable":[{"label":"Date Available","value":"2022-08-23T17:04:07Z","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/issued","classmap":"edm:WebResource","property":"dcterms:issued"},"iri":"http:\/\/purl.org\/dc\/terms\/issued","explain":"A Dublin Core Terms Property; Date of formal issuance (e.g., publication) of the resource."}],"DateIssued":[{"label":"Date Issued","value":"2022","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/issued","classmap":"oc:SourceResource","property":"dcterms:issued"},"iri":"http:\/\/purl.org\/dc\/terms\/issued","explain":"A Dublin Core Terms Property; Date of formal issuance (e.g., publication) of the resource."}],"Degree":[{"label":"Degree (Theses)","value":"Master of Science - MSc","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","classmap":"vivo:ThesisDegree","property":"vivo:relatedDegree"},"iri":"http:\/\/vivoweb.org\/ontology\/core#relatedDegree","explain":"VIVO-ISF Ontology V1.6 Property; The thesis degree; Extended Property specified by UBC, as per https:\/\/wiki.duraspace.org\/display\/VIVO\/Ontology+Editor%27s+Guide"}],"DegreeGrantor":[{"label":"Degree Grantor","value":"University of British Columbia","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","classmap":"oc:ThesisDescription","property":"oc:degreeGrantor"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeGrantor","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the institution where thesis was granted."}],"Description":[{"label":"Description","value":"Latent variable models posit that an unobserved, or latent, set of variables describe the statistical properties of the observed data. The inferential goal is to recover the unobserved values, which can then be used for a variety of down-stream tasks. Recently, generative models, which attempt learn a deterministic mapping (the generator) from the latent to observed variables, have become popular for a variety of applications. However, arbitrarily different latent values may give rise to the same dataset especially in modern non-linear models, an issue known as latent variable indeterminacy. In the presence of indeterminacy, many scientific problems which generative models aim to solve become ill-defined. In this thesis, we develop a mathematical framework to analyze the indeterminacies of a wide range of generative models by framing it as a special type of statistical identifiability. By doing so, we unify existing model-specific derivations from various corners of the diverse literature on identifiability in latent variable models. Using our framework, we also derive conditions to eliminate indeterminacies completely while maintaining the flexibility of modern methods. Using these conditions, we are able to target precisely the sources of indeterminacy to derive novel results on the weak and strong identifiability of popular generative models, and variations thereof.","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/description","classmap":"dpla:SourceResource","property":"dcterms:description"},"iri":"http:\/\/purl.org\/dc\/terms\/description","explain":"A Dublin Core Terms Property; An account of the resource.; Description may include but is not limited to: an abstract, a table of contents, a graphical representation, or a free-text account of the resource."}],"DigitalResourceOriginalRecord":[{"label":"Digital Resource Original Record","value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/82455?expand=metadata","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","classmap":"ore:Aggregation","property":"edm:aggregatedCHO"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO","explain":"A Europeana Data Model Property; The identifier of the source object, e.g. the Mona Lisa itself. This could be a full linked open date URI or an internal identifier"}],"FullText":[{"label":"Full Text","value":"Indeterminacy in Latent Variable Models:Characterization and Strong IdentifiabilitybyQuanhan XiB. Sc, University of Ottawa, 2020A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)August 2022\u00a9 Quanhan Xi, 2022The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Indeterminacy in Latent Variable Models: Characterization andStrong Identifiabilitysubmitted by Quanhan Xi in partial fulfillment of the requirements for the degreeof Master of Science in Statistics.Examining Committee:Benjamin Bloem-Reddy, Assistant Professor, Statistics, UBCSupervisorDaniel J. McDonald, Associate Professor, Statistics, UBCSupervisory Committee MemberiiAbstractLatent variable models posit that an unobserved, or latent, set of variables describethe statistical properties of the observed data. The inferential goal is to recoverthe unobserved values, which can then be used for a variety of down-stream tasks.Recently, generative models, which attempt learn a deterministic mapping (thegenerator) from the latent to observed variables, have become popular for a varietyof applications. However, arbitrarily different latent values may give rise to thesame dataset especially in modern non-linear models, an issue known as latentvariable indeterminacy. In the presence of indeterminacy, many scientific problemswhich generative models aim to solve become ill-defined.In this thesis, we develop a mathematical framework to analyze the indeterminaciesof a wide range of generative models by framing it as a special type of statisticalidentifiability. By doing so, we unify existing model-specific derivations fromvarious corners of the diverse literature on identifiability in latent variable models.Using our framework, we also derive conditions to eliminate indeterminaciescompletely while maintaining the flexibility of modern methods. Using theseconditions, we are able to target precisely the sources of indeterminacy to derivenovel results on the weak and strong identifiability of popular generative models,and variations thereof.iiiLay SummaryLatent variable models attempt to uncover unobserved, scientifically significantfactors that help us explain and learn from observed data. For example, a valuerepresenting disease stage is not directly observable from the pixels of medicalimaging, but describes the statistical variations in a population. To this end,modern innovations allow scientists to build accurate models, but are unstablein the sense that multiple users may obtain arbitrarily different results even withperfect data. The research presented in this thesis builds a theory to relate thesepossible different values to each other. Using insights derived from our theory,we are then able to suggest modifications to existing models which resolve thisissue, where multiple users will recover the same results, given enough data from apopulation. These results represent a first step in the ultimate goal of latent variablemodels to automatically discover underlying properties from complex data formats.ivPrefaceThis thesis is solely authored work by the author, Quanhan Xi, under the guidanceof Prof. Benjamin Bloem-Reddy. The research topic was proposed by Prof. Bloem-Reddy. All novel technical results are contributed by the author.Certain results pertaining to Section 5.1.1 have appeared as a peer-reviewed (non-archival) article at the 2021 NeurIPS Causal Inference and Machine Learningworkshop authored by the author and Prof. Bloem-Reddy. The technical contentin Sections 3, 4, and 5 is based on a manuscript (Xi and Bloem-Reddy, 2022)currently under peer-review.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Random Variables, Pushforward measures . . . . . . . . . . . . . 72.3 Null sets and absolute continuity . . . . . . . . . . . . . . . . . . 82.4 Bijective Mappings, Borel Isomorphisms . . . . . . . . . . . . . . 103 Generative Models and Identifiability . . . . . . . . . . . . . . . . . 123.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Generic Generative Modelling . . . . . . . . . . . . . . . . . . . 143.3 Related Works and Historical Notes . . . . . . . . . . . . . . . . 16vi3.4 Generative Models as Statistical Models . . . . . . . . . . . . . . 213.5 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 Identifiability and Indeterminacies . . . . . . . . . . . . . . . . . 244 Characterizing Indeterminacies . . . . . . . . . . . . . . . . . . . . 254.1 Example: Linear Generative Models . . . . . . . . . . . . . . . . 254.2 Indeterminacy Maps . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Analyzing Aa,b . . . . . . . . . . . . . . . . . . . . . . . 284.3 Indeterminacy Sets from Indeterminacy Maps . . . . . . . . . . . 295 Modelling Choices for Strong Identifiability . . . . . . . . . . . . . . 325.1 Multiple Environments . . . . . . . . . . . . . . . . . . . . . . . 335.1.1 Generative Modelling in Multiple Environments . . . . . . 335.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.3 Fixing Latent Distributions . . . . . . . . . . . . . . . . . 405.1.4 Strongly Identifiable VAE . . . . . . . . . . . . . . . . . 425.2 Groups of (Optimal) Transport Maps . . . . . . . . . . . . . . . . 465.2.1 Optimal Transport Generators . . . . . . . . . . . . . . . 475.2.2 Triangular Monotone Maps . . . . . . . . . . . . . . . . . 496 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 53Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 62A.1 Detailed Linear Examples . . . . . . . . . . . . . . . . . . . . . . 62A.1.1 Example: Factor Analysis . . . . . . . . . . . . . . . . . 62A.1.2 Linear, non-Gaussian ICA . . . . . . . . . . . . . . . . . 66A.2 Detailed Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.2.1 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . 68A.2.2 Proof of Proposition 5.2 . . . . . . . . . . . . . . . . . . 71A.2.3 Proof of Proposition 5.3 . . . . . . . . . . . . . . . . . . 72viiA.2.4 Proof of Proposition 5.7 . . . . . . . . . . . . . . . . . . 75A.3 Discrete Observations . . . . . . . . . . . . . . . . . . . . . . . . 75viiiList of FiguresFigure 5.1 Schematic representation of Lemma 4.2 for (a) measure iso-morphisms (whenPz contains multiple elements as in iVAE);and (b) measure automorphisms (whenPz is a singleton as ina standard VAE). . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 5.2 a) Proposition 5.9 and b) Proposition 5.6, with \u00b50 = \u03b70 = 0.The orthogonal complement of a plane in R3 is the perpendic-ular line through the origin. . . . . . . . . . . . . . . . . . . . 46ixAcknowledgmentsFirst and foremost, I would like to acknowledge my supervisor, Professor BenBloem-Reddy, for all his advice and encouragement during my M. Sc.. He hasprovided me with both research and life mentorship whenever I needed it most.I feel deeply fortunate that we will continue to work together during my Ph.D.studies!I also wish to acknowledge my second reader, Professor Daniel J. McDonald.Daniel acted as a secondary mentor to me throughout my M. Sc., and in particularprovided valuable feedback that helped improve this thesis.I am grateful for all of my fellow classmates at the Department of Statistics. Inparticular, I wish to acknowledge Naitong and Nikola for all the tennis practiceand good memories. I also wish to thank Gian Carlo and Kenny for their tirelesswork organizing various events around the department.Finally, I am thankful for my family for their constant support and encouragement.My mother always encouraged me despite being thousands of kilometers away.Amanda has been the most accommodating and supportive partner I could haveasked for, supporting me through the highs and the lows, the busy times and theslow times.xChapter 1IntroductionLatent variable models assume an underlying collection of realized, but unobserved(latent) random variables that are used to describe the statistical properties of theobserved data. Typically, each observation Xi is assumed to arise from a singlerealization Zi of the latent random variable. The first step in most applications isto then infer the unobserved realizations for each datum. These applications arehighly diverse, but a useful classification is as follows:\u2022 Inference as uncovering hidden variables: latent variables are often includedin well-defined physical or societal models (\u201cstructural theories\u201d (Mulaik,2009)), which are not measurable or not observable, but more relevant tothe theory than the observational format collected. In this case, inference issimply to estimate this hidden value, given the observed quantities.\u2022 Inference for dimensionality reduction and exploration: latent variablescan be used as lower-dimensional \u201crepresentations\u201d of high-dimensionalobservations. Beyond dimensionality reduction, how the data vary alongdifferent values of their inferred latent variables is also of interest (\u201clatentspace exploration\u201d). Crucially, the data analyst does not attach meaning to thelatent variables a-priori, but may assess their semantics during exploration,1e.g., by interpolating along specific dimensions (Higgins et al., 2017; Kimand Mnih, 2018).Example 1.1 (Uncovering hidden variables). Human intelligence is sometimestheorized to be explained by a single dimension\u2014the univariate \u201cg\u201d factor (Spear-man, 1904). Here, the observational quantities are test results, and interest directlylies in recovering the g-value of an individual (for example, via a standardizedintelligence quotient (IQ) test).Example 1.2 (Dimensionality reduction and exploration). Suppose a high-dimensionalsingle-cell sequencing dataset (often with thousands of expression values, even af-ter pre-selection) with multiple phenotypic states is modelled with 10 latent factors,compressing the data onto a 10-dimensional space. The inferred individual factorscan be used to cluster or classify observations into known states (dimensionalityreduction). It may be observed that the latent dimension Z1 roughly correspondsto cell size, whence the inferred model can be used to generate synthetic cells ofdifferent, intermediate sizes (exploration).For some purposes, such as by using the estimated lower-dimensional factors ina predictive model, it can be sufficient to find any latent values that representthe data well. However, most tasks that aim to advance understanding usingthese models are of questionable integrity when the latent values for a dataset areindeterminate under the model. Specifically, latent variable indeterminacy refers towhen the observed data do not provide enough information to distinguish betweentwo or more possible latent values, i.e., the problem is underdetermined. Ourtwo examples above are ill-defined in the indeterminate setting\u2014an individualcannot possess two \u201cg\u201d-values under the psychometric theory, and a cell cannotsimultaneously be two different sizes.Though related to statistical identifiability, which has a general mathematicaldefinition, indeterminacy is more often given model-specific definitions. Forexample, factor analysis is commonly related to identifiability of the entries of the2factor loading matrix, but does not suggest how it may generalize to non-linearmodels. Independent component analysis (ICA) (Comon, 1994) has successfullyanalyzed indeterminacies in non-linear models (Hyva\u00a8rinen and Pajunen, 1999), butis unable to eliminate them completely, which makes it inappropriate for the tasksdescribed above.The goal of this thesis is to carefully characterize the indeterminacies of a popularclass of latent variable models. By doing so, we can better understand limita-tions when interpreting and using these models, particularly recent non-lineariterations. We also propose modifications to existing methods that eliminate theirindeterminacies, taking a first step in building modern interpretable latent variablemodels.Specifically, we propose a novel framework for characterizing and eliminatingindeterminacies in a general class of latent variable models\u2014a key step is to definea special form of identifiability such that it is a dual concept to latent variableindeterminacy. Our framework abstracts the analysis of indeterminacies to a mostlymeasure-theoretic exercise free of the specific model at hand, and entails practicalseveral contributions to the current literature:\u2022 Unifies existing results from seemingly unrelated models in disparate fields,such as factor analysis (psychometrics), ICA (signal processing), and VAE(unsupervised learning).\u2022 Adds minimal additional constraints to strengthen existing results, in particu-lar eliminating indeterminacies completely (strong identifiability) in modernnon-linear models previously only known to be weakly identifiable.\u2022 Proves strong identifiability for a class of triangular normalizing flows, apopular model for density estimation but not previously known to be usefulfor interpretable latent variable modelling.\u2022 Encourages future application of results from measure theory to eliminate3indetermacies, with possible applications to a wide variety of fields.The organization of this thesis is as follows. In Chapter 2, we will review therequired technical background, especially in measure theory, to develop the frame-work in the remainder of the thesis. In Chapter 3, we will introduce generativemodels, a popular type of latent variable model that will be the main object ofanalysis in this thesis. In particular, we will define strong identifiability and relateit to indeterminacy, and motivate why it is important for scientific applications ofthese models. In Chapter 4, we develop our mathematical framework and obtaincharacterizations of the indeterminacy of generic generative models. Finally, inChapter 5, we discuss modelling choices that can result in strongly identifiablegenerative models, taking inspiration from current literature as well as developinga novel class of models.4Chapter 2Technical BackgroundThe rest of this thesis takes place in the mathematical setting of measure-theoreticprobability. This section will review some relevant definitions. Our standardreference here will be C\u00b8inlar (2011), though we will sometimes refer to other textsdepending on the specific topic (Bogachev, 2007; Kechris, 1995; Schilling, 2005).2.1 Basic DefinitionsLet (E,\u03c4) be a topological space with E a set and \u03c4 its collection of open sets.Definition 2.1 (\u03c3 -algebras, C\u00b8inlar, 2011, Eq. 1.3). A collection E of subsets of Eis called a \u03c3 -algebra on E if it is closed under complements and countable unions:B \u2208 E =\u21d2 E \\B \u2208 E , B1,B2, \u00b7 \u00b7 \u00b7 \u2208 E =\u21d2 \u222anAn \u2208 E . (2.1)Note that a \u03c3 -algebra always contains the empty set and E itself.The pair (E,E ) defines a measurable space. The elements of E in this context arecalled measurable sets. When the \u03c3 -algebra is insignificant, or obvious by context,we will simply refer to the space by E.5Definition 2.2 (Generated \u03c3 -algebras, C\u00b8inlar, 2011, Sec. 1). The \u03c3 -algebragenerated by a collection of subsets E \u2032, denoted \u03c3(E \u2032) is the smallest \u03c3 -algebrathat contains E \u2032.In this thesis, we will always work with what is known as the Borel \u03c3 -algebra.Definition 2.3 (Borel \u03c3 -algebras, C\u00b8inlar, 2011, Sec. 1). Let (E,\u03c4) be a topologicalspace. The Borel \u03c3 -algebra of E is generated by the collection of open sets, \u03c3(\u03c4).We denote it byB(E).An element B \u2208B(E) is then said to be a Borel set.Let (E,E ), (F,F ) be two measurable spaces, and f : E \u2192 F a mapping betweenthem. The image of A\u2282 E is defined asf (A) = { f (a) | a \u2208 A} \u2282 F. (2.2)Similarly, the preimage of B\u2282 F is defined asf\u22121(B) = {x \u2208 E | f (x) \u2208 B} \u2282 E. (2.3)Most mappings we will be concerned with will be assumed to be measurable.Definition 2.4 (Measurable Mappings, C\u00b8inlar, 2011, Sec. 2). A mapping f : E \u2192 Fis said to be (E ,F )-measurable if f\u22121(B) \u2208 E for each B \u2208F .If (E,E ) = (F,F ), we will refer to f as simply E -measurable.Definition 2.5 (Measures, C\u00b8inlar, 2011, Sec. 3). Given a measurable space (E,E ),a mapping \u00b5 : E \u2192 [0,\u221e] is a measure if it satisfies\u00b5( \/0) = 0, \u00b5(\u222anAn) =\u2211n\u00b5(An), (An) a disjoint sequence in E . (2.4)In particular, if \u00b5(E) = 1, then \u00b5 is said to be a probability measure.6The triplet (E,E ,\u00b5) is known as a measure space, and, if \u00b5 is a probability measure,as a probability space. When there is no risk for confusion, we simply identify ameasure space by its measure \u00b5 . Two measures \u00b5 and \u03bd on the same measurablespace are equal whenever\u00b5(B) = \u03bd(B), for all B \u2208 E . (2.5)The main application of measure theory is to compute integrals of a measurablefunction f with respect a measure \u00b5 and possibly over a measurable set A. We willdenote this integral by \u222bx\u2208Af (x)\u00b5(dx). (2.6)We will not pursue precise definitions of this integral in this thesis\u2014see the resource(C\u00b8inlar, 2011, Sec. 4).2.2 Random Variables, Pushforward measures(C\u00b8inlar, 2011, Ch. 2) covers probability spaces in depth. Here, we only reviewthe relevant notion pertaining to random variables and their distributions. Arandom variable X1 on (E,E ) is associated to a probability measure \u00b5 , called itsdistribution, defined as\u00b5(B) = P(X \u2208 B), for all B \u2208 E . (2.7)Random variables X , Y defined on the same measurable space with distributions\u00b5 , \u03bd are said to be equal in distribution if \u00b5 = \u03bd as probability measures, denotedX d= Y .1Technically, we require a background measure space (\u2126, calF,P), and a random variable isdefined as a measurable function X :\u2126\u2192 E.7Any (E ,F )-measurable function f : E \u2192 F applied to X , denoted f (X), definesa random variable on (F,F ) with distribution \u00b5 \u25e6 f\u22121, where f\u22121 denotes thepreimage of f as a set function F \u2192 E . Sometimes, we will use the morestreamlined pushforward notation for the distribution of f (X), as follows:f#\u00b5 = \u00b5 \u25e6 f\u22121. (2.8)Finally, we define the notion of a pushforward \u03c3 -algebra.Definition 2.6 (Pushforward \u03c3 -algebras, C\u00b8inlar, 2011, Sec. 2). Let (E,E ) be ameasurable space, F be a set, and f be a mapping f : E \u2192 F . The pushforward\u03c3 -algebra of f is defined as\u03c3( f ) = {B\u2282 F ; f\u22121(B) \u2208 E } (2.9)It is easily shown that \u03c3( f ) is a \u03c3 -algebra on F , and that f is measurable withrespect to \u03c3( f ) (C\u00b8inlar, 2011, Exercise 2.20). In fact, it is the smallest \u03c3 -algebrathat makes f measurable.2.3 Null sets and absolute continuityDefinition 2.7 (Null Sets, C\u00b8inlar, 2011, Sec. 3). Given a measurable space (E,E ),a measurable set A \u2208 E is said to be null with respect to a measure \u00b5 , or \u00b5-null, if\u00b5(A) = 0.The empty set is always a null set by the definition of a measure, but there can bemany more null sets, depending on the measure. Any countable union of null setsis again null. For example, the Lebesgue measure \u03bb on R assigns null measureto a singleton {x} for x \u2208 R, and so both the set of natural numbers and rationalnumbers are \u03bb -null sets.For a measure space (E,E ,\u00b5), most properties that are consequences of measure-8theoretic manipulations can only hold \u00b5-almost everywhere. This is a weakernotion than a property holding pointwise, and the strength of a result can dependon the measure \u00b5 .Definition 2.8 (Almost Everywhere, C\u00b8inlar, 2011, Sec. 3). Given a measure space,a property that is stated for x \u2208 E is said to hold \u00b5-almost everywhere if thereexists a measurable set N with \u00b5(N) = 0 such that P holds for all x \u2208 E \\N.For example, the following is an elementary, but very useful result:Lemma 2.1 (C\u00b8inlar, 2011, Exercise 4.25). Let f ,g be measurable functions on ameasure space (E,E ,\u00b5). Then, if\u222bx\u2208Af (x)\u00b5(dx) =\u222bx\u2208Ag(x)\u00b5(dx) for all A \u2208 E , (2.10)we have f (x) = g(x) \u00b5-almost everywhere.Furthermore, two measures are often compared with respect to their null sets.Definition 2.9 (Absolute Continuity, C\u00b8inlar, 2011, p. 31). A measure \u00b5 is said tobe absolutely continuous with respect to \u03bd defined on the same measurable space,denoted \u00b5 \u226a \u03bd , if for any A \u2208 E such that \u03bd(A) = 0, we also have \u00b5(A) = 0.An equivalent property is given by the Radon-Nikodym theorem.Theorem 2.2 (Radon-Nikodym, C\u00b8inlar, 2011, Thm. 5.11). \u00b5\u226a \u03bd on (E,E ) if andonly if there exists a measurable function p : E \u2192 [0,\u221e), uniquely defined \u03bd-almosteverywhere, such that for any measurable set A \u2208 E and measurable function f ,we have \u222bx\u2208Af (x)\u00b5(dx) =\u222bx\u2208Ap(x) f (x)\u03bd(dx). (2.11)We call p the density of \u00b5 with respect to \u03bd .Typically, when discussing a probability measure P on Rd , p is the density of9P with respect to the Lebesgue measure \u03bb , implicitly assuming that P \u226a \u03bb . Inthis thesis, we will also be working in this context, referring to p as a probabilitydensity, unless stated otherwise.Definition 2.10 (Equivalence, Schilling, 2005, Problem 19.5). Two measures \u00b5 , \u03bdare said to be equivalent if \u00b5 \u226a \u03bd and \u03bd \u226a \u00b5 .Clearly, equivalent measures assign the exact same null sets, and imply the same\u201calmost everywhere\u201d statements. There is again an analogous definition in terms ofdensities.Lemma 2.3 (Schilling, 2005, Exercise 19.5). If \u00b5 and \u03bd are two measures definedon the same measurable space, then any density p of one with respect to the othersatisfies p(x)> 0 \u00b5-almost everywhere (equiv. \u03bd-almost everywhere) if and only ifthey are equivalent.We will often use the fact that if p, a probability density, is strictly positive, thenP-almost everywhere is equivalent to \u03bb -almost everywhere. Finally, when workingon Euclidean spaces, we will simply write almost everywhere (or a.e.) to mean\u03bb -almost everywhere.2.4 Bijective Mappings, Borel IsomorphismsThis section reviews relevant facts about Borel isomorphisms, the main referenceis (Kechris, 1995, Ch. 15).Bijective mappings f : E \u2192 F can enjoy some additionally nice properties. First,it can be easily shown thatf ( f\u22121(A)) = A, for all A\u2282 E. (2.12)Furthermore, the image f (A) defines the pre-image of the inverse mapping f\u22121.This is not necessarily true for non-bijective mappings.10We now define the notion of isomorphism between measurable spaces.Definition 2.11 (Borel Isomorphism, Kechris, 1995, Ch. 10.B). Let f : E \u2192 Fbe a bijective mapping. If f and f\u22121 are both measurable, it is known as anisomorphism, and E, F are said to be isomorphic. If E = F , f is called anautomorphism. In particular, if E, F are Borel spaces, we refer to such an f as aBorel auto\/isomorphism.Borel-measurable bijections are particularly nice to work with\u2014they are automati-cally Borel isomorphisms.Lemma 2.4. Let E, F be Borel spaces and f : E \u2192 F be bijective. Then, f ismeasurable if and only if it is a Borel isomorphism.Proof. We only prove the forward direction, i.e., showing f\u22121 is measurable iff is measurable. The reverse direction is identically proved. By (Kechris, 1995,Theorem 15.1), for f Borel-measurable and injective, f (B) is a Borel set of F forany Borel set B of E. Since f (B) defines the pre-image of f\u22121 for any Borel set Bof E, it immediately follows that f\u22121 is also Borel-measurable.Finally, we define the notion of a measure auto\/isomorphism.Definition 2.12 (Measure Isomorphism, Bogachev, 2007, Sec 9.2). Let (E,E ,\u00b5)and (F,F ,\u03bd) be two measure spaces. An isomorphism f : E \u2192 F is called a(\u00b5,\u03bd)-measure isomorphism iff#\u00b5 = \u03bd , f\u22121# \u03bd = \u00b5. (2.13)It is called a \u00b5-measure-preserving automorphism if (E,E ,\u00b5) = (F,F ,\u03bd).11Chapter 3Generative Models and IdentifiabilityThe applications outlined in Chapter 1 are commonly achieved using generativemodels. Building such a model involves specifying a latent Borel space Z, which ispaired with the data (or observation) Borel space X. The model is then parametrizedvia a measurable mapping f : Z\u2192 X (the generator) and probability distributionPz on Z. In words, the generative model specifies that each observation Xi \u2208 X isa noisy observation of f (Zi), where Zi is an independent draw from Pz. To fit themodel is then to fit a parametrized form of f , and in some cases, a parametric formof Pz, via maximum likelihood (or some proxy thereof).3.1 MotivationTo recall why it is important to eliminate indeterminacies in latent variable models,and in particular generative models, we dedicate this section to describing indetail a popular scientific application, and how indeterminacies undermine modelinterpretability to the end-user.One popular application of generative modelling is to analyze high-dimensionalgenomics data. In particular, single cell RNA sequencing (scRNA-seq) provides12the state of the art for measuring individual cell properties. A specific example isin predicting cellular response to a perturbation or intervention such as infectionor drug treatment (Lotfollahi et al., 2019) using a VAE (with Pz a multivariateGaussian). In particular, suppose that experimental data for some collection of celltypes was collected, where some cells are measured in an unperturbed (p = 0) andothers in a perturbed state (p = 1).The latent values for each cell are estimated under the model, and aggregated forp = 0 and p = 1. Denote their mean values by Z\u00af(p=0) and Z\u00af(p=1) respectively.Then, an average perturbation response in the latent space is computed as \u03b4 =Z\u00af(p=1)\u2212 Z\u00af(p=0).Suppose now that a new sample X (p=0)i is collected only in an unperturbed state,and we wish to study the cell-specific response to the perturbation. We first estimateits latent value via the inference algorithm, producing Z\u02c6(p=0)i . Then, we estimateZ\u02c6(p=1)i = Z\u02c6(p=0)i +\u03b4 . Finally, we generate a synthetic \u201ccounterfactual\u201d observationX (p=1)i = f (Z\u02c6(p=1)i ) via the model generator f .The model proposed in the above procedure is highly indeterminate. In particular,if the inference algorithm were trained a second time (even approaching asymptoticconvergence), each estimated Z\u02c6i would almost certainly be a rotated version ofitself, which would result in different Z\u00af and \u03b4 values.In practice, Lotfollahi et al. (2019) show that this procedure performs well in theirexperimental validation, in the sense that synthetic samples correlated well withthe corresponding held-out perturbed samples. Note that even in the presence ofindeterminacies, since the two indeterminate models produce an identical fit, wesuspect that these results are indeed reproducible in the sense that rotated latentvalues would have similarly positive results based on the evaluation criteria.However, issues arise when we try to interpret the output of these models. Whatinsight have we gained by estimating \u03b4 and producing synthetic counterfactualobservations? Does \u03b4 generalize to a study conducted at a different lab? Can we13report \u03b41, the first entry of the vector, as a quantitative estimate of the causal effectof p = 1 on the corresponding latent dimension?In indeterminate models, \u03b4 cannot possibly generalize to different labs, givenit would not even generalize to a second run of the inference algorithm on theexisting dataset. This also makes typical statistical considerations difficult\u2014therobustness\/stability of \u03b4 is arbitrarily poor in this scenario.Eliminating indeterminacies does not necessarily enable us to endow \u03b4 , or anyother latent estimand, with a scientific interpretation. However, we believe thatthis is an important first step towards such a goal, and at the very least allowsprincipled assessment of reproducibility and stability, which are key to the finalgoal of interpretability.3.2 Generic Generative ModellingGenerically, a generative model can be stated mathematically as follows:Zi \u223c Pz , \u03b5i \u223c P\u03b5 , Zi\u22a5\u22a5\u03b5i , and Xi = g( f (Zi),\u03b5i) , (3.1)where \u03b5i is a noise component assumed to be sampled i.i.d. from a known dis-tribution P\u03b5 , and g is a known noise mechanism. In this thesis, we will not beconcerned with inferring the noise \u03b5 , and instead treat it as a nuisance variable.Hence, we work with the following global assumption that \u03b5 has a null effect onthe probabilistic properties of the model.Assumption 3.1. Assume that g and P\u03b5 are such that g( f (Za),\u03b5a)d= g( f (Zb),\u03b5b),with \u03b5ad= \u03b5b, if and only if f (Za)d= f (Zb)1.This assumption includes for example the noiseless case, and additive noise for asuitable noise distribution on X (see Ha\u00a8lva\u00a8 et al. (2021), for example). Practically,1We note that this assumption rules out the possibility of discrete observations except in verylimited cases; see Appendix A.3 for a brief discussion of this point.14this means that the distribution of X under the model is determined completely bythe distribution of f (Z), that is, the pushforward measure f#Pz. In other words, thestatistical model is parametrized by the tuple \u03b8 = ( f ,Pz).We also make the following global assumption:Assumption 3.2. Assume that any generator f : Z\u2192 X is injective, and has thesame image: for any fa, fb \u2208F , fa(Z) = fb(Z) :=F (Z)\u2286 X.This assumption removes degenerate cases of indeterminacy\u2014if f were not in-jective, we could have f (z1) = f (z2) identical noiseless observations under twodifferent latent values under the model. The assumption of sharing the same imageis technical in nature, but note that the identification problem is trivial otherwise.Suppose fa, fb have different image sets. Then, if Pz is fully supported, we cannothave ( fa)#Pz = ( fb)$Pz, and by Assumption 3.1, the resulting models cannot pro-duce the same fit. In other words, given a fully supported latent distribution, anypossible indeterminacies arise from generators satisfying Assumption 3.2.We will assume additive noise when discussing (3.1) for the remainder of thethesis (i.e., g( f (Zi),\u03b5i) = f (Zi)+\u03b5i), though the results developed in later chaptersapply so long as Assumption 3.1 is satisfied. We are not aware of other noisemechanisms such that Assumption 3.1 holds for a wide class of P\u03b5 , however wenote that additive noise models are standard in most instances; the following twoexamples detail two popular cases.Example 3.3 (Factor Analysis (Lawley and Maxwell, 1962)). When f (Zi) =\u03b1+FZi with \u03b1 a vector and F a full-rank matrix and Z,X Euclidean spaces, (3.1)specifies a factor model, a popular classical generative model.Example 3.4 (Variational Auto-encoder (Kingma and Welling, 2014)). Whenf (the \u201cdecoder\u201d) is a (possibly) non-linear function and inference on Zi is viaa variational approximation of f\u22121 (the \u201cencoder\u201d), (3.1) specifies a variationalauto-encoder (VAE).15Remark. In more classical machine learning contexts, a generative model refers toone that learns the joint probability distribution of observations Xi and their labels Yi.The definition here is different, and more resembles that of a deep generative model(DGM), which uses deep neural networks (DNN) to parametrize the generator in(3.1). However, the results presented in this thesis are not specific to DGMs, andapply more generally, including to classical linear models (not typically referredto as generative models) such as factor analysis, and polynominal or spline-basedgenerators.3.3 Related Works and Historical NotesBefore presenting our framework, this section aims to provide a brief review of theexisting literature to provide context. As this is the setting for all closely relatedworks, we will assume Euclidean spaces, i.e., Z = Rdz and X = Rdx . In theory, ourframework can be seen as having a technical advantage as it applies to any Borelspaces, though we remark that currently (as will be seen in Chapter 5, where wederive strong identifiability), practical analyses still rely on these Euclidean spaces.Extending these results to obtain strong identifiability on more exotic spaces (e.g.,Ding and Regev, 2021; Mathieu et al., 2019) represents future work.Indeterminacies in Factor AnalysisRecall that factor analysis (Example 3.3) can be understood as a generative modelwith a linear generator f (x) = Fx. Indeterminacy in factor analysis, henceforth\u201cfactor indeterminacy\u201d, is perhaps the most well-known and well-studied exampleof latent variable indeterminacy. We cannot hope to cover the entire history offactor indeterminacy here, and only review some relevant progressions that haveled to developments to solve the indeterminacy problem. See Steiger (1979) and(Mulaik, 2009, Chapter 13) for more in depth account and discussion.The long-standing controversy of factor indeterminacy stems from Spearman(1904) proposing the first instance of such a factor model, being also the first effort16to identify the factor as being scientifically meaningful. Wilson (1929) was the firstto point out that there could be multiple solutions to Spearman\u2019s model, and thusthe g-factor of an individual (see Example 1.1) was not unique under the model.This was a serious issue, as Spearman had previously justified his model owing toan erroneous proof that the g-factor was indeed unique (Steiger, 1979).Eventually, Spearman offered what appears to be the first resolution for indetermi-nacy: if an observed test score were perfectly correlated with the g-factor, then it isunique under the model. Of course, this theory implies that we have access to aperfect instrument of intelligence, to which Wilson (1929) replied that we couldsimply \u201cthrow away our scaffolding\u201d, referring to the factor model itself.Despite the indeterminacy problem, multiple factor analysis, a multivariate exten-sion of Spearman\u2019s univariate model, became popular for quantitative research inpsychology (Steiger, 1979). Multivariate statisticians (Anderson and Rubin, 1956)soon became interested (and skeptical), and characterized the indeterminacy prob-lem in terms of an orthogonal matrix that leaves the covariance matrix invariant.Our work in the following Chapter can be seen as a framework to generalize thistype of result to a richer class of models\u2014see Section 4.1 for a framing of theabove problem in our notation.Remark. Factor analysis, even once statisticians first became involved (Andersonand Rubin, 1956), was not formulated with distributional assumptions on the latentvariables. In particular, to fit the model was to satisfy certain second-order (i.e.,correlational) criteria, equivalent to a maximum likelihood learning of a generativemodel with Gaussian latent distribution. For a discussion of the identifiability ofthis model under a standard Gaussian latent distribution, see Section 4.1. However,some perspectives on factor analysis in psychology do not treat it as a statisticalmodel at all, but rather purely as the solutions to a set of algebraic equations derivedfrom a dataset.We refrain from any more discussion on the historical aspects of factor indetermi-17nacy, and now discuss what methods are used in practice to resolve this problem.We note here that in a factor model where each entry of the matrix F is a parameter,strong identifiability of the model is equivalent to our Definition 3.7, and henceidentification in factor models typically refers to strong identifiability (or, at thevery most, identifiability up to a permutation of the entries and sign flips). A firstcommon assumption is for F to be full-rank to avoid trivial indeterminacies inthe model (i.e., ensuring that f (x) = Fx is injective), which corresponds to ourAssumption 3.2. Most identification strategies go on to constrain the matrix Fto eliminate the rotation indeterminacy. However, the literature on identificationconstraints in factor models is vast, and strategies can differ depending on what themodelling framework is.For example, a triangular F with diagonals of 1 is generally sufficient for identifi-cation. However, this implies that the first dimension is a noisy observation of thefirst factor, and the remaining factors are autoregressively dependent (Aguilar andWest, 2000), which involves non-trivial user input to best reorder the observed data.2 In another example, Bai and Ng (2013) assume that both F , the generator matrixand Z, the factors are random, and derive three asymptotic identification conditionsin this setting. Rohe and Zeng (2020) uses the characterization of the Gaussiandistribution as the only rotationally invariant distribution to show identifiability offactor models with a non-Gaussian latent distribution, and their model is free ofassumptions on the noise mechanism.Independent Component Analysis (ICA)Similarly to factor analysis, linear ICA can also be framed within (3.1) with a lineargenerator, f (x) = Fx, where F is known as the mixing matrix. However, ICA wasoriginally formulated for signal processing, in which the latent distribution is an2This assumption about the first factor is similar to Spearman\u2019s original suggestion, and fur-thermore, the autoregressive\/triangular map is formally justified to be strongly identifiable inSection 5.2, where we provide a proof of a more general, non-linear case and without requiringidentity diagonal elements.18unknown \u201csource\u201d signal distribution with independent coordinates to be inferred,from which a mixed signal is observed. We say that a pair (F,Pz) solves the ICAproblem if it is a solution to the maximum likelihood problem for a given dataset,and Pz has independent coordinates.Immediately, it is clear that for any solution (F,Pz), there is another solution(FP\u22121D\u22121,D#P#Pz), where P is a permutation of the indices of Z, and D is adiagonal scaling matrix. That is to say, we can reorder and rescale the coordinatesof Z, as they remain independent, and this can then be undone via pre-multiplicationof the mixing matrix.The ICA literature is interesting, as the introduction of ICA (Comon, 1994) wasimmediately concerned with identifiability, but also while the above indetermi-nacies are considered fundamental and irreducible. Specifically, (Comon, 1994,Corollary 13) proved identifiability up to permutations and scalings, which wassimply referred to as \u201cidentifiability\u201d.3 Due to this, strong identifiability in ICA(both linear and non-linear) is understood to mean identifiability up to permu-tations and scalings, while weak identifiability referred to an even larger set ofindeterminacies, usually arbitrary coordinate-wise transformations (Ha\u00a8lva\u00a8 et al.,2021; Hyva\u00a8rinen and Morioka, 2016, 2017; Hyva\u00a8rinen et al., 2018), understood tobe a \u201cfully unmixed\u201d solution. There has been some advancements in non-linearICA-type models to characterize different types of indeterminacies, such as Klindtet al. (2021) (up to permutations and sign flips) and Ahuja et al. (2022) (up toequivariances of a latent \u201cmechanism\u201d).Non-linear ICA and Identifiable VAENon-linear ICA aims to fit the same model as linear ICA, replacing the lineargenerator with a non-linear, more flexible f . Note that since these non-lineargenerator as often parametrized as deep neural networks, which contain linear3Comon (1994) used similar arguments as Rohe and Zeng (2020) in showing that using non-Gaussian latent distributions meant that the rotational indeterminacy was eliminated.19maps as special cases (Vidal et al., 2017), non-linear ICA inherits the fundamentalscaling and permutation indeterminacies of linear ICA.On the other hand, non-linear ICA typically has many more indeterminacies owingto the complexity of the generator class (we formalize why in Theorem 4.3)\u2014theidentifiability results in this literature are typically up to arbitrary coordinate-wise transformations. Furthermore, many such results rely on non-i.i.d., typicallytime dependent non-stationary observations (Hyva\u00a8rinen and Morioka, 2016, 2017;Hyva\u00a8rinen et al., 2018), we also formalize why non-i.i.d. data reduces indetermina-cies in Section 5.1.1.Recently, Khemakhem et al. (2020) used insights from non-linear ICA to derivethe first identifiability results for VAE (Example 3.4). In particular, they formu-late the generative model of the VAE identically to non-linear ICA, where Pz isparametrized as an exponential family with independent components. In Khe-makhem et al. (2020), they obtain an indeterminacy set of linear transformations(under the sufficient statistics of the exponential family), corresponding to the factthat exponential families are preserved under such transformations (again, we makethis precise in Theorem 4.3 and Section 5.1.1).VAEs, in both form and purpose, can be seen as the natural non-linear extensionto factor analysis. In addition to the example outlined in Section 3.1, the latentdiemnsions of a VAE are commonly given semantic meaning when analyzingnatural image data (Higgins et al., 2017; Kim and Mnih, 2018), and have also beenused to embed the latent space with causal interpretations (Lu et al., 2022). In viewof the history behind factor indeterminacy, it is clear that identifiability should bean important topic for VAEs.Khemakhem et al. (2020), and slight extensions such as (Lu et al., 2022; Zhou andWei, 2020), represent the current state of the art in terms of analyzing identifiabilityin VAEs. However, besides the variational inference, the ICA-inspired generativemodelling set-up differs to the original VAE (Kingma and Welling, 2014), which20fixes the latent \u201cprior\u201d distribution (see Section 5.1.3 for further discussion onthis point). As we show in Theorem 4.3 and Section 5.1.4, this connection toICA introduces, amongst others, the fundamental indeterminacies discussed inComon (1994). Using our results, we show that the more natural framing of VAEsas non-linear factor analysis, rather than non-linear ICA, can help achieve strongidentifiability (Section 5.1.4).Indeterminacies in Generic Generative ModelsAll of the discussions of indeterminacies, and similar notions of identifiabilityabove are in terms of pre-compositions of the generator. For example, the rotation,permutation, or element-wise transformations all influence f as a pre-compositionf \u2032 = f \u25e6 A. This is the fundamental connection between these two otherwisedifferent lines of work in factor analysis-type and ICA-type generative models.In light of this connection, and to resolve possible confusions between indeter-minacy and identifiability, we set up the generative model (3.1) as a statisticalmodel, where indeterminacy can be reconciled with identifiability. We dedicate theremainder of this section to describe this approach.3.4 Generative Models as Statistical ModelsWe now begin to build our framework to analyze indetermiancies in the modeldefined by (3.1). Recall that our generative model is parametrized by the tuple\u03b8 = ( f ,Pz). In particular, we denote their respective parameter spaces asF ,Pz.Within our framework, varying F ,Pz recovers specific models such as factoranalysis (F are linear), or ICA (Pz have independent components), and for thisreason we sometimes refer to these parameter spaces as the model design. Denotethe resulting marginal distribution of X as P\u03b8 . The model hence induces a statisticalmodel on X denoted as follows:M (F ,Pz) = {P\u03b8 on X | \u03b8 = ( f ,Pz) , f \u2208F , Pz \u2208Pz} . (3.2)21Remark. When we make the assumption of additive noise, the model can be easilywritten in terms of the marginal density. Assuming P\u03b5 has a density p\u03b5 definedwith respect to some reference measure \u00b5x on X, then P\u03b8 , the marginal distributionon X, has a density p\u03b8 that can be written as follows:p\u03b8 (x) =\u222bZp\u03b5(x\u2212 f (z))dPz. (3.3)Note that p\u03b8 is again a density with respect to \u00b5x, even if f#Pz does not have adensity with respect to it.3.5 IdentifiabilityA statistical model is said to be (strongly) identifiable if the mapping \u03b8 \u2192 P\u03b8 isone-to-one (van der Vaart, 1998, p. 62). In other words, in an identifiable model,whenever two parameter values \u03b8a, \u03b8b are such that the resulting X is equal in dis-tribution, we must have \u03b8a = \u03b8b. For the sake of analyzing indeterminacies, we areonly concerned with the identifiability of the functional parameter f . Furthermore,there are many acceptable notions of equality on a function space of parameters.We will hence work with an alternative definition, specifically tailored to latentvariable indeterminacy. First, we define a notion of functional equivalence up toprecompositions of a set.Definition 3.5. Two functions fa, fb : Z\u2192 X are said to be equivalent up to a setof Borel automorphisms A \u220b A : Z\u2192 Z iffa(z) = fb(A(z)) for all z \u2208 Z, for some A \u2208A (3.4)We denote this equivalence relation faA== fb.It is easily seen that this defines an equivalence relation, given bijectivity of A \u2208A .22Notably, other typical notions of functional equivalence are special cases of theabove. Let idz denote the identity function on Z, then faA== fb is pointwise equalitywhen A = {idz}. Let i\u02dcdz denote the set of transformations equal to idz almosteverywhere, then faA== fb is almost everywhere equality when A = i\u02dcdz. Armedwith this definition, we are now ready to define the notion of an indeterminacy set.Definition 3.6. The indeterminacy set of a modelM (F ,Pz), denoted A (M ),is the smallest set of Borel automorphisms of Z such thatP\u03b8a = P\u03b8b =\u21d2 faA (M )== fb . (3.5)Next, for the purposes of strong identifiability, we will consider two functionsequivalent whenever they are equal almost everywhere.Definition 3.7. A generative model M (F ,Pz) is weakly identifiable up toA (M ), a set of Borel automorphisms or A (M )-identifiable, if its indeterminacyset is A (M ). If A (M ) = i\u02dcdz, then the model is strongly identifiable.This definition of strong identifiability is hence equivalent to the statistical defi-nition4 when the parameter space of f are the equivalence classes of Pz-almost-everywhere equivalent functions.Remark. Any notion of model identifiability is an property of statistical inferencenot achieved in practice when fitting a model to finite data. Even if there existsa unique latent variable that is in theory recoverable, there is no guarantee thata training algorithm will reach it with finite data. However, model identifiabilityis an important quality for statistical inference, and in particular is necessary forthe typical theoretical guarantees (van der Vaart, 1998). Furthermore, we remarkthat there is empirical evidence that even weak identifiability can recover latentstructures that faithfully represent the ground truth (Khemakhem et al., 2020; Lu4Strictly speaking, this is equivalent to a notion of weak\/partial identifiability (or set identifi-ablility) up to arbitrary distributions Pz i.e., identifiable up to the set {( f ,Pz)| f \u2208 i\u02dcdz,Pz \u2208Pz}. Inthis thesis, we are not concerned with inference on Pz.23et al., 2022; Sorrenson et al., 2020).3.6 Identifiability and IndeterminaciesIn practice, particularly when considering deep generative models, strong iden-tifiability of f as in Definition 3.7 does not necessarily imply the same for theparameters that are being optimized. This is because the set of learned parameters,say weights and biases of a neural network, can be many-to-one with the almost-everywhere equivalence classes. However, the definition above makes a connectionto statistical identifiability, and also allows us to make the notion of latent variableindeterminacy precise\u2014special cases of the above are standard in recent works inthis area (Ahuja et al., 2022; Khemakhem et al., 2020).In particular, because f \u2208F and A \u2208 A (M ) are bijective by definition, if twolatent values Za, Zb generate the same observation X in an A (M )-identifiablemodel, then we must have Za = A(Zb). Similarly, Zb = A\u22121(Za). In particular, ifthe model is strongly identifiable, then we have Za = Zb outside of a null set, andthe generating latent variable is unique.Henceforth, we will perceive indeterminacy and identifiability as analogous terms.To say that a model is strongly identifiable is equivalent to saying the model isfree of indeterminacies. To say that a model is weakly identifiable up to A (M ) isequivalent to saying the indeterminacies of the model are A (M ).24Chapter 4Characterizing IndeterminaciesAs a first step towards developing a framework to eliminate indeterminacies, inthis section we prove results that characterize the indeterminacy set as definedin Definition 3.6 and Definition 3.7 for generic generative models. In particular,we will show that for any generative model of the form (3.1), the indeterminaciesare determined by the model design, i.e., the parameter spacesF andPz. Thischaracterization gives not only a framework for recovering the indeterminacies ofa given model, but also a framework to design strongly identifiable models, whichwe apply in later sections.4.1 Example: Linear Generative ModelsWe first review two well-known linear models to act as points of reference through-out the rest of the thesis. The theory we develop in the rest of this section is largelya generalization of the intuitive notions of unidentifiability in these simpler models.For more details, see Appendix A.1.For observations modeled as random vectors X taking values in Rdx , factor analysis(Lawley and Maxwell, 1962) aims to infer a low-dimensional latent variable Z25taking values in Rdz , dz < dx, via the modelZi \u223cN (0, Idz\u00d7dz) , \u03b5i \u223cN (\u00b5, Idx\u00d7dx) , Xi = F Zi+ \u03b5i , (4.1)where \u03b5i is independent of Zi, and F is a full-rank dx\u00d7dz matrix of so-called factorloadings, to be learned from data. We can think of this as structurally identicalto a linear VAE. It is well known that the parameter F and the latent variablescorresponding to the observations are unidentifiable: a single marginal distributionon X may correspond to (at least) two different factor loading matrices, say Fa andFb.To see this, note that Gaussian additive noise can be deconvolved (Maritz andLwin, 1989), so the distribution of X is entirely determined by the distribution ofFZ. Let Fb = FaAa,b, with Aa,b some dz\u00d7 dz matrix. Since Z is a standard mul-tivariate Gaussian, FaZ \u223cN (0,FaF\u22a4a ) and FaAa,bZ \u223cN (0,FaAa,bA\u22a4a,bF\u22a4a ), andFaZd= FaAa,bZd= FbZ if and only if Aa,bA\u22a4a,b = Idz\u00d7dz . That is, the correspondingindeterminacy set consists of the set of dz\u00d7dz orthogonal matrices.Another way of characterizing the indeterminacies, consistent with the frameworkwe develop in the rest of this chapter, is to note that when Aa,b is an orthogonalmatrix, Aa,bZd= Z. That is, it preserves the latent variable distribution. Moreover,it can be constructed as Aa,b = F\u22121a Fb, where F\u22121a is the left-inverse of Fa, for twofactor loading matrices in the model class Fa and Fb. As we will show, generaliza-tions of those are precisely the two conditions that characterize indeterminacies inmore general models.Linear ICA (Comon, 1994) relaxes the assumption that Z has a normal distribution,and only requires that the components of Z are independent. Typically, an ICAmodel is specified with a class of distributions on Z, denotedPz. The so-calledmixing function, F , and Pz \u2208Pz are then inferred simultaneously from data.The indeterminacies of this model arise as any full-rank matrix Aa,b such thatFaAa,bZad= FbZb, where the indices a and b represent latent variables with different26distributions in Pz. The indeterminacy set now includes certain unresolvableambiguities that arise from Aa,b transporting the distribution of Za to that of Zb,in addition to the measure preserving transformations in the previous example.In particular, Za and Zb may be a permutation and scaling of each other. If Pzexcludes Gaussian distributions, it is known that these are the only two types oftransformations in the indeterminacy set for linear ICA (Comon, 1994). On theother hand, in nonlinear ICA, many more indeterminacies may arise because Aa,bdoes not need to be linear.These examples illustrate that model indeterminacies come not only from thedistribution(s) on Z, but also how they interact with the class of functions mappingthe latent variables into the observation space. We make this precise in the nexttwo sections.4.2 Indeterminacy MapsSuppose that Za and Zb are two possible latent values generating a de-noisedobservation X . By our global assumption of injectivity, these two latent valuesmust have been passed through different generators fa and fb. We havefa(Za) = fb(Zb) = X (4.2)We also assumed that fa and fb share an image, F (Z). Hence, their inversesf\u22121a , f\u22121b are well defined fromF (Z)\u2192 Z. The above then implies the followingrelation between Za and Zb:Zb = f\u22121b ( fa(Za)). (4.3)We now define the indeterminacy map between any two generators fa, fb.27Definition 4.1. For fa, fb \u2208F , their indeterminacy map Aa,b : Z\u2192 Z is defined byAa,b = f\u22121b \u25e6 fa. (4.4)From the above definition, (4.3) becomesZb = Aa,b(Za) Za = A\u22121a,b(Zb). (4.5)4.2.1 Analyzing Aa,bWe now examine some elementary properties about indeterminacy maps, given ourglobal assumptions onF .Lemma 4.1. For any fa, fb \u2208F , their indeterminacy map Aa,b is a Borel automor-phism of Z.Proof. Recall that Aa,b is then Borel automorphism when A\u22121a,b is also measurable.The inverse clearly exists, A\u22121a,b = f\u22121a \u25e6 fb. By Lemma 2.4, all of fa, f\u22121a , fb, f\u22121bare measurable. Since compositions of measurable functions are again measurable,Aa,b is a Borel automorphism of Z.The above Lemma is a technical prerequisite to the following result, which repre-sents the first fundamental result of our framework.Lemma 4.2. Let \u03b8a = (Pz,a, fa) and \u03b8b = (Pz,b, fb) be two parametrizations of agenerative model with resulting marginal distributions P\u03b8a and P\u03b8b . Then, P\u03b8a = P\u03b8bif and only if Aa,b, the corresponding indeterminacy map defined in (4.4) is a(Pz,a,Pz,b)-measure isomorphism. In particular, if Pz,a = Pz,b := Pz, Aa,b is a Pz-measure-preserving automorphism.Sketch of proof, forward direction. By Assumption 3.1, P\u03b8a = P\u03b8b if and only if28Pz,a \u25e6 f\u22121a = Pz,b \u25e6 f\u22121b . Let B \u2208B(Z). Then,Pz,a(A\u22121a,b(B)) = Pz,a( f\u22121a ( fb(B)) = Pz,b( f\u22121b ( fb(B)) = Pz,b(B), (4.6)where the first equality is by definition of Aa,b, the second equality is by hypothesis,and the third equality is due to injectivity. Since B was arbitrary, this shows thatPz,a \u25e6A\u22121a,b = Pz,b. To see that Pz,a = Pz,b \u25e6Aa,b, simply swap the roles of the indicesa and b.To make the argument in the above proof fully rigourous, some care is requiredto construct a \u03c3 -algebra on the imageF (Z). The arguments can be found in theAppendix, as well as the proof of the \u201conly if\u201d direction.4.3 Indeterminacy Sets from Indeterminacy MapsLemma 4.2 gives necessary and sufficient conditions for two parameterizationsof a generative model to correspond to the same fit. Though the \u201conly if\u201d mayseem irrelevant at first glance, it also documents the cases in which the parameterspaces are such that P\u03b8a cannot be equal to P\u03b8b , an alternative path towards strongidentifiability.Using the result, we now show how we can characterize exactly the set of modelindeterminacies A (M ) given the parameter spaces F , Pz. To that end, definethe following sets of latent space Borel automorphisms induced by the generativemodel parameter spaces:A (F ) = {A : Z\u2192 Z | A = f\u22121b \u25e6 fa for some fa, fb \u2208F}A (Pz) = {A : Z\u2192 Z | Pb = Pa \u25e6A\u22121,Pa = Pb \u25e6A for some Pa,Pb \u2208Pz} .A (F ) consists of all possible indeterminacy maps constructed from F , andA (Pz) consists of all possible isomorphisms between measures inPz. Both setsalways include the identity function by taking fa = fb and Pa = Pb.29Note that the \u201csize\u201d of the above sets depend on the complexity of the under-lying parameter spaces F ,Pz. Consider the edge case that F = { f}, i.e., thegenerator is fixed to be f , then it must be that A (F ) = idz. On the other hand,if F is a highly flexible class of generators, e.g., all measurable functions, thenA (F ) may very well be all Borel automorphisms of Z. Similarly, suppose thatPz = {N ([\u221211], I2\u00d72)}, i.e., a fixed Gaussian latent distribution. Then, A (Pz)consists of allN ([\u221211], I2\u00d72)-measure-preserving automorphisms. Though weare unable to give a precise characterization of these automorphisms, we do knowthat for example, A (Pz) excludes permutations of z1 and z2.With the above insights, we now provide a precise characterization of the indeter-minacy set of a generative model in terms of its parameter spaces.Theorem 4.3. The generative modelM (F ,Pz) is identifiable up to A (M ) =A (F )\u2229A (Pz). In particular,M (F ,Pz) is strongly identifiable if and only ifA (F )\u2229A (Pz) = i\u02dcdz.Proof. Recall that, for the generative model to be identifiable up to a set of mea-surable functions A (M ) is to say that, for all ( fa,Pz,a), ( fb,Pz,b) \u2208F \u00d7Pz suchthat P\u03b8a = P\u03b8b , we have A = f\u22121b \u25e6 fa \u2208A (M ).We first show that for any parameter spaces F andPz, we have that A (M )\u2286A (F )\u2229A (Pz). Suppose A \u2208 A (M ). That is, A = f\u22121b \u25e6 fa such that thereexist Pz,a, Pz,b such that the parametrizations \u03b8a = ( fa,Pz,a), \u03b8b = ( fb,Pz,b) haveP\u03b8a = P\u03b8b . By definition of A, we have A \u2208A (F ). By Lemma 4.2, we must havethat A \u2208A (Pz) also.We now show that A (F )\u2229A (Pz) \u2286 A (M ). Suppose A \u2208 A (F )\u2229A (Pz).We can write A = f\u22121b \u25e6 fa for some fa, fb \u2208F . Furthermore, there exist Pz,a andPz,b such that Pz,b = Pz,a \u25e6A\u22121 and Pz,a = Pz,b \u25e6A. By Lemma 4.2, \u03b8a = ( fa,Pz,a),\u03b8b = ( fb,Pz,b) is such that P\u03b8a = P\u03b8b , and hence A \u2208A (M ).30This expresses the identifiability of a generative model in terms of indeterminacymaps induced by its parameter spaces. In particular, all model indeterminaciesmust be transports between distributions inPz that can be constructed by pushingand pulling along generators from F as f\u22121b \u25e6 fa. It suggests that the modelidentifiability strengthens as we increase the number of constraints onF ,Pz, orboth, until the intersection A (F )\u2229A (Pz) contains only the identity and strongidentifiability is obtained.A particularly important implication of Theorem 4.3 is that it cleanly partitionsthe two sources of indeterminacy. This means that, to eliminate indeterminacies,we can either constrainF ,Pz, or both. The more constraints we place on theseparameter spaces, the smaller the indeterminacy set becomes.In the past, in applications where identifiability was deemed to be important, thestrategies to eliminate indeterminacies often constrained bothF andP in synergy.This approach to designing identifiable models is most clearly demonstrated inlinear, non-Gaussian ICA: the linear constraint onF reduces indeterminacy mapsto linear maps, and the non-Gaussianity inPz is designed precisely to eliminatelinearly isomorphic measures.However, as we have mentioned in this thesis, a more pressing issue today concernsthe interpretability of modern, flexible generative models. For this task, Theo-rem 4.3 indicates that we should primarily target the latent distribution spacePz,and only placing minimal constraints onF and hence A (F ). The next sectiondetails some approaches for strong identifiability in this setting.31Chapter 5Modelling Choices for StrongIdentifiabilityBeyond a simple characterization, Theorem 4.3 also provides a path forward forstrong identifiability of flexible, non-linear generative models. This has been ahighly difficult problem in the past\u2014our results indicate that this is becauseA (F )can very well be all Borel automorphisms of Z. However, Theorem 4.3 exposesthe structure of unidentifiability and indicates that there may be approaches tospecifying strongly identifiable latent variable models without restricting ourselvesto linear generators. This chapter is dedicated to exploring two such approaches.The first assumes data from \u201cmultiple environments\u201d to impose additional con-straints on the model. This is the approach currently used in related models Ahujaet al. (2022); Khemakhem et al. (2020). To display the generality of our resultsin Chapter 4, we also recover these previous results within our framework. Thesecond approach is novel, and specifiesF to be a family of maps satisfying certaingroup-like properties\u2014in particular, triangular monotone maps make up a class ofgenerators that satisfy our restrictions in practice (Irons et al., 2021; Wehelkel andLouppe, 2019).325.1 Multiple EnvironmentsIt is common in statistical inference to make the assumption of independent andidentically distributed (i.i.d.) observations. Models designed specifically for non-i.i.d. observations are typically less popular due to other compromises such asaccuracy and efficiency.It is generally true in most scenarios that i.i.d. models are more efficient andeasier to analyze theoretically, when considering the usual criteria for statisticalmethodology (e.g., asymptotic normality or convergence rates). However, thesecriteria are trivially invalidated when the model is non-identifiable such as thegenerative models in this thesis.In this section, we describe how a specific type of non-i.i.d. (multiple environments)data makes identifiability analyses easier, which is perhaps unexpected given theabove discussion. This idea has been used extensively in the non-linear ICAidentifiability literature, where multiple environments are either indexed by time orsome other \u201cauxiliary information\u201d (Ahuja et al., 2022; Hyva\u00a8rinen and Morioka,2016, 2017; Hyva\u00a8rinen et al., 2018; Khemakhem et al., 2020; Klindt et al., 2021).We show in this section that this can be seen as an instance of constraining Pzwhile leavingF fully flexible, as was suggested by our framework.5.1.1 Generative Modelling in Multiple EnvironmentsSuppose data arise from environments indexed by e \u2208 E, where the environmentlabel is assumed to be deterministic (i.e., known, or observed without noise). Eachenvironment corresponds to a different observation random variable Xe \u223c Pex on ashared observation space X. This is reflected in the generative model as |E| distinctdistributions on latent variables, Ze \u223c Pez on a shared latent space Z. Crucially, eachenvironment shares the same generator f . The parameter space is (F ,{Pez }e\u2208E),33and the generative model is specified as, for each e \u2208 E,Zei \u223c Pez , \u03b5i \u223c P\u03b5 , Zei\u22a5\u22a5\u03b5i , and Xei = g( f (Zei ),\u03b5i) . (5.1)We denote the corresponding statistical model asM (F ,{Pez }e\u2208E)Remark. In addition to identifiability purposes, these models are used as a causalinference method, where environments correspond to observational and interven-tional distributions (e.g., Bu\u00a8hlmann, 2020; Peters et al., 2016). There, relationshipsthat are modular, or invariant across environments, are interpreted as more likelyto be causal. Similar models have also been considered for out-of-distributiongeneralization, particularly under covariate shift (Arjovsky et al., 2019; Lu et al.,2022). There has been recent interest in \u201ccausal representation learning\u201d Scho\u00a8lkopfet al. (2021); Wang et al. (2021), i.e., recovering latent variables with causal inter-pretations. Though we are unaware of a precise definition of causal representationlearning, we believe that understanding the indeterminacies of models of the form(5.1) can be of great importance for causal machine learning and interpretabilitymoving forward.It is easy to see why multiple environment models, which model a distributionXe for each environment, make it easier to identify a shared parameter f . Recallthe generic generative model given by (3.1). We will refer to it as the \u201csingleenvironment\u201d model, noting that it corresponds to the case |E|= 1 in (5.1). In thesingle environment model, recall that A (M )-identifiability is defined as:P\u03b8a = P\u03b8b =\u21d2 faA (M )== fb . (5.2)For the multiple environments model given by (5.1) with n environments, A (M )-34identifiability (of the shared parameter f ) is given by:\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3Pe1\u03b8a = Pe1\u03b8bPe2\u03b8a = Pe2\u03b8b...Pen\u03b8a = Pen\u03b8b\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe =\u21d2 faA (M )== fb . (5.3)By making f a shared parameter to model multiple observed distributions, wehave increased the number of constraints on the left-hand-side of the identifiabil-ity criteria, which increases the number of constraints on the indeterminacy setA (M ).We formalize this as a Corollary to Theorem 4.3.Corollary 5.1. The generative modelM (F ,{Pez }e\u2208E) is identifiable up toA (F )\u2229 (\u2229e\u2208EA (Pez )) . (5.4)Proof. For each environment e \u2208 E, the characterization in Theorem 4.3 holds,and if Pe\u03b8a = Pe\u03b8b , we must have faAe== fb, whereAe =A (F )\u2229A (Pez ). RecallingDefinition 3.5 for the equivalence relationship, we see thatfaAe== fb for all e \u2208 E \u21d0\u21d2 fa \u2229eAe== fb, (5.5)and henceM (F ,{Pez }e\u2208E) is identifiable up to\u22c2e(A (F )\u2229A (Pez ))=A (F )\u2229 (\u2229eA (Pez )) . (5.6)The above corollary implies that the indeterminacy set of the multiple environmentsmodel with |E|> 1 must be a simultaneous (Pa,ez ,Pb,ez )-measure isomorphism foreach e \u2208 E, and in particular the intersection in (5.4) implies that the indeterminacy35set is no larger than the single environments model. In practice, as will be clearwhen discussing specific models, the multiple environments indeterminacy set is infact much smaller than in the single environments case.5.1.2 ExamplesAs mentioned, non-linear identifiabilty should be obtained by targeting the latentdistributions of the model. Examining the second term in (5.4) reveals that multipleenvironments are a way to do exactly that.To display the relevance of Corollary 5.1, in this section we show how two recentworks in non-linear identifiability, seemingly unrelated at first glance, can be seenas specific instances of our proposed framework. Additionally, we note that theserecent works have been unable to obtain strong identifiability. Our frameworkmakes clear where the remaining indeterminacy stems from, and in the secondcase, suggests a simple modification to eliminate them.Equivariant Stochastic MechanismsIn (Ahuja et al., 2022), weak identifiability of a temporal generative model is estab-lished. Adapted to our notation (note that the time indices represent environments),the model can be described:Xt = f (Zt) , Zt+1 = mt(Zt ,Ut) ,Zt\u22a5\u22a5Ut , t = 1,2, . . . , (5.7)where mt \u2208M : Z\u00d7 [0,1]\u2192 Z are unknown mechanisms and Ut are auxiliary noisevariables. Note that this is in fact a noiseless generative model at the level ofthe observations, but where the underlying latent variable evolves according to adeterminstic, but unknown mechanism mt and random noise Ut . To be clear, thismeans thatF is fully flexible, whilePtz is parametrized by an initial condition P1,the distributions for Ut , and the mechanisms mt . In what follows, we will assumea fixed P1 and Ut \u223cU [0,1] as in (Ahuja et al., 2022), and leavePtz parametrizedpurely by the mechanisms mt .36Denote the marginal distribution of Zt as Pt . In (Ahuja et al., 2022), identifiabilityof the generator f is established up to pre-composition of some transformation Asuch that A\u25e6ma(z,U) d= mb(A(z),U) for U \u223cU [0,1], for all possible values of z,and ma,mb \u2208M. Using our framework, we are able to show the following strongeridentifiability result, using only observations from two time points t = 1,2.Proposition 5.2. The model described by (5.7) is identifiable up to A \u2208A (M )satisfyingA(ma(Z,U))d= mb(A(Z),U), (5.8)for any ma,mb \u2208M, U \u223cU [0,1] and any random variable Z independent of U.Compared to the original proof, we are able to strengthen the result while weaken-ing the assumptions due to our measure-theoretic framework as follows:\u2022 Letting P1 be any point mass recovers the original identifiability result in(Ahuja et al., 2022).\u2022 Our proof structure, which can be found in the Appendix, follows the in-tuition originally laid out in (Ahuja et al., 2022), but we do not assume adiffeomorphic generator.Identifiable VAE (iVAE)In what follows, assume Euclidean spaces for the latent variable and observations,Z = Rdz , X = Rdx . The identifiable VAE model (iVAE) Khemakhem et al. (2020)specifies the latent variable distribution via an auxillary variable u, which we takehere to index the environment. Note that (Khemakhem et al., 2020) does not modelthe distribution of u, and so here we will assume that u is deterministic, e.g., a timeindex as suggested in (Hyva\u00a8rinen et al., 2018; Khemakhem et al., 2020). The latentdistribution is parameterized as a K-dimensional exponential family distribution37on Z = Rdz ,p(z;\u03b7(u)) = m(z)exp(\u03b7(u)\u22a4T (z)\u2212a(\u03b7(u)), (5.9)with functional parameters \u03b7 , T taking values in RK . The remainder of the modeldesign follows (5.1) with additive noise. We note that in (Khemakhem et al., 2020),the distribution is assumed to factorize over dimensions of Z (as in ICA), but thatis not necessary, an observation also made recently in Lu et al. (2022).In proving identifiability, iVAE requires the existence of points u0,u1, . . . such that\u03b7(ui)\u2212\u03b7(u0) are linearly independent and span the latent space. They are able toobtain that for parametrizations ( fa,Ta,\u03b7a) and ( fb,Tb,\u03b7b) resulting in the sameobservational distribution, there exists an invertible matrix L and offset vector csuch that for all x,Ta( f\u22121a (x)) = L\u22a4Tb( f\u22121b (x))+ c. (5.10)The above is the content of (Khemakhem et al., 2020, Thm. 1). Using ourframework, we arrive at a similar result.Although we use similar arguments to those originally presented in (Khemakhemet al., 2020), our framework also sheds light on why the iVAE indeterminacies areof the particular form (5.10). In particular, it is due to a purely probabilistic result.As a preliminary, we denote an exponential family distribution (5.9) parametrizedby \u03b7(u) and T by Em(\u03b7(u),T ).Proposition 5.3. Let A : Rdz \u2192 Rdz be a measure isomorphism for two sets ofK+1 exponential family distributions, in the sense thatEm(\u03b7a(ui),Ta) = Em(\u03b7b(ui),Tb)\u25e6A\u22121 (5.11)Em(\u03b7b(ui),Tb) = Em(\u03b7a(ui),Ta)\u25e6A \u2200i = 0,1 . . . ,K. (5.12)38Suppose that, for the first K ui\u2019s, both \u03b7a(ui) and \u03b7b(ui) are linearly independent.Then,Tb(A(z)) = L\u22a4Ta(z)+d, (5.13)almost everywhere, where L is a K\u00d7K invertible matrix and d is a K-dimensionalvector not depending on x.The simultaneous isomorphisms above are the indeterminacy maps, due to Corol-lary 5.1. The following result is a slightly stronger version of (Khemakhem et al.,2020, Thm. 1). It does not require differentiability for any dimension of expo-nential family latent distribution, while Khemakhem et al. (2020) requires theexistence of Jacobians for their analysis.Proposition 5.4. Suppose a generative modelM is described by (5.1) with latentdistributions described by (5.9), and that m is strictly positive. Suppose we observeat least K+1 distinct values of ui such that the corresponding natural parameters{\u03b7(ui)}Ki=0 are linearly independent. Then, the indeterminacy map A \u2208 A (M )satisfiesTb(Aa,b(x)) = L\u22a4Ta(x)+d, (5.14)almost everywhere, where L is an invertible K\u00d7K matrix and d is a K-dimensionalvector.Proof. By Corollary 5.1 and since we do not constrainF , the generator is identifi-able up to the transformations described in Proposition 5.3.The key observation to make in view of this result is that the identifiability result isin direct correspondence to a result on the exponential families (Proposition 5.3).Our framework logically reduces the model-specific analysis to a probabilisticexercise\u2014this is true in general for completely unconstrainedF . That is, building39a strongly identifiable, fully flexible model is precisely to find latent distributionssuch that there are no non-trivial isomorphisms between them. This is exactly theapproach we will take to adapt the iVAE to be strongly identifiable.5.1.3 Fixing Latent DistributionsMany generative models, including factor analysis and the original VAE (Exam-ple 3.3 and Example 3.4), fix the latent distribution. In other words, Pz is asingleton {Pz}, and Theorem 4.3 implies that models with fully flexible generatorsare essentially identifiable up to the class of Pz-measure-preserving automorphisms.However, notice that the iVAE latent distributions (5.9) were not fixed, and itsfunctional parameters were learned alongside the generator. This is due to aconnection with non-linear ICA, which generally has different goals compared tothose outlined in the introduction of this thesis.ICA was originally conceived as an algorithm for signal processing, and in partic-ular, signal unmixing, rather than interpretable latent variable modelling. Givenobservations of mixed signals (e.g., music played by a jazz band), the problem ofsignal unmixing is to find a statistically independent source distribution (the outputof each individual band member), and the mechanism that mixes them (in linearICA, this is called the mixing matrix).Non-linear ICA today is more commonly used for some sort of generative mod-elling, but inherits the \u201csource distribution\u201d interpretation of the latent variabledistribution in the original purpose. In doing so, it assumes a \u201cground-truth\u201d latentdistribution that should be preferred, and hence attempts to estimate this distribu-tion simultaneously to the generator. As mentioned in Section 3.3, this approachgenerally results in unresolvable indeterminacies.For most generative models however, nothing in the data privileges one set of latentdistributions over another if they lead to the same distribution on the observations.The latent space has little meaning in general before the generator is learned, aside40Pz,a Pz,bPxfa fbAa,b(z)(a)PzPxfa fbAa,b(z)(b)Figure 5.1: Schematic representation of Lemma 4.2 for (a) measure isomor-phisms (whenPz contains multiple elements as in iVAE); and (b) mea-sure automorphisms (whenPz is a singleton as in a standard VAE).from possible prior beliefs about its dimension (e.g., the g-factor was initiallyhypothesized to be univariate). As Lemma 4.2 shows, models with fixed latentdistributions have fewer indeterminacies (see Fig. 5.1 for a visual comparison).It may be tempting to assume that allowing the latent distributions to be flexibleimproves the expressiveness of the model. However, this is seldom true whenFis flexible enough, at least in theory. Consider model parameters \u03b8a = (Paz , fa)versus \u03b8b = (Pbz , fb), and their respective fits. So long as Paz and Pbz make forstandard probability spaces, there exists a measure isomorphism between them(on Euclidean spaces, one such isomorphism is the inverse CDF transform, forexample). Then, denoting some isomorphism Aa,b, the fit induced by \u03b8b can alsobe achieved via \u03b8 \u2217a = (Paz , fa \u25e6Aa,b), assuming F is flexible enough to includefa \u25e6Aa,b. Hence, fixing Pz = Paz is sufficient to obtain the same amount of flexibilityas the fully flexible case.One benefit of flexibility in Pz is specific to the multiple environments model.Given informative metadata about the environments e, one can use optimizingPez as a method of differentiating between environments (since the generator isfixed), in order to better fit the data. To this end, we will simply note that we canstill choose latent distributions in an informative manner, so long as it is fixedbefore training f (i.e., optimizing them as hyperparameters). In our view, this is a41necessary compromise for interpretability\u2014as we will now see, the iVAE modelwith fixed latent distributions is strongly identifiable.5.1.4 Strongly Identifiable VAEIn this section, we show a specific example of how fixed latent distributions, i.e.,nonlinear factor analysis-type models can lead to strong identifiability. In general,constructing distributions such that the only measure-preserving automorphism isthe identity is in general very difficult\u2014non-trivial measure-preserving automor-phisms, such as via the Darmois construction, can almost always be constructed(see Gresele et al. (2021) for an example). On the other hand, the class of si-multaneous measure preserving automorphisms is generally much smaller whenconsidering multiple environments. We show a specific example here by fixing thelatent distributions in iVAE.Recall the setting in iVAE, with latent space Rdz and natural parameter space RK .We denote a fixed set of full-rank exponential family distributions parameterizedby \u03b7 , byEm,T = {P on Z | p(z;\u03b7) = m(z)exp(\u03b7\u22a4T (z)\u2212a(\u03b7));\u03b7 \u2208 RK} . (5.15)We refer to a particular distribution in a fixed family as Em,T (\u03b7). The only si-multaneous measure-preserving automorphisms of a \u201cbasis\u201d of these distributionsare i\u02dcdz\u2014this is the result of a series of appealing, simple proofs, which we detailbelow.Lemma 5.5. Suppose A is simultaneously a P1-measure automorphism and P2-measure automorphism for P1, P2 from a fixed exponential family parametrized by\u03b71, \u03b72. Suppose that m is strictly positive for this family. Then, we have(\u03b71\u2212\u03b72)\u22a4T (z) = (\u03b71\u2212\u03b72)\u22a4T (A(z)) a.e. (5.16)42We can immediately apply this to obtain a characterization of exponential familyautomorphisms. Recall that the dimension of Em,T is the dimension the naturalparameter vector, and equivalently the dimension of T (x).Proposition 5.6. Let Em,T be a fixed exponential family, with dimension K, suchthat m is strictly positive. Let \u03b7i \u2208RK for i = 0,1, . . . ,K and suppose A :Rd \u2192Rdis simultaneously a Em,T (\u03b7i)-measure automorphism for each \u03b7i. Then, (T (z)\u2212T (A(z))) \u2208 span{\u03b7i\u2212\u03b70}\u22a5, almost everywhere.Proof. Lemma 5.5 applies to the K contrast vectors (\u03b7i\u2212\u03b70), so we have:(\u03b7i\u2212\u03b70)\u22a4(T (z)\u2212T (A(z)) = 0 a.e. (5.17)Any vector v \u2208 span(\u03b7i\u2212\u03b70) is of the form v = \u2211Ki=1 ai(\u03b7i\u2212\u03b70). Clearly,v\u22a4(T (z)\u2212T (A(z)) = 0 a.e., (5.18)and hence (T (z)\u2212T (A(z))) \u2208 span{\u03b7i\u2212\u03b70}\u22a5 almost everywhere.More useful for strong identifiability is the following result.Proposition 5.7. Let Em,T be a fixed exponential family, with dimension K, suchthat m is strictly positive and T is injective. Suppose that \u03b7i \u2208 RK , i = 0,1, . . . ,Kspan RK . Suppose A : Rd \u2192 Rd is a simultaneously a Em,T (\u03b7i)-measure automor-phism for each \u03b7i. Then, A(z) = z, almost everywhere.Proof. Without loss of generality, assume that \u03b70 is such that {\u03b7i\u2212\u03b70} forms abasis of RK . By Proposition 5.6, (T (z)\u2212T (A(z))) \u2208 span{\u03b7i\u2212\u03b70}\u22a5 = (RK)\u22a5 ={0}. This shows that T (A(z)) = T (z) almost everywhere. If T is injective, then wehaveA(z) = z a.e. (5.19)43The above states that the only shared automorphisms for each distribution froma suitably fixed exponential family whose parameters form a basis of RK are i\u02dcdz.Strong identifiability follows when these are fixed as the latent distributions.Theorem 5.8. LetM (F ,{Pez }e\u2208E) be the multiple environments model describedin (5.1), with Z =Rdz . For a subset of environments E\u2217 \u2282 E, with |E\u2217|= K+1, letPez = Em,T (\u03b7e) with m strictly positive and T injective in at least one dimension, andsuch that the corresponding parameters {\u03b7e}e\u2208E\u2217 span RK . ThenM (F ,{Pez }e\u2208E)is strongly identifiable.Proof. In this model, we have Pez = {Em,T (\u03b7e)}. By Corollary 5.1, the gen-erator is identifiable up to A (F )\u2229 (\u2229eA ({Em,T (\u03b7e)})). By Proposition 5.7,\u2229eA ({Em,T (\u03b7e)}) contains only functions that are equal to the identity almosteverywhere, and hence the model is strongly identifiable.The conditions of the theorem are met by dz+1 Gaussian distributions with stan-dard covariance such that their means \u00b5i, i = 0,1, . . .dz span Rd . More generally,injectivity of T is guaranteed so long as one of its dimensions is the identity (cor-responding to a \u201cGaussian\u201d dimension). Note the observation space remains anarbitrary Borel space of the same cardinality ofR\u2014in particular, it may be bounded,such as [0,1]m. We remark that the distributions on E \\E\u2217 remain unconstrainedand may be learned as in iVAE.Geometric Characterization of VAE IndeterminaciesIn fact, we do not require a spanning set for a useful characterization of theindeterminacy, especially in such a Gaussian setting, where the model is (strongly)identifiable up to arbitrary transformations on the dimensions not spanned by \u00b5i,due to Proposition 5.6. The content of the following proposition is graphicallydisplayed in Fig. 5.2.44Proposition 5.9. In the multiple environments model described in Theorem 5.8(with Z = Rd), fix a base environment with distribution N (0,\u03a3) and a subsetE\u2217 of environments where |E\u2217| = d\u2032 \u2264 d, with distributions N (\u00b5e,\u03a3). Suppose{\u00b5e}e\u2208E\u2217 are linearly independent, and (\u00b5e)i = 0 for each e and i \/\u2208 d\u2217 for somecollection of dimensions d\u2217. Then, for any indeterminacy map Aa,b it holds that(Aa,b(x))i\u2208d\u2217 = (x)i\u2208d\u2217 a.e. (5.20)Proof. Similarly to the previous identifiability proofs, we appeal to Corollary 5.1and analyze the shared automorphisms. For Gaussian distributions with a fixedcovariance matrix varying by its mean, we have T (x) = x, and \u03b7i = \u00b5c. Usingthe base environment we have \u00b50 = 0. By Proposition 5.6, we have (z\u2212A(z)) \u2208span{\u00b5i}\u22a5.Arranging {\u00b5i} into each row of a d\u2032 \u00d7 d matrix M, it is an elementary factthat span{\u00b5i}\u22a5 = Ker(M), the kernel or null-space of M. Furthermore, M hascolumns of 0 corresponding to d\u2217 and linearly independent rows. Together, standardGaussian elimination reveals that the reduced row echelon form of M has thefollowing form:\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0j \u2208 d\u2217 j \/\u2208 d\u2217 j \u2208 d\u22171 0 0 \u00b7 \u00b7 \u00b7 \u00b510 0 1 \u00b7 \u00b7 \u00b7 \u00b52......... \u00b7 \u00b7 \u00b70 0 0 \u00b7 \u00b7 \u00b7 \u00b5d\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb (5.21)The corresponding null space is the space of vectors with 0 entries for the indicesd\u2217. Hence, (z\u2212A(z)) \u2208 span{\u00b5i}\u22a5 implies (A(z))i\u2208d\u2217 = (z)i\u2208d\u2217 .45xyz\u00b51 \u00b52span(\u00b51,\u00b52)A(x)\u2212x\u2208span(\u00b51,\u00b52)\u22a5(a) The indeterminacy set in red withNormal means \u00b51 = (1,0,0),\u00b52 = (0,1,0).xyz\u03b71\u03b72 span(\u03b71,\u03b72)T (A(z))\u2212T (z)\u2208span(\u00b51,\u00b52)\u22a5(b) The indeterminacy set (under T ) in redfor parameter vectors \u03b71,\u03b72.Figure 5.2: a) Proposition 5.9 and b) Proposition 5.6, with \u00b50 = \u03b70 = 0.The orthogonal complement of a plane in R3 is the perpendicular linethrough the origin.5.2 Groups of (Optimal) Transport MapsWe now move onto a completely novel approach to specify strongly identifiablegenerative models. In Section 5.1.1,F was left unconstrained, and strong identifia-bility was achieved through multiple environments and restricting the environmentlatent variable distributions to be fixed members of an exponential family. In thissection we construct models in which the latent distributions can by any fixeddistributions with strictly positive density, and without requiring observations frommultiple environments. This is achieved by restricting the class of generators, andwith the additional condition that Z = X = Rd .We approach the identifiability problem indirectly, by considering what propertiesof A (F ) would yield strong identifiability. We aim to specifyF such that whenAa,b = f\u22121b \u25e6 fa transports one latent distribution to another, it is unique in a suitablesense so that, in particular, when it transports a distribution to itself, it must be(equivalent to) the identity map. Note although we continue to follow the multipleenvironments notation from Section 5.1.1, all results also hold in the i.i.d. casecorresponding to |E|= 1.46Specifically, the following properties onF can guarantee strong identifiability fora fixed latent distribution Pz:Proposition 5.10. The generative modelM (F ,{Pez }e\u2208E) with Z=X=Rd wherethe following properties hold1. f \u2208F is invertible on Rd , and f\u22121 \u2208F also (closure under inverses).2. If fa, fb \u2208F , then fa \u25e6 fb \u2208F also (closure under function composition).3. If f \u2208F is a measure-preserving automorphism for any Pez , then f (z) = zalmost everywhere.is strongly identifiable for any number of environments |E|.Proof. Our framework makes it clear why these are the requirements for strongidentifiability. Note that since Z = X, the indeterminacy maps Aa,b are mappingson the same space as fa and fb. Then, 1) and 2) ensure that the indeterminacy mapremains withinF , that is forming a group under function composition, and henceA (F )\u2282F . 3) ensures that A (F )\u2229A (Pz)\u2282F \u2229A (Pz) = i\u02dcdz, and thus wehave strong identifiability.At first glance, 1) and 2) appear to be reasonably simple constraints to enforce.Assumption 3) at first glance may appear like an impossible assumption, and onethat simply \u201cassumes away\u201d the problem of identifiability. However, this is notnecessarily the case\u2014the remainder of this section will describe two examples ofnon-trivial generator classes such that 3) is guaranteed to hold.5.2.1 Optimal Transport GeneratorsGiven two probability measures Pa and Pb onRd , the Monge formulation of optimaltransport (Santambrogio, 2015) with respect to the cost function c : Z\u00d7Z\u2192R+ is47to find a map T : Z\u2192 Z such that Pa \u25e6T\u22121 = Pb and that minimizes the total cost\u222bRdc(z,T (z))Pa(dz) . (5.22)We call T an optimal transport (OT) map with respect to c if it minimizes (5.22)for transporting between some pair1 of probability distributions on Rd . Clearly, if cis such that c(z1,z2) = 0 \u21d0\u21d2 z1 = z2, then the unique OT map from a distributionPz to itself is equal to the identity map Pz-almost everywhere. Proposition 5.10 thenimplies strong identifiablility so long as the group properties 1) and 2) are satisfied.Theorem 5.11. LetM (F ,{Pez }e\u2208E) be a generative model, whereF are optimaltransport maps for any fixed cost c such that c(z1,z2) = 0 \u21d0\u21d2 z1 = z2. Supposeat least one of Pez is equivalent to the Lebesgue measure on Rd , \u03bb . Then the modelis strongly identifiable for any |E| \u2265 1.Proof. Fix an environment e \u2208 E such that Pez is fully supported. By Proposi-tion 5.10, we simply need to show that if f \u2208F is a measure-preserving automor-phism for Pez , then f (z) = z almost everywhere. Let f \u2208F be a measure-preservingautomorphism for Pez .Clearly, any f \u2217 \u2208 i\u02dcdz makes c(z, f \u2217(z)) = 0 Pez -almost everywhere, and hence thetotal cost is 0 This implies that f \u2217 is an OT map from Pez to itself for any f \u2217 \u2208 i\u02dcdz.Since f \u2208F is also an optimal transport map, it must also make the total cost 0,i.e., \u222bRdc(z, f (z))Pez (dz) = 0 . (5.23)Since c(z, f (z)) \u2265 0, we have that c(z, f (z)) = 0 Pez -almost everywhere. Sincec(z, f (z)) = 0 \u21d0\u21d2 f (z) = z, and since Pez was equivalent to \u03bb , we conclude thatf (z) = z almost everywhere and the model is strongly identifiable.1Or multiple pairs, or even between the same distribution48Specifying an optimal transport map such that compositions and inverses are stilloptimal is far from trivial. For example, Brenier maps have appeared recently in anumber of machine learning contexts (Amos et al., 2017; Huang et al., 2021; Wanget al., 2021). They are the unique solutions to the OT problem with c(z1,z2) =||z1\u2212 z2||2, and are characterized by the property that a Brenier map A is thegradient of a convex function, A = \u2207\u03c6 , for convex \u03c6 : Rd \u2192 R.Unfortunately, the set of Brenier maps is not closed under composition. To ourknowledge, the only way to remedy this is for every fa, fb \u2208F to be cyclicallyco-monotone, a property deduced recently in (Torous et al., 2021). That is, fa, fbmust satisfy, for each m \u2208 N and each sequence z1,z2, . . . ,zm+1 = z1,m\u2211i=1\u27e8 fb(zi), fa(zi+1)\u2212 fa(zi)\u27e9 \u2265 0 .We do not currently know of any general (and flexible) function classes that satisfythis property.5.2.2 Triangular Monotone MapsTriangular monotone increasing (TMI) maps are of growing interest in generativemodelling (Irons et al., 2021; Jaini et al., 2019; Kingma et al., 2016; Papamakarioset al., 2017; Wehelkel and Louppe, 2019). More generally, many normalizing flow(Papamakarios et al., 2021) models have triangular, even monotonic layers, butdue to alternating between lower and upper-triangular (Dinh et al., 2015, 2017;Sorrenson et al., 2020) the final generator may fail to be triangular monotone.Let f : Rd \u2192 Rd be a monotone increasing triangular map. This means that:49f (x) =\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0f1(x1)f2(x1,x2)...fd(x1, . . . ,xd)\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,where each xd \u2192 fd(x1:d\u22121,xd) is monotone increasing (hence invertible) for anyx1:d\u22121. The inverse of f is as follows:f\u22121(x) =\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0f\u221211 (x1)f\u221212 ( f\u221211 (x1),x2)...f\u22121d ( f\u221211 (x1), f\u221212 ( f\u221211 (x1),x2), . . . ,xd)\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .This is also a TMI map\u2014the inverses of monotone increasing maps are alsomonotone increasing. Furthermore, it is clear that compositions of TMI maps areagain TMI. Thus, it satisfies properties 1) and 2) as outlined in Proposition 5.10.Note the map described above is lower-triangular\u2014upper-triangular maps areanalogously defined. For the purposes of this section, a triangular map refers to alower-triangular map. As long as all maps considered are either lower, or uppertriangular, the same closure properties apply.Each TMI map that is also a (\u00b5,\u03bd)-measure isomorphism (for \u00b5 and \u03bd equivalentto \u03bb ) has an explicit, and unique almost everywhere construction as the Kno\u00a8the\u2013Rosenblatt (KR) transport (Carlier et al., 2010). The KR transport is describedrecursively as follows. Let F\u00b5(xm|x1:m\u22121) be the conditional CDF of the m-thcomponent of \u00b5 on the preceding components. Because \u00b5 has strictly positivedensity, F\u00b5 is monotone increasing. Then, the m-th component of the KR transportis as follows:Km(x1:m\u22121,xm) = F\u22121\u03bd {F\u00b5(xm|x1:m\u22121) | K1(x1), . . . ,Km\u22121(x1:m\u22121)}.50That is, Km sends xm through the conditional CDF of \u00b5 on x1:m\u22121, and back throughthe inverse conditional CDF of \u03bd on y1:m\u22121 = (K1(x1), . . . ,Km\u22121(x1:m\u22121)). ThisCDF transform is the unique (almost everywhere) monotone increasing transportmap between the 1-dimensional unique (almost everywhere) regular conditionalprobabilities.It is clear that the map K defined by its components Km is a TMI map. Though it isnot known to be an optimal transport map itself, it is the limit of optimal transportmaps for a sequence of appropriately weighted quadratic losses (Santambrogio,2015, Ch. 2.4). In view of this, it is perhaps not surprising that it inherits someof the properties discussed in the previous section. In particular, it also results instrongly identifiable generative models.Theorem 5.12. Let M (F ,{Pez }e\u2208E) be a generative model, where F are TMImappings, and Pez is equivalent to \u03bb for at least one e \u2208 E. Then, the model isstrongly identifiable for any |E|.Proof. The class of TMI generators corresponds to the class of KR transports.Since KR transports are the unique TMI map transporting \u00b5 to \u03bd , and triangularmonotone increasing maps are closed under inverses and compositions, it must bethat:\u2022 For K the KR transport from \u00b5 to \u03bd , K\u22121 is the KR transport from \u03bd to \u00b5 .\u2022 For measures \u00b5,\u03bd ,\u03c0 , if K1 is the KR transport from \u00b5 to \u03bd , K2 is the KRtransport from \u03bd to \u03c0 , then K1 \u25e6K2 is the KR transport from \u00b5 to \u03c0 .Now, it is clear that if fa, fb are KR transport, their indeterminacy map Aa,b is alsoa KR transport. Furthermore, by construction, each univariate CDF transformationfor the KR transport from a distribution \u00b5 to itself is the identity, and hence anyKR transport that is also a \u00b5-preserving measure automorphism must be in i\u02dcdz. ByProposition 5.10, the model is strongly identifiable.51The properties of KR transports also enable us to provide an ICA-type identifi-ability result, when the latent distributions are not fixed but constrained to haveindependent components.Proposition 5.13. The ICA model with M (F ,{Pez }e\u2208E)with F (mixing func-tions) consisting entirely of TMI maps andPez fully supported distributions withindependent components is identifiable up to invertible, component-wise transfor-mations for any |E|.Proof. By Theorem 4.3, A (M ) is the set of functions equal almost everywhereto KR transports between measures in Pez for any e \u2208 E. Let Pez,a and Pez,b betwo such measures, which by assumption have independent components. By theconstruction of the KR transport, it is clear that Km depends only on xm in anyKR transport K between Pz,a and Pz,b. Such a map is monotone increasing anddiagonal\u2014hence an invertible, component-wise transformation.Finally, we note that unlike the optimal transport maps in the previous section, TMIgenerators have already been used for generative modelling in the literature, andcan have highly flexible parametrizations (Wehelkel and Louppe, 2019). Our resulthere merely supports their interpretability. In particular, since the KR transport isable to move between any two distributions equivalent to \u03bb , the class of TMI mapsis able to fit any such distribution, given any fixed latent distribution.52Chapter 6Conclusion and Future WorkGenerative models, particularly those with flexible generators, are increasinglybeing applied to new areas in hopes of better understanding complex, high-dimensional, and often uninformative data formats. However, these generativemodels are highly indeterminate, where infinitely many possible latent values cangenerate an observation. This indeterminacy hence makes the task of interpretinglatent variables, for example by computing estimates or examining the effects ofperturbations, not well defined.Understanding latent variable indeterminacies appears to be of various degreesof importance in different fields. Discussions of indeterminacies, and attempts tocharacterize them in factor analysis date back over 100 years (Spearman, 1904;Wilson, 1929). Similarly, ICA was originally developed with certain identifiabilityproperties in mind, and has inspired the majority of contemporary work examiningindeterminacies in generative models.The importance of understanding indeterminacies in modern generative modelsis increasingly being recognized in the literature, particularly in unsupervisedlearning. In this thesis, we contribute to this study by reconciling seeminglyunrelated theories, from modern analyses tailored to deep learning, to classical53linear factor analytic approaches, in a unifying mathematical framework based onstatistical identifiability. To do this, we formulate the problem in the language ofmeasure-theoretic probability, which is a degree of abstraction beyond contempo-rary analyses based on manipulating densities.Specifically, our framework obtains basic characterizations of indeterminacy in abroad class of generative models, including most classical and modern iterationsas special cases. We cleanly partition two sources of indeterminacy (Theorem 4.3),outlining the contributions from the complexity of the generator class to the com-plexity of the latent distribution class.Using this basic characterization, we are able to provide guidance on designinggenerative models depending on the requirements for flexibility and identifiability.In particular, we provide concrete suggestions for modern expressive models thatdo not constrain the generator, such that no indeterminacies remain, i.e., stronglyidentifiable models. These models are either existing methodology (in the case ofTMI flows), or small modifications thereof (in the case of strongly identifiable VAE).We believe that the framework and results developed in this thesis represent a firststep towards modern, interpretable generative modelling. However, substantiallymore work needs to be done to truly understand when and how latent variables canbe safely interpreted and used to advance scientific inquiry. In what follows, weoutline a non-exhaustive list of follow-up work that can contribute to this effort.Perturbations in Latent SpaceA key point of interpreting latent variables is to understand the effects of a perturba-tion in the latent space. An example of an attempt to do this is the example given inSection 3.1, attempting to quantify the effect of some physical intervention p = 1on the data, via an offset vector in the latent space \u03b4 .In Section 3.1, we mentioned that \u03b4 is not a well-defined target in the presence ofindeterminacies. However, what else is required to make such a \u03b4 interpretable asan effect of an intervention? Furthermore, is there a path forward where we can54discover new, scientifically meaningful interventions simply by manipulating thelatent space in a trained generative model?Assessing Stability and Robustness of Generative ModelsStability and robustness to small changes, either in the training dataset or thedistribution underlying it, are important assessments to make when designinginterpretable models. In non-identifiable models, these assessments are difficult, ifnot impossible to make, given that the model can be arbitrarily different even underthe same dataset or training distribution. However, strongly identifiable modelseliminate this issue, and isolates the effects of the small changes, which can beuseful to understanding the stability and robustness of generative models moregenerally.Compressing Latent CompressionsThough we have focused on strongly identifiable models in this thesis, our resultscan be useful in characterizing the indeterminacy sets of new generative modelsas well. Specifically, our definition of A -identifiability implies that a weaklyidentifiable model is not yet \u201cfully compressed\u201d, in the following sense.The indeterminacy set A represents symmetries of the latent space under themodel. For example, factor analysis in a single environment yields a rotationsymmetry in the latent space\u2014this is well known. This means that, for eachnoiseless observation X , we may associate a latent value Z, and any \u201crotated\u201d valueof it, giving the equivalence class induced by the orthogonal matrices:[Z] = {RZ | R \u2208 O(dz)}. (6.1)Equivalently, we can describe the possible latent values generating any singleobservation X by the orbits of the group action O(dz) acting on Rdz (i.e., thequotient space Rdz\/O(dz)), which can be identified with [0,\u221e).55A fitted factor model corresponds to some rotation R\u2217 \u2208 O(dz). Assuming awell-specified model and perfect estimation for two observations x1, x2, the truelatent values z1, z2 are estimated as z\u02c61 = R\u2217z1, z\u02c62R\u2217z2. In general, z\u02c61, z\u02c62 can bearbitrarily different from z1, z2, except for the fact that \u2225z\u02c6i\u22252 = \u2225zi\u22252. In essence,our knowledge about the true latent variables are summarized in the 2-norm, whichidentifies an orbit [Z].The above example reveals that fitting a dz-dimensional factor model without anyidentifiability modifications, even if perfectly specified, only allows us to recover a1-dimensional summary of the true factors. The same type of claim holds for anymodel with a Gaussian latent distribution, such as the VAE (though due to moresymmetries that are not necessarily linear, the compressed space is even smallerthan [0,\u221e)).This begs the question: is there any theoretical benefit to fitting larger factormodels, even if we had access to perfect inference? Empirically, larger dimensionsdz in VAEs, which in theory share the same problem, appear to better capturehigher dimensional datasets\u2014is there a way to characterize this gain in terms ofsymmetry? Or are these benefits purely in terms of statistical effiency?56BibliographyAguilar, O. and West, M. (2000). Bayesian dynamic factor models and portfolioallocation. Journal of Business & Economic Statistics, 18(3). \u2192 page 18Ahuja, K., Hartford, J., and Bengio, Y. (2022). Properties from mechanisms: Anequivariance perspective on identifiable representation learning. In ICLR 2022.\u2192 pages 19, 24, 32, 33, 36, 37Amos, B., Xu, L., and Kolter, J. (2017). Input convex neural networks. In ICML2017. \u2192 page 49Anderson, T. W. and Rubin, H. (1956). Statistical inference in factor analysis. InProceedings of the Third Berkeley Symposium on Mathematical Statistics andProbability, volume 1. Univ of California Press. \u2192 page 17Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant riskminimization. arXiv preprint arXiv:1907.02893. \u2192 page 34Bai, J. and Ng, S. (2013). Principal components estimation and identification ofstatic factors. Journal of econometrics, 176(1). \u2192 page 18Bogachev, V. I. (2007). Measure Theory. Springer. \u2192 pages 5, 11Bu\u00a8hlmann, P. (2020). Invariance, Causality and Robustness. Statistical Science,35(3):404 \u2013 426. \u2192 page 34Carlier, G., Galichon, A., and Santambrogio, F. (2010). From knothe\u2019s transport tobrenier\u2019s map and a continuation method for optimal transport. SIAM Journalon Mathematical Analysis, 41(6). \u2192 page 5057Comon, P. (1994). Independent component analysis, a new concept? SignalProcessing, 36(3):287\u2013314. \u2192 pages 3, 19, 21, 26, 27, 66, 67Ding, J. and Regev, A. (2021). Deep generative model embedding of single-cellrna-seq profiles on hyperspheres and hyperbolic spaces. NatureCommunications, 12(1). \u2192 page 16Dinh, L., Krueger, D., and Bengio, Y. (2015). NICE: Non-linear independentcomponents estimation. In Workshop at ICLR 2015. \u2192 page 49Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using realNVP. In ICLR 2017. \u2192 page 49Gresele, L., Ku\u00a8gelgen, J. V., Stimper, V., Scho\u00a8lkopf, B., and Besserve, M. (2021).Independent mechanism analysis, a new concept? In NeurIPS 2021. \u2192 page 42Ha\u00a8lva\u00a8, H., Corff, S. L., Lehe\u00b4ricy, L., So, J., Zhu, Y., Gassiat, E., and Hyva\u00a8rinen, A.(2021). Disentangling identifiable features from noisy data with structurednonlinear ICA. In NeurIPS 2021. \u2192 pages 14, 19Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed,S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with aconstrained variational framework. In ICLR 2017. \u2192 pages 2, 20Huang, C.-W., Chen, R. T. Q., Tsirigotis, C., and Courville, A. (2021). Convexpotential flows: Universal probability distributions with optimal transport andconvex optimization. In ICLR 2021. \u2192 page 49Hyva\u00a8rinen, A. and Morioka, H. (2016). Unsupervised feature extraction bytime-contrastive learning and nonlinear ICA. In NeurIPS 2016. \u2192 pages19, 20, 33Hyva\u00a8rinen, A. and Morioka, H. (2017). Nonlinear ICA of temporally dependentstationary sources. In AISTATS 2017. \u2192 pages 19, 20, 33Hyva\u00a8rinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis:Existence and uniqueness results. Neural Networks, 12(3):429\u2013439. \u2192 page 3Hyva\u00a8rinen, A., Sasaki, H., and Turner, R. E. (2018). Nonlinear ICA usingauxiliary variables and generalized contrastive learning. In AISTATS 2019,pages 859\u2013868. \u2192 pages 19, 20, 33, 3758Irons, N. J., Scetbon, M., Pal, S., and Harchaoui, Z. (2021). Triangular flows forgenerative modeling: Statistical consistency, smoothness classes, and fast rates.arXiv preprint arXiv:2112.15595. \u2192 pages 32, 49Jaini, P., Selby, K. A., and Yu, Y. (2019). Sum-of-squares polynomial flow. InICML 2019. \u2192 page 49Kechris, A. S. (1995). Classical Descriptive Set Theory. Springer. \u2192 pages5, 10, 11Khemakhem, I., Kingma, D. P., Monti, R. P., and Hyva\u00a8rinen, A. (2020).Variational autoencoders and nonlinear ICA: A unifying framework. InAISTATS 2020. \u2192 pages 20, 23, 24, 32, 33, 37, 38, 39, 74, 75Kim, H. and Mnih, A. (2018). Disentangling by factorising. In ICML 2018, pages2649\u20132658. \u2192 pages 2, 20Kingma, D. and Welling, M. (2014). Auto-encoding variational bayes. arXivpreprint arXiv:1312.6114. \u2192 pages 15, 20Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling,M. (2016). Improved variational inference with inverse autoregressive flow. InNeurIPS 2016. \u2192 page 49Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., andPalton, D. M. (2021). Towards nonlinear disentanglement in natural data withtemporal sparse decoding. In ICLR 2021. \u2192 pages 19, 33Lawley, D. N. and Maxwell, A. E. (1962). Factor analysis as a statistical method.Journal of the Royal Statistical Society, Series D (The Statistician),12(3):209\u2013229. \u2192 pages 15, 25Lotfollahi, M., Wolf, F. A., and Theis, F. J. (2019). scgen predicts single-cellperturbation responses. Nature methods, 16(8):715\u2013721. \u2192 page 13Lu, C., Wu, Y., Herna`ndez-Lobato, J. M., and Scho\u00a8lkopf, B. (2022). Invariantcausal representation learning for out-of-distribution generalization. In ICLR2022. \u2192 pages 20, 23, 34, 38Maritz, J. and Lwin, T. (1989). Empirical Bayes Methods. CRC Press. \u2192 page 26Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. (2019).59Continuous hierarchical representations with poincare\u00b4 variational auto-encoders.In NeurIPS 2019. \u2192 page 16Mulaik, S. A. (2009). Foundations of Factor Analysis. Chapman and Hill\/CRC.\u2192 pages 1, 16Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., andLakshminarayanan, B. (2021). Normalizing flows for probabilistic modelingand inference. Journal of Machine Learning Research, 22(57):1\u201364. \u2192 page 49Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressiveflow for density estimation. In NeurIPS 2017. \u2192 page 49Peters, J., Bu\u00a8hlmann, P., and Meinshausen, N. (2016). Causal inference by usinginvariant prediction: identification and confidence intervals. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 78(5):947\u20131012.\u2192 page 34Rohe, K. and Zeng, M. (2020). Vintage factor analysis with varimax performsstatistical inference. arXiv preprint arXiv:2004.05387. \u2192 pages 18, 19Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians.Birkha\u00a8user Cham. \u2192 pages 47, 51Schilling, R. L. (2005). Measures, Integrals and Martingales. CambridgeUniversity Press. \u2192 pages 5, 10Scho\u00a8lkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., andBengio, Y. (2021). Toward causal representation learning. Proceedings of theIEEE, 109(5):612\u2013634. \u2192 page 34Sorrenson, P., Rother, C., and Ko\u00a8the, U. (2020). Disentanglement by nonlinearICA with general incompressible-flow networks (GIN). In ICLR 2020. \u2192 pages24, 49Spearman, C. (1904). \u201cgeneral intelligence\u201d, objectively determined and measured.The American Journal of Psychology, 15(2). \u2192 pages 2, 16, 53Steiger, J. H. (1979). Factor indeterminacy in the 1930\u2019s and the 1970\u2019s someinteresting parallels. Psychometrika, 44(2). \u2192 pages 16, 1760Torous, W., Gunsilius, F., and Rigollet, P. (2021). An optimal transport approachto causal inference. arXiv preprint arXiv:2108.05858. \u2192 page 49van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press. \u2192pages 22, 23Vidal, R., Bruna, J., Giryes, R., and Soatto, S. (2017). Mathematics of deeplearning. arXiv preprint arXiv:1712.04741. \u2192 page 20Wang, Y., Blei, D., and Cunningham, J. P. (2021). Posterior collapse and latentvariable non-identifiability. In NeurIPS 2021. \u2192 pages 34, 49Wehelkel, A. and Louppe, G. (2019). Unconstrained monotonic neural networks.In NeurIPS 2019. \u2192 pages 32, 49, 52Wilson, E. B. (1929). Comment on professor spearman\u2019s note. Journal ofEducational Psychology, 20(3). \u2192 pages 17, 53Xi, Q. and Bloem-Reddy, B. (2022). Indeterminacy in latent variable models:Characterization and strong identifiability. arXiv preprint arXiv:2206.00801. \u2192page vZhou, D. and Wei, X.-X. (2020). Learning identifiable and interpretable latentmodels of high-dimensional neural activity using pi-VAE. In NeurIPS 2020. \u2192pages 20, 75C\u00b8inlar, E. (2011). Probability and stochastics, volume 261. Springer. \u2192 pages5, 6, 7, 8, 961Appendix ASupporting MaterialsA.1 Detailed Linear ExamplesA.1.1 Example: Factor AnalysisWe present a simple example of using multiple environments, and a basis of latentdistributions, to obtain identifiability, using only linear algebra concepts. Thisexample also provides intuition for the minimality of the number of environments.That is, for this 2-d latent space, three environments is enough to obtain strongidentifiability, while two environments is insufficient.Suppose two competing linear generative models for a random vector x \u2208 R10with latent vector z \u2208 R2, for data arising from three environments indexed bye = 1,2,3:z(e) \u223c N(\u00b5, I2\u00d72)\u03b5 \u223c N(\u00b5, I10\u00d710)y(e) = \u03b1+Fz(e)+ \u03b5 (A.1)z(e) \u223c N(\u00b5e, I2\u00d72)\u03b5 \u223c N(\u00b5, I10\u00d710)x(e) = \u03b1+Fz(e)+ \u03b5. (A.2)62The left model is a single environment model, while in the right model, two ofthe \u00b5e are linearly independent, i.e., a multiple environment model. Note that thegenerator function here isg(z) = \u03b1+Fz, (A.3)where F is a full rank 10\u00d72 matrix, and \u03b1 is an offset vector in data space, fixedfor all environments. For each environment we have the marginal distributionunder the multiple environment model:x(e) \u223c N(\u03b1+F\u00b5e,FF\u22a4+ I10\u00d710) (A.4)Recall the Gaussian distribution is characterized entirely by its mean and covariance\u2014that is, for marginal distributions parametrized by \u03b81 = (\u03b11,F1),\u03b82 = (\u03b12,F2):P\u03b81,e = P\u03b82,e \u21d0\u21d2 \u03b11+F1\u00b5e = \u03b12+F2\u00b5e, F1F\u22a41 = F2F\u22a42 . (A.5)To say that this model is strongly identifiable means that the right-hand-sideequalities for each e imply \u03b11 = \u03b12 and F1 = F2.In the single environments model, there are the following constraints:\u03b11+F1\u00b5 = \u03b12+F2\u00b5 (A.6)F1F\u22a41 = F2F\u22a42 (A.7)The single environments model is not identifiable. For example, let R be anorthogonal (rotation) matrix, then, let F2 = F1R and \u03b12 = \u03b11\u2212F1R\u00b5+F1\u00b5 . Wehave\u03b12+F2\u00b5 = \u03b11\u2212F1R\u00b5+F1\u00b5+F1R\u00b5 = \u03b11+F1\u00b5 (A.8)F2F\u22a42 = F1RR\u22a4F\u22a42 = F1F\u22a41 , (A.9)63where the last equality is due to R being an orthogonal matrix. This is a classicalcase of exploiting the rotational invariance of the Gaussian to construct a non-identifiable example.Now, we analyze the multiple environment model. To be explicit, the three envi-ronments impose the following constraints in the multiple environments model:\u03b11+F1\u00b51 = \u03b12+F2\u00b51 (A.10)\u03b11+F1\u00b52 = \u03b12+F2\u00b52 (A.11)\u03b11+F1\u00b53 = \u03b12+F2\u00b53 (A.12)F1F\u22a41 = F2F\u22a42 , (A.13)We can show directly that these constraints imply that \u03b11+F1z = \u03b12+F2z. First,assume that \u00b51 and \u00b52 are the linearly independent pair. Then, taking differences,F1(\u00b51\u2212\u00b53) = F2(\u00b51\u2212\u00b53) (A.14)F1(\u00b52\u2212\u00b53) = F2(\u00b52\u2212\u00b53) (A.15)F1F\u22a41 = F2F\u22a42 . (A.16)Written in matrix form, the first two constraints readF1M = F2M =\u21d2 F1 = F2, (A.17)since \u00b51\u2212\u00b53 and \u00b52\u2212\u00b53 remain linearly independent, and hence M is invertible.It immediately follows from the original constraints that \u03b11 = \u03b12 also.The above analysis showed that, for identifiability, a single environment wasinsufficient, while three environments was adequate. This begs the question, whatabout two environments? In other words, is the three environment constraintminimal?Consider a model with two environments with means \u00b51, \u00b52. By the arguments64above, it imposes the following constraints:\u03b11+F1\u00b51 = \u03b12+F2\u00b51 (A.18)\u03b11+F1\u00b52 = \u03b12+F2\u00b52 (A.19)F1F\u22a41 = F2F\u22a42 . (A.20)Can we construct an non-identifiable example? Let F2 = F1R, \u03b12 = \u03b11\u2212F1R\u00b51+F1\u00b51 as in the single-environment case. Clearly, these satisfy the first and thirdconstraint for any orthogonal matrix R. We aim to find a specific rotation matrixthat also satisfies the second constraint. Observe that:\u03b12+F2\u00b52 = \u03b11\u2212F1R\u00b51+F1\u00b51+F1R\u00b52 (A.21)= \u03b11+F1\u00b51+F1R(\u00b52\u2212\u00b51). (A.22)Let x be a vector orthogonal to \u00b52\u2212\u00b51, standardized such that \u2225x\u22252 = \u2225\u00b52\u2212\u00b51\u22252.ConsiderR =1\u2225\u00b52\u2212\u00b51\u22252\uf8ee\uf8ef\uf8f0 | |\u00b52\u2212\u00b51 x| |\uf8f9\uf8fa\uf8fb[1 00 \u22121]\uf8ee\uf8ef\uf8f0 | |\u00b52\u2212\u00b51 x| |\uf8f9\uf8fa\uf8fb\u22a4. (A.23)This is the eigendecomposition of an orthogonal matrix (it is the product of orthog-onal matrices) with eigenvalues 1 and \u22121, and corresponding eigenvectors \u00b52\u2212\u00b51and z. Since it is an eigenvector, we have R(\u00b52\u2212\u00b51) = \u00b52\u2212\u00b51.1 Then, we have\u03b12+F2\u00b52 = \u03b11+F1\u00b51+F1\u00b52\u2212F1\u00b51 = \u03b11+F1\u00b52, (A.24)which satisfies the second constraint as desired. This shows that three environmentsare required, and hence minimal for strong identifiability of this model.1R is essentially a rotation matrix with axis (\u00b52\u2212\u00b51).65Note that such a construction will not work for the three-environment model. Forthree environments, the rotation has to satisfy bothR(\u00b53\u2212\u00b51) = \u00b53\u2212\u00b51 (A.25)R(\u00b52\u2212\u00b51) = \u00b52\u2212\u00b51, (A.26)that is, the eigenspace of R associated to the eigenvalue 1 spans R2, i.e., it is theidentity.A.1.2 Linear, non-Gaussian ICAConsider a generative model (Equation (3.1)) with Z = Rdz and X = Rdx . Assumedx \u2265 dz. Let the generator parameter space be F = {A \u2208 Rdx\u00d7dz;rank(A) = dz}.That is, the generators are full-rank linear transformations, and hence injective. Letthe prior parameter space bePz = {p(z) =dz\u220fi=1pi(z); pi are non-Gaussian, and not a point mass}, (A.27)i.e., probability distributions on Rdz with a density, and the density factorizes asindependent, non-Gaussian components.The identifiability of this problem was first studied in (Comon, 1994)2. In ouranalysis. In their analysis, identifiability is established up to pre-multiplicationof a diagonal matrix and a permutation. That is, for generators Fa, Fb \u2208F withPz,a,Pz,b \u2208Pz, if the marginal distributions of X match, then Fa = Fb\u039bP, where \u039bis an invertible diagonal matrix and P is a permutation matrix.Under our framework, i.e., Lemma 4.2, we must have that Pz,b = Pz,a \u25e6A\u22121a,b, whereFa = Fb \u25e6Aa,b. Using our framework, we now show that Aa,b = \u039bP as above.2In the original analysis, the model is fit according to a criteria maximizing the independencebetween components, and also one component of the latent is allowed to be Gaussian. Forsimplicity, we will simply study the implications of matching observational marginal distributions(i.e., maximum likelihood) and where all latent components are non-Gaussian.66The identifiability result obtained in (Comon, 1994) rests on the following result(restated and re-proved to match our notation):Theorem A.1 (Theorem 10, (Comon, 1994)). Let z be a random vector withfactorized density. Let x =Cz, such that x also has factorized density. Then, z j isnon-Gaussian if the j-th column has at most one non-zero entry.Proof. We require Theorem 19 from (Comon, 1994).Lemma A.2 ((Comon, 1994), Darmois\u2019 Theorem). Define two random variablesZ1 and Z2 asZ1 =\u2211iaizi Z2 =\u2211ibizi, (A.28)where zi are independent random variables, i.e., their joint distribution factor-izes. Then, if Z1 and Z2 are independent, all variables z j for which a jb j \u0338= 0 areGaussian.Now, let z be a random vector with factorized density and x = Cz, where x hasfactorized density also. Note that this implies any xi, xk are independent for i \u0338= k.We have thatxi =\u2211jCi jz j xk =\u2211jCk jz j, (A.29)and hence by Lemma A.2, if z j is non-Gaussian, it must be that Ci jCk j = 0. Thisholds for each i \u0338= k, and hence, the j-th column has at most one non-zero entry.Recall the definition of Pz is such that any prior must factorize and be non-67Gaussian. Then, Theorem A.1 implies thatA (Pz)\u2229Rdz\u00d7dz = {A \u2208 Rdz\u00d7dz | A has no column with > 1 nonzero entry}.(A.30)That is, any linear isomorphisms between two priors must have no column withmore than one nonzero element. Now, note that for any Aa,b \u2208A (F ), we haveAa,b = f\u22121b \u25e6Fa, (A.31)where f\u22121b is the restriction of the linear map represented by the pseudoinverse F\u2020bto the range ofF . By Lemma 4.1, Aa,b is an invertible linear map and hence fullrank. Finally, we conclude that for any Aa,b \u2208 A (F )\u2229A (P), Aa,b must haveexactly one nonzero element in each column. We can then apply a permutation Psuch that PAa,b = \u039b, where \u039b is diagonal. Finally, we obtain Aa,b = P\u22a4\u039b, whereP\u22a4 is a permutation matrix and \u039b is diagonal and invertible.A.2 Detailed ProofsA.2.1 Proof of Lemma 4.2To make the argument in the proof of Lemma 4.2 precise, we need to construct themeasurable space on which the inverse of f\u22121 :F (Z)\u2192 Z is defined. That is, weneed to attach a \u03c3 -algebra toF (Z).We recall that F is a set of Borel isomorphisms (e.g., measurable functions)between Z and X. Now, recall that the pushforward \u03c3 -algebra of the bijectionf : Z\u2192F (Z) is the following collection of subsets ofF (Z):\u03c3( f ) = {B\u2282F (Z); f\u22121(B) \u2208B(Z)} (A.32)Recall that this is the smallest \u03c3 -algebra that makes f measurable.68Lemma A.3. Suppose f \u2208F . Then, \u03c3( f ) contains only Borel sets. In other words,\u03c3( f )\u2282B(X).Proof. Let C \u2208 \u03c3( f ). By definition, f\u22121(C) is Borel. Since f is a Borel isomor-phism when defined on F (Z) (Lemma 2.4), we have f ( f\u22121(C)) = C must beBorel.Further, all generators inF induce the same pushforward \u03c3 -algebra.Lemma A.4. For fa, fb inF , \u03c3( fa) = \u03c3( fb).Proof. To see that \u03c3( fa)\u2282 \u03c3( fb), suppose C \u2208 \u03c3( fa). By Lemma 2.4, C is Borel,which means that f\u22121b (C) is Borel by measurability. Hence, C \u2208 \u03c3( fb). We have\u03c3( fb)\u2282\u03c3( fa) by the exact same argument, which implies that \u03c3( fa) =\u03c3( fb).We denote this shared \u03c3 -algebra by \u03c3(F ). The reason we work with \u03c3(F ) isto construct the measurable space (F (Z),\u03c3(F ). Note the following facts about(F (Z),\u03c3(F )):\u2022 For any f \u2208F , f : Z \u2192F (Z) is bijective, and f\u22121 :F (Z)\u2192 Z is welldefined.\u2022 For any f \u2208F and a Borel set B \u2208B(Z), its image f (B)\u2282F (Z) is alsothe pre-image of f\u22121\u2014that is, ( f\u22121)\u22121(B) = f (B).\u2022 Since \u03c3(F )\u2282B(X), if any measures are equal onB(X), then they are alsoequal on \u03c3(F ).We are now ready to present the rigourous proof of Lemma 4.2.Proof of Lemma 4.2. =\u21d2 : Recall that Aa,b and A\u22121a,b are measurable. By Assump-tion 3.1 in the main text, P\u03b8a = P\u03b8b implies that Pz,a \u25e6 f\u22121a = Pz,b \u25e6 f\u22121b on B(X),69which implies equality also for \u03c3(F ). Let B \u2208B(Z). Then,Pz,a(A\u22121a,b(B)) = Pz,a( f\u22121a ( fb(B)) = Pz,b( f\u22121b ( fb(B)) = Pz,b(B), (A.33)where the first equality is by definition (working on \u03c3(F )), the second equality isdue to Pz,a \u25e6 f\u22121a = Pz,b \u25e6 f\u22121b , and the third equality is due to injectivity. Since Bwas arbitrary, this shows that Pz,a \u25e6A\u22121a,b = Pz,b. To see that Pz,a = Pz,b \u25e6Aa,b, simplyswap the roles of the indices a and b.\u21d0= : We show the contrapositive statement, i.e., at least one of the followingimplications hold:Pz,b \u0338= Pz,a \u25e6A\u22121a,b =\u21d2 P\u03b8a \u0338= P\u03b8b (A.34)Pz,a \u0338= Pz,b \u25e6Aa,b =\u21d2 P\u03b8a \u0338= P\u03b8b. (A.35)Without loss of generality, suppose that Pz,a \u0338= Pz,b \u25e6Aa,b (the same argument worksfor Pz,b \u0338= Pz,a \u25e6A\u22121a,b by swapping arguments). Note that by Assumption 3.1, itis equivalent to show that Pz,a \u25e6 f\u22121a \u0338= Pz,b \u25e6 f\u22121b . That is, we aim to find someB \u2208B(X) such thatPz,a( f\u22121a (B)) \u0338= Pz,b( f\u22121b (B)). (A.36)To construct such a B, let B\u2217 \u2208B(Z). We have by hypothesis thatPz,a(B\u2217) \u0338= Pz,b(Aa,b(B\u2217)) = Pz,b( f\u22121b ( fa(B\u2217)). (A.37)We have fa(B\u2217)\u2282 X, which is a Borel set by Lemma 2.4. Then,Pz,b( f\u22121b ( fa(B\u2217)) \u0338= Pz,a(B\u2217) = Pz,a( f\u22121a ( fa(B\u2217)), (A.38)and hence taking B = fa(B\u2217) shows that Pz,a \u25e6 f\u22121a \u0338= Pz,b \u25e6 f\u22121b .70A.2.2 Proof of Proposition 5.2This section proves identifiability for the equivariant stochastic mechanisms model.Proof of Proposition 5.2. We can analyze this model in our framework using justtwo time-points, t = 1,2. We work on an augmented latent space Z\u02dc = Z\u00d7 [0,1]and treat the random variables Ut as additional latent variables (i.e., as a \u201cnoiseless\u201dcase under our framework). For a generator f : Z\u2192X, we extend f\u02dc : Z\u02dc\u2192X\u00d7 [0,1],f\u02dc (z,u) = ( f (z),u). The identity extension ensures that f\u02dc is still injective, and isunique to f . Now suppose fa and fb are such that the distribution of X1 and X2|X1match. Note the marginal and conditional uniquely determine the joint, and hencewe simply assume that the joint and hence marginal distributions of X1 and X2match.Let the joint distribution of Z1 and U be denoted \u03c0Z1,U . Since they are independent,we have that \u03c0Z1,U = P1\u2297U [0,1].3 We also extend the mechanism m as m\u02dc(z,u) =(m(z,u),u), implying that m\u02dc\u22121(Bz\u00d7Bu) = m\u22121(Bz)\u00d7Bu. Since Z2 = m(Z1,U1),this then implies that P2 = \u03c0Z1,U \u25e6 m\u02dc\u22121 = (P1 \u25e6m\u22121)\u2297 (U [0,1]) (note the standardm in the right-hand-side). The same applies to an extended indeterminacy map,i.e., P2 \u25e6 A\u02dc\u22121a,b = (P1 \u25e6A\u22121a,b)\u2297 (U [0,1]).We now apply Lemma 4.2 to t = 1, where Z1 has fixed distribution P1 (i.e., it is asingleton), and to t = 2, where the latent distribution may vary with the mechanismma or mb, denoted P2,a,P2,b. As a result, we obtainP1 = P1 \u25e6A\u22121a,b, P2,b = P2,a \u25e6 A\u02dc\u22121a,b. (A.39)3This means that for a Borel product Bz\u00d7Bu, where Bz, Bu are Borel sets in their respectivedomains, we have \u03c0X1,U (Bz\u00d7Bu) = (P1(Bz))(U [0,1](Bu)).71Applying these identities simultaneously to P2,b givesP2,b = (P1 \u25e6m\u22121b )\u2297 (U [0,1]) = (P1 \u25e6A\u22121a,b \u25e6m\u22121b )\u2297 (U [0,1]) (A.40)P2,b = P2,a \u25e6 A\u02dc\u22121a,b = (P1 \u25e6m\u22121a \u25e6A\u22121a,b)\u2297 (U [0,1]), (A.41)which by the properties of a product measure, means that(P1 \u25e6A\u22121a,b \u25e6m\u22121b ) = (P1 \u25e6m\u22121a \u25e6A\u22121a,b). (A.42)Writing the above in terms of their random variables, we have mb(Aa,b(Z),U)d=Aa,b(ma(Z,U)) for Z with any fixed distribution P1 independent of U .A.2.3 Proof of Proposition 5.3This section details the series implications that prove Proposition 5.3. We firstprove some general results about densities under measure isomorphisms.Lemma A.5. Suppose probability measures Pz,a,Pz,b admits strictly positive densi-ties pa, pb. Suppose A is a (Pz,a, Pz,b)-measure isomorphism. Then,pb(A(x))kA(x) = pa(x) a.e., (A.43)where kA depends only on A and is strictly positive a.e..Proof. Since A is a (Pz,a, Pz,b)-measure isomorphism and Pz,a,Pz,b are equivalentto \u03bbz, we have that for a Borel set B,\u03bbz(B) = 0 \u21d0\u21d2 Pz,a(B) = 0 \u21d0\u21d2 Pz,b(A(B)) = 0 \u21d0\u21d2 \u03bbz(A(B)) = 0, (A.44)where the first and third equivalences are because Pz,a and Pz,b are equivalent to \u03bbz.This shows that \u03bbz \u25e6A is equivalent to \u03bbz, and hence it has an a.e.-strictly positivedensity kA. Then, by the definition of the density (Theorem 2.2), we have for a72Borel set B,Pz,a(B) = Pz,b(A(B)) (A.45)\u21d0\u21d2\u222bBpa(x)\u03bbz(dx) =\u222bA(B)pb(x)\u03bbz(dx) =\u222bBpb(A(x))\u03bbz(A(dx)), (A.46)where the last equality is by the standard change of variables formula, noting thatB = A\u22121(A(B)) since A is invertible. Now, we have that\u222bBpa(x)\u03bbz(dx) =\u222bBpb(A(x))kA(x)\u03bbz(dx), (A.47)by invoking the definition of the density again. Since the above holds for any B,we have (Lemma 2.1):pb(A(x))kA(x) = pa(x) a.e., (A.48)where kA(x) is strictly positive a.e..Corollary A.6. Suppose four probability measures P1,a,P2,a,P1,b,P2,b have strictlypositive densities p1,a, p2,a, p1,b, p2,b. For A both a (P1,a,P1,b)-measure isomor-phism and a (P2,a,P2,b)-measure isomorphism, we havep1,ap2,a(x) =p1,bp2,b(A(x)) a.e.. (A.49)Proof. This follows immediately from Lemma A.5 from the fact that kA is strictlypositive a.e. and depends only on A.We are now ready to prove Proposition 5.3.Proof of Proposition 5.3. Fix j = K + 1, i.e., such that u j is not in the linearlyindependent subset. From Corollary A.6 and by taking logarithms, we have for73each i \u0338= j,\u03b7a(ui)\u22a4Ta(z)\u2212a(\u03b7a(ui))\u2212 (\u03b7a(u j)\u22a4Ta(z)\u2212a(\u03b7a(u j))) (A.50)= \u03b7b(ui)\u22a4Tb(A(z))\u2212a(\u03b7b(ui))\u2212 (\u03b7b(u j)\u22a4Tb(A(z))\u2212a(\u03b7b(u j))), (A.51)almost everywhere, which simplifies to(\u03b7a(ui)\u2212\u03b7a(u j))\u22a4Ta(z)\u2212 ca(ui) = (\u03b7b(ui)\u2212\u03b7b(u j))\u22a4Tb(A(z))\u2212 cb(ui),(A.52)almost everywhere. ca, cb are differences in the normalizing constants a(\u03b7a), anddo not depend on z\u2014we suppress the dependency on u for convenience. Written inmatrix form, we have\uf8ee\uf8ef\uf8ef\uf8f0\u03b7a(u0)\u2212\u03b7a(u j)...\u03b7a(uK)\u2212\u03b7a(u j)\uf8f9\uf8fa\uf8fa\uf8fb\u22a4Ta(z) =\uf8ee\uf8ef\uf8ef\uf8f0\u03b7b(u0)\u2212\u03b7b(u j)...\u03b7b(uK)\u2212\u03b7b(u j)\uf8f9\uf8fa\uf8fa\uf8fb\u22a4Tb(A(z))+ c, (A.53)almost everywhere, where c is the vector of differences ca\u2212 cb. Following (Khe-makhem et al., 2020), we will call these two matrices La and Lb, noting that theyare invertible since their rows are linearly independent by assumption. Then, weobtainL\u22a4a Ta(z) = L\u22a4b Tb(A(z))+ c (A.54)=\u21d2 Tb(A(z)) = (L\u22121b La)\u22a4Ta(z)\u2212 (L\u22121b La)\u22a4c (A.55)=\u21d2 Tb(A(z)) = L\u22a4Ta(z)+d, (A.56)almost everywhere, where L = L\u22121b La is invertible and d =\u2212L\u22a4c.74A.2.4 Proof of Proposition 5.7Proposition 5.7 is in essence a special case of Proposition 5.3. Its proof is also astraightforward application of Corollary A.6.Proof of Proposition 5.7. The expression is a direct consequence of Corollary A.6by plugging in the exponential family densities p1,a = p1,b = p1 and likewise forp2. Taking logarithms on both sides, we have\u03b7\u22a41 T (z)\u2212\u03b7\u22a42 T (z)\u2212a(\u03b71)+a(\u03b72) =\u03b7\u22a41 T (A(z))\u2212\u03b7\u22a42 T (A(z))\u2212a(\u03b71)+a(\u03b72) (A.57)=\u21d2 (\u03b71\u2212\u03b72)\u22a4T (z) = (\u03b71\u2212\u03b72)\u22a4T (A(z)) a.e.. (A.58)A.3 Discrete ObservationsIn this section, we briefly discuss models with discrete observations, e.g., Bernoulliwith probability parameter given by f (z), or Poisson with mean parameter f (z).Such models were briefly discussed in iVAE (Khemakhem et al., 2020), as wellas in follow-up work such as the pi-VAE (Zhou and Wei, 2020). In short, theframework developed in this thesis rests on bijective generators which enable therecovery of unique latent codes for each observed value. As noted in a correction in(Khemakhem et al., 2020), this task seems fundamentally impossible for examplewhen the latent space is uncountable and the outcome is discrete, due to the lackof an bijective map between spaces of different cardinality. However, generativemodels do not typically send a latent variable to the outcome, but rather to aparameter value of a conditional distribution. This allows us to reformulate theassumptions required for our theory, although as we will see shortly, most discreteoutcome models do not satisfy these assumptions.Formally, suppose X is either finite or countable. Let X be a random variable75on X and denote by Px := P(X = x) : X \u2192 [0,1] the probability mass function,which satisfies \u2211x\u2208X Px(x) = 1. Two random variables Xa, Xb are said to be equalin distribution if and only if their respective PMFs satisfy Px,a(x) = Px,b(x) for allx \u2208 X.The observation model is then described by a conditional PMF P(X = x|z). Weassume this model has a topological parameter space \u0398 and also pair it with theBorel \u03c3 -algebra, e.g., \u0398= [0,1]n for an n-dimensional Bernoulli. Let f : Z\u2192\u0398be an injective generator (note this implies \u0398 has cardinality at least that of Z).Then the generative model is as follows:Z \u223c Pz , P(X = x|z) = gx( f (z)) , (A.59)where gx(\u03b8) is the PMF of the observation model with parameter \u03b8 at x.Recall Assumption 3.1. We now introduce its discrete analogoue, which would berequired for the theory developed in this thesis. First, note that the marginal PMFon X is given as follows:Px(x) =\u222bZgx( f (z))Pz(dz) =\u222b\u0398gx(\u03b8)Pz( f\u22121(d\u03b8)) = E\u03b8 [gx(\u03b8)] , (A.60)where as a random variable, \u03b8 = f (Z). The assumption is then as follows:Assumption A.1 (Discrete analogue to Assumption 3.1). Assume thatE\u03b8a [gx(\u03b8a)]=E\u03b8b [gx(\u03b8b)] for each x \u2208 X if and only if \u03b8a d= \u03b8b.In other words, the distribution of \u03b8 = f (Z) must be characterized by the momentsE[gx(\u03b8a)], for each x \u2208 X. Indeed, for observational equivalence to imply anythingabout the latent spaces, such an assumption would be needed. However, it appearsthat this assumption is rarely satisfied for any reasonable models. For example, theBernoulli observation model with P(X = 1|z) = g1(\u03b8) = \u03b8 , \u0398 = [0,1], requires76that the distribution of \u03b8 be characterized by just its first moment, E[\u03b8 ]. Of course,this is highly unlikely unless very strict restrictions are placed on f and Pz.At the core of the issue remains the cardinality mismatch between Z and X.One necessary condition to show that \u03b8ad= \u03b8b is that E[g(\u03b8a)] = E[g(\u03b8b)] forall bounded continuous g : \u0398\u2192 R (test functions). For \u0398 uncountable, there areclearly uncountably many test functions, while in our discrete assumption above,there are countably many test functions at best. Though we do not make thisnotion precise here, we believe this makes the discrete assupmtion above unlikelyto be satisfied, and hence identifiability, at least under our framework (which webelieve to be reasonably general), is highly unlikely for discrete outcomes withuncountable latent spaces.77","attrs":{"lang":"en","ns":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","classmap":"oc:AnnotationContainer"},"iri":"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note","explain":"Simple Knowledge Organisation System; Notes are used to provide information relating to SKOS concepts. There is no restriction on the nature of this information, e.g., it could be plain text, hypertext, or an image; it could be a definition, information about the scope of a concept, editorial information, or any other type of information."}],"Genre":[{"label":"Genre","value":"Thesis\/Dissertation","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","classmap":"dpla:SourceResource","property":"edm:hasType"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/hasType","explain":"A Europeana Data Model Property; This property relates a resource with the concepts it belongs to in a suitable type system such as MIME or any thesaurus that captures categories of objects in a given field. It does NOT capture aboutness"}],"GraduationDate":[{"label":"Graduation Date","value":"2022-11","attrs":{"lang":"en","ns":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","classmap":"vivo:DateTimeValue","property":"vivo:dateIssued"},"iri":"http:\/\/vivoweb.org\/ontology\/core#dateIssued","explain":"VIVO-ISF Ontology V1.6 Property; Date Optional Time Value, DateTime+Timezone Preferred "}],"IsShownAt":[{"label":"DOI","value":"10.14288\/1.0417488","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","classmap":"edm:WebResource","property":"edm:isShownAt"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt","explain":"A Europeana Data Model Property; An unambiguous URL reference to the digital object on the provider\u2019s website in its full information context."}],"Language":[{"label":"Language","value":"eng","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/language","classmap":"dpla:SourceResource","property":"dcterms:language"},"iri":"http:\/\/purl.org\/dc\/terms\/language","explain":"A Dublin Core Terms Property; A language of the resource.; Recommended best practice is to use a controlled vocabulary such as RFC 4646 [RFC4646]."}],"Program":[{"label":"Program (Theses)","value":"Statistics","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","classmap":"oc:ThesisDescription","property":"oc:degreeDiscipline"},"iri":"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the program for which the degree was granted."}],"Provider":[{"label":"Provider","value":"Vancouver : University of British Columbia Library","attrs":{"lang":"en","ns":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","classmap":"ore:Aggregation","property":"edm:provider"},"iri":"http:\/\/www.europeana.eu\/schemas\/edm\/provider","explain":"A Europeana Data Model Property; The name or identifier of the organization who delivers data directly to an aggregation service (e.g. Europeana)"}],"Publisher":[{"label":"Publisher","value":"University of British Columbia","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/publisher","classmap":"dpla:SourceResource","property":"dcterms:publisher"},"iri":"http:\/\/purl.org\/dc\/terms\/publisher","explain":"A Dublin Core Terms Property; An entity responsible for making the resource available.; Examples of a Publisher include a person, an organization, or a service."}],"Rights":[{"label":"Rights","value":"Attribution-NonCommercial-NoDerivatives 4.0 International","attrs":{"lang":"*","ns":"http:\/\/purl.org\/dc\/terms\/rights","classmap":"edm:WebResource","property":"dcterms:rights"},"iri":"http:\/\/purl.org\/dc\/terms\/rights","explain":"A Dublin Core Terms Property; Information about rights held in and over the resource.; Typically, rights information includes a statement about various property rights associated with the resource, including intellectual property rights."}],"RightsURI":[{"label":"Rights URI","value":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/","attrs":{"lang":"*","ns":"https:\/\/open.library.ubc.ca\/terms#rightsURI","classmap":"oc:PublicationDescription","property":"oc:rightsURI"},"iri":"https:\/\/open.library.ubc.ca\/terms#rightsURI","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the Creative Commons license url."}],"ScholarlyLevel":[{"label":"Scholarly Level","value":"Graduate","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","classmap":"oc:PublicationDescription","property":"oc:scholarLevel"},"iri":"https:\/\/open.library.ubc.ca\/terms#scholarLevel","explain":"UBC Open Collections Metadata Components; Local Field; Identifies the scholarly level of the author(s)\/creator(s)."}],"Supervisor":[{"label":"Supervisor","value":"Bloem-Reddy, Benjamin","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/contributor","classmap":"vivo:AdvisingRelationship","property":"dcterms:contributor"},"iri":"http:\/\/purl.org\/dc\/terms\/contributor","explain":"A Dublin Core Terms Property; An entity responsible for making contributions to the resource.; Examples of a Contributor include a person, an organization, or a service."}],"Title":[{"label":"Title ","value":"Indeterminacy in latent variable models : characterization and strong identifiability","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/title","classmap":"dpla:SourceResource","property":"dcterms:title"},"iri":"http:\/\/purl.org\/dc\/terms\/title","explain":"A Dublin Core Terms Property; The name given to the resource."}],"Type":[{"label":"Type","value":"Text","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/type","classmap":"dpla:SourceResource","property":"dcterms:type"},"iri":"http:\/\/purl.org\/dc\/terms\/type","explain":"A Dublin Core Terms Property; The nature or genre of the resource.; Recommended best practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMITYPE]. To describe the file format, physical medium, or dimensions of the resource, use the Format element."}],"URI":[{"label":"URI","value":"http:\/\/hdl.handle.net\/2429\/82455","attrs":{"lang":"en","ns":"https:\/\/open.library.ubc.ca\/terms#identifierURI","classmap":"oc:PublicationDescription","property":"oc:identifierURI"},"iri":"https:\/\/open.library.ubc.ca\/terms#identifierURI","explain":"UBC Open Collections Metadata Components; Local Field; Indicates the handle for item record."}],"SortDate":[{"label":"Sort Date","value":"2022-12-31 AD","attrs":{"lang":"en","ns":"http:\/\/purl.org\/dc\/terms\/date","classmap":"oc:InternalResource","property":"dcterms:date"},"iri":"http:\/\/purl.org\/dc\/terms\/date","explain":"A Dublin Core Elements Property; A point or period of time associated with an event in the lifecycle of the resource.; Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF].; A point or period of time associated with an event in the lifecycle of the resource.; Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF]."}]}