Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Clustering and modelling of phase variation for functional data Fu, Shing 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2019_may_fu_shing.pdf [ 28.34MB ]
JSON: 24-1.0376529.json
JSON-LD: 24-1.0376529-ld.json
RDF/XML (Pretty): 24-1.0376529-rdf.xml
RDF/JSON: 24-1.0376529-rdf.json
Turtle: 24-1.0376529-turtle.txt
N-Triples: 24-1.0376529-rdf-ntriples.txt
Original Record: 24-1.0376529-source.json
Full Text

Full Text

Clustering and modelling of phasevariation for functional databyShing FuB.Sc., The University of Hong Kong, 2008M.Sc., The University of British Columbia, 2010A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)February 2019c© Shing Fu, 2019The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the dis-sertation entitled:Clustering and modelling of phase variation for functionaldatasubmitted by Shing Fu in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Statistics.Examining Committee:Nancy Heckman, StatisticsSupervisorMat´ıas Salibia´n-Barrera, StatisticsSupervisory Committee MemberAlexandre Bouchard-Coˆte´, StatisticsSupervisory Committee MemberJane Wang, Electrical and Computer EngineeringUniversity ExaminerJames Zidek, StatisticsUniversity ExamineriiAbstractOur work is motivated by an analysis of elephant seal dive profiles whichwe view as functional data, specifically, as depth as a function of time,with data recorded almost continuously by sensors attached to the animal.The objective is to group profiles by shape to better understand the cor-responding behavioural states of the seals. Most existing approaches relyon multivariate clustering methods applied to ad hoc summaries of the diveprofile. Instead, we view each profile as arising from a function that is adeformation of a base shape. The deformation is regarded as phase varia-tion and is represented by a latent warping function with a finite mixturedistribution.We first propose a curve registration model to explicitly model amplitudeand phase variations of functional data, with phase variation representedby smooth time transformations called warping functions. Inference is con-ducted via the stochastic approximation expectation-maximization (SAEM)algorithm. Our simulation study shows that the SAEM algorithm is compu-tationally more stable and efficient than existing approaches in the literaturefor inference of this class of curve registration model with flexible warping.We then propose two clustering approaches based on our curve registra-tion model for functional data: 1) a simultaneous approach that smoothsthe noisy raw profiles and estimates the base shape, the warping functionsand the cluster membership and via SAEM algorithms; and 2) a two-stepapproach that applies clustering algorithms on the estimated warping func-tions. In contrast to generic clustering algorithms in the literature, ourmethods treat the clustering structure as heterogeneity in phase variation.The proposed method is applied to the analysis of elephant seal dive pro-files and an analysis of human growth curves. We are able to obtain moreintuitive clusters by focusing the clustering effort on phase variation.iiiLay SummaryTechnological advancement has allowed researchers to study the behaviourof deep diving animals via attaching miniature sensors. High volume dataare generated by these sensors and require efficient analyses to provide initialinsight. This thesis studies automatic data analytic methods to group divedepth trajectories, also called dive profiles, of a southern elephant seal basedon shape. These methods will help researchers to understand how the sealhunts and rests under water. The developed methods can be applied to datathat tracks motions or physical processes in general.ivPrefaceThis dissertation is original, independent work by the author Shing (Eric)Fu, under the supervision of Professor Nancy Heckman.A version of Chapter 2, Sections 1.1 and 1.2 and Appendix A.1 wasaccepted for publication by Computational Statistics & Data Analysis onJune 25, 2018. The manuscript entitled “Model-based curve registration viastochastic approximation EM algorithm” by Eric Fu and Nancy Heckman isavailable online at Thesouthern elephant seal data were provided by Dr. Christophe Guinet of Cen-tre d’Etudes Biologiques de Chize´ UMR 7372 CNRS-ULR. The data werecollected as part of the Syste`me d’Observation MEMO. I was responsible formodel formulation, derivation of the algorithms, computational work, dataanalysis and the majority of the writing. Dr. Heckman was the supervisoryauthor on this project and was involved throughout the project in conceptformation and manuscript edit and contributed to the proof in AppendixA.1.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Data and motivation . . . . . . . . . . . . . . . . . . . . . . 11.2 Existing work on dive profile clustering . . . . . . . . . . . . 31.3 Overview of the proposed method . . . . . . . . . . . . . . . 51.4 Review of curve registration . . . . . . . . . . . . . . . . . . 61.5 Tables and figures . . . . . . . . . . . . . . . . . . . . . . . . 11viTable of Contents2 Modelling amplitude and phase variations . . . . . . . . . . 162.1 Curve registration model . . . . . . . . . . . . . . . . . . . . 182.2 Model identifiability . . . . . . . . . . . . . . . . . . . . . . . 202.3 Inference via a stochastic approximation EM algorithm . . . 242.3.1 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.2 E-step . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 312.5 Remarks on using a fixed template . . . . . . . . . . . . . . . 362.6 Tables and figures . . . . . . . . . . . . . . . . . . . . . . . . 373 Clustering on phase variation . . . . . . . . . . . . . . . . . . 443.1 Review of functional data clustering . . . . . . . . . . . . . . 443.2 Mixture of warpings model . . . . . . . . . . . . . . . . . . . 493.2.1 Simultaneous registration and clustering with knownand unknown templates . . . . . . . . . . . . . . . . . 513.2.2 Two-step approach . . . . . . . . . . . . . . . . . . . 533.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . 543.4 Tables and figures . . . . . . . . . . . . . . . . . . . . . . . . 624 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.1 Dive profiles of a southern elephant seal . . . . . . . . . . . . 874.2 Berkeley Growth Study data . . . . . . . . . . . . . . . . . . 904.3 Tables and figures . . . . . . . . . . . . . . . . . . . . . . . . 935 Discussion and future work . . . . . . . . . . . . . . . . . . . 1175.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 119viiTable of ContentsBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124AppendicesA Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . 131A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . 131A.2 Derivations of the SAEM algorithms . . . . . . . . . . . . . . 135A.2.1 Unknown template . . . . . . . . . . . . . . . . . . . 136A.2.2 Known template . . . . . . . . . . . . . . . . . . . . . 139A.2.3 Metropolis-Hastings algorithm for simulating warpingeffects . . . . . . . . . . . . . . . . . . . . . . . . . . . 143B Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . 148B.1 Derivations of SAEM for mixture of warping model . . . . . 148viiiList of Tables2.1 Summary of simulation computations: computing times ofcompleted runs and the percentage of failed runs for varyingnumber of sampled points. We allocate a maximum comput-ing time of six hours for each run. (Shape 1) . . . . . . . . . 382.2 Summary of simulation computations: computing times ofcompleted runs and the percentage of failed runs for varyingnumber of sampled points. We allocate a maximum comput-ing time of six hours for each run. (Shape 2) . . . . . . . . . 392.3 Average IMSE of the estimated common shape and IMSPEof the predicted warping function . . . . . . . . . . . . . . . . 423.1 Adjusted Rand index and average number of non-empty clus-ters (Eff. K). The true number of clusters, K, is assumedknown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.2 Adjusted Rand index. True template: f = −2√t(1− t).Misspecified fixed template: f = −4t(1 − t). The number ofclusters is assumed known. . . . . . . . . . . . . . . . . . . . . 803.3 Adjusted Rand index. True template: f = −(27/4)t(1− t)2.Misspecified fixed template: f = −4t(1 − t). The number ofclusters is assumed known. . . . . . . . . . . . . . . . . . . . . 864.1 Cross-tabulation of MXWP clusters and gender of the sub-jects in the Berkeley Growth Study. . . . . . . . . . . . . . . 116ixList of Figures1.1 Dive depth of a southern elephant seal recorded per secondfrom Oct. 28 to Dec. 29, 2010. . . . . . . . . . . . . . . . . . 111.2 Swim track of a southern elephant seal near the KerguelenIslands between October 28 and December 29, 2010 with an-notations of number of days since the beginning of the excursion. 121.3 Daily time series of dive depth of a southern elephant sealfrom November 11 (top) to 26 (bottom), 2010. . . . . . . . . 131.4 Some typical shapes of dive profiles of a southern elephantseal. Top: flat-bottom dives. Bottom left: V-shape dive.Bottom right: drift dive. . . . . . . . . . . . . . . . . . . . . . 141.5 Illustration of amplitude and phase variations in simulateddata. The gray curve indicates the base shape. Top left:warping functions, i.e. a smooth, strictly increasing bijectionon [0, 1]. Top right: Phase variation induced by the warp-ing functions. Bottom left: amplitude variation in terms ofvertical shifting and scaling. Bottom right: curves with bothphase and amplitude variations. . . . . . . . . . . . . . . . . . 152.1 Twenty simulated curves for each of shape 1 (top row) andshape 2 (bottom row). Number of observations per curve isn = 100 for shape 1 and n = 1000 for shape 2. . . . . . . . . . 37xList of Figures2.2 Output of SAEM iterations from several simulated data sets(top plots for shape 1, bottom plots for shape 2, for n =100, 1000 and 2000 observations per curve). Estimated baseshapes (left) for 1, 2000, 7000 and 12000 iterations. Traceplots for the error variance (right) by SAEM iterations. Here,the first 2000 iterations are burn-in and calibration period.In the six error variance plots, the first 500 iterations areexcluded from the plots for better scaling. . . . . . . . . . . 402.3 Simulation results for shapes 1 and 2 when the number ofobservations per curve is 100 (first column), 1000 (secondcolumn) or 2000 (third column). Estimated common shapepointwise 95% confidence band based on the empirical stan-dard error. Top panel: Average estimated common shape.Middle panel: Average bias. Bottom panel: Root meansquared error. . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4 Average prediction error of the warping functions and point-wise 95% confidence band based on the empirical standarderror, for n = 100, 1000 and 2000 observations per curve. . . . 433.1 Two hundred simulated curves from the mixture of warpingsmodel with 1000 observations per curve. Base shape: f(t) =−4t(1− t). True model parameters: pii = 1/4 for i = 1, . . . , 4,κ1 = (0, 1, 1, 0)>, κ2 = (1.2, 1, 1, 1.2)>, κ3 = (0, 1, 2, 1)>,κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ = diag(102, 502) andσ2 = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Within-cluster sum-of-squares (for kCFC) and BIC for sim-ulation study with true template, f(t) = −4t(1 − t). Truenumber of clusters is 2. . . . . . . . . . . . . . . . . . . . . . . 643.3 Within-cluster sum-of-squares (for kCFC) and BIC for sim-ulation study with true template, f(t) = −4t(1 − t). Truenumber of clusters is 3. . . . . . . . . . . . . . . . . . . . . . . 653.4 Within-cluster sum-of-squares (for kCFC) and BIC for sim-ulation study with true template, f(t) = −4t(1 − t). Truenumber of clusters is 4. . . . . . . . . . . . . . . . . . . . . . . 66xiList of Figures3.5 Adjusted Rand index when the true number of clusters known.True template f(t) = −4t(1− t). . . . . . . . . . . . . . . . . 673.6 Example clusters formed by funclust for a simulation repli-cate with four true clusters and correctly specified number ofclusters by the method. Curves in the same row belong to thesame simulated cluster. Curves in the same column belong tothe same funclust cluster. Colored lines show the estimatedcluster means. . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.7 Example clusters formed by funHDDC for a simulation repli-cate with four true clusters and correctly specified number ofclusters by the method. Curves in the same row belong to thesame simulated cluster. Curves in the same column belong tothe same non-empty funHDDC cluster. Colored lines showthe estimated cluster means. Only two non-empty clustersare formed by the method. . . . . . . . . . . . . . . . . . . . . 713.8 Example clusters formed by waveclust for a simulation repli-cate with three true clusters and correctly specified numberof clusters by the method. Curves in the same row belongto the same simulated cluster. Curves in the same columnbelong to the same waveclust cluster. Colored lines show theestimated cluster means. . . . . . . . . . . . . . . . . . . . . . 723.9 Convergence diagnosis via traceplot of estimated parametersfor MXWP-FxT, K = 4 . . . . . . . . . . . . . . . . . . . . . 733.10 Convergence diagnosis via traceplot of estimated parametersfor MXWP, K = 4 . . . . . . . . . . . . . . . . . . . . . . . . 743.11 Two hundred simulated curves from the mixture of warp-ings model with 1000 observations per curve. Base shape:f1(t) = −2√t(1− t). True model parameters: pii = 1/4for i = 1, . . . , 4, κ1 = (0, 1, 1, 0)>, κ2 = (1.2, 1, 1, 1.2)>,κ3 = (0, 1, 2, 1)>, κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ =diag(102, 502) and σ2 = 10. . . . . . . . . . . . . . . . . . . . 753.12 BIC for simulation study with true template, f(t) = −2√t(1− t).True number of clusters is 2. . . . . . . . . . . . . . . . . . . 76xiiList of Figures3.13 BIC for simluation study with true template, f(t) = −2√t(1− t).True number of clusters is 3. . . . . . . . . . . . . . . . . . . 773.14 BIC for simulation study with true template, f(t) = −2√t(1− t).True number of clusters is 4. . . . . . . . . . . . . . . . . . . 783.15 Adjusted Rand index when the true number of clusters isknown. True template f(t) = −2√t(1− t). . . . . . . . . . . 793.16 Two hundred simulated curves from the mixture of warp-ings model with 1000 observations per curve. Base shape:f2(t) = −27/4t(1 − t)2. True model parameters: pii = 1/4for i = 1, . . . , 4, κ1 = (0, 1, 1, 0)>, κ2 = (1.2, 1, 1, 1.2)>,κ3 = (0, 1, 2, 1)>, κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ =diag(102, 502) and σ2 = 10. . . . . . . . . . . . . . . . . . . . 813.17 BIC for simulation study with true template, f(t) = −(27/4)t(1−t)2. True number of clusters is 2. . . . . . . . . . . . . . . . . 823.18 BIC for simulation study with true template, f(t) = −(27/4)t(1−t)2. True number of clusters is 3. . . . . . . . . . . . . . . . . 833.19 BIC for simulation study with true template, f(t) = −(27/4)t(1−t)2. True number of clusters is 4. . . . . . . . . . . . . . . . . 843.20 Adjusted Rand index when the true number of clusters isknown. True template f(t) = −(27/4)t(1− t)2. . . . . . . . . 854.1 A random sample of 200 dive profiles. . . . . . . . . . . . . . 934.2 The same random sample of 200 dive profiles as in Figure 4.1but with standardized durations. . . . . . . . . . . . . . . . . 944.3 BIC for clustering of southern elephant seal dive profiles. . . . 954.4 Estimated templates. Estimated B-spline, fˆ(t) =∑Kfk=1 αˆkBfk (t),for MXWP and STEP. Scaled known template, fˆ(t) = µˆsh +µˆsc · (−4t(1− t)), for MXWP and STEP. . . . . . . . . . . . 964.5 Traceplots of estimated error variance, σˆ2 by SAEM itera-tions. The first 500 iterations are excluded for better scalingof the plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97xiiiList of Figures4.6 Eight clusters of dives determined by GMM. The dark bluelines are the cluster means. . . . . . . . . . . . . . . . . . . . 984.7 Eight clusters of dives determined by MXWP-FxT. The darkblue lines are the typical shapes of the clusters as definedin (4.1)-(4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.8 Eight clusters of dives determined by STEP-FxT. The darkblue lines are the typical shapes of the clusters as definedin (4.1)-(4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.9 Eight clusters of dives determined by MXWP. The dark bluelines are the typical shapes of the clusters as defined in (4.1)-(4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.10 Eight clusters of dives determined by STEP. The dark bluelines are the typical shapes of the clusters as defined in (4.1)-(4.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.11 Estimated warping functions corresponding to the eight MXWP-FxT clusters in Figure 4.7. . . . . . . . . . . . . . . . . . . . . 1034.12 Estimated warping functions corresponding to the eight STEP-FxT clusters in Figure 4.8. . . . . . . . . . . . . . . . . . . . . 1044.13 Estimated warping functions corresponding to the eight MXWPclusters in Figure 4.9. . . . . . . . . . . . . . . . . . . . . . . 1054.14 Estimated warping functions corresponding to the eight STEPclusters in Figure 4.10. . . . . . . . . . . . . . . . . . . . . . . 1064.15 Dive profiles cross-tabulated by GMM (rows) and MXWP-FxT (columns) clusters. Curves in the same row belong tothe same GMM cluster. Curves in the same column belongto the same MXWP-FxT cluster. . . . . . . . . . . . . . . . . 1074.16 Dive profiles cross-tabulated by GMM (rows) and STEP-FxT(columns) clusters. . . . . . . . . . . . . . . . . . . . . . . . . 1084.17 Dive profiles cross-tabulated by GMM (rows) and MXWP(columns) clusters. . . . . . . . . . . . . . . . . . . . . . . . . 109xivList of Figures4.18 Dive profiles cross-tabulated by GMM (rows) and STEP (columns)clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.19 Growth curves of 39 boys and 54 girls from the BerkeleyGrowth Study data. The verticle grid lines indicate the actualmeasurement times. . . . . . . . . . . . . . . . . . . . . . . . 1114.20 BIC for GMM and MXWP and within-cluster sum of squares(WSS) for kCFC for the Berkeley Growth Study data. . . . . 1124.21 Cluster analysis by GMM, kCFC and MXWP for BerkeleyGrowth Study data, with eight clusters for GMM and kCFCand two clusters for MXWP. The clustering methods are ap-plied to the height data. The resulting clusters are identifiedby color and linetype. The velocity (second row) and accel-eration (last row) panels show the corresponding derivativesof the growth curves estimated by smoothing splines and la-belled by cluster. . . . . . . . . . . . . . . . . . . . . . . . . . 1134.22 Cluster analysis by GMM, kCFC and MXWP for BerkeleyGrowth Study data with the number of clusters fixed at two.The clustering methods are applied to the height data. Theresulting clusters are identified by color and linetype. Thevelocity (second row) and acceleration (last row) panels showthe corresponding derivatives of the growth curves estimatedby smoothing splines and labelled by cluster. . . . . . . . . . 1144.23 Fitted growth curves, their first two derivatives and the warp-ing functions by MXWP. The two derivative functions arederived from the fitted splines for the amplitude and warp-ing functions. Cluster memberships are indicated by colour(red vs. blue) and linetype (solid vs. dashed). In the topleft panel, the gold line shows the estimated template for thegrowth curves. . . . . . . . . . . . . . . . . . . . . . . . . . . 1154.24 Traceplot of estimated error variance by MXWP for the Berke-ley Growth Study data. . . . . . . . . . . . . . . . . . . . . . 116xvAcknowledgementsFirst and foremost, I would like to express my deepest gratitude to myresearch supervisor, Professor Nancy Heckman, for her continuous guidance,encouragement and support in the past five years. Nancy helps me seethe value in my work which keeps me motivated during tougher times. Igreatly appreciate her dedicating ample time, out of her busy schedule asthe department head, for our regular research meetings. I can never ask fora more caring and supportive mentor.I wish to thank Dr. Christophe Guinet for his generosity in sharing hiselephant seal data which started my research. I am also thankful to themembers of my supervisory committee, Professors Mat´ıas Salibia´n-Barreraand Alexandre Bouchard-Coˆte´, for their insightful comments and feedbackon my research proposal, manuscript and dissertation. I also owe thanks toProfessors James Zidek and Jane Wang, my university examiners, for theircareful reading of this dissertation and their valuable comments.In my first year of the PhD program, I have faced some unexpectedturbulances and I am grateful to Professor John Petkau for his advice andwords of wisdom that helped me to weather the storm. John is also a greatmentor in statistical consulting. His door is always open and I have learnta lot from our discussion of my consulting cases as well as from the projectswe worked on together.It is my privilege to be involved in various consulting projects throughthe Applied Statistics and Data Science Group (ASDa) that broadened myexposure beyond my research topic. I must thank Ms. Carolyn Taylor, themanaging consultant, for putting her trust in me to take charge of some ofthe cases on my own. Thank you to Professors Harry Joe, Rollin Brant, Drs.Biljana Jonoska Stojkova and Davor Cubranic. It has been a great pleasureand rewarding experience working with them. I wish to offer my specialthanks to Mr. Rick White. I am lucky to take part in many consultingxviAcknowledgementsmeetings with him and witness his creative and effective ways of explainingstatistical concepts and ideas to a general audience. It is a big loss that heis no longer with us and I will forever miss the enlightening conversations instatistics and history with him in the department corridor.Thank you to the staff members of the department office, Andrea Soll-berger and Ali Hauschildt, for keeping things in order and making our lifeeasy. Special thanks to Peggy Ng, our administrator, and her husband Rickfor hosting Christmas and New Year parties for those who are away fromtheir family during the festive times.The graduate student body in our department is a closely knitted com-munity and I am blessed to be a part of it. I wish to thank all the won-derful people I have crossed paths with. To Andy Leung, David Lee, DerekChiu and Hoyin Ho for all the dim-sum and mahjong gatherings. To AlexiRodriguez-Arelis, my long time office mate who I have shared many goodlaughs with. To David Kepplinger and Gong Zhang for taking the time tocome and show support at my oral defence. To Seong-Hwan Jun, AdrianJones, Jonathan Agyeman and everyone in the short-term consulting servicefor the comradeship in rebooting and stewarding the program.Thank you to my cheerleader Joanna Zhao for her mental and emotionalsupport and keeping me sane with her reality check.Finally, thank you to my family who let me be so far away from homefor so long and support me in every way they can. To my father, my bestlistener, whose love, laughter and wise words I will miss dearly. I hope Ihave done him proud.xviiDedicationTo my family.xviiiChapter 1IntroductionIn this thesis, we study the problem of clustering functional data with sub-stantial phase variation and noise. The work is motivated by an analysis ofmarine mammal dive profiles which we view as functions of depth over time.We propose a model to simultaneously smooth the curves and estimate theiramplitude and phase components. We then discuss approaches to clusterthe curves into groups of similar shapes, either by extending the proposedmodel or via a two-step approach. In this chapter, we describe the dataand the purpose of clustering dive profiles. We then briefly review exist-ing approaches in the literature and provide an overview of our proposedmethods.1.1 Data and motivationThe development of various tracking devices has greatly facilitated ecolog-ical studies of diving animals where direct observation of their underwateractivities is challenging. Biologists often attach these devices onto largediving animals, such as seals and whales, to collect data on their foragingbehaviours. Modern sensors are capable of tracking the movement of thetagged animal with high temporal resolution while collecting environmentaldata at the same time. The fine scale movement data facilitate the studyof the diving activity of marine mammals, in order to understand the forag-ing areas and behaviour of the species (Dragon et al., 2012; Bailleul et al.,2008; Hindell et al., 1991). Studying diving activity can also help biologists11.1. Data and motivationassess the impact of environmental changes and human activities on the an-imals (Guinet et al., 2014; Langrock et al., 2014; Walker et al., 2011a) toinform the development of conservation policy.One popular and widely deployed tracking device is the time-depthrecorder (TDR) which has been in continuous use since the 1960’s (Wombleet al., 2013). A TDR records water pressure, which is then converted todepth, at fixed intervals of time. Also, geolocation data are often collectedwhen the mounted devices are carried back to the surface level. The TDR isrecovered at the end of deployment with up to a few months of data storedon-board. Researchers then retrieve the time series of depth to study thediving and foraging behaviour of the animal.Figures 1.1 and 1.2 show the dive depths of a southern elephant seal andits course of movement in the southern Indian Ocean. Most analyses of thistype of bio-logging data consider the dive depth in the vertical dimensionseparately from the geolocation data in the horizontal dimensions, althoughrecently, Bestley et al. (2015) attempt to combine the two dimensions. Inthis work, we choose to focus on the movement of the seal in the verticaldimension. For our southern elephant seal data, depth is measured persecond over a period of about two months. Figure 1.3 shows a subset ofthe dive depth data in terms of daily series. Throughout the two months,the southern elephant seal engages in continuous dives lasting for 20 to 30minutes with short recesses of about three minutes at the surface betweendives.For many air-breathing marine animals, a dive is commonly adopted asthe unit of behaviour (Womble et al., 2013) and we take that approach here.Therefore, one of the common preprocessing steps of the TDR data is toextract dives by identifying their end points. For deep diving animals suchas Southern elephant seals, extraction of dives is relatively easy. For ourdata, a dive is defined as the interval when the entire sequence of recordeddepths is deeper than a threshold of 15 metres. The time series of depth asa function of time for a single dive is referred to as a dive profile.21.2. Existing work on dive profile clusteringIn this work, we model dives of a single seal and consider a dive profile asa sampling unit. The dives are often grouped into different dive types basedon the shape of the dive profile. Biologists then deduce physiological func-tions and behavioural states of each type of dive. For example, Figure 1.4shows four dive profiles in some typical shapes. The V-shape dive might bean exploratory dive, the flat-bottom shape dives might be foraging dives and,during the drift dive, the elephant seal is believed to be sleeping. Since thetime series of depth is often measured at high frequency over a long period oftime, the grouping of dives into biologically meaningful classes also reducesthe data to a more manageable size (Schreer and Testa, 1995). Hypotheseson the foraging strategies and efficiency of the species are then developedbased on the identified shapes and sequence of dive types (Brillinger andStewart, 1997).1.2 Existing work on dive profile clusteringEarliest work on dive type classifications is performed by visual inspection,as noted by Brillinger and Stewart (1997) and Schreer et al. (1998). However,this is not possible when the number of dives recorded is large, which istypical with TDR recordings. In our example, 3300 dives are identified overthe two month excursion of the southern elephant seal. Therefore, clusteringof dives by quantitative and automated methods is desirable. An automatedmethod also has the potential to be implemented on the tracking devices toprovide on-board identification of dive types as a form of data compression.Gaussian mixture model on raw dataBrillinger and Stewart (1997) apply Gaussian mixture model (GMM) clus-tering directly to the data vectors of depths, each of which represents a diveprofile. Since the method requires the dive profile data to be observed ona common set of time points, the varying duration across dives presents a31.2. Existing work on dive profile clusteringproblem. In the original paper, the authors appended the data vectors ofshorter dives with zero depths. Alternatively, we can scale time by diveduration such that each curve is a function defined on the unit interval. Apre-processing step is then employed to interpolate data to construct con-tinuous curves. The curves are then evaluated on a common grid of timepoints to form new vectors of depths. Finally, GMM is applied to these datavectors of the same length.Stepwise approach with feature extraction or dimensionreductionAnother class of dive clustering appoaches involves feature extraction orsome form of dimension reduction, followed by multivariate clustering tech-niques. For example, Schreer et al. (1998) convert the dive profiles to somecrude abstractions by scaling both time and depth, and averaging depthswithin each profile. A dive profile is first interpolated and evaluated at 100equally spaced time points spanning the dive duration, with the dive depthsscaled such that the maximum is 1. For each dive profile, averages of thescaled depths are then taken over every 10 points, yielding a 10-dimensionalvector that represents the shape of the dive profile, regardless of the maxi-mum depth and dive duration. Clustering methods such as neural networks,k-means and other distance-based algorithms are then applied to these vec-tors.Dragon el al. (2012) partition each dive profile into descent, bottom andascent stages. Stage transitions are determined when the vertical speed ex-ceeds some thresholds. Physiologically meaningful summaries such as max-imum depth, stage durations, rates of descent and ascent and wigglinessof the dive profile in the bottom stage are then computed for each dive.A deterministic rule is used to screen out drift dives, with their relativelyconstant and slower descent rate (see bottom right panel in Figure 1.4).The remaining dives are clustered by the k-means algorithm applied to theprincipal component scores of the dive profile summaries.41.3. Overview of the proposed methodWalker et al. (2011) reduce the dimension of the data by subsamplingthe recorded depths for each dive profile as follows. They first scale boththe times and depths of each dive profile to between 0 and 1. Free-knotsplines with a fixed number of knots are then fitted to the scaled data. Theoriginal data for each dive profile is then condensed by keeping only depthsat the fitted knot locations. The data are further summarized by conductinga principal component analysis on the condensed data vectors. In the end,the dive profiles are grouped by hierarchical clustering using the first fewprincipal component scores.1.3 Overview of the proposed methodIn our proposed method, we view the dive profiles as functional data. Weonly consider the duration-standardized profiles so that all curves have acommon domain of time. We also consider phase variation in addition toamplitude variation, which is a unique feature of functional data. By phasevariation, we refer to variation in the lateral direction of time, whereas byamplitude variation, we refer to variation in the vertical direction of depth.For our data, sources of phase variation are the timings of stages such asdescending, drifting and ascending, which influence the shape of the diveprofiles. Phase variation is also introduced artificially due to the linear timescaling from standardization of dive durations. On the other hand, the mainform of amplitude variation is in the maximum depth.While the methods of Schreer et al. (1998) and Brillinger and Stewart(1997) do not differentiate the two modes of variation, the more recentmethods of Walker et al. (2011a) and Dragon et al. (2012) do take phasevariation into account with the use of free-knot splines and the partitionof each dive profile into three stages. We anticipate that a more intuitiveclustering structure in the data can be revealed via a model for the curvesthat explicitly models the two modes of variation. Specifically, we model51.4. Review of curve registrationthe dive profile as,yi(t) = fi ◦ hi(t) + i(t), t ∈ [0, 1]. (1.1)where fi ◦hi represents the noiseless dive profile, while i(t) represents resid-ual variations, i.e. the erratic movement of the seal that deviates from thesmooth trajectory modelled with fi ◦ hi. We assume thatfi(t) = ai,sh + ai,scf(t), (1.2)is a vertical shifting and scaling of some base shape f on [0, 1], whereashi is a strictly increasing function from [0, 1] onto [0, 1] that deforms fi toyield a smoothed version of the observed curve. We call fi an amplitudefunction and hi a warping function as they represent amplitude and phasevariations respectively. Figure 1.5 illustrates the two modes of variationin the random functions generated by model (1.1)-(1.2). With a commontemplate in f , variation in shape is captured by the warping functions. Ourmethod estimates the latent functions, the fi’s and the hi’s, and clustersthe curves into groups of homogeneous shapes via the warping functions,the hi’s.1.4 Review of curve registrationThe process of estimating the warping functions is often called curve reg-istration. The term arises from one branch of functional data analysiswhich treats phase variation as a nuisance effect to be eliminated in a pre-processing step of curve alignment. Smooth functions, the yˆi(·)’s, are firstreconstructed from discrete time data, either by interpolation if the obser-vations are exact, or by estimation with smoothing methods such as localpolynomial regression, kernel smoothing or spline smoothing. Viewing thewarping function as a transformation of the observed time, t, to the so calledsystem time, hi(t), in which the functional data are considered synchronized61.4. Review of curve registrationor aligned, each smooth function can therefore be aligned as yˆi ◦ h−1i (·).Landmark registrationIf the curves possess a common set of landmarks, such as local maxima,minima and zero-crossings, they can be registered by aligning these featurepoints on some reference locations on the abscissa. The warping functionis formed by interoplating through the shifted times. This is referred toas landmark registration (Kneip and Gasser, 1992; Ramsay and Silverman,2006). The main task is then to identify and locate the features or land-marks, either manually or numerically by locating the zero-crossings of thecurves or their estimated derivatives from the smoothing step. Gasser andKneip (1995) established some asymptotic properties of landmark identifi-cation via kernel estimation of the smooth curves.Template registrationAn alternative to landmark registration is to find warping functions thatminimize misalignment among the registered curves. The method uses amisalignment measure between two curves, a measure such as L2 distance orthe Rao-Fisher metric (Srivastava et al., 2011). Typically, the misalignmentis measured between each sampled curve and a template. Wang and Gasser(1997) suggest choosing one of the sampled curves as the template. Wangand Gasser (1999) use the cross-sectional average of the unaligned curvesas an initial template. Their algorithm then iterates between 1) aligningthe curves to the template by dynamic programming and 2) updating thetemplate to the cross-sectional average of the aligned curves. Ramsay andLi (1998) combine this type of iterative Procrustes method with a smoothrepresentation of each warping function to preserve differentiability of thealigned curves.Instead of aligning all curves to one common template, Kneip and Ram-71.4. Review of curve registrationsay (2008) determine warping functions that make each registered curveclose to its truncated Karhumen-Loe`ve (KL) expansion. Thus, they tackleboth phase and amplitude variation. Their algorithm iterates across 1) per-forming a functional principal component analysis on the tentatively alignedcurves, 2) updating templates using the first few principal components, and3) updating the warping function by aligning the curves to the new template.Tang and Mu¨ller (2008) propose a pairwise registration process whichdoes not require an explicit template, but rather, focusses on estimation ofthe warping functions. The authors consider a stochastic model,Yi(t) = f ◦ hi(t) + δZi ◦ hi(t) + σ2i(t), for i = 1, . . . , N, (1.3)where f(·) is the population mean of the curves in system time, the Zi(·)sare independent random functions with zero mean and finite variance andthe i(·)s are independent white noise processes. The scale of amplitudevariation, δ, is assumed small whereas the warping functions are assumedrandom with E[h−1i (t)]= t, so that the observation time is equal to thesystem time on average. To obtain an estimate for hi, first, they smooththe data from the ith curve by weighted local linear regression. Treatingcurve i as a reference, for each j, a piecewise linear warping function hj→iis computed to align the smoothed curve yˆj to the smoothed curve yˆi. Theauthors’ estimate of hi ishˆi(t) =1NN∑j=1h−1j→i.The authors prove that hˆi has desirable properties. Here, we present aheuristic justification of hˆi. Firstly,yj ◦ h−1j→i ≈ yi ≈ f ◦ hi ≈ (yj ◦ h−1j ) ◦ hi = yj ◦ (h−1j ◦ hi)implies thath−1j→i ≈ h−1j ◦ hi81.4. Review of curve registrationand sohˆi(t) ≈ 1NN∑j=1h−1j ◦ hi(t). (1.4)Also, since E[h−1j (s)] = s for any s,1NN∑j=1h−1j (s) ≈ s.Hence, setting s = hi(t) in the right side of (1.4) yields hˆi(t) ≈ hi(t).Model-based registrationThe sequential approach to curve registration works well when the observeddata arise from smooth curves possessing a common set of clearly recog-nizable features. However, as pointed out by Rakeˆt et al. (2014), the pre-smoothing step can be problematic for noisy data. In this case, an individualcurve’s data might be overfitted, resulting in a “bumpy” curve with fictitiousfeatures being used in the subsequent alignment. Another problem with thesequential approach is with inference on population level parameters, sincethe sequential approach does not take into account the uncertainty of thealignment.An alternative approach is to perform curve fitting and registration si-multaneously, via a probabilistic model of the form (1.1). Lawton et al.(1972) propose the original shape invariant model (SIM),yij = ai,sh + ai,scf(wi,sh, wi,sctij) + ij ,where the observed curve yi is a scaling and translation of a template, f ,in both vertical and horizontal directions. The warping function is limitedto a linear transformation of time. In Chapter 2, we describe in detail ourproposed model, which can be considered as an extension to SIM to moreflexible warping functions. We discuss a stochastic approximation version91.4. Review of curve registrationof the EM algorithm for estimation of model parameters, the warping andthe amplitude functions as defined in (1.1) and (1.2). We compare thecomputational and statistical performance of our method to similar model-based registration methods in the literature via a simulation study.The rest of this thesis is organized as follows. In Chapter 3, we discussclustering methods based on model (1.1). Specifically, we propose to clustervia the warping functions when phase variation is the dominating mode ofvariation. Through simulation, we also compare the performance of ourproposed method to some generic clustering algorithms for functional datathat do not differentiate the two modes of variation. In Chapter 4, weapply our proposed method to the analysis of dive profiles of the southernelephant seal. For each dive, we estimate both the amplitude function, fi,and the warping function, hi. The warping functions are used to cluster thedive profiles into groups of similar shape. We find that clustering based onthe warping function is superior to clustering based on the original data,yielding more interpretable clusters. In Chapter 5, we summarize the workand discuss possible directions of future research.101.5.Tablesandfigures1.5 Tables and figuresFigure 1.1: Dive depth of a southern elephant seal recorded per second from Oct. 28 to Dec. 29, 2010.111.5.TablesandfiguresFigure 1.2: Swim track of a southern elephant seal near the Kerguelen Islands between October 28 and December29, 2010 with annotations of number of days since the beginning of the excursion.121.5.TablesandfiguresFigure 1.3: Daily time series of dive depth of a southern elephant seal from November 11 (top) to 26 (bottom),2010.131.5.TablesandfiguresFigure 1.4: Some typical shapes of dive profiles of a southern elephant seal. Top: flat-bottom dives. Bottom left:V-shape dive. Bottom right: drift dive.141.5.TablesandfiguresFigure 1.5: Illustration of amplitude and phase variations in simulated data. The gray curve indicates the baseshape. Top left: warping functions, i.e. a smooth, strictly increasing bijection on [0, 1]. Top right: Phase variationinduced by the warping functions. Bottom left: amplitude variation in terms of vertical shifting and scaling.Bottom right: curves with both phase and amplitude variations.15Chapter 2Modelling amplitude andphase variationsFunctional data often exhibit variation not only in amplitude but also inhorizontal scaling, or phase. For example, in the classical Berkeley GrowthStudy which considers height of a child as a function of age (Ramsay andSilverman, 2006), both the magnitude of growth spurts and their age ofoccurrence vary across subjects.Early approaches to analysis of functional data either do not differentiatethe two modes of variation or transform time to eliminate phase variationin a pre-processing step of curve registration as reviewed in Chapter 1. Thedata, (tij , yij), for individual curves i = 1, . . . , N , and sampled points, j =1, . . . , ni are modelled asyij = fi(tij) + ij (2.1)with fi a smooth function and ij a noise term. The function fi is typicallymodelled flexibly, in terms of spline functions, either via minimizing a pe-nalized likelihood (Brumback and Rice, 1998) or by a linear mixed-effectsmodel (Rice and Wu, 2001; Durba´n et al., 2005; Heckman et al., 2013).Indeed, under general conditions, one can show that penalized likelihood es-timates are the same as linear mixed-effects model estimates (Wand, 2003).Yao et al. (2005) took a different approach by simply assuming that thefunctions, fi, i = 1, . . . , N , were sampled from a smooth stochastic processwhich can be approximated parsimoniously by the principal functions of the16Chapter 2. Modelling amplitude and phase variationsestimated covariance function.Often, however, the original data do not follow model (2.1) due to thepresence of phase variation. The phase variation might actually be inter-esting features in the data. A pre-processing registration step to align thecurves and remove these features might not be sensible. A more appro-priate model would be to account for phase variation explicitly via a timetransformation, hi, also called a warping function:yij = (fi ◦ hi)(tij) + ij . (2.2)The warping function hi is continuous, strictly increasing and maps thesampling time, t, to the system time, hi(t), in such a way that, in systemtime, the N curves are in synchrony. Ignoring phase variation and analyzingunaligned curves, that is, using model (2.1) instead of model (2.2), canlead to incorrect conclusions. For instance, estimating a population meancurve by the point-wise average of non-aligned individual curves may leadto dampened or even completely masked peaks and valleys.In this chapter, we discuss our proposed probabilistic model for (2.2),estimating the amplitude and phase variation simultaneously via a newexpectation-maximization (EM) algorithm. We consider warping functionsin the class,H =h :(H1): h is continuous in t(H2): h is strictly increasing in t(H3): h(0) = 0 and h(1) = 1 ,and smooth amplitude functions on [0, 1]. In Section 2.1, we discuss ourproposed model and contrast with similar models of Brumback and Lind-strom (2004); Telesca and Inoue (2008) and Rakeˆt et al. (2014). In gen-eral, model (2.2) is not identifiable. In Section 2.2, we discuss this issueand provide an easy to apply theorem, similar to that in Chakraborty andPanaretos (2017). Proof of the theorem is deferred to Appendix A.1. InSection 2.3, we discuss a stochastic approximation version of the EM algo-172.1. Curve registration modelrithm for parameter estimates, curve fitting and prediction of the warpingfunctions and random effects. Some details of the derivations are relegatedto Appendix A.2. In Section 2.4, we compare all four approaches reviewed inSection 2.1 through a simulation study and find our EM based method andthe Bayesian hierarchical curve registration (BHCR) of Telesca and Inoue(2008) computationally more stable, especially when the curves are denselyobserved. In terms of statistical efficiency, our simulation study shows thatthe estimated base shape and predicted warping functions using our pro-posed method also have the lowest mean squared errors in general.2.1 Curve registration modelIn this section, we discuss our proposed model for (2.2) and contrast withsimilar models suggested by Brumback and Lindstrom (2004) and Telescaand Inoue (2008). Following that, we also discuss the slightly different ap-proach of Rakeˆt et al. (2014).The first three methods extend the shape-invariant model (SIM) of Law-ton et al. (1972), where fi is a vertical shifting and scaling of a common baseshape f , as in (1.2) but repeated here for convenience:fi(t) = ai,sh + ai,scf(t). (2.3)Here the random amplitude effect, ai = (ai,sh, ai,sc)>, follows a bivariatenormal distribution with mean µ ≡ (µsh, µsc)> and variance-covariance ma-trix Σ. On the other hand, f is modelled as a linear combination of a set ofKf B-spline basis functions, asf(t) =Kf∑k=1αkBfk (t). (2.4)In our method and that of Brumback and Lindstrom (2004), the coeffi-cients α1, . . . , αKf are fixed. Telesca and Inoue (2008) work in a hierarchical182.1. Curve registration modelBayesian framework where the coefficients are random.In these three methods, the warping function hi is modelled as a randomelement of H using a set of Kh B-spline basis functions:hi(t) =Kh∑k=1βi,kBhk (t), (2.5)with the coefficients βi,1, . . . , βi,Kh random andE[hi(t)] = tfor all t. To ensure that hi is a proper warping function that maps onto[0, 1], the models use the fact that hi is strictly increasing if βi,1 < βi,2 <· · · < βi,Kh (Kelly and Rice, 1990) and force hi(0) = 0 and hi(1) = 1 bysetting βi,1 = 0 and βi,Kh = 1. While Telesca and Inoue explicitly imposethe constraint on the coefficients which is otherwise multivariate Gaussian,Brumback and Lindstrom link the basis coefficients to Gaussian warpingeffects by the Jupp transformation (Jupp, 1978).We take a more straightforward but non-Gaussian approach. We modelthe first differences of the coefficients by a Dirichlet distribution. Specifically,we express the coefficients of the warping functions asβi,k =k∑j=1wi,j , (2.6)withwi,1 ≡ 0, wi ≡ (wi,2, . . . , wi,Kh)> ∼ Dirichlet(κ¯, τ). (2.7)with the probability density function of the Dirichlet distribution parame-terized asΓ(τ)∏Khk=2 Γ(τ κ¯k)Kh∏k=2wτκ¯k−1i,k (2.8)where κ¯ = (κ¯2, . . . , κ¯Kh)> is the mean vector and τ is the precision pa-192.2. Model identifiabilityrameter. The Dirichlet distribution ensures that wi,1 < wi,1 + wi,2 < · · · <wi,1 + · · · + wi,Kh = 1. Since hi(t) is linear in wi,1, . . . , wi,Kh , our modelallows easy and exact control of E[hi(t)] by choosing the appropriate con-stant for the mean, κ¯. The most useful choice is some constant κ¯identity fora given set of B-spline basis functions such thatE[hi(t); κ¯ = κ¯identity] = t.In the case where κ¯ is fixed, there is only one unknown parameter, τ , tobe estimated, regardless of the number of coefficients in the spline in (2.5).Although the correlation structure of this Dirichlet distribution is fixed,we have seen in simulation studies that the conditional expectation of thewarping function given the data provides a good estimate of the true warpingfunction when the number of observations per curve is large.Rakeˆt et al. (2014) proposed a different type of model for warped curves.The data are modelled as yij = f(hi(tij)) + xi(tij) + ij . The common baseshape, f , is non-random and subject to warping, whereas xi is a zero meanGaussian process representing idiosyncratic features that will not be aligned.The base shape, f , is represented by an interpolating spline, interpolatingover all distinct values of tij , while the warping function, hi, is a realizationof a Brownian bridge. Although there is no constraint discussed for mono-tonicity of hi in their original paper, their actual implementation allowssetting hi to an interpolating spline through the realized Brownian bridgewith monotonicity ensured by Hyman filtering.2.2 Model identifiabilityAs mentioned earlier, model (2.2) is not identifiable without further as-sumptions. Since (H, ◦) is a group on H, that is, closed under functionalcomposition, ◦, and inverses, we can warp the amplitude functions by a fixed202.2. Model identifiabilityh ∈ H and transform the warping functions, hi’s, accordingly, such thatf˜i = fi ◦ hi = (fi ◦ h) ◦ (h−1 ◦ hi) = f∗i ◦ h∗i ,where h∗i = (h−1 ◦ hi) ∈ H. If h is not the identity function, then h∗i 6= hiand both (fi, hi) and (f∗i , h∗i ) generate f˜i. However, we can show that ifthe amplitude function, fi, is a random shifting and scaling of a base shape,as in (2.3), identifiability of the probabilistic model of f˜i ≡ fi ◦ hi can beguaranteed under some additional conditions. We use the notation X1d= X2to mean that X1 and X2 have the same distribution.Theorem 1. Suppose that, for i = 1, 2,1)fi(t) = ai,sh + ai,scξi(t)where ai,sh and ai,sc are random with P(ai,sc = 0) = 0 and ξi, definedon [0, 1], is a real-valued non-random function with a continuous firstderivative which vanishes at most on a countable subset of [0, 1]; and2) with probability one, hi is continuous with strictly positive and contin-uous first derivative, with hi(0) = 0 and hi(1) = 1; and3) E(h−1i (t)) = t ∀t.Thenf˜1d= f˜2 if and only if (f1, h1)d= (f2, h2).Corollary 1. Assume that the conditions of Theorem 1 hold. Suppose alsothat, for i = 1, 2, (ai,sh, ai,sc) has known mean (µsh, µsc) with µsc 6= 0. Iff˜1d= f˜2, then h1d= h2, ξ1 = ξ2 and (a1,sh, a1,sc)d= (a2,sh, a2,sc).Corollary 2. Assume that the conditions of Theorem 1 hold. Suppose alsothat, for i = 1, 2, ai,sh ≡ 0 and∫t ξ2i (t) dt = 1. If f˜1d= f˜2, then h1d= h2,ξ1(t) = ±ξ2(t) and a1,sc d= ±a2,sc.212.2. Model identifiabilityThe theorem and corollaries are useful to insure identifiability of a para-metric model for the stochastic process, f˜ , with f˜(t) = ash + ascf(h(t)). Ifthe model satisfies the conditions of the theorem for all parameters, thenclearly f˜ is identifiable if and only if (ash + ascf, h) is identifiable. Thus ifash + ascf and h are independent processes, each with an identifiable pa-rameterization, then f˜ is identifiable with respect to the parameterization.Specifying identifiable parameterizations for ash + ascf and h is typicallyeasy.The proof of the theorem essentially follows the proof of Theorem 1 ofChakraborty and Panaretos (2017) who introduce the use of local variation.These authors restrict themselves to the case that ai,sh = 0, and directlyobtain the result given in Corollary 2. Given their result, our theorem isnot surprising since adding the constant ai,sh does not change the localvariation functions used in the proof. We sketch our proof of the theoremin Appendix A, including more details of the topology of the function spaceconsidered and, in some cases, with modified and shortened arguments. Theproofs of the corollaries are straightforward, with the proof of Corollary 2following the ending arguments of Chakraborty and Panaretos (2017).Note that the theorem’s restriction on hi is E(h−1i (t)) = t. Unfortunately,implementation of this restriction is computationally intensive, unless themodel for hi is extremely simple. Therefore, we have followed the easier path,as other researchers have, of assuming that E(hi(t)) = t for all t. We thenuse Corollary 1 to motivate the conditions we place on (ai,sh, ai,sc). Withthis alternate assumption on hi, our method still provides good estimates ofthe base and warping functions, as indicated in our simulation studies.While the assumption that E(h−1(t)) = t for all t is not the same asthe assumption that E(h(t)) = t for all t, we note that relating the warpingfunction to the identity function, I(t) = t, is a common approach to attemptto enforce identifiability. For instance, Brumback and Lindstrom (2004)force the N estimated warping functions in a sample of size N to averageto I, which is a sort of empirical analogue of requiring E(h(t)) = t for222.2. Model identifiabilityall t. They state but do not prove that this leads to identifiability of themodel. In their Bayesian approach, Telesca and Inoue (2008) assume thatE(h(t)) = t for all t, although the concept of identifiability is not as welldefined in the Bayesian context. In their rigorous treatment, Chakrabortyand Panaretos require h to target I by assuming that E[h−1(t)] = t for all t.While it seems intuitive that the condition E(h(t)) = t for all t would alsosuffice to guarantee identifiability since both h and its inverse are warpingfunctions, proving this seems to be an open problem. Clearly, Chakrabortyand Panaretos’ method of proof, using a measure of the local variation inf ◦ h, does not hold for the constraint E(h(t)) = t for all t.The additional conditions for identifiability of model (2.2) relate to model(2.3). We have used Corollary 1 to suggest appropriate conditions, thatis, we have chosen conditions so that the distribution of the coefficients(ai,sh, ai,sc) is identifiable, namely, assuming normality and fixing their ex-pected values to µ ≡ (µsh, µsc)> = (0, 1)>.In our data context, where the base shape represents a typical depthtrajectory of a dive, the scaling effect, ai,sc, accounts for the variabilityin the total depth of the dive while the shifting effect, ai,sh, accounts forany variability of depth at the beginning and end of the dive. This lattervariability may be introduced in the pre-processing stage when the diveprofiles are extracted from the original time series. Our assumptions on thescale and shift random effects, that E(ai,sc) = 1 and E(ai,sh) = 0, mean thatthe base shape is a point-wise average of all of the individual, non-warpedcurves in the population, that is, in (2.3), E(fi(t)) = f(t).We note that Brumback and Lindstrom (2004) use a similar restrictionon the scale and shift random effects but with a sample dependent approach,restricting the coefficient vectors (a1,sh, . . . , aN,sh) and (a1,sc, . . . , aN,sc) byusing rank deficient variance-covariance structures to guarantee that themeans of the vectors’ components are equal to their known population means(µi,sh = 0 and µi,sc = 1). In their Bayesian context, to improve the compu-tational stability of the posterior inference, Telesca and Inoue (2008) adjust232.3. Inference via a stochastic approximation EM algorithmthe sampled amplitude random effects in each MCMC steps such that thesample means of the shifting and scaling factor are equal to zero and onerespectively.To estimate the model parameters, Brumback and Lindstrom (2004) andRakeˆt et al. (2014) modify the algorithm of Lindstrom and Bates (1990) formaximum likelihood estimation. Telesca and Inoue (2008) take a Bayesianapproach with MCMC-based posterior inference. In the following section,we propose an EM algorithm with stochastic approximation for parameterestimation. The algorithm also conveniently yields the fitted curves and theprediction warping functions as a by-product of the stochastic approxima-tion.2.3 Inference via a stochastic approximation EMalgorithmRecall that our model (2.3) has i) a non-random base function f , param-eterized by the basis coefficients α = (α1, . . . , αKf ) as in (2.4), ii) randomshifting and scaling effects ai ≡ (ai,sc, ai,sh) which are bivariate normal withfixed mean µ ≡ (µsh, µsc)> = (0, 1)> and unknown covariance Σ, and iii)random warping functions with hi based on wi, which is Dirichlet withparameter τ as in (2.7). We use the following EM algorithm for maximumlikelihood estimation of α, Σ, τ and error variance σ2 of the proposed model.For the i−th curve, where i = 1, . . . , N , denote the observed data by yi andthe complete data by zi = (yi,ai,wi). Let θ = (α, σ2,Σ, τ) be the collec-tion of all unknown parameters and `c(θ; z) the complete-data log-likelihoodfunction. To find the maximum likelihood estimate of θ, the EM algorithm(Dempster et al., 1977) maximizes the observed data likelihood by creatinga sequence, {θ(k), k ≥ 1}, via iterations between 1) an E-step that computesQ(θ;θ(k)) = E[`c(θ; z)∣∣∣y;θ(k)] treating θ(k) as the true parameter value,and 2) an M-step that sets θ(k+1) = arg maxθQ(θ;θ(k)). Under generalconditions, the sequence, {θ(k), k ≥ 1}, is guaranteed to converge to a local242.3. Inference via a stochastic approximation EM algorithmmaximum of the observed data likelihood (Wu, 1983).The complete-data log-likelihood consists of three components,`c(θ; z) = `ac (Σ; a) + `wc (τ ; w) + `yc (α, σ2; z), (2.9)where the log-likelihoods of the random amplitude effects a =(a1, . . .aN )and random warping effects w =(w1, . . . ,wN ) are`ac (Σ; a) = −N log(2pi)−N2log det Σ− 12N∑i=1(ai − µ)>Σ−1(ai − µ)= −N log(2pi) +−N2log det Σ− 12tr(N∑i=1(ai − µ0)(ai − µ)>Σ−1),(2.10)and`wc (τ ; w) =Kh∑k=2(τ κ¯k − 1)N∑i=1log(wi,k)−NKh∑k=2log Γ(τ κ¯k)− log Γ(τ) ,(2.11)and the conditional normal log-likelihood of the observed curves given therandom effects is`yc (α, σ2; z) = −ntot2− 12σ2N∑i=1||yi − ai,sh1− ai,scBi(wi)α||2= −ntot2log(2piσ2)− 12σ2{N∑i=1||yi − ai,sh1||2− 2N∑i=1(yi − ai,sh1)>(ai,scBi(wi))α+ tr(N∑i=1a2i,scBi(wi)>Bi(wi)αα>)},(2.12)252.3. Inference via a stochastic approximation EM algorithmwith ntot the total number of observed values andBi(wi) =Bf1 (h(ti,1; wi)) · · · BfKf (h(ti,1; wi)).... . ....Bf1 (h(ti,ni ; wi)) · · · BfKf (h(ti,ni ; wi)) , (2.13)a basis evaluation matrix for the shape function at warped times for the i-thcurve, h(ti,1; wi), . . . , h(ti,ni ; wi).For our model, we can see from (2.10)–(2.12) that the complete-datalog-likelihood is linear in the following sufficient statistics:Syy =N∑i=1Syy,i ≡N∑i=1(yi − ai,sh1)>(yi − ai,sh1),SBy =N∑i=1SBy,i ≡N∑i=1ai,scBi(wi)>(yi − ai,sh1),SBB =N∑i=1SBB,i ≡N∑i=1ai,scBi(wi)>Bi(wi)ai,sc,Sa =N∑i=1Sa,i ≡N∑i=1(ai − µ)(ai − µ)>, andSωj =N∑i=1Sωj ,i ≡N∑i=1log(wi,j) for j = 2 . . . ,Kh.Therefore Q(θ;θ(k)) depends on the observed data only through the condi-tional expectation of these sufficient statistics. Let S be a generic notationfor a sufficient statistic and S˜(k) its conditional expectation given the ob-served data under the “true” parameter θ(k).262.3. Inference via a stochastic approximation EM algorithm2.3.1 M-stepGiven the conditional expectations of the sufficient statistics, the M-stepis relatively straightforward. Closed-form solutions exist for updating theestimates of α, Σ and σ2:α(k+1) = S˜(k)−1BB S˜(k)ByΣ(k+1) =1NS˜(k)aσ2(k+1)=1ntot(S˜(k)yy − 2S˜(k)>By α(k+1) + α(k+1)>S˜(k)BBα(k+1)).For the precision parameter of the Dirichlet warping effects, the correspond-ing maximizer,τ (k+1) = arg maxτ>0Kh∑j=2(τ κ¯j − 1)S˜(k)ωj −NKh∑j=2log Γ(τ κ¯j)− log Γ(τ) ,can be solved numerically by Newton’s method.2.3.2 E-stepCalculating S˜(k), the conditional expectation of a sufficient statistic, is diffi-cult, making the E-step challenging. Explicit calculation of these conditionalexpectations is practically impossible because of the model’s non-linearity,caused by the warping functions. Thus sampling-based methods are oftenemployed to approximate the expectations. Wei and Tanner (1990) consid-ered Monte Carlo approximations with direct sampling from the conditionaldistribution of (ai,wi)|yi;θ(k). However, in our case, the conditional dis-tribution is intractable. We experimented with the importance samplingapproach of Walker (1996) but we found the efficiency of the importancesampler low for generating samples of the warping effects, the wi’s. Theimportance weights were concentrated only on a few points, resulting in asmall effective sample size. This is not surprising given the high sampling272.3. Inference via a stochastic approximation EM algorithmfrequency of our functional data. The conditional density of the warpingfunctions is likely to be concentrated in a small region.We also consider approximating S˜(k) by Markov Chain Monte Carlo(MCMC) where at the kth step of the EM algorithm, for each curve, wesample a sequence of random effects,{(a(k)i,[r],w(k)i,[r]); r = 1, . . . , Rk}, by aMetropolis-Hastings algorithm with (ai,wi)|yi;θ(k) as the stationary distri-bution. We then approximate conditional expectations by ergodic averages,Sˆ(k)MC,i =1RkRk∑r=1S(a(k)i,[r],w(k)i,[r],y(k)i,[r]).However, this method is computationally too intensive, largely due to theevaluation of a B-spline basis matrix in (2.13) for each MCMC sample ofwarping effect, w(k)i,[r]. In addition, our experience shows that Rk must belarge for accurate approximation of the expectations. Indeed, in theory, theMCMC sample size, Rk, must increase with k so that the approximation er-ror does not dominate and the EM algorithm can converge (Wei and Tanner,1990).Stochastic approximation for the E-stepKuhn and Lavielle (2005) modify the algorithm to avoid generating a largeMCMC sample afresh at each EM iteration, and we apply their modificationhere. We setSˆ(k)SA,i = Sˆ(k−1)SA,i + γk(Sˆ(k)MC,i − Sˆ(k−1)SA,i),which, instead of discarding all samples from previous E-steps, updatesSˆ(k−1)SA,i , the conditional expectation approximation from the preceding E-step, using Sˆ(k)MC,i calculated from the MCMC sample generated in the k-thEM iteration. Kuhn and Lavielle showed that if the step size, γk, satisfies1) 0 ≤ γk ≤ 1, 2)∑γk = ∞ and 3)∑γ2k < ∞, and the Markov chainsare uniform ergodic, then their stochastic approximation EM (SAEM) al-282.3. Inference via a stochastic approximation EM algorithmgorithm converges almost surely to a local maximum of the observed-datalog-likelihood under some general conditions. In addition, the convergenceof SAEM does not depend on Rk, which implies that a precise MCMC ap-proximation of Sˆ(k)MC,i for each k is not necessary.We use the SAEM, choosing the step size to beγk ={1 for k ≤ B(k −B)−α for k > B ,for 0.5 < α ≤ 1 so that the first B steps are burn-in steps where theSAEM algorithm can move to the vicinity of the maximum. For the MCMCupdates, we choose Rk ≡ 1 to minimize the number of B-spline basis evalu-ations at each step.Simulation step by Metropolis-Hastings-within-GibbsWe generate the Markov chains by a Metropolis-Hastings-within-Gibbs al-gorithm to sample ai and wi in turns. Sampling from the conditional dis-tribution of ai is straightforward, as follows. Since the distribution of theobservations given the random effects isyi|ai,wi ∼ N(Fi(wi)ai, σ2I),whereFi(wi) =1 f(h(ti,1; wi))......1 f(h(ti,ni ; wi)) ,and ai is normally distributed with mean µ and covariance Σ, the conditionaldistribution of ai given yi and wi is also normal with covariance matrixΣi =(σ−2Fi(wi)>Fi(wi) + Σ−1)−1(2.14)292.3. Inference via a stochastic approximation EM algorithmand meanµi = Σi(σ−2Fi(wi)>yi + Σ−1µ). (2.15)Therefore, The MCMC sample of ai can be generated directly by a Gibbssampler.On the other hand, the conditional distribution of wi|ai,yi does notbelong to a common family and direct sampling is difficult. Therefore wereplace the Gibbs step by the following Metropolis-Hastings step. First, wetransform wi from the Kh − 2 simplex, S, to the space P = {x ∈ RKh−1 :x>1 = 0} by the centered-log-ratio transform (Aitchison, 1982),[G(wi)]k = log(wi,k+1)−Kh∑j=2log(wi,j)/(Kh − 1) .A proposal is drawn by performing a random walk on P where the randomstep follows a (Kh−1)-dimensional normal distribution with zero mean anda rank Kh− 2 covariance matrix with diagonal elements equal to σ2q · (Kh−2)/(Kh − 1) and off-diagonal elements equal to −σ2q/(Kh − 1) to enforcethe sum-to-zero constraint. We choose the centered-log-ratio transform andan exchangable covariance matrix for a symmetric treatment of the warpingcoefficients. The proposal is mapped from P back to S by the softmaxtransform, [G−1(x)]j+1= exj/Kh−1∑k=1exk .The proposal for the warping effect, denoted by w∗, is then accepted withprobability,min1, fy|a,w(a(k)i ,w∗∣∣∣yi;θ(k))fy|a,w(a(k)i ,w(k−1)i∣∣∣yi;θ(k)) ·Kh∏l=2[w∗]l[w(k−1)i]l ;otherwise, the sampled state from the previous iteration is taken as the newstate. In practice, we run each chain in the E-step with 5 to 10 iterationsto encourage a better approximation to the stationary distribution. This is302.4. Simulation studyin keeping with Kuhn and Lavielle (2005), who found that a small numberof burn-in iterations for this inner MCMC step usually suffices and a longerburn-in does not improve the convergence of the SAEM algorithm much.Stochastic approximation for predicted random effectsTo predict the warping function and the warped curves from the fittedmodel, we use the conditional mean of the desired functions:yˆi(t) = E(ai,sh∣∣∣yi; θˆ)+ Kf∑k=1E(ai,scBfk (h(t; wi))∣∣∣yi; θˆ) αˆk,andhˆi(t) =Kh∑k=1k∑j=1E(wi,j∣∣∣yi; θˆ)Bhk (t). (2.16)Similarly, the amplitude effects can be predicted by the conditional meansaˆi,sh = E(ai,sh∣∣∣yi; θˆ) and aˆi,sc = E(ai,sc∣∣∣yi; θˆ) . (2.17)Stochastic approximations to the required conditional expectations are easilyobtained as a by-product of the SAEM algorithm.2.4 Simulation studyWe compare the statistical efficiency of our proposed model for estimatingthe base shape and predicting the warping functions and the computationalfeasibility of the SAEM algorithm to existing model-based registration meth-ods in the literature by a simulation study.Data are simulated from model (2.3)–(2.7) discussed in Section 2.1 wherethe cubic splines for the base shape and the warping functions have equallyspaced knots over the interval of [0, 1].312.4. Simulation studyTwo scenarios are considered. In the first case, the base shape uses fiveB-spline basis functions with coefficients equal to 0, −200, −500, −200 and0 while the warping functions use six B-spline basis functions. In the secondcase, the base shape uses 11 B-spline basis functions with coefficients equalto −350, −300, −700, −100, 400, −100, −700, 100, −800, 400 and −450,while the warping functions use nine B-spline basis functions. The Dirichletmean is set toκ¯ =19(1, 2, 3, 2, 1)>for the first scenario, andκ¯ =118(1, 2, 3, 3, 3, 3, 2, 1)>for the second scenario such that E[hi(t)] = t for t ∈ [0, 1]. For the randomeffect distributions, in both cases, the precision parameter of the Dirichletwarping effects is τ = 10 and the covariance matrix, Σ, of the normal ampli-tude effects is diagonal with variances of the shifting and the scaling effectsequal to 202 and 0.052 respectively. The error process is white noise, withij normally distributed with variance σ2 = 52. Observations for each curveare sampled at equally spaced time points. Three sampling frequencies, n,are considered: 100, 1000 and 2000 points per curve. We simulate 20 curvesfor each simulation run with 200 runs for each setting. Figure 2.1 showsthe simulated curves of shape 1 with 100 points per curve and shape 2 with1000 points per curve respectively.We compare the flexible SIM approach of Brumback and Lindstrom(2004), which we refer to as BL2004, the phase and amplitude varyingpopulation pattern (PAVPOP) model of Rakeˆt et al. (2014), the Bayesianhierarchical curve registration (BHCR) of Telesca and Inoue (2008) andour proposed model and SAEM algorithm. We implement our methodin R and C++ and provide the code at R packages and source codes for BL2004 and BHCRare obtained from the authors whereas the R package for PAVPOP is avail-able at the on repository While322.4. Simulation studythe four methods model the random effects differently, in our simulations,the basis functions to model the base shape and the warping functions arecorrectly specified for all methods.The R code for BL2004 and PAVPOP has built-in monitoring of con-vergence. However, no such functionality is implemented for SAEM andBHCR. Determining convergence at successive iterations appears to be anopen question for SAEM and BHCR. In our simulation study, for these twomethods, we run a fixed schedule of 10000 iterations after 2000 burn-in it-erations. During the first half of the burn-in stage for SAEM, we fix thestep-size of the stochastic approximation at γk = 1 to encourage the algo-rithm to move quickly to the region of high likelihood. Choosing a goodproposal variance up front for the Metropolis-Hastings algorithm is difficult.Setting a proposal variance too large or too small would lead to impracti-cally low or high acceptance rates for the Metropolis-Hastings algorithm,causing poor mixing of the Markov chains, hence slow convergence of theSAEM algorithm. In order to find a reasonable proposal variance, we tunethe variance adaptively during the burn-in stage such that the acceptancerate is between 17% and 33%. This range of acceptance rate is chosen withreference to the optimality results in Roberts and Rosenthal (2001) forrandom walk Metropolis algorithms.For model identifiability, we assume that the mean of the amplitudeeffect, µ, is (0, 1)>. However, the empirical mean of the estimated ampli-tude effects can still drift away from the population mean, resulting in anestimated template with inappropriate scale. We avoid this issue by adjust-ing, at each EM iteration, the MCMC sample of amplitude effects with thetransformation,ai,adj =(1 −a¯sh/a¯sc0 1/a¯sc)ai,for each curve i, before the stochastic approximation E-step, thus ensuringthat the empirical mean of the approximations of E[ai|yi; θˆ] is always equalto (0, 1)>.332.4. Simulation studyThe simulations are run on the Cedar cluster of Compute Canada. Ta-bles 2.1 and 2.2 describe the computing times for completed simulationruns and show the percentage of runs that were aborted when the maxi-mum time allocated was exceeded or failed due to numerical degeneracy. Interms of stability, the SAEM and BHCR approaches always return a fittedmodel whereas BL2004 and PAVPOP, which both rely on the approximationmethod of Lindstrom and Bates (1990) for a nonlinear mixed-effects model,can run into numerical errors. In most cases, the numerical errors are causedby degeneracy of matrices that are required to be positive definite. For thesetwo methods, the numerical problem is usually more prevalent for shape 2,possibly because the base shape is more complicated and the splines used infitting the model are more flexible. As the sampling frequency, n, increases,the percentage of runs with numerical error tends to descrease for PAVPOPbut tends to increase for BL2004.Comparing the computing time of completed runs, SAEM is the fastest,while BHCR takes two to three times longer. This is likely due to the differ-ence in the MCMC sampler of the two approaches. While BHCR updates thewarping coefficients one at a time, our approach updates the entire vector ofwarping coefficients in a Metropolis-Hastings step. The simultaneous updat-ing minimizes the number of times we need to reevaluate the basis functionsof the common shape at the new warped time. Our method is thereforemore scalable with the sampling frequency and the number of knots for thewarping functions. On the other hand, both methods have low variabilityin their computing time since they are set up to run the same number ofiterations. We do not consider BL2004 in this comparison because only onecompleted run converged, for shape 1 with n = 2000. Among the remain-ing three methods, PAVPOP is the slowest because of the need to invertcorrelation matrices of the same dimension as the sampling frequency dueto the Gaussian process assumption on the error process. The computingtime is prohibitively long at high sampling frequency. In the case of shape1 with n = 2000, 92% of the runs exceeded the maximum allocated time ofsix hours.342.4. Simulation studyFor convergence diagnostics of SAEM, Figure 2.2 shows the estimatedbase shape and error variance across the iterations for one simulation repli-cate from each setting. There is no practical change in the estimated baseshape after 2000 iterations while the estimated error variances have beenstablized around the true value when the algorithm is stopped. Additionalplots for convergence diagnostics for both SAEM and BHCR appear in thesupplementary materials of Fu and Heckman (2018). These plots show that,for BHCR, the estimated base shape changes little after the burn-in period,although the rate of convergence seems to be slightly slower than SAEM forshape 2 and n ≥ 1000. Hence, for both algorithms, our choice of the fixednumber of iterations seems sufficient for the simulation study.For estimating the base shape, all but one method recover both shape 1and shape 2 faithfully on average, as seen in the top panel of Figure 2.3. Theexception is BL2004 which fails to reveal the smaller peaks and valleys ofshape 2. An alternative visualization in terms of bias is shown in the middlepanel of Figure 2.3. The bottom panel of Figure 2.3 shows the root meansquared errors (RMSEs) while Table 2.3 lists the integrated mean squarederrors (IMSEs). Our proposed method has the smallest IMSE in all caseswhile BHCR has the second lowest IMSE for shape 1. For BL2004, althoughconvergence is not declared in many cases, the IMSEs are comparable toother methods, especially for shape 2, except in the case with n = 100where the bias is also substantial.We also compare the predicted warping functions in Table 2.3 and inFigure 2.4. For SAEM and BHCR, the prediction bias and variability aremodest in all cases. The bias for PAVPOP is also small when it returns afitted model. For BL2004, the average prediction bias is much higher thanthe other methods for both shapes. In general, our method has the smallestintegrated mean squared prediction error (IMSPE) of the predicted warpingfunctions in the case of shape 1 while BHCR has the smallest IMSPE in thecase of shape 2, as shown in Table 2.3.352.5. Remarks on using a fixed template2.5 Remarks on using a fixed templateThe evaluation of the template function at warped time points requiresevaluation of the B-spline basis functions, the Bfk ’s. This is a computationalbottleneck for our SAEM algorithm, accounting for up to 80% of the runtime in our analyses. A substantial gain in computational efficiency can beachieved if we replace the unknown base shape by a known template thatis easy to evaluate. For example, if the noiseless curves are believed to beunimodal, a specified parabola might be an adequate template as long asthe warping functions are flexible enough to warp the template and fit thedata well. The use of a fixed template also avoids the identifiability issues.With known f the mean parameters µ and κ¯ in our curve registration modelbecome free parameters that need to be estimated. We refer the readers toAppendix A.2.2 for derivations of an SAEM algorithm for a model withknown template.362.6.Tablesandfigures2.6 Tables and figuresAmplitude Functions Warping Functions Warped Curves with Noise Amplitude Functions Warping Functions Warped Curves with Noise0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00−400−300−200−1000−600−400−20002000.000.250.500.751.−400−300−200−1000−600−400−2000200TimeFigure 2.1: Twenty simulated curves for each of shape 1 (top row) and shape 2 (bottom row). Number ofobservations per curve is n = 100 for shape 1 and n = 1000 for shape 2.372.6.TablesandfiguresTable 2.1: Summary of simulation computations: computing times of completed runs and the percentage of failedruns for varying number of sampled points. We allocate a maximum computing time of six hours for each run.(Shape 1)Times (sec) Aborted Runs Completed Runsn Method Median IQR Max. timeexceededNumerical errorencounteredNonconvergencedeclaredSuccessfulruns100SAEM 19.5 0.9 0.0% 0.0% - 100%BHCR 46.0 4.8 0.0% 0.0% - 100%BL2004 65.8 81.1 0.0% 21.5% 78.5% 0.0%PAVPOP 169.6 53.1 0.0% 40.5% 3.5% 56.0%1000SAEM 151.0 1.2 0.0% 0.0% - 100%BHCR 421.3 7.5 0.0% 0.0% - 100%BL2004 235.4 255.8 0.0% 41.0% 59.0% 0.0%PAVPOP 7541.1 2399.7 0.0% 11.5% 7.5% 81.0%2000SAEM 295.3 26.7 0.0% 0.0% - 100%BHCR 881.6 15.6 0.0% 0.0% - 100%BL2004 271.3 360.2 0.0% 46.0% 53.5% 0.5%PAVPOP 19480.6 2517.8 92.0% 6.5% 0.0% 1.5%382.6.TablesandfiguresTable 2.2: Summary of simulation computations: computing times of completed runs and the percentage of failedruns for varying number of sampled points. We allocate a maximum computing time of six hours for each run.(Shape 2)Times (sec) Aborted Runs Completed Runsn Method Median IQR Max. timeexceededNumerical errorencounteredNonconvergencedeclaredSuccessfulruns100SAEM 30.0 0.8 0.0% 0.0% - 100%BHCR 71.4 2.6 0.0% 0.0% - 100%BL2004 386.0 24.3 0.0% 63.5% 36.5% 0.0%PAVPOP 612.9 247.1 0.0% 72.5% 3.5% 24.0%1000SAEM 242.8 3.9 0.0% 0.0% - 100%BHCR 711.9 28.7 0.0% 0.0% - 100%BL2004 288.6 137.0 0.0% 98.0% 2.0% 0.0%PAVPOP 12668.4 3248.6 0.0% 14.5% 3.5% 82.0%2000SAEM 499.2 14.7 0.0% 0.0% - 100%BHCR 1756.4 79.5 0.0% 0.0% - 100%BL2004 490.2 207.8 0.0% 97.5% 2.5% 0.0%PAVPOP - - 73.5% 26.5% 0.0% 0.0%392.6.Tablesandfigures100 1000 2000Shape−1Shape−−300−200−1000−500−2500250TimeEstimated base shapeSAEM iterations    1    2K    7K   12K100 1000 2000Shape−1Shape−202500500075001000012500 02500500075001000012500 0250050007500100001250025.526.026.527.02530354045SAEM iterationEstimated error varianceFigure 2.2: Output of SAEM iterations from several simulated data sets (top plots for shape 1, bottom plots forshape 2, for n = 100, 1000 and 2000 observations per curve). Estimated base shapes (left) for 1, 2000, 7000 and12000 iterations. Trace plots for the error variance (right) by SAEM iterations. Here, the first 2000 iterations areburn-in and calibration period. In the six error variance plots, the first 500 iterations are excluded from the plotsfor better scaling.402.6. Tables and figures100 1000 2000Shape 1Shape 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00−300−200−1000−400−2000200TimeMethod SAEM BHCR BL2004 PAVPOP Truth100 1000 2000Shape 1Shape 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00−40−20020−400−2000200Time100 1000 2000Shape 1Shape 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.001020300100200300400500TimeFigure 2.3: Simulation results for shapes 1 and 2 when the number of obser-vations per curve is 100 (first column), 1000 (second column) or 2000 (thirdcolumn). Estimated common shape pointwise 95% confidence band basedon the empirical standard error. Top panel: Average estimated commonshape. Middle panel: Average bias. Bottom panel: Root mean squarederror.412.6. Tables and figuresTable 2.3: Average IMSE of the estimated common shape and IMSPE ofthe predicted warping functionShape n Method Average IMSE ofcommon shapeAverage IM-SPE of warpingfunctions1100SAEM 79 0.14× 10−3BHCR 114 0.20× 10−3BL2004 379 2.75× 10−3PAVPOP 222 0.52× 10−31000SAEM 68 0.11× 10−3BHCR 198 0.35× 10−3BL2004 281 2.22× 10−3PAVPOP 250 0.49× 10−32000SAEM 68 0.11× 10−3BHCR 231 0.39× 10−3BL2004 266 2.16× 10−3PAVPOP 268 0.52× 10−32100SAEM 4807 3.85× 10−3BHCR 10677 3.74× 10−3BL2004 60333 181.82× 10−3PAVPOP 9415 6.12× 10−31000SAEM 5322 5.35× 10−3BHCR 10088 3.83× 10−3BL2004 8812 15.45× 10−3PAVPOP 10062 4.68× 10−32000SAEM 5256 5.39× 10−3BHCR 10638 3.42× 10−3BL2004 8640 15.02× 10−3PAVPOP - -422.6.Tablesandfigures100 1000 2000Shape 1Shape 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00−0.10−− SAEM BHCR BL2004 PAVPOPFigure 2.4: Average prediction error of the warping functions and pointwise 95% confidence band based on theempirical standard error, for n = 100, 1000 and 2000 observations per curve.43Chapter 3Clustering on phase variation3.1 Review of functional data clusteringClassical clustering methods for functional data can be viewed as exten-sions of multivariate clustering methods to handle data objects with infinitedimension. Three main approaches exist in the literature.Distance-based clusteringThe first is distance-based, relying on some measure of dissimiliarity betweentwo curves. For example, Ferraty and Vieu (2006) proposed a hierarchialclustering algorithm based on semi-metrics of the formdm(y1, y2) =[∫T∣∣∣y(m)1 (t)− y(m)2 (t)∣∣∣2 dt]1/2 , (3.1)where d0 is the L2 distance. To estimate the semi-metrics, the functions,y1 and y2, are often reconstructed from the vectors of discrete time data,y1 and y2. For densely observed data and for smooth functions, y1 and y2,each function can be approximated by basis expansions with analytic form,yi(t) ≈D∑j=1Bj(t)αi,j , (3.2)443.1. Review of functional data clusteringwhere the vector of basis coefficients αi = (αi,1, . . . , αi,D)> can be estimatedby least squares, yielding αˆi. The estimate of the m−th derivative of thefunction, y(m)i , is simply the derivative of the right-hand side of (3.2) withαˆij replacing αij , yieldingŷ(m)i (t) =∂m∂tmyˆi(t) =D∑j=1B(m)j (t)αˆi,j .Ferraty and Vieu (2006) suggest using B-spline basis expansions in (3.2)which are numerically stable for estimating derivatives. The integral in thesemi-metric (3.1) is subsequently estimated by replacing the smooth func-tions by the spline approximations. Since the splines are piecewise poly-nomials, the integrals can be computed explicitly. For sparsely observedcurves, Peng and Mu¨ller (2008) do not reconstruct the functions explicitly.Instead, they rely on the conditional expectations of the L2 distance giventhe data,d˜0(y1,y2) = E[d0(y1, y2)2∣∣y1,y2]1/2 .The conditional expectation can be estimated using the principal analysisby conditional expectation framework (Yao et al., 2005). The authors notethat d˜0 is a metric that can be used for distance-based clustering methods.Clustering via estimated basis coefficientsAn alternative to the distance-based method is the filtering approach whichapplies methods for multivariate data directly to the estimated basis co-efficients, αˆi, from model (3.2). Abraham et al. (2003) and Serban andWasserman (2005) apply k-means clustering to B-spline and Fourier basiscoefficients respectively. Garcia-Escudero and Gordaliza (2005) propose arobust method of trimmed k-means clustering for B-spline basis coefficients.Jank (2006) treats the estimated coefficients as data and models their dis-tribution as a finite mixture of Gaussian distributions with cluster specificmeans and variance-covariance matrices. Bouveyron and Jacques (2011) ap-453.1. Review of functional data clusteringply the high dimensional data clustering (HDDC) of Bouveyron et al. (2007)to basis coefficients when the number of basis functions is large.Simultaneous curve fitting and clusteringA third approach, motivated by sparsely observed and noisy curves, is toperform curve fitting and clustering simultaneously. Often a probabilisticmodel for (3.2) is employed for the functional data where the basis coeffi-cients, the αi’s, are treated as random effects that follow a finite mixtureof Gaussian distributions (James and Sugar, 2003; Giacofci et al., 2013).Clustering and curve fitting via these mixture models are then performediteratively with an EM algorithm, either for soft clustering where the prob-ability of a curve belonging to a specific cluster is returned, or for hard clus-tering which assigns each curve to one cluster. More recently, researchershave considered a definition of a cluster in terms of a mean function anda set of basis functions. Cluster k then has possible paths in a functionalsubspace, Ck = {y : y(t) = µk(t) +∑Dkj=1Bj,k(t)αj , α ∈ RDk}, with clusterspecific dimension, Dk, and directions of variation, Bj,k’s, for j = 1, . . . , Dk.For example, Chiou and Li (2007) propose the k-centres functional cluster-ing (kFCF) algorithm which alternates between identifying the functionalsubspaces and assigning curves to the closest subspace. Jacques and Preda(2013) consider model-based methods based on Gaussian distributions anddiscuss EM-like algorithms for clustering.Methods treating phase variation as within cluster variationAs we pointed out in Chapter 2, functional data often exhibit both am-plitude and phase variations. The clustering methods reviewed above donot explicitly model the two modes of variation, although some recent worktreats phase variation as within cluster variation. The aim of this recentwork is to identify different typical shapes for each cluster and register thecurves to these typical shapes.463.1. Review of functional data clusteringSangalli et al. (2010) discuss a k-means algorithm for clustering smoothcurves of the formyi(t) = fi ◦ hi(t)where hi ∈ Hlin = {h : h(t) = wsh + wsct, wsh ∈ R, wsc ∈ R+} is a lineartime transformation. Given a similarity index ρ between two curve and acluster centroid ϕk, the authors define phase variation between yi and ϕkby the optimal warping,hi→ϕk = arg maxh∈Hlinρ(yi ◦ h, ϕk) (3.3)and the amplitude variation between yi and ϕk byρ(yi ◦ hi→ϕk , ϕk). (3.4)For the definition of phase and amplitude variation (3.3) and (3.4) to besensible, the authors point out that the similarity index needs to satisfy thefollowing two conditions, in addition to the usual properties of reflexivity,symmetry and transitivity. The first condition requires that, for any twocurves yi and yj ,ρ(yi ◦ h, yj) = ρ(yi, yj ◦ h−1) ∀h ∈ Hlinso that aligning yi to yj results in the same similarity as aligning yj to yi.This condition leads to a definition of amplitude variation that is indepen-dent of how the curves are aligned from one to another or to a commontemplate. The second condition requires thatρ(yi, yj) = ρ(yi ◦ h, yj ◦ h) ∀h ∈ Hlinwhich implies thatρ(yi ◦ hi→ϕk , ϕk) = ρ(yi ◦ hi→ϕk ◦ h, ϕk ◦ h) ∀h ∈ Hlin (3.5)473.1. Review of functional data clusteringthat is, the amplitude variation in (3.4) does not change by simultaneouslywarping the curve and the template. The choice of ρ depends on the class ofwarping functions. For linear transformation, Hlin, a similarity index thatsatisfies (3.1) and (3.5) isρ(y1, y2) =∫R y′1(s)y′2(s)ds√∫R [y′1(s)]2 ds · ∫R [y′2(s)]2 ds.For main task of the k-means algorithm is to alternate between an alignmentand cluster assignment step and a template updating step. In the alignmentand cluster assignment step, the warping functions are computed accordingto (3.3) for each combination of curve yi and template ϕk. A curve is thenassigned to the cluster with the highest similarity (3.4). Given the warpingfunctions and the cluster assignment, the template for cluster k is updatedtoarg maxϕ∑i∈Ckρ(yi ◦ hi→ϕk , ϕ)where Ck is the set of indices of curves in cluster k.Liu and Yang (2009) propose model-based clustering using a finite mix-ture of simple shape-invariant models with cluster specific base shape. Thewarping functions, hi(t) = wi,sh + t, are limited to simple lateral shift-ings of time. The k-th cluster can be understood as Ck = {y : y(t) =ash + asc · fk(wsh + t), (ash, asc, wsh)> ∈ R3}, with cluster specific baseshape, fk. For phase variation with more flexible time warpings, Zhang andTelesca (2014) extend the single-cluster Bayesian hierarchical curve registra-tion model of Telesca and Inoue (2008) discussed in Chapter 2 by imposinga Dirichlet process prior over the base shapes. Because of the nonlinear-ity introduced by the flexible warping functions, model-based methods withcluster specific templates are often computationally highly intensive. Intu-itively, this is because, at each iteration, the EM based algorithms mustregister a curve to each template. In an attempt to solve this computationalchallenge, Maire et al. (2017) propose a relatively more efficient online-EMalgorithm, although the focus is on recovering the base shapes rather than483.2. Mixture of warpings modelpredicting the cluster labels for all curves.Motivated by the analysis of elephant seal dive profiles, we consider thescenario when heterogeneity in phase variation is the only clustering struc-ture. Assuming that the observed curves can be adequately approximatedby warping a common template, we propose a mixture of warpings modelfor warped curves with between-cluster variation in phase. We describe themodel in Section 3.2 and discuss simultaneous curve fitting and clusteringvia stochastic approximation EM (SAEM) algorithms in Subsection 3.2.1.Our use of a common template across clusters reduces the computationalcomplexity of simultaneous curve fitting and clustering compared to modelswith cluster specific templates. We also propose a two-step approach in Sub-section 3.2.2 to address the issue of sensitivity of clustering results to initialvalues. In Section 3.3, we conduct a simulation study to compare our pro-posed methods to clustering algorithms in the literature that do explicitlyaccount for the two modes of variation.3.2 Mixture of warpings modelIn general, phase variation can be an interesting characteristic, along withamplitude variation, to define the cluster. In that case, elimination of phasevariation by curve registration prior to clustering is not sensible.For simultaneous curve fitting and clustering on phase variation, we pro-pose a mixture of warpings model which extends the curve registration modelin Chapter 2. Recall the basic framework of the curve registration model,the discrete time observations are modelled asyij = (fi ◦ hi)(tij) + ij≡ ai,sh + ai,scf ◦ hi(tij) + ij . (3.6)as in (2.2) with Gaussian white noise, ij , and with the amplitude function,fi, modelled as verticle shifting and scaling of a base shape, f , as in (2.3).493.2. Mixture of warpings modelThis common base shape is either a known template with closed-form ex-pression or an unknown function modelled with B-splines as in (2.4). Thewarping function, hi, is also modelled with B-splines with parameterizationhi(t) =Kh∑k=1βi,kBhk (t),βi,k =k∑j=1wi,j , (3.7)as in (2.5) and (2.6), where wi,1 ≡ 0 and (wi,2, . . . , wi,Kh)> is a vector on the(Kh − 2)-standard-simplex. In contrast to Chapter 2, we must define clus-ters, and we consider a probabilistic approach in identifying the clusteringstructure. Specifically, we propose a distribution on mi, the latent clus-ter label, and a finite mixture of Dirichlet distributions on the coefficientswi ≡ (wi,2, . . . , wi,Kh)>,mi ∼ Categorical(pi1, . . . , piM )wi | mi = m ∼ Dirichlet(κ¯m, τm), (3.8)where κ¯m is the cluster specific Dirichlet mean vector and τm is the clusterspecific precision paramter as defined in (2.8) in Chapter 2. The meanand precision parameters can be linked to the usual Dirichlet concentrationparameters as κm = κ¯m · τm and τm = κ>m1. Depending on the context, wewill use the two parameterizations interchangably for simpler notation.To justify the use of a common template, consider the alternative setupwith cluster specific templates f1, . . . , fM and a common within-cluster dis-tribution of warping functions, hi ∼ F0, that represents a modest level ofphase variation in the cluster. For simplicity, in our explanation, we assumeno amplitude shifting and scaling across curves. Curves in the m-th clusterthus follow a probabilistic modelym,i = (fm ◦ hm,i)(t) for hm,i ∼ F0.503.2. Mixture of warpings modelIf the cluster specific templates are homeomorphic, i.e. if, for any j and k,we can find a warping function hj→k such that fj ◦ hj→k = fk, then theclusters can be equivalently modelled in terms of a common template, f1,ym,i = (fm ◦ hm→1 ◦ h1→m ◦ hm,i)(t)= (f1 ◦ h∗m,i)(t).The new warping function, h∗m,i = h1→m ◦ hm,i, with respect to f1 nowfollows a transformed distribution which we denote as Fm. The curvescan then be grouped and the clusters identified by estimating F1, . . . ,FMinstead of the templates. For our model, we ignore the complicated impactof the transformations on the distributions and instead assume one commontemplate with within-cluster distributions of warping functions as in (3.8).3.2.1 Simultaneous registration and clustering with knownand unknown templatesUnder the same framework as (2.2), we treat the warping functions as la-tent random functions that need to be estimated from the raw data. Oneapproach to clustering with model (3.6)-(3.8) is to estimate the warpingfunctions and cluster the curves simultaneously. As in Chapter 2, we as-sume that(ai,sh, ai,sc)> ∼ N(µ,Σ),ij ∼ N(0, σ2) (3.9)and refer to the full model of (3.6)-(3.9) as a mixture of warpings model.If the base shape is unknown and modelled with B-splines as in (2.4),again we fix the mean of the random amplitude effects at µ = (0, 1)> formodel identifiability. For phase variation, the overall mean of the warp-ing functions across the M clusters can be controlled via the constraint,513.2. Mixture of warpings modelκ¯overall ≡∑Mm=1 pimκ¯m = κ¯identity, such thatE [hi(t)] = t for t ∈ [0, 1].However, this constraint complicates the updating of the parameters of themixture distribution in the M-step. Instead, we fix the Dirichlet mean ofthe first component at κ¯1 = κ¯identity such that the average warping of thisreference group is the identity function, that is,E[hi(t) | mi = 1] = t for t ∈ [0, 1].For this unknown template model, the base shape, f , corresponds to a typ-ical shape for the reference cluster. We have not analyzed the identifiabilityof this model but understand that it is unidentifiable with respect to permu-tations of the group labels. Hence, we expect the likelihood function to behighly multi-modal since any group can be chosen as the reference group andyields a comparable fit given that the B-spline approximations to the latentfunctions are flexible enough. Nevertheless, for the purpose of clustering,we anticipate that any of these local maxima would still return reasonableresults.Alternatively, we can employ a known template, f , if some basic proper-ties of the common shape are known or desired. For example, if the smoothedcurves are convex functions with similar values on the boundaries of t = 0and t = 1, like the elephant seal dive profiles, we might consider the parabolaf(t) = −4t(1− t) as a fixed template. The fixed reference of a known tem-plate also yields an identifiable model where the means of the amplitudeeffects and the Dirichlet concentration parameters of each mixture compo-nent are free parameters to be estimated.Whether the template, f , is known or unknown, estimation of modelparameters, warping functions and predictive probabilities of cluster mem-berships,pˆi,m = P(mi = m∣∣∣ yi; θˆ) ,523.2. Mixture of warpings modelcan be performed by maximum likelihood via an SAEM algorithm. Weassign the i-th curve to the cluster with the highest predictive probabilities,pˆi,1, . . . , pˆi,M . The derivation of the SAEM algorithm is similar to that forthe curve registration model in Chapter 2, with the hierarchical formulationof the mixture of Dirichlet distribution (3.8) allowing for an easy extensionof the original SAEM. We defer all derivations of the SAEM algorithm toAppendix B.For many iterative clustering algorithms, the resulting clusters are of-ten sensitive to the choice of an initial grouping. A popular strategy is torepeat the algorithm multiple times with different random initial groupingsand choose the result with the best clusters, for example with the largestlikelihood or the smallest within-cluster sum of squares. Unfortunately, forthe mixture of warpings model, when each curve is densely observed acrosstime, the SAEM algorithm can take considerable computing time, renderingthe multi-start strategy infeasible.3.2.2 Two-step approachWe also consider a two-step approach that is computationally much sim-pler than the simultaneous approach. In this two-step approach, we firstignore the clustering structure and estimate the warping functions by (2.16)based on the curve registration model of Chapter 2. The estimated timewarpings are then treated as data and the corresponding original curvesclustered by estimating the parameters of the mixture of Dirichlet distribu-tions model (3.8) applied to the estimated warping coefficients.Clustering and estimation in the mixture model is computed by an EMalgorithm. Since the only latent variable is the cluster label, exact calcu-lation for the E-step is possible by (B.1). For the M-step, the sufficientstatistics for the mixing proportion and Dirichlet components are identicalto those for the mixture of warpings model. Hence, the parameter estimatescan be updated by the same Newton’s method as described in Appendix B.1.533.3. Simulation studyThe two-step method is a reasonable approach when the functional dataare densely observed across time with low noise relative to the scale of thecurves. As discussed in Chapter 2, we expect the predictive distribution ofthe warping function given the data to be dominated by the likelihood androbust to mis-specification of the random effect distribution. If the withincurve variability or the observation error is relatively small, we also expectthe predictive density to have the majority of mass concentrated in a smallregion because of the high precision in predicting the warping functions.This justifies treating the predicted warping effects as data and ignoring theuncertainty of the prediction in our two-step clustering approach.3.3 Simulation studyWe conduct a simulation study to compare the performance of our mixture ofwarpings model and two-step approach with clustering methods that ignoreregistration. We consider four versions of our methods:• simultaneous approach with known and fixed template (MXWP-FxT ),and• simultaneous approach with unknown template (MXWP)discussed in Section 3.2.1 and• two-step approach with known template (STEP-FxT ), and• two-step approach with unknown template (STEP)discussed in Section 3.2.2. The five clustering methods for functional datathat we compare our methods to are:• Gaussian mixture model (GMM) of Brillinger and Stewart (1997),543.3. Simulation study• funclust method of Jacques and Preda (2013),• funHDDC method of Bouveyron and Jacques (2011),• kCFC method of Chiou and Li (2007), and• waveclust method of Giacofci et al. (2013).We simulate data with the clustering structure lying in the distribution ofthe warping functions only.Simulated dataWe simulate data from the model specified in (3.6)-(3.9), with base shape,f(t) = −4t(1 − t), and µ = (−25, 500)>, Σ = diag(102, 502), σ2 = 102.The warping functions are modelled as cubic B-splines with an interiorknot at t = 0.5 and so Kh = 5 basis functions. We generate mixturesof Dirichlet distributions with two, three or four components with equalmixing proportions. The concentration parameters are κ1 = (0, 1, 1, 0)>,κ2 = (1.2, 1, 1, 1.2)>, κ3 = (0, 1, 2, 1)> and κ4 = (1, 2, 1, 0)>, with the firstm used in an m-component mixture. For each scenario, we replicate thesimulation 200 times. For each simulation run, we generate a sample of 200curves, each observed on a grid of 1000 equally spaced time points between0 and 1. Figure 3.1 shows one set of the simulated data.Algorithm and model configurationThe four methods of funclust, funHDDC, kCFC, waveclust are implementedin the R package funcy (Yassouridis et al., 2018). We use 20 B-spline ba-sis functions in funclust, funHDDC, kCFC and 20 wavelets functions inwaveclust. We implement both the GMM method of Brillinger and Stew-art (1997) and the four proposed mixture of warpings methods in our Rpackage mixedWarpedCurves2 which is available at Simulation studyeric-f/mixedWarpedCurves2. In the case of MXWP-FxT and STEP-FxT,the known model template f is correctly specified whereas the warping func-tions are modelled with cubic B-splines with the same knot as the truesimulated functions at t = 0.50. In the case of MXWP and STEP, the un-known template f is modelled with a cubic B-spline with one interior knot att = 0.50, and the warping functions are modelled with cubic B-splines withinterior knots at t = 0.25, 0.50 and 0.75. In all four methods, we run theSAEM algorithms for 12000 iterations where the first 2000 iterations formthe burn-in and calibration stage, in which we tune the proposal variance ofthe Metropolis-Hastings sampler such that the acceptance rate is between17% to 33%.Starting valuesFor all nine methods, initial values for the cluster labels are required forthe clustering algorithm. In all except the two simultaneous approaches, weconsider multiple starting values. For each data set, the fitted model withthe largest likelihood, or lowest within-cluster sum-of-squares for kCFC, isretained as the final output.For GMM, the algorithm is repeated 30 times with random startingvalues for clustering labels. For funclust, funHDDC, kCFC and waveclust,in addition to the 30 random starts, we also include results from k-means andhierarchical clustering as starting values. The best fitted model among the32 outputs is retained as the final result. We note that the implementationsof these four methods in the funcy package does not allow user-definedstarting values. For STEP-FxT and STEP, we use GMM clusters in additionto 30 random starts. For MXWP-FxT and MXWP, we only use the GMMclusters as the initial clustering configuration.For our four methods based on the mixture of warpings model, we alsoinitialize the unknown parameters as follows. We set the starting values forthe mixing proportions to equal weights, pim = 1/M , the mean of the random563.3. Simulation studyeffects to their null values, (µ = (0, 1)> and κ¯1 = · · · = κ¯M = κ¯identity), thevariance-covariance matrix, Σ, as a diagonal matrix with large values on thediagonal, and the precision parameters, the τm’s, to some small values. ForMXWP and STEP where the model template is unknown, the base shapeand the error variance, σ2, are initialized to their ordinary least squaresestimates from the unaligned curves.BIC and within-cluster sum of squaresYassouridis et al. (2018) claim that AIC and BIC are not reliable for selectingthe number of clusters for funclust, funHDDC and waveclust. The questionof how to determine the number of clusters remains open. To examinethe issue, we assess the fits of one to eight clusters for each simulationrun. We use BIC, provided by the funcy package, for funclust, funHDDCand waveclust. We also use BIC, implemented in our mixedWarpedCurves2package, for the mixture of Dirichlet distributions for STEP-FxT and STEPand for the mixture of Gaussian distributions of the GMM approach. ForkCFC, since the method is not based on maximum likelihood, we report thewithin-cluster sum of squares of the fitted basis coefficients instead. ForMXWP-FxT and MXWP, we report the BIC based only on the likelihoodof the mixture of Dirichlet distributions component, treating the predictedwarping coefficients as data. This is because we find the integral of themarginal likelihood difficult to evaluate or approximate with high accuracy.ResultsFigures 3.2 to 3.4 show the BIC and the within-cluster sum-of-squares. ForMXWP-FxT, STEP-FxT and STEP, their BIC values show a decreasingtrend with number of components in general. The decreasing trend clearlylevels off after the correct number of clusters. We note that the jumps inBIC for a few cases with large numbers of model clusters are due to some ofthe resulting clusters actually being empty. While the minimum BIC might573.3. Simulation studynot correspond to the model with the true number of clusters, the number ofclusters can be identified by locating the elbow in the plot of BIC. A similarpattern is observed for funHDDC and kCFC, although for funHDDC thelevelling off can occur earlier with lower number of model clusters especiallyin the scenario of four simulated clusters.For MXWP, the BICs appear to be noisier and less reliable for choosingthe number of clusters. For GMM, funclust and waveclust, BIC is ineffec-tive for choosing the number of clusters. The reason is clear for GMM:no smoothness assumption is imposed on the data and each observation istreated as a 1000-dimension vector. Hence, an extra mixture componentadds 1000 parameters to the model. The resulting BIC is dominated bythe penalty on model complexity. For waveclust, we see an increasing trendwhich is counter intuitive. Perhaps the wavelet basis is unsuitable for oursimulated data. For funclust, we observe a rather linear trend in BIC withno levelling off, suggesting consistently the model with the largest numberof clusters is most appropriate.To assess the accuracy of the methods in recovering the clusters, wecompute the adjusted Rand index (Rand, 1971) for clustering results whenthe true number of clusters is known. An adjusted Rand index close to oneindicates that the clustering result is close to the true grouping of the simu-lated data. Figure 3.5 and Table 3.1 summarize the adjusted Rand indices.Across all three scenarios of K = 2, 3 and 4 simulated clusters, MXWP-FxT,STEP-FxT and STEP have the best performance with the highest medianadjusted Rand index. MXWP does not perform as well as the other threemethods that focus on phase variations. Since the corresponding two-stepapproach of STEP performs well, we expect the warping functions to beaccurately predicted, hence not the source of the low accuracy. The loweradjusted Rand index of the simultaneous approach is likely because of thesensitivity to starting values and the possibility of the parameter estimatesfor the mixture of Dirichlet distributions component being trapped at somelocal maximum.583.3. Simulation studyFor generic clustering methods that do not explicitly account for phasevariation, GMM and kCFC perform well. In particular, GMM can out-perform the mixture of warpings model with unknown template in termsof adjusted Rand index, especially when the true number of clusters is low.For the more sophisticated model-based approach of funclust, funHDDC andwaveclust, funclust performs poorly across all scenarios. Figure 3.6 showsa typical funclust output in our simulation. The estimated cluster meansare similar across groups and the clusters lack visual intuition. For fun-HDDC, we notice that the number of non-empty clusters is often lower thanthe number of specified clusters as shown in Table 3.1. Figure 3.7 showsan extreme case where only two clusters are identified when there are fourtrue clusters. All but 52 curves in cluster 4 and two curves in cluster 2 aregrouped together. While these 52 out of the 53 curves in cluster 4 are cor-rectly grouped, clearly, this method is not working well. For waveclust, theestimated cluster means have peculiar shapes which fit the data well overthe first half of the domain, [0, 0.5], but are unreasonable in the second half,(0.5, 1].Since we use a fixed number of iterations for the SAEM algorithm, webriefly examine the convergence of the algorithm via traceplots of the esti-mated parameters. Figures 3.9 and 3.10 show the sequences of estimatedparameters across iterations by MXWP-FxT and MXWP for a simulationrun with four clusters. The estimated parameters appear to move quickly tothe proximity of a maximum. For some parameters, the estimates continueto change slowly after about 5000 iterations while others appear to haveconverged. The slow drift is likely the effect of stochastic approximationaveraging out earlier MCMC states that are further from the optimal value.Judging from the estimated error variance, σ2, that is close to the true valueat the end of the algorithm, we believe that the algorithm has converged toa reasonable extent such that the predicted latent functions fit the data well.Overall, our four methods that cluster on phase variation recover thegroupings the most faithfully, although MXWP has trouble choosing thenumber of clusters. The relatively simpler model of GMM and kCFC per-593.3. Simulation studyform resonably well whereas funclust, funHDDC and waveclust often fail toidentify the clusters for the scenarios we considered. While MXWP-FxT hasthe best performance in our simulation study, the assumption that the fixedtemplate is correctly specified is often unrealistic. In the next subsection, weexpand our simulation study to examine the impact of using a misspecifiedtemplate to fit the data.Sensitivity to misspecified fixed templateTo investigate the performance of our methods when the fixed template ismisspecified, we simulate data in exactly the same way except with twodifferent base shapes: f(t) = −2√t(1− t) and f(t) = −(27/4) · t(1 − t)2.Figures 3.11 and 3.16 show one set of the simulated data for each base shape.We compare the four methods of MXWP, MXWP-FxT, STEP andSTEP-FxT with GMM clustering. The cubic splines for our four meth-ods have the same knots as in the previous simulation. For MXWP-FxTand STEP-FxT, we fix the template at f(t) = −4t(1 − t) for fitting thedata.Figures 3.12 to 3.14 show the dependence of BIC on the true number ofclusters when the true template is f(t) = −2√t(1− t). We choose to omitthe impractical BIC for GMM since the value is dominated by the penaltyterm as seen in the previous simulation study. For STEP and STEP-FxT,the decreasing trends of their BIC level off at the correct number of clustersin most runs, especially for the unknown template case, STEP. For MXWP-FxT, the decreasing trends in BIC level off sometimes at a lower number ofclusters than the true number of groups. As in the previous simulation, forMXWP a similar pattern of noisier BIC values with less prominant elbowin the trend is observed. Overall, BIC does not appear to be effective forchoosing the number of clusters except for STEP and STEP-FxT.Figure 3.15 and Table 3.2 summarize the adjusted Rand indices for603.3. Simulation studyclustering results when the true number of clusters is known. With twosimulated clusters, all four methods of MXWP, MXWP-FxT, STEP andSTEP-FxT perform better than generic GMM clustering. In particular,STEP almost always recovers the exact group partitions with a minimumadjusted Rand index of 0.94. With more than two simulated clusters, theaccuracy of MXWP deteriorates most substantially as in the previous simu-lation. However, STEP remains highly accurate with adjusted Rand indexhigher than 0.96 for 75% of the simulation runs. With misspecified template,MXWP-FxT and STEP-FxT still slightly outperform GMM, with STEP-FxT yielding a higher median adjusted Rand index than MXWP-FxT.Figures 3.17 to 3.19 plot the BIC by true number of clusters when thetrue template is f(t) = −(27/4)t(1 − t)2. A similar pattern is observed forMXWP and STEP. For models with misspecified template, when the num-ber of simulated clusters is more than two, the trends of BIC for STEP-FxTare flat, providing little insight into the number of clusters. For MXWP-FxT, the line plots of BIC seem to be too erratic to be of any use.Figure 3.20 and Table 3.3 summarize the adjusted Rand index for sim-ulated data with template, f(t) = −(27/4)t(1− t)2. Overall, STEP has thehigher adjusted Rand index on average. In this particular simulation withmore skewed curves, MXWP still performs well as the number of simulatedclusters, K, increases. For methods with misspecified template, MXWP-FxT is the least accurate when the number of simulated clusters is morethan two. STEP-FxT is similar but slightly better than GMM clustering.ConclusionOverall, MXWP, MXWP-FxT, STEP and STEP-FxT perform well whenthe main source of variation between clusters is phase. In the scenarioswe considered, MXWP has trouble determining the number of clusters.When the template is correctly specified, the simultaneous approach withknown template recovers the partition most accurately on average. How-613.4. Tables and figuresever, the unknown template model is more robust to misspecification of thebase shape. When the true template is f(t) = −2√t(1− t) but the modeltemplate is −4t(1 − t), fitting with the fixed but misspecified template byMXWP-FxT or STEP-FxT is still better than MXWP. However, with askewed true template, f(t) = −(27/4)t(1− t)2, using the misspecified fixedtemplate of −4t(1−t) is unwise; the template is not appropriate and not rep-resentative of the simulated curves. This explains the less informative BICfor choosing the number of clusters and the lower accuracy of the resultingclusters.The clustering algorithms via SAEM can be sensitive to the initial values.While trying multiple starting values might be computationally infeasible forthe simultaneous curve fitting and clustering approach, the combination ofa 2-step clustering approach with multiple random starting values appearsto provide a better alternative to the simultaneous approach.Based on our simulation study, we conclude that STEP has the bestperformace in the scenarios we considered. With 200 curves at a samplingfrequency of 1000 points per curve, STEP takes about 30 minutes for eachsimulation run. In comparison, MXWP-FxT and STEP-FxT take four toseven minutes for the same number of iterations. For a larger data setsuch as our elephant seal dive profiles, MXWP-FxT and STEP-FxT mightbe more preferrable for their higher computational efficiency and a similarperformance to STEP. Regarding the choice of initial cluster labels, simul-taneous methods work well with GMM providing starting values. However,methods using multiple starting values are more robust to the sensitivityissues in the initial cluster labels. Thus for large data sets, where the com-putational cost for multiple starting values is high, methods like STEP andSTEP-FxT may be preferred.3.4 Tables and figures623.4.TablesandfiguresFigure 3.1: Two hundred simulated curves from the mixture of warpings model with 1000 observations percurve. Base shape: f(t) = −4t(1 − t). True model parameters: pii = 1/4 for i = 1, . . . , 4, κ1 = (0, 1, 1, 0)>,κ2 = (1.2, 1, 1, 1.2)>, κ3 = (0, 1, 2, 1)>, κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ = diag(102, 502) and σ2 = 10.633.4.TablesandfiguresFigure 3.2: Within-cluster sum-of-squares (for kCFC) and BIC for simulation study with true template, f(t) =−4t(1− t). True number of clusters is 2.643.4.TablesandfiguresFigure 3.3: Within-cluster sum-of-squares (for kCFC) and BIC for simulation study with true template, f(t) =−4t(1− t). True number of clusters is 3.653.4.TablesandfiguresFigure 3.4: Within-cluster sum-of-squares (for kCFC) and BIC for simulation study with true template, f(t) =−4t(1− t). True number of clusters is 4.663.4.TablesandfiguresFigure 3.5: Adjusted Rand index when the true number of clusters known. True template f(t) = −4t(1− t).673.4.TablesandfiguresTable 3.1: Adjusted Rand index and average number of non-empty clusters (Eff. K). The true number of clusters,K, is assumed known.Adjusted Rand IndexMethod Eff. K Mean (SD) Min Q1 Median Q3 MaxK = 2STEP 2.00 1.00 (0.00) 0.98 1.00 1.00 1.00 1.00MXWP 2.00 0.88 (0.05) 0.70 0.85 0.88 0.92 1.00STEP-FxT 2.00 1.00 (0.00) 0.96 1.00 1.00 1.00 1.00MXWP-FxT 2.00 1.00 (0.00) 0.98 1.00 1.00 1.00 1.00GMM 2.00 0.94 (0.04) 0.79 0.92 0.94 0.98 1.00kCFC 2.00 0.96 (0.05) 0.76 0.92 0.98 1.00 1.00funclust 1.99 0.02 (0.06) -0.01 -0.00 0.00 0.01 0.50funHDDC 2.00 0.76 (0.24) 0.03 0.70 0.85 0.94 1.00waveclust 2.00 0.96 (0.09) 0.39 0.96 1.00 1.00 1.00K = 3STEP 3.00 0.97 (0.03) 0.86 0.95 0.97 0.99 1.00MXWP 3.00 0.84 (0.05) 0.64 0.81 0.85 0.88 0.94STEP-FxT 3.00 0.97 (0.02) 0.87 0.95 0.97 0.99 1.00MXWP-FxT 3.00 0.98 (0.02) 0.94 0.97 0.99 1.00 1.00— continued on next page —683.4.TablesandfiguresAdjusted Rand IndexMethod Eff. K Mean (SD) Min Q1 Median Q3 Max— continued from previous page —GMM 3.00 0.88 (0.05) 0.75 0.86 0.88 0.91 0.97kCFC 2.82 0.70 (0.23) 0.00 0.71 0.76 0.81 0.94funclust 2.92 0.02 (0.06) -0.01 -0.00 0.00 0.01 0.73funHDDC 2.97 0.69 (0.18) 0.00 0.57 0.73 0.82 0.96waveclust 3.00 0.27 (0.08) 0.05 0.22 0.27 0.32 0.55K = 4STEP 4.00 0.96 (0.03) 0.81 0.95 0.96 0.99 1.00MXWP 4.00 0.86 (0.06) 0.66 0.82 0.86 0.90 0.99STEP-FxT 4.00 0.96 (0.03) 0.82 0.94 0.96 0.97 1.00MXWP-FxT 4.00 0.97 (0.02) 0.89 0.96 0.97 0.99 1.00GMM 4.00 0.86 (0.05) 0.68 0.84 0.86 0.89 0.96kCFC 4.00 0.73 (0.06) 0.51 0.70 0.74 0.77 0.89funclust 3.81 0.02 (0.03) -0.01 -0.00 0.01 0.03 0.28funHDDC 3.11 0.45 (0.19) 0.08 0.31 0.38 0.52 0.95waveclust 4.00 0.24 (0.09) 0.02 0.18 0.25 0.30 0.46693.4. Tables and figuresFigure 3.6: Example clusters formed by funclust for a simulation replicatewith four true clusters and correctly specified number of clusters by themethod. Curves in the same row belong to the same simulated cluster.Curves in the same column belong to the same funclust cluster. Coloredlines show the estimated cluster means.703.4. Tables and figuresFigure 3.7: Example clusters formed by funHDDC for a simulation repli-cate with four true clusters and correctly specified number of clusters bythe method. Curves in the same row belong to the same simulated cluster.Curves in the same column belong to the same non-empty funHDDC clus-ter. Colored lines show the estimated cluster means. Only two non-emptyclusters are formed by the method.713.4. Tables and figuresFigure 3.8: Example clusters formed by waveclust for a simulation replicatewith three true clusters and correctly specified number of clusters by themethod. Curves in the same row belong to the same simulated cluster.Curves in the same column belong to the same waveclust cluster. Coloredlines show the estimated cluster means.723.4.TablesandfiguresFigure 3.9: Convergence diagnosis via traceplot of estimated parameters for MXWP-FxT, K = 4733.4.TablesandfiguresFigure 3.10: Convergence diagnosis via traceplot of estimated parameters for MXWP, K = 4743.4.TablesandfiguresFigure 3.11: Two hundred simulated curves from the mixture of warpings model with 1000 observations percurve. Base shape: f1(t) = −2√t(1− t). True model parameters: pii = 1/4 for i = 1, . . . , 4, κ1 = (0, 1, 1, 0)>,κ2 = (1.2, 1, 1, 1.2)>, κ3 = (0, 1, 2, 1)>, κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ = diag(102, 502) and σ2 = 10.753.4.TablesandfiguresFigure 3.12: BIC for simulation study with true template, f(t) = −2√t(1− t). True number of clusters is 2.763.4.TablesandfiguresFigure 3.13: BIC for simluation study with true template, f(t) = −2√t(1− t). True number of clusters is 3.773.4.TablesandfiguresFigure 3.14: BIC for simulation study with true template, f(t) = −2√t(1− t). True number of clusters is 4.783.4.TablesandfiguresFigure 3.15: Adjusted Rand index when the true number of clusters is known. True template f(t) = −2√t(1− t).793.4.TablesandfiguresTable 3.2: Adjusted Rand index. True template: f = −2√t(1− t). Misspecified fixed template: f = −4t(1− t).The number of clusters is assumed known.Adjusted Rand IndexK Method Mean (SD) Min Q1 Median Q3 Max2STEP 1.00 (0.01) 0.94 1.00 1.00 1.00 1.00MXWP 0.94 (0.11) 0.58 0.96 0.98 1.00 1.00STEP-FxT 0.99 (0.03) 0.69 0.98 1.00 1.00 1.00MXWP-FxT 0.98 (0.02) 0.88 0.98 1.00 1.00 1.00GMM 0.91 (0.07) 0.53 0.88 0.92 0.96 1.003STEP 0.97 (0.02) 0.90 0.96 0.97 0.99 1.00MXWP 0.80 (0.07) 0.62 0.75 0.80 0.85 0.93STEP-FxT 0.93 (0.05) 0.70 0.91 0.94 0.97 1.00MXWP-FxT 0.90 (0.07) 0.69 0.86 0.92 0.95 1.00GMM 0.91 (0.05) 0.73 0.88 0.91 0.94 1.004STEP 0.97 (0.02) 0.85 0.96 0.97 0.99 1.00MXWP 0.85 (0.09) 0.64 0.78 0.84 0.93 1.00STEP-FxT 0.91 (0.05) 0.66 0.88 0.92 0.95 1.00MXWP-FxT 0.89 (0.06) 0.69 0.86 0.89 0.93 1.00GMM 0.89 (0.04) 0.72 0.86 0.89 0.92 1.00803.4.TablesandfiguresFigure 3.16: Two hundred simulated curves from the mixture of warpings model with 1000 observations percurve. Base shape: f2(t) = −27/4t(1 − t)2. True model parameters: pii = 1/4 for i = 1, . . . , 4, κ1 = (0, 1, 1, 0)>,κ2 = (1.2, 1, 1, 1.2)>, κ3 = (0, 1, 2, 1)>, κ4 = (1, 2, 1, 0)>, µ = (−25, 500)>, Σ = diag(102, 502) and σ2 = 10.813.4.TablesandfiguresFigure 3.17: BIC for simulation study with true template, f(t) = −(27/4)t(1− t)2. True number of clusters is 2.823.4.TablesandfiguresFigure 3.18: BIC for simulation study with true template, f(t) = −(27/4)t(1− t)2. True number of clusters is 3.833.4.TablesandfiguresFigure 3.19: BIC for simulation study with true template, f(t) = −(27/4)t(1− t)2. True number of clusters is 4.843.4.TablesandfiguresFigure 3.20: Adjusted Rand index when the true number of clusters is known. True template f(t) = −(27/4)t(1−t)2.853.4.TablesandfiguresTable 3.3: Adjusted Rand index. True template: f = −(27/4)t(1−t)2. Misspecified fixed template: f = −4t(1−t).The number of clusters is assumed known.Adjusted Rand IndexK Method Mean (SD) Min Q1 Median Q3 Max2STEP 1.00 (0.03) 0.70 1.00 1.00 1.00 1.00MXWP 0.94 (0.11) 0.58 0.96 0.98 1.00 1.00STEP-FxT 0.98 (0.02) 0.90 0.96 0.98 1.00 1.00MXWP-FxT 0.98 (0.02) 0.92 0.98 0.98 1.00 1.00GMM 0.95 (0.07) 0.41 0.94 0.96 0.98 1.003STEP 0.92 (0.04) 0.78 0.90 0.93 0.95 1.00MXWP 0.80 (0.07) 0.62 0.75 0.80 0.85 0.93STEP-FxT 0.84 (0.05) 0.69 0.81 0.84 0.88 0.97MXWP-FxT 0.77 (0.09) 0.45 0.72 0.78 0.83 0.93GMM 0.83 (0.06) 0.68 0.80 0.84 0.87 0.994STEP 0.84 (0.06) 0.64 0.81 0.84 0.88 0.99MXWP 0.85 (0.09) 0.64 0.78 0.84 0.93 1.00STEP-FxT 0.73 (0.17) 0.01 0.71 0.78 0.82 0.91MXWP-FxT 0.50 (0.09) 0.30 0.44 0.49 0.55 0.75GMM 0.72 (0.05) 0.55 0.69 0.72 0.75 0.9086Chapter 4ApplicationsIn this chapter, we apply our clustering methods proposed in Chapter 3 tothe data set that motivates our work, the dive profile data of a southernelephant seal. We also apply our methods to the growth curve data fromthe Berkeley Growth Study which is a classical and benchmark data set forfunctional data analysis. In both analyses, we compare the clusters revealedby our methods that focus on between-group heterogeneity in phase varia-tion to those identified by the generic clustering method of GMM originallyconsidered by Brillinger and Stewart (1997) for dive profiles.4.1 Dive profiles of a southern elephant sealThe dive proile data of the southern elephant seal are provided by Dr.Christophe Guinet of Centre d’Etudes Biologiques de Chize´ (CEBC) UMR7372 CNRS-ULR. The data are collected as part of the Syste`me d’ObservationMEMO. A time depth recorder is glued to an elephant seal on the KerguelenIsland. The swim track of the seal is shown in Figure 1.2 in Chapter 1. Divedepth data are recorded from October 28 to December 29, 2010. A totalof 3300 dives are identified in the original data set. For our analysis, weexclude the first 150 and last 150 dives since these dives are much shallowerwhen the seal is near the island where the seabed is shallow. Figure 4.1shows a random sample of 200 dive profiles from the 3000 dives. Figure 4.2shows the same 200 dive profiles with time scaled to between 0 and 1.874.1. Dive profiles of a southern elephant sealWe apply GMM and the four methods of MXWP, STEP, MXWP-FxTand STEP-FxT to the duration-standardized dive profiles. For GMM, sincethe data vectors for individual dive profiles need to have the same dimension,we prepare the data as follows: each scaled dive profile is first interpolated bya cubic spline, a data vector is then formed by evaluating the interpolatingspline on 1001 equally spaced time points from 0 to 1. For MXWP-FxT andSTEP-FxT, the known template is set to the parabola f(t) = 4t(1− t). ForMXWP and STEP, the unknown template is modelled with a cubic splineon [0, 1] with interior knots at 0.25, 0.50, 0.75. For all four of our proposedmethods, the warping functions are also modelled with cubic splines on[0, 1] with interior knots at 0.25, 0.50, 0.75. Where the SAEM algorithm isemployed for MXWP-FxT and STEP-FxT, we use a fixed number of 10000iterations after 2000 steps of burn-in and calibration. MXWP and STEPare more flexible and complex with an unknown template, hence we try toensure convergence by increasing the total number of iterations to 52000.For GMM, STEP-FxT and STEP, each method is repeated 100 timeswith random starting values for group membership. For the two-step meth-ods, we also use the GMM result as the starting values. In each case, wetreat the model with the lowest BIC as the final output. For the simulta-neous approaches of MXWP-FxT and MXWP, we initialize the clusteringconfiguration and parameter estimates as described in the simulation studyin Section 3.3.Number of clustersWe consider up to K = 8 clusters. Figure 4.3 shows the dependence of BICon number of clusters for each method. The case of K = 1 is not included forkCFC since the computer program implemented in the funcy package doesnot support this option. For all five methods, the lowest BIC correspondsto the model with eight clusters, which is the largest number of clusters weexplore. There is no clear pattern of leveling off of the BIC, except maybefor MXWP and STEP-FxT at K = 4 clusters. For all five methods, we only884.1. Dive profiles of a southern elephant sealexamine the results with eight clusters for the rest of the analysis.DiagnosticsFigure 4.4 shows the estimated template for MXWP and STEP and thescaled known template, µˆsh+ µˆsc[4t(1−t)] for MXWP-FxT and STEP-FxT.The scaled templates for MXWP-FxT and STEP-FxT are almost identical.The same can be said of the estimated unknown template for MXWP andSTEP. The two estimated templates resemble a pinched parabola, suggest-ing that our choice of the fixed template for MXWP and STEP is reasonable.All four templates have the same number of maxima and minima and we caneasily warp one template to the other. Figure 4.5 shows the sequences of es-timated error variance, σˆ2, for MXWP, STEP, MXWP-FxT and STEP-FxTas a basic convergence diagnosis of the SAEM algorithms. There is no obvi-ous sign of non-convergence in any of the four cases. In terms of computingtime, the SAEM algorithms for STEP-FxT and MXWP-FxT take 2.0 and2.7 hours respectively. On the other hand, using the unknown templatesmodelled with splines by STEP and MXWP increases the computing timeby 3.5 to 4.2 times.ResultsEach of Figures 4.6 to 4.18 shows the dive profile clusters produced by thecorresponding method and the typical shape for each cluster. The typicalshape is calculated as the scaled template subject to the average warping ofthe cluster:a¯m,sh + a¯m,scfˆ(h¯m(t))(4.1)894.2. Berkeley Growth Study datawhere a¯m,sh and a¯m,sc are the empirical means of the estimated amplitudeeffects for curves in cluster m andh¯m(t) =Kh∑k=2 k∑j=2ˆ¯κm,jBhk (t) (4.2)is the mean warping function of cluster m and the ˆ¯κm,j ’s the estimatedDirichlet means. Figures 4.11 to 4.14 show the corresponding estimatedwarping functions by clusters with the clusters produced by our four meth-ods. The GMM clusters mostly differ in terms of the maximum depth of thedives. On the other hand, MXWP-FxT and STEP-FxT discover clusterswith more distinct shapes. For example, many of the drift dives are groupedinto cluster 7 while the flat bottom dives and the V-shape dives are betterseparated and represented by the typical shapes. MXWP and STEP alsoidentify the V-shape dives from the flat bottom dives but fail to separatethe drift dives from the V-shape dives.To better visualize and compare the shape homogeneity within clusters,we cross tabulate the dive profiles by GMM cluster and the clustering labelreturned by each of our methods in Figures 4.15 to 4.18. For example, inFigure 4.15, dive profiles in the same row belong to the same GMM clusterwhile dive profiles in the same column belong to the same MXWP-FxTcluster. In general, the dive profiles within the same column appear to havemore similar shape than across the same rows. Based on visual inspection,we conclude that the clusters yielded by our methods, in particular MXWP-FxT and STEP-FxT, are more intuitive.4.2 Berkeley Growth Study dataThe Berkeley Growth Study is a benchmark data set in functional dataanalysis (Tuddenham, 1954; Ramsay and Silverman, 2007). The heightsof 39 boys and 54 girls are recorded repeatedly from age 1 to 18 years at904.2. Berkeley Growth Study data31 common ages. Figure 4.19 shows the growth curves. These curves arerelatively smooth, especially compared to the elephant seal dive profiles. Asnoted by Ramsay and Silverman (2006), the growth in height is fastest rightafter birth with the rate decreasing over time until the puberty growth spurtthat happens around age 9 to 15 when growth accelerates again.We compare the clusters produced by GMM, kCFC and MXWP. Wedo not consider the two-step approach because we can use multiple startingvalues for MXWP with this small data set. For MXWP, the cubic spline forthe unknown template has an interior knot at 0.5 while the cubic splines forthe warping functions have interior knots at 0.25, 0.5 and 0.75. Since thisdata set has relatively few individuals and few observations per individualwe run the SAEM algorithm with only 1000 iterations after 200 burn-inand calibration steps. Each of the three methods is repeated 30 times withrandom starting values for cluster membership. As usual, for GMM andMXWP, the fitted model with the lowest BIC is retained as the final output.For kCFC, we replace the BIC with within-cluster sum-of-squares (WSS) ofthe distances between the curves and their cluster center.Figure 4.20 shows the BIC and WSS for one to eight clusters. While thescree plot of WSS for kCFC shows no clear sign of an elbow and the BICfor GMM decreases steadily from K = 1 cluster to K = 8 clusters, the BICfor MXWP attains its minimum at K = 2, suggesting two clusters amongthe growth curves.Figure 4.21 shows the growth curves labelled by clusters, with eight clus-ters for GMM and kCFC and two clusters for MXWP. The correspondingvelocity and acceleration functions are estimated individually by smoothingsplines. Figure 4.22 shows the same plots but with the number of clustersfixed at two for all three methods. As seen in the figures, GMM groups thecurves by their overall heights and reveals little structure in terms of phasevariations. In contrast, applying MXWP to the height data, we are able toidentify two groups with different timing of pubertal growth spurts. As seenfrom the velocity functions, one group has the growth spurt at around age914.2. Berkeley Growth Study data9 while the other group has the spurt at around age 12 or older. Table 4.1lists the gender of the subjects in each MXWP cluster. Cluster 2 with the33 subjects whose growth spurts started earlier are all girls whereas cluster 1with later growth spurts consists of 21 girls and all 39 boys. Although kCFCis able to identify more interesting groups than GMM, the resulting clustersare less intuitive. If we fix the number of clusters at K = 2, while kCFCclusters show some connection to the timing of growth spurt, the separationof the two groups is not as clear as by MXWP.Figure 4.23 shows the estimated template, fitted curves, the warpingfunctions and the derived velocity and acceleration function,ŷ′i(t) = aˆi,sc[fˆ ′(hˆi(t))· hˆ′i(t)]andŷ′′i(t) = aˆi,sc[fˆ ′′(hˆi(t))·(hˆ′i(t))2+ fˆ ′(hˆi(t))· hˆ′′i (t)],where aˆi,sc is the estimated scaling effect as in (2.17), fˆ′ and fˆ ′′ are deriva-tives of the fitted spline template and hˆi, hˆ′i and hˆ′′i are the spline estimateof the warping function and its derivatives. The curves fit the data well andthe SAEM algorithm appears to have converged as suggested by the tra-ceplot in Figure 4.24. The derived velocity and acceleration functions alsoresemble the smoothing spline estimates, although slightly oversmoothedin comparison. There might be more peaks and valleys in the derivativefunctions than our choice of the spline functions can represent. Also, themixed-effects model might over penalize the roughness of the curves. All inall, the plots, especially the warping functions, support our interpretationof the clusters in terms of the timing of growth spurts.In conclusion, we apply our clustering methods that explicitly modelphase variation to a novel data set of elephant seal dive profiles and theclassic Berkeley growth curve data. In both cases, our methods are able toreveal clustering structure in terms of deformation or phase variation thatare not identified by generic clustering algorithms for functional data. In the924.3. Tables and figurestwo examples we examine, such between-group differences in phase variationyield more intuitive clusters.4.3 Tables and figuresFigure 4.1: A random sample of 200 dive profiles.934.3.TablesandfiguresFigure 4.2: The same random sample of 200 dive profiles as in Figure 4.1 but with standardized durations.944.3.TablesandfiguresFigure 4.3: BIC for clustering of southern elephant seal dive profiles.954.3.TablesandfiguresFigure 4.4: Estimated templates. Estimated B-spline, fˆ(t) =∑Kfk=1 αˆkBfk (t), for MXWP and STEP. Scaled knowntemplate, fˆ(t) = µˆsh + µˆsc · (−4t(1− t)), for MXWP and STEP.964.3.TablesandfiguresFigure 4.5: Traceplots of estimated error variance, σˆ2 by SAEM iterations. The first 500 iterations are excludedfor better scaling of the plots.974.3.TablesandfiguresFigure 4.6: Eight clusters of dives determined by GMM. The dark blue lines are the cluster means.984.3.TablesandfiguresFigure 4.7: Eight clusters of dives determined by MXWP-FxT. The dark blue lines are the typical shapes of theclusters as defined in (4.1)-(4.2).994.3.TablesandfiguresFigure 4.8: Eight clusters of dives determined by STEP-FxT. The dark blue lines are the typical shapes of theclusters as defined in (4.1)-(4.2).1004.3.TablesandfiguresFigure 4.9: Eight clusters of dives determined by MXWP. The dark blue lines are the typical shapes of the clustersas defined in (4.1)-(4.2).1014.3.TablesandfiguresFigure 4.10: Eight clusters of dives determined by STEP. The dark blue lines are the typical shapes of the clustersas defined in (4.1)-(4.2).1024.3.TablesandfiguresFigure 4.11: Estimated warping functions corresponding to the eight MXWP-FxT clusters in Figure 4.7.1034.3.TablesandfiguresFigure 4.12: Estimated warping functions corresponding to the eight STEP-FxT clusters in Figure 4.8.1044.3.TablesandfiguresFigure 4.13: Estimated warping functions corresponding to the eight MXWP clusters in Figure 4.9.1054.3.TablesandfiguresFigure 4.14: Estimated warping functions corresponding to the eight STEP clusters in Figure 4.10.1064.3.TablesandfiguresFigure 4.15: Dive profiles cross-tabulated by GMM (rows) and MXWP-FxT (columns) clusters. Curves in thesame row belong to the same GMM cluster. Curves in the same column belong to the same MXWP-FxT cluster.1074.3.TablesandfiguresFigure 4.16: Dive profiles cross-tabulated by GMM (rows) and STEP-FxT (columns) clusters.1084.3.TablesandfiguresFigure 4.17: Dive profiles cross-tabulated by GMM (rows) and MXWP (columns) clusters.1094.3.TablesandfiguresFigure 4.18: Dive profiles cross-tabulated by GMM (rows) and STEP (columns) clusters.1104.3. Tables and figuresFigure 4.19: Growth curves of 39 boys and 54 girls from the Berkeley GrowthStudy data. The verticle grid lines indicate the actual measurement times.1114.3. Tables and figuresFigure 4.20: BIC for GMM and MXWP and within-cluster sum of squares(WSS) for kCFC for the Berkeley Growth Study data.1124.3. Tables and figuresFigure 4.21: Cluster analysis by GMM, kCFC and MXWP for BerkeleyGrowth Study data, with eight clusters for GMM and kCFC and two clustersfor MXWP. The clustering methods are applied to the height data. Theresulting clusters are identified by color and linetype. The velocity (secondrow) and acceleration (last row) panels show the corresponding derivativesof the growth curves estimated by smoothing splines and labelled by cluster.1134.3. Tables and figuresFigure 4.22: Cluster analysis by GMM, kCFC and MXWP for BerkeleyGrowth Study data with the number of clusters fixed at two. The clusteringmethods are applied to the height data. The resulting clusters are identifiedby color and linetype. The velocity (second row) and acceleration (last row)panels show the corresponding derivatives of the growth curves estimatedby smoothing splines and labelled by cluster.1144.3. Tables and figuresFigure 4.23: Fitted growth curves, their first two derivatives and the warpingfunctions by MXWP. The two derivative functions are derived from the fittedsplines for the amplitude and warping functions. Cluster memberships areindicated by colour (red vs. blue) and linetype (solid vs. dashed). In thetop left panel, the gold line shows the estimated template for the growthcurves.1154.3. Tables and figuresTable 4.1: Cross-tabulation of MXWP clusters and gender of the subjectsin the Berkeley Growth Study.Mixture of warping clustersGender 1 2Boy 39 0Girl 21 33Figure 4.24: Traceplot of estimated error variance by MXWP for the Berke-ley Growth Study data.116Chapter 5Discussion and future workIn this thesis, we first propose a model-based curve registration method thatexplicitly models the amplitude and phase variations of functional data. InChapter 2, we discuss our model and derive a stochastic approximation EM(SAEM) algorithm for estimation of the model parameters and prediction ofthe amplitude random effects and the warping functions. We show that theSAEM algorithm is computationally more stable and efficient than existingmethods in the literature when the functional data are densely observedacross time.We then propose two clustering methods that explore clustering struc-tures in phase variation: a two-step approach that clusters the warping func-tions estimated by our curve registration model, and a simultaneous curvefitting and clustering method based on the mixture of warpings model. InChapter 4, we apply our methods to a cluster analysis of 3000 dive profileswhere each curve is observed at its original frequency of more than 1000 timepoints. We also apply our methods to the smaller data set of the Berkeleygrowth study. Together with the simulation study in Chapter 3, the dataanalyses in Chapter 4 show that our proposed methods can reveal more in-teresting clusters of curves than generic clustering methods for functionaldata that do not explicitly model amplitude and phase variations.In this thesis, we only consider phase variation represented by the class117Chapter 5. Discussion and future workof warping functions,H =h :(H1): h is continuous in t(H2): h is strictly increasing in t(H3): h(0) = 0 and h(1) = 1 .Alternative types of phase variation not represented by the class H butmight be of practical interest are shifting and scaling in the temporal direc-tion, fi(t) = f(wsh + wsct). Note that f needs to be defined on the wholereal line or an interval dictated by the range of possible values of wsh andwsc. In that case, the simple shape invariance class of warping functions,{h : h(t) = wsh + wsct}, (5.1)first proposed by Lawton et al. (1972) is more appropriate.Another limitation of H is the assumption that h(0) = 0 and h(1) = 1.While the time domain can be easily generalized from [0, 1] to [t0, t1] byrescaling, the assumptions of a common start time when all curves are insynchrony, i.e. h(t0) = t0, and no overall speeding up or slowing down,i.e. h(t1) = t1 can be restrictive. For example, in the Berkeley GrowthStudy data with a time domain of age 1 to 18, it might not be reasonable toassume that the growth progress of all the children is identical at age 1 andsubsequently at age 18. In that case, it might be desirable to expand thetime domain slightly beyond the range of observed data to [t0−∆t, t1 +∆t].The original constraint at the boundary is then replaced by t0−∆t < h(t0) <h(t1) < t1 + ∆t. In practice, we would transform the original observed time,tobs, byt =tobs − (t0 −∆t)(t0 − t1 + 2∆t) ,before modelling the curves in terms of functions in t over the unit interval.1185.1. Future work5.1 Future workIn this section, we discuss some possible directions of future work.Standard error for parameter estimates of the curveregistration modelThe standard error of the estimated parameters for the curve registrationmodel (2.2)-(2.7) can be computed using the observed-data Fisher informa-tion matrix, which can be estimated by the observed information,I(θ; y) = −H(θ; y) = − ∂2∂θ2`(θ; y).where H(θ; y) is the Hessian of the observed-data log-likelihood. Computa-tion of I(θ; y) can be incorporated into the EM algorithm via the followingrelationship between H(θ; y) and the derivatives of the complete-data log-likelihood (Louis, 1982):H(θ; y) = E[Hc(θ; z) | y;θ]− Cov[∇c(θ; z) | y;θ] (5.2)where Hc(θ; z) and ∇c(θ; z) are the Hessian and gradient of the complete-data log-likelihood (2.9). The covariance term in (5.2) can be decomposedasCov[∇c(θ; z) | y;θ]= E[∇c(θ; z)∇c(θ; z)> | y;θ]− E[∇c(θ; z) | y;θ] E[∇c(θ; z) | y;θ]>.(5.3)Delyon et al. (1999) suggests approximating the conditional expectationsin (5.2) and (5.3) by stochastic approximation alongside the SAEM algo-rithm, with the general schemeH(k) = H(k−1) + γk[Hc(θ(k); z(k))−H(k−1)]1195.1. Future workwhere θ(k) is the estimated parameter and z(k) is the complete-data of ob-servations and simulated random effects at the k-th iteration of the SAEMalgorithm. The feasibility and the performance of the above approximationfor our models remain to be assessed.IdentifiabilityIn Section 2.2, we discuss the idenfiability issue of curve registration mod-els of the form (2.2)-(2.3). Most of the curve registration methods thatare based on probabilisitic models assume that E[hi(t)] = t, including ours.However, Panaretos and Zemel’s proof of identifiability, based on local varia-tion functions, requires E[h−1i (t)] = t. In our simulation study, our assump-tion, E[hi(t)] = t, does not seem to cause any problems in estimating thetemplate. However, the identifiability of the model under this assumptionremains to be established or disproved. The identifiability of our mixture ofwarpings model with unknown template is also an open question.Template invariant clusteringFor our two-step clustering method with known template, the warping func-tions, and therefore the clustering results, depend on the choice of the fixedtemplate. While the features of an appropriate template might be relativelyeasy to identify from an initial data visualization, the exact form of thefixed template can be difficult to choose. For a more consistent definition ofclusters that is robust to the choice of the template, we might require thata warping of a fixed template serves as well as the original fixed template.That is, we might require that clustering result be invariant to arbitrarywarping of the fixed template.Here, we consider warping functions in H, the class of strictly increasingfunctions defined in Chapter 2. Note that H is closed under compositionand, if h is in H, so is h−1.1205.1. Future workSuppose that, for a fixed template f , for each i, yi = f ◦hi for some hi ∈H. Now suppose, instead of f , we use the fixed template f ◦ h ≡ f∗, whereh is an arbitrary warping function in H and so we assume that yi = f∗ ◦ h∗ifor some warping function h∗i ∈ H. Since f ◦ hi = (f ◦ h) ◦ (h−1 ◦ hi) =f∗ ◦ (h−1 ◦ hi), we have h∗i = h−1 ◦ hi.For clustering to be invariant to warping of the fixed template, we wantour clustering results to be the same whether we use the his or the h∗i s. Fordistance-based clustering methods, this means that we want the distancebetween hi and hj to be the same as the distance between h∗i and h∗j . Thus,it suffices that we use a distance d that satisfiesd(ϕ1, ϕ2) = d(ϕ−1 ◦ ϕ1, ϕ−1 ◦ ϕ2)for all ϕ, ϕ1 and ϕ2 in H, that is, it suffices thatd(ϕ1, ϕ2) = d((ϕ−11 ◦ ϕ)−1, (ϕ−12 ◦ ϕ)−1). (5.4)For convenience, we reformulate this, defining the distance d∗ on H in termsof the distance between inverse functions:d∗(ψ1, ψ2) ≡ d(ψ−11 , ψ−12 ).Thus, (5.4) becomesd∗(ϕ−11 , ϕ−12 ) = d∗(ϕ−11 ◦ ϕ,ϕ−12 ◦ ϕ)for all ϕ, ϕ1 and ϕ2 in H. The task is then to find d∗ that is invariantto simultaneous warping of any two curves in H. One such distance is theRao-Fisher metric advocated by Srivastava et al. (2011).1215.1. Future workNon-homeomorphic cluster-specific templatesOur mixture of warpings model in Section 3.2 focuses on the scenario whenthe curves are different warpings of a common template. When the clustersare non-homeomorphic, for example with the number of local maxima vary-ing across groups, our model is unsuitable and might not be able to identifythe clustering structure. In that case, any phase variation might be betterviewed as subordinate to the differences in the basic structure of curves.This is the approach taken by (Sangalli et al., 2010), who use a k-meansmethod for simultaneous alignment and clustering.A more appropriate extension of our curve registration model for thisalternative scenario is to impose a finite mixture distribution over the baseshape, f , instead of the warping distributions. The stochastic approximationEM algorithm should still be applicable in theory. Intuitively, we might wantto warp a curve to each of the estimated templates for good mixing of theMCMC chains within the EM algorithm at a higher computational cost.The actual performance of the algorithm remains to be exmained.Performance with sparse designWhile our work is motivated by noisy functional data with dense designs,i.e. curves that are frequently observed across time, the mixed-effect modelapproach is often applied to functional data with sparse designs. The per-formance of our proposed methods for curve registration and for clusteringcould be studied in that setting.Incorporate temporal and spatial dependence in dive typeclusteringIn our analysis of dive profiles we treat the observed curves as independentrealizations of a random functional process. The daily time series in Fig-1225.1. Future workure 1.3, however, exhabits a strong pattern of serial correlation in dive types.We anticipate that this temporal dependence might not influence the estima-tion of the warping functions in general given the high sampling frequencywithin curve. However in the two-step clustering approach, incorporatingthe temporal dependence structure in the clustering step, for example via ahidden Markov model, might improve the grouping of the dive types.On the other hand, dependence in dive depth and geolocation acrossdives might be strong behavioural indicators. We can imagine the possibilityof a seal diving to similar depths around the same area while foraging buttravel a greater distance and to varying depths while exploring. Avenuesto incorporate information from both the vertical dimension of depth andhorizontal dimensions of longitude and latitude for classification of divingbehaviours remain to be explored.123BibliographyAbraham, C., Cornillon, P.-A., Matzner-Løber, E., and Molinari, N. (2003).Unsupervised curve clustering using B-splines. Scandinavian Journal ofStatistics, 30(3):581–595.Aitchison, J. (1982). The statistical analysis of compositional data. Journalof the Royal Statistical Society. Series B (Methodological), 44(2):139–177.Bailleul, F., Pinaud, D., Hindell, M., Charrassin, J.-B., and Guinet, C.(2008). Assessment of scale-dependent foraging behaviour in southernelephant seals incorporating the vertical dimension: a development of thefirst passage time method. Journal of Animal Ecology, 77(5):948–957.Bestley, S., Jonsen, I. D., Hindell, M. A., Harcourt, R. G., and Gales, N. J.(2015). Taking animal tracking to new depths: synthesizing horizontal–vertical movement relationships for four marine predators. Ecology,96(2):417–427.Bouveyron, C., Girard, S., and Schmid, C. (2007). High-dimensional dataclustering. Computational Statistics & Data Analysis, 52(1):502–519.Bouveyron, C. and Jacques, J. (2011). Model-based clustering of time seriesin group-specific functional subspaces. Advances in Data Analysis andClassification, 5(4):281–300.Brillinger, D. R. and Stewart, B. S. (1997). Elephant seal movements: divetypes and their sequences. In Modelling Longitudinal and Spatially Cor-related Data, pages 275–288. Springer.124BibliographyBrumback, B. A. and Rice, J. A. (1998). Smoothing spline models for theanalysis of nested and crossed samples of curves. Journal of the AmericanStatistical Association, 93(443):961–976.Brumback, L. C. and Lindstrom, M. J. (2004). Self modeling with flexible,random time transformations. Biometrics, 60(2):461–470.Chakraborty, A. and Panaretos, V. M. (2017). Functional Registrationand Local Variations: Identifiability, Rank, and Tuning. ArXiv e-prints,1702.03556.Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifyingsubstructures of longitudinal data. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 69(4):679–699.Delyon, B., Lavielle, M., and Moulines, E. (1999). Convergence of a stochas-tic approximation version of the EM algorithm. The Annals of Statistics,27(1):94–128.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm. Journal of the RoyalStatistical Society. Series B (Methodological), 39(1):1–38.Dragon, A.-C., Bar-Hen, A., Monestiez, P. P., and Guinet, C. (2012). Hori-zontal and vertical movements as predictors of foraging success in a marinepredator. Marine Ecology Progress Series, 447:243–257.Durba´n, M., Harezlak, J., Wand, M., and Carroll, R. (2005). Simple fittingof subject-specific curves for longitudinal data. Statistics in Medicine,24(8):1153–1167.Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis:theory and practice. Springer Science & Business Media.Fu, E. and Heckman, N. (2018). Model-based curve registration via stochas-tic approximation EM algorithm. Computational Statistics & Data Anal-ysis.125BibliographyGarcia-Escudero, L. A. and Gordaliza, A. (2005). A proposal for robustcurve clustering. Journal of Classification, 22(2):185–201.Gasser, T. and Kneip, A. (1995). Searching for structure in curve samples.Journal of the American Statistical Association, 90(432):1179–1188.Giacofci, M., Lambert-Lacroix, S., Marot, G., and Picard, F. (2013).Wavelet-based clustering for mixed-effects functional models in high di-mension. Biometrics, 69(1):31–40.Guinet, C., Vacquie´-Garcia, J., Picard, B., Bessigneul, G., Lebras, Y.,Dragon, A.-C., Viviant, M., Arnould, J. P., and Bailleul, F. (2014). South-ern elephant seal foraging success in relation to temperature and light con-ditions: insight into prey distribution. Marine Ecology Progress Series,499:285–301.Heckman, N., Lockhart, R., and Nielsen, J. D. (2013). Penalized regression,mixed effects models and appropriate modelling. Electronic Journal ofStatistics, 7:1517–1552.Hindell, M., Slip, D., and Burton, H. (1991). The diving behavior of adultmale and female southern elephant seals, mirounga-leonina (pinnipedia,phocidae). Australian Journal of Zoology, 39(5):595–619.Jacques, J. and Preda, C. (2013). Funclust: A curves clustering method us-ing functional random variables density approximation. Neurocomputing,112:164–171.James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled func-tional data. Journal of the American Statistical Association, 98(462):397–408.Jank, W. (2006). Ascent EM for fast and global solutions to finite mix-tures: An application to curve-clustering of online auctions. Computa-tional Statistics & Data Analysis, 51(2):747–761.Jupp, D. L. (1978). Approximation to data by splines with free knots. SIAMJournal on Numerical Analysis, 15(2):328–343.126BibliographyKelly, C. and Rice, J. (1990). Monotone smoothing with application to dose-response curves and the assessment of synergism. Biometrics, 46(4):1071–1085.Kneip, A. and Gasser, T. (1992). Statistical tools to analyze data represent-ing a sample of curves. The Annals of Statistics, 20(3):1266–1305.Kneip, A. and Ramsay, J. O. (2008). Combining registration and fittingfor functional models. Journal of the American Statistical Association,103(483):1155–1165.Kuhn, E. and Lavielle, M. (2004). Coupling a stochastic approximation ver-sion of EM with an MCMC procedure. ESAIM: Probability and Statistics,8:115–131.Kuhn, E. and Lavielle, M. (2005). Maximum likelihood estimation in non-linear mixed effects models. Computational Statistics & Data Analysis,49(4):1020–1038.Langrock, R., Marques, T. A., Baird, R. W., and Thomas, L. (2014). Mod-eling the diving behavior of whales: a latent-variable approach with feed-back and semi-Markovian components. Journal of Agricultural, Biological,and Environmental Statistics, 19(1):82–100.Lawton, W., Sylvestre, E., and Maggio, M. (1972). Self modeling nonlinearregression. Technometrics, 14(3):513–532.Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed effects modelsfor repeated measures data. Biometrics, 46(3):673–687.Liu, X. and Yang, M. C. (2009). Simultaneous curve registration and clus-tering for functional data. Computational Statistics & Data Analysis,53(4):1361–1376.Louis, T. A. (1982). Finding the observed information matrix when us-ing the EM algorithm. Journal of the Royal Statistical Society. Series B(Methodological), pages 226–233.127BibliographyMaire, F., Moulines, E., and Lefebvre, S. (2017). Online EM for functionaldata. Computational Statistics & Data Analysis, 111:27–47.Minka, T. (2000). Estimating a dirichlet distribution.Panaretos, V. M. and Zemel, Y. (2018). Statistical Aspects of WassersteinDistances. ArXiv e-prints, 1806.05500.Peng, J. and Mu¨ller, H.-G. (2008). Distance-based clustering of sparselyobserved stochastic processes, with applications to online auctions. TheAnnals of Applied Statistics, 2(3):1056–1077.Rakeˆt, L. L., Sommer, S., and Markussen, B. (2014). A nonlinear mixed-effects model for simultaneous smoothing and registration of functionaldata. Pattern Recognition Letters, 38:1–7.Ramsay, J. O. and Li, X. (1998). Curve registration. Journal of the RoyalStatistical Society. Series B (Statistical Methodology), 60(2):351–363.Ramsay, J. O. and Silverman, B. W. (2006). Functional Data Analysis.Springer, New York, 2 edition.Ramsay, J. O. and Silverman, B. W. (2007). Applied Functional Data Anal-ysis: Methods and Case Studies. Springer, New York.Rand, W. M. (1971). Objective criteria for the evaluation of clusteringmethods. Journal of the American Statistical association, 66(336):846–850.Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models forunequally sampled noisy curves. Biometrics, 57(1):253–259.Roberts, G. O. and Rosenthal, J. S. (2001). Optimal scaling for variousMetropolis-Hastings algorithms. Statistical Science, 16(4):351–367.Sangalli, L. M., Secchi, P., Vantini, S., and Vitelli, V. (2010). K-meanalignment for curve clustering. Computational Statistics & Data Analysis,54(5):1219–1233.128BibliographySchreer, J. F., O’Hara Hines, R. J., and Kovacs, K. M. (1998). Classificationof dive profiles: A comparison of statistical clustering techniques andunsupervised artifical neural networks. Journal of Agricultural, Biological,and Environmental Statistics, 3(4):383–404.Schreer, J. F. and Testa, J. W. (1995). Statistical classification of divingbehavior. Marine Mammal Science, 11(1):85–93.Serban, N. and Wasserman, L. (2005). Cats: clustering after transforma-tion and smoothing. Journal of the American Statistical Association,100(471):990–999.Srivastava, A., Wu, W., Kurtek, S., Klassen, E., and Marron, J. S. (2011).Registration of Functional Data Using Fisher-Rao Metric. ArXiv e-prints,1103.3817.Tang, R. and Mu¨ller, H.-G. (2008). Pairwise curve synchronization for func-tional data. Biometrika, 95(4):875–889.Telesca, D. and Inoue, L. Y. T. (2008). Bayesian hierarchical curve registra-tion. Journal of the American Statistical Association, 103(481):328–339.Tuddenham, R. D. (1954). Physical growth of California boys and girlsfrom birth to eighteen years. University of California Publications inChild Development, 1:183–364.Walker, C., MacKenzie, M., Donovan, C., Kidney, D., Quick, N., and Hastie,G. (2011a). Classification of animal dive tracks via automatic landmark-ing, principal components analysis and clustering. Ecosphere, 2(8):1–13.Walker, C., Mackenzie, M., Donovan, C., and O’Sullivan, M. (2011b).SALSA – a spatially adaptive local smoothing algorithm. Journal of Sta-tistical Computation and Simulation, 81(2):179–191.Walker, S. (1996). An EM algorithm for nonlinear random effects models.Biometrics, 52(3):934–944.129Wand, M. P. (2003). Smoothing and mixed models. Computational Statis-tics, 2(18):223–249.Wang, K. and Gasser, T. (1997). Alignment of curves by dynamic timewarping. The Annals of Statistics, 25(3):1251–1276.Wang, K. and Gasser, T. (1999). Synchronizing sample curves nonparamet-rically. The Annals of Statistics, 27(2):439–460.Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of theEM algorithm and the poor man’s data augmentation algorithms. Journalof the American Statistical Association, 85(411):699–704.Womble, J. N., Horning, M., Lea, M.-A., and Rehberg, M. J. (2013). Divinginto the analysis of time–depth recorder and behavioural data records:A workshop summary. Deep Sea Research Part II: Topical Studies inOceanography, 88:61–64.Wu, C. J. (1983). On the convergence properties of the EM algorithm. TheAnnals of Statistics, pages 95–103.Yao, F., Mu¨ller, H.-G., and Wang, J.-L. (2005). Functional data analysis forsparse longitudinal data. Journal of the American Statistical Association,100(470):577–590.Yassouridis, C., Ernst, D., and Leisch, F. (2018). Generalization, combi-nation and extension of functional clustering algorithms: The r packagefuncy. Journal of Statistical Software, Articles, 85(9):1–25.Zhang, Y. and Telesca, D. (2014). Joint Clustering and Registration ofFunctional Data. ArXiv e-prints, 1403.7134.130Appendix AAppendix to Chapter 2A.1 Proof of Theorem 1The proof of the theorem requires some topological arguments because ofrequirements of measurability of mappings between function spaces and mea-surability of subsets of function spaces. We provide details here. We makefull use of the following argument. Suppose Ωi, i = 1, 2, are metric spaceswith corresponding Borel sigma-algebras Oi and H is a function from Ω1to Ω2. Let (Ω,F , P ) be a probability space. If X1 and X2 are randomvariables from Ω to Ω1 with P{X1 ∈ B} = P{X2 ∈ B} for all B ∈ O1,then P{H(X1) ∈ C} = P{H(X2) ∈ C} for all C ∈ O2 provided thatH−1(C) ∈ O1 for all C ∈ O2, that is, provided that H is measurable. Inour proofs, we show that the function H is measurable by showing that His continuous with respect to the metrics on Ω1 and Ω2. In some cases, thefunction H will only be defined on A, a subset of Ω1. In this case, we showthat A is in O1 and H is continuous on A.We define the following function spaces and distances, which inducetopologies.• C1[0, 1] is the set of functions from [0, 1] → < with continuous firstderivatives, with distanced∞∗(f1, f2) ≡ ||f1 − f2||∞∗ ≡ supt|f1(t)− f2(t)|+ supt|f ′1(t)− f ′2(t)|.131A.1. Proof of Theorem 1• C[0, 1] is the set of continuous functions from [0, 1]→ <, with distanced∞(f1, f2) ≡ ||f1 − f2||∞ = supt|f1(t)− f2(t)|.For any cross-products of function spaces, we use the usual cross-productmetric/topology. The following lemmas are presented without proof.Lemma 1.1. Let A ⊆ C1[0, 1] consist of all non-constant functions. In the topologyinduced by d∞∗, A is an open set.2. Let B ⊆ C1[0, 1] with B consisting of h ∈ C1[0, 1] with h(0) = 0,h(1) = 1 and h′(t) > 0 for all t. The set B is a Borel set of C1[0, 1] inthe topology induced by d∞∗.3. Let C ⊆ C1[0, 1] = {F ∈ C1[0, 1] : F (0) = 0, F (1) = 1, F (t) > F (s) forall t > s}. Then C is a measurable subset of C1[0, 1] in the topologyinduced by d∞∗.Lemma 2. The following mappings are well-defined and continuous, withthe topologies on A,B, C and C1[0, 1] induced by d∞∗ and the topology onC[0, 1] induced by the usual L∞ norm.1. h ∈ B → h−1 ∈ B;2. H◦ : C1[0, 1]× C1[0, 1]→ C1[0, 1] with H◦(f, g) = f ◦ g;3. H1 : C1[0, 1]×C1[0, 1]→ C1[0, 1]×C1[0, 1] with H1(f, h) = (f ◦ h, h);4. Hinv : C → C[0, 1], with Hinv(F ) = F−1;132A.1. Proof of Theorem 15. the total variation mapping HTV : A→ C1[0, 1]:(HTV f)(t) =∫ t0 |f ′(s)|ds∫ 10 |f ′(u)|du.Lemma 3. The following equalities can be found in Lemma 1 of Chakrabortyand Panaretos (2017)Fi ≡ HTV (fi) = HTV (ξi) ≡ Fξi ,F˜i ≡ HTV (f˜i) = HTV (ξ˜i) ≡ F˜ξiandF˜i = Fξi ◦ hi. (A.1)Proof of the TheoremWe use the notation introduced in Lemmas 1, 2 and 3. Suppose that(f1, h1)d= (f2, h2). Then continuity of Ho implies that f1 ◦ h1 d= f2 ◦ h2.Now we suppose that f˜1d= f˜2 and show that (f1, h1)d= (f2, h2). Weproceed by showing(i) (f˜1, F˜1)d= (f˜2, F˜2),(ii) F˜−11d= F˜−12 ,(iii) Fξ1 = Fξ2 , and finally(iv) (f1, h1)d= (f2, h2).We see that (i) follows by the continuity of the mapping HTV and themeasurability of the set A. Item (ii) follows since F˜i ∈ C and the mappingHinv is continuous on C, which is a measurable set.133A.1. Proof of Theorem 1To show (iii), we note that F˜−1i (t) is bounded and so its expectationexists. By (ii),E[F˜−11 (t)] = E[F˜−12 (t)] for all t.From equation (A.1), F˜−1i = h−1i ◦ F−1ξi , soE[F˜−1i (t)] = E[h−1i (F−1ξi(t))] = F−1ξi (t)by the fact that F−1ξi (t) is nonrandom and the assumption that E[h−1i (u)] = ufor all u. Therefore, F−1ξ1 = F−1ξ2and (iii) holds.To show (iv), we write, for A in the sigma-algebra associated with A andB in the sigma-algebra associated with B, by continuity of the mapping H1.P{(f1, h1) ∈ A×B}= P{(f1 ◦ h1, h1) ∈ H1(A×B)}= P{(f1 ◦ h1, Fξ1 ◦ h1) ∈ (H2 ◦ H1)(A×B)} (A.2)where H2 : A×B → A×C1[0, 1] mapping (f, h) to (f, Fξ1 ◦h) is continuoussince the composition of functions is continuous. But, by (A.1), expression(A.2) is equal toP{(f˜1, F˜1) ∈ (H2 ◦ H1)(A×B)}which, by (i), equalsP{(f˜2, F˜2) ∈ (H2 ◦ H1)(A×B)}.Similarly,P{(f2, h2) ∈ A×B}= P{(f2 ◦ h2, Fξ1 ◦ h2) ∈ (H2 ◦ H1)(A×B)}= P{(f2 ◦ h2, Fξ2 ◦ h2) ∈ (H2 ◦ H1)(A×B)} by (iii)= P{(f˜2, F˜2) ∈ (H2 ◦ H1)(A×B)}.134A.2. Derivations of the SAEM algorithmsThus (f1, h1)d= (f2, h2).A.2 Derivations of the SAEM algorithmsWe adopt the following notations. Let yi = (yi1, . . . , yini)> be the vectorof observations for curve i for i = 1, . . . , N , and y be the collection of allobserved data. Let ai = (ai,sh, ai,sc)> be the amplitude random effects,wi = (wi,2, . . . , wi,Kh)> be the warping random effects for curve i, anda and w be the collections of the corresponding random effects across allcurves. We denote the complete-data by zi = (yi,ai,wi) and z = (y,a,w)for curve i and across all curves respectively. Also, we denote the totalnumber of observations by ntot =∑Ni=1 ni and the collection of all unknownparameters by θ.In this section, we show the details of the SAEM algorithm for the model,yij = ai,sh + ai,sc · f ◦ h(tij ; wi) + ijh(t; wi) =Kh∑k=1βi,kBhk (t) (A.3)for i = 1, . . . , N and j = 1, . . . , ni, whereβi,k =k∑j=1wi,jwi,1 ≡ 0wi ∼ Dirichlet(κ¯, τ)ai ∼ N(µ,Σ)ij ∼ N(0, σ2). (A.4)135A.2. Derivations of the SAEM algorithmsA.2.1 Unknown templateFor unknown template, f , modelled as B-spline,f(t) =Kf∑k=1αkBfk (t)we fix the means of the random effects, µ and κ¯, at some constants. Theunknown parameters are α = (α1, . . . , αKf )>, σ2, Σ and τ .Complete-data log-likelihoodLet Bi(wi) be the ni×Kf evaluation matrix of basis functions, the Bfk ’s, atwarped times, h(ti1; wi), . . . , h(tini ; wi). The complete-data log-likelihoodhas three components,`c(θ; z) = `cy(α, σ2; z) + `ca(Σ; a) + `cw(τ ; w)where each component has their corresponding deviance functions,−2`cy(α, σ2; z) =N∑i=1{ni log(2piσ2) +1σ2||yi − ai,sh1− ai,scBi(wi)α||2}= ntot log(2piσ2)+1σ2{N∑i=1||yi − ai,sh1||2 −2[N∑i=1(yi − ai,sh1)>(ai,scBi(wi))]α+tr[(N∑i=1a2i,scBi(wi)>Bi(wi))αα>]}, (A.5)136A.2. Derivations of the SAEM algorithms−2`ca(Σ; a) =N∑i=1{2 log(2pi) + log det Σ + (ai − µ)>Σ−1(ai − µ)}= 2N log(2pi) +N log det Σ+tr[(N∑i=1(ai − µ)(ai − µ)>)Σ−1](A.6)and−2`cw(τ ; w) =N∑i=12Kh∑k=2log Γ(τ κ¯k)− log Γ(τ) −2Kh∑k=2(τ κ¯k − 1) log(wi,k)= 2NKh∑k=2log Γ(τ κ¯k)− log Γ(τ)−2Kh∑k=2(τ κ¯k − 1)[N∑i=1log(wi,k)](A.7)Minimum sufficient statisticsThe minimum sufficient statistics areSyy =N∑i=1Syy,i ≡N∑i=1||yi − ai,sh1||2,SBy =N∑i=1SBy,i ≡N∑i=1ai,scBi(wi)>(yi − ai,sh1),SBB =N∑i=1SBB,i ≡N∑i=1a2i,scBi(wi)>Bi(wi),137A.2. Derivations of the SAEM algorithmsfor (A.5),Sa =N∑i=1Sa,i ≡N∑i=1(ai − µ)(ai − µ)>,for (A.6), andSwk =N∑i=1Swk,i ≡N∑i=1log(wi,k) for k = 2 . . . ,Khfor (A.7).Gradients and HessiansUsing the rules that, ∂tr(SΣ)/∂Σ = S, ∂X−1/∂Σ = X−1(∂X/∂Σ)X−1, and∂ log | det Σ|/∂Σ = (Σ−1)>, the gradients of the complete-data log-likelihoodare,∂`c∂α=1σ2(SBy − SBBα)∂`c∂σ2=12[1σ4(Syy − 2S>Byα + α>SBBα)−ntotσ2]∂`c∂Σ=12(Σ−1SaΣ−1 −NΣ−1)∂`c∂τ=Kh∑k=2κ¯kSwk −NKh∑k=2κ¯kz(τ κ¯k)−z(τ) , (A.8)where z is the digamma function.The Hessian of `c with respect to τ is,∂2`c∂τ2= −N[Kh∑k=2κ¯2kz′(τ κ¯k)−z′(τ)], (A.9)where z′ is the trigamma function.138A.2. Derivations of the SAEM algorithmsM-stepDenote the conditional expectation E[S|y;θ∗] of a sufficient statistics S by S˜.The auxiliary function of the EM algorithm, Q(θ;θ∗), has the same form asthe complete-data log-likelihood except with the sufficient statistics replacedby their conditional expectations. In the M-step of the EM algorithm, if θˆis an interior point in the parameter space that maximizes Q(θ;θ∗), then itsolves the first order conditions,S˜By − S˜BBα = 01σ4(S˜yy − 2S˜>Byα + α>S˜BBα)− ntotσ2= 0Σ−1S˜aΣ−1 −NΣ−1 = 0∑Khk=2 κ¯kS˜wk −N[∑Khk=2 κ¯kz(τ κ¯k)−z(τ)]= 0.For α, σ2 and Σ, close form solution exists atαˆ = S˜−1BBS˜Byσˆ2 =1ntot(S˜yy − 2S˜Byαˆ + αˆ>S˜BBαˆ)Σˆ =1NS˜a.For τ , we find the solution numerically by Newton-Raphson method usingthe gradient in (A.8) and the Hessian in (A.9). Since τ is constrained tobe positive, if the updated value in the Newton-Raphson method is non-positive, we reduce the step size by half successively until the new value isadmissible.A.2.2 Known templateIf the base shape, f , is known and fixed, the unknown model parametersto be estimated are σ2, µ, Σ, κ¯, and τ . Since the mean of the Dirichletdistribution, κ¯, is not fixed, we adopt the usual parameterization in terms139A.2. Derivations of the SAEM algorithmsof concentration parameters,κ = τ κ¯ = (κ2, . . . , κKh)>.Let bi(wi) = (f(h(ti1; wi)), . . . , f(h(tini ; wi)))> be the base shape evalu-ated at the warped times, h(tij ; wi)’s, for curve i. The complete-data log-likelihood is`c(θ; z) = `cy(σ2; z) + `ca(µ,Σ; a) + `cw(κ; w)with corresponding deviance functions−2`cy(σ2; z) =N∑i=1{ni log(2piσ2) +1σ2||yi − ai,sh1− ai,scbi(wi)||2}= ntot log(2piσ2) +1σ2N∑i=1||yi − ai,sh1− ai,scbi(wi)||2, (A.10)−2`ca =N∑i=1{2 log(2pi) + log det Σ + (ai − µ)>Σ−1(ai − µ)}= 2N log(2pi) +N log det Σ +Nµ>Σ−1µ+tr[N∑i=1(aia>i)Σ−1]− µ>Σ−1(N∑i=1ai)−(N∑i=1ai)>Σ−1µ(A.11)140A.2. Derivations of the SAEM algorithmsand−2`cw(κ; w) =N∑i=12Kh∑k=2log Γ(κk)− log Γ(κ>1) −2Kh∑k=2(κk − 1) log(wi,k)= 2NKh∑k=2log Γ(κk)− log Γ(κ>1)−2Kh∑k=2(κk − 1)[N∑i=1log(wi,k)](A.12)Minimum sufficient statisticsThe minimum sufficient statistics areSr =N∑i=1Sy,i ≡N∑i=1||yi − ai,sh1− ai,scbi(wi)||2,for (A.10),Sa1 =N∑i=1Sa,i ≡N∑i=1aiSa2 =N∑i=1Sa,i ≡N∑i=1aia>i ,for (A.11), andSwk =N∑i=1Swk,i ≡N∑i=1log(wi,k) for k = 2 . . . ,Khfor (A.12).141A.2. Derivations of the SAEM algorithmsGradients and HessiansUsing the general fact that µ>Σx = tr(xµ>Σ), the gradients of the complete-data log-likelihood are,∂`c∂σ2=12[1σ4Sr − ntotσ2]∂`c∂µ= Σ−1Sa1 −NΣ−1µ∂`c∂Σ=[Σ−1(Sa2 − Sa1µ> − µ>Sa1 +Nµµ>)Σ−1 −NΣ−1]∂`c∂κj=N∑i=1log(wi,j)−N[z(κj)−z(κ>1)]. (A.13)for j = 2, . . . ,Kh. The Hessian with respect to κ has entries,∂2`c∂κj∂κk={N[z′(κ>1)−z′(κj)], for j = kNz′(κ>1), for j 6= k . (A.14)M-stepFor σ2, µ and Σ, the M-step solves the first order conditions,1σ4Sr − ntotσ2= 0,Σ−1Sa1 −NΣ−1µ = 0,[Σ−1(Sa2 − Sa1µ> − µ>Sa1 +Nµµ>)Σ−1 −NΣ−1]= 0,with closed form solutions,σˆ2 =Srntot,µˆ =Sa1N,Σˆ =(Sa2N− Sa1Nµˆ> − µˆ>Sa1N+ µˆµˆ>)=(Sa2N− µˆµˆ>).On the other hand, the Dirichlet concentration parameters can be up-142A.2. Derivations of the SAEM algorithmsdated by Newton’s method. The updating at the (k + 1)-th iteration ofNewton’s method is,κ(k+1) = κ(k) −H−1g,where g is the gradient vector with elements as in (A.13) and H is theHessian matrix with entries as in (A.14). As pointed out by Minka (2000),the Hessian can be expressed asH = Nz′(κ>1) · 11> − diag [z′(κ2), . . . ,z′(κKh)]Let Λ = −diag [z′(κ2), . . . ,z′(κKh)], and z = Nz′(κ>1), by the Woodburymatrix identity,H−1 =(Λ + z11>)−1= Λ−1 − Λ−1z11>Λ−11 + z1>Λ−11= Λ−1 − Λ−111>Λ−11/z + 1>Λ−11.Hence, the updating step can be computing directly as,H−1g = Λ−1g − (Λ−11) · ( 1>Λ−1g1/z + 1>Λ−11)= Λ−1 (g − z∗ · 1)where z∗ = 1>Λ−1g1/z+1>Λ−11 , where the j-th element is(H−1g)j=gj − z∗λjj.A.2.3 Metropolis-Hastings algorithm for simulatingwarping effectsIn this section, we detail the derivation of the Metropolis-Hastings algo-rithm for simluating warping effects wi for the general curve registration143A.2. Derivations of the SAEM algorithmsmodel (A.3)-(A.4). Let w(k)i and a(k)i be the states of the Markov chain forthe random warping and amplitude effects of curve i sampled in the k-thiteration. In the (k + 1)-th iteration, we generate a proposal, w∗i , for thewarping effect by first mapping w(k)i from the (Kh − 2)-simplex SKh−2 tov(k)i = (v(k)i,2 , . . . , v(k)i,Kh)> in the space PKh−1 = {v ∈ RKh−1 : v>1 = 0}, bythe centered-log-ratio transform,v(k)i = G(w(k)i ) =[log(wi,2)− log(wi), . . . , log(wi,2)− log(wi)]>where log(wi) =1Kh−1∑Khj=2 log(wi,j . Then, we perturb v(k)i by a symmetricGaussian random walk in PKh−2 to generate a proposal,v∗i ∼ N(v(k)i , V)whereV = σ2MH[I − (Kh − 1)−1J]is a (Kh − 1)× (Kh − 1) covariance matrix with I the identity matrix andJ a matrix of ones. The proposal is then transformed back to SKh−2 by thesoftmax transform,w∗i = G(v∗i ) =1Z[exp(v∗i,2), . . . , exp(v∗i,Kh)]>where Z =∑Khj′=2 exp(v∗i,j′).The proposal density isQ (w∗ | w) = T (v∗ | v)∣∣∣∣ ∂v∗∂w∗∣∣∣∣where T is the random walk transition kernel that is symmetric with respectto the two arguments. For the Jacobian term, since v and w are completelydetermined by their first (Kh − 1) elements, we can focus on the truncatedvectors v− = (v2, . . . , vKh−1)> ∈ RKh−1 and w− = (w2, . . . , wKh−1)>. We144A.2. Derivations of the SAEM algorithmsobserve that∂wi∂vj=∂∂vj1Zevi=1Z2[Zevi1{i=j} − evi(evj − evKh )]= 1{i=j}wi − wi(wj − wKh)and thus the first derivative of the transformation is,∂w−∂v−= U −w−(w− − wL1)>where U = diag(w2, . . . , wKh−1). By the matrix determinant lemma forrank-1 update, the Jacobian of the softmax transform G is,det(∂w−∂v−)= det(U) · (1− (w− − wL1)>U−1w−)=Kh∏l=2wi ·1− Kh−1∑l=2(wl − wKh)= D ·Kh∏l=2wl.The proposal, w∗i , is accepted with probability,min1, fa,w|y(a(k)i ,w∗i | yi; θˆ)Q(w(k)i | w∗i)fa,w|y(a(k)i ,w(k)i | yi; θˆ)Q(w∗i | w(k)i)= min1, fa,w,y(a(k)i ,w∗i ,yi; θˆ)Q(w(k)i | w∗i)fa,w,y(a(k)i ,w(k)i ,yi; θˆ)Q(w∗i | w(k)i)where fa,w|y is the conditional density of the amplitude and warping effects,given the data, yi, and fa,w,y is the full joint density, assuming the modelparameters are equal to the estimates, θˆ. By cancelling the common terms145A.2. Derivations of the SAEM algorithmsin the joint density, the acceptance probability can be expressed asmin1, fy|a,w(yi | a(k)i ,w∗i ; θˆ)fa(a(k)i ; θˆ)fw(w∗i ; θˆ)Q(w(k)i | w∗i)fy|a,w(yi | a(k)i ,w(k)i ; θˆ)fa(a(k)i ; θˆ)fw(w(k)i ; θˆ)Q(w∗i | w(k)i)= min1, fy|a,w(yi | a(k)i ,w∗i ; θˆ)fw(w∗i ; θˆ)fy|a,w(yi | a(k)i ,w(k)i ; θˆ)fw(w(k)i ; θˆ) · Kh∏j=2w∗i,jw(k)i,jwherefy|a,w(yi | a(k)i ,w∗i ; θˆ)fy|a,w(yi | a(k)i ,w(k)i ; θˆ)= exp{12σˆ2[∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w(k)i )∣∣∣∣∣∣2−∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w(k)i )∣∣∣∣∣∣2]} ,with bˆi (w) the vector of estimated or evaluated based shape for curve i atwarped times, h(ti1; w), . . . , h(tini ; w); andfw(w∗i ; θˆ)fw(w(k)i ; θˆ) · Kh∏j=2w∗i,jw(k)i,j=Kh∏j=2(w∗i,jw(k)i,j)τˆ ˆ¯κj−1·Kh∏j=2w∗i,jw(k)i,j=Kh∏j=2(w∗i,jw(k)i,j)τˆ ˆ¯κj.146A.2. Derivations of the SAEM algorithmsHence, the acceptance probability can be simplified toexp{12σˆ2∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w(k)i )∣∣∣∣∣∣2}exp{12σˆ2∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w∗i )∣∣∣∣∣∣2} ·Kh∏j=2(w∗i,jw(k)i,j)τˆ ˆ¯κj. (A.15)147Appendix BAppendix to Chapter 3B.1 Derivations of SAEM for mixture of warpingmodelIn this section, we show the details of the SAEM algorithm for the mixtureof warping model,yij = ai,sh + ai,scf ◦ h(tij ; wi) + ijh(t; wi) =Kh∑k=1βi,kBhk (t)for i = 1, . . . , N and j = 1, . . . , ni, whereβi,k =k∑j=1wi,jmi ∼ Categorical(pi1, . . . , piM )wi,1 ≡ 0wi | mi = m ∼ Dirichlet(κ¯m, τm)ai ∼ N(µ,Σ)ij ∼ N(0, σ2)where κ¯m = (κ¯m,2, . . . , κ¯m,Kh)> for m = 2, . . . ,M are the mean vectors andτ1, . . . , τM are the precision parameters of the M Dirichlet mixing compo-nents. Whenever the Dirichlet mean is not specified, we use the alternative148B.1. Derivations of SAEM for mixture of warping modelparameterization in terms of the usual Dirichlet concentration parameters,κm = τmκ¯m = (κm,2, . . . , κm,Kh)>.In the case of unknown base shape, f is modelled as B-spline,f(t) =Kf∑k=1αkBfk (t),with the first mixture component for warping parameterized by mean κ¯1 andprecision τ1 such that, κ1 = τ1κ¯1. For identifiability, we fixed the means ofthe amplitude effects, µ, and the means of the first mixture component forwarping, κ¯1 at some constants. Otherwise, for known and fixed template,f , both µ and κ1 are to be estimated.Sufficient StatisticsIn addition to the notation defined in Appendix A.2, we denote the vectorof cluster labels by m = (m1, . . . ,mN ). On the other hand, z = (y,a,w),retains the same definition from Appendix A.2. For unknown base shape,the complete-data log-likelhood is,`c(θ; z) = `cy(α, σ2; z) + `ca(Σ; a) + `cw(τ1,κ2, . . . ,κM , pi1, . . . , piM ; w,m)whereas for known template shape, the complete-data log-likelhood is,`c(θ; z) = `cy(σ2; z) + `ca(µ,Σ; a) + `cw(κ1, . . . ,κM , pi1, . . . , piM ; w,m).In either case, the first two components, `cy and `ca, are identical to thosefor the registration model discussed in Chapter 2 and Appendix A.2. Hence,the minimum sufficient statistics and the associated E-step and M-step forthese two components remain the same. The only algorithmic details thatneed updating are those concening the mixture of Dirichlet component, `cw,149B.1. Derivations of SAEM for mixture of warping modelwhich has likelihood function of the form`cw =M∑m=1N∑i=1δi,m log(pim)+M∑m=1N∑i=1δi,mKh∑k=2(κm,k − 1) log(wi,k)−Kh∑k=2log Γ(κm,k)− log ΓKh∑k=2κm,k=M∑m=1(N∑i=1δi,m)log(pim)+M∑m=1Kh∑k=2(κm,k − 1)[N∑i=1δi,m log(wi,k)]−(N∑i=1δi,m)Kh∑k=2log Γ(κm,k)− log ΓKh∑k=2κm,kwhere δi,m = 1{mi=m} is a binary indicator of cluster membership for curvei and cluster m. The minimum sufficient statistics areNm =N∑i=1δi,m, for m = 1, . . . ,M ; andSwm,k =N∑i=1δi,m log(wi,k), for k = 2, . . . ,Kh, and m = 1, . . . ,MM-stepFor the mixing proportions, pi = (pi1, . . . , piM ), the parameters have a sum-to-one constraint. Hence, the likelihood, `cw, can be maximized with respect150B.1. Derivations of SAEM for mixture of warping modelto pi by solving the Lagrangian,Nmpˆim− λˆ = 0, for m = 1, . . . ,M, andM∑m=1pˆim − 1 = 0,which has solutions λˆ =∑Mm=1Nm ≡ N and pˆim = Nm/N .For individual Dirichlet component, the parameter can be updated inthe M-step by Newton’s method. In general, if the mean, κ¯m, is not fixed,the gradient and Hessian with respect to the concentration parameters, κm,have the same form as (A.13) and (A.14):∂`c∂κm,j= Swm,j −Nm[z(κm,j)−z(κ>m1)],for j = 2, . . . ,Kh and∂2`c∂κm,j∂κm,k={Nm[z′(κ>m1)−z′(κm,j)], for j = kNmz′(κ>m1), for j 6= k.On the other hand for the first component, if the mean, κ¯1, is fixed, thegradient and Hessian with respect to the precision parameter, τ1, have thesame form as (A.8) and (A.9):∂`c∂τ1=Kh∑k=2κ¯1,kSw1,k −N1Kh∑k=2κ¯1,kz(τ1κ¯1,k)−z(τ1) ,and∂2`c∂τ21= −N1[Kh∑k=2κ¯21,kz′(τ1κ¯1,k)−z′(τ1)].Metropolis-Hastings-within-Gibbs algorithmThe MCMC sample of the random effects for the stochastic approximation151B.1. Derivations of SAEM for mixture of warping modelE-step can be generated by a Metropolis-Hastings-within-Gibbs algorith.Denote the k-th MCMC state for curve i by (m(k)i ,a(k)i ,w(k)i ). We willupdate the MCMC states in the order of w(k+1)i , a(k+1)i , followed by m(k+1)i .Firstly, given m(k)i and a(k)i , the warping effects, w(k+1)i , can be simulated bythe Metropolis-Hastings algorithm outlined in Appendix A.2.3. The sameproposal distribution is used with acceptance probability,exp{12σˆ2∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w(k)i )∣∣∣∣∣∣2}exp{12σˆ2∣∣∣∣∣∣yi − a(k)i,sh1− ai,scbˆi (w∗i )∣∣∣∣∣∣2} ·Kh∏j=2(w∗i,jw(k)i,j)κˆm(k)i,j.which is identical to (A.15), except with the estimated Dirichlet parameter,κˆm(k)i ,j, now indexed by the cluster membership, m(k)i .Secondly, given w(k)i and m(k)i , the amplitude effect and the observationsare conjugate Gaussians. Hence, the posterior distribution of ai given otherrandom effects and data is normal with variance-covariance matrix and meantaking the form of (2.14) and (2.15).Lastly, since the cluster membership is categorical, the new state m(k+1)ican be sampled with posterior probabilities,P (mi = m∗ | ai,wi,yi)∝ fy|a,w,m(yi∣∣∣ ai,wi,m∗; θˆ) fa(ai; θˆ)fw|m (wi ∣∣∣ m∗; θˆ) fm(m∗; θˆ)= fy|a,w(yi∣∣∣ ai,wi; θˆ) fa(ai; θˆ)fw|m (wi ∣∣∣ m∗; θˆ) pˆim∗∝ fw|m(wi∣∣∣ m∗; θˆ) pˆim∗= Γ(κ>m∗1)∏Khk=2 Γ(κm∗,j)Kh∏k=2wκm∗,k−1i,j pˆim∗ (B.1)for m∗ = 1, . . . ,M .152


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items