Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Practical considerations for Dimensionality Reduction : user guidance, costly distances, and document… Ingram, Stephen 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_ingram_stephen.pdf [ 9.97MB ]
Metadata
JSON: 24-1.0052184.json
JSON-LD: 24-1.0052184-ld.json
RDF/XML (Pretty): 24-1.0052184-rdf.xml
RDF/JSON: 24-1.0052184-rdf.json
Turtle: 24-1.0052184-turtle.txt
N-Triples: 24-1.0052184-rdf-ntriples.txt
Original Record: 24-1.0052184-source.json
Full Text
24-1.0052184-fulltext.txt
Citation
24-1.0052184.ris

Full Text

Practical Considerations for Dimensionality ReductionUser Guidance, Costly Distances, and Document DatabyStephen IngramB.Sc. Computer Science, Georgia Institute of Technology, 2004M.Sc. Computer Science, University of British Columbia, 2007A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)September 2013c? Stephen Ingram, 2013AbstractIn this thesis, we explore ways to make practical extensions to DimensionalityReduction, or DR algorithms with the goal of addressing challenging, real-worldcases.The first case we consider is that of how to provide guidance to those usersemploying DR methods in their data analysis. We specifically target users who arenot experts in the mathematical concepts behind DR algorithms. We first identifytwo levels of guidance: global and local. Global user guidance helps non-expertsselect and arrange a sequence of analysis algorithms. Local user guidance helpsusers select appropriate algorithm parameter choices and interpret algorithm out-put. We then present a software system, DimStiller, that incorporates both types ofguidance, validating it on several use-cases.The second case we consider is that of using DR to analyze datasets consistingof documents. In order to modify DR algorithms to handle document datasetseffectively, we first analyze the geometric structure of document datasets. Ouranalysis describes the ways document datasets differ from other kinds of datasets.We then leverage these geometric properties for speed and quality by incorporatingideas from text querying into DR and other algorithms for data analysis.We then present the Overview prototype, a proof-of-concept document analysissystem. Overview synthesizes both the goals of designing systems for data analystswho are DR novices, and performing DR on document data.The third case we consider is that of costly distance functions, or when themethod used to derive the true proximity between two data points is computation-ally expensive. Using standard approaches to DR in this important use-case canresult in either unnecessarily protracted runtimes or long periods of user monitor-iiing. To address the case of costly distances, we develop an algorithm framework,Glint, which efficiently manages the number of distance function calculations forthe Multidimensional Scaling class of DR algorithms. We then show that Glint im-plementations of Multidimensional Scaling algorithms achieve substantial speedimprovements or remove the need for human monitoring.iiiPrefaceParts of this thesis have appeared in publications and journal submissions. Most ofChapter 3 is based on the following published conference paper:? Stephen Ingram, Tamara Munzner, Veronika Irvine, Melanie Tory, StevenBergner, and Torsten Mo?ller. Dimstiller: Workflows for dimensional analy-sis and reduction. In Visual Analytics Science and Technology (VAST), 2010IEEE Symposium on, pages 3?10, 2010.A version of Chapter 4 has been submitted for publication to a journal as:? Stephen Ingram, Tamara Munzner. The geometry of fast document queries:implications for visual analytics algorithms.Sections of Chapter 5 are under preparation for a future conference submission andhave also appeared in the following technical report:? Stephen Ingram, Tamara Munzner, and Jonathan Stray. Hierarchical clus-tering and tagging of mostly disconnected data. Technical Report TR-2012-01, University of British Columbia Department of Computer Science, May2012.Chapter 6 is based on the following published conference paper:? Stephen Ingram and Tamara Munzner. Glint: an MDS framework for costlydistance functions. In Proceedings of SIGRAD, number 81, pages 29?38,2012.ivThe author of this thesis is the first author and main contributor on each of thesepublished papers and submissions. As first author, I performed the majority of writ-ing for each of the above-referenced papers. As main contributor, I was responsi-ble for designing the algorithms and coding the software presented in the differentthesis chapters with the exception of the Overview prototype, the development ofwhich received minor coding contributions from co-author Jonathan Stray.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 21.2 Methods and Uses for Dimensionality Reduction . . . . . . . . . 31.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Dimensionality Reduction Algorithms . . . . . . . . . . . . . . . 92.1.1 Orthogonal Projections . . . . . . . . . . . . . . . . . . . 92.1.2 Global Distances . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Manifold Methods . . . . . . . . . . . . . . . . . . . . . 132.1.4 Probability-based Methods . . . . . . . . . . . . . . . . . 142.2 Information Retrieval and the Spatial Analysis of Text . . . . . . . 152.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Nearest-Neighbor Search . . . . . . . . . . . . . . . . . . . . . . 16vi2.4.1 General Nearest-Neighbor Search . . . . . . . . . . . . . 162.4.2 Inverted-File-Based Nearest-Neighbor Search . . . . . . . 162.5 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . 172.6 Software Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6.1 Systems for High-Dimensional Analysis . . . . . . . . . . 172.6.2 Systems for Spatial Analysis of Text Corpora . . . . . . . 193 DimStiller: Workflows for Dimensional Analysis and Reduction . . 213.1 Local and Global Guidance . . . . . . . . . . . . . . . . . . . . . 233.2 Users and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.1 Target User Population . . . . . . . . . . . . . . . . . . . 243.2.2 Are My Dimensions Meaningful? . . . . . . . . . . . . . 253.2.3 How Do My Dimensions Relate to Each Other? . . . . . . 273.2.4 Are My Clusters Real? . . . . . . . . . . . . . . . . . . . 273.3 DimStiller Architecture . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Input Tables and Dimension Types . . . . . . . . . . . . . 283.3.2 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.3 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.4 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . 333.3.5 DimStiller Interface . . . . . . . . . . . . . . . . . . . . 343.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Sustainability Simulation: Measuring Variability . . . . . 363.4.2 Computational Chemistry: Evaluating Clusters . . . . . . 403.5 DimStiller Deployment . . . . . . . . . . . . . . . . . . . . . . . 424 The Geometry of Fast Document Queries:Implications for Visual Analysis Algorithms . . . . . . . . . . . . . . 464.1 The Mostly Disconnected Property of Term-Vector Datasets . . . 504.1.1 Term-Vector Datasets and TF-IDF . . . . . . . . . . . . . 504.1.2 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . 514.1.3 The Mostly Disconnected Property . . . . . . . . . . . . 534.1.4 Implication for Efficient Queries . . . . . . . . . . . . . . 554.2 Nearest-neighbor Search . . . . . . . . . . . . . . . . . . . . . . 57vii4.2.1 Implication: Nearest-Neighbor Search with an Impact-OrderedInverted File . . . . . . . . . . . . . . . . . . . . . . . . 574.2.2 Algorithm: APQ . . . . . . . . . . . . . . . . . . . . . . 584.2.3 APQ Results . . . . . . . . . . . . . . . . . . . . . . . . 644.3 Distance Matrix Computation and Storage . . . . . . . . . . . . . 664.3.1 Implication: Truncated Distance Matrices . . . . . . . . . 664.3.2 Algorithm: Fast Cluster Tree Calculation with DM2CT . . 674.3.3 Cluster Tree Results . . . . . . . . . . . . . . . . . . . . 674.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . 684.4.1 Implication: Dimensionality Reduction through Local At-traction and Global Repulsion . . . . . . . . . . . . . . . 684.4.2 Algorithm: MD-SNE . . . . . . . . . . . . . . . . . . . . 724.4.3 MD-SNE Results . . . . . . . . . . . . . . . . . . . . . . 745 Overview Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.1 Overview Protoype Description . . . . . . . . . . . . . . . . . . . 805.2 Clustering, Tagging, and DR for Sensemaking . . . . . . . . . . . 815.2.1 Why Clustering? . . . . . . . . . . . . . . . . . . . . . . 825.2.2 Why Tagging? . . . . . . . . . . . . . . . . . . . . . . . 825.2.3 Why Dimensionality Reduction? . . . . . . . . . . . . . . 845.2.4 Why All Three? . . . . . . . . . . . . . . . . . . . . . . . 845.3 Overview Prototype Results . . . . . . . . . . . . . . . . . . . . 845.3.1 Afghan War Logs . . . . . . . . . . . . . . . . . . . . . . 855.3.2 Caracas Cables . . . . . . . . . . . . . . . . . . . . . . . 875.4 Overview Deployment . . . . . . . . . . . . . . . . . . . . . . . 896 Glint: An MDS Framework for Costly Distance Functions . . . . . 936.1 Distances In MDS . . . . . . . . . . . . . . . . . . . . . . . . . . 956.1.1 Expensive Distance Functions . . . . . . . . . . . . . . . 956.1.2 Experimental Analysis of Sparse MDS Solutions . . . . . 966.2 Glint Algorithm Framework . . . . . . . . . . . . . . . . . . . . 966.2.1 Glint Outer Loop . . . . . . . . . . . . . . . . . . . . . . 986.3 Glint Instantiations . . . . . . . . . . . . . . . . . . . . . . . . . 99viii6.3.1 Component M: MDS Algorithm . . . . . . . . . . . . . . 1026.3.2 Component DS: Densification Strategy . . . . . . . . . . 1036.3.3 Component S: Objective Function . . . . . . . . . . . . . 1046.3.4 Instantiation Design Summary . . . . . . . . . . . . . . . 1066.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.4.1 Dataset and Distance Function Description . . . . . . . . 1076.4.2 Benchmark Speed and Quality Comparison . . . . . . . . 1086.4.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . 1097 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 1127.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.1.1 DimStiller . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.1.2 Algorithms for the Visual Analysis of MoDisco Data . . . 1137.1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.1.4 Glint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.2.1 User Guidance . . . . . . . . . . . . . . . . . . . . . . . 1157.2.2 Efficient DR with Costly Distances . . . . . . . . . . . . 1157.2.3 Mostly Disconnected Data . . . . . . . . . . . . . . . . . 1157.3 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.3.1 Making an Algorithmic Connection . . . . . . . . . . . . 1167.3.2 Research Impact and Collaboration . . . . . . . . . . . . 120Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123ixList of TablesTable 4.1 Benchmark datasets, with size in terms of both points and di-mensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Table 4.2 Comparison of hierarchical clustering timing and accuracy us-ing the truncated matrix with our approximate DM2CT algo-rithm vs. using the inverted file with exact Voorhees method. . 69Table 5.1 List of manually constructed document tags applied to the Cablesdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Table 6.1 Parameters used in Glint instantiations. . . . . . . . . . . . . . 101Table 6.2 Glint component design summary. . . . . . . . . . . . . . . . 107Table 6.3 The cost d of a single distance calculation for the benchmarkdatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Table 6.4 Comparison of full objective functions, time (in seconds), andspeedup between Glint instantiations and original MDS algo-rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110xList of FiguresFigure 1.1 Uncovering ?hidden dimensions? in a database of images. . . 5Figure 1.2 Visualizing high-dimensional clusters a text database. . . . . . 6Figure 3.1 Anatomy of a simple DimStiller expression . . . . . . . . . . 26Figure 3.2 Interactive DimStiller controls . . . . . . . . . . . . . . . . . 38Figure 3.3 DimStiller correlation operator views . . . . . . . . . . . . . 39Figure 3.4 DimStiller Reduce:PCA control and scatterplot . . . . . . . 41Figure 3.5 Running the DimStiller Cluster Verify workflow on acomputational chemistry dataset. . . . . . . . . . . . . . . . . 42Figure 4.1 Chapter 4 organization diagram. . . . . . . . . . . . . . . . . 49Figure 4.2 Plot of MoDisco statistics. . . . . . . . . . . . . . . . . . . . 54Figure 4.3 Distance Histograms of different dataset types. . . . . . . . . 56Figure 4.4 Sample V and I . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 4.5 Anatomy of Priority Queue elements . . . . . . . . . . . . . . 60Figure 4.6 All Pairs Queries algorithm pseudocode. . . . . . . . . . . . . 61Figure 4.7 Diagram of the describing accumulator and priority queue update 62Figure 4.8 Speed vs. Accuracy chart for the APQ 5-nearest-neighbor searchalgorithm on the warlogs dataset. . . . . . . . . . . . . . . 65Figure 4.9 Adjusted agreement rates ARk among dimensionality reductiontechniques on benchmark datasets. . . . . . . . . . . . . . . . 76Figure 4.10 2D layouts of the four MoDisco benchmark datasets. . . . . . 77Figure 5.1 The components of the Overview prototype . . . . . . . . . . 80Figure 5.2 The Overview Prototype loaded with the Warlogs dataset. . 85xiFigure 5.3 The DiscoTree control at different prune levels . . . . . . . . 86Figure 5.4 Cable alleging Iranian drone plans. . . . . . . . . . . . . . . . 87Figure 5.5 The Cables dataset, with the full set of categories listed inTable 5.1 from the AP Caracas Bureau Chief. . . . . . . . . . 89Figure 5.6 Hierarchical structure in the Disconnected ComponentTree and Items Plot . . . . . . . . . . . . . . . . . . . 89Figure 5.7 Cable with description of alleged arms trafficking. . . . . . . 90Figure 5.8 Cable alleging Venezuelan influence in Jamaican politics. . . . 91Figure 6.1 Diagram of Glint execution. . . . . . . . . . . . . . . . . . . 97Figure 6.2 Comparison of speed and quality for MDS algorithms and theirGlint instantiations. . . . . . . . . . . . . . . . . . . . . . . . 111Figure 6.3 Log-scale Glint convergence curves. . . . . . . . . . . . . . . 111xiiAcknowledgmentsThis dissertation has been made possible with help from many good folks. I wouldlike to thank my PhD supervisor, Tamara Munzner, for being a delightful and pa-tient academic mentor. Tamara was instrumental in helping me to select researchtopics, in imparting good writing and research habits, and in having reasonableexpectations. I want to thank my other committee members, Torsten Mo?ller andNando de Freitas, for providing many helpful suggestions for greatly improvingthe original text of this thesis. Thanks to my university examiners, GiuseppeCarenini and Edie Rasmussen, for supplying many useful comments and correc-tions to the exam draft of the thesis. I?d like to thank my external examiner, LelandWilkinson, for making the trip to Vancouver to attend my oral examination, as wellas for being a wellspring of helpful advice and useful pointers to the literature.Thanks to Jonathan Stray for offering to collaborate with my colleagues and meon the Overview project, as well as for being an abundant source of research ideas.Thanks to the InfoVis group, namely Matthew Brehmer, Jessica Dawson, Joel Fer-stay, and Michael Sedlmair, for patiently reading many paper drafts, always pro-viding their impressions and feedback, and for being good company. Thanks to myfriend and mentor, Art Warburton, for encouraging me to finish my PhD and givingme the opportunity to work in a comfortable place while doing so.Special thanks to my patient partner in life, Kelsey, for keeping me sane in acrazy world, for being a wonderful mother to our children, Eleanor and Henry, andfor staying my best friend after all these years. And, of course, I want to thank myMother and Father, for encouraging my interests and for always accepting me, nomatter what.xiiiChapter 1IntroductionData analysis is the act of making interpretive statements from the careful scrutinyof recorded quantities. A data analyst is the name of one who is engaged indata analysis regardless of the context, whether professional, educational, or recre-ational.A common technique for performing data analysis is to consider the objects un-der scrutiny to be positioned in space, even when such objects are entities withoutany concrete reality. The space in which such objects are placed may be metaphor-ical, and not the three-dimensional physical space in which we reside. The abstractspace has dimensions that are determined by properties of the data deemed impor-tant by the data analyst.For example, when data is organized in a tabular format, its transformation intoa set of points is straightforward. In a table of data, the individual data samplesare assigned to rows, and the different numerical measurements per sample areassigned to corresponding columns in each row. A single row of such a table thenrepresents a point of the dataset, while an entire column represents a dimension.Fisher?s famous iris dataset [33] illustrates this concept nicely: the different irisesbeing analyzed can be conceived of as points in a ?measurement? space.Another spatial data format is the distance matrix, where rows similarly rep-resent points. In contrast to tabular data, though, columns no longer representdimensions, but distances to other points. The distance matrix therefore purelycatalogues the proximities of points relative to each other, without invoking any di-1mensional measurements. If the number of data points is large, then these matricescan be very inefficient to store and compute.The spatial metaphor aids in the visual analysis of the data. There is sometimesa direct correspondence between the spatial arrangement of points to meaningfulfacts about the data. For example, the data might belong to two classes whichcorrespond to two separate densities of points in space. Furthermore, the humanvisual system can quickly detect complex spatial properties of the data like lin-ear relationships and clustering. Thus, by positioning objects in a space that wecan visually scrutinize, one can rapidly observe spatial patterns that may provideunderlying truths about the data.When the number of dimensions are small, as is the case of the aforemen-tioned iris dataset, then visual scrutiny of the data is straightforward. Even if thereare more than two dimensions, but fewer than a dozen, techniques like scatterplotmatrices and parallel coordinates plots allow for effective visualization of spatialpatterns [7, 60]. But when the number of dimensions exceed a dozen or so, thenumber of scatterplots in a matrix or the complexity of the parallel coordinatesplot makes visual analysis of all the linked views cumbersome and confusing.Dimensionality Reduction, or DR, is a suite of data processing techniques de-signed to reduce the number of dimensions of a dataset while best preserving thespatial properties useful for understanding the data. Though DR has broad appli-cability to a variety of domains and problems, our focus in this thesis is on the useof DR for visual analysis tasks like data exploration. By reducing the dimensional-ity of the data to two, many techniques appropriate for low-dimensional data, likescatterplots, can be applied to high-dimensional cases.1.1 Dimensionality ReductionA major proportion of this thesis covers systems and algorithms for DR. As itsname suggests, DR expresses a dataset in a smaller number of dimensions thanoriginally found in the data. The input for such systems takes the form of a tablerepresenting dataset coordinates or a distance matrix.Input tables of coordinates are represented as a matrix X with n rows and mcolumns. The matrix entry xi j stores the ith point?s jth coordinate value. Each row2represents an instance of a sampled data point with the m measurements stored inthe corresponding columns. The magnitudes of the values in the different columnsmay be distributed at widely varying scales and need to be normalized to a mean-ingful scale if they are to be combined together.Distance matrices D hold all the pairs of distances between points, where thematrix entry di j stores the distance between. Distance matrices are symmetric,with a zero diagonal representing a zero self-distance, and therefore only a lowertriangle of the matrix need be stored. The distances stored in the matrix are theresulting output of a non-negative distance function f (i, j), also known as a dis-tance metric, or similarity or kernel function [79], and are interpreted as a measureof similarity between the two data points. While the most commonly encountereddistance function is Euclidean, there exist many other possibilities that compute amore appropriate measure of similarity for a given application [42].Output from dimensionality reduction is always in the form of a matrix of co-ordinates Y with a user-controlled number of dimensions. For some methods, likeprincipal component analysis [64], metadata is available about how the coordinatesare constructed from the high-dimensional input. In all but a few special cases, Y isinvariant to both rotation and reflection. This important fact must be kept in mindto avoid ascribing meaning to the global locations of clusters and manifolds in Y;only relative positioning is meaningful.1.2 Methods and Uses for Dimensionality ReductionAt a high level, the methods of reducing dimensions can be grouped into threesimple classes: culling, collecting, and synthesizing. Culling and collecting tech-niques assume the input is a table of coordinates, while synthesizing algorithmsgeneralize to both coordinate and distance data. Here we describe each of theseclasses and some of their important practical uses.Culling dimensions, also called feature selection [46], means removing a subsetof dimensions outright from the dataset. Perhaps the most important reason forremoving a dimension is because it does not contribute any useful information tothe structure of the dataset other than noise. A trivial example is a hypotheticaldataset of two dimensions, one with a bimodal distribution of values, and one with3a uniform distribution of values. The modes of distribution of the first dimension?svalues permit classification into one of two groups, while the distribution of thevalues of the second dimension only work to obscure this grouping. By cullingthe second dimension, the useful structure of the first dimension is clearer, bothfor visualization and for automated classification. A variety of techniques exist formanually detecting and culling unimportant dimensions from a dataset [62] as wellas a growing literature for sophisticated automatic-selection methods [77].Like culling, collecting dimensions is another form of dimension filtering. In-stead of removing dimensions, though, we combine them together linearly, as aweighted average. Collecting is appropriate when two dimensions are highly cor-related, or expressing largely identical relationships to the other dimensions in thedata. For example, in a database of different car models, the overall weight of thecar is highly correlated with the fuel economy of the car. One simple reason forcollecting together highly correlated dimensions is to reduce the number of redun-dant visual comparisons to other dimensions in scatterplots. Collecting dimensionscan be performed manually, though it is customary to use automatic techniques likePrincipal Component Analysis that group together correlated dimensions [64].The third and final group of dimensionality reduction methods synthesizes newdimensions from input dimensions. Synthesizing methods create new datasetswhose dimensions are complex, often nonlinear combinations of the original inputdimensions. Synthesizing techniques feature prominently in two broadly defineduse cases of visual high-dimensional analysis [98].The first use case of synthesizing dimensions from old is that of uncovering?hidden dimensions,? also called latent or hidden variables. In this case, one hy-pothesizes that there are a small set of dimensions that account for the observedbehavior of a larger set of dimensions. A classic example of such a use case isin inferring orientation axes, such as left/right and up/down, from a database ofimages of an object. Here the measured dimensions are light intensities at pixellocations in an P?P image, and the synthesized ?hidden dimensions? are the axesof orientation [21]. As shown in Figure 1.1, image orientation axes are confinedto run along smooth, nonlinear manifolds contained within the larger space of allpossible P? P images. Manifolds such as these cannot be captured by simpleculling or collecting of input dimensions. The meaning and orientation of the hid-4Figure 1.1: Uncovering ?hidden dimensions? in a database of images. Here,the underlying data points have dimensions describing pixel intensities.By using a dimensionality reduction technique [21], orientation dimen-sions can be constructed from the proximity relationships in the data.c?2009 by Taylor & Francis. Reprinted by permission.den dimensions is not given by these dimensionality reduction algorithms. Onlyby carefully analyzing the changes of the data across the synthesized space can theorientation and meaning of these dimensions be described [76].The second use case of dimensionality reduction by synthesizing dimensionsis that of visualizing high-dimensional clusters. A data analyst may not be inter-ested in the way the input dimensions express the data and instead merely wanta way to visually verify if there is cluster separation of their data. By using di-mensionality reduction algorithms that preserve distance relationships an analystcan adequately visualize separate clusters whose numerous separating boundariesmay span across a very large set of input dimensions, for example by using mul-tidimensional scaling on a document database [57]. The dimensions spanned bythese separating boundaries will be ignored by cull and collect algorithms becausethey will contain useful variation and often be uncorrelated. Worse, when the di-mensions number in the hundreds or thousands, the separating boundaries becomeimpossible to visualize across a set of scatterplot matrices. Figure 1.2 shows an5Figure 1.2: Visualizing high-dimensional clusters in a text database describedin Chapter 4. The underlying data points are term-vectors whose dimen-sions describe the presence or absence of a term in a text. By reducingto two dimensions, clusters of similar documents become readily appar-ent. Cluster colors correspond to cluster assignment by non-negativematrix factorization [101].example of dimensionality reduction producing a visualization of distinct clustersof a high-dimensional dataset.1.3 Thesis ContributionsThis doctoral thesis is greatly informed by research done in a previous Master?sthesis [56]. In that work, I presented a parallel, dimensionality-reduction algo-rithm, Glimmer, designed to optimize both speed and generality across data sets.In spite of its intended generality, Glimmer does not address many important di-mensionality reduction problems that do not fit neatly into its assumptions. The6components of this thesis were inspired by deep complications that emerged whenGlimmer was applied to real-world cases of high-dimensional data analysis. Thesecomplications are both at a system and algorithm level and our proposed solutionsmake up the content of our thesis contributions.In the remainder of this section, we present our thesis contributions as a set ofresearch question-answer pairs. In each answer we describe the complication andhow our thesis contribution helps answer the research question and give pointers inthe text for more details.How do we design systems for high-dimensional analysis aimed at DRnovices? Supplying users with a powerful set of dimensionality reduction algo-rithms to answer their research questions only creates new questions like, ?whichalgorithm for what task?? and ?which parameter settings?? and ?what am I look-ing at??. We contribute the design and implementation of a general analysis systemcalled DimStiller, specifically targeted at non-expert, middle-ground users. Dim-Stiller is built around the notion of guidance, where sensible parameter selectionsare built into the techniques themselves and commonly used compositions of tech-niques are stored as user-defined workflows. Chapter 3 presents the DimStillersystem.How do we design efficient DR and clustering algorithms for the importantspecial case of large document datasets, which have very high dimensionality?Due to their complex structure, translating text into spatial points results in ab-stract spaces of very high-dimensionality. Existing dimensionality reduction tech-niques encounter many algorithm-level and qualitative problems when processingsuch data. We contribute a set of design implications for such data sets that takescareful advantage of the spatial structure of such spaces. The design implicationscreate a bridge between the field of Information Retrieval and high-dimensionalanalysis. We furthermore contribute a set of algorithms implementing these impli-cations and illustrate and measure their improvement over competing approaches.Chapter 4 discusses the algorithm design implications induced by the geometry oflarge-document datasets.We also contribute a special-case, high-dimensional analysis system calledthe Overview prototype, as part of a collaboration with computational journalists.Overview targets journalists interested in rapidly exploring large, unlabelled docu-7ment dumps. These journalists are often unfamiliar with concepts such as cluster-ing and DR. The prototype presents a set of linked visualizations of text collectionsthat permit users to annotate different groupings of points and make sense of thelarger text collection. Chapter 5 describes the Overview prototype system.How do we practically handle costly distance functions? The design ofdimensionality reductions algorithms often ignores the computational cost com-puting distances, focusing instead on solely reducing the cost of computing newdimensions from those distances. But, in some important cases, the functions thatproduce the distances between points are themselves costly to compute. This spe-cial case is handled inefficiently by current dimensionality reduction algorithms.We contribute an algorithm framework, Glint, for performing dimensionality re-duction efficiently in these cases. Glint is designed to be algorithm agnostic, andfocuses on minimizing the number of distance calculations to achieve a stable lay-out. Chapter 6 presents the Glint framework.8Chapter 2Related WorkIn this section, we describe previous work related to the contributions of this the-sis. Dimensionality reduction algorithms appear in every subsequent chapter, sowe begin by providing a survey of dimensionality reduction algorithm research,with particular focus on multidimensional scaling algorithms. Because of our fo-cus on analyzing text data in Chapters 4 and 5, we then describe relevant algorithmsand techniques from information retrieval, nearest-neighbor search, and hierarchi-cal clustering, with priority given to algorithms designed for processing text data.Finally, as Chapter 3 describes a software system for facilitating the analysis ofmultidimensional data, we survey relevant software systems targeting this use case.2.1 Dimensionality Reduction AlgorithmsWe group our survey of dimensionality reduction algorithms into four differentfamilies based on their qualitative objectives: orthogonal projections, global dis-tances, manifold distances, and probability distributions.2.1.1 Orthogonal ProjectionsThe family of orthogonal projections includes methods such as principal compo-nent analysis (PCA) [64] which computes the linear projection of coordinate dataonto an orthonormal basis that best preserves the variance of the data. The transfor-mation has numerous useful properties, both by being the linear projection of the9data into the low-dimensional space with least-squared error, and also by providinga method for back-projection to the original embedding space of the data.PCA provides useful metadata for analyzing the orthogonal projection. First,the algorithm outputs a weighted ranking of the orthogonal basis vectors, permit-ting determination of the most important directions of data dispersion. Second, thealgorithm also supplies vectors, also called loadings, that contain weights describ-ing the contribution of each input dimension given to each output dimension.Variants of PCA algorithms exist to address speed and robustness concerns.For example, PCA can also be computed very efficiently using approximation al-gorithms of the singular value decomposition of a matrix [47]. Robust PCA [54]is designed to handle outliers in the data that would distort the proper alignment ofthe orthogonal axes.PCA is implemented as a software component in Chapter 3 and is applied toprocessing text for visualization in Chapter 4.2.1.2 Global DistancesGlobal distance methods, such as Multidimensional Scaling (MDS) [14], computea low-dimensional layout of points with inter-point distances that best match theinput high-dimensional distance matrix.MDS refers to an entire family of algorithms with different objective functions,computational complexities, and qualitative results [14, 36]. The common thread isthat they all minimize objectives that are some function of the difference betweenthe Euclidean distances of the lower-dimensional layout coordinates and the mag-nitude of the original high-dimensional dissimilarities. The most recurrent functionin the literature is the Stress function, a sum of squared residuals between high andlow dimensional distances and can be written compactly as followsStress(D,?)2 = ||D??||2F||D||2Fwhere D is the input distance matrix and ? is the distance matrix computed fromthe low dimensional coordinates, and ||X ||2F represents the square of the Frobeniusnorm of the matrix X , or ?i j x2i j. Clearly, Stress goes to 0 when the distances of the10low dimensional coordinates match the input. Stress may be a multimodal functionwith many stationary points, making global optimization problematic [44].Here, we discuss four major classes of MDS algorithms in terms of their short-comings in handling costly distances. We do not describe the family of non-metricMDS methods based on distance rankings [69], as these techniques do not factorinto our research.MDS is used as a software component in both Chapters 3 and 5. In Chapter 4,we describe some shortcomings of using MDS as a dimensionality reduction tech-nique for visualizing text datasets. The distinction between the different classes ofMDS algorithms described below are important for understanding the Glint MDSalgorithm framework in Chapter 6.Analytic AlgorithmsThe original MDS algorithm, now called Classic MDS [113] or Principal Coordi-nates Analysis [41], computes a one-step global minimum of an objective functioncalled Strain, which is expressed asStrain(X) = ||XXT ?B||2where X is the n? l matrix of low dimensional coordinates and B is the so-calleddouble-centred distance matrix. Double-centring subtracts the row mean from eachmatrix row, the column mean from each matrix column, and the mean of all theentries in the matrix from each matrix entry. Strain minimizes the discrepancybetween low and high-dimensional inner products, not distances as in the Stressfunction. A major benefit of using Strain over Stress is that it is a convex functionwhose minimum can be computed without iterative techniques. The algorithm re-lies on computing the full SVD of a dense N?N matrix. The SVD of a dense,square matrix requires O(N3) steps [40] and is therefore too computationally com-plex to be suitable for large datasets or problems with costly distance functions.Several scalable Classic MDS approximation algorithms based on the Nystro?mapproximation of the SVD have been presented [85]. For example, both PivotMDS [15] and Landmark MDS [30] work by having the user select a number of?pivot? or ?landmark? points. These particular columns in the distance matrix are11then computed and processed by the algorithm to map the remaining points intolow-dimensional space.The main drawback to this strategy is the manual nature of selecting the propernumber of landmark points. The Pivot MDS authors suggest a human-in-the-loopstrategy where the user iteratively adds landmarks until visually determining thestability of the layout. The Landmark MDS authors propose an iterative strategybased on cross-validation, but do not present any benchmarks for this terminationcriterion. Chapter 6 shows a method to automate the selection of the number oflandmarks and removes the human-in-the-loop from these techniques.Force-Directed AlgorithmsThe Glimmer [56, 57] algorithm and its antecedent, by Chalmers [19], are MDSapproximation algorithms that minimize Stress that iteratively sample high-dimen-sional distances and proportionally nudge the layout points in the direction of theresidual distances. The movement of the points is controlled using a dampenedforce-directed simulation heuristic. The benefits of the force-directed approachinclude a simple implementation and a rapid convergence to a minimum regionof the Stress function in fewer iterations. Force-directed algorithms can also bevery scalable; the Glimmer algorithm achieves considerable speed improvementon large datasets by GPU parallelization.Force-directed algorithms also have drawbacks. Their randomness may inducea visible level of noise in the final layout. Additionally, force-directed methods canconverge to a local minimum of the Stress function that may be vastly inferior tothe global minimum, though hierarchical force-directed techniques like Glimmerreduce this occurrence.Force directed algorithms also may compute more distances than are strictlynecessary. The algorithms are designed to compute high-dimensional distancesprior to each force simulation time step, regardless of whether enough distance in-formation has already been sampled to achieve a quality layout. This oversamplingbecomes especially inefficient when distances are costly.12Gradient AlgorithmsOther MDS techniques use exact gradient information to calculate layout coordi-nates. Some of these algorithms use backtracking gradient descent on the Stressfunction [17], while the SMACOF algorithm [29] minimizes a sequence of quadraticfunctions that majorize the Stress function. These techniques are costly but themost flexible, permitting weights and missing values, while also converging to alower-error minimum than randomized techniques.As shown in their application to graph drawing [37, 66], gradient techniquescan harness a sparsely populated distance matrix as input with good results in lesstime than using the full distance matrix. However, like analytic approximationtechniques such as Pivot MDS, the precise number of distances to compute in ad-vance to converge to a quality minimum is left up to the practitioner to deduce.Coordinate-Only AlgorithmsWhen MDS input takes the form of a table of coordinates, the number of inputdimensions m is often much smaller than the number of points N. The PLMP [83]and LAMP [63] algorithms build on this assumption to rapidly compute low-Stresslayouts for very large datasets by computing mappings for each point derived froma subset of ?control? points. The profound acceleration that the algorithms achieverelative to other approaches is hindered when the number of dimensions equals orexceeds the number of points N, forcing the complexity of computing the individ-ual mapping to approach O(N2). Thus, methods that rely on the relation m Nfor speed are less suitable than other approaches when the relation does not hold.2.1.3 Manifold MethodsManifold distance dimensionality reduction methods preserve distances across lo-cal surfaces formed by the data in high dimensional space, building a model ofmanifold connectivity. Recent work on manifold methods began with the Isomapalgorithm [110], Local Linear Embedding [92], and Laplacian Eigenmaps [8] whichhave been followed by many, many other variants [21, 70, 104, 125].Manifold methods are computationally intensive and often subject to numer-ous tuning parameters controlling how the local manifold is inferred from the data.13Furthermore, several methods make assumptions about the sampling density andnumber of manifolds generating the data points under analysis. In the ideal case,most of the methods are targeted at data with smooth, uniformly sampled, non-linear structures [98]. Such structures can arise from sampling the state space ofdynamical systems as in human motion motion capture [122]. The target datasetsthat appear in our research are not generated by uniformly sampled, nonlinear pro-cesses and therefore unlikely to lie upon a smooth, nonlinear manifold. As a result,algorithms from the class of manifold methods do not make an appearance in ourthesis.2.1.4 Probability-based MethodsProbability-based dimensionality reduction methods such as SNE [50], t-SNE [117],NE [134] and BH-SNE [116] try to minimize the discrepancy between high dimen-sional and low dimensional probabilities derived from distances. This discrepancyis measured as the Kullback-Leibler divergenceN?iN?jp j|ip j|iq j|ibetween the two distributions P and Q representing the distributions for the high-dimensional and low-dimensional cases respectively. Probability methods workby building conditional probability distributions for each point over the other datapoints based on high-dimensional distances, where probability is interpreted as thelikelihood that point i will be chosen as a given point j?s nearest neighbor. Thisapproach intuitively assigns points with close distances a high probability value,and points with far distances a low probability value.The t-SNE probability-based method produces some of the most visually salientcluster visualizations of high-dimensional data. It accomplishes this salience bymapping the high-dimensional Gaussian probability distribution to a heavier-tailedStudent-t probability distribution in the low-dimensional space. The mismatch intails is specifically designed to address the crowding problem [117], where map-pings of points within a sphere of radius r high-dimensional space quickly exhaustthe exponentially smaller volume contained within a corresponding sphere with14identical radius in low-dimensional space. Using the N-body calculation speedupof SNE originally presented by de Freitas et al. [28], both NE and BH-SNE reducethe O(N2) iteration complexity of t-SNE to O(N logN) without a quality penaltyin certain cases [116, 134].In Chapter 4, we present a method for improving the speed and efficiency ofthe NE and BH-SNE algorithms.2.2 Information Retrieval and the Spatial Analysis ofTextChapter 4 discusses Information Retrieval and Spatial Analysis methods for pro-cessing text data with an emphasis on its high-dimensional structure. Below, webriefly summarize research related to this discussion, including summaries of In-formation Retrieval, Nearest-Neighbor Search, and Hierarchical Clustering.2.3 Information RetrievalThe field of information retrieval studies efficient and accurate methods for query-ing databases, including text databases [75]. Of the several useful models of textdata, our work focuses on the vector space model, where documents are mapped tovectors in term space [96]. In particular, the term vectors we study have dimensionvalues assigned by techniques like TF-IDF [95], which assigns each term dimen-sion a value proportional to the frequency of the term in the document and thendiscounts the value by its frequency of appearance in the database.Because the number of terms in even a modestly sized database of a few thou-sand documents often numbers in the tens of thousands, the dimensionality of termvectors is often equal to or greater than the number of documents themselves. Inspaces of such extreme dimensionality, effects from the so-called curse of dimen-sionality [9] become readily apparent. Such effects are known to wreak havoc withspatial search algorithms [55].Instead, very efficient query algorithms for TF-IDF term vectors eschew spa-tial metaphors and instead use a data structure called the inverted file [137], thusnamed because it represents a sparse transpose of the term-vector matrix. In Chap-ter 4, we present more details about TF-IDF vectors and discuss the spatial geom-15etry of inverted file algorithms.2.4 Nearest-Neighbor SearchNearest-Neighbor search algorithms find the nearest set of k distinct points to aquery point, where k is a parameter to the algorithm. We first discuss algorithmsfor general nearest-neighbor search on any dataset, and then for document-basednearest-neighbor search using inverted file indices.2.4.1 General Nearest-Neighbor SearchGeneral nearest-neighbor search algorithms fall into two classes: spatial-partitioningstrategies, and mapping-based techniques.Spatial partitioning strategies are those techniques that construct hierarchicalstructures which partition data space. Examples include balance-box decomposi-tion trees [4] and vantage point trees [135]. After constructing hierarchical spatialdecompositions of the data space with such methods, a spatial search for nearbypoints becomes similar to a generic search tree traversal.Mapping-based techniques, by contrast, build functions that map data pointsclose together with high probability. Using this technique, the search is restrictedto the subspace, or set of bins, to which similar points are mapped. The seminalexample of mapping techniques is locality sensitive hashing (LSH) [39], whichuses random projections to divide the input space into a set of bins.2.4.2 Inverted-File-Based Nearest-Neighbor SearchThe exact nearest-neighbor problem has been tackled in the Information Retrievaland Database literature by using inverted-files as a search data structure. The all-pairs k-nearest-neighbor problem is referred to as a top-k similarity join in theDatabase literature [12]. Initial work in Information Retrieval reduced the searchspace of nearest points by partially ordering the inverted file, and then iterativelycomputing upper bounds on the remaining documents to be processed [80, 105].Recent work focused on scaling up similarity calculations to massive datasets alsouses a bound calculation to compute the exact nearest-neighbors, while also build-ing the inverted file on the fly [6, 129]. In contrast to these exact techniques with16full traversal of possibly disk-resident, partial inverted files, our work in Chapter 4performs a partial, impact-ordered traversal of a memory-resident inverted file withuser-based controls over accuracy.2.5 Hierarchical ClusteringCluster Analysis is a rich field with numerous active subfields [32]. One such sub-field that intersects this thesis is hierarchical clustering. A hierarchical clusteringalgorithm creates a binary tree where leaves represent the input data points (andall points are connected to the tree). Agglomerative cluster trees are built from thebottom up: two nodes are joined when they are the most similar. When the similar-ity measure is the nearest distance between any pair of child nodes, the algorithm iscalled single-link clustering [106]. Single-link clustering has several nice proper-ties, like computational efficiency [103], and existing algorithms can be augmentedwith measures to reduce quality problems like cluster chaining [20, 128]. Specialsingle-link hierarchical clustering algorithms exist for document datasets [81, 119].As in nearest neighbor search, these methods utilize inverted file indices for fastprocessing of high-dimensional data. Chapter 4 details how an improved traver-sal of the inverted file can greatly improve the speed of computing a hierarchicalclustering without a penalty to cluster quality.2.6 Software SystemsThis thesis presents two software systems, DimStiller in Chapter 3 and the Overviewprototype in Chapter 5, designed for different applications: high-dimensional anal-ysis and spatial analysis of text corpora. In this section, we survey other softwaresystems targeted at addressing these two tasks.2.6.1 Systems for High-Dimensional AnalysisDimStiller is designed to augment high-dimensional analysis tasks with visualguidance of algorithm parameter choices and the construction of analysis pipelines.Many other software systems are designed to be tools for high-dimensional analy-sis in some shape or another. In this section, we describe those relevant software17systems that provide users with access to the previously described dimensionalityreduction and clustering algorithms and produce visualizations of the results.Programming EnvironmentsWe first consider full-fledged programming environments such as MATLAB [112]or R [87]. One strength of these systems is execution speed and breadth of avail-able graphics and analysis packages [126], but at the cost of requiring the user toaccess functionality through a special programming language. Beyond help filesthat describe the built-in functions, these systems provide no built-in mechanismsfor guidance to non-expert users.Toolkit SolutionsIn specialized toolkits, several algorithms have been packaged together, usuallywith a GUI front end. For example, the Matlab Toolbox for Dimensionality Re-duction1 gathers together over 30 techniques for dimensionality reduction underone umbrella. Another example is the HIVE dataflow toolkit for dimensionalityreduction [90]. While such tools reduce programmer time by providing easy ac-cess to a wide variety of analysis techniques, neither local nor global guidance isprovided to the user.Visual Dimensionality Analysis EnvironmentsXmdvTool [123] supports interactive visual exploration of multivariate data setsthrough many types of views including scatterplot matrices, with interactive con-trols that include sophisticated linking and brushing techniques. It includes sev-eral approaches to collecting and culling dimensions that are based on hierarchi-cal clustering of the dimensions using a variety of metrics, such as DOSFA [130]and VHDR [131]. Its Value and Relation (VaR) technique [132] does use MDSto create a scatterplot of dimensions, encoding information about the dimensionsin an information-dense pixel-oriented glyph at each scatterplot point. However,XmdvTool is not primarily designed to support workflows built around reductionthrough synthesizing new dimensions. In contrast, the GGobi system [24] is a visu-1 homepage.tudelft.nl/19j49/Matlab Toolbox for Dimensionality Reduction.html last visited on 2/01/201018alization framework for high-dimensional analysis with dimensionality reductiontechniques that create synthetic dimensions as a central focus, supporting inter-action between multiple kinds of linked views including scatterplot matrices. Italso features sophisticated high-dimensional navigation including projection pur-suit and grand tours, and a plugin architecture for easy connection with R [87].The limitation of both of these frameworks is that while they implicitly provideways to access and explore many relevant paths through table space in useful andnovel ways, they lack an explicit framework for local and global guidance. Thearchitecture of these systems is sufficiently orthogonal to the notions of guidancedescribed in this paper that supplying them with such a framework to would requiresubstantial ground-up development.The rank-by-feature framework of Seo and Shneiderman [100] allows the userto visually inspect and explore dimensional relationships, but only with subsets ofthe original dimensions, so the huge part of table space that can only be reachedvia constructing synthetic dimensions cannot be explored. The data explorationenvironment of Guo [45] has a component-based architecture for finding clustersof the data with unique dimensional relationships.Johansson and Johansson?s [62] system embodies the concept of guiding theuser through analysis stages. Users can craft quality metrics from combinations ofcorrelation, clustering, or other measures and cull dimensions according to thesemeasures. However, their system only supports one hardwired global workflow,where the only flexibility is in setting parameter values at each local stage. Anotherlimitation is the inability to construct synthetic dimensions.2.6.2 Systems for Spatial Analysis of Text CorporaThe Overview prototype, presented in Chapter 5, aids in the analysis of a textcorpus with a pair of linked views presenting a clustering and dimensionality re-duction of the underlying text corpus. The most relevant previous work is perhapsthe Newdle system by Yang et al. [133]. It is also oriented around hierarchicalclustering of topics and tag-based analysis. Yang et al. do not incorporate linkeddimensionality reduction, and they show only a particular cut through the hierar-chy at once, whereas our interface is based on a full hierarchy that allows the user19to explore the entire multiscale structure. Chen et al. [23] also take a samplingapproach to dimensionality reduction for large document collections to producelayouts that show clear clusters, but do not discuss any sort of interactive browsingor annotation. O?sterling et al. compute the contour tree of a density field, whichhas some conceptual similarities to the hierarchical clustering [82]. The Overviewprototype uses a point-based visual encoding rather than their landscape-based ap-proach based on guidelines from previous empirical studies [114, 115].20Chapter 3DimStiller: Workflows forDimensional Analysis andReductionMany practical questions about a high dimensional dataset require understandinghow the dimensions and points relate to each other and to an underlying space: Aremy dimensions meaningful? How do my dimensions relate to each other? Are myclusters real?A combination of known statistical and visualization techniques can help an-swer the three questions above. For example, the question ?how do my dimensionsrelate to each other?? may be answered using Principal Components Analysis andinterpreting the magnitudes of the eigenvalues and eigenvectors of the correlationmatrix. Each technique may produce different output with a corresponding spe-cialized interpretation. The sheer proliferation of techniques makes navigating thisanalysis space a daunting prospect for many users, who do not fully understandhow and when to use these techniques correctly. For instance, what parameter set-tings make sense for a particular dataset? When can the output of one technique belegitimately used as the input for another?In contrast to the profusion of previous work proposing better or faster tech-niques for specific aspects of dimensional analysis [49, 57], far less attention hasbeen paid to creating systems that guide their users through the larger process of an-21alyzing high-dimensional data iteratively using a combination of techniques. Forinstance, many engaged in data analysis, who lack a deep knowledge of dimen-sional reduction, simply project their data to some 2D space and plot these pointsusing a scatterplot. However, if the intrinsic dimensionality of the dataset is largerthan two, clusters or orthogonal axes may be projected on top of each other, occlud-ing relevant structures of interest. Although experts confronted with the problem-atic result of a single undifferentiated blob could conjecture that there is a mismatchbetween intrinsic dimensionality of the dataset and the space they chose to projectto, less sophisticated users are routinely misled. The reasons for their perplexityinclude the sheer number of proposed dimensionality reduction techniques [51],the complexity of the mathematics underlying them, the widespread availability ofdimensionality reduction tools that create only a single 2D or 3D scatterplot [65],and the lack of clear characterizations in the literature of which techniques aresuitable for what kinds of data and tasks. DimStiller was expressly designed tohelp users avoid this pitfall, with a dimension reduction workflow that includes anestimation of intrinsic dimensionality of the dataset to guide users in making aninformed choice about how many dimensions to reduce to, if at all, and having thedefault view of a table of more than two dimensions be a scatterplot matrix ratherthan a single scatterplot.In this chapter we present the design and implementation of the DimStiller vi-sualization framework. DimStiller gathers together a variety of techniques fromdimensional analysis and reduction into a coherent framework that emphasizes theunderlying dimensions and the relationships between them. For instance, it guidesusers through estimating the intrinsic dimensionality of a dataset before carryingout reduction. The analysis technique components of DimStiller are outfitted withinteractive controls and linked views, allowing users to see and manipulate inter-mediate results at each analysis step. Section 3.2 describes the task of dimension-ality reduction and analysis. We present the DimStiller architecture in Section 3.3,and two case studies showing how it can be used to analyze complex real-worlddatasets in Section 3.4.The second contribution of this chapter to the thesis is the notion and imple-mentation of both local and global guidance for navigating through the space ofpossible data tables during dimensional analysis and reduction, using the abstrac-22tions of operations, expressions, and workflows. Expressions instantiate a chainingtogether, or composition, of transformation operators on input data tables. Expres-sions show which operations have been applied to the data as well as the order inwhich they occur. Workflows are templates for exploration that consist of a specificexpression along with the saved parameter values for each operator. Workflowsbundle together the sequence of operators of an expression independent of the dataon which they operate, permitting DimStiller users to re-use and share commonpatterns of analysis. While mechanisms such as expressions, operators, pipelines,and macros have been proposed previously in many contexts [48], the novelty inDimStiller is the way in which these are used to walk through a series of operationson data tables, providing guidance during the analysis process. In Section 3.1 wediscuss the idea of guidance in further detail, and contrast our approach to previouswork in Section 2.3.1 Local and Global GuidanceAnyone engaged in the dimensional analysis and reduction process must choosefrom a vast number of possible transformations of the data table at every step ?but only a relatively small subset of these transformations will yield meaningfulinformation about the structure of the input dataset. We define providing guidanceas structuring the exploration process to help users find this small, meaningful setfrom the huge space of possible transformations. We first describe this abstractanalysis space, and then explain the two kinds of guidance, local and global, thatDimStiller provides to support effective navigation in this space.We model the dimensional analysis and reduction process as traversing tablespace, the space of all possible data tables. Conceptually, this space is like a graphwhere nodes are data tables, connected by edges representing a transformation.A path through the space begins at some node representing the input data table.Intermediate nodes are the tables that result from the transformations applied bythe user, and the path terminates at the output data table. We thus must considerhow to help system users locate, explore, and traverse relevant regions in tablespace.At the global level, we guide system users who must find a path that traverses23the space from some given start point to a useful end point. Workflows are a mech-anism to represent entire paths in this space. They represent a chained pipeline oftransformations that can be applied to an input dataset. The built-in workflows area small set of paths intended to span the space between a few landmarks of poten-tial interest. DimStiller is designed to help its users find new, useful paths throughtable space, and save them as new workflows for later use.At the local level, users also need to explore neighborhoods in table space: tun-ing the parameters of an individual transformation operator corresponds to search-ing for the most informative data table in the region of table space that is reachablewith a single transformation that can be carried out by the chosen operator. Dim-Stiller supports this local exploration through a chain of linked operator controlsand views, so that users have immediate visual feedback about the effects of pa-rameter tuning.3.2 Users and TasksWe now describe in detail the intended target user population for DimStiller, thequestions that DimStiller is designed to help these users answer, and common tech-niques currently used to answer such questions. While these are not the only ques-tions a data analyst might be interested in, we argue that they are a good place tostart when considering a new dataset, especially one with unclear provenance thatis not necessarily well curated.3.2.1 Target User PopulationDimStiller is aimed at bridging the gap between state of the art techniques in vi-sually oriented dimensionality analysis, and the current practices of many usersand potential users who do not already have deep knowledge of their data and themathematics of reduction. Although dimensional analysis is sufficiently complexthat we do not target casual users, we argue that this middle ground between utternovices and fully confident experts is a sizeable group that is underserved by thecurrent set of available systems.For example, a visualization researcher called on to help somebody analyze adataset may be completely unfamiliar with the dataset characteristics and the tasks24of the researchers at the beginning of the analysis process. Furthermore, the per-son might be a visualization generalist rather than a specialist in the mathematicalfoundations of high dimensional techniques in particular. Another example is endusers who have expertise in their own domain and the desire to do some dimen-sional analysis, but not deep knowledge of reduction mathematics. They mightbe developing algorithms to generate or process the data, and seek to evaluate thequality of their results or fine-tune parameter settings.DimStiller is particularly aimed at providing major process improvements fordata analysts who must deal with messy datasets that may have unclear prove-nance. By providing both local and global guidance through table space, we aim tosupport analyses that might otherwise seem too daunting and decrease the chancesthat non-expert users draw incorrect conclusions, supporting a qualitatively dif-ferent analysis process than with previous tools. For those data analysts dealingwith curated datasets where the meaning of each row and column are already fullyunderstood, DimStiller may simply speed up a previously feasible, but slow, anal-ysis process by automatically supplying a suite of visual results for the analyst toperuse.3.2.2 Are My Dimensions Meaningful?Sometimes an input dimension may actually contain little or no useful informationat all. Because of this, it is important for an data analyst to be able to character-ize the dataset in terms of which dimensions have useful information versus the?meaningless? ones. This understanding is not critical for downstream analysisalgorithms in the same analysis session, since the mathematics of dimension re-duction will handle creating the correct lower-dimensional projection. However,discovering that a given dimension is culled could reveal problems with the datasource, with major upstream consequences in later iterations of the larger analysisloop: the data might be gathered differently, or the algorithms to generate it mightbe refined.One such criterion is simply to check the underlying variance of the input di-mensions and cull those beneath a small noise threshold. Another possibility is touse an information entropy cutoff.25Input:File?data.csv?Cull:VarianceThreshold = 0Collect:PearsonThreshold = 0.9View:SPLOM100 X 8100 X 7100 X 4Correlation Matrix ViewScatterplot ViewVariance ControlFile Selection ControlCorrelation ControlScatterplot ControlOPERATORS VIEWSCONTROLSWorkflow SelectorExpression TreeOperator ControlView WindowView WindowFigure 3.1: Left: The anatomy of a simple DimStiller expression. Input data is fed into a pipeline of operators that alterthe dimensionality of the data. Each operator may or may not have controls or views. An Operator?s control isdisplayed when it is selected in the Expression Tree. An operator?s view is shown in a separate window, allowingside by side comparison between multiple views. Changing an operator?s parameter in the control may produce achange to its output that is propagated across the expression using events that may travel both upstream and down-stream from the operator. Right: The DimStiller interface for this expression, showing the Collect:Pearsonand View:SPLOM views. The Cull:Variance expression is selected in the Expression Tree, so its control isvisible.263.2.3 How Do My Dimensions Relate to Each Other?Much of multivariate statistical analysis is concerned with how the individual di-mensions relate to each other. Many popular metrics such as Pearson?s correlationcoefficient measure pairwise relationships between individual dimensions. Moreholistic methods such as Principal Components Analysis determine how all thedimensions may actually express a smaller number of dimensions. Other meth-ods uncover more complex nonlinear relationships between the input data, such asmultidimensional scaling or many of the manifold-following variants.3.2.4 Are My Clusters Real?While the previous questions are related to dimensions, the question of clustermembership relates to the points. Clustering is the assignment of a unique label tospecific regions of the input data?s feature space and the points that occupy them.Cluster labels can be computed from any of a myriad of clustering algorithms.Clustering relates to the input dimensionality in a reciprocal way. If the dataanalyst trusts the dimensional basis in which the data is represented, then pointclusters in such a space will be considered real clusters with higher confidence thanwithout such a trust. Likewise, if the data analyst is given a clustering that is trustedto be real, and the space in which the data is projected maintains the clustering?scoherence, then this result increases confidence in the current dimensions. Thus, aclustering can inform the quality of a dimensionality reduction, and vice versa.3.3 DimStiller ArchitectureIt is clear that an analysis tool that provides users with the ability to load theirdata into the system, transform their data with different analysis techniques, andscrutinize their data before and after these transformations would help users answerthese questions. What is not clear is how such a tool should organize the resultsof applying the transformations or how to link these transformations together withvisualizations to keep users focused on the analysis.DimStiller organizes dimensionality analysis and reduction as a pipeline oftransformations to a data table and linked views of it at different pipeline stages, asshown in Figure 3.1. The DimStiller model is based on an abstraction called an ex-27pression which encapsulates a sequence of transformations, called operators, thatact upon tables of data where rows are points and columns are dimensions. Op-erators transform tables by adding, removing, or changing points or dimensions.Operators may have control panels and associated views that provide a visual rep-resentation of a table at that pipeline stage. All views are linked, and selections arepropagated up and down the pipeline appropriately. A key aspect of the DimStillerarchitecture is the ability to instantiate expressions from pre-existing workflowsthat capture useful analysis patterns.The DimStiller expression and operator abstractions were partially inspired bythe Expression and Operator information visualization design patterns of Heer andAgrawala [48]. However, DimStiller expressions have a simple, linear topologythat defers processing to the operators in contrast to the general tree structure sug-gested by the Expression pattern. Likewise, DimStiller operators are composableprocessing units similar to the Operator pattern, but compute general transforma-tions and not necessarily visual mappings.3.3.1 Input Tables and Dimension TypesDimStiller supports an abstract interface to data via a table model. Conceptually, atable can represent anything from physical entries in a disk file, to a cross-networkdatabase, or even evaluations of a simulator. The current implementation onlysupports simple file-based tables. There are two kinds of dimensions: Data, andAttribute. Data dimensions are either Quantitative or Categorical. Internally, bothare represented by floating point values, and DimStiller maintains a lookup tableto map floating point values to category symbols for Categorical dimensions. At-tribute dimensions represent values such as color or selection that are used in viewssuch as scatterplot matrices, but are ignored by purely data-oriented operators suchas variance culling or dimension reduction.3.3.2 OperatorsOperators are functions that map an n?m table to an n??m? table. That is, oper-ators may add or delete points or dimensions, or change the value of any existingcell in the table. In the parlance of Section 3.1, they are the edges that connect28two nodes in table space. For example, the Cull:Variance operator removesdimensions with low variance from an n?m table. If any of the m table dimen-sions has variance below the user-controllable threshold, then the application ofthis operator would result in a new n?m? table where m? < m.Every operator may have an associated control and/or an associated view, al-though neither is mandatory. Operator controls are GUI elements that permit usersto modify operator parameters. Operator controls afford the user command overlocal search in table space. For example, Figure 3.2 shows the control panel forthe above Cull:Variance operator, which has an interactive plot of the vari-ance for each input dimension. The threshold parameter for culling is adjusted byclicking directly on the plot, and this operator does not have a separate view. Op-erator views are visualizations of the data table at that stage of the pipeline. Someoperators are purely view-oriented, and do not transform the data table at all. Forexample, the view for the View:SPLOM is a scatterplot matrix, and its controlpanel only affects this visual display. In contrast, the Collect:Pearson oper-ator has both a control panel with a slider to change the threshold, and a separateview with a matrix of colored boxes to show the pairwise correlations encoded witha blue-yellow-red diverging colormap.The Operator namespace has a two-level structure, where a family has specificinstances. The set of Operator families and instances, with notationFamily:Instance, is:? Attrib:Color Adds attribute color dim for views? Collect:Pearson Join highly correlated dims? Cull:Variance Identify/remove low-variance dims? Cull:Name Identify/remove dim by name string match? Data:Norm Normalize input dims? Input:File Load comma separated value (CSV) file? Reduce:PCA Estimate/reduce dimensionality with PCA? Reduce:MDS Estimate/reduce dimensionality with MDS? View:SPLOM Plot n-dim table using nxn scatterplot matrix? View:Histo Plot dim distribution with histogram29The current families and operators serve to illustrate the potential of our ap-proach to system architecture; the set of families is not exhaustive, nor is the set ofoperators within any family. The built-in set of operator families and instances canbe extended by implementing new operators.Cull and Collect Operator FamiliesOperators in the Cull family compute a specific criterion for each dimension and re-move those dimensions that do not satisfy it. The criterion for the Cull:Varianceoperator is variance, and dimensions that fall beneath a user-specified threshold areculled. It can help users locate and eliminate dimensions whose variability is zero,or is small but non-zero because of noise. The Cull:Name operator allows usersto selectively remove dimensions manually, for example to analyze only a subsetof the input dimensions.While the Cull family acts on individual dimensions, operators in the Collec-tion family use pairwise criteria like covariance and Pearson?s correlation coeffi-cient. Rather than removing dimensions whose pairwise measures do not satisfythe threshold, these operators replace them with a single representative dimensionfor the collection, for example the average.These operators can help users who may need to refine the processes used togenerate their input dataset, as we discuss in Section 3.2.2. They are also usefulfor those whose analysis needs preclude the creation of synthetic dimensions.Reduce Operator FamilyA critical design choice in the Dimstiller architecture is that the Reduce operatorfamily includes estimation of the intrinsic dimensionality of the space in additionto actually performing the reduction. The control for the operator, shown in thelower left of Figure 3.4, has a scree plot: a bar chart with the number of dimen-sions on the horizontal axis, and an estimate of the variability that would not beaccounted for if the dataset were reduced to a space of that size on the verticalaxis. The user then can make an informed choice when selecting the target di-mensionality, by clicking on the plot at the desired threshold. (The definition of?intrinsic dimensionality? that we espouse is the smallest dimensionality of the set30of spaces in which the data can be embedded with distortion less than some noisetolerance, rather than zero distortion.) DimStiller supports users in experimentallydetermining the correct noise threshold, which differs between datasets, by inter-active threshold adjustment.Although scree plots are far from new, most previous toolkits do not explic-itly couple them to the use of a reduction algorithm: users are simply expectedto provide a number as input, with no guidance. Users who are not experts orare dealing with unfamiliar datasets will often have no idea of what a reasonablenumber might be. Worse yet, a significant number of reduction technique imple-mentations are hardwired to blindly reduce to two (or three) dimensions, with nohint to the user that this choice might be inappropriate or misleading. Even the rel-atively sophisticated user who knows to run an estimator is often provided with theblack-box output of a single number, rather than the detailed information for eachpossible number of cumulative dimensions shown in a scree plot [49]. Our designalso has the benefit that users can see different estimates of intrinsic dimensionalityin a lightweight and fast way with the scree plots, rather than the more heavyweightapproach of reducing and then viewing the results in a scatterplot matrix. A relateddesign choice is that the View:SPLOM view for showing a table is a scatterplotmatrix rather than a single scatterplot. When the table has only two dimensionsthis view does of course show the case of only a single scatterplot, but when it hasmore the user is guided to see all of the information rather than an arbitrary subset.While designing the architecture, we considered whether to have the estimationstep separate from the reduction step. We ultimately decided that they should becoupled together into one module in service of the goal of providing guidance forthe non-expert user. Understanding which estimators are appropriate for whichreduction algorithms requires significant knowledge of dimensionality reduction:for example, nonlinear reduction methods should not be used in conjunction withlinear estimators. Thus, we do not expect the middle-ground user to make thatchoice, reserving it for the designer of new operators.The exact measure shown on the vertical axis of the scree plot depends onthe operator instance. The Reduce:PCA operator shows the eigenvalues, and theReduce:MDS shows the stress values of the embedding in each dimension. Weuse the CPU implementation of Glimmer [57] for both the MDS reduction and31estimation.Attribute and View Operator FamiliesAttribute operators add attribute dimensions to the output data table of the operator.Attribute dimensions are interpreted by view and attribute operators and ignored byother operator families. The Color attribute operator creates an attribute dimensionused for coloring to which it assigns values based on the values of numeric orcategorical dimensions. The assignment of colors is performed either by linearlyinterpolating between two endpoint colors or by assigning colors to individual di-mension values. The default colormap for categorical data, inspired by the work ofStone [108], has 10 bins; colors are repeated if there are more than 10 values.View operators provide visualizations of their input data. The two built-in Viewoperators are View:Hist for showing the distributions of individual dimensions,and View:SPLOM for showing pairwise relationships between dimensions. Boththe SPLOM and the Histogram views provide global linked selection by creatingan attribute dimension for selection, and displaying points with a nonzero selec-tion attribute value in a default selection color. SPLOM views also use the colorattribute dimension to color points.3.3.3 ExpressionsA DimStiller expression is the instantiation of an ordered list of operators appliedto an input table. Figure 3.1 Left illustrates the elements of a sample expression,showing the associated views and controls for each operator. As the expressionprogresses, the data entries change value and the output table changes shape asit is progressively refined. Figure 3.1 Left also shows the relevant pathways forhow information about input, view, and parameter changes moves across the ex-pression. In our informal description of the table space graph of Section 3.1, thenodes are data tables and the edges are the transformation operators. However, inthe DimStiller user interface, the more natural representation uses the dual graph,where operators are the nodes, and the edges represent the data flowing betweenthem. The user is thus encouraged to focus on manipulating and understanding thetransformations of the data.323.3.4 WorkflowsWorkflows are templates for entire expressions that can be immediately createdwith a few clicks. DimStiller has a base set of workflows built in, and users cancreate their own by saving the list of operators in any active expression as a work-flow.A workflow contains a sequence of individual operator steps, and saved pa-rameters associated with each operator. When a user instantiates a workflow, anew expression is produced unique to a given input table. Because many operatorsmay result in time-consuming computations, only the first operator in a workflowcomputes its output upon workflow instantiation, with subsequent operators greyedout in the user interface. Users choose when to progress to activating the next step,possibly after adjusting parameters at the current step, with a Step Operatorbutton. Heavyweight operators downstream will thus only initiate their computa-tions on data tables that may be much more compact than the input table due toreduction at upstream stages.The built-in workflows are designed to help users begin to answer the set ofquestions that we identified in Section 3.2.1, as a proof of concept that this style ofguidance can help middle-ground users. We do not claim that they are the only way,or even the best way in all cases, to answer these questions. Workflows provideoptional global guidance; they are not mandatory. Power users have the flexibilityto build up new expressions directly in DimStiller by choosing individual operatorsfrom the currently loaded set.The set of workflows built into DimStiller are:? Reduce:PCA.? Cull:Variance?? Data:Normalize ?? Collect:Pearson?? Reduce:PCA?? View:SPLOM? Reduce:MDS.? Cull:Variance?33? Data:Normalize ?? Collect:Pearson?? Reduce:MDS?? View:SPLOM? Cluster Verify.? Attrib:Color?? Data:Normalize?? Reduce:PCA?? View:SPLOM3.3.5 DimStiller InterfaceFigure 3.1 Right shows a screenshot of the DimStiller session containing the ex-pression diagrammed in Figure 3.1 Left. The visual structure of the DimStillerinterface, with views and controls for each operator, encourages the user to exam-ine the individual operators, adjust their parameters, and observe the effects on theresulting transformations in the visual representations.The DimStiller window on the left contains the Workflow Selector at thetop, with the Expression Tree underneath and an operator control panel on thebottom. Two view windows are visible on the right, a scatterplot matrix for theView:SPLOM operator and the correlation matrix for the Collect:Pearsonoperator. The Expression Tree shows that the input file dimstillerwide.csvcontained a table with 100 rows of points and 8 columns of dimensions. In thisexample dataset, Dim 1 and Dim 2 are independently sampled from a uniformdistribution between 0 and 1, Dim 3 is a scalar multiple of Dim 1, Dim 4 andDim 6 are scalar multiples of Dim 2, Dim 5 is set to all zeros, Dim 7 and Dim 8are linear combinations of both Dim 1 and Dim 2 with a uniform noise term.The first E1 operator S1 is Cull:Variance. The user has clicked on thefirst nonzero dimension in the scree plot, resulting in a threshold value of 0.0007(rounded to 0.001 for display in the tree). The summary line for S1 in the Ex-pression Tree shows that the output table of S1 has 7 dimensions as opposed tothe 8 dimensions that were input to the operator, and the expanded details beneath34show that Dim 5 is the one that was culled. The S2 operator collects dimensionsthat are correlated with the threshold of 0.85, resulting in a table of 4 dimensionswhose pairwise correlations are shown with color in the top view. The expandeddetails shows Dim 1 and Dim 3 are now represented by new synthetic dimensionS2.D1, and the remaining three are now represented by S2.D2. The last opera-tor is View:SPLOM, and the bottom view shows the scatterplot matrix. The userselected some points in one plot, and they are colored red in all of the linked plots.Workflow SelectorThe Workflow Selector displays the available workflows and allows the user toselect one and create a new expression from it. Selecting a workflow fills theadjacent list box with the sequence of steps for the user to inspect. If the userchooses to activate that workflow by clicking the Add button, DimStiller appliesthe workflow steps to the currently selected expression, making those operatorsvisible in the Expression Tree.Expression TreeThe Expression Tree is a three-level tree widget that lists all open expressions. Atthe top level, expressions are described by a short text summary where each newoperator X is appended on the right of the text string as? [X], where X is a veryterse label. When the user drills down to the next level, the individual operators thatcomprise the expression are listed with a concise yet complete text summary thatincludes the size of the output table produced by the operator in terms of points anddimensions, as well as any operator parameters that are set to non-default values.The third and final level of detail is only added if an operator modifies the outputdimensionality, namely the list of the dimensions modified by that operator andany details that relate the input and output dimensions. For example, in Figure 3.1the expansion of the Collect:Pearson operator shows two of the syntheticdimensions and names of the original dimensions collected together.35Operator Control PanelEach operator may have a control panel that lets the user adjust its parameters.Called operator controls, they afford the user with the means to locally search fora meaningful region in table space. When an operator is selected in the Expres-sion Tree, its control populates the operator control panel region at the bottom ofthe main DimStiller window. Only one operator can be selected at a time. Opera-tor controls visible in this paper include the Cull:Variance control shown inFigure 3.1 and the Reduce:PCA control shown in Figures 3.4 and 3.5.View WindowsEach operator also may have an associated view. When an expression is loadedor created, the associated views open up as individual windows to support side-by-side comparison across operators within the same expression or even acrossdifferent expressions. All the views are created using the Processing language, butnew operator view plugins could be created using any graphical toolkit that can in-terface with Java. The built-in operators that have views are the View:Histo, theView:SPLOM shown in Figures 3.1 and 3.2, and the box matrix showing pairwiserelationships for the Collect:Variance operator in Figures 3.1 and 3.3.3.4 Case StudiesWe now describe how the DimStiller architecture facilitates the task of dimension-ality reduction and analysis through case studies on real-world data. We use thebuilt-in workflows to construct expressions that inform users about the characterand relationships of the dimensions and clusters of the datasets1.3.4.1 Sustainability Simulation: Measuring VariabilityOur first case study focuses on a sustainability simulation dataset containing a largecollection of simulated results of government policy decision scenarios [78]. The294 dataset dimensions represent the environmental and societal indicators affectedby the policy decisions. A first attempt to analyze this data using pre-existing tools1The video at http://www.cs.ubc.ca/labs/imager/video/2010/dtil4 2.mov shows the look and feel of interactivesessions with these datasets.36fell prey to the reduce to 2 and plot pitfall discussed in the introduction to thischapter. The simulation is agglomerated from many subpieces originally designedfor varying purposes, rather than being carefully constructed from custom compo-nents that would dovetail seamlessly. The simulation designers thus did not have aclear idea of the intrinsic dimensionality of this dataset. They did know what theirdimensions were, with meaningful labels for each such as Cost of Livingand Air Quality. However, they thought it was possible that some indicatorsalways had the same value across the entire dataset. They were also interested inlearning about how the dimensions related to each other: they suspected that manyindicators were highly correlated, but did not know the number of equivalenceclasses or which indicators were in each group. They were also curious whetherautomatically computed correlation groups would match with their intuitions aboutindicator relationships.The first choice to make when using DimStiller is whether to construct ourown expression from scratch by individually choosing operators, or to instantiatea workflow from the existing list. Since we are interested in finding the intrinsicdimensionality of the space as well as any correlations, a workflow in the Reducefamily seems to be a good match, and we start with Reduce:PCA.Figure 3.2 shows the control of the Cull:Variance operator that plots thesorted variances of the dimensions, with the log-scale option selected to emphasizesmall values. We notice that there are indeed many zero-variance dimensions, andclick the scree control to remove these 34 dimensions, leaving 260 in the outputtable. The researchers could now drill down in the Expression Tree to see thenames of the potentially problematic culled dimensions. They now know that eitherthe input policy choices used in this run of the simulator did not effectively spanthe indicator space, or that there are unforeseen interactions between simulatorcomponents.We click the Step operator button to activate the next workflow step, theData:Norm operator. Both reduction workflows include a normalization step toguide users who may be unaware of the effects of transforming dimensions withdiffering scales of variation. The Expression Tree in Figure 3.4 shows that wechose to normalize using Z scores, so the operator subtracts the mean and dividesby the standard deviation. We then step to activate the Collect:Pearson oper-37ator, which gathers highly correlated dimensions. Even the most stringent possiblethreshold setting of 1.0 for perfect correlation results in a drastic reduction of thenumber of dimensions: from 260 to 147. Figure 3.3 Left shows the correlationmatrix view, where only a small fraction of the boxes are visible without scrolling.Relaxing the threshold to a more reasonable value of 0.8 results in the view shownin Figure 3.3 Right, where the number of dimensions in the table is reduced to 22.Again, the simulation designers could now drill down in the Expression Tree to seethe names of which dimensions were collected together, in order to check whetherthe automatic computations match their intuitions about the expected behaviour ofthe simulator.Figure 3.2: Interactive DimStiller controls. Left: The Cull:Varianceoperator control displays a sorted list of the dimension variances. Manyhave zero variance, indicating potential problems in the choice of inputvariables to the simulator or the operation of the simulator itself. Log-scaling of the variances emphasizes small values. Right: The scree plotfor the nonlinear Reduce:MDS shows an intrinsic dimensionality of 12dimensions, versus the 16 dimensions found by linear methods shownin Figure 3.4.Finally, we would like to determine whether the intrinsic dimensionality of thisspace is even smaller than the 22 dimensions of the table after culling and collect-ing, and if so reduce to that space. The Reduce:PCA operator constructs a linearprojection of the data into a subspace that minimizes distortion. Our motivation fordoing the reduction is to observe the distribution of the eigenvalues correspondingto the major axes of the simulation output in a subsequent scree plot. The controlview in Figure 3.4 shows the scree plot in the PCA control, and we see that theeigenvalues approach zero between 12 and 18 dimensions. Mousing over dimen-38Figure 3.3: The DimStiller Collect:Pearson operator view shows thecorrelation matrix with a diverging color scale ranging from blue at thepositive end, through yellow for independent pairs, to red for negativecorrelation. Left: The perfect correlation threshold of 1.0 reduces thetable from 260 to 147 dimensions. Right: Relaxing to a more reason-able threshold of 0.8 reduces the number of dimensions to just 22, allvisible without scrolling.sion 16 shows that it corresponds to a very small noise threshold of 0.001, and weclick to select that value. We then click the Step Operator button and activatethe View:SPLOM operator which brings up the scatterplot matrix view.In order to facilitate viewing the original lattice structure of the data in theinput space, we insert an Attrib:Color operator into the expression after theInput:File. The Attrib:Color creates an attribute dimension containingcolor information derived from a single, user-selected input dimension. We selectthe initial dimension as the dimension from which colors are derived. The resultingcolored SPLOM is shown in Figure 3.4 in the top right view window, labeledE1:[View:SPLOM].The original analysis of the dataset was done by projecting the data down totwo dimensions using multidimensional scaling. We quickly replicate this analysisin DimStiller so that we can compare the results directly. We reload the samedata into a new expression E2, adding a Attrib:Color Operator and using theReduce MDS workflow. The operators in this workflow are the same as in the39previous analysis except for the Reduce operator, and we use the same settings asbefore. We check the scree plot for this case, and find that only 12 dimensionssuffice with this nonlinear reduction, as shown in Figure 3.2. Then, to illustrate thereduce to 2 and plot pitfall, we use the Reduce:MDS Operator to select 2 as theoutput dimensionality. The SPLOM view at the bottom right of Figure 3.4 showsonly the single scatterplot. The regular, lattice-like structure visible in the SPLOMabove is completely hidden, and we see only an undifferentiated blob.This analysis session with DimStiller shows that although the simulator pro-duces hundreds of outputs, dozens have zero variance, most of the remainder arehighly correlated, and the data can be represented with only around a dozen dimen-sions without losing information. Although in theory this full analysis could havebeen carried out with existing tools like MATLAB, and bits and pieces of it weredone over the course of a few years, in practice we did not have a complete pic-ture of this messy real-world dataset until we could analyze it with the DimStillersystem.3.4.2 Computational Chemistry: Evaluating ClustersWe now examine a 30-dimensional computational chemistry dataset. The indi-vidual dimensions of this data measure physical properties of chemicals such asmolecular weight and the number of bonds they possess. The dataset includes acluster membership dimension with 236 clusters of the data produced by a com-mercial clustering package. The goal of the chemist who work with this dataset isto evaluate the quality of the clustering.The Cluster Verify workflow is appropriate for this goal, so we instan-tiate a DimStiller expression from it. After loading the data, we use the Coloroperator control to choose which dimension we use for the categorical colormap.By default, the color operator culls the dimension by which it colors the points;including this cluster membership dimension in the downstream analysis wouldusually skew the results.After adjusting color settings, we activate the Data:Norm operator whichnormalizes the dimensions to Z scores. We then activate the Reduce:PCA andobserve its control. The scree plot of the eigenvalues, visible in Figure 3.5, shows40Figure 3.4: DimStiller Reduce:PCA control and scatterplot. The scree plotin the Reduce:PCA control shows an estimate of the intrinsic dimen-sionality as between 12 and 18 dimensions, and we have interactivelyselected 16 as the threshold. The top scatterplot matrix shows a resultwith a visible lattice structure. In contrast, the bottom view shows thepitfall of reducing to just two dimensions using MDS, where an undif-ferentiated blob gives a misleading impression of no such structure.an exponential drop off in magnitudes. This plot strongly suggests that the majorityof the variance of the data resides in a lower dimensional space than the inputdimensionality. Standard practice is to select the ?knee? of the value drop-off curveas a good candidate for target dimensionality. We select 3 and then activate theView:SPLOM operator.The resulting scatterplot matrix of the View:SPLOM Operator, also visible inFigure 3.5, reveals several interesting cluster structures in the data. In the bottomrow of two scatterplots, we observe clear separation of several clusters of points.In contrast to the spatial separation, some color labels appear to span gaps in thescatterplots. A single categorical color scheme of course cannot possibly show over200 clusters with distinguishable colors, so DimStiller uses a repeating palette.41To check whether some of the adjacent clusters with differing labels might havethe same color by chance, we select the Attrib:Color Operator again in theExpression Tree to bring up its control. The Permute Colors button permutesthe order in which colors are assigned to categories. After trying a few differentpermutations of the color scheme we conclude that the phenomenon we saw wasnot just an artifact; several cluster labels do indeed span these observed clusters.This result gives strong, albeit not conclusive, evidence that there may be betterclusterings.Figure 3.5: Running the DimStiller Cluster Verify workflow on a com-putational chemistry dataset. The scree plot in the lower left showsthat most of the variability resides in a low dimensional subspace. Wechoose a threshold of 3 dimensions at the ?knee? in the plot. We see inthe colored SPLOM view that while the clusters are spatially coherent,they do not reflect the spatial separation in this projection, suggestingthat this clustering is not the most appropriate.3.5 DimStiller DeploymentDimStiller was produced as part of a larger, multi-pronged research initiative de-signed to study high-dimensional analysis. In the context this initiative, DimStiller42was our first attempt to architect a software solution to the major issues that webelieved were facing working data analysts: selecting appropriate methods andappropriately tuning their parameters. Another key thread of that larger researchinitiative was a multi-year qualitative field study in which I participated as a co-author [98]. This qualitative study provided an ideal opportunity to deploy Dim-Stiller to working data analysts from a variety of disciplines and informally gaugethe utility of the software to their problems.To deploy the software, we introduced some of the data analysts in the studyto DimStiller and encouraged them to try to incorporate using the software in theirown data analysis tasks. This limited deployment was instructive in several keyways. First, it highlighted use cases of high-dimensional analysis where DimStillerwas not an appropriate tool. Second, it brought to our attention incorrect assump-tions of DR algorithms that needed to be addressed before a tool like DimStillercould even be applied. Finally, the deployment shed light on external factors thatcan greatly affect the success and future development of software systems. In thereminder of this section, we describe these different results of the deployment inmore detail.In one deployment use case, detailed more completely in the qualitative fieldstudy as FISHPOP [98], a user applied DR to a technique more appropriate to a dif-ferent high-dimensional technique called sensitivity analysis. Sensitivity analysisis outside the scope of the goals of both DimStiller and DR. This case representsa mismatch between the user?s understanding the goals of existing DR workflowsand the user?s underlying task of gauging the sensitivity of output dimensions val-ues to small changes in input dimensions.In two other different deployment cases, users uncovered incorrect assumptionsin DR algorithms that needed to be addressed before DimStiller could effectivelyhandle their input. For example, one DimStiller user was a data analyst analyzingdistance-matrices resulting from the co-citation analysis of relevant terms, wheredistances between terms were the result of a costly database calculation. Here, thedata analyst was prohibited from doing a full analysis without performing an im-practical, months-long database calculation. Another user wanted to use DimStillerto perform DR on vectors derived from document collections. DimStiller was notequipped to process the extremely high-dimensional datasets resulting from ana-43lyzing document data. Both of these issues were less about inherent limitationsof DimStiller itself than about DR in general, as any DR algorithm or tool wouldencounter similar difficulties when applied to the same data.For these users, important algorithmic obstacles needed to be surmounted be-fore applying the workflows for which DimStiller was designed. By applying Dim-Stiller to perform DR on practical problems, these DimStiller users highlighted theissues that we tackle in the remainder of this thesis. In the case of the co-citationanalyst, we developed an adaptive algorithm framework to handle the case of cal-culating DR in the presence of costly distance functions. For the case of DocumentData, we developed DR algorithms to efficiently leverage their high-dimensionalstructure. Both of these issues, and the algorithms we developed, are covered inmore detail in Chapters 6 and 4 respectively.Missing from our deployment were any unambiguous success stories, whereusers reported insights and analysis process improvements from using DimStiller.We believe most users in the deployment did not adopt DimStiller into their anal-ysis workflow, using the software only during the period of our interviews. Thislack of adoption derives less from the design of the software itself, with which nouser in our deployment had issues, than from two important external factors weobserved: difficulty of software integration and our own lack of evangelism.By software integration, we refer to the ease with which the inputs and outputsof software systems connect with other software tools. For example, one of ourcollaborators in the qualitative study remarked how she would be more likely touse and recommend DimStiller if it could be easily incorporated into her R work-flow [86]. DimStiller was implemented as a standalone Java tool with a singlegeneric input and output format. Thus, considerable software engineering and Javaexpertise is required to build and incorporate new operators to the tool. In the fu-ture, we believe guidance-oriented analysis tools should integrate themselves intosoftware platforms with a substantial base of existing techniques and users. Doingso both mitigates the initial barrier to adoption and maximizes the ease with whichnew or existing techniques can be folded into the tool.Another factor affecting the adoption of systems like DimStiller is the effect ofsoftware evangelism. Beyond the initial deployment and publishing of an articledescribing DimStiller [58], we made no further effort to deploy to other communi-44ties or improve the software, having shifted our research focus on addressing theissues that appear later in the thesis, such as handling costly distances or documentdata. We discuss the effects of software evangelism and other ways to increase theimpact of research more fully in Section 7.3.2 of this thesis.45Chapter 4The Geometry of Fast DocumentQueries:Implications for Visual AnalysisAlgorithmsThe field of Information Retrieval, or IR, focuses in part on improving the accuracyand speed of the task of querying document databases [75]. Search querying isperformed by millions of users of varying skill each day. To do so, a user posesa set of terms to the query engine, which then provides a 1-dimensional list ofresults back to the user, sorted in terms of predicted relevance to the query. Searchquery results computed on databases of billions of text records are often returnedwith processing time in less than a second, making it one of the more popular andscalable computer algorithms.One reason Information Retrieval algorithms for search querying can performso efficiently lies with an important property of the underlying data. We informallyname this property the Mostly Disconnected property, or MoDisco for short. Putbriefly, MoDisco datasets are those datasets with a large number of dimensions,but where each point has a nonzero in only a subset of these dimensions, and eachdimension is set to nonzero for only a subset of the points. That is, not only are the46points sparse, but so are the dimensions themselves.The MoDisco property has implications for the efficiency and scalability ofquery algorithms. High-quality query results can be computed by processing onlya small subset of the total dataset without degradation in quality, by organizingthe data into data structures such as inverted files and traversing these structures inorder of query-term impact [137]. We discuss the MoDisco property in detail andits relationship with query algorithms in Section 4.1.In contrast to querying, visual analysis often involves a more open-ended ex-ploration. The document exploration task remains the purview of expert-level ana-lysts in fields like intelligence analysis, law, and computational journalism, whereit is important to quickly understand large, unlabelled collections of text docu-ments [25, 73]. In this context, exploratory methods tend to encounter scaling dif-ficulties, and the characterization of large is often ascribed to datasets well underten thousand documents [52, 133].We argue that the MoDisco property also has important implications for doc-ument exploration algorithm design in visualization. By interpreting the MoDiscoproperty in terms of distance geometry, we can make useful statements about theconstruction of distance-based analysis algorithms, like clustering and dimension-ality reduction, used to explore these large, complex datasets.As the first contribution of this chapter, we identify and describe the followingthree implications for algorithm design in this paper.? Nearest Neighbor search of MoDisco data can more efficiently be done withsearch-index based queries rather than generic distance-based approaches.? Distance matrices of MoDisco data can be calculated and stored efficientlywith low approximation error.? Dimensionality Reduction algorithms for MoDisco data should use methodsbased on local attraction and global repulsion.While distance matrices, nearest neighbor search, and dimensionality reduc-tion are useful in a variety of analysis pipelines [58], it is not self-evident how suchimplications can be directly utilized to formulate new algorithms. As a followup47contribution, we also describe three techniques, two new algorithms and a modifi-cation of an existing algorithm, for incorporating these implications into the designof algorithms intended for the visual analysis of documents:? All Pairs Queries - we present a new algorithm for nearest neighbor searchof MoDisco data, which we use to construct an efficient Distance Matrixapproximation.? Fast Cluster Tree - we present an algorithm for producing a hierarchical clus-ter tree from our Distance Matrix approximation in O(N logN) steps.? MD-SNE - we introduce the MD-SNE algorithm for dimensionality reduc-tion of MoDisco data.48Impact SortedInv. File APQ AlgotrithmCluster TreeAlgorithm Algorithm from previous workData structureTerm VectorsTruncated Distance Matrix2D LayoutAPQ to Distance MatrixContaining SectionBH-SNEMD-SNEMSTDM2CT4.2 Nearest Neighbor Search4.3 Distance Matrix Computation and Storage4.4 Dimensionality ReductionFigure 4.1: Chapter diagram describing the relationship between our algorithmic contributions, the previous work usedas algorithm components, and the data structures created and used by these algorithms. Section 4.2 describes theAll-Pairs Query, or APQ, algorithm, which uses impact-sorted inverted file queries to process a set of term vectorsto yield a set of nearest neighbors for each vector. In Section 4.3, we describe how this set of nearest neighborscan be input into the APQ to Distance Matrix algorithm to produce a Truncated Distance Matrix, a low-spaceDistance Matrix approximation. We then demonstrate how to use the Truncated Distance Matrix efficiently withthe DM2CT algorithm which uses the Minimum Spanning Tree (MST) algorithm as a component to produce aCluster Tree. Section 4.4 describes the MoDisco-SNE (MD-SNE) algorithm for dimensionality reduction.49Figure 4.1 summarizes the contributions of this paper with a diagram show-ing the relationship between our implications, our algorithmic contributions, andprevious work used as algorithm components.After first defining the MoDisco property and how it relates to document term-vector datasets in Section 4.1, the remainder of the paper is organized around thethree different algorithm design implications. In each of Sections 4.2, 4.3, and 4.4,we explain the design implication, present an algorithm incorporating the design,and compare the results of the algorithm with competing work.4.1 The Mostly Disconnected Property of Term-VectorDatasetsQuery systems often use the vector space model, storing documents as term vec-tors [96]. Here, document refers to a string containing the content of a singletext file. The document string can then be broken into terms: n-grams of words.Term vectors reside in a space whose dimensions are all of the terms in the en-tire database, and whose weights indicate the importance of a given term in iden-tifying the underlying document. Using such vectors is not only more efficientthan processing the varying-length document content during a query, but it per-mits us to conceive of the documents as points residing in term-space. We canthen think about documents spatially, just as we would any other distribution ofhigh-dimensional data points.In this section we first describe the method by which documents are trans-formed into term vectors. We then present a set of term-vector datasets derivedfrom real-world document corpora. Next, we indicate how the transformationmethod induces the mostly disconnected property by measuring properties of ourbenchmark datasets. Finally, we explain the method by which query algorithms areable to extract performance from this property.4.1.1 Term-Vector Datasets and TF-IDFAn influential method for determining the weights assigned to term-vector dimen-sions in the IR literature is the TF-IDF method [95]. TF-IDF, short for Term Fre-quency times Inverse Document Frequency, is a scheme that assigns a large weight50to terms that appear in a document frequently, but then discounts that weight byhow often the term appears in other documents. The weight of common words isreduced and unique words increased. Using the probabilistic model of IR, TF-IDFhas already been shown to be an optimal model of document relevance under aset of simplifying assumptions [89]. In contrast to this probabilistic perspectiveregarding document relevance, we develop a spatial intuition of why TF-IDF isefficient in document queries.Term vectors are typically normalized to unit length, to control for documentsize by forcing long documents with large term frequencies to carry the sameweight as short documents with small term frequencies. This normalization im-plies that document points reside on the surface of a hypersphere. The fact thatterm weights are always nonnegative restricts their distribution further to the posi-tive orthant of this sphere.The common distance metric used between term vectors is the cosine distancemetric [95]. The cosine distance between vectors a and b is equal to 1?aT b, or 1minus the dot product between the vectors. While other related metrics exist, wefocus on the cosine metric due to its simple geometric interpretation as one minusthe length of the linear projection of vector a onto b.4.1.2 Benchmark DatasetsOur explanation of the Mostly Disconnected property requires demonstrating andmeasuring the sparsity characteristics of a set of varying benchmark datasets. Thebenchmarks in this paper are run on a set of seven datasets. Four of them are col-lections of MoDisco term vectors generated using TF-IDF term weights from textdatabases; three are real-world datasets and one is synthetic. The metacombinedataset is produced from metadata tags harvested from a disparate collection ofdigital libraries [67]. The warlogs and cables datasets are subsets of docu-ments from larger Wikileaks collections [59]. The conspiracy dataset is har-vested from the conspiracy section of a BBS textfile collection1. We include thesynthetic simplex dataset as an extreme example of a MoDisco dataset, which isa set of 1000 points equally distant from each other in 999-dimensional space. An1http://www.textfiles.com/ Last visited on July 10, 201351Name Points (N) Dims (M) MoDisco?metacombine 28K 28K Ycables 7K 66K Ywarlogs 3K 4K Yconspiracy 1K 33K Ysimplex 1K 1K Ygrid 1K 2 Nmnist 2K 784 Nblobs 6K 5 NTable 4.1: Benchmark datasets, with size in terms of both points and dimen-sions. The first five are MoDisco; the second three are not.(N?1)-dimensional simplex dataset is equivalent to an N?N identity matrix I.The term vector datasets in this thesis are constructed with the following pipeline.First the text is split into ordered sets of words. These word sets are then strippedof a designated list of stop words. The remaining words in the set are then usedto create bigrams, pairs of subsequent words. Term counts for each document arethen constructed by counting the occurrences of each bigram in each document.These term counts are finally re-weighted by TF-IDF and then normalized to unitlength. Stray has a more detailed description of the term-vector generation processincluding justifications for not using alternative approaches, such as entity recog-nition [109].For contrast, we also include three datasets that are not MoDisco, also a mixof real-world and synthetic. As a synthetic example of a linear manifold, or flatsurface, we include a regularly sampled 2-D lattice called grid. The real-worldmnist dataset contains the combined test sets of digits 1 and 2 in the MNISThandwritten digit dataset2, as an example of a high-dimensional, real-world datasetwith two meaningful clusters. Finally, the synthetic blobs dataset is a collectionof 6 clusters of 1000 points randomly generated from 6 separate 5-dimensional,Gaussian distributions. Table 4.1 provides the characteristics of these benchmarkdatasets in terms of the number of points N, the term space dimensionality M, andwhether it has the MoDisco property.2downloaded from http://yann.lecun.com/exdb/mnist/ Last visited on July 10, 2013524.1.3 The Mostly Disconnected PropertyThe intuitive definition of the MoDisco property is that each point mostly residesin a small subspace of the full term space, represented by a small number of keyterms. Furthermore, these same key terms weigh highly with only a small subset ofthe other data points. To formally define the MoDisco Property, we first introducea descriptive statistic measuring the sparsity of both points and dimensions of thedataset with N points and M dimensions. First, we define the thresholded sparsityof a point i to bePi(t) = |{xi j|xi j > t, j ?M}|/Mand the thresholded sparsity of a dimension j to beD j(t) = |{xi j|xi j > t, i? N}|/Nfor some threshold 0 < t < 1. The purpose of introducing threshold t is to re-move vector terms from the data with very low weight. Low-weight terms oftencontribute little to the distance calculation between two points, and can potentiallymake operationally sparse data appear dense. In our analysis, we select t = 0.1,but it holds for a wide set of values. We can now define two worst-case sparsitymeasuresmaxP(t) = max{Pi(t)|?i ? [1,N]}maxD(t) = max{D j(t)|? j ? [1,M]}Given these worst-case thresholded sparsity measures, we now say that a datasetis MoDisco when both maxP(t) < B and maxD(t) < B. Here, B ? (0,1) is somebound on the desired sparsity of the data; a smaller B imposes stricter sparsity re-quirements on the data. These two inequalities are descriptive statements about theworst-case sparsity of any single point and dimension in the data. They implicitlydescribe how the design implications we present in this paper hold true and are notcomputed explicitly by any of the presented algorithms. As a side note, it is notmeaningful to make statements differentiating the average sparsity of points versusdimensions of a dataset, because these measures are equivalent.530% 20% 40% 60% 80% 100%0%20%40%60%80%100%B=0.2 B=0.9maxP(0.1)maxD(0.1)  cablesmetacombinewarlogsconspiracysimplexgridmnistblobsFigure 4.2: Plot of MoDisco statistics (maxP(0.1),maxD(0.1)) showingclear separation between MoDisco datasets (circles) and non-MoDiscodatasets (squares). The boundary B defines the region of MoDiscodatasets, we present two extreme possibilities, B = 0.2 and B = 0.9,that agree with our datasets. In the MoDisco case, each point has asmall number of nonzero dimensions and each dimension is expressedby a small number of points.Figure 4.2 shows a plot of the points (maxP(0.1),maxD(0.1)) for the suite of7 benchmark datasets. The five MoDisco term-vector datasets, represented as cir-cles, are distinctly separated from the other 3 non-MoDisco datasets, representedas squares. Empirically, the worst case sparsity bound B can take values anywherebetween 0.2 and 0.9, demarcated by grey lines on Figure 4.2. Given that we ob-serve significant separation between MoDisco and non-MoDisco datasets, we donot prescribe any specific value of B.There is an intuitive explanation of why term-vector schemes like TF-IDF in-duce the MoDisco property. The TF component defines the subspace in which54each vector resides, while the IDF component reduces the subspace weights belowt for those dimensions shared by many vectors, effectively reducing the subspacedimensionality and disconnecting most vector subspaces from each other (unlessthey possess sufficient TF weight to compensate). Without the IDF discountingfactor, most of the set of subspaces in which vectors reside would connect by in-tersecting across the dimensions representing common terms.By definition, the MoDisco property implies that the distribution of the setof cosine distances between points is highly skewed toward 1. Two randomlysampled documents from a MoDisco data set will share few, if any, terms onaverage. This phenomenon has been observed and noted in the IR literature onquery datasets [84]. Figure 4.3 verifies this phenomenon by showing distance his-tograms for benchmark datasets: the histogram of the MoDisco warlogs datasetis most similar to the high-dimensional simplex case, rather than the other threenon-MoDisco examples of a regular grid and the two cluster datasets.4.1.4 Implication for Efficient QueriesMoDisco data are highly amenable to efficient search queries, due to the mappingof the data into fast, orderable search indices. We now explain how these indicesare constructed, and how the MoDisco property affords efficient lookup of relevantdocuments.An important search data structure is called an Inverted File [137]. It is sonamed due to the inversion from the standard term-vector storage that maps a doc-ument to a list of terms, to a mapping from a single term to a list of documentsthat store that term. Query processing using the inverted files need not comparethe query to the entire document corpus, but only to those documents that share thequery terms.Improved traversal of the inverted file can be done by performing an impactsort, or simple re-ordering of both the index lists and the order of their traversal.Impact ordering has been shown to compute higher-scoring query results earlierthan low-scoring queries [3]. The advantage of computing these higher-impactqueries first is that we can truncate the query computation after computing the top-scoring queries.55grid mnist clusters simplex0 0.2 0.4 0.6 0.8 1warlogs Distance HistogramInter-Point DistanceProportionofDistancesb)a)Figure 4.3: Distance Histograms of different dataset types. a) Histogramsof distances (top) and layouts showing dataset geometric structures af-ter dimensionality reduction (bottom) for non-MoDisco datasets grid(blue), mnist (yellow), blobs (purple), and the synthetic extremeMoDisco case of the simplex (green). b) Real-world MoDisco datasetwarlogs (brown) histogram matches the simplex case the best.The cost of impact-ordered traversal is that the accuracy of the query resultsare approximate. Many efficient query-evaluation systems exist that produce exactresults [34]. In this paper, we focus on impact-ordering techniques for their user-controlled flexibility of the speed and quality tradeoff.Impact-ordered, inverted file indices exploit the MoDisco property in the fol-lowing way. The traversal of the documents indexed by the inverted-file terms con-tained in the query is ordered by the magnitude of their 1-D projections with thequery?s subspace dimensions. This ordering ensures that we first visit the stronglyconnected documents, those with large dimension weights intersecting the querysubspace.564.2 Nearest-neighbor SearchThe computational task of selecting the nearest neighbors of a point applies toa broad class of problems and is the subject of a significant amount of previouswork [72]. For data with the MoDisco property, we suggest that nearest neigh-bors are most efficiently computed using index-based queries, rather than the twoother major categories of nearest-neighbor search techniques that are relevant toour context: spatial partitioning strategies and mapping-based techniques. In thissection we discuss the related work, and then explain how index-based queries canbe expected to out-perform the state of the art. We then present a new algorithm forall-pairs, approximate k-nearest-neighbor search using an impact-ordered invertedfile and show how it achieves high accuracy results in less time than competingapproaches.4.2.1 Implication: Nearest-Neighbor Search with anImpact-Ordered Inverted FileWe suggest that when processing MoDisco data for nearest-neighbor search, queryindices are more efficient than either spatial partitioning or mapping based tech-niques. We provide an intuitive justification for this claim in this section, backedup with an empirical analysis of the benefits of using query indices in Section 4.2.3.Each term vector resides in a low-dimensional subspace of term space, and theinverted file maintains a record of the subspace dimensions for each vector. Whenusing the terms of a given term vector as a query into the impact-ordered index, wefirst encounter those vectors whose projections onto the query?s subspaces dimen-sions are the largest. These vectors are more likely to be in the set of points closestto our query. In this way, the impact-sorted, inverted file already acts like a map-ping function on MoDisco data, but without the need to compute any projectionsor hash functions.Spatial partitioning strategies and mapping based techniques, in contrast, op-erate with no a priori knowledge of the subspace structure of the data. They mustfirst resolve the relevant data subspaces by either building the spatial-partitioningdata structure or computing enough mapping functions to effectively discretize theterm space into relevant regions.57It is instructive to note how using index queries for the general case, when thedata does not have the MoDisco property, will fail to improve over other nearest-neighborhood search methods. In the general case, there is no guarantee that mostpoints will reside in a low-dimensional subspace. In the case of a dense dataset, sin-gle inverted index entries are full, containing references to the entire set of points.When inverted file entries are full, they provide no useful method for reducing thesearch space for neighboring points. Indeed, in the full inverted-index-entry case,the inverted file devolves into an exhaustive search over all the data points. In thecase of sparse rows, but not sparse columns, we can guarantee that most pointsreside in a low-dimensional subspace. However, we still cannot guarantee againstthe existence of a full inverted index entry.4.2.2 Algorithm: APQOur All Pairs Queries (APQ) algorithm constructs an approximate set of k-nearestneighbors without exhaustively computing all the similarities between data points.We have argued that the fastest way to perform nearest neighbor search on MoDiscodata is with index queries, where the query is constructed from the terms of thedocument itself. Thus, to compute the full set of nearest neighbors, we will wantto perform an ensemble of queries across all pairs of data points. Using impact-ordered processing, we can then truncate the query computation when we producethe highest scoring, or nearest results.The APQ algorithm performs an ensemble of queries in an iterative way us-ing data structures similar to the single query case [137]. The algorithm maintainsseveral lists of fundamental data structures. First, the term vector dataset V . Theinverted file I, which we store as a list of arrays of document indices, is sorted inorder of term weight. The expression I[i, j] references the document whose weightfor term i is the jth largest among documents possessing that term. Figure 4.4presents a simple two-document term dataset and its corresponding inverted file.The third list Q is the set of terms for each point, where each set of terms is orga-nized as a priority queue. Figure 4.5 describes the values stored in each priorityqueue element: the queue item priority used to position the item in the queue, aterm used for indexing a row in the the inverted file, a term pos recording the po-58Term Vectors V Inverted Index I12DocIDdogdog0.30.8 cat 0.5cat 0.9Term WeightTermdogcatDocIDTermWeight0.820.910.310.52Sorted by Term WeightFigure 4.4: A sample term vector dataset and its inverted file. The terms inV become the indices of the rows in I, while the IDs of the documentsforming the rows in V become the indices for the weights stored for eachterm in I. The list of weights for each term in I are stored in descendingorder of document weight.sition number of the term in the term vector, and a counter used to index a columnin the inverted file. The accumulators are a list A of hash sets that map documentnumbers to accumulated terms. The accumulator A[i, j] stores the accumulatedterms that make up the inner product between document i and j up to the currentiteration. These sets of accumulators A are the output of the algorithm. We discusshow to interpret A as a distance matrix in section 4.3.1.The primary difference between our method and the single-query, impact-orderedalgorithm is the introduction of the set of priority queues Q. In the single querycase, impact ordering is achieved by sorting the set of inverted file term entriesusing the product of each query term weight and the matching inverted file termweight as a sort key. The priority queue entry for each document Q replicates thesame sort ordering by making the priority value for each queue entry the same asthe sort key. The priority value for the head queue element is then updated to bethe product of the queue entry weight with the next term in the inverted file, thusrepositioning the item in the queue. Using the priority queues interleaves the sortordering and accumulation calculations in a single inner-loop iteration. Figure 4.6presents the whole APQ algorithm in pseudocode.59The anatomy of Q elements0.4 dog 1 1Term Pos: position of term in term vectorCounter: increment into document list in ITerm: The term used to derive the impact scorePriority: Impact score for this element, used as a sort keyFigure 4.5: Anatomy of Priority Queue elements maintained for each doc-ument. The priority queue Q is used to sequence the computation ofterms in the APQ algorithm in impact order.60INPUTV : Term Vector DatabaseI : Inverted Filemaxiter : Maximum number of iterations to runOUTPUTA : List of Accumulator Setsprocedure APQ(V, I,maxiter)for i? 1,M do . Loop over termsI[i].sort() . Sort each term index.for i? 1,N do . Loop over documentsfor j? length(V [i, ]) do . Loop over termst?V [i, j].term . term indexa?V [i, j].weight . document term weightb? I[t,1].weight . top index term weightqitem? (a?b, t, j,1) . Build priority queue itemQ[i].insert(qitem) . Insert priority queue itemwhile iteration < maxiter dofor i? 1,N do . Loop over documents. Begin A updateqitem? Q[i].head . Grab the highest priority itemt? qitem[2] . term indexc? qitem[4] . index counterA[i, I[t,c].docid]+ = qitem[1] . accumulate term. Begin Q updateqitem[4]? c+1 . increment index counterj? qitem[3] . term numberqitem[1]? I[t,c].weight ?V [i, j].weight . update priorityQ[i].update(qitem) . reposition queue itemFigure 4.6: All Pairs Queries algorithm pseudocode.61How A and Q are updated using V and I 0.8 cat 2 21QSTEP 2: Update Q 0.8 cat 2 2V11 cat 0.9QSelects Row in VSelects Column in V0.8 cat 2 2V11 cat 0.9QI cat 0.91Selects Row in ISelects Column in I0.520.4 cat 2 2V11 cat 0.9QI cat 0.91Multiply WeightsUpdate Q item priority with product0.522.12.22.32.4Increment counter0.8 cat 2 11QSTEP 1: Update A 0.8 cat 2 11QI cat 0.91Selects Row in ISelects Column in I0.8 cat 2 11QI cat 0.911A 01Selects Column in ASelects Row in A1.11.21.30.8 cat 2 11Q1A 0.81Add Priority to A1.4Select Head of QFigure 4.7: [Diagram of the describing accumulator and priority queue update. The accumulator A and the priorityqueue Q associated with each document are updated using the term-vector dataset V and the inverted file I aftereach iteration of the APQ algorithm. The left half of the figure diagrams the steps that update A. The right half ofthe figure diagrams the steps that update Q.62The main loop of the APQ algorithm consists of two steps applied to eachdocument. In the first step, we update the document?s accumulator list and in thesecond, we update its priority queue. Figure 4.7 illustrates these two steps using theexample dataset listed in Figure 4.4. The diagram shows the relationship betweenthe data structures in the pseudocode: the term vector database V, the inverted fileI, the list of priority queues Q, and the list of accumulator sets A. (Technically, thetwo-document dataset in the illustration is not MoDisco, but we use this simpleexample for clarity.)The update to a document?s accumulator list, illustrated in the left box in Fig-ure 4.7, is done in four sub-steps. First, we read the head element from the priorityqueue. We then use the term and counter members of the queue element to in-dex into the inverted file I and reference the appropriate document-id/term-weightpair. The ID of the document we are currently processing and the ID stored in thereferenced pair provide an index into the appropriate entry in the accumulator fileA. We then add the priority value of the queue element to the value stored in theaccumulator file entry.After updating the accumulator for a document, we update the priority valueof the head queue element. This update is illustrated in the right box in Figure 4.7.First, we increment the counter field of queue element. Then, in the next two steps,entries from V and I are referenced using the queue element?s indexing fields. Thepriority value of the queue element is then set to the product of the weight fieldsfrom these referenced entries. After updating the queue item priority, its position inthe queue is updated, possibly shifting its location relative to other queue elements.Note that, before entering the main loop, we must first initially construct thepriority queue Q[i] of each document with one queue element for each term. Thepriority values used by the priority queues are the impact ordering scores. Asshown in the right box of Figure 4.7, the impact ordering scores for queue elementsare computed by taking the product of the term weight and the corresponding termin the inverted file referenced by the counter of the queue item. At the beginningof the algorithm, the index counters are all 1 and so we reference the first entry inthe inverted file list for the indexed term.The algorithm outer loop may terminate over a variety of possible thresholds.For instance, search engines often simply compute a fixed number of iterations63resulting in a small subset of results for the user to peruse, allowing a user tocompute iterations only if interested. This case is presented in the pseudocode inFigure 4.6. In practice, we use a termination threshold based on the change in thelargest k accumulators.4.2.3 APQ ResultsWe compare our APQ algorithm to state-of-the-art representatives from the spatial-partitioning and mapping classes of nearest-neighbor-search algorithms. For spa-tial partitioning, we select the Vantage-Point Tree (VPT) algorithm3 [135]. Formapping algorithms, we select the Locality-Sensitive Hashing (LSH) algorithm4 [39].We modified both the VPT and LSH implementations to be able to rapidly storeand compute cosine distances of sparse feature vectors.As a quality measure, we compute the adjusted agreement rate [53] betweenthe true set of nearest neighbors and the nearest neighbors returned by the approxi-mate nearest-neighbor algorithm. Agreement rate measures the average size of theintersection between sets. Adjusted agreement rate subtracts the expected numberof intersections due to chance. The formula for adjusted agreement rate isARk =(1kNN?i=1ai)?kN?1where N is the number of points, k is the considered number of neighbors, andai is the number of true k-nearest neighbors appearing in the set of the k-nearestneighbors in the layout.For APQ, we sample AR5 measure across a range of compute times. Directcomparison between APQ and the LSH algorithm is complicated by the presence oftwo parameters: number of hash tables, and the number of mapping functions pertable. We regularly sample these two parameters at integer locations over a rangebetween 5 and 80 for hash tables, and 5 and 10 for mapping functions, generatinga set of nearest neighbors for each sample and recording the compute time andAR5 for each sample. We then compute the efficient frontier of these samples with3implementation in C++ provided by van der Maaten.4implementation in C++ as part of the Caltech Large Scale Image Search Toolbox640 1000 2000 3000 4000 5000 6000 700000.20.40.60.81AR5milliseconds  LSH APQ VPTreeFigure 4.8: Speed vs. Accuracy chart for the APQ 5-nearest-neighbor searchalgorithm on the warlogs dataset.respect to time and AR5 to produce a comparison curve. Since VPT is a parameter-free algorithm, we simply ran it once, recording the resulting AR5 and computetime.All result timings in this paper are generated on a 13-inch Macbook Pro runningOS X 10.8 with a 2.5 GHz Intel Core i5 processor supplied with 8GB memory. Ourimplementation of APQ and all other implementations in this paper are written inthe C programming language.Figure 4.8 shows the speed-vs-accuracy tradeoff for APQ, LSH and VPT onthe task of selecting the 5 nearest neighbors on the warlogs dataset. Speed ismeasured in CPU time, and accuracy is measured as the fraction of correctly la-belled nearest neighbors in the 5-nearest-neighbor set. The chart indicates thatAPQ achieves the same search accuracy in less time than the LSH or VantagePoint Tree techniques. We ran the same analysis with other values of k and other65MoDisco benchmark datasets; the result was identical relative positioning of thealgorithms.4.3 Distance Matrix Computation and StorageDistance matrices store the distances between points in a data set. In addition to be-ing interesting objects of theoretical study, they often serve as input to practical dataanalysis algorithms like clustering [1] and dimensionality reduction [118]. Com-puting and scanning the entire distance matrix is costly due to the O(N2) uniqueentries stored in the matrix. In this section we discuss how the MoDisco propertyimplies that we do not need to compute and store this entire database, leading tosignificant savings in cost. We then demonstrate how to use approximate distancematrices in practice, introducing an algorithm that uses our distance matrix approx-imation to produce a hierarchical clustering of the input data in less time than usinga full inverted file traversal.4.3.1 Implication: Truncated Distance MatricesWe can greatly reduce the size of storage required for a distance matrix of MoDiscodata without suffering from excessive approximation errors. We can accomplishsuch an approximation by storing distances between term vectors that share high-impact terms and further assuming the remaining distances are 1. This implica-tion is a direct consequence of the distribution of weights among term vectors inMoDisco data where we assume that a vector xi possesses a small number Xi(t)of high-impact weights, and that the number of vectors Yj(t) that also possess ahigh-impact weight for these same dimensions is also small.An efficient strategy for storing and retrieving these distances is to maintaina hash table for each data point, where distance matrix values are keyed by pointindex. Using such a strategy, any particular distance di j can be computed in O(1)by checking if the key j exists in the ith table. If it does exist, the stored distancevalue is returned; if the key does not exist, it is safe to return 1. We call this efficientdistance matrix data structure a truncated distance matrix.Finally, the APQ2DMAT method translates the set of accumulator lists A re-turned by APQ into a truncated distance matrix. First, we initialize the distance66matrix by constructing an empty hash set for each document. Then, we iterate overall accumulator entries in each accumulator list, adding 1 minus the value of jthentry of the ith list to the jth entry of the ith hash set in the distance matrix. Tosymmetrize the distance matrix, we also insert the same value to the ith entry ofthe jth list.4.3.2 Algorithm: Fast Cluster Tree Calculation with DM2CTCluster dendrograms are diagrams that display hierarchical pairwise relationshipsbetween clusters in the data. They are visual representations of the cluster tree:a tree whose nodes represent successively higher-density regions of the data, andis computed using hierarchical clustering algorithms. Single-link clustering is aclassical hierarchical technique that is isomorphic to the Minimum Spanning Tree(MST) of the complete graph induced by the distance matrix, where edge weightsbetween points are determined by distances between the same points.We now present the DM2CT algorithm for computing a single-link cluster treefrom a truncated distance matrix in O(N logN) steps. The first step of DM2CTis to convert the dataset into a graph. To do this we first construct a set of graphvertices, one for each document. Then we map the set of distances stored in thematrix into graph edges, where the distance di j is converted into an undirected edgebetween vertex i and j with weight di j. Finally, we perform Kruskal?s algorithmin O(E logE) steps to generate the MST of this graph [68]. The order in whichedges are added to the MST is the same order that clusters are joined together inthe hierarchy. DM2CT is similar to the Voorhees algorithm for computing exactcluster trees from inverted files [119], which uses Prim?s MST algorithm with a fulltraversal of the inverted file. By using APQ with impact ordering and APQ2DMto generate our distances, we avoid full traversal while still maintaining a goodapproximation of the distance matrix.4.3.3 Cluster Tree ResultsWe compare the speed of DM2CT with the method suggested by Voorhees [119],which performs a full traversal of inverted file structure. We consider the clusteringcreated by this exact method as the ground truth, and compute a Fowlkes-Mallows67index score [35] between each clustering pair to check the quality of our approx-imation in terms of fidelity to this ground truth. Fowlkes-Mallows scores are ameasure of how closely two clusterings correspond for a set number of clusters C.A score close to zero is poor, with random associations between the two cluster-ings, while a score close to 1 implies identical cluster assignments. We denote theFowlkes-Mallows index for the clusters produced from cutting the tree at the 500thsplit as FM500, and for the 1000th split as FM1000.Our results in Table 4.2 show an order of magnitude improvement in speedusing the approximate truncated distance matrix approach over using the exactinverted-file method. The quality is roughly equivalent; we observe high Fowlkes-Mallows scores near 1.0 between the two clusterings, whereas a random clusteringwould yield a score of 0.0.4.4 Dimensionality ReductionIn this section we first summarize the extensive related work regarding dimen-sionality reduction. We then discuss the implications of the MoDisco property forperforming dimensionality reduction on term-vector data. Finally, we present analgorithm for scalable dimensionality reduction of term-vector data and measurelayout quality relative to competing dimensionality reduction approaches.4.4.1 Implication: Dimensionality Reduction through LocalAttraction and Global RepulsionThe unique geometry of MoDisco data presents several challenges to DR algo-rithms. The MoDisco property implies that points mostly reside in a low dimen-sional subspace that is orthogonal to most of the subspaces in which the other datapoints lie. We would naturally like the local region of a low-dimensional layout ofeach point to reflect the nearer points with intersecting subspaces. This requirementimplies a necessary local attraction between a point and its nearest neighbors interm space. We now consider the distant points with orthogonal subspaces. Whilethe cosine distance between two points being orthogonal is 1, the semantic mean-ing of two documents being orthogonal implies that they are unrelated. Therefore,we argue that it makes more sense for orthogonal points to simply repel each other68Benchmark N M Distances FM500fidelityFM1000fidelityDM2CTtime(ms)Voorheestime(ms)speedupmetacombine 28K 28K 2597 14323 6 0.94 0.96cables 7K 66K 2155 114300 53 0.96 0.92warlogs 3K 4K 531 4608 9 0.92 0.7conspiracy 1K 33K 715 5839 8 0.74 N/ABenchmark N M Clustering FM500fidelityFM1000fidelityDM2CTtime(ms)Voorheestime(ms)speedupmetacombine 28K 28K 532 175036 329 0.94 0.96cables 7K 66K 67 104308 1557 0.96 0.92warlogs 3K 4K 34 5745 169 0.92 0.7conspiracy 1K 33K 7 437 62 0.74 N/ATable 4.2: Comparison of hierarchical clustering timing and accuracy usingthe truncated matrix with our approximate DM2CT algorithm vs. usingthe inverted file with exact Voorhees method. We divide runtimes of thetwo methods into the time it takes to compute the distances and the timeit takes to compute the cluster tree. The fourth and fifth columns givetimings in milliseconds. Column six gives the speedup achieved usingDM2CT over the Voorhees method. The last two columns FM500 andFM1000 denote the Fowlkes-Mallows indices for clusters at the 500th and1000th split in the tree respectively; an index score close to 1.0 indicateshigh fidelity between the two clusterings. The DM2CT method achievesorders of magnitude speed improvement while maintaining high fidelitywith the exact ground-truth clustering.69than to attempt to fit precisely to a unit distance apart. Since most of the pointsin a MoDisco dataset are unrelated, we need a DR method that can apply a globalrepulsion between all points.In the remainder of this section, we discuss how the MoDisco property andthe need for local attraction and global repulsion negatively affects the first threefamilies discussed in Section 2.1 but can be accommodated with the probability-based family.Candidate Property: Orthogonal ProjectionsOrthogonal projections are an ineffective solution for reducing the dimensions ofMoDisco data because the majority of the orthogonal variance in MoDisco data iscaptured by the disconnected relationships in the data. Assuming the data is clus-tered in G clusters, the average distance between these clusters will be 1.0, creatinga G-dimensional simplex. The first G principal components will primarily expressthese unit-length, inter-cluster distance relationships, often without resolving anyof the intra-cluster distance relationships that may reside in a subset of orthogonaldimensions. Because G is often larger than 2 or 3, PCA by itself is an ineffectivetool for producing low-dimensional visualizations of MoDisco data.Candidate Property: Global DistancesThere is surprising variability to the number of MDS techniques [36], but our ar-guments here hold for any particular MDS algorithm. The problem that MDSmethods have with MoDisco data lies with optimizing the layout to fit the prepon-derance of unit length distances. MDS methods optimize a function called stress,which sums the residuals between the high and low dimensional distance matrices.Buja and Swayne noted the problem of indifferentiation [16] when using MDSfor datasets where distances are clustered around a positive constant, with the ex-treme case being a perfect simplex: the high-dimensional analog of a tetrahedron,where every object is equally distant from every other object. In this case, applyingMDS yields low-dimensional layouts with a characteristic shape approximating auniform ball: in 2D, it is a circular disk with a sharp edge and a low density in thecenter. They caution data analysts to avoid misinterpreting this artifact as mean-70ingful dataset structure. Figure 4.3 shows the similarity of MoDisco data to theirsimplex example; both the simplex and the MoDisco distances are largely skewedtoward a single value. To our knowledge, their observation is the only close con-nection between the concept of MoDisco data and a Visual Analytics use case;unfortunately their thoughtful analysis has not percolated into the visualization lit-erature.Continuing with this line of analysis, we argue that MDS algorithms spendmost of their time on nearly useless computations when applied to MoDisco data.Because no simplex can fit without distortion into a low-dimensional projection,the unit-length distances make up the majority of the error of the objective function.Unfortunately, these disconnected unit-length distances dominate the computationin a way that is inversely related to their importance; the important local distancesto their nearest neighbors are not given enough weight.One strategy to improve MDS maps is to modifying the objective functionusing polynomial re-weighting of the terms inversely proportional to distance, asin Sammon mapping [97]. However, such schemes do not go far enough; they stillretain the need to fit all the distances of unit length in the objective function. Ourresults in Section 4.4.3 show the negative impact that fitting the global distanceshas on layout quality with MoDisco data.Candidate Property: Manifold DistancesAfter observing the deleterious effects of fitting global distances, the notion ofstitching together the local term vectors into a useful layout may seem attrac-tive. However, in practice, manifold techniques struggle with building an effectivemanifold model over MoDisco data. Their connectivity assumptions often do nothold because the sampling of the different subspaces is very sparse, often lead-ing to significant distortions. This phenomenon is well documented by van derMaaten [118], who reports that the performance of manifold methods are consis-tently thwarted by noisy, real-world examples.71Candidate Property: ProbabilitiesProbability-based methods like SNE [50] and t-SNE [117] can accommodate bothlocal attraction and global repulsion. To achieve local attraction, we fit a Gaussianprobability distribution only over the nearest-neighbor set of each input point. Toachieve global repulsion, we can simply assign a zero probability to points thatwe consider to be disconnected from each other. Unlike the previously describedDR techniques, probability methods permit significant flexibility in the relativeplacement and handling of zero-probability points. Using a spring-force metaphor,zero-probability points are attached to a point with an infinite-length spring thatrapidly diminishes in spring force as a function of distance. In contrast, distance-based methods such as MDS always assign a fixed-length spring between pointsthat will invariably contract to some finite length, producing an attractive forcebetween two points that have no meaningful relationship. The crucial property ofthis family of methods is the ability to assign zero weight to points in a way that isconsistent with the mathematical framework; developing different approaches thatdo not use probabilities to do so would be interesting future work.In Section 4.4.2 we detail the results of a modified implementation of the BH-SNE algorithm [116]. We choose the BH-SNE algorithm over other probability-based DR methods due to empirical evidence of good cluster separation, as well asits proven O(N logN) iteration cost. Because the NE algorithm [134] is virtuallyidentical to BH-SNE, the results below remain the same if the NE algorithm issubstituted for BH-SNE.4.4.2 Algorithm: MD-SNEOur discussion of Implication 3 notes that the t-SNE algorithm minimizes the di-vergence between the probabilities of the layout and the data [117]. We now de-scribe a scalable, efficient way to construct high-quality probability-based layoutof document datasets. We first discuss the implementation details of the BH-SNEalgorithm and its shortcomings, and then describe our own modification to the al-gorithm to adapt it for MoDisco data.72BH-SNEBH-SNE [116] is a modification of the t-SNE algorithm [117], which is itself amodification of the Stochastic Neighborhood Embedding, or SNE, algorithm [50].SNE fits Gaussian probability distributions for each point over the other points inthe dataset with variances determined by a single user-selected parameter calledperplexity. It then positions points in low-dimensional space in such a way thatminimizes the Kullback-Leibler divergence (a measure of the difference betweenprobability distributions) between the high and low dimensional Gaussian distribu-tions. The t-SNE algorithm improves the cluster separation of the low-dimensionalpoints by using a Student?s t distribution for the low-dimensional probability model.The mismatch between the tails in the Gaussian and Student?s t distributions effec-tively adds more space between dense regions in the layout, reducing the visualcrowding problem. Both SNE and tSNE perform dense operations on the distancematrix and therefore do not scale well for datasets beyond roughly 10,000 points.BH-SNE is a technique for approximating t-SNE efficiently on larger datasets.First, BH-SNE approximates the probability distribution for each point by fittingthe high-dimensional probability distribution to only the k-nearest neighbors ofany given point, and assigning zero probability to the remaining points. In thepublished method, nearest neighbor search is performed by first reducing the di-mensionality with PCA and then using Vantage Point Trees [135] in this reducedspace.t-SNE optimizes its objective function using a momentum-weighted gradientdescent, requiring O(N2) dense-matrix computations. For BH-SNE to perform thissame optimization efficiently, it re-expresses the gradient of its objective cost func-tion in two parts: an attractive part and a repulsive part. The attractive part of thegradient is calculated in O(N) time using the set of nearest neighbors, valid sincethe zero-probability points exert no attraction. The repulsive part of the gradient,which involves all pairs of points, is computed in O(N logN) time using a Barnes-Hut quadtree approximation. The Barnes-Hut quadtree discretizes space in such away that the algorithm can approximate repulsive forces with an accuracy depen-dent on the distance [5]. This approximation is valid because, in the case of t-SNE,the magnitude of the repulsive force often decays as a function of distance.73MD-SNEApplying BH-SNE directly to MoDisco data is straightforward. The primary short-coming of the BH-SNE procedure when applied to MoDisco data is its ineffectivenearest-neighbor strategy: using PCA to reduce to 50 dimensions followed by Van-tage Point Tree (VPT) search. While this method succeeds for small MoDiscodatasets, our results in Section 4.4.3 show that as the size of the term spaces in-crease, the PCA-VPT method breaks down. Instead, we can use our APQ algo-rithm to generate a truncated distance matrix of k distances per row, where k is thenumber of nearest neighbors requested by BH-SNE. Then, we simply substitutethe indices and distances of the truncated distance matrix in place of the nearestneighbors set computed using their strategy. The BH-SNE algorithm then pro-ceeds normally. Unlike the PCA-VPT method, the quality of the APQ method isindependent of the size of the input on MoDisco data.4.4.3 MD-SNE ResultsWe compare with representative algorithms from different classes of dimensional-ity reduction. For linear projection methods, we select Principal Component Anal-ysis, with eigenvectors computed using a fast SVD algorithm [47]. For distance-based methods, we select the hierarchical, stochastic Glimmer MDS algorithm [57].Finally, we compare MD-SNE to results produced by BH-SNE.The speeds between the different DR methods are commensurate. The MD-SNE algorithm produced a layout of our largest distance matrix, metacombine,in approximately 5 minutes. Because of its O(N logN) complexity and fixed-iteration optimization, it is not unreasonable to expect MD-SNE to handle million-scale datasets with hour-length processing times.We claim that probability maps are better suited to produce low-dimensional vi-sualizations of MoDisco data over competing techniques, and that our MD-SNE al-gorithm is better suited to MoDisco data than standard BH-SNE. Our argument forusing probability-based methods over other dimensionality reduction techniquesare based on quality, not speed. We validate this claim of quality improvementboth quantitatively and qualitatively.74Neighborhood AgreementWe first present a quantitative validation, where we compare a single quality mea-sure across different layouts of the same dataset. Many quantitative evaluations ofDR algorithms use the final Normalized Stress measure [57, 63, 83] when compar-ing against other algorithms. The Stress measure, however, incorporates all pairs ofdistances in its calculation, while we are specifically attempting to remove discon-nected distances from the equation. Furthermore, metric-oriented objectives likeStress are concerned with measuring precise distances, while Probability-basedtechniques subject these distances to a nonlinear transformation. We thus comparethe rank ordering discrepancy between points using different techniques, ratherthan the precise distance discrepancy.As with the nearest-neighbor search discussed in Section 4.2.3, we select ad-justed agreement rate as a suitable, rank-oriented measure given that the target taskis local exploration. Here, we measure the agreement of the local neighborhood ofpoints in high-dimensional space and in the layout for different neighborhood sizes.Instead of measuring ARk for a single k, we vary k from 1 to 100 for each layoutmethod, on each dataset.The results are displayed in Figure 4.9. Both PCA and MDS report loweragreement rates on each dataset, while MD-SNE and BH-SNE show similar be-havior on all benchmarks except for metacombine. For this dataset, BH-SNEshows worse agreement rates for larger neighborhoods. We hypothesize that thisresult is due to the metacombine dataset having a larger feature space than theother benchmarks; it is an order of magnitude larger than the others. The BH-SNEnearest-neighbor search technique of reducing to 50 dimensions with PCA beforeusing Vantage Point Trees is more likely to fail to adequately separate points insuch a large feature space. In contrast, MD-SNE exhibits the same pattern ofneighborhood-agreement behavior as other MoDisco datasets.The values for agreement rate in Figure 4.9 are much lower than those in Fig-ure 4.8 on the same dataset. We suspect this discrepancy arises from the challengeof combining the different point neighborhoods into a single planar region. Despitethe difference in magnitude, the ordering of the different neighborhood-agreementcurves corresponds to the ordering in visual quality between the different tech-750 50 100warlogs, N=3KN?hood Size 0 50 100cables, N=7KN?hood Size0 50 10000.050.10.150.2 conspiracy, N=1KAgr. RateN?hood Size 0 50 100metacombine, N=28KN?hood SizeMD?SNE BH?SNE PCA GlimmerFigure 4.9: Adjusted agreement rates ARk among dimensionality reductiontechniques on our benchmark datasets over a range of neighborhoodsizes k. Higher numbers indicate higher agreement between the layoutneighborhood and the true neighborhood. MD-SNE and BH-SNE ex-hibit roughly equivalent performance on smaller MoDisco datasets, butBH-SNE breaks down on the benchmark with the larger feature spacewhile MD-SNE maintains performance.niques.Visual QualityWe now present a qualitative validation based on discussion of result images. Fig-ure 4.10 presents layouts of the four MoDisco datasets for each of the four di-mensionality reduction techniques. The PCA layouts, as expected, have a difficulttime separating points using linear projections, and clusters tend to be expressedin filament-like structures projected on top of each other. The Glimmer MDS al-gorithm applied to MoDisco data produces noisy, cloud-like visualizations withlarge areas of uniform density, and fails to produce much visual separation betweendense regions.On the smaller first three datasets, BH-SNE and MD-SNE both produce ac-ceptable layouts, and we see a tradeoff between cluster separation and local fidelity.On the cables and warlogs dataset, BH-SNE produces excellent cluster sepa-ration, but within the local cluster region MD-SNE layouts show better neighbor-hood agreement. However, with the larger metacombine result, there is a clearquality improvement for MD-SNE over BH-SNE: the BH-SNE nearest neighborstrategy fails to achieve the same cluster cohesion as MD-SNE approach.76conspiracy warlogs cables metacombinePCAGlimmerBH-SNEMD-SNEFigure 4.10: 2D layouts of the four MoDisco benchmark datasets acrossthe four tested dimensionality reduction algorithms. PCA yields toomuch overplotting and Glimmer fails to provide clear cluster separa-tion. BH-SNE and MD-SNE both produce acceptable layouts for thefirst three smaller datasets, but BH-SNE breaks down with the largemetacombine dataset.77Chapter 5Overview PrototypeChapter 3 presented a software system designed with the goal of providing guid-ance to non-expert users and Chapter 4 presented a set of algorithm guidelines withthe goal of efficiently processing text data for visual analysis. In this chapter, wecombine these two goals to present a software system, the Overview Prototype,designed to provide guidance to journalists who are analyzing large text datasets.In the course of designing this tool, we worked in close collaboration with a jour-nalist from the Associated Press who was specifically interested in helping otherjournalists to understand a large set of documents of mostly unknown content,when indices, summaries, and other knowledge organization aids are not available.We define the underlying task of our target users as document set exploration:the computer-assisted human construction of a categorization that is instance-spec-ific, that is, tailored to a specific use of a specific document set. We assume that fulltext search of the dataset is available to users, but not completely helpful becausethey do not know precisely what they are looking for. Our target users are notstrictly confined to journalism, and may also arise in fields such as business, lawand intelligence.To process and categorize our document set, we must encode documents insome way that preserves semantics. The vector space representation of documentsdescribed in Chapter 4 yields very high dimensional spaces with thousands or tensof thousands of dimensions, corresponding to the natural language vocabulary ofthe document set. Many automatic clustering algorithms that carry out computa-78tions in these spaces have been proposed [10] as ways to categorize large documentcollections. The output of these algorithms is a list of the documents in each clus-ter, which does not necessarily provide a sense of the overall semantic structure ofthe document set. Nor is it obvious whether any particular clustering, out of thecombinatorially huge number of partitions of objects into disjoint sets, capturesthe application-specific semantics of interest. For this reason, many sensemakingsystems attempt to directly visualize the high-dimensional cluster structure of thedocuments through low-dimensional layouts created with dimensionality reduction(DR) techniques [127] [23]. Our first attempt at such a system involved a singleinteractive layout of a document dataset using a modification of the Glimmer algo-rithm.However, we found that users of DR for document set visualization, includ-ing our collaborator, have the persistent unease that there is often but not alwaysstructure in their datasets that is not revealed; that is, that they see false negativesin many cases but true negatives in others. In the case of document sets, struc-ture most often means clusterings of related documents. But due to the extremelyhigh-dimensional nature of the underlying data, the dimensionally reduced plotcan become crowded, with local densities of documents resulting from chance (afalse positive). In a similar way, true densities of related documents may be ?in-terrupted? by spurious, unrelated documents (a false negative). The tool developedin this chapter, the Overview prototype, is designed to mitigate these two relatedproblems by augmenting and complementing a dimensionally reduced view of thedata with additional information in the form of a hierarchical decomposition of thedata and an infrastructure for manual annotation.The remainder of this chapter is organized as follows: we first describe theOverview prototype application that combines a hierarchical clustering dendro-gram and an MDS view to support text analysts in exploring and annotating docu-ment collections through tagging clusters. We then justify our design by discussinghow the different prototype components work to support our target analysis task.Next, we show the prototype at work in two separate data analysis cases; the firstcase demonstrates a hypothetical use of the tool, and the second case describes theresults of actually giving the tool to a real-world data analyst. Finally, we describethe results of the tool?s public deployment to users outside of the experimental79setting.5.1 Overview Protoype DescriptionThe Overview prototype, shown with labeled components in Figure 5.1, is a proof-of-concept application to address computer-assisted human classification throughtagging items. It is based on interactively combining clustering, tagging, and di-mensionality reduction. In this section we describe the different data views andhow to interact with the prototype.Disconnected Component Tree Tags View Items PlotItem ViewerActive Set ListFigure 5.1: The components of the Overview prototype. The Overview pro-totype consists of five main components, indicated in this figure withblue labels. The Disconnected Component Tree displays aninteractive dendrogram, the Tags View displays and manages user-created tags, the Items Plot displays a scatterplot of the data, theActive Set List displays the tags and clusters contained in thecurrent selection, and the Item Viewer presents the text of the high-lighted item in the selection.In the top half of the application are three different views of the data. Theleft cluster view features an interactive representation of the hierarchical clustering80dendrogram (hereafter called the Disconnected Component Tree, or Dis-coTree), and the right DR view, called the Items Plot, shows a 2D scatterplotof a low-dimensional embedding created with the Glimmer MDS technique [57].The Tags View in the middle allows tags to be created or highlighted. The viewsare all linked to support cross-filtering [124].Below these views is the Active Set List. The active set is the set ofdocuments highlighted by the user and their associated nodes. The active set isconstructed through a selection interaction, such as clicking a node or lassoing aset of items, within a particular view. The Active Set List shows DiscoTreenodes on the left and items on the right, where the labels are the top terms withina document, or of all documents within the cluster. At the very bottom is ItemViewer, which displays the underlying text of a specific item selected in the rightitems list of the Active Set List.A user interacts with the prototype by selecting nodes in the DiscoTree, byhighlighting regions of the Items Plot, or by clicking a Tag in the Tags View.This brings all items and nodes that match the selection criteria into the ActiveSet List. The user can then peruse the items that make up the selection in theItem Viewer. Finally, having collected related items into an active set, the usercan apply a tag to those items in the Tags View. Section 5.3 provides more de-tails on how to use the different components of the prototype by walking through ananalysis task on a real-world dataset. As a supplement, our collaborator producedan online tutorial video explaining the different components of the software1.5.2 Clustering, Tagging, and DR for SensemakingWe now justify the design of the Overview prototype by articulating the docu-ment set exploration task more precisely as supporting text analysts in buildingan application-specific hierarchical categorization schema through tagging, start-ing from the scaffolding of a schema automatically created through hierarchicalclustering. Dimensionality reduction fits into this workflow both to enable directvisualization of document set structure, where possible, and as a way to evaluate1http://overview.ap.org/blog/2012/03/video-document-mining-with-the-overview-prototype/ Lastaccessed July 17th, 201381the relationship between the distance metric used in algorithmic clustering and thesemantic content of the dataset.5.2.1 Why Clustering?Clusters of points in high-dimensional data sets are interesting because they oftenhave meaning; that is, cluster structure frequently represents semantics of interestto the user. This statement posits a strong connection between a mathematicalproperty and an abstract, high-level notion of human knowledge.The idea of representing document collections as point sets in high-dimensionalspace began with the work of Luhn in the 1950s [74] and was developed into thevector space model by Salton et al. [96], originally designed for information re-trieval tasks. Section 4.1 contains a thorough description of the vector space model.Information retrieval and dimensionality reduction applications rely on a simi-larity or distance function, defined over every pair of documents, which gives riseto a metric space and associated topology. Spatially compact clusters of documentvectors in cosine distance space were recognized by early information retrieval re-searchers as semantically interesting structures, giving rise to the cluster hypothe-sis [61], a modern version of which is articulated as ?documents in the same clusterbehave similarly with respect to relevance to information needs? [75]. The clusterhypothesis is widely assumed and has been shown to hold in the case of web-scaleinformation retrieval [26].Sensemaking differs from information retrieval in that the user does not knowbeforehand what type of information is sought [94]. However, because the docu-ments within a cluster are conceptually similar, representing a document corpus byits clusters may be a useful form of information reduction. The intuition is that ifthe text analyst has read a few documents in a cluster, they can assume that the restwill contain a similar type of information.5.2.2 Why Tagging?A cluster is, in the end, just a set of documents. While there is evidence thatmachine-extracted clusters capture interesting semantics, that does not help theuser to understand what any given cluster means, much less a tree which may82include hundreds of clusters and sub-clusters. Cluster labeling is the crucial nextstep in sensemaking.There have been many more or less sophisticated attempts at automatic clusterlabeling, ranging from displaying the most frequently occurring words to attempt-ing to extract a single key sentence from a text corpus [136]. A related prob-lem is the naming of topics extracted by topic modeling algorithms such as LatentDirichelet Analysis (LDA) [11]. Each topic found by such an algorithm is a proba-bility distribution over words, sometimes visualized with a word cloud as shown bythe TextFlow tool of Cui et al. [27]. Yet a word cloud is not a substitute for a goodtopic name, as researchers working with LDA-based methods tacitly acknowledgewhen they compose short, human-generated labels to refer to the semantics of ex-tracted word distributions.A deeper problem is the suitability of the classification schema implied by thedistance metric. In what sense are two documents really ?similar?? In practicesimilarity depends on the context of the analysis. For example, should the docu-ments be grouped by location, specific event, type of incident, or actors involved?There is no reason to assume that the particular encoding and distance metric usedto generate clusters necessarily partitions the documents in the most semanticallyuseful way, especially given that, for the sensemaking task, the user may not knowbeforehand what ways are going to be interesting.Grimmer and King approach this problem by visualizing the space of all pos-sible clusterings, populated by executing a variety of clustering algorithms on thesame data [43]. They are able to directly explore some of the different semanticallyinteresting categorizations on the same set of documents.In principle, the sensemaking loop should include adjustments to the vector en-coding and distance metrics so as to explore different categorization schemas, butvery little is known about how to automatically make these adjustments. Instead,we consider the automatically-generated clusters as starting points for human clas-sification. We argue for a tagging system that allows the user to summarize andannotate the content of a cluster, by applying a label to some or all of its docu-ments. These tags allow the construction of a manual classification scheme thatfollows or cuts across the cluster structure as desired, and has greater or lesserresolution around particular concepts and subtrees.835.2.3 Why Dimensionality Reduction?Dimensionality reduction methods such as MDS also use distance metrics, just ask-means and other clustering methods do, to map items to a lower dimensionalspace. The promise of MDS and other DR methods is visual confirmation of theexistence and size of clusters. MDS as a visualization technique is often usedto verify a proposed clustering by coloring the points according to the clustersand checking if the spatial proximity relationships in the low dimensional viewmatch up the coloring, or to visually check for cluster structure in unclassified databy noticing if there are any visually separated clusters in the view. The visualdisplay of dimensionality reduction results allows the user to cross-check whetherthe relationships between the visible clusters that arise from the distance metricmatch their mental model of semantic content.5.2.4 Why All Three?In this framework, the sensemaking task is the construction of a set of tags whichcapture the concepts of interest ? perhaps newly discovered ? in the documentcollection. The tags, presented in the tool in the Tags View, act as an annotationlayer to get human semantic understanding into the exploration loop. An automati-cally created clustering, which can be navigated and examined in the DiscoTreeView, serves to accelerate the process of constructing meaningful annotations.Finally, dimensionality reduction, as presented in the Items Plot, provides aparallax view of the cluster structures.The Overview prototype then combines these three capabilities as three verydifferent linked views of the document dataset. Using the views, an Overviewuser cycles between examining clusters, writing annotations, and examining localneighborhoods and outliers until the process converges to a satisfactory categoriza-tion of the document dataset.5.3 Overview Prototype ResultsWe show results of using the Overview Prototype with two real-world journalismdatasets from WikiLeaks, Warlogs and Cables. In both, documents were en-coded as a vector using the TF-IDF term weighting scheme [96] applied to all84vocabulary words plus automatically detected common two-word phrases, or bi-grams.5.3.1 Afghan War LogsFigure 5.2: The Overview Prototype loaded with the Warlogs dataset andhaving five nodes of the DiscoTree tagged. The colors reveal thatthe documents they contain fall into local neighborhoods in the MDSItems Plot, but most of this structure cannot be seen from proxim-ity relationships alone because there is no visible separation from theother data points.The Warlogs dataset is the subset of the WikiLeaks Afghan Warlogs datasetfrom July 2009. It contains 3077 points and 4286 dimensions, where a single pointcorresponds to a document from the dataset. These documents are military after-action reports with an extremely terse format and specialized jargon, so they arenot trivial for non-experts to read.Figure 5.2 shows the results. We began with the most compact pruning levelwhere only nodes of 64 or more items are visible in the DiscoTree, revealing sevenmain clusters. Exploring those with the combination of quickly reading labelsin the Active Set List and using the Item Viewer to read the full textof documents led us to quickly confirm five of these as semantically meaningfulcategories and tag them with the following names and colors: found ieds (purple),insurgent engagements (brown), fire missions (pink), missions containing a salturreport (size, activity, location, time, unit, result) following an enemy contact (gold),85and detainee transfers (teal).We can see that these groups do form neighborhoods in the MDS ItemsPlot, but would be difficult to identify without the coloring because the regionsare not visually separated from the rest of the points by regions of low density. Theexception is the well separated detainee transfers.Figure 5.3: The DiscoTree control at different prune levels. Using the prunecontroll at the bottom of the DiscoTree View permits the user tosee the cluster tree at several levels of abstraction. Here we show theview of the Cables dataset at prune level 8 (left) and 16 (right).For clarity, the DiscoTree View is interactively prunable so that only nodespast a particular size are visible, using the logarithmic Show Nodes >= but-tons along the bottom. Figure 5.3 shows the DiscoTree View of the Cablesdataset at varying prune levels. Each node in the DiscoTree contains many doc-ument items, and due to the hierarchical clustering the same item will appear inevery node along a path from the singleton leaf node at the bottom of the tree upto the root. When an item is selected through one of the other views, this entirebranch is highlighted through an edge color.The user can use the Tags View to create and assign an arbitrary set ofnamed and colored tags. Clicking on a tag?s name selects all items which havebeen assigned to that tag. While the user interface supports the useful shortcut oftagging all items in a node, in general tags may be arbitrarily distributed acrossitems. Nodes are assigned a tag?s color when all items within that node containthat tag. Due to the hierarchical nature of the tree, once a node is colored in, all ofits child nodes will also be filled in.86Each item in the MDS Items Plot is a single document, represented by apoint. An individual item can have many tags attached to it, and a node has manyitems and thus many tags as well. We do not attempt to show all tags at once withany sort of glyph, since items and nodes cover small regions of the screen. Instead,the last selection color takes priority over the others.5.3.2 Caracas CablesThe Cables dataset comes from a different WikiLeaks source, the diplomaticcables. We analyzed the subset consisting of cables sent to or from the US Embassyin Caracas, Venezuela, or containing the word ?Caracas?. It has 6849 points and65,777 dimensions.Figure 5.4: Cable alleging Iranian drone plans. The journalist found a cablealleging Iranian plans to ship unmanned aerial vehicles to Venezuela.We provided this dataset and the Overview prototype to the Associated PressCaracas bureau chief, who created several dozen tags over a session of a few hours.These tags, listed in Table 5.1, cross-cut the data in different ways including ge-ography, politics, and events; the supplemental video2 shows the prototype witheach of these three tags sets. Figure 5.4 shows a different subset of ten interesting2http://www.cs.ubc.ca/labs/imager/video/2012/modiscotag.mov Last accessed July 17th, 201387Tag Name # TaggedColombia/rebels 570Ven Opposition 424Oil industry 428Chavez politics 383Finance 382extradition 678vatican 283China 64Caribbean politics 923Peru 1405Portugal 109Banks 59Elections 138Argentina 72Cuba 66ALBA countries 213Nicaragua 357Ecuador 728Airlines 99Chile 104Honduras 234Jamaica 99Colombia politics 207Tag Name # TaggedRCTV 89Student protests 64Suriname/Dutch territories 287Paraguay 95Human trafficking/child labor 236land seizures 43weapons 79food supply 93Russia 117Brazil 180Guyana 143Jewish community 108extrajudicial killings/human rights 374Afro-Colombians 94health care 39education 39nationalizations 64Spain 86El Salvador 113Israel 39Trinidad 39water/drought 44Belarus 39Table 5.1: List of manually constructed document tags applied to theCables dataset.tags. Tags are applied to documents, not nodes, because the automatically gener-ated tree does not necessarily categorize the documents in a way that is meaningfulto the user. For example, several different branches contain documents concerningEcuador in Figure 5.5. Figure 5.6 shows hierarchical cluster structure, where theparent-child relationships between the branch tagged with Finance and its childrenabout banking and oil are also visible as spatial nesting in the MDS view.The journalist found several topics that he had not previously found throughunassisted inspection of the dataset, including arms shipments to Ecuador shownin Figure 5.7. He noted that the application allowed him to quickly spot subject88Figure 5.5: The Cables dataset, with the full set of categories listed in Ta-ble 5.1 from the AP Caracas Bureau Chief.(a) (b) (c) (d)Figure 5.6: Hierarchical structure in the Disconnected ComponentTree and Items Plot. The branch tagged with Finance has childrenconcerning banking and oil. DiscoTree View detail when financetag selected (a) and one child node selected (b); Item View detail forfinance alone (c) and child (d).areas that could be of greater news interest, such as information on Colombianrebels. The tool also helped him find several interesting individual documents, forexample claims that Chavez was giving millions to a particular Jamaican politi-cian?s election campaign shown in Figure 5.8, and the cable alleging Iranian plansto ship unmanned aerial vehicles to Venezuela shown in Figure 5.4.5.4 Overview DeploymentThe Overview prototype was deployed on a website maintained by the AssociatedPress3. The deployment was evangelized by our collaborator, Jonathan Stray, whopromoted the tool through social media and journalism conferences, and built a3http://overview.ap.org Last accessed July 17th, 201389Figure 5.7: Cable with description of alleged arms trafficking. The journalistfound cables with a description of such trafficking in Ecuador.comprehensive tutorial video of the tool on the website4. The prototype was thenconverted to a public, web-based analysis tool5, complete with user accounts forsaving document analyses, and the ability to import data from DocumentCloud, apublic document repository for warehousing primary sources. In its various incar-nations, Overview has been used by working journalists to assist in published newsstories. Here, we briefly present two such use-cases as further validation of ourdesign.The first use-case is that of reporter Jarrel Wade, working for the Tulsa Worldnewspaper [120]. He used Overview to analyze a collection of emails from theTulsa Police Department revealing problems with the in-car computer system pur-chased by the department. In his detailed user experience writeup6, Wade describes4http://overview.ap.org/blog/2012/03/video-document-mining-with-the-overview-prototype/ Lastaccessed July 17th, 20135overviewproject.org Last accessed July 17th, 20136http://overview.ap.org/blog/2012/09/how-i-used-overview-to-report-on-8000-police-department-emails/90Figure 5.8: Cable alleging Venezuelan influence in Jamaican politics. Thejournalist found cables alleging campaign financing of a Jamaican po-litical candidate by Chavez.using the tool as a method not for avoiding reading all the documents, but for or-ganizing his reading of the email conversations into categories. He praises theefficacy of Overview, stating, ?In the end, I?m guessing it would have taken fourreporters splitting up emails into stacks of a few thousand to do the work I did intwo weeks.?The second use-case story we describe occurred during the 2012 US Presiden-tial election campaign. Reporter Jack Gillum used Overview in building a story toreveal that vice-presidential candidate Paul Ryan requested and used federal moneyfrom government programs he publicly criticized [38]. Gillum describes his use ofthe tool in a writeup by Jonathan Stray on the Overview website7. In contrast toWade using Overview?s clustering as an organizational tool, Gillum reports usingLast accessed July 17th 20137http://overview.ap.org/blog/2012/11/document-mining-shows-paul-ryan-relying-on-the-the-programs-he-criticizes/Last accessed July 17th, 201391the clustering as a filtering tool to specifically avoid having to read all the doc-uments in his analysis. Gillum knew that large swaths of the documents in thedataset contained no useful information for his story, and so he used the clusteringalgorithm to identify and cull these swaths.These two use-cases provide evidence that our tool design accelerated the targettask of document analysis for non-experts users: working journalists having no ex-perience with high-dimensional analysis. We were encouraged that these analysesled to published articles. Furthermore, we discovered our design is flexible enoughto accommodate different analysis styles of our target users. As future work, weplan to present a detailed account of several different use-cases of the Overviewtool, with the hopes of improving the design of analysis tools for non-expert users.92Chapter 6Glint: An MDS Framework forCostly Distance FunctionsAll MDS algorithms work by minimizing an objective function quantifying thedistortion of the points in the low-dimensional space relative to their original inputconfiguration. Though the different MDS algorithms compute coordinates in awide variety of ways, in each case the computational work can be divided into twoparts: distance calculation, where the inter-point distances are calculated from theinput points, and layout calculation, which reads the computed high-dimensionaldistances and positions the points in the low-dimensional space.The focus of this chapter is Glint, an iterative algorithm framework for au-tomatically minimizing distance calculation in MDS. Structurally, Glint forms anouter loop around a modified MDS algorithm. It starts with an empty distancematrix, densifying the matrix as the outer loop iterates, automatically terminatingwhen the MDS layout is stable. Glint separates the distance calculation portion ofthe MDS algorithm from layout calculations and provides an automated termina-tion procedure.The time cost of individual high-dimensional distance calculations have a pro-found effect on the run time of an MDS algorithm. Even for an efficient metric likethe 10-dimensional Euclidean distance function, the time spent calculating high-dimensional distances occupies almost 80% of the algorithm run time using theGlimmer force-directed MDS algorithm [57]. Many real-world problems where93MDS is used require more costly distance functions than the Euclidean case. Inthese more expensive cases, total distance costs occupy more than 99% of MDSrun time using the same algorithm. Thus, an efficient MDS algorithm should seekto minimize the total work done, minimizing the sum of both the distance andlayout work.Previous work has assumed that individual distance computations are fast tocalculate and thus has not sought to automatically resolve the balance between dis-tance and layout work. Current fast MDS algorithms that handle distance matriceseither compute many more distances than necessary [57], or leave the total numberof distances to compute as a tuning parameter and so do not have a fully automaticway to terminate [15, 30]. The Glimmer algorithm is an example of the overcompu-tation shortcoming [57]. It computes an iterative MDS approximation using force-directed heuristics. The Glimmer minimization strategy defines a cheap iterationand then iterates until convergence is detected. Within each iteration, both distancecalculations and layout calculations are done. Glimmer automatically chooses thenumber of distance calculations to make before terminating, but computes morethan are strictly necessary. The Pivot MDS algorithm is an example of the termina-tion shortcoming [15]. It computes a one-step analytic MDS approximation. TheMDS work is cleanly divided between distance calculation up front followed by asingle contiguous layout calculation. Pivot MDS computes all the distances it usesup front, but does not know how many to select.The above examples motivate a synthesis of the benefits of the two algorithms,keeping the automatic termination of algorithms like Glimmer while separating thedistance work from the layout work as in algorithms like Pivot MDS. The goal ofGlint is thus to not only compute far fewer distances than the iterative approxima-tion, but also to remove the tuning parameter from the analytic approximation.To demonstrate the generality and robustness of the Glint approach, we de-vise Glint instantiations for three very different classes of MDS algorithm: force-directed, analytic, and gradient-based. We present the design of the Glint compo-nents for each instantiation, where each is tailored to the requirements of the un-derlying MDS algorithm. We then show that these Glint instantiations drasticallyreduce total run time on datasets with costly distance functions without penalizingthe final layout quality.946.1 Distances In MDSThe distances between the points in a low-dimensional MDS solution are intendedto closely model those in the high-dimensional input dataset. The core premise ofMDS is that the input contains redundant information, allowing for correct outputeven with an incomplete set of distances as input. Glint exploits this redundancyby iteratively constructing a subset of distances that is as small as possible. Thissection describes two issues concerning these distances: the existence and effect ofexpensive distance functions, and how sparse the input distance matrix can be.6.1.1 Expensive Distance FunctionsMinimizing the total number of distances computed is especially important whenthe time spent computing distances dominates the time spent computing the layout.Many real-world applications involve datasets with expensive distance functions.Even the straightforward Euclidean distance metric can be costly if the numberof dimensions is large enough, for example in the millions. In image processing,the Earth Mover?s Distance, or EMD, compares the similarity of color distribu-tions between images and is useful for ranking images for querying and nearest-neighbor-type calculations [93]. Its calculation requires solving a linear program,often a costly operation relative to the layout calculation per point. Computa-tional complexity is not the only reason for distance calculation cost. Distancesbased on database lookups are costly due to the relative speed of disk I/O to mem-ory reads. Distances that involve elicitation of human judgement can be the mostcostly of all, because the time scales of human response are so much longer thanof automatic computation. Human-elicited distances are of interest in many do-mains; in a marketing example, a single distance is derived from the averagedsimilarity judgements elicited from survey takers comparing two items [71]; in apsychophysics example, distances are derived from just noticeable differences inhaptic stimuli [111].In all of these cases, distance calculations can comprise well over 99.9% of thetotal time to compute the MDS layout. We will show that using the Glint frame-work can drastically reduce the time spent computing distances without compro-mising the final quality of the MDS layout.956.1.2 Experimental Analysis of Sparse MDS SolutionsSpence and Domoney conducted a series of data experiments to determine if therecould be an a priori way to select an optimal subset of distance matrix entriesto compute prior to MDS layout [107]. Their experiments investigated the effectof controlling three factors pertaining to layout quality. The first two factors, theamount of noise in the distance measurement and the number of input data points,cannot be manipulated by an MDS algorithm. The last experimental factor theytested, which an algorithm can indeed control in practice, is distance matrix density,or how densely sampled the approximation of the distance matrix is compared tothe full version.The experiments resulted in two key findings that pertain to our work. First,only a fraction of the matrix, ranging from 20% to 60% of the distances on their ex-ample data, needed to be computed to accurately approximate the full layout. Thisfinding verifies that the goal of minimizing distance computations is a reasonableone. Second, their results imply that there is no direct way to assess in advanceexactly how many distances need to be computed. We thus designed Glint to runonline, determining the optimal number of distances to compute on the fly.6.2 Glint Algorithm FrameworkGlint is an algorithm framework: an algorithm with modular components that arethemselves algorithms. Figure 6.1 shows a diagram of these three components;each corresponds to a step in the Glint outer loop. Glint starts with an emptydistance matrix and a random layout and then loops over the following three mainsteps to determine a final layout. First, in the Densify Matrix step, it selects a newsubset of the distance matrix to compute with the distance matrix densificationstrategy DS and then updates the matrix with the computed values. In the Lay OutPoints step, it updates the layout using the new distance information as input to theMDS layout algorithm M. Finally, in the Check Convergence step, it checks to seeif the change in the objective function S is below a threshold ? . If convergence isdetected, the last layout is returned, otherwise the loop repeats.The MDS layout algorithm M takes as input a low-dimensional point configu-ration as the starting point and a sparse distance matrix. To qualify for use in Glint,96Densify Matrix (DS) Lay Out Points (M) Check Convergence (S)D'tD't+1layouttlayoutt+1StSt+1epsiterationdSGlint Outer LoopFigure 6.1: Diagram of Glint execution.M must possess three characteristics. First, it must be able to compute a layoutgiven a distance matrix. Next, it must be able to handle an incomplete ? that is,sparse ? distance matrix, given the Glint strategy of gradual densification. Finally,M must compute a layout from a given starting position rather than starting fromscratch each time, so that subsequent outer loop iterations start M from a statecloser to the final layout configuration. We discuss M further in Section 6.3.1.Controlling the density rate and pattern of the distance matrix is the job of thedensification strategy DS. Some MDS algorithms, such as Pivot MDS and Land-mark MDS are able to compute layouts with incomplete matrices, but the precisesparsity pattern of the incomplete distance matrix may be constrained. Becausematrix sparsity pattern requirements vary from algorithm to algorithm, we musttailor the selection of computed distances DS to the MDS algorithm M. We dis-cuss DS further in Section 6.3.2.Glint requires a cheap, monotonic objective function S in order to measurelayout convergence; it must also be tailored to the MDS algorithm M. It shouldnot invoke a costly full stress function that requires computing all the high- andlow-dimensional distances, which would obviate all performance benefits of thesystem. We discuss S further in Section 6.3.3.976.2.1 Glint Outer LoopAlgorithm 1 Pseudocode for Glint, with variable definitions.function GLINT(?)layout0 ? RANDOMLAYOUTt? 0sold ? MAXVALUEwhile !converged doPt+1 ? DS( Pt )D?t+1? DISTANCE( D?t , Pt+1, d )layoutt+1? M(D?t+1, layoutt)snew? S(Pt , D?t+1, layoutt+1)converged? |sold? snew|/sold < ?sold ? S(Pt+1, D?t+1, layoutt+1)t? t +1return layoutVariable Descriptiont the current Glint iteration? termination thresholdMAXVALUE maximum possible value for scalar objective functiond the distance metric d(i, j)sold scalar objective function value on the previous layoutsnew scalar objective function value on the current layoutlayoutt layout coordinates at iteration tPt the set of computed point pairs at iteration tD?t the sparse distance matrix with nonzeros specified by PtThe Glint algorithm consists of a single threshold-controlled loop, similar toalgorithms like gradient descent where the algorithm loops until the change inmeasured progress becomes very small. Figure 1 lists pseudocode for Glint. Thealgorithm initializes with a random configuration of points layout and then iter-ates through the main loop. In the main loop, we first call the densification strategyDS. On the first call it constructs the initial sparsity pattern Pt of the distance ma-trix D?t , and on subsequent calls it densifies the pattern by filling in more nonzeroentries. Specifically, the sparsity pattern Pt contains the set of nonzero indices of98the D?t at time t. After selecting the precise entries to change, Glint updates thesparse distance matrix D?t by invoking the distance function for each pair of pointscontained in Pt . Next, the MDS algorithm M runs to termination with the startingpoint layout and the input distance matrix D?. Glint itself terminates when thechange in the objective function S is less than the termination threshold ? .The objective function S takes three parameters: the sparsity pattern P thatspecifies the pairs of points over which we compare high-D and low-D distances,the distance matrix D? from which we read the distances specified by pairs in P, andthe low-dimensional layout coordinates layout from which we compute the low-D distances. The reason for including the pattern P as an input instead of simplysumming over the entirety of D? is subtle, but important. Glint terminates when theobjective function converges; that is, when it stops changing between subsequentiterations. Thus, the objective function must compare results at time t +1 to resultsat time t. However, not only do the points in the layout change between iterations,but the number of terms in the distance function changes, because there are morenonzero entries in D?t+1 than in D?t . To properly measure convergence, we need tocompare functions with the same number of terms. Including the same sparsitypattern in the objective calculation ensures that we compare objective functionswith equivalent terms at each iteration, by specifying which entries of the matrixto use. Thus, in the Algorithm 1 pseudocode, snew is computed with the sparsitypattern from the previous iteration, Pt , to determine which entries to include in thecomputation, while using the actual values derived from the current layout at timet +1.6.3 Glint InstantiationsA Glint instantiation substitutes implementations of three concrete componentsinto the abstract framework of the Glint algorithm. We describe three Glint instan-tiations, one for each of the three different MDS algorithm families described inSection 2.1.2: force-directed, analytic, and gradient.Several of the Glint instantiations require choosing input parameters, as dis-cussed in detail below. Table 6.1 summarizes the default value of each parameterand our method for selecting it. It also includes our analysis of the tradeoffs, with99the results for setting the parameter too small or too big.100Parameter Name Instantiation Default Selection Method If Too Small If Too Big? all 0.001 benchmark T:slower Q:better T:faster Q:worsenumIters gradient-based100 benchmark LT:faster DT:slower LT:slower DT:fasternumDists all logN parameter doubling T:faster Q:worse T:slower Q:betternumRunsF force-directed5 benchmark LT:faster Q:worse LT:slower Q:betternumRunsA analytic 10 benchmark LT:faster Q:worse LT:slower Q:bettertrainSize force-directed,analytic3 benchmark T:faster Q:worse T:slower Q:betterTable 6.1: Parameters used in Glint instantiations, their default values, how they were chosen, and the tradeoffs insetting them too small or too big. T is total time, LT is layout time, DT is distance calculation time, and Q is layoutquality.1016.3.1 Component M: MDS AlgorithmThe M component takes as input the low-dimensional input coordinates and placesthem in a new configuration based on the current distance matrix D? as output. Forthe analytic instantiation we substituted the Pivot MDS algorithm [15] for M, andfor the gradient implementation we substituted the SMACOF algorithm [29] forM. The Pivot MDS algorithm is used without change, but the other instantiationsrequire algorithm parameter choices or internal modifications which we detail inthe following subsections.Gradient-Based InstantiationFor the gradient-based instantiation, the SMACOF MDS algorithm has two tuningparameters: the inner termination threshold, ? , and the maximum number of inner-loop iterations before termination, numIters. We use the same value for ? as inthe main Glint algorithm.We observed that the gradient of the stress function for very sparse input ma-trices quickly shrinks in proximity to a minimum. Setting numIters too largeresults in over-optimizing with incomplete distance information, while setting ittoo small leads to computing more distances than are necessary. We select 100 asa good balance over all our benchmarks between these two extremes.Force-Directed InstantiationIn the force-directed instantiation, we substitute a modified version of the Glim-mer [56, 57] algorithm for M. We used the version of Glimmer that supports dis-tance matrix calculations in addition to handling points [56]. To make the Glimmeralgorithm suitable as the M component, we must alter the randomized samplingregime used by the algorithm. In Glimmer, sampling is uniform and unconstrainedover the entire distance matrix. Glint, however, only feeds a sparse subset of thedistance matrix D? to M for each outer loop iteration. To compensate, we constrainGlimmer sampling to be uniform over the given nonzeros of the sparse distancematrix D?.1026.3.2 Component DS: Densification StrategyThe DS component determines which distances to compute at each Glint iteration.For each instantiation, we follow a strategy of adding numDists new distancesper point to the matrix D?. By default, the numDists parameter is initially set todlog10 Ne.Setting the numDists parameter to an overly small value would result in anobjective function S change that is less than the termination threshold ? and thusan incorrect algorithm termination after the first iteration. A small numDistsis analogous to performing gradient descent with too small a gradient step-size.To ensure numDists is large enough, we follow a simple strategy of doublingnumDists during the first iteration until we achieve a change in the objectivefunction S greater than ? .The distribution of new distances across the matrix D? varies for each instanti-ation. We describe these distributions on a per-instantiation basis.Gradient InstantiationThe gradient instantiation DS is the simplest of the densification strategies. Ateach iteration, the DS uniformly samples numDists distances per point withoutreplacement.Force-Directed InstantiationThe force-directed DS is similar to the gradient instantiation, except for a singlemodification addressing Glimmer point hierarchies. The Glimmer algorithm di-vides points into a pyramid of levels, with the fewest points contained in the toplevel and increasingly larger sets of points at lower levels [57]. Sampling uniformlywithout replacement from the distance matrix would often lead to the case that, atthe top level, several points will not have any distances computed between any ofthe other points in the top level, only distances computed to points in lower lev-els. To solve this problem, the force-directed DS samples numDists distanceswithout replacement once for the points contained in each level. The sampling fora given level is constrained to be uniform over only the points contained in thatlevel.103Analytic InstantiationPivot MDS works by operating on a subset of complete columns of the distancematrix. The uniform sampling of distances per point used by the other instantia-tions would violate this constraint by allowing zeros within columns. We insteadcompute numDists new columns of the distance matrix at each iteration. Newcolumns are chosen using the MaxMin strategy described in the Pivot MDS pa-per [15] starting from a single column chosen uniformly at random.6.3.3 Component S: Objective FunctionGlint objective functions S are fast approximations of the true objective functionsF that are far more costly. In each of the Glint instantiations, S fits the followingtemplate:S < hi, low,sel > (P,D, layout) = ?(i, j)?sel(P)(lo(i, j)?hi(i, j))2?(i, j)?sel(P) hi(i, j)2Here we use the < ? > template notation from the C++ language to indicateparameters to S that do not change at runtime. The template parameters hi(i, j)and lo(i, j) are functions defining the high and low-dimensional distances betweenpoints i and j. The hi function varies from dataset to dataset, while lo is alwaysthe Euclidean distance function operating on the layout input parameter. The seltemplate parameter is an index-selection function that selects a subset from the setof nonzero distance matrix indices P. Intuitively, this function just measures thenormalized sum of distance residuals between the layout points and the data, butonly for a small set of point pairs instead of all pairs of points.Because they are stress-based techniques that minimize distance residuals, theforce-directed and gradient-based instantiations use D?i j for hi(i, j) and the low-dimensional Euclidean distance for lo(i, j). The analytic instantiation is strain-based, minimizing the inner-product residuals. To measure strain, we set hi(i, j)to be the inner product of ith and jth rows of the double-centered matrix C andset lo(i, j) to be the inner product of the ith and jth layout coordinates. The inter-ested reader should refer to the original Pivot MDS paper for more details on thedefinition of double-centering and the efficient computation of C [15].104For the analytic and gradient-based instantiations, the index-selection functionsel selects the entirety of the nonzero matrix indices P. In contrast, the force-directed instantiation selects a subset of P. The precise subset of P is the set ofpoint indices contained in the union of per-point random sample caches used bythe Glimmer algorithm.Each instantiation employs randomized sampling of new distance matrix in-dices after each Glint iteration, as mentioned in Section 6.3.2. In the case of thegradient-based instantiation, this random sampling does not impart enough randomnoise to the observed values of S to induce an unexpected termination. However,in the force-directed and analytic cases, we observed enough noise in the sequenceof S values that early termination was regularly observed. The sequence noise wasobserved to be Gaussian distributed (we confirmed normality with a Shapiro-Wilktest result of p = 0.55 [102]). In this section we describe our strategy for creatinga smooth S from the noisy series of raw objective function values.The obvious first choice for smoothing in Glint would be to convolve the signalwith a noise filter, as is done in the force-directed Glimmer algorithm. However,while that approach works well for Glimmer, it is not adequate for detecting con-vergence reliably within Glint. First, the smoothing filter size used in Glimmer ismuch bigger than the expected number of outer loop iterations. Reducing the filterwindow size would change the frequency response of the filter, allowing noise toleak into the signal. Furthermore, moving averages correspond to an impulse re-sponse filter that is finite. However, the noise in the sparse stress signal is Gaussianwhite noise, with equal power across all frequencies, so it will manifest itself afterfiltering any bandwidth.Rather than attempt to filter our the noise using convolution or averaging, weinstead take a more direct approach and model the observed noise in the sparsestress signal explicitly. Stochastic processes where any subset of process samplesare normally distributed are known as Gaussian processes and can be accuratelymodelled by the machinery of Gaussian process regression (GPR) [88].In order to perform GPR we must select the forms of the two functions thatcompletely determine a Gaussian process, the mean and the covariance function.The mean of the Gaussian process encodes information about the shape of the un-derlying process, for example whether it is linear or constant. We chose a mean105prior of zero, indicating that we have no advance knowledge about the signal. Weselect the squared exponential function, one of the most commonly chosen covari-ance functions [88], because it models smooth transitions between adjacent valuesof S, a behavior that matches our expectations for the convergence curve.We can improve our smooth estimate of the mean of S by increasing the num-ber of samples computed at each outer loop iteration. In the force-directed case,we compute more samples by restarting M with the same initial layout and a dif-ferent random seed. Since the analytic case proceeds deterministically, the sametechnique cannot be used. To compute a set of random samples for the analyticcase we were inspired by bootstrap resampling methods [31]. In order produce asingle sample we select numDists columns uniformly at random to leave out ofP.For the parameter designating the number of computed samples per Glint itera-tion, there is a parameter tradeoff between the fidelity of the estimated mean, whichaffects the likelihood of observing a false termination, and the speed of algorithm.We empirically find that computing 5 runs for the force-directed numRunsF pa-rameter and 10 runs for the analytic and numRunsA parameter yields good resultsover all our benchmark datasets.Using GPR requires initialization of the so-called process hyperparameters ofthe squared exponential covariance function. These include the length scale, ordegree of smoothness, and the noise level. The hyperparameters can be efficientlylearned from a small set of observations computed during the first trainSizeiterations of the Glint outer loop, by optimizing a likelihood function using conju-gate gradients. We empirically find that using 3 iterations for training yields goodresults over all our benchmark datasets.6.3.4 Instantiation Design SummaryTable 6.2 summarizes the Glint component design decisions, emphasizing the un-derlying algorithm features that crosscut the three instantiations. Consideration ofthese features could guide designers of future instantiations. For example, an algo-rithm using the entire sparse input distance matrix, like Pivot MDS and SMACOF,can remain unaltered for M. Algorithms with objective functions S that are noisy,106Alg. Class M DS Sforce-directed altered sampling uniform pointwisefor each hierarchyGPR smoothed stress-based across sample-cache setsgradient-based unchanged uniform pointwise stress-based across Panalytic unchanged uniform columnwise GPR smoothed strain-based across PTable 6.2: Glint component design summary for each MDS algorithm class.such as Pivot MDS and Glimmer, can employ GPR smoothing.6.4 ResultsWe present the results in terms of a benchmark performance comparison and anassessment of convergence. We first describe the benchmark datasets in detail.We compare the efficiency and quality of Glint instantiations against the standardalgorithms in terms of time and stress using these benchmarks. We then discussconvergence issues and demonstrate convergence behavior of each instantiation.6.4.1 Dataset and Distance Function DescriptionThe molecule dataset contains 661 points representing polymer-based nanocom-posites. The distance function is cheap: it is the Euclidean distance metric wherethe number of dimensions m is 10. We include this dataset as a baseline wherethe Glint requirements are not met and unmodified algorithms should be employedinstead. The 4000 points in the concept dataset are biomedical terms where thedistance function to determine their co-occurrence in journal articles requires run-ning database queries. The Flickr dataset contains 1925 images culled from theauthor?s public photo collection, with distances computed using the Earth Mover?sDistance (EMD) [93]. The BRDF dataset is an example from the computer graph-ics literature, where computations involving 100 points representing images usethe Euclidean distance function. The number of dimensions m is four million [76];this function is expensive despite being Euclidean because of the huge numberof dimensions. The videogame dataset was created by gathering human judge-107ments in response to survey of questions about 96 games. While the exact timinginformation for the judgments was not reported [71], our conservative estimate isthat the sum of the response times of the human participants took an average of10 seconds for each pairwise comparison. Table 6.3 summarizes our benchmarkdistance functions and costs.d cost (sec) Distance Calculation Benchmark0.00001 Euclidean m = 10 molecule0.001 DB Query concept0.01 Earth Mover 83 signature flickr1.0 Euclidean m = 4M brdf10.0 Human Elicited videogameTable 6.3: The cost d of a single distance calculation for the benchmarkdatasets in seconds rounded to the nearest power of 10. Here m repre-sents the number of dimensions of the input data in the case of using aEuclidean distance function.6.4.2 Benchmark Speed and Quality ComparisonWe validate Glint by comparing the benchmark performance of our implementa-tions against the previous work in terms of speed and quality. Speed is measured inseconds to termination and quality is measured in terms of the full objective func-tion F using the entire distance matrix D. For the force-directed and gradient-basedinstantiations, F is the full normalized stress function [13]. For the analytic instanti-ation, F is the full normalized strain function. We compute F only for performancevalidation; it is never computed in practice. All recorded values are averaged over5 runs on an Intel Core 2 QX6700 2.66 GHz CPU with 2 GB of memory.For the original approach in the force-directed and gradient-based performancecomparison, we ran the Glimmer and SMACOF algorithms, respectively, with thesame ? for these as used in Glint. For the original approach used in the analyticperformance comparison, we know of no algorithms with termination criteria. In-stead, we used a human-in-the-loop Pivot MDS setup, where the first author addednumDists pivots at a time with a keystroke, and manually halted the process aftervisually assessing layout convergence. The Pivot MDS algorithm is unable to han-108dle incomplete distance matrix columns, so we omit the videogame benchmark,which possesses many missing matrix entries, from the analytic results.Figure 6.2 and Table 6.4 compare the execution time and final layout quality ofGlint to the original approaches.The speedup of the force-directed instantiation ranges from 20 to 115 for thecostly target cases, while the original Glimmer algorithm is several times faster forthe cheap baseline. The main benefit of the fully automatic analytic Glint instan-tiation is the elimination of the need for manual monitoring and intervention. TheGlint instantiation was faster than Pivot MDS with a manual operator in the loopfor molecule and flickr, but slower for concept and brdf. The speedupof the gradient-based Glint instantiation is dramatic: several orders of magnitudein the target cases, and a factor of two in the baseline case of molecule wherethe distance function is cheap.The quality values for Glint are roughly the same magnitude and variability foreach benchmark in the force-directed case. For the analytic instantiation, the qual-ity values are equal or better than the manual Pivot MDS method. In the gradientcase, most of the final quality values, except molecule, are slightly worse thanthe standard approach using the full distance matrix. The gradient Glint instanti-ation provides a speed and quality compromise between the extremely costly butaccurate full gradient approach, and the fast but approximate force-directed Glintinstantiation.6.4.3 ConvergenceWe illustrate the convergence behavior of each Glint instantiation in Figure 6.3.Each log-scale plot displays two curves: the blue curve represents the value of thefull, slow objective function F of the layout after each Glint iteration, while theorange curve shows the value of the smoothed, fast objective S. For those instan-tiations that employ GPR smoothing, we also plot the random samples used in theregression as gray dots. Similarly, for those instantiations that employ an itera-tive layout algorithm M, we plot the values of S after each M iteration. As in thebenchmark comparison, F is the full normalized stress function for Glimmer andSMACOF, and F is the full strain function for Pivot MDS.109Benchmark GlintFOrig.FGlintTimeOrig.TimeSpeedupForce-Dir.molecule 0.03 0.03 14 4 0.2concept 0.18 0.18 49 1016 20flickr 0.08 0.09 2.4K 98K 40brdf 0.03 0.04 3K 304K 115videogame 0.45 0.45 23K 482K 20Analyticmolecule 0.35 0.42 3 23 9concept 0.93 0.94 96 63 0.7flickr 0.48 0.59 1.2K 2.9K 2brdf 0.078 0.233 40K 6K 0.2Gradientmolecule 0.01 0.03 360 700 1.9concept 0.18 0.18 0.1K 113K 880flickr 0.06 0.04 8K 71M 8.8Kbrdf 0.008 0.005 4K 859K 200videogame 0.16 0.13 19K 430K 220Table 6.4: Comparison of full objective functions, time (in seconds), andspeedup between Glint instantiations and original MDS algorithms.The magnitude of the change in the cheap objective S approximates that of thechange in costly F function. In the case of Pivot MDS, the smoothed S series isslightly offset from the gray random samples due to the effect of using sparsitypatterns from the previous iteration. These benchmarks validate the claim that set-ting ? to a given termination threshold will terminate Glint when the correspondingchange in F falls below the threshold modulo some sampling noise.110Force-Directed Analytic Gradient-BasedSpeed0 5 10 15MOLECULE0 500 1000 1500CONCEPT0 50K 100K 150KFLICKR0 100K 200K 300K 400KBRDF0 100K 200K 300K 400K 500K 600KVIDEOGAME0 5 10 15 20 25MOLECULE0 50 100 150CONCEPT0 1000 2000 3000 4000FLICKR0 1K 2K 3K 4K 5KBRDF0 200 400 600 800MOLECULE0 50K 100K 150KCONCEPT0 20M 40M 60M 80KFLICKR0 200K 400K 600K 800K 1MBRDF0 1M 2M 3M 4M 5MVIDEOGAMEQuality MOLECULECONCEPTFLICKRBRDF0 0.1 0.2 0.3 0.4 0.5VIDEOGAMEMOLECULECONCEPTFLICKR0 0.2 0.4 0.6 0.8 1BRDFMOLECULECONCEPTFLICKRBRDF0 0.05 0.1 0.15 0.2VIDEOGAMEFigure 6.2: Comparison of speed (top) and quality (bottom). In each pair, thetop blue bar is the original MDS algorithm, and the bottom orange baris the Glint instantiations. The black lines indicate 95% standard errorbars.0 500 1000 150010?310?210?1100Stress (Log Scale)Force?Directed IterationsForce?Directed10 15 20 25 3010?610?410?2100Strain (Log Scale)Analytic IterationsAnalytic0 500 100010?410?2100Stress (Log Scale)Gradient?Based IterationsGradient?BasedFigure 6.3: Log-scale Glint convergence curves on each instantiation gener-ated using the brdf dataset. The orange S curve is derived from thenoisy grey samples. S is designed to match the convergence behavior ofthe costly F series in blue.111Chapter 7Conclusion and Future WorkWe conclude the thesis by summarizing the finer points of the different thesis chap-ters. We then describe a set of future research directions that build upon and expandour work. We draw the thesis to a close with a set of lessons learned from our re-search.7.1 ConclusionsDifficulties arise when dimensionality reduction is applied to the important use-cases of non-expert users, document data, and costly distance functions. In thisthesis we identified the obstacles associated with each of these cases and exploredways to address the underlying problems. For non-expert users of dimensionalityreduction, we identified the need for two kinds of user guidance, local and global,and designed a system, DimStiller, that encapsulates both. In the case of docu-ment data, we have identified the mostly-disconnected property of the data and itsconnection with query algorithms from Information Retrieval. We then presentedalgorithms for high-dimensional analysis, including dimensionality reduction, thattake advantage of the mostly disconnected-property for improved efficiency andaccuracy. For the case of costly distance functions, we identified an inefficiencyin the design of multidimensional scaling algorithms with respect to this case andpresented an algorithm framework, Glint, to minimize total running time without apenalty to output quality.1127.1.1 DimStillerIn Chapter 3, we presented DimStiller, a data analysis system that uses a set ofabstractions to structure and navigate dimensional analysis and reduction: data re-sides in tables, operators modify and visualize tables, expressions chain togetheroperators, and workflows permit pattern re-use. DimStiller uses these mechanismsto provide both local and global guidance through the analysis space of possibledata tables. In both of the presented case studies in Section 3.4, we showed howindividual operators probe the input dimensions and produce values like variance,correlations, and principal components and visualizations such as scree plots andscatterplots. It is the data analyst?s task to turn these quantitative figures into an-swers to qualitative questions about the data. DimStiller builds target users? trustin these answers by providing an intuitive pipeline architecture that visually guidesusers through making algorithm and parameter choices.7.1.2 Algorithms for the Visual Analysis of MoDisco DataChapter 4 showed how the MoDisco property of real-world document datasets isan important consideration in the design of algorithms for data analysis. This prop-erty has not been previously addressed in the visualization literature; we generalizealgorithms and data structures originally designed for information retrieval for vi-sualization applications. MoDisco data has both sparse vector columns and rows;this property has important implications for the design of search query algorithms.We showed how impact-ordered query algorithms implicitly use this property tocompute the higher-scoring queries faster. When data analysis algorithms, suchas dimensionality reduction, target TF-IDF document data, they can leverage theMoDisco property to efficiently compute important data structures for data explo-ration: nearest-neighbor sets, distance matrices, cluster trees, and 2D layout coor-dinates. We then presented three scalable algorithms for computing each of theseimportant data structures, with results that improve over the state-of-the-art.7.1.3 OverviewIn Chapter 5, we presented the Overview prototype application for this task thatcombines a hierarchical clustering view and a traditional MDS view. We vali-113dated the application with two complex, real-world datasets from the journalismdomain: subsets of the diplomatic cables and Afghan war logs from WikiLeaks.The Overview prototype application allowed direct comparison of different itemencoding, clustering, and layout algorithms with respect to each other and a fixedset of human-assigned tags, opening up a line of research into the semantic relation-ships between these different algorithmic stages in the visual exploration pipeline.7.1.4 GlintChapter 6 illustrated how expensive distance calculations change the efficiency ofexisting MDS algorithms like Glimmer and SMACOF. Such algorithms computemore distances than are required for an existing quality of layout, while analyticalgorithms require manually tuning the number of distances to compute as an inputparameter. We solve both these problems with Glint, an algorithm framework withthree components: a distance matrix densification strategy DS, an algorithm M,and an inexpensive objective measure S. Given these components, Glint samplesdistances from the distance matrix in fixed batches, updating the low-dimensionallayout with new information until the layout quality converges. We showed howcareful design of termination criteria can overcome the noise effect of random sam-pling on convergence. We presented and validated Glint instantiations for threeseparate types of previous MDS algorithms: the force-directed Glimmer, the ana-lytic Pivot MDS, and the gradient-based SMACOF.The Glint instantiations presented give essentially equivalent layout quality inall cases. The analytic instantiation was roughly equal in time performance to PivotPDS, with some cases of speedup and some of slowdown; the main contribution ofGlint in this situation is to remove the need for manual monitoring and interven-tion. The iterative instantiations showed substantial speedups against Glimmer andSMACOF in all of our target cases with costly distance functions, ranging from 20to 115 with the force-directed Glint instantiation and from 200 to 8800 with thegradient-based Glint instantiation.1147.2 Future WorkWe now present a series of stepping-off points for future research based on thecontents of this thesis.7.2.1 User GuidanceA form of guidance not yet explicitly provided by DimStiller is to help the userfully understand the inner workings of the supported operators. For example, userscould be given visual feedback highlighting relevant regions of scree plots in thereduce operator control panels. While the DimStiller architecture should in theorysupport this kind of targeted exploration, it would require significant future workto design a system that truly sheds light on all black-box algorithms. Anotherdirection of future work will be to add more encoding and interaction techniquespreviously shown to be effective for high-dimensional analysis, for example sortingthe Collect operator matrix view for the rank by feature capability suggested bySeo and Shneiderman [100].7.2.2 Efficient DR with Costly DistancesThe Glint system specifically targeted MDS algorithms. As discussed in sec-tion 2.1, there are other types of dimensionality reduction algorithms, like proba-bility based methods, that also rely on distance information. It would be interestingwork to apply the Glint framework to these other types of DR algorithms.Another interesting avenue of future work is to explore different types of densi-fication strategies. The examples we provide in Glint all use uniform sampling, buta more sophisticated sampling based on proximities of points in low-dimensionalspace might also be employed to good effect.7.2.3 Mostly Disconnected DataThere are several different avenues for possible future work with respect to thethe analysis of mostly-disconnected data. One is to determine if data other thanterm-vector databases can exhibit the MoDisco property. We conjecture that us-ing an IDF-like transformation of high-dimensional vector data in other domainscould induce MoDisco structure for great effect. Another area for future work is115modification of the APQ nearest-neighbor algorithm to include other innovationsfrom query algorithms like impact transformations of the input data and accumu-lator limiting and pruning [2]. It may also be fruitful to devise ways to incorporateefficient use of truncated distance matrices and inverted files into other clusteringalgorithms like k-medoids [91].7.3 Lessons LearnedShifting the discussion to a higher level of abstraction, we end the thesis with adiscussion of some of the overarching lessons learned about conducting interdisci-plinary research in the course of writing this thesis.Many of the projects in this thesis involve bringing together techniques fromdifferent disciplines. For example, Chapter 4 draws connections between Infor-mation Retrieval and Visualization. The Overview prototype combine ideas fromdata mining, visualization, and data journalism. Glint attempts to build an algo-rithm framework that crosscuts algorithm work in Visualization and Statistics. Di-mensionality Reduction itself is a subject of interest in several different researchcommunities: Statistics, Machine Learning, Data Mining, and Visualization. Eachof these disciplines functions like a lens, focusing on those aspects of algorithmdesign most beneficial to their own particular objectives.Below, we first relate the experience of drawing connections between the algo-rithmic foci of two different research communities. Then, we discuss what factorsof an interdisciplinary collaboration can aid in the dissemination of research, lead-ing to greater impact.7.3.1 Making an Algorithmic ConnectionRevealing algorithmic connections between High-dimensional data analysis andInformation Retrieval was one of the more challenging and intellectually satisfyingaspects of this thesis. In this section, we recount the somewhat roundabout storyof how these connections were made. It is our hope that this story functions asan informative exercise in revealing how research across disciplines is often a mixof wrong-turns, back-tracking, and (hopefully) breakthroughs. This contrasts withthe standard academic presentation of work as a finished product resulting from a116set of ordered, logical conclusions. We then discuss how a multifaceted researchprocess can actually work to create multiple connections between different fieldsand enrich the underlying algorithms developed along the way.Glimmer and Sparse DocsOne of the more compelling set of results produced from our validation of theGlimmer MDS algorithm was from the docs dataset [57]. This dataset, referredto as metacombine in Chapter 4, was our first encounter with analyzing term-vector data using techniques from high-dimensional analysis. The discrepancybetween the docs result from PivotMDS, a Classical Scaling algorithm, and theresult from Glimmer, a Distance Scaling algorithm, motivated an in-depth analysisof when Distance Scaling algorithms are more appropriate than Classical Scalingalgorithms. Our intuition was that docs was so intrinsically high-dimensional,that its structure was largely simplicial. Therefore, linear projection techniques likeClassical Scaling were inappropriate for producing dimensionally reduced layoutsbecause much of the data variance was orthogonal.Power Transformations and Local Graph LayoutOur success with using Glimmer to produce large document dataset layouts withvisible clustering attracted our collaborator, Jonathan Stray, who was interested inproducing the tool that would become Overview. We began experimenting withusing Glimmer on what would become the warlogs dataset in Chapter 5. Theseexperiments revealed that improved cluster structure could be observed with powertransformations of the distance function, as described by Buja et al [16]. But onetroubling visual artifact of using Glimmer on our data was the persistence of anoverall circular layout, inside which the clusters were being forced together. Fur-ther analysis of the data revealed this circularity to be an artifact of the predomi-nance of unit length distances in the data.Unit distances are a result of the cosine distance between two data points shar-ing no features. Semantically, though, the purpose of a unit distance is actuallyto imply no connection at all, not a connection of a specific length as was be-ing reported by the cosine distance function. Our next strategy was to attempt to117visually encode the absence of a connection between documents. This encodingwas achieved by replacing unit length distances with infinite distances and employa Box-Cox transformed version of the stress function as specified by Chen andBuja [22]. But this strategy of replacing distances led to many practical compli-cations: the data often broke into multiple disconnected components which wouldrepel each other in the layout, the energy model optimization was prone to get-ting stuck in inferior local minima, and the Box-Cox force model itself had manysensitive parameters.Contour Trees and DiscoTreesAt this point, we shifted our focus away from finding a better layout engine, and to-ward visually encoding the different densities of points in high-dimensional space.Focusing on densities was directly informed by the observation that removing allunit length ?edges? in the graph constructed from the distance matrix often resultedin meaningful disconnected components. The disconnected components were infact densities of points separated by an appreciable distance in terms-space. If den-sities separated by unit-distance were meaningful, we surmised that many otherdistances could produce meaningful decompositions. We set about answering thequestion of whether there are even more interesting component decompositions ofthe data determined by gradually relaxing the distance threshold.In our first analysis of the problem, we considered the Contour Tree [18] ofthe distance field. If one imposes a distance field over the high-dimensional dataspace, then the value at each point in space is determined by its distance to thenearest data point. Our unit-length decomposition of the data then resulted fromthe disjoint sets of points contained in the iso-contours of the distance field at athreshold of 1. The Contour Tree then represents all possible sets of componentsproduced by the union of all iso-contour values between one and zero. We then setabout computing and visualizing this decomposition.The DiscoTree described in Chapter 5 is the result of an approximate algorithmto calculate this Contour Tree. Our DiscoTree algorithm hinged upon the observa-tion that the maximum distance between points within a contour is bounded by thevalue of the contour [59]. By comparing the Contour Tree generated by this pro-118cess to other hierarchical decompositions of points, we then concluded that thereis an isomorphism between this Contour Tree of a scalar distance field and thehierarchy produced by Single-Link clustering.Single-Link to Impact-Ordered Inverted FilesRealizing that our DiscoTree algorithm efficiently computes a Single-Link hierar-chy, we set about comparing our strategy to previous work computing single-linkhierarchies. This comparison revealed a subset of clustering work using inverted-file index structures when clustering documents [81]. Rather than simply compareagainst this work, we examined current techniques for building and processing in-verted files [137]. It was in this deeper comparison that we were able to drawthe connection between the approximations we made in our DiscoTree algorithmwith approximations made by impact-ordered, inverted-file indices. The connec-tion between inverted files and high-dimensional space then permitted us to connectmany of the previously-visited dots and suggest the algorithm design implicationsin Chapter 4.Making Connections: Breadth-First vs. Depth-First ResearchIn developing new techniques, there is a tension between a breadth-first and depth-first approach to researching a solution to a problem. In the breadth-first approach,a researcher constructs an abstract description of their problem and then makes abroad survey of possible techniques, selecting the best overall fit for the solution athand. In the depth-first approach, the researcher picks a familiar technique, devel-oping and deepening its features until it is appropriate for solving the underlyingproblem. Both approaches have their merits and risks. Breadth-first connects theproblem to a tried and true solution to a similar problem, but at the risk of makingonly a superficial connection between problem and algorithm. Depth-first oftenleads to novel work, pushing the frontiers of a technique to new venues, but at therisk of being inferior to algorithms in alternative fields.Our experience shows there may be benefit in combining both breadth anddepth. For example, our work in making modifications to Glimmer and the ContourTree algorithm were a depth-oriented approach, taking spatial approaches from119visualization and computer graphics and deepening them to better analyze high-dimensional term-vector data. Simultaneously, our drawing connections acrossdimensionality reduction, hierarchical clustering and information retrieval werebreadth-oriented, selecting algorithms appropriate for the tasks of layout-generation,groupings, and querying respectively. The utilization of these two styles simulta-neously allowed us to organize a set of disparate fields together (breadth) and thendraw low-level connections between them (depth), ultimately leading to deeperinsights and improved results.7.3.2 Research Impact and CollaborationThe other high-level lesson learned in the course of developing this thesis was thatof the benefits of a cross-domain collaboration, especially with respect to researchimpact. Real-world impact, where non-experts derive benefit from the insights orresults of research, can be difficult to achieve. A paper by Wagstaff [121], address-ing research impact in Machine Learning, suggested that impact is greatly affectedby follow-through, loosely defined as both ?Publicize results to relevant user com-munity? and ?Persuade users to adopt technique.? In this section, we discuss ourexperiences with collaborators in developing software systems, and how a collabo-rator?s own goals can have significant effect on the impact of one?s research. Sedl-mair et al. have presented a detailed approach to collaborator selection during thecourse of visualization design studies [99]. Our discussion here is focused specifi-cally on impact through user adoption, and can be considered a complementary setof suggestions.DimStiller CollaborationIn Chapter 3, we presented a high-dimensional analysis tool, DimStiller, aimedat providing non-expert data-analysts with guidance in using sophisticated toolsand workflows. After developing this software, we collaborated with different re-searchers and elicited their questions and analysis needs in response to the tool.The motivations and goals of these collaborators were focused on solving their do-main analysis problems under aggressive time constraints. For them, taking part inan iterative design process might do more harm than good if the tool doesn?t have120immediate benefits to their work process. Our collaboration with these domainanalysts led to insights in other research projects [98], but did not lead to furtherdevelopment of the tool, or any popular adoption of the DimStiller software.Overview CollaborationOur collaboration with Jonathan Stray on the Overview project contrasts with theDimStiller collaboration. In particular, as a Knight Foundation grant recipient,he was motivated to see our collaboration succeed. In addition, his profile wasaligned with two aspects of follow-through highlighted by Wagstaff: publicizingresults, and evangelizing the technique for adoption. That is, he had access tohigh-profile venues, such as the PBS Idea Lab website1, and he was professionallywell-connected enough with our target group to make headway evangelizing thetechnique to real journalists. As a result, we were able to see the tool put to use byjournalists outside of our collaboration. We detailed some of this use in Section 5.4of the thesis.Ingredients of High-Impact CollaborationThe Overview project then presents a study in some of the right ingredients forhigh-impact research. We summarize these ingredients as:? Collaborator has a stake in the positive outcome of the project.? Collaborator has a noted platform for publicizing good results.? Collaborator is well-connected to high-profile users in the community.Our collaborations on the DimStiller project possessed none of these ingredients,making user adoption more unlikely through this specific collaboration. In our ex-perience, if the ultimate goal of a project is community adoption of a tool, we sur-mise that project collaborations without these ingredients face an uphill struggle.We do not claim that these factors are either necessary or sufficient for high-impactresearch. To do so would cynically suggest only seeking out community leaders for1http://www.pbs.org/idealab/2013/04/how-a-computer-can-organize-thousands-of-documents-for-a-reporter110.html Last accessed July 17th, 2013121any collaboration. But we do mean to imply that, in our experience, the existenceof any or all three of these facets can go a long way toward seeing greater adoptionrates in a target community.122Bibliography[1] C. Aggarwal and C. Zhai. A survey of text clustering algorithms. MiningText Data, pages 77?128, 2012. ? pages 66[2] V. Anh and A. Moffat. Impact transformation: effective and efficient webretrieval. In Proc. ACM Conf. Information Retrieval (SIGIR), pages 3?10.ACM, 2002. ? pages 116[3] V. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effectiveearly termination. In Proc. ACM Conf. Information Retrieval (SIGIR),pages 35?42. ACM, 2001. ? pages 55[4] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimalalgorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891?923, 1998. ? pages 16[5] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculationalgorithm. Nature, 324:446?449, 1986. ? pages 73[6] R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search.In Proceedings of the 16th International Conference on World Wide Web,pages 131?140, 2007. ? pages 16[7] R. Becker and W. Cleveland. Brushing scatterplots. Technometrics, 29(2):127?142, 1987. ? pages 2[8] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques forembedding and clustering. Advances in Neural Information ProcessingSystems, 14:585?591, 2001. ? pages 13[9] R. Bellman. Adaptive Control Processes: a Guided Tour, volume 4.Princeton University Press, 1961. ? pages 15[10] P. Berkhin. A survey of clustering data mining techniques. GroupingMultidimensional Data, pages 25?71, 2006. ? pages 79123[11] D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77?84, 2012. ? pages 83[12] C. Bo?hm and H. Kriegel. A cost model and index architecture for thesimilarity join. In Conf. on Data Engineering (ICDE), pages 411?420.IEEE, 2001. ? pages 16[13] I. Borg and P. Groenen. Modern Multidimensional Scaling Theory andApplications. Springer-Verlag, 2nd edition, 2005. ? pages 108[14] I. Borg and P. Groenen. Modern multidimensional scaling: Theory andapplications. Springer, 2005. ? pages 10[15] U. Brandes and C. Pich. Eigensolver methods for progressivemultidimensional scaling of large data. In Graph Drawing, volume 4372,pages 42?53. Springer, 2007. ? pages 11, 94, 102, 104[16] A. Buja and D. Swayne. Visualization methodology for multidimensionalscaling. Journal of Classification, 19(1):7?43, 2002. ? pages 70, 117[17] A. Buja, D. Swayne, M. Littman, N. Dean, H. Hofmann, and L. Chen. Datavisualization with multidimensional scaling. Journal of Computational andGraphical Statistics, 17(2):444?472, 2008. ? pages 13[18] H. Carr, J. Snoeyink, and U. Axen. Computing contour trees in alldimensions. Computational Geometry: Theory and Applications, 24(2):75?94, 2003. ? pages 118[19] M. Chalmers. A linear iteration time layout algorithm for visualising highdimensional data. In Proc. IEEE Visualization, pages 127?132, 1996. ?pages 12[20] K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree.Advances in Neural Information Processing Systems, 23:343?351, 2010.? pages 17[21] L. Chen and A. Buja. Local multidimensional scaling for nonlineardimension reduction, graph drawing, and proximity analysis. Journal of theAmerican Statistical Association, 104(485):209?219, 2009. ? pages 4, 5,13[22] L. Chen and A. Buja. Stress functions for nonlinear dimension reduction,proximity analysis, and graph drawing. Journal of Machine LearningResearch, 14:1145?1173, 2013. ? pages 118124[23] Y. Chen, L. Wang, M. Dong, and J. Hua. Exemplar-based visualization oflarge document corpus. IEEE Transactions on Visualization and ComputerGraphics, 15(6):1161?1168, 2009. ? pages 20, 79[24] D. Cook and D. Swayne. Interactive and Dynamic Graphics for DataAnalysis: With Examples Using R and GGobi. Springer, 2007. ? pages 18[25] K. A. Cook and J. J. Thomas, editors. Illuminating the path: the researchand development agenda for visual analytics. National Visual AnalyticsCenter, 2005. ? pages 47[26] F. Crestani and S. Wu. Testing the cluster hypothesis in distributedinformation retrieval. Information Processing & Management, 42:1137?1150, 2006. ? pages 82[27] W. Cui, S. Liu, L. Tan, C. Shi, Y. Song, Z. Gao, X. Tong, and H. Qu.TextFlow: Towards better understanding of evolving topics in text. Proc.IEEE Symp. Information Visualization (InfoVis), 17(12):2412?2421, 2011.? pages 83[28] N. de Freitas, Y. Wang, M. Mahdaviani, and D. Lang. Fast krylov methodsfor n-body learning. In Advances in Neural Information ProcessingSystems, pages 251?258, 2005. ? pages 15[29] J. de Leeuw and P. Mair. Multidimensional scaling using majorization:SMACOF in R. Journ. Statistical Software, 31(3):1?30, 8 2009. ? pages13, 102[30] V. de Silva and J. Tenenbaum. Sparse multidimensional scaling usinglandmark points. Technical report, Stanford, 2004. ? pages 11, 94[31] B. Efron and R. Tibshirani. An Introduction to the Bootstrap, volume 57.CRC Press, 1993. ? pages 106[32] B. Everitt, S. Landau, M. Leese, and D. Stahl. Cluster Analysis, 5thEdition. John Wiley & Sons, Ltd, 2011. ? pages 17[33] R. Fisher. The use of multiple measurements in taxonomic problems.Annals of Human Genetics, 7(2):179?188, 1936. ? pages 1[34] M. Fontoura, V. Josifovski, J. Liu, S. Venkatesan, X. Zhu, and J. Zien.Evaluation strategies for top-k queries over memory-resident invertedindexes. Proceedings of the VLDB Endowment, 4(12):1213?1224, 2011.? pages 56125[35] E. Fowlkes and C. Mallows. A method for comparing two hierarchicalclusterings. Journal of the American Statistical Association, 78(383):553?569, 1983. ? pages 68[36] S. France and J. Carroll. Two-way multidimensional scaling: A review.IEEE Transactions on Systems, Man, and Cybernetics, Part C:Applications and Reviews, 41(5):644?661, 2011. ? pages 10, 70[37] E. Gansner, Y. Koren, and S. North. Graph drawing by stress majorization.In Graph Drawing, pages 239?250, 2004. ? pages 13[38] J. Gillum. Ryan asked for federal help as he championed cuts.http://bigstory.ap.org/article/ryan-asked-federal-help-he-championed-cuts.Accessed: 2013-06-04. ? pages 91[39] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensionsvia hashing. In Proc. Intl. Conf. on Very Large Data Bases (VLDB), pages518?529, 1999. ? pages 16, 64[40] G. Golub and W. Kahan. Calculating the singular values andpseudo-inverse of a matrix. Journal of the Society for Industrial & AppliedMathematics, Series B: Numerical Analysis, 2(2):205?224, 1965. ? pages11[41] J. Gower. Some distance properties of latent root and vector methods usedin multivariate analysis. Biometrika, 53(3-4):325?338, 1966. ? pages 11[42] J. Gower. Measures of similarity, dissimilarity, and distance. Encyclopediaof Statistical Sciences, 5:397?405, 1985. ? pages 3[43] J. Grimmer and G. King. General purpose computer-assisted clustering andconceptualization. Proc. Natl. Acad. Sciences (PNAS), 2010. ? pages 83[44] P. Groenen and W. Heiser. The tunneling method for global optimization inmultidimensional scaling. Psychometrika, 61(3):529?550, 1996. ? pages11[45] D. Guo. Coordinating computational and visual approaches for interactivefeature selection and multivariate clustering. Information Visualization, 2(4):232?246, 2003. ? pages 19[46] I. Guyon and A. Elisseeff. An introduction to variable and featureselection. The Journal of Machine Learning Research, 3:1157?1182, 2003.? pages 3126[47] N. Halko, P. Martinsson, Y. Shkolnisky, and M. Tygert. An algorithm forthe principal component analysis of large data sets. SIAM Journal onScientific Computing, 33(5):2580?2594, 2011. ? pages 10, 74[48] J. Heer and M. Agrawala. Software design patterns for informationvisualization. Proc. IEEE Symp. Information Visualization (InfoVis), 12(5):853?860, 2006. ? pages 23, 28[49] M. Hein and J. Audibert. Intrinsic dimensionality estimation ofsubmanifolds in Rd . In Proc. Intl. Conf Machine Learning (ICML), pages289?296. ACM, 2005. ? pages 21, 31[50] G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances inNeural Information Processing Systems, 15:833?840, 2002. ? pages 14,72, 73[51] R. Holbrey. Dimension reduction algorithms for data mining andvisualization. University of Leeds/Edinburgh, 2006. ? pages 22[52] S. Huang, M. Ward, and E. Rundensteiner. Exploration of dimensionalityreduction for text visualization. In Proc. Coordinated and Multiple Viewsin Exploratory Visualization (CMV), pages 63?74, 2005. ? pages 47[53] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193?218, 1985. ? pages 64[54] M. Hubert, P. Rousseeuw, and K. Branden. ROBPCA: a new approach torobust principal component analysis. Technometrics, 47(1), 2005. ? pages10[55] P. Indyk and R. Motwani. Approximate nearest neighbors: towardsremoving the curse of dimensionality. In Proceedings of the ThirtiethAnnual ACM Symposium on Theory of Computing, pages 604?613. ACM,1998. ? pages 15[56] S. Ingram. Multilevel multidimensional scaling on the GPU. Master?sthesis, University of British Columbia Department of Computer Science,2007. ? pages 6, 12, 102[57] S. Ingram, T. Munzner, and M. Olano. Glimmer: Multilevel MDS on theGPU. IEEE Transactions on Visualization and Computer Graphics, 15(2):249?261, 2009. ? pages 5, 12, 21, 31, 74, 75, 81, 93, 94, 102, 103, 117127[58] S. Ingram, T. Munzner, V. Irvine, M. Tory, S. Bergner, and T. Mo?ller.Dimstiller: Workflows for dimensional analysis and reduction. In Proc.IEEE Symp. Visual Analytics Science and Technology (VAST), pages 3?10,2010. ? pages 44, 47[59] S. Ingram, T. Munzner, and J. Stray. Hierarchical clustering and tagging ofmostly disconnected data. Technical Report TR-2012-01, University ofBritish Columbia Department of Computer Science, May 2012. URLhttp://www.cs.ubc.ca/labs/imager/tr/2012/modiscotag/. ? pages 51, 118[60] A. Inselberg and B. Dimsdale. Parallel coordinates. In Human-MachineInteractive Systems, pages 199?233. Springer, 1991. ? pages 2[61] N. Jardin and C. van Rijsbergen. The use of hierarchical clustering ininformation retrieval. Information Storage & Retrieval, 7(5):217?240,1971. ? pages 82[62] S. Johansson and J. Johansson. Interactive dimensionality reductionthrough user-defined combinations of quality metrics. Proc. IEEE Symp.Information Visualization (InfoVis), 15(6):993?1000, 2009. ? pages 4, 19[63] P. Joia, D. Coimbra, J. Cuminato, F. Paulovich, and L. Nonato. Local affinemultidimensional projection. IEEE Transactions on Visualization andComputer Graphics, 17(12):2563?2571, 2011. ? pages 13, 75[64] I. Jolliffe. Principal component analysis. Wiley Online Library, 2005. ?pages 3, 4, 9[65] F. Jourdan and G. Melanc?on. Multiscale hybrid MDS. In Proc. Intl. Conf.on Information Visualization (IV?04), pages 388?393, 2004. ? pages 22[66] M. Khoury, Y. Hu, S. Krishnan, and C. Scheidegger. Drawing large graphsby low-rank stress majorization. Comp. Graph. Forum, 31(3pt1):975?984,June 2012. ? pages 13[67] A. Krowne and M. Halbert. An initial evaluation of automated organizationfor digital library browsing. In Proceedings of the 5th ACM/IEEE-CS JointConference on Digital Libraries, 2005., pages 246?255. IEEE, 2005. ?pages 51[68] J. Kruskal. On the shortest spanning subtree of a graph and the travelingsalesman problem. Proc. American Mathematical Society (AMS), 7(1):48?50, 1956. ? pages 67128[69] J. Kruskal. Multidimensional scaling by optimizing goodness of fit to anonmetric hypothesis. Psychometrika, 29(1):1?27, 1964. ? pages 11[70] N. Lawrence. Probabilistic non-linear principal component analysis withGaussian process latent variable models. The Journal of Machine LearningResearch, 6:1783?1816, 2005. ? pages 13[71] J. P. Lewis, M. McGuire, and P. Fox. Mapping the mental space of gamegenres. In Proc. ACM SIGGRAPH Symp. Video Games, pages 103?108,2007. ? pages 95, 108[72] Y. Lifshits. Algorithms for nearest neighbor search. Tutorial, RussianSummer School in Information Retrieval (RuSSIR), 2007. ? pages 57[73] Y. Liu, S. Barlowe, Y. Feng, J. Yang, and M. Jiang. Evaluating exploratoryvisualization systems: A user study on how clustering-based visualizationsystems support information seeking from large document collections.Information Visualization, 12(1):25?43, 2013. ? pages 47[74] H. Luhn. A statistical approach to the mechanized encoding and searchingof literary information. IBM Journal of Research and Development, 1(4):309?317, 1957. ? pages 82[75] C. Manning, P. Raghavan, and H. Schu?tze. Introduction to informationretrieval, volume 1. Cambridge University Press Cambridge, 2008. ?pages 15, 46, 82[76] W. Matusik, H. Pfister, M. Brand, and L. McMillan. A data-drivenreflectance model. ACM Trans. Graphics (Proc. SIGGRAPH 2003), 22(3):759?769, 2003. ? pages 5, 107[77] L. Molina, L. Belanche, and A`. Nebot. Feature selection algorithms: Asurvey and experimental evaluation. In Proceedings 2002 IEEEInternational Conference on Data Mining, pages 306?313. IEEE, 2002. ?pages 4[78] T. Munzner, A. Barsky, and M. Williams. Reflections on questvis: Avisualization system for an environmental sustainability model. ScientificVisualization: Interactions, Features, Metaphors, 2:240?259, 2011. ?pages 36[79] K. Murphy. Machine learning: a probabilistic perspective. The MIT Press,2012. ? pages 3129[80] F. Murtagh. A very fast, exact nearest neighbor algorithm for use ininformation retrieval. Information Technology: Research and Development,1(4):275?283, 1982. ? pages 16[81] F. Murtagh. Clustering in massive data sets. In Handbook of Massive DataSets, pages 501?543. Kluwer Academic Publishers, 1999. ? pages 17, 119[82] P. O?esterling, G. Scheuermann, S. Teresniak, G. Heyer, S. Koch, T. Ertl,and G. Weber. Two-stage framework for a topology-based projection andvisualization of classified document collections. In Proc. IEEE Symp.Visual Analytics Science and Technology (VAST), pages 91?98, 2010. ?pages 20[83] F. Paulovich, C. Silva, and L. Nonato. Two-phase mapping for projectingmassive data sets. IEEE Transactions on Visualization and ComputerGraphics, 16(6):1281?1290, 2010. ? pages 13, 75[84] S. Perry and P. Willett. A review of the use of inverted files for best matchsearching in information retrieval systems. Journal of Information Science,6(2-3):59?66, 1983. ? pages 55[85] J. Platt. FastMap, MetricMap, and Landmark MDS are all Nystro?malgorithms. In Proc. Intl. Workshop on Artificial Intelligence and Statistics,pages 261?268, 2005. ? pages 11[86] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2013. URLhttp://www.R-project.org/. Accessed: 2013-09-27. ? pages 44[87] R Development Core Team. R: A Language and Environment for StatisticalComputing. Vienna, Austria, 2008. URL http://www.R-project.org. ?pages 18, 19[88] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning.MIT Press, 2006. ? pages 105, 106[89] S. Robertson. Understanding inverse document frequency: on theoreticalarguments for IDF. Journal of Documentation, 60(5):503?520, 2004. ?pages 51[90] G. Ross and M. Chalmers. A visual workspace for hybridmultidimensional scaling algorithms. In Proc. IEEE Symp. InformationVisualization (InfoVis), pages 91?96, 2003. ? pages 18130[91] L. Rousseeuw and L. Kaufman. Clustering by means of medoids.Statistical Data Analysis Based on the L1-norm and Related Methods,pages 405?416, 1987. ? pages 116[92] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locallylinear embedding. Science, 290(5500):2323?2326, 2000. ? pages 13[93] Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover?s Distance as ametric for image retrieval. Intl. Journ. Computer Vision, 40(2):99?121,2000. ? pages 95, 107[94] D. Russell, M. Stefik, P. Pirolli, and S. Card. The cost structure ofsensemaking. In Proceedings of the INTERACT?93 and CHI?93 conferenceon Human factors in computing systems, pages 269?276. ACM, 1993. ?pages 82[95] G. Salton and C. Buckley. Term-weighting approaches in automatic textretrieval. Information Processing & Management, 24(5):513?523, 1988.? pages 15, 50, 51[96] G. Salton, A. Wong, and C. Yang. A vector space model for automaticindexing. Communications of the ACM, 18(11):613?620, 1975. ? pages15, 50, 82, 84[97] J. Sammon. A nonlinear mapping for data structure analysis. IEEE. Trans.Computers, 100(5):401?409, 1969. ? pages 71[98] M. Sedlmair, M. Brehmer, S. Ingram, and T. Munzner. Dimensionalityreduction in the wild: Gaps and guidance. Technical Report TR-2012-03,UBC Computer Science, June 2012. ? pages 4, 14, 43, 121[99] M. Sedlmair, M. Meyer, and T. Munzner. Design study methodology:Reflections from the trenches and the stacks. IEEE Transactions onVisualization and Computer Graphics, 18(12):2431?2440, 2012. ? pages120[100] J. Seo and B. Shneiderman. A rank-by-feature framework for interactiveexploration of multidimensional data. Information Visualization, 4(2):99?113, 2005. ? pages 19, 115[101] F. Shahnaz, M. Berry, V. Pauca, and R. Plemmons. Document clusteringusing nonnegative matrix factorization. Information Processing &Management, 42(2):373?386, 2006. ? pages 6131[102] S. Shapiro and M. Wilk. An analysis of variance test for normality(complete samples). Biometrika, 52(3/4):591?611, 1965. ? pages 105[103] R. Sibson. SLINK: An optimally efficient algorithm for the single-linkcluster method. The Computer Journal, 16(1):30?34, 1973. ? pages 17[104] V. Silva and J. Tenenbaum. Global versus local methods in nonlineardimensionality reduction. Advances in Neural Information ProcessingSystems, 15:705?712, 2003. ? pages 13[105] A. Smeaton and C. Van Rijsbergen. The nearest neighbour problem ininformation retrieval: an algorithm using upperbounds. ACM SpecialInterest Group on Information Retrieval (SIGIR) Forum, 16(1):83?87,1981. ? pages 16[106] P. Sneath. The application of computers to taxonomy. Journal of GeneralMicrobiology, 17(1):201?226, 1957. ? pages 17[107] I. Spence and D. Domoney. Single subject incomplete designs fornonmetric multidimensional scaling. Psychometrika, 39:469?490, 1974.? pages 96[108] M. Stone. Color in information display. IEEE Visualization 2006 CourseNotes. http://www.stonesc.com/Vis06, Oct 2006. ? pages 32[109] J. Stray. How overview can organize thousands of documents for a reporter.http://overview.ap.org/blog/2013/04/how-overview-can-organize-thousands-of-documents-for-a-reporter/.Accessed: 2013-06-04. ? pages 52[110] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric frameworkfor nonlinear dimensionality reduction. Science, 290(5500):2319?2323,Dec 22 2000. ? pages 13[111] D. Ternes and K. MacLean. Designing large sets of haptic icons withrhythm. In Intl. Conf. Haptics: Perception, Devices, and Scenarios(EuroHaptics), pages 199?208. Springer LNCS 5024, 2008. ? pages 95[112] The MathWorks Inc. MATLAB. Natick, Massachusetts, 2010. ? pages 18[113] W. Torgerson. Multidimensional scaling: I. theory and method.Psychometrika, 17:401?419, 1952. ? pages 11132[114] M. Tory, D. Sprague, F. Wu, W. So, and T. Munzner. Spatialization design:Comparing points and landscapes. Proc. IEEE Symp. InformationVisualization (InfoVis), 13(6):1262?1269, 2007. ? pages 20[115] M. Tory, C. Swindells, and R. Dreezer. Comparing dot and landscapespatializations for visual memory differences. Proc. IEEE Symp.Information Visualization (InfoVis), 16(6):1033?1040, 2009. ? pages 20[116] L. van der Maaten. Barnes-Hut-SNE. In Proceedings of the InternationalConference on Learning Representations, 2013. URLhttp://arxiv.org/abs/1301.3342. ? pages 14, 15, 72, 73[117] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal ofMachine Learning Research, 9(2579-2605):85, 2008. ? pages 14, 72, 73[118] L. van der Maaten, E. Postma, and H. Van Den Herik. Dimensionalityreduction: A comparative review. Journal of Machine Learning Research,10:1?41, 2009. ? pages 66, 71[119] E. Voorhees. Implementing agglomerative hierarchic clustering algorithmsfor use in document retrieval. Information Processing & Management, 22(6):465?476, 1986. ? pages 17, 67[120] J. Wade. TPD working through flawed mobile system. http://www.tulsaworld.com/article.aspx/TPD working through flawed mobile system/20120603 11 a1 cutlin136616. Accessed: 2013-06-04. ? pages 90[121] K. Wagstaff. Machine learning that matters. Proc. Intl. Conf MachineLearning (ICML), pages 529?536, 2012. ? pages 120[122] J. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical modelsfor human motion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(2):283?298, 2008. ? pages 14[123] M. Ward. XmdvTool: Integrating multiple methods for visualizingmultivariate data. In Proc. IEEE Visualization, pages 326?333, 1994. ?pages 18[124] C. Weaver. Cross-filtered views for multidimensional visual analysis. IEEETransactions on Visualization and Computer Graphics, 16(2):192?204,2010. ? pages 81[125] K. Weinberger and L. Saul. An introduction to nonlinear dimensionalityreduction by maximum variance unfolding. In Proceedings of the National133Conference on Artificial Intelligence, volume 21, pages 1683?1686. MenloPark, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.? pages 13[126] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer NewYork, 2009. ? pages 18[127] J. Wise, J. Thomas, K. Pennock, D. Lantrip, M. Pottier, A. Schur, andV. Crow. Visualizing the non-visual: spatial analysis and interaction withinformation from text documents. In Proc. IEEE Symp. InformationVisualization (InfoVis), pages 51?58, 1995. ? pages 79[128] D. Wishart. Mode analysis: A generalization of nearest neighbor whichreduces chaining effects. Numerical Taxonomy, 76:282?311, 1969. ?pages 17[129] C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. InConf. on Data Engineering (ICDE), pages 916?927. IEEE, 2009. ? pages16[130] J. Yang, W. Peng, M. Ward, and E. Rundensteiner. Interactive hierarchicaldimension ordering, spacing and filtering for exploration of highdimensional datasets. In Proc. IEEE Symp. Information Visualization(InfoVis), pages 105?112, 2003. ? pages 18[131] J. Yang, M. Ward, E. Rundensteiner, and S. Huang. Visual hierarchicaldimension reduction for exploration of high dimensional datasets. In Proc.Eurographics/IEEE Symp. Visualization (VisSym), pages 19?28, 2003. ?pages 18[132] J. Yang, A. Patro, S. Huang, N. Mehta, M. Ward, and E. Rundensteiner.Value and relation display for interactive exploration of high dimensionaldatasets. In Proc. IEEE Symp. Information Visualization (InfoVis), pages73?80, 2004. ? pages 18[133] J. Yang, D. Luo, and Y. Liu. Newdle: Interactive visual exploration of largeonline news collections. IEEE Computer Graphics & Applications, 30(5):32?41, 2010. ? pages 19, 47[134] Z. Yang, J. Peltonen, and S. Kaski. Scalable optimization of neighborembedding for visualization. In Proc. Intl. Conf Machine Learning(ICML), pages 127?135, 2013. ? pages 14, 15, 72134[135] P. Yianilos. Data structures and algorithms for nearest neighbor search ingeneral metric spaces. In Proc. ACM-SIAM Symposium on DiscreteAlgorithms (SODA), pages 311?321. Society for Industrial and AppliedMathematics, 1993. ? pages 16, 64, 73[136] H. Zha. Generic summarization and keyphrase extraction using mutualreinforcement principle and sentence clustering. In Proc. ACM Conf.Information Retrieval (SIGIR), pages 113?120. ACM, 2002. ? pages 83[137] J. Zobel and A. Moffat. Inverted files for text search engines. ACMComputing Surveys (CSUR), 38(2):6, 2006. ? pages 15, 47, 55, 58, 119135

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052184/manifest

Comment

Related Items