UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A multimedia interface for facilitating comparisons of opinions Rizoli, Lucas 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2009_fall_rizoli_lucas.pdf [ 1.4MB ]
JSON: 24-1.0051680.json
JSON-LD: 24-1.0051680-ld.json
RDF/XML (Pretty): 24-1.0051680-rdf.xml
RDF/JSON: 24-1.0051680-rdf.json
Turtle: 24-1.0051680-turtle.txt
N-Triples: 24-1.0051680-rdf-ntriples.txt
Original Record: 24-1.0051680-source.json
Full Text

Full Text

A Multimedia Interface for Facilitating Comparisons of Opinions by Lucas Rizoli  B.Cmp.H, Queen’s University, Kingston, 2006  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science)  The University of British Columbia August 2009 c Lucas Rizoli 2009  ii  Abstract Written opinion on products and other entities—evaluative text—can be important to consumers and researchers, but expensive and difficult to analyze. Efforts at mining for opinions have been successful at collecting the data, but have not focused on making analyses of such data easy. We have developed Hierarchical Evaluative Histogram Explorer (hehxplore), a multimedia interface designed to facilitate the analysis of opinions on multiple entities, particularly comparing opinions on multiple entities. hehxplore integrates an information visualization and an intelligent summarization system that selects notable comparisons in opinion data. Used in combination with opinion mining, we believe hehxplore can reduce the time and effort required to explore and use evaluative data. hehxplore presents data to users in two useful, complementary modes: graphics and text. The visualization is designed in order to present the aggregated opinions in the data clearly, unambiguously, and in a manner easy to learn. The summarization system applies a set of statistics for comparing opinions across entities in order to highlight those that show strong similarities or dissimilarities between entities. We conducted a study of our interface with 36 subjects. The results of the study showed that subjects liked the visualization overall and our summarization system’s selections overlapped with those of subjects more than did the selections of baseline systems. Given the choice, subjects sometimes changed their selections to be more consistent with those of our system. We also used subjects’ selections and our study’s datasets to train new selection systems using machine learning techniques. Using the data collected in our study, these trained systems were able to match subjects’ selections more closely than our statistics-based system. We describe the design and implementation of hehxplore, describe and relevant work, detail our studies of hehxplore’s systems and their results. We also consider potential future work to be done on hehxplore.  iii  Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  v  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . 1.1 Problem Overview . . . . 1.2 Our Contributions . . . . 1.3 Organization of Chapters  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  1 1 1 2  2 Related Work . . . . . . . . . . 2.1 Mining Opinions from Text 2.2 Summarization . . . . . . . 2.3 Visualizations . . . . . . . . 2.4 Task-Based Design . . . . . 2.5 Multimedia Interfaces . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  4 4 5 5 8 8  3 Task Analysis . . . . . . . . . 3.1 Existing Task Taxonomies 3.2 Integrated Task Model . . 3.3 Usage Scenarios . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  10 10 12 14  4 Describing & Comparing Evaluative Data . 4.1 Descriptive Statistics . . . . . . . . . . . . . 4.2 Statistics for Comparisons . . . . . . . . . . 4.2.1 Discretization of Values . . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  17 18 19 21  5 Design Rationale . . . . . . . . . . . . . . . 5.1 Visualization of Opinion Data . . . . . . 5.1.1 Early Prototypes . . . . . . . . . 5.1.2 Parallel-Coordinate Tree . . . . . 5.1.3 Hierarchical Histograms . . . . . 5.2 Summarization of Opinion Comparisions  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  22 22 22 23 26 27  . . . .  . . . . . .  . . . . . .  Contents 5.2.1 5.2.2 5.2.3  iv  Content Selection . . . . . . . . . . . . . . . . . . . . . . . Selection of Comparisons to be Mentioned . . . . . . . . . Selection of Comparison Aspects to be Mentioned . . . .  6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Scenario . . . . . . . . . . . . . . . . . . . . . . . 6.3 Data Generation . . . . . . . . . . . . . . . . . . 6.4 Subjects . . . . . . . . . . . . . . . . . . . . . . . 6.5 Materials . . . . . . . . . . . . . . . . . . . . . . 6.6 Procedure . . . . . . . . . . . . . . . . . . . . . . 6.7 Method . . . . . . . . . . . . . . . . . . . . . . . 6.8 Baseline Systems . . . . . . . . . . . . . . . . . . 6.9 Machine Learning System . . . . . . . . . . . . . 6.9.1 Training Data . . . . . . . . . . . . . . . . 6.9.2 Machine Learning System Implementation 6.10 Results . . . . . . . . . . . . . . . . . . . . . . . . 6.10.1 System Performance . . . . . . . . . . . . 6.10.2 Usability of the Visualization . . . . . . . 6.11 Discussion . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  28 28 29  . . . . . . . . . . . . . . . .  30 30 30 31 32 33 33 34 34 35 35 36 37 37 38 38  7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  40 40 40  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  42  v  List of Tables 6.1  6.2  6.3  6.4  The comparison types, the number of aspects mentioned in support (|S|) and contrast (|C|), and the constraints on the (dis)similarity of the aspects of a comparison. . . . . . . . . . . . . . . . . . . . 31 The summary cases generated as example data for the study. Each specifies a configuration type for opinions overall, one comparison (C1), another comparison (C2), and all other possible comparisons (C. . . ). . . . . . . . . . . . . . . . . . . . . . . . . . 33 Mean precision, recall, and F-measure of the expected performance of the na¨ıve and semi-informed alternative selection systems, as well as our statistics-based and machine learning systems. 37 Subjects’ responses to statements related to components of usability. Subjects could strongly disagree (sd), disagree (d), agree (a), or strongly agree (sa) with a statement, or remain neutral (n), or not respond (nr). The most frequent response for each component is in boldface. . . . . . . . . . . . . . . . . . . . . . . 38  vi  List of Figures 2.1  2.5  A plot of opinions on Star Wars: Episode III - Revenge of the Sith as visualized in OpinionReader by Fujii & Ishikawa [14]. . . A textual summary of, and treemap visualization of, opinions on the Apex AD2600 dvd player from Carenini, Ng, & Pauls [7]. . . Opinion graphs in Opinion Observer by Liu, Hu, & Cheng [23]. . A parallel-coordinate tree in SurveyVisualizer by Brodbeck & Girardin [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An example of ValueCharts+ by Bautista & Carenini [3]. . . . .  3.1  A map of our integrated task model. . . . . . . . . . . . . . . . .  4.1  Partial view of the information extraction and organization process for two products. . . . . . . . . . . . . . . . . . . . . . . . . Three distributions with different controversiality scores: 0.0 on the left, 0.612 in the middle, and 1.0 on the right. . . . . . . . . . Visualization of thresholds used to discretize values returned by the system’s various similarity functions into very dissimilar (vd), dissimilar (d), similar (s), or very similar (vs). For example, values of counts() greater than 0.6 and less than 0.7 are dissimilar ; values of dists() greater than 0.95 are very similar. . . . . . . . .  2.2 2.3 2.4  4.2 4.3  5.1 5.2  5.3 5.4 5.5  A mock-up of polarity bars, a modification of Opinion Observer (Figure 2.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A p-node tree [40], a modification of the parallel-coordinate tree in SurveyVisualizer (Figure 2.4). Extracts from the original text corpus are visible as bubbles extending from nodes at the bottom of the window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coloured circles and labels used to represent the number of opinions on a set of features in a p-node tree (detail from Figure 5.2). A mock-up of a distribution matrix. . . . . . . . . . . . . . . . . Example of a chart of the opinions on a camera’s Image feature. The grey dot beneath the bars represents the mean of the opinions. Count (#), mean (Avg.), and controversiality (Contro.) are stated explicitly at the top of the chart. . . . . . . . . . . . . . .  6 6 7 8 9 13  17 19  21  23  24 25 25  26  List of Figures 5.6  5.7  Example of a stacked bar chart, showing the relationship between charts of opinions on Appearance, Battery, Flash and their parent feature, Digital Camera. Notice that a) axes’ scales are consistent, b) bar colours relate stacked bars to charts of child features, c) explicit opinions on Digital Camera are the bottom most bars in the chart (light grey), d) Digital Camera contains bars of other child features not shown (e.g. Image, Video). . . . . . . . . . . . Screenshot of our interface displaying generated opinion data on digital cameras from two fictional manufacturers. . . . . . . . . .  vii  27 28  viii  Acknowledgements The author would like to thank Jackie CK Cheung and Ivan Zhao for their contributions. Thanks go to Gabriel Murray, Jan Ulrich, Emtiyaz Khan, and Mark Schmidt for their company and advice; to Sandro, Michele, and Eric, the author’s family, for their support. Most importantly, the author would like to thank Giuseppe Carenini and Raymond Ng for their guidance, patience, and help.  1  Chapter 1  Introduction 1.1  Problem Overview  There is a lot of written opinion on products, services, and other entities available in online reviews, blogs, and the like. In marketing literature, this is known as electronic word-of-mouth (abbreviated to ewom in [30]); in computational linguistics, as evaluative text. The opinions expressed in such text can be of great use to many people and organizations. Consumers use others’ opinions to help make decisions as to what they should purchase or support [10, 32]; marketers, designers, and manufacturers can use similar information to study consumer opinion [15] and make forecasts. This valuable information is available, but analyzing it can be very difficult, costly, and time-consuming [16]. There have been many efforts to mine opinions from evaluative text automatically [28]. These have made opinion data available, but the means of understanding and analyzing it effectively are lacking. Tools capable of generating textual summaries and graphical representations exist (for example, Opinion Observer [23] and Treemaps [7]), but they are limited in either only supporting the analysis of opinions about a single entity or in providing rather simple graphical intefaces. In contrast, the work presented in this thesis aims to support the comparison of opinions on multiple entities through a multimedia interface.. Such comparisons are key in tasks such as competitive analysis [11], or deciding among alternatives to make a purchase.  1.2  Our Contributions  We present Hierarchical Evaluative Histogram Explorer (shortened to hehxplore), a novel multimedia interface for the analysis of opinion data, particularly the comparison of opinions across two entities. We aim to facilitate the analysis and comparison of opinion data, to allow users to leverage large collections of opinion data and come to actionable conclusions. In order to do so, hehxplore employs two methods of presenting opinion data mined from text: a visualization of the data as well as an intelligent textual summary of notable comparisons in the data. We believe these are two meaningful contributions to the ongoing work on opinion mining and analysis. The visualization of opinion data represents our first major contribution. Our visual representation was designed by following a task-based approach [41]  Chapter 1. Introduction  2  to facilitate the analysis of opinion data, as well as to simplify comparisons among such data within and across features, as well as across entities. The second and key contribution of this paper is the content selection strategy that selects notable aspects of the opinion data to be presented in a textual summary. Our strategy is based on a set of statistics that we argue can be used to characterize the (dis)similarity of opinions on a feature across entities (for example, on the Flash feature of two cameras). These statistics include the differences in the number of opinions on features of the entities, the difference of their mean opinions, the divergence of their distributions, and the level of controversy in the opinions. In essence, our content selection strategy says that only features that are very (dis)similar with respect to these statistics should be included in the summary. These two subsystems, the visualization and the summarizer, work on the same set of opinion data, but present information that we believe are complimentary rather than redundant. In employing both, hehxplore is a multimedia interface, providing users with the benefits of both a graphical presentation of the data, and a textual summary of notable comparisions in the data. Our third major contribution is a user study, conducted with 36 subjects, of our visualization and content selection strategy. The usability of the visualization was evaluated by questionnaire, while the content selection strategy was tested by comparing subjects’ and system selections from the same sets of opinion data, as well as verifying whether revealing the system selections to the subjects prompted them to revise their initial selections to make them consistent with the system’s. The results of this study indicate that subjects found the visualization accurate, easy to learn and read, and satisfying, that subject and system selections overlap, and that when shown the system’s selections, subjects often revised their own initial selections to make them more consistent with the system’s selections. We then used the data collected in our user study in order to study an alternative content selection strategy, one employing machine learning techniques instead of the (dis)similarity statistics used in our original strategy. We found that, using the data from our user study, a machine-learning–based content selection system, tested in a leave-one-out experiment, behaved more like the study subjects than our statistics-based system.  1.3  Organization of Chapters  In chapter 2, we review related work in the field of natural language processing on opinion mining and summarization, in information visualization on visualizing opinion data, and in multimedia interfaces on the integration of text and graphics. In chapter 3, we describe the tasks and the data which our interface is designed for. Next, in chapter 4, we detail the kind of opinion data on which our system operates, as well as the statistics it calculates in order to organize and describe it. In chapter 5 we describe our interface: the visual representation of opinion data and the system that selects comparisons and summarizes them.  Chapter 1. Introduction  3  We report on our user study and discuss the results in chapter 6. We conclude in chapter 7 by identifying the strengths of our method and areas for future work.  4  Chapter 2  Related Work 2.1  Mining Opinions from Text  There has been a substantial amount of work on opinion mining1 [28]. Systems for opinion mining have been used to classify entire documents by their overall opinion (eg. [38] and [29]) as well as to extract opinions on features of a specific entity or topic (eg. [26] and [21]). The first application is centred primarily on documents, and tries to answer questions like “Is this a positive or negative review?” The second is centred on entities or topics, and so can involve a number of documents, and answers “What features of this product did the reviewer like or dislike?” As our interest is in comparing opinions on multiple entities, we focus primarily on mining centred on entities. When used to gather opinions on entities or topics, the problem of extracting opinions can be broken down into the following subproblems2 : 1. identifying the features on which opinions are expressed in the text, 2. identifying the opinions on those features, 3. determining the polarity of those opinions—positive, negative, or neutral, 4. and determining the strength of those opinions. Not every opinion mining system addresses all of the above problems (for example, the system in [26] focuses on the overall reputation of an entity rather opinions on its features, and so does not identify features, only phrases expressing opinions on the target entity). The means by which these problems are addressed vary from system to system. A number of measures and statistics (such as point-wise mutual information in [38]), machine learning methods (such as relaxation labelling in opine [31]), and data (such as WordNet [12]) have been used. Work by Hu & Liu on feature and opinion identification [21] and determining the polarity of opinions [19] employs unsupervised methods, small annotated datasets, and WordNet to address these subproblems. While this, and similar systems, are effective, they often result in unweildy, repetitive sets of features and opinons expressed on them. 1 This area of research is known by many names such as opinion mining, review mining, sentiment analysis, and subjectivity analysis. 2 This list is adapted from [31].  Chapter 2. Related Work  5  The system for mining opinons we relied on for this thesis is detailed in [8]. It builds on previous work in order to improve on the number and arrangement of features. In this system, features mined using unsupervised methods are brought together with domain-specific, user-supplied feature sets. This is done through a matching algorithm that employs term similarity metrics in order to match mined features to user-supplied features. This system leverages the faster, automatic mining methods with users’ domain knowledge and expectations, resulting in organized datasets without needless redundancy that contain features, opinions, and opinion strengths expressed in the original text. Though most systems mine for opinions on a single entity, our system allows for opinions on multiple entities to be compared along shared and similar features. The combination of user-defined feature sets is important in hehxplore for this reason. By performing similarity matching between the sets of features mined from corpora of evaluative text on various entities, it is possible to create a shared feature set, and to compare mined data along these shared features.  2.2  Summarization  Having mined the opinion data from the original text, it is often necessary to present it to a human user in an accessible and useful manner. Many opinion mining systems not only extract evaluative data from text, but include summarizers that organize and summarize the data as text. The form of the summaries they produce varies: some are lists of pros and cons (eg. [21]), others, a collection of representative sentences extracted from the original text (eg. mead* in [5]) or even generated arguments (eg. sea in [5], [6]). In order to summarize the opinions, it is necessary to select or prioritize the data that will make up the content of a summary. Methods of content selection vary. Some rank content by the number of features, strength of the opinions, or frequency of the opinion on a feature [21]. Others use more complicated measures of importance ([5]), as well as other characterizations of the data ([6]), to decide on features and opinions to summarize. Most of the summarization systems are meant to summarize opinions on a single entity. Our summarization system is different, meant to summarize notable similarities and contrasts between opinions on two entities. As its goals differ from existing systems, its content selection strategy differs.  2.3  Visualizations  In addition to textual summaries, opinion data is also presented graphically. There are a number of visualization of mined evaluative data3 , though most are not meant as tools for analysis. 3 The affect plots in in-spire [17] are somewhat similar, though they visualize affect and not evaluations.  Chapter 2. Related Work  6  Figure 2.1: A plot of opinions on Star Wars: Episode III - Revenge of the Sith as visualized in OpinionReader by Fujii & Ishikawa [14]. OpinionReader [14] (Figure 2.1) arranges labels in a scatter plot, where the horizontal axis represents the combined polarity and strength of opinion, and the vertical axis, the frequency of those opinions. The purpose of OpinionReader is to summarize the expressed pros and cons of an entity or topic. Carenini, Ng, & Pauls [7] (Figure 2.2) summarize opinions on a single entity by visualizing them in treemaps. The data visualized are different from those used by OpinionReader: opinions of varying strength and polarity on a hierarchy of features.  Figure 2.2: A textual summary of, and treemap visualization of, opinions on the Apex AD2600 dvd player from Carenini, Ng, & Pauls [7]. Though our system builds on their work, it differs in its visual representation, which does not require significant interaction in order to observe the distribution  Chapter 2. Related Work  7  of opinions on a feature. hehxplore also differs in that it is meant to facilitate comparision across multiple entities, rather than detail opinions on a single entity. Opinion Observer [23] (Figure 2.3) presents opinions on multiple entities. It displays the number of positive and negative evaluations as bars extending from a baseline, coding entity by colour. This allows for side-by-side visual comparison of opinions on the same feature. This eases the comparison of opinions on a single feature across two or more entities. The data visualized by Opinion Observer is different from ours in that it does not include the strength of opinions, nor does it arrange features into a hierarchy.  Figure 2.3: Opinion graphs in Opinion Observer by Liu, Hu, & Cheng [23]. Opinion Observer also differs from our interface in that it does not attempt to identify and describe notable comparisons, nor does it allow multiple levels of analysis (single feature and the entity overall) to be visualized simultaneously. Though not a visualization of evaluations extracted from evaluative text, SurveyVisualizer [4] (Figure 2.4) is designed to display similar opinion data. It does so using a number of parallel-coordinate plots arranged in a hierarchy. Data from previous years’ survey results—or, conceivably, of opinions on multiple entities—are represented by different lines, differentiated from each other by colour and thickness. Though the source and type the data visualized in SurveyVisualizer (surveys) is different from that of our system, we did attempt to adapt its hierarchical parallel-coordinates to our tasks and data [40] (see subsection 5.1.2). ValueCharts+ [3] (Figure 2.5) display the calculated evaluations on each entity according to valuation functions on their features. Despite being designed for a different set of tasks (preferential choice) many of the goals of ValueCharts+ are like those of hehxplore: to facilitate the comparison and selection of entities based on evaluations of their features. The visual representation is also similar, a variation on stacked bar charts.  Chapter 2. Related Work  8  Figure 2.4: A parallel-coordinate tree in SurveyVisualizer by Brodbeck & Girardin [4].  2.4  Task-Based Design  In order to guide and evaluate the design of a system, it is necessary to understand what tasks that system aims to perform or facilitate. The task-based approach to visualization [41] and interface design [22] focus on the uses to which a tool will be put. This differs from a data-type approach which aims to create a representation best suited to the nature of the data (i.e. multidimensional, numeric, etc.). The resulting design may not be ideal for the purposes users will put it to since “the utility of a presentation is linked to the nature of the task” rather than to the data type [9]. The task-based approach has been taken in a number of visualization designs, including automated systems such as boz [9] and improvise [41], and usercontrolled systems for preferential choice [3] and meteorology [35]. hehxplore followed a task-based design.  2.5  Multimedia Interfaces  Though visualizations can represent large amounts of data in meaningful ways, it can be beneficial to support them with text [37]. Multimedia interfaces have combined graphics and text in complementary ways, taking advantage of the strengths of one medium and compensating for the weaknesses of the other (e.g.  Chapter 2. Related Work  9  Figure 2.5: An example of ValueCharts+ by Bautista & Carenini [3]. [18]). hehxplore is an extension of earlier work on multimedia presentations of opinions mined from text [7]. This earlier interface presents the user both with a treemap visualization of opinions mined from a corpus of reviews, as well as a summary of those reviews (Figure 2.2). This summary contains links to the source text from which the summary’s sentences are extracted. hehxplore is similar in that it is also a multimedia interface, a complement of text and data graphics. It differs in that it summarizes feature comparisons across entities, rather than opinions on features of a single entity. Also, the sentences in our summaries are realized using sentence templates, and are not quotes taken from the corpus of text.  10  Chapter 3  Task Analysis We have taken a task-based approach [22] to the design of hehxplore. The taskbased approach focuses on the tasks a visualization and interface support. We specify “who’s going to use the system to do what” and design it in order to best support those uses. In order to identify design requirements for hehxplore, we describe the tasks that it should support. We create a task model by combining relevant task taxonomies and frameworks from previous work in information visualization. This task model is used to guide the design and evaluation of hehxplore. Our task model itegrates high-level task taxonomies that describe abstract interaction and analysis tasks, as well as lower-level tasks involved in analytic activity1 . Our task model attempts to combine tasks relevant to data analysis using information visualization, and interpreting them in the context of our evaluative data. We combine the task taxonomies to gain benefits of each of the taxonomies: one describes low-level knowledge building, another general information visualization tasks, and a third high-level knowledge-building tasks. We integrate them in order to resolve any overlap between taxonomies, to organize our data by the common requirements across tasks. This allows us to use the model as a whole, rather than a number of lists, as well as identify not only tasks that are poorly or not at all supported, but perhaps also the requirements of these tasks that have not been met.  3.1  Existing Task Taxonomies  Amar & Stasko [1] provide a list of common tasks carried out when employing information visualization tools for understanding data. Retrieve value “Given a set of specific cases, find attributes of those cases.” This is, essentially, reading a value in a data set. Filter “Given some concrete conditions on attribute values, find data cases satisfying those cases.” Compute derived value “Given a set of data cases, compute an aggregate numeric representation of those data cases.” This includes counts, differences, means, etc. 1 This approach is similar to that used in the design of ValueCharts+ [3], where a task model was developed by combining concepts from information visualization and decision theory.  Chapter 3. Task Analysis  11  Find extremum “Find data cases possessing an extreme value of an attribute over its range within the data set.” Sort “. . . rank [data cases] according to some ordinal metric.” Determine range “Given a set of data cases and an attribute of interest, find the span of values within the set.” Characterize distribution “Given a set of data cases and a quantitative attribute of interest, characterize the distribution of that attributes values over the set.” Find anomalies “Identify any anomalies. . . with respect to a given relationship or expectation.” Cluster “. . . find clusters of similar attribute types.” Correlate “. . . determine useful relationships between the values of [two] attributes.” These are low-level, common tasks which nearly map to specific tool functions (sort, cluster, determine range, etc.). The authors consider these to be “analytic primitives” that can be combined in order to form compound tasks; and “may provide a checklist for system designers.” We use them as such: tasks that should be supported by our tool in order to facilitate analysis. Shneiderman’s task taxonomy [33] is well known in the field of information visualization, and considered “a useful starting point for designing advanced graphical user interfaces.” It describes a number of common tasks that tools should support, building on the mantra of “overview first, zoom and filter, then details on demand.” The tasks (overview, zoom, filter, details-on-demand, relate, history, and extract) involve the manipulation of the data representation (overview, zoom, filter), as well as interaction with the application (history, extract). Though originally these tasks were associated with the types of data they are enacted upon, they are separable from data types and widely applicable as general tasks. The tasks have been used as guiding design principles in a number of successful tools such as Spotfire2 and Table Lens3 . Shneiderman’s tasks are, mostly, one level removed from Amar & Stasko’s low-level tasks. They are functions of the tool, rather than primitive knowledgebuilding tasks performed by the user. The tasks allow a user to navigate through the data. Though they are less specific, they are important. We interpret them as mid-level tasks through which many knowledge building tasks are performed. This interpretation is not entirely consistent, as there are overlaps between the two taxonomies (filter, which is related to sorting, for example). In addition to low-level tasks, Amar & Stasko also address high-level knowledgebuilding tasks in [2]. They identify how a tool can go beyond merely representing data to supporting useful analysis. They identify two gaps in visual analysis: 2 http://spotfire.tibco.com/ 3 http://www.inxight.com/products/oem/table_lens/  Chapter 3. Task Analysis  12  the rationale gap “between perceiving a relationship and expressing confidence in the correctness and utility of [it],” and the worldview gap “between what is shown to a user and what actually needs to be shown. . . for making a decision.” For each gap, they propose three knowledge tasks to be used as goals in design or to be met in an evaluation. • Rationale gap Expose uncertainty in data measures and aggregations, and [show] the possible effect of this uncertainty on outcomes. Concretize relationships , “clearly presenting what comprises the representation of a relationship.” Formulate cause and effect by clarifying possible sources of causation. • Worldview gap Determine domain parameters by combining knowledge and metadata, showing what’s relevant and important. Multivariate explanation : providing support for discovery. . . of useful correlative models and constraints. Create and confirm hypotheses : “support for the formulation and verification of hypotheses.” Unlike their low-level tasks and Shneiderman’s tasks, which build knowledge, these high-level tasks are intended to solidify understanding and use it to support decisions in the real world.  3.2  Integrated Task Model  We integrate these frameworks into a single task model by creating six groups of tasks (see Figure 3.1). These groups represent more abstract tasks carried out by the user. Some of these groups are some of the more abstract tasks in the taxonomies cited, others are tasks abstracted from similar and related tasks from the taxonomies. Read tasks involve the perception or interpretation of information presented by the tool. These tasks require an unambiguous and faithful representation of, and access to, the data. Overview tasks involve the presentation of aggregate or descriptive information, or trends in data. These require abstraction and summarization of the raw data. Filter tasks re-arranged or represent data according to certain criteria. These require manipulation of data.  Chapter 3. Task Analysis  13  Relate/Compare tasks are carried out on pairs or sets of data. They require both facilities for closely representing, abstracting, and manipulating sets of data. Hypothesize tasks act to reduce uncertainty and test users’ theories about the nature of the data. They require domain knowledge and the means of applying that to the data. History tasks are carried out on user actions, rather than the data. These require a record of user actions and changes.  Figure 3.1: A map of our integrated task model. The groups allow us not only to organize tasks, but specify more general requirements. Any observed weaknesses within a group would suggest that the requirements of the group have not been met. Of course, the tasks are interdependent. It seems unlikely that a tool would be able to support Hypothesize tasks without supporting the Read and Filter tasks, for example. We believe nonetheless that such an organization is beneficial. The task model is a set of tasks that can be used as a set of design requirements: a totally successful tool supports every task at all levels, though success can also be seen in how well some tasks are supported. The model can also be used in the evaluation of a tool, as heuristics for evaluating it analytically, or as the basis for user testing.  Chapter 3. Task Analysis  3.3  14  Usage Scenarios  We present three hypothetical usage scenarios that demonstrate how the tasks operate in the real world. The scenarios do not represent the full range of activities or contexts in which the task model is relevant. They describe three chains of tasks carried out with a successful tool in three specific contexts. Shopping for a Digital Camera Enzo would like to purchase a new digital camera. Photography is his hobby, and so he knows which features of a camera are most important to him: whitebalance settings, image quality, and raw file support. He has collected and mined reviews of the three cameras within his price range: the Acme A-1, Camera Corp. 3000, and Serrano 23. Let us now assume that Enzo can use a system for exploring opinions about the three cameras. Enzo first takes an overview (read ) of the opinions of the cameras he has selected so he has a sense of how opinions are distributed in general. He does this by zooming out so he can see only major trends of how opinions each camera’s features are distributed (characterize distribution). Once he has a general sense for common values, he then finds extremes, the features that are most liked or disliked. Unlike most other features, all three cameras’ customer support have many strong negative evaluations. Enzo is unconcerned about this since he is confident enough in his skills as a photographer that he will not require customer support; he determines that the customer support feature is not relevant (determine domain parameters). Enzo decides to examine opinions of the specific features he’s interested in. He sorts the data by the evaluations of the white-balance feature. He finds that one camera, the Acme A-1, has the highest rated white-balance feature. He also finds that the differences in opinion between the A-1 and the 3000 and the 23 are quite slight (compute derived value). Indeed, he notices that the range of opinions of that feature is quite limited: most of the cameras have highly rated white-balance. Enzo decides to look at the raw support next. He sees that the Acme A-1 has a very low rated raw feature. To understand the reasons for this, he looks at the sentences in the reviews of the A-1 related to raw support (details-ondemand ). Enzo reads that the Acme’s software for reading raw files is quite difficult to install and does not work in Linux. Enzo is a dedicated Linux user, and decides that he’ll disregard the Acme A-1. He filters the data, extracting only the reviews of the Camera Corp. and Serrano. He now compares the Camera Corp. and Serrano’s image quality. He sees that the Serrano has a higher-rated image quality. Enzo can see that this is because two of the related features (concretize relationships), colour and sharpness (multivariate explanation), are particularly exceptional according to reviewers. Enzo decides to pursue the Serrano 23. He’ll ask to see it the next time he goes  Chapter 3. Task Analysis  15  to the electronics store. Marketing Research Marcela works for a movie production company and has been asked propose two ways to advertise Videodrome II: Newer Flesh, an upcoming movie. She has been given the short written reactions of preview audiences. These were submitted electronically, and include audience members’ gender and what age group they belong to. Marcela hopes to be able to find the features of Videodrome II that appeal to different genders or age groups, so that the film’s ad campaign is likely to be effective. Again, we assume that Marcela can use a system for exploring opinions. She believes that the film’s violence will appeal to males, but not to females. She decides to test whether this is true (create and confirm hypotheses). She filters the data so that she can separate the opinions of males from those of females. Looking at the data (read ), she then compares the evaluations of violence given by males to those given by females. There are no obvious patterns in either of the sets of opinions. Marcela still believes they may be related, and so correlates the two distributions. She finds that female and male opinions of Videodrome II ’s violence are strongly correlated. Marcela decides that opinions about the film’s violence are not split by gender. Marcela decides to look at the data another way, undoing her earlier split (history), then filtering the data again. This time she splits opinions of violence into audience members younger than 25 and than those older. Now the data appear to be quite different: younger audience members rated the violence somewhat positively, while older audiences were very negative. A quick correlation supports this, showing that the two sets are strongly anti-correlated. Marcela decides that audience age is an important factor when deciding how to market Videodrome II (determine domain parameters, create and confirm hypotheses). Perhaps it’d be best to make one trailer that emphasizes the violence to be shown on television channels and websites for youth, and another trailer focusing on other aspects of the film for media with older audiences. But which aspects, Marcela wonders. She extracts only the opinion data from those older than 25. Marcela would like to find the most important feature to the above 25 group, and so sorts the features in ascending order. The director feature is the first in this order. Most viewers over 25 rate the director highly. To Marcela, this makes sense, since the director is famous for a number of movies made in the early 80’s. She decides that advertising targeted at older audiences should downplay the violence in Videodrome II, and make a big deal of the director. Perhaps they could put the names of his earlier films in the trailer as well, written between little laurels, thinks Marcela. Marcela will suggest this at the next marketing meeting.  Chapter 3. Task Analysis  16  Analyzing Survey Results Dom and Roland have two sets of survey results regarding the quality of their online ticket purchasing website, TicketGouge.ca. The first is from a few months back, and the second was collected last week, shortly after they re-designed their site. They would like to find any differences in opinion related to the change in design. Dom and Roland, as Enzo and Marcela, are assumed to use a system for exploring opinions. Dom is an artist and arranged the interface of TicketGouge. He looks at the old and new opinions of TicketGouge’s look-and-feel and ease of use (read, compare). While ratings of look-and-feel were generally positive a few months ago (characterize distribution), ratings in the new survey data are split. Indeed the controversiality of look-and-feel is quite great in the new data (compute derived value). The data concerning ease of use are very similar. Dom believes this shows that the new design is controversial (formulate cause and effect): about half of the respondents like it, the others dislike it. He suspects some users may dislike his new Flash-based interface, but he can’t be sure: his survey only allowed responses along a Likert scale, so he can’t read any collection of respondents’ written opinions to search for the reasons why the dislike the new design. Roland is more concerned with the business aspects of TicketGouge. He examines the data concerning prices and service charges. Neither was changed in the time between the old and new surveys. While opinions of the prices have not changed, ratings of service charges have. In the new surveys, respondents were split in their opinion of the service charges. Roland is surprised to see this and can’t understand why. He decides to cluster features in the new survey data so he can find similar ratings. The only other features with similarly split ratings are look-and-feel and ease of use (determine domain parameters). Roland discusses his findings with Dom. They believe that the similarity between ratings of interface features and ratings of service charges have some connection. Since the only thing that changed between the old and new surveys was the design of the interface, it must be the cause of both the difference between old and new opinions, as well as the relationship between the interface and service charges (formulate cause and effect). Dom soon realizes that this must be so. The old interface displayed the service charges in a small typeface at the bottom of the payment page. In the interest of legibility, Dom increased the font size. He also simplified the page layout, so all the costs were clearly listed. Customers that believe TicketGouge’s service charges are too hight are now less likely to miss or ignore them in the new interface, and this is what caused them to rate service charges negatively in the new round of surveys. Dom suggests they reevaluate the service charges. Roland suggests they reduce the font size on the billing page.  17  Chapter 4  Describing & Comparing Evaluative Data A description of the method of mining opinions from text we adopt for our work on hehxplore is given in [8]. From a corpus of documents expressing opinions on an entity (e.g. user-submitted reviews from Amazon.com of the Canon G3 digital camera), the system returns a list of the entity’s features on which opinions are expressed (a camera’s Flash, its Appearance, etc.), as well as the opinions themselves. In this way, it is possible to extract sets of features and opinions from a number of corpora, each of which expresses opinions on a different entity (reviews of a Sony camera, of a Nikon camera, etc.). The polarity—whether an opinion is positive or negative—and strength— the degree of sentiment—of each opinion can also be determined. In that method, three levels of strength are considered, thus the polarity/strength of an opinion can be represented by an integer in {−3, −2, −1, +1, +2, +3}, where +3 is the most positive opinion and −3 the most negative. The features, and their associated opinions, are then organized according to a common hierarchy. For example, the features disk capacity, storage, and memory-card size can be mapped to a single feature: Memory. Semantically similar features from different entities can also be mapped to a single feature (the Sony’s Steady Shot and the Nikon’s Vibration Reduction can be mapped to Image Stabilization). This hierarchy is user-defined, so it can reduce redundancy as well as reflect a user’s needs or interests. Camera A [+1, +1, +1, +3, +3] Lens [+2, +2, +3, +3] Manual Mode [+1, +1, +2, +2] Zoom [−2, +1, +1] ... Flash [−1, +1, +2, +2, +2] Image [−1, −1, +2, +2] ...  Camera B [−2, −1, +1, +1, +2, +2, +2] Lens [−1, +1, +1] Manual Mode [−2] Zoom [−3, −2, −1] ... Flash [−1, −1, −1, +3, +3] Image [+2, −1, +3] ...  Figure 4.1: Partial view of the information extraction and organization process for two products. The outcome of this process is a collection of sets opinions on features of  Chapter 4. Describing & Comparing Evaluative Data  18  a number of entities, as well as a common hierarchy of features across entities (each camera has a Flash feature, a Memory feature, etc.). We assume this is the input to our system.  4.1  Descriptive Statistics  There are a number of statistics that can be used to describe the opinions on a feature. These include the count, mean opinion, and controversiality (all of which are given above each chart in our visualization). Let ps(fa ) be the set of opinions on the feature f of entity a. We can find the count of opinions, count(fa ) = |ps(fa )|, as well as the mean (or average opinion): mean(fa ) =  1 |ps(fa )|  psk psk ∈ps(fa )  It may be important to know whether evaluations of a feature are controversial: how split opinions are among positive and negative. We adopt a measure of controversiality based on information entropy, first proposed in [6]. The controversiality of a feature is a number ∈ [0, 1]. It is calculated by aggregating positive and negative evaluations for each feature, calculating the entropy of the resultant Bernoulli distribution. Firstly, we calculate the importance of feature fi in total, and for its positive and negative evaluations. |psk |  imp(fi ) = psk ∈ps(fi )  |psk |  imp pos(fi ) = psk ∈ps(fi )∧psk >0  |psk |  imp neg(fi ) = psk ∈ps(fi )∧psk <0  Secondly, we find the entropy of Bernoulli distribution with parameter θi , where θi = imp pos(fi )/imp neg(fi ). H(θi ) = −θi × log2 θi − (1 − θi ) × log2 (1 − θi ) We weight this value by the importance of feature fi . To do so, we find the maximum importance: the number of evaluations multiplied by M , the maximum possible evaluation strength (in our case, 3). imp max(fi ) = M × |ps(fi )| Finally, the controversiality of feature fi is: contro(fi ) =  imp(fi ) H(θi ) imp max(fi )  Chapter 4. Describing & Comparing Evaluative Data  19  We can find the controversiality of the evaluations for an entity by taking the weighted sum of the controversiality of all its features. The weight of each feature is the number of evaluations of that feature, less one: w(fi ) = |ps(fi )| − 1 We subtract one to exclude features that have been evaluated only once, since their entropy scores are 1, but they are not indicative of any real controversiality. The controversy of entity E is: contro(E) =  i  w(fi ) × contro(fi ) i w(fi )  This score (contro(fa )) is a real value in the range [0, 1]. A feature controversiality of 0.0 occurs when all opinions on a feature are of the same polarity; controversiality of 1.0 occurs when they are strong and evenly split between positive and negative (see examples in Figure 4.2).  Figure 4.2: Three distributions with different controversiality scores: 0.0 on the left, 0.612 in the middle, and 1.0 on the right.  4.2  Statistics for Comparisons  Just as we define statistics for describing opinions on a feature of a single entity, we define statistics for describing the similarity of opinions on a feature across two entities. We call these aspects of a comparison. Formally, the similarity of aspects are functions on opinion distributions on feature f for the pair of entities a and b: fa and fb . These functions return values in the range [0, 1], where 1.0 indicates an extreme similarity and 0.0, extreme dissimilarity. As when considering opinions on a single entity, it can be important to know how many opinions are expressed on an entity and its features. We define the similarity of the counts of opinions on fa and fb as the ratio of the count of each. min(count(fa ), count(fb )) counts(fa , fb ) = max(count(fa ), count(fb )) We define the similarity of the means of opinions as 1 minus the difference between the means proportionate to the maximum possible difference between  Chapter 4. Describing & Comparing Evaluative Data means. means(fa , fb ) = 1 −  20  |mean(fa ) − mean(fb )| 2 × max strength  In our system, max strength = 3, and the greatest possible difference between means is 6 (for the means −3 and +3). Note that the equation above is sensitive only to differences in strength between means, not to those of polarity. For example, it returns the same similarity value for the means −2.6 and −0.6 as for the means −1.0 and +1.0, though the first two means are both negative and the latter two are of different polarity. This is counterintuitive: the latter case should be less similar than the former. We therefore substitute the initial formulation with the following function when the two means are of different polarity (e.g. mean(fa ) < 0 < mean(fb )) and sufficiently strong (|mean(fa )| > 0.5 and |mean(fb )| > 0.5) in order to capture the dissimilarity of means of different polarity: means(fa , fb ) = 1 −  k  |mean(fa ) − mean(fb )| 2 × max strength  with k > 1. k was set to 3 in our study based on observing the effect of different values of k on several test cases during development. The similarity of the aspect of controversiality is related to the difference between the controversiality scores of the two distributions of opinions. contros(fa , fb ) = 1 − |contro(fa ) − contro(fb )| This equation is not sensitive to whether the two sets of opinions are both (un)controversial (e.g. 0.6 and 1.0) or whether one is controversial and the other uncontroversial (e.g. 0.6 and 0.2). As with means, we exaggerate the difference between controversiality scores when they are different (e.g. contro(fa ) < 0.5 < contro(fb )) by substituting the equation above with the following: contros(fa , fb ) = 1 −  k  |contro(fa ) − contro(fb )|  In addition to the aspects above, we consider differences in the distribution of opinions. To do this, we employ Jensen–Shannon divergence (DJS , also known as information radius) [24]. dists(fa , fb ) = 1 − DJS (fa fb ) =1−  1 2 DKL (fa  M ) + 12 DKL (fb M )  where M = 12 (fa + fb ), and DKL is Kullback–Leibler divergence. Jensen– Shannon divergence is the mean information loss between each distribution from their mean distribution and it is commonly used to measure a kind of distance between two distributions. Unlike DKL , DJS is bounded and symmetric.  Chapter 4. Describing & Comparing Evaluative Data  4.2.1  21  Discretization of Values  Though all the similarity functions above return values between 0.0 and 1.0, this does not mean that they express comparable differences in similarity. For example, two opinion distributions with means of 0.75 are not necessarily as similar as are two distributions with dists of 0.75. To alleviate this problem, we simplified the statistics as follows. Each statistic was discretized into four categories: very dissimilar (vd), dissimilar (d), similar (s), and very similar (vs), but the thresholds that define these categories differ among the statistics (Figure 4.3). These thresholds were arrived at through an iterative evaluation of sample cases made by the authors. These, along with k, are parameters of our selection strategy that could be refined to better match human judgements in the future.  Figure 4.3: Visualization of thresholds used to discretize values returned by the system’s various similarity functions into very dissimilar (vd), dissimilar (d), similar (s), or very similar (vs). For example, values of counts() greater than 0.6 and less than 0.7 are dissimilar ; values of dists() greater than 0.95 are very similar.  22  Chapter 5  Design Rationale 5.1  Visualization of Opinion Data  hehxplore employs graphics that represent the data in order to make it accessible to users, as well as to aid in their analysis and comparison of the data. We designed our visualization according to the tasks it should support. To do so we created a task model by integrating relevant task taxonomies and frameworks from previous work in information visualization. These task taxonomies describe visual, interactive [33], and analytic [2, 1] tasks. These include reading the data accurately, easily characterizing subsets of the data, identifying anomalies, and relating data to support hypotheses. These tasks are general, but common and important to a number of potential users of hehxplore (consumers, market analysts, researchers etc.).  5.1.1  Early Prototypes  We evaluated a number of early prototype representations of the data to be implemented in hehxplore. Two early prototypes, polarity bars and the parallelcoordinate tree, are modified versions of other visualizations of evaluative data: Opinion Observer [23] and SurveyVisualizer [4]. We also evaluated a third prototype we called a distribution matrix, which we developed into the current hehxplore visualization. Polarity Bars The polarity bar representation (Figure 5.1) is an extension of the vertical, barbased Opinion Observer [23]). Each feature is represented by a bar. The number of evaluations is encoded by the length of a bar. Bars extend from a horizontal axis, upwards for positive evaluations, downwards for negative. The various polarity/strength values are encoded by the segments of the bar of different saturations. Weaker evaluations are represented by the less saturated segments of a bar, and are nearer to the horizontal axis; stronger evaluations are more saturated segments, farther from the axis. The mean for each distribution of evaluations is represented by a circular glyph, its vertical position representing its value. The polarity bar representation allows for bars to be configured in different ways. Bars can be grouped by entity to facilitate within-entity comparison,  Chapter 5. Design Rationale  23  Figure 5.1: A mock-up of polarity bars, a modification of Opinion Observer (Figure 2.3). or combined to facilitate cross-entity feature-by-feature comparison. Opinion Observer’s provides only the latter configuration. There are a number of problems with polarity bars. While length and direction represent number and polarity clearly, it is somewhat difficult to make careful comparisons of strength with polarity bars. Though easy to interpret, strength segments do not have a consistent position in space. If there are fewer weak opinions on one feature than on another, its stronger opinions are rendered lower, even when there are the same number of strong evaluations in both features. This makes comparing quantities across features more difficult, since strength and quantity vary along the vertical. Polarity bars also use two vertical scales: one for the number of evaluations, another for the value of the mean. This is confusing and can be easily misread: the mean may be rendered over a bar segment representing a certain polarity/strength value, but represent a value quite different from it. Lastly, polarity bars do not extend easily to a hierarchy. Representing evaluations coming from a child feature results is difficult. Stacking them into a single bar makes it hard to judge their strengths; putting them next to each other does not represent the parent-child relationship as well and cripples the combined arrangement of bars.  5.1.2  Parallel-Coordinate Tree  We and Ivan Zhao created many prototypes based on SurveyVisualizer [4]. While iterating on the design, we discovered a number of draw backs to parallelcoordinates plots. Among them, that the incomplete nature of the mined opinion data made it difficult to maintain a consistent visual and mental represen-  Chapter 5. Design Rationale  24  Figure 5.2: A p-node tree [40], a modification of the parallel-coordinate tree in SurveyVisualizer (Figure 2.4). Extracts from the original text corpus are visible as bubbles extending from nodes at the bottom of the window. tation of opinions on an entity. For example, while survey respondents typically cover all or most questions, online reviewers do not mention all or most product features: feature coverage and overlap may be quite small. This results in many visual gaps in the parallel-coordinate representation, making it difficult to read the data. There are also known difficulties with parallel-coordinates, such as cluttering and dimension ordering [13]. To avoid these problems, the prototype was modified to use coloured circles rather than thin lines to represent opinon. An interactive prototype, called the p-node tree (Figure 5.2), was created by Zhao and is detailed in [40]. Despite the appeal of SurveyVisualizer and p-node trees, the prototype was unintuitive and not as easy to understand as another prototype of ours, the distribution matrix. The p-node trees had a novel visual representation that required more effort to become familiar with than bar charts, sparse data was hard to read, and getting a sense of distribution and small differences in number was difficult when repesented as colour intensity and labels instead of length.  Chapter 5. Design Rationale  25  Figure 5.3: Coloured circles and labels used to represent the number of opinions on a set of features in a p-node tree (detail from Figure 5.2).  Figure 5.4: A mock-up of a distribution matrix. Distribution matrix The distribution matrix (Figure 5.4) represents opinions on each feature in a histogram or bar chart. Categories for the polarity/strength of opinions are represented on the horizontal axis, the number of opinions on the vertical axis. Each bar corresponds to a polarity/strength category, its height represents the number of that opinion. Bar charts are an good visual representation of the data because they are clear and familiar, thereby reducing learning times and potential misunderstanding [25]. These features are arranged in a grid or matrix. Comparisons of opinions across features are made easier when many of these charts are arranged neatly and nearly [25, 34]. This follows Tufte’s principle of small multiples, that many repeated representations can enforce “comparisons of changes, of the differences among objects, of the scope of alternatives.” [36] We believed that the distribution matrix representation was approachable, and did best at facilitating the tasks described in our integrated task model. We decided to develop it further, tuning it over many iterations.  Chapter 5. Design Rationale  5.1.3  26  Hierarchical Histograms  Figure 5.5: Example of a chart of the opinions on a camera’s Image feature. The grey dot beneath the bars represents the mean of the opinions. Count (#), mean (Avg.), and controversiality (Contro.) are stated explicitly at the top of the chart. The visualization realized in our interface is based on our distribution matrix prototype. The primary visual component of hehxplore is the bar chart representing opinions on a feature (Figure 5.5). The mean of the opinions on a feature is represented as a single grey dot plotted along the opinion axis, under the bars. Unlike when reading the bars in the chart, to read the mean it is necessary to interpret the opinion axis as numeric, not categorical. The mean can be plotted anywhere between the tick marks representing the most negative and most positive opinions. It is sometimes useful to have access to the exact values of key descriptive statistics. For this reason we state the count of opinions on a feature, the value of their mean, and the controversiality score above the chart, beneath the feature name. The mean dot can both support and provide contrast to the information in the bars above it. For example, in Figure 5.5, the mean is near +1, though the actual number of +1 opinions in the data is low. This suggests that opinion on Image may be split (it is, in a J-shaped distribution). The proximity of the representations of these different, though related, descriptions of the data allow for such understanding to be reached more quickly than if they were apart or not available at a glance. As in our prototype, the charts are arranged in rows and columns (Figure 5.6). The features in our input data are organized in an hierarchy; the charts representing opinions on these features are arranged hierarchically in our interface. Features without children in the hierarchy contain opinions expressed on  Chapter 5. Design Rationale  27  Figure 5.6: Example of a stacked bar chart, showing the relationship between charts of opinions on Appearance, Battery, Flash and their parent feature, Digital Camera. Notice that a) axes’ scales are consistent, b) bar colours relate stacked bars to charts of child features, c) explicit opinions on Digital Camera are the bottom most bars in the chart (light grey), d) Digital Camera contains bars of other child features not shown (e.g. Image, Video). themselves (e.g. Figure 5.5). Features that have child features have opinions on themselves and also subsume the opinions on their children. We represent this by stacking the bars representing opinions of child features (Figure 5.6). Bars of opinions on child features are distinguished both by colour and by their stacking order: opinions expressed on a parent feature explicitly are the bottom-most set of bars in a chart.  5.2  Summarization of Opinion Comparisions  hehxplore is designed to support the comparison of opinions across entities. Visualizing opinions on multiple entities allows a user to examine and compare opinions expressed in large corpora quickly and easily, but it may still be difficult and time-consuming to identify important similarities and contrasts of opinion across entities. Also, the nature of these important comparisons may be difficult to convey clearly and succinctly using graphics. To address this potential limitation and to further facilitate comparison, we propose a textual summary of notable comparisons of features across two entities. Our summarization system relies on a set of statistics to characterize feature comparisions according to their (dis)similarities. These statistics are adaptations of statistics previously developed for opinions on a single entity, to ones that describe opinions on a pair of entities (see section 4.2).  Chapter 5. Design Rationale  28  Figure 5.7: Screenshot of our interface displaying generated opinion data on digital cameras from two fictional manufacturers.  5.2.1  Content Selection  Our content selection strategy always includes an overall comparison of the two entities, which corresponds to comparing the distributions of explicit opinions about the entities combined with opinions about all their features (left-most feature in Figure 5.6). In addition to this, our strategy selects a subset of the feature comparisons (in our study, up to 13 of the possible feature comparisons) worth mentioning. Then, for each selected feature comparison it determines the aspects that are worth mentioning. We now examine these two selection processes in order.  5.2.2  Selection of Comparisons to be Mentioned  Feature comparisons are first filtered by removing any comparison that covers too few opinions (in our study, 3% or less of the total count) as their statistics are likely not to be very meningful. Formally, a feature f is considered only if: count(fa ) + count(fb ) >  3 100  count(ka ) + count(kb ) k  After filtering out low-count feature comparisons, the noteworthiness of each feature comparison is assessed by counting the number of very (dis)similar as-  Chapter 5. Design Rationale  29  pects of the comparison. That is, nworthiness(fa , fb ) = g∈G  1 if g(fa , fb ) ∈ {vd, vs} 0 otherwise  where G = {counts, means, contros, dists}. Notice that this assigns greater noteworthiness to feature comparisons with the greatest number of any kind of strong (dis)similarity. For example, a comparison of Lens that has very similar means and contros has a nworthiness of 2, and it is considered as noteworthy as a comparison of Battery that has very similar counts but very dissimilar contros. Feature comparisons are ranked by nworthiness; comparisons are selected for mention in the summary until either 13 of the features in the hierarchy have been selected, or the next most highest ranked comparison’s nworthiness is 0. When necessary, ties are broken by selecting the feature comparison with the most extreme counts (that is, the comparison y = maxx |counts(fx ) − 0.5|). This tie breaking strategy is justified by the assumption that counts is the most critical aspect in a comparison.  5.2.3  Selection of Comparison Aspects to be Mentioned  Whenever a feature comparison is included in the summary a statement on its counts and a statement on its means are always included as these two aspects are considered important in any comparison. A statement on contros is included when it is very (dis)similar; likewise, dists. Each feature comparison is summarized in a single sentence. Rhetorically, the statement on counts is presented as the main claim. The statements on the other aspects are presented as contrast or support for the main claim, depending on whether they are consistent, with respect to similarity, with the statement on counts. For example, the overall comparison of Yoyodyne and Cyberdyne cameras (the feature Digital Camera in Figure 5.7) has dissimilar counts, very similar means, and very dissimilar contros. The statement on contros supports the statement on counts (they are both dissimilar); the statement on means contrasts with the statement on counts. Summaries always mention the feature comparison of the top-most feature in the hierarchy of features. This serves to make an overall statement on the difference of opinions on the entities. The summary then includes statements on the most notable comparisons, if there are any noteworthy comparisons in the data. Our system does not perform any complex realization of the selected content. The sentence presented in the summary is realized using simple sentence templates. That is, the content is mapped directly to surface-level sentences, without any intermediate representations (see examples in Figure 5.7).  30  Chapter 6  Evaluation 6.1  Goals  The primary goal of our study was to evaluate the quality of our system’s selections of noteworthy comparisons by, firstly, comparing system and user selections, and secondly, finding whether and when users believe the system’s selections are good. Our secondary goal was to ascertain the usability of our visualization and to discover, more generally, how users interpret such opinion data. To achieve these goals, we conducted a study in which we collected what human subjects select as the most noteworthy feature comparisons in a set of opinions, both before and after seeing selections made by our system. Subjects were asked to justify each of their selections by, first, noting whether they had selected a comparison because the opinions were either similar, dissimilar, or notable for another reason, and second, by writing a brief explanation. In addition to this, subjects classified each of our system’s selections as good or poor. Lastly, subjects completed a questionnaire in which they rated the usability of our visualization.  6.2  Scenario  To encourage subjects to pay attention to the data as well as to make the task easy to understand, we developed a fictional scenario within which to present opinion data, our visualization, and selection strategy. Subjects were told that an unspecified camera manufacturer is conducting an analysis of the newest digital cameras released by its competitors. This company has hired the subject to analyze the opinions on pairs of digital cameras, and to identify interesting differences and similarities. Subjects were also told that they would be asked to double-check another analysts’ work (in truth, the selections made by our summarizer). The entities used in our study all shared a simple, two-level hierarchy of features. These features are Appearance, Battery, Flash, Image, Lens, Software, and Video; all of which are child features of Digital Camera.1 1 Initially, we generated data for 8 + 1 features, but we later reduced the number of features to 6 + 1 after our initial pilot studies.  Chapter 6. Evaluation  6.3  31  Data Generation  To our knowledge, there is no available corpus of evaluative text annotated with features in a hierarchy large and varied enough to serve as a basis on which to evaluate our interface. As such, we generated data that mimics the labelled opinion data provided by Hu & Liu [20]. This data was generated with features that are in keeping with the study scenario, and was sufficiently varied to evaluate our interface. We would like to evaluate our interface on the entire space of possible opinion data. This is, however, not practical. Instead, we generated a set of data that we believe represents the space of possible opinion data insofar as it alters the summaries generated by our system. Thus, we identified a number of feature comparison types (Table 6.1) which would be represented differently in a summary. A comparison type is a set of constraints on the aspects of a comparison. That is, a comparison type requires that the aspects (counts, means, contros, and dists, see section 4.2) of a comparison are of a certain category of (dis)similarity (very dissimilar (vd), dissimilar (d), similar (s), or very similar (vs) (see subsection 4.2.1). These constraints cause each type to be mentioned with a different configuration of aspects as support or contrast (see subsection 5.2.3). Type 0 1 2 3 4 5  6 7 ? M  |S| 0 1 0 1 2 0 1 2 3 0 ? ?  |C| 0 0 1 1 0 2 2 1 0 3 ? ?  counts d∨s vs s d vd vd  vs vd  means contros d∨s d∨s s∨vs d∨s vd d∨s vd vs d∨vd vs s∨vs vs Does not occur Does not occur s∨vs vs s∨vs vs vd∨d∨s∨vs At least one vd∨vs  dists d∨s d∨s d∨s d∨s d∨s d∨s  vs vs  Table 6.1: The comparison types, the number of aspects mentioned in support (|S|) and contrast (|C|), and the constraints on the (dis)similarity of the aspects of a comparison. For example, the constraints type 0 are met only when all aspects of a comparison are either similar (s) or dissimilar (d). Such a comparison is not noteworthy, and therefore will not be selected for mention in a summary. Constraints of type 6 require that the counts of a comparison be very dissimilar (vd), its means similar or very similar (s∨vs), and both its contros and dists  Chapter 6. Evaluation  32  very similar (vs). A realized sentence of a comparison of type 6 would have the counts as the main claim, and all other statements supporting that claim, as all other aspects are also similar and worthy of mention. Type ? does not constrain any aspect of a comparision to any category of (dis)similarity. Type M requires only that at least one aspect of a comparison be very similiar or very dissimilar. Notice that two configuration types between 5 and 6 can be specified, but their constraints are not met in practice2 . The comparison types can be used to generate data that will meet certain constraints and result in comparisions which will be represented in known ways by our summarizer. However, types can be used to control only the feature comparisons represented as sentences. For us to be certain that the data we were generating covered a large space of possible summaries, it was necessary to control which and how many of these types were present in each set of opinions on entities to be summarized. To do so, we identified a set of summary cases (Table 6.2). Each summary case uses types to constrain the data generated for one, two, and all other feature comparisons, as well as the overall comparison (comparison of the top-most feature in the hierarchy: Digital Camera, in our study). For example, case 0 uses type ? overall, but specifies that all other comparisons must be of sentence type 0. This results in a set of opinions in which no feature comparisons are noteworthy and so none will be mentioned in the summary, excepting the required sentence describing the comparison of the overall opinions on the entities. Case 15 specifies that one feature comparison must satisfy the constraints of type 1, another must satisfy type M (that is, have at least one very (dis)similar aspect), and all other comparisons must be type 0. This results in a summary that mentions one two notable feature comparisons in addition to the required overall statement. Having control over the generation of the data such that it was similar in character to that of real data available to us, and was controlled so that we could create data that would result in specific kinds of summaries, allowed us to create datasets which were realistic but also covered a large number of possible sentences and summaries, without having to find large, varied existing sets of opinion data that would require a lot of cleaning, and human labelling.  6.4  Subjects  36 subjects, 24 females, 12 males, aged 19–43 (median 23), participated in the study. Subjects were university students or graduates. They were recruited through an online subject pool. Each was paid $10 to participate. 2 This is because means, contros, and dists are related in such a way that no two of these aspects can be very similar while the third is dissimilar.  Chapter 6. Evaluation Case 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  Overall ? 1 2 3 4 5 6 7 ? ? ? ? ? ? ? ? ? ? ? ? ? ?  C1 0 ? ? ? ? ? ? ? 1 2 3 4 5 6 7 1 2 3 4 5 6 7  C2 0 ? ? ? ? ? ? ? 0 0 0 0 0 0 0 M M M M M M M  33 C. . . 0 ? ? ? ? ? ? ? 0 0 0 0 0 0 0 0 0 0 0 0 0 0  Table 6.2: The summary cases generated as example data for the study. Each specifies a configuration type for opinions overall, one comparison (C1), another comparison (C2), and all other possible comparisons (C. . . ).  6.5  Materials  All study materials were provided to subjects on paper. Charts were printed in colour on sheets of 8.5 × 11 inch paper, as were quizzes and primers.  6.6  Procedure  Subject sessions were designed to take no longer than an hour. Subjects were first briefed on the various parts of the session. They were then given time to read a six-page primer which introduced the scenario, explained the nature of the data, the charts, as well as the means and controversiality. Once finished, they were given a brief quiz that asked them to rank opinions on three different charts according to count, mean, and controversiality. These charts had non-specific feature names and no explicit values for the statistics. Subjects were allowed to refer back to the primer while completing the quiz. The experimenter checked their answers, prompting them to reconsider them  Chapter 6. Evaluation  34  where they were incorrect. Subjects did not continue until they had the correct answers. After completing the quiz, subjects were given a sheet with charts representing opinions of two digital cameras (similar to the top portion of the window in Figure 5.7) and a response sheet. The first phase of their response was to classify the opinions on the sheets as similar and dissimilar, select up to two notable features, and to give reasons for their selections. Once this phase was complete, they were given a second copy of the sheet of charts with the features selected by our system circled. The names of very (dis)similar aspects of the selected comparisons were also listed on the page, and whether they were similar or dissimilar. In the second phase, subjects responded by classifying each of the system’s selections as good or poor, their reasons for classifying them as they did, and were given a second opportunity to select up to two notable features. Subjects were not told that the selections were made by a computer system, only that they were “evaluating another analysts’ selections.” Subjects were given as many sheets to respond to as could be done in approximately forty minutes. Ten minutes were allotted at the end of each session for subjects to complete the questionnaire.  6.7  Method  To evaluate the performance of our content selection strategy against the gold standard supplied by our study subjects, we calculated precision and recall of the system as well as the F-measure. To determine subjects’ perceptions of the usability of our visualization, we included in the final questionnaire a series of statements, and asked subjects to rate how strongly they agree or disagree with each statement. A number of these statements relate to Nielsen’s quality components of usablity: learnability, efficiency, memorability, error, and overall satisfaction [27]. They are statements such as “It is easy to learn to read the charts” (learnability), and “I am confident that I read the charts correctly” (error).  6.8  Baseline Systems  In order to set a baseline of performance, we found the expected performance of two simple alternative selection systems: a na¨ıve system which selects 0– 2 feature comparisons to mention randomly, and a semi-informed system. The semi-informed system is as likely to select 1, 2, or no comparisons as did the subjects in our study. Though it is likely to select the same number of comparisons, the comparisons it selects are picked randomly. More formally, we can say that the probabilities of these systems selecting a  Chapter 6. Evaluation  35  certain number of comparisons are ∀x, P r(Size = x|System = na¨ıve) =  1 3  P r(Size = x|System = semi-info.) = P r(Size = x|System = subjects) Since we consider 6 selectable features in our study, for both systems, the probability of selecting feature comparison y is P r(Select = y|Size = 0) = 0 P r(Select = y|Size = 1) = P r(Select = y|Size = 2) =  1 6 1 6  +  1 5  By multiplying the probability of a system selecting features that overlap with selections made by subjects, we find the expected performance of each system (see Table 6.3). These are the baseline scores which our system must match or beat in order to be considered successful.  6.9  Machine Learning System  We also trained and analyzed the performance of a content selection strategy with a generic machine learning algorithm at its core. This machine learning system provides further contrast with our statistics-based system.  6.9.1  Training Data  The machine learning system was trained on the data collected in our user study of our original, statistics-based system. That is, in relation to this machine learning system, our study served to collect training data. We used a collection of various data to train the machine learning system. These training data included, firstly, the generated data represented directly in the charts given to subjects: • The raw numbers of opinions of each polarity/strength on each feature of each entity (excluding the top-most feature, Digital Camera) • The descriptive statistics for opinions on each feature of each entity – ie. count, mean, and contro for all fi (including the top-most feature, Digital Camera) Also included were the data used by our statistics-based system: • Values of each aspect of each comparison – ie. counts, means, contros and dists for all comparisions (fa , fb ) • Total number of extreme (vs or vd) aspects in each comparison  Chapter 6. Evaluation  36  Observations of and discussions with pilot study subjects led us to include other data as well. Subjects appeared to justify some of their selections by refering to the (dis)similarity of opinions on specific features with those on the topmost feature, Digital Camera. For example, “the opinions on Appearance are distributed much like the opinions on the camera overall.” They would also often mention statistics (such as mean) relative to features within an entity. For example, “in both cameras, the feature with the highest mean is Flash.” To support these explanations, we decided to include the following data that was not given explicitly in the study, nor considered by our statistics-based system: • Values of each aspect of each feature compared with the top-most feature of the same entity – ie. aspects for all comparisions (fa , Digital Camera a ) • Rank of each feature’s count, mean, and contro within the same entity – eg. Within entity a, Flash has the highest count, Appearance has the next highest count, etc. Training was carried out with the aim of selecting comparisons similar to those selected by subjects in our study. However, subjects did not always select the same feature comparisons given the same case, so it was necessary to combine their selections into a single score which the machine learning system would attempt to learn. We created a simple selection score to be assigned to each feature within each case; it is given by selectionScore(f ) = |subjects that selected f | ÷ |subjects| Since subjects were allowed to select upto two feature comparisons for each given case, the sum of the selectionScore for all features in a case must lie in the interval [0, 2]. For example, four subjects saw the data for case 0; all of them selected Battery as notable, two selected Appearance, and Flash and Image were each selected once. That means that selectionScore(Battery) = selectionScore(Appearance) = selectionScore(Flash) = selectionScore(Image) =  6.9.2  4 4 2 4 1 4  Machine Learning System Implementation  We trained a support vector regression model to predict selection scores for feature comparisions based on the training data described above. We used SMOreg, an implementation of support vector regression available in Weka 3.4.14 [39]. The training data were partitioned into 22 train/test sets, in which each excluded one case from the training dataset to act as the test dataset. This is analogous to 22-fold cross-validation, where the folds were cases.  Chapter 6. Evaluation  37  For each fold, selection scores for the feature comparisons in the test set were predicted. To complete the content selection system, the two comparisions with the greatest, positive predicted scores were taken to be the selections made by the system.  6.10  Results  Subjects took approximately 10–15 minutes to finish reading and re-reading the primer. Some managed to respond to only a single sheet of charts, while others completed six in the same time. Though subjects were not carefully timed, the experimenter did notice that subjects typically responded faster to later sheets than they did to earlier ones.  6.10.1  System Performance System na¨ıve semi-info. stats-based machine learning  Precision 0.209 0.305 0.408 0.526  Recall 0.168 0.305 0.372 0.571  F-measure 0.186 0.305 0.379 0.541  Table 6.3: Mean precision, recall, and F-measure of the expected performance of the na¨ıve and semi-informed alternative selection systems, as well as our statistics-based and machine learning systems. On average, over 98 sets of selections, selections made by our statistics-based system agree with subject selections better than those we could expect from the baseline systems (Table 6.3): showing much improvement over the na¨ıve and semi-informed systems. After seeing our statistics-based system’s selections, subjects selected the same features as our system more often than they did before seeing our system’s choices (mean precision = 0.500, sd = 0.419; mean recall = 0.449, sd = 0.390; mean F-measure = 0.454, sd = 0.380). This is a change of approximately 20% in all measures. This change is statistically significant according to two-tailed paired t-tests (precision t(97) = 2.84, p < 0.01; recall t(97) = 3.13, p < 0.01). Subjects, on average, rated 60% of the statistics-based system selections as good (sd = 0.418). This means that subjects were more likely to rate system selections as good than they were to select them in their final selections. Subjects tended to go with their initial selections in the end (mean precision = 0.806, sd = 0.301; mean recall = 0.801, sd = 0.302; mean F-measure = 0.799, sd = 0.298).  Chapter 6. Evaluation  38  Performance of the Machine Learning System Selections made by the machine learning system trained on data collected in our study matched selections made by subjects better than our statistics-based system (mean precision = 0.526, sd = 0.281; mean recall = 0.571, sd = 0.313; mean F-measure = 0.541, sd = 0.284), showing a 28% increase in precision and 53% in recall. However, when compared to the selections made by subjects after seeing our statistics-based systems’ selections, the machine learning system did not improve much (mean precision = 0.485, sd = 0.317; mean recall = 0.526, sd = 0.347; mean F-measure = 0.498, sd = 0.321), roughly as precise with 17% better recall.  6.10.2  Usability of the Visualization  Subjects rated each of the statements related to Nielsen’s quality components (learnability, efficiency, memorability, error, and overall satisfaction), as well as to whether the charts were cluttered. The results are given in Table 6.4. Charts are. . . Learnable Efficient Memorable Read correctly Satisfying Cluttered  sd  1  d 2 3 3 1 2 18  n 6 4 5 6 5 12  a 13 20 17 19 22 3  sa 15 9 11 10 5 1  nr  1 2  Table 6.4: Subjects’ responses to statements related to components of usability. Subjects could strongly disagree (sd), disagree (d), agree (a), or strongly agree (sa) with a statement, or remain neutral (n), or not respond (nr). The most frequent response for each component is in boldface.  6.11  Discussion  Our system for selecting comparisons performs better than the baseline systems. Our statistics-based system’s selected comparisons were more alike those made by subjects; and the improvement over the na¨ıve and semi-informed baseline systems is markable. However, the scores achieved by our system are not particularly high. This suggests that our system could yet well be improved, perhaps by tweaking the thresholds for categorizing comparison aspects. Interestingly, subjects often believed that our system made good selections. It is possible that many cases in which subjects classified the system’s selections as good, they believed them to be—but not as good as those they made themselves. Perhaps subjects are charitable when “double-checking another analysts’ selections,” but remain convinced of their initial selections.  Chapter 6. Evaluation  39  Subjects were not always as certain of their initial selections after seeing those made by our system. In some cases, subjects may have been convinced by the selections made by our statistics-based system and chose to include comparisons selected by the system instead of those they made themselves. That subjects did change their selections after seeing system selections could suggest that the selections made by our system were, in some cases, valuable and different from those that subjects were able to make from using the visualization alone. Unfortunately, it could be that subjects revised their selections not because the selections presented to them were valuable, but simply because they were uncertain of their initial selections or that any “second opinion” could have convinced them to revise them. It could be that subjects would have revised their selections regardless of whether they were shown selections made by our system, selections made by the na¨ıve baseline system, or random selections. Unfortunately, we cannot be certain that subjects’ revised their selections because they believed our system’s selections were good. Our study did not control for this, and we did not ask subjects explicitly to justify their final selections. The machine learning system selected more of the same comparisions first selected by subjects. This is encouraging, particularly since it is not unreasonable to expect that such a system would improve given a larger set of data on which to train. However, the improved performance of the machine learning system was not great when compared to subjects’ revised selections. This is likely because the machine learning system was trained to approximate the first selections made by subjects, not their revised selections. It would be illuminating to see whether subjects, shown the selections made by the machine learning system rather than those made by the statistics-based system, would revise their selections to be more consistent with those selections. Further, the dataset used to train the machine learning system was not very large. Though we tried to make the best of the small data by performing nfold cross-validation (the leave-one-out experiment), the system’s performance may not be generalizable. Indeed, the machine learning system was experimental, and tweaks to the training data, algorithm parameters, and even form of machine learning could lead to improved performance. From questionnaire responses, we find that subjects consider our visualization to be usable. The majority of subjects responded positively to the visualizations in general, though the experimenter’s observations and subjects’ comments made post-session and written in the questionnaires suggest there were differences in how quickly and confidently they read them.  40  Chapter 7  Conclusions and Future Work 7.1  Future Work  Overall, the results of our user study are encouraging. An examination of subjects’ written reasoning for selecting feature comparisons seems likely to produce insight as to why system’s selections differ from subjects’, and, more generally, how subjects interpreted the data and used it to justify their selections. It may also be important to see the extent to which subjects agreed with each other, as well as expose some of their reasoning in changing their selections. Indeed, our study found that subjects would sometimes change their selections after seeing our system’s selections, but could not conclude that subjects changed their selections because they thought our system’s selections were valuable. We would also like to evaluate our visualization method more precisely, and the benefits of interacting with it on a computer rather than in a static presentation. These would allow us to explore how well hehxplore supports the dynamic, interactive tasks in our integrated task model. It would also be interesting to perform more exploratory studies of the system to see whether it succeeds in being useable in a realistic context, in provoking insight into the data. Our experiment with machine-learning–based content selection shows promise. It would be interesting to see if the performance it obtained would change with a larger set of data to train on, or as algorithm parameters and types were altered.  7.2  Conclusions  We have detailed Hierarchical Evaluative Histogram Explorer, a multimedia interface for facilitating the comparison of opinions on two entities. This interface includes two complementary presentations of opinion data: a visualization of the opinions on features of multiple entities, and a textual summary of the most noteworthy comparisons of opinions on features across two entities. We described the motivation for this interface, the design of the visualization, and the methods by which comparisons are ranked, selected, and summarized. The results of our user study show that our visualization is usable, and that our  Chapter 7. Conclusions and Future Work  41  summarization system performs more like humans than do two baseline systems. Our study also found that subjects often would reconsider their own analysis when shown that of our system, changing their conclusions about the data in a way more in line with that of our system. Further studies also found that content selection strategies based on machine learning techniques also performed well and show promise. While there is room to improve our interface, it does visualize opinion data in a useable way, and it selects feature comparisons that, in cases, humans find valuable and did not notice from inspecting the visualization alone.  42  Bibliography [1] Robert Amar, James Eagan, and John Stasko. Low-level components of analytic activity in information visualization. In INFOVIS ’05: Proceedings of the 2005 IEEE Symposium on Information Visualization, page 15. IEEE Computer Society, 2005. [2] Robert Amar and John Stasko. A knowledge task-based framework for design and evaluation of information visualizations. In INFOVIS ’04: Proceedings of the 2004 IEEE Symposium on Information Visualization, pages 143–150. IEEE Computer Society, 2004. [3] Jeanette Bautista and Giuseppe Carenini. An integrated task-based framework for the design and evaluation of visualizations to support preferential choice. In AVI ’06: Proceedings of the 2006 Working Conference on Advanced Visual Interfaces, pages 217–224. ACM, 2006. [4] Dominique Brodbeck and Luc Girardin. Visualization of large-scale customer satisfaction surveys using a parallel-coordinate tree. In INFOVIS ’03: Proceedings of the 2003 IEEE Symposium on Information Visualization, pages 197–201. IEEE Computer Society, 2003. [5] G. Carenini, R. Ng, and A. Pauls. Multi-document summarization of evaluative text. In EACL ’06: Proceedings of the 11th Conference of the European Chapter of the ACL, 2006. [6] Giuseppe Carenini and Jackie C. K. Cheung. Extractive vs. NLG-based abstractive summarization of evaluative text: The effect of corpus controversiality. In INLG ’08: Proceedings of the 57th International Natural Language Generation Conference. ACL, 2008. [7] Giuseppe Carenini, Raymond T. Ng, and Adam Pauls. Interactive multimedia summaries of evaluative text. In IUI ’06: Proceedings of the 11th International Conference on Intelligent User Interfaces, pages 124–131. ACM, 2006. [8] Giuseppe Carenini, Raymond T. Ng, and Ed Zwart. Extracting knowledge from evaluative text. In K-CAP ’05: Proceedings of the 3rd International Conference on Knowledge Capture, pages 11–18. ACM, 2005. [9] Stephen M. Casner. Task-analytic approach to the automated design of graphic presentations. ACM Transactions on Graphics, 10(2):111–151, 1991.  Bibliography  43  [10] J.A. Chevalier and Dina Mayzlin. The effect of word of mouth on sales: Online book reviews. Working Paper, 2003. [11] C. Dellarocas, N. Awad, and X. Zhang. Exploring the value of online reviews to organizations: Implications for revenue forecasting and planning. In Proceedings of the 24th International Conference on Information Systems, 2004. [12] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. Bradford Books, 1998. [13] Ying-Huey Fua, Matthew O. Ward, and Elke A. Rundensteiner. Hierarchical parallel coordinates for exploration of large datasets. In VIS ’99: Proceedings of the 1999 Conference on Visualization, pages 43–50. IEEE Computer Society, 1999. [14] A. Fujii and T. Ishikawa. A system for summarizing and visualizing arguments in subjective documents: Toward supporting decision making. In COLING-ACL ’06: Proceedings of the Workshop on Sentiment and Subjectivity in Text, pages 15–22. ACL, 2006. [15] Anindya Ghose. The Economic Impact of User-Generated and FirmPublished Online Content: Directions for Advancing the Frontiers in Electronic Commerce Research. Wiley & Sons, 2007. Online chapter. [16] David Godes and Dina Mayzlin. Using online conversations to study wordof-mouth communication. Marketing Science, 23(4):545–560, 2004. [17] M.L. Gregory, N. Chinchor, P. Whitney, R. Carter, E. Hetzler, and A. Turner. User-directed sentiment analysis: Visualizing the affective content of documents. In COLING-ACL ’06: Proceedings of the Workshop on Sentiment and Subjectivity in Text, pages 23–30. ACL, 2006. [18] Catalina Hallett. Multi-modal presentation of medical histories. In IUI ’08: Proceedings of the 13th international conference on Intelligent User Interfaces, pages 80–89, New York, NY, USA, 2008. ACM. [19] M. Hu and B. Liu. Mining opinion features in customer reviews. In AAAI ’04: Proceedings of the National Conference on Artificial Intelligence, pages 755–760, 2004. [20] Minqing Hu and Bing Liu. Feature based summary of customer reviews dataset. http: // www. cs. uic. edu/ ~ liub/ FBS/ sentiment-analysis. html , 2004. [21] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD ’04: Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177. ACM, 2004.  Bibliography  44  [22] Clayton Lewis and John Rieman. Task-centered User Interface Design. 1994. [23] Bing Liu, Minqing Hu, and Junsheng Cheng. Opinion observer: Analyzing and comparing opinions on the web. In WWW ’05: Proceedings of the 14th International World Wide Web Conference, pages 342–351. ACM, 2005. [24] C.D. Manning and H. Sch¨ utze. Foundations of statistical natural language processing. MIT Press Cambridge, MA, USA, 1999. [25] Peter McLachlan, Tamara Munzner, Eleftherios Koutsofios, and Stephen North. Liverac: Interactive visual exploration of system management timeseries data. In CHI ’08: Proceedings of the 26th Annual SIGCHI Conference on Human Factors in Computing Systems, pages 1483–1492. ACM, 2008. [26] Satoshi Morinaga, Kenji Yamanishi, Kenji Tateishi, and Toshikazu Fukushima. Mining product reputations on the web. In KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 341–349, New York, NY, USA, 2002. ACM. [27] Jakob Nielsen. Usability Engineering. Morgan Kaufmann, 1993. [28] B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008. [29] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In EMNLP ’02: Proceedings of the ACL-02 conference on Empirical Methods in Natural Language Processing, pages 79–86, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [30] Irene Pollach. Electronic word of mouth: A genre analysis of product reviews on consumer opinion web sites. In HICSS ’06: Proceedings of the 39th Annual Hawaii International Conference on System Sciences, volume 3. IEEE Computer Society, 2006. [31] Ana-Maria Popescu and Oren Etzioni. Extracting product features and opinions from reviews. In HLT-EMNLP ’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 339–346. ACL, 2005. [32] S. Senecal and J. Nantel. The influence of online product recommendations on consumers’ online choices. Journal of Retailing, 80(2):159–169, 2004. [33] Ben Shneiderman. The eyes have it: a task by data type taxonomy for informationvisualizations. In VL ’96: Proceedings of the 1996 IEEE Symposium on Visual Languages, pages 336–343. IEEE Computer Society, 1996.  Bibliography  45  [34] Harri Siirtola. Interaction with the reorderable matrix. In IV ’99: Proceedings of the 1999 IEEE International Conference on Information Visualization, pages 272–277. IEEE Computer Society, 1999. [35] L.A. Treinish. Task-specific visualization design. IEEE Computer Graphics and Applications, 19(5):72–77, Sep/Oct 1999. [36] Edward R. Tufte. Envisioning Information. Graphics Press, 1990. [37] Edward R. Tufte. Visual Explanations: Images and Quantities, Evidence and Narrative. Graphics Press, 1997. [38] P.D. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 417–424, 2002. [39] Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2 edition, 2005. [40] Ivan Zhao, Giuseppe Carenini, and Lucas Rizoli. Visualizing feature-based customer review summarization system using p-node tree. Undergrad report, 2008. [41] M.X. Zhou and S.K. Feiner. Visual task characterization for automated visual discourse synthesis. In CHI ’98: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 392–399. ACM, 1998.  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items