UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

NJM-Vis : applying and interpreting neural network joint models in natural language processing applications Johnson, David 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_november_johnson_david.pdf [ 7.46MB ]
Metadata
JSON: 24-1.0383321.json
JSON-LD: 24-1.0383321-ld.json
RDF/XML (Pretty): 24-1.0383321-rdf.xml
RDF/JSON: 24-1.0383321-rdf.json
Turtle: 24-1.0383321-turtle.txt
N-Triples: 24-1.0383321-rdf-ntriples.txt
Original Record: 24-1.0383321-source.json
Full Text
24-1.0383321-fulltext.txt
Citation
24-1.0383321.ris

Full Text

NJM-Vis: Applying and Interpreting Neural NetworkJoint Models in Natural Language Processing ApplicationsbyDavid JohnsonBCIS, University of the Fraser Valley, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2019c© David Johnson, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:NJM-Vis: Applying and Interpreting Neural Network Joint Models inNatural Language Processing Applicationssubmitted by David Johnson in partial fulfillment of the requirements for the de-gree of MASTER OF SCIENCE in Computer Science.Examining Committee:Giuseppe Carenini, Computer ScienceSupervisorGabriel Murray, Computer ScienceSupervisoriiAbstractNeural joint models have been shown to outperform non-joint models on severalNLP and Vision tasks and constitute a thriving area of research in AI and ML. Al-though several researchers have worked on enhancing the interpretability of single-task neural models, in this thesis we present what is, to the best of our knowledge,the first interface to support the interpretation of results produced by joint models,focusing in particular on NLP settings. Our interface is intended to enhance in-terpretability of these models for both NLP practitioners and domain experts (e.g.,linguists).iiiLay SummaryA deep neural network is a machine learning algorithm in which layers of arti-ficial neurons are trained to learn features of the neural network input which areinformative for the target prediction task. Although this algorithm can have strongpredictive power, it is often seen as a blackbox in which the results of the neu-ral network are often difficult to explain or interpret. Multitasking is a manner oftraining deep neural networks in which the neural network learns more than onetask at the same time with the intention that learning one task will benefit learningthe other task and vice versa. The added complexity of multitask learning meansthe results from deep neural networks may be even more difficult to interpret. Thisthesis describes an interface intended to allow users to interpret and explore theresults from their deep multitask neural networks, making the blackbox that liesbetween the users and their deep neural network output transparent.ivPrefaceThis thesis is an original work of the author, David Johnson, under the supervisionof Dr. Giuseppe Carenini and Dr. Gabriel Murray. This work is unpublished at thistime.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.0.1 Neural Joint Models . . . . . . . . . . . . . . . . . . . . 52.0.2 Interfaces to Interpret Neural Models (Visual Data) . . . . 52.0.3 Interfaces to Interpret Neural Models (Textual Data) . . . 62.0.4 Saliency Interpretation Method . . . . . . . . . . . . . . . 82.0.5 Word Cloud Visualization Techniques . . . . . . . . . . . 83 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . 21vi4.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 234.3 Results Comparing Single and Joint Models . . . . . . . . . . . . 235 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Design Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1 NJM-Vis Design . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2 Iterative Design Process . . . . . . . . . . . . . . . . . . . . . . . 337 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2 Participant Results . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.1 Participant 1 . . . . . . . . . . . . . . . . . . . . . . . . 447.2.2 Participant 2 . . . . . . . . . . . . . . . . . . . . . . . . 457.2.3 Participant 3 . . . . . . . . . . . . . . . . . . . . . . . . 467.2.4 Participant 4 . . . . . . . . . . . . . . . . . . . . . . . . 487.3 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . 498 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 61A.1 Case Study Questionnaires . . . . . . . . . . . . . . . . . . . . . 61A.1.1 Participant 1 . . . . . . . . . . . . . . . . . . . . . . . . 61A.1.2 Participant 2 . . . . . . . . . . . . . . . . . . . . . . . . 62A.1.3 Participant 3 . . . . . . . . . . . . . . . . . . . . . . . . 63A.1.4 Participant 4 . . . . . . . . . . . . . . . . . . . . . . . . 65A.2 Demo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67viiList of TablesTable 4.1 Single & Joint Task Results for Extractive Summarization . . . 24Table 4.2 Single & Joint Task Training Results for Dialogue Act . . . . . 25Table 4.3 Single & Joint Task Results for Bot Detection . . . . . . . . . 25Table 4.4 Single & Joint Task Results for Fake News Detection . . . . . 25viiiList of FiguresFigure 1.1 NJM-Vis interface. The left of the interface shows the scorepanel, displaying individual model performance. The middleof the interface is the vis panel, displaying a word graph vi-sualization indicating words which are relevant to model pre-diction. The right side of the interface is the sentence browserpanel. After clicking words that appear in the vis panel, thesentence browser panel is populated with sentences from thedataset which contain the selected word. . . . . . . . . . . . . 2Figure 2.1 CNNVis, a system for visualizing CNNs for understanding, di-agnosing non-convergence problems, and refining CNNs. Thevisualization shows a CNN as a directed acyclic graph, with anaggregation of neurons and layers. [25] . . . . . . . . . . . . 7ixFigure 2.2 Visualization tool for live convnet activations from [43]. Forclarity, the original caption is slightly revised as: “ The bot-tom shows a screenshot from the software. Webcam input isshown, along with the whole layer of conv5 activations. Theselected channel pane shows an enlarged version of the 13x13conv5151 channel activations. Below it, the deconv starting atthe selected channel is shown. On the right, three selections ofnine images are shown: synthetic images produced using regu-larized gradient ascent methods, the top 9 images patches fromthe training set (the images from the training set that caused thehighest activations for the selected channel), and the deconv ofthe those top 9 images. All areas highlighted with a green starrelate to the particular selected channel, here conv5151; whenthe selection changes, these panels update. The top depicts en-larged numerical optimization results for this and other chan-nels. conv52 is a channel that responds most strongly to dogfaces, but it also responds to flowers on the blanket on the bot-tom and half way up the right side of the image (as seen in theinset red highlight). conv5151 detects different types of faces.The top nine images are all of human faces, but here we seeit responds also to the cat’s face. Finally, conv5111 activatesstrongly for the cat’s face, the optimized images show cat likefur and ears, and the top nine images (not shown here) are alsoall of cats.” . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 Multifaceted Feature visualization. Original caption revisedas: “The top images show 8 types of images that activate the‘grocery store’ class neuron. The bottom images show trainingset images that activate the same neuron, as well as resemblethe synthetic images in the top panel” [35]. . . . . . . . . . . 11xFigure 2.4 deepViz interface. The middle of the interface shows the bitmaprepresentations of filter banks in the selected CNN layer. Aboveare nodes displaying the network overview. The user is able toclick one of the nodes to view the selected layer’s filters. Theleft menu allows uers to search for images labeled with a cer-tain class (”plane” in this figure). The side view is a histogramof the model’s predicted classes for the selected image [5]. . . 12Figure 2.5 ActiVis interface [20]. For clarity, the original caption is slightlyrevised as: “ActiVis integrates several coordinated views tosupport exploration of complex deep neural network models,at both instance-and subset-level. 1. The user Susan starts ex-ploring the model architecture, through its computation graphoverview (at A). Selecting a data node (in yellow) displays itsneuron activations (at B). 2. The neuron activation matrix viewshows the activations for instances and instance subsets; theprojected view displays the 2-D projection of instance activa-tions. 3. From the instance selection panel (at C), she exploresindividual instances and their classification results. 4. Addinginstances to the matrix view enables comparison of activationpatterns across instances, subsets, and classes, revealing causesfor misclassification. ” . . . . . . . . . . . . . . . . . . . . . 13Figure 2.6 DeepCompare [33] interface showing a comparison betweenCNN and LSTM deep neural models. A) A neuron weightdetail panel showing weights of one layer color-coded (greenhigh, red low) for each model. B) Neuron Activation Distri-bution panel showing the number of data instances binned intoan activation scale from -1 to 1. C) Test results panel showingdata instances and a glyph displaying whether each model pre-dicted the test instance correctly or incorrectly. D) Test ResultSummary panel showing a treemap of the model performanceon the entire test dataset color-coded to show positive and neg-ative data instances (purple and yellow respectively). . . . . . 14xiFigure 2.7 Manifold interface. Original caption: “Manifold consists oftwo interactive dialogs: a model comparison overview (1) thatprovides a visual comparison between model pairs using a smallmultiple design, and a local feature interpreter view (2) that re-veals a feature-wise comparison between user-defined subsets(c) and provides a similarity measure (b) of feature distribu-tions. The user can sort based on multiple metrics (a) to iden-tify the most discriminative features among different subsets,i.e., sort based on the selected subset in red (2) or a specificclass such as C1 (3).” [44]. . . . . . . . . . . . . . . . . . . . 15Figure 2.8 Original caption slightly revised as: “Saliency map for ‘I hatethe movie’. Each row corresponds to saliency scores for thecorrespondent word representation. The x-axis corresponds tothe word embedding dimensions (each word is set to 60 dimen-sions)” [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Figure 2.9 SentenTree visualization on Twitter data in which words arenodes and nodes connected by an edge are words which co-occur in tweets together. The size of words indicates a word’sfrequency in a dataset [18]. . . . . . . . . . . . . . . . . . . . 16Figure 2.10 MultiCloud visualization. The visualization contains small dotsaround the word cloud which indicate “anchor points”. “An-chor points” are those which indicate documents. The anchorpoints pull words in the visualization towards the point if theword is in the document. Anchor points, and words whichbelong to the document represented by the anchor point, arecolour-coded. Words which are in multiple documents are greyin colour. The right image shows a word cloud built contain-ing only words which are from individual documents, whilethe left image shows a word cloud built from words belong-ing to both multiple documents (words in grey) and words inindividual documents (words in colour) [19]. . . . . . . . . . 17xiiFigure 4.1 Neural network architecture for the bot detection & fake newsdetection tasks. Bot detection is in blue, fake news detection isin orange, and the overlap is the shared layer which is jointlylearned during joint training. The numbers in the graphics in-dicate the dimensions of each layer. For instance, the input forboth networks is set to 2000 dimensions. . . . . . . . . . . . 24Figure 4.2 Neural network architecture for the extractive summarization& dialog act prediction tasks. Extractive summarization is inblue, dialog act prediction is in orange. The numbers in thegraphics indicate the dimensions of each layer. For instance,the input for both networks is set to 2500 dimensions. . . . . . 25Figure 6.1 NJM-Vis interface. The left of the interface shows the scorepanel, displaying individual model performance. The middleof the interface is the vis panel, displaying a word graph vi-sualization indicating words which are relevant to model pre-diction. The right side of the interface is the sentence browserpanel. After clicking words that appear in the vis panel, thesentence browser panel is populated with sentences from thedataset which contain the selected word. . . . . . . . . . . . . 31Figure 6.2 Score panel. The top of the panel shows Precision, Recall, andF-score for both the joint and single task. The bottom of thescore panel shows a confusion matrix with aligned bar charts.Each bar chart represents the number of data instances whichare categorized into each positive/negative subset (true posi-tive, false positive, etc.). The bar charts also include fixed andbroken subsets (i.e., from false negative in the single task, totrue positive in the joint task and from false positive in the sin-gle task, to true negative in the joint task). As seen in the truepositive subset, when a user clicks a subset the bar is high-lighted in purple . . . . . . . . . . . . . . . . . . . . . . . . 35xiiiFigure 6.3 Vis panel. The vis panel contains two selections for directcomparison between user subset selections. The top showsJoint Task True Positives in blue, while the bottom shows Sin-gle Task True Positives in gold. . . . . . . . . . . . . . . . . 36Figure 6.4 Sentence browser panel. The panels allow two user selectionsfor direct comparisons. The sentences are those which containthe selected word, which is bolded in each sentence. At theend of each sentence is the name of the secondary task labelwritten in bold uppercase text. . . . . . . . . . . . . . . . . . 37Figure 6.5 Early interface mock-up. The top of the interface contains con-fusion matrices showing positive/negative subsets (true posi-tive, false positive, etc.) The confusion matrix also includesthe number of fixed/broken data instances in green and brown.Below the confusion matrices is a table containg the scores ofprecision, recall, and f-score. Below, is a sentence browserwhere each sentence is color-coded with contribution score inwhich the darker the blue is the stronger the contribution scoretoward prediction is, and the darker the red is the stronger thecontribution score against prediction is. Below is the word-graph visualization which we continue to use on the currentinterface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 6.6 Prototype interface. This is the first workable prototype inter-face. The interface contains confusion matrices at the top withthe number of data instances classified into each positive/neg-ative subset, as well as the word-graph visualization below. . 39Figure 6.7 A diagram of possible aligned bar chart designs drawn duringthe iterative design process. . . . . . . . . . . . . . . . . . . 40Figure 6.8 Another possible stacked bar chart design drawn during theiterative design process. . . . . . . . . . . . . . . . . . . . . 41Figure 6.9 Interface design created prior to the current design. This ver-sion of the interface contained only single user selections, anddid not allow direct comparisons between multiple user selec-tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42xivFigure 8.1 A mock-up to potentially expand the number of user selectionsat once. This shows a way of potentially splitting the vis panelinto sections by dynamically decreasing the number of graphvisualizations in such a way as to fit the most number of graphsinto each selection window. Although this could work for fourselections, it could have scalability issues as the number ofselections grows, since it will be increasingly difficult to dy-namically adjust the graphs to fit into the smaller and smallerselection window sizes. Notice also that the background ofsome words is highlighted which indicates they have high fre-quency in only their selected subset as explained in Figure 8.2. 52Figure 8.2 A mock-up to address case study feedback. In the vis panel, thesize scale has been increased to make it easier to differentiatebetween high contributing words (large font) and low contibut-ing words (small font). Additionally, words which have a highfrequency in only their one selected subset have a highlightedbackground, as seen with the word ”points”. . . . . . . . . . 53Figure A.1 A screenshot of the visualization produced by the toy datasetused for demo purposes during the training portion of the casestudy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67xvAcknowledgmentsI would like to extend my gratitude to Dr. Giuseppe Carenini and Dr. GabrielMurray for supporting me, guiding me, and teaching me every step of the waythroughout this work.To my family and to my friends I can say with certainty that I wouldn’t haveever succeeded in reaching this point without you. My thanks and my love to youfor being there for me to rely on, and for consistently bringing joy into my life.xviChapter 1IntroductionDeep learning approaches have recently shown great potential on a large numberof key prediction problems. While this progress has been achieved by mainly fo-cusing on one specific task at a time, it is clear that more powerful solutions can bedeveloped by building joint models, where dependencies between multiple taskscan be effectively exploited [27]. These joint models are already outperformingnon-joint models on several important tasks and have become a thriving area of re-search in AI and Machine Learning, especially when applied to Natural LanguageProcessing (NLP) and Computer Vision. For instance, in Computer Vision, jointlylearning histogram of oriented gradient features, deformation handling, and occlu-sion handling can improve a pedestrian detection system over one that learns eachtask individually [36]. In NLP, named-entity recognition can use part-of-speechtags as features, so improving the accuracy of a part-of-speech tagger can improvethe results of a named-entity recognition model, and vice versa [7]. Similarly, dis-course parsing can often be combined with other NLP tasks in which improvementsin learning discourse parsing can improve learning in a joint task such as sentimentanalysis [34].Although a strength of deep learning is in learning representations of data, thecomplexity of the representation makes explanation for deep neural networks no-toriously difficult. Deep neural networks often learn multiple representations ofthe input, and we do not know which part of the input they capture [15]. Thisis arguably even more challenging for joint models, which tend to be much more1Figure 1.1: NJM-Vis interface. The left of the interface shows the score panel,displaying individual model performance. The middle of the interfaceis the vis panel, displaying a word graph visualization indicating wordswhich are relevant to model prediction. The right side of the interface isthe sentence browser panel. After clicking words that appear in the vispanel, the sentence browser panel is populated with sentences from thedataset which contain the selected word.complex given that the models have shared layers. It may not be clear how muchone task contributes to the learning of shared layers over another task, and there-fore not clear how much impact one task had on the output of another task. Severalresearchers have worked on enhancing the transparency/interpretability of single-task neural models typically by visualizing feature optimization. In Computer Vi-sion this can be accomplished by continually synthesizing images which causehigher and higher neuron activations, eventually finishing with an image synthe-sized to maximally activate neurons. Though these preferred input images rarely2look like natural images, they can be used to determine what a neuron layer haslearned to detect [43][11][16]. In addition to feature optimization methods, thereare also attribution methods such as Layerwise Relevance Propagation (LRP) andSaliency attribution. These methods try to attribute a neuron’s relevance to theneural model’s output, and are more appropriate for domains other than ComputerVision.In this thesis we present what is, to the best of our knowledge, the first visualinterface to support the interpretation of results produced by Neural Joint Models(NJM-Vis). In particular, we focus on supporting the understanding of the benefitsthat one task is bringing to the other in NLP settings by relying on the LRP attribu-tion method. NJM-Vis, shown in Figure 1.1, comprises two views in a multiformoverview/detail design [30], in which one view shows an overview of the results asconfusion matrices, while the other allows the user to explore details of the resultsthrough an adaptation of a sentence visualization tool [18].As running examples, we use two NLP joint models : Summ-DiaAct and Bot-FakeNews. In Summ-DiaAct, an extractive summarization task [31][32] is jointlyperformed with a dialog act detection task [9] on conversational data. The goal ofextractive summarization is to classify each sentence as important (extract-worthy)or not, while the goal of dialog act detection is to predict the speaker intentionassociated with an utterance, e.g. question, answer, inform. In Bot-FakeNews, abot detection task [22] is jointly trained with a fake news detection task [40]. Forbot detection, the goal is simply to classify a tweet as coming from either a botTwitter account, or real user account. The goal of the fake news detection taskis to classify a tweet as either verifiable fact, or rumour. In this thesis, we haveimplemented a joint neural model for both Summ-DiaAct and Bot-FakeNews andused the results produced by such models as inputs to our interface.Users of visualization tools for deep learning can be categorized into three over-lapping groups [17]: model developers, model users, and non-experts. Our systemis designed to assist an overlap of model developers and model users. Model devel-opers understand deep learning thoroughly and use systems like (e.g., Tensorboard[1], Deep Eyes [38], and Blocks [4]) to interpret the underlying neural model, todebug or improve it. In contrast, model users may have less or no experienceimplementing deep learning solutions, but employ neural networks as a means of3developing domain-specific applications. Systems built for these users include Ac-tiVis [20], and LSTMVis [39]. In this work we will specify the terms “modeldevelopers” or “model uers” when speaking of only one or the other, and use theterm “users” to refer generally to users that could be either model developers ormodel users.Our main goal is to enhance the ability of users to interpret the benefits of ajoint task model compared to a single task model by allowing them to inspect thepredictions of the joint task model; to assess how the joint task models differ fromthe single task models; and, more importantly, to evaluate the reasons why thesepredictions are different.To assess the strengths and weaknesses of our initial prototype, we have runa formative evaluation as a case study with four model user participants. In thesestudies, we have used our two joined models: Summ-DiaAct and Bot-FakeNews.4Chapter 2Related Work2.0.1 Neural Joint ModelsNeural joint models come in two alternative forms: multi-tasking and pre-training.Pre-training completes the training of one task and then uses the learned weightsto initialize the weights for a second task. This has been shown by [12] to resultin better generalization and better performance than the typical manner of randomweight initialization. Another style of joint model is multi-tasking [7], where thetraining process proceeds by feeding training examples from alternating tasks al-lowing the neural model to jointly learn multiple tasks. Multi-tasking has beensuccesfully applied in multiple areas, such as NLP [26] and computer vision [21].In this work, we use multi-tasking, as it tends to outperform pre-training (e.g., [34]in NLP, joining discourse parsing and sentiment analysis).2.0.2 Interfaces to Interpret Neural Models (Visual Data)Much of the previous work creating interfaces for visualizing deep neural networks(DNN) has been done on computer vision tasks using Convolutional Neural Net-works (CNN). [43] describes an interface allowing visualization of plotted convo-lutional layer activation values as seen in Figure 2.2. Similarly [25] (Fig. 2.1)presents an interface for visualizing CNNs by converting a CNN to a directedacyclic graph and clustering neurons in each layer of the network before addingan edge-bundling visualization to show an overview of the whole network. Both of5these do support interpretation of the underlying neural models, but they are bothfocused on CNNs and vision tasks, unlike our goal of using feed-forward DNNson textual tasks. In [35] (Fig. 2.3) the authors state that a limit of current expla-nation techniques is that they assume each neuron detects only one type of feature,but neurons must be multifacted, i.e. that they fire in response to different typesof features. Previous activation maximization techniques constructed images with-out regard for multiple facets of a neuron. The authors explicitly uncover multiplefacets of each neuron by separately producing a synthetic visualization of each ofthe types of images that activate a neuron. The authors accomplish this by calcu-lating the derivative of the target neuron activation with respect to each pixel (thisdescribes how to change pixel color to increase activation of that neuron). Anotherwork for visualizing CNNs is deepViz [5] (Fig. 2.4). In deepViz, the interfaceallows exploration of convolution neural networks as the network changes overtraining time steps. Users select an image which is then passed through a particu-lar network filter. Visualization is done using decaf [10] which is used to retrieveand visualize the activation values of a particular layer, selected by the user. Withan included slider to allow the user to move through the training time steps, theinterface visualizes how the selected layer’s filter changes over time. Like otherdiscussed interfaces, deepViz focuses only on visual image data, not text, and doesnot have any functionality in place to account for joint neural models.2.0.3 Interfaces to Interpret Neural Models (Textual Data)Some recent work has explored visual interfaces for understanding neural NLP(e.g., [24]) (Fig 2.8), but they are limited to single-task models, while our goal issupporting the interpretation of joint-models.In [33] (Fig. 2.6), the authors present a visual analytic system for comparingthe results of two deep neural models. Their interface allows users to compareneuron activations by layer, as well as the number of data instances which fallinto each activation bin from values -1 to 1. In addition, the interface shows usersdetails of data instances such as a juxtapositon showing which of the two neuralmodels correctly classified the data instance. The interface shows a treemap of theoverall performance of both models. Although this interface compares two neural6Figure 2.1: CNNVis, a system for visualizing CNNs for understanding, diag-nosing non-convergence problems, and refining CNNs. The visualiza-tion shows a CNN as a directed acyclic graph, with an aggregation ofneurons and layers. [25]models as we do in our work, they do not compare joint neural models. Addition-ally, the intention behind our work is to explain the results of joint deep neuralnetworks, whereas the intention behind DeepCompare is to compare the perfor-mance of models, not to explain the results. Another similar interface is ActiVis[20] (Fig. 2.5). In ActiVis the authors present an interface intended to be effectivefor industry scale data. The interface allows users to see neuron activations over adataset by means of viewing classes, a 2d projection of the neuron activations byclass, and a visual representation of the data instances by class which allows theuser to explore the data instances and compare data instance neuron activations.Though this interface does allow neural models trained on textual data, it focusesonly on single-task networks. Manifold [44] (Fig. 2.7)is another interface for in-terpretation of neural models, although Manifold also incorporates machine learn-ing models beyond just neural models (linear regression, support vector machines,7etc.). The interface allows comparison between positive and negative subsets (truepositive, false negative, etc.) of predictions between models. The interface addi-tionally provides a way to view the number of features appearing in each of theuser’s selected instances, allowing the user to see the feature distribution. Thereare similarities to our work, such as the comparison between positive and negativesubsets of predictions between models, though Manifold is not intended to directlycompare joint task models as our interface does, so Manifold does not directly dis-play to the user data instances which were fixed or broken by the process of jointtraining.2.0.4 Saliency Interpretation MethodIn [24] (Fig. 2.8) salience is used to measure the amount a neural unit contributesto the meaning compositionality (building sentence meaning from the meaningof words or phrases) using first-order derivatives. In this way the authors areable to show explanations for the difference in performance on sentiment anal-ysis tasks between a recurrent neural network (RNN), long short-term memorynetwork (LSTM), and bi-directional LSTM. Although saliency methods contributeto understanding of neural models, it was shown in [2] that they are less effectivethan LRP methods, as they are unable to establish when words are inhibiting aprediction decision as LRP is capable of doing. In this thesis, we rely on LRP toexplain both single task and joint task predictions.2.0.5 Word Cloud Visualization TechniquesIn our work we use a word cloud style visualization adapted from [18]. In [18](Fig. 2.9) the authors present SentenTree, a technique for visualizing social mediatext. The visualization uses a word cloud to present words which are common insocial media datasets. The SentenTree technique builds on typical word clouds byincluding links between words in the word cloud, indicating that words co-occur insentences together. Our work uses the underlying SentenTree algorithm, but buildson top of the original SentenTree by changing SentenTree to allow word size to in-dicate the score of a contribution measure, such as LRP. Additionally, we split theSentenTree visualization to allow two sets of SentenTree visualizations, supporting8our users in comparing two selections of subsets of data at the same time (for in-stance, our users can compare true positives from the joint task, and true positivesof the single task at the same time). Another work using word clouds as visu-alization techniques is MultiCloud [19]. MultiCloud is a visualization techniquethat expands word clouds to allow visualization of multiple documents within asingle word cloud. MultiCloud has fixed points along the border of the visualiza-tion layout, and distributes words in locations which indicate to which documentor documents the word belongs. For example, if the fixed point for document 1is in the top left corner of the layout, and a word in the word cloud comes fromdocument 1, it is pulled closer to the top left corner of the word cloud. Althoughthis visualization technique is intended to show from which documents words inthe word cloud are from, it is possible this technique could be applied to our work,in which we could instead have a single word cloud and use the fixed points onthe border to indicate from which neural model (joint or single) the word is from.Exploring this alternative is left as future work.9Figure 2.2: Visualization tool for live convnet activations from [43]. For clar-ity, the original caption is slightly revised as: “ The bottom shows ascreenshot from the software. Webcam input is shown, along with thewhole layer of conv5 activations. The selected channel pane shows anenlarged version of the 13x13 conv5151 channel activations. Below it,the deconv starting at the selected channel is shown. On the right, threeselections of nine images are shown: synthetic images produced usingregularized gradient ascent methods, the top 9 images patches from thetraining set (the images from the training set that caused the highestactivations for the selected channel), and the deconv of the those top 9images. All areas highlighted with a green star relate to the particularselected channel, here conv5151; when the selection changes, these pan-els update. The top depicts enlarged numerical optimization results forthis and other channels. conv52 is a channel that responds most stronglyto dog faces, but it also responds to flowers on the blanket on the bot-tom and half way up the right side of the image (as seen in the insetred highlight). conv5151 detects different types of faces. The top nineimages are all of human faces, but here we see it responds also to thecat’s face. Finally, conv5111 activates strongly for the cat’s face, the op-timized images show cat like fur and ears, and the top nine images (notshown here) are also all of cats.”10Figure 2.3: Multifaceted Feature visualization. Original caption revised as:“The top images show 8 types of images that activate the ‘grocery store’class neuron. The bottom images show training set images that activatethe same neuron, as well as resemble the synthetic images in the toppanel” [35].11Figure 2.4: deepViz interface. The middle of the interface shows the bitmaprepresentations of filter banks in the selected CNN layer. Above arenodes displaying the network overview. The user is able to click one ofthe nodes to view the selected layer’s filters. The left menu allows uersto search for images labeled with a certain class (”plane” in this figure).The side view is a histogram of the model’s predicted classes for theselected image [5].12Figure 2.5: ActiVis interface [20]. For clarity, the original caption is slightlyrevised as: “ActiVis integrates several coordinated views to support ex-ploration of complex deep neural network models, at both instance-andsubset-level. 1. The user Susan starts exploring the model architec-ture, through its computation graph overview (at A). Selecting a datanode (in yellow) displays its neuron activations (at B). 2. The neuronactivation matrix view shows the activations for instances and instancesubsets; the projected view displays the 2-D projection of instance ac-tivations. 3. From the instance selection panel (at C), she exploresindividual instances and their classification results. 4. Adding instancesto the matrix view enables comparison of activation patterns across in-stances, subsets, and classes, revealing causes for misclassification. ”13Figure 2.6: DeepCompare [33] interface showing a comparison betweenCNN and LSTM deep neural models. A) A neuron weight detail panelshowing weights of one layer color-coded (green high, red low) for eachmodel. B) Neuron Activation Distribution panel showing the number ofdata instances binned into an activation scale from -1 to 1. C) Test re-sults panel showing data instances and a glyph displaying whether eachmodel predicted the test instance correctly or incorrectly. D) Test ResultSummary panel showing a treemap of the model performance on the en-tire test dataset color-coded to show positive and negative data instances(purple and yellow respectively).14Figure 2.7: Manifold interface. Original caption: “Manifold consists of twointeractive dialogs: a model comparison overview (1) that provides a vi-sual comparison between model pairs using a small multiple design, anda local feature interpreter view (2) that reveals a feature-wise compari-son between user-defined subsets (c) and provides a similarity measure(b) of feature distributions. The user can sort based on multiple metrics(a) to identify the most discriminative features among different subsets,i.e., sort based on the selected subset in red (2) or a specific class suchas C1 (3).” [44].15Figure 2.8: Original caption slightly revised as: “Saliency map for ‘I hate themovie’. Each row corresponds to saliency scores for the correspondentword representation. The x-axis corresponds to the word embeddingdimensions (each word is set to 60 dimensions)” [24].Figure 2.9: SentenTree visualization on Twitter data in which words arenodes and nodes connected by an edge are words which co-occur intweets together. The size of words indicates a word’s frequency in adataset [18].16Figure 2.10: MultiCloud visualization. The visualization contains small dotsaround the word cloud which indicate “anchor points”. “Anchorpoints” are those which indicate documents. The anchor points pullwords in the visualization towards the point if the word is in the doc-ument. Anchor points, and words which belong to the document rep-resented by the anchor point, are colour-coded. Words which are inmultiple documents are grey in colour. The right image shows a wordcloud built containing only words which are from individual docu-ments, while the left image shows a word cloud built from wordsbelonging to both multiple documents (words in grey) and words inindividual documents (words in colour) [19].17Chapter 3DatasetsIn this thesis, we develop two NLP joint models: Summ-DiaAct and Bot-FakeNews.In Summ-DiaAct, an extractive summarization task [31][32] is jointly performedwith a dialog act detection task [9] on conversational data. In Bot-FakeNews, a botdetection task [22] is jointly trained with a fake news detection task [40]. For theSumm-DiaAct joint model, we use the Augmented Multi-party Interaction (AMI)corpus as our dataset [6]. AMI is a multi-modal dataset created from 100 hours ofmeeting recordings. In creating this dataset, participants played the various rolesin a design team in which their goal was to take a project from kick-off to finishthroughout the course of a day. The AMI dataset has annotations for multiple taskssuch as dialog act, topic segmentation, abstractive and extractive summarization,named entities etc.. We use the annotations for the dialog act (15 types) and extrac-tive summarization (binary) tasks, as these two were shown to benefit from jointtraining [37]. The extractive summarization labels represent whether a sentence isextract worthy or not. The dialog act annotations used are as follows:• Backchannel: Someone listening to the speaker says something in the back-ground without stopping the speaker.• Stall: Speaker starts speaking before they are ready and uses “filled pauses”such as “uh”, “um”.• Fragment: Speaker started saying something but stopped before they gotfar enough to finish their intention18• Inform: Speaker spoke with the intention to give information• Elicit-Inform: Speaker requested that someone else gives information• Suggest: Speaker gives a suggestion to the listeners• Offer: Speaker expresses an intention relating to their own actions• Elicit-Offer-Or-Suggestion: Speaker requests that someone else makes anoffer or a suggestion• Assess: Speaker expresses an evaluation of something that is being discussedby the group• Comment-About-Understanding: Speaker indicates that they did or didnot understand what a previous speaker said• Elicit-Assessment: Speaker attempts to elicit an assessment about what wassaid or done previously• Elicit-Comment-About-Understanding: Speaker elicits from listeners aboutwhether what has been said has been understood• Be-Positive: Social acts intended to make the individual or group happier• Be-Negative: Social acts expressing negative feelings to the individual orgroup• Other: Any acts conveyed by the speaker which do not fit into the other actcategoriesFor the Bot-FakeNews joint models, we used two separate datasets, both ofwhich are comprised of tweets from Twitter. For bot detection, we used a datasetfrom [8], which contains both tweets from genuine accounts, as well as from ac-counts identified as bots in [41]. Dataset [8] included and built on work from [41]in which they identified “spambots” by crawling Twitter profiles through the Twit-ter API and determining which accounts have posted at least one malicious URL.A URL was identified as “malicious” via two methods: Google Safe Browsing19(GSB) and a URL honeypot. GSB is a blacklist for identifying malicious or phish-ing URLs. The honeypot was developed by [41] as well in order to visit the URLsusing a browser inside a virtual machine and then to detect creation/modificationof sensitive data. After identifying an account as potentially belonging to a bot,[41] manually went through each of the accounts and decided whether their tweetswere useful and meaningful. Genuine real user accounts were identified from arandom sample of Twitter users which [8] randomly contacted and asked a simplequestion in natural language. The replies to the question were manually verified,and accounts which properly answered the question were verified as human. Forthe fake news task, we used a dataset from [45], in which the authors enlisted ateam of journalists to identify when a newsworthy event was occurring, at whichcase the authors collected tweets associated with the event. The journalists thenwent through the collected tweets and identified them as either factual or rumour.The author proceeded in this manner over five different newsworthy events:• Ferguson unrest: citizens of Ferguson in MI, USA protest a fatal shooting ofan 18-year-old African American Michael Brown by a white police officeron August 9, 2014• Ottawa shooting: shootings on Parliament Hill in Ottawa, Canada result inthe death of a Canadian soldier on October 22, 2014• Sydney siege: a gunman held hostage ten customers and eight employeesof a Lindt chocolate cafe located at Martin Place in Sydney, Australia onDecember 15, 2014• Charlie Hebdo shooting: two brothers force their way into the offices ofFrench satirical weekly newspaper Charlie Hebdo in Paris, killing 11 peopleand wounding 11 more on January 7, 2015• Germanwings plane crash: a passenger plane from Barcelona to Dusseldorfcrashes in the French Alps on March 24, 2015, killing all passengers andcrew. The plane was found to have been deliberately crashed by the co-pilot20Chapter 4Data ModelBy following the standard information visualization methodology [29], we base thedesign of NJM-Vis on abstracted data and task models. In this chapter, we presentthe data model, which describes information about sentences and words that weneed to compute and store. The task model, which outlines key analysis tasks tosupport the interpretation of joint models and their comparison with single models,will be presented in the following chapter.4.1 Model DescriptionThe data model for the four classification tasks comprises tables containing infor-mation associated to sentences and to words. For each sentence in the datasets, weneed to store all its words and their corresponding embeddings. Additionally, foreach sentence and for each task we need the prediction of the joint model, of thesingle model and the gold-standard label. Moving to words, for each word we needa measure of its contributions to each possible prediction for the sentence contain-ing that word (across models and tasks). So for instance, for the word ‘remote’in the sentence ‘we do not include a remote’ in the AMI corpus, we would needa measure of its contribution to the prediction of that sentence being summary-worthy and of its dialog act type, in the single and in the joint models.With respect to computing how much a word contributes to a neural prediction,there are multiple possible methods. Our goal is to explain the prediction for a21classification problem such that given an input vector x we would like to knowhow the features of x (the words in our tasks) contribute toward our classificationprediction, and in what way they contribute to our prediction.Predictions of DNNs can be explained by decomposing the output of the net-work on the input variables. NJM-Vis uses this form of explanation through amethod known as Layerwise-Relevance Propagation (LRP)[3] to explain the out-put of the model. LRP propagates the relevance of the output backward through thenetwork, distributing the relevance layer by layer in proportion to how much eachneuron in the layer contributed to the output, until reaching the input layer wherethe relevance is finally distributed among the input neuron (the words in our tasks)in proportion to how much each contributed to the output, giving us how relevanteach part of the input was on the output from the network. Using this relevance,we can determine whether a particular part of the input was able to contribute foror against (and whether the contribution was weak or strong). Let the neurons ofthe network be:ak = g(∑ja jw jk +b) (4.1)where ak is the neuron activation, g is an activation function which is positive andmonotonically increasing, a j are the activations from the previous layer, w jk is theweights of the neuron, and b is the bias parameter. As shown in [28] a rule thatworks to propagate relevance is:R j =∑ka jw+jk∑ j a jw+jkRˆk +a jw−jk∑ j a jw−jkRˇk (4.2)where Rˆk = αRk and Rˇk = −βRk with α and β chosen to constraints α −β = 1and β >= 0Although in this thesis we have used LRP as a means of determining eachword’s relevance to the predicted output, our interface does not require that LRPis used. Another possible method would be an attention mechanism which couldsimilarly give an importance score to each input word. For the purposes of thisthesis we chose to use LRP over an attention mechanism since, as shown in [42],introducing an attention mechanism adds additional complexity to neural networks22which can possibly require longer training time and more labeled data (which arerather limited for our prediction tasks).4.2 Model ArchitectureBoth networks were developed using Tensorflow [1] in Python with an LRP imple-mentation adapted from [23]. The network for Summ-DiaAct is shown in Figure4.2. Both the extractive summarization and dialog act prediction networks use2500 input dimensions, since the input to each network is a sentence containing 25words total, with 100 word embedding dimensions per word. Sentences with morethan 25 words are trimmed at 25 words, and sentences with less than 25 wordsare padded with zeroes until they hit 2500 dimensions. The intermediate layersfor both tasks using ReLU activation functions, as this tended to result in betterperformance through empirical testing. The output layer for the extractive summa-rization task applies a sigmoid activation function since the task is binary classifi-cation, while the dialogue act task applies a softmax activation function since thetask is multi-class classification.The weights are initialized with a Xavier initialization [14] since activationschosen from a random normal distribution tended to cause neuron saturation withinour DNN.The network for Bot-FakeNews is shown and described in Figure 4.1. Boththe bot detection and fake news detection networks use 2000 input dimensions,since the input to each network is a tweet containing 20 words total, with 100 wordembedding dimensions per word. Using 2000 input dimensions performed betterthan 2500 input dimensions for these tasks from empirical testing. Tweets withmore than 20 words are trimmed at 20 words, and tweets with less than 20 wordsare padded with zeroes until they hit 2000 dimensions. The intermediate layersuse ReLU activation functions, while both output layers use a sigmoid activationfunctions.4.3 Results Comparing Single and Joint ModelsAs shown in Table 4.1, the results for extractive summarization improve whentrained in the Summ-DiaAct joint model. In contrast, Table 4.2 indicates that for23Figure 4.1: Neural network architecture for the bot detection & fake news de-tection tasks. Bot detection is in blue, fake news detection is in orange,and the overlap is the shared layer which is jointly learned during jointtraining. The numbers in the graphics indicate the dimensions of eachlayer. For instance, the input for both networks is set to 2000 dimen-sions.the dialogue act task the improvement is negligible.Model F-score Precision RecallSingle .424 .604 .327Joint .463 .632 .366Table 4.1: Single & Joint Task Results for Extractive SummarizationWhen looking at the Bot Detection and Fake News detection tasks, the Bot-FakeNews joint model outperforms the single model for both tasks as shown intables 4.3 and 4.4 .24Figure 4.2: Neural network architecture for the extractive summarization &dialog act prediction tasks. Extractive summarization is in blue, dialogact prediction is in orange. The numbers in the graphics indicate thedimensions of each layer. For instance, the input for both networks isset to 2500 dimensions.Micro Average (Single) Micro Average (Joint).711 .714Table 4.2: Single & Joint Task Training Results for Dialogue ActModel F-score Precision RecallSingle .75 .88 .66Joint .96 .97 .93Table 4.3: Single & Joint Task Results for Bot DetectionModel F-score Precision RecallSingle .76 .86 .70Joint .81 .90 .72Table 4.4: Single & Joint Task Results for Fake News Detection25Chapter 5Task ModelThe high-level goal of NJM-Vis is to support model developers and model users[17] in interpreting the benefits of joint task neural model predictions, when com-pared to single task models. Given that this is a comparison task, we referred toexisting literature on visualizing comparison [13]. As in [13] where comparisontasks are grouped abstractly as the following actions: Identify, Measure, Dissect,Connect, Contextualize, and Communicate.In addition to referring to the aforementioned existing works, we also wentthrough an informal iterative collection of user requirements from NLP experts(including the authors). The following tasks are intended to be supported by theinterface.• (T1) Measure predictions of the two models quantitativelyExample: Measure Precision, Recall, and F-score for both models, or howmany predictions are “fixed”/“broken” by the models.Our elicited user requirements determined that both model users and modeldevelopers want to determine which model performs better quantitativelysince this allows a user to determine whether the joint task training actuallyimproves predictive performance over the single task training by a measur-able amount.• (T2) Identify key words in subsets of predictions26Example: Identify that “dollars” is often appearing in true positive predic-tions, therefore we could say that “dollars” is a key word. Alternatively, aword may be considered a key word if it has a high contribution measure in asubset of predictions. Lastly, a word may be a key word if it often co-occurswith other words in subsets of predictions.It was determined from our user requirements that the ability to identifywords which are important to prediction subsets could be a valuable firststep in understanding the predictive difference between joint and single taskmodels. Identifying key words allows the user to gain an overview of po-tentially important differences between the models which the user can thenbegin to explore more in-depth (such as in T3 and T5).• (T3) Dissect linguistic similarities/differences between single and jointpredictionsExample: Dissect that key words appearing in true positive predictions areoften pronouns.Once the user/developer has identified which words are key words, they maywant to move to more in-depth analysis of those words in an attempt to dis-cover linguistic similarities/differences and through that analysis, perhapsgain an improved understanding of the compared tasks. For example, per-haps a user finds that by jointly training extractive summarization with di-alog act prediction, their extractive summarization performance improves,and through the dissection of linguistic properties of key words between themodels they discover that words which are pronouns often show up in truepositives for the joint task but not single task trained model. This would in-dicate that pronouns may have predictive power to both tasks, and that byjointly training the two tasks, the network is better able to learn a represen-tation which accounts for the importance of pronouns. The user then is ableto gain knowledge about the tasks themselves, i.e. that pronouns may beimportant to predicting whether a sentence is extract worthy, but only whenthe sentence is expressing particular dialog acts.• (T4) Identify possible errors in results27Example: Identify that “ve” has a high frequency, which may be an error leftover from pre-processing words like “i’ve”, “should’ve”, etc..In our elicitation of requirements from model developers it was determinedthat developers, since they often manually build model architectures, wantto be able to easily identify possible errors in the pre-processing and trainingphase. Though this is also useful for model users, it may be more difficultfor model users to understand the technical details of errors than the moreexperienced model developers, and therefore the means of identifying errorsto model users may need to be more intuitive than measures like gradientvalues, etc.• (T5) Identify key relationships between predicted class labelsExample: Identify that a particular predicted class label for the input task 1is often appearing in subsets of predictions for the input task 2.Identifying key relationships between predicted class labels is an importantuser requirement for understanding the difference between the two comparedmodels. Consider a user with the tasks extractive summarization and dialogact prediction in which the user finds that sentences with the dialog act “sug-gest” label are appearing more often in the jointly trained extractive summa-rization model than in the single trained model. The user could then inferthat perhaps spoken suggestions have predictive power for whether a sen-tence should be included in an abstract, and the act of joint training helpedthe extractive summarization network learn a representation which accountsfor this linguistic property. Similar to T3, the user then gains understandingabout both the tasks themselves, as well as the predictive differences betweenthe joint and single task models.• (T6) Contextualize predictions at the granularity of sentencesExample: Contextualize that the key word “schedule” in true positives oftenappears in sentences with the modal verb “must”, potentially indicating that“must” may also have predictive power when with “schedule” for extractiveworthy sentences.28It was discovered through our user requirement elicitation that strictly show-ing a key word may not always be enough context to understand the word’simportance to predictions. It was determined that an important user task isto be able to view a key word in its full sentence allowing the user to analyzethe complete context of the input to the network. This context could lead theuser to a deeper understanding of the linguistic properties which caused themodel prediction.29Chapter 6Design Solution6.1 NJM-Vis DesignNJM-Vis is faceted into multiple views of coordinated visualizations. As seen inFigure 6.1, the left side of the interface, the score panel (Fig. 6.2), supports T1 bysummarizing and comparing the predictions of the joint (blue) and single (orange)neural models. In the top of the score panel Precision, Recall, and F-Scores areshown in a table format for both the joint and single task versions of the model.The rows of the table list the joint and single task, in which joint is in blue andsingle is in orange. The bottom of the score panel shows a confusion matrix withaligned bar-charts. True positive, false positive, false negative, and true negativesubsets of dataset examples comprise the aligned bar-charts. The aligned bar-chartsalso include green bars for examples which were fixed by the joint training process(i.e., from false negative to true positive and from false positive to true negative).Similarly, examples which were broken by the joint training process (i.e., from truepositive to false negative and from true negative to false positive) are shown in red.If the user clicks a bar the bar will be highlighted with a purple outline, as seen bythe purple highlight on the joint task true positive bar in (Fig. 6.2).The middle view allows comparison of two selections by juxtaposition, placingthe first selected subset along the top of the view, and the second selected subsetalong the bottom of the view, allowing a user to view and compare selections suchas true positive from the joint task and true positive from the single task. This30Figure 6.1: NJM-Vis interface. The left of the interface shows the score panel,displaying individual model performance. The middle of the interfaceis the vis panel, displaying a word graph visualization indicating wordswhich are relevant to model prediction. The right side of the interface isthe sentence browser panel. After clicking words that appear in the vispanel, the sentence browser panel is populated with sentences from thedataset which contain the selected word.facilitates direct comparison between selected subsets.Clicking any of the subsets in the bar chart view, such as true positive for thesingle task, or false negative for the joint task, brings up a word cloud style visu-alization in the Vis Panel (Fig. 6.3) adapted from [18]. This sentence browser isstructured as a node-link graph diagram in which nodes are words and links repre-sent words that co-occur in a sentence as shown in Figure 6.1. The visualizationalso supports the following tasks:• T2: Since words which appear in the visualization are words which have a31high frequency in the selected subset, the visualization intrinsically identifieswords which could be key words. Additionally, the visualization uses thesize of words to encode a measure of how strongly a word contributed to theselected subset of predictions in which a larger word indicates it contributedmore strongly to a prediction than a smaller word. This is also an indicationthat a word may be a key word. For instance, in Fig. 6.3 the word “new”appears in the middle word cloud in the joint task true positive subset, and islarger than other words in the subset. This indicates to the user that the word“new” is a key word for this subset. We use LRP as our measure of a word’scontribution to prediction.• T3: The visualization panel allows the user to make two selections and com-pare them directly, allowing the comparison of subsets of data instances forboth the single and joint task at the same time. With the ability to have thisjuxtaposed comparison, the user can view and compare linguistic similaritiesor differences. The user can easily see if, for example, pronouns were oftenappearing in the joint task true positive predictions, but not the single tasktrue positive predictions.• T4: Because the visualization is built on words which often appear, it’s pos-sible for the user to see cases in which a commonly appearing error is oc-curring in a subset of their data. For example, in Fig 6.1 in the joint tasktrue positive subset, and in the right most word cloud centering on “would”we see the word “ve” which seems to be potentially an error in the datapre-processing in which it is likely the intended word was “i’ve” or perhaps“would’ve”.Clicking any of the words in the middle view visualization brings up a scrol-lable list of sentences in the Sentence Panel on the right side of the view (Fig. 6.4),all of which are sentences from the user’s dataset containing the word which wasclicked on by the user from the selected subset in the middle view. The top rightsentence view appears when a user clicks a word in a node tree in the top half of themiddle view, and the bottom right sentence view appears when a user clicks a wordin a node tree in the bottom half of the middle view. This allows users to directly32compare sentences containing selected words between two selected subsets. Thissentence panel supports the following tasks:• T5: By including all of the secondary task class labels in the sentence panel,the user is able to see whether certain class labels for the user’s secondarytask are appearing often in their selected primary task subset. For example,in Fig. 6.4 in the bottom panel it appears many of the data instances havethe secondary class label (in this case, dialogue act class label) of “STL”(shorthand for “stall”). This indicates to the user that perhaps the class labelof “stall” has important predictive power for the selected subset of primarytask predictions.• T6: Since the sentence panel allows the user to scroll through the full sen-tences of the key words appearing in the vis panel, it allows the user todirectly compare full sentences of their selected subsets. This allows theuser to directly compare sentences for the joint and single task systems. Forexample, users may choose to select true positive for joint task and true pos-itive for single task and compare similarities and differences between theselections.6.2 Iterative Design ProcessThe design of the interface took place through an iterative design process in whichwe refined the design of the interface as we refined the clarity of the user tasks anddata model. An early mock-up version of the interface can be seen in Fig 6.5. Theinterface included tables showing model prediction subsets such as true positive,false positive, etc as well as precision, recall, and f-scores scores for both the singleand joint task models. The table also included a “sentence browser” with color-coded words which were attributed to relevance contribution values. The darker thecolor-coded words, the stronger the contribution score towards the prediction (i.e.darker blue contributes stronger to prediction than lighter blue). This was removedin the final version, since the lighter colors tended to be harder to view, and sincethe relevance contribution measure was already being captured by the size of thewords in the word graph. In a panel below was the word cloud visualizations.33Though this early mock-up version did include some of the concepts which wereincluded in the final version, such as using a word cloud visualization and showingtables with the number of true positives, false positives, etc. this version of theinterface did not fully support all of the user tasks which we ended up deriving.For example, T3 is not supported by this interface, as the word cloud visualizationonly supports viewing one subset selection at a time.A prototype interface was implemented as seen in Fig 6.6 with the design basedoff of the mock-up version in Fig 6.5. The prototype version was similar in conceptto the mock-up, including the tables counting the number of positive and negativesubsets of predictions, as well as the word cloud visualization underneath. Tobetter suit user comparison, the tables showing positive and negative subsets ofpredictions was enhanced with aligned bar-charts [30]. The bar charts went throughseveral mock-ups as seen in Fig 6.7 and Fig 6.8.During the process of adding the new aligned bar-chart design, it was alsodecided that adding a sentence browser would help satisfy our user task modelby supporting what would become our tasks T5 and T6. With both the sentencebrowser and the aligned bar-chart design implemented, the interface design was asseen in Fig. 6.9.As our user task model became more refined we realized that the interfacedesign in Fig. 6.9 would not fully support a key task in our user task model, taskT3, and that to support a task focused on comparing both models we needed toamend the vis panel to allow for a juxtaposed visualization view in which the user isable to view two selections at the same time. After completing the implementationof the juxtaposed vis panel, we arrived at the current design of the interface whichsupported our entire task model.After running our case studies, we received feedback from our participants ofhow we may better be able to support our task model. The feedback is discussed inChapter 6 and potential interface additions and mock-ups are presented in Chapter7.34Figure 6.2: Score panel. The top of the panel shows Precision, Recall, andF-score for both the joint and single task. The bottom of the score panelshows a confusion matrix with aligned bar charts. Each bar chart repre-sents the number of data instances which are categorized into each pos-itive/negative subset (true positive, false positive, etc.). The bar chartsalso include fixed and broken subsets (i.e., from false negative in thesingle task, to true positive in the joint task and from false positive inthe single task, to true negative in the joint task). As seen in the truepositive subset, when a user clicks a subset the bar is highlighted inpurple35Figure 6.3: Vis panel. The vis panel contains two selections for direct com-parison between user subset selections. The top shows Joint Task TruePositives in blue, while the bottom shows Single Task True Positives ingold.36Figure 6.4: Sentence browser panel. The panels allow two user selections fordirect comparisons. The sentences are those which contain the selectedword, which is bolded in each sentence. At the end of each sentence isthe name of the secondary task label written in bold uppercase text.37Figure 6.5: Early interface mock-up. The top of the interface contains confu-sion matrices showing positive/negative subsets (true positive, false pos-itive, etc.) The confusion matrix also includes the number of fixed/bro-ken data instances in green and brown. Below the confusion matrices isa table containg the scores of precision, recall, and f-score. Below, is asentence browser where each sentence is color-coded with contributionscore in which the darker the blue is the stronger the contribution scoretoward prediction is, and the darker the red is the stronger the contribu-tion score against prediction is. Below is the word-graph visualizationwhich we continue to use on the current interface.38Figure 6.6: Prototype interface. This is the first workable prototype interface.The interface contains confusion matrices at the top with the number ofdata instances classified into each positive/negative subset, as well asthe word-graph visualization below.39Figure 6.7: A diagram of possible aligned bar chart designs drawn during theiterative design process.40Figure 6.8: Another possible stacked bar chart design drawn during the iter-ative design process.41Figure 6.9: Interface design created prior to the current design. This versionof the interface contained only single user selections, and did not allowdirect comparisons between multiple user selections.42Chapter 7Case StudyTo assess the efficacy of the design, we ran a case study with a set of participants.The case study is intended to be part of an iterative design process in which futureversions of the interface could be influenced directly by the case study feedback,at which point further case studies could be run, allowing additional feedback, andso on.7.1 MethodThe case study involved four participants. One participant was a postdoctoral re-searcher, while the others were graduate students. All of the participants were froma Computer Science background. The participants were split into two groups withtwo participants being assigned to the Summ-DiaAct joint tasks, and the other twobeing assigned to Bot-FakeNews.After an initial explanation of the purpose and intention behind the interface,participants were walked through a short training session on a toy dataset of pre-dictions. An example of the appearance of the interface during the demo can beseen in Appendix A.2. Participants were asked to answer simple questions (e.g.,identify one high frequency word appearing in the joint model true positive subset,but not appearing in the single model true positive subset) by using the interfaceon the toy dataset. Once the users were able to correctly answer all the simplequestions indicating that they understand the basic encodings and functions of the43interface, they moved on to using the interface to explore their assigned dataset ofpredictions.Participants were told to explore the predictions however they see fit. Theywere told to write down any general insights that they gained from using the in-terface, as well as any insights gained about specifically why the joint task outper-formed the single task.7.2 Participant ResultsIn this section we provide the results for each of the four participants from thecase study. The results include observations made about the participants use ofthe interface during the case study, as well as participant’s feedback given throughpost-study questionnaires. The full participant post-study questionnaire results canbe seen in the Appendix.7.2.1 Participant 1Participant 1 was assigned to the Summ-DiaAct joint tasks. The participant beganthe task by comparing each subset directly between single and joint tasks, such ascomparing true positives for both the joint and single tasks, and then false positivesfor both the joint and single task. Throughout this process, the participant madenotes on paper (an indication that our interface should include a notepad function-ality) about which words were large. Participant 1 focused primarily on the vispanel, not using the sentence browser panel at all until reminded of the functional-ity during the study.During the study, Participant 1 commented that they wanted some sort of in-dication, such as colour, of which words were common between selections. Theparticipant also commented that it would be useful to show which words are impor-tant in only the selected subset, i.e. show if a word is frequent in only the selectedsubset since many words show up in multiple subsets. Additionally, the participantwanted the interface to indicate words which have similar linguistic characteris-tics to the ones that appear in the vis; for example if a key word is a pronoun,then the participant wanted an option to see other pronouns. Lastly, the participantcommented that they would prefer if the sentence browser panel showed all the44sentences from the dataset, and used some kind of indication like highlighting toindicate which sentences below to the user selection.In the post-study questionnaire the participant commented that they liked theconcept of being able to compare the two models directly, and they found the visdesign of using size to indicate importance useful. The participant commentedthat the size difference should be more notable so that there’s a wider differencebetween small and large words.Using the interface the participant was able to make some inferences aboutthe two tasks, for instance that modal verbs were often big indicating that theycontributed strongly to predictions, which the participant commented, “...seemedintuitive given that the dataset was a dialog dataset”.Finally, the participant concluded that they would use this interface or a similarinterface for their multi-task problems in the future if development was continuedand the interface was further enhanced with their desired features as describedabove.7.2.2 Participant 2Participant 2 was assigned to the Bot-FakeNews dataset. The participant beganby exploring much of the interface and compared many combinations of subsetsagainst each other, both single versus joint, as well as single versus single and jointversus joint. They clicked around all aspects of the interface often, including click-ing on many words to see the sentences in which they appeared. The participantquickly recorded insights about the datasets after only a few minutes of use. By the10 minute mark of the study the participant had already recorded multiple insightsabout the datasets. The participant spent nearly the entire allotted time (35 of 40allotted minutes for exploration) using the interface and continuously writing newinsights on the datasets every few minutes.The participant was able to come up with multiple insights about the datasetsuch as the fact that many of the tweets in the dataset were about the stock market,trading, money, and these were mostly predicted as true positives (i.e. that they aretweets from bots). Additionally, they commented that many of the true positivesare tweets that mention blogs and other posts. They also determined that many of45the tweets predicted to be from people did not cover a single overarching topic.Some observations about the word graph visualizations made by the partici-pants were that: “The joint task true positives had deeper word cloud trees than thesingle task true positives.” and that, “The single task false negatives had a lot ofword overlap with the single task true positives, but that this was not the case forthe joint task”.The visual aspect of the interface, including the colours and clear organizationof the score panel were commented on by the participant as being “very useful”and “...provided a fast and easy way to organized the results”. Additionally theparticipant commented that being able to compare selections in juxtaposition wasextremely useful.The participant suggested that the visualization of the trees could be improvedby making the difference in the size scaling more noticeable. The participant alsomentioned that perhaps the colour scheme could be changed for which coloursindicate which model, as they found that the yellow colour of the arc connectionsbetween words in the word graphs conflicted with the gold colour used to indicatemodel type.The participant concluded by they would use this or similar interfaces in thefuture for multi-task problems since the interface, “...is very good, easy to use, andconvenient interface to see and do several things at once”.7.2.3 Participant 3Participant 3 was assigned to the SummDia-Act tasks. The participant began thestudy by first looking at strictly the bar charts first, spending time studying thecharts before moving on to clicking the charts to view the visualizations. Whenlooking at the visualizations, the participant began first by looking at True Posi-tives and comparing both joint and single, and then moving to False Positives andcomparing joint and single, and continued in that fashion until having looked atall the subsets. After fully exploring the whole interface carefully, the participantmoved to recording all of their insights made at once.While using the interface, the participant noted that through using the scorepanel and viewing the positive/negative confusion matrix they discovered that the46joint task outperformed the single task in both the true positive and true negativecategories, so the joint task outperformed the single task in each category. Theparticipant felt they could look at word groups that appear in the true positive butnot the false negative for the joint task and not in the true positive for the singletask, (presumably to find words which were important in only the true positivesubset indicating that the neural model learned through joint training that the wordcan be informative in predictions) though they felt that might take a lot of cognitiveload to accomplish.In the post-study questionnaire the participant indicated that they liked the barcharts, and commented that they especially liked the fixed and broken columns.They commented that it took them a while to get used to the word graphs, butthey appreciated that size was used to distinguish importance. They also liked thatthey were able to compare two sets of word groupings together to try to performinference, though they felt the current design did lay a “fair bit” of cognitive loadon the user, given that the design requires the user to click between subsets oftento compare two at a time, and that they sometimes wanted to remember what theprevious pairs of visualizations looked like while also viewing their current pairsof visualizations.As suggestions for improvements to the interface, the participant mentionedthat the size differential between small and large words could be more pronounced.They also mentioned that connecting words based on whether they are in the samesentence may not be as useful as connecting them if they’re in a more immediatecontext (in the case of long sentences).The participant said that they thought they would want to use the interface inthe future for other multi-task problems. They felt that it may have taken themlonger than it should have to get used to the word graphs, but they felt that thegraphs did eventually give them more insight into the model performance beyondwhat they would have got from simply looking at a confusion matrix. As a lastnote, the participant commented that they felt the interface might have use as a“sanity check” tool to determine, “...whether the dataset and annotations are anygood”.477.2.4 Participant 4Participant 4 was assigned the Bot-FakeNews dataset. They began the study bylooking at one selection at a time, clicking on many of the words and looking atsentences. They came up with insights within the first minute of using the interface.After continuing to use the interface the participant came up with multiple insightsover the next few minutes (see Participant 4 questionnaire in Appendix A). Asthe participant continued to use the interface, they clicked around often and usedmany aspects of the interface, quickly switching their attention between aspects ofthe interface. As the study allotted time continued, the participant made insightsthroughout their allotted time.The participant mentioned that because of the interface they were able to seethat currency words such as “forex” were important for correctly classifying whethera tweet was from a bot. They also stated that because there were significantly morefalse negatives for the single task than for the joint task, that the single task modelhas trouble identifying what words are more “human”.During the post study questionnaire the participant stated that they liked thatthe interface showed influential words that contributed to each task for the positiveand negative subsets. They also said they appreciated how the word graph denotedthe importance of a word from its size, and they found that this design choice madelooking at the visualization easy to breakdown and to understand. They felt that“most importantly” they liked they could compare the results of the joint and singletask in a way that is more than “just a number (i.e. accuracy)”. They felt the wordvisualisations made it easier to understand what the deep models were looking atwhen making classification decisions.The participant did find that they wanted more ways to know if a word wasimportant in multiple subsets, or just the selected subset that they’re viewing. As asmaller interface improvement suggestions, such as a way to move the word cloudvisualizations around the vis panel manually.Finally, the participant commented that they would “absolutely” use this orsimilar interfaces for multi-task problems in the future. They felt that, “The vi-sualization made it easier to peer into the models ‘black box’ and gain a betterunderstanding of what is actually going on in the network”.487.3 Summary of ResultsAll participants gained some insights to why the joint task outperformed the singletask. For instance, participant 4’s comment that, “There are significantly more falsenegatives for the single task than the joint task. This indicates to me that the singletask has trouble identifying what words are more ‘human’. For example the singletask indicated that ‘offline’ highly contributed to a tweet being classified as postedby a bot, in contrast the joint task does not have ‘offline’ included at all.” Three ofthe four participants were also able to gain insights about their datasets in general.For instance, participant 2 stated “Many of the tweets are about the stock market,trading, auctions, and money, and these were mostly predicted as true positives(from bots)”.In the post-study questionnaire all participants commented that they liked theconcept of being able to compare two models, as well as the encodings and overallorganization of the interface. Participants additionally liked being able to see de-tails on demand by clicking on words, as well as the ability to see visually whichwords are influential on predictions. Participant 4 commented that, “Most impor-tantly I liked that I could compare results of the joint task vs the single task in away that is more than just a number (i.e. accuracy, etc.). The words visualizedmade it easier to understand what the deep models were looking at when makingclassification decisions.”All four participants suggested that to improve the visualization the interfacecould include an indication for when a word is important and appearing in onlyone subset as opposed to multiple subsets. Three of the four participants also com-mented that the size difference between words should be more pronounced whenindicating word contribution to prediction. Two participants mentioned that theywould like to see more than two subset selections at the same time.Finally, all four participants said that they would use this or similar interfacesfor multi-task problems in the future.In the following chapter we present a mock-up of what the interface could looklike after accounting for the feedback received from the case study participants.49Chapter 8Future WorkAlthough the current interface only supports training two tasks jointly, there maybe benefit to training more than two tasks. We would like the interface to allow forthese neural models jointly trained on more than two tasks, which would involvefurther development on multiple aspects of the interface. Fig. 8.1 contains a mock-up design of what the vis panel might look like if the ability to account for morethan two tasks was incorporated. The vis panel is changed to allow for more thantwo user selections to be made at the same time by splitting the panel into as manyevenly distributed sections as there are tasks. It was noted during the case studiesthat three of the four participants did wonder why the vis panel only allowed twoselections at once instead of more, so this change could also satisfy some of thecase study feedback. Further work would need to be done deciding how many tasksthis design could support in total. The sentence browser panel could potentiallyallow for more than two tasks by removing the two panel layout and instead havinga one panel layout with tabs at the top of the layout allowing the user to switchbetween multiple open selections. It may be that users want to visually see bothopen selections at the same time, and for that purpose the user tabs could be poppedout of the interface and moved freely around the screen. Although this interfacedesign change would allow more than two joint tasks, there may be scalabilityissues with the design as the number of tasks continues to increase.We may also continue developing the interface to address one piece of feedbackreceived from all four study participants, which was that the scale between word50size needs to be changed to make a more noticeable size difference between whatis a small word and what is a large word. A mock-up of the interface in Fig. 8.2shows what this change could look like.Furthermore, future development of the interface should address feedback whichwas received from all four study participants that the interface could be improvedwith an indicator that a word is important in only one subset as opposed to multi-ple subsets. As seen in Fig. 8.2, an addition to the interface which could satisfythis requirement is the yellow highlighted background of words which appear inonly one subset, indicating clearly to the user that the word is important in only theselected subset.Since the development of NJM-Vis is guided by an iterative design process,future work would also include running more case studies to evaluate the proposedinterface design changes; and more long term to run formal user studies, possiblycomparing alternative versions of the interface.We also plan to explore further joint neural network architectures. In our work,we used joint neural network architectures which shared only one layer. It maybe that different architectures, such as ones that share more than one layer, or net-works which are deeper or wider, may produce more accurate predictions than thetwo architectures that we use. It’s possible that with better performing neural net-work architectures our relevance contribution measure used may produce strongersignals, and therefore our visualization may better explain predictions to users (eg.there may be a larger visual difference between words which contribute stronglyto predictions versus those that contribute weakly to predictions if the network ismore confident about predictions).Finally, we chose to use LRP as our means of calculating a relevance measurefor the input. However, it may be the case that there are better means of calcu-lating a relevance for the network input. Though we did explore using attentionmechanisms instead of LRP, we ultimately decided on LRP for reasons discussedin Chapter 4. Conceivably, further research may improve attention mechanisms,giving a possible edge over LRP, or other means of relevance contribution may bedeveloped by further research which may give better results than LRP.51Figure 8.1: A mock-up to potentially expand the number of user selections atonce. This shows a way of potentially splitting the vis panel into sec-tions by dynamically decreasing the number of graph visualizations insuch a way as to fit the most number of graphs into each selection win-dow. Although this could work for four selections, it could have scala-bility issues as the number of selections grows, since it will be increas-ingly difficult to dynamically adjust the graphs to fit into the smallerand smaller selection window sizes. Notice also that the background ofsome words is highlighted which indicates they have high frequency inonly their selected subset as explained in Figure 8.2.52Figure 8.2: A mock-up to address case study feedback. In the vis panel, thesize scale has been increased to make it easier to differentiate betweenhigh contributing words (large font) and low contibuting words (smallfont). Additionally, words which have a high frequency in only theirone selected subset have a highlighted background, as seen with theword ”points”.53Chapter 9ConclusionThe contribution of this thesis is a novel interface for exploring and interpretingjoint task neural network models. Neural joint models have been shown to outper-form non-joint models on several NLP and Vision tasks and constitute a thrivingarea of research in AI and ML. However, to the best of our knowledge, there is notprevious work to support their interpretability. NJM-Vis fills this gap by supportingthe interpretation of NLP joint task neural model predictions, when compared tosingle task models. For the purposes of analyzing with our interface, we designedtwo joint task neural networks using Tensorflow in Python. We used two differentdatasets as the input to our joint task neural networks. In both of our joint neuralnetworks the joint task network improved on the score from the single task net-work. We chose to use Layerwise Relevance Propagation as a means of explainingthe relevance of the input to our neural network predictions. The generated neuralnetwork output and prediction scores, and our relevance scores of the input to theneural network, were all used as input to the interface.As a means of determining how our interface could satisfy user goals, we devel-oped a user task model. The task model included tasks: measure predictions of twomodels quantitatively, identify key words in subsets of predictions, dissect linguis-tic similarities/differences between single and joint predictions, identify possibleerrors in results, identify key relationships between predicted class labels, contex-tualize predictions at the granularity of sentences.The design of our visual interface is built to support our user task model. The54design combines tables and aligned bar-charts with sentence browsers, and wordcloud style visualizations. Our interface breaks down neural network predictionsinto subsets of positive and negative predictions (true positive, false negative, etc).The word cloud visualization is used with the relevance scores to allow users toquickly determine which words contributed to their predictions.To assess the efficacy of our design, we ran a case study with four partici-pants. The participants had an opportunity to use the interface to explore twodifferent datasets. The first dataset was comprised of data from the AMI corpusand contained labeled data for two NLP tasks: extractive summarization and dia-log act prediction. The second dataset was comprised of labeled Twitter data fortwo tasks: fake news detection and bot detection. All four participants stated thatthey would use our interface for exploring and interpreting results from their jointtasks, providing preliminary evidence for the usefulness of our prototype.The case studies also presented useful feedback from which further versionsof our prototype will be informed. Users noted that the interface could have anindication for which words were important in only one subset, i.e. that the interfacecould indicate if a word had a high relevance score or high frequency of occurrencein only one subset. Users also recommended minor design changes for generalusability: increasing the word size scale so that high relevance words appear larger,and that the color choices of the interface are changed so that the gold color of thearcs between words looks clearly different from the golden color of the font in theinterface. We consider this interface to be an early exploration of explaining jointneural models to users, and we believe that the user case study indicates that ourinterface and the concepts underlying its design are effective. With the iterativeprocess of refining the design with user feedback we could continue to improveour contribution to the emerging field of explainable deep learning.55Bibliography[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scalemachine learning. In OSDI, volume 16, pages 265–283, 2016. → pages3, 23[2] L. Arras, G. Montavon, K.-R. Mu¨ller, and W. Samek. Explaining recurrentneural network predictions in sentiment analysis. arXiv preprintarXiv:1706.07206, 2017. → page 8[3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Mu¨ller, andW. Samek. On pixel-wise explanations for non-linear classifier decisions bylayer-wise relevance propagation. PloS one, 10(7):e0130140, 2015. → page22[4] A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren. Do convolutional neuralnetworks learn class hierarchy? IEEE transactions on visualization andcomputer graphics, 24(1):152–162, 2017. → page 3[5] D. Bruckner, J. Rosen, and E. Randall Sparks. deepviz : Visualizingconvolutional neural networks for image classification. 2013. → pagesxi, 6, 12[6] J. Carletta. Unleashing the killer corpus: experiences in creating themulti-everything ami meeting corpus. Language Resources and Evaluation,41(2):181–190, 2007. → page 18[7] R. Collobert and J. Weston. A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In Proceedingsof the 25th international conference on Machine learning, pages 160–167.ACM, 2008. → pages 1, 5[8] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi. Theparadigm-shift of social spambots: Evidence, theories, and tools for the arms56race. In Proceedings of the 26th International Conference on World WideWeb Companion, pages 963–972. International World Wide WebConferences Steering Committee, 2017. → pages 19, 20[9] A. Dielmann and S. Renals. Recognition of dialogue acts in multipartymeetings using a switching dbn. IEEE transactions on audio, speech, andlanguage processing, 16(7):1303–1314, 2008. → pages 3, 18[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell. Decaf: A deep convolutional activation feature for generic visualrecognition. In International conference on machine learning, pages647–655, 2014. → page 6[11] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layerfeatures of a deep network. University of Montreal, 1341(3):1, 2009. →page 3[12] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio. Why does unsupervised pre-training help deep learning? Journalof Machine Learning Research, 11(Feb):625–660, 2010. → page 5[13] M. Gleicher. Considerations for visualizing comparison. IEEE transactionson visualization and computer graphics, 24(1):413–423, 2017. → page 26[14] X. Glorot and Y. Bengio. Understanding the difficulty of training deepfeedforward neural networks. In Proceedings of the thirteenth internationalconference on artificial intelligence and statistics, pages 249–256, 2010. →page 23[15] Y. Goldberg. Neural network methods for natural language processing.Synthesis Lectures on Human Language Technologies, 10(1):1–309, 2017.→ page 1[16] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessingadversarial examples. arXiv preprint arXiv:1412.6572, 2014. → page 3[17] F. M. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics indeep learning: An interrogative survey for the next frontiers. IEEEtransactions on visualization and computer graphics, 2018. → pages 3, 26[18] M. Hu, K. Wongsuphasawat, and J. Stasko. Visualizing social media contentwith sententree. IEEE transactions on visualization and computer graphics,23(1):621–630, 2017. → pages xii, 3, 8, 16, 3157[19] M. John, E. Marbach, S. Lohmann, F. Heimerl, and T. Ertl. Multicloud:Interactive word cloud visualization for multiple texts. → pages xii, 9, 17[20] M. Kahng, P. Y. Andrews, A. Kalro, and D. H. P. Chau. Activis: Visualexploration of industry-scale deep neural network models. IEEEtransactions on visualization and computer graphics, 24(1):88–97, 2017. →pages xi, 4, 7, 13[21] T. Kaneko, K. Hiramatsu, and K. Kashino. Adaptive visual feedbackgeneration for facial expression improvement with multi-task deep neuralnetworks. In Proceedings of the 2016 ACM on Multimedia Conference,pages 327–331. ACM, 2016. → page 5[22] S. Kudugunta and E. Ferrara. Deep neural networks for bot detection. arXivpreprint arXiv:1802.04289, 2018. → pages 3, 18[23] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Mu¨ller, and W. Samek. Thelrp toolbox for artificial neural networks. The Journal of Machine LearningResearch, 17(1):3938–3942, 2016. → page 23[24] J. Li, X. Chen, E. Hovy, and D. Jurafsky. Visualizing and understandingneural models in nlp. arXiv preprint arXiv:1506.01066, 2015. → pagesxii, 6, 8, 16[25] M. Liu, J. Shi, Z. Li, C. Li, J. Zhu, and S. Liu. Towards better analysis ofdeep convolutional neural networks. IEEE transactions on visualization andcomputer graphics, 23(1):91–100, 2017. → pages ix, 5, 7[26] P. Liu, X. Qiu, and X. Huang. Recurrent neural network for textclassification with multi-task learning. arXiv preprint arXiv:1605.05101,2016. → page 5[27] B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural languagedecathlon: Multitask learning as question answering. arXiv preprintarXiv:1806.08730, 2018. → page 1[28] G. Montavon, W. Samek, and K.-R. Mu¨ller. Methods for interpreting andunderstanding deep neural networks. Digital Signal Processing, 2017. →page 22[29] T. Munzner. A nested model for visualization design and validation. IEEEtransactions on visualization and computer graphics, 15(6):921–928, 2009.→ page 2158[30] T. Munzner. Visualization analysis and design. AK Peters/CRC Press, 2014.→ pages 3, 34[31] G. Murray and G. Carenini. Summarizing spoken and written conversations.In Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing, pages 773–782. Association for ComputationalLinguistics, 2008. → pages 3, 18[32] G. Murray, S. Renals, and J. Carletta. Extractive summarization of meetingrecordings. In Proceedings of Interspeech 2005, 2005. → pages 3, 18[33] S. Murugesan, S. Malik, F. Du, E. Koh, and T. M. Lai. Deepcompare: Visualand interactive comparison of deep learning model performance. IEEEcomputer graphics and applications, 2019. → pages xi, 6, 14[34] B. Nejat, G. Carenini, and R. Ng. Exploring joint neural model for sentencelevel discourse parsing and sentiment analysis. In Proceedings of the 18thAnnual SIGdial Meeting on Discourse and Dialogue, pages 289–298, 2017.→ pages 1, 5[35] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization:Uncovering the different types of features learned by each neuron in deepneural networks. arXiv preprint arXiv:1602.03616, 2016. → pages x, 6, 11[36] W. Ouyang and X. Wang. Joint deep learning for pedestrian detection. InProceedings of the IEEE International Conference on Computer Vision,pages 2056–2063, 2013. → page 1[37] T. Oya and G. Carenini. Extractive summarization and dialogue actmodeling on email threads: An integrated probabilistic approach. InProceedings of the 15th Annual Meeting of the Special Interest Group onDiscourse and Dialogue (SIGDIAL), pages 133–140, 2014. → page 18[38] N. Pezzotti, T. Ho¨llt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, andA. Vilanova. Deepeyes: Progressive visual analytics for designing deepneural networks. IEEE transactions on visualization and computer graphics,24(1):98–108, 2017. → page 3[39] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool forvisual analysis of hidden state dynamics in recurrent neural networks. IEEEtransactions on visualization and computer graphics, 24(1):667–676, 2017.→ page 459[40] W. Y. Wang. ” liar, liar pants on fire”: A new benchmark dataset for fakenews detection. arXiv preprint arXiv:1705.00648, 2017. → pages 3, 18[41] C. Yang, R. Harkreader, and G. Gu. Empirical evaluation and new design forfighting evolving twitter spammers. IEEE Transactions on InformationForensics and Security, 8(8):1280–1293, 2013. → pages 19, 20[42] Y. Yang, V. Tresp, M. Wunderle, and P. A. Fasching. Explaining therapypredictions with layer-wise relevance propagation in neural networks. In2018 IEEE International Conference on Healthcare Informatics (ICHI),pages 152–162. IEEE, 2018. → page 22[43] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understandingneural networks through deep visualization. arXiv preprintarXiv:1506.06579, 2015. → pages x, 3, 5, 10[44] J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: Amodel-agnostic framework for interpretation and diagnosis of machinelearning models. IEEE transactions on visualization and computer graphics,25(1):364–373, 2018. → pages xii, 7, 15[45] A. Zubiaga, M. Liakata, and R. Procter. Learning reporting dynamics duringbreaking news for rumour detection in social media. arXiv preprintarXiv:1610.07363, 2016. → page 2060Appendix ASupporting MaterialsA.1 Case Study QuestionnairesA.1.1 Participant 1Please use the interface to explore the data however youd like. If you gain anyinsights about the dataset from using the interface, please write them belowCould informativeDid you gain any insights on why the jointly trained model has outper-formed the single trained model on this dataset? If so, please write them belowPOST-STUDY QUESTIONNAIREPlease answer the following questions. What did you like about the inter-face? What did you find to be useful? I liked the idea of comparing two differentmodels. Being able to look at two models at the same time was useful. Differentsizes of the words that indicate predictive power were useful.What did you dislike about the interface? How can the visualization beimproved? - It was difficult to know what size was considered big and what sizewas considered small. - similar sizes made me difficult to understand whats goingon in the cell. - I gave some feedback during the study. - not only frequencies, butI also want to know the words that are frequent in that subset, but not common inother subsets. - I wanted to see a full text and which sentences were selected for asummary or not. - I wanted to see more subsets at the same time so that I dont need61to remember my observations. It was not easy to identify differences between twoselections.Do you have any further comments on what you learned about the datafrom using the interface? - light verbs ,.. Modal verbs. (e.g., would, could) werebig, which was surprsing to me Dave said it was a dialog dataset, then now it makessense. - for the false positive subset the most frequent words were very differentfor the two models. Im not sure why, but I wanted to explore it more because thewords were very similar for true positive and false negative,..Would you use this or similar interfaces in the future for your multi-taskproblems? Yes, if it is enhanced a little more.A.1.2 Participant 2Please use the interface to explore the data however youd like. If you gain anyinsights about the dataset from using the interface, please write them belowMany of the tweets are about the stock market, trading, auctions, and money,and these were mostly predicted as true positives (from bots).The tweets predicted to be from people (true negatives) had no real overarchingtopic, though they seem to more personal such as I know or the name, Aidan.The true positives focus on tweets that mention blogs and other posts.The false positives about fitbits seem like theyre from ads. The false positivesseem like casual comments/replies.Did you gain any insights on why the jointly trained model has outper-formed the single trained model on this dataset? If so, please write them belowSome of the false negatives that the joint model fixed to true positive were stillpresent in the single task true positives, e.g, gold, silver, etc.The joint task true negatives captured more words than the single task truenegatives.The joint task true positives has deeper trees than the single task true positives.The single task connected words that were not supposed to (e.g., single taskfalse negatives has one large tree vs. joint task true positives has similar words intwo trees).The single task false negatives has a lot of overlap with its true positives, but it62is not the case for joint task.POST-STUDY QUESTIONNAIREPlease answer the following questions. What did you like about the inter-face? What did you find to be useful?I like the visual aspect, the colours, and the clear organization of the left-handmenu. The summary on the left-hand side was very useful, and provided a fast andeasy way to organize the results and know the difference in the results between thetwo models.I also like how I can click on any word and see all its appearances on the samepage. The side-by-side comparison of two things was extremely useful as well.What did you dislike about the interface? How can the visualization beimproved?The trees can be improved; for example, the colour and text size can be mademore obvious. It was sometimes not immediately obvious that one word was largerthan another. Perhaps having a different colour than the ones used to indicatethe model for the tree would be better. Currently, the single task model and theconnection between words share the same yellow colour.A bug already mentioned is that the words may overlap if there are too manyof them to fit.Do you have any further comments on what you learned about the datafrom using the interface?It was obvious what kind of tweets were classified as fake (from bots) but notobvious what kind of tweets are real.Would you use this or similar interfaces in the future for your multi-taskproblems?Yes, having everything fit on one page–the summary (left-hand), the visualiza-tion (center), and the details (right-hand)–is a very good, easy to use, and conve-nient interface to see and do several things at once.A.1.3 Participant 3Please use the interface to explore the data however youd like. If you gain anyinsights about the dataset from using the interface, please write them below63Expect to try and understand *why* the joint fixed what it did and *why* itbroke what it did - eg. look at why joint fixed false negatives to true positives bylooking at single-task false negatives and fixed - looking at word groupings thatappeared in both - implication: single-task considered it important for negativelabel, but joint fixed it because of that word grouping - eg. remote control - lookingat word groupings that appear in fixed but not in single-false-negative - this wouldindicate that the single-task model did not consider it significant enough to warranta positive - presumably that would also be the case for fixing false positives as wellas brokenI think I was overthinking things - can look at why the joint fixed things bylooking only at fixed, for exampleThe joint seemed to use remote-control to both fix false negatives and breaktrue negatives (meaningful words that appeared in both sets)Did you gain any insights on why the jointly trained model has outper-formed the single trained model on this dataset? If so, please write them belowBar charts are better for the true pos/neg and worse for the false pos/neg, sooutperforms in every categoryMight be able to look at word groupings that appear in the true positive but notfalse negative for the joint and not in the true positive for the single, but that wouldrequire a fair bit of cognitive load - might be useful to perform such an inferenceautomatically and visualize thatPOST-STUDY QUESTIONNAIREPlease answer the following questions. What did you like about the inter-face? What did you find to be useful?- The bar charts, especially the fixed and broken columns - it took me a whileto get used to the word graphs (I kept thinking of the pairings as phrases instead ofjust words in the same sentence, despite Dave telling me otherwise), but the sizedifferential is useful - being able to compare two sets of word groupings to try andperform further inference (though that is a fair bit of cognitive load)What did you dislike about the interface? How can the visualization beimproved?- Multiple copies of words in the same graph might make it harder to makeconnections - the size differential could be more pronounced - Connecting words64based on whether theyre the in the same sentence may not be as useful as connect-ing them if theyre in a more immediate context (in the case of long sentences) -have the displayed sentences include the removed stopwords for readability - be-ing able to click on a sentence and see it in the broader context might be useful- having the label for the other task shouldn’t appear to be part of the sentence; aseparate column for that may be more usefulDo you have any further comments on what you learned about the datafrom using the interface?- many of the important word groupings seem to be from words that dont seemthat important (yeah, oh, one, etc.) - since the TP/TN/FP/FN scores were so similarfor the two models, the fixed and broken columns seemed to be the most informa-tiveWould you use this or similar interfaces in the future for your multi-taskproblems?I think so. I feel that it took me longer than it should have to get used to theword graphs, and the graphs seem to include many words that I wouldn’t considerimportant, but that in itself is a useful insight that I wouldn’t have had from justlooking at a confusion matrix. Id like to be able to see how it handles a betterdataset. The tool may have a use as a sanity check to see if the dataset and/orannotations are any good.A.1.4 Participant 4Please use the interface to explore the data however youd like. If you gain anyinsights about the dataset from using the interface, please write them belowThe word fiddling appears to me some kind of code for some bots tweetsThere are more influential words that contribute to true positives (bots), than totrue negative (humans)forex is important for false negatives joint and single, as well as true positivejoint and single. (i.e. the vis helped me see that forex is an important word forclassifying a bot)Did you gain any insights on why the jointly trained model has outper-formed the single trained model on this dataset? If so, please write them below65There are significantly more false negatives for the single task than the jointtask. This indicates to me that the single task has trouble identifying what words aremore human. For example the single task indicated that offline highly contributedto a tweet being classified as posted by a bot, contrastingly the joint task does nothave offline included at all.POST-STUDY QUESTIONNAIREPlease answer the following questions. What did you like about the inter-face? What did you find to be useful?I liked that the interface showed influential words that contributed to each taskfor true/false positives and true/false negatives. I also appreciated how the wordgraph denoted the importance of a word from its size. This made looking at thevisualization easy to breakdown and understand.Most importantly I liked that I could compare to results of the joint task vs thesingle task in a way that is more than just a number (i.e. accuracy, etc.). The wordsvisualized made it easier to understand what the deep models were looking at whenmaking classification decisions.What did you dislike about the interface? How can the visualization beimproved?- way to know if something was posted by the same for a given word (i.e. howmany of the fiddling sentences were posted by the same user) - way to move thegraphs around (i.e. maybe I want one on top of the other for some reason) - wayto know if the words visualised are important across the different categories (i.e.forex is important for false negatives joint and single, as well as true positive jointand single) - indicate the importance cut off for a word to be visualizedDo you have any further comments on what you learned about the datafrom using the interface?Would you use this or similar interfaces in the future for your multi-taskproblems?Absolutely. The visualization made it easier to peer into the models black boxand gain a better understanding of what is actually going on in the network.66A.2 Demo DatasetAs seen in Fig. A.1 the demo of the interface used during the training portion ofthe case study. The demo was done on a very simple toy dataset.Figure A.1: A screenshot of the visualization produced by the toy datasetused for demo purposes during the training portion of the case study.67

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0383321/manifest

Comment

Related Items