Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Variant view : visualizing sequence variants in their gene context Ferstay, Joel A. 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_fall_ferstay_joel.pdf [ 5.62MB ]
Metadata
JSON: 24-1.0052187.json
JSON-LD: 24-1.0052187-ld.json
RDF/XML (Pretty): 24-1.0052187-rdf.xml
RDF/JSON: 24-1.0052187-rdf.json
Turtle: 24-1.0052187-turtle.txt
N-Triples: 24-1.0052187-rdf-ntriples.txt
Original Record: 24-1.0052187-source.json
Full Text
24-1.0052187-fulltext.txt
Citation
24-1.0052187.ris

Full Text

Variant ViewVisualizing Sequence Variants in their Gene ContextbyJoel A. FerstayB.Sc., The University of British Columbia, 2006B.Sc., The University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2013c? Joel A. Ferstay 2013AbstractScientists use DNA sequence differences between an individual?s genome anda standard reference genome to study the genetic basis of disease. Such dif-ferences are called sequence variants, and determining their impact in thecell is difficult because it requires reasoning about both the type and loca-tion of the variant across several levels of biological context. In this designstudy, we worked with four analysts to design a visualization tool support-ing variant impact assessment for three different tasks. We contribute dataand task abstractions for the problem of variant impact assessment, andthe carefully justified design and implementation of the Variant View tool.Variant View features an information-dense visual encoding that providesmaximal information at the overview level, in contrast to the extensive nav-igation required by currently-prevalent genome browsers. We provide initialevidence that the tool simplified and accelerated workflows for these threetasks through three case studies. Finally, we reflect on the lessons learnedin creating and refining data and task abstractions that allow for conciseoverviews of sprawling information spaces that can reduce or remove theneed for the memory-intensive use of navigation.iiPrefaceThis thesis is based on material contained in the following paper? Joel A. Ferstay, Cydney B. Nielsen, and Tamara Munzner. VariantView: Visualizing Sequence Variants in their Gene Context. To ap-pear in Transactions on Visualization and Computer Graphics (Proc.InfoVis 2013).iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Approach and Contributions . . . . . . . . . . . . . . . . . . 21.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 52 Variant Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . 62.1 Next Generation Sequencing . . . . . . . . . . . . . . . . . . 62.2 Sequence Variant Data Generation . . . . . . . . . . . . . . . 82.2.1 Quality Control . . . . . . . . . . . . . . . . . . . . . 82.2.2 Alignment/Mapping . . . . . . . . . . . . . . . . . . . 92.2.3 Filtering Variant Candidates . . . . . . . . . . . . . . 92.3 Prioritization of Candidate Variants . . . . . . . . . . . . . . 113 Visualization Design Process . . . . . . . . . . . . . . . . . . 123.1 Nine-Stage Design . . . . . . . . . . . . . . . . . . . . . . . . 12ivTable of Contents4 Data and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.1 The Reference Genome . . . . . . . . . . . . . . . . . 154.1.2 Scales and Coordinate Systems . . . . . . . . . . . . . 164.1.3 Variant Attributes . . . . . . . . . . . . . . . . . . . . 184.1.4 Gene Attributes . . . . . . . . . . . . . . . . . . . . . 184.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.1 Driving Biological Tasks . . . . . . . . . . . . . . . . 204.2.2 Tasks and Data Questions . . . . . . . . . . . . . . . 225 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 Genome Browsers . . . . . . . . . . . . . . . . . . . . . . . . 245.1.1 Sequence Read Visualization . . . . . . . . . . . . . . . 255.1.2 Genome Browsers for Variant Analysis . . . . . . . . 275.2 Tailored Gene View Solutions for Variant Analysis . . . . . . 286 Design Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . 326.1 Core Components: Automation versus Visualization . . . . . 326.2 Genomic Coordinates: Strengths and Weaknesses . . . . . . 356.3 Filtered Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 366.4 Transcript and Protein Region Encoding . . . . . . . . . . . 406.5 Variant and Variant Attribute Encoding . . . . . . . . . . . 416.6 Use of Color . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.7 Design Comparison . . . . . . . . . . . . . . . . . . . . . . . 436.8 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 447 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.1 Case Study 1: Discover . . . . . . . . . . . . . . . . . . . . . 477.1.1 Hypothesis Confirmation . . . . . . . . . . . . . . . . 487.1.2 Hypothesis Generation . . . . . . . . . . . . . . . . . 497.2 Case Study 3: Compare . . . . . . . . . . . . . . . . . . . . . 507.3 Case Study 4: Debug Pipeline . . . . . . . . . . . . . . . . . 51vTable of Contents8 Discussion, Future Work, and Conclusions . . . . . . . . . . 538.1 Specialize First, Generalize Later . . . . . . . . . . . . . . . 538.2 Visualization Design Considerations . . . . . . . . . . . . . . 548.3 Design Progression . . . . . . . . . . . . . . . . . . . . . . . . 558.4 Design Study Methodology Pitfalls Analysis . . . . . . . . . 628.4.1 Pitfalls in the Winnow Stage . . . . . . . . . . . . . . 628.4.2 Pitfalls in the Cast Stage . . . . . . . . . . . . . . . . 668.4.3 Pitfalls in the Discover Stage . . . . . . . . . . . . . . 678.4.4 Pitfalls in the Design Stage . . . . . . . . . . . . . . . 708.5 Scalability Limitations . . . . . . . . . . . . . . . . . . . . . 708.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76viList of Tables4.1 Data abstraction table. . . . . . . . . . . . . . . . . . . . . . . 194.2 Data question table. . . . . . . . . . . . . . . . . . . . . . . . 23viiList of Figures1.1 A genetic sequence variant. . . . . . . . . . . . . . . . . . . . 21.2 Variant View tool screen capture. . . . . . . . . . . . . . . . . 42.1 The variant data generation and analysis pipeline. . . . . . . 73.1 The Nine-Stage Design process. . . . . . . . . . . . . . . . . . 134.1 Biological context in which variants occur. . . . . . . . . . . 165.1 The Ensembl genome browser . . . . . . . . . . . . . . . . . . 265.2 The IGV [43] sequence read visualization tool. . . . . . . . . 275.3 The Ensembl [5] variation image. . . . . . . . . . . . . . . . . 305.4 The cBio [4] tool for analysing genetic variants. . . . . . . . . 315.5 The MuSiC [9] tool. . . . . . . . . . . . . . . . . . . . . . . . 316.1 Variant View tool. . . . . . . . . . . . . . . . . . . . . . . . . 346.2 A detail view of the top of an Ensembl [5] variation image. . 396.3 Variant visual encoding. . . . . . . . . . . . . . . . . . . . . . 456.4 Comparison of the same variant data. . . . . . . . . . . . . . 467.1 Confirming AML genes. . . . . . . . . . . . . . . . . . . . . . 497.2 Discovering AML genes. . . . . . . . . . . . . . . . . . . . . . 517.3 Comparison of patient data. . . . . . . . . . . . . . . . . . . . 527.4 Debugging the bioinformatics pipeline. . . . . . . . . . . . . . 528.1 The first two data sketch prototypes. . . . . . . . . . . . . . . 568.2 An interactive search supporting prototype. . . . . . . . . . . 578.3 A second interactive search supporting prototype. . . . . . . . 58viiiList of Figures8.4 An intermediate prototype for the Variant View design . . . . 598.5 A sixth data sketch prototype with icons for variant type. . . 608.6 A prototype showing protein symbols and amino acid classes. 618.7 Design Study Methodology: 32 pitfalls . . . . . . . . . . . . . 638.8 A hierarchical Aggregation Example. . . . . . . . . . . . . . . 73ixAcknowledgementsI would first like to thank my thesis supervisors, Dr. Tamara Munzner andDr. Cydney B. Nielsen. Working with Tamara was a tremendously valuableexperience. Tamara?s knowledge of the information visualization literatureand vast experience in effectively presenting complicated topics in a lucidfashion helped make my work much more accessible and understandable toboth a visualization audience and a genome sciences audience. Cydney wasa fantastic co-supervisor. Cydney not only helped to connect me with poten-tial collaborators at the Genome Sciences Centre, where I performed a greatdeal of this project, but also provided immeasurable support and under-standing in the face of potential projects that could have become sprawling,large team efforts. Cydney gave me both guidance and the confidence tostep up to these projects and neatly carve out a well-scoped project thatwas not only interesting, but could also be completed within the confines ofa masters program.I would also like to thank Dr. Karon MacLean, the second reader forthis thesis. I have reaped the benefit of taking several of Karon?s coursesat the University of British Columbia, and have always found her coursesto be stimulating, and immediately practical. I feel Karon?s expertise andthoughtful comments on this thesis have made it much more accessible to abroader audience, even to those outside the domain of information visual-ization and the domain of genome science.I would like to thank the members of the InfoVis group, Matthew Brehmer,Jessica Dawson, Stephen Ingram, and Michael Sedlmair, for their insightfuland very constructive feedback on my work. I would also like to thank themfor the opportunity to comment and provide feedback on their work, whichwas such a pleasure.xAcknowledgementsI would further like to thank Rod Docking, Dr. Gerben Duns, Dr. LindaChang, Simon Chan, and Dr. Aly Karsan, our collaborators at the BritishColumbia Cancer Agency and Genome Sciences Centre, for their patience,time, and interest in this project.I would also like to thank Jessica Dawson, Anna Flagg, Shathel Had-dad, Juliette Link and Louise Oram for being excellent support during thisprocess. It was always a pleasant diversion to ask about your projects, beinspired by your creativity, and then return, energized, to my own work. Iwould also like to thank Meghan Allen for being an inspiring teacher; I amhopeful her teaching skills rubbed off on me: they are probably the reasonfor my ability to achieve a teaching assistant award.I would like to thank the Vancouver Institute for Visual Analytics (VIVA),MITACS, and AeroInfo/Boeing for their generous funding support.Finally, I cannot begin to describe all the support my family has givento me. My parents Ralph and Cheryl Ferstay, my brother Daniel Ferstay,and Emanuel (Manny) Cabral: their love and support make each day a newone.xiDedicationTo my parents, Ralph and Cheryl, who have given me so much. To mygrandparents, Alfonse and Ivy, who I miss.xiiChapter 1IntroductionThe human genome project produced a reference genome for the humanspecies [15], consisting of about 3 billion chemical constituents called nu-cleotides. Each person?s genome is slightly different; the rate of variationbetween the nucleotide sequences for individuals is less than roughly onepercent [23]. Differences between an individual person?s genome and thereference genome are called sequence variants; Figure 1.1 shows an examplesequence variant. Changes at the DNA sequence level can cause a variety ofgenetic diseases such as cancer. Scientists are interested in finding sequencevariants that are predictive of different disease states, and they do so bycomparing the genome sequences of individuals diagnosed with a disease tothe reference genome, which is generally assumed to be healthy and disease-free. This problem is non-trivial because very few variants are harmful andteasing these apart from the much larger set of harmless variants requiresboth automated detection and human inspection. Human reasoning aboutthe biological impact of variants is particularly challenging because it re-quires considering multiple attributes at a variant position across severallevels of biological context.Currently, variant analysts attack the problem with workflows that havehigh cognitive load because of the need to mentally integrate across manydatabases and spreadsheets. The dominant visualization tools for exploringsequence data in general are genome browsers [5, 11, 16, 43, 47]; using themtypically requires extensive navigation with very high time costs. A fewsystems have been proposed for variant analysis, but they either share thefundamental problems of genome browsers [5] or fall short of presenting thefull spectrum of biological context needed by the analysts [4, 9].In this design study, we worked with four variant analysts over a six11.1. Approach and ContributionsFigure 1.1: A genetic sequence variant. A person?s genome is compared toa reference genome for the human species to identify where and how theirgenome varies in nucleotide content. In the above example, the person hasa variant G nucleotide at position 6.month period to design and refine Variant View, a tool to accelerate andimprove variant analysis, shown in Figure 1.2. We identified three variantanalysis tasks: finding candidate genes that may be implicated in specifictypes of cancer, comparing data about an individual patient to a data setof variants known to be harmful, and debugging the bioinformatics pipelinebefore the data is used for any further analysis.1.1 Approach and ContributionsOne contribution of this thesis is a data and task abstraction for the problemdomain of variant analysis: our task analysis links concrete, domain-specificquestions to this data abstraction. Another contribution is a discussionthat reflects on the strengths and weaknesses of genomic coordinates asa data abstraction, a question that has broad implications for the designof biological visualizations. A third contribution is the validated designand implementation of Variant View. We carefully justify our choices forvisual encoding and interaction techniques with respect to the data andtask abstractions. With careful filtering, we created an information-denseoverview for multiple, non-contiguous features at multiple scales showingall necessary information simultaneously without the need to navigate. Wevalidate the effectiveness of the tool with three case studies of its use afterseveral months of deployment. Our final contribution is a discussion of the21.1. Approach and Contributionslessons learned in this design study: the design strategy of ?specialize first,generalize later?, and six design considerations organized into the themes of?what to show? and ?how to show it?.31.1.ApproachandContributions13-07-02 12:49 PMVariant VisualizationPage 1 of 1http://localhost:8888/Variant_View_Prot_Coord3.0.htmlKIAA0586 (NM_014749)KIAA0586 (NM_001244193)KIAA0586 (NM_001244191)KIAA1109 (NM_015312)KIAA1522 (NM_020888)KIAA1522 (NM_001198972)KIF26A (NM_015656)KIT (NM_000222)KIT (NM_001093772)KRAS (NM_004985)KRTAP9-2 (NM_031961)KTI12 (NM_138417)LAMA3 (NM_001127717)LRP1 (NM_002332)MAK16 (NM_032509)MAP3K6 (NM_004672)MAST4 (NM_001164664)MAST4 (mRNA)MAST4 (NM_015183)MDN1 (NM_014611)MEGF6 (NM_001409)MLL (NM_001197104)MLL (NM_005933)MLL2 (NM_003482)MLLT3 (NM_004529)MPDZ (NM_001261406)MPL (NM_005373)MSH6 (NM_000179)MUC22 (NM_001198815)MYBBP1A (NM_001105538)MYCBPAP (NM_032133)MYO15A (NM_016239)MYO1G (NM_033054)NBEAL2 (NM_015175)NEB (NM_004543)NEB (NM_001164507)NOTCH1 (NM_017617)NPM1 (NM_001037738)NPM1 (NM_199185)NRAS (NM_002524)NRP2 (NM_018534)NUDT17 (NM_001012758)NUP62 (NM_001193357)ODZ2 (NM_001122679)ODZ4 (NM_001098816)PCLO (NM_014510)PCM1 (NM_006197)PCNT (NM_006031)PCNXL3 (NM_032223)PEAK1 (NM_024776)PEG3 (NM_001146185)PEG3 (NM_001146184)PELP1 (NM_014389)PHRF1 (NM_020901)PIK3C2B (NM_002646)PKDREJ (NM_006071)PKHD1L1 (NM_177531)PLCE1 (NM_016341)PLCE1 (NM_001165979)Gene Search:  SubmitSort By Gene:Alpha Cluster Score Variant CountVariantsD---FL+VRSDYDYDHDVDVDVMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainSignalsDomainsRegionsTopo. DomainsTransmem.Active SitesNP BindingMetal Bind.BindingsMod. ResidueCarbohyd.Disuf.Variant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 55589770 TACGAC TAC . . COSM2901 D419- gene-anon trans-anonpid-anon 55589772 C CTTCCTA . . 29820(?) -419FL gene-anon trans-anonpid-anon 55593605 GAAGGTT GAAAAGGT . . "21978, V559RS gene-anon trans-anonpid-anon 55599320 G T . rs121913 "21979, D812Y gene-anon trans-anonpid-anon 55599320 G T . rs121913 "21979, D812Y gene-anon trans-anonpid-anon 55599320 G C . rs121913 "21979, D812H gene-anon trans-anonpid-anon 55599321 A T . rs121913 "1314, 1 D812V gene-anon trans-anonpid-anon 55599321 A T . rs121913 "1314, 1 D812V gene-anon trans-anonpid-anon 55599333 A T/G . . "1317, 1 D816V gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon) gene-anon (trans-anon)Figure 1.2: Variant View tool screen capture. Sequence variants and their attributes shown in Variant Viewwith respect to biological context annotations at multiple scales. This gene, whose name is anonymized, was notpreviously known to be implicated in leukemia; analysts identified it as a candidate gene through variant analysisusing the tool.41.2. Thesis Organization1.2 Thesis OrganizationChapter 2 describes the variant analysis pipeline from start to finish at ahigh level to give context to the problem addressed by Variant View.Chapter 3 describes the methodology used in this design study for findingcollaborators and a promising visualization problem. We apply the designstudy methodology of Sedlmair et al.[38] to create a visualization solution forour collaborators. Chapter 3 also describes the process for translating thisproblem into a refined set of data and task descriptions that guide designand implementation of a visualization tool.Chapter 4 describes the data types and attributes used by analysts andthe driving biological tasks for the problem of variant analysis; Chapter 4 iscentral to defining requirements for the system.Chapter 5 presents related work. The aim of this chapter is to describecurrent visualization approaches to the problem of variant analysis, the dataattributes important to the task that they reveal, their strengths, and whythey are insufficient solutions for the current analysis task.Chapter 6 outlines a principled approach to the design of a display andvisual encoding for the problem of variant analysis. In this chapter weleverage existing knowledge of human perceptual and cognitive strengthsand weaknesses to design an information dense visual encoding for variantanalysis. The choice of what data or attributes to include and emphasizeor de-emphasize in this visual encoding is prioritized by the discussion inChapter 4.Chapter 7 provides three case studies that provide initial evidence forour claim that the Variant View tool helped analysts perform their workfaster, and see patterns in the data that they may not have seen without it.Finally, Chapter 8 reflects on the lessons learned during this design study,and possible design criteria and guidance for future studies.5Chapter 2Variant Analysis PipelineThis chapter describes the variant analysis pipeline from start to finish at ahigh level to give context to the problem addressed by Variant View. Thecontents of this chapter summarize information mainly drawn from a reviewof current variant research methods by Altmann et al. [1], and a currentsurvey of variant research methods by Pabinger et al. [33].Variant analysis requires knowledge of how variant sequence data is pro-duced. This knowledge is important because obtaining variant data involvesa multi-stage pipeline of operations, and artifacts in the form of spuriousvariants can be introduced at several stages of the pipeline. The data pro-duction pipeline involves producing variant data of sufficiently high qualitybased on various metrics. After data production, analysis takes place. Thisanalysis usually requires knowledge of biology to mentally integrate severalpieces of information to determine whether a variant is harmful or not [1, 33].Figure 2.1 depicts this process from end to end. This chapter outlines thecomplete process at a very high level to motivate problems in this domain,some of which are solved by tools discussed in the Related Work of Chap-ter 5, and expose a new problem, whose solution is the focus of this thesis.2.1 Next Generation SequencingAnalysts interested in sequence variant analysis must first acquire the data.The first step is obtaining a whole genome sequence from an individual.Although ideally one would like to acquire a long, continuous, error-free, se-quence of nucleotides representing the entire genome, current DNA sequenc-ing technology is incapable of reading the entire genome sequence continu-ously from one end to another. The current technologies used to perform62.1. Next Generation SequencingNext Generation SequencingQuality Control Alignment/MappingFiltering Variant CandidatesPrioritization of CandidatesValidationVariant View supports prioritization of candidate variantsFigure 2.1: The variant data generation and analysis pipeline. This pipelinebegins with Next Generation Sequencing (NGS), where nucleotide identitiesare reported. A quality control step removes nucleotides after the NGSprocess that are unlikely to be correct based on various score metrics. Thealignment and mapping step aligns nucleotide sequence data to the referencegenome to produce a list of sequence variants. The number of variantsat this stage can be large, so the filtering variant candidates step involvesapplication of automated algorithms to remove variants that are unlikely tobe of interest. Prioritization of variants takes place after filtering; this steprequires an analyst with biological knowledge to inspect the list of variants todetermine which variants are predictive of disease or interesting for furtherresearch. Variant View is designed to support analysts at the prioritizationof variants step. The last step is Validation; after analysts are certain thata particular list of variants warrant further research, a technique such asSanger sequencing is used to validate or confirm that these variants aretrue.this are a number of platforms collectively called next generation sequencing(NGS). A platform is the sequencing technology and the associated softwarefor translating raw chemical signals identifying nucleotides into text files ofnucleotide sequence reads. NGS allows for sequencing of the whole genome72.2. Sequence Variant Data Generationof single individuals in a single laboratory within two weeks and at a low costcompared to the earlier Sanger sequencing method [1, 33, 37]. NGS involvesextracting DNA from a cell population. The pipeline of today?s most widelyapplied sequencing NGS platforms entails the fragmentation of the DNAto be sequenced into smaller segments called sequencing reads [1, 33]. Theoutput of a NGS experiment is a collection of millions of these short reads.Each NGS platform introduces sequencing errors that are characteristic forits sequencing pipeline. Compared to traditional Sanger sequencing, thesehigh-throughput sequencing approaches produce many more sequences, butof much shorter length and inferior quality; the shorter length of reads andinferior quality has an impact on how the resulting readouts are processedin a downstream analysis [1, 33, 37].The technology of NGS is subject to an ongoing development and thecurrent generation of sequencing technologies are about to be replaced bymore modern approaches aimed at eliminating some of the current technicalproblems which result in lower quality data. However, even if new NGStechnology produces more error-free data, this data will still need to beanalyzed.2.2 Sequence Variant Data GenerationObtaining genetic variant data requires a sequence of steps following gen-eration of the NGS data. We discuss those steps that are most likely tolead to erroneous data in the final sequence data due to uncertainty in theirprocess. The steps discussed are Quality Control, Alignment/Mapping, andFiltering Variant Candidates.2.2.1 Quality ControlMost sequencing platforms provide the DNA sequence read data directlyin a flat file format. Checking the quality of the generated sequence datais the first step in the pipeline that deals with the actual sequence data.NGS platforms produce quality scores for each individual nucleotide in a82.2. Sequence Variant Data Generationread, and some of them automatically remove data if it does not meet theplatform manufacturer?s factory-specified threshold [1, 33].2.2.2 Alignment/MappingFor almost all applications, sequence reads are aligned to a reference se-quence; in this case, the human genome. This requirement for aligningseveral million short reads, which contain small deviations and sequencingerrors, to a reference sequence or a database of sequences has brought fortha number of efficient algorithms some of which use hashing to acceleratethe alignment step. The choice of alignment tool and the correspondingsettings significantly affect the outcome and may cause errors to appear indownstream processing. The choice of alignment algorithm and its param-eter settings are both important. One such parameter setting is closenessof match between sequence reads and the reference. For instance, if onlyperfect matches between sequence reads and a reference are allowed, thedownstream analysis will not find any differences between the reference andthe sequenced genome, and no variants will be found. On the other hand,allowing many mismatches between the reference and sequence reads may al-low for many wrong alignments and result in a high number of false positivevariants in the downstream analysis [1, 33].Once the reads have been aligned to the reference genome, many algo-rithms allow storage of the result in the sequence alignment/map (SAM)format [1, 33]. The SAM format stores information about each alignedread. At this point, visual inspection of a whole genome sequencing exper-iment is usually not realistic. However, one can isolate alignments withina target region and visualize only that specific region in a genome browsersuch as the Integrative Genomics Viewer (IGV) [43].2.2.3 Filtering Variant CandidatesFiltering is an essential step to reduce the number of false positive variantcalls: a call is the presence and nucleotide identity of the DNA sequencemade by the sequencing technology [1, 33]. Algorithms designed to remove92.2. Sequence Variant Data Generationthese false positive variants and minimize variant calling artifacts are GATK,SAMtools, and VCFtools [8, 19, 21]. Most variant calling tools have theoption to generate the data in the VCF format [8].After this round of filtering, and only once the reads are aligned, variantscan also be filtered out if they do not have sufficient read support: forinstance, if variants are only supported by only a single read or very few [1,33].Variants can also be filtered out based on the effect they might haveon the cell. To facilitate this process, the VCF format specifies for eachvariant basic information such as the chromosomal position, the referencenucleotide, and the variant nucleotide. Information on the quality of thevariant call as well as the amount of sequence data available for the callare stored. The variant calling process on whole genome data can generatemore than a million variants. To cope with this size, tools for automatedvariant annotation have been developed. Effect predictor algorithms suchas snpEff [6] can enrich the variant data in the VCF file with informationrelating to the gene the variant takes place in, and the other possible effectsthe variant could have on the cell downstream. Variants that are predictedto have very low impact on the cell may be removed from the data at thispoint [1, 33].A widely used approach to substantially reduce the candidate list is toexclude known variants which are present in public variant databases, pub-lished studies or in-house databases as it is assumed that common variantsrepresent harmless variations. The entire filtering process helps reduce avariant data set from millions of variants, to thousands [1, 33].Working with NGS systems is an interdisciplinary effort. While the gen-eration of the data is mainly laboratory-centered, the initial processing ofthe short read data falls into the domain of bioinformatics and is mostlyautomated. The interpretation of the results, however, requires close inter-action between biology and bioinformatics in order to derive insights fromthe data; there is a need for better tools at this interface [33].102.3. Prioritization of Candidate Variants2.3 Prioritization of Candidate VariantsWith the use of whole-genome sequencing, the challenge is narrowing downthe list of candidate variants and interpreting remaining variants within abiological context. Prioritizing the resulting filtered variants is task specific;it typically requires detailed knowledge about the domain and visual inspec-tion of the variant context. After an analyst is confident that their list ofvariants are mostly true positives, and biologically interesting, they verifythe list of variants using Sanger sequencing [1, 33, 37]. This list of variantscan vary in size depending on the task; for our project, analysts workedwith lists of between 2,000 and 10,000 variants. They might spend up to 15minutes analyzing each variant.Our visualization solution, Variant View, fits into this pipeline at the pri-oritization of candidate variants stage where analysts must use their knowl-edge of disease biology and the nuanced information from the effect predic-tor algorithms to determine which variants to report for validation methodssuch as Sanger sequencing. Figure 2.1 shows where Variant View is situatedin the variant analysis pipeline. As we will discuss in detail in Chapter 5,there is a lack of visualization tools to guide this prioritization process. Thenext chapter will demonstrate the design process we adopted to create ourvisualization solution.11Chapter 3Visualization Design ProcessOur design process followed the collaborative nine-stage design study method-ology framework of Sedlmair et al. [38] because it is the culmination ofthree advanced visualization researchers? experience in co-authoring overtwenty design studies, and their literature survey of many more. Sedlmairet al.?s [38] design study methodology framework targets the design of vi-sualization tools for complex data sets. We applied this existing designframework to our design process in this project.3.1 Nine-Stage DesignThe nine stages are the precondition phase of learn, winnow, cast ; the corephase of discover, design, implement, deploy ; and the analysis phase ofreflect, write. This process is depicted in Figure 3.1. In this study, thethree visualization researchers were new, moderately experienced, and veryexperienced; given this combination of expertise, we did not allocate timefor an explicit learning phase. We did indeed have an extensive winnowingstage of roughly five months, in which we considered several other biologicalproblems of potential collaborators at the Michael Smith Genome SciencesCentre (GSC) but decided against pursuing them. We discuss this processin more detail in Section 8.4. We ultimately selected the problem of variantanalysis as a rich problem domain with interesting visualization researchquestions after a series of meetings with two front-line analysts (A1 and A2)who are research biologists. We made connections with these two postdocsthrough a gatekeeper (G1) who is engaged in both basic and clinical researchat the GSC.The core phase of the design study lasted roughly six months. During123.1. Nine-Stage DesignPRECONDITIONpersonal validationCOREinward-facing validationANALYSISoutward-facing validationlearn implementwinnowcast discover designdeployreflectwriteFig. 2. Nine-stage design study methodology framework classified into three top-level categories. While outlined as a linear process, the overlappingstages and gray arrows imply the iterative dynamics of this process.kind of implementation?and it is all too common to jump forwardover stages without even considering or starting them. This forwardjumping is the first pitfall that we identify (PF-1). A typical exampleof this pitfall is to start implementing a system before talking to thedomain experts, usually resulting in a tool that does not meet theirspecific needs. We have reviewed many papers that have fatal flawsdue to this pitfall.The linearity of the diagram, however, does not mean that previousstages must be fully completed before advancing to the next. Manyof the stages often overlap and the process is highly iterative. In fact,jumping backwards to previous stages is the common case in orderto gradually refine preliminary ideas and understanding. For exam-ple, we inevitably always find ourselves jumping backwards to refinethe abstractions while writing a design study paper. The overlappingstages and gray arrows in Figure 2 imply these dynamics.Validation crosscuts the framework; that is, validation is importantfor every stage, but the appropriate validation is different for each. Wecategorize validation following the three framework phases. In the pre-condition stage, validation is personal: it hinges on the preparation ofthe researcher for the project, including due diligence before commit-ting to a collaboration. In the core phase, validation is inward-facing:it emphasizes evaluating findings and artifacts with domain experts. Inthe analysis phases, validation is outward-facing: it focuses on justi-fying the results of a design study to the outside world, including thereaders and reviewers of a paper. Munzner?s nested model elaboratesfurther on how to choose appropriate methods at each stage [50].4.1 Precondition PhaseThe precondition stages of learn, winnow, and cast focus on prepar-ing the visualization researcher for the work, and finding and filteringsynergistic collaborations with domain experts.4.1.1 Learn: Visualization LiteratureA crucial precondition for conducting an effective design study is asolid knowledge of the visualization literature, including visual en-coding and interaction techniques, design guidelines, and evaluationmethods. This visualization knowledge will inform all later stages: inthe winnow stage it guides the selection of collaborators with interest-ing problems relevant to visualization; in the discover stage it focusesthe problem analysis and informs the data and task abstraction; in thedesign stage it helps to broaden the consideration space of possiblesolutions, and to select good solutions over bad ones; in the imple-ment stage knowledge about visualization toolkits and algorithms al-lows fast development of stable tool releases; in the deploy stage itassists in knowing how to properly evaluate the tool in the field; in thereflect stage, knowledge of the current state-of-the-art is crucial forcomparing and contrasting findings; and in the write stage, effectiveframing of contributions relies on knowledge of previous work.Of course, a researcher?s knowledge will gradually grow over timeand encyclopedic knowledge of the field is not a requirement beforeconducting a first design study. Nevertheless, starting a design studywithout enough prior knowledge of the visualization literature is a pit-fall (PF-2). This pitfall is particularly common when researchers whoare expert in other fields make their first foray into visualization [37];we have seen many examples of this as reviewers.4.1.2 Winnow : Select Promising CollaborationsThe goal of this stage is to identify the most promising collaborations.We name this strategy winnowing, suggesting a lengthy process of sep-arating the good from the bad and implying that careful selection isnecessary: not all potential collaborations are a good match. Prema-ture commitment to a collaboration is a very common pitfall that canresult in much unprofitable time and effort (PF-3).We suggest talking to a broad set of people in initial meetings, andthen gradually narrowing down this set to a small number of actual col-laborations based on the considerations that we discuss in detail below.Because this process takes considerable calendar time, it should beginwell before the intended start date of the implement stage. Initial meet-ings last only a few hours, and thus can easily occur in parallel withother projects. Only some of these initial meetings will lead to furtherdiscussions, and only a fraction of these will continue with a closercollaboration in the form of developing requirements in the discoverstage. Finally, these closer collaborations should only continue on intothe design stage if there is a clear match between the interests of thedomain experts and the visualization researcher. We recommend com-mitting to a collaboration only after this due diligence is conducted; inparticular, decisions to seek grant funding for a collaborative projectafter only a single meeting with a domain expert are often premature.We also suggest maintaining a steady stream of initial meetings at alltimes. In short, our strategy is: talk with many but stay with few, startearly, and always keep looking.The questions to ask during the winnow stage are framed as rea-sons to decide against, rather than for, a potential collaboration. Wechoose this framing because continued investigation has a high timecost for both parties, so the decision to pull out is best made as early aspossible. Two of our failure cases underline the cost of late decision-making: the PowerSetViewer [54] design study lasted two years withfour researchers, and WikeVis [72] half a year with two researchers.Both projects fell victim to several pitfalls in the winnow and caststages, as we describe below; if we had known what questions to con-sider at these early stages we could have avoided much wasted effort.The questions are categorized into practical, intellectual, and inter-personal considerations. We use the pronouns I for the visualizationresearcher, and they for the domain experts.PRACTICAL CONSIDERATIONS: These questions can be easilychecked in initial meetings.Data: Does real data exist, is it enough, and can I have it?Some potential collaborators will try to initiate a project before realdata is available. They may promise to have the data ?soon?, or ?nextFigure 3.1: The Nine-Stage Design process advocated by Sedlmair et. al.[38]. Figure courtesy of Tamara Munzner.this time we et with analysts regularly, for ar und an hour a week, andtheir feedback and ideas actively sh ped the tool capabilities. The discoverstage began with several se i-structured interviews with analysts A1 and A2to understan their current workflow and identify tasks that visualizationmight address. Their ta ks are described in Section 4.2: their main prob-lem is the Discover Genes task. The design and implementation stages wereightly int rwoven, w th a ser es f 8 prototypes of increasing complexitycreated ver fiv months. We decided that data sketches [20] were more ap-prop iate than paper prototyping due to the complexity of the data, so eventhe e rl est prototypes did l ad and show real data. The first two prototypeswere st tic tests f visual encoding possibilities, where we received feedbackby dem nstrating them to t e analysts. The deploy stage began in the thirdmonth with the third prototype, which supported interactive search; fromthen on, A1 and A2 used the prototypes in their analysis process, with eachnew prototype replacing the previous one. Five more prototypes of increas-ing sophistication were deployed over the next two months, and A1 and A2have been using the final version for two more months. A1 and A2 usedthe tool whenever they needed to assess variant impact, and reported thatit helped them to see patterns in their data that were difficult to imagineusing previous tools such as their spreadsheet software. We comment more133.1. Nine-Stage Designon A1 and A2?s experiences with the tool in Chapter 7.When this prototype was demonstrated to gatekeeper G1, he becameenthusiastic about using it for other biological problems. He connected uswith two more analysts, A3 and A4, who are bioinformaticians. Their feed-back identified the driving problems described as the Compare Patient Taskand the Debug Pipeline Task in Section 4.2. Based on analyst feedback,we adapted the base design to handle these additional tasks with two morerounds of prototyping over one month. These analysts were intrigued bythis prototype and are considering how it might be incorporated into futureworkflows. Deployment for the Debug task with A3 and A4 might be possi-ble in the near future, since they have direct control over their own workflow.However, deployment for the Compare task is a more complex problem sincethat workflow is still being developed and gatekeeper approval is requiredfor clinical use. A staged development process, as with LiveRAC [22], wouldbe one way to approach the problem; we leave it as future work.The analysis phase of the design study overlapped with the core phase,and extended for another month beyond it. As usual, the writing stagetriggered a return to the discover stage to further refine the data and taskabstractions, which in turn led to a few improvements in design. Writingalso triggered a return to the reflect stage, as we considered what lessons welearned that might be of interest to visualization practitioners who have noconnection to this particular domain.14Chapter 4Data and TasksBefore we present the visual encoding and interaction design choices of Vari-ant View, we need to explain the underlying data and task abstractions[26, 29]. We begin with the data abstraction, where we explain the charac-teristics of the domain data and how we abstract it in terms of scale andtype, and discuss the computation of derived data. We then explain thetasks in more detail with respect to data involved, and then consider whatabstraction in domain-independent language is interesting. The data andtasks were extracted through continued feedback from our analyst collabo-rators as part of our design process.4.1 DataThis section outlines the variant data attributes and data capturing impor-tant levels of biological context that are necessary for variant analysis.4.1.1 The Reference GenomeA variant is a difference between an individual person?s genome and thereference genome, a standardized coordinate system derived as a consensusfrom a small number of people. Because the reference genome is assembledfrom several donor genomes, comparison of a given genome to the referencewill expose millions of genetic variants. Many of these variants are just dueto harmless genetic differences between this sample genome and the refer-ence. The reference genome is an imperfect abstraction, and is actively beingaugmented by sampling [42] and storing [12, 39] the larger scope of humanvariability. The genome of an individual is called a sample; it typically con-154.1. DataSample GenomesExonGeneTranscriptProteinReference Genome~3 billion nucleotides012= Harmless Variant= Harmful VariantTranslation~20,000 genes~10,000 nucleotides~1000 nucleotides (nt)~300 amino acids (AA)...Figure 4.1: Biological context in which variants occur. ?X? symbols indi-cated sequence variants between a sample genome from an individual personand the reference genome. Variants propagate from the genome level to thetranscript and protein levels and can be harmful (red) or harmless (blue)depending on whether they disrupt important biological signals or not. Thegene level is specified by genomic coordinates, the exon-containing transcriptlevel is specified by transcript coordinates, and the protein level is specifiedby protein coordinates.tains thousands to millions of variants. Our collaborating analysts workingwith cancer genomes applied several rounds of custom filtering to identify onthe order of hundreds of variants of interest per individual. When summedacross roughly a hundred samples, their data sets contain between 2,000 and10,000 variants.The starting point of variant analysis is of course the variants them-selves, but they need to be interpreted within a larger biological contextof additional information about the genome and its structure. Figure 4.1shows a diagram of the relevant biological context.4.1.2 Scales and Coordinate SystemsAs with many complex datasets, there is known and relevant structure atmultiple scales in genome sequence data (Table 4.1). At the top level isthe entire genome, which is roughly 3 billion nucleotides (nt) in length. The164.1. Datareference genome establishes genomic coordinates, a linear coordinate systemthat specifies location within the sequence as an nt index. The standard wayto provide information about known biological context is as annotations thatpertain to a range between two locations.The next relevant level of structure below the genome itself is genes;they are roughly 10,000 nt in length, and there are approximately 20,000genes in the human genome. Below the gene scale, the next level is exons,the part of the gene sequence that creates proteins. They are roughly 100nt in length and there are on average 10 of them per gene.Eliminating the regions of the genome that are not exons leads to asecond coordinate system, transcript coordinates. Exon ordering is preservedbetween genomic and transcript coordinates, and these regions are simplystitched together to produce transcripts that are on average 1000 nt long.Most genes in the human genome produce multiple different transcripts.This diversity results from a biological process of different subsets of exonsbeing assembled into alternative transcripts under different conditions. Forexample, one transcript may include all of a gene?s exons while another mayinclude all but one exon.Finally, each triple of nucleotides in the transcript is translated into oneamino acid to create a sequence that is a third as long, for a third coordinatesystem of protein coordinates indexed by amino acid (AA).The lowest level of relevant structure is protein regions, which range fromone to hundreds of AA in length and are specified in protein coordinates.These regions have known functional properties, such as facilitating chemicalreactions within a cell or anchoring the protein to particular structures.There are around 20 types of protein regions; each has a list of ranges inprotein coordinates specifying known regions of that type. Proteins do nottypically have annotations for all possible region types, but rather only haveannotations for a few region types. Variants that cause amino acid changeswithin these regions are considered more likely to disrupt protein shape orfunction.174.1. Data4.1.3 Variant AttributesEach variant can have a position in all three coordinate systems: genome,transcript, and protein. Each variant has several categorical attributes, alsosummarized in Table 4.1 in terms of coordinate system and the numberof categories for each. There are 4 possible nucleotide types for a variant,represented by the well-known letters A, C, G, or T, and there are 20 pos-sible AA types, classified into 4 different chemical categories. A variantthat changes the AA to one of a different class is typically more disruptivethan those where the new AA is still in the same class. There are 7 pos-sible variant types, for example, a nt insertion or a nt deletion. Anotherattribute of interest is whether a variant is recorded in any of the two majordatabases that categorize certain variants as known to be harmful or knownto be harmless. These databases are imperfectly curated and cover multiplecancer subtypes, so this information is considered supplemental rather thandefinitive. Each variant also has an associated list of sample identifiers; thesame variant may occur in only one or multiple samples.4.1.4 Gene AttributesOur data abstraction also includes two derived attributes, called varcountand hotspot, that we calculate for every gene (Table 4.1). These attributeswere not previously used by the analysts, but capture patterns that we de-termined were of interest based on our task analysis. The varcount metricis simply the count of how many variants occur within a gene normalizedby the gene?s length in nt, and thus it ranges from 0 to 1. This measure isuseful for identifying highly mutated genes. The hotspot metric is a morecomplex metric that goes beyond counts to capture the co-location of vari-ants within a gene. We group together neighboring variants if their distanceis smaller than 20 nt in transcript coordinates; this threshold correspondsto the inflection point in the distribution of inter-neighbor distances for allgenes. The hotspot metric is then computed as the maximum group size fora gene, and thus it ranges from 0 to the maximum value for the data set.This metric is useful for identifying genes with large clusters of variants.184.1. DataAnnotations CS Count LengthGene G 20K per genome 10K ntExon G/T 10 per gene 100 ntFunctional Region P 10 per gene 1-300 AA(20 region types)Variant Attributes CS # CategoriesVariant Position G/T/P -Nucleotide Type G/T 4Variant Type G/T/P 7Amino Acid Type P 20Amino Acid Class P 4Database Status - 4Sample IDs - (list)Derived Attributes RangeVarCount Metric [0, max]Hotspot Metric [0.0, 1.0]Table 4.1: Data abstraction table. Annotations may be in one or severalcoordinate systems (CS): Genome (G), Transcript (T), or Protein (P). Theircounts are given, and their average lengths show their relative scales. Variantattributes are also associated with a coordinate system. For categoricalattributes, the number of categories are shown. Each sample has a uniqueidentifier and two gene-level derived attributes.194.2. Tasks4.2 TasksAll of the analysts were in a group at the GSC focused on the specific cancertype of acute myeloid leukemia (AML). In this section we first characterizethe problems and tasks our analysts face and then distill a set of questionsthey ask about their data to perform these tasks.4.2.1 Driving Biological TasksDiscover Genes: The Discover Genes task is to find new genes that arecandidates for involvement in the disease of acute myeloid leukemia throughvariant analysis. The scope of this task is limited to hypothesis generation;the identified candidate genes would then be investigated further to confirmthose hypotheses with other tools. This task takes place within the contextof an extensive pipeline of data processing and analysis. The input at thisstage is a dataset of around 3,000 variants that has already been pre-filteredby data quality metrics. Each variant is associated with several attributesincluding the gene within which it occurs; typical datasets have around 50variants per gene, and include samples from around 100 individuals.The analysts loaded this list into a spreadsheet, sorted by gene name,and then went through line by line to make judgements about the impactof each variant by reading its attributes. They also used web-based toolsto determine whether the variant appears within any of a large number ofprotein regions. This latter task required an arduous process of querying aprotein database website, selecting a protein from a list of possible proteins,inspecting the resulting web page of protein details, and mentally intersect-ing the variant?s location within the genome with the interval of the proteinregion boundaries; this process can take about 15 minutes for each variant.They also manually compared the variant against two different databasesof known variants [12, 39], to understand whether or not it had alreadybeen characterized as being harmful or harmless; consulting these databasesrequires about 5-10 minutes per variant.Compare Patient: The Compare Patient task is to compare variant data204.2. Tasksfor a particular individual patient with a database of variants that are knownto be harmful for acute myeloid leukemia, in hopes of generating a diagno-sis and treatment plan by noting variants similar to a disease populationgroup [41]. The known-AML database contains around 10,000 variants withat most 200 variants per gene; the patient dataset typically has around 1000variants, with at most 10 variants per gene.The challenge is that similarity is loosely understood rather than fullycharacterized. A specific variant in a patient clearly corresponds to a knownone if they have exactly the same position and attributes; the question ofwhether nearby variants should be considered matches is more fuzzy. Cur-rently, A3 and A4 are in the process of developing algorithms that classifyvariants into three categories: positive matches, unclear, and unlikely tomatch. Their preliminary algorithms generate reports that are being usedexperimentally by clinicians as part of a workflow that is still under develop-ment. Although they are not clinicians themselves, A3 and A4 work closelywith them, understand the pain points of the current prototype workflow,and have access to real patient data. They conjecture that visualizationsupport might allow the clinicians to better interpret the border cases be-tween matching and non-matching where the algorithm may fall short, andpossibly also to better use the matching variants for a treatment plan.Debug Pipeline: The Debug Pipeline task is to ensure that the bioinfor-matics pipeline used to generate variant datasets from raw data is workingcorrectly, before relying on the output in downstream tasks such as DiscoverGenes or Compare Patient. There are several places that errors might oc-cur in the multi-stage pipeline: spurious variants may be generated due tonoise in the next-generation sequencing stage or incorrect thresholding inthe data quality filtering stage after that, and incorrect attributes for vari-ants might be generated by the variant effect prediction stage. The goal offinding biologically implausible results requires knowledge of both biologyand the variant data production pipeline. Once a pattern is known to reli-ably predict false positive data, it can be incorporated later into automatedfiltering algorithms. Although the bioinformaticians already had debugged214.2. Taskstheir pipeline extensively, visualization support has often uncovered errorsof a kind difficult to detect with other methods.4.2.2 Tasks and Data QuestionsTable 4.2 contains the full list of concrete questions about the data that weidentified for three target tasks of Discover Genes, Compare Patient, andDebug Pipeline.The Discover Genes task involves Q1 through Q9. Q1 through Q4 aredirect questions about variant attributes. The only unimportant attributeis the list of sample IDs; these unique identifiers are occasionally used tolook up further information but are not directly of interest themselves. Q5is about proximity between variants themselves. Q6 and Q7 also pertain toproximity, but specifically whether a variant falls within given annotationranges. Q1 through Q7 are all at a gene-level scale; that is, they onlypertain to variants within a single gene. Q8 and Q9 are at a larger scale:they characterize genes with respect to each other in terms of patterns ofvariants within them. These two questions are at genome-level scale; theypertain to selecting which genes to inspect in more detail. The ComparePatient Task involves Q10 and Q11: these questions also pertain to thepositions of variants and their attributes. Finally, the Debug Pipeline Taskinvolves Q12 and Q13: Q12 is purely about position, and Q13 is a directquestion about variant attributes.Identifying the questions analysts ask about their data can provide guid-ance for what information is required to solve their tasks, and what infor-mation is irrelevant. In the design rationale discussion of Chapter 6, weuse these concrete questions to motivate and justify the design decisions wemake to construct our visualization solution.224.2.TasksDiscover Genes Task: Gene-LevelQ1 What is the variant type?Q2 Is there a change in AA chemical class? From what to what?Q3 Is there a change in AA? From what to what?Q4 Is the variant in any of the known databases? Is it a harmless or harmful one? Which one(s) is it?Q5 Are there many variants in close proximity to each other? Where?Q6 Is the variant close to an exon boundary?Q7 Which types of functional regions are known for this gene and does the variant fall within anyrange of any of them? If so, which types? Which ranges?Discover Genes Task: Genome-LevelQ8 Are there genes with many variants?Q9 Are there genes with variants in close proximity to each other?Compare Patient TaskQ10 Does a patient variant occur at exactly the same position as a known variant? If so, do the attributesmatch exactly (variant type, AA change, nt change)?Q11 Does a patient variant have a known variant nearby it? If so, are the attributes the same? Or verysimilar?Debug Pipeline TaskQ12 Is there an unusual or biologically implausible distribution of variants?Q13 Is there an unusual or biologically implausible combination of attributes at a variant position?Table 4.2: Data question table. Concrete questions asked by analysts to infer variant impact, for each of the threeidentified tasks.23Chapter 5Related WorkThere are many tools available for visualizing sequence variant data. Thesetools vary in their approach, with some targeting flexibility in the sense ofdisplaying a large number of attributes, some irrelevant to the current designstudy?s tasks, to others that are more targeted to particular tasks withlimited flexibility. Generally speaking, genome browsers [5, 11, 16, 43, 47]are the most flexible tool, and even allow visualization of raw sequence readalignment to a reference genome [3, 10, 14, 43]; one genome browser inparticular, Ensembl [5], allows for a specialized display for variant analysisin addition to typical genome browser capabilities. Other representationsexpose variant attributes more explicitly, such as cBio [4] and MuSiC [9],but only do so at the gene level.5.1 Genome BrowsersGenome browsers are the dominant paradigm in sequence visualization to-day [5, 11, 16, 43, 47]. At their core is the data abstraction of genomiccoordinates: the genome is considered as a single, long, linear sequence ofnucleotides, and nucleotide position within the string acts as an index. Thevisual encoding is that horizontal spatial position reflects genomic coordi-nates, with interactive navigation through panning and zooming to adjustthe view to show any single region of interest. Multiple rows are stacked ver-tically into tracks; each of these separate tracks can show any kind of datathat can be indexed with respect to genomic coordinates. An enormousamount of genomic information is indexed this way in public and privatebiological databases, as annotations that refer to some range in genomiccoordinates.245.1. Genome BrowsersWhen zoomed all the way in, the user sees features at the level of in-dividual nucleotides, including their actual values as C, G, A, or T in thebase track. When zoomed all the way out, the entire genome is shown.Even when zoomed out only to the gene level, individual nucleotides can-not be resolved, and there are many irrelevant regions present which causesregions of potential interest to be so highly compressed that useful infor-mation is not visible. Figure 5.1 shows a screen capture of the Ensemblgenome browser zoomed in to the gene level. Variant data is added as atrack. Tracks can stack vertically. Genome coordinates are shown at thetop. Variants are encoded as thin, vertical lines, and color encodes varianttype. Stacked below the variants are other data tracks such as transcript in-formation and protein domains. The variant data are so squished that theyjust appear as thin, non-salient vertical lines. The strengths and weaknessesof genomic coordinates are discussed further in Section 6.2 as part of ourdesign rationale.5.1.1 Sequence Read Visualization Tools for VariantAnalysisThe output of alignment technologies, used to help identify sequence vari-ants and described in Section 2.2.2, is a collection of read alignments of nu-cleotides A,C,G, and T to a reference genome. If many short reads alignedto a similar region of the genome have the same nucleotide in a particu-lar position, there is greater evidence that the nucleotide is a true positive,and not simply due to some error in the data generation pipeline. Toolssuch as Artemis [3], Bambino [10], Magic Viewer [14], and IGV [43] supportviewing of these aligned reads by displaying them stacked and/or end-to-end. Figure 5.2 depicts the IGV tool, arguably the most popular of these.The short, horizontal, grey tracks in this figure are sequence reads. Thereference genome sequence of nucleotides are depicted at the bottom. Thevertical column that is highlighted in blue shows all the read evidence for aparticular position. Here, all the reads are showing a C, meaning that thereis a lot of evidence that the individual?s genome had a C in this position.255.1. Genome BrowsersFigure 5.1: The Ensembl genome browser. Variant data is added as a track.Tracks can stack vertically. In this screen capture, the genome coordinatesare shown at the top. Variant data is shown within the black box we havedrawn to emphasize their presence. The thin vertical lines are variants, andcolor encodes their variant type. Stacked below the variants are other datatracks such as transcript information and protein domains.The analyst would therefore be likely to trust this C nucleotide assignmentbased on the stacked read evidence. Generally, these kind of read visual-ization tools are good for detailed inspection of read evidence for a singlevariant. However many other attributes that variant analysts are interestedin such as how the sequence variant will impact its gene?s protein productis not available; read alignment visualization tools are not tailored for theproblem of variant impact analysis as described by the data and tasks wehave defined in Chapter 4.265.1. Genome BrowsersTrack 1Track 2Figure 5.2: The IGV [43] sequence read visualization tool. Two tracksare shown. The grey-filled horizontal strips represent DNA sequence readfragments. Along the horizontal x-axis is a genomic coordinate system. Eachgrey read fragment is aligned to a reference genome at the bottom. Readfragments from the same experiment are in the same horizontal track. Thegrey histograms above each track shows read coverage, the number of readssupporting a particular nucleotide, per nucleotide.5.1.2 Genome Browsers for Variant AnalysisThe Ensembl [5] genome browser, in addition to being a fully-fledged genomebrowser tool, now includes support for visualizing variants and their at-tributes in the form of a so-called variation image [5]; the variation imageis shown in Figure 5.3. The variation image encodes variant type and somevariant attributes, in addition to partially collapsing the inter-exon regionsto give more screen space to variants within exons. This view shows onlya single gene in a display, but it does not provide any guidance on whatgene to inspect. One benefit of the single-gene approach is that its scal-ability problems are less extreme than those of a general-purpose genomebrowser in terms of panning and zooming to regions of interest in the entiregenome. However, the Ensembl variation image?s track-based view, shownin Figure 5.3, typically requires vertical scrolling, particularly to see vari-275.2. Tailored Gene View Solutions for Variant Analysisants across multiple alternative transcripts: each possible transcript and itsassociated protein regions form a unit, and around ten of these units arestacked vertically to span a great deal of screen space. The representationalso requires user interaction to expose some attribute information such asknown database type, and does not encode AA class. Variants are difficultto resolve since they are encoded as thin vertical lines. Their type is encodedby color, which is difficult to resolve because the variant lines are so thin.Also, because inter-exon regions are only partially condensed, exon regionsare still small, and multiple variant lines in close proximity can overlap andocclude each other making it difficult to resolve variant type, position, andrecurrence. Finally, at this time, the Ensembl genome browser?s variationimage does not allow analysts to upload their own data into the system tobe displayed. They can only visualize variant data from existing, curateddatasets.5.2 Tailored Gene View Solutions for VariantAnalysisIn contrast to genome browser approaches, there are two recent tools, cBio [4]and MuSiC [9], that are more tailored to the display of variant attribute andmulti-scale annotation information; cBio is shown in Figure 5.4 and MuSiCis shown in Figure 5.5.These tools are a useful first step in showing important feature infor-mation at the overview level. Both show variant position with respect toannotation boundaries. However, several visual encoding decisions lead todifficulties in using them to assess variant impact. For example, cBio en-codes the repetition of multiple variants as variant bar height, which is onlyminimally salient. MuSiC encodes repetition of multiple variants at a singlelocation with vertical stacking if they are identical, and triangular bloom-like layouts if they are collocated but of different type. Both MuSiC andcBio are missing much of the detailed information of amino acid class andknown database type. In both cBio and MuSiC, protein regions are likely285.2. Tailored Gene View Solutions for Variant Analysisto overlap, leading to occlusion and difficulties in resolving what regions areaffected by variants. Neither tool shows where variants occur in relationto the gene transcript, thus it is difficult to know whether variants occurin and around exon boundaries. A major barrier to cBio use is that userscannot import their own data. Although MuSiC is technically available asopen source, it too has barriers to use: the undocumented code to generateplots is embedded within a larger system codebase and would be nontrivialto adapt for standalone use.295.2. Tailored Gene View Solutions for Variant AnalysisBCAFigure 5.3: The Ensembl [5] variation image?s track-based view typicallyrequires vertical scrolling, particularly to see variants across multiple alter-native transcripts since each possible transcript and associated protein re-gions are stacked. The full display is labelled (A); this view extends acrossnearly seven pages when printed out directly from the browser. Variantsare encoded as thin, vertical, colored lines. The region labelled (B) shows amagnified cropping of the display that includes a transcript scope similar toVariant View. The region labelled (C) shows protein regions which are alsoincluded in Variant View.305.2. Tailored Gene View Solutions for Variant Analysis13-05-13 1:10 PMcBioPortal for Cancer Genomics::ResultsPage 1 of 3http://www.cbioportal.org/public-portal/index.do?cancer_study_id=laml?laml_tcga_cnaseq&gene_list=FLT3&Action=Submit&tab_index=tab_visualizecBioPortal for Cancer GenomicsGene Set / Pathway is altered in 28.3% of all cases. Acute Myeloid Leukemia (TCGA, Provisional)/Tumors with sequencing and aCGH data: (187)/User-defined List/1 geneModify QueryHOME TUTORIALS NEWS FAQ DATA SETS ABOUT WEB API R/MATLAB NETWORKSFLT3: [Somatic Mutation Rate: 27.8%]FLT3_HUMANOncoPrint Plots Mutations Survival Network IGV Download Bookmark# Mutations0 100 200 300 400 500 600 700 800 900 993 aaig Pkinase_Tyr020D835E/D835H/D835YSearch: Showing 1 to 52 of 52 entriesTCGA-AB-2811D835Y Missense 354 L  msa  3D V NA 9TCGA-AB-2812600_600D>DSSDNEYFYVDFREYEYD IF ins 17 V NA 7TCGA-AB-2814D835E Missense 354 N  msa  3D V NA 8TCGA-AB-2816D835Y Missense 354 L  msa  3D V NA 6TCGA-AB-2818D835Y Missense 354 L  msa  3D V NA 8TCGA-AB-2825G613_splice Splice V NA 3TCGA-AB-2830604_605insSPRGGNEYFYVDFREYEYDLKWE IF ins 7 V NA 11TCGA-AB-2834D835E Missense 354 N  msa  3D V NA 1TCGA-AB-2836596_597insDNEYFYVDFRE IF ins 8 V NA 4TCGA-AB-2840602_603insYEYDLK IF ins 12 V NA 1TCGA-AB-2851D835Y Missense 354 L  msa  3D V NA 7Show / hide columnsCaseIDAA Change Type COSMIC FIS Cons 3D VS AlleleFreq(T)#MutinSampleFigure 5.4: The cBio [4] tool for analysing genetic variants. The tool repre-sents variants as a red circle on a thin vertical line. The color f the circledoes not encode any information. Information about protein coordinateand amino acid change is given in text above the variant. The entire proteinlength is encoded by a grey rectangle, and protein domains are encoded ascolored boxes on top of the protein length. Variant count at a single posi-tion is encoded by the height of the variant circle and line. Additionally, anamino acid coordinate scale is given at the bottom of the screen.The next-densest group of mutations occurred in DNAH5,where there were four mutations within a space of 3 aa. Severalgenes have mutations that occur in triplets within a space of 2 aa,including UBR4 and RB1CC1. Both UBR4 and RB1CC1 have re-lationships with RB1, a gene on the significantly mutated gene listfor the OV data set and a gene also found to harbor copy-numberalterations in the OV data set (The Cancer Genome Atlas ResearchNetwork 2011). UBR4 is a component of the N-end rule pathwaythat interacts with RB1, and RB1CC1 actually regulates the expres-sion of RB1. RB1 itself has twomutations in close proximity (within1 aa of each other). These high-density groups of RB1-related mu-tations, pictured in Figure 5, may support the hypothesis that RB1could be an additional driver of ovarian cancer.COSMIC/OMIM queryUsing theCOSMIC/OMIMmodule ofMuSiC,we attempted to findpreviously reported mutations matching the query set of somaticOV mutations. This type of analysis can provide a measure of re-currence, as these databases generally contain information aboutthe studies fromwhich their contentswere derived, and theCOSMICdatabase deals exclusively with the somatic mutations discov-ered in cancer studies. A summary of COSMIC/OMIM databasecomparisons for those significantly mutated genes listed abovewith at least one database match of any type is presented in Sup-plemental Figure S2. This summary, however, represents onlya subset of all of the information made available via the databasequeries. We found 15 exact matches in genomic position and nu-cleotide change between the OV data set and COSMIC, includingsites inNF1, RB1, and PIK3CA, all considered significantlymutatedgenes by the MuSiC pipeline. This type of match is only possiblewhen comparing to the COSMIC database, since OMIM entriescontain only amino acid coordinates. We identified another exactmatch in the FOXG1 gene, which encodes a forkhead transcriptionfactor. Not only was the FOXM1 transcription factor network citedas significantly altered in 87% of samples in the previous TCGAstudy (The Cancer Genome Atlas Research Network 2011), but,additionally, some forkhead transcription factors were previouslyidentified as therapeutic targets (Moumne et al. 2008; Wang et al.2010b).In addition to finding the above COSMIC variants whichshared positions and identical nucleotide changes with OV mu-tations, our comparison of OV mutations to the COSMIC andOMIM databases also identified a large set of database mutationsthat altered the same amino acid in an identical manner as an OVmutation. Of 233 such matches from COSMIC (76 from OMIM),the overwhelming majority of these, 219(56), were from the gene TP53. Therewere 229 (76) other ??position?? matches,defined as mutations which affect an iden-tically positioned amino acid but whichdo not cause the same residue change asthe previously reported mutation. Mostof thesematches are from genes that havebeen previously associated with ovariancancer (TheCancerGenomeAtlas ResearchNetwork 2011), such as BRAF, BRCA1,KRAS, and again, TP53.Pfam annotationThe Pfam annotation module of MuSiCgroups genes based on the frequency ofmutation in specific protein domains.Grouping mutations by their protein do-main can serve to group genes accordingto putative function, since genes that sharea domain are more likely to share relatedfunctions.We performed Pfam annotationon the 19,356 somatic variants identifiedin the 316 OV cases. Supplemental TableS1 reports the number of nonsynonymousmutations, synonymous mutations, andalso the number of genes harboring mu-tations in each Pfam domain with at leastfive somatic events in the OV data set.In the analysis of the OV data, manyof the frequently mutated domains are alsothemost prevalentdomains in thegenome,including the seven-transmembrane Gprotein-coupled receptor domain, the pro-tein kinase domain, and the zinc fingerdomain, as illustrated in Figure 6A. Thisgenome-wide abundance is not true, how-ever, for the amply mutated P53 domain.Figure 5. Proximity analysis mutation diagrams. These mutation diagrams show recurrent tripletmutations in both UBR4 and RB1CC1, both of which harbor a relationship with tumor suppressor geneRB1.MuSiC analysis tool suiteGenome Research 1593www.genome.orgFigure 5.5: The MuSiC [9] tool for analysing genetic variants. Variants areencoded by a thin vertical line with a circle on top. The color of the circleencodes the variant type. The protein length is encoded by a horizontalrectangular box, with protein regions along this length encoded as coloredboxes on top. An amino acid coordinate scale is given along the length ofthe protein. Amino acid coordinates and identities are given in text aboveeach variant.31Chapter 6Design RationaleWe now discuss the design decisions for Variant View.6.1 Core Components: Automation versusVisualizationFigure 6.1 shows Variant View, with its core interface components labeled.The overall design arose from considering the specific tasks outlined in Sec-tion 4.2 and identifying three common themes. First, analysts need to inte-grate diverse data types from distinct sources, such as patient variant datain user-specified input files or protein annotations from public databases.Manually integrating these data together one gene at a time as describedin Section 4.2.1 is very time-consuming. We therefore decided to auto-mate this process by building data integration into Variant View so thatall relevant data is available from within a single unified interface. Second,analysts need to prioritize genes based on these integrated data, but theprevious workflow only provided alphabetical sorting by gene name. Wedesigned two derived metrics, varcount and hotspot, and equipped VariantView with a reorderable list of genes that can be sorted by either of thesemetrics or alphabetically (Figure 6.1, label B). This component of VariantView also supports direct searching by gene name. Finally, analysts needto make judgements about the biological significance of a gene?s variants.Unlike the other two general tasks described above, this one requires humaninspection and we therefore designed a concise visual interface to supportthis type of reasoning. We strove to encode as many attributes into theprimary overview (Figure 6.1, label A) as possible; to avoid clutter, we show326.1. Core Components: Automation versus Visualizationattributes that were deemed by our analysts to be more peripheral to theanalysis into the supporting table view (Figure 6.1, label C). Variant Viewfeatures bidirectionally linked views [46] such that selections in any one ofthe views are reflected in the others; the video included in the supplemen-tary materials shows the look and feel of the interaction at more length. Thevideo is available at: http://www.cs.ubc.ca/labs/imager/video/2013/variantview/variantview_video.mov336.1.CoreComponents:AutomationversusVisualizationFigure 6.1: The Variant View tool, annotated to indicate its three main views. The primary view (A) is thecentral overview for performing variant impact assessment; the reorderable gene list view (B) can sort genesalphabetically or by derived measures of variant importance; the secondary Variant Data table (C) containsperipheral information.346.2. Genomic Coordinates: Strengths and Weaknesses6.2 Genomic Coordinates: Strengths andWeaknessesAlthough many genome browsers provide access to many hundreds of publicdata tracks, an analyst typically focuses on fewer than a dozen tracks at once;only a few tracks are relevant to a specific task at once. The data abstractionof genomic-coordinate tracks provides an extremely flexible architecture,allowing new data types to be easily incorporated into genome browsers.The popularity of genome browsers implies that many tasks in this domainare well served by this style of pan-and-zoom navigation. Users can easilynavigate to a known range and explore local neighborhoods around it at thatsame scale. They can also easily synthesize information about correlationbetween phenomena in the same range across multiple tracks. The fixedcoordinate system allows users to easily preserve and maintain orientationin terms of where some feature of interest lies with respect to larger-scalestructures in the genome.However, genome browsers are more difficult to use for tasks that requireunderstanding features that fall into non-contiguous regions because theinteraction costs become high. Extensive panning and zooming adds bothtime cost and cognitive load for the user, who must remember regions ofinterest and their context because they cannot be seen side by side [18, 35].Genome browsers are particularly difficult to use when features of interesthave distributions that are sparse or bursty across some range. The problemwith sparse distributions in a fixed coordinate system is that the featuresare small relative to the scale of the range in which they fall, so they aredifficult or impossible to see when the user has zoomed out far enough tosee the full range. Similarly, distributions with bursts of features very closeto each other can be difficult to understand from high zoom levels becausethey lie on top of each other, so that a burst is hard to distinguish from asingle occurrence. Genome browsers are also difficult to use when featuresof interest fall at multiple scales, so they cannot be easily seen at any singlezoom level. Moreover, if an analyst does not already have hypotheses aboutwhat regions in a dataset are interesting, it could be difficult to find such356.3. Filtered Scopeareas through unguided exploration. Abstractly speaking, the problem isa lack of information scent [13, 34] in the overview; that is, at high zoomlevels there is no visual indication of what areas might be fruitful to explorenext, forcing users to undertake exhaustive search.Collapsed coordinate systems can be used instead of genomic coordi-nates to emphasize regions of interest. They are a much less common rep-resentation than genomic coordinates. As discussed in Chapter 5, tools likethe Ensembl variation image [5] use partially collapsed inter-exon regionsto slightly emphasize exons as regions of interest. Overall, collapsed coor-dinates risk not being able to show data that fall outside of the selectedregions, and they also distort the scale, which may be important for sometasks.The data abstraction of genome coordinates is sufficiently powerful andpervasive that it has widespread use, but variant analysis is one of many bio-logical subdomains where it falls short [30, 31]. In our design we abandonedthem completely, in favor of the collapsed coordinate systems of transcriptand protein coordinates as a way to filter the scope of what is shown.6.3 Filtered ScopeA central design decision was to aggressively filter out all information unnec-essary for variant analysis tasks in order to create an easily-comprehensibleoverview showing everything important simultaneously. All questions ex-cept for Q8 and Q9 require seeing only a single gene at a time. The analystsignore all variants that occur outside of gene boundaries both because theirfunctional consequences are much more difficult to assess and because theyare deemed less likely to be harmful.Even Q8 and Q9 do not require visually encoding the location of thegenes in genomic coordinates. Thus, there is no overview of the entiregenome in the main view; only a single gene is shown at once. The geneto inspect is selected from a reorderable list of gene names in a secondaryview that can be sorted according to the derived attributes of hotspot andvarcount, to satisfy Q8 and Q9, and reduce the gene search space.366.3. Filtered ScopeMoreover, considering the information summarized in Tables 4.1 and 4.2in combination shows there is no need to use genomic coordinates at all;transcript and protein coordinates suffice. That is, our task analysis alsoshows that there is no need to show the non-exon parts of the gene that donot contribute to the transcript, so we filter them out completely. Again,the analysts ignore all variants that occur outside of exon region boundaries,deeming them unlikely to be harmful.We also realized that there is no need to show low-level nucleotide orprotein type information at non-variant positions. Thus, we only show theboundaries of annotation ranges, without attempting to show their internalstructure.In a traditional genome browser, each sample would be shown separatelywith its own horizontal band. We have instead chosen to show all of thevariants together in the context of a single coordinate system, combining in-formation across all samples. Again, this decision was motivated by our taskand data analysis: no question requires direct comparison between multi-ple samples. The only questions that require reasoning about an individualsample are Q10 and Q11. We once again handle the problem with aggressivefiltering: in that case, we only show two more variants for each one in theindividual sample, its neighbors to the left and right.We relegate secondary information to an auxiliary spreadsheet-formattable linked to the main view: it contains details about identifiers in knowndatabases (for Q4c, which database is it?), the genomic coordinate value,identifiers for samples and the transcript, and other attribute informationin textual format that is also visually encoded in the main view.These decisions lead to a view dramatically different from what is shownin a traditional genome browser: it is information-dense but visually clear,showing all important information simultaneously without the cognitive loadof navigation.Figure 6.2 shows a detail view of the top of an Ensembl [5] variationimage, annotated to show the distinction between genomic coordinates andexon regions. A screen capture of the Ensembl variation image is shownin Figure 6.1. Genome coordinates are encoded as the alternating black376.3. Filtered Scopeand white bar at the top. This view includes both the exon regions andregions between them. The exon regions are much smaller than the non-exon regions, so in genomic coordinates that give equal weight to everynucleotide position the exons are so squished that their internal structureis difficult to distinguish. In contrast, the track beneath shows an alternateview where exons are expanded horizontally and the non-exon regions arecompressed. Our design takes this idea a step further by eliminating thenon-exon regions completely.386.3.FilteredScopeFigure 6.2: A detail view of the top of an Ensembl [5] variation image annotated to show the distinction betweengenomic coordinates and exon regions. A screen capture of the Ensembl variation image is shown in Figure 6.1.The exon regions are much smaller than the non-exon regions, so in genomic coordinates that give equal weightto every nucleotide position the exons are so squished that their internal structure is difficult to distinguish. Incontrast, the track beneath shows an alternate view where exons are expanded horizontally and the non-exonregions are compressed.396.4. Transcript and Protein Region Encoding6.4 Transcript and Protein Region EncodingWith the scope of the view reduced to a focus on the exon-containing tran-script and protein regions, the next decision was how to encode them. Usinghorizontal spatial position for the coordinates was the obvious choice, giventhe strong precedent of horizontal encoding of coordinate data in genomebrowsers. The transcript and protein regions are bars on separate verticalrows, as shown in Figures 6.1 and 6.3, with the transcript bar on top show-ing exon boundaries within it (Q6) and one row for each protein region typebelow it. They are aligned to have the same spatial extent so that verticallocations correspond across these rows, supporting reasoning simultaneouslyacross these levels. To make this alignment possible, the protein coordinatesystem is at a higher zoom level than the transcript coordinate system. Forinstance, if 900 transcript coordinates are captured by the orange transcriptbar, 300 protein coordinates will be stretched to match up with the tran-script. This representation makes sense because 3 transcript coordinates isequal to 1 protein coordinate, as described in Section 4.1.2. For instance,hypothetical transcript coordinates 4, 5, and 6 will all map to protein co-ordinate 2. Encoding each type of functional regions as its own bar on aseparate vertical row is an important choice to prevent occlusion while ac-commodating Q7; in previous systems these intervals all fall into the samespatial region, leading to overlap and visual clutter. The transcript bar isthe top, orange bar depicted in Figure 6.1 label A. The protein region barsare the bars stacked below the transcript bar in Figure 6.1 label A.Only the protein region types that appear as annotations for a selectedgene are shown in the main view. Every gene has the AA Chain type, butsome have no other region types at all; many have only a few region types.Text labels with more detailed information about protein regions appear onmouseover; this interaction is shown in the video included in the supplemen-tary materials.406.5. Variant and Variant Attribute Encoding6.5 Variant and Variant Attribute EncodingThe goal for encoding variants and variant attributes was to show all of themat once; that is, to allow variant impact to be assessed with an information-dense overview that does not require interaction. Figure 6.3 summarizes thevisual encoding choices.Variants are encoded as vertical lines that traverse the entire transcriptand protein regions. The lines have high visual salience, emphasizing therelationship of variants across the transcript and protein regions in one view.The attributes for a variant are encoded at the top of its line, stackedvertically. Although the horizontal screen space is occupied by bars encodingthe transcript and protein regions, there is considerable vertical screen spaceavailable. This scheme also allows attributes to be clearly associated withthe variant without occluding the transcript and protein regions, leadingto a primary display without visual clutter. The top of the stack has atwo-part icon to show database status, with a small hollow circle on top ifthe variant appears in the known-harmless database and a small filled-circleicon just below it to show that is listed in a known-cancer database (Q4).A single variant could be in both of these databases simultaneously, so adifferent vertical region is allocated to each. Having a variant appear inboth the known-harmless database and known-harmful database is possiblebecause these databases are imperfectly curated. Just below, variant typeis encoded with an icon (Q1); we use a set of 7 evocative icons culled fromthe biological literature. Resolving the variant type was difficult in previoustools because either it was not shown at all, or it was encoded with a verysmall mark such as a small, circular mark or thin line [4, 5, 9]. Moreover,the small size of these marks precludes the effective use of color coding [48]to show any other addition.Below the variant type is the amino acid type for both the referencegenome and the sample at that position (Q3). The 20 amino acid typesare shown using single letters, following biological convention (Q2); we notethat color coding is precluded since there are 20 choices. Changes in typeare thus shown implicitly by having different symbols next to each other in416.6. Use of Colorthe stack.A small grey arrow appears at the very top of the stack to distinguishthe variants for a particular patient as needed for the clinical patient-focusedtask (Q10, Q11); the grey arrow marks are shown in Figure 7.3. We choseto use an additional mark to highlight rather than changing color to ensurethat the color coding choices discussed below remain clearly visible.We wanted variant hotspots to be highly salient (Q5). Our layout em-phasizes recurrence of variants across samples by repeating the variant unitas many times as it recurs. The large region of encoded pixels created bythis repetition results in a highly visually salient triangular visual footprint,as shown in Figure 6.4(d) and Figure 7.1(a); Figure 7.1(c) shows an exampleof a gene with no hotspots and no highly visually salient triangular visualfootprint: the variants are more uniformly distributed. In contrast, previ-ous work has shown recurrence in a way that is far more subtle, throughposition coding of a small object across a small position range, so it is easyto miss [4, 5, 9].6.6 Use of ColorIn the vertical stack of variants in the top part of the main view, we reservethe use of color for emphasizing changes of type implicitly. Amino acidchemical class is encoded with one of 4 different colors, red, green, lightblue, and blue, so that a change of class is apparent as a change of color.These changes have a high impact, and so are encoded with high salience.The regions are relatively small, so we use high-saturation colors; we dotake care to ensure that the text protein symbols in the foreground havesufficient luminance contrast to be visible. We chose colors to be highlydistinguishable while still colorblind-safe through varying saturation andbrightness.In the bottom Transcript/Protein section, bars are colored if variantsstrike through them; otherwise they are shown in desaturated grey. Thealways-visible bars that stretch across most of the view each have their owncolor for memorability and visual salience: the Transcript bar is orange, and426.7. Design Comparisonthe AA Chain protein region bar is green. All of the other bars are blue ifthey are struck by a variant (Q7).We do reuse colors between the top and bottom parts of the view. Whileis it possible that the similarities between colors in the two parts of the viewcould be misinterpreted as implying a connection between data that is not infact related, such as the AA Chain with the Special AA class, shown in thelegends on the left side of Figure 6.3, we made a considered tradeoff. Ourmain goal was highly distinguishable colors in both places, with a secondarygoal of a reasonably unified palette; the spatial separation between the partsof the view makes the misinterpretation less likely.6.7 Design ComparisonWe compare existing visual representations for variant analysis to our visualencoding to motivate the strengths of our design, showing the same variantdata for a more direct and fair comparison. Figure 6.4 shows a compar-ison of variant data for the known gene DNMT3A between the encodingschemes of cBio [4], MuSiC [9], Ensembl variation image [5], and VariantView. Because of the usage barriers described in Chapter 5, the images fromprevious work are mockups created through close reading of the associatedpapers and personal communication with the authors. Our discussion fo-cuses on the intellectual design considerations of each representation, not onthe underlying implementation of the system or tool that generates them.Both cBio and MuSiC encode variants as small colored circles on topof vertical lines that indicate their position on the protein coordinate, asshown in Figures 6.4(a) and (b). While MuSiC uses circle color to representa limited number of variant types, neither representation shows the variantattributes of known database information or chemical class change, and onlyMuSiC shows AA change consistently. In both cases, the variant context ofthe transcript is absent and protein regions are represented as colored blocksall on the same vertical row, so there is a risk of occlusion. The high colorsaturation for these protein regions also tends to make them the centre offocus rather than the variants themselves. Figure 6.4(d) shows the Variant436.8. ImplementationView encoding. We argue that it is both more information-dense and morevisually salient than these previous tools.The Ensembl mockup in Figure 6.4(c) focuses on only the transcript andprotein regions, with annotations to show where the many other alternativetranscript and protein regions would take a great deal of additional verticalscreen space. To avoid clutter, Variant View instead shows these alternativetranscripts in separate views. In the Ensembl variation image, variants areencoded as thin lines, with variant type encoded as color which makes typedifficult to resolve; in addition, occlusion of amino acid encodings can occurif the variants are close together. Variant lines can also overlap if they arein the same location, making it difficult to determine how many variants arepresent, and their variant type. Furthermore, the design relies on interactionin the form of scrolling and clicking to expose more information from therepresentation, instead of encoding it densely at the overview level. Whilewe do not attempt to make an exact estimate of the speedup, we arguethat analysts would need significantly less time to extract the importantinformation from Variant View, where it is all shown immediately, thenfrom the sprawling Ensembl variation image.6.8 ImplementationVariant View was implemented using a combination of HTML, CSS, JavaScript, and the JavaScript Data-Driven Documents (D3) library [2]. Wechose to deliver the tool as a web application to maximize accessibility andappeal for biologists, who find software installation a significant barrier toentry. The two versions of the prototype, Discover versus Compare, areaccessible through different URLs. It is available as open source at http://www.cs.ubc.ca/labs/imager/tr/2013/VariantView.In addition to the user-specified input file of variant data, Variant Viewaccesses the UniProt database [44] for protein information and the RefSeqdatabase [36] for exon information.446.8. Implementation13-06-24 6:39 PMVariant VisualizationPage 1 of 1http://localhost:8888/Variant_View_Prot_Coord3.0.htmlDNMT3A (NM_022552)IDH2 (NM_002168)FLT3 (NM_004119)ANKRD36 (NM_001164315)ARID1B (NM_017519)STAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)FNDC1 (NM_032532)HERC2 (NM_004667)ITGA8 (NM_003638)JAK2 (NM_004972)MLL2 (NM_003482)NRAS (NM_002524)NUDT17 (NM_001012758)RREB1 (NM_001003698)SACS (NM_014363)SETD2 (NM_014159)SPTBN5 (NM_016642)SRRM2 (NM_016333)THSD7B (NM_001080427)Gene Search:  SubmitSort By Gene:Alpha Cluster Score Variant CountVariantsGVDYASAVEQEGEDSTSASLMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainDomainsRegionsActive SitesBindingsMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 11288816 G T . . "13028, G60V gene-anon trans-anonpid-anon 11288816 G T . . "13012, D61Y gene-anon trans-anonpid-anon 11288819 G T . rs121918 13014 A72S gene-anon trans-anonpid-anon 11288819 C T . . "13035, A72V gene-anon trans-anonpid-anon 11288821 G C . . "13016, E76Q gene-anon trans-anonpid-anon 11288821 A G . rs121918 "13017, E76G gene-anon trans-anonpid-anon 11288821 G T . . . E76D gene-anon trans-anonpid-anon 11292688 T A . rs121918 "13020, S502T gene-anon trans-anonpid-anon 11292688 T G . . "13020, S502A gene-anon trans-anonpid-anon 11292688 C T . . 13023 S502L gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon)Known HarmlessTranscriptVariant TypeAll Other RegionsAA ChainNon-Intersected RegionsTranscriptSpecialChargedUnchargedHydrophobicAA Chemical Class Colours:Transcript/Region Colours:Protein Regions--+*>>StopIndelDeletionInsertionSpliceFrameshiftNonsynonymKnown CancerKnown DatabaseReference AAVariant AAVariantFigure 6.3: Variant visual encoding. Three variants that all occur at thesame position are encoded in this diagram. Each variant has the identity ofits type indicated by an icon, which is pointed to by the Variant Type labeland arrow figure annotations. The reference amino acid is encoded by one ofthe twenty possible protein symbols. For instance, the reference amino acidpointed to by the Reference AA label a d arrow figure annotations is an?S?, and represents the amino acid at this positio in the reference genome.Below the reference amino acid is the variant amino acid. The variant aminoacid is the identity of the amino acid that a particular sample, or individual,has. In this diagram, the symbol pointed to by the Variant AA label andarrow figure annotations is a ?T?. Additionally, the amino acid chemicalclass is indicated by a colored circle. For instance, the reference ?S? belongsto the ?Special? amino acid class. A class color legend is shown in thefigure. Additionally, whether the variant is reported in a known-harmlessor known-harmful database is encoded by an empty circle, and/or a filledcircle, respectively, at the top of the variant encoding. The transcript andprotein regions the variants intersect are encoded as horizontal bars at thebottom of the display. Color and space is used to redundantly encode thedifference between the important functional transcript and protein regions.456.8. Implementation1500 950 aaDNMT3A0 100 200 300 400 500 600 700 950Scale (AA)R882HR882HR882HR882HR882HR882HR882HR882HR882HR882HR882CR882CR882CR882CR771QS714CR676WK632E477KL347Q-R882C/R882H(a)DNMT3A0 100 200 300 400 500 600 700 950Scale (AA)R882HR882CR771QS714CR676WK632E477KL347Q(b)VariationsTranscriptTranscriptExonsProteinRegions182.54 Kb25.42 Mb25.44 Mb 25.46 Mb 25.48 Mb 25.50 Mb 25.52 Mb25.52 Mb 25.54 Mb25.56 Mb 25.58 MbR/CR/HADDPWPPR/QS/CK/R/QR/QL/QAround 10 transcripts can be shown here.Transcript and protein regions repeated for each alternative transcript around 10 times below this point.(c)13-06-24 6:23 PMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Prot_Coord3.0.htmlDNMT3A (NM_022552)IDH2 (NM_002168)FLT3 (NM_004119)ANKRD36 (NM_001164315)ARID1B (NM_017519)STAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)FNDC1 (NM_032532)HERC2 (NM_004667)ITGA8 (NM_003638)JAK2 (NM_004972)MLL2 (NM_003482)NRAS (NM_002524)NUDT17 (NM_001012758)Gene Search:  SubmitSort By Gene:Alpha Cluster ScoreVariantsLQEK*K*RWSC*RQRCRCRCRCRHRHRHRHRHRHRHRHRHRHRHMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainDomainsRegionsZinc-FingersActive SitesBindingsMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 25470002 A T . . . L347Q gene-anon trans-anonpid-anon 25469029 C T . . 166531(s E477K gene-anon trans-anonpid-anon 25467208 CT C . . . gene-anon trans-anonpid-anon 25466809 T A . . . K632* gene-anon trans-anonpid-anon 25464487 G A . . . R676W gene-anon trans-anonpid-anon 25463541 G C . . 87011(so S714C gene-anon trans-anonpid-anon 25463508 C T . . . gene-anon trans-anonpid-anon 25463181 C T . . "87009, R771Q gene-anon trans-anonpid-anon 25457243 G A . . 53042(so R882C gene-anon trans-anonpid-anon 25457243 G A . . 53042(so R882C gene-anon trans-anonpid-anon 25457243 G A . . 53042(so R882C gene-anon trans-anonpid-anon 25457243 G A . . 53042(so R882C gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonpid-anon 25457242 C T . rs147001 52944(so R882H gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon) gene-anon (trans-anon)(d)Figure 6.4: Comparison of the same variant data betw e different visualencoding schemes. (a) cBio [4] mockup. (b) MuSiC [9] mockup. (c) Ensemblvariation image [5] mockup. (d) Variant View screenshot.46Chapter 7Case StudiesWe now present three case studies which provide initial evidence that VariantView is useful to domain experts in several ways. First, it integrates diversedata types previously distributed across input files and external databases.Second, it provides summary metrics that are valuable for sorting genes andidentifying candidates for further exploration. Third, it displays rich infor-mation about variant type and distribution across a gene. This informationis not available in any other visualization tool and is valuable for interpretingthe biological impact of variants, which requires human inspection.The method of case studies is chosen as a validation technique over othermethods such as head-to-head comparisons with previous work via bench-marks that show an improvement in task completion time. Case studiesinvolve showcasing results that are found by target users through their useof the tool, and walking the reader through why the result images are effec-tive through a qualitative discussion. The case study is a common methodfor presenting the results of a design study [28]. Munzner?s nested model [29]describes that the case study validation approach is strongest when thereis an explicit discussion pointing out the desirable properties in the resultimages of the visualization tool?s use; in the discussion of results that followswe make sure to explicitly describe how Variant View exposes meaningfulcombinations of attributes to the target user to help them complete some oftheir tasks as defined by Table 4.2.7.1 Case Study 1: DiscoverVariant View consolidates transcript, protein and variant position and at-tributes into a single summary view, in contrast to the previous complex477.1. Case Study 1: Discoverworkflow described in Section 4.2.1. The analysts used Variant View firstfor hypothesis confirmation, to see if the tool could expose known types anddistributions of variants in genes implicated in AML, and then for hypothesisgeneration, to discover new variants that play a role in AML.7.1.1 Hypothesis ConfirmationUpon sorting by the hotspot metric (Q9), the first three genes in the listwere DNMT3A, IDH2, and FLT3. All of these have been reported inthe literature as being affected by AML variants and this finding providesevidence that the tool can help confirm positive controls of the disease.Once promising candidate genes were identified by simple sorting onsummary metrics, our analysts then used the rich information available inthe Variant View visualization to examine the variants? biological contexts.Figure 7.1(a) and (b) show the gene-level view of FLT3 and IDH2, respec-tively. The analysts found that the visual encoding in the main windowwas highly effective at emphasizing the hotspots at the gene level with vi-sually salient bloom-like structures (Q5). They also noted how easily theycould relate protein region information to variant position (Q7). In partic-ular, Figure 7.1(a) reveals variant intersections with many different proteinregions, which would be considerably more difficult to interpret in tabularformat. In addition, Variant View exposes the diversity of variant typeswithin a given hotspot. For example, the clusters in Figure 7.1(a) containmany different types of variants, whereas the cluster in Figure 7.1(b) iscomparatively uniform in variant type. Our analysts were interested in suchdifferences. These details are not captured by simple summary measures,like our hotspot metric, but rather require visual inspection and human in-terpretation. Overall, Variant View provided a notable acceleration of ouranalysts? previous manual workflow and they could see immediately whatwould have taken them at least 15 minutes to find.487.1. Case Study 1: Discover13-06-24 6:25 PMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Prot_Coord3.0.htmlDNMT3A (NM_022552)IDH2 (NM_002168)FLT3 (NM_004119)ANKRD36 (NM_001164315)ARID1B (NM_017519)STAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)FNDC1 (NM_032532)HERC2 (NM_004667)ITGA8 (NM_003638)JAK2 (NM_004972)MLL2 (NM_003482)NRAS (NM_002524)NUDT17 (NM_001012758)RREB1 (NM_001003698)SACS (NM_014363)SETD2 (NM_014159)SPTBN5 (NM_016642)SRRM2 (NM_016333)THSD7B (NM_001080427)ZFHX4 (NM_024721)Gene Search:  SubmitSort By Gene:Alpha Cluster ScoreVariantsVLVM>>EYFY--FSPEGSGL+-SENMNMI+AVDYDYDHDVDEDEI--YDMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainSignalsDomainsRegionsTopo. DomainsTransmem.Active SitesNP BindingBindingsMod. ResidueCarbohyd.Disuf.Variant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 28609758 C G . . 116249 V491L gene-anon trans-anonpid-anon 28608315 C T . . . V581M gene-anon trans-anonpid-anon 28608299 TC T . . . gene-anon trans-anonpid-anon 28608281 ACGTAGAA AC . . "28045, EYFY588- gene-anon trans-anonpid-anon 28608273 TG TGGCCCAC . . "36099, F594SPEG gene-anon trans-anonpid-anon 28608251 TTGAGATC TTGAGATC . . 19855 -594SENM gene-anon trans-anonpid-anon 28602329 G A . . 786 A680V gene-anon trans-anonpid-anon 28592642 C A . rs121913 "785, 78 D835Y gene-anon trans-anonpid-anon 28592642 C A . rs121913 "785, 78 D835Y gene-anon trans-anonpid-anon 28592642 C G . rs121913 "785, 78 D835H gene-anon trans-anonpid-anon 28592641 T A . rs121909 "784, 19 D835V gene-anon trans-anonpid-anon 28592640 A T . rs121913 "854, 79 D835E gene-anon trans-anonpid-anon 28592640 A C . . "854, 79 D835E gene-anon trans-anonpid-anon 28592634 CATGAT CAT . . . I836- gene-anon trans-anonpid-anon 28592621 A C . . . Y842D gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon)(a)13-06-24 6:27 PMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Prot_Coord3.0.htmlDNMT3A (NM_022552)IDH2 (NM_002168)FLT3 (NM_004119)ANKRD36 (NM_001164315)ARID1B (NM_017519)STAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)FNDC1 (NM_032532)HERC2 (NM_004667)ITGA8 (NM_003638)JAK2 (NM_004972)MLL2 (NM_003482)NRAS (NM_002524)NUDT17 (NM_001012758)Gene Search:  SubmitSort By Gene:Alpha Cluster ScoreVariantsAVRWRQRQRQRQRQRQRQRQRQRQRQRQRQRQRKVAMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainRegionsNP BindingMetal Bind.BindingsMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 90645558 G A . . . A22V gene-anon trans-anonpid-anon 90631935 G A . . 41877 R140W gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631934 C T . rs121913 "41590, R140Q gene-anon trans-anonpid-anon 90631838 C T . rs121913 "33733, R172K gene-anon trans-anonpid-anon 90630421 A G . . . V297A gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon)( )13-06-26 12:02 AMVariant VisualizationPage 1 of 1http://localhost:8888/Variant_View_Prot_Coord3.0.htmlSCN4A ( _000334)SCN7A (N _002976)SETD1B (NM_015048)SF3B1 (NM_012433)SH3TC1 (NM_018986)SMO (NM_005631)SPDYE6 (NM_001146210)SPINT1 (NM_001032367)SPTBN4 (NM_020971)SVEP1 (NM_153366)TECPR2 (NM_001172631)TMEM8A (NM_021259)TPRX1 (NM_198479)TRPM4 (NM_001195227)UBR4 (NM_020765)USP34 (NM_014709)VWA3B (NM_144992)VWF (NM_000552)WDFY4 (NM_020945)ZDBF2 (NM_020923)ZNF142 (NM_001105537)ABCB6 (NM_005689)BRWD1 (NM_018963)CDH23 (NM_001171930)CDH23 (NM_022124)DMBT1 (NM_004406)FAM208B (NM_017782)FBLN2 (NM_001998)MAST4 (NM_001164664)MLL (NM_001197104)NEB (NM_004543)NEB (NM_001164507)PEG3 (NM_001146185)PLXNA4 (NM_020911)PPP2R3A (NM_001190447)RAD21 (NM_006265)USH2A (NM_206933)WDR65 (NM_001195831)WT1 (NM_001198552)ANK1 (NM_000037)ANKRD28 (NM_001195098)ARID1A (NM_006015)ARID1A (NM_139135)ARID4A (NM_002892)BAIAP3 (NM_001199096)CACNA1E (NM_001205294)CARD14 (NM_001257970)CARD14 (NM_052819)FAM123A (NM_152704)FBLN2 (NM_001004019)KIAA1522 (NM_001198972)KRAS (NM_004985)PNPLA7 (NM_152286)SDK1 (NM_152744)SETBP1 (NM_015559)SRSF2 (NM_001195427)SUN1 (NM_001171944)UNKL (NM_001193389)XIRP1 (NM_194293)Gene Search:  SubmitSort By Gene:Alpha Cluster Score Variant CountVariantsTIRCNDYCSGMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteiA.A. ChainComp. BiasesTransmem.Zinc-FingersMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 19499556 G A . . . T1108I gene-anon trans-anonpid-anon 19484347 G A . . . R1908C gene-anon trans-anonpid-anon 19477147 T C . . . N2452D gene-anon trans-anonpid-anon 19422124 T C . . . Y4514C gene-anon trans-anonpid-anon 19419817 T C . . . S4701G gene-anon trans-anonAlter ative Transcripts: gene-anon (trans-anon)(c)Figure 7.1: Confirming AML genes. Variant View allowed analysts toquickly confirm known results: known AML genes could be found nearthe top of the sorted lists, and the per-gene views clearly and immediatelyshowed tell-tale structure. (a) IDH2. (b) FLT3. (c) Example gene withoutinteresting structure near the list bottom.7.1.2 Hypothesis GenerationIn addition to retrieving and inspecting known variants in important AMLgenes, our analysts successfully used Variant View to discover interesting497.2. Case Study 3: Comparecandidate genes. For example, Figure 1.2 shows one of these genes, and twomore are shown Figure 7.2 (a) and (b). The gene names have been sanitizedsince their research is still ongoing and sample IDs in all examples havebeen sanitized to protect patient privacy. Figure 1.2 shows a concentrationof variants that would be difficult to reveal in a spreadsheet or list inter-face. Just as with the hypothesis confirmation examples, Figure 1.2 andFigure 7.2 (a) and (b) reveal either uniform or diverse variant types withintheir hotspots in a way that is not communicated by the hotspot metricalone. Interpretation of the biological importance of this variant diversityrequires human judgement, as does the significance of intersected proteinregions. A1 remarked on the limitations of the previous workflow comparedto using Variant View for Q5, Q6, and Q7:It was really difficult to try and imagine the distribution of thevariants along both the transcript and the protein - furthermore,the number of look ups required to determine whether the vari-ants intersected important protein domains would have madesearching all of them really difficult - getting extra detail aboutthe protein regions would add an additional layer of workload.7.2 Case Study 3: CompareAnalysts A3 and A4 used Variant View for the Compare Patient task, asdescribed in Section 4.2.1. Figure 7.3 shows the immediate neighbors on eachside of each variant, with the patient?s own data indicated by the grey arrowsat the top of the stack. It is immediately apparent that the leftmost andmiddle variants are exact matches with the known-AML variants on their leftsides; reveal of the patient variants? neighbors is also demonstrated in thesupplementary video. It is also immediately apparent that the rightmostvariant does not have a match in the database: its neighbor is relativelyfar away and has very different attributes. The analysts remarked on howquickly the tool allowed them to draw these conclusions.507.3. Case Study 4: Debug Pipeline13-06-26 12:14 PMVariant VisualizationPage 1 of 1http://localhost:8888/VariantViewDemo.htmlAATK (NM_001080395)ABCA1 (NM_005502)ABCA13 (NM_152701)ABCA3 (NM_001089)ABCA7 (NM_019112)ABCB6 (NM_005689)ADAMTS2 (NM_014244)ADAMTS7 (NM_014272)AGAP6 (NM_001077665)AGRN (NM_198576)AKAP13 (NM_006738)ALMS1 (NM_015120)ANGPTL6 (NM_031917)ANK1 (NM_000037)ANK1 (NM_020477)ANKRD28 (NM_001195098)ANKRD28 (NM_015199)ANKRD28 (NM_001195099)ANKRD30A (NM_052997)ANKRD36 (NM_001164315)ANKS3 (NM_001242929)ARID1A (NM_006015)ARID1A (NM_139135)ARID1B (NM_017519)ARID2 (NM_152641)ARID3A (NM_005224)ARID4A (NM_002892)ARID4B (NM_031371)ARMC5 (NM_001105247)ASXL1 (NM_015338)ATM (NM_000051)BAIAP3 (NM_001199096)BAIAP3 (NM_001199099)BAIAP3 (NM_003933)BCOR (NM_001123384)BCOR (NM_001123383)BEST3 (NM_032735)BOD1L1 (NM_148894)BRCA2 (NM_000059)BRWD1 (NM_018963)C10orf68 (NM_024688)C12orf35 (NM_018169)C5 (NM_001735)C5orf42 (NM_023073)CACNA1E (NM_001205294)CACNA1E (NM_000721)CAND2 (NM_012298)CARD14 (NM_001257970)CARD14 (NM_052819)CARD14 (NM_024110)CBL (NM_005188)CCDC40 (NM_001243342)Gene Search:  SubmitSort By Gene:Alpha Cluster Score Variant CountVariantsTMSLSLSLSLSLSLEGPSAGLSGSFLMIMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 97784148 C T . . . T127M gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97854983 C T . . . S734L gene-anon trans-anonpid-anon 97866233 A G . . . E943G gene-anon trans-anonpid-anon 97867956 C T . . . P959S gene-anon trans-anonpid-anon 97867960 C G . . . A960G gene-anon trans-anonpid-anon 97868086 T C . . . L972S gene-anon trans-anonpid-anon 97868088 G A . . . G973S gene-anon trans-anonpid-anon 97881287 C G . . . F1242L gene-anon trans-anonpid-anon 97911205 G T . . . M1627I gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon)(a)13-06-26 12:11 PMVariant VisualizationPage 1 of 1http://localhost:8888/VariantViewDemo.htmlSTAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)FNDC1 (NM_032532)HERC2 (NM_004667)ITGA8 (NM_003638)JAK2 (NM_004972)MLL2 (NM_003482)NRAS (NM_002524)NUDT17 (NM_001012758)RREB1 (NM_001003698)SACS (NM_014363)SETD2 (NM_014159)SPTBN5 (NM_016642)SRRM2 (NM_016333)THSD7B (NM_001080427)ZFHX4 (NM_024721)ADAMTS7 (NM_014272)ANGPTL6 (NM_031917)ANKRD30A (NM_052997)BEST3 (NM_032735)CEP290 (NM_025114)CPEB2 (NM_001177381)DNAH17 (NM_173628)DNAH7 (NM_018897)EP300 (NM_001429)FAM151A (NM_176782)FRAS1 (NM_025074)Gene Search:  SubmitSort By Gene:Alpha Cluster Score Variant CountVariantsGTRSEEGTRSEEGTRSEEGTRSEEGTRSEEGTRSEEGTRSEERQDNMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainComp. BiasesZinc-FingersMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608502 CGGACGCGCGGACGCG. . . GTRSE101 gene-anon trans-anonpid-anon 608641 G A . . . R1061Q gene-anon trans-anonpid-anon 609228 G A . . . D1257N gene-anon trans-anonAlternative Transcripts: gene-anon (trans-anon)( )Figure 7.2: Discovering ML genes. Variant View has proved useful foranalysts in the discovery process of identifying new ca did te genes. (Genenames sanitized as their follow-up research is still ongoing.) (a) The clearhotspots indicated a gene of interest. (b) The fact that the variants strikea range in a known function region type was the most informative aspect ofthis layout.7.3 Case Study 4: Debug PipelineThe Debug Pipeline task, as discussed in Section 4.2.1, emerged later inour interactions with analysts and like the Compare Patient task it wassuggested after presentations of the tool designed for the Discover Genescontext.Analyst A3 found spurious data from what he thought was a fully de-bugged pipeline when using Variant View. Figure 7.4 shows the surprisingvisual pattern for a gene (name sanitized). He quickly concluded that thesheer number of repeated identical variants that he saw was highly unlikelyto reflect true dataset structure of the same variant occurring in so manydifferent individuals. After solving this particular pipeline problem, A3 re-marked:517.3. Case Study 4: Debug Pipeline13-06-26 12:33 AMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Clusters.htmlVariantsLPLPKNS S AVGSMutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainRegionsComp. BiasesZinc-FingersMod. ResidueVariant Details Comparison ModesShow Patient Data OnlyShow Patient + NeighborhoodVariant ID Chr. Coord. Ref Base Var Base Effect Level Effect Type Gene Name Trans. Name Prot. Coord.pid-anon 31022959 T C MODERATE NON_SYNONY gene-anon trans-anon L815Ppid-anon 31022959 T C NON_SYNONY gene-anon trans-anon L815Ppid-anon 31023029 G T NON_SYNONY gene-anon trans-anon K838Npid-anon 31024274 T C LOW SYNONYMOUS gene-anon trans-anon S1253pid-anon 31024274 T C SYNONYMOUS gene-anon trans-anon S1253pid-anon 31024450 C T NON_SYNONY gene-anon trans-anon A1312Vpid-anon 31024704 G A NON_SYNONY gene-anon trans-anon G1397Spid-anon 31025163 A G MODIFIER UTR_3_PRIM gene-anon trans-anon -Alternative Transcripts: gene-anon (trans-anon) gene-anon (trans-anon)Gene Select: AP1S2  SubmitSelect Patient: Patient 1  SubmitPatient Genes: ASXL1  Submitgene-anonFigure 7.3: Comparison of patient data to a known-AML variant database.The immediate neighbors for each variant are shown.The tool exposed artifacts in the data that slid past at least tworounds of quality metric filtering, I was very surprised to see thatthere could be anything wrong with the data at this point - thistype of problem would not have been caught by our previous,automated methods.13-03-30 6:01 PMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Prot_Coord3.0.html?inputbox=ANKRD36Variants>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Mutation TypeReference A.A.sVariant A.A.sTranscripttrans-anonProteinA.A. ChainRegionsComp. BiasesMod. ResidueVariant DataPatient ID Chr. Coord. Ref Base Var Base dbSNP129 dbSNP135 dbSNP137 COSMIC A.A. Chng. Gene RefSeq IDpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTGTCTG . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 CTCTG CTCTGTCT . ABSENT "158604, gene-anon trans-idpid-anon 17083754 TCTG TCTGCCTG . ABSENT "20866, gene-anon trans-idpid-anon 17083754 TCTGGC TCTGGCTG . ABSENT "20866, gene-anon trans-idpid-anon 17083754 TCTG TCTGCCTG . ABSENT "20866, gene-anon trans-idpid-anon 17083754 CTG CTGTATG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGTTTG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGCATG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGCATG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGTTTG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGCTTG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGCTTG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 CTG CTGCATG . ABSENT "27392, gene-anon trans-idpid-anon 17083754 G GCAGA . ABSENT "17571, gene-anon trans-idpid-anon 17083754 G GTCTC . ABSENT "17571, gene-anon trans-idSort By Gene:Alpha Cluster Score Variant CountDNMT3A (NM_022552)IDH2 (NM_002168)FLT3 (NM_004119)ANKRD36 (NM_001164315)ARID1B (NM_017519)STAG2 (NM_001042749)TNRC18 (NM_001080495)WT1 (NM_000378)ABCA13 (NM_152701)CEBPA (NM_004364)TET2 (NM_001127208)DNAH10 (NM_207437)GPSM1 (NM_015597)ASXL1 (NM_015338)DNAH1 (NM_015512)DNAH6 (NM_001370)FAT1 (NM_005245)MDN1 (NM_014611)PTPN11 (NM_002834)SYNE1 (NM_033071)ALMS1 (NM_015120)C10orf68 (NM_024688)CCDC88C (NM_001080414)DNAH11 (NM_003777)DNAH3 (NM_017539)DNAH9 (NM_001372)FAT2 (NM_001447)KIT (NM_000222)KIT (NM_001093772)MEGF6 (NM_001409)MPDZ (NM_001261406)MUC22 (NM_001198815)PHRF1 (NM_020901)TDRD6 (NM_001010870)USP24 (NM_015306)ZNF469 (NM_001127464)AGAP6 (NM_001077665)CHTF18 (NM_022092)CSMD1 (NM_033225)Alternative Transcripts: gene-anon (trans-anon) gene-anon (trans-anon)Gene Search:  SubmitFigure 7.4: Debugging the bioinformatics pipeli e.52Chapter 8Discussion, Future Work,and ConclusionsIn this section we discuss the design strategy of ?specialize first, generalizelater? as a way to tackle biological data visualization challenges. We also re-flect more generally on the visualization design issues. We discuss our designprogression through a description of the 8 prototypes we constructed. Wediscuss our design process with reference to some of the 32 pitfalls proposedby the authors of the design study methodology by Sedlmair et al. [38].We then discuss limitations of the work with an emphasis on the design?sscalability, and summarize the conclusions of this work.8.1 Specialize First, Generalize LaterThe domain of biology has been a frequent target of design studies in visual-ization [7, 17, 24, 25, 27, 32]. We conjecture that this domain is a rich sourceof problems exactly because of its difficulty: there is an enormous amount ofdata to contend with, and figuring out what matters is nontrivial. In the lan-guage of the four-level nested model of visualization design [29], developingthe appropriate data abstraction is a major part of the problem.By abandoning whole genome coordinates and committing to transcriptand protein coordinates, we created a specialized tool that targets key tasksin variant analysis, but does not offer the generality of a genome browser. Wemade this decision knowingly, and throughout the design phase we purpose-fully strove to optimize the display to the target tasks and did not requireourselves to produce a very general solution. This philosophy to specializefirst has emerged from examination of many design studies [32, 38]; it seems538.2. Visualization Design Considerationsto be well suited for domains where the amount of up-front detail is enor-mous and it can be difficult to judge which design elements will generalize.Generalization follows naturally from this initial specialization. We havefound that opportunities for generality naturally emerge when analysts tryout working prototypes on their own data, even though they are not obviousat the outset. For example, our original design targeted the Discover Genestask, but it later became apparent that Variant View could support theother two tasks with only minor adjustments. Additional applications andadaptions continue to emerge as we expose more analysts to the tool. Forexample, another group is interested to use Variant View to visualize variantsin non-exonic regions, which are excluded in the current tool by our choice ofcoordinate system. A more general alternative to committing to transcriptand protein coordinates would be to enable an analyst to define coordinatesof interest: for example, non-exonic regions. Overall, this approach ensuresthat the decisions of what to generalize are guided by real-world use cases.8.2 Visualization Design ConsiderationsWe now reflect on our design choices by framing them in an abstract waythat is not tied to the vocabulary of the domain problem. These choices canbe organized into: What to show and How to show it.What to show : A major abstract choice in this study was to identifyscales of interest within the data. As discussed previously, the final choice ofscales may break with convention, but should best serve the analysis task.Closely coupled choices were to identify what data can be filtered out asbeing irrelevant and to determine what additional data to derive. There is atendency to display all information within the provided input file, but moreoften than not, much of that material is not useful to the target tasks andvaluable derived metrics are missing.How to show it : At several points in the design phase, we explored op-tions for how to highlight a change in data value and the choice required:deciding when comparison can be accommodated implicitly by visually en-coding values through side-by-side marks versus by explicitly computing a548.3. Design Progressionvalue difference that is visually encoded directly. Although the side-by-sideapproach may introduce more visual clutter than a single difference value, itpreserves the underlying data and may be the better choice for some tasks.A related choice concerned deciding what to visually encode directly versuswhat to support through interaction. Attempting to encode all pertinent dataattributes can lead to visual clutter, but requiring extensive interaction canbe taxing to the user. Similarly, navigation within a view can be very timeconsuming and we carefully considered when to reduce navigation drasti-cally or eliminate it completely. Taken together, our approach regardinghow to show the data was to create a multi-scale non-contiguous overviewthat showed all information without the need to navigate.8.3 Design ProgressionThe design and implementation stages for this design study were tightlyinterwoven, with a series of 8 prototypes of increasing complexity createdover five months. We decided that data sketches [20] were more appropriatethan paper prototyping due to the complexity of the data, so even the earliestprototypes did load and show real data. The first two prototypes werestatic tests of visual encoding possibilities, where we received feedback bydemonstrating them to the analysts; these early prototypes are shown inFigure 8.1(a) and (b).The deploy stage began in the third month with the third prototype,which supported interactive search, shown in Figure 8.2, and a fourth pro-totype with different interactive filtering capabilities, shown in Figure 8.3.Figure 8.4 shows a final design for a series of prototypes that allowed forinteractive filtering. The three histograms at the top of the screen can bebrushed with the mouse cursor to specify a range of quality values. Eachhistogram is a scented widget [49]; in addition to allowing the user to specifyranges, the scented widget shows the distribution of the data in order toprovide information scent. Below these scented widgets are check boxesthat filter the data on variant type. We abandoned this design because ourcollaborators had already pre filtered their data, so they did not need to558.3. Design Progression(a)(b)Figure 8.1: The first two data sketch prototypes. (a) shows an early datasketch that shows full genomic coordinates for the gene representation. Theexon-containing gene representation is next to the ?Gene:? title; exon regionsare represented as skinny, vertical, dark grey bars, and are comparably smallto the long, thin horizontal line of the inter-exon regions. Red verticallines represent variants. (b) shows a representation that removes inter-exonregions to emphasize the formerly comparably small exons regions in (a).The transcript is colored orange, and protein regions are represented bybars in separate horizontal tracks below the transcript, each distinguishedby color and spatial position. Variants are encoded by blue vertical strikeswith blue circles on top.specify these ranges.568.3. Design ProgressionFigure 8.2: An interactive search supporting prototype. This prototyperetained the transcript and protein representations of Figure 8.1(b), but alsoallows the user to search for genes of interest interactively. The prototypealso has the capability of interactive filtering on various variant quality andtype attributes.578.3. Design ProgressionFigure 8.3: A second interactive search supporting prototype. This pro-totype retained the transcript and protein representations of the prototypeshown in Figure 8.1(b), but also allows the user to search for genes of interestinteractively. The prototype also has the capability of interactive filteringon various variant quality and type attributes; the prototype differs fromthe prototype in Figure 8.2 slightly in terms of layout, and mostly in termsof the attributes it allows the user to interactively filter data by.588.3.DesignProgression13-05-23 8:05 PMVariant VisualizationPage 1 of 2http://localhost:8888/Scented_Widgets.htmlExome Forward Reads Exome Reverse Reads (Forward + Reverse)/(Total)Non-Synonymous Frame Shift Codon Deletion Codon Insertion Truncating/NonsenseVariant Recurrence at Transcript PositionsSamples:Variants:Unique:100%0%9021310202306460Range: 105 - 1581 readsRange: 68 - 1440 reads Range: 42 - 99 %58681MUC4TranscriptProteinEffectsA.A. ChainCleaved Ch.Cleaved Ch.DomainRegionComp. BiasAct. SiteMetalsBindingsLipidsCarbohyd.SignalsTransmem.Topo. Dom.Disulf.Gene Query (Name or RefSeq ID): MUC4  SubmitFigure 8.4: An intermediate prototype for the Variant View design. The histogram interfaces at the top arebrushable controllers that help define a threshold for filtering on a quality metric. The checkboxes allow theanalyst to filter variants based on the variant types described in Section 4.1.3. Variants appear as vertical, bluelines that intersect an orange transcript stacked above multiple, colored protein regions below. Variant recurrenceat a position is encoded by a large histogram stacked above, and aligned with, the transcript and protein regions.598.3. Design ProgressionOne of the major problems with the first five prototypes is that althoughwe show the necessary levels of biological context described in Section 4.1.2that our analysts are interested in, we do not show the important variantattributes described in Section 4.1.3. Variants are encoded simply withvertical lines, and the only aspect analysts could resolve about each line is thevariant?s position along the transcript and protein regions. From our dataand task abstraction described in Section 4, we decided to expose all variantattributes necessary for variant impact analysis within the gene view. Thefirst prototype to show variant type information, one of the most importantattributes for variant impact assessment, is shown in Figure 8.5. In thisprototype, variant type is encoded with icons. We decided to remove theorange transcript since most of our discussion in interviews centred aroundwhether variants intersected protein regions.13-07-19 10:55 AMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View.htmlVariantsProteinVariantReference- - + + ->>A.A. ChainDomainRegionAct. SiteBindingsCarbohyd.SignalsTransmem.Topo. Dom.Disulf.Variant DataPatient ID De Novo G. Coord. dbSNP129 dbSNP135 COSMIC RNA-seq Exome-seq Gen-seq Effect RefSeq ID130-99 1 28592621 . . . TRUE TRUE NA MISSENSE NM_004119102-23 1 28592634 . . . TRUE TRUE TRUE NM_004119122-10 0 28592639 . rs1219134 "19649, 2 TRUE TRUE TRUE NM_004119198-02 1 28592640 . rs1219134 "854, 796 TRUE NA NA MISSENSE NM_004119192-73 1 28592640 . . "854, 796 TRUE NA NA MISSENSE NM_004119126-19 1 28592641 . rs1219096 "784, 198 TRUE TRUE NA MISSENSE NM_004119224-08 0 28592641 . rs1219096 "784, 198 TRUE NA NA MISSENSE NM_004119200-36 0 28592641 . rs1219096 "784, 198 TRUE NA NA MISSENSE NM_004119146-71 0 28592641 . . "784, 198 TRUE NA NA MISSENSE NM_004119115-98 0 28592641 . rs1219096 "784, 198 TRUE TRUE NA MISSENSE NM_004119115-27 0 28592641 . rs1219096 "784, 198 TRUE FALSE NA MISSENSE NM_00411999-21 1 28592642 . rs1219134 "785, 789 TRUE TRUE NA MISSENSE NM_004119198-02 1 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_00411989-83 1 28592642 . rs1219134 "785, 789 TRUE TRUE TRUE MISSENSE NM_00411996-26 0 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_004119203-64 0 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_004119115-27 0 28592642 . rs1219134 "785, 789 TRUE FALSE NA MISSENSE NM_004119195-58 0 28597514 . . . TRUE FALSE NA MISSENSE NM_00411991-73 1 28602329 . . 786 FALSE TRUE NA MISSENSE NM_00411991-01 1 28608251 . . 19855 FALSE TRUE NA NM_00411979-85 1 28608273 . . "36099, 1 FALSE TRUE NA NM_00411990-35 0 28608276 . . "19796, 9 TRUE NA NA MISSENSE NM_004119119-19 1 28608281 . . "28045, 1 FALSE TRUE FALSE NM_00411996-07 1 28608299 . . . FALSE FALSE TRUE NM_004119129-95 1 28608315 . . . TRUE TRUE NA MISSENSE NM_004119197-69 1 28609758 . . 116249 TRUE NA NA MISSENSE NM_004119163-19 0 28626716 . rs1460307 28039 TRUE NA NA MISSENSE NM_004119Gene Query (Name or RefSeq ID): FLT3  SubmitFigure 8.5: A sixth data sketch prototype with icons for variant type. Thisprototype retained the protein representations of the initial prototypes, butnot the transcript representation. Additionally, the prototype includes iconsrepresenting the attribute of variant type for each variant.608.3. Design ProgressionOur analysts found the icon representation for variant type very useful.So, we used our data abstraction to decide which attributes to encode foreach variant in the next prototype. Amino acid identity and class iden-tity at each variant position were the next most important attributes toencode. One aspect of previous prototypes our analysts did miss, however,was the orange transcript representation, so we made sure to include it inthe next prototype. In the seventh prototype, shown in Figure 8.6, aminoacid change is shown using one of twenty protein symbols, and amino acidclass is encoded with one of four colors in a circle surrounding the proteinsymbol.13-07-19 11:00 AMVariant VisualizationPage 1 of 2http://localhost:8888/Variant_View_Prot_Coord2.0.htmlTranscriptFLT3VariantsMutation TypeVariant A.A.Reference A.A.DY-I-DE-DE>>DV+DV+DVDADVDV-DYDYDHDYDYDYDNKVAI-LFVF-E-EMVL-VMVProteinA.A. ChainDomainsRegionsActive SitesBindingsCarbohyd.Topo. DomainsTransmem.Disuf.SignalsNP BindingMod. ResidueVariant DataPatient ID De Novo G. Coord. dbSNP129 dbSNP135 COSMIC RNA-seq Exome-seq Gen-seq Effect RefSeq ID163-19 0 28626716 . rs1460307 28039 TRUE NA NA MISSENSE NM_004119197-69 1 28609758 . . 116249 TRUE NA NA MISSENSE NM_004119129-95 1 28608315 . . . TRUE TRUE NA MISSENSE NM_004119119-19 1 28608281 . . "28045, 1 FALSE TRUE FALSE NM_00411996-07 1 28608299 . . . FALSE FALSE TRUE NM_00411991-01 1 28608251 . . 19855 FALSE TRUE NA NM_00411979-85 1 28608273 . . "36099, 1 FALSE TRUE NA NM_00411990-35 0 28608276 . . "19796, 9 TRUE NA NA MISSENSE NM_00411991-73 1 28602329 . . 786 FALSE TRUE NA MISSENSE NM_004119195-58 0 28597514 . . . TRUE FALSE NA MISSENSE NM_004119122-10 0 28592639 . rs1219134 "19649, 2 TRUE TRUE TRUE NM_004119198-02 1 28592640 . rs1219134 "854, 796 TRUE NA NA MISSENSE NM_004119192-73 1 28592640 . . "854, 796 TRUE NA NA MISSENSE NM_004119126-19 1 28592641 . rs1219096 "784, 198 TRUE TRUE NA MISSENSE NM_004119224-08 0 28592641 . rs1219096 "784, 198 TRUE NA NA MISSENSE NM_004119200-36 0 28592641 . rs1219096 "784, 198 TRUE NA NA MISSENSE NM_004119146-71 0 28592641 . . "784, 198 TRUE NA NA MISSENSE NM_004119115-98 0 28592641 . rs1219096 "784, 198 TRUE TRUE NA MISSENSE NM_004119115-27 0 28592641 . rs1219096 "784, 198 TRUE FALSE NA MISSENSE NM_00411999-21 1 28592642 . rs1219134 "785, 789 TRUE TRUE NA MISSENSE NM_004119198-02 1 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_00411989-83 1 28592642 . rs1219134 "785, 789 TRUE TRUE TRUE MISSENSE NM_00411996-26 0 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_004119203-64 0 28592642 . rs1219134 "785, 789 TRUE NA NA MISSENSE NM_004119115-27 0 28592642 . rs1219134 "785, 789 TRUE FALSE NA MISSENSE NM_004119102-23 1 28592634 . . . TRUE TRUE TRUE NM_004119130-99 1 28592621 . . . TRUE TRUE NA MISSENSE NM_004119Gene Query (Name or RefSeq ID): FLT3  SubmitFigure 8.6: A prototype showing protein symbols and amino acid classes.This prototype additionally encodes amino acid change and amino acid classtype as a letter symbol and circle color, respectively. The orange transcriptrepresentation is once again included.Our analysts were happy with the representation at this stage in the618.4. Design Study Methodology Pitfalls Analysisdesign, but wanted more assistance navigating through the long list of genesthat variants may occur in. Up to this point, the prototypes required thatanalysts know which gene they wanted to inspect a priori ; they then had totype the gene name into the search box to display it. Based on interviews, wefound that we could use derived measures of variant impact to sort the genelist ranked by these metrics. These metrics are described in Section 4.1.4.We created a list interface for the gene names, and allowed the user to sort bythese gene names in the next prototype, whose display is the final prototype,shown in Figure 6.1. Finally, we made changes to the color palette of thisprototype to ensure the amino acid class colors were distinguishable forpeople who are red-green color-blind.8.4 Design Study Methodology Pitfalls AnalysisWe discuss the current study in the context of selected pitfalls (PF) observedin the work of Sedlmair et al. [38]: the Design Study Methodology (DSM),shown in Figure 3.1. Figure 8.7 is an itemized list of the 32 pitfalls and istaken from [38]. We discuss the stages of the DSM where we feel our designstudy encountered turbulence due to pitfalls, or why and how our designprocess succeeded in avoiding certain pitfalls.8.4.1 Pitfalls in the Winnow StageIn the winnow stage, the goal is to identify and select the most promisingcollaborators to design a visualization solution for. At this stage, prema-ture commitment to collaborators who either have no time to discuss theirproblem, or cannot give access to data, can sink a promising design study.In our study, we turned down three potential collaborations due to theirunavailability before committing to a final group of analysts. We feel weavoided PF-3, ?premature commitment; collaboration with the wrong peo-ple,? largely due to knowledge of PF-4, ?no real data available (yet),? andPF-5, ?insufficient time available from potential collaborators.? These pit-falls, along with the rest of the 32 pitfalls, are shown in Figure 8.7.628.4. Design Study Methodology Pitfalls AnalysisFigure 8.7: Design Study Methodology: 32 pitfalls. Figure taken from [38]courtesy of Tamara Munzner.Some problems, after thorough characterization in terms of the data andtasks required to solve them, may be completely automated and not requirea visualization solution at all. We encountered this problem with one of our638.4. Design Study Methodology Pitfalls Analysisfirst potential collaborators. Their problem involved detecting topologicalfeatures in graphs. The problem seemed promising for a visualization ap-proach; however, as the problem characterization, data, and task abstractionstages progressed, it became clear that a number of off-the-shelf methodsincluding a simple prioritized list interface could solve their problem. Mis-taking a problem that can be solved entirely by automated methods for aninteresting visualization problem is characterized by the DSM?s PF-6: ?noneed for visualization: problem can be automated.?We feel we mostly avoided PF-7: ?researcher expertise does not matchthe domain problem.?: two of the three visualization researchers had ties tothe domain of biology in the form of university training, so fundamentals forunderstanding the problem domain were present, and the interest level washigh. When it came to the more specific domain of sequence variant analysis,however, this pitfall was not completely avoided due to some specifics of thedomain: for instance, questions of, ?what artifacts are present in sequencedata??, and, ?what attributes are of interest to sequence variant analysts??This gap in knowledge was largely compensated for by the availability of ouranalysts and their willingness to meet to clarify or explain their workflow anddomain. This availability was probably a result of having collaborators witha high level of interest in the visualization solution. In this collaboration,we feel we did not succumb to PF-11, ?no rapport with collaborators,? anda good rapport with our collaborators probably facilitated their continuedsupport and feedback.PF-8: ?no need for research: engineering vs. research project,? wasa very apparent pitfall at multiple stages of this design study; carefullyscoping the project to a specific visualization design problem helped avoidthis pitfall. PF-8 captures situations wherein a problem can be solved with asystem composed of known, not necessarily novel approaches; visualizationresearch contributions from this process may either be minor or nonexistent.The computationally intensive filtering stages required to narrow down alist of sequence variants to a manageable size for analysis, mentioned inChapter 2, is a difficult engineering problem; its solution would probablynot contribute much to the field of visualization. Observing PF-9, ?no need648.4. Design Study Methodology Pitfalls Analysisfor change: existing tools are good enough,? encouraged a survey of the fieldthat exposed existing tools such as MedSavant [45] that could satisfy thisfiltering problem for our collaborators .A final pitfall in the winnowing stage is PF-10: ?no real/important/recurring task.? This pitfall aims to steer visualization design study prac-titioners away from investing time and effort in design solutions for minorand/or infrequent problems in an analyst?s workflow. An example of thispitfall occurred during our winnowing process: we were considering a col-laborator who had immediate access to a wealth of promising microRNAdata. We were enthusiastic at the prospect of designing a visualization toolfor this data, but after additional talks with the potential collaborator, wefound that there were no front-line analysts available to work with the data;furthermore, the potential collaborator was busy just trying to generatemore raw data, and did not have time to interpret and analyze it at length.Because we had data but no tasks to design a solution for, we parted ways.The central goal of the winnow stage is to identify the most promisingcollaborations [38]. In our study, we found that the winnow stage initiatedevaluation of whether or not a certain collaboration would result in an ef-fective visualization tool. The concept of a winnow stage also helped us tostart thinking about project scope and planning: identifying collaboratorsable to articulate data and task descriptions made it easier to design a visu-alization tool within a reasonable scope of time. Definition of pitfalls in theDSM helped us to identify more concrete scenarios that should be avoided.However, some pitfalls defined within the winnow stage of the DSM weremore difficult to identify than others. We found PF-4, PF-5, PF-7, PF-11,shown in Figure 8.7, the easiest to identify early in the winnow stage withoutprogressing to more advanced stages of the design process. We found PF-6, PF-8, PF-9, and PF-10 much more difficult to reliably identify withouta deeper knowledge of the analysts? data and tasks. In the DSM, dataand task abstraction does not occur until the design phase. Based on ourexperience with this methodology, we feel that spending time abstractingdata and tasks can help determine whether a problem can be automated(PF-6), whether a problem requires an engineering effort but no interesting658.4. Design Study Methodology Pitfalls Analysisresearch contributions are possible (PF-8), whether existing tools are goodenough for the current problem (PF-9), and whether the task is importantenough or occurs frequently enough that investing in a visualization solutionis worth the time (PF-10).8.4.2 Pitfalls in the Cast StageCharacterizing the roles collaborators play during the design study process ispart of the cast stage. This stage is important because it can help determinewho the visualization solution is being designed for and other considerationssuch as permission to use potentially sensitive or private data.The first of the pitfalls associated with this stage is PF-12: ?not identi-fying front-line analyst and gate keeper from start.? A visualization solutionshould be developed specifically for the front-line analysts? data and tasks.Knowledge of this pitfall helped us to identify the front-line analysts whoseproblem included performing tasks on the data described in Section 4.1.This pitfall may seem like an obvious one to avoid, but often there are othercollaborators who can be mistaken for front-line analysts based on their vastknowledge of the problem domain, or knowledge of or familiarity with thetasks of the actual front-line analysts. In our design study, a fellow computa-tional tool builder was an example of this cast member: they had an expertlevel of knowledge regarding the variant data generation phase, which wasone of their principal responsibilities, but through exposure to, and inter-action with front-line analysts, they came to understand some of the dataand tasks they were performing. In the end, they themselves would not beusing the tool for variant analysis, so we ensured we tailored the tool forthe collaborators performing direct analysis on the variant data first. Thisdecision helped us avoid PF-14: ?mistaking fellow tool builders for real endusers.? It was also important to recognize that one collaborator could takeon many roles. We found that the fellow tool builder mentioned was also apotent connector : they were able to connect us with the front-line analysts,the principal investigator for their lab, and another fellow tool builder thatgenerated variant data. The principal investigator of the lab was able to668.4. Design Study Methodology Pitfalls Analysisgive us access and permission to use certain data, and have time with theanalysts. In the language of the DSM, the principal investigator was a gatekeeper, and becoming acquainted with them early helped us avoid PF-12:?not identifying front-line analyst and gate keeper from start.? Neglectingto meet with and/or get permission from the gate keeper during the courseof the design study can lead to problems with access to data, problemswith access to analysts, and problems related to the responsible release ofinformation in publications.One observation we made during the course of this design study that sup-plements the description of cast and the operational consequences of castingcertain collaborators in the DSM came after we were happy with a near-finaldesign for the system: after checking in with the front-line analysts, we wentback to our fellow tool builder collaborators to get their input. What wefound is that they could use the tool for spotting artifacts in the variantdata that could be a result of the variant data generation pipeline describedin Chapter 2. This idea is somewhat captured by PF-13: ?assuming everyproject will have the same role distribution,? because the fellow tool builderhad in some sense become a front-line analyst for a different task: the DataDebug task.8.4.3 Pitfalls in the Discover StageThe discover stage of the DSM is related to requirements analysis in softwareengineering. To adequately characterize the problem front-line analysts aredealing with, it is necessary to extract information from them. This processis usually iterative.During this information extraction stage, one pitfall is captured by PF-16: ?just talking and fly-on-the-wall.? Just talking to users is necessary butoften not sufficient to extract information about their data and tasks: oftenwhat a target user says they do in retrospect is only an incomplete matchwith their actual activities. Another common observation technique is fly-on-the-wall, wherein the researcher silently and unobtrusively observes thetarget user complete their tasks in their habitual environment. In our study,678.4. Design Study Methodology Pitfalls Analysiswe began with a semi-structured interview, and then constructed prototypesbetween each interview to present at the subsequent one to garner feedback;this process is described in Chapter 3. We found that constructing proto-types that could load real data helped us to avoid mistaking the structureand scale of the true datasets. This process of creating data sketches [20]versus paper prototypes is valuable even at the very early stages of thedesign process. This process also invited more feedback from the collabora-tors, who could test out the interface and comment on what data attributesthey wished they could see, but could not, and other feedback. The demosessions helped guide the design away from solutions we had first thoughtwere quite final: for instance, one design, shown in Figure 8.4, depicts thevariants as blue lines across a transcript and protein annotations. Duringa demo session of this tool, analyst A1 mentioned that: ?This representa-tion looks good, but I know there?s supposed to be a stop mutation in thisgene. Which one is it? Why can?t I see it?? A stop mutation is one of thevariant types described in the Data Abstraction of Section 4.1. Exposureto these prototypes helped refine and prioritize the data our analysts werereally interested in. Furthermore, the demo sessions could fit into less thanone hour time slots during the week, which could be more convenient andtime efficient than other invasive methods of watching and interrupting theanalysts to ask questions as they worked. Analyst time can be precious:in our study, analysts met with us in addition to their myriad day-to-dayresponsibilities. It was important to keep interaction with them brief andmeaningful. Moreover, seeing iterative prototypes appeared to engage theirinterest as they could see their feedback manifest itself either in the choiceof visual encodings or interaction techniques available in each successivedesign.Another pitfall we encountered during the problem characterization stagewas PF-17: ?experts focusing on visualization design vs. domain problem.?This pitfall captures instances wherein target analysts focus on communi-cating their problems in the form of either new or previous design solutionsinstead of communicating their domain problems directly. For instance, apotential collaborator explained that their domain problem could be solved688.4. Design Study Methodology Pitfalls Analysiswith a particular graph visual encoding of their data, without specifying theproblem they were trying to solve with their data. Echoing previous solu-tions to this pitfall, we also found that focussing on the analysis problemdirectly, and providing the target analyst with prototype design alternatives,helps tease them apart from design solutions they are fond of or assume willbe effective.A final pitfall encountered during the discover stage is PF-18: ?learn-ing their problems/language: too little/too much.? This pitfall attempts toguide design study practitioners towards a balance of enough domain prob-lem/language knowledge required to understand the data and tasks involved,without becoming so immersed in the problem domain that domain-specificdetails creep into the data and task abstraction stages. Having some bio-logical knowledge a priori, we were confident about our understanding ofthe biological transcript and the attributes and annotations analysts mightlike to see. However, it was not obvious to us that our final collaboratorsdid not want to see variants that occurred outside of exon regions. This in-stance was an example of having too little knowledge of the problem domain.Later in the study, during initial writing stages, the data and task abstrac-tion included too much domain information: for instance, we included just,?variant mutation types,? and neglected to specify that these were, ?cat-egorical attributes,? and that there were a total of about seven of them.This underdeveloped data abstraction made it difficult to motivate why weused the visual encoding of an icon. There are preferable ways of encodingcategorical attributes: for instance, hue or icon. Furthermore, knowing thenumber of categories or the range of values an attribute can exhibit can alsomotivate their encoding. For instance, if an attribute can only exhibit one ofa small, finite number of categories, encoding category with hue is a sounddesign decision because less than one dozen colors are distinguishable whenshowing categorical data [48].698.5. Scalability Limitations8.4.4 Pitfalls in the Design StageOnce the domain problem has been thoroughly characterized, the designstudy researcher can begin designing a visualization solution. According tothe DSM, the design stage includes generation of data abstractions, visualencodings, and interaction mechanisms [38]. A first pitfall in this stage isPF-19: ?abstraction: too little.? In the previous section we fell into a pitfallthat lead to not enough abstraction: this problem came from PF-18, andletting too much domain specific language creep into the data abstraction.Before this specification was resolved, it limited our ability to think clearlyabout the data we needed to present to the analysts, and successfully connectthe theory of visual encoding selection to attribute type: for instance, somevisual encodings are more effective for sequential data (can encode withintensity) than categorical (can encode with hue).Since our design study process incorporated a wealth of prototypes andsketches for the design of Variant View, we feel that we mostly avoidedthe DSM?s PF-20: ?premature design commitment: consideration spacetoo small.? However, the tool used for the Confirm and Discover tasks areslightly different than the tool used for the Compare task. The Compare tasktool presents known variant data to the left and right of a patient variantdataset, and is discussed in Section 7.2; the tool is shown in Figure 7.3. Thedesign cycle for the Compare task tool was much shorter, and we feel thatgiven more time to present prototype design alternatives we could help thisdesign mature and avoid PF-20.8.5 Scalability LimitationsVariant View supports the display of up to 52 variants per gene on a 1280by 800 pixels display (primary view: 675 pixels wide; each variant encoding:13 pixels wide). Above 52 variants, the display scrolls horizontally to showadditional variants. This scale choice is appropriate for our target analysts?datasets, which undergo a previous filtering step in their workflow. We ini-tially experimented with supporting the filtering stage within Variant View708.6. Future Workitself. Existing tools, such as cBio [4] and MuSiC [9] have similar displaylimits, which we estimate at 80 and 60 variants, respectively, onscreen with-out overlaps. The Ensembl variation image [5] and other genome browserscan display hundreds of variants because they are encoded as thin verti-cal lines, but overlaps and occlusion can lead to difficulties in determiningwhether there is one or many variants at a single position.8.6 Future WorkA major limitation of the current tool is the number of variants that can bedisplayed in the primary view at once. If the number of variants per genedid increase above 52, or much larger, there are two families of methodsfor reducing the amount of information shown: Item reduction methods andattribute reduction methods.Item reduction methods are pervasive in the HCI and visualization lit-erature; their goal is to reduce the number of items that need to be shownonscreen. Two major methods of item reduction are filtering and aggre-gation. Navigation, such as panning and or zooming in to see fewer itemsare suggested to be a special case of filtering. The limitations and timecosts associated with zooming are discussed in Section 6.2. Filtering andaggregation are therefore fruitful avenues of exploration.One possible area of future work would be to directly add visualizationsupport for the variant data filtering stage mentioned in Chapter 2. Additionof filtering was attempted at one stage of the design process, and is shownin Figure 8.4. The three histograms at the top of the screen can be brushedwith the mouse cursor to specify a range of quality values. The histogramis a scented widget [49]; in addition to allowing the user to specify ranges,the scented widget shows the distribution of the data in order to provideinformation scent. Below these scented widgets are check boxes that filterthe data on variant type. We abandoned this design because our collabora-tors had already pre filtered their data, so they did not need to specify theseranges. The approach also fell short because the task requires a considerableengineering effort to support manipulation of very large (10,000 item) data718.6. Future Worksets in a time frame that would be acceptable to the user. After attemptingthis approach we felt that it fell outside the scope of designing a solution fora visualization research problem. Furthermore, we found that tools such asMedSavant [45] already target interactive filtering and manipulation of largevariant data sets, but lack the capability to depict the variant attributes de-signed in Section 4.1 within the context of a gene transcript and proteinannotations. One future direction could be to integrate Variant View withMedSavant and allow its back end to support filtering large variant datasetsbased on various quality metrics. We have considered MedSavant over othertools because we have conducted initial contact with its developers who haveexpressed interest in a collaboration. This collaboration did not take placeduring the course of this design study due to the constraints of time and thelogistics of coordinating with off-site collaborators.In contrast to the filtering method of item reduction, the aggregationmethod involves summarizing information about data items with marksthat represent several underlying items instead of only one item. Aggre-gation can be used to construct overviews of data. Since our analysts wouldprefer to see as much variant information as possible simultaneously, ag-gregation might be a more appropriate method than filtering, because wecan use our data and task abstractions to prioritize a list of attributes forcreating marks that adequately represent several underlying variant itemsinstead of just a single one. The challenge of aggregation is avoiding elim-ination of interesting signals in the dataset when trying to summarize thedata. Determining a priority for attributes could help alleviate this prob-lem. In our design study, variant type was the most important attribute fordetermining variant impact. After variant type, analysts considered otherfactors such as variant recurrence. Creating a successful mark in this caseshould therefore incorporate variant type. Since we deemed that variantrecurrence was the second most important attribute, it should also be re-flected in this mark. Given the current design of Variant View, the scarcestresource for displaying information is horizontal screen space. If there aremany variants, we will run out of room in the horizontal, or x-axis, screendimension. A particular instantiation of a tool that encodes information728.6. Future Workhorizontally, but applies hierarchical aggregation to summarize informationin the vertical direction is by Stolte et al. [40].The tool is shown in Figure 8.8, and encodes time series data for de-bugging superscalar processor performance problems. A horizontal timelineview is encoded as a four-tiered strip chart; there are four levels of aggre-gation. Starting from the bottom strip up, an analyst can view data atincreasing levels of detail. This aggregation approach might be suitable forvariant visualization since the bottom strip can preserve spatial informationof where along a gene the variant occurs, while the upper strips could encodemore attributes of the variant at regions specified on the bottom strip.iiiiiiii10.3. Overviews 191Figure 10.8: Multiple levels of aggregation are used to provide multiplelevels of overview in this timeline view for debugging superscalar processorperformance. From [Stolte et al. 99], Figure 1.interfaces. Using navigation in a single window to support overviewing is aform of temporal multiplexing: at any given moment the window showsthe scene from some specific viewpoint. That view might show overview atone time, and detail at another, but both are not visible simultaneously.The distinction between overview and detail is made according to time;both would occur within the same window. Figure 10.6 shows an example oftemporal multiplexing, with screenshots taken at three di?erent timepoints.The major cost with this approach is the need to use internal memory tounderstand the di?erences between a view shown in the past and the currentstate. The major benefit is the compact use of screen space, since only onewindow is used.10.3.2 Separate Dedicated OverviewFor visualizations is designed with multiple views, a separate windowcan be dedicated to an overview display that is visible at all times. Themost common form is to have a single separate window that is clearlydistinguishable as the overview. However, many datasets have internal orderivable structure at multiple scales. In these cases, a multiscale visualrepresentation can provide many levels of overview, rather than just a singleFigure 8.8: A hierarchical Aggregation Example. Multiple levels of aggrega-tion are used to provide multiple levels of overview in this timeline view fordebugging superscalar processor performance. From [40], Figure 1. Figurecourtesy of Christopher StolteAnother possible avenue for future work is the refinement of the VariantView Compare task tool which is sho n n Figure 7.3. This design wasmuch more preliminary than the design created for the Confirm and Discovertasks, because they were part of the original problem characterization phase.There are a few apparent considerations for this tool such as whether or notit is correct to show just the variants to the right and to the left of thepatient?s variant data in this tool?s neighborhood view. It could be that a738.7. Conclusionspatient?s variant is dismissed as being not harmful because we only showjust the variants to the left and right it. This problem could be solvedthrough further discussions with the analysts, who might give us a morerealistic threshold number for neighbourhood size of variants around thepatient variant. It could also be solved by having a tuneable control inthe form of a slider that interactively allows for neighbourhood thresholdresizing. This project is an interesting avenue, but future work in this areawill take more gatekeeper approval considering the sensitivity of the patientdata in this domain.8.7 ConclusionsIn this design study we designed, implemented and deployed a tool for ge-netic variant impact assessment. One contribution of this thesis is a dataand task abstraction for the problem domain of variant analysis: our taskanalysis links concrete, domain-specific questions to this data abstraction, asdescribed in Chapter 4. Another contribution is a discussion in Section 6.2that reflects on the strengths and weaknesses of genomic coordinates as adata abstraction, a question that has broad implications for the design ofbiological visualizations. A third contribution is the validated design andimplementation of Variant View described in Chapter 6 and Chapter 7. Wecarefully justify our choices for visual encoding and interaction techniqueswith respect to the data and task abstractions. We validate the effectivenessof the tool with three case studies of its use after several months of deploy-ment in Chapter 7. Our final contribution is a discussion in Chapter 8 of thelessons learned in this design study: the design strategy of ?specialize first,generalize later?, and six design considerations organized into the themes of?what to show? and ?how to show it?. Variant View was originally designedfor the specific variant analysis task of Discover Genes in collaboration withtwo analysts, but we were able to adapt the design with minimal changes totwo additional tasks for other analysts. The combination of thorough dataabstraction and task analysis led us to select and prioritize data in this do-main in terms of what should be emphasized, de-emphasized, or completely748.7. Conclusionsdiscarded. Our goal was an information-dense overview showing multiple,non-contiguous features at multiple scales. We succeeded in designing amain view that did not require any navigation, and limited our use of in-teractivity to simple techniques of sorting secondary views and bidirectionallinking between views. In contrast, previous tools in this domain rely oninteraction techniques that are costly in terms of both speed of executionand mental workload, or present an incomplete view of the dataset, so thatsome user questions could not be answered.75Bibliography[1] A. Altmann, P. Weber, D. Bader, M. Preuss, E. B. Binder, and B. Mller-Myhsok. A beginner?s guide to SNP calling from high-throughput DNA-sequencing data. Human Genetics, 131(10):1541?1554, 2012.[2] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents.IEEE Trans. on Visualization and Computer Graphics (Proc. InfoVis2011), 2011.[3] T. Carver, S. R. Harris, M. Berriman, J. Parkhill, and J. A. McQuil-lan. Artemis: an integrated platform for visualization and analysisof high-throughput sequence-based experimental data. Bioinformatics,28(4):464?469, 2012.[4] E. Cerami, J. Gao, U. Dogrusoz, B. E. Gross, S. O. Sumer, B. A. Ak-soy, A. Jacobsen, C. J. Byrne, M. L. Heuer, E. Larsson, Y. Antipin,B. Reva, A. P. Goldberg, C. Sander, and N. Schultz. The cBio can-cer genomics portal: An open platform for exploring multidimensionalcancer genomics data. Cancer Discovery, 2(5):401?404, 2012.[5] Y. Chen, F. Cunningham, D. Rios, W. M. McLaren, J. Smith,B. Pritchard, G. M. Spudich, S. Brent, E. Kulesha, P. Marin-Garcia,D. Smedley, E. Birney, and P. Flicek. Ensembl variation resources.BMC Genomics, 11(11), 2010.[6] P. Cingolani, A. Platts, M. Coon, T. Nguyen, L. Wang, S.J. Land,X. Lu, and D.M. Ruden. A program for annotating and predicting theeffects of single nucleotide polymorphisms, SnpEff: SNPs in the genomeof Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2):80?92,2012.76Bibliography[7] P. Craig and J. Kennedy. Coordinated graph and scatter-plot views forthe visual exploration of microarray time-series data. In Proc. IEEESymp. Information Visualization (InfoVis), pages 173?180, 2003.[8] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A.DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry,G. McVean, and R. Durbin. The variant call format and VCFtools.Bioinformatics, 27:2156?2158, 2011.[9] N.D. Dees, Q. Zhang, C. Kandoth, M.C. Wendl, W. Schierding, D.C.Koboldt, T.B. Mooney, M.B. Callaway, D. Dooling, E.R. Mardis, R.K.Wilson, and L. Ding. MuSiC: Identifying mutational significance incancer genomes. Genome Res., 22:1589?1598, 2012.[10] M. N. Edmonson, J. Zhang, C. Yan, R. P. Finney, D. M. Meerzaman,and K. H. Buetow. Bambino: a variant detector and alignment viewerfor next-generation sequencing data in the SAM/BAM format. Bioin-formatics, 27(6):865?866, 2011.[11] M. Fiume, V. Williams, A. Brook, and M. Brudno. Savant:genome browser for high-throughput sequencing data. Bioinformatics,26(16):1938?44, 2010.[12] S. A. Forbes, N. Bindal, S. Bamford, C. Cole, C. Y. Kok, D. Beare,M. Jia, R. Shepherd, K. Leung, A. Menzies, J. W. Teague, P. J. Camp-bell, M. R. Stratton, and P. A. Futreal. COSMIC: mining completecancer genomes in the Catalogue of Somatic Mutations in Cancer. Nu-cleic Acids Res., 39(Database issue):945?950, Jan 2011.[13] G. W. Furnas. Effective view navigation. In Proc. ACM Conf. HumanFactors in Computing Systems (CHI), pages 367?374, 1997.[14] H. Hou, F. Zhao, L. Zhou, E. Zhu, H. Teng, X. Li, Q. Bao, J. Wu, andZ. Sun. MagicViewer: integrated solution for next-generation sequenc-ing data visualization and genetic variation detection and annotation.Nucleic Acids Res., 38(Web Server issue):732?736, 2012.77Bibliography[15] International Human Genome Sequencing Consortium. Initial sequenc-ing and analysis of the human genome. Nature, 409(6822):860?921, Feb2001.[16] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle,A. M. Zahler, and D. Haussler. The human genome browser at UCSC.Genome Res., 12(6), 2002.[17] R. Kincaid, A. Ben-Dor, and Z. Yakhini. Exploratory visualization ofarray-based comparative genomic hybridization. Information Visual-ization, 4(3):176?190, 2005.[18] H. Lam. A framework of interaction costs in information visualization.IEEE Trans. on Visualization and Computer Graphics (Proc. InfoVis2008), 15(6):1149?1156, 2008.[19] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,G. Marth, G. Abecasis, and R. Durbin. The Sequence Alignment/Mapformat and SAMtools. Bioinformatics, 25:2078?2079, 2009.[20] D. Lloyd and J. Dykes. Human-centered approaches in geovisualiza-tion design: investigating multiple methods through a long-term casestudy. IEEE Trans. Visualization and Computer Graphics (Proc. Info-Vis 2011), 17(12):498?507, 2011.[21] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis,A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, andM. A. DePristo. The Genome Analysis Toolkit: a MapReduce frame-work for analyzing next-generation DNA sequencing data. GenomeRes., 20:1297?1303, 2010.[22] P. McLachlan, T. Munzner, E. Koutsofios, and S. North. LiveRAC:interactive visual exploration of system management time-series data.In Proc. ACM Conf. on Human Factors in Computing Systems (CHI),pages 1483?1492, 2008.78Bibliography[23] M. L. Metzker. Emerging technologies in DNA sequencing. GenomeRes., 14:1767?1776, 2005.[24] M. Meyer, T. Munzner, A. dePace, and H. Pfister. MulteeSum: A toolfor exploring space-time expression data. IEEE Trans. Visualizationand Computer Graphics (Proc. InfoVis 2010), 16(6):908?917, 2010.[25] M. Meyer, T. Munzner, and H. Pfister. MizBee: A multiscale syntenybrowser. IEEE Trans. Visualization and Computer Graphics (Proc.InfoVis 2009), 15(6), 2009.[26] M. Meyer, M. Sedlmair, and T. Munzner. The four-level nested modelrevisited: blocks and guidelines. In Proc. VisWeek Workshop on BE-yond time and errors: novel evaLuation methods for Information Vi-sualization (BELIV), 2012.[27] M. Meyer, B. Wong, M. Styczynski, T. Munzner, and H. Pfister. Path-line: A tool for comparative functional genomics. Computer GraphicsForum (Proc. EuroVis 2010), 29(3):1043?1052, 2010.[28] T. Munzner. Process and pitfalls in writing information visualizationresearch papers. Information Visualization: Lecture Notes in ComputerScience, 4950:134?153, 2008.[29] T. Munzner. A nested model for visualization design and validation.IEEE Trans. on Visualization and Computer Graphics (Proc. InfoVis2009), 15(6):921?928, 2009.[30] C. Nielsen and B. Wong. Points of view: Managing deep data in genomebrowsers. Nature Methods, 9(6):521?521, 2012.[31] C. Nielsen and B. Wong. Points of view: Representing the genome.Nature Methods, 9(5):423?423, May 2012.[32] C. B. Nielsen, S. D. Jackman, I. Birol, and S. J. M. Jones. ABySS-Explorer: Visualizing genome sequence assemblies. IEEE Trans. Visu-alization and Computer Graphics (Proc. InfoVis 2009), 15(6):881?888,2009.79Bibliography[33] S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efre-mova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski.A survey of tools for variant analysis of next-generation genome se-quencing data. Briefings in Bioinformatics, 131(10):1?26, 2013.[34] P. Pirolli and S. K. Card. Information foraging. Psychological Review,106(4):643?675, 1999.[35] M. Plumlee and C. Ware. Zooming versus multiple window interfaces:Cognitive costs of visual comparisons. Proc. ACM Trans. on Computer-Human Interaction (ToCHI), 13(2):179?209, 2006.[36] K. D. Pruitt, T. Tatusova, and D. R. Maglott. NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes, tran-scripts and proteins. Nucleic Acids Res., 35(Database issue):61?65, Jan2007.[37] F. Sanger and A.R. Coulson. A rapid method for determining sequencesin DNA by primed synthesis with DNA polymerase. Journal of Molec-ular Biology, 94(3):441?446, 1975.[38] M. Sedlmair, M. Meyer, and T. Munzner. Design study methodology:Reflections from the trenches and the stacks. IEEE Trans. on Visualiza-tion and Computer Graphics (Proc. InfoVis 2012), 18(12):2431?2440,2012.[39] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M.Smigielski, and K. Sirotkin. dbSNP: the NCBI database of geneticvariation. Nucleic Acids Res., 29(1):308?311, Jan 2001.[40] C. Stolte, R. Bosch, P. Hanrahan, and M. Rosenblum. Visualizing ap-plication behavior on superscalar processors. In Proc. IEEE Symposiumon Information Visualization (InfoVis), pages 10?17, 1999.[41] J. A. Tennessen, A. W. Bigham, T. D. O?Connor, W. Fu, E. E. Kenny,S. Gravel, S. McGee, R. Do, X. Liu, G. Jun, H. Min Kang, D. Jor-dan, S. M. Leal, S. Gabriel, S. J. Rieder, G. Abecasis, D. Altshuler,80BibliographyD. A. Nickerson, E. Boerwinkle, S. Sunyaev, C. D. Bustamante, M. J.Bamshad, and H. M. Akey. Evolution and functional impact of rarecoding variation from deep sequencing of human exomes. Science,337(64):64?68, 2012.[42] The 1000 Genomes Project Consortium. A map of human genome vari-ation from population-scale sequencing. Nature, 467:1061?1073, 2010.[43] H. Thorvaldsdo?ttir, J. T. Robinson, and J. P. Mesirov. IntegrativeGenomics Viewer (IGV): high-performance genomics data visualizationand exploration. Briefings in Bioinformatics, 14(2), 2013.[44] UniProt Consortium. Reorganizing the protein space at the Uni-versal Protein Resource (UniProt). Nucleic Acids Res., 40(Databaseissue):71?75, Jan 2012.[45] Univ. Toronto Biolab. MedSavant variant search engine.http://medsavant.com, 2012.[46] B. Wang, Q. Michelle, A. Woodruff, and A. Kuchinsky. Guidelinesfor using multiple views in information visualization. In Proc. ACMAdvanced Visual Interfaces (AVI), pages 110?119, 2000.[47] J. Wang, L. Kong, G. Gao, and J. Luo. A brief introduction to web-based genome browsers. Briefings in Bioinformatics, 14(2), 2013.[48] C. Ware. Information Visualization: Perception for Design. MorganKaufmann Publishers, second edition, 2004.[49] W. Willett, J. Heer, and M. Agrawala. Scented widgets: improvingnavigation cues with embedded visualizations. IEEE Trans. on Visual-ization and Computer Graphics (Proc. InfoVis 2007), (13):1129?1136,2007.81

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052187/manifest

Comment

Related Items