Out with Outliers: Making Sense of Multi-ThreadedApplication Performance at Scale with NonSequiturbyAugustine WongBASc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2024© Augustine Wong, 2024The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Out with Outliers: Making Sense of Multi-Threaded Application Per-formance at Scale with NonSequitursubmitted by Augustine Wong in partial fulfillment of the requirements for thedegree of Master of Applied Science in Electrical and Computer Engineering.Examining Committee:Alexandra Fedorova, Professor, Electrical and Computer Engineering, UBCSupervisorIvan Beschastnikh, Associate Professor, Computer Science, UBCSupervisory Committee MemberJulia Rubin, Associate Professor, Electrical and Computer Engineering, UBCSupervisory Committee MemberiiAbstractModern multi-threaded systems are highly complex. This makes their behaviordifficult to understand. Developers frequently capture behavior in the form of pro-gram traces and then manually inspect these traces. Existing tools, however, fail toscale to traces larger than a million events.In this thesis, we present an approach to compress multi-threaded traces inorder to allow developers to visually explore these traces at scale. Our approachis able to compress traces that contain millions of events down to a few hundredevents. We use this approach to design and implement a tool called NonSequitur.We evaluate NonSequitur with 42 participants on traces from RocksDB andWiredTiger, two complex database back-ends. We demonstrate that with NonSe-quitur, participants performed some performance analysis tasks on large executiontraces up to 11 times more accurately as compared with other tools. Additionally,for some performance analysis tasks, the participants spent on average three timeslonger with other tools than with NonSequitur.iiiLay SummaryUnderstanding and fixing performance issues in modern software systems is diffi-cult. Typically, software developers use tools that produce execution logs showinghow the software is behaving. Software developers can then read these logs to di-agnose the performance issues. However, the execution logs generated from thesetools can contain millions of lines, which makes manually reading the logs imprac-tical. Furthermore, existing tools designed to help developers inspect executionlogs are unable to handle large execution logs.We present NonSequitur, a tool designed to help developers perform perfor-mance analysis tasks on large execution logs. NonSequitur uses RegTime, an al-gorithm we designed, to summarize the information in large execution traces. Wedesign and evaluate NonSequitur, and show that it can help developers diagnoseperformance issues captured in large execution logs more quickly and accuratelythan with other tools.ivPrefaceThe work presented in this thesis was conducted by the author under the supervi-sion of Professor Alexandra Fedorova. I was responsible for implementing NonSe-quitur; running the user study to evaluate its effectiveness; and analyzing the userstudy results. Professor Ivan Beschastnikh and Paul Bucci guided me with the cre-ation of the user study methodology and the analysis of results. Professor TamaraMunzner also provided guidance with NonSequitur’s design.A version of Chapter 2 has been published [Fedorova A, Mustard C, Beschast-nikh I, Rubin J, Wong A, Miucin S, Ye L. Performance comprehension at WiredTiger.In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engi-neering Conference and Symposium on the Foundations of Software Engineering2018 Oct 26 (pp. 83-94)]. The user study reported in Chapters 6 and 7 was coveredby UBC Ethics Certificate number H20-03087.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 The Nature of Performance Comprehension . . . . . . . . . . . . . . 42.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 WiredTiger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 Key Findings on Performance Comprehension . . . . . . . . . . . 103 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 NonSequitur Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 16vi4.1 Trace Compression . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 The RegTime Algorithm . . . . . . . . . . . . . . . . . . . . . . 194.3 RegTime Visual Encoding of RegTime expressions . . . . . . . . 234.4 Visualizing Compressed Execution Traces . . . . . . . . . . . . . 255 RegTime Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . 306 NonSequitur Evaluation Methodology . . . . . . . . . . . . . . . . . 326.1 Multithreaded Traces . . . . . . . . . . . . . . . . . . . . . . . . 326.2 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3 Tools used for comparison . . . . . . . . . . . . . . . . . . . . . 346.4 Experiment Treatments . . . . . . . . . . . . . . . . . . . . . . . 366.5 Experiment Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 366.6 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . 386.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 397 NonSequitur Evaluation Results . . . . . . . . . . . . . . . . . . . . 417.1 Performance with T1 Tasks . . . . . . . . . . . . . . . . . . . . . 457.2 Performance with T2 Tasks . . . . . . . . . . . . . . . . . . . . . 467.3 Performance with T3 Tasks . . . . . . . . . . . . . . . . . . . . . 467.4 Impact of Performance Debugging Experience . . . . . . . . . . . 487.5 Instances When Participants Performed More Poorly with NonSe-quitur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.6 Limitations of the NonSequitur Design . . . . . . . . . . . . . . . 527.7 Summary of findings . . . . . . . . . . . . . . . . . . . . . . . . 538 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.1 Short Term Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 548.2 Long Term Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 61viiList of TablesTable 2.1 WiredTiger case study code examples. . . . . . . . . . . . . . 8Table 3.1 Summary of prior work . . . . . . . . . . . . . . . . . . . . . 13Table 5.1 Compression ratios achieved with RegTime. nbe f ore is the num-ber of events in the trace before compression. na f ter is the num-ber of events in the trace after compression. The compressionratios of the trace sizes in MB were approximately the same. . 31Table 6.1 Summary of the Execution Traces Used in Our User Study . . 33Table 6.2 Performance analysis tasks used in the study . . . . . . . . . . 37Table 7.1 P-values of the McNemar’s test and the Wilcoxon signed-ranktest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44viiiList of FiguresFigure 4.1 NonSequitur workflow. . . . . . . . . . . . . . . . . . . . . . 17Figure 4.2 The key steps of the RegTime algorithm. . . . . . . . . . . . 19Figure 4.3 The function sequence from Figure 4.2(1) and the RegTimeVisual Encoding of the corresponding RegTime expression. . 24Figure 4.4 The NonSequitur Trace Visualization Tool. . . . . . . . . . . 26Figure 4.5 Linked highlighting in NonSequitur. . . . . . . . . . . . . . . 27Figure 6.1 Chrome TraceViewer Tool . . . . . . . . . . . . . . . . . . . 35Figure 6.2 OpTrack Tool. . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 6.3 The design of our user study for evaluating NonSequitur. . . . 37Figure 7.1 The difference scores from the NonSequitur evaluation. . . . . 42Figure 7.2 The number of minutes the participants spent on each perfor-mance analysis task. . . . . . . . . . . . . . . . . . . . . . . 42Figure 7.3 An example of when NonSequitur is not able to assist softwaredevelopers with T3 tasks. . . . . . . . . . . . . . . . . . . . . 47Figure 7.4 The NonSequitur visualization of threads 2, 9, and 21 of thenormal WiredTiger trace. . . . . . . . . . . . . . . . . . . . . 48Figure 7.5 The NonSequitur visualization of threads 1, 9, 13, and 15 ofthe normal WiredTiger trace. . . . . . . . . . . . . . . . . . . 48Figure 7.6 The boundaries of the RegTime expressions in threads of thenormal WiredTiger trace did not always align with the timeintervals during which thread 21 called wt cond wait signalat a callstack depth of 0. . . . . . . . . . . . . . . . . . . . . 49ixAcknowledgmentsThis research was funded by both Huawei and a grant from the Natural Scienceand Engineering Research Council of Canada (NSERC) Discovery Grant.Firstly, I would like to thank Alexandra Fedorova for supervising me. Herguidance and words of encouragement have sustained me as I continued with myresearch. I like to thank Ivan Beschastnikh and Paul Bucci for assisting us with de-signing the methodology of the user study we conducted to evaluate NonSequitur.I am also grateful to Tamara Munzner for her assistance with the initial design ofNonSequitur.Finally, I would like to thank my wife for her love and support. There weretimes during my research when I was unable to spend as much quality time withher as she wanted. Yet, she was always patient and encouraging, even when mybody and mind were not always present with her. After I obtain my degree, I lookforward to repaying her love by rededicating myself to her and to our beautifuldaughter.xChapter 1IntroductionDebugging performance issues in multi-threaded applications is a frequent andchallenging task for software developers. Unlike regular software bugs, perfor-mance issues do not cause the system to crash or exhibit obviously unusual behav-ior. Instead, performance issues cause abnormalities in a system’s non-functionalproperties such as latency, energy consumption, and heat dissipation [22]. Dueto the complexity of modern multi-threaded applications, performance issues cantake weeks to resolve and involve frequent exchanges between users and develop-ers [22]. Consequences of unresolved software performance issues cost companiestrillions of dollars [22].To understand how software developers diagnose performance issues, we con-ducted a case study that examined the nature of performance comprehension, de-fined as the tools, information, and processes developers use to investigate perfor-mance issues. Our case study used a JIRA database of tickets for WiredTiger amulti-threaded open-source storage engine, focusing specifically on performanceissues [15]. Our study concluded that developers at WiredTiger spent most of theirperformance debugging time investigating ”outlier events.” An example of an out-lier event is a piece of code which runs infrequently, but each occurrence of it lastsa long time [20]. Another example is when the completion rate of a function oroperation drops below average [15]. We define an outlier event as: a function callinvocation or time between function calls that lasts an unusually long time.Our study found that developers investigating outlier events analyzed log files1capturing thread behavior over time. The developers focused on finding corre-lations between outlier events and thread activity. Furthermore, the developersmanually read log files, which is not practical if the log files are large.Generally, when developers try to understand the performance of their appli-cations, they can use profilers to collect performance data during program execu-tion. Popular profilers include OProfile, Perf, HPCToolkit, and Intel VTune [36].However, our study found that profilers are not suitable tools for investigating per-formance issues caused by outlier events. Because profilers use event sampling togather performance data, they do not sample at a high enough rate to catch out-lier events. As stated by one WiredTiger developer to us during our case study, “Iwould only expect Perf to be of limited use here. It is useful in giving an indicationof where time is spent over the entire run, which is good for general performancetuning but I don’t know how to use the results to find what is causing outlier oper-ations”[15].Resolving outlier events is important for performance debugging because theyrepresent the worst-case system behavior [15]. To find outlier events, developerstypically analyze execution traces (i.e., time-stamped operation or function logs),which are generated at runtime and capture each event that a thread executed overtime. A typical execution trace of a multi-threaded application would capture whena particular thread entered a function and when a particular thread exited the func-tion. Therefore, execution traces would capture information about outlier events,as well as information about what threads were doing during the time when theseoutlier events appeared. However, execution traces which capture outlier eventsare very large and could contain millions of recorded function invocations acrossmultiple threads. One option that developers have to analyze large execution tracesis to employ algorithms to extract important information from the execution traces.Such algorithms can remove redundant information from execution traces [18]; de-tect “phases of execution” [7, 16, 29, 33]; and find interesting patterns of behavior[8, 24]. However, these algorithms are not designed to assist software developerswith finding and investigating the causes of outlier events. Furthermore, to thebest of our knowledge, most of these algorithms have not been implemented intotrace analysis tools which software developers can use. Another option is to usetrace visualization tools. However, a survey from 2014 found that the majority of2trace visualization tools unable to visualize very large execution traces containingmillions of function invocations [19]. Consequently, even visualization tools thatare designed to assist developers with performance debugging [21, 31] cannot beused to perform performance analysis tasks for large execution traces. On the otherhand, other trace visualization tools that can handle large execution traces were notdesigned to assist developers with analyzing performance bugs caused by outlierevents [13, 14].We present RegTime, an algorithm which compresses execution traces by re-moving information not important for performance debugging from the executiontraces. When designing RegTime, we were informed by the results of our casestudy that examined performance comprehension [15]. RegTime takes as its in-put a large execution trace and produces as its output a compressed version of theexecution trace, which retains enough information for developers to perform per-formance debugging. We also describe NonSequitur, an interactive visualizationtool for developers to study traces compressed with RegTime.In summary, we make the following contributions:• We describe our case study that examined the nature of performance com-prehension.• We present the RegTime algorithm to assist developers with analyzing largeexecution traces to debug performance.• We introduce NonSequitur, a tool that visualizes the compressed executiontraces generated by RegTime. Consequently, NonSequitur is able to visual-ize large execution traces within the width of a computer screen1.• We evaluate NonSequitur via a user study with 42 software developers on aset of real-world performance analysis tasks.1NonSequitur is open source: https://github.com/auggywonger/nonsequitur vis3Chapter 2The Nature of PerformanceComprehensionOur goal with designing RegTime and NonSequitur was to help software develop-ers with debugging performance of software systems. Therefore, to design thesetools to meet the performance debugging needs of software developers, we soughtto understand how developers diagnose performance bugs. Although there havebeen empirical studies that characterised performance debugging, these studiesfocused on comparing the processes of fixing performance and non-performancebugs [27, 37] or characterizing performance bugs [23, 30]. The findings of theseprevious studies were not useful in helping us to design RegTime or NonSequiturbecause they did not analyze what developers do to diagnose those bugs. Conse-quently, we conducted our own empirical case study to understand how developersperform performance debugging.In our case study, we explored how developers study and diagnose performanceissues in WiredTiger 1, an open-source high-performance storage engine writtenin C. As of 2014, WiredTiger is part of MongoDB2, a popular NoSQL database.WiredTiger developers collaborate in a decentralized manner and use Jira [35] fortracking and for thoroughly discussing and resolving all performance issues. Weanalyzed 44 performance-related tickets in WiredTiger’s Jira repository. We fo-1http://www.wiredtiger.com/2https://www.mongodb.com/4cused on the information that developers do and do not collect, tools they use anddo not use, and processes they follow. We also considered the relationship betweeninformation and tools: do information needs drive tool usage, or do developersprefer and use certain tools first, and then sift through the resulting information?2.1 Research QuestionsFor our case study, we extracted performance-related tickets from WiredTiger’sJira repository and analyzed them to answer three research questions:RQ1: What information do developers collect during performance comprehen-sion?RQ2: What tools do developers use during performance comprehension?RQ3: What processes do developers follow during performance comprehension?2.2 WiredTigerWiredTiger is a multi-threaded open-source storage engine, designed for high per-formance and scalability. At the time we conducted this case study, it had a teamof around 24 developers, with the core six developers contributing over 200 com-mits each. At the time we performed the case study, the most active developercontributed over 9,000 commits, and the most active six developers contributedover 13,000 commits in total. Three of these developers had 30+ years of expe-rience, one had around 15 years of experience, and the remaining two developershad around 5 years of experience.At the time we performed our case study, WiredTiger consisted of over 84,000lines of C code3. It runs on POSIX-compatible systems, including Linux andFreeBSD, as well as Windows. WiredTiger clients include several popular webservice providers, such as Amazon Web Services. In 2014, WiredTiger was ac-quired by MongoDB [26] — the provider of one of the most popular open-source3Calculated using cloc on .c, .cc, .h, and .i files in the src and api directories of the WiredTigerrepository.5document-oriented (NoSQL) databases. WiredTiger achieves a 7-10x performanceincrease over MongoDB’s previous storage engine [25].Why study WiredTiger. We used WiredTiger to study performance comprehen-sion because the main goal of WiredTiger is to develop a high performance storageengine [25]. WiredTiger is carefully designed to efficiently use all available CPUcores, large amounts of RAM, and advanced synchronization techniques to enablemore efficient parallelism. Developers thus spend a significant amount of effortunderstanding and resolving performance-related issues. Moreover, WiredTigerpresented a unique opportunity for our study because its open development makesit possible to investigate performance comprehension practices of a successfulperformance-oriented software team.2.3 Data CollectionDevelopers communicated primarily through Jira tickets for issue tracking, sayingthat they used Jira “as a notebook for debugging sessions”. Other, less signifi-cant communication channels were Github [34] pull requests for code reviews andmerges (over 19,000 commits and over 2,900 pull requests), and mailing lists. Yet,the project conventions are to maintain all conversations about issues in Jira as op-posed to other means, such as email. The WiredTiger team did not use IRC orSlack. Therefore, Jira is a sufficiently complete source of performance debuggingactivity, which motivated us to perform our case study with this data source.To collect performance-related tickets for our analysis, we navigated to theMongoDB Jira issue tracker repository4 and selected project WiredTiger.We searched for tickets satisfying the following criteria:• Tickets of any standard type, i.e., bug, documentation, improvement, newfeature, task, technical debt, or workload.• Tickets in any state, i.e., open, in progress, in code review, resolved, closed,waiting for user input, needs reviewer, or needs merge.• Tickets containing the keyword “performance” in either the title or the bodyof the ticket.4https://jira.mongodb.org6We sampled the repository twice: first in August 2017 and then in February2018. Both times, we analyzed the tickets one by one, using open coding, a methodwhich I describe in more detail in Section 2.4. We stopped when we reached con-ceptual saturation – no new information was obtained from analyzing further data.As the result of this process, we analyzed 44 tickets updated between April 2016and December 2017 (34 in the first round and 8 in the second).The analyzed tickets contained between 1 and 39 comments made by develop-ers (median 6.5, mean 8.7), with one to seven distinct developers contributing toeach ticket (median 3, mean 3). Initial issue descriptions were between 13 and 394words (median 86.5, mean 111.6). Developer comments were between 3 and 915words (median 60.5, mean 93.3). In total, 16 developers contributed to the majorityof tickets, with each of the top four developers contributing to more than 10 issues.2.4 Data AnalysisWe used open coding – a technique from grounded theory for deriving theoreticalconstructs from qualitative analysis [10, 17]. We independently read each selectedJira ticket and coded ideas contained in the data. We focused on information, tools,and processes that developers use to diagnose performance issues. We did not codethe description of the problem or its resolution. Table 2.1 shows examples of theinformation codes, tool codes, and process codes we used in our case study.Two members of our team that conducted the study were responsible for per-forming the coding. On a weekly basis, we met to discuss quotes identified bythese two coders, review their grounding in Jira tickets, merge and refine codesand concepts as necessary, identify emerging categories, and also refine codingrules and processes. Following Strauss’s version of grounded theory, where datasampling and data analysis are seen as interleaved steps that are repeated until onecan describe the researched phenomenon, we continued coding the tickets until allauthors agreed that new data does not change our understanding of performancediagnosis process. At this stage, we coded 36 tickets originally sampled in August2017. To further validate our results and ensure a stable coding process and set ofconcepts, we sampled the repository one more time in February 2018 and coded7Table 2.1: WiredTiger case study code examples.Category and code Code descriptionInformation Codes▷ Info-functions-or-activity-over-time-by-threadsDeveloper obtains the functions orgroups of functions (i.e., activity)that were executed over time by dif-ferent threads▷ Info-threads-blocked-delayed-due-to-certain-functions-or-activityDeveloper finds when and for howlong threads are blocked or delayedwhen executing certain functions orperforming certain activity▷ Info-latency-spikes Developer observes unusually longlatencies seen in the program.Tool Codes▷ Tool-show-statistics-over-time This code was used for two in-housetools that visualized statistics, col-lected via an application log, as timeseries.▷ Tool-pmp (Lib,App) Pmp stands for poor man’s pro-filer [3]. This is a script that period-ically collects gdb callstacks acrossall threads, and pre-processes themwith awk.Process Codes▷ process-info-functions-or-activity-over-time-by-threads-tool-pmpDeveloper sought to learn aboutthe activity performed over time bythreads and used pmp to obtain thisinformation.▷ process-tool-pmp-info-functions-or-activity-over-time-by-threadsDeveloper used pmp but withoutknowing in advance what informa-tion might end up being useful in per-formance debugging.8another eight tickets. The complete set of codes is available online5.As a result of this process, we analyzed 44 tickets. We marked between 1and 106 quotes per ticket (median 14, mean 25.5), identifying 36 codes relatedto information needs, 22 codes related to tools, and 26 codes related to process(9 for information and tools inter-relationships, 7 for reproducibility, and 12 forexperiments). The four most senior developers were responsible for commentscorresponding to 71.6% of the codes.2.5 Threats to ValidityInternal validity. We might have misinterpreted developers’ intentions or misiden-tified concepts, introducing researcher bias into our analysis. We attempted to mit-igate this threat by performing independent coding and cross-validating each codeidentified in our analysis by at least two members of our team. We also discussedand validated the results during weekly meetings with all team members were in-volved with the study.Moreover, Alexandra Fedorova, who worked as WiredTiger consultant, con-firmed that our understanding of the debugging traces in the tickets we studied arerepresentative of performance debugging sessions. In addition to Alexandra Fe-dorova, our team had other members who developed database software for othercompanies, so we were well positioned to understand the technical details in thetickets. We thus believe our analysis is reliable.In organizations where developers communicate through a variety of differentchannels, especially in-person, bug reports or issue trackers provide incompleteviews of a bug’s history [9]. However, for WiredTiger, the Jira database was themain communication channel; we are thus confident that Jira was a sufficientlycomplete source of performance-related debugging information for our study.External validity. As in other software engineering case studies, our researchmight not generalize beyond our subject of study: the WiredTiger system and thesample of tickets we consider. It is possible that the practices used by other com-panies and teams may differ from those used at WiredTiger. Our tickets sampleincludes patterns and practices followed by 16 experienced software developers,5https://goo.gl/DdwU7u9giving us confidence that our study captures at least some of the best practices forperformance diagnosis.2.6 Key Findings on Performance ComprehensionFrom our case study, we learned the following about the nature of performancecomprehension:▷ WiredTiger developers spent most of their performance debugging timechasing outlier events. These events represent worst-case system behaviourand are thus of critical importance for performance-oriented groups suchas WiredTiger. When looking for root causes of outlier events, developersconsider correlated behavior between threads and events that happen overtime.▷ When looking for root causes of latency spikes, developers consideredcorrelated behavior between threads and events that happen over time.However, developers are not properly supported by existing performance de-bugging/comprehension tools. We discuss existing performance debugging/-comprehension tools in more detail in Chapter 3.▷ Profilers are not useful for investigating outlier events. Our case studyuncovered that WiredTiger developers spent most effort on understandinghow outlier events relate to other events in the program. Off-the-shelf per-formance tools, namely profilers, were not helpful because developers reliedon time series per-thread view of the execution, which profilers do not pro-vide. Although the sample data collected by profilers could, in theory, bedisplayed across time/threads, the profiler sampling resolution is not fineenough to capture outlier events because they do not occur frequently. In-stead, the information that developers need to investigate outlier events werelog files that showed precisely when and for how long each thread called afunction.▷ Log analysis is often manual and exploratory. For the most part develop-ers manually sifted through raw log files. Manual log analysis works as long10as the logs are not too big or if the developer knows what to search for. Un-fortunately, as our analysis revealed, developers seldom know what to lookfor. Instead of starting with a concrete hypothesis and looking for a specificpiece of information, developers usually just “throw” a tool at the problemto see what information turns up.11Chapter 3Related WorksFrom the key findings of our WiredTiger case study, We formulated three dominantperformance analysis tasks:1. Understanding what threads are doing over time.2. Identifying when outlier events occur in the execution.3. Correlating the occurrence of outlier events with the activities of threads.Moreover, because developers often analyze traces in an exploratory fashion, de-velopers must perform these performance comprehension activities on large exe-cution traces.Performance comprehension is reliant on timing information. To understandwhat threads are doing over time, software developers need to understand whenevents occurring in an execution begin and end. To identify when outlier eventsoccur in the execution, developers need to know when events occur and the durationof these events. To correlate the occurrence of outlier events with the activities ofthreads, developers must be able to find the time interval during which an outlieroccurred and observe what threads were doing during this time interval.The activities that developers perform when diagnosing performance bugs aredifferent from the activities of program comprehension. Pacione et al. [28] definednine principal program comprehension activities:1. Investigating the functionality of (a part of) the system.12Table 3.1: Summary of prior workTool LargetracesConsiderstimingFindsoutliersCompress. Framework [18] ✓ ✗ ✗Phase Finder [29] ✓ ✗ ✗Sabalan [8] ✓ ✗ ✗Zinsight [14] ✓ ✓ ✗TraceViz [13] ✓ ✓ ✗SyncTrace [21] ✗ ✓ ✓ExtraVis [12] ✗ ✓ ✗TraceDiff [32] ✗ ✓ ✗NonSequitur (this thesis) ✓ ✓ ✓2. Adding to or changing the system’s functionality.3. Investigating the internal structure of an artefact.4. Investigating dependencies between artefacts.5. Investigating runtime interactions in the system.6. Investigating how much an artefact is used.7. Investigating patterns in the system’s execution.8. Assessing the quality of the system’s design.9. Understand the domain of the system.Many existing tools are designed for program comprehension, despite the factthat performance bugs are generally harder to resolve than non-performance bugs[27, 37]. Table 3.1 summarizes the features of the existing tools.Prior work has developed algorithms to help developers analyze executiontraces. These algorithms summarize the activities captured within the executiontraces by removing repeated calls due to loops [18] or by dividing the content ofan execution trace into fragments that correspond to the execution phases of a pro-gram [29]. These algorithms can help developers understand the functionalityof the system. However, these algorithms do not consider the timing information13captured in the execution traces, treating unusually long function calls the sameas those of expected duration. Therefore, these algorithms cannot locate outlierevents or show what activities occurred during the time when the outlier eventsappeared. Other algorithms were designed to find patterns of behavior captured inexecution traces. Sabalan is a trace visualization tool that leverages an algorithminspired by bioinformatics to display execution behavior that appears repeatedly inmultiple traces [8]. The algorithm supporting Sabalan treats execution traces likeDNA sequences, where the recurring patterns in DNA sequences serve biologicalfunctions. Sabalan was demonstrated to be effective at helping developers withprogram comprehension tasks such as feature location or understanding how a sys-tem works. But because the algorithm supporting Sabalan treats execution traceslike DNA sequences, this algorithm also ignores any timing information providedby the execution traces. Therefore, Sabalan would be of limited use for perfor-mance comprehension.In contrast, many trace visualization tools do display the behavior of executiontraces over time. However, these trace visualization tools are not suited for analyz-ing large execution traces to diagnose performance bugs caused by outlier events.Trace visualization tools typically fall into one of three categories:1. The tool can visualize large execution traces, but does not support perfor-mance comprehension tasks.2. The tool can support performance comprehension tasks, but cannot visualizelarge execution traces.3. The tool neither supports performance comprehension tasks nor is capableof visualizing large execution traces.IBM’s Zinsight is a trace visualization tool designed to visualize traces fromthe IBM System z, which is the brand name for IBM’s mainframe computers [14].TraceViz is a visualization framework designed based on observations made inreal situations at STMicroelectronics, a semi-conductor company [13]. Both thesetools provide a timeline showing what is happening in an execution over time.However, the main purpose of these tools is to help developers to identify patternsof behavior in execution traces. Zinsight has a Sequence Context view that shows14the paths of execution that lead to specific events. TraceViz’s timeline view of theexecution was designed to show periodic temporal behavioral patterns. Therefore,these tools are not suitable for investigating performance bugs caused by outlierevents.SyncTrace visualizes thread behavior over time in sufficient detail that it couldtheoretically be used for performance comprehension tasks [21]. SyncTrace has apanel residing at the top for visualizing the timeline of a thread. Users can interactwith the panel to see correlations in behavior between the thread displayed in thepanel and other threads. But SyncTrace was only demonstrated to be capable ofvisualizing threads executing hundreds of thousands of events. Large executiontraces, by contrast, could capture millions of events per thread.ExtraVis is a trace visualization tool designed explicitly to support programcomprehension tasks [12]. Although it displays a vertical timeline of events, thepurpose of this timeline is to help developers with identifying the major phases ofan execution. Furthermore, ExtraVis was not shown to be capable of visualizinglarge execution traces. TraceDiff is another example of a trace visualization toolwhich does not support performance comprehension tasks and is incapable of vi-sualizing large execution traces. The purpose of TraceDiff is to visually comparetwo execution traces [32]. TraceDiff is only capable of visualizing execution tracescontaining hundreds of thousands of function calls.15Chapter 4NonSequitur DesignIn this section we present the design of NonSequitur, a trace visualization toolthat helps developers with performing the following performance analysis tasks onlarge execution traces:1. Understanding what threads are doing over time: Developers can useNonSequitur to examine the timeline of each thread and understand approx-imately when the threads have executed a particular function. For example,developers can use NonSequitur to determine if a thread executed a partic-ular function throughout its lifetime, or just at the beginning or end of itslifetime.2. Identifying when outlier events occur in the execution: Developers canuse NonSequitur to determine which threads invoked unusually long func-tion calls or experienced unusually long periods of time between two con-secutive function calls. Developers can also use NonSequitur to determinewhen during the execution the outlier events occurred.3. Correlating the occurrence of outlier events with the activities of threads:Developers can use NonSequitur to see what function calls each thread madeduring the time interval when outlier events occurred.Figure 4.1 shows the NonSequitur workflow, which is comprised of three stages:164/4/24, 2:34 AM NonSequiturfile:///C:/Users/Asian/masters/nonsequitur_v4/NonSequitur.html 1/1Thread ID12Enter a function name:# Function ColorThread 1Thread 20 __block_merge  1 __block_write_off  2 __btcur_update  3 __curfile_search  Function: __wt_cond_wait_signalDuration: 20.95 secondsRawTracesMulti-threadedApplicationCapturetracesFunction call capturee.g., with XRay LoadtracesVisualize traces with NonSequiturEncodedTracesLossy CompressionRegTime algorithmFigure 4.1: NonSequitur workflow.1. Ingest execution traces showing when threads entered and exit func-tions: From our case study on performance comprehension mentioned inchapter 2, we learned that developers were interested in knowing when threadsentered and exited functions so that they can investigate outlier events. There-fore, NonSequitur takes as its input a multi-threaded execution trace that cap-tures when threads entered and exited functions. The execution traces thatNonSequitur ingests can be produced by different tracing tools like XRay.XRay produces such traces automatically by instrumenting the applicationat compile time. Most traces we study in evaluation were produced usingXRay 1. But other tools or manual instrumentation could be used as well.Regardless of what tool is used to produce the execution trace, NonSequiturexpects that each record in the execution trace will contain the followinginformation:TID: The ID of the thread entering or exiting the function.Event Type: A binary value indicating if the thread entered or exited a func-tion.Function Name: The name of the function that the thread is entering orexiting.Time: The time when the thread entered or exited the functionFrom ingesting an execution trace, NonSequitur can determine the start andend time of each function call, what thread made that function call, and the1https://llvm.org/docs/XRay.html17call duration. This information is necessary if NonSequitur is to help de-velopers with understanding what threads are doing over time, finding whenoutlier events occurred, and identifying correlations between the occurrenceof outliers and activities of threads. Because we want developers to be able touse NonSequitur to analyze large execution traces, we designed NonSequiturto handle multi-threaded execution traces that contain millions of functionentry and exit records per thread.2. Trace compression with the RegTime algorithm: To support a large multi-threaded execution trace, NonSequitur must compress the trace into a rep-resentation that can be practically visualized. At the same time, NonSe-quitur must compress the trace in a way that retains information requiredby software developers to debug performance. To meet these two require-ments, NonSequitur uses RegTime, an algorithm that compresses executiontraces by removing information not important for performance debugging.We present RegTime in Sections 4.1 and 4.2.3. Visualizing the compressing execution traces: NonSequitur visualizes thecompressed execution traces generated by RegTime. Developers can interactwith a NonSequitur visualization to perform the performance analysis tasks.We present the NonSequitur visualization in Sections 4.3 and 4.4.4.1 Trace CompressionWe focus on compressing traces containing time-stamped records of function calls,though traces with other kinds of log records could potentially be treated withthe same algorithms. However, we are not the first to propose a technique forcompressing execution traces, so as to make them more understandable. Hamou-Lhadj et al. [18], for instance, created a framework for lossless trace compres-sion. Lossless compression techniques achieve compression such that the exactoriginal dataset can be recreated from its compressed version. Lossy compressiontechniques produce better compression ratios by discarding information, but theoriginal dataset cannot be recreated. Consequently, there is a trade-off betweenachieving better compression ratios and retaining information from the original18Figure 4.2: The key steps of the RegTime algorithm.dataset.To compress traces into a representation that can be visualized in a smallamount of screen real estate, RegTime must reduce a trace with millions of eventsdown to a few thousand. Given a compression ratio ninitial/na f ter, where ninitial isthe initial number of function records in the trace and na f ter is the number of func-tion records in the trace after compression, RegTime must to achieve a compressionratio at least 1,000. However, the lossless compression proposed by Hamou-Lhadjet al., the best technique known to us, demonstrated a maximum compression ra-tio of around 41. Therefore, we chose to design RegTime as a lossy compressionalgorithm.4.2 The RegTime AlgorithmWe use Figure 4.2 and Algorithms 1 and 2 to explain how RegTime works. Weuse the term RegTime expression to refer to a function sequence which has beencompressed by RegTime.Figure 4.2 visualizes the key steps of the algorithm on an example. Figure4.2(1) shows a compressible function sequence, which is a function sequence ex-19ecuted by a thread that we wish to compress with RegTime. The compressiblefunction sequence in Figure 4.2(1) is visualized as a sequence of callstacks. Weconstruct each callstack using a stack data structure, based on the thread’s functionentries and exits. Whenever the thread enters a function, we push this functiononto the stack. For instance, Figure 4.2(1) displays the call stack P,X, signifyingthat the thread entered function X right after function P. Conversely, every time thethread exits a function, we pop the top function from the stack. Continuing withthe example, Figure 4.2(1) shows callstack A following call stack P,X, demonstrat-ing that after exiting function X and P, the thread entered function A. Figure 4.2(2)shows that the compressible function sequence has four unique callstacks, whichwe labelled as S0, S1, S2, and S3.RegTime creates a RegTime expression by compressing the callstacks thatmake up a compressible function sequence. To create the compressed callstacks(compressed seq in Algorithm 2) RegTime first merges identical callstacks. InFigure 4.2(3), RegTime creates four groups of merged callstacks: S0’, S1’, S2’,and S3’. For example, RegTime formed S2’ by merging the three instances of S2.Next, RegTime arranges the groups according to the order in which the first occur-rences of each unique callstack appears. Figure 4.2(4) shows the first occurrenceof each unique callstack, and Figure 4.2(5) lists the final ordering. We now presentthe full algorithm in detail.RegTime iterates through the timeline of each thread captured in a multi-threadedexecution, deciding which function sequences invoked by a thread can be com-pressed into a RegTime expression (Algorithm 1, lines 5 - 6). There may be caseswhen RegTime decides none of the function sequences executed by a thread needto be compressed (Algorithm 1, line 9). Alternatively, RegTime can compresssome function sequences executed by a thread while leaving other sequences un-compressed (Algorithm 1, lines 12 - 15).For a given thread, RegTime iterates through the function call records corre-sponding to that thread to find the start of a compressible function sequence (Al-gorithm 2, lines 6 - 13). To simplify our explanation of RegTime, we treat eachfunction call record as including a start and end time of the function call, stackdepth of the call, and call duration.RegTime considers a function sequence to be compressible if it satisfies three20Algorithm 1 CompressExecution1: procedure COMPRESSEXECUTION(T )2: RegTimes = [][] ▷ [t][[expr,s,e] is a RegTime expr from positions s to e ofthread t’s exec3: for tid in T do ▷ Iterate through all threads in trace T4: f ncall idx← 05: while f ncall idx < T [tid].length do ▷ Iterate through thread’stimeline6: (expr,start pos,end pos)← FINDRTEXPR( f ncall idx,T [tid])7: if expr is nil then8: ▷ No RegTime expr found starting at f ncall idx9: RegTimes[tid].append([nil, f ncall idx,T [tid].length−1])10: break11: end if12: if f ncall idx > start pos then13: ▷ Found RegTime expr interval does not capture f ncall idx14: RegTimes[tid].append([nil, f ncall idx,start pos−1])15: end if16: ▷ Record found RegTime expr17: RegTimes[tid].append([expr,start pos,end pos])18: f ncall idx← end pos+119: end while20: end for21: return RegTimes22: end procedureconditions:▷ Condition 1: The function sequence is comprised of function calls that haveduration less than CallDurationT hresh.▷ Condition 2: The time between two consecutive function calls is less thanCallGapT hresh.▷ Condition 3: The time interval occupied by the function sequence does nottake up more than a TotalTimeFractionT hresh of a thread’s total executiontime.Condition 1 ensures that RegTime compresses those portions of a thread’s be-havior that have a high density of function calls. These regions are ideal for com-21Algorithm 2 FindRTExpr Function1: procedure FINDRTEXPR( f ncall index,T Trace)2: ▷ Stay in Skip state until we find a compressible fn call3: state← Skip4: compressed seq = []5: ▷ Iterate through the thread’s fn calls6: for i = f ncall index to T Trace.length do7: ▷ Whether call at i has an outlier duration8: is long f n← T Trace[i].duration > CallDurationT hresh9: if state is Skip and not is long f n then10: state← Compress ▷ Start compressing11: expr start pos← i12: expr.starttime← T Trace[expr start pos].starttime13: compressed seq.append(T Trace[i])14: else if state is Compress then15: expr end pos← i−116: expr.endtime← T Trace[expr end pos].endtime17: idle period ← (T Trace[i].starttime−T Trace[i−1].endtime)18: is long idle← idle period > CallGapT hresh19: is long regt ← expr longer than TotalTimeFractionT hresh ofthread exec time20: if is long f n or is long idle or is long regt then21: ▷ Reached end of compressible sequence.22: ▷ Compress sequence so far and return.23: merged callstacks = merge callstacks(compressed seq)24: sorted merged callstacks = sort(merged callstacks)25: expr.compressed callstacks = sorted merged callstacks26: ▷ Return RegTime expr27: return expr,expr start pos,expr end pos28: end if29: ▷ Add TTrace[i] and continue compressing30: compressed seq.append(T Trace[i])31: end if32: end for33: return nil,nil,nil ▷ No RegTime expr found34: end procedure22pression. Condition 1 also ensures that RegTime does not lose outlier calls withlong duration. Additionally, because an outlier event can be an unexpectedly longperiod of time between two consecutive calls, condition 2 ensures that these peri-ods are not lost.Finally, Condition 3 guarantees that RegTime will use multiple RegTime ex-pressions to represent the behavior of a thread, even if all the calls by that threadsatisfy Conditions 1 and 2. Each RegTime expression would be visualized sepa-rately on a timeline by NonSequitur (as explained next). If a thread with consistentand repetitive sequence of function calls is compressed with only a single RegTimeexpression, then the visualization would include only a single visual element. Wereasoned that software developers would better understand how a thread’s behaviorchanges over time if we showed it with multiple visual elements. Hence, we usemultiple RegTime expressions.Once the algorithm has found the beginning of a compressible function se-quence, it continues to iterate through the function call records until it determinesthat it has reached the end of a compressible function sequence. Then, RegTimegenerates a RegTime expression to represent the compressible function sequence(Algorithm 2, lines 21 - 27).A RegTime expression has three attributes allowing it to represent a functionsequence: start and end times for the function sequence and compressed callstacks:a sequence of sorted groups of callstacks captured in the function sequence period(Algorithm 2, line 25). Figure 4.2 illustrates how RegTime generates these com-pressed callstacks.4.3 RegTime Visual Encoding of RegTime expressionsWe use the term RegTime Visual Encoding to refer to the visualization techniqueNonSequitur uses to display RegTime expressions. Figure 4.3 compares the func-tion sequence from Figure 4.2(1) to the RegTime Visual Encoding of RegTimeexpression for this sequence from Figure 4.2(5). NonSequitur uses a black box todenote the start and end times of the expression. It uses stacked rectangular glyphsto encode the compressed callstacks in the RegTime expression.If we consider the RegTime Visual Encoding in Figure 4.3(2), we can discern23Figure 4.3: The function sequence from Figure 4.2(1) and the RegTime Vi-sual Encoding of the corresponding RegTime expression.the following information about the function sequence it represents:✓ Caller-callee relationships. RegTime preserves the caller-callee relationshipsbetween the functions. For example, the Visual Encoding shows that function X isalways called by function P.✓ Order and functions. The function sequence consists of the functions A, P, X ,and Y . RegTime preserves all the functions in the trace, unless their cumulative du-ration over the entire trace is negligible. RegTime does not preserve the total orderof the callstacks, because it sorts them by the first occurrence within the encodedperiod. E.g., callstack A is shown first in the encoded sequence, callstack P,Y isshown second, and callstack P,X is shown third. If the program also happened toinvoke P,Y after P,X (in addition to invoking it after A), the visual encoding wouldnot change. Every callstack is visualized only once in the encoding period regard-less of how many times it appeared, so the total order callstacks is not preserved,only the order of the first occurrence.✓ Encoded interval. RegTime preserves the start and end time of the encodedfunction sequence. In this example, the function sequence started at 0 ns and endedat 7 ns.✓ Cumulative duration of functions. RegTime uses the cumulative durationof all callstacks in the encoded sequence to determine the relative width of thecorresponding glyphs. For example, given that the cumulative duration of function24A in the sequence is 3 ns, function A’s glyph takes about 3/7 of the corespondingglyph.✓ Loose causal relationship. Since the order in which the groups of mergedcallstacks appear in a RegTime expression corresponds to the order that the firstoccurrences of the callstacks appear in the function sequence, we have some senseof the causal relationships between functions executed by the thread.For example, the RegTime Visual Encoding shows that function P appears af-ter function A. Strictly speaking, this indicates that an occurrence of function Pappeared after an occurrence of function A at least once in the function sequence.However, because the behavior of threads is highly repetitive [11], it is likely thatfunction P always appeared after function A. Since a function’s name typicallycaptures the function’s purpose [6], users can further ascertain causal relationshipsfrom the function names.However, due to the lossy compression, we are unable to discern the followingfrom a RegTime Visual Encodings:✗ Individual function calls. Because invocations of the same function are mergedtogether, we cannot tell when individual function calls were executed or what theirdurations were.✗ Precise causal relationship. Two merged functions appearing next to each otherwithin a RegTime Visual Encoding does not always mean there is a causal rela-tionships between these functions. Figure 4.3(2) shows function X appearing afterfunction Y, even though Figure 4.3(1) shows that there is no causal relationshipbetween these functions.4.4 Visualizing Compressed Execution TracesWe developed NonSequitur to visualize the encodings generated by RegTime. Thetool is open source2. After compression of the traces with RegTime, NonSequiturconstructs and outputs an .html output file that contains the visualization. This filecan be opened with any browser.Figure 4.4 showcases a NonSequitur visualization of two threads from RocksDB,a storage engine developed by Facebook (presently Meta). NonSequitur consists2https://github.com/auggywonger/nonsequitur vis25Figure 4.4: The NonSequitur Trace Visualization Tool.of two parts: the task panel and the main panel.The main panel shows what the threads are doing over time. As discussed inSection 4.2, the summary of an execution trace produced by RegTime can containboth compressed (encoded) and uncompressed segments. NonSequitur visualizesthe activities within the uncompressed segments as callstacks over time, while ituses RegTime Visual Encodings to visualize the compressed segments. The mainpanel is divided into rows, one for each thread. The X axis represents time, whilethe Y axis represents callstack depth.Figure 4.4 shows the activity of two threads, Thread 4 and Thread 11. The orig-inal traces of these threads captured approximately 2.5 million and two hundredthousand function calls respectively. However, RegTime summarized the behav-ior of these threads using a few hundred function calls. The summaries producedby RegTime are short enough that the horizontal space occupied by NonSequiturvisualizations do not exceed 1,300 pixels. Thus, software developers using NonSe-quitur do not need to use horizontal scrolling within the main panel to see differentparts of the visualization.The task panel sits at the top of NonSequitur. Using the task panel, userscan select one or more thread timelines to show, while hiding the others (Figure26Figure 4.5: Linked highlighting in NonSequitur.4.4(1)). We also let users search the execution for function names (Figure 4.4(2)).After users enter the name of a function, NonSequitur displays only those threadswhich have called that function. NonSequitur relies on colour to distinguish whichfunction calls are made by the threads. Below the search bar is a legend showingusers the mapping between colours and functions (Figure 4.4(3)).Figure 4.4(3) shows that many of the functions in the legend have been assignedthe colour grey. The reason is because only a limited number of colours should beused to represent categories when visualizing data. NonSequitur assigns distinc-tive colours to only a subset of functions, while the other functions are shown ingrey. To decide which functions should be represented with distinctive colours,NonSequitur computes a prominence score for each function: the number of timesthe function appeared in the execution trace summary produced by RegTime mul-tiplied by the number of threads that invoked the function. Users can click on afunction in the legend; occurrences of those functions are then highlighted in themain panel.27To improve readability in NonSequitur visualizations, each rectangular glyph isdesigned to have a minimum width in pixels. This design choice ensures that eventhe smallest glyphs are visible to users. However, a consequence of this approachis that the X axis in NonSequitur visualizations is nonlinear. For example, a linearX axis may cause some function calls of extremely short duration to occupy lessthan one pixel of horizontal space, while enforcing a minimum width would allowthese function calls to occupy a visible amount of horizontal space.A nonlinear X axis can make it challenging for users to discern correlationsbetween activities across different threads. With a nonlinear X axis, function callsin multiple threads may appear to have occurred simultaneously when, in fact, theyhappened at different times.To mitigate this issue, NonSequitur uses linked highlighting to help users findcorrelations in activities across threads. If a user hovers over a function call inthe visualization which is not within a RegTime Visual Encoding, NonSequiturhighlights in green the activities of all the threads during the time this function callwas invoked.For example, Figure 4.5(1) shows the green highlighting which appears inthe main panel when we hover over an unusually long invocation of the func-tion rocksdb::InstrumentedCondVar::Wait(). The green highlighting indicates thatduring the time interval when thread 4 made an unusually long invocation of thefunction rocksdb::InstrumentedCondVar::Wait(), thread 11 experienced a period ofinactivity. Similarly, if a user hovers over a function call belonging to the RegTimeVisual Encoding in Figure 4.5(2), NonSequitur highlights in green the activitiesof all the threads during the time that the function sequence represented by theRegTime Visual Encoding was invoked.Note that hovering over a function call in the main panel also reveals a tooltip,which provides more details about that function call. If users hover over a func-tion call outside of a RegTime Visual Encoding, the tooltip shows the name ofthe function and the duration of the function call. If a user hovers over a functioncall within a RegTime Visual Encoding, the tooltip shows the name of the func-tion and the cumulative duration of the function call within the function sequencerepresented by the RegTime Visual Encoding.Using this functionality, we can intuit from Figure 4.5(1) that the unusually28long invocation of rocksdb::InstrumentedCondVar::Wait() in thread 4 had some-thing to do with a very long period of inactivity in thread 11. We cannot infercausal relationship one way or the other, neither can we know if the correlationwas spurious, but we can begin forming hypotheses leading to the diagnosis of theperformance bug.29Chapter 5RegTime Algorithm EvaluationTable 5.1 shows the compression ratios achieved by RegTime when compressingthe traces of threads in WiredTiger (MongoDB storage engine), RocksDB, andthe Chrome browser. The WiredTiger and RocksDB traces were collected usingXRay, a function call tracing system developed by Google [5]. XRay is supportedby LLVM, which automatically instruments the application given a compilationflag. We obtained Chrome traces using Windows Process Monitor [4].For this experiment, we configured RegTime with the following threshold val-ues. We empirically found that these values tend to produce most compact yetuseful visualizations.▷ CallDurationT hresh : Function calls are considered to be short if its dura-tion is less 1% of the thread’s total execution time.▷ CallGapT hresh : The time between two consecutive function calls is con-sidered short if that time is less than 0.1% of the thread’s total executiontime.▷ TotalTimeFractionT hresh : The time interval occupied by a compressiblefunction sequence cannot exceed 13% of a thread’s total execution time.Table 5.1 shows that RegTime compression ratios have a high variance, whichstrongly depends on the shape of the call stacks in the trace. However, in all casesRegTime produced a compressed trace that can be easily inspected using NonSe-quitur. As we will see in the evaluation this is critical because existing tools arefrequently unusable at scale.30Table 5.1: Compression ratios achieved with RegTime. nbe f ore is the numberof events in the trace before compression. na f ter is the number of eventsin the trace after compression. The compression ratios of the trace sizesin MB were approximately the same.System Thread # uniqueeventsnbe f ore na f ter Comp.RatioRocksDB 0 74 2,466,295 323 7,635RocksDB 1 316 2,762,774 2,485 1,111WiredTiger 0 115 7,784,936 546 14,258WiredTiger 1 81 6,935,791 209 33,185Chrome 0 34 274,361 96 2,857Chrome 1 8 46,464 56 829Achieving the high compression ratios in Table 5.1 enables user-friendly visu-alization of very large traces, but also requires dropping certain information. Thisis an inherent trade-off made by our design, which means that RegTime and Non-Sequitur are suitable for certain performance-analysis tasks, but not for others.For example NonSequitur would not be useful for within-function optimiza-tions or other tasks where the precise order of executed functions is crucial in-formation. On the other hand, RegTime with its goal to diagnose root causes foroutlier events helps developers: a) identify when outlier events occur, and b) ob-serve what threads are doing prior to and during these events. Next, we evaluate thehypothesis that RegTime is effective at these tasks despite discarding informationfrom the trace.31Chapter 6NonSequitur EvaluationMethodologyWe empirically evaluate how effective NonSequitur is at helping software devel-opers diagnose performance issues by answering the following research questions:RQ1: Does NonSequitur help analyze large execution traces to understandwhat threads are doing over time?RQ2: Does NonSequitur help analyze large execution traces to find whenunusually long function latencies occur?RQ3: Does NonSequitur help analyze large execution traces to learn whatone thread was doing during the time when another thread was blocked ordelayed?RQ4: For small traces that can be visualized with other tools, does NonSe-quitur lead to false conclusions in questions regarding thread activity overtime as compared to other tools that do not drop information from the trace?To address these questions, we conducted a user study with 42 participants.The rest of this Section presents our methodology and results.6.1 Multithreaded TracesEach participant was asked to carry out performance analysis tasks on three dif-ferent execution traces shown in Table 6.1. The first two traces, which we call32Table 6.1: Summary of the Execution Traces Used in Our User StudyTrace Name Description Trace Size Total Exec.TimeWT normal A trace ofWiredTigerexecuting queriesover alog-structuredmerge-tree (LSMtree).24 threadsmaking 0.8Mcalls to 20differentfunctions0.5 minutesWT large A large trace ofWiredTigerrunning anupdate heavybenchmark test.28 threadsmaking 50Mcalls to 320differentfunctions5 minutes(includes thebenchmark testsetup)RDB large A trace ofRocksDBrunning thedbbenchbenchmark testwith parallelwrites enabled.32 threadsmaking 39Mcalls to 747differentfunctions20 minutes(includes thebenchmark testsetup)the normal WiredTiger (WT) trace and the large WiredTiger trace were taken fromWiredTiger, MongoDB’s storage engine. The third trace, which we call the largeRocksDB (RDB) trace was taken from RocksDB. We derived these traces fromtwo software systems where we knew developers were struggling with diagnosisof performance bugs: from our case study examining the nature of performancecomprehension [15] or from our own interaction with developers for RocksDB.Although the normal WiredTiger and large WiredTiger traces came from thesame system, they capture WiredTiger performing two different tasks. Addition-ally, the number of function invocations captured in the large WiredTiger trace is 60times larger than the number of function invocations in the normal trace. Therefore,participants would not be able to apply their knowledge of the normal WiredTigertrace to analyze the large one.We recruited 42 participants for our user study. The participants included com-33puter engineering students and professional software developers:▷ We classified 7 of the participants as ”WiredTiger Developers.” These par-ticipants are either former or current WiredTiger developers.▷ We classified 20 of the participants as ”Experienced Performance Debug-gers.” These participants indicated they had three or more years of experi-ence debugging the performance of software systems.▷ We classified 10 of the participants as ”Performance Debuggers.” These par-ticipants stated they had 1-2 years of experience debugging the performanceof software systems.▷ We classified 5 of the participants as ”Inexperienced Performance Debug-gers.” These participants stated they had no experience debugging the per-formance of software systems.Because none of the participants indicated that they had experience developingRocksDB, we did not categorize any participants as RocksDB developers.6.2 Variable SelectionOur study involved one independent variable: the type of trace visualization toolparticipants used for performance analysis. Participants were required to use either“NonSequitur” or other tools, which we refer to as other, described in the nextsection.The dependent variables were (1) how accurately the participants were able todo the performance analysis tasks, and (2) how long participants took to performthe tasks.6.3 Tools used for comparisonWe used two tools in the other category: Chrome TraceViewer [1] and OpTrack [2].These were the only publicly available tools known to us that, like NonSequitur,display the per-thread callstacks over time. Unlike NonSequitur, they do not dropany information from the trace, so they were ideal for answering RQ4.Chrome TraceViewer, which is part of the Chrome Developer Tools suite, isrepresentative of a typical trace visualization tool. It visualizes multi-threaded ex-ecution traces as rows of icicle plots, with each icicle plot depicting callstacks that34WT LSM Read Flow events Processes View Options » ?File Size StatsMetricsFrame DataInput LatencyAlertsNothing selected. Tap stuff.← →2345678▾  9▾  10▾  11▾  12▾  13▾  14▾  15Figure 6.1: Chrome TraceViewer ToolFigure 6.2: OpTrack Tool.35occurred over time within one thread (see Figure 6.1). Chrome TraceViewer offersusers a number of navigation features, including zooming into the visualization,highlighting functions of interest with prominent colours, and scrolling horizon-tally across the X axis.Since Chrome TraceViewer visualizes all events in a trace and uses the width ofa single screen to do this, it is unable to visualize execution traces containing mil-lions of events: the UI crashes when presented with very large traces. Therefore,for the large traces we used OpTrack, which is part of WiredTiger distribution [2].Similarly to Chrome TraceViewer, OpTrack visualizes multi-threaded executioncall stacks over time (see Figure 6.2). To handle large traces, OpTrack dividestraces into 240 equal time intervals and shows one interval at a time. Users selectfrom a navigation panel the interval to view. Consequently, OpTrack cannot showthe entire execution all at once. Other than the navigation panel, OpTrack does nothave any notable navigation features such as zooming or highlighting functions ofinterest.6.4 Experiment TreatmentsOur study followed a within-subjects design. Each participant was required toanalyze each trace twice: once with NonSequitur (the treatment condition), andonce with another tool (the control condition).We exposed participants to either the treatment condition first for all tracesor the control condition first for all traces. This means that a participant wouldeither always analyze a trace with NonSequitur first or always analyze a trace withNonSequitur last.6.5 Experiment TasksTo quantify how well the participants performed the performance analysis tasks,they were required to answer questions that assessed their understanding of thetrace. Questions with multiple correct answers were worth between two points tothree points, while all the other questions were worth one point.Asking exactly the same question for both the treatment and the control con-dition would create undesirable learning effects. That is, a participant could learn36Table 6.2: Performance analysis tasks used in the studyTask Task Description Example QuestionT1Understand what a thread isdoing over time.Which thread executes evict lru walkthroughout its lifetime but only executesevict page at the beginning and at theend of its lifetime?T2Find when unusually longlatencies or long periods be-tween function calls occur.Which threads contain invocations ofrocksdb::WriteThread::LinkOne exceed-ing 500 seconds? When do these invo-cations occur in the execution?T3Understand what one threadwas doing during the timewhen another thread wasblocked or delayed.During the time intervals when thread24 is not executing any function, whichthread spends most of its lifetime inwt cond wait signal?Figure 6.3: The design of our user study for evaluating NonSequitur.the answer for a question using the treatment tool and then would trivially knowthe answer for the control tool without having to actually apply it. To that end, wepresent users with very similar, but non-identical questions for each task. Table 6.2lists the tasks and example questions.376.6 Experimental ProcedureFigure 6.3 shows the procedure for the user study. The procedure consisted of threephases:▷ Prestudy: We asked potential participants to complete a pre-questionnaireto gather demographic information about them (see Figure 6.3(1)). Studentswho indicated that they completed at least five computer engineering coursesor possessed at least one school term of software development work expe-rience were allowed to participate in the study, as were non-student respon-dents indicating that they were working as software developers.▷ Training: All participants watched tutorial videos explaining how to useNonSequitur, Chrome TraceViewer, and OpTrack (see Figure 6.3(2)). Af-ter that, the participants completed quizzes testing their knowledge of thetools. After completing the quizzes, the participants were given the correctanswers, so that they can become comfortable with the tools. Participantswere also given opportunities to re-watch the tutorial videos. On average, ittook them approximately one hour to complete the training.▷ Tasks: During this phase, the participants were exposed to the experimenttreatments. The order in which the traces were presented to the partici-pants was consistent: starting with the normal WiredTiger trace (see Fig-ure 6.3(3)), followed by the large WiredTiger trace (see Figure 6.3(4)), andending with the large RocksDB trace (see Figure 6.3(5)). As stated earlier,we exposed participants to either the treatment condition first or the controlcondition first for all traces.We gave participants access to the training material throughout this phaseand allowed them to take as much time as they liked to answer questions,only requiring participants to finish within 6 hours, a generous time limit.We used the Qualtrics survey platform to automatically present tools to par-ticipants, display the questions to participants, record the participants’ re-sponses, and time how long it took them to answer the questions.Participants took between one and three hours to finish the tasks with bothNonSequitur and the other tools. Because of the large number of participantsinvolved in our study and the length of the study, we did not observe the38participants in person. Instead, we used Zoom to record the participants’screens, so we could analyze later how they used the tools. Participants werepermitted to provide us with feedback on their experience with the user studyif they wished.6.7 Threats to ValidityThe external threats of conducting our user study are the representativeness of theperformance analysis tasks, participants, and traces selected for the experiment.To mitigate the threat of task representativeness, we designed our tasks based on acase study on how developers debug performance [15]. We mitigated the threat ofparticipant representativeness by recruiting only professional computer engineers,or graduate students with at least five computer engineering courses or at least oneschool term of software development work experience. We addressed the threatof the representativeness of the traces by using traces taken from WiredTiger andRocksDB, two software systems where we knew developers tended to spent a lotof time fixing performance issues.One internal threat of our user study is bias towards exposing participants tothe treatment condition first or the control condition first for each execution trace.To mitigate this threat, we relied on the Qualtrics survey platform to randomlyassign participants to either always be exposed to the treatment condition first orto the control condition first. We also relied on the Qualtrics survey platform tomeasure how long it took participants to complete the performance analysis tasks,thus mitigating the threat of bias in measuring time.Another internal threat is potential bias towards evaluating how well partici-pants performed on the performance analysis tasks. For this, we created a rubricprior to conducting our user study, and we strictly used this rubric for grading theparticipant responses.The tools we selected for the control condition also pose an internal threat. Weprovided the participants with Chrome TraceViewer for the small WiredTiger tracebecause it is representative of a typical trace visualization tool. We provided theparticipants with OpTrack to analyze the larger traces because it was designed tovisualize large execution traces. We mitigate the threat that participants perform39worse over time due to fatigue by letting participants complete the performanceanalysis tasks at their own pace.A final internal threat is that participants used NonSequitur three times withinthe same session, but they used Chrome TraceViewer once and OpTrack twice.There is the threat of underlying carryover effect from participants growing ac-customed to using NonSequitur over the duration of the experiment and perform-ing better with this tool during later trace analyzes. We believe the threat of theunderlying carryover effect is not a significant concern because we provided allparticipants with training on how to use Chrome TraceViewer and OpTrack. Fur-thermore, participants had access to these training materials when performing theperformance analysis tasks.40Chapter 7NonSequitur Evaluation ResultsFigure 7.1 displays a scatter plot of difference scores. Note that we asked partici-pants to perform two T3 tasks for the normal WiredTiger trace (referred to as FirstT3 and Second T3 in Figure 7.1).Each difference score shown in 7.1 was calculated by taking the score a partic-ipant obtained after completing a performance analysis task on an execution traceusing NonSequitur and subtracting the score the same participant obtained using adifferent tool. Thus, a difference score greater than zero indicates the participantperformed a performance analysis task on an execution trace more accurately withNonSequitur than with the other tool. Conversely, a difference score less than zeromeans the participant performed better using the other tool. A difference score ofzero means the participant achieved the same score with both tools. A red line inFigure 7.1 indicates the horizontal position of dots representing difference scoresof zero.Figure 7.2 depicts a scatter plot showing the number of minutes participantsspent on each task. ’N’ on the X axis stands for NonSequitur, while ’C’ and ’O’stand for Chrome TraceViewer and OpTrack respectively.We applied the McNemar’s test and the Wilcoxon signed-rank test to assessthe statistical significance of the difference scores. The McNemar’s test, whichis a non-parametric test for paired binary outcome data, was specifically used toevaluate the difference scores from questions that were worth one point each. Forquestions worth multiple points, we used the Wilcoxon signed-rank test, another41Figure 7.1: The difference scores from the NonSequitur evaluation.4/4/24, 8:52 AM Bokeh Plotfile:///C:/Users/Asian/masters/nonsequitur_study_results/nonsequitur_study_results_for_paper.html 1/1N C N C N C N C N O N O N O N O N O N OT1 T2 First T3 Second T3 T1 T2 T3 T1 T2 T3Normal WT Trace Large WT Trace Large RocksDB TraceTools Per Task Per Trace0102030Time (Minutes)Experienced Performance DebuggerWiredTiger DeveloperPerformance DebuggerInexperienced Performance DebuggerMinutes Spent on Each Performance Analysis TaskFigure 7.2: The number of minutes the participants spent on each perfor-mance analysis task.non-parametric test that compares paired data to determine if their population me-dian ranks differ.Statistical tests are methods used to make decisions about hypotheses basedon data from experiments or observations. These tests start with a null hypoth-esis that assumes no effect or no difference between groups, and an alternativehypothesis that suggests there is an effect or a difference. For the McNemar’s test,the null hypothesis is that there is no difference in the likelihood of a participantcompleting a performance analysis task correctly with NonSequitur versus othertools, with the alternative hypothesis asserting that these probabilities are different.42For the Wilcoxon signed-rank test, the null hypothesis is that there is no mediandifference between the scores participants achieved using NonSequitur and thoseachieved with other tools for the same task. The implication of the null hypothesisbeing true is that NonSequitur does not increase the accuracy of task performancecompared to other tools. The alternative hypothesis suggests that there is a mediandifference, indicating that the tool used by the participants influence how accuratelythey perform the performance analysis tasks.From statistical tests, we obtain p-values that determine the strength of the nullhypothesis. A low p-value indicates the results we observed in our user study areunlikely under the null hypothesis, suggesting that we can consider the alternativehypothesis. A high p-value, meanwhile, means there is not enough statistical evi-dence to reject the null hypothesis. Typically, p-values below 0.05 are consideredlow. Table 7.1 shows the p-value from each statistical test we performed duringour user study.Both Figure 7.1 and Table 7.1 show the following:▷ When conducting performance analysis tasks on the normal WiredTiger trace,which was a rather small trace that could be easily visualized with othertools, participants performed equally well with both NonSequitur and ChromeTraceviewer. The p-values from the statistical tests on the difference scoresobtained for this trace were greater than 0.05, indicating no statistically sig-nificant evidence of a performance difference between using NonSequiturand Chrome TraceViewer. However, there were still several instances whenparticipants performed the performance analysis tasks less accurately withNonSequitur. We discuss the reasons why in Section 7.5.▷ When conducting performance analysis tasks T1 and T2 on the large WiredTigertrace, participants performed better with NonSequitur than with OpTrack.The p-values from the statistical tests on the T1 and T2 difference scoresobtained for this trace were less than 0.05, signifying statistically significantevidence of a performance difference favoring NonSequitur. However, mostof the difference scores for T3 on this trace were zero, meaning that partici-pants generally achieved the same score with both NonSequitur and OpTrackin this instance.▷ When conducting performance analysis tasks on the large RocksDB trace,43Table 7.1: P-values of the McNemar’s test and the Wilcoxon signed-rank test.Trace Task Maxscore Stat test P-ValueWT normal T1 1 McNemar 0.182WT normal T2 3 Wilcoxon 0.883WT normal First T3 3 Wilcoxon 1WT normal Second T3 3 Wilcoxon 0.236WT large T1 1 McNemar 1.192×10−7WT large T2 3 Wilcoxon 9.537×10−7WT large T3 1 McNemar 0.248RDB large T1 1 McNemar 1.192×10−7RDB large T2 3 Wilcoxon 6.110×10−6RDB large T3 3 Wilcoxon 6.162×10−8participants performed better with NonSequitur than with OpTrack. All thep-values from the statistical tests on the difference scores obtained for thistrace were less than 0.05, indicating statistically significant evidence of aperformance difference between using NonSequitur and using OpTrack.These data suggest the following conclusions: (1) Since users did equally wellon the small trace, which can be effectively visualized with Chrome TraceViewer,we can conclude that the fact that RegTime drops information from the trace doesnot impede users from producing correct answers (RQ4). After all, Chrome Trace-Viewer does not drop any information from the trace, and our users did no worsewith NonSequitur. (2) For large traces, NonSequitur is generally more effective atproviding the needed information than OpTrack, the tool that splits the trace acrossmany smaller visualizations (RQ1-RQ3). At the same time, some tasks, such as T3for WT large, may be completed equally well with a simpler tool.To further substantiate these conclusions, we dig deeper into how the partici-pants used the tools.447.1 Performance with T1 TasksTo perform task T1, participants needed to identify what threads were executingfunctions of interest and then observe when these threads execute these functions.From analyzing the Zoom video recordings, we saw that participants liked to usethe search bar in NonSequitur to determine what threads were executing these func-tions. The participants also clicked on the functions in the legend to highlight them.Because RegTime summarized the behavior of threads with multiple RegTime ex-pressions, participants could identify, at a high-level, when threads executed func-tions. For example, participants could see if a thread executed a function at thebeginning, middle, or end of its lifetime.Although participants were able to perform the T1 task for the normal WiredTigertrace using Chrome TraceViewer, we saw from our review of the Zoom videorecordings that participants needed to make repeated use of Chrome TraceViewer’snavigation features to complete T1. For example, participants would attempt tohighlight functions of interest with Chrome TraceViewer, only to find that callsto those functions were too small to be easily spotted despite being displayed inprominent colours. Participants then needed to zoom into the visualization to deter-mine what threads executed these functions. If participants zoomed into a ChromeTraceViewer visualization, they would need to scroll horizontally along the time-line of the visualization so they can see when the threads executed the functionsover time. The participants’ need to constantly use the navigation features withChrome TraceViewer explains why we see in Figure 7.2 that participants tended tospend more time using Chrome TraceViewer than NonSequitur.On average, participants were three times more accurate performing the T1 taskon the large WiredTiger trace with NonSequitur than with OpTrack. Participantswere also on average 11 times more accurate performing the T1 task on the largeRocksDB trace with NonSequitur than with OpTrack. Participants struggled touse OpTrack to perform T1 tasks for the large execution traces because OpTrackdoes not provide users with any navigation features beyond selecting what timeintervals of the execution to display. Thus, participants found it difficult to useOpTrack to identify what threads called functions of interest or what threads weredoing over time. Participants’ strategies when using OpTrack were limited to a)45randomly or methodically selecting time intervals to understand what threads weredoing over time and b) spending time studying the visualization of the time intervalto determine whether a thread invoked a function of interest. Consequently, we seefrom Figure 7.2 that participants tended to spend more time using OpTrack thanNonSequitur. The participants spent on average twice as much time using OpTrackto perform the T1 task on the large WiredTiger trace. The average amount of timethe participants spent using OpTrack to perform the T1 task on the large RocksDBtrace was 1.7 times longer than the average time spent using NonSequitur.7.2 Performance with T2 TasksWe observed from the Zoom video recordings that, when using NonSequitur, par-ticipants used the search bar and legend together to determine what threads exe-cuted outlier events. Because RegTime leaves outlier events uncompressed, par-ticipants were also able to identify when threads executed these outlier events. Onaverage, participants were 2.6 more accurate performing the T2 task on the largeWiredTiger trace with NonSequitur than with OpTrack. Participants were also onaverage twice more accurate performing the T2 task on the large RocksDB tracewith NonSequitur than with OpTrack.Figure 7.2 shows that participants generally spent more time performing T2tasks with both Chrome TraceViewer and OpTrack than with NonSequitur. WithChrome TraceViewer, participants needed to spend time scrolling to find the threadsthat executed the outlier events. Meanwhile, participants using OpTrack needed tospend a lot of effort studying multiple time intervals to find outlier events. Theparticipants spent on average twice as much time using OpTrack to perform the T2task on the large WiredTiger trace. The average amount of time the participantsspent using OpTrack to perform the T2 task on the large RocksDB trace was threetimes longer than the average time spent using NonSequitur.7.3 Performance with T3 TasksWhen using NonSequitur, participants made extensive use of the linked highlight-ing to explore what threads were doing when the outlier events occurred. When theboundaries of the RegTime expressions lined up with the start and end of outlier464/4/24, 2:34 AM NonSequiturfile:///C:/Users/Asian/masters/nonsequitur_v4/NonSequitur.html 1/1Thread ID12Enter a function name:# Function ColorThread 1Thread 20 __block_merge  1 __block_write_off  2 __btcur_update  3 __curfile_search  Function: __wt_cond_wait_signalDuration: 20.95 secondsFigure 7.3: An example of when NonSequitur is not able to assist softwaredevelopers with T3 tasks.events, as depicted in Figure 4.5 for the large RocksDB trace, the participants wereon average able to perform the T3 task four times more accurately with NonSe-quitur than with OpTrack.However, participants struggled to perform the T3 task if the RegTime VisualEncoding boundaries did not line up. For instance, Figure 7.3 shows what hap-pened when participants attempted to use linked highlighting to perform the T3task for the large WiredTiger trace. Participants were able to identify the outlierevent, which was an unusually long invocation of the function wt cond wait signalin thread 18. However, they were unable to understand what thread 20 was doingwhen the unusually long invocation of the function wt cond wait signal occurredbecause the RegTime Visual Encoding boundaries in thread 20 did not line up withthe start and end of this outlier event.Interestingly, the average time participants spent performing the T3 task withOpTrack was less than with NonSequitur. From reviewing the Zoom video record-ings, we believe many participants simply gave up using OpTrack to perform thistask. One participant told us after completing the study that ”using OpTrack forthis will be very time consuming and tedious.”47Thread 2Thread 9Thread 21file:///C:/Users/Asian/masters/nonsequitur/NonSequitur.html 1/1Figure 7.4: The NonSequitur visualization of threads 2, 9, and 21 of the nor-mal WiredTiger trace.Figure 7.5: The NonSequitur visualization of threads 1, 9, 13, and 15 of thenormal WiredTiger trace.7.4 Impact of Performance Debugging ExperienceWe see from Figure 7.2 that the participants who were willing to spend the mosttime performing the performance analysis tasks tended to be WiredTiger develop-ers and experienced performance debuggers. We speculate that these participantswere used to spending a lot of time debugging performance and were thereforemore patient with completing the performance analysis tasks during our study.7.5 Instances When Participants Performed More Poorlywith NonSequiturThere were a few instances when participants performed the performance analysistasks on the normal WiredTiger trace less accurately with NonSequitur than withChrome TraceViewer. We examined these instances in closer detail to learn if there48Thread ID151617181920212223Enter a function name:# Function ColorThread 17Thread 210 __clsm_search  1 __curfile_close  2 __curfile_reset  3 __curfile_search  Function: __wt_cond_wait_signalDuration: 802.1 milliseconds4/24/24, 8:31 PM NonSequiturfile:///C:/Users/Asian/masters/nonsequitur/NonSequitur.html 1/1Figure 7.6: The boundaries of the RegTime expressions in threads of the nor-mal WiredTiger trace did not always align with the time intervals duringwhich thread 21 called wt cond wait signal at a callstack depth of 0.are limitations with the design of NonSequitur that affected the participants’ abilityto perform the performance analysis tasks accurately. Below are the details of ourexamination:▷ T1 Task: To evaluate if a participant was able to perform the T1 task onthe normal WiredTiger trace accurately, we asked participants the follow-ing question: which thread executes evict lru walk throughout its lifetimebut only executes evict page at the beginning and at the end of its life-time? Thread 2 was the correct answer. However, one participant incorrectlyidentified thread 9 while another participant incorrectly identified thread 21.Figure 7.4 shows how threads 2, 9, and 21 appear in the NonSequitur vi-sualization. In Figure 7.4, the invocations of evict lru walk are coloredgrey and the invocations of evict page are colored red. Although thread9 and thread 21 are not the correct answers, both these threads did exe-cute evict lru walk and evict page. Furthermore, we reviewed the Zoomvideo recordings of the two participants performing this task and observedthat they were able to use NonSequitur’s legend and search bar to find allthe threads that executed evict lru walk and evict page, including thread2. Combined with the fact that over 90% of the participants performed the49T1 task accurately, we concluded that these two participants misinterpretedthe question but could have correctly identified thread 2 had they read thequestion properly.▷ T2 task: To evaluate if a participant was able to perform the T2 task onthe normal WiredTiger trace accurately, we asked participants the followingquestion: which threads contain invocations of wt cond wait signal whichlast 10 seconds, and when do these invocations occur in the execution? Thecorrect answer to this question was that thread 9 and thread 15 make invo-cations of wt cond wait signal which last 10 seconds, and that these invo-cations occurred throughout the execution (see Figure 7.5). Three of the sixparticipants who received a difference score of less than 0 for this task wereable to correctly identify threads 9 and 15. However, these three partici-pants stated that the invocations of wt cond wait signal lasting 10 secondsoccurred at the beginning of the execution instead of throughout the exe-cutions. From reviewing the Zoom video recordings of these participants,we concluded that they had difficulty distinguishing the difference betweenthe ”beginning of the execution” and ”throughout the execution.” For ex-ample, one of the participants toggled back and forth between choosing the”beginning of the execution” and ”throughout the execution” as the correctanswer. Had we clarified within the question what the distinction betweenthe the ”beginning of the execution” and ”throughout the execution” was, webelieve that these participants would have answered the question correctly.Of the three other participants who did not correctly identify the threads 9and 15 as the only threads that made invocations of wt cond wait signalthat lasted 10 seconds:– The first participant identified threads 1, 9, 13, and 15. Figure 7.5shows the NonSequitur visualization of threads 1 and 13. Both thesethreads exclusively made calls to wt cond wait signal. We concludedthat this participant misinterpreted the question as asking which threadsspent a total of 10 seconds in wt cond wait signal. Instead, we in-tended for participants to identify which threads made at least one in-vocation of wt cond wait signal that lasted 10 seconds.50– The second participant identified thread 9, but not thread 15. We re-viewed the Zoom video recording of this participant performing thistask and saw that the participant clicked on function wt cond wait signalin the legend to highlight all the invocations of wt cond wait signal.Although NonSequitur highlighted the invocations of wt cond wait signalin thread 15, the participant did not examine thread 15 at all. Instead,the participant examined thread 9 almost exclusively. We believe thathad the participant taken the time to examine thread 15, they wouldhave discovered that it also made invocations to wt cond wait signallasting 10 seconds.– The third participant identified thread 15, but not thread 9. The Zoomvideo recording of this participant showed that they were able to useNonSequitur to discover that thread 9 and thread 15 made invocationsto wt cond wait signal lasting 10 seconds. However, this participantended up only selecting thread 15 as the correct answer, leading us toconclude that they entered the incorrect answer by mistake.▷ First T3 task: To evaluate if a participant was able to perform the first T3task on the normal WiredTiger trace accurately, we asked participants the fol-lowing question: during the time intervals when thread 24 is not executingany function, which thread spends most of its time in wt cond wait signal?This question requires participants to be able to determine what threads weredoing during the time intervals that thread 24 was idle. From analyzing theZoom video recordings of participants who achieved a difference score ofless than zero on this task, we noticed a specific behavior: these participantshovered their cursors over the time intervals on thread 24’s timeline whenthread 24 was inactive. They seemed to hope or expect that NonSequiturwould highlight in green the activities of other threads during these timeintervals. However, NonSequitur’s functionality is limited to highlightingintervals in green only when users hover over function calls within the visu-alization, not inactive periods. We discuss in Section 7.6 how we can modifyNonSequitur to overcome this limitation.▷ Second T3 task: To evaluate if a participant was able to perform the sec-51ond T3 task on the normal WiredTiger trace accurately, we asked partici-pants the following question: during the time intervals when thread 21 callswt cond wait signal at a callstack depth of 0, what leaf functions doesthread 17 call? We also added a footnote in this question to explain thata leaf function is a function which is invoked at the top of the callstack. Thereason why the participants who received a difference score less than zerostruggled to perform this task with NonSequitur is because the boundariesof the RegTime expressions in the threads did not always align with the timeintervals during which thread 21 called wt cond wait signal at a callstackdepth of 0 (see Figure 7.6). We discuss in Section 7.6 how we can mod-ify NonSequitur to make the boundaries of the RegTime expressions alignedwith function invocations existing outside of RegTime expressions.7.6 Limitations of the NonSequitur DesignIn this section we list the limitations with the design of NonSequitur and proposesolutions to fix some of these limitations:▷ Alignment of RegTime expression boundaries: As shown in Figures 7.3and 7.6, participants of our user study struggled to identify what threadswere doing during the occurrence of an outlier event if the boundaries ofa RegTime expression were not aligned with the start or end of the outlierevent. Currently, RegTime creates RegTime expressions to represent thebehaviour of a thread without considering the start and end of outlier eventsthat may appear in other threads. To fix this limitation, we can process anexecution trace before it gets compressed by RegTime to identify the startand end of outlier events. We can then modify RegTime to consider the startand end of outlier events when creating RegTime expressions.▷ Hovering the cursor over periods of inactivity on a thread’s timeline:Currently, placing the cursor over time intervals in a thread’s timeline whenthe thread was inactive does not cause NonSequitur to highlight in green theactivities of other threads during those time intervals. As noted in Section7.5, this limitation caused a few participants to struggle with performing the52first T3 task with NonSequitur accurately. We can address this limitationby modifying NonSequitur to detect when a user is hovering over a timeinterval in a thread’s timeline when the thread was inactive and to highlightthe activities of other threads that occurred within the time interval.▷ Loss of precise ordering of function calls: To compress execution traces,RegTime removes from the traces the precise ordering of function calls madeby threads. There is no way to avoid this limitation without reducing Non-Sequitur’s ability to visualize large multi-thread execution traces. However,despite this limitation, our evaluation of NonSequitur demonstrated that par-ticipants were still able to use NonSequitur to understand what threads weredoing over time.▷ Inability to zoom down to individual function calls: NonSequitur cur-rently does not support the ability to zoom into the visualization to identifyindividual function calls. Although we could modify NonSequitur to supporta zoom feature, such a modification would likely require major changes tothe current NonSequitur workflow. Investigating the merits of implementingthis feature is a direction of future work.7.7 Summary of findingsWe conclude that RegTime and NonSequitur are well suited for understanding whatthreads were doing over time (RQ1); detecting outlier events (RQ2); and under-standing what threads were doing over time when these events occurred (RQ3).For these performance analysis tasks, the fact that RegTime drops information fromtraces did not prevent users from completing the task successfully. Even for smalltraces, the users answered questions correctly and more quickly than with conven-tional tools (RQ4). For larger traces, participants were able to accomplish the tasksuccessfully more often with NonSequitur than with other tools, and took less timedoing so.However, we noted several limitations with NonSequitur that prevented userssome users from correctly performing the performance analysis tasks. In futurework, we plan on addressing these limitations.53Chapter 8Future WorkWe propose several short-term and long-term goals for the continued developmentand improvement of NonSequitur.8.1 Short Term Goals1. Address limitations with the current NonSequitur design: In Section 7.6,we proposed several modifications to NonSequitur to address some of itslimitations. We plan on implementing these modifications and evaluating ifthey were successful in addressing NonSequitur’s shortcomings.2. Explore ways of making the X axis linear while maintaining readability:As stated in Section 4.4, each rectangular glyph representing an outlier eventor a function within a RegTime expression is designed to have a minimumwidth in pixels. We made this design choice to ensure that even the small-est glyphs are visible to users, but this approach also causes the X axis tobecome nonlinear. Therefore, we plan on investigating alternative ways ofmaintaining the readability of NonSequitur’s visualizations without causingthe X axis to become nonlinear.548.2 Long Term Goals1. Use NonSequitur to debug real-world performance isues: We plan oncontinuing to evaluate the effectiveness of NonSequtur by using it to debugreal-world performance issues encountered in different types of applications.2. Collaborating with the WiredTiger development team: Several WiredTigerdevelopers are already aware of NonSequitur from having participated in ouruser study. We would like to explore getting the WiredTiger developmentteam to use NonSequitur and provide us with feedback on how the tool canbe improved.3. Widespread adoption of NonSequitur: The ultimate goal is to have Non-Sequitur become widely adopted within the software industry. We would liketo see NonSequitur become as popular as FlameGraphs amongst developerswho are interested in understanding the performance of their multi-threadedapplications. As with FlameGraphs, we envision continuing to make Non-Sequitur open source so that the software development community can cus-tomize NonSequitur to suit their needs.55Chapter 9ConclusionDebugging performance issues in multi-threaded applications requires tools thatscale to accommodate the large amounts of runtime information generated by suchapplications.We described a trace compression algorithm called RegTime to assist soft-ware developers with investigating performance bugs in multi-threaded applica-tions caused by outlier events. This technique can achieve compression ratios ex-ceeding 1,000, allowing developers to analyze execution traces consisting of mil-lions of events.We designed NonSequitur, a trace visualization tool that allows developers toanalyze the compressed execution traces generated by RegTime. We carried outa structured user study with 42 participants. Our results demonstrate that withNonSequitur, participants performed some performance analysis tasks on large ex-ecution traces up to 11 times more accurately as compared with other tools. Ad-ditionally, for some performance analysis tasks, the participants spent on averagethree times longer with other tools than with NonSequitur.56Bibliography[1] Chrome traceviewer.https://github.com/catapult-project/catapult/blob/main/tracing/README.md.[Accessed 04-02-2024]. → page 34[2] OpTrack in wiredtiger.http://source.wiredtiger.com/10.0.0/tool-optrack.html. [Accessed04-02-2024]. → pages 34, 36[3] Poor Man’s Profiler. https://poormansprofiler.org. → page 8[4] Windows process monitor.https://learn.microsoft.com/en-us/sysinternals/downloads/procmon.[Accessed 04-02-2024]. → page 30[5] Xray. https://research.google/pubs/xray-a-function-call-tracing-system/.[Accessed 04-02-2024]. → page 30[6] N. J. Abid, B. Sharif, N. Dragan, H. Alrasheed, and J. I. Maletic. Developerreading behavior while summarizing java methods: Size and contextmatters. In 2019 IEEE/ACM 41st International Conference on SoftwareEngineering (ICSE), pages 384–395. IEEE, 2019. → page 25[7] L. Alawneh, A. Hamou-Lhadj, and J. Hassine. Segmenting large traces ofinter-process communication with a focus on high performance computingsystems. Journal of Systems and Software, 120:1–16, 2016. → page 2[8] S. Alimadadi, A. Mesbah, and K. Pattabiraman. Inferring hierarchical motifsfrom execution traces. In Proceedings of the 40th International Conferenceon Software Engineering, pages 776–787, 2018. → pages 2, 13, 14[9] J. Aranda and G. Venolia. The Secret Life of Bugs: Going Past the Errorsand Omissions in Software Repositories. In Proceedings of the InternationalConference on Software Engineering, ICSE, 2009. → page 957[10] J. Corbin and A. Strauss. Basics of Qualitative Research: Techniques andProcedures for Developing Grounded Theory. Sage Publications, Inc., 3rdedition, 2008. → page 7[11] B. Cornelissen and L. Moonen. On large execution traces and traceabstraction techniques. Software Engineering Research Group, Delft, 2008.→ page 25[12] B. Cornelissen, A. Zaidman, D. Holten, L. Moonen, A. Van Deursen, andJ. J. Van Wijk. Execution trace analysis through massive sequence andcircular bundle views. Journal of Systems and Software, 81(12):2252–2268,2008. → pages 13, 15[13] R. Dautriche, R. Blanch, A. Termier, and M. Santana. Traceviz: Avisualization framework for interactive analysis of execution traces. In Actesde la 28ie`me conference francophone sur l’Interaction Homme-Machine,pages 115–125, 2016. → pages 3, 13, 14[14] W. De Pauw and S. Heisig. Zinsight: A visual and analytic environment forexploring large event traces. In Proceedings of the 5th internationalsymposium on Software visualization, pages 143–152, 2010. → pages3, 13, 14[15] A. Fedorova, C. Mustard, I. Beschastnikh, J. Rubin, A. Wong, S. Miucin,and L. Ye. Performance comprehension at wiredtiger. In Proceedings of the2018 26th ACM Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineering,pages 83–94, 2018. → pages 1, 2, 3, 33, 39[16] Y. Feng, K. Dreef, J. A. Jones, and A. van Deursen. Hierarchical abstractionof execution traces for program comprehension. In Proceedings of the 26thConference on Program Comprehension, pages 86–96, 2018. → page 2[17] B. Glaser and A. Strauss. The Discovery of Grounded Theory: Strategies forQualitative Research. Observations (Chicago, Ill.). Aldine de Gruyter, 1967.→ page 7[18] A. Hamou-Lhadj and T. C. Lethbridge. Compression techniques to simplifythe analysis of large execution traces. In Proceedings 10th InternationalWorkshop on Program Comprehension, pages 159–168. IEEE, 2002. →pages 2, 13, 1858[19] K. E. Isaacs, A. Gime´nez, I. Jusufi, T. Gamblin, A. Bhatele, M. Schulz,B. Hamann, and P.-T. Bremer. State of the art of performance visualization.EuroVis (STARs), 2014. → page 3[20] M. Jovic, A. Adamoli, and M. Hauswirth. Catch me if you can: performancebug detection in the wild. In Proceedings of the 2011 ACM internationalconference on Object oriented programming systems languages andapplications, pages 155–170, 2011. → page 1[21] B. Karran, J. Tru¨mper, and J. Do¨llner. Synctrace: Visual thread-interplayanalysis. In 2013 First IEEE Working Conference on Software Visualization(VISSOFT), pages 1–10. IEEE, 2013. → pages 3, 13, 15[22] R. Krishna, M. S. Iqbal, M. A. Javidian, B. Ray, and P. Jamshidi. Cadet:Debugging and fixing misconfigurations using counterfactual reasoning.arXiv preprint arXiv:2010.06061, 2020. → page 1[23] Y. Liu, C. Xu, and S.-C. Cheung. Characterizing and detecting performancebugs for smartphone applications. In Proceedings of the 36th internationalconference on software engineering, pages 1013–1024, 2014. → page 4[24] D. Lo, S.-C. Khoo, and C. Liu. Mining temporal rules from programexecution traces. Proc. of Int. Work. on Program Comprehension throughDynamic Analysis, 2007. → page 2[25] Michael Cahill. A Technical Introduction to WiredTiger.https://www.slideshare.net/mongodb/mongo-db-wiredtigerwebinar. → page6[26] MongoDB. MongoDB Acquires WiredTiger Inc.https://www.mongodb.com/press/wired-tiger. → page 5[27] A. Nistor, T. Jiang, and L. Tan. Discovering, reporting, and fixingperformance bugs. In 2013 10th working conference on mining softwarerepositories (MSR), pages 237–246. IEEE, 2013. → pages 4, 13[28] M. J. Pacione, M. Roper, and M. Wood. A novel software visualisationmodel to support software comprehension. In 11th working conference onreverse engineering, pages 70–79. IEEE, 2004. → page 12[29] H. Pirzadeh, A. Agarwal, and A. Hamou-Lhadj. An approach for detectingexecution phases of a system for the purpose of program comprehension. In2010 Eighth ACIS International Conference on Software Engineering59Research, Management and Applications, pages 207–214. IEEE, 2010. →pages 2, 13[30] A. B. Sa´nchez, P. Delgado-Pe´rez, I. Medina-Bulo, and S. Segura. Tandem:A taxonomy and a dataset of real-world performance bugs. IEEE Access, 8:107214–107228, 2020. → page 4[31] J. Tru¨mper, J. Bohnet, and J. Do¨llner. Understanding complex multithreadedsoftware systems by using trace visualization. In Proceedings of the 5thinternational symposium on Software visualization, pages 133–142, 2010.→ page 3[32] J. Tru¨mper, J. Do¨llner, and A. Telea. Multiscale visual comparison ofexecution traces. In 2013 21st International Conference on ProgramComprehension (ICPC), pages 53–62. IEEE, 2013. → pages 13, 15[33] N. Walkinshaw, S. Afshan, and P. McMinn. Using compression algorithmsto support the comprehension of program traces. In Proceedings of theEighth International Workshop on Dynamic Analysis, pages 8–13, 2010. →page 2[34] WiredTiger. WiredTiger GitHub. https://github.com/wiredtiger/wiredtiger, .→ page 6[35] WiredTiger. WiredTiger Jira. https://jira.mongodb.org/projects/WT/issues, .→ page 4[36] H. Xu, Q. Wang, S. Song, L. K. John, and X. Liu. Can we trust profilingresults? understanding and fixing the inaccuracy in modern profilers. InProceedings of the ACM International Conference on Supercomputing,pages 284–295, 2019. → page 2[37] S. Zaman, B. Adams, and A. E. Hassan. A qualitative study on performancebugs. In 2012 9th IEEE working conference on mining software repositories(MSR), pages 199–208. IEEE, 2012. → pages 4, 1360Appendix ASupporting MaterialsBelow are the questions we used to evaluate how accurately our user study par-ticipants performed the performance analysis tasks. Because our participants ana-lyzed the same execution traces twice, we needed to prevent them from being ableto trivially complete a performance analysis task with one tool because they hadpreviously completed the same task with another tool. Our solution was to ran-domize the thread IDs between the trace visualizations in the treatment and controlconditions. For example, thread 1 in the NonSequitur visualization of the normalWiredTiger trace could be equivalent to thread 5 in the Chrome TraceViewer visu-alization of the normal WiredTiger trace. This approach allowed us to either a) askvery similar, but non-identical questions for a task, or b) ask the exact same ques-tion for a task but have different correct answers depending on the visualizationtool used.Evaluating the accuracy of the T1 task with the normal WiredTiger trace:With NonSequiturQ: Which thread executes evict lru walk throughout its lifetime but only executesevict page at the beginning and at the end of its lifetime?A: Thread 2With Chrome TraceViewerQ: Which thread executes evict lru walk throughout its lifetime but only executesevict page at the beginning and at the end of its lifetime?61A: Thread 9Evaluating the accuracy of the T2 task with the normal WiredTiger trace:With NonSequiturQ: Which threads contain invocations of wt cond wait signal which last 10 sec-onds? When do these invocations occur in the execution?A: Thread 9 and Thread 15 make invocations of wt cond wait signal which last10 seconds. These invocations occur throughout the execution.With Chrome TraceViewerQ: Which threads contain invocations of wt cond wait signal which last 10 sec-onds? When do these invocations occur in the execution?A: Thread 13 and Thread 15 make invocations of wt cond wait signal which last10 seconds. These invocations occur throughout the execution.Evaluating the accuracy of the first T3 task with the normal WiredTiger trace:With NonSequiturQ: During the time intervals when thread 24 is not executing any function, whichthread spends most of its time in wt cond wait signal?A: Thread 2With Chrome TraceViewerQ: During the time intervals when thread 1 is not executing any function, whichthread spend most of its time in wt cond wait signal?A: Thread 9Evaluating the accuracy of the second T3 task with the normal WiredTigertrace:With NonSequiturQ: During the time intervals when thread 21 calls wt cond wait signal at a call-stack depth of 0, what leaf functions does thread 17 call? A leaf function is afunction which is invoked at the top of the callstack.A: curfile search and wt cursor set keyv62With Chrome TraceViewerQ: During the time intervals when thread 10 calls wt cond wait signal at a call-stack depth of 0, what leaf functions does thread 18 call? A leaf function is afunction which are invoked at the top of the callstack.A: curfile search and wt cursor set keyvEvaluating the accuracy of the T1 task with the large WiredTiger trace:With NonSequiturQ: There is a thread which performs the below activities in the following order:▷ The thread makes an invocation to wt cond wait signal.▷ The thread then repeatedly invokes evict lru pages.Which thread matches the description above?A: Thread 13With OpTrackQ: There is a thread which performs the below activities in the following order:▷ The thread makes an invocation to wt cond wait signal.▷ The thread then repeatedly invokes evict lru pages.Which thread matches the description above?A: Thread 17Evaluating the accuracy of the T2 task with the large WiredTiger trace:With NonSequiturQ: Which threads contain invocations of wt cond wait signal which exceed 10seconds? When do these invocations occur in the execution?A: Thread 18 and Thread 23 make invocations of wt cond wait signal which last10 seconds. These invocations occur at the beginning of these threads’ lifetimes.With OpTrackQ: Which threads contain invocations of wt cond wait signal which exceed 10seconds? When do these invocations occur in the execution?63A: Thread 20 and Thread 21 make invocations of wt cond wait signal which last10 seconds. These invocations occur at the beginning of these threads’ lifetimes.Evaluating the accuracy of the T3 task with the large WiredTiger trace:With NonSequiturQ: Thread 24 calls some function at the top of its callstack throughout its lifetimebut calls this function more during the time interval when thread 18 makes unusu-ally long invocations (lasting at least 10 seconds) to wt cond wait signal. Namethe function called at the top of thread 24’s callstack.A: wt block read offWith OpTrackQ: Thread 25 calls some function at the top of its callstack throughout its lifetimebut calls this function more during the time interval when thread 21 makes unusu-ally long invocations (lasting at least 10 seconds) to wt cond wait signal. Namethe function called at the top of thread 25’s callstack.A: wt block read offEvaluating the accuracy of the T1 task with the large RocksDB trace:With NonSequiturQ: Which thread calls rocksdb::ThreadPoolImpl::Impl::BGThread and then is idlefor a significant portion of time before repeatedly callingrocksdb::DBImpl::BGWorkFlush?A: Thread 22With OpTrackQ: Which thread calls rocksdb::ThreadPoolImpl::Impl::BGThread and then is idlefor a significant portion of time before repeatedly callingrocksdb::DBImpl::BGWorkFlush?A: Thread 5Evaluating the accuracy of the T2 task with the large RocksDB trace:With NonSequiturQ: Which of the threads below contain invocations of64rocksdb::WriteThread::LinkOne which exceed 500 seconds? When do these invo-cations occur in the execution?A: Thread 1 and Thread 13 make invocations ofrocksdb::WriteThread::LinkOne which exceed 500 seconds. These invocations oc-cur at the beginning of the execution.With OpTrackQ: Which of the threads below contain invocations ofrocksdb::WriteThread::LinkOne which exceed 500 seconds? When do these invo-cations occur in the execution?A: Thread 23 and Thread 31 make invocations ofrocksdb::WriteThread::LinkOne which exceed 500 seconds These invocations oc-cur at the beginning of the execution.Evaluating the accuracy of the T3 task with the large RocksDB trace:With NonSequiturQ: Thread 11 calls rocksdb::DBImpl::BGWorkFlush but is then idle for a signifi-cant period of time before calling rocksdb::DB::BGWorkFlush again. Let us referto the time interval when thread 11 is idle as T IDLE. For thread 4, thread 18, andthread 22, name the leaf function that dominates the thread’s execution time duringT IDLE. A leaf function is a function which are invoked at the top of the callstack.A:▷ Thread 4 - rocksdb::InstrumentedCondVar::Wait▷ Thread 18 - rocksdb::WriteThread::LinkOne▷ Thread 22 - rocksdb::ThreadPoolImpl::Impl::BGThreadWith OpTrackQ: Thread 4 calls rocksdb::DBImpl::BGWorkFlush but is then idle for a significantperiod of time before calling rocksdb::DB::BGWorkFlush again. Let us refer tothe time interval when thread 11 is idle as T IDLE. For thread 5, thread 23, andthread 32, name the leaf function that dominates the thread’s execution time duringT IDLE. A leaf function is a function which are invoked at the top of the callstack.65A:▷ Thread 5 - rocksdb::ThreadPoolImpl::Impl::BGThread▷ Thread 23 - rocksdb::WriteThread::LinkOne▷ Thread 32 - rocksdb::InstrumentedCondVar::Wait66