UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Composable and reusable whole-system offline dynamic analysis Lefebvre, Geoffrey 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_spring_lefebvre_geoffrey.pdf [ 2.84MB ]
Metadata
JSON: 24-1.0052156.json
JSON-LD: 24-1.0052156-ld.json
RDF/XML (Pretty): 24-1.0052156-rdf.xml
RDF/JSON: 24-1.0052156-rdf.json
Turtle: 24-1.0052156-turtle.txt
N-Triples: 24-1.0052156-rdf-ntriples.txt
Original Record: 24-1.0052156-source.json
Full Text
24-1.0052156-fulltext.txt
Citation
24-1.0052156.ris

Full Text

Composable and Reusable Whole-System Offline Dynamic Analysis by Geoffrey Lefebvre B.A.Sc., L’Université de Sherbrooke, 1997 M.A.Sc., The University of British Columbia, 2002 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2012 c© Geoffrey Lefebvre 2012 Abstract This thesis describes the design, implementation, and evaluation of a dy- namic program analysis framework called Tralfamadore. Three key aspects differentiate Tralfamadore from traditional dynamic analysis frameworks. First, analyses execute offline based on detailed CPU traces. This approach enables multiple analyses on the same execution, and analyses not foreseen at the time of execution. Using detailed traces also decouples analysis from instrumentation, simplifying the design of analyses and of the framework. Second, Tralfamadore supports the analysis of operating system execution. Third, the architecture of the analysis engine promotes the construction of analyses that are composable and reusable. New analyses can be written by mostly reusing existing components. Although these aspects have been in- vestigated in isolation in the past, Tralfamadore is the first dynamic analysis framework to address all three at once. ii Preface This thesis describes and evaluates a dynamic analysis framework called Tralfamadore. An earlier version of this framework was presented in part in a paper that appeared at the EuroSys 2009 conference [50] and in a unpublished manuscript titled Execution Mining [49]. I was the primary author on both papers. The majority of the work described in this thesis is mine. In particular, I proposed the idea of working with traces and structuring a dynamic anal- ysis framework as a dataflow engine, allowing analyses to be constructed in a modular fashion. The description and evaluation of Tralfamadore in this thesis is much more thorough and expanded than in the previous pa- pers. The thesis also describes a new version of Tralfamadore, and for these reasons much of the writing is new. Some sections of the thesis draw from the aforementioned papers [50, 49]. Some of the writing in Sections 4.3, 4.4, 4.5 and 4.7 is taken from the Execution Mining manuscript. Sections 5.2 and 5.3 are taken from the EuroSys paper but Subsections 5.3.3 and 5.3.4 are new. Sections 5.4.1, 5.5, and 6.6 are taken from the Execution Mining manuscript. I did most of the design, implementation, and evaluation of Tralfamadore presented in this thesis, including the trace collection facility (Chapter 3), the analysis engine (Chapter 4), and the heap slicing analysis (Section 5.4). The performance evaluation is new, although I reused some of the bench- marks from the Execution Mining manuscript. I also had collaborators from my committee and the NSS lab. Most notably, Brendan Cully created a web frontend, which was used to produce the screenshots in Section 5.3. He also added support for logging network traffic, implemented the original version of the function index, and built the stack depth analysis described in Sec- iii tion 5.5. Christopher Head implemented the memory index described in Section 4.8. Norm Hutchinson and Dutch Meyer implemented some of the optimizations in the trace collection facility such as the double buffering and the dedicated writer thread. The current implementation of the trace parser is based on code originally written by Norm Hutchinson. List of Publications Tralfamadore: Unifying Source Code and Execution Experience (short pa- per). Appeared in the proceeding of EuroSys 2009. Published by the ACM. iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Tralfamadore . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 5 2 Overview of Tralfamadore . . . . . . . . . . . . . . . . . . . . 6 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Offline Analysis . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Dataflow Model . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Interactive Results . . . . . . . . . . . . . . . . . . . . 9 2.1.4 Kernel Support . . . . . . . . . . . . . . . . . . . . . 11 2.2 Overall Framework Architecture . . . . . . . . . . . . . . . . 11 2.2.1 Capture and Analysis Phases . . . . . . . . . . . . . . 11 v 2.2.2 Indexing Phase . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Recording Phase . . . . . . . . . . . . . . . . . . . . . 13 2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . 14 2.4 Deterministic VM Record-Replay . . . . . . . . . . . . . . . 15 2.5 Advantages of Working with Traces . . . . . . . . . . . . . . 17 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Comprehensiveness . . . . . . . . . . . . . . . . . . . 21 3.2.2 Transparency . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 Completeness . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Execution Capture in Tralfamadore . . . . . . . . . . . . . . 23 3.3.1 Overview of QEMU . . . . . . . . . . . . . . . . . . . 24 3.3.2 Trace Capture with QEMU . . . . . . . . . . . . . . . 25 3.3.3 The Execution Trace . . . . . . . . . . . . . . . . . . 25 3.3.4 Virtual to Physical Mappings . . . . . . . . . . . . . 26 3.3.5 The Instruction Log . . . . . . . . . . . . . . . . . . . 27 3.3.6 Trace Sample . . . . . . . . . . . . . . . . . . . . . . 30 3.3.7 Selective Trace Capture . . . . . . . . . . . . . . . . . 31 3.3.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 The Analysis Engine . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 The Tralfamadore Analysis Engine . . . . . . . . . . . . . . . 40 4.4 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.6 Caching, Indexing and Striding . . . . . . . . . . . . . . . . . 46 4.6.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 46 vi 4.6.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6.3 Striding . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Operator Library . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7.1 The Trace Parser Operator . . . . . . . . . . . . . . . 52 4.7.2 The Context Operator . . . . . . . . . . . . . . . . . 53 4.7.3 Invocations . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7.4 Allocations . . . . . . . . . . . . . . . . . . . . . . . . 58 4.7.5 Other Operators . . . . . . . . . . . . . . . . . . . . . 58 4.8 Examining State . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.9 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.9.1 C Parser . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.9.2 Integer Boxing . . . . . . . . . . . . . . . . . . . . . . 61 4.9.3 Polymorphic Code . . . . . . . . . . . . . . . . . . . . 62 4.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.10.1 Scanning Analyses . . . . . . . . . . . . . . . . . . . . 63 4.10.2 Indexed Analyses . . . . . . . . . . . . . . . . . . . . 65 4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Understanding Execution . . . . . . . . . . . . . . . . . . . . 70 5.3 Control Flow Annotations . . . . . . . . . . . . . . . . . . . 71 5.3.1 Control Flow Indirection . . . . . . . . . . . . . . . . 72 5.3.2 Path Analysis . . . . . . . . . . . . . . . . . . . . . . 73 5.3.3 Control Flow Extraction . . . . . . . . . . . . . . . . 74 5.3.4 Control Flow Extraction Performance . . . . . . . . . 75 5.4 Heap Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.1 An Example: Linux Networking . . . . . . . . . . . . 80 5.4.2 Heap Slicing in Detail . . . . . . . . . . . . . . . . . . 84 5.4.3 Heap Slicing Performance . . . . . . . . . . . . . . . . 87 5.5 Retroactive Assertions . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vii 6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Dynamic Binary Analysis Frameworks . . . . . . . . . . . . . 99 6.3 Deterministic Record-Replay . . . . . . . . . . . . . . . . . . 104 6.3.1 Hardware Support for Multi-Processor Replay . . . . 105 6.3.2 Other Approaches to Record-Replay . . . . . . . . . . 106 6.4 Collecting Analysis Specific Traces . . . . . . . . . . . . . . . 107 6.5 Offline and Whole-System Analyses . . . . . . . . . . . . . . 108 6.6 Querying Execution . . . . . . . . . . . . . . . . . . . . . . . 109 6.7 Execution Indexing and Simplification . . . . . . . . . . . . . 110 6.8 Lightweight Kernel Instrumentation . . . . . . . . . . . . . . 111 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 viii List of Tables 3.1 Mean execution time and standard deviation for each config- uration. The third column provides the CPU utilization for Native and the overhead compared to Native for the other configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Trace size, native CPU time, and trace generation rate. . . . 36 3.3 Tracing cost per instruction (bytes). . . . . . . . . . . . . . . 36 4.1 Some of the current operators available. . . . . . . . . . . . . 51 6.1 A comparison of dynamic binary analysis frameworks. . . . . 101 ix List of Figures 2.1 Gradual reconstruction of execution semantic. . . . . . . . . . 10 2.2 Operator configuration. . . . . . . . . . . . . . . . . . . . . . 10 2.3 Tralfamadore architecture. . . . . . . . . . . . . . . . . . . . . 12 3.1 QEMU micro-operation modified to log a register update. . . 25 3.2 Lookup by translation ID. . . . . . . . . . . . . . . . . . . . . 29 3.3 Trace sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Dataflow engine. . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Pipeline example — allocation indexer. . . . . . . . . . . . . . 41 4.3 Tree example — extracting argument values. . . . . . . . . . 41 4.4 Operator interface signature. . . . . . . . . . . . . . . . . . . 43 4.5 Simple breakpoint operator. . . . . . . . . . . . . . . . . . . . 45 4.6 Missing annotation due to trace skipping. . . . . . . . . . . . 47 4.7 Using a function index to inspect a register. . . . . . . . . . . 49 4.8 Striding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.9 AT&T assembly vs DynamoRIO. . . . . . . . . . . . . . . . . 53 4.10 Fabricated return statement. . . . . . . . . . . . . . . . . . . 57 4.11 Relative overhead of the null analysis. . . . . . . . . . . . . . 64 4.12 Relative overhead of the memory trace analysis. . . . . . . . . 65 4.13 Normalized overhead to get argument values (wget). . . . . . 66 4.14 Normalized overhead to get argument values (kernel). . . . . 67 4.15 Normalized overhead to get argument values (postmark). . . 67 5.1 The six callers to mutex lock and the absence of lock con- tention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 x 5.2 Following an indirect call in sysenter entry. . . . . . . . . . 72 5.3 Processing 3 packet types in eth type trans. . . . . . . . . . 73 5.4 Control flow extraction. . . . . . . . . . . . . . . . . . . . . . 75 5.5 Relative overhead and time to extract control flow (wget). . . 77 5.6 Relative overhead and time to extract control flow (kernel). . 78 5.7 Relative overhead and time to extract control flow (postmark). 79 5.8 Packet delivery via type-specific function pointer from the protocol table. . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.9 Top/bottom half network processing. . . . . . . . . . . . . . . 81 5.10 ARP packet lifetime through the network stack. . . . . . . . . 83 5.11 TCP packet receive. . . . . . . . . . . . . . . . . . . . . . . . 83 5.12 Heap slicing configuration. . . . . . . . . . . . . . . . . . . . . 85 5.13 HeapTrace operator. . . . . . . . . . . . . . . . . . . . . . . . 86 5.14 Relative overhead and time to extract heap slices (wget). . . 89 5.15 Relative overhead and time to extract heap slices (kernel). . . 90 5.16 Relative overhead and time to extract heap slices (postmark). 91 5.17 Fraction of trace examined (wget). . . . . . . . . . . . . . . . 92 5.18 Fraction of trace examined (kernel). . . . . . . . . . . . . . . 92 5.19 Fraction of trace examined (postmark). . . . . . . . . . . . . 93 5.20 Ownership violation example. . . . . . . . . . . . . . . . . . . 95 5.21 Distribution of stack sizes. . . . . . . . . . . . . . . . . . . . . 96 5.22 Maximum stack depth seen (kernel compile). . . . . . . . . . 97 xi Acknowledgments First, I would like to thank my supervisors, Andrew Warfield and Mike Feeley, for their patience, support, and valuable insights. In particular, I thank Mike for showing me the importance of clear and concise writing, and Andrew, for always putting things in perspective and showing me the importance of a positive attitude and enthusiasm for my work. Andrew’s efficiency and dedication to getting things done is something I can only hope to one day match. I would also like to thank Norm Hutchinson for being on my supervisory committee. He was very generous with his time and a source of sound advice. The members of my examining committee, Bill Aiello, Sathish Gopalakrishnan, and David Lie provided insightful com- ments, greatly improving the quality of this thesis. Second, I would like to thank the members of the Networks, Systems, and Security lab, especially Brendan Cully and Dutch Meyer for being excellent lab mates, collaborators, and friends. Brendan did a lot of work on the Tralf- amadore project and I am heavily indebted to him. I have learned a lot from working with them both and I hope that we will get to collaborate again in the future. I would also like to thank Jim Chow and Rolf Neugebauer for being great mentors during my internships at VMware and Intel Research. It has been a pleasure to work under their guidance. I would also like to thank my family for being supportive throughout my entire journey as a graduate student, and my friends for taking my mind off work through climbing, ski trips, and the odd concert. Last, but not least, I would like to thank my partner, Karyn Moffatt, who has stuck with me all along. She often did more than her share so that I could focus on my work, and she has surely proofread more drafts of this thesis than anyone else. xii Dedication To my parents. xiii Chapter 1 Introduction Debugging and understanding large and complex programs have always been and remain challenging tasks even for the most experienced programmer. This is especially true for operating systems; the kernel of a modern operat- ing system such as Linux or Windows consists of millions of lines of C code. This code is typically written prioritizing efficiency over readability and is often poorly documented. Operating systems code executes directly on the hardware, and in a highly concurrent and asynchronous manner. Much of this complexity is directly exposed to the developer who must explicitly deal with interrupts and the intricacies of virtual memory. To make things worse, the manifes- tation of a software error often results in a catastrophic failure, taking the entire system down and leaving little information to explain the cause of the failure. To assist in the task of understanding and debugging code, developers use many forms of dynamic program analysis ranging from crude techniques such as adding printf() statements, or running the program in a debugger, to using a binary instrumentation framework such as Pin [52] or Valgrind [63]. From a methodology point of view, the process of instrumenting a pro- gram can be viewed as formulating a question about the runtime behaviour of the program. Executing the instrumented program generates information which may help in answering this question. This information may lead to refinement of the initial query, or an entirely new hypothesis. This process is therefore cyclical in nature. 1 1.1 Problem Statement The traditional approach to understanding execution just outlined has many problems. Tools such as debuggers and dynamic binary instrumentation frameworks are online tools. They are “stuck in time”,1 meaning that they can only examine a program as it runs. All analyses must either be antici- pated prior to running the program, or be performed on different executions of the program. Due to non-determinism or changes in the state of the sys- tem, the same analysis may produce different results from one run of the program to the next. This limitation is problematic because as mentioned above, the task of understanding and debugging a program is cyclical in nature. A related issue is the problem known as the “observer” effect where adding instrumentation may affect the runtime behaviour of the program. Further, the costs to setup and execute the program, and to recompile it in the case of instrumentation added statically, make the online approach highly inefficient. Doing dynamic program analysis on an executing binary inherently in- troduces a semantic gap [19]. On one side, the CPU executes a stream of instructions which update the state of the processor and memory, and on the other side, developers expect to be able to reason about a running pro- gram at the level of source code. This semantic gap makes developing new analyses challenging. An analysis writer needs to understand many low- level aspects of execution such as how arguments are passed to functions, or how memory is allocated from the heap. This gap widens with operating system code, which does not provide standardized and stable programming interfaces such as POSIX or the C standard library. The effort to write a new analysis can be substantial and can require a fair amount of domain expertise. Unfortunately, existing dynamic analysis frameworks force analysis writers to constantly reinvent the wheel; low-level aspects of execution must be re-learned for every new analysis. A good example of this problem is Valgrind [63], which includes two different race 1As opposed to Kurt Vonnegut’s Tralfamadorians [83] who can see in four dimensions and are “un-stuck in time.” 2 detectors, and to which engineers at Google recently added a third one.2 Al- though all three detectors differ in the algorithm used to identify races, they each reimplement instrumentation to extract the same information (lock ac- quisition/release, accesses to heap memory). The required instrumentation is orthogonal to the analysis used and could be shared across implementa- tions if Valgrind made it easy to do so. 1.2 Tralfamadore This thesis describes the design and implementation of a new dynamic anal- ysis framework called Tralfamadore which aims to address the problems described in the previous section. This framework is based on recording and analyzing detailed execution traces. Tralfamadore analyses execute offline, driven solely by information contained or derived from previously recorded traces. This approach makes the execution of analyses deterministic, guar- anteeing consistency across runs, and allowing analyses not foreseen at the time of the original execution. Tralfamadore also enables the design of anal- yses that can freely navigate forward and backward in time. This kind of analysis is very difficult to implement in a traditional online manner. The design of the Tralfamadore analysis engine is centered around a dataflow model, similar to a streaming database: execution traces are treated as streams, and simple modules, called operators, enhance them by inserting new records that correspond to higher level events such as function calls or heap allocations. This design provides a gradual reconstruction of execu- tion semantic from a low-level CPU trace to the level of source code. It also makes analysis reusability and composability first-class idioms, and explores the idea that dynamic analyses should be built by combining simple reusable components. Another important aspect of Tralfamadore is that it is designed to ana- lyze the execution of operating system kernel code. Most existing dynamic analysis frameworks are unable to analyze kernel execution. Although recent research efforts such as PinOS [16] or Aftersight [23] have sought to address 2http://code.google.com/p/data-race-test/wiki/ThreadSanitizer 3 this issue, none of these tools are publicly available. Looking at kernel code brings additional semantic challenges, such as demultiplexing control flow, not present when analyzing user level code. 1.3 Thesis Statement This thesis demonstrates that capturing and analyzing the execution of an operating system kernel using detailed traces is both practical and beneficial, and that structuring an analysis framework as a dataflow engine is a suitable approach to bridge the semantic gap created by using CPU level traces. The thesis statement can be dissected into three claims: first, that it is practical to capture the execution of an operating system kernel using detailed traces. Second, that the traces contain sufficient information to allow any analysis supported by an online binary analysis framework. Third, that structuring the core of a dynamic analysis framework as a dataflow engine provides a gradual reconstruction of execution semantics and allows users to write analyses that are composable and reusable. This thesis makes the following contributions: 1. Tralfamadore, the first whole-system offline dynamic analysis frame- work with a programmable interface. 2. The design of a dynamic analysis framework structured as a dataflow engine and the demonstration of its benefits. 3. A technique to demultiplex the control flow of kernel execution. This demultiplexing allows the reconstruction of control flow at the source code level. 4. A new analysis called heap slicing designed to help developers under- stand complex data-driven subsystems such as a networking stack or a file system. 4 1.4 Thesis Organization The remainder of this thesis is organized as follows: Chapter 2 describes the overall architecture of Tralfamadore. It presents and motivates the impor- tant design decisions, including the decision of using detailed traces to drive analyses. The chapter also gives an overview of the important components of Tralfamadore and their role in the overall workflow. Chapter 3 describes the traces consumed by Tralfamadore analyses and how these traces are recorded. It demonstrates that the traces recorded with Tralfamadore are complete, meaning that they contain sufficient information to support the same analyses as an online dynamic binary instrumentation framework. The chapter also evaluates the cost of collecting detailed traces, and demonstrates that given today’s large and cheap disks, it is feasible to capture the execution of an operating system’s kernel execution for common test workloads. Chapter 4 describes the Tralfamadore analysis engine. The chapter mo- tivates the use of a dataflow model to structure the analysis engine, and describes the second contribution of the thesis, which is the design of the analysis engine and how it provide a gradual reconstruction of execution semantic from CPU-level traces to the level of source code. This gradual reconstruction promotes the design of analyses that are composable and reusable. The chapter also describes some of the core operators, describing the operator designed to reconstruct the control-flow of kernel execution and is the third contribution of this thesis. Chapter 5 describes some of the analyses that have been built using Tralfamadore, including heap slicing, the fourth contribution of this thesis. Chapter 6 provides a survey of deterministic record-replay techniques and compares Tralfamadore against existing dynamic analysis frameworks, vali- dating the first contribution of this thesis. Finally, Chapter 7 concludes by providing a summary of the thesis and its contributions. 5 Chapter 2 Overview of Tralfamadore Tralfamadore is an offline dynamic analysis framework based on detailed CPU-level traces. Contrary to online and replay-based approaches which execute or re-execute the code being analyzed, Tralfamadore analyses are entirely driven from previously generated traces; running analyses involves no re-execution at all. This chapter describes the key design decisions that lead to this approach and shaped the overall architecture of the framework. The chapter also provides an overview of the key components and how they each fit in the overall framework. Finally, the chapter motivates the use of detailed execution traces, and provides an overview of deterministic record- replay, a technique that could be used to generate traces with substantially lower overhead. 2.1 Motivation The dynamic analysis framework described in this thesis seeks to address problems with traditional dynamic binary instrumentation (DBI) frame- works. These problems which are described in Section 1.1 can be summa- rized as follow: • Traditional DBIs execute analyses online. Instrumentation and Anal- yses are inlined and execute along with the program or system being analyzed. This approach makes it hard to get consistent results across runs. • Most DBIs are unable to analyze operating system code. This is un- fortunate because kernel code is some of the most challenging code to understand and debug. 6 • Existing frameworks force analysis writers to constantly re-learn low- level aspects of execution. Although some of these issues have been investigated in isolation (mostly the first and second problem), the framework described in this thesis is the first to address all three at once. Addressing these problems lead to four important design decisions: • Execute analyses offline. • Structure the design of the analysis engine based on a dataflow model. • Provide interactive results. • Support the analysis of operating system kernel execution. 2.1.1 Offline Analysis In his Slaughterhouse-Five novel [83], Kurt Vonnegut describes a planet called Tralfamadore, on which its inhabitants exists at all points in time si- multaneously, meaning that they have access to the past, present, and future at once. Tralfamadore, the framework,3 seeks to reproduce this ability by decoupling analysis from execution. As opposed to online dynamic analysis frameworks, such as Pin [52] or Valgrind [63], that analyze a program as it executes, Tralfamadore is an offline framework. The execution of the system being analyzed is first recorded and stored persistently, and all analyses are performed afterward on this static data set. To summarize, Tralfamadore analyses executions as opposed to analyzing programs as they execute. Running dynamic analyses offline has multiple advantages such as en- abling new analyses not foreseen at the time of recording. Analyses execute off a static data set instead of being tethered by the target program’s ex- ecution, making their execution deterministic. This approach allows for repeated analysis over the same execution, providing consistency between each run, an important asset for developers using dynamic analysis in the 3Everywhere else in this thesis, the name Tralfamadore refers solely to the analysis framework and not the planet, unless explicitly stated. 7 cyclical process of understanding or debugging complex pieces of code such as an operating system kernel. Another advantage is that analyses are no longer “stuck in time” and can freely navigate execution by arbitrarily go- ing back in time or jumping ahead. Analyses are also easier to debug; their deterministic behaviour simplifies the task of validating if an error in the analysis has been fixed or not. An offline approach also helps alleviate the “observer” effect. Although recording may affect the behaviour of the sys- tem, once an interesting behaviour has been captured, this behaviour is guaranteed to manifest for all analyses regardless of their overhead. Doing offline dynamic analysis also has drawbacks. First, the power of the analyses is bounded by the information contained in the trace. Second, an offline approach can only be used for post-mortem analysis. As an ex- ample, it cannot be used to provide lifeguards [20, 73] that can stop buffer overflow and intrusions. 2.1.2 Dataflow Model As mentionned in Section 1.1, doing dynamic analysis at the binary level introduces a semantic gap. Writing new analyses requires a considerable amount of effort and expertise due to the need to reconstruct execution semantic from CPU-level traces. An important motivation behind the design of Tralfamadore is that once effort has been invested in understanding a low-level aspect of execution, it should be easy to capture this effort into a reusable component, and to compose new analyses from such component. The design of the Tralfamadore analysis engine is based on a dataflow model similar to streaming databases and network packet processing engine such as Click [42] and Bro [67]. In such an architecture, data is processed by single-purpose components, which are called operators in Tralfamadore.4 These operators are connected as a Directed Acyclic Graph (DAG), usually a pipeline or a tree, where each operator consumes a stream of records from one or more upstream operators and produces a stream of records as well. Data traverses the graph of operators, flowing from the leafs to the root. 4Streaming databases such as Aurora[18] also use the term operator, Click [42] refers to these components as elements, and traditional DBMSs use the term iterator. 8 Tralfamadore analyses execute by parsing the trace into a stream of records called annotations corresponding to the execution of machine-level instructions and the occurrence of events such as interrupts or page faults. The stream is passed through an analysis-specific configuration of operators that augment the stream with progressively more semantically rich annota- tions. These high-level annotations allow complex analyses to be expressed in compact and readable code. Figure 2.1 illustrates how operators gradually reconstruct execution se- mantics when streaming a CPU-level trace through a pipeline configuration of operators shown in Figure 2.2. The left hand side of Figure 2.1 shows an excerpt of raw trace containing basic block (BBlk) annotations. These annotations are generated by the Trace Parser operator. The stream passes through the Invoke operator which augments the stream with Invoke and Return annotations corresponding to function calls and return statements. A third operator (Alloc), uses the Invoke annotations to identify calls to kmalloc(), and further augments the stream with heap allocation anno- tations. The Alloc annotation shown on the right hand side indicates the allocation of a block of 1024 bytes at address 0xdfcc5800. Subsequent operators interested in heap allocations do not need to be concerned with identifying calls to kmalloc(), but can simply use the Alloc annotation directly without knowing how it was generated. In addition to gradually bridging the semantic gap, the dataflow archi- tecture encourages the design of analysis that are composable and reusable. Composable means that analyses can be easily structured as a combination of simple single-purpose operators. This design also makes it easy to reuse these operators across multiple analyses, saving developers from the burden of constantly having to re-understand low-level aspects of execution. 2.1.3 Interactive Results Another goal of Tralfamadore is to provide interactive results so that devel- opers can get answers as quickly as possible. Tralfamadore provides mech- anisms to generate trace indices that subsequent analyses can use to only 9 BBlk(2): c01b0623: mov   $0x000000d0  -> %edx : R[edx]=000000d0 c01b0628: call  0xfffe7228 %esp  -> %esp (%esp) : ST32[f755bea4]=c01b062d R[esp]=f755bea4 BR=c0197850 BBlk(8): c0197850: sub   $0x1c %esp  -> %esp : R[esp]=f755be88 c0197853: cmp   %eax $0x00000800 : c0197858: mov   %ebx  -> 0xc(%esp) : ST32[f755be94]=ff c019785c: mov   %esi  -> 0x10(%esp) : ST32[f755be98]=100 c0197860: mov   %edi  -> 0x14(%esp) : ST32[f755be9c]=f755bf0cc0197864: mov   %ebp  -> 0x18(%esp) : ST32[f755bea0] =dfda0600 c0197868: mov   %edx  -> 0x8(%esp) : ST32[f755be90]=d0 c019786c: jcc   0x00000099 : FL=00000287 ... BBlk(7): c01978ce: mov   %ebx  -> %eax : R[eax]=dfcc5800 c01978d0: mov   0xc(%esp)  -> %ebx : LD32[f755be94]=ff R[ebx]=000000ff c01978d4: mov   0x10(%esp)  -> %esi : LD32[f755be98]=100 R[esi]=00000100 c01978d8: mov   0x14(%esp)  -> %edi : LD32[f755be9c]=f755bf0c R[edi]=f755bf0c c01978dc: mov   0x18(%esp)  -> %ebp : LD32[f755bea0]=dfda0600 R[ebp]=dfda0600 c01978e0: add   $0x1c %esp  -> %esp : R[esp]=f755bea4 c01978e3: ret   %esp (%esp)  -> %esp : LD32[f755bea4]=c01b062d R[esp]=f755bea8 BR=c01b062d FL=00000292 Invoke (__kmalloc, 0xf755bea4) Return (0xf755bea4) Alloc (0xdfcc5800, 1024) Figure 2.1: Gradual reconstruction of execution semantic. Invoke trace Trace Parser AllocSink trace time Annotations inserted by the Invoke operator Annotation inserted by the Alloc operator Figure 2.2: Operator configuration. read the relevant portions of a trace. Many analyses end up reading a small fraction (< 1%) of the trace. The dataflow model is also well suited to produce results in an online manner. Analyses can return partial results as soon as they are available. This is advantageous for analyses that take a long time to complete because they need to read a substantial fraction of the trace. All the analyses described in this thesis can return partial results 10 within seconds, although some of them take much longer to complete. 2.1.4 Kernel Support As mentioned in the introduction, operating system code is some of the most complex and challenging code to understand and debug. As members of a systems research lab, my colleagues and I spend numerous hours in- strumenting, compiling, rebooting and running operating system code, such as the Linux kernel, or the Xen hypervisor [11], to understand or debug strange runtime behaviours. Most of the work on dynamic program analysis has focused on user-space code, and the tools available to kernel develop- ers, such as printk() and debuggers, are pretty crude. Tralfamadore is the first offline dynamic analysis framework that supports the analysis of kernel code, and provides a programmable interface. 2.2 Overall Framework Architecture The Tralfamadore analysis framework consists of multiple components, the two most important ones being the trace capture facility and the analysis engine. Figure 2.3 provides an overview of the framework and shows the main components of the system. 2.2.1 Capture and Analysis Phases Tralfamadore operates in two distinct phases: trace capture and trace anal- ysis. In the capture phase, the target system executes in a virtual machine environment that has been modified to record execution traces. The trace capture implementation is based on QEMU [12], an open-source full system x86 emulator. Using virtualization provides the ability to capture the execu- tion of an entire virtual machine (OS and applications) without modification to the guest system being recorded. The traces are very detailed, providing the ability to reconstruct the state (memory and registers) of the system at the granularity of every instruction. 11 Analyze Capture Traces Trace Capture: Modified QEMU writes a trace of CPU instructions and events. VM Persist Trace repository: Traces are stored on disk alongside source, debug info, and indices. Analysis Engine: Performs analysis by passing operators over the trace stream. Indices may be written back to disk to speed future analysis. Analysis Script Trace Users/Tools Interactive analysis: Tools launch analysis jobs and get stream of results. Use deterministic VM record-replay to record execution VM Results Analysis Results: OpOp Op Op Index Record Execution Figure 2.3: Tralfamadore architecture. Once a trace has been produced, it may be shared among a number of users and analysis engines to answer a wide variety of questions about the target system’s behaviour such as what are the distinct control-flow paths through a function, or is some function’s argument ever null? Users execute analyses by creating analysis engine instances configured with a specific analysis and targeted at a selected trace. Because traces are simply plain files, an analysis engine instance is just a normal unprivileged process reading a trace file. There is no need to instantiate a virtual machine containing the system being analyzed. Tralfamadore analyses are composed from a progression of operators designed to enhance a CPU-level trace with new annotations corresponding to higher-level events. This resulting stream is consumed by a sink operator which analyzes the stream to produce the desired analysis results. The sink operator can produce the analysis results in an online manner, pushing them in a streaming manner to a presentation layer, providing users with partial 12 results as soon as they are available. 2.2.2 Indexing Phase Performing a scan of an entire trace can be very slow. Because the traces are often very large (> 100GB), and consist of very small variable length records (∼50-100 bytes), they are expensive to parse. To reduce this overhead, Tralf- amadore analyses can uses indices to only examine relevant portions of the trace. As an example, indices are used to efficiently extract the value of a function argument by only revisiting the points in time where the function is called. They are also used to identify the slices of trace during which a particular thread was running. Indexers are ordinary Tralfamadore analyses where the sink operator writes the generated indices to disk based on the annotations received from upstream operators. Indices are usually generated in an indexing phase that occurs after the trace capture phase but prior to executing analyses that use them. Section 4.6 in Chapter 4 provides more details on how indices are generated and used. 2.2.3 Recording Phase The traces generated with Tralfamadore are very detailed, resulting in a lot of information being generated and written to disk. For this reason, capturing execution with Tralfamadore can cause a 50-100× slowdown, and affect the behaviour of the system being recorded. Previous work has shown that deterministic virtual machine (VM) record- replay can capture the complete execution of a single-processor VM with overhead often below 5% and replay this execution deterministically to gen- erate a detailed trace [87]. Tralfamadore could leverage this technique to prepend a recording phase and decouple trace generation from recording execution. The distortion that tracing overhead may have on execution is now greatly reduced because of the low-overhead of recording and the faith- fulness of deterministic replay. The system will re-execute exactly as it did when recorded, regardless of the overhead imposed from generating traces. 13 Our lab has begun to implement a recording phase for Tralfamadore. The lab now has working prototype of deterministic record-replay for QEMU which can generate traces during replay. Other students have also worked on record-replay implementations for Xen [11] and KVM.5 The final goal is to record execution in Xen or KVM, and replay the recordings in QEMU, as done by Aftersight [23], to generate traces. 2.3 Implementation Details The two principal components of Tralfamadore are the trace capture facility, and the analysis engine itself. The trace capture facility was implemented by modifying QEMU [12], a popular open-source x86 emulator. The modi- fications are based on version 0.9.1 of QEMU. Approximately 2200 lines of C code were modified or added to QEMU to support trace generation. The current implementation of the Tralfamadore analysis engine consists of approximately 5000 lines of C code and 12000 lines of OCaml code. The C language is used to implement performance critical code such as the trace parser and the instruction decoder. All of the operators and analyses are written in OCaml. The implementation of Tralfamadore described in this thesis is in fact the third implementation. Two earlier implementations were also built as part of this thesis research. The first implementation was an early prototype that took advantage of the branch trace store (BTS) feature that has been available on Intel processors since the Pentium 4 [2], allowing the generation of a continuous log of all taken branches. This information alone is insuffi- cient to perform most dynamic analyses. The basic task of reconstructing the control-flow of kernel execution requires identifying threads that are be- ing context switched in and out, which is difficult to do unambiguously with only control-flow information. Also, additional mechanisms are needed to record the instructions being executed and handling self-modifying code. It is also worth mentioning that although BTS is a hardware feature, it still 5http://www.linux-kvm.org/ 14 imposes a 20-30× slowdown on the system, before adding the extensions required to log instructions. The second implementation used the same trace capture facility as the current implementation, but the analysis engine was solely implemented in OCaml and was much slower. The current implementation of Tralfama- dore is a complete rewrite. Performance critical components have been reimplemented in C and the use of expensive OCaml language features such as polymorphic variants and generic functions has been minimized. Many analyses execute between one and two order of magnitudes faster in the current implementation of Tralfamadore. Sections 4.7.1 and 4.9 describes some of the interesting performance optimizations present in the current implementation. 2.4 Deterministic VM Record-Replay This section presents an overview of deterministic virtual machine (VM) record-replay. Although this technique is complimentary to the work de- scribed in this thesis, it is relevant and worth describing in more details. Deterministic record and replay has been implemented at many differ- ent levels of abstraction. It has been implemented at the virtual machine level [14, 28, 87], process level [79, 34, 74, 43], JVM level [22], and at the interpreter level to record the execution of scripts [56]. The idea behind record and replay is to log the non-determinism, such as external inputs or the thread schedule, and reinject it at exactly the same time during replay, providing a deterministic re-execution. Deterministic VM record-replay, the technique most relevant to this the- sis, records execution below the operating system at the virtualized hardware level. The advantage of recording at such a low-level is that the recordings are comprehensive; the execution of an entire virtual machine, including the operating system and all the applications, is recorded. The fidelity of replay is extremely high. During replay, the guest virtual processor re-executes the same sequence of instructions and the guest register and memory state are identical after every instruction. 15 To record execution at the hardware level, the information that needs to be logged is low-level, architecture specific, and falls into two categories: instructions that are non-deterministic, and asynchronous events such as interrupts and Direct memory access (DMA). Non-deterministic instructions include instructions that produce results that are time or processor specific such as rdtsc and cpuid on x86, and instructions that read from hardware ports or memory mapped IO. Unlike non-deterministic instructions which are part of the instruction stream being executed and will be reached naturally as the system executes, asynchronous events such as interrupts or DMA require to be precisely time- stamped so as to be reinjected at the exact same point in the instruction stream during replay. Even an off-by-one instruction error could result in divergence, causing replay to fail. Precise timestamps can be obtained with negligible overhead by using hardware performance counters to measure the number of branches or instructions executed. The ReVirt paper[28] provides more in-depth details on how to use performance counters to obtain precise timestamps on x86. As mentioned in the previous section, recording execution at the VM level has very low overhead for a uni-processor virtual machine. As reported in [87] and [23], VMware’s implementation of record-replay has overhead as low as 0.7% and an average overhead of 5% for SPEC benchmarks. VM record-replay also generates small logs. The ReVirt paper [28] reports a log generation rate of less than 1.5 GB per day of recording for the SPECweb99 benchmark, and for a kernel compile over NFS benchmark. These bench- marks are network intensive and represent worse case scenarios because all incoming network traffic is recorded. Benchmarks with little network traffic generated much smaller logs (< 100 MB per day). Unfortunately recording the execution of multi-processor systems is sub- stantially more expensive, specially on systems that exhibit a high-level of shared memory communication. The outcome of memory races needs to be recorded which is difficult to do efficiently without hardware support. Recording execution on multi-processor systems is an active area of research both in the system community and the architecture community. Chapter 6 16 gives an overview of the work that has been done in this area. 2.5 Advantages of Working with Traces This section motivates the design decision of using a trace-based approach where analyses run completely offline without any re-execution. This ap- proach differs from other offline approaches based on replay, where the instrumentation is inlined (as with online dynamic binary instrumenta- tion (DBI) frameworks such as Pin [52] and Valgrind [63]) while the program re-executes[23, 24, 13, 66]. The two main advantages of a trace-based ap- proach is that it decouples instrumentation from analysis, and it simplifies the design and implementation of the framework. Using traces decouples instrumentation from analysis. Analysis writers do not have to worry about how to instrument the system they are analyzing. Although frameworks such as Pin [52] and Nirvana [13] have sought to make the task of instrumenting easier, it stills remains a challenging task for many analyses [63]. With Tralfamadore, the burden of adding instrumentation to collect traces needs to be paid once, and this task can be performed by a domain expert. The execution of inlined instrumentation and analysis along with the code being analyzed can also be problematic. With most DBIs [52, 16, 63, 15], the analysis executes in the same address space as the program being analyzed, potentially resulting in address space clashes. Analyses writers must use clever tricks [62] to minimize their memory footprint and reduce the chance of conflicts. This problem is not limited to online tools, replay-based variants of online DBIs such as PinPlay [66], or replaying in Valgrind [24] also suffer from this issue. For the same reason, using libraries is substantially more complicated and prone to conflicts. Another advantage of using traces is that it simplifies the design of the framework. Tralfamadore analyses are regular unprivileged processes, and the traces they read are simply regular files. Analyses are free to use any libraries, and can be implemented in any language. All analyses, and most of Tralfamadore is implemented in OCaml, a garbage-collected, statically 17 typed language with functional, imperative and object-oriented language features. Many analyses make use of third-party libraries to produce graphs in Dot format, or write JSON formatted output. These libraries can be linked directly into an analysis process. One of the reason Tralfamadore is based on traces instead of replay is very pragmatic: when this project started, there were no working and pub- licly available implementations of VM record-replay. Although the technique is well understood in the research community (and at VMware), getting an implementation of record-replay to work is challenging,6 and would have in- volved a large time investment to reimplement work that had already been published [23]. Nevertheless, using traces provides benefits by trading overhead for sim- plicity and ease of use, and as demonstrated in the next chapter, this over- head is acceptable when compared with DBI frameworks such as Valgrind, and given the low price and capacity of modern disks. The main contributions of this thesis are about reconstructing execu- tion semantic from CPU-level recordings, and many of the contributions and insights from this thesis would also apply to an implementation of the Tralfamadore analysis engine based on replay. Many of the operators and analyses that have been built, and that are described in Chapter 4 and 5 would work in an identical manner. 2.6 Conclusion This chapter provided an overview of the Tralfamadore dynamic analysis framework. It presented the key design decisions that shaped its overall architecture. The reader should retain that Tralfamadore supports the dy- namic analysis of operating system code, that analyses execute offline based on detailed CPU traces, and that the Tralfamadore analysis engine is struc- tured as a dataflow engine to provide a gradual reconstruction of execution semantic. 6Personal conversations with Jim Chow and George Dunlap. 18 This chapter also provided a big picture view of the framework and its different phases of operation. This overview should help the reader situate the core components of Tralfamadore in the overall workflow. These core components are the trace capture facility described in the next chapter, and the analysis engine described in Chapter 4. 19 Chapter 3 Trace Collection 3.1 Introduction This chapter describes the design and implementation of the Tralfamadore trace capture facility. As described in Chapter 2, this facility is used to record the execution of entire systems (OS and applications) by capturing detailed CPU-level traces. These traces are stored to disk where they are used to drive offline dynamic analyses. These analyses are executed by the Tralfamadore analysis engine described in the next chapter. Tralfamadore seeks to record the execution of unmodified operating system code, and entail analyses with the same ability to observe execution as provided by a dynamic binary instrumentation framework such as Pin [52] or Valgrind [63]. Section 3.2 establishes the requirements of a tracing facility that achieves these goals. These requirements influence the environment used to record execution and the information that needs to be recorded in a trace. Sec- tion 3.3 describes the current implementation of the Tralfamadore trace capture facility and the trace format, and argues that they meet the require- ments. Section 3.4 evaluates the performance of the trace capture facility. Although the focus has been on providing complete traces and the current implementation is not designed with performance as a first goal, the evalu- ation demonstrates that the overheads to capture traces of kernel execution for common benchmark workloads are reasonable. 3.2 Requirements An important goal of Tralfamadore is to be able to run arbitrary dynamic analyses on unmodified operating system code. This goal establishes three 20 requirements for the Tralfamadore tracing facility and the traces it generate. The tracing facility should provide the ability to record execution compre- hensively, it should capture traces transparently, and the recorded traces should be complete. Although the terms comprehensiveness and complete- ness are closely related, they are used to express two distinct properties of the traces being recorded. Comprehensiveness refers to the scope at which execution is recorded, i.e. which parts (processes, device drivers, etc.) of the system are recorded. Completeness on the other hand refers to the level and amount of information being recorded, i.e. how much details to capture about every instruction executed. The next three sections describe these three requirements in more detail. The first two requirements dictate how traces should be recorded, and the third requirement dictates what should be recorded. Section 3.3 describes the Tralfamadore trace capture facility and the trace format, and how they meet these requirements. 3.2.1 Comprehensiveness The Tralfamadore trace capture facility should be able to capture execution comprehensively, meaning that it should be able to record the execution of an entire system, including the execution of its operating system ker- nel and of all its applications. Traces should be captured at the hardware level, underneath the operating system, to ensure a complete visibility of the system’s execution. 3.2.2 Transparency Another goal of Tralfamadore is be able to capture the execution of unmodi- fied systems. This approach requires collecting traces transparently without explicitly changing the binaries. Further, being transparent requires being able to handle dynamically generated and self-modifying code without the need to be explicitly notified by the system being recorded. 21 3.2.3 Completeness Completeness defines the information that needs to be captured in a trace. A complete trace should contain sufficient details to represent the captured execution without ambiguity at a given level of abstraction. Tralfamadore operates at the same level of abstraction than dynamic binary instrumen- tation (DBI) frameworks which embed analysis and instrumentation code directly within the program being analyzed. This code executes inlined with the target program and has access to the register state and the content of memory on every instruction boundary. Inlined instrumentation also has ac- cess to the instruction stream being executed or an equivalent intermediate representation (IR)[63]. Tralfamadore operates differently than these tools by decoupling instru- mentation from analysis. The execution of the target system is recorded by collecting a detailed trace and analyses execute offline on the recorded exe- cution. A challenge with this approach is to capture sufficient information to provide the level of completeness offered by online DBI frameworks. To provide the same power, the trace must explicitly include, or provide sufficient information to extrapolate the following: 1. The state of registers and memory when tracing starts. 2. Updates to registers and memory for every instruction executed. 3. Implicit updates to registers and memory by the CPU or devices. 4. Mappings of virtual to physical pages. 5. A log of all instructions executed. The first two items provide the register and memory state before the first instruction, and the ability to update this state after every instruction. The processor or devices may also implicitly update the register and memory state. As an example, when an interrupt or exception is triggered, the x86 processor automatically saves the value of the program counter on 22 the stack and loads new segment selectors. These are visible updates to the state of the system and must be included in the trace. The third item is also a good example of some of the challenges brought by capturing operating system execution. User-level applications are shielded by the operating system from the complexity of handling interrupts, and this kind of state change is invisible from a user-level perspective. The current implementation of the trace capture facility does not capture all implicit updates. Direct memory access (DMA) and implicit updates to page tables by the memory management unit (MMU) are currently not recorded. Section 3.3.8 provides more information on these limitations and how they are currently handled. The trace must also include the virtual to physical mappings for analyses to be able to properly rebuild and examine the content of memory. Writes to memory are virtually addressed but update physical memory. Because a physical page can be mapped multiple times at different virtual addresses, a single write can update multiple virtually addressed locations. This in- formation needs to be properly propagated to correctly rebuild the content of memory. Using virtual to physical translations, analyses can reconstruct the state of physical memory and access its content via virtual to physical translations. The log of executed instructions (e.g. mov (%eax), %edx) could be re- constructed from the content of memory, so its possible to provide the same power as a traditional online approach without including this information in the trace. Existing Tralfamadore analyses use the log of instructions ex- tensively, and providing this information explicitly without having to recon- struct the memory state simplifies the design and improves the performance of analyses substantially. 3.3 Execution Capture in Tralfamadore This section describes the implementation of the trace capture facility, and the format of the traces. Tralfamadore relies on system virtualization and software emulation to meet the first two requirements (comprehensiveness 23 and transparency) established in the previous section. The traces are captured by executing the target system inside a virtual machine running on a whole-system emulator called QEMU [12]. By using a whole system emulator, Tralfamadore can capture the execution of an entire system including all of the applications and the operating system kernel. Using an x86 software emulator also provides transparency because it is possible to modify the behaviour of each instruction to log their side effects. Modifying the emulator effectively adds a detailed tracing facility to the hardware, an approach that has been proposed by the processor architecture community [20]. 3.3.1 Overview of QEMU QEMU is a freely available open source emulator that supports whole system emulation on x86. QEMU is able to run an unmodified operating system and all of its applications inside a virtual machine. QEMU is faster than traditional emulators, such as BOCHS[46], because it uses binary translation to emulate guest instructions. QEMU emulates a guest processor by translating each instruction into a sequence of RISC-like micro-operations, one basic block at a time. The micro-operations are fed into a code generator which produces native code that executes directly on the host. To improve performance, QEMU also implements many optimizations found in other dynamic binary translation frameworks [52, 15, 78] such as translation caching and basic block chaining. QEMU emulates virtual memory using a software MMU. The guest phys- ical memory is represented as a contiguous array in the host address space and each memory access goes through a software translation procedure, walking the guest page table to determine the virtual to physical page map- ping. A guest physical address is translated into a host virtual address by adding to it the base address of the array representing physical memory. To improve performance, QEMU provides a software translation look-aside buffer (TLB) to cache translations. This TLB behaves similarly to the x86 hardware implementation and is invalidated by the guest in the same way. 24 3.3.2 Trace Capture with QEMU As described in section 3.2.3, the traces must contain five items to meet the completeness requirement. First, the Tralfamadore trace capture facility can be configured to take a virtual machine snapshot when tracing starts, capturing the complete register and memory state, and providing item 1. In addition to the snapshot, a Tralfamadore trace consists of three components: The execution trace, the virtual to physical mappings, and the instruction log. The largest component by far is the execution trace which contains the side-effects of every instruction, interrupt, and exception. It contains sufficient information to provide items 2 and 3, minus the caveats previously mentioned for item 3 and described in more details in Section 3.3.8. The virtual to physical mappings and the instruction log provide items 4 and 5 respectively. 3.3.3 The Execution Trace To record the execution trace, straightforward additions were made to QEMU micro-operations that modify emulated guest state to log their update as they execute. Figure 3.1 shows a QEMU micro-operation modified to log a register update. This specific micro-operation decrements by one the con- tent of the ecx register. The second statement logs the updated value of ecx. All updates are written to a shared memory buffer which is flushed to disk periodically by a dedicated thread. Double buffering is used to avoid stalling execution as much as possible. void OPPROTO op_decl_ECX(void) { ECX = (uint32_t)(ECX - 1); tralf_log_register(R_ECX, ECX); } Figure 3.1: QEMU micro-operation modified to log a register update. The execution trace consists of a sequence of variable length records. Currently the trace format defines four types of records: BBlk, Event, 25 PState and TSS. BBlk records are by far the most frequent records in a trace and rep- resents the execution of a QEMU translation which is at most (and almost always) one basic block of guest instructions. A BBlk record contains the virtual address of the first instruction of the basic block, the number of instructions actually executed, which can be less than the number of in- structions in the translation, since the translation may have only partially executed due to a fault or interrupt. The record also contains the side effect of each instruction, and a unique translation ID which is used to lookup the instructions in the instruction log. Event records represent the occurrence of an interrupt or an exception. The record contains the interrupt vector, error code and the value of the pro- gram counter saved on the stack by the processor. The record also contains updates to the state of the system automatically done by the processor such as saving execution state on the stack and loading new segment selectors and program counter. PState records are used to capture register state checkpoints and appear at the beginning of the trace or after a gap in the recording such as on a transfer from user to kernel space when only recording kernel execution as described in Section 3.3.7. Finally, TSS records are very infrequent and usually only appear once at the beginning of the trace. This record describes the content of the Task State Segment (TSS), an in-memory data structure pointed to by the TSS register. Section 4.7.2 in the next chapter describes the context operator in detail and describes how this operator uses the TSS record to detect the occurrence of context switches. 3.3.4 Virtual to Physical Mappings Tralfamadore uses QEMU’s software TLB to capture the mappings between virtual and physical addresses. Tralfamadore logs all insertions into QEMU’s software TLB to capture the mappings between virtual and physical ad- dresses. Logging TLB insertion is sufficient and necessary to properly track 26 the mappings between virtual and physical pages, which is needed to prop- erly reconstruct the memory state of the guest. This approach is sufficient because every memory access first goes through the TLB. A hit means that the current mapping is being used to do a virtual to physical translation. A TLB miss results in a page table walk and a new mapping being inserted in the TLB and logged, regardless if the miss was due to a capacity miss or an explicit guest invalidation. As long as the TLB is flushed at the start of recording, then logging TLB insertions guarantees that the necessary map- pings to rebuild the memory state by “applying” recorded loads and stores will be present in the trace. The correct mapping at time t for a page is the most recent TLB entry for that page that is older than t (TLB insertions always happen before their corresponding memory access). It is necessary to use the mappings from the TLB because guest page tables and the TLB can get out of synch.7 Using the guest page table to determine a physical address can result in using a different mapping during analyses than the one used during the original execution. Supporting the querying of memory at arbitrary points in time requires logging TLB invalidations as well. If a TLB invalidation is present between the last TLB insertion and the lookup at time t, the guest page table active at time t must be walked to get the mapping that would have been in effect at that time during the original execution had the page been accessed. Such a query can therefore fail if no mapping for the virtual address exists in the page table, just as the same memory access would have generated a page fault in the guest, if it had occurred during execution. TLB invalidation in the trace can be identified by looking for updates to the cr3 control register or the execution of an explicit TLB invalidation instruction (invlpg). 3.3.5 The Instruction Log The third trace component is a log of every instruction executed. This in- formation is not embedded in the execution trace but recorded in a separate instruction log. Every time QEMU translates a new sequence of guest in- 7The PaX (http://pax.grsecurity.net) NX emulation relies on this behaviour. 27 structions, the input to the translator is recorded in the instruction log. The intuition behind this approach is the same as using a code cache. Most programs exhibit some form of locality and end up re-executing the same code many times. The execution of a translation might use different inputs and produce different output every time, but the sequence of instructions executed remains the same. We have observed over two orders of magni- tude difference in terms of growth rate between the execution trace and the instruction log. Logging the instructions in a separate log requires a mapping between the instruction log and the execution trace. This mapping must handle the fact that QEMU uses a physically indexed translation cache to reuse translations corresponding to the same guest code but mapped at more than one virtual address or in more than one address space (such as kernel code and shared libraries). In addition, virtual to physical mappings are dynamic and changes must be tracked to properly map the program counter value which is a virtual address to its corresponding physical address. A mapping between the in- struction log and the execution trace must also handle changes to physical memory such as self-modifying code and physical memory updates (such as DMA of code from disk). The need to track virtual to physical mappings can be eliminated by using the physical address directly to identify guest code by emitting the physical address of the translation in the execution trace log in addition to its virtual address. The second requirement imposes issuing some form of version number in both the instruction log and the execution trace when- ever physical memory containing code changes. Mapping the execution of a translation with the original guest code requires a two level lookup. The first lookup determines the version of physical memory at the time of execution and the second lookup fetches the guest code from that version of memory based on the physical address. To avoid this two-level lookup and to eliminate the need to track virtual to physical mappings, Tralfamadore uses a unique monotonically increasing ID to label each translation. This ID is used to label the original guest code 28 logged in the instruction log and is issued in the execution trace whenever the translation is executed.8 Mapping the execution of a translation in the trace with the corresponding guest code is a simple lookup using the ID as a key. Because the ID space is not sparse, this table can be implemented as an array, providing O(1) lookups. A naive implementation of this mapping table will contain many repeated definitions of the same translation due to capacity misses in the QEMU code cache. This redundancy can be easily eliminated by adding one level of indirection by keeping all unique definitions in a content indexed table. The lookup table maps translation ID to pointers in this definition table. The lookup into the content indexed table only happens at startup when both tables are built. The startup time could be decreased by saving the table to disk instead of rebuilding it every time. Figure 3.2 illustrates how BBlk records from the execution trace are paired with their instructions using the translation ID to perform a lookup in the instruction log. c01b0623, 2, 76: R[edx]=000000d0 ST32[f755bea4]=c01b062d R[esp]=f755bea4 BR=c0197850 mov   $0x000000d0  -> %edx call  0xfffe7228 %esp  -> %esp (%esp) mov   %ebx  -> %eax mov   0xc(%esp)  -> %ebx mov   0x10(%esp)  -> %esi mov   0x14(%esp)  -> %edi mov   0x18(%esp)  -> %ebp add   $0x1c %esp  -> %esp ret   %esp (%esp)  -> %esp Lookup by translation ID Instruction LogExecution Trace BBlk(2): c01b0623: mov   $0x000000d0  -> %edx : R[edx]=000000d0 c01b0628: call  0xfffe7228 %esp  -> %esp (%esp) : ST32[f755bea4]=c01b062d R[esp]=f755bea4 BR=c0197850 Content Indexed Table Lookup Table Paired BBlk Figure 3.2: Lookup by translation ID. 8Chronicle [4] uses a similar approach to map basic block’s execution with their corre- sponding Valgrind translation. 29 3.3.6 Trace Sample 146240:Event(32, 0, c0122fc2): LD32[c0615100]=60859c LD32[c0615104]=c0108e00 LD32[c1204060]=ffff LD32[c1204064]=cf9a00 ST32[c0625fb4]=246 ST32[c0625fb0]=60 ST32[c0625fac]=c0122fc2 R[esp]=c0625fac S[cs]=0060,00000000,ffffffff,00cf9a00 BR=c010859c FL=00000046 146497:BBlk(2): c010859c: push $0xffffffff %esp -> %esp (%esp) : ST32[c0625fa8]=ffffffff R[esp]=c0625fa8 c010859e: jmp 0x0000083e : BR=c0108ddc 146526:BBlk(13): c0108ddc: cld: FL=00000046 c0108ddd: push %fs %esp -> %esp (%esp) : ST32[c0625fa4]=d8 R[esp]=c0625fa4 c0108ddf: push %es %esp -> %esp (%esp) : ST32[c0625fa0]=7b R[esp]=c0625fa0 c0108de0: push %ds %esp -> %esp (%esp) : ST32[c0625f9c]=7b R[esp]=c0625f9c c0108de1: push %eax %esp -> %esp (%esp) : ST32[c0625f98]=0 R[esp]=c0625f98 c0108de2: push %ebp %esp -> %esp (%esp) : ST32[c0625f94]=0 R[esp]=c0625f94 c0108de3: push %edi %esp -> %esp (%esp) : ST32[c0625f90]=c0692280 R[esp]=c0625f90 c0108de4: push %esi %esp -> %esp (%esp) : ST32[c0625f8c]=c068c004 R[esp]=c0625f8c c0108de5: push %edx %esp -> %esp (%esp) : ST32[c0625f88]=c0624000 R[esp]=c0625f88 c0108de6: push %ecx %esp -> %esp (%esp) : ST32[c0625f84]=c0106f40 R[esp]=c0625f84 c0108de7: push %ebx %esp -> %esp (%esp) : ST32[c0625f80]=0 R[esp]=c0625f80 c0108de8: mov $0x0000007b -> %edx : R[edx]=0000007b c0108ded: mov %edx -> %ds : LD32[c1204078]=ffff LD32[c120407c]=cff300 S[ds]=007b,00000000,ffffffff,00cff300 Figure 3.3: Trace sample. Figure 3.3 shows a small trace excerpt from a debug dump. The sample contains three records. For each record, the first number is the timestamp of the record and corresponds to the record’s byte offset in the execution trace file. The first record is an event describing the occurrence of an external interrupt. We can determine that it is an external interrupt by looking at the vector number which is 32 in this case. The next lines contain the updates to the register and memory state automatically done by the processor such as saving state on the stack, loading a new code segment selector, updates to the flags register, and branching to the first instruction of the interrupt handler. The other two records represent the execution of QEMU translations. The number of instructions executed is provided on the first line and each subsequent line corresponds to the execution of a single instruction. The 30 translation ID has already been used to lookup the instruction definition from the instruction log. The address of each instruction is in the first column and is calculated from the virtual address provided in the BBlk record and the length of each instruction. The mnemonic plus the symbolic input and output of each instruction is in the second column. The third columns contains the side effect of each instruction. The second BBlk record is an example of a translation that is shorter than a basic block. Because the last mov instruction loads a new segment selector which affects how linear address are calculated and requires runtime checks to determine if the segment descriptor is valid, QEMU stops the translation after this instruction to regain control at runtime to validate and load the segment descriptor. 3.3.7 Selective Trace Capture The Tralfamadore trace capture facility provides the option to selectively record kernel execution without recording user space execution. Tracing is turned on whenever a user level process enters the kernel via a system call or because of an interrupt or exception, and tracing is turned off when execution returns back to user space. This mode of operation results in substantially smaller traces since kernel execution usually accounts for a small fraction of a workload’s execution. We currently use Tralfamadore almost exclusively this way since the focus of this thesis is on capturing and analyzing kernel execution. This approach also results in missing information and it is no longer possible to reconstruct the complete memory state of the system since the side-effects of user level execution are absent from the trace. We use two techniques to compensate for this missing information. First, we capture the complete register state on every entry into the kernel, en- suring that the register state can always be reconstructed at all points of kernel execution. Second, the value returned by memory-load instructions are logged as well. Any updates to memory performed by a user-level process and used as an input by the kernel are captured in the trace. To summarize, 31 any side-effect of user-level execution that is observed by kernel execution is captured in the trace. 3.3.8 Limitations The trace capture facility is based on an emulator optimized for performance. Some of these optimizations affect the precision of the traces. A common optimization in many emulators is to update the x86 arithmetic flags lazily just before they are used by an instruction. The intuition is that most ALU instructions update these flags but very few instructions actually read them. Most flag updates end up being overwritten by the next instruction without ever being read. Tralfamadore can only log updates to the flag register (eflags) when the flag values are calculated. This limitation prevents reconstructing the eflags register state on ar- bitrary instruction boundaries but fortunately this has only minor conse- quences. First, the eflags register state can always be reconstructed on basic block boundaries because QEMU does not attempt to optimize flag usage across translations and forces a flag update on the last flag updating instruction of any translation. Second, we can always reconstruct a register state that will allow re-execution forward from any instruction since any update to the arithmetic flags that is actually being used (such as the last flag update before a conditional jump or a conditional move instruction) appears in the log. As mentioned in Section 3.2.3, updates to memory caused by direct memory access (DMA) are currently not captured in the execution trace. IO devices such as network cards, use DMA to write directly into memory without involving the processor directly. Memory updates performed by the MMU are also not captured in the execution trace. Whenever a page is accessed or dirtied, the MMU may walk the page table in memory and set the accessed and/or dirty bit of the corresponding page table entries. By capturing the value returned by memory-load instructions, Tralfama- dore is able to get around these limitations by implicitly capturing external updates to memory that are read during execution. 32 Because explicit logging of DMA input can be useful for capturing in- coming network packets, support for logging network input has been added as an option. Generic support for capturing arbitrary DMA and MMU up- dates is left for future work but should be a straightforward addition since this issue is simply a limitation of the current Tralfamadore trace capture prototype. QEMU properly emulates the device and MMU behaviour and it should be relatively easy to extend the logging facility to explicitly log memory updates performed by DMA and the MMU. 3.4 Evaluation Capturing detailed CPU traces is expensive both in terms of execution slow- down and the space required to store the traces. This section demonstrates that it is practical to capture detailed kernel traces for common workloads and that the overhead could be addressed by using deterministic record- replay as proposed in Section 2.2.3. To demonstrate this claim, we measured the overhead of three example workloads. Wget retrieves the entire Gimp user manual from the Gimp web site.9 It is a network-intensive benchmark with low CPU utilization. Overall, 2110 files are retrieved, for a total of 35 MB of data. Kernel is a minimal kernel build from sources. This benchmark is CPU intensive with nearly 100% CPU utilization but with only 10% of the time spent in the kernel. The last benchmark, Postmark [40] is a IO intensive workload that simulates a mail server by performing random IO on many small files. The CPU utilization is low (5-10%) but over 90% of the CPU time is spent in the kernel. We ran each benchmark in four different configurations: Native, Qemu, TQemu-NoLog and TQemu-Log. The Native configuration is a 32-bit Ubuntu 8.04 system running kernel 2.6.24 on a dual socket 2.5GHz quad core Intel Xeon system. The system has 16GB of RAM and a 1 TB 7200rpm SATA2 hard drive. To make this 9http://docs.gimp.org/2.6/en/ 33 configuration closer to the Qemu configurations, we used the taskset com- mand10 to force the benchmarks to execute on a single core. The Qemu configuration corresponds to a Linux guest running on a unmodified version of QEMU 0.9.1. The guest is a 32-bit Ubuntu 8.04 system running kernel 2.6.24 with 1 GB of RAM. The TQemu-NoLog and TQemu-Log config- urations use the same guest as in the Qemu configuration but running the Tralfamadore version of QEMU modified to record traces as described in Section 3.3. The TQemu-NoLog configuration uses the modified QEMU with the trace logging feature turned off. The TQemu-Log configuration captures traces when the guest is running in kernel mode. When tracing, time dilation [35] is used to slow down the guest’s notion of time to more closely match the reduced performance of the system. Time dilation dimin- ishes the rate of timer interrupts, reducing the overhead incurred to handle them. In both TQemu configurations, the tracing instrumentation is al- ways present and executed for both user and kernel space, it simply does not generate traces when disabled. The presence of the tracing instrumen- tation allows us to calculate how much trace would be generated without actually recording any trace. This feature is used to estimate the size of whole system (kernel and user space) traces. All QEMU configurations run on the host described in the native configuration. The traces are written on a RAID-0 disk array capable of writing at 250MBps. The results for each benchmark for all configurations are presented in Table 3.1. Each benchmark was executed five times and the wall clock time to execute the benchmark was measured. For each configuration, the first two columns contain the mean (M) and standard deviation (SD) in seconds. The third column contains the CPU utilization for theNative configuration, and the overhead (OH) in terms of slowdown factor compared to Native for the other four configurations. The slowdown on execution varies widely depending on the CPU utiliza- tion, ranging from a 1.3 to a 47 times slowdown. These results demonstrates that the overheads imposed by using QEMU and tracing are substantial for workloads with high CPU usage, but are acceptable for mostly IO work- 10http://linux.die.net/man/1/taskset 34 loads. In all cases, the overheads are not prohibitive and are similar to the overheads imposed by tools such as Valgrind [63] and PinOS [16]. Native Qemu TQemu-NoLog TQemu-Log M SD CPU M SD OH M SD OH M SD OH (sec) (sec) (%) (sec) (sec) (×) (sec) (sec) (×) (sec) (sec) (×) Wget 149.2 2.3 0.5 153.8 2.4 1.0 167.6 2.5 1.1 198.0 1.6 1.3 Kernel 73.3 0.3 99.7 1256.7 0.7 17.1 3017.7 2.8 41.2 3448.5 16.6 47.0 Postmark 169.9 8.4 8.2 191.9 1.4 1.1 439.4 1.4 2.6 1963.2 170.7 11.6 Table 3.1: Mean execution time and standard deviation for each configura- tion. The third column provides the CPU utilization for Native and the overhead compared to Native for the other configurations. Table 3.2 provides the trace size when tracing kernel execution (Kernel) and an estimate of the trace size if whole system (Kernel + User) executions were captured. The third column provides the trace generation rate in Gi- gabytes per second when tracing kernel execution. This number is obtained by dividing the Kernel trace size (column 1) by the TQemu-Log wall clock time (Table 3.1). The generation rate depends on the CPU utilization and varies from one benchmark to another. The last two columns of Table 3.2 present the trace generation rate per second of CPU utilization. These numbers are obtained by dividing the trace size by the native CPU time (either kernel or, kernel and user) and they represent the rate at which data would need to be written to disk to sustain a benchmark executing at native speed with 100% CPU utilization. These numbers can be used to estimate an upper bound on the tracing overhead assuming that tracing remains IO bound as it currently is. Assuming the same disk array as the one used in this thesis (∼250MBps), this overhead is roughly between 80 and 100×. Although the traces are large, the numbers in the first column indicate that recording the kernel execution of common workloads is practical given the price and capacity of today’s disks. A one Terabyte drive currently costs around $65 CDN and the cost of storage is constantly getting cheaper. Fur- ther, the overheads both in terms of slowdown and storage can be addressed by using virtual machine deterministic record-replay. As mentioned in Sec- 35 Size Size Rate CPU Time CPU Time Rate Rate (GB) (GB) (GBps) (sec) (sec) (GBps) (GBps) (Kernel) (K + U) (Kernel) (K + U) (Kernel) (K + U) Wget 12.8 18 0.06 0.592 0.744 21.56 24.1 Kernel 115.9 1659 0.03 6.6 73.0 17.67 22.7 Postmark 280.0 303 0.14 12.9 13.9 21.69 21.9 Table 3.2: Trace size, native CPU time, and trace generation rate. tion 2.2.3, it is possible to record the execution of a virtual machine with overhead often below 5%, and the replay logs are small enough to easily record hours of execution. One of the reason traces are so large is the high cost per instruction of the current Tralfamadore trace format. Table 3.3 provides the cost per instruction in bytes for each benchmark. The numbers are obtained by di- viding the trace size by the number of instructions in the trace. The current trace format is designed with simplicity and ease of parsing in mind. More compact traces could be achieved by eliminating redundant information. As an example, an add $4, %eax instruction (which adds four to the current value of eax) currently results in the new value of eax being recorded in the trace, although it could be trivially extrapolated from its previous value. Wget Kernel Postmark 12.3 14.9 12.2 Table 3.3: Tracing cost per instruction (bytes). As a comparison of what can be achieved in terms of reducing the size of traces, the Chronicle omniscient debugger[1, 4] uses traces generated with Valgrind to provide time-travelling debugging from traces without any re- execution. The design of the Chronicle trace format goes into great effort to eliminate redundant information and achieves a much lower cost of 6.5 bits per instruction. This space efficiency comes at a cost in terms of recording overhead; the Chronicle recorder is CPU bound and imposes a slowdown of around 300×. This overhead is substantially higher than the overhead imposed by the Tralfamadore trace collection facility. 36 3.5 Conclusion This chapter described the trace capture facility used to record the execution of unmodified system including the operating system kernel. This facility is based on a whole-system x86 emulator modified to collect detailed CPU traces. This chapter demonstrated two claims: first, that the traces collected with Tralfamadore are complete, meaning that they can be used to perform analyses with the same power as traditional online binary analyses. Second, this chapter evaluated the overhead of recording traces with Tralfamadore. The measured performance shows that although the overheads are high, they are acceptable in the context of dynamic analysis and the current availability of large affordable disks. Further, as proposed in Chapter 2 and Section 3.4, it is possible to decouple capturing execution from trace generation using deterministic virtual machine record-replay. The traces recorded with Tralfamadore are low-level and opaque, and create a semantic gap between their content and the information expected by a developer. The next chapter describes the Tralfamadore analysis engine which is designed to gradually reconstruct the execution semantic from these low-level traces to the level of source code. 37 Chapter 4 The Analysis Engine 4.1 Introduction This chapter describes the design and the implementation of the Tralfama- dore analysis engine. The analysis engine consumes the traces generated by the Tralfamadore capture facility described in the previous chapter and is used to implement the analyses described in the next chapter. Two of the four contributions of this thesis are described in this chapter: The design of a dynamic analysis framework structured as a dataflow en- gine, and the context operator which demultiplex the control flow of kernel execution (Section 4.7.2). Section 4.2 motivates the architecture of the analysis engine by describ- ing its benefits, and Sections 4.3 to 4.6 describe the core components of the analysis engine. Tralfamadore analyses are built by composing simple components called operators. Section 4.7 describes some of the important operators that have been built. Finally the chapter describes optimizations that have been implemented to improve the performance of analyses and presents a generic performance evaluation of simple analyses. Chapter 5 describes more complex analyses that have been built and evaluates their performance. 38 4.2 Motivation Performing dynamic program analysis at the processor level, as if the hard- ware had been modified to trace execution, creates a semantic gap between the information recorded (CPU-level traces) and what an analysis writer may want (source-level semantics). This problem is exacerbated in the case of kernel level analysis, since execution is now recorded below the operat- ing system and many abstractions such as the notion of threads, available when doing user-level analysis, no longer exists. These abstractions must be rebuilt from the ground up, and doing so is challenging and requires a substantial amount of domain expertise. Operator trace Trace Parser OperatorSink trace time Annotations inserted by the first operator Annotation inserted by the second operator Trace annotations Figure 4.1: Dataflow engine. As mentioned in Chapter 2, Tralfamadore proposes to address these is- sues by structuring the analysis engine as a dataflow engine. Analyses built with Tralfamadore are composed of single-purpose components called opera- tors. Operators are responsible for recognizing specific patterns in the trace corresponding to the occurrence of events such as function calls, interrupts, or memory accesses, and producing annotations describing these events. The recorded trace is treated as a stream and flows through a pipeline of operators where each operator observes the trace as it flows and inserts new annotations into the stream corresponding to semantically higher-level events. Operators close to the original trace have to deal with low-level aspects of execution, but downstream operators will gradually observe ex- 39 ecution at a higher level, looking for patterns based on previously added annotations instead of the low-level trace. A dataflow architecture as just described provides two important benefits from the point of view of the analysis writer: • The gradual reconstruction of execution semantics provides a modu- lar approach to structuring dynamic analyses. High-level aspects of analyses can be written based on higher level annotations instead of dealing directly with a CPU-level trace. • The dataflow architecture promotes composability and reusability. By making it easy to reuse existing operators, low-level aspects of exe- cution need to be understood once, and this task can be performed by a domain expert. Subsequent analyses are shielded from the low- level complexity and can simply leverage the annotations provided by existing operators. 4.3 The Tralfamadore Analysis Engine A Tralfamadore analysis consists of operators that are connected in a pipeline or tree configuration, where each operator reads from one or more input stream and produces an output stream. Operators generate their output stream by adding, removing, modifying or simply passing stream elements called annotations. At the root of the graph, a sink operator consumes the amalgamated stream, and the leafs of the graph consist of parsers that either read the trace or indices. Two configurations are typically used to compose operators into an anal- ysis: pipeline and tree. In a pipeline configuration, the upstream operator is a trace parsing operator and the other downstream operators are con- nected in a chain. This type of configuration is used to read the entire trace, and is often used to populate annotations caches to be used as indices by subsequent analyses. Figure 4.2 shows the operator configuration used to build an allocation index which provides the points in the trace where heap objects are allocated and released. The allocation index is used by many 40 analyses, including heap slicing, an analysis used to visualize the execution of data-driven software components such as network protocol stacks and file systems. Heap slicing will be described in more details in the next chapter. Index Trace Parser TraceInvokeAlloc Alloc Indexer Figure 4.2: Pipeline example — allocation indexer. Analyses that only read a subset of the trace are configured as a tree of operators. These analyses use additional inputs to provide, in addition to the trace, streams of cached annotations that are used as indices to quickly find relevant sections of trace. Figure 4.3 shows an operator configuration for an analysis that extracts the value of an function argument for all calls to this specific functions. This analysis uses a Function Index stream produced from cached annotations. This index provides a stream of positions at which to start reading the trace to extract the value of the argument. Once the argument value for a given function call is extracted, the analysis can skip ahead to the next position provided by the index. Trace Parser Argument Invoke Trace IndexFunctionIndex MergeSink Figure 4.3: Tree example — extracting argument values. The next sections describes in more details the primitives abstractions 41 used to construct analysis: annotations, operators, streams, and caching/in- dexing. 4.4 Annotations Annotations are the basic unit of information in the system. They are typed and parametrized and are created by operators. At the lowest level, the trace parser operator defines annotations corresponding to the execution of a basic block and the occurrence of an event such as an interrupt. The trace records used to build these primitive annotations were described in more detail in section 3.3.2. Annotations have two components: a timestamp and a set of attributes. Timestamps are opaque values in the sense that for all operators except the trace parser they have no concrete meaning apart from providing a logical ordering between different annotations. A timestamp is defined by two fields: a byte offset in the trace file and a precedence field to provide an ordering between annotations with the same byte offset. The precedence field is used when merging multiple input streams and ensures that annotations are correctly ordered. 4.5 Operators Operators are the main functional abstraction in Tralfamadore. They rep- resent a single analysis function that operates on one ore more input stream and produces an output stream of annotations. The common role of an op- erator is to identify a specific pattern in its input stream corresponding to an event and inserting new annotations characterizing the event in its output stream. Operators can also act as filters, removing annotations from the stream or refining existing annotations by replacing them with annotations containing additional higher level information. Operators are connected by streams, which carry annotations. A stream, then, is an interface across which operators may send and receive sequences of annotations. As previously mentioned, the flow of annotations in the 42 system is driven by a “pull” from the terminal stream operator in the graph rather than a “push” from the trace itself. This allows operators to elect to work with specific, small regions of the trace. The programmatic idiom for streams in our system is that of an infinite lazy list, whose contents are materialized on demand. module Op = struct type ’a t = { next: unit -> ’a Annot.t; seek: ts_t -> unit; close: unit -> unit } ... end Figure 4.4: Operator interface signature. The implementation of operators in Tralfamadore bear significant sim- ilarity with iterators in databases, or available in programming languages such as Java and C++. They are instantiated and export a simple interface to iterate across a sequence of records. We added a seek function to the interface to simplify the design of operators that stride across a trace, to only read relevant portions of it. Figure 4.4 shows the type signature of a Tralfamadore operator. All operators except the merge operator (described in Section 4.6.3) implement this interface. The module Op defines a type named t. OCaml modules provide namespace management and this type is globally known as Op.t. This type is a record (which is similar to a struct in C) and is polymorphic (i.e., generic). In OCaml, polymorphic types are parametrized by one or more type variables which are written with an apostrophe in front of them (’a in this example). The record has three fields named next, seek and close. The type of each field is defined on the right side of the colon and the three fields are functions. The next function takes no argument (unit is similar to void in C) and returns an annotations which is also a polymorphic type. This function is the main driver in Tralfamadore and is the function used to pull annotations from the upstream operator. The seek function 43 takes a timestamp as an argument and returns nothing. This function is used to position an operator at a specific point in the trace. The ts t type is a typedef for the integer type used to represents trace timestamps. The type of an operator is solely defined by its interface which is de- fined by the type of annotations it produces ’a Annot.t. The type of an operator is not defined by its name or its internal state. This approach provides a form of encapsulation, allowing an operator to be easily replaced with another operator with the same type signature but a different internal implementation. This encapsulation makes it is possible to have multiple implementation of the same operator, each designed to understand different systems (e.g., Linux or Windows) or different versions of the same system. Although an operator’s internal implementation might need to change to adapt to systems or software changes, as long as the produced annotations remain unchanged, consumers of its annotations do not need to be modified. The Tralfamadore caching and indexing mechanism described in the next section takes advantage of this encapsulation by replacing operators that produce annotations from the trace by operators that produce the same an- notations from a cache. Operators are instantiated using partially evaluated functions to create closures that encapsulate the operator’s internal state, hiding it from the operator’s type signature. Figure 4.5 provides an example of a simple breakpoint operator which looks for locations in the trace where a specific instruction is executed. Every time the operator hits the target instruction it produces an annotation which is inserted in the stream prior to the basic block containing the target instruction. The core of this operator is the pull from upstream function which reads an annotation from the upstream operator and performs an action based on the type of the annotation and its attribute values. Annotations are implemented using OCaml variants, which are similar to tagged unions in C. ‘BBlk is the tag and bb is a reference to the tag specific structure. Tralfamadore uses pattern matching, a language feature common to many functional language, to concisely express the constraints used to match on 44 (* Module defining the breakpoint annotation *) module Breakpoint = struct type t = { pc: guest32_t } end (* Module providing an implementation of a breakpoint operator *) module BreakpointOp = struct type attr = [ ‘Breakpoint of Breakpoint.t ] type annot = attr Annot.t type ’a state = { up: ’a Op.t; (* The upstream operator *) buf: ’a AnnotBuf.t; (* Annotation buffer *) target_pc: guest32_t } (* the target instruction *) constraint ’a = [> ‘BBlk of BBlk.t ] let pull_from_upstream st = (* Get an annotation from the upstream operator *) let a = st.up.Op.next () in match a.Annot.attr with | ‘BBlk bb -> let start_pc = bb.BBlk.pc in let end_pc = BBlk.instr_pc bb (BBlk.last_instr bb) in (* If the target instruction falls within the basic block *) if st.target_pc >= start_pc && st.target_pc <= end_pc then ( (* Create breakpoint annotation *) let bp_attr = ‘Breakpoint { Breakpoint.pc = st.target_pc } in (* This annotation should appear before the basic block annotation *) let bp = Annot.gen_before a bp_attr in (* We save the current annotation and return the new one *) AnnotBuf.push st.buf a; bp ) else a | _ -> a (* next takes an extra unit argument to allow partial eval *) let next st () = if AnnotBuf.is_empty st.buf then pull_from_upstream st else (* Drain queued annotations first *) AnnotBuf.pull st.buf ... let create up target_pc = let st = { up = up; buf = AnnotBuf.create 1; target_pc = target_pc } in (* * Instantiate the operator using partially evaluated functions * to encapsulate the state. *) { Op.next = next st; Op.seek = seek st; Op.close = close st } end Figure 4.5: Simple breakpoint operator. 45 annotations. Patterns are separated by a | and the action is the expression or statements following the ->. In this example, if the pattern matches a basic block annotation then the address of the first and last instruction of the basic block are extracted and compared with the target instruction. If the target instruction is within the basic block, a ‘Breakpoint annotation is generated and inserted before the basic block. 4.6 Caching, Indexing and Striding Many analyses only need to look at small portions of the trace to extract the information they need. A good example is extracting the value of an argument for all calls to a specific function. It is sufficient to revisit the trace at all call sites and inspect the register or memory state to extract the argument values. Analyses could execute much faster if they had the ability to only read the relevant portions of a trace. The rest of this section describes the techniques available to minimize the amount of trace read while still ensuring correct results. 4.6.1 Caching One of the benefits of an offline approach is that analyses produce consistent results across runs. Because they execute off a static data set, analyses execute deterministically. This property should also apply in the case of analyses that only read a subset of the trace. They should produce the same results as if they read the entire trace. Skipping trace sections is an optimization and should not affect the results of an analysis. To produce the same results, an analysis must read the relevant por- tions of the trace, and receive the same annotations that are relevant to the analysis from upstream operators. This second requirement is transitive since most operators produce new annotations based on existing annotations received from upstream operators. Because many operators are stateful, achieving this goal is more complicated than simply running a configuration of operators from the desired starting point. 46 Stateful operators can produce different annotations depending on where they start reading the trace. As shown on Figure 4.6, an operator that tracks heap allocations will identify the size of the allocation when the allocator function is called but will get the pointer to the object when the function returns. This operator will be unable to produce an allocation annotation if an analysis is configured to start reading the trace while in the middle of the allocation (after the allocator function is invoked but before it returns). This behaviour is problematic because the allocation annotation would normally be inserted in the stream when the allocator returns. The stream produced will be different for the same region of trace than if the analysis had started reading the trace earlier. time The operator reads the address of the allocated object when __kmalloc returns, and produces an annotation containing both the size and the address of the allocation. The size of the allocation is passed as an argument to __kmalloc.  A heap allocation operator will read the size of the allocation when __kmalloc is called. __kmalloc returns__kmalloc is called The operator will not produce an annotation if the analysis starts reading the trace in the middle of the heap allocation. Figure 4.6: Missing annotation due to trace skipping. Another example is the Context operator described in Section 4.7.2 which reconstructs control-flow and labels individuals flows of execution (system calls, interrupt handler, etc.) with a unique monotonically increas- ing ID. This operator can only produce the same annotations if the trace is read from the beginning. Tralfamadore addresses this issue by allowing operators to create per- sistent caches of their annotations. Cached annotations can be reused in 47 subsequent analyses instead of being resynthesized from the trace. The ways in which operators cache their annotations are specific to each operator but there is a common requirement that an operator using a cache produces the same annotations it would produce if reading the whole trace. An annotation cache should include sufficient context to be able to reproduce the original annotations without parsing any earlier trace. The simplest implementation of an annotation cache is to simply dump all the annotations to disk. Once annotations are cached, analyses can be configured to use variants of opera- tors that produce annotations from a cache instead of generating them from the trace. 4.6.2 Indexing Annotations correspond to the occurrence of semantically richer events. They have attributes that characterize certain aspects of execution such as function called, address and size of heap objects, etc. They provide a rich body of information and are well suited to serve as the basis for indices. Tralfamadore analyses can use cached annotations as indices to map annotations or annotation attributes to interesting positions in the trace. These positions can be used as starting points to examine a trace, or used to identify interesting slices of a trace, such as the range of a trace where a specific thread was running. Operators that support this mechanism export a query interface that lets analyses instantiate an operator that will produce the subset of annotations matching the query. This is conceptually similar to a select operation in a relational database where a subset of the rows—those that match the selection criteria—are returned. Section 4.6.3 describes in more details how cached annotations are used to only read relevant portions of the trace. Many analyses revisiting a trace at specific points will want to examine the state of the system at those points (e.g., to extract the value of a function argument). Because the trace only contains updates to registers, an analysis wishing to inspect the register state at a specific point in time must start reading the trace prior to this time to observe the last updates to the rele- 48 vant registers. Tralfamadore makes it possible to produce caches of indexed positions, such as a function index, that store timestamps prior to the an- notations they are indexing (in this case, Invoke annotations), to guarantee the ability to observe the relevant register updates. For the function index, the recorded position guarantees the ability to observe the last updates to registers used to pass arguments, the stack pointer register (esp), and the frame pointer register (ebp) if used by the function being indexed. The method used to index earlier positions in the trace is generic and could be used with other caches as well, although it is currently only used to generate a function index. time Invoke(__kmalloc, 0xf755bea4) R[eax] = 4 FunctionIndex Target function entry point Last register update before function entry point Indexed trace position Figure 4.7: Using a function index to inspect a register. Indexers are regular Tralfamadore analyses and execute like other analy- ses except that the sink operator in the configuration is a specially configured operator that writes its annotations to disk. 4.6.3 Striding The programming interface exposed in Tralfamadore to take advantage of indices is centered around the idea of striding over the trace. The goal of the striding API is to be able to quickly navigate to a point of interest and start scanning the trace from that point forward. Once the desired information is gathered, the API provides ways to efficiently jump to the next point of interest. This pattern is repeated for every point of interest, effectively striding over the trace. A schematic of the striding model is drawn in Figure 4.8. 49 Trace Index Mergescan scan skip skip t t t Figure 4.8: Striding. A striding model provides two advantages. The first one is that the points where scanning should stop do not have to be known in advance. The length of the region of interest can be determined by the analysis as it runs. The second advantage is that analyses can be written to execute in a single pass instead of doing one pass per region of interest. Overlapping regions of interests can easily by merged by the analysis software avoiding repeated scans of overlapping regions of trace. Chapter 5 provides detailed examples of analyses that take advantage of the striding model to avoid repeatedly scanning the trace. The striding programming model is exposed in Tralfamadore via a special Merge operator. This operator, as its name indicates, merges two streams into a single stream ordered by timestamps. Additionally, the merge opera- tor exports an extended interface which enables seeking to the next annota- tion in one of the input streams and automatically resyncing the other input stream to the same timestamp. The common use case for this extended in- terface is to merge an index stream and a trace stream and be able to resync both streams when seeking to the next indexed point of interest. 50 Operator Description Trace Parser Parses the raw trace file. Produces: BBlk(pc, instructions, side effects), Event(vector, error code, pc, side effects) Context Identifies independent contexts of execution (threads, interrupts). Depends: BBlk(side effect=store[TSS.esp0]|instr.mnemonic = sysexit|iret), Event* Produces: Context(activation ID, stack address, Entry|Exit|Switch) Invoke Identifies function call and return points. Depends: BBlk(pc = function address | instr.mnemonic = ret) Produces: Invoke(pc, stack pointer), Return(stack pointer) Argument Extract the value of a specific function argument. Depends: BBlk*, Invoke(pc = function address) Produces: Arg(position, value) Allocation Tracks allocations of objects on the heap. Depends: Context*, Invoke* Produces: Alloc(ptr, size, pc), Release(ptr) MemTrace Tracks memory access. Depends: BBlk(side effect = load[*]|store[*]) Produces: MemTrace(pc, address, access sz, R|W) HeapTrace Tracks accesses to objects allocated on the heap. Depends: Memtrace*, Allocation* Produces: HeapTrace(pc, ptr, offset, access sz, R|W) Locks Tracks the use of lock based synchronization primitives. Depends: Invoke*, BBlk(side effect = load[*]|store[*]) Produces: Lock(lock), Unlock(lock) Breakpoint Identifies the occurence of a breakpoint. Depends: BBlk* Produces: Breakpoint(pc) Table 4.1: Some of the current operators available. 4.7 Operator Library Whole-system analysis faces unique challenges in mapping instruction-level traces to source-level semantics. The approach described in this thesis is to compose directed graphs of operators that refine low level trace annotations into progressively higher-level semantics. The focus is on studying the Linux kernel on x86 hardware and the implementation of many of the operators described in this thesis are specific to Linux and/or the x86 architecture. On the other hand, the annotations these operators produce are generic and could apply to a much wider range of platform. Other members of the 51 lab have also begun work on user-level analysis and preliminary Windows support has previously been implemented. Extending the system in these directions is largely a matter of developing new operators. Table 4.1 briefly describes some of the operators currently implemented. Lower-level operators such as the Context and Invoke operators have a de- tailed understanding of the hardware and how it is used by the operating system, while higher-level operators understand how the operating system behaves, for instance with regard to its heap management or synchroniza- tion primitives. The remainder of this section describes four of the more fundamental operators in additional detail. 4.7.1 The Trace Parser Operator This operator is the base of all Tralfamadore analyses. It reads the trace from disk and produces annotations similar to the records contained in the execution trace described in Section 3.3.3. One notable difference, is the Trace Parser operator does the lookup in the instruction log described in Section 3.3.5 internally. The BBlk annotations it produces have already been paired with their respective instructions. This operator is partially implemented in C for performance reasons. The original implementation was originally entirely written in OCaml and was substantially slower. To disassemble the content of the instruction log, the Trace Parser oper- ator uses the DynamoRIO [15] disassembler because it provides additional information not provided by disassembler using the AT&T or Intel syn- tax. The DynamoRIO disassembler makes all input and output operands explicit which is very useful. Operators and analyses can track all instruc- tion operands without having to know the specificities of each instructions. An obvious example is the push instruction, as shown on Figure 4.9, which pushes the content of a register on the stack. The push instruction writes to memory and modifies the value of the stack pointer register (esp) but a traditional disassembler would not provide this information. Parsing the trace is very expensive. Even with the low-level parsing 52 AT & T DynamoRIO push %eax push %eax %esp -> %esp (%esp) Figure 4.9: AT&T assembly vs DynamoRIO. implemented in C, the trace parser ends up creating a large number of objects representing BBlk annotations, each one dynamically allocated on the heap. A high-level language, such as OCaml, makes this problem worse because of the runtime representation of structured types and the frequent use of integer boxing, which results in a larger number of object allocations, and stresses the garbage collector. Fortunately, many analyses end up only examining a very small fraction of the annotations directly produced from the trace, and the parser has been optimized for this case by lazily parsing the trace. The updates to registers and memory that are provided by BBlk annotations are left unparsed, they only contain a pointer to the raw block of data read from disk. This raw data only gets parsed when an downstream operator inspects the side-effects of an instruction. 4.7.2 The Context Operator Many program analyses operate on individual flows of execution. The sim- plest such analysis is extracting control flow graphs or paths. A user-level tracing framework such as Nirvana [13] can provide built-in support for identifying individual flows of execution using the primitives provided by the operating system. In the case of Tralfamadore, because the traces are captured using a full system emulator at a layer below the operating system, no such abstraction or primitive exists. The executions of all threads appear to be arbitrarily interleaved within a single trace. To support analyses that need flow information, such as race detection, Tralfamadore must provide the ability to reconstruct and label individual flows of execution. Reconstructing flows at such a low level is challenging, due to the asyn- chronous, interrupt-driven environment in which they operate: a flow of ex- ecution may be suspended and resumed for a number of reasons, including 53 explicit context switches, interrupts (which may be nested), and exceptions. Further, although the notion of thread exists at the operating system level, it does not represent a good abstraction for tracking individual flows of ex- ecution. Interrupt handlers execute in the context of the currently running thread but are conceptually different flows of execution. Identifiers such as current11 which map to the currently executing task, cannot be used to distinguish between threaded (system call, kernel threads) and interrupt handler execution. To reconstruct individual flows, Tralfamadore needs to be able to identify when control transitions from one flow to another. Understanding these transitions requires an understanding of both the physical architecture and how a given OS uses that architecture in its implementation. Specifically, the Context operator needs to track three things: the start and end of interrupt and fault handlers, system calls, and the occurrence of context switches from one thread to another. The execution of interrupt and exception handlers (including system calls that use software interrupts to enter the kernel) can be identified by Event annotations in the trace. At these points, the current flow of execution is sus- pended and control jumps to an event handler. An iret instruction marks the end of handler execution and the resumption of the original execution context. System calls may also be invoked through fast entry instructions that do not generate events; their boundaries are identified by looking for basic block annotations that end with sysenter/sysexit instructions. Additionally, to pair a thread resumed by the operating system sched- uler with its previously suspended execution, the Context operator needs a mechanism to identify each thread in the system. One simple way to do this would be to track the invocation of the operating system stack switch- ing routine such as switch to() on Linux, using its argument to identify threads. We can also use a more generic approach (that works for both Linux and Windows) which is to track updates to the esp0 field of the Task State Segment (TSS). This field indicates the base address of the kernel 11current is a Linux kernel macro used to get a pointer to the currently running task from a CPU local variable. 54 stack for the task about to run. This address uniquely identifies a thread over its lifetime and thus serves as a suitable context identifier. To avoid blindly scanning every trace record to look for writes to the esp0 field, we currently use a combination of both techniques. We use the occurrence of calls to switch to to detect that we are near an update to the TSS and scan the trace from that point until the esp0 field is updated. This is simply a performance optimization and we can fall back to the generic method if needed. The Context operator tags each flow by emitting an annotation when- ever a flow of execution is suspended or resumed (whether due to a software context switch or a hardware interrupt). The first attribute of the annota- tion is an activation ID which is an opaque monotonically increasing integer that uniquely identifies a flow (such as the invocation of a system call or in- terrupt handler) over its lifetime. The second attribute is the base address of the kernel stack, which is primarily used to pair suspended activations as described above. The stack attribute can also be used to group activations such as all system calls executed by the same thread. The third attribute identifies the reason behind the context event such as an entry into the kernel or a switch from one thread to another. To deal with nested interrupts, the operator maintains a stack of live activations for each thread it has seen. When an interrupt occurs, the op- erator emits an annotation with a new activation ID and pushes that ID on the stack. When the handler terminates, the operator pops its stack and emits an annotation with the ID of the resumed activation. Similarly, on a software context switch, the operator emits an annotation indicating the stack address and the activation ID of the resumed thread. While the operator is written by a domain expert, analyses using anno- tations from the Context operator require no specific knowledge of Linux or the x86 architecture. They may simply use the annotations to identify the regions of trace that comprise a single activation or thread of execution. 55 4.7.3 Invocations The Invoke operator produces annotations that simplify the tracking of call flow by providing a reliable mechanism to pair matching function calls and returns. Although this may seem like a simple task, it is actually harder than it first appears because of multiple problems. First, the presence of recursion means that the return address on the stack cannot be used to unambiguously identify a specific function call in- stance because there could be multiple “live” call instances of a specific function at the same time. Second, when the last thing a function does before returning is to call another function (i.e. return foo(), the compiler can optimize the tail call by replacing the call instruction with a jmp instruction, optimizing away the pushing of the return address on the stack. When the callee returns, it will not return to its immediate caller but to its caller’s caller. This common optimization creates two problems: the presence of call instructions cannot be relied upon to find all function calls and a single ret instruction may unwind multiple function calls at once. Third, because execution is analyzed at the processor level, there can sometimes be thousands of instructions between the point in the trace where a function is called, and the point where it starts to execute. This problem is often caused by function calls causing page faults because the callee’s code is not mapped in memory. The page fault handler must first add a page table entry before resuming execution at the first instruction of the callee. Because of the low-level aspect of the trace, the execution of the page fault handler is visible in the trace and must be explicitly dealt with. This is a good example of an issue that does not exists with user-level analyses because interrupts and faults are handled transparently by the operating system. The Invoke operator solves the first and last problems by using the sym- bol table to identify function entry points, and then using the execution of a function’s first instruction to identify function calls. The tail call prob- lem is solved by using the value of the stack pointer when a function has 56 been called and just before it returns as a key to pair matching calls and returns. When a tail call optimized function returns, just as a single return instruction unwinds multiple calls, a single key will match multiple calls as well. The Invoke operator produces two annotations: Invoke and Return. The Invoke annotation includes the address of the function being called and the value of the stack pointer register on entry. The Return annotation includes the value of the stack pointer register just before the return address on the stack is consumed by a ret instruction. As described in the previous paragraph, consumers of the annotations produced by the Invoke operator can use the value of the stack pointer contained in both annotations as a key to pair function calls and returns. Limitations The current implementation of the Invoke operator is unable to track inlined functions, since at the binary level they do not exist as functions and are not explicitly called. This limitation could potentially be addressed by using debugging information but this support has not been implemented yet. Another limitation of the current implementation is the inability to deal with interprocedural control flow that does not follow a call flow pattern such as setjmp/longjmp and exception handling. A third limitation is the inability to deal with fabricated return statements such as the one shown in Figure 4.10. These limitations do not affect our ability to analyze the Linux kernel because it does not use these mechanisms. Analyzing user space code, where exceptions and dynamically generated code are common, would require enhancing the operator to address these limitations. mov (%esp), %edx add $4, %esp jmp *%edx Figure 4.10: Fabricated return statement. 57 4.7.4 Allocations The Alloc operator tracks the allocation of objects on the heap and issues an- notations whenever an object is allocated or freed. The Linux kernel provides multiple ways to allocate memory. Some are obvious such as kmalloc() which work similarly to its user-level equivalent malloc(), others such as kmem cache alloc take a pointer to a memory pool as an argument instead of an explicit size. It is also possible to bypass these allocators and allocate memory directly from the page allocator using alloc pages(). Many of these allocating functions also have variants that are used for debugging or on NUMA architecture. Implementing the Alloc operator required a large amount of domain- specific knowledge, such as how to map memory pool pointers to allocation sizes, but these details are abstracted by the Alloc and Release annotations which identify points in the trace where heap objects are allocated and released, and provide information about each allocation such as the object’s address and size. Another difficulty with tracking allocations is that with most allocators the parameters used to determine the size — either the size itself or a pointer to a memory pool — are passed on invocation, but the pointer to the object is only available on return. The allocation operator takes advantage of the pairing provided by the Invoke and Return annotations to match the object with its size. 4.7.5 Other Operators The current library operators also contains other operators which are used in the analyses described in the next chapter. There is a memory trace operator which extracts every memory read and write operation and produces an annotation for each of them. This operator is used in many analyses such as the heap slicing analysis presented in Section 5.4. There is also a Lock operator which tracks the use of synchronization primitives that provide mutual exclusion. A summary of the functionality of many of the operators currently available with Tralfamadore is described in Table 4.1. 58 4.8 Examining State Many operators need to inspect the state of registers or memory at specific points during execution. As an example, the Invoke operator tracks the content of the stack pointer (esp) to produce Invoke and Return annotations than can easily be paired using the value of the stack pointer. The state of a register can be inspected by having the operator track the state of that register, monitoring the stream for annotations that update the specific register. The value of the register at a specific point in time corresponds to the last update to that register prior to the inspection. Tralfamadore’s register state tracking is optimized for the common case that registers are updated often but are seldom inspected, and the cur- rent implementation takes advantage of the trace parser’s lazy evaluation described in Section 4.7.1 to track the content of registers cheaply. BBlk annotations produced by the Trace Parser operator contain the instructions executed and their updates on the state of the system. As described in Section 4.7.1, the updates are parsed lazily by being kept in a raw unparsed form until explicitly needed. The instructions can be used to cheaply determine if a register is updated without parsing the raw updates (e.g., mov (%esp), %eax updates %eax). Determining if the instructions in a BBlk annotations update a register can be done in a pre-processing phase when the Instruction Log is first read from disk. The steps to update the state of a register consist of checking a flag to determine if the annotation updates the register, and updating a pointer to the annotation if the test succeeds. When an operator needs to inspect the content of a register, the last BBlk annotation to update the register is parsed to read the current register value. The state of memory can be inspected in two ways. Because the trace contains memory reads, any memory value that is used directly by the traced execution can be obtained by scanning the trace forward. The value of a function argument passed on the stack can be obtained by scanning the trace forward until the argument is loaded from memory and used. For cases where the desired values are not available by scanning for memory 59 read or write operations, Tralfamadore provides a memory index that can be queried to obtain the content of memory at a specific address and point in time. Using the memory index to inspect the content of memory at a spe- cific address and point in time is a three step process. First, the virtual address is translated into a physical address using a table of TLB updates and invalidation, and the guest page tables, if required. This translation procedure is described in more detail in Section 3.3.4. Second, the memory index is consulted to determine the timestamp of the most recent update to the requested physical memory location that is prior to the requested point in time. The memory index is implemented as a binary tree where the keys are time ranges and the values are bit vectors of the bytes that have been updated in that time range. The key of the root node corresponds to the entire trace, and the time range of a leaf node is a single epoch representing a region of trace containing a predefined number of memory updates. The bit vectors are very sparse and are run-length encoded to save space. Third, the trace is revisited at the timestamp returned by the memory index to read the value of the requested memory update. Because the time- stamps returned by the index correspond to epochs each containing multiple memory updates, it is possible that within an epoch, the requested memory update will occur after the requested time. In this case, the second and third step are repeated to determine and inspect the previous epoch that updates the requested physical location. 4.9 Optimizations As mentioned in Section 2.3, the Tralfamadore analysis engine described in this thesis is a complete rewrite of the previous implementation. The previous version of the analysis engine had numerous performance problems, resulting in many analyses executing very slowly. This section describes some of the optimizations that were implemented to address the performance issues with the previous implementation. These optimizations have not been benchmarked in isolation but overall, analyses executing under the current 60 version of Tralfamadore perform one to two orders of magnitude faster than with the previous implementation. The first important optimization is the lazy parsing of trace records. This optimization is part of the design of the Trace Parser operator previously described in Section 4.7.1. The remaining optimizations are described below. 4.9.1 C Parser Efficiently parsing binary formatted files is difficult to do in a high-level language such as OCaml. A similar challenge exists with parsing network packets [53]. One difficulty comes from the inability to directly dereference and inspect the values contained in a memory buffer. Instead, the content of the buffer must first be copied into an object of the appropriate type (such as an OCaml integer) and then inspected. This transfer can also require the creation of temporary substrings which require a heap allocation and a memory copy. The previous version of the Tralfamadore analysis engine was entirely implemented in OCaml and profiling runs indicated that a large fraction of time was spent parsing the trace. The current implementation sidesteps this problem by using a much faster trace parser entirely written in C. The current C implementation can parse at speed exceeding 200 MBps compared to 15 MBps for the previous OCaml implementation. 4.9.2 Integer Boxing The types provided by OCaml to represent 32-bit and 64-bit integers are boxed. A boxed integer is represented at runtime by a pointer to a heap allocated object containing the integer value. OCaml also provides unboxed integers which do not require a heap allocation, but can only represent 31-bit integers. Unfortunately, most of the values contained in a trace are 32-bit register updates and 32-bit addresses, and cannot be represented with un- boxed integers. The previous implementation of Tralfamadore used boxed integers to represent these values, adding a level of indirection, and increas- ing the stress on the garbage collector. The current version of the analysis engine can be configured to execute on 64-bit hosts where unboxed integers 61 are 63-bit wide. Most entries in the trace can now be represented without using boxed integers. 4.9.3 Polymorphic Code The OCaml language supports polymorphic code which can be used to ma- nipulate objects of arbitrary types. A good example are built-in data struc- tures such as lists and arrays that can store any types of objects. To simplify the design of generic code, OCaml provides support for polymorphic com- parison operators (=, <, ≥, etc.). These operators are very useful but result in substantially slower code, especially when comparing integers. By default, the OCaml compiler will generate polymorphic code, and much effort was invested in profiling and adding type signatures to force the compiler to spe- cialize generic code and avoid polymorphic comparison operators. In many cases, the specialization replaced calls to generic comparison functions such as caml equal() with a single cmp instruction. 4.10 Evaluation This section presents a general performance evaluation of the trace anal- ysis engine using simple generic analyses. More complex analyses will be described and evaluated in Chapter 5. The goal is to quantify the perfor- mance of the Tralfamadore analysis engine compared to native execution and analyses executing under a replay-based environment such as the one provided by Aftersight [23]. The evaluation is based on the three bench- marks described in Section 3.4: Wget, Kernel, and Postmark. Wget is a network intensive workload, Kernel is a CPU intensive workload, and Postmark is a disk intensive workload. The Tralfamadore analyses evaluated in this section are compared against three data points: native execution, an estimation of replay under QEMU, and an estimation of replay under a heavily instrumented version of QEMU. The latter two points are extrapolated from the Qemu and TQemu-NoLog numbers from Table 3.1, and the last point is meant to estimate the cost of 62 replaying in a version of QEMU modified to support a heavyweight analy- sis that instruments most instructions and memory accesses. The next two sections present the performance of Tralfamadore in terms of relative over- head compared to these three data points. To avoid disk caching effects, the buffer cache is flushed at the start of every experiment described in the next two sections and in Chapter 5. Because deterministic replay executes off a log, idle time such as waiting for network traffic can be squashed during replay. To simulate this effect, the CPU time (which excludes all idle time) is used to estimate the replay time. In an actual virtual machine replay-based environment, not all idle time can be compressed during replay since some IO operations such as disk accesses still occur. Therefore, the estimation of replay used in this thesis is optimistic; it underestimates the replay time by eliminating all idle time. This evaluation compares the overhead between Tralfamadore and an opti- mistic estimate of replay, and is therefore pessimistic. The relative overhead of Tralfamadore compared to replay is most likely lower than presented in this thesis. As mentioned in Section 3.4, time dilation [35] is used to slow down the execution of guest running inside QEMU. Time dilation unfortunately renders timing measurements done within the guest inaccurate. To avoid this problem, the replay time is estimated as the wall clock time of the benchmark under the QEMU variants (Qemu or TQemu-NoLog), which is measured externally and is accurate, minus the native idle time (which is also accurate). Equation 4.1 shows the formula used to estimate the replay time. QEMUreplay = QEMUwall − (Nativewall −Nativecpu) (4.1) 4.10.1 Scanning Analyses The following presents the performance evaluation of Tralfamadore analyses that scan the trace, reading it in its entirety from start to finish. The overhead of two analyses is presented: the null analysis, and a memory 63 trace analysis. The goal of the null analysis is to serve as a baseline to evaluate the minimum slowdown of an analysis that must scan the entire trace. The memtrace analysis extracts all accesses to memory from the trace, and represents a good example of a heavier analysis where the content of most trace records is examined in detail. Figures 4.11 and 4.12 present the performance in terms of relative slowdown of these two analyses for the three benchmarks described in Section 3.4. wget kernel postmark Benchmark 0 20 40 60 80 100 120 140 S lo w do w n Fa ct or 1.8 46.2 31.3 48.9 2.7 148.5 13.7 1.1 18.8 Native Est. QEMU replay Est. QEMU replay w/ instr. Figure 4.11: Relative overhead of the null analysis. The results indicate that the overhead relative to native execution, in terms of time needed to complete an analysis, is comparable to heavyweight online analysis tool such as Valgrind, and is acceptable in the context of dynamic analysis. The results also indicate that for mostly user-level CPU intensive work- load such as kernel compilation, the performance of Tralfamadore is compa- rable to replaying execution in QEMU as done in Aftersight [23]. Although the relative overhead compared to replay is very high for the Postmark benchmark, it is important to understand that this benchmark is very disk intensive, and therefore the replay time is most likely underestimated. As mentioned at the beginning of this section, most of the idle time spent wait- 64 wget kernel postmark Benchmark 0 40 80 120 160 200 240 280 320 360 S lo w do w n Fa ct or 3.2 65.8 75.6 87.7 3.8 358.6 24.5 1.6 45.3 Native Est. QEMU replay Est. QEMU replay w/ instr. Figure 4.12: Relative overhead of the memory trace analysis. ing for disk IO would be present in an actual replay environment. 4.10.2 Indexed Analyses Figures 4.13, 4.14, and 4.15 show the relative overhead of extracting the value of a function argument for all calls to that function. This analysis is a good example of an analysis that uses indices to only read relevant sections of trace. To evaluate the performance of argument value extraction, the functions that appeared in a trace are classified in a set of bins based on their calling frequency. The bin labelled N contains all functions that were called between N and 1.5N times in a trace (e.g., functions in the bin labelled 100 were called between 100 and 150 times). Five functions were randomly selected from each bin, and for each se- lected function, the time to extract the argument values for all calls was measured. The results show the minimum, maximum, and mean time for each bin in terms of relative slowdown. For small number of calls (less than 1000), the performance is comparable or can exceed that of native or replay. For large numbers of calls (1000000), 65 disk seeks degrade performance substantially, re-running the benchmarks on a warm buffer cache improves the performance by one to two orders of mag- nitude. The Tralfamadore analysis engine currently issues disk requests one at a time. The performance could be improved by batching IO requests to provide the operating system disk scheduler with more outstanding requests. 1 10 100 1000 10000 100000 1000000 Number of Calls 0.0 20.0 40.0 60.0 80.0 100.0 120.0 S lo w do w n Fa ct or Native Est. QEMU replay Est. QEMU replay w/ instr. Figure 4.13: Normalized overhead to get argument values (wget). 66 1 10 100 1000 10000 100000 1000000 Number of Calls 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 S lo w do w n Fa ct or Native Est. QEMU replay Est. QEMU replay w/ instr. Figure 4.14: Normalized overhead to get argument values (kernel). 1 10 100 1000 10000 100000 1000000 Number of Calls 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 S lo w do w n Fa ct or Native Est. QEMU replay Est. QEMU replay w/ instr. Figure 4.15: Normalized overhead to get argument values (postmark). 67 4.11 Conclusion This chapter presented the motivations, design, implementation, and a ge- neric performance evaluation of the Tralfamadore analysis engine. The dataflow architecture described in this chapter aims to address two prob- lems: first, that doing dynamic analysis on CPU-level recordings of execu- tion introduces a semantic gap. Second, existing frameworks force analysis writers to constantly have to relearn low-level aspects of execution. These two issues are especially important when doing analyses of operating system execution. The design of the Tralfamadore analysis engine addresses these two problems with a modular approach to structure dynamic analyses that provides a gradual reconstruction of execution semantic and makes analyses composable and reusable. 68 Chapter 5 Applications 5.1 Introduction This chapter describes some of the analyses and tools that have been built using the Tralfamadore framework. The goal of this chapter is to demon- strate how the analysis engine described in the previous chapter can be used to construct analyses that are composed from simple reusable compo- nents called operators. All of the analyses described in this chapter are built from existing operators. Reusing existing operators provides many benefits. Many low-level aspects and intricacies of the platform being analyzed such as how function arguments are passed from caller to callee, how interrupts are handled on x86, or how Linux switches from one thread to another are abstracted by the annotations provided by existing operators. As a result, a new analysis requires less code to be written, and this code tends to be higher-level and focusing on the analysis itself instead of on the details of the platform being analyzed. The first two analyses described in this chapter are program understand- ing tools targeted at new and existing Linux kernel developers. The Linux kernel is very large and complex. Developers unfamiliar with the code of the kernel can spend hours understanding it. These two tools help the de- veloper by extracting dynamic information that tells the developer how the code behaves at runtime. The goal is to help a developer answer the “what does this piece of code do?” question. The first application provides control-flow paths for every call to a spe- cific function. This information helps a developer quickly familiarize himself with the blocks of code that are frequently used, and which combinations of code blocks are common. The second application is based on a new anal- 69 ysis called heap slicing, which is the fourth contribution of this thesis, and is aimed at helping developers understand complex data-driven subsystems such as a network protocol stack or a file system. The focus of this thesis is on building dynamic analyses for the Linux kernel but many of the analyses described in this chapter could be easily adapted to user space programs, or to other operating systems. Section 5.5 describes analyses that perform retroactive assertions. Like regular assertions, retroactive assertions validate that an invariant is never violated at runtime. But unlike regular assertions, they do so posthumously, allowing the validation of assertions that had not been foreseen at the time of execution. We describe two such assertions: an ownership violation checker and a stack limit checker. 5.2 Understanding Execution Reasonably simple applications of static analysis are often used to assist developers in navigating large bases of source code. Editors have “tag” fa- cilities, and often also allow developers to use a search facility that is tied to a language-specific parser in order to find things like the declaration of a specific variable or the definition of some function. Dynamic techniques have also been applied to help developers understand how source behaves under execution by exposing details such as profiling and coverage infor- mation. Because Tralfamadore captures complete executions, significantly more powerful tools can be provided for understanding program execution. The next two sections describe a set of tools to extract information about a program’s runtime behavior and present it to users interactively. These tools use Tralfamadore as a backend to extract and summarize trace infor- mation, and present the results using a GUI-based frontend. The current frontend implementation is a web-based source code browser and is built as an extension to the mercurial revision control source code browser. The web frontend is practical and can be used to easily demonstrate Tralfamadore’s capabilities from a web browser, but it would be relatively easy to build additional frontends such as an Eclipse plugin. 70 Section 5.3 describes a tool to enhance a standard source code view with dynamic control flow annotations, providing hints on how a function is used at runtime. Section 5.4 presents a new analysis called heap slicing that provide a rich summarization of how heap allocated objects are used across their lifetime. 5.3 Control Flow Annotations A major challenge to understanding the way that an individual function or data structure is used is identifying the code that uses it. Static analysis, or even simpler techniques,12 are useful, but hardly sufficient. First, following indirection is challenging using static analysis. Second, static tools do not provide any insight into the relative frequency of access, making it difficult for developers to start with the “common case”. The control flow tool annotates a function in source with the different control flow paths that travel through it, and allows developers to choose be- tween focusing on the common case, outliers, or unexecuted code, depending on their interest. Figure 5.1: The six callers to mutex lock and the absence of lock contention. Figure 5.1 shows the Linux mutex lock function annotated with trace data. Locking semantics aside, this function represents the common idiom of 12e.g., grep 71 an accessor function guarding a specific variable, in this case a mutex. The annotated code immediately provides the developer with three useful pieces of information: First, while mutex lock is called over 8000 times in the trace, there is never any lock contention. The mutex fastpath lock statement shown on line 92 is an inline assembly macro which tries to acquire the lock and calls mutex lock slowpath if the lock is contended. If there was lock contention, a box highlighting a call to mutex lock slowpath would be shown. Second, the frequency of each path is displayed at the top of the function. Because the calling context is taken into account when labelling individual paths, even with a single path through mutex lock, six paths are displayed due to the six different callers. Finally, the box at the bottom shows where control returns to for each path; this allows the developer to immediately focus on pipe read as the most frequent caller. 5.3.1 Control Flow Indirection In the example above, we revealed the users of a function. In many cases, the inverse operation can also be very useful. When calls are performed indirectly, e.g., through function pointers, static tools are easily stymied. While dynamic analysis does not necessarily present a complete inventory of control flow targets for a given site, it does allow detailed insight into real invocations. Figure 5.2: Following an indirect call in sysenter entry. 72 Figure 5.2 shows one of Linux’s system call entry points, sysenter entry. On line 340, the system call number in register eax is used as an offset into a jump table. Tralfamadore annotates this statement with a list of the jump targets that are taken in the trace. As with the accessor example above, it clearly presents a ranked list of system calls, providing the developer with an intuitive sense of the common uses of the underlying code. Because it uses traces, it can display useful information that is not avail- able from profile-based tools. For example, it is easy to see that of the 4 system calls invoked in this slice of trace data, one (sys clone) always calls syscall exit work after it returns, one (sys open) never calls it, and two (sys read, sys write) sometimes do but usually don’t. 5.3.2 Path Analysis Figure 5.3: Processing 3 packet types in eth type trans. In large functions, the set of possible paths through the code can quickly become obscured by a cascade of complex conditional jumps–this number 73 is typically far less than 2conditionals due to inter-branch dependencies. By presenting the set of actual paths taken, Tralfamadore makes it much eas- ier to understand these dependencies. As a simple example, consider the eth type trans function in Figure 5.3. We can clearly see that there are three distinct control flows corresponding to the type of the Ethernet pack- ets being processed (multicast, broadcast, or normal). The distribution of these packet types during the trace is also presented, allowing a developer to better understand the actual workload being inspected. For instance, in this example 5,988 packets are multicast or broadcast versus only 97 normal packets, implying that the host is engaged in relatively little active network communication during the trace period. 5.3.3 Control Flow Extraction We now describe the Tralfamadore analysis used to extract the control flow traces for a given function. The original Tralfamadore paper [50] presents a preliminary implementation of this analysis that extracted all paths for all functions in a single post-processing phase. The version described here is an updated version that builds and summarizes the paths for a selected function in an online manner.13 The control flow extraction analysis is broken into two stages, each imple- mented as a Tralfamadore operator. The CFlowTrace operator filters basic blocks executed by the target function and labels them with an unique ID corresponding to a specific call instance. Basic blocks with the same ID are part of the same function invocation trace. Because function calls can be nested and context switches can occur at any point, invocation traces will be arbitrarily multiplexed. The second operator is a simple sink operator that demultiplexes, and forwards the control flow traces to a frontend where they are summarized and presented. The CFlowTrace operator uses the Invoke operator to detect when the target function is being called and when it returns. These two annotations serve as the start and end for each trace. The operator also uses the Invoke 13The paths shown in Figures 5.1 5.2, and 5.3 where extracted using the original version. The updated analysis produces the same information. 74 CFlowTrace Invoke Control Flow Summary Index IndexFunctionIndex Merge Context Index TraceTrace Parser Merge Figure 5.4: Control flow extraction. operator to detect when the target function calls another function and when the callee returns, to respectively suspend and resume tracing of the target function. The operator uses annotations produced by the Context operator to detect when a trace is being suspended due to a context switch or an interrupt. It also uses the function index to get trace positions prior to the target invocations to have the necessary register state. The operator uses the striding interface provided by the merge operator to skip sections of trace containing no invocations of the target function. Figure 5.4 shows the operator configuration for extracting control flow traces. Many of the low-level aspects of the x86 architecture and Linux are ab- stracted away by the annotations provided by the Invoke and Context op- erators. The CFlowTrace operator consists of less than 250 lines of OCaml code, including comments and code to gather runtime statistics, and is ag- nostic to Linux or the x86 architecture. 5.3.4 Control Flow Extraction Performance Figures 5.5, 5.6, and 5.7 show the performance of the control flow extraction analysis for the Wget, Kernel, and Postmark benchmarks. The Wget benchmark retrieves the content of a web site using the wget command. It 75 is a network IO intensive benchmark. The Kernel benchmark is a com- pilation of the Linux kernel. It is CPU intensive with utilization close to 100%. Postmark is a disk IO intensive benchmark that simulates a mail server. The benchmarks were described previously in Section 3.4. For each benchmark, 25 functions were randomly selected, and a control flow trace was extracted for each function invocation. Each bar represents a specific run on one of the selected functions, indexed by the number of calls present in the trace. The names of the functions and their respective call frequency are shown in the table above each graph. The graph’s y-axis on the left provides the relative overhead compared to native execution, and the y-axis on the right provides the wallclock time required to complete a run. For up to a few thousands calls per run, the overhead is often less than native, meaning that it is faster to use Tralfamadore to obtain the results than by executing the program, excluding the overhead of online instrumen- tation. For most functions, the analysis completes quickly. The paths for all 25 selected functions in the Wget benchmark, 22 of the 25 functions in the Kernel benchmark, and 20 of the 25 functions in the Postmark bench- mark can be extracted under ten minutes. Although extracting the path for certain function can take much longer (up to two hours in the worse case), partial results are returned almost immmediatly. As soon as a call instance returns, its control-flow path can be streamed to the presentation frontend. Similarly to extracting function argument values (Section 4.10), the total amount of time spent on disk seeks becomes substantial for large numbers of function calls and hurts performance. 76 Function Name Call Frequency eth header 1 flock to posix lock 2 security file lock 2 exit files 3 cap bprm secureexec 3 rtnl notify 4 it real fn 6 sys alarm 9 key put 18 flush descriptor 30 do writepages 40 end buffer async write 89 generic unplug device 193 ide wait stat 249 submit bh 749 ext3 init security 2027 pagevec free 2172 get unused fd flags 2727 ext3 getblk 6148 getname 14919 try to del timer sync 22969 ip rcv finish 27534 nf iterate 42848 ext3 mark inode dirty 77816 serial in 3894371 1 2 2 3 3 4 6 9 18 30 40 89 19 3 24 9 74 9 20 27 21 72 27 27 61 48 14 91 9 22 96 9 27 53 4 42 84 8 77 81 6 38 94 37 1 Number of Function Calls 0.0 0.5 1.0 1.5 2.0 2.5 S lo w do w n Fa ct or 0 50 100 150 200 250 300 350 W al lC lo ck Ti m e (s ec ) Figure 5.5: Relative overhead and time to extract control flow (wget). 77 Function Name Call Frequency inet twsk alloc 1 security task setgroups 1 sock def readable 3 neigh fill info 7 unix write space 13 do nanosleep 59 sys getcwd 141 seq release 375 ext3 free blocks sb 727 shmem permission 1030 ip local deliver 1538 dx probe 2348 ide dma end 3551 aa register find 3721 sched fork 3769 rt hash code 4448 create empty buffers 6824 delay 10653 get dtype 16132 ext3 get group desc 23631 can share swap page 45825 prio tree remove 120332 dummy inode getattr 225191 link path walk 429918 spin lock 8766876 1 1 3 7 13 57 14 1 37 5 72 7 10 30 15 38 23 48 35 51 37 21 37 69 44 48 68 24 10 65 3 16 13 2 23 63 1 45 82 5 12 03 32 22 51 91 42 99 18 87 66 87 6 Number of Function Calls 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 S lo w do w n Fa ct or 0 500 1000 1500 2000 2500 3000 W al lC lo ck Ti m e (s ec ) Figure 5.6: Relative overhead and time to extract control flow (kernel). 78 Function Name Call Frequency do wait 1 tcp update metrics 1 arch pick mmap layout 2 set termios 3 inet select addr 6 skb checksum complete head 18 nla put 44 exit thread group keys 63 dequeue signal 65 ns to timespec 130 out of line wait on bit lock 427 ei interrupt 3108 ext3 unlink 7431 invalidate inode buffers 11263 may expand vm 12544 security inode getattr 15214 ext3fs dirhash 24581 mb cache shrink fn 40866 get request wait 46820 msecs to jiffies 116620 do IRQ 246005 free buffer head 1137510 block write full page 1288287 ext3 journal dirty data 2532516 spin lock 23351794 1 1 2 3 6 18 44 63 65 13 0 42 7 31 08 74 31 11 26 3 12 54 4 15 21 4 24 58 1 40 86 6 46 82 0 11 66 20 24 60 05 11 37 51 0 12 88 28 7 25 32 51 6 23 35 17 94 Number of Function Calls 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 S lo w do w n Fa ct or 0 1000 2000 3000 4000 5000 6000 7000 W al lC lo ck Ti m e (s ec ) Figure 5.7: Relative overhead and time to extract control flow (postmark). 79 5.4 Heap Slicing This section describes a dynamic analysis called heap slicing designed to help a developer understand data-driven subsystems such as the networking stack in the Linux kernel. Heap slicing extracts the set of statements that have touched a set heap objects between their allocation and release. It produces a representation of where and how these objects are used throughout their lifetime. The goals of heap slicing are different than with program slicing [84, 6], a program analysis technique which extracts program statements that have affected the value of a variable at a specific point in a program. This set of statements, called a program slice, helps developers focus on the relevant portion of a program, when trying to understand how and why a variable holds a certain value. The goal of heap slicing is more holistic as it is to provide a view of how an entire data-driven subsystem operates at runtime. Section 5.4.1 motivates heap slicing with an example, and Section 5.4.2 describe its implementation in more details 5.4.1 An Example: Linux Networking The Linux network stack is an illustrative example of the challenges faced by developers in understanding unfamiliar code. This code base has a number of features that contribute to the difficulty of understanding it. int deliver_skb(struct sk_buff *skb, struct packet_type *pt, struct net_device *dev) { atomic_inc(&skb->users); return pt->func(skb, skb->dev, pt, dev); } Figure 5.8: Packet delivery via type-specific function pointer from the pro- tocol table. 80 int netif_rx(struct sk_buff *skb) { struct softnet_data *q; q = &__get_cpu_var(softnet_data); __get_cpu_var(netdev_rx_stat).total++; __skb_queue_tail(&q->input_pkt_queue, skb); return NET_RX_SUCCESS; } int process_backlog(..., int quota) { struct softnet_data *q; q = &__get_cpu_var(softnet_data); do { struct sk_buff *skb; skb = __skb_dequeue(&q->input_pkt_queue); netif_receive_skb(skb); } while (++work < quota); return work; } Figure 5.9: Top/bottom half network processing. 81 First, it is very large. According to SLOCCount14, there are 342,536 lines of code in the net subdirectory of the Linux kernel (version 2.6.24). Restricting our attention to the most relevant subcomponents (ipv4, netfil- ter, core, sched, and ethernet) reduces the count to 113,110 lines. Second, it spans all the layers of kernel abstractions from system calls through device drivers. Third, it uses dynamic control flow techniques that are challenging for static tools to understand, such as function pointers. Figure 5.8 shows the packet delivery function, which dispatches the packet through a type- specific function pointer. Fourth, the functionality of the network stack is split according to the standard Linux top/bottom driver architecture: the “top half” executes in interrupt context, which must complete as quickly as possible. It simply receives a packet from a device driver and queues it. Actual packet processing (the “bottom half”) is deferred until inter- rupts have been serviced. Figure 5.9 shows the code used by network card interrupt handlers to queue received packets and the processing function that runs asynchronously to dequeue packets to push them up the protocol stack. This causes standard call-flow analysis to be unable to capture the end-to-end processing of packets from arrival to final delivery. Static analysis tools that perform call flow analysis cannot deal with this level of indirection, but even if they could, constructing a static call graph only tells part of the story. A large portion of system code is there to handle errors and rare corner cases. Static analysis cannot guide a developer to concentrate on the important or most commonly used functionality of the network stack, nor can it identify unused or outlying code paths. The network stack is inherently data-driven and has a single data struc- ture (the packet) at its core. Developers who want to understand the stack typically want to begin with understanding how a specific subset of pack- ets flow through it. What if a developer could point to a function in the source code and ask, “Where are packets that are processed by this function coming from and where are they going to?” The answer to this question for the function arp rcv is illustrated in Figure 5.10 which shows a graph of the functions that access any packet which passes through arp rcv, from 14http://www.dwheeler.com/sloccount/ 82 __alloc_skb -> ei_receive -> ne2k_pci_block_input -> eth_type_trans -> ei_receive -> netif_rx process_backlog -> netif_receive_skb -> arp_rcv -> arp_process ip_route_input kfree_skb -> skb_release_all -> skb_release_data -> __kfree_skb arp_process -> kfree_skb -> skb_release_all -> skb_release_data -> __kfree_skb kfree_skb -> skb_release_all -> skb_release_data -> __kfree_skb 162 162 160 14 146 2 Figure 5.10: ARP packet lifetime through the network stack. __alloc_skb -> ei_receive -> ne2k_pci_block_input -> eth_type_trans -> ei_receive -> netif_rx process_backlog -> netif_receive_skb ip_rcv -> ip_rcv_finish -> ip_route_input tcp_v4_rcv -> __skb_checksum_complete -> __skb_checksum_complete_head -> skb_checksum -> __skb_checksum_complete_head -> tcp_v4_rcv tcp_v4_do_rcv skb_release_all -> skb_release_data -> __kfree_skb tcp_rcv_established -> tcp_data_queue skb_release_all -> skb_release_data -> __kfree_skb tcp_recvmsg -> skb_release_all -> sk_stream_rfree -> skb_release_all -> skb_release_data -> skb_release_data -> __kfree_skb tcp_v4_conn_request -> tcp_parse_options -> tcp_v4_conn_request -> kfree_skb -> skb_release_all -> skb_release_data -> __kfree_skb tcp_check_req -> tcp_parse_options -> tcp_check_req -> kfree_skb -> skb_release_all -> skb_release_data -> __kfree_skb tcp_rcv_state_process -> tcp_ack skb_release_all -> skb_release_data -> __kfree_skb tcp_rcv_state_process -> tcp_data_queue -> skb_release_all -> skb_release_data -> __kfree_skb tcp_rcv_established -> tcp_ack 2 374279 374279 374279 374279 374279 374279 350604 105099 245505 97154 7945 7960 7952 7763 7761 ip_rcv_finish -> ip_local_deliver -> ipt_do_table -> ip_local_deliver_finish Figure 5.11: TCP packet receive. its allocation to is eventual release. Function names in the first, shaded text box are executed in device interrupt context, while subsequent processing occurs in a soft irq. Numbers to the left of the figure indicate the number of packets that take the indicated path through the code. Looking at the figure, the developer can quickly learn that packets are allocated by the function alloc skb, are operated on at the ARP level primarily by the function arp process, and are eventually freed by kfree skb. Having understood this simple flow for ARP packets, the developer might then ask about packets processed by another function, perhaps for a more 83 complex protocol such as TCP. The flow of packets going through the func- tion tcp v4 do rcv is shown in Figure 5.11. The graph is much larger, representing the increased complexity of TCP processing relative to ARP. 5.4.2 Heap Slicing in Detail The figures in this section were generated by extracting heap slices from a trace of a virtual machine running the Apache web server on Linux. Heap slicing starts with a single program statement using a pointer to a heap object. The current implementation is limited to function arguments but it could easily be generalized to arbitrary statements. The Tralfamadore implementation of heap slicing relies on the fact that analyses are consistent across runs, and performs the analysis in two passes over a trace. Although heap slicing could be implemented in an online manner using a single pass, such an implementation would be very inefficient because it would have to track all heap objects, and then discard objects that were never touched by the statement of interest. The Tralfamadore implementation only needs to track accesses to the set of objects that will actually be part of the slice, reducing the overhead of the analysis. Heap slicing starts with a function and one of its argument as inputs. The only requirement is that the argument be a pointer. In the first pass, the value of the pointer is extracted at all points in time where the function is called. This pass uses the function index to only visit the call sites, skipping most of the trace. The (timestamp, pointer) tuples extracted in the first pass are cross-referenced with a heap allocation index to determine the allocation and release time for each pointer/heap object. In the second pass, for every object, the analysis scans the trace from allocation to release to extract all accesses to the object throughout its lifetime. The result is an access trace for each object. The results are forwarded to the frontend where they are summarized in a tree where each access trace is inserted from the root down using longest prefix matching. The root of the tree corresponds to the allocation function and the leaf nodes are deallocation functions. A given analysis can produce more than 84 Trace Parser HeapTrace MemTrace Trace IndexAllocationIndex Merge Heap Slicing Analysis Figure 5.12: Heap slicing configuration. one graph if the objects were allocated by more than one function. The two pass approach means that from an execution point of view, the analyses effectively goes back in time from the point in time where the selected function is called to the point where the object is allocated. A naive implementation of the analysis will repeatedly scan sections of the trace with overlapping live objects. As will be shown in Section 5.4.3, overlapping objects are common and a naive approach will results in the entire trace being scanned multiple times. The current implementation uses the Merge operator described in Sec- tion 4.6.3 to extract access traces for every object in a single scan, reading relevant portions of the execution trace once, and striding over sections of trace without live objects. Figure 5.13 shows the core of the HeapTrace op- erator which extracts the object access traces. This example shows how an operator reuses existing operators to compose a new analysis. The operator writer needs not to be concerned with specific knowledge of the Linux kernel heap allocation functions or which x86 instruction touches memory. This knowledge is abstracted away by the Allocation and MemTrace operators. Figure 5.12 shows the operator configuration used to extract heap slices. The sink operator simply forwards the traces obtained from the HeapTrace operator to the frontend. 85 module HeapTraceOp = struct (* This operator acts as a filter, removing all annotations not part of an object heap trace. *) let rec next st () = (* Pull the next annotation from the merge operator *) let a = st.m.M2Op.next () in match a.Annot.attr with | ‘Alloc alloc -> ( st.live_obj <- add_to_live_obj st.live_obj alloc; a ) | ‘Release rel -> ( st.live_obj <- del_from_live_obj st.live_obj rel; if live_obj_is_empty st.live_obj then ( (* If there is no live obj, seek to the next allocation. *) try st.m.M2Op.seek_1 () with EOF -> () ); a ) | ‘MemTrace mt -> ( (* If this memtrace annotation corresponds to an access to a live heap object, produce a heap trace annotation. *) try (* Find the allocation matching the access *) let alloc = find_in_live_obj st.live_obj mt.MemTrace.addr in let off = Guest32.sub addr alloc.Alloc.base in ‘HeapTrace { base = alloc.Alloc.base; off = off; sz = mt.MemTrace.sz; access = mt.MemTrace.access } with Not_found -> next st () | _ -> next st () ... end Figure 5.13: HeapTrace operator. 86 5.4.3 Heap Slicing Performance Figures 5.14, 5.15, and 5.16 show the performance of heap slicing. Each figure shows a table of the functions used as target statements and the number of heap objects sliced for each statement. For each benchmark, 25 functions that take a heap pointer as an argument were randomly selected, and for all instance of the function, an access trace for the corresponding heap object is extracted. All benchmarks report less than 25 slices because for certain functions, no entries in the allocation index were present because the heap objects were allocated and/or released outside the trace time frame. The bar graph below each table shows the time to extract all the objects’ access trace for a given target statement. The y-axis on the left is the relative overhead compared to running the benchmark natively, and the axis on the right is the wall clock time required to complete the run. The figures show that the performance of extracting a specific heap slice depends on both the number of objects to scan and their lifetimes. For example, in the kernel benchmark it takes over an order of magnitude less time to retrieve the 2186 object traces for the ext3 get block function than to retrieve the 657 objects for the free pipe info function. The reason is that these 657 objects are longer lived than the ext3 get block objects, and a larger amount of trace needs to be scanned. Figure 5.18 shows the amount of trace examined to extract all the objects access trace for the same functions. Extracting the slice for free pipe info requires scanning the entire trace while less than 2% of the trace needs to be examined for ext3 get block’s objects. Figures 5.17, 5.18, and 5.19 demonstrate the benefits of extracting every access trace in a single pass over the trace as described in Section 5.4.2. This is achieved by using the Merge operator which allows writing analyses that stride over the trace, only reading the relevant parts. The graphs show the fraction of trace examined for both the striding and a naive approach, for the runs shown on Figures 5.14, 5.15, and 5.16. One of the benefit of the striding approach is that the worse case is bounded. The analysis will read at most the entire trace once. On the other hand, the amount of trace read with a 87 naive approach can degenerate when a large number of overlapping objects are present, as shown with the results for the Postmark benchmark. In all cases, the striding approach will always examine at most the same amount of trace than a naive approach. Although the overhead can be quite high for extracting access traces for a large number of long-lived objects, it is in the same range as heavyweight dynamic analyses such as Memcheck [75]. It is also important to note that partial results can be produced very quickly. As soon as an object is deallo- cated, its access trace can be forwarded to the frontend and inserted in the graph. 88 Function Name Number of Objects proc reg release 1 inet fill ifaddr 1 pipe release 3 nla put 6 skb copy and csum datagram 221 ext3 delete inode 2025 ext3 init block alloc info 2026 get write access 2026 inode add bytes 2026 dev queue xmit 2026 kfree skb 2027 skb checksum 2061 ip rcv finish 13278 block prepare write 13549 kfree skb 27188 generic permission 27531 inotify inode queue event 43010 1 1 3 6 22 1 20 25 20 26 20 26 20 26 20 26 20 27 20 61 13 27 8 13 54 9 27 18 8 27 53 1 43 01 0 Number of Heap Objects 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 S lo w do w n Fa ct or 0 100 200 300 400 500 W al lC lo ck Ti m e (s ec ) Figure 5.14: Relative overhead and time to extract heap slices (wget). 89 Function Name Number of Objects icmp rcv 1 skb clone 1 ip finish output 2 eth header 3 nla put 7 free pipe info 657 udp4 lib rcv 702 nf hook slow 702 start transaction 702 dummy inode init security 922 unlock new inode 929 ext3 discard reservation 929 kfree skb 929 destroy inode 1508 inode add bytes 1539 ext3 get block 2186 ext3 block to path 3491 generic file open 4367 security inode permission 4528 1 1 2 3 7 65 7 70 2 70 2 70 2 92 2 92 9 92 9 92 9 15 08 15 39 21 86 34 91 43 67 45 28 Number of Heap Objects 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 S lo w do w n Fa ct or 0 1000 2000 3000 4000 5000 6000 W al lC lo ck Ti m e (s ec ) Figure 5.15: Relative overhead and time to extract heap slices (kernel). 90 Function Name Number of Objects pipe write release 1 tcp v4 rcv 1 dev hard start xmit 1 ip finish output 1 skb checksum complete 18 ip local deliver 1011 ip rcv 1523 netif receive skb 3167 skb release all 3180 ext3 inode is fast symlink 7428 security inode init security 7429 get write access 7536 generic file open 7660 start transaction 7660 iget 7744 iput 7759 security inode permission 7769 ext3 block to path 8250 ext3 writepage trans blocks 8250 mark inode dirty 8251 1 1 1 1 18 10 11 15 23 31 67 31 80 74 28 74 29 75 36 76 60 76 60 77 44 77 59 77 69 82 50 82 50 82 51 Number of Heap Objects 0.0 20.0 40.0 60.0 80.0 100.0 120.0 S lo w do w n Fa ct or 0 5000 10000 15000 20000 W al lC lo ck Ti m e (s ec ) Figure 5.16: Relative overhead and time to extract heap slices (postmark). 91 1 1 3 6 22 1 20 25 20 26 20 26 20 26 20 26 20 27 20 61 13 27 8 13 54 9 27 18 8 27 53 1 43 01 0 Number of Heap Objects 1e-06 1e-05 0.0001 0.001 0.01 0.1 1.0 10.0 Fr ac tio n of Tr ac e w/ striding w/o striding Figure 5.17: Fraction of trace examined (wget). 1 1 2 3 7 65 7 70 2 70 2 70 2 92 2 92 9 92 9 92 9 15 08 15 39 21 86 34 91 43 67 45 28 Number of Heap Objects 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1.0 10.0 Fr ac tio n of Tr ac e w/ striding w/o striding Figure 5.18: Fraction of trace examined (kernel). 92 1 1 1 1 18 10 11 15 23 31 67 31 80 74 28 74 29 75 36 76 60 76 60 77 44 77 59 77 69 82 50 82 50 82 51 Number of Heap Objects 1e-08 1e-07 1e-06 1e-05 0.0001 0.001 0.01 0.1 1.0 10.0 100.0 1000.0 10000.0 Fr ac tio n of Tr ac e w/ striding w/o striding Figure 5.19: Fraction of trace examined (postmark). 93 5.5 Retroactive Assertions The use of assertions is common practice when attempting to debug, un- derstand, or otherwise validate assumptions about the way that a system is behaving. Unfortunately, they typically must be made a priori, either by compiling assert statements into source or inserting dynamic probes prior to running a system. Recent work (VAssert15) partly solves this problem by allowing assertions on applications running in a virtual machine to be validated during replay, but while they allow the evaluations of assertions to be deferred, the assertions themselves are no more powerful since they must be written prior to compiling the program similarly to regular assertion statements. Using Tralfamadore it is possible to write retroactive assertions that would be very difficult to express in the traditional style for two reasons. First, unlike traditional assertions (and ones written with VAssert), Tralf- amadore assertions can be expressed at the binary level, providing the ability to validate properties such as stack height that are difficult to express at the source level. Second, Tralfamadore assertions are not limited to validating a property at a specific point in a program. They can easily express properties that need to be validated at a much finer grain (e.g. on every instruction), effectively holding over an entire execution. The next two subsections de- scribes analyses that exemplify these two properties. Ownership Violation For simplicity and scalability, applications desire to use as little locking as possible. In Linux, for example, many functions assume that their objects are never accessed in interrupt context. They may use runtime assertions16 to catch some violations, but it is difficult to be assured that all possible accesses have been guarded. Tralfamadore makes it straightforward to validate whether such an as- sumption holds throughout the duration of a trace, and can additionally 15http://www.vmware.com/pdf/ws65 vassert programming.pdf 16BUG ON(in interrupt()); 94 produce a detailed report about any violations. Here we present a simple ownership assertion that detects whenever an object is accessed by con- texts other than that which allocated it, which may be checked against all instances of a given object type within the trace. The algorithm is a refinement of the heap slicing algorithm presented in Section 5.4 and is im- plemented by reusing the HeapTrace operator (and other operators) along with a single new operator that implements the following algorithm: each object’s lifetime is looked up from the allocation index, and the moment of allocation is then cross-referenced with the context index to identify the allocation context. The trace can then be scanned linearly from the time of allocation to the point where the object is freed. Whenever the object is touched, the accessor context is recorded. If the set of accessor contexts contains multiple elements, then an ownership violation has occurred. If a violation is detected, a subsequent analysis can retrieve the necessary in- formation to produce a graph showing the exact sequence of accesses that produced the violation overlayed with the control flow of both the owner and violator. Figure 5.20 shows a portion of the report produced by a test case we created, in which an owner thread creates an object, and a violator running as a timer interrupt handler accesses it. Figure 5.20: Ownership violation example. 95 stack usage (bytes) 0 500 1000 1500 2000 2500 3000 fre qu en cy 1 10 100 1K 10K 100K 1M 10M Figure 5.21: Distribution of stack sizes. Stack Usage Stack space in the kernel is at a premium: each thread of each process needs its own stack, and the address space is small. For this reason, Linux includes a compilation option to use 4K stacks instead of the default 8K. Unfortunately, stack usage is a highly dynamic property, depending on a combination of control and data (for example, a recursive function’s stack usage can vary wildly depending on input). As a result, while the developers believe that 4K should suffice, and that it would allow many more threads, the default remains at 8K. The kernel developers are washing their hands of the problem, leaving it to the end user to decide whether to take the chance of stack corruption in exchange for better scalability. By recording the target workload with Tralfamadore, the actual amount of stack used can be easily measured. We constructed a simple operator that tracks all updates to the stack pointer. Running it on a kernel compile workload similar to the one described in Section 3.4 produces the usage distribution shown in Figure 5.21. This reveals the exact moment at which 96 sysenter_past_esp/sys_write/vfs_write/do_sync_write/ ext3_file_write/generic_file_aio_write/ __generic_file_aio_write_nolock/generic_file_buffered_write/ ext3_write_begin/block_write_begin/__block_prepare_write/ ext3_get_block/ext3_get_blocks_handle/ext3_new_blocks/ __ext3_journal_get_write_access/journal_get_write_access/ do_get_write_access/__get_free_pages/__alloc_pages/ get_page_from_freelist/common_interrupt/do_IRQ/irq_exit/ do_softirq/__do_softirq/net_rx_action/process_backlog/ netif_receive_skb/ip_rcv/ip_rcv_finish/ip_local_deliver/ ip_local_deliver_finish/tcp_v4_rcv/tcp_v4_do_rcv/ tcp_rcv_established/__tcp_push_pending_frames/tcp_transmit_skb/ ip_queue_xmit/ip_output/ip_finish_output/dev_queue_xmit/ __qdisc_run/pfifo_fast_dequeue Figure 5.22: Maximum stack depth seen (kernel compile). the stack reaches its largest point (in this case, the maximum size of 2960 bytes was reached 14 times). A query for the call stack at that point returns the stack shown in Figure 5.22, showing that the cause of the high stack usage was a network interrupt being handled while the kernel was deep in the processing of a sys write system call. Further inspection revealed that all 14 occurrences were at this location, due to the interrupt handler repeatedly processing a backlog of packets. This easily reveals the motivation for the decision of the Linux developers to enable a separate interrupt stack frame when the 4K stack option is enabled. 5.6 Conclusion This chapter described some of the applications that have been built using Tralfamadore. These analyses are designed to help developers understand complex pieces of code such as the Linux kernel, or asserting hypotheses they may have about the runtime behaviour of a program. This chapter demonstrated that building these analyses is relatively simple because they are built by reusing exisiting operators. This reusability is advantageous because these operators abstract away many low-level aspects of execution which are challenging and time-consuming to understand. The focus of this thesis is on building dynamic analysis for the Linux 97 kernel but because many of the aspects that are specific to Linux or the x86 architecture have been abstracted away by existing operators, many of the analysis described in this chapter could be easily ported to user space or to other operating systems. 98 Chapter 6 Related Work 6.1 Introduction This chapter presents a survey of the existing work on dynamic analysis related to Tralfamadore. One of the goal of this chapter is to compare Tralf- amadore with existing dynamic binary analysis frameworks to demonstrate that Tralfamadore is the first framework to address all of the problems first described in Section 2.1. This chapter also examines some of the work on deterministic record- replay, describing some of the recent proposals to support recording exe- cution on multi-processor systems. This body of work is complimentary to Tralfamadore in that the techniques described could also be used to decouple capturing execution from trace generation [87]. The rest of the chapter surveys trace collection techniques, dynamic anal- yses that could be implemented using Tralfamadore, trace query languages, execution indexing techniques, and lightweight OS instrumentation frame- works. Much of this work is complimentary to Tralfamadore but some of these techniques or approaches could be used to improve future versions of Tralfamadore. 6.2 Dynamic Binary Analysis Frameworks The goal of this section is to validate the claim that Tralfamadore is the first offline dynamic analysis framework that supports the analysis of operating system code and provides a programmable interface specifically designed for dynamic analysis. As described in Section 3.2.3, the traces recorded with Tralfamadore are complete, meaning that analyses based on these traces 99 have access to the register and memory state on every instruction boundary. This section compares Tralfamadore with other frameworks that provide access to the same level of information. This includes traditional dynamic binary instrumentation (DBI) frameworks such as Pin [52], and replay based framework such as Aftersight [23]. Table 6.1 lists the frameworks that meet the completeness requirements and compares them based on the following properties derived from the prob- lems described in Section 2.1, and the design goals of the trace capture facility (Chapter 3) and of the analysis engine (Chapter 4). • Does the framework support offline analyses? An offline approach makes it possible to execute multiple analyses on a single execution and enables analyses that can time-travel, arbitrarily going backward and forward in time. • Does the framework support the analysis of kernel execution? • Is the framework transparent, or does it require explicit modifications to the system or program being analyzed? • Does the framework provide a programmable interface designed for dynamic analysis? • Does the framework provides mechanism to build analyses that are composable and reusable? Composability and reusability of analyses is supported by none of the frameworks in Table 6.1 except for SimOS [71] which partially supports this property. To avoid needless repetition, this last property will only be discussed when comparing Tralfamadore and SimOS. Aftersight [23] proposes to decouple dynamic analysis from execution by using deterministic record-replay to efficiently record the execution of virtual machine and replays this execution in an whole-system emulator 100 • fully supported ◦ partially supported O ffl in e K er n el T ra n sp a re n t P ro gr am m ab le In te rf ac e C o m p o sa b le & R eu sa b le Tralfamadore • • • • • Aftersight • • • SimOS • • • ◦ ATOM • Valgrind ◦ • DynamoRIO ◦ • Pin ◦ • PinOS • ◦ • JIFL • • Nirvana • ◦ • PinSEL • ◦ • PinPlay • ◦ • BitBlaze • • • Time-travelling VMs • • • ◦ Omniscient debugging • ◦ ◦ Table 6.1: A comparison of dynamic binary analysis frameworks. (QEMU). Like Tralfamadore, Aftersight supports whole-system or kernel- level offline analyses, but does not provide an instrumentation API, forcing users to directly embed their instrumentation and analysis into QEMU’s micro-operations. SimOS[71] explores the idea of using a machine simulator to study the execution of entire systems. Similar to Tralfamadore, SimOS proposes the use of annotations to capture events at a higher semantic level. As with Tralfamadore, annotations can be recursive (annotations can be built from other annotations). 101 SimOS can operate in two modes: emulation mode which uses dynamic binary translation and accurate mode which executes the system under a cy- cle accurate simulator. Analyses performed in emulation mode execute in an online manner, and do not provide the benefits of offline analysis. Analyses performed in accurate mode may be repeatable if the simulation is determin- istic but the overheads in this mode are much higher than with Tralfamadore, ranging between 180 and 6400× for a uniprocessor simulation[72]. Although analyses built with SimOS are composable, SimOS does not address the issue of reusability. Also, SimOS does not support indices to efficiently navigate execution, and therefore does not provide mechanisms such as annotation caching to deal with stateful analyses as described in Section 4.6. Traditional binary instrumentation frameworks are fundamentally differ- ent than Tralfamadore. These tools are online tool and are “stuck in time.” They are limited to analyzing a program as it executes instead of analyzing an execution. ATOM [80] is one of the first binary instrumentation frame- works. It uses static binary rewriting to insert instrumentation that will execute inlined with the target program. This approach is not transparent because program being analyzed need to be explicitly modified. Pin [52], Valgrind [63], and DynamioRIO [15] are process-level dynamic binary instrumentation (DBI) frameworks. They use dynamic binary trans- lation, a technique pioneered with the Shade simulator [26], to add instru- mentation to a program as it executes. Because both the analysis and the target program execute in the same address space, analysis writers are forced to use clever techniques to avoid clashing with the target program. This technique does not require explicit modifications to the program being an- alyzed but it is not completely transparent because it may caused memory conflicts between the program being analyzed and the analysis. ATOM, Pin, Valgrind, and DynamoRIO are unable to instrument and analyze the execution of operating system code. PinOS [16] addresses this issue by extending Pin with the ability to do whole-system analysis. Because it uses virtualization, most of the instrumentation frameworks sits outside of the system being analyzed but the instrumentation code and data still 102 need to be embedded within the system, potentially causing memory clashes as with Pin and Valgrind. JIFL [64], and ongoing work at the University of Toronto [31] propose to embed a DBI within the Linux kernel. This approach allows adding arbitrary dynamic instrumentation to the Linux kernel but requires the operating system to be explicitly modified to add support for a just-in-time compiler. Nirvana [13] is a user-level offline dynamic analysis framework based on a variant of record-replay. Instead of explicitly logging all the external inputs and non-deterministic events, Nirvana captures this information implicitly by logging the complete register state after kernel to user transitions and non-deterministic instructions such as rdtsc, and logging all values read from memory by the processor while executing. Logging every memory load is expensive, and to reduce the size of the log, Nirvana uses a cache to keep track of the last value read for any given address. A value loaded from memory is recorded on a cache miss or if the cached value is different than the value just read. Nirvana also updates the cache on memory writes so that only the first load after an external modification to memory needs to be recorded. External modifications can be caused by the operating system, DMA, or writes from another processor. Nirvana provides an instrumentation API similar to the one provided by ATOM, but unlike Tralfamadore, it does not support capturing and an- alyzing the execution of operating system code. Nirvana can record and analyze unmodified programs, but it is not completely transparent. Similar to tools like Pin and Valgrind, the Nirvana execution recorder is embedded in the address space of the program being recorded which can cause memory conflicts. PinSEL [60] extends Pin to support replay-based offline analysis by using an approach similar to Nirvana [13] to automatically detect changes to the register and memory state. PinPlay [66] extends PinSEL with a software- simulated cache coherency protocol to deterministically replay the execution of a process on a multi-processor machine. PinSEL and PinPlay provide the same level of transparency as Pin. Bitblaze [77] is an online whole-system binary analysis framework based 103 on QEMU. Like Tralfamadore and Aftersight, Bitblaze analyzes execution at the processor level which provides a transparent approach to dynamic analysis and support for the analysis of operating system code. Because Bitblaze is an online framework, it is also “stuck in time” and can only analyze programs as they execute. Time-travelling virtual machines [41] use deterministic replay to allow a debugger attached to a Linux kernel to step execution backwards as well as forwards. VMware Workstation now includes a record and replay fa- cility [87] that can be used for replay debugging [3] of recorded software. Time-travelling virtual machines support the offline analysis of operating systems but the interface provided to analyze execution is limited to the context of debugging (e.g. breakpoints, watchpoints, etc.) Omniscient debugging [1, 21, 51, 69] is an approach to debugging where the execution of a program is stored and indexed using techniques borrowed from program slicing [84], allowing debugging to work “backwards in time.” All existing implementations of omniscient debugging are limited to user- space programs and provide an interface limited to debugging. 6.3 Deterministic Record-Replay As discussed in Section 2.4, it is possible to use deterministic virtual machine record-replay to capture a lightweight event log of a running system, which can be reconstituted into a full trace during a later replay phase [87]. Tralf- amadore could use this complimentary technique to reduce the overhead on execution incurred by the trace collection facility. This section gives an overview of the work on record-replay both at the process and system level, focussing on research effort aimed at supporting record-replay on multi- processor systems. Jockey [74], Flashback [79], DejaVu [22] and more recently, R2 [34] sup- port single-processor process-level recording and replay. These process-level record and replay tools require users to manually write stubs around, or add annotations, to system or library calls to capture their updates on the state of the target process. As mentioned in the previous section, Nirvana [13] and 104 PinSEL [60] avoid the need for stubs by logging memory reads performed by the processor to implicitly record the changes to register and memory state. RecPlay [70] and Instant Replay [47] are process-level record-replay tools that log the order in which locks are acquired and can be used to replay multi-processor executions of race-free programs. The approach used by Nirvana and PinSEL can also be used to record the execution of multi-processor systems with one caveat. The re-execution of the log for a given processor will result in that processor re-executing the same sequence of instructions and producing the same output. Because the ordering of writes by different processor is not recorded, re-execution does not guarantee the ability to reconstruct an identical memory state. Still, useful analyses targeted at multi-processor systems have been built using Nirvana [61]. As mentioned in the previous section, PinPlay extends PinSEL with a simulated cache coherency protocol to support reconstructing an exact memory state but suffers from much higher recording overhead (50 to 150×) in doing so. SMP-ReVirt [29] is a multi-processor version of ReVirt [28] that uses hardware page protection to log the ordering of reads and writes to shared memory. The performance of this approach is highly sensitive to the degree of shared-memory communication and is prone to false sharing due to the page-size granularity provided by the MMU. Scribe [43] records the execu- tion of processes running on multiple processors by using a technique similar to SMP-Revirt [29] but only on user-level pages. 6.3.1 Hardware Support for Multi-Processor Replay Deterministic multi-processor record and replay is an active area of research in the computer architecture community. Flight Data Recorder (FDR) [86] logs the ordering of data races by interposing on a directory-based cache coherency protocol. BugNet records memory loads in a manner similar to Nirvana [13] and PinSEL [60] by tracking cache misses to detect memory changes by other threads or the operating system. BugNet uses a similar approach to FDR to record the ordering of data races between threads. 105 Strata [59] extends BugNet to support snoop-based cache coherency proto- cols. Lee et al. [48] propose to combine BugNet’s load-based approach with an offline search algorithm using a constraint solver to determine the order- ing of memory races. This approach does not require any modifications to the cache coherence protocol. Only processor-local modifications are needed such as logging cache misses, substantially reducing the hardware modifica- tions needed compared to other hardware-based replay approaches. DeLorean [57] proposes a different approach; instead of tracking individ- ual shared memory accesses, programs execute in atomic chunks of instruc- tions. Similar to transactions, if the execution of a chunk does not conflict with the execution of chunks on other processors, then the chunk commit, otherwise it is rolled back. The advantage of this approach is that it is sufficient to record the total order in which chunks commits, and that it is amenable to weaker memory consistency models. Previous hardware-based recorders assumed a sequential consistency model. Rerun [37] proposes a similar model but instead of rollbacks, the number of instructions executed without conflict is recorded. 6.3.2 Other Approaches to Record-Replay To address the overhead of recording multi-processor execution in soft- ware, recent work has explored the idea of using weaker forms of recording. PRES [65] and ODR [7] do not record the order of memory races and use search techniques to find a suitable ordering that reproduces a bug [65] or produces the same output [7]. Neither of these tools guarantee determin- istic re-execution as described in Section 2.4. The sequence of instructions executed and the state of the system may differ from one re-execution to another. 106 6.4 Collecting Analysis Specific Traces Using execution traces is an old idea and much work has been done on how to collect traces. This section briefly surveys existing techniques and systems that only capture a specific aspect of execution such as memory access traces or the control-flow of a program as it executes. These traces are incomplete in the senses that they do not contain sufficient information to reconstruct the complete state of the program or system being analysed. Ball et al. [10] propose algorithms to reduce the amount of instrumen- tation needed to profile and trace the control-flow of programs. Whole Program Paths [45] proposes an efficient representation of the complete control-flow of a program by capturing all acyclic paths executed and then expressing these paths as a context-free grammar. Traceback [9] uses static binary rewriting to record control-flow traces. Traceback does not require source code and is amenable to programs written in any language. ATUM [5] is a technique to capture whole-system memory address traces by modifying the microcode of instructions that access memory. The authors describe an implementation for the VAX 8200 processor but the technique should work on any processor where the microcode can be modified. A follow-up paper [76] describes an updated facility to collect traces on multi- processor systems. In both cases, the approach requires a microprogrammed processor. Perl et al. [68] describe a software only approach using binary patching. Code is patched with trap instructions to invoke trace collecting routines to record branch outcome and address information. The paper describes an implementation specific to the alpha architecture using PAL code, but the approach should be portable to most architecture. Abstract execution [44] is a general technique to reduce the overhead of recording traces by breaking this process into a two step procedure. Given a set of events to collect from the execution of a given program, abstract execution generates two versions of the program: the first one is the original program instrumented to collect a log, the second version is a simplified version of the program whose purpose is to consume the log to generate the desired trace of events. There is a tension between how much information is 107 captured in the log and how much execution needs to be done by the second version to generate the desired trace. Abstract execution can be used to collect expensive traces such as memory traces more efficiently. Interestingly, deterministic record-replay and Tralfamadore’s approach to capturing traces can be seen as two extreme forms of abstract execution. In the case of replay, the first version of the program only records the input and non-deterministic events and the second version is designed to deterministically re-execute the program using the log. With Tralfamadore, the second version of the program does not re-execute at all but simply reads the trace generated by the first version. 6.5 Offline and Whole-System Analyses This section describes research on specific analyses or framework designed for specific analyses. Some of these analyses execute offline or analyze the execution of an entire system. This work is related to Tralfamadore in the sense that many of these analyses could be built with Tralfamadore, or are similar to analyses that have been built with Tralfamadore. Bouncer [27] and SAGE [32] use Nirvana [13] to perform symbolic exe- cution offline to automatically generate input filters to block malicious in- put [27], or synthesize new test cases for systematic testing[32]. Truscan [61] also uses Nirvana to detect data races offline and automatically classify them as harmful or not by re-executing both outcomes of a race and examining the results. LiteRace [54] is an example of an offline race detector. The race detection analyses execute from a previously recorded trace containing all calls to synchronization primitives and a sample of memory accesses. Because the trace recorded with Tralfamadore are complete, any race detection algorithm can be implemented as a Tralfamadore analysis. TaintBochs [25] use a mix of online taint tracking and offline trace-based analysis to study the lifetime of sensitive information such as passwords. TaintBochs uses a modified version of the Bochs emulator [46] to record traces that contain all writes to memory and all updates to a specific subset 108 of registers. One could use Tralfamadore to also implement the taint propa- gation offline, using a recorded execution to experiment with different taint propagation policies. Dataflow tomography [58] proposes the use of whole-system taint track- ing as a tool to understand complex computer systems. This project is complementary to Tralfamadore, meaning that Tralfamadore could be used to implement the various taint policies described in the paper. Using Tralf- amadore would have provided the authors with the benefits of deterministic analysis, in addition to using a high-level language instead of dealing directly with QEMU’s dynamic binary translation engine. Introvirt [39] proposes to use virtual machine deterministic replay to retroactively detect the presence of an intrusion. Introvirt is the original source of inspiration for the idea of retroactive assertions described in Sec- tion 5.5. While Introvirt checks are expressed at the source level, Tralfama- dore retroactive assertions are written as binary analyses and can validate assertions difficult to express at the source level such as detecting a stack overflow. 6.6 Querying Execution Several projects have explored the notion of querying program execution, ei- ther by providing an explicit query interface to the end user or using machine learning to discover execution patterns. Two recent projects from the pro- gramming languages community, relational queries over program traces [33], and the program query language PQL [55], have proposed query interfaces to allow developers to search for interesting aspects of program behavior. The scalable omniscient debugging project [69] also explored query inter- faces to assist in navigation. A suitable query interface would be an excel- lent tool for interacting with traced execution. These query languages are designed around specific language runtimes and are not designed to query operating system execution. They operate at a higher level and rely on run- time information provided by the operating system or the language runtime. Tralfamadore analyses execution at the hardware level below the operating 109 system and needs to explicitly reconstruct this information. Projects such as Daikon [30] and the work on mining specifications [8] have used machine learning techniques to extract likely invariants or to infer correct behavior from a recorded execution trace. Tralfamadore is designed as a platform that can support both manual and automatic investigation of runtime application properties, and could be used to apply these techniques to extract likely invariants from operating system execution. 6.7 Execution Indexing and Simplification Control flow [85], and memory [81] indexing techniques have been designed to unambiguously label points in a program execution matching a certain control-flow pattern or memory state. Hoffman et al. [36] proposes to struc- ture traces according to semantic views, which are sections of an execution trace corresponding to the execution of a specific thread, method or object. These indexing techniques are complementary to the work described in this thesis, in that they could be used to build richer indices for traces recorded with Tralfamadore. Whole Execution Traces (WET) [88] is a generic trace format with the same completeness properties as traces collected with Tralfamadore. WET traces embed control and data flow information in the trace to provide ef- ficient backward navigation. This information could be added to the Tralf- amadore trace format to simplify going backward to get register and memory definitions. Execution Reduction [82], Tinertia [38], and Execution Fast Forward- ing [89] are trace simplification techniques that analyze an execution trace containing events such as system calls, context switches, and shared mem- ory accesses, to generate a reduced version of the event log that if executed would still exhibit a specific property of the original execution, such as re- producing a concurrency bug [82, 38] or reaching a specific point in the execution [89]. 110 6.8 Lightweight Kernel Instrumentation DTrace [17], SystemTap,17 and LTTng18 provide lightweight dynamic in- strumentation of operating system kernel using software probes. Although this approach has zero overhead when no instrumentation is present, the technique is only suitable for lightweight analyses. For example, probe- based instrumentation can not be used to build tools such as race detectors. The focus of these tools is different than with Tralfamadore. Their goal is to support very lightweight online analyses executing on production systems. Tralfamadore is not designed with the goal of minimizing overhead and aims to support arbitrary offline analyses. 6.9 Conclusion This chapter compared Tralfamadore with existing dynamic binary analy- sis frameworks in order to demonstrate a key contribution of this thesis: Tralfamadore is the first offline dynamic analysis framework to support the analysis of operating system execution and provide a programming inter- face suitable to dynamic analysis. This chapter also surveyed some of the work on deterministic record-replay. This technique is complimentary but highly relevant to Tralfamadore because it can be used to decouple execution recording and trace generation, reducing the overhead caused by collecting detailed CPU traces. 17http://sourceware.org/systemtap/ 18http://lttng.org/ 111 Chapter 7 Conclusion This thesis described the design and implementation of Tralfamadore, a dy- namic program analysis framework that addresses three issues with existing approaches. First, traditional dynamic binary instrumentation (DBI) frame- works are online tools. As mentioned in Section 1.1, these tool are “stuck in time,” meaning that they can only analyze a program as it executes. Any program behaviour wished to be examined must be anticipated by choosing an analysis prior to running the program. Second, most DBIs lack support for analyzing the execution of an operating system kernel. While developers of user-level programs are provided with a wide selection of tools such as race and leaks detectors, Kernel developers are left with very crude tools such as manually inserting printk statements. This is unfortunate because kernel code is some of the most complex and challenging code to understand and debug. Finally, existing frameworks force analysis writers to constantly have to re-learn low-level aspects of execution. Although these three issues have previously been investigated in isola- tion, some only partially, Tralfamadore is the first dynamic analysis frame- work to address all three at once. An important distinction between Tralf- amadore and most existing dynamic analysis approaches is that Tralfama- dore analyses execute completely offline based on detailed CPU traces. Chapter 2 provided an overview of the Tralfamadore framework and in- troduced its two core components: the trace capture facility and the analysis engine. The trace capture facility records a system’s execution by capturing an instruction-level trace. Users can then instantiate one or more analyses to execute over the recorded trace. Each analysis executes as an instantiation of the analysis engine configured with the specific analysis. Chapter 2 also argued over the benefits of using traces. A key benefits of 112 this approach is that analyses are decoupled from instrumentation. Analysis writers need not be concerned with the intricacies of instrumenting the pro- gram or system they wish to analyze. Analyses based on traces are simple user-level program that can be written in any language and use arbitrary libraries. Chapter 3 described the trace capture facility. The design of the trace capture facility and the trace format is motivated by three requirements: the trace recorded should be comprehensive, meaning the facility should be able to capture the execution of an entire system, including its operating system. Second, the trace capture facility should be able to capture execu- tion transparently ; it should support recording the execution of unmodified systems. Finally, the trace should be complete. It should contain sufficient information to be able to reconstruct the register and memory state on every instruction boundary. This level of information provides Tralfamadore anal- yses with the same power as analyses that execute inlined with the program or system being analyzed. Chapter 3 also presented an evaluation of the overhead of recording and storing traces. This evaluation demonstrated that although the traces are very large, it is practical to record the execution of an operating system kernel given the current capacity and price of storage. The overhead on execution is substantial but similar to the overhead imposed by heavyweight instrumentation framework such as Valgrind [63]. To capture execution comprehensively and transparently, the trace cap- ture facility effectively records execution at the processor level below the operating system. Analyzing execution at this level inherently introduces a semantic gap between the information contained in the trace and the level of information expected by a developer. Chapter 4 describes the design and implementation of the Tralfamadore analysis engine which addresses this problem. The Tralfamadore analysis engine is structured as a dataflow engine sim- ilar to the architecture of streaming databases and modular software routers such as Click [42]. Analyses are built by composing simple reusable com- ponents called operators into a tree-like graph. The trace is treated as a 113 stream and flows through the configuration of operators. Each operators inspect the stream, looking for specific execution patterns, and insert an- notations into the stream that correspond to higher-level events such as function calls, context switches, or heap allocations. Upstream operators are inherently low-level dealing directly with the trace, downstream opera- tors increasingly rely on previously inserted annotations and are gradually shielded from low-level aspects of execution. This gradual reconstruction of execution semantics encourages the de- sign of analyses that are composable and reusable. The architecture of the analysis engine makes it easy to reuse existing operators. Low-level aspects of execution need to be understood once, and this task can be left to a domain-expert. Chapter 5 described some of the analyses that have been built with Tralfamadore. It described a new analysis called heap slicing designed to help developers understand data-driven software components such as net- work protocol stacks or file systems. The chapter also demonstrated that Tralfamadore analyses can be built mostly from existing operators, often re- quiring a single new operator consisting of a few hundred lines of code. This code tends to be relatively high-level since many of the low-level aspects of the system being analyzed have been abstracted away by existing operators. Finally, Chapter 6 compared Tralfamadore with existing dynamic analy- sis frameworks, validating that Tralfamadore is the first framework to com- prehensively address the issues described at the start of this chapter. It also surveys existing work on deterministic record-replay, a technique that Tralfamadore could leverage to decouple trace generation from execution recording, greatly reducing the overhead currently incurred when capturing traces. To summarize, this thesis makes four contributions: It presents the first offline dynamic analysis framework that supports the analyses of kernel code and provides an programmable interface tailored for dynamic analysis. The thesis proposes the design of a dynamic analysis engine structured as a dataflow engine, and demonstrates the benefits of this approach. This the- sis also describes the Context operator designed to reconstruct and label 114 individual flows of execution from a CPU-level trace. Finally, this thesis de- scribed heap slicing, a new analysis aimed at helping developers understand complex data-driven subsystems. 115 Bibliography [1] C/C++ trace-based debugger based on chronicle and eclipse. http://code.google.com/p/chronomancer/. [2] Intel 64 and ia-32 architectures software developer’s manual volume 3b: System programming guide. http://www.intel.com/products/processor/manuals/. [3] Replay debugging on linux. http://www.vmware.com/pdf/ws7 replay linux technote.pdf. [4] Valgrind-based complete, indexed recording of process execution. http://code.google.com/p/chronicle-recorder/. [5] A. Agarwal, R. L. Sites, and M. Horowitz. Atum: a new technique for capturing address traces using microcode. In Proceedings of the 13th annual international symposium on Computer architecture, ISCA ’86, pages 119–127, Los Alamitos, CA, USA, 1986. IEEE Computer Society Press. [6] H. Agrawal and J. R. Horgan. Dynamic program slicing. In PLDI ’90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, 1990. [7] G. Altekar and I. Stoica. Odr: output-deterministic replay for multicore debugging. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP ’09, pages 193–206, New York, NY, USA, 2009. ACM. [8] G. Ammons, R. Bod́ık, and J. R. Larus. Mining specifications. In Symposium on Principles of programming languages, 2002. 116 [9] A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: first fault diagnosis by reconstruction of distributed control flow. In Proceedings of the 2005 ACM SIGPLAN conference on Pro- gramming language design and implementation, PLDI ’05, pages 201– 212, New York, NY, USA, 2005. ACM. [10] T. Ball and J. R. Larus. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst., 16:1319–1360, July 1994. [11] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualiza- tion. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, pages 164–177, New York, NY, USA, 2003. ACM. [12] F. Bellard. QEMU, a fast and portable dynamic translator. In ATC’05: USENIX Annual Technical Conference, 2005. [13] S. Bhansali, W.-K. Chen, S. de Jong, A. Edwards, R. Murray, M. Drinić, D. Mihočka, and J. Chau. Framework for instruction-level tracing and analysis of program executions. In VEE ’06: Proceedings of the 2nd international conference on Virtual execution environments, 2006. [14] T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14:80–107, February 1996. [15] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for adaptive dynamic optimization. In CGO ’03: Proceedings of the inter- national symposium on Code generation and optimization, pages 265– 275, Washington, DC, USA, 2003. IEEE Computer Society. [16] P. P. Bungale and C.-K. Luk. Pinos: a programmable framework for whole-system dynamic instrumentation. In Virtual execution environ- ments, 2007. [17] B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instru- mentation of production systems. In Proceedings of the annual confer- 117 ence on USENIX Annual Technical Conference, ATEC ’04, pages 2–2, Berkeley, CA, USA, 2004. USENIX Association. [18] D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seid- man, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams: a new class of data management applications. In Very Large Data Bases, 2002. [19] P. M. Chen and B. D. Noble. When virtual is better than real. In Pro- ceedings of the Eighth Workshop on Hot Topics in Operating Systems, pages 133–, Washington, DC, USA, 2001. IEEE Computer Society. [20] S. Chen, M. Kozuch, T. Strigkos, B. Falsafi, P. B. Gibbons, T. C. Mowry, V. Ramachandran, O. Ruwase, M. Ryan, and E. Vlachos. Flex- ible hardware acceleration for instruction-grain program monitoring. In International Symposium on Computer Architecture, 2008. [21] J.-D. Choi, B. P. Miller, and R. H. B. Netzer. Techniques for debugging parallel programs with flowback analysis. ACM Trans. Program. Lang. Syst., 13:491–530, October 1991. [22] J.-D. Choi and H. Srinivasan. Deterministic replay of java multi- threaded applications. In Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, SPDT ’98, pages 48–59, New York, NY, USA, 1998. ACM. [23] J. Chow, T. Garfinkel, and P. M. Chen. Decoupling dynamic program analysis from execution in virtual environments. In ATC’08: USENIX 2008 Annual Technical Conference, 2008. [24] J. Chow, D. Lucchetti, T. Garfinkel, G. Lefebvre, R. Gardner, J. Ma- son, S. Small, and P. M. Chen. Multi-stage replay with crosscut. In VEE ’10: Proceedings of the 6th ACM SIGPLAN/SIGOPS interna- tional conference on Virtual execution environments, pages 13–24, New York, NY, USA, 2010. ACM. 118 [25] J. Chow, B. Pfaff, T. Garfinkel, K. Christopher, and M. Rosenblum. Understanding data lifetime via whole system simulation. In Proceed- ings of the 13th conference on USENIX Security Symposium - Volume 13, SSYM’04, pages 22–22, Berkeley, CA, USA, 2004. USENIX Associ- ation. [26] B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, SIG- METRICS ’94, pages 128–137, New York, NY, USA, 1994. ACM. [27] M. Costa, M. Castro, L. Zhou, L. Zhang, and M. Peinado. Bouncer: securing software by blocking bad input. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 117–130, New York, NY, USA, 2007. ACM. [28] G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. Revirt: enabling intrusion analysis through virtual-machine logging and replay. In OSDI ’02: Proceedings of the 5th symposium on Operating systems design and implementation, 2002. [29] G. W. Dunlap, D. G. Lucchetti, M. Fetterman, and P. M. Chen. Ex- ecution replay on multiprocessor virtual machines. In International Conference on Virtual Execution Environments, 2008. [30] M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao. The Daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1–3):35–45, Dec. 2007. [31] P. Feiner, A. D. Brown, and A. Goel. A design for comprehensive kernel instrumentation. In Proceedings of the Sixth international conference on Hot topics in system dependability, HotDep’10, pages 1–16, Berkeley, CA, USA, 2010. USENIX Association. [32] P. Godefroid, M. Y. Levin, and D. Molnar. Automated whitebox fuzz testing. In In NDSS, 2008. 119 [33] S. Goldsmith, R. O’Callahan, and A. Aiken. Relational queries over program traces. In Object-Oriented Programming, Systems, Languages, and Applications, 2005. [34] Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In Operating Systems Design and Implementation, 2008. [35] D. Gupta, K. Yocum, M. McNett, A. C. Snoeren, A. Vahdat, and G. M. Voelker. To infinity and beyond: time-warped network emulation. In Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3, NSDI’06, pages 7–7, Berkeley, CA, USA, 2006. USENIX Association. [36] K. J. Hoffman, P. Eugster, and S. Jagannathan. Semantics-aware trace analysis. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’09, pages 453–464, New York, NY, USA, 2009. ACM. [37] D. R. Hower and M. D. Hill. Rerun: Exploiting episodes for lightweight memory race recording. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pages 265–276, Wash- ington, DC, USA, 2008. IEEE Computer Society. [38] N. Jalbert and K. Sen. A trace simplification technique for effective debugging of concurrent programs. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering, FSE ’10, pages 57–66, New York, NY, USA, 2010. ACM. [39] A. Joshi, S. T. King, G. W. Dunlap, and P. M. Chen. Detecting past and present intrusions through vulnerability-specific predicates. In Proceed- ings of the twentieth ACM symposium on Operating systems principles, SOSP ’05, pages 91–104, New York, NY, USA, 2005. ACM. [40] K. Kanoun, Y. Crouzet, A. Kalakech, A.-E. Rugina, and P. Rumeau. Benchmarking the dependability of windows and linux using post- mark&#8482; workloads. In ISSRE ’05: Proceedings of the 16th IEEE 120 International Symposium on Software Reliability Engineering, pages 11–20, Washington, DC, USA, 2005. IEEE Computer Society. [41] S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In ATC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, 2005. [42] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. M. Kaashoek. The click modular router. ACM Transactions on Computer Systems, 2000. [43] O. Laadan, N. Viennot, and J. Nieh. Transparent, lightweight applica- tion execution replay on commodity multiprocessor operating systems. In Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS ’10, pages 155–166, New York, NY, USA, 2010. ACM. [44] J. R. Larus. Abstract execution: a technique for efficiently tracing programs. Softw. Pract. Exper., 20:1241–1258, November 1990. [45] J. R. Larus. Whole program paths. In Proceedings of the ACM SIG- PLAN 1999 conference on Programming language design and imple- mentation, PLDI ’99, pages 259–269, New York, NY, USA, 1999. ACM. [46] K. P. Lawton. Bochs: A portable pc emulator for unix/x. Linux J., page 7. [47] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Trans. Comput., 36:471–482, April 1987. [48] D. Lee, M. Said, S. Narayanasamy, Z. Yang, and C. Pereira. Offline symbolic analysis for multi-processor execution replay. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchi- tecture, MICRO 42, pages 564–575, New York, NY, USA, 2009. ACM. [49] G. Lefebvre, B. Cully, , D. Meyer, M. J. Feeley, N. C. Hutchinson, and A. Warfield. Execution mining. Unpublished manuscript, 14 pages, 2010. 121 [50] G. Lefebvre, B. Cully, M. J. Feeley, N. C. Hutchinson, and A. Warfield. Tralfamadore: unifying source code and execution experience. In Pro- ceedings of the 4th ACM European conference on Computer systems, EuroSys ’09, pages 199–204, New York, NY, USA, 2009. ACM. [51] B. Lewis. Debugging backwards in time. In Workshop on Automated Debugging, 2003. [52] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wal- lace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI ’05: Proceed- ings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, 2005. [53] A. Madhavapeddy, A. Ho, T. Deegan, D. Scott, and R. Sohan. Melange: creating a ”functional” internet. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, pages 101–114, New York, NY, USA, 2007. ACM. [54] D. Marino, M. Musuvathi, and S. Narayanasamy. Literace: effective sampling for lightweight data-race detection. In PLDI ’09: Proceedings of the 2009 ACM SIGPLAN conference on Programming language de- sign and implementation, pages 134–143, New York, NY, USA, 2009. ACM. [55] M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using pql: a program query language. In Object-Oriented Programming, Systems, Languages, and Applications, 2005. [56] J. Mickens, J. Elson, and J. Howell. Mugshot: deterministic capture and replay for javascript applications. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI’10, pages 11–11, Berkeley, CA, USA, 2010. USENIX Association. [57] P. Montesinos, L. Ceze, and J. Torrellas. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution 122 ef?ciently. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pages 289–300, Washington, DC, USA, 2008. IEEE Computer Society. [58] S. Mysore, B. Mazloom, B. Agrawal, and T. Sherwood. Understanding and visualizing full systems with data flow tomography. In Architectural Support for Programming Languages and Operating Systems, 2008. [59] S. Narayanasamy, C. Pereira, and B. Calder. Recording shared mem- ory dependencies using strata. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming lan- guages and operating systems, 2006. [60] S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Au- tomatic logging of operating system effects to guide application-level architecture simulation. In Proceedings of the joint international con- ference on Measurement and modeling of computer systems, SIGMET- RICS ’06/Performance ’06, pages 216–227, New York, NY, USA, 2006. ACM. [61] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data racesallusing replay analysis. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN con- ference on Programming language design and implementation, pages 22–31, New York, NY, USA, 2007. ACM. [62] N. Nethercote and J. Seward. How to shadow every byte of memory used by a program. In VEE ’07: Proceedings of the 3rd international conference on Virtual execution environments, pages 65–74, New York, NY, USA, 2007. ACM. [63] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and im- plementation, 2007. 123 [64] M. Olszewski, K. Mierle, A. Czajkowski, and A. D. Brown. Jit in- strumentation: a novel approach to dynamically instrument operating systems. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, pages 3–16, New York, NY, USA, 2007. ACM. [65] S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. Pres: probabilistic replay with execution sketching on multiprocessors. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP ’09, pages 177–192, New York, NY, USA, 2009. ACM. [66] H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. Pinplay: a framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, CGO ’10, pages 2–11, New York, NY, USA, 2010. ACM. [67] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks, 31(23–24), 1999. [68] S. E. Perl and R. L. Sites. Studies of windows nt performance using dynamic execution traces. In Proceedings of the second USENIX sym- posium on Operating systems design and implementation, OSDI ’96, pages 169–183, New York, NY, USA, 1996. ACM. [69] G. Pothier, E. Tanter, and J. Piquer. Scalable omniscient debugging. In Object-Oriented Programming, Systems, Languages, and Applications, 2007. [70] M. Ronsse and K. De Bosschere. Recplay: a fully integrated practical record/replay system. ACM Trans. Comput. Syst., 17:133–152, May 1999. [71] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod. Using the simos machine simulator to study complex computer systems. ACM Trans. Model. Comput. Simul., 7:78–103, January 1997. 124 [72] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete com- puter system simulation: The simos approach. IEEE Parallel Distrib. Technol., 3:34–43, December 1995. [73] O. Ruwase, S. Chen, P. B. Gibbons, and T. C. Mowry. Decoupled life- guards: enabling path optimizations for dynamic correctness checking tools. In Proceedings of the 2010 ACM SIGPLAN conference on Pro- gramming language design and implementation, PLDI ’10, pages 25–35, New York, NY, USA, 2010. ACM. [74] Y. Saito. Jockey: a user-space library for record-replay debugging. In Automated Analysis-Driven Debugging, 2005. [75] J. Seward and N. Nethercote. Using valgrind to detect undefined value errors with bit-precision. In ATEC ’05: Proceedings of the annual con- ference on USENIX Annual Technical Conference, pages 2–2, Berkeley, CA, USA, 2005. USENIX Association. [76] R. L. Sites and A. Agarwal. Multiprocessor cache analysis using atum. In Proceedings of the 15th Annual International Symposium on Com- puter architecture, ISCA ’88, pages 186–195, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press. [77] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang, N. James, P. Poosankam, and P. Saxena. Bitblaze: A new approach to computer security via binary analysis. In Proceedings of the 4th International Conference on Information Systems Security, ICISS ’08, pages 1–25, Berlin, Heidelberg, 2008. Springer-Verlag. [78] S. Sridhar, J. S. Shapiro, E. Northup, and P. P. Bungale. Hdtrans: an open source, low-level dynamic instrumentation system. In VEE ’06: Proceedings of the 2nd international conference on Virtual execution environments, 2006. [79] S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In USENIX Annual Technical Conference, 2004. 125 [80] A. Srivastava and A. Eustace. Atom: a system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, PLDI ’94, pages 196–205, New York, NY, USA, 1994. ACM. [81] W. N. Sumner and X. Zhang. Memory indexing: canonicalizing ad- dresses across executions. In Proceedings of the eighteenth ACM SIG- SOFT international symposium on Foundations of software engineer- ing, FSE ’10, pages 217–226, New York, NY, USA, 2010. ACM. [82] S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long- running multithreaded programs via dynamic execution reduction. In Proceedings of the 2007 international symposium on Software testing and analysis, ISSTA ’07, pages 207–218, New York, NY, USA, 2007. ACM. [83] K. Vonnegut. Slaughterhouse-five, or, the children’s crusade : a duty- dance with death. Dell, 1968. [84] M. Weiser. Program slicing. In ICSE ’81: Proceedings of the 5th inter- national conference on Software engineering, 1981. [85] B. Xin, W. N. Sumner, and X. Zhang. Efficient program execution indexing. In Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’08, pages 238–248, New York, NY, USA, 2008. ACM. [86] M. Xu, R. Bodik, and M. D. Hill. A “flight data recorder” for en- abling full-system multiprocessor deterministic replay. In International Symposium on Computer Architecture, 2003. [87] M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and B. Weissman. Retrace: Collecting execution trace with virtual machine determinis- tic replay. In Third Annual Workshop on Modeling, Benchmarking and Simulation, held in conjunction with the 34th Annual International Symposium on Computer Architecture, 2007. 126 [88] X. Zhang and R. Gupta. Whole execution traces and their applications. ACM Trans. Archit. Code Optim., 2:301–334, September 2005. [89] X. Zhang, S. Tallam, and R. Gupta. Dynamic slicing long running programs through execution fast forwarding. In Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, SIGSOFT ’06/FSE-14, pages 81–91, New York, NY, USA, 2006. ACM. 127

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0052156/manifest

Comment

Related Items