Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Binary shuffling : defeating memory disclosure attacks through re-randomization Williams-King, David 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_september_williamsking_david.pdf [ 530.77kB ]
JSON: 24-1.0167580.json
JSON-LD: 24-1.0167580-ld.json
RDF/XML (Pretty): 24-1.0167580-rdf.xml
RDF/JSON: 24-1.0167580-rdf.json
Turtle: 24-1.0167580-turtle.txt
N-Triples: 24-1.0167580-rdf-ntriples.txt
Original Record: 24-1.0167580-source.json
Full Text

Full Text

Binary Shuffling: Defeating Memory Disclosure Attacksthrough Re-RandomizationbyDavid Williams-KingB.Sc. Honours Computer Science, University of Calgary, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2014© David Williams-King, 2014AbstractSoftware that is in use and under development today still contains as many bugsas ever. These bugs are often exploitable by attackers using advanced techniquessuch as Return-Oriented Programming (ROP), where pieces of legitimate code arestitched together to form a malicious exploit. One class of defenses against theseattacks is Address-Space Layout Randomization (ASLR), which randomly selectsthe base addresses of legitimate code. However, it has recently been shown thatthis randomization can be unravelled with memory disclosure attacks, which di-vulge the contents of memory at a given address. In this work, we strengthen coderandomization against memory disclosure attacks, in order to make it a viable de-fense in the face of Return-Oriented Programming. We propose a technique calledbinary shuffling, which dynamically re-randomizes the position of code blocks atruntime. While a memory disclosure may reveal the contents of a memory address(thus unravelling the randomization), this information is only valid for a very shorttime. Our system, called Shuffler, operates on program binaries without accessto source code, and can re-randomize the position of all code in a program in aslittle as ten milliseconds. We show that this is fast enough to defeat any attemptat Return-Oriented Programming, even when armed with a memory disclosure at-tack. Shuffler adds only 10 to 21% overhead on average, making it a viable defenseagainst these types of attack.iiPrefaceThis thesis evolved out of a class project in Bill Aiello’s course CPSC 538W, “On-line Privacy”, and from the author’s prior experience with malware and compilertoolchains. The class project was an early single-address-space relocation-basedprototype that nevertheless could dynamically migrate functions (in the absence offunction pointers), and was conceived and written in entirety by the author.After this, the project proceeded with the author writing several prototypesusing different technical mechanisms. The author has written all of the code. BillAiello and Mihir Nanavati advised the project and helped keep it going in a usefuldirection, with Patrick Colp and Kent Williams-King providing valuable feedbackas well.As for publications, Bill Aiello and Patrick Colp helped prepare a submissionto Usenix Security 2014, which was rejected, and also worked towards a submis-sion to ACM Computer and Communications Security (CCS) 2014, which we ulti-mately decided not to submit. We aim to resubmit this work to another conferencein the future.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Executable Space Protection . . . . . . . . . . . . . . . . . . . . 42.2 Return-Oriented Programming . . . . . . . . . . . . . . . . . . . 52.3 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.1 Location of Shuffling Mechanism: In a Separate Process . 93.1.2 Burning a Core . . . . . . . . . . . . . . . . . . . . . . . 93.1.3 What Shuffler Can Shuffle . . . . . . . . . . . . . . . . . 103.1.4 Heartbeat Operation . . . . . . . . . . . . . . . . . . . . 11iv4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1 The Userspace Shuffler Process . . . . . . . . . . . . . . . . . . . 134.1.1 Modifying the Target’s Memory . . . . . . . . . . . . . . 134.1.2 When Do Writes Appear? Or, Cross-Modifying Code . . . 154.1.3 Bootstrapping Shuffler (and Attach/Detach) . . . . . . . . 184.2 Code Discovery and Relocation . . . . . . . . . . . . . . . . . . 194.2.1 How the Build Chain Uses Relocations . . . . . . . . . . 194.2.2 Code Discovery and Disassembly . . . . . . . . . . . . . 204.2.3 Relocating Code at Runtime . . . . . . . . . . . . . . . . 214.3 Trampolines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.1 How to Update the Stack . . . . . . . . . . . . . . . . . . 224.3.2 Trampoline Implementation . . . . . . . . . . . . . . . . 244.3.3 Memory Protection . . . . . . . . . . . . . . . . . . . . . 254.4 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.1 Handling Stale Addresses . . . . . . . . . . . . . . . . . 264.4.2 Handling Shared Libraries . . . . . . . . . . . . . . . . . 284.4.3 Handling Threads and Forks . . . . . . . . . . . . . . . . 284.5 Overall Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 305.1 Shuffler Microbenchmarks . . . . . . . . . . . . . . . . . . . . . 305.1.1 Memory Access Overhead . . . . . . . . . . . . . . . . . 305.1.2 Context Switch Overhead [on Ivy Bridge] . . . . . . . . . 315.1.3 Code Injection Overhead [on Ivy Bridge] . . . . . . . . . 315.1.4 Cache Trampoline Optimization [on Ivy Bridge] . . . . . 325.1.5 Jump Table Optimization [on Ivy Bridge] . . . . . . . . . 325.2 Macrobenchmarks on SPEC CPU2006 . . . . . . . . . . . . . . . 335.3 Running Nginx [on Ivy Bridge] . . . . . . . . . . . . . . . . . . . 356 Security Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.1 Memory Disclosure Attacks . . . . . . . . . . . . . . . . . . . . 366.1.1 Analysis of a Memory Disclosure Attack . . . . . . . . . 366.1.2 Why the Attacker Has Insufficient Time . . . . . . . . . . 37v6.2 Other Types of Attacks . . . . . . . . . . . . . . . . . . . . . . . 387 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1 Return-Oriented Programming . . . . . . . . . . . . . . . . . . . 427.2 Static Randomization . . . . . . . . . . . . . . . . . . . . . . . . 427.3 Just-In-Time Code Reuse Attacks . . . . . . . . . . . . . . . . . 437.4 Dynamic Randomization . . . . . . . . . . . . . . . . . . . . . . 447.5 Control-Flow Integrity . . . . . . . . . . . . . . . . . . . . . . . 448 Future Work and Discussion . . . . . . . . . . . . . . . . . . . . . . 468.1 Shuffler Completeness . . . . . . . . . . . . . . . . . . . . . . . 468.2 Code Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . 478.3 JIT Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478.4 Hot Patching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488.5 Shuffling an Entire System . . . . . . . . . . . . . . . . . . . . . 488.6 Shuffling Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A Injected Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55viList of TablesTable 5.1 Performance of reading 10GB of memory between processes. . 30Table 5.2 Function pointer overhead on a million calls. . . . . . . . . . . 32Table 5.3 Performance improvement for jump table optimizations . . . . 33Table 5.4 SPEC CPU2006 average overhead . . . . . . . . . . . . . . . 35viiList of FiguresFigure 2.1 Example ROP attack (executes a shell on 64-bit Linux). . . . . 6Figure 3.1 Shuffler running on a separate CPU core. . . . . . . . . . . . 10Figure 4.1 High-level architecture of Shuffler. . . . . . . . . . . . . . . . 12Figure 4.2 Remapping private memory as shared memory . . . . . . . . 14Figure 4.3 Intel’s specification of cross-modifying code. . . . . . . . . . 16Figure 4.4 Idiom for retrieving EIP (32-bit x86 code). . . . . . . . . . . 21Figure 4.5 Transforming a call instruction with a trampoline. . . . . . . . 23Figure 4.6 Adding a shadow stack to form bookkeeping trampolines. . . 24Figure 4.7 Trampoline assembly code (AT&T syntax). . . . . . . . . . . 24Figure 4.8 Overall architecture of Shuffler. . . . . . . . . . . . . . . . . 29Figure 5.1 SPEC CPU2006 . . . . . . . . . . . . . . . . . . . . . . . . 34Figure A.1 Trampoline code with bookkeeping. . . . . . . . . . . . . . . 56viiiAcknowledgmentsMy supervisors, Bill Aiello and Andy Warfield, helped me throughout my master’sprogram to learn to see and think about research problems at a higher level, andfor this I am very grateful. Special thanks are due to Bill, who advised me onthe Shuffler project right from its beginnings as a small class project, who putup with its many iterations, and who always had useful advice and kind wordsand time for lunch meetings. My labmate and closest lab-friend, Mihir Nanavati,helped advise the project and suggested useful directions, and was always willingto talk at length about anything I wanted; but, when Mihir masterfully disappearedto another continent right before each of our paper submission deadlines, PatrickColp stepped in and helped out with the writing. Finally, my brother Kent, NSS-member-to-be, provided a great deal of advice in our many discussions about theproject, informed by his encyclopaedic knowledge of Intel systems. This projectwould not have been possible without Bill, Mihir, Patrick, and Kent, and everyoneelse in NSS who made sure I had a good time and kept myself sane. On that front,thanks to my parents and all my friends at Green College as well, for making mytime in Vancouver the best it could possibly be.Thank you, all!ixChapter 1IntroductionNo one has yet figured out how to write code without bugs. In fact, today’s pro-duction software might contain as many as 50 defects per 1000 lines of code [21].Many of these errors are exploitable by an attacker, particularly memory errorssuch as buffer overflows, use-after-free, and memory corruption. Even programswritten in languages with managed memory or isolated in virtual machines are notimmune, since bugs may be present in the language runtime [22] or virtual machineimplementation [9].Memory errors may allow an attacker to inject new data and code, examine thestate of the running program, even hijack its control flow. There are many possibletypes of attacks, but we are concerned in particular with code reuse attacks, whichleverage existing code to form an exploit. The most famous type of code reuse at-tack is Return-Oriented Programming [30], where control flow is directed throughsmall gadgets which together are Turing-complete and can form any exploit.One type of defense against memory errors is Address Space Layout Ran-domization (ASLR), which randomizes the base addresses of stack, heap, and li-braries in a program. This prevents an attacker from hard-coding addresses andhaving a reproducible exploit. Unfortunately, one memory access vulnerability(which reports an address to the attacker) can allow the rest of a program’s lay-out to be deduced, since the entire code (or stack) segment is shifted uniformly.More recent techniques which perform fine-grained ASLR have been proposed,permuting and randomly placing functions, basic blocks, or even individual in-1structions [2, 16, 37].Another class of defenses against memory error attacks and against Return-Oriented Programming (ROP) in particular is Control-Flow Integrity (CFI) [1, 39].CFI computes a control flow graph which contains all valid jumps and calls, andaims to prevent execution from straying outside the graph. Done properly, CFIshould prevent any ROP attack, because the jumps used by ROP are not possiblein a normal program execution.Unfortunately, recent works have shown that both types of defense are quitevulnerable. Fine-grained randomization is performed only once when a programis loaded, and the memory layout remains static thereafter. So it is possible tomount a memory disclosure attack and gradually unravel the randomization untilenough code is found to mount an attack [31]. This attack works in theory againstany fine-grained randomization technique. Conversely, while the original CFI stillremains a strong defense, it was prohibitively expensive; recent CFI techniquestake shortcuts that approximate the control-flow graph in order to achieve betterperformance. For example, many CFI systems allow a return statement to jump toanywhere immediately following an indirect call statement. It has been shown thatalmost any approximation which does not enforce true CFI still provides enoughgadgets for an exploit [10].In this thesis, we resurrect fine-grained ASLR as a viable defense against codereuse attacks. We harden ASLR against memory disclosure attacks by observ-ing that any memory disclosure can be mitigated through re-randomization, whichmay invalidate any previously discovered features of the address space. Our sys-tem, Shuffler, dynamically re-randomizes (or shuffles) code layout at runtime. Theshuffling is performed at high frequency, as often as once per 10 milliseconds,which is fast enough to provide a strong defense against repeated memory disclo-sure attacks such as Snow et al’s attack [31].Shuffler operates on commodity off-the-shelf binaries, without requiring sourcecode or debugging information. Shuffler does not require any modifications to theoperating system, but instead operates entirely in userspace. It may be embed-ded into the system loader, or track a process and all its children; Shuffler caneven attach onto running processes and then detach later (providing, perhaps, stop-gap defense against a zero-day vulnerability). The shuffling rate can be adjusted2dynamically according to observed system activity such as network overhead orsuspicious process crashes.We measured the performance of Shuffler on SPEC CPU 2006, and found anoverhead of 10-21% on the applicable test cases. This seems appropriately smallenough for our technique to be viable. Shuffler has also been tested on the webserver nginx, which is perhaps a more realistic example.3Chapter 2BackgroundWhen a running program is attacked, the attacker’s aim is usually to cause theprogram to run code it was not designed to run. Early stack-smashing attackswould simply exploit buffer overruns to write new code provided by the attackerdirectly onto the stack, and then overwrite a return address to point at the newcode. Modern defenses such as executable space protection prevent new code frombeing written into memory, which leads to more sophisticated code reuse attackslike Return-Oriented Programming (ROP). We describe some background in detailbelow, leading to a description of ROP.2.1 Executable Space ProtectionExecutable space protection refers to marking pages in a process’s memory regionas non-executable, to prevent an attacker from running code in those memory re-gions. Historically, regions like the stack would have RWX permissions, allowingcode injected through a buffer overrun to be executed immediately. This largelystems from x86 paging table permissions, which originally did not allow pagesto be made readable without also being executable, until the introduction of theNo-eXecute (NX) bit.11Read and write permissions have always been manipulable: bit 1 of each page table entry con-trols write protection, and bit 2 can be used to make a page completely inaccessible [18]. But theexecute-disable (NX) bit is bit 63 and was only introduced in the first machines to use more than32 bits for physical memory addressing (Pentium Pro and newer, anything with Physical AddressExtension or PAE).4Now, most modern operating systems use the hardware NX bit, if present, toenforce executable space protection.2 Some Linux-based systems even emulatethe NX bit in software [24, 27]. This effectively prevents code-injection attacksfrom succeeding, although executables may still explicitly request RWX stacks ormemory regions despite the dangers of doing so, e.g., for self-modifying code orthrough misconfiguration of the build toolchain [14, 19].2.2 Return-Oriented ProgrammingReturn-Oriented Programming (ROP) [30] is a class of attack that works even inthe presence of executable space protection. The basic idea is to reuse fragmentsof code already present within a legitimate executable’s code segment, combiningthem into a malicious exploit (without writing any new code). The fragments ofcode (called gadgets) are normally very short and end in return statements (hencethe name Return-Oriented Programming).An attacker deploying a ROP attack begins by exploiting a memory error vul-nerability, e.g., a buffer overrun that allows the current return address on the stackto be overwritten, and several words of memory to be written beyond that. Whenthe current function goes to return, it instead jumps to an address of the attacker’schoosing, which is the first gadget to be executed. Each gadget runs a small amountof code and then executes a return statement; each return reads its target addressand then pops it off the stack. In this way, the attacker can direct execution througha whole series of gadgets simply by writing their return addresses in sequence ontothe stack.The attacker can also place data onto the stack, and load it into registers by exe-cuting a gadget which pops data off the stack—for example, pop %rax ; ret,which is a very common instruction sequence in function epilogues. After findinga number of gadgets in a program or library’s code, the attacker can characterizethem by what side effects they have, what registers are modified, and what func-tions or traps are called. It is relatively straightforward to find a Turing-completeset of such gadgets, so that any desired exploit can be compiled down into a se-2Windows calls this Data Execution Prevention (DEP), OpenBSD calls it Write-XOR-Execute(WˆX), and other systems like Linux just call it NX bit support.5quence of gadgets (using a constraint or SAT solver) [12, 28]. Even basic librariesthat are always present, like or kernel32.dll, can be shown to con-tain a Turing-complete set of gadgets [10, 33].As an example, see Figure 2.1. Here the stack has been set up with the ad-dresses of five gadgets and some data. The first two gadgets simply clear a register,then use a return statement to pop the next return address off the stack. Gadgetsg3 and g4 load data values from the stack, or pointers into the stack. And gadgetg5 initiates a system call. The attack shown will execute system call 0x3b (execve)with the three arguments /bin/sh, 0, and 0 — in other words, it opens a shell, theholy grail for an attacker. A real attack would likely not have these gadgets avail-able and may have to construct equivalent behaviour from many more gadgets, butthis complete compromise of the target program would likely still be possible.Figure 2.1: Example ROP attack (executes a shell on 64-bit Linux).Finally, note that x86 and x86 64 employ dense, variable-length instructionsets. In other words, common or older instructions might have short one- or two-6byte encodings, while more recent and more complex instructions might be muchlonger (up to a maximum of 15 bytes [23]). And furthermore, the instruction set isdense, meaning that most sequences of bytes represent valid instruction sequences.Thus, an attacker can search for code beginning at any address, even in the middleof the original instructions, further increasing the chances of finding useful gadgets.Thus, Return-Oriented Programming (and variations like Jump-Oriented Pro-gramming [3]) is a very powerful type of code reuse attack which can be car-ried out against modern systems. The defense most commonly proposed for ROPis Control-Flow Integrity, though recent CFI systems are too lax with their poli-cies [10]. We claim to show that (dynamic) randomization-based techniques alsoshow promise as ROP defenses.2.3 Threat ModelIn this work, we assume a user who is running a modern operating system withexecution space protection, to handle traditional code-injection attacks. The oper-ating system and system administrator are trusted. Shuffler provides protection toone or more userspace programs run by the user. We assume these target programsare well formed (e.g., compiler-generated code, which is organized into blocks thatnever issue call instructions into the middle of other blocks), and that Shuffler isable to discover the code in the target (e.g., through the use of the symbol table,discussed in Section 4.2.2).We assume the attacker does not have direct access to the target system, butrather is situated remotely on the network, or perhaps may craft a self-containedexploit (such as a malicious Word document or PDF file). As in Snow et al [31], weassume an adversary with advance knowledge of a) a memory access vulnerabilitythat can be invoked repeatedly, b) one valid code address, and c) a vulnerabilitywhich allows control-flow hijacking (in order to initiate execution of an exploit).Under these conditions, Snow et al have shown that any static randomization tech-nique can be defeated; but their analysis does not extend to dynamic randomization,our proposed method of defense.7Chapter 3DesignOur aim in this work is to resurrect fine-grained ASLR so that it becomes a usefuldefense against Return-Oriented Programming. Ideally, we could completely iso-late each block of code within a target program, preventing an exploit from trans-ferring control-flow between gadgets. This would defeat any code-reuse attack thatneeds code from multiple blocks. In reality, however, all code blocks within a pro-cess exist in the same address space and can always reach each other—assumingthat the attacker knows where they are.Thus, our approach is to dynamically move (or shuffle) code blocks from oneaddress to another. This approximates the isolation ideal, and prevents an attackerfrom reliably jumping between code blocks. With continuous, high-frequencyshuffling, discovering the location of a code block is of limited use to an attacker,because the block will not live at that address for long. Constructing a successfulROP attack in this environment is difficult.3.1 ArchitectureThe goals of our system, Shuffler, are presented below.1. Security: We should provide strong randomization guarantees without in-troducing additional attack vectors to the system.2. Deployability: Shuffler should require minimal system modifications, andshould provide protection for existing unmodified application binaries.83. Performance: Programs run under Shuffler should approach native perfor-mance (in terms of wall clock time).The architecture of our system, Shuffler, is presented below.3.1.1 Location of Shuffling Mechanism: In a Separate ProcessOur aim is to move the code around inside a target process while it is running. Butit is dangerous to embed the shuffling mechanism inside the same process, whereit may be vulnerable to memory disclosure attacks on the target. Since Shufflermust keep track of the shuffled locations of every code block at all times, its datastructures are a prime target for memory disclosure attacks (because they providean easy way to unravel the randomization).This suggests two other designs: either store the shuffling data structures insidethe kernel, or push the data structures into a separate user space process. Mov-ing the shuffling into the kernel does provide the easiest access to page tables foreasy page-granularity shuffling, direct page read/writes, and other mechanisms thatwould make shuffling easy to implement (and as a precedent, existing ASLR mech-anisms are kernel-level). However, a userspace solution has the advantage of easydeployability to existing systems running stock kernels, and limits the potentialdamage an attacker can cause by attempting to attack Shuffler itself. We chose todesign Shuffler as a userspace process. It runs with no additional special privileges,executing in userspace under the same user ID as the target process.3.1.2 Burning a CoreThe Shuffler process runs on a dedicated CPU core, “burning” the core to pro-vide the security afforded by shuffling with minimal impact on other cores’ per-formance. As shown in Figure 3.1, the shuffling core runs in parallel to the realapplication, so that a single-threaded shuffled application now needs two cores.1Since burning a core allows the target application to run at nearly native perfor-mance, this seems a reasonable trade-off for security; modern CPUs have many1Actually, only a small portion of the shuffling CPU’s time is used, perhaps one fifth in our tests,so that one shuffling core would suffice to handle multiple application cores.9cores, and forthcoming models have even more, though it is uncertain whether ap-plications can scale up to that many cores. Note that dedicating cores to specializedtasks to ensure good performance is not without precedent: for example, on main-frames, IBM provides extra cores solely for offloading Java Virtual Machines [17]so that Java performance will be closer to native performance.Figure 3.1: Shuffler running on a separate CPU core.Unfortunately, from the hardware’s perspective, since shuffling is happeningfrom a different core, one processor is modifying the code of another processor.This is called cross-modifying code and it is very slow: as is the case for self-modifying code, whenever code is modified at runtime, the processor must flushits pipelines and clear instruction caches. However, any case where code blocks arepermuted must involve code modification, because embedded jumps are normallyrelative and must be updated. So this high cost cannot be avoided entirely. Wemitigate it by generating new code asynchronously, a technique which is madepossible by having Shuffler run on a separate core.3.1.3 What Shuffler Can ShuffleChoosing to use a separate userspace process for Shuffler is excellent for deploy-ability. Shuffler may be installed by a non-root user without needing a customkernel, or even a reboot. We further extend this deployability by allowing Shufflerto begin shuffling in two ways:1. Using a custom loader, which may replace the system loader to catch all10newly run processes.2. Attaching onto any running process (and optionally detaching later).In principle, we want Shuffler to be able to handle anything that a userspaceprocess might do: load shared libraries, create threads, call fork/exec, etc. Sup-porting all of these features is a only matter of implementation; no major designchanges would need to be made. We implemented as many as possible within ouravailable time frame.3.1.4 Heartbeat OperationShuffler operates in a loop, shuffling every function in turn of a target program. Forsecurity, Shuffler attempts to guarantee that each function will be moved within agiven time period (e.g., 100 milliseconds). The loop runs periodically at this rate,and enters an error condition if at any point Shuffler cannot move all the functionswithin the time allotted.The time period may be adjusted by the user, because higher shuffling ratesprovide greater security (in the sense that any code address remains valid for ashorter period of time), while lower shuffling rates provide better performance.This rate could even be adjusted as Shuffler is running, in response to any increasesin suspicious network traffic or program crashes.11Chapter 4ImplementationFigure 4.1: High-level architecture of Shuffler.Figure 4.1 describes Shuffler’s operation at a high level. In brief, Shufflerruns as a separate userspace process, and attaches onto the target process with theptrace API. Shuffler is able to inject code into the target and modify its memorydirectly (described in Section 4.1.1). Special trampolines are introduced for func-tion calls (Section 4.3) to make shuffling easier. The blocks of code that make up12trampolines and the target’s functions are tracked within Shuffler, and randomlyshuffled in the target inside sandbox regions of memory (see Section 4.2). Shuffleris implemented for 64-bit Linux on the x86 64 architecture.4.1 The Userspace Shuffler ProcessAs mentioned in Section 3.1.1, we chose to run Shuffler in a separate userspaceprocess. However, it requires a fair amount of engineering effort to achieve all thetasks that Shuffler needs to do from userspace. The main challenges are 1) gettingread-write access to the target’s memory, and 2) ensuring that the target runs newcode after it is updated (instead of stale cached code). The following sectionsdescribe how this is possible to implement from userspace.4.1.1 Modifying the Target’s MemoryThe simplest way to manipulate another process’s state is through the ptraceprocess introspection API (designed for debuggers). As a first step, the Shufflerprocess attaches onto the target process with ptrace. This grants Shuffler theability to read and modify saved registers, change the instruction pointer, interceptsignals, and modify the target’s memory.1However, using ptrace to access the target’s memory is very slow, since theAPI requires that a separate system call be made to access each word of memory.2One of the main tasks that Shuffler performs is to copy functions in order to shufflethem, so we need a faster mechanism.Our main technique is to use shared memory maps. These allow portions ofa process’s address space to be mapped to the same physical pages as anotherprocess’s address space (possibly at different virtual addresses). Once set up, thisallows Shuffler to modify the target’s memory directly with ordinary writes. Thereare two instances where Shuffler uses shared memory maps:1. Sandboxes are large empty regions within which code can be shuffled ran-domly. To create them, Shuffler injects code into the target which opens1Shuffler can only attach onto executables which are not setuid/setgid [34]. This is a securitylimitation of the ptrace API.2There have been attempts to improve this: the PTRACE MULTI patch [11] to Linux providesone solution, but it is not in the mainline kernel.13Figure 4.2: How private mmap’ed regions are remapped as shared memory.a temporary backing file, resizes it to the desired size, and mmaps it intothe target’s memory.3 Shuffler can then map the temporary file into its ownaddress space.2. Shuffler also needs to map the existing code segments as shared memorymaps, in order to shuffle (and erase) the code within. However, such regionshave already been mapped as private memory maps (by the loader and sharedlibrary runtime). So, as shown in Figure 4.2, Shuffler a) injects code intothe target which b) creates a temporary mapping and copies the data to thismapping, c) unmaps the original private region and maps it again as a sharedregion, then d) copies the data back in and unmaps the temporary region;finally, Shuffler can e) map the shared region into its own address space. Theremapping operation is performed while the target is stopped to prevent anyrace conditions.Shared memory maps allow Shuffler to access remote addresses with reads andwrites, the same way as local addresses. This leads to a clean internal architecture.3This is what POSIX shared memory objects were designed for, but they are a libc-level abstraction and difficult to use from injected code. The Linux implementation (seesysdeps/unix/sysv/linux/shm open.c) just uses mmap and ftruncate anyway.14Other Methods of Accessing MemoryOther techniques for accessing shared memory are possible. For example, there isalready a mechanism similar in spirit to our shared memory maps built into Linux:the /proc/pid/mem file, which provides direct access to a target’s memory. Onemight reasonably ask why we did not use this, since it is recommended by [11] andeasier to use (it is the mechanism employed by GDB). However, while this file canbe opened and read with syscalls like pread, it cannot be mmaped, which preventsthe clean architecture allowed by shared memory maps. Using /proc/pid/memis also significantly less efficient, as shown in our evaluation (Section 5.1.1).One other option is to use the new pair of system calls introduced in Linux3.2, namely process vm readv() and process vm writev(), which areespecially designed to transfer data between process address spaces (and are es-sentially a realization of the PTRACE MULTI proposal [11]). However, invokingthese system calls requires a kernel context switch instead of simple writes; whilethe calls allow operations to be batched together by accepting a vector of requests,this creates additional overhead and complexity. These new system calls are alsoless efficient than shared memory maps (see Section 5.1.1).Finally, shared memory maps allow a function to be moved with a single copy,while any of these other methods require two copies (because one endpoint of eachtransfer must be in the calling process). This is why we use shared memory maps,despite the additional complexity in setting them up.44.1.2 When Do Writes Appear? Or, Cross-Modifying CodeShuffler creates shared memory maps in order to write code to shared pages, andhave that code appear in the target process. This process, where one processorgenerates code for another processor, is called cross-modifying code (a general-ization of self-modifying code, where one processor generates its own code). Un-fortunately, there are a number of caveats with cross-modifying code. First, it isvery slow, because CPUs are not designed to have code modified without warning(caches need to be invalidated, pipelines cleared, etc). Second, there are timingissues because writes may be cached on one processor and not appear right away4However, we do use process vm readv/writev to bootstrap shared memory maps!15on another; and third, the target processor may run old versions of the code out ofstale instruction caches or through speculative execution, unless it is given propernotification of the code modification.The Specification of Cross-Modifying CodeModern CPUs will reorder instructions, speculatively execute upcoming instruc-tions and branches, prefetch memory for the instruction cache, and so on. Theinstruction cache and data caches are separate. This is why according to Section8.1.3 of the Intel manuals [18] “Handling Self- and Cross-Modifying Code”, cross-modifying code should be written as shown in Figure 4.3./* Action of Modifying Processor */Memory_Flag = 0; /* Set Memory_Flag to value other than 1 */Store modified code (as data) into code segment;Memory_Flag = 1;/* Action of Executing Processor */WHILE (Memory_Flag != 1)Wait for code to update;ELIHW;Execute serializing instruction; /* For example, CPUID */Begin executing modified code;Figure 4.3: Intel’s specification of cross-modifying code.In other words, the writer clears a flag before writing the modified code, andthen sets the flag afterwards.5 The executing processor spins until it sees the flagset, which means that all previous writes have become visible; then it issues cpuidto cancel speculative execution of old code; and only then can it jump to the newcode.In the case of Shuffler, this could translate to an implementation as follows.Shuffler writes out new copies of functions, issues a store memory fence, and thenincrements a “generation counter” which is on a shared memory page. Then Shuf-5Although the manual doesn’t mention this, to be fully correct, there should probably be a storememory fence sfence before setting the flag, to make sure writes are not re-ordered.16fler injects code to spin until the correct value of the generation counter is seen, atwhich point the target can continue executing and Shuffler may begin erasing oldfunctions. However, this is not our current implementation; see below.Making Use of RWX PagesOne possible shortcut to implementing cross-modifying code is to use memorypages that are marked write-through; thus, the write-back cache is never used, andwrites are never re-ordered. If writes are “atomic” (this includes 64-bit alignedquadword writes on 64-bit processors, described below), then they will be effec-tively placed straight into main memory, synchronously.Such pages could be set up from the operating system (or emulated by alwaysflushing changes immediately after writing to a page6). However, we noticed thatpages marked RWX and mapped as shared pages into two different address spacesact as write-through pages (in Linux). Writes are propagated immediately. Soprevious versions of Shuffler used this method (and the current version can stillfall back to it). However, this means that a huge number of pages in the targetprocess must be marked with RWX permissions. This opens up the door to oldercode injection attacks; even if we are preventing code reuse attacks, it is not muchuse if simple buffer overrun attacks work once more.Current ImplementationBecause of the security risk of having RWX pages, Shuffler now uses R-X pages(sandboxes) and RW- pages (stack) in the target process, all of which are mappedas RW- pages in the shuffling process. In this setup, writes may be delayed in thewrite-back cache indefinitely.7 Implementing proper support for this environmentwould involve writing a generation counter whenever new functions are shuffled,and reading the counter from the target’s processor to see whether the writes havepropagated or not (as outlined above).However, we did not fully implement this. Instead, we just continue writingnew code into shared memory maps, and trust that it will eventually propagate to6In Linux, by calling flush dcache page().7In fact we observed the target processor seeming to enter an infinite loop, running old code, onmultiple occasions. (Using ptrace to read the code also revealed old data.)17the target. After a memory fence, Shuffler’s writes should appear eventually. Wefound that the delay inherent in waiting for the next shuffling cycle before erasingold code is usually enough to have the writes propagate.This is not a robust solution, and the system will crash at higher shuffling rates(compared with the RWX solution, which never does). Clearly we should imple-ment a system that actually checks if the writes have appeared before erasing oldcode. But our current implementation works well enough to gather timing infor-mation and get an idea of whether shuffling is feasible or not.Making Atomic UpdatesOne other potential issue is that the target processor may see half-updated codewhich, when decoded, yields an invalid (or undesirable) instruction stream. We getaround this by making a new copy of code in advance, and then updating the livecode path with quadword aligned writes to addresses or call instructions. On mod-ern processors, quadword-aligned writes performed by a single store instructionare atomic and do not require special synchronization.4.1.3 Bootstrapping Shuffler (and Attach/Detach)Shuffler has two methods of shuffling programs when it first starts: 1) taking con-trol of a program from its very beginning with a custom loader which forks into thetarget and Shuffler, and 2) using ptrace to attach onto a running program at anypoint. Shuffler can also detach from a program that is being shuffled. This meansthat Shuffler can be deployed only when a target program really needs defending(e.g., a new zero-day attack is discovered), attaching onto a running program, anddetaching when the threat has passed. See Section 8.4 for further discussion of thisidea.Technically speaking, the main challenge with both of these methods is gettingcode injected into the target (a task which Shuffler must perform fairly frequently).Shuffler reads the instruction pointer to find a page in the target which is executable;it then uses ptrace to inject code which calls mmap and then traps. When thetarget is allowed to run, it maps in an executable “staging area” as a shared memoryregion. Shuffler restores the original code for the overwritten page and can in18future write directly to the staging area, warping the target’s instruction pointerto the beginning of the region. Code injection is required to call open, mmap,mprotect, etc from the target’s context. Finally, writing trampolines and new copiesof functions is also injection of a sort, but it is easier because Shuffler does not haveto coordinate taking control back after running the code once.4.2 Code Discovery and RelocationCode is not designed to be moved to different addresses as it is running. However, itseems like this must be possible, since the compiler build chain regularly moves (orrelocates) code to different addresses during the compilation and loading process.4.2.1 How the Build Chain Uses RelocationsConceptually, when the compiler generates basic blocks it keeps the relocation in-formation necessary to move that block to any address. Subsets of these relocationsare stored in ELF files and kept around for other stages of the build/load process.In particular,• The compiler generates object files (which are relocatable ELF files). Objectfiles contain code which starts at address 0, and relocations sufficient to shiftthe code to a different base address.• The linker takes object files and decides on an ordering for the code within,relocating everything to non-overlapping addresses. This creates executablesor libraries (which are also ELF files).• The loader copies code into memory at runtime, and if the code is position-independent, then the loader relocates it to another base address using therelocations in the ELF.ELF executables have nearly all their relocations resolved, and position-independentcode (used for shared libraries) is located at address 0 so that the loader can decideon the base address.So as code is built, different aspects of the code are resolved (basic block order-ing, function ordering, base address, etc) and relocations that become unnecessary19are discarded. However, if we could keep all these relocations around, then shuf-fling a program essentially means re-linking it, at runtime. There are compiler flags(like GCC’s -q or --emit-relocs) which preserve relocations, and an earlierShuffler prototype made use of this information.8 But relocations are only availableif one has access to the source code and can compile with relocations preserved,and we want Shuffler to work on binaries. So we need to find code, disassemble it,and try to reverse-engineer the relocations in order to move the code.4.2.2 Code Discovery and DisassemblyIn x86-compatible binaries, code and data are interleaved. Distinguishing betweenthem is a non-trivial task, although certainly possible: the heuristic-based recursivedisassembler from Zhang and Sekar [39] is 100% accurate at disassembling theirchosen set of test cases, and Wartell et al [36] have performed a broader study withtheir disassembler. However, we want Shuffler to be applicable to a wide range ofbinaries without requiring complex heuristics. Thus, Shuffler assumes that a fullyaccurate disassembly may not be available.The simplest discovery technique, which is used by the current Shuffler imple-mentation, is to obtain the target’s list of functions from its symbol table. Shuf-fler then uses a disassembly library on each function to find the jump and RIP-relative instructions that need to be fixed when the function is moved (relocated).We benchmarked five popular open-source disassembly libraries and chose diS-torm3 [8], which had the best performance (and furthermore, has complete 64-bitsupport). If some code is not discovered through the symbol table, that code simplywill not be shuffled. Note that shared libraries have their own symbol tables whichare searched separately.Not all executables have a symbol table, however. In fact “stripped” executa-bles are the default in most Linux distributions. To handle these, the best optionis to use heuristics to disassemble the code segment, looking for embedded datato skip over. We decided not to implement this, because it is a tedious process,8Unfortunately, this option is rarely used and we found bugs at the linker level[6], and in the compiler (GCC 4.8): gedit text region intersect()’s calls tofind nearest subregion() have no relocations since the latter is a static function. Further-more, it is difficult to modify most build systems to emit full relocations (e.g., Firefox, Chromium).The option -ffunction-sections is a workaround for the gedit case, but is fairly heavyweight.20and there is extensive prior work in the area (including [39]). So at the moment,Shuffler will only move functions that are mentioned in a symbol table.4.2.3 Relocating Code at RuntimeMigrating a program’s code while the program is executing is a challenging task,because the code will likely be incorrect when it is moved to a new location. Forexample, embedded call instructions need to be adjusted (because they add a con-stant to the current instruction pointer to find the target). In fact, all embeddedinstructions with RIP-relative arguments need to be updated. Fortunately, diStormmakes it easy to find and fix all such arguments.But after the code itself is updated, there are still external references to the oldversion of the code (e.g., direct calls to the old code, function pointers). Shufflerbuilds an “adjustment list” for each function of memory locations that refer to thefunction and need updating when it moves. This includes jump tables, incomingcalls, etc. The list may be incomplete because Shuffler may only have a partialdisassembly available. Furthermore, some types of references (such as functionpointers) may become propagated throughout memory at runtime, making it al-most impossible to fix all references. Shuffler fixes all the references it knowsabout, and handles all other references at runtime when the old addresses are used;see Section 4.4.1 for details on handling stale addresses. To make this easier, weare careful to ensure that stale addresses only ever refer to original locations offunctions, and not their shuffled variants.Unfortunately, a function is free to refer to its current address by saving the RIPregister. This is quite common in 32-bit x86 code, where RIP-relative addressingis not present; instead, GCC makes use of the idiom in Figure 4.4 to load eip intoa nextnext: pop %ebxFigure 4.4: Idiom for retrieving EIP (32-bit x86 code).This is dangerous for code movement, since an unknown set of registers maycontain EIP-relative addresses; worse still, these may be pushed onto the stack.21We do not know whether this practice is common in 64-bit GCC-generated code,but to be safe, we avoid moving whichever function is currently executing. Weassume that a function may perform RIP-relative computations, but does not needthis state once it calls another function (an assumption which seems to be borneout in practice).4.3 Trampolines4.3.1 How to Update the StackWe want to be able to move functions at arbitrary time intervals throughout thetarget program’s execution. We have described how to statically relocate functionsand find references to them. But the biggest barrier to moving functions are the run-time artefacts of their execution: in particular, return addresses get pushed onto thestack, and these return addresses refer to the new copies of the code. When weattempt to move these functions, the return addresses will be stale and likely causea program crash if actually used.The obvious way to fix this is to unwind the stack, find all the return addresses,and update them to new locations. However, in x86 the stack frame for a functionmay be optimized out, making it very difficult to unwind the stack with 100% ac-curacy. Even libraries such as libunwind [20] must sometimes resort to heuristics.Given access to source code, compiler flags like GCC’s -fno-omit-frame-pointerand -fasynchronous-unwind-tables or information from custom LLVMcompiler passes (as in [15]) can guarantee accurate stack unwinding. But Shufflertargets binaries.The approach taken by Shuffler is to hoist each call instruction into a trampo-line. As shown in Figure 4.5, each existing call instruction is transformed into ajump to the trampoline, which calls the target; on the return path, control flowpasses through the trampoline before returning to the original caller. In otherwords, this ensures that return addresses on the stack (for shuffled functions) al-ways refer to trampolines, allowing the caller and callee functions to be movedwithout having to update the return address.Now the issue is that trampolines themselves are never moved. Shuffler extends22(a) Original call instruction(b) Trampoline call instructionFigure 4.5: Transforming a call instruction with a trampoline.the basic trampolines to remember locations of return addresses using a separateshadow stack (see Figure 4.6). Then, trampolines can be moved as well, with theadditional step of walking the shadow stack and updating any references to oldtrampolines.It would be possible to rewrite functions directly with this extra shadow stackinstrumentation, and avoid the use of trampolines. However, this involves insert-ing a lot of code around each call site, which may negatively impact code caching,alignment, and other parameters set up by compiler optimizations. With trampo-lines, the majority of calls turn into jumps of the same size, without impactingcode size at all. Trampolines also add one more layer of indirection to the ac-tual addresses of functions; the attacker cannot, for example, read the stack andimmediately find addresses of functions.23Figure 4.6: Adding a shadow stack to form bookkeeping trampolines.4.3.2 Trampoline ImplementationAs shown in Figure 4.7, basic trampolines, without the shadow stack “bookkeep-ing”, are very simple. We use RIP-relative call and jmp instructions to refer to64-bit constants located immediately following them in memory. The constants arealigned on 64-bit boundaries and can be updated with atomic 64-bit writes9. Theydo not have to be atomically updated together since both the old and new code arevalid at the moment of *0x6(%rip)jmp *0x8(%rip).quad 0x2000 ; dest function (for call).quad 0x1000 ; source address (for jmp)Figure 4.7: Trampoline assembly code (AT&T syntax).The bookkeeping adds some complexity to the trampolines, increasing the size9On modern 64-bit CPUs, 64-bit writes aligned to 64-bit boundaries are atomic—see Section8.1.1 of [18].24of a trampoline to 64 bytes (including the two embedded addresses at the end).We use a simple stack for bookkeeping at a fixed point in memory, the address ofwhich is embedded into trampolines when they are generated. See Figure A.1 inAppendix A for the full code for bookkeeping trampolines.Finally, functions which contain calls must be updated to use trampolines.Most call instructions are 32-bit relative, which can be replaced with direct 32-bit jumps to trampolines (since both instructions are 5 bytes long and we try toplace trampolines within 32 bits of their functions). However, inevitably Shufflersometimes needs to patch in more bytes of code than there is room for: this hap-pens if 64-bit jumps are required, and for indirect calls (which are usually less than5 bytes long).We handle this by trying to rewrite the source function, expanding its code tomake space for the new jumps. Internal jumps within the function and all RIP-relative instructions must be fixed up, and potentially some internal jumps mayneed to be upgraded to wider variants (if their target is now too far away to reach).Our implementation attempts to perform rewriting once at code discovery time; allfuture shuffled copies of a function will be based on the rewritten code. But ourimplementation gives up on rewriting if opcode substitution is necessary, fallingback instead on a 1-byte int3 trap (when this is hit, Shuffler warps the instructionpointer to the trampoline). The traps are slower but always work. The full rewritingprocess could easily be fully implemented, and there are also existing libraries likeDynamoRIO [4] that do full rewriting, but we found that opcode substitution isvery rarely required.4.3.3 Memory ProtectionTrampolines provide a fair amount of information to an attacker: they contain theaddresses of both source and destination functions. However, this information mustbe placed somewhere in the target’s address space for the target to be able to run atnear-native speed without Shuffler’s intervention. So there is not much that can bedone about this, with current hardware and operating system support.In an ideal world, hardware would support pages with write and execute per-missions, but no read permission (e.g., -WX). This is currently impossible to achieve25except without delving into Intel’s Virtual Machine Extensions [18] (i.e., the hy-pervisor level). Having such pages would make implementing a secure shufflingmechanism much easier: simply write addresses or jumps straight into this region,and jump to the (now disguised) function location. We can approximate this mech-anism with current systems, however. Shuffler implements an optional protectionlayer on top of the shuffling mechanisms, wherein all code pages have no executepermission by default. Whenever the instruction pointer reaches a protected page,execute permission is enabled on that page; if the program tries to read the page, itmust be an attack.Ideally memory access checks would occur every time a page is used, but thatis far too slow, and we are forced to “cache” the execute permission on a page.This provides some level of protection—an attacker can no longer simply readall existing pages looking for code—but frequently-executed pages almost alwayshave the execute bit enabled, so it is possible to sidestep this mechanism. Still,perhaps in the future, -WX pages will be accessible from userspace, and this schemewill become relevant.4.4 CompletenessThe number of constructs that are available to userspace programs is large, and inthis section we discuss how Shuffler handles some of these cases.4.4.1 Handling Stale AddressesAt some point, the target may jump to a code address which is stale. We discusshow to deal with addresses that are generated dynamically and may refer to shuffledcopies of functions (e.g., return addresses on the stack) above in Section 4.3. Theonly other case is unanticipated references to the original function location. Wemust be somewhat careful here and not just trap and warp such references to thecurrent function location; this works, but defeats the purpose of shuffling, as anattacker can jump to any piece of code simply by jumping to its original location.So we handle stale addresses as special cases, according to how they arise.Here are the cases we came across, and how Shuffler handles them:• Indirect calls Indirect calls (such as jmp *%rax) are replaced with jumps26to trampolines. However, indirect call instructions have very short encod-ings, and jump-to-trampoline instructions (which are at least five bytes long)normally will not fit in their place. Thus, we try to use function-rewriting tomake space for the new instruction, or fall back on a (1-byte) trap instruction.• Function pointers Function pointers occur in the data section as literaladdresses or in code as constants, and it is impossible to tell that they arecode pointers until they are called. Shuffler traps on initial function pointerinvocations, warping the instruction pointer to the current function location,and inserts a trampoline to speed up subsequent calls. We do not attempt tomodify function pointer values when a function is moved, as the pointer maybe propagated throughout the heap and stack.• Jump tables Jump tables contain absolute code addresses and are embed-ded into the data section, with no clear delineations (although relocations,if present, will indicate jump tables [29]). Currently, Shuffler uses a simpleheuristic10 to detect jump tables and replace the table entries with trampolineaddresses. For more complete and advanced heuristics, see Cifuentes [5].• Setjmp/longjmp These functions allow the stack state of a program to besaved and restored, enabling non-local gotos across function calls. GCC doesnot inline setjmp/longjmp even at high optimization levels, so any longjmpwill restore within a stack frame for the setjmp function. We could simplyavoid shuffling the setjmp function, so that any longjmp will work correctly,and additionally fix bookkeeping trampoline information (see below). How-ever, this special case is not yet implemented.• C++ exceptions Exception handling involves unwinding the stack when anexception is thrown. Discarding stack frames causes issues with bookkeep-ing trampolines, but there is a simple fix (which we did not implement):simply check for all bookkeeping entries which do not follow in descending10We look for JMP instructions with a scale factor of 8 and a constant offset which points insidethe code segment, to a table of addresses which also point inside the code segment. We assume thefirst word which does not look like an address demarcates the end of the jump table. This is veryaccurate at picking out GCC-generated jump tables, even at high optimization levels.27order from the previous entries, or which refer to memory beyond the currentstack pointer. Such bookkeeping entries are out-of-date and can be ignored.There are other unusual cases which we do not handle11, but we have listedabove the majority of cases that appeared in our testing.4.4.2 Handling Shared LibrariesShuffler finds shared libraries through the same mechanism as GDB12, by walkingthe dynamic loader’s list of libraries. New shared libraries are noticed by insertinga breakpoint at dl debug state, which is a hook called by dlopen after everyshared library load or unload. Each shared library is its own ELF file, and is shuf-fled independently. Each process also gets its own private re-randomized copy of alibrary; any attempt to share the memory used by libraries across processes wouldincrease the attack surface against randomization, since multiple processes couldexamine the code at once.4.4.3 Handling Threads and ForksSupport for threads and forks is not fully implemented in Shuffler; refer to thediscussion in Section Overall OperationOnce Shuffler can write code into a target process, redirect function calls throughtrampolines so that functions can be moved, and save return addresses on a shadowstack so that trampolines can be moved, the main remaining task is to actuallymove all this code.As shown in Figure 4.8, Shuffler consists of a main thread and one or moreworker threads. The main thread catches signals in one or more target processesunder its control, can move targets’ instruction pointers and inject code, and is re-sponsible for detecting new shared libraries and forks. The worker threads eraseold shuffled functions, and create new shuffled copies asynchronously; the main11For example, the unportable GCC extension called labels as values, which allows pointers togoto labels to be taken with the unary && operator.12See enable break() in gdb/solib-svr4.c.28Figure 4.8: Overall architecture of Shuffler.thread periodically stops the target processes and marks most old functions as be-ing ready for erasure (with the single exception of whichever function/trampolineis currently executing, as described in Section 4.2.3). The target process may spendnearly all of its time actually executing, and will transparently migrate executionto the new copies of functions (as the worker threads atomically update old tram-polines, or trampoline addresses on the stack).29Chapter 5Performance EvaluationWe performed most of the experiments in this section on a system running Debianjessie with the Linux 3.13 kernel, with a quad-core 3.20 GHz Intel Core i5-4570(Haswell) processor and 8 GB of RAM. Some experiments (clearly marked) wererun on our “alternative” system, an ASUS Zenbook UX32VD with a dual-core1.90 GHz Intel Core i7-3517U (Ivy Bridge) processor and 6 GB of RAM.All results are reported as the average of three runs.5.1 Shuffler Microbenchmarks5.1.1 Memory Access OverheadTo compare the different methods of accessing another process’s memory, wewrote test programs to transfer 10GB in 1MB chunks. The timing results are shownin Table 5.1 (as usual, the average of three runs).Type Runtime (sec) Runtime (normalized)Local memory access 0.607s 1.000xShared memory map 0.604s 0.995xprocess vm readv 0.630s 1.038x/proc/pid/mem 0.947s 1.561xVanilla ptrace 19.916s 32.811xTable 5.1: Performance of reading 10GB of memory between processes.30Shared memory maps provide identical performance to local memory accesses(though of course it may take a short while for the writes to be committed). Boththe new system call process vm readv and reading /proc/pid/mem withpread provide worse performance. In both cases, the overhead of just the systemcalls is negligible: invoking the same number of system calls, but only transferring1 byte, takes 0.003 seconds. On the other hand, using ptrace directly requires ahuge number of system calls and slows down unacceptably. Finally, we were un-able to benchmark the proposed PTRACE MULTI patch [11], but its performancewould likely be comparable to process vm readv, because their interfaces aresimilar.Thus, although shared memory maps are complicated to set up, their additionalflexibility and superior performance provide a compelling reason to use them (asShuffler currently does).5.1.2 Context Switch Overhead [on Ivy Bridge]To measure the overhead of a context switch between the target program and Shuf-fler (and back again), our test program dereferences a function pointer to an emptyfunction repeatedly, causing a million switches to Shuffler and back on the indi-rect jump trampoline. This results in an average time of 6.9 µs per context switch.This is slightly higher than the expected context switch cost of modern processorsbecause the shuffler makes a number of ptrace system calls in handling the tram-poline. As a point of comparison, a function call on the same system takes only0.002 µs.5.1.3 Code Injection Overhead [on Ivy Bridge]As a simple code injection example, injecting an mprotect involves four con-text switches and the actual call to mprotect. This is due to the target programtrapping to Shuffler, which injects the mprotect and switches back to the targetprogram. The target program executes the mprotect call and traps back to Shuf-fler, which then switches back to the target program. The context switch after themprotect is a callback mechanism that Shuffler uses to perform any additionaloperations.31Since the cost of a context switch between the target program to Shuffler andback takes 6.9 µs, the expected cost of an mprotect call plus the additional con-text switch back and forth should be at least twice that of just context switching.However, the cost is only 9.1 µs. The reduction in expected cost is likely due to theeffects of caching.5.1.4 Cache Trampoline Optimization [on Ivy Bridge]Description RuntimeBaseline 0.003sShuffling without optimizations 13.887sWith func ptr caching 6.946sWith func ptr caching and rewriting 0.222sTable 5.2: Function pointer overhead on a million calls.To measure the performance of function pointers, we created a synthetic bench-mark that dereferences a function pointer to an empty function a million times. Thesimple step of introducing trampolines (without any optimization) causes a hugeperformance penalty. This is due to every function pointer call turning into twoseparate traps, one at the indirect call site (because the call *%rax instructionis so small it needs a trap) and the other at the original function location.The first optimization is to turn on function pointer caching, which writes atrampoline at the original function location to remove one of the traps on the criticalpath, cutting the runtime in half. The second optimization is to rewrite the sourcefunction so that the indirect call can be replaced with a jump, removing the othertrap from the critical path. The combination of these optimizations results in goodperformance. Though the baseline is still much faster at 0.003s, that measurementdoes not include the constant startup-time overhead required to shuffle a process.5.1.5 Jump Table Optimization [on Ivy Bridge]Figure 5.3 shows the effect of jump table optimizations on the sjeng SPECCPU2006 benchmark with a 10 ms shuffle rate. Introducing jump table optimiza-tions realizes a 171x performance improvement. This is because originally the32Description RuntimeBaseline 3.61sShuffling without optimizations 939.00sShuffling with optimizations 5.48sTable 5.3: Performance improvement for jump table optimizations on sjeng(10ms shuffle rate).jump table refers to the original function location, and invoking it requires a con-text switch. With jump table optimizations, this context switch is replaced by atrampoline.5.2 Macrobenchmarks on SPEC CPU2006We ran Shuffler on a representative selection of SPEC CPU2006 benchmarks. Fig-ure 5.1 shows the overhead of shuffling compared with native performance, withshuffling rates varying between every 100, 50, 20, and 10 milliseconds. Perfor-mance generally decreases at higher rates of shuffling.The performance characteristics are clearest in the libquantum case: here, anyshuffling at all has an overhead of roughly 10% (because of the constant overheadintroduced by trampolines). Then, performance decreases as the shuffling rate in-creases, because the target program spends more time stopped and its CPU’s cacheshave to contend with more cross-modifications from the shuffling processor.Some cases, especially milc and hmmer, have large codebases. Thus, morework needs to be done in each shuffling period, which either causes performanceoverhead (as in milc) or even causes writes to be delayed too long for shufflingto operate correctly (as in hmmer at high shuffling rates). The scheme presentedin Section 4.1.2 would need to be properly implemented for cases like hmmer towork. The other C benchmarks in SPEC CPU fail for what is probably the samereason, with other explanations in some cases (e.g., perl uses threads, which wouldneed to be supported as described in Section 8.1).33bzip2 hmmer lbm libquantum milcRelative Performance0.80.911. shuffle 100ms shuffle 50ms shuffle 20ms shuffle 10msFigure 5.1: Performance of shuffler mechanisms and periodic shuffling on SPEC CPU2006 Benchmarks.34Shuffle Rate Average Overhead100ms 10.3%50ms 11.3%20ms 13.4%10ms 21.1%Table 5.4: Average overhead of SPEC CPU2006 benchmarks with differentshuffling rates.Collecting the average runtimes into Table 5.4, we see an average overhead of10-21% on the SPEC benchmarks that successfully run.5.3 Running Nginx [on Ivy Bridge]We ran the web server nginx (with optimizations and threads disabled) under Shuf-fler. We created a 90-byte index.html page and fetched it 1000 times sequentially.Without Shuffler, this took 7.136 seconds, and with shuffling every 100ms, it took8.525 seconds (an overhead of 19%). This is not bad considering the complexityof a web server.35Chapter 6Security Evaluation6.1 Memory Disclosure AttacksThe primary motivation for binary shuffling is to mitigate sophisticated memorydisclosure attacks, such as the one presented by Snow et al [31]. As discussed inSection 2.3, we assume a remote attacker who knows one valid code address, andwho may repeatedly invoke a memory access vulnerability to report back memorypages, tracing the control flow of the program so as to always request valid pages.The attacker can then compile a customized ROP attack by searching for gadgetswithin the retrieved code, and specifying control flow between these gadgets.For this type of attack to succeed, the gadgets that are used by the exploitmust be located at the same place in memory when the exploit is executed. Inother words, if one or more gadgets are shuffled away before the exploit can becompiled and executed, the attack will have been foiled. We wish to show thatshuffling occurs at a fast enough rate that such attacks will always be foiled withhigh probability.6.1.1 Analysis of a Memory Disclosure AttackFor our analysis, we define the following variables:• let TS be the time between shuffles;• let TD be the time taken to discover and retrieve code pages; and36• let TE be the time required to execute the customized ROP attack.However, in a departure from Snow, we assume a stronger adversary who hasaccess to an unshuffled binary of the target program and can analyze it staticallyin advance of the attack. Thus, the abstract control flow graph between possiblegadgets is known, although it does not contain actual memory addresses. Thememory discovery process is used to fill in the abstract control flow graph withactual addresses to create exploit code. We further assume the attacker may havemany such potential exploits available, requiring different gadgets. Assuming thatcrafting exploits has been done in advance, the time to complete an attack is givenby TD +TE .We assume that all functions are shuffled atomically. This means that if evena single gadget remains to be executed after a shuffle, the gadget attack code willnot run to completion. We assume that if the gadget attack code does not run tocompletion, then it will not succeed. Under these assumptions, an attack that fillsthe abstract gadget control graph directly with memory locations read during thediscovery process will not succeed if a shuffle occurs before the attack completesboth the memory discovery phase and the gadget execution phase. That is, theattack fails if TD +TE > TS.Note that TS is a tunable parameter of the Shuffler and can be set by the userbased on current estimates of TD + TE of known memory disclosure exploits inthe wild. Prior work on just-in-time gadget compilation suggests that TD may bequite high (discussed below). However, if we can demonstrate that even the mostpessimistic TD is greater than TS, then this is sufficient and TE can be ignored (webelieve TE to be quite low in practice).6.1.2 Why the Attacker Has Insufficient TimeAn attacker needs several distinct gadgets to craft an exploit. Polychronakis et al[26] find the number of distinct gadgets to be between 8 and 21 in an analysis oftwelve real-world ROP exploits. As we are considering the pessimal case, we as-sume an attacker only needs to find 8 gadgets. In an analysis of common WindowsDLLs, Snow et al [31] were able to discover a total of 413 gadgets across 307pages, yielding an average density of 1.35 gadgets per page. Therefore, in order to37obtain 8 gadgets, we estimate that an attacker will need a minimum of 6 pages.The fastest demonstrated remote exploit (a JavaScript page loaded in a webbrowser which contains their exploit framework) [31] harvests pages at a rate of84 pages/second.1 Therefore, in the most pessimistic case, if the first 6 pagesdiscovered happen to have all 8 gadgets that the attacker needs, the time requiredis 71 ms. As long at TS is less than this, the attack will fail.In reality, TD is likely to be quite a bit larger than in this pessimal analysis. Inthe fastest attack described by Snow et al, it takes 2.3 seconds to traverse pages,collect gadgets, resolve APIs, and compile an attack. Examining their graph whichshows a breakdown of time spent on each aspect of the attack, it takes approxi-mately 500 ms for the gadget discovery phase. Therefore, a TS of 500 ms is likelysufficient to defend against most attacks, while a TS of 50 ms is likely sufficient todefend against even the most pessimal attacks.Finally, all of this discussion has been assuming the attacker can directly in-teract with the target program. More typically, the user is concerned with remoteexploits. In this case, even after the memory vulnerability disclosure has been usedto read data from the target process, that data must be communicated back to the at-tacker. The attacker will need to overcome the intervening network latency, whichis likely on the order of a millisecond (at minimum). And unless the disclosurevulnerability can be run in batches, an attacker may need to make hundreds of callsfor I/O across the network just to read a handful of code pages (since each readwould normally retrieve only a small amount of data).6.2 Other Types of AttacksAdversaries are not limited to the type of attack given by Snow et al. We discussother possible ways to subvert dynamic shuffling.Brute forceAs with any randomization technique, the attacker can simply guess at the memorylocations of the gadgets required for an attack. We assume that at least one shuf-fle has been performed (the target may be vulnerable before this, but Shuffler can1We assume that their hardware, which is not described, is equivalent to our own.38easily perform a shuffle when the target is first started). Then each function will beplaced at a random byte-aligned address somewhere in the shuffling sandbox (suchthat it does not overlap with any other function). Currently we use 256MB sand-boxes, which provides 28 possible bits of entropy. If at most half of the sandboxis used (code sizes are usually much smaller than this), then there are at least 27bits of entropy in choosing any new function location. So correctly guessing thelocation of a particular block of code will only work one in 227 times. So pinningdown the location of (say) eight unique gadgets would only succeed one time in227·8 = 2216. This appears very unlikely to be a successful attack vector.Guessing the Random SequenceAn attacker may find greater utility in trying to discover the random seed used byShuffler. If Shuffler’s operation were completely deterministic with respect to therandom number generator’s seed, then a well-provisioned attacker could simplyrun (or simulate) Shuffler on the target program with each of the possible seeds.In fact, Shuffler’s random number generator only has 232 possible seeds, so anattacker could also randomly guess the seed, providing much better odds for theattacker than guessing at gadget locations.The correct solution here is to use a random number generator with a seed widerthan 32 bits. We did not implement this; however, it is a fairly simple change. Theperformance of Shuffler does not depend much on the speed of the random numbergenerator, because random numbers are generated asynchronously in the workerthreads (which have more time than they need).Using the Shadow Stack to Find Valid AddressesThe shadow stack used by bookkeeping trampolines contains a list of pointers toreturn addresses, which are themselves inside trampolines. So this gives an attackeraccess to additional valid code locations (albeit doubly indirected). Much the sameinformation is already available by reading the program’s real stack, but with theshadow stack, no stack unwinding is necessary. However, we are assuming that theattacker knows some valid code locations and simply cannot read them fast enoughto execute an attack. So this additional information is not particularly helpful.39Using Cached Calls to Find Valid AddressesFor performance reasons, when Shuffler sees the target invoking a function pointerand raising a trap at the original location of the function, a cache trampoline isplaced at the original location which simply jumps to the current location of thereal function. This represents a much more significant advantage for the attacker:it becomes possible to read the original location of a function, and find the currentlocation of that particular function. What’s more, if the cache trampoline is notpresent, the attacker will simply read a series of trap instructions instead of crashingthe target. So an attacker can collect a set of gadgets that are necessary for anattack, continuously re-read the original locations of the functions containing thosegadgets, and spring the exploit only when all functions have cache trampolines.However, this attack only works for functions which are referenced by a func-tion pointer. It also requires an attacker to read at least sixteen pages (originalfunction locations and then current locations) to find eight gadgets; the current lo-cations are dependent on the values of the first reads and cannot all be batchedtogether. So our analysis of a memory disclosure attack given above still holds,although this essentially gives the attacker their best case scenario for number ofpages to read. It is also possible to disable cache trampolines, and simply take theperformance hit on function pointer calls.In future we would like to have a stronger solution to this problem. The bestoption seems to be to start randomizing function pointers; i.e., create cache trampo-lines but at random locations, and occasionally invalidate them. Function pointerscan be updated by using the Intel Last Branch Record to find the source instruc-tion, and changing the register or memory location which contained the functionpointer. This will obviously miss many occurrences of pointers, but should providegood performance for tight loops.Jumping to Original Function LocationsShuffler traps all jumps that reach within an original function’s code (once thefunction has been shuffled away). This could be a legitimate jump, usually fromdereferencing a function pointer (or undetected jump table), in which case the in-struction pointer will be warped to the current function location. The jump could40also be the result of an attacker trying to jump into the middle of a function. At themoment, Shuffler cannot differentiate and assumes that the jump is legitimate. Itis likely that a simple heuristic, like allowing jumps to the start of the function butnot the middle, will suffice. In any case, some policy must be implemented beforereal-world Shuffler deployment. This is discussed further in Section 8.6.Trampoline Re-Reading AttacksOne technique an attacker might try is to continuously re-read trampolines. Thisis particularly effective if trampolines are not shuffled; the trampolines themselvesmay not contain any gadgets, but a simple re-read of the trampoline will reveal thecurrent location of its source and destination functions. Once a trampoline is dis-covered, the attacker can always read that location (leaking information betweenshuffling periods). Furthermore, the attacker can read a trampoline to discoverwhether functions have been shuffled or not, providing a potential way to synchro-nize an attack within shuffling periods.Needless to say, this is a fairly powerful attack which is outside of our originalthreat analysis. We avoid the issue by shuffling trampolines just as much as wemove functions; when bookkeeping trampolines are used, re-reading a trampolineis no more effective than re-reading a function.Compromising the Shuffler ProcessShuffler itself is a natural point of attack given its privileged position with respectto target code. However, there is no opportunity for the attacker to exploit Shufflerthrough the target program. While Shuffler does read from memory locations inthe target code, this is treated only as data and is stored in no-execute pages ofmemory within Shuffler. The behaviour of Shuffler is not based on the contentsof those reads during runtime. The behaviour is based entirely on the analysisof the symbol tables and dynamic disassembly when Shuffler launches the targetprogram. In particular, Shuffler knows exactly how large each function that it readsshould be and only copies that amount of memory. It is possible that a maliciousexecutable could be crafted that causes Shuffler to misbehave, but we are assumingthe attacker has no control over the executable which is given to Shuffler.41Chapter 7Related Work7.1 Return-Oriented ProgrammingReturn-Oriented Programming (ROP) began as return-into-libc attacks [30], andwas generalized to finding gadgets in any existing code base. Recently, ROP com-pilers can automatically find gadgets and create a desired exploit using the avail-able gadgets; ROP compilers can even harden attacks for the presence of ASLRand other defenses [28].7.2 Static RandomizationAddress Space Layout Randomization (ASLR) [25, 38] randomizes the locationsof base addresses of the program (e.g., stack, heap, and libraries). Thus, an at-tacker no longer knows where in memory specific code resides and therefore canno longer reliably overwrite the return address with a valid location. However,since only base addresses are randomized, the relative position of code within eachmodule remains constant. This opens the door for a memory disclosure attack. Ifan attack can reveal even a single address within a module, it can determine thelocation of all the code in that module [32, 35].Building upon the basic concept of ASLR, fine-grained ASLR takes random-ization one step further: instead of just randomizing the base address of a codemodule, the code within the module itself is randomized. Bhatkar et al [2] random-42ize at the granularity of functions, while also randomizing the location of stack-resident variables and static data. However, their approach is a source-to-sourcetransformation. Binary stirring [37], on the other hand, applies randomization tobinaries, at the granularity of basic blocks. Instruction Location Randomization(ILR) [16], as its name suggests, provides even finer grained randomization byplacing each instruction at an arbitrary address. In each of these systems, however,a static randomization is applied at load time. That is, a program is rerandomizedeach time it is run, but this randomization exists for the lifetime of the program.In the face of multiple memory disclosure attacks in conjunction with just-in-time exploit generation [31], these static randomizations are insufficient. A singlememory disclosure vulnerability can be exercised multiple times in order to revealmultiple addresses. Once an attacker has gathered enough information, an exploitcan be compiled and executed.7.3 Just-In-Time Code Reuse AttacksThe attack outlined by Snow et al [31] is able to subvert fine-grained code random-ization and construct a successful exploit. The attack requires a memory disclosurevulnerability that can be invoked repeatedly. Conceptually, this vulnerability couldbe invoked to read the entire code segment of the target program, and then constructa ROP attack based on the code that is present (no matter what randomizing permu-tation has been used). However, in practice attempting to read addresses arbitrarilywill cause page faults on unallocated pages, terminating the target program.Thus, the Snow attack is predicated on knowing one valid code address. Thememory access vulnerability can be invoked to read the whole page containingthat address (since memory protection is done at page granularity). The code isdisassembled, and any call or jmp instructions point to other pages which alsocontain code and may be disassembled (the next page is another potential can-didate, if control flow falls off the end of the previous page). By repeating thisprocess, the attacker can often recursively disassemble a significant portion of thetarget’s code.After discovering valid code pages and retrieving their contents, the attackercan search these pages for ROP gadgets. If sufficient gadgets are found, a ROP43compiler can be used to target an exploit to the existing gadgets. Finally, the at-tacker can use a separate vulnerability (such as a stack smashing bug) in the targetprogram to inject the custom-compiled ROP attack into the target, and successfullyrun an exploit.7.4 Dynamic RandomizationIn order to defend against multiple memory disclosure attacks, the layout of a pro-gram needs to be re-randomized as it executes [31]. This dynamic randomizationcan only succeed in defending against these attacks if the re-randomization occursquickly enough.Stabilizer [7] improves the consistency of performance measurement techniquesby removing the side-effects of static randomizations on cache and branch predictorperformance. Stabilizer is implemented as a compile-time transformation, necessi-tating source code. However, rather than continuously re-randomizing everything,they instead mark functions for re-randomization every 500 ms. The first timea function is executed after being marked, it is re-randomized. Thus, rarely usedfunctions can exist at fixed locations for long periods of time. Further, a copy of thecode remains at its original location, easily exploitable by an attacker. Stabilizer,however, was not designed for security purposes.Giuffrida et al [15] propose a fine-grained re-randomization strategy for se-curing operating system components. Implemented as an LLVM link-time trans-formation, it also requires access to source code. At every re-randomization step,a new process is linked and launched and live state is migrated to the new pro-cess. The overhead of this appeoach is quite high, 42.7% average overhead for aone-second re-randomization period.7.5 Control-Flow IntegrityControl-Flow Integrity (CFI) is a defense against ROP, where every indirect jumpis validated before it is taken, so as to prevent control-flow hijacking. The originalCFI implementation [1] provides strong correctness properties but has high over-head. Recent research has focussed on enforcing less strict (or “coarse-grained”)CFI, trading some security for a much faster implementation. CFI techniques have44been proposed that work just with debug information [13] and even on arbitrarybinaries [39].Unfortunately, a recent work by Davi et al [10] showed that most of these looserversions of CFI provide insufficient defense. The authors found a set of specificgadgets (located after call instructions, to look like valid return locations) such thatrecent CFI mechanisms would allow control-flow transfer and a return-orientedattack. These gadgets are Turing-complete.Works such as XFI [13] still provide security against ROP attacks, but furtherwork will need to be done to find a solution which works on arbitrary binaries, ora solution which has reasonable performance overhead.45Chapter 8Future Work and Discussion8.1 Shuffler CompletenessAlthough Shuffler handles many ways that a userspace target program might createstale address references, discussed in more detail in Section 4.4.1, there are othercases (like C++ exceptions) which we do not handle. These cases are likely toinvolve most implementation work for special-casing, but our overall design shouldremain unchanged. To truly handle unmodified program binaries, Shuffler shouldsupport as many of these cases as possible.In addition, Shuffler does not handle several legitimate features of userspaceprocesses, including threads and forks.• Threads While not fully implemented, handling threads is straightforward.Whenever an old copy of a function is erased, Shuffler would simply checkwhether any thread is still executing the function, rather than just the mainthread. Additionally, bookkeeping trampolines would need to store theirsaved return addresses in a thread-safe way instead of just appending to astack. This could be implemented using a hash table, or by allocating thread-local storage for the bookkeeping stacks.11On x86 64, GNU libc dedicates fs to point to thread-local storage, and allocatesTLS SLOTINFO SURPLUS (62) extra TLS entries just in case. We could easily use one entryto point to a bookkeeping data structure (different offsets into a memory map). If the original ELFhas no TLS program header then the fs register is available for our own use.46• Fork/exec Intercepting fork and exec calls is easy with ptrace. But forkcalls are difficult to handle because of the shared memory map technique weuse to access memory in the target. These shared memory maps are pre-served across forks, and the parent and child end up sharing their code andstack—essentially turning each fork call into a vfork call. Thus, the firstaction after a fork needs to be injecting the same code that was initially usedto set up shared memory maps, to unmap and remap the memory regionsbacked by new files. We have not fully implemented this, but it is the samemechanism as originally used by the loader. Finally, exec calls can be han-dled by simply throwing away the existing structures and starting over, as ifattaching onto a new process.We would, of course, like to have had the time to fully implement these fea-tures, and hope to do so in the future.8.2 Code DisassemblyAlthough Shuffler is targeted at common off-the-shelf binaries, our current imple-mentation relies on having a symbol table to discover code in the target. We donot consider this a serious limitation, because there has been a great deal of priorwork in recursive disassembly [29]. It is very hard to do fully accurate disassembly,but fortunately Shuffler does not need a complete disassembly; if some functionsare not found, they will simply not be shuffled. We would like to incorporate oneof these existing disassembly projects into Shuffler so that the latter will work onstripped binaries.8.3 JIT CodeOne major advantage of dynamic randomization is the ease of supporting dynamically-generated code (simply add the new code to the list of shufflable units). Thus, onepotential future direction for this project is to support JIT code from language run-times. Shuffler could simply remove execute permission on suspected code pages(e.g., all unknown pages allocated by the program), wait for a jump into them, andthen assume a function begins at that point and disassemble from there. This is47almost the same process of dynamic code discovery discussed in Section 4.2.2. Wehave not implemented this technique, but we believe it speaks to the flexibility ofrandomization techniques that they can be easily extended to JIT code.8.4 Hot PatchingThe current Shuffler can dynamically attach onto and detach from running pro-cesses. This is very good for defending against zero-day vulnerabilities, because alayer of protection can be added to a running service which is known to be vulner-able, without bringing that service down. Because Shuffler is already dynamicallycopying in new versions of functions, it would be straightforward to extend Shuf-fler to incorporate hot patches that fix a vulnerability before detaching. Thus, a usercould attach, wait for a patch, apply the patch, and detach from the target processwithout ever stopping it. This means Shuffler is poised to be an excellent zero-day defense. However, there is currently a small performance degradation over theoriginal when Shuffler detaches, so this would ideally be fixed first.8.5 Shuffling an Entire SystemGiven Shuffler’s ability to hook into a system loader, it would be quite possibleto attempt to shuffle every userspace program running on a system. The way thatforks would be handled allows one Shuffler process to handle multiple target pro-cesses, if they are spawned from the same source (e.g., a shell). And Shuffler caneasily create additional worker threads to handle the new processes. In other words,potentially a small number of Shuffler processes (running on a small number of ex-cess CPU cores) could shuffle an entire system at once. It would be interesting tosee whether this works.8.6 Shuffling PolicyShuffler takes a very straightforward approach to shuffling, simply moving everyfunction sequentially within a fixed time period. It might make sense to find func-tions that are higher-risk, perhaps because they contain obvious gadgets or systemcalls, and shuffle these at a higher rate. As mentioned in Section 3.1.4, perhaps the48shuffling rate should be customizable based on resource load and other features ofthe supporting system. Shuffler could be extended to have a shuffling policy thataffects the rate at which different functions are shuffled, and when to dynamicallyadjust these rates. A policy would also be useful to decide whether a stray jump isthe result of an attack or not.49Chapter 9ConclusionThe two main classes of defenses against ROP, namely control-flow integrity andrandomization, have both recently been shown to have weaknesses [10, 31]. Weaim to resurrect randomization techniques as a viable defense against ROP. Wedo this by introducing shuffling, a high-frequency dynamic re-randomization tech-nique.Our system, Shuffler, implements binary shuffling (i.e., shuffling for programbinaries without access to source code). Shuffler runs as a separate userspace pro-cess, for easy deployability. It can run with a custom program loader, or attach ontorunning programs (optionally even detaching at a later point), shuffling all the func-tions in a program as quickly as every ten milliseconds. In the future, a platformlike Shuffler could provide on-the-fly defense against zero-day vulnerabilities, andoperate on dynamically generated JIT code from language runtimes.We benchmarked Shuffler on the benchmark suite SPEC CPU 2006, and foundan average overhead of 10-21% depending on the shuffling rate. This overhead issmall enough that shuffling appears to be a viable defense against memory disclo-sure attacks and ROP.50Bibliography[1] M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow integrity. InProceedings of the 12th ACM Conference on Computer andCommunications Security, pages 340–353, 2005. → pages 2, 44[2] S. Bhatkar, R. Sekar, and D. C. DuVarney. Efficient techniques forcomprehensive protection from memory error exploits. In Proceedings ofthe 14th USENIX Security Symposium, pages 271–286, 2005. → pages 2, 42[3] T. Bletsch, X. Jiang, V. W. Freeh, and Z. Liang. Jump-orientedprogramming: a new class of code-reuse attack. In Proceedings of the 6thACM Symposium on Information, Computer and Communications Security,pages 30–40. ACM, 2011. → pages 7[4] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for adaptivedynamic optimization. In CGO, 2003. → pages 25[5] C. Cifuentes and M. Van Emmerik. Recovery of jump table case statementsfrom binary code. In Program Comprehension, 1999. Proceedings. SeventhInternational Workshop on, pages 192–199. IEEE, 1999. → pages 27[6] C. Coutant. Bug 15758 - gold segfault when using -q option. bug.cgi?id=15758, 2013. → pages 20[7] C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performanceevaluation. SIGARCH Comput. Archit. News, 41(1):219–228, Mar. 2013. →pages 44[8] G. Dabah. distorm3., 2003–2012. → pages 20[9] M. Daley. Full Disclosure: Information on recently-fixed Oracle VMVirtualBox vulnerabilities. →pages 151[10] L. Davi, D. Lehmann, A.-R. Sadeghi, and F. Monrose. Stitching the gadgets:On the ineffectiveness of coarse-grained control-flow integrity protection. In23rd USENIX Security Symposium, Aug. 2014. TECHNICAL REPORTTUD-CS-2014-0097. → pages 2, 6, 7, 45, 50[11] R. Davoli. [PATCH 1/2] ptrace multi: speedup for virtual machines (anddebuggers) running on ptrace.!msg/linux.kernel/3yoMhJ5Q-uE/Ymec2PD56dUJ, 2008. → pages 13, 15,31[12] T. Dullien, T. Kornau, and R.-P. Weinmann. A framework for automatedarchitecture-independent gadget search. In WOOT, 2010. → pages 6[13] U. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula. Xfi:Software guards for system address spaces. In Proceedings of the 7thSymposium on Operating Systems Design and Implementation, pages 75–88,2006. → pages 45[14] M. Frysinger et al. Hardened/GNU stack quickstart - Gentoo Wiki. stack quickstart. → pages 5[15] C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum. Enhanced operating systemsecurity through efficient and fine-grained address space randomization. InPresented as part of the 21st USENIX Security Symposium (USENIXSecurity 12), pages 475–490, 2012. → pages 22, 44[16] J. Hiser, A. Nguyen-Tuong, M. Co, M. Hall, and J. W. Davidson. Ilr:Where’d my gadgets go? In Security and Privacy (SP), 2012 IEEESymposium on, pages 571–585. IEEE, 2012. → pages 2, 43[17] IBM. Ibm system z application assist processor (zaap). → pages 10[18] Intel. Intel 64 and ia-32 architectures software developer’s manual volume3a: System programming guide, part 1, Mar 2010. → pages 4, 16, 24, 26[19] R. McGrath. Re: Executable memory: some apps that work on RH9 don’ton FC1. → pages5[20] D. Mosberger. The libunwind project., 2014. → pages 2252[21] V. Murdico. Tester’s world: Bugs per lines of code. → pages1[22] Oracle. Oracle security alert cve-2013-0422. → pages1[23] J. Pauling. x86 64 asm - maximum bytes for an instruction? - stackoverflow. → pages7[24] PaX Team. Homepage of PaX. → pages 5[25] PaX Team. PaX address space layout randomization (aslr)., 2003. → pages 42[26] M. Polychronakis and A. D. Keromytis. Rop payload detection usingspeculative code execution. In Malicious and Unwanted Software(MALWARE), 2011 6th International Conference on, pages 58–65. IEEE,2011. → pages 37[27] Red Hat. Exec shield. →pages 5[28] E. J. Schwartz, T. Avgerinos, and D. Brumley. Q: Exploit hardening madeeasy. In Proceedings of the 20th USENIX Conference on Security, pages25–25, 2011. → pages 6, 42[29] B. Schwarz, S. Debray, and G. Andrews. Disassembly of executable coderevisited. In Reverse Engineering, 2002. Proceedings. Ninth WorkingConference on, pages 45–54. IEEE, 2002. → pages 27, 47[30] H. Shacham. The geometry of innocent flesh on the bone: Return-into-libcwithout function calls (on the x86). In Proceedings of CCS 2007, pages552–61, Oct 2007. → pages 1, 5, 42[31] K. Z. Snow, F. Monrose, L. Davi, A. Dmitrienko, C. Liebchen, and A.-R.Sadeghi. Just-in-time code reuse: On the effectiveness of fine-grainedaddress space layout randomization. In IEEE Symposium on Security andPrivacy, 2013. → pages 2, 7, 36, 37, 38, 43, 44, 5053[32] A. Sotirov and M. Dowd. Bypassing browser memory protections.,2008. → pages 42[33] M. Tran, M. Etheridge, T. Bletsch, X. Jiang, V. Freeh, and P. Ning. On theexpressiveness of return-into-libc attacks. In Recent Advances in IntrusionDetection, pages 121–141. Springer, 2011. → pages 6[34] user3232765. Why ptrace doesn’t attach to process after setuid? → pages13[35] P. Vreugdenhil. Pwn2own 2010 windows 7 internet explorer 8 exploit., 2010. → pages42[36] R. Wartell, Y. Zhou, K. W. Hamlen, M. Kantarcioglu, and B. Thuraisingham.Differentiating code from data in x86 binaries. In Machine Learning andKnowledge Discovery in Databases, pages 522–536. Springer, 2011. →pages 20[37] R. Wartell, V. Mohan, K. W. Hamlen, and Z. Lin. Binary stirring:Self-randomizing instruction addresses of legacy x86 binary code. InProceedings of the 2012 ACM Conference on Computer andCommunications Security, pages 157–168, 2012. → pages 2, 43[38] J. Xu, Z. Kalbarczyk, and R. K. Iyer. Transparent runtime randomization forsecurity. In Reliable Distributed Systems, 2003. Proceedings. 22ndInternational Symposium on, pages 260–269. IEEE, 2003. → pages 42[39] M. Zhang and R. Sekar. Control flow integrity for COTS binaries. InUSENIX Security Symposium, 2013. → pages 2, 20, 21, 4554Appendix AInjected CodeFigure A.1 shows the code for our full bookkeeping trampolines (refer back to Fig-ure 4.6 for a higher-level description of bookkeeping). The address 0x7fffffff(a 31-bit constant) is replaced with the address of the shadow stack region, and theaddresses of the source and destination functions are embedded at the end of thetrampoline. Every time the trampoline is called, the current stack pointer %rspis saved on the shadow stack, and on return, popped from the shadow stack. Theshadow stack itself is a flat memory region, the first 8 bytes of which are a pointerto the current top of the stack.The two NOPs at the beginning serve to align the final embedded addresses to8-byte boundaries, so that they can be updated atomically with normal writes (asdescribed in Section 4.1.2). The incoming jump can safely point beyond the NOPs.Note that when the instruction pointer is set with ptrace, we sometimes observedit warping backwards two bytes, perhaps due to canceling speculative execution orsome other hardware feature; thus, although trampolines are never warped to, wetry to have two leading NOPs (which are normally skipped over) at the beginningof any injected code segment.Finally, the whole trampoline is exactly 64 bytes, so that is the memory sizeallocated for each trampoline within the code sandboxes.55100: 90 nop101: 90 nop102: 50 push %rax103: 48 8b 04 25 ff ff ff mov 0x7fffffff, %rax10a: 7f10b: 48 89 20 mov %rsp, (%rax)10e: 48 83 c0 08 add $0x8, %rax112: 48 89 04 25 ff ff ff mov %rax, 0x7fffffff119: 7f11a: 58 pop %rax11b: ff 15 0f 00 00 00 callq *0xf(%rip) # 130121: 48 83 2c 25 ff ff ff subq $0x8, 0x7fffffff128: 7f 0812a: ff 25 08 00 00 00 jmpq *0x8(%rip) # 138130: 00 00 00 00 00 00 00 # address of dest function137: 00138: 00 00 00 00 00 00 00 # address of call source13f: 00Figure A.1: Trampoline code with bookkeeping.56


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items