Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Inter-process communication in disaggregated datacenters Carbonari, Amanda 2018

You are currently on our download blacklist and unable to view media. You will be unbanned within an hour.
To un-ban yourself please visit the following link and solve the reCAPTCHA, we will then redirect you back here.

Item Metadata


24-ubc_2018_may_carbonari_amanda.pdf [ 585.55kB ]
JSON: 24-1.0365936.json
JSON-LD: 24-1.0365936-ld.json
RDF/XML (Pretty): 24-1.0365936-rdf.xml
RDF/JSON: 24-1.0365936-rdf.json
Turtle: 24-1.0365936-turtle.txt
N-Triples: 24-1.0365936-rdf-ntriples.txt
Original Record: 24-1.0365936-source.json
Full Text

Full Text

Inter-process Communication in DisaggregatedDatacentersbyAmanda CarbonariBSc. Computer Science, Colorado State University, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)April 2018c© Amanda Carbonari, 2018AbstractDisaggregation is a promising new datacenter (DC) architecture which aims to mit-igate mounting DC costs. Disaggregated datacenters (DDCS) disaggregate tradi-tional server components into distinct resources. Disaggregation also poses aninteresting paradigm shift. Namely, a DDC possesses traits akin to a distributedsystem, as resources no longer fate- share: a CPU can fail independently of an-other CPU. It is not unreasonable to assume that these disaggregated resources willstill be presented to a user as a single machine. This requirement has implicationsfor disaggregated system design. For example, what happens if a CPU fails duringa remote cross-processor procedure call?This is not a new question, as distributed systems, multi-processor systems, andhigh performance computing (HPC) systems, have grappled with this challenge.We look at how this challenge translates to a disaggregated context, in particular,focusing on the remote procedure call (RPC) abstraction. We design a disaggre-gated system, Bifro¨st, to ensure exactly-once semantics for procedure calls underfailure scenarios and provide strict memory consistency. We analyze the overheadof Bifro¨st compared to an equivalent RPC implementation in Thrift. Although, wefind that Bifro¨st has a higher overhead than Thrift, its results are still promising,showing that we can achieve greater functionality than Thrift with a slightly higheroverhead.iiLay SummaryDatacenters are costly to operate due to cooling costs, machine upgrades, etc. Dis-aggregation, a trend of separating resources into individual entities, attempts tomitigate these costs. Once the resources are separated, the datacenter no longerprovides the same single machine architecture users typically work with.To provide this single machine abstraction, the disaggregated datacenter mustprovide some guarantees about the resources. In particular, what will happen if thecompute resource of the machine fails? We focus on providing memory consis-tency and communication guarantees even under failure for disaggregated systems.iiiPrefaceThe work presented in this thesis was conducted by the author in collaboration withFabian Ruffy under the supervision of Dr. Ivan Beschastnikh. None of the text ofthe dissertation is taken directly from previous published or collaborative articles.The system design in Chapter 5, inter-process communication system in Chap-ter 6, and the experiments in Chapter 7 were primarily implemented and performedby me. Design choices for Chapter 4 and Chapter 5.3 were influenced by feedbackand input from Ivan Beschastnikh and Fabian Ruffy. The implementation of theswitch and distributed shared memory system in Chapter 6 was primarily done byFabian Ruffy.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Failure rates in large-scale systems . . . . . . . . . . . . . . . . . 42.2 Strawman argument . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Existing architectures . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Programmable switches for resource management . . . . . . . . . 63 Background and assumptions . . . . . . . . . . . . . . . . . . . . . . 73.1 Disaggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Programmable switches . . . . . . . . . . . . . . . . . . . . . . . 8v3.3 Remote procedure calls . . . . . . . . . . . . . . . . . . . . . . . 83.4 Distributed shared memory . . . . . . . . . . . . . . . . . . . . . 94 Bifro¨st Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1 Memory semantics . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Call semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Bifro¨st design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Network protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Gatekeeper: switch control-plane . . . . . . . . . . . . . . . . . . 175.4 Bifro¨st daemons . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.1 Thrift modifications . . . . . . . . . . . . . . . . . . . . . . . . . 226.2 DSM system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 P4 data-plane program . . . . . . . . . . . . . . . . . . . . . . . 236.4 Gatekeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.2 How realistic is our test environment? . . . . . . . . . . . . . . . 287.3 What is the base overhead of Thrift and Bifro¨st? . . . . . . . . . . 297.4 What is the impact of UDP vs TCP? . . . . . . . . . . . . . . . . 307.5 What are the overheads on a simple workload? . . . . . . . . . . 317.6 What is the latency breakdown in Thrift and Bifro¨st? . . . . . . . 338 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378.2.1 System improvements . . . . . . . . . . . . . . . . . . . 378.2.2 Evaluation improvements . . . . . . . . . . . . . . . . . 38vi9 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409.1 Network and system co-design. . . . . . . . . . . . . . . . . . . . 409.2 Context-based RPC. . . . . . . . . . . . . . . . . . . . . . . . . . 4110 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43viiList of TablesTable 5.1 Bifro¨st API. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16viiiList of FiguresFigure 1.1 Pass by reference semantics for DDC RPC compared to tradi-tional RPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 3.1 Protocol independent switch architecture. . . . . . . . . . . . 8Figure 4.1 Bifro¨st memory semantics during a RPC. The orange regionrepresents A’s over the pointer. Blue represents B’s controlover a pointer. Green represents a leased pointer, thereforeread-only access. . . . . . . . . . . . . . . . . . . . . . . . . 12Figure 4.2 Call semantics with possible failure points. . . . . . . . . . . 13Figure 5.1 Overview of Bifro¨st architecture. . . . . . . . . . . . . . . . . 16Figure 5.2 Bifro¨st packet header. . . . . . . . . . . . . . . . . . . . . . . 17Figure 5.3 Bifro¨st RPC fault recovery scheme. . . . . . . . . . . . . . . . 18Figure 6.1 Bifro¨st P4 pipeline. . . . . . . . . . . . . . . . . . . . . . . . 24Figure 7.1 Latency comparison using ping6 on OpenV Switch (OVS)Mininet and real OVS cluster, averaged over 1000 pings. . . . 27Figure 7.2 Latency comparison using netperf with TCP and UDP onOVS Mininet and real OVS cluster. Averaged over 1000 pingsand with varying payload size. . . . . . . . . . . . . . . . . . 28Figure 7.3 Baseline latency in µs for OVS Mininet compared to P4 Mininetwith ranging payload size. Measurements were taken withnetperf. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29ixFigure 7.4 Thrift and Bifro¨st running on OVS Mininet compared to thenetperf round trip latency measurement for TCP and UDP. 30Figure 7.5 Thrift RPC ping test using TCP and UDP as underlying trans-ports, averaged over 100 iterations. . . . . . . . . . . . . . . . 31Figure 7.6 Increment array RPC test on OVS Mininet with Thrift, Bifro¨st,TCP netperf baseline and UDP netperf baseline. Aver-aged over 100 iterations and varying data size. . . . . . . . . 32Figure 7.7 Add array RPC test on OVS Mininet with Thrift, Bifro¨st, TCPbaseline and UDP baseline. The baselines are NetPerf laten-cies. Averaged over 100 iterations and varying data size. . . . 33Figure 7.8 Breakdown of where time is spent for each small workload anda data size of 16384 bytes, averaged over 100 iterations. . . . 34xGlossaryRPC remote procedure callDC datacenterDDC disaggregated datacenterHPC high performance computingTOR top of rackIPC inter-process communicationPISA protocol independent switch architectureDSM distributed shared memoryAPI application programming interfaceOVS OpenV SwitchSDN software defined networkingRDMA remote memory accessRTT round trip timeICMP Internet control message protocolMSS maximum segment sizeMTU maximum transmission unitxiAcknowledgmentsThis research is supported by an NSERC discovery grant. I would like to thank mysupervisor, Ivan Beschastnikh for shepherding me through my Masters and helpingto grow and improve as a researcher. I’d also like to thank Fabian Ruffy for all thework he put into the project, as this thesis would not have been possible withouthis contributions. Finally, I’d like to thank my finace´, parents, and NSS lab-mateswho supported me throughout my Masters.xiiChapter 1IntroductionDisaggregation is a rising trend that attempts to mitigate mounting datacenter op-erational costs [21]. Disaggregated datacenters (DDCS) separate the traditional re-sources of a server-centric architecture into individual components. A blade is aserver which consists of one specific resource type (i.e., CPU, memory, SSD, etc.),each individual resource is connected over a commodity interconnect (e.g., Ether-net). This architecture provides many benefits to both users and operators, chiefamong them are modularity and density [16].But, disaggregation also poses an interesting paradigm shift. Namely, a DDCpossesses traits akin to a distributed system as resources no longer fate-share: aCPU can fail independently of another CPU. Yet, it is reasonable to assume thatdisaggregated resources will be compiled and presented to the user as a singlemachine to support legacy applications [10, 16].Datacenters currently support a variety of applications and workloads. Ide-ally, these applications and workloads will continue to have the same behavior onDDCS. Yet, a legacy datacenter application, such as Hadoop, cannot reason aboutthe memory of a node failing or one processor in a node failing. Therefore, a DDCmust provide strict guarantees to legacy applications when components fail.But, how can the underlying system abstract away CPU failure during a remotecross-processor procedure call? This particular question has been explored in otherareas of research, such as high performance computing (HPC), multi-processor sys-tems, and distributed systems.1Traditional RPC DDC RPCCPUMemCPUMemSwitchCPU CPUGlobal address spaceargsresultcall/retSwitch call/ret argsresultFigure 1.1: Pass by reference semantics for DDC RPC compared to traditionalRPC.In particular, remote procedure calls (RPCS) have been used in both distributedsystems and multi-processor systems for inter-process communication (IPC). Un-fortunately, current RPC implementations are limited due to failures and inabilityto use pass by reference arguments. There are multiple practical challenges inattempting to achieve local procedure call semantics (exactly-once) under failureconditions and reason about memory addresses on remote nodes [8].We focus on the challenges of building reliable RPC mechanism for disaggrega-tion. Disaggregation allows for optimizations previously unavailable in distributedsystems, primarily, a global shared memory space. Since DDC resources are com-piled to represent a single machine, each processor in a DDC “machine” has uni-form access to a bank of global memory. We leverage this global address space tosupport pass by reference arguments (Figure 1.1).Rack-scale DDCS also differ from distributed systems and multi-processor sys-tems because most datacenter racks today incorporate a top of rack (TOR) switch.We find that the TOR switch presents a natural interposition point to observe cross-processor procedure calls and memory accesses. This allows for a control-planeprogram that monitors all procedure calls and pointer arguments. This programthen interfaces with daemons running on the resources to coordinate and enforcefate-sharing and failure recovery.We prototype our design, Bifro¨st, in an emulated disaggregated rack setting.We evaluate the overhead of our system relative to an equivalent Thrift RPC imple-2mentation [3].In summary, our work makes the following two contributions.• We define an IPC semantics for a disaggregated rack (Section 4).• We present Bifro¨st, a system co-designed with the network, that ensuresexactly-once semantics of procedure calls even under failure and enforcesstrict memory consistency (Section 5).We describe our prototype of Bifro¨st in Section 6 and evaluate it, exploring theperformance implications relative to current RPC implementations (Section 7). Weconclude with the limitations of our work and future directions (Section 8).3Chapter 2MotivationBifro¨st is motivated by the recent trend in datacenter architecture and programmableswitches. We argue that, to reap the full benefits of disaggregation, we must takea holistic approach to designing systems on disaggregated racks. In particular, wemust consider the role of the network as a critical piece in the design. In this work,we focus on the ability of the network to provide fault tolerance and fate-sharingenforcement at line rate, ensuring exactly-once semantics for procedure calls, aswell as enforce strict memory consistency semantics.2.1 Failure rates in large-scale systemsAs systems begin to scale beyond traditional servers into large rack-scale machines,we must consider the implications of failures. HPC has been studying failure ratesin supercomputers for over a decade. They have found that failures not only occurat a rate which requires mitigation, but also at a high enough rate to have detri-mental effects on performance [15, 36, 38, 40, 43]. Although disaggregation isnot at the same scale as supercomputers, it is progressing in that direction. Disag-gregation also has the same requirement as supercomputers: a conglomeration ofresources must be presented as a single entity to the programmer. Therefore, fail-ures must not only be addressed in disaggregation, but they must also be mitigatedat the system level due to performance implications.42.2 Strawman argumentA tempting strawman argument is to enforce traditional fate-sharing semantics.Essentially making DDCS fail like traditional servers. This will lead to inefficientuse of resources under failure [10]. Disaggregation allows for new fate-sharingmodels and possibly new fault- tolerance techniques, these must be discovered andimplemented on a per application basis. In our particular scope, we look beyondfate-sharing between caller and callees, although our system still provides it. Wedesign a way to recover from caller and callee failure instead of failing functionalprocesses.2.3 Existing architecturesDisaggregation presents a different context than many of the existing similar ar-chitectures that handle inter-process communication failures (distributed systems,multi-processor systems, and HPC).Disaggregation differs from distributed systems because distributed systemsmaintain a “node” view, where a collection of resources (CPU, memory, storage)is lost when a “node” fails [20]. Disaggregation must contend with individual re-sources failing [10]. Distributed systems also deal with more points of failure thana disaggregated rack. For example, distributed systems must handle network fail-ures such as partitioning and packet loss [20]. Due to these differences, distributedsystem solutions do not take advantage of the disaggregated resources and maketrade-offs to cover failures not present in disaggregation.Multi-processor systems differ because they were built on a smaller scale withstatic configurations [7, 11]. Disaggregation must contend with failures at therack-scale, but can also replace components with free resources in the rack. Multi-processor systems were also built using intra-connects between resources, not com-modity, packet-switched networks. Therefore, disaggregation is a less constrainedenvironment than multi-processor systems, and must expand up on their solutionsto match the elasticity and scale of the environment.Although HPC systems match if not exceed disaggregation in scale, they relyheavily on specialized hardware [14]. It is reasonable to assume that disaggregatedracks will still be built from commodity hardware to mitigate costs. Therefore5they cannot rely on specialized hardware for solutions. HPC systems often requirespecific programming paradigms from their users. This is not possible in a DDCbecause datacenters today serve a variety of workloads, from hosting web serversto large-scale distributed data processing. Programmers cannot be constrained inthe workloads they can run in the datacenter.Each of these areas differ from disaggregation in a variety of ways, but we canlearn from these solutions and port them to the disaggregated context. We attemptto do so while focusing on IPC.2.4 Programmable switches for resource managementThe TOR switch provides a natural interposition layer for a disaggregated rack. Itobserves all traffic, from control flow to memory accesses. This provides a uniqueadvantage of monitoring the state of the system based on the observed networktraffic. This may seem in violation of the end-to-end argument, but we are onlyproposing to move the management functionality that would perform better at theswitch [37]. Taking a decentralized approach to fate-sharing and memory pro-tection requires coordination of all resources in the system, thus incurring an over-head of communication. Moving that functionality to a passive centralized solutionwhich does not require extra hardware should improve system performance.6Chapter 3Background and assumptions3.1 DisaggregationThere are two types of disaggregation: full and partial. In full disaggregation eachresource is completely independent and attached to the network. For example, aCPU will not have any on board or directly connected RAM. Partial disaggregationis where a CPU will have a small amount of RAM directly attached to it, this RAMacts as an extra cache level and RAM has a small CPU attached to it which actsas a memory controller [16]. Research has been trending towards disaggregatinghardware components, with partial disaggregation being the first step [16, 28, 29].Although disaggregation aims to be deployed at the datacenter scale, there arepractical limitations to this, such as interconnect speed and distance between com-ponents. Recent research and prototypes tend to focus on disaggregation on a rack-scale for these reasons [1, 10, 16]. Based on the current directions in disaggregationresearch, we focus our solution on a partially disaggregated rack.Disaggregation is not only a trend in datacenter architecture, it is very quicklybecoming a reality. It has been shown that legacy applications can perform ina partially disaggregated rack-scale environment if bandwidth is greater than 40Gbps and latency is less than 5 µs [16]. Therefore, we can not only prototypedisaggregated racks [1, 6] but also run legacy applications on them with commodityinterconnects.7ParserMatch-Action Stage 1Match-Action Stage 2Match-ActionStage NEgress pipeline…Control-plane (CPU)Data-plane (ASIC)Network ManagementNetwork FunctionsToR SwitchFigure 3.1: Protocol independent switch architecture.3.2 Programmable switchesProgrammable switches provide flexibility to network operators by allowing fastprototyping of new protocols. There are two primary architectures for these switches,we focus on building for protocol independent switch architecture (PISA). PISAfollows a pipeline match action architecture in the data-plane of the switch (Fig-ure 3.1). Packets get parsed at ingress by a custom parsing program. They then flowthrough the each stage’s match-action table until they reach the egress queues. If apacket requires special processing, it gets trapped to the control-plane of the switch,which has a control program running on a CPU. This has become the predominantarchitecture for Barefoot switches [4]. There are two fundamental limitations forprogramming these switches: 1) the maximum number of pipeline stages is fixedand 2) memory is explicitly tied to a stage and cannot be used by another stage [13].Neither of these limitations effect our work, as we do not require much memoryper stage and we do not require many stages of computation.3.3 Remote procedure callsRPCS aim to abstract away the complexities of distributed communication by per-forming server lookup, marshaling/unmarshaling of data, and network setup forthe developer. Although this is an attractive and simplifying abstraction, there are8many challenges with implementing it. We address two in this work: precise failuresemantics and pass by reference arguments.Precise failure semantics. In the ideal case, RPCS would transparently pro-vide local procedure call semantics (exactly-once). This is largely debated as beingimpractical, therefore the semantics are relaxed to last one [30]. Last-one mean theprocedure is called continuously, only the last call will successfully complete [30].The key to transparent fail-over with last-once semantics lies in orphaned calleediscovery and extermination [30]. This requires the control flow state of the pro-gram to be maintained and used upon failure.Modern RPC implementations do not provide any fault recovery and leave it upto the application using RPCS to handle any errors [2, 3, 5]. They tend to focus onproviding fast marshaling and unmarshaling of complex data types [3, 5].Pass by reference arguments. The simple solution, dereferencing the pointerand sending its data in the RPC, is only viable for the pointers to values. Nestedpointers (i.e., a struct with a pointer to a pointer) require special consideration [30].Often, the overhead to implementing such solutions outweighs the benefit of al-lowing arbitrary pointer arguments. Consequently, most RPC implementations fo-cus on providing multiple language interfaces instead of pass-by-reference argu-ments [3, 5].3.4 Distributed shared memoryDistributed shared memory (DSM) systems provide a global address space for ap-plications. DSMS make many design decisions regarding granularity of memoryaccess (i.e., pages or objects), coherence semantics, scalability, and heterogene-ity [33]. The different coherence semantics range from release consistency to strictconsistency [19, 24, 31, 34]. For example, Grappa provides global linearizableguarantees for their memory model, a strict consistency model [31]. They guar-antee that every modification to the data occurs in a serialized manner and everyread returns the most recently written value. Whereas, TreadMarks provides re-lease consistency, which allows processors to delay exposing changes to shareddata until a synchronization access occurs (acquire or release) [19]. This allowedTreadMarks to improve upon their performance because they did not require heavy9synchronization overhead on all processors and implement a lazy version of releaseconsistency.10Chapter 4Bifro¨st Semantics4.1 Memory semanticsBifro¨st provides strict memory consistency semantics, where the most recentlywritten value will be read by the processor. Concurrent writes by the same nodewill be serialized and processed in order at the memory server. This is enforcedthrough a notion of control and lease. When a CPU allocates a region of memory,it automatically assumes control over that region. When a pointer to a region ofmemory is passed as an argument, the memory is implicitly leased to the receivingCPU.Figure 4.1 displays the memory semantics during an RPC. The arguments(ptr1) are first alloc’d by A (caller). This can be done well before the call (i.e.,read all data into memory) or be done on a per call basis. ptr1 are then passedin an RPC, leasing them to the callee, the caller maintains control over the memorybut can only read it while the callee maintains the lease (green section). Controltransfer occurs only on returning a result to the caller. First B (callee) allocatesmemory for ptr2 and has control over the memory (blue region). When B returnsptr2, it transfers this control to A, converting the access from blue to orange. Thisallows the caller to decide which memory should be freed (ptr1 or ptr2).11A(Caller) ptr 1B(Callee)ptr 2alloc(size)write(payload)RPC: foo(ptr1)alloc(size)read(ptr1)write(ptr2)RPC: return(ptr2)read(ptr2)free(ptr1)Figure 4.1: Bifro¨st memory semantics during a RPC. The orange region rep-resents A’s over the pointer. Blue represents B’s control over a pointer.Green represents a leased pointer, therefore read-only access.4.2 Call semanticsUnder all conditions, Bifro¨st provides exactly-once semantics. We define exactly-once semantics as both the caller and the callee observe the call executing exactlyonce. This is a harder requirement than current RPC systems, as they allow forretransmissions, whereas we consider those to be more than one call (giving last-one semantics). To illustrate this, we focus on a use case where the caller and calleeuse pass-by-reference for both argument and result.Normal operations. The caller makes a RPC with the desired method andarguments, which is routed to the appropriate callee (Figure 1.1(DDC RPC) dottedgreen arrow). The callee then reads in the global address argument (Figure 1.1(DDC12Caller CalleeRPC requestRPC replyf3f4f2f1Figure 4.2: Call semantics with possible failure points.RPC) blue arrow), if any, and performs the procedure. Once the callee is finished,it writes back any relevant output (Figure 1.1(DDC RPC) red arrow) and returns theresult address. The caller then reads the data and continues. Here, Bifro¨st maintainsexactly-once semantics, but the challenge arises under failure conditions.Caller failure. If the caller fails at point f1 in Figure 4.2, then the caller andcallee consider the RPC executed, but there is no caller to return to. If the caller failsat point f2 in Figure 4.2, both the caller and the callee consider the RPC executedonce. But, when the caller is restarted, it would have lost this state. There are twooptions: fate-share and recovery. In fate-sharing the callee is forced to fail if thecaller fails, and both are restarted. This provides exactly-once semantics as boththe caller and callee do not know of the previous attempted execution after reboot.Exactly-once semantics can be achieved using RPC-based checkpointing. Thecaller considers the RPC initiated when it successfully sends the request. At thispoint, it waits for the callee’s response. A checkpoint is taken shortly before theRPC has been initiated. When the caller fails, it is restarted with this checkpoint,thus preserving the state immediately after the RPC has been sent. Both the callerand the callee both have a record of the RPC executing only once.Callee failure. If the callee fails at point f3 in Figure 4.2, then the callee is not13alive to field the caller’s request. This can be fixed by retransmitting the requeston the caller side until the callee is back, but this violates the definition of exactly-once. If the callee fails at f4 in Figure 4.2, then both the caller and callee considerthe RPC initiated but not executed. The first failure can be addressed by in-networkretransmission of the packet once the callee has been rebooted. Essentially, thecaller does not retransmit the message, from its perspective the RPC was only calledonce, but the underlying network handles retransmission of the packet to the callee.The failure case at f4 can be handled using the same RPC checkpointingscheme described above. Here, the callee is checkpointed immediately after theRPC is received. Therefore it can restart right at the point where it considers theRPC initiated but not executed. This poses problems for non-idempotent operationson global memory. To handle non-idempotent operations, the global memory usedby the callee will fate-share with the callee. Thus requiring the callee, upon reboot,to reload the arguments and any other state from the global memory snapshot.14Chapter 5Bifro¨st designFigure 5.1 shows an overview of the Bifro¨st architecture. Each of the computeresources that comprise a single machine are connected via the network to theglobal address space. The global address space is provided by a distributed sharedmemory system run on top of the memory resources.Application processes call RPCS using the Bifro¨st RPC library. This RPC libraryuses the Bifro¨st application programming interface (API) to manage global mem-ory. The semantics described in Section 4 are enforced in the ToR switch by thecontrol-plane program Gatekeeper.5.1 APIBifro¨st presents two APIS, one for accessing global memory and the other for RPCS(Table 5.1). The memory API provides four basic primitives for operating on globalmemory: alloc, read, write, and free. All of these operations are performed onglobal address pointers. The RPC API only provides two calls, one for the callerand one for the callee. The RPC library handles all marshaling and unmarshaling ofthe arguments and results and handles the translation of a method call to a remotecall.15GatekeeperParser Match-action tablesData-planeControl-planeToR SwitchProcessBdaemonCompute ResourceProcessBdaemonCompute ResourceMemory bankBdaemonMemory ResourceMemory bankBdaemonMemory ResourceRPC LibRPC LibFigure 5.1: Overview of Bifro¨st architecture.API CallMemoryptr← alloc(int size)int← read(char* buf, int len, ptr addr)int← write(char* buf, int len, ptr addr)free(ptr addr)RPC caller result←<method name>(args, ...)RPC callee export(<method name>)Table 5.1: Bifro¨st API.5.2 Network protocolFigure 5.2 shows the Bifro¨st packet header. It contains a Bifro¨st identifier that theparser on the switch data-plane uses to determine if the packet should go throughthe Bifro¨st parsing tree. The call type has four possible values: CALL (1), RE-PLY (2), EXCEPTION (3), and ONEWAY (4). This is drawn from Thrift RPCimplementation. Next is the length of the method name and then the method name160 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30bit44bytesBifro¨st identifier Call typeMethod name lengthMethod nameRPC identifierGlobal memory address pointerPayloadFigure 5.2: Bifro¨st packet header.itself. We fix the method name to be 16 bytes to avoid variable length parsing inthe switch. The RPC identifier is a unique number which denotes a particular RPCoperation for this caller.The global memory address pointer is a 16 byte address which points to the passby reference arguments. When multiple arguments are pass by reference, the RPClibrary allocates memory for each argument, then writes the data in flattened formto one global address. This global address is then sent in the RPC. This removes theneed to parse a variable list of pointer arguments to enforce the memory semanticsdiscussed in Section 4.1. The rest of the payload is not parsed by the switch.5.3 Gatekeeper: switch control-planeGatekeeper, run on the control-plane of the switch, tracks and manages all Bifro¨sttraffic. Gatekeeper performs three main tasks: resource management, memoryprotection, and fate-sharing and fault tolerance for RPCS. It enforces every decisionas a match-action rule in the data-plane of the switch.Resource management. To present the user with a single machine, the con-trolling entity must be able to determine available resources. When a resourcebecomes alive, either from reboot or initial plug in, it automatically sends an initi-ation message. This message contains resource type and capacity or specification.17A(Caller)B(Callee)Switch GatekeeperRPC: foo(args)A: commit a1B: commit b1RPC: return(result)B: commit b2RPC: foo(args)A: commit a1RPC: foo(args)RPC: return(result) RPC: return(result)B: commit b2A: commit a2a1f1a2f2f3b1f4b2Figure 5.3: Bifro¨st RPC fault recovery scheme.The Gatekeeper maintains a list of these resources and their status (free, in-use, orfailed). When a program has been compiled, procedures are then assigned to com-pute resources. The compute resources send a similar initialization message to theswitch. This message is trapped to the control-plane where Gatekeeper generates amatch-action rule to automatically forwarded any Bifro¨st packet to that particularcompute resource for processing. This alleviates the need to directly connect RPCcallers and callees, it provides the RPC registration service in the switch data-plane.Memory protection. The switch data-plane drops all packets for a pointer bydefault. It only allows access if a table rule is generated to allow access. Based onthe memory semantics described earlier (Section 4.1), when the data-plane parses18a Bifro¨st RPC packet with a global address pointer, it traps to the Gatekeeper in thecontrol-plane. Gatekeeper generates a match-action rule to allow the destinationIP address (callee) to read that pointer. It removes the rule which allows the sourceIP address (caller) to write to that pointer. These rules match based on a pointer,operation, and source IP address from a memory access packet.The access control table update is represented by the first dashed line and thechange from orange access to green access in Figure 4.1. When the RPC com-pletes, the data-plane parses the response RPC packet, if the reply packet containsa global address, it traps to Gatekeeper which removes the rule allowing the sourceIP address (callee) to access the pointer. Gatekeeper also must update the table toallow the destination IP address (caller) to perform any memory operation on thereturned pointer and the arguments previously sent.When the data-plane parses any memory access packet, it looks at the opera-tion, the source address, and the pointer address. If the operation is allocation, ittraps to Gatekeeper which generates a table rule to allow all memory access pack-ets from the requesting address. If the operation is a free, Gatekeeper removes anymatch-action rule regarding that pointer. It does not update the tables for a read orwrite operation.Fate-sharing and fault tolerance. The switch data-plane traps to the control-plane when it encounters an RPC packet, here Gatekeeper maintains a graph ofactive RPCS. This graph represents the control and data flow of the program. Itallows Gatekeeper to not only track nested RPCS but also pointers that are passedin multiple RPCS. When a resource fails, Gatekeeper determines the failure domainof that RPC and if it should be recovered. These decisions can be made on a callby call basis, it is also possible to make them programmable by the developer [10].Once the failure domain is computed, the control-plane removes rules in the match-action tables which forwarded packets to the blacklisted IP addresses. By default,these packets will be dropped.We deploy checkpoint and rollback recovery for both caller and callee failures.We base our checkpoints and committal of checkpoints on RPC calls (Figure 5.3).This means we have a guarantee of RPC state at each particular checkpoint. Gate-keeper keeps track of the most recent checkpoint committed by a node.There are four points in which a checkpoint is created and committed: RPC19initiation (a1), RPC initiation received (b1), RPC reply (b2), RPC reply received(a2). When A (caller) creates the checkpoint a1, the checkpoint is committed toGatekeeper when the RPC sent to B (callee). This ensures that the checkpoint a1is only committed when the RPC is actually sent across the network. At this point,we guarantee that A views the RPC as “called”.When the RPC is received by B, B immediately checkpoints its state (b1) andcommits that to Gatekeeper. Once b1 is committed, B can proceed with computingthe procedure. This ensures that B will restart at the beginning of computation,ensuring the RPC is received once, but never executed more than once (based onthe view of the callee). When B has completed the procedure, it checkpoints itsstate again (b2) and commits it with the return of the RPC. Thus representing that Bconsiders the RPC “completed”. When A has received the RPC reply, it immediatelycheckpoints its state (a2) and commits it before continuing computation. Once a2is committed, A considers the RPC “completed”.Each checkpoint represents the RPC status to the particular node making thecheckpoint. This aids in recovery as we can determine whether or not the callerthinks it called the procedure, the callee received the request, the callee completedthe request, and if the caller received the result. Maintaining exactly-once seman-tics is still not trivial, especially in the cases where the caller and callee check-points do not reflect the same status, in particular, f4 in Figure 5.3. To solve this,we have Gatekeeper maintain a most recent RPC for every caller/callee pair. Whenf4 occurs, Gatekeeper will see that A considers the RPC called, but B failed be-fore receiving that call. Therefore, Gatekeeper will restart B from the most recentcheckpoint and then replay the last RPC from the pair to synchronize their view ofthe RPC. A does not know of the replayed packet and still considers the RPC to becalled once.5.4 Bifro¨st daemonsBifro¨st requires coordination on the resource side to achieve transparency to theapplication, execute checkpointing or snapshotting, and perform memory clean-up. The Bifro¨st daemons change roles depending on which type of resource theyrun on.20Compute daemons keep track of the number of times a pointer has been leasedand initiate the synchronous checkpoint before an RPC is sent across the networkand before the RPC return is passed up to the process. When performing the check-pointing, the compute daemon appends the commit information on the end of theBifro¨st packet. This extra information is removed by the compute daemon on thecallee node. Memory daemons handle memory API requests, perform memorysnapshotting on a specified region, and service memory clean-up requests.21Chapter 6ImplementationTo prototype our design, we modify an existing RPC framework (Thrift), and built abasic DSM system. The switch data-plane program is written in P4 and Gatekeeperis written in C++. We simulate the network topology in Mininet with a P4 switch.6.1 Thrift modificationsThrift is an RPC library originally developed at Facebook, but open-sourced asan Apache project [3]. Thrift provides flexibility with different abstraction lay-ers: thrift file, TClient/TProcessor, TProtocol, TTransport (buffered,framed, etc.), and TSocket. The user specifies the thrift file which is then com-piled and generates the TClient/TProcessor for both the client and the serverrespectively. TProtocol defines the marshaling and unmarshaling for everyThrift data type. When a data type is marshaled it is written to the transport. Forcomplex data types, such as lists Thrift marshals each item in the list. TTransportdefines wrapper functions to perform network operations. TSocket performs theactual network I/O functions.We start with the Thrift c glib library, using the binary protocol (sends thedata as raw bytes) and the buffered transport (buffers the data before calling socketsend or receive). This means, when Thrift marshals a data type, it “writes” to thebuffered transport. If the write buffer is full, the transport sends the message overTCP, if the buffer isn’t full, it writes the message to the buffer. The same is true for22reads. This can be inefficient when the buffer size is much smaller than the databeing sent.Bifro¨st requires UDP to perform the rerouting and packet drops required toenforce our desired semantics. The entire Thrift design is based on the connectionabstraction of TCP. This required us to modify the TTransport layer. We createda buffered transport for UDP which uses a UDP socket. The UDP socket performswhole reads of messages, which are then buffered at the UDP buffered transportlayer. We also created a connectionless server using the UDP socket. Instead oflistening and accepting connections, the server receives an RPC, processes it, andreplies. This removes the need for any handshake between the client and server aswell as reduces the number of open sockets. It currently does not handle multipleconnections, nor does it queue outstanding requests. We plan on addressing that infuture work.6.2 DSM systemOur basic DSM system exposes a key value store interface, where the global addressis the key and the data is the value. We chose to embed the global address andoperation type in an IPv6 address to aid in load balancing, memory migration, and,it is addressable from any requesting machine. The first four bytes of the IPv6address are zero, the fifth byte is the DSM system prefix. The sixth byte is the DSMserver machine ID. The seventh byte is the operation (allocate, read, write, free).The eighth byte is the arguments (if any). Finally, the last eight bytes are the 64 bitpointer. Embedding the global address in IPv6 is not required for our system. Theglobal address can be passed in an application level header or payload and parsedat the server side.6.3 P4 data-plane programP4 is a highly reconfigurable, protocol and target (i.e., switch) independent switchdata-plane language [9]. P4 programs for the PISA we described in Section 3. AP4 program has the following components: ingress parser, match-action tables,actions, and header definitions.The packets flow along the pipeline by first being parsed based on the parsing23Egress queuescallreplyDSM ACLfwd_IPv6fwd_IPv4IPv4_lpmBifröstDSM fwdIPv6_lpmIPv4IPv6DSM protocolParserFigure 6.1: Bifro¨st P4 pipeline.rules (Figure 6.1). Then our P4 program checks to determine if the packet is aBifro¨st packet based on the parsed header fields. This allows it to co-exist withother network functions on the switch. Once it determines it is a Bifro¨st packet,it applies the Bifro¨st match-action rules. When the switch data-plane encounters aBifro¨st call and reply or a DSM allocate, it traps to the control-plane CPU, whereGatekeeper runs, by generating a digest.6.4 GatekeeperGatekeeper is written in C++ and runs on the switch control-plane CPU. WhenGatekeeper receives a trap from the switch data-plane it determines which typeevent has occurred: RPC initiation, RPC return, RPC failure, Mem alloc,Mem free. In the case of RPC initiation, Gatekeeper parses the packet todetermine if any global address is used. It then generates match-action rules to en-force memory protection updates on the global addresses involved in the RPC call.Finally, it creates an RPC entry in its control flow graph. The control flow graph ismaintained in a node-centric data structure. Each node represents a caller or calleewith a directed edge between the two. Each directed edge contains the call methodname and pointer argument. When a node is added to the graph, it is also added toa list of current RPCS. This list contains node pointers into the graph.When an RPC returns (RPC return, Gatekeeper must perform clean up oper-ations on its graph. First, it updates the match-action rules for the global addressesinvolved in the RPC. Then it removes the callee and caller pair from the graph(assuming the caller does not have other outstanding RPCS).When Gatekeeper receives an RPC failure, it walks the graph to see what24caller or callees the failed processor was associated with. Gatekeeper then buildsthe failure domain based off of those associations. It will initiate memory clean upfor the pointers controlled by the failed processor IP address. Simultaneously, itwill get a new processor and initiate it from the checkpoint. Any requests going tothe failed processor are queued at Gatekeeper. Once the new processor is initial-ized, the held requests are forwarded to the new processor. Any entry in the tablesthat forwarded to the failed processor are re-written to point to the new processor.For Mem alloc and Mem free, Gatekeeper updates the access control listfor the memory pointers. It adds an entry for Mem alloc and removes an entryfor Mem free, assuming the requesting IP address is the one that controls thatmemory pointer. If the requesting IP address does not control the pointer, thepacket is dropped.25Chapter 7EvaluationWe focus our evaluation on the overhead of our system compared to the overheadof Thrift. We attempt to answer five questions regarding our system:• How realistic is our test environment?• What is the base overhead of Thrift and Bifro¨st?• What is the impact of our Thrift modifications?• What is the cost of Thrift and Bifro¨st on a simple workload?• What is the latency breakdown in Thrift and Bifro¨st?7.1 MethodologyWe perform our tests using a network simulation environment called Mininet [23].Mininet simulates a customized network topology, allowing the user to rapidlyprototype and test switch and application code. It is easily customizable, we use itwith a P4 switch to prototype our P4 program and Gatekeeper. We also use it withan OpenV Switch (OVS) for our performance testing.Our topology consists of six machines: one RPC client (c1), two RPC servers(s1, s2), and three DSM servers (m1, m2, m3). s1 and s2 run two different RPCservices. s1 handles ping requests, s2 handles echo and array operation requests.m1, m2, and m3 run the memory daemon to service memory access requests.26050100150200250OVS Mininet OVS HardwareLatency (us)Figure 7.1: Latency comparison using ping6 on OVS Mininet and real OVScluster, averaged over 1000 pings.We use two measurement programs for our baseline measurements: ping6and netperf. ping6 uses Internet control message protocol (ICMP) to requestand receive an echo between two nodes and measures the round trip time (RTT) inmilliseconds. netperf performs network testing between two hosts, measuringa variety of metrics such as bandwidth and RTT. We use two specific netperftests: TCP RR and UDP RR. TCP RR stands for TCP request/receive. It performsas many requests with a specified payload in ten seconds. It then outputs the 50thpercentile, 90th percentile, 99th percentile, mean, and standard deviation latencymeasurements in µs. UDP RR performs the same test, except over an UDP con-nection.270204060801001201401601801 2 4 8 16 32 64 1282565121024204840968192Latency (us)Data size (bytes)Mininet-TCPMininet-UDPHardware-TCPHardware-UDPFigure 7.2: Latency comparison using netperf with TCP and UDP on OVSMininet and real OVS cluster. Averaged over 1000 pings and with vary-ing payload size.7.2 How realistic is our test environment?We look at the cost of pings (using ping6 and netperf) in Mininet comparedto two servers with 10Gb NICs connected with an OVS. Figure 7.1 shows that theping latency of Mininet is lower than hardware. This is expected as Mininet runson a single host and does not require traversing the physical NIC. Mininet has anaverage RTT of 75 µs whereas hardware has an average RTT of 231 µs. Whenlooking at protocol specific numbers, Mininet outperforms the servers by a factor4x for TCP and 3x for UDP on small data sizes. With large data sizes, Mininetstarts to outperform the real servers by a factor of 10x for TCP and 7x for UDP, asexpected (Figure 7.2).Based on these measurements, Mininet shows an optimistic performance com-pared to real hardware. We, therefore, base our performance evaluation on relative28050100150200250300Latency (us)Payload size (bytes)OVS-TCPOVS-UDPP4-TCPP4-UDPFigure 7.3: Baseline latency in µs for OVS Mininet compared to P4 Mininetwith ranging payload size. Measurements were taken with netperf.overhead.We also ran netperf on the OVS and P4 versions of Mininet with varyingpayload sizes (Figure 7.3). The P4 switch in Mininet is an average 20x slower forTCP and 16x for UDP. We believe this is due to the P4 parsing overhead and thatthe P4 switch implementation is not optimized. We plan to investigate this furtherin future work. But, because of the large performance overhead, we use the OVSMininet for the rest of our tests.7.3 What is the base overhead of Thrift and Bifro¨st?To determine the base overhead of our system, we run two microbenchmarks: pingtest and echo test. The ping test calls a “ping” RPC, which the server just returnsACK. This is the simplest RPC, with no parameters or return value. The total UD-P/TCP payload size is 29 bytes, including the Thrift header. To provide a baseline,29010203040506070netperf-TCP Thrift netperf-UDP BifrostLatency (us)Figure 7.4: Thrift and Bifro¨st running on OVS Mininet compared to thenetperf round trip latency measurement for TCP and UDP.we also ran netperf with a 29 byte request payload and a 1 byte reply payload.We ran the ping test 100 times and took the average latency. Thrift has a 3.09xoverhead compared to TCP on the OVS Mininet and Bifro¨st has a 3.78x overheadcompared to UDP on the OVS Mininet (Figure 7.4). Although the Bifro¨st overheadis slightly higher than the Thrift overhead (only 0.69 difference), we believe thisdifference is negligible.7.4 What is the impact of UDP vs TCP?To determine how the difference of protocol (UDP vs. TCP) effects our perfor-mance measurements of Bifro¨st, we compare the performance of regular Thrift(over TCP) with our Thrift UDP implementation, but no Bifro¨st management orDSM system. We ran a the ping test on the OVS switch for 100 iterations and av-eraged the RTT in µs. Thrift over TCP had an average latency of 37 µs, whereas300102030405060TCP UDPLatency (us)Thrift Transport ProtocolFigure 7.5: Thrift RPC ping test using TCP and UDP as underlying transports,averaged over 100 iterations.Thrift over UDP had an average latency of 54 µs (Figure 7.5). The fact that UDPincreases the latency by ∼45% is interesting, as UDP should have better perfor-mance than TCP. We plan to investigate why Thrift UDP performs worse in futurework.7.5 What are the overheads on a simple workload?To get a preliminary idea of what the performance of Bifro¨st would be runningreal-world workloads, we tested two simple workloads: increment array and addarrays. Increment array calls an RPC with a byte array and a byte as parametersand expects a byte array of the same length as a return value. The server receivesthis request and increments the passed array with the passed value. Add arrayssends two byte arrays and expects a byte array of the same length as a return value.The server performs the element-wise addition of the two arrays.3105000100001500020000250003000035000400004500050000Array size (bytes)Latency (us)ThriftBifrostTCPUDPFigure 7.6: Increment array RPC test on OVS Mininet with Thrift, Bifro¨st,TCP netperf baseline and UDP netperf baseline. Averaged over100 iterations and varying data size.We ran the test on varying payload sizes (up to 2 MB) and for 100 iterations.Figure 7.6 shows the average latency of Bifro¨st, Thrift, UDP baseline, and TCPbaseline on the OVS Mininet. Bifro¨st’s performance increases exponentially withthe data size. Thrift’s performance also increases, but does so slightly more er-ratically. Compared to their baselines, Bifro¨st has, on average, a 28x overheadcompared to UDP and Thrift has a 21x average overhead compared to TCP. This isdue to the extra reads and writes Bifro¨st must perform to access global memory.Figure 7.7 shows the comparison between Bifro¨st, Thrift, UDP, and TCP overthe OVS Mininet running the add arrays workload. Bifro¨st performs worse thanThrift on average. Bifro¨st has an overhead of 41x compared to UDP and Thrift hasan overhead of 22x.We found that Thrift had very odd behavior once the packet is larger than 512bytes, then subsides when the packet is larger than 9000 bytes. It seems that the3201000020000300004000050000600007000080000Array size (bytes)Latency (us)ThriftBifrostTCPUDPFigure 7.7: Add array RPC test on OVS Mininet with Thrift, Bifro¨st, TCPbaseline and UDP baseline. The baselines are NetPerf latencies. Aver-aged over 100 iterations and varying data size.TCP segment size gets stuck at 512 bytes, even though the packet is larger, whichcauses malformed packets. The negotiated maximum segment size (MSS) is 8940,which is our maximum transmission unit (MTU), 9000 bytes, minus the TCP andIP headers. We found the same behavior when running on the P4 Mininet. There isalso a secondary spike which occurs at 131072 bytes. We plan to investigate bothspikes in future work.7.6 What is the latency breakdown in Thrift and Bifro¨st?To discover where most time is spent, we also gather breakdown measurementsfor both Thrift and Bifro¨st over 100 iterations of each RPC and an array size of16384 bytes. This will help pinpoint possible bottlenecks in both systems. Wemeasured the costs into pre-processing, marshaling, network, unmarshaling, server330102030405060708090100Time spent (%)Workloadclient marshallingclient to server netserver unmarshallingserver compserver marshallingserver to client netclient unmarshallingotherFigure 7.8: Breakdown of where time is spent for each small workload and adata size of 16384 bytes, averaged over 100 iterations.computation, and post-processing. Pre- and post-processing occur on the clientside, this is where the arrays are populated, written or read from remote memoryor copied. Marshaling and unmarshaling occurs on both the client and server side.Figures 7.8 shows a breakdown of our results. Bifro¨st spent the most time dur-ing the pre- and post-processing stages (labeled as other in Figure 7.8). Outside ofthat, Bifro¨st spent the most time in server computation. Both these are expected asthe memory access calls occur during pre-processing, post-processing, and servercomputation in Bifro¨st.Thrift spent most of the time in the processing stages for add array. WhereasThrift spent most of the increment array latency sending from the server to theclient. We found these results interesting, as we expected the majority of Thriftlatency to be in the network. This is only true in the increment array test, where80% of the time is the server responding to the client. It is interesting that these34times are not reflective of their payload size either, as increment array is returningless data than it sent. Therefore, it would make sense that the client to servernetwork time would be the largest in Thrift, not the server to the client. We plan toinvestigate this with the latency spike we see in TCP (Figure 7.6 and Figure 7.7).35Chapter 8Discussion8.1 LimitationsThere are several limitations to our current design based on the assumptions andtrade-offs we made. Our design currently assumes rack-scale disaggregation. Thisis currently the most plausible form of disaggregation, but limits the scalability ofour solution. Since we rely on the switch to have a global view of the resources, ourdesign does not directly translate to a scale larger than a rack. It is not impossibleto scale our design beyond a rack. It requires more complexity and coordination,as now Gatekeeper must act as a distributed system which makes decisions aboutfailures and memory across racks. This can introduce many complexities, such asrack failures, switch failures, and distributed consensus.Another limitation of our system is that we do not handle any network failures.In particular, we consider the network to be lossless, which is not true for somecommodity networks. To address this, we would need to add retransmissions andtimeouts to our protocol, which in turn, will add a performance overhead.We also limit our design by having fixed size method names and only oneglobal pointer per RPC. We did this to simplify the parsing on the data-plane.There is a trade-off between variable length header entries and performance of theP4 parsing script. We elected to choose performance of the script over supportingvariable-length method names and pointer lists. We could modify the P4 programto handle variable-length parsing, but, do not find it necessary at this point.368.2 Future workWe split our future work into two categories: system improvement and evalua-tion improvement. System improvement describes optimizations or functionalitywe wish to add to our system. Evaluation improvement describes any questionsregarding our current evaluation we wish to address or more tests we wish to run.8.2.1 System improvementsSome of our performance overhead is due to the unmarshaling on the server side.We plan to mitigate this by creating a custom Thrift type for our shared pointerscheme. These types will be used just like C pointers but the RPC library willhandle the marshaling and unmarshaling into the IPv6 pointer mechanism our DSMuses. This will be advantageous as the shared pointers are currently stored as Thriftbyte arrays. This means Thrift will marshal and unmarshal them byte by byte,allowing for variable byte arrays. Since our pointers are fixed size, we can senda fixed size byte array instead of marshaling in pieces. This will also aid in thetransparency of the developer, as they will just be using a different pointer type inC, but all the access semantics remain the same.As stated in Section 6, our server does not handle multiple requests fromclients. We plan to address this by creating a multi-threaded server which main-tains a thread pool for servicing requests. As requests come in, it logs the thread IDof the assigned thread then passes the packet. If all threads are busy, it will put thepacket in a buffer that is FIFO. Once a thread is completed (i.e., sends the REPLY),it will assign it a packet from the buffer.We’d also like to implement our fault tolerance design and fate-sharing en-forcement. This requires implementing a snapshotting mechanism on the mem-ory daemon and a checkpointing mechanism on the compute daemon. Both ofthese mechanisms must have low execution time and be synchronous, to ensure nooutgoing network packets occur while the snapshot or checkpoint is taken. Thisatomicity is difficult to achieve, but it will provide a basis for building new faulttolerance techniques.In this work, we only considered CPU failures, but we must also design andevaluate memory fault tolerance techniques. We hope to show that performing37most of the coordination and computation at the switch will allow for new faulttolerance techniques.Once the fault tolerance mechanism and fate-sharing enforcement is imple-mented for both memory and CPU, we would like to experiment with differentpossible fate-sharing models and fault tolerance techniques as described in [10].In particular, we believe the tainted fate-sharing model would be advantageous forour DSM system. There might also be advantages to having a specific fate-sharingmodel for nested RPCS or different fate-sharing models and fault tolerance depend-ing on if the processor was the caller or callee.8.2.2 Evaluation improvementsOur current evaluation is limited in only showing the performance characteristicsof Thrift vs. Bifro¨st and evaluating the end to end latency of each operation. Wehope to expand upon our evaluation in multiple ways, first addressing interestingquestions raised from our current numbers and secondly, measuring the effective-ness of other aspects of our system.Our next step would be to measure the memory protection in the switch, look-ing at the overhead for the P4 program in the data-plane and Gatekeeper in thecontrol-plane. We also plan to test the memory protection, testing to see if the se-mantics we describe align with the implementation. For example, the applicationwill attempt to access a memory pointer that it does not control. We then plan tocompare our memory protection scheme to another DSM with the same semantics(strict consistency). This will allow us to compare our performance overhead forstrict consistency to their implementation.Once the fault tolerance technique is implemented we will also evaluate it forcorrectness and performance overhead. We plan to do so in a similar manner tomemory protection. First, we will profile it to determine where it spends the mosttime and to determine the overhead of the P4 program. Then we will test thecorrectness by injecting faults while an application is running. Finally, we willcompare our implementation to an application which provides fault tolerance forRPCS.Next, when the pointers are integrated into Thrift, we hope to modify a multi-38threaded application to use Bifro¨st. There we can do a full macrobenchmark of theperformance of that application on Bifro¨st compared to a single machine, showingrelative performance overheads. Then we will test for when failures occur, char-acterizing the failure the application has and measuring the time it takes Bifro¨st torecover. Finally, we hope to run these tests on real hardware instead of a simulator.This will require us to obtain a programmable switch, at least 40 Gbps Ethernet,and at least 40 Gb NICs.39Chapter 9Related work9.1 Network and system co-design.Combining system design with network elements is not a new idea [10, 17, 25, 26].With the recent advancements in networking, such as software defined network-ing (SDN) and programmable switches, there has been a stronger call for networkintegrated systems.Programmable switches allow for more flexibility in parsing packets, allow-ing for rapid prototyping of new protocols. There is also an added advantage ofmoving some compute to the switch itself [17, 26, 27]. Recently, NetCache pro-vided fast key-value store caching layer in the switch at line rate [17]. NetCachephysically stored the key-value cache in data- plane memory [17]. We do not takethis approach as we only perform memory access control updates in the switchdata-plane. All other computation is done in the control-plane.One intuitive way of using SDN to solve failures is to re-route the traffic. Pre-vious work extended SDN controllers to reroute traffic when links or switchesfail [22, 25, 32]. Albatross, not only re-routes around partitions, but it enforcesthem by killing the partitioned node [25]. We enforce fate-sharing in a similar wayto Albatross by dropping all packets going to and from the failure domain. We ex-pand upon this principle to provide access control for shared memory in the switchdata-plane.409.2 Context-based RPC.There has been a large amount of work customizing or improving RPC for particularenvironments [18, 39, 41, 42]. Similar to our work, Stuedi, et. al. integrated RPCwith remote memory access (RDMA) for datacenter environments [41]. Kalia, improve upon that work by using two-sided datagram RDMA [18] and Su develop a new RDMA paradigm to achieve even better performance under RPC.Our work goes a step further by leveraging functionality in the network to achievestronger guarantees instead of focusing on improving performance.RPC has also been studied in the context of multi-processor operating sys-tems [7, 11, 12, 35]. Paradigm [12], Hive [11], and Sprite [35] all utilize RPCas a mode of IPC between different kernels or processors. Hive, in particular, fo-cusing on fault mitigation for the processor shared memory. This is orthogonalto our work, as they use RPC to communicate but do not consider RPC failure orrecovery, they only focus on fault containment for memory [11]. More recently,Barrelfish, provides a multi-kernel abstraction for multi-processor systems, withthe inter-kernel communication being handled by a user-level RPC library. Theirimplementation differs in two ways: 1) RPCS are not set over a network, but sharedmemory and 2) they do not consider partial failures [7]. In a disaggregated system,performing a network RPC is identical to writing out to shared memory, thereforewe focus on a network based RPC implementation. We also consider RPC failurecases and recovery strategies.41Chapter 10ConclusionWe have designed and prototyped a system porting RPCS to a disaggregated con-text with network managed memory protection and exactly-once semantics. Bifro¨stachieves promising performance results, performing slightly worse than Thrift with-out optimizations. Our results show that the majority of time spent in Bifro¨st is dueto DSM accesses, whereas Thrift’s latency is due to the reply from the server to theclient. In our future work, we plan to address Bifro¨st’s performance challenges bymaking several improvements to our system and evaluation.We believe that, to reap the full benefits of disaggregation, we must take aholistic approach to designing systems on disaggregated racks. In particular, wemust consider the role of the network as a critical piece in the design. This work isone of the initial steps in realizing the benefits of disaggregation by co-designingwith the network.42Bibliography[1] Intel, Facebook Collaborate on Future Data Center Rack Techologies, 2013. →page 7[2] rpclib - modern msgpack-rpc for C++, 2016. → page 9[3] Thrift, 2017. → pages 3, 9, 22[4] Barefoot Technology, 2018.→ page 8[5] gRPC, 2018. → page 9[6] K. Asanovic´. FireBox: A Hardware Building Block for 2020Warehouse-Scale Computing. FAST ’14. → page 7[7] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter,T. Roscoe, A. Schu¨pbach, and A. Singhania. The Multikernel: A New OSArchitecture for Scalable Multicore Systems. In Proceedings of the ACMSIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09,2009. → pages 5, 41[8] A. D. Birrell and B. J. Nelson. Implementing Remote Procedure Calls. ACMTrans. Comput. Syst., 1984. → page 2[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4:Programming Protocol-independent Packet Processors. SIGCOMM Comput.Commun. Rev., 2014. → page 23[10] A. Carbonari and I. Beschasnikh. Tolerating faults in disaggregateddatacenters. In Proceedings of the 16th ACM Workshop on Hot Topics inNetworks, HotNets-XVI, 2017. → pages 1, 5, 7, 19, 38, 4043[11] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta.Hive: Fault Containment for Shared-memory Multiprocessors. InProceedings of the Fifteenth ACM Symposium on Operating SystemsPrinciples, SOSP ’95, 1995. → pages 5, 41[12] D. R. Cheriton, H. A. Goosen, and P. D. Boyle. Paradigm: a highly scalableshared-memory multicomputer architecture. Computer, 1991. → page 41[13] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, A. Orda, andT. Edsall. dRMT: Disaggregated Programmable Switching. In Proceedingsof the Conference of the ACM Special Interest Group on DataCommunication, SIGCOMM ’17. → page 8[14] C. Constantinescu. Teraflops supercomputer: architecture and validation ofthe fault tolerance mechanisms. IEEE Transactions on Computers, 2000. →page 5[15] A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Fault prediction under themicroscope: A closer look into HPC systems. In International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC),2012, SC ’12. → page 4[16] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal,S. Ratnasamy, and S. Shenker. Network Requirements for ResourceDisaggregation. In Proceedings of the 12th USENIX Conference onOperating Systems Design and Implementation, OSDI ’16, 2016. → pages1, 7[17] X. Jin, X. Li, H. Zhang, R. Soule´, J. Lee, N. Foster, C. Kim, and I. Stoica.NetCache: Balancing Key-Value Stores with Fast In-Network Caching. InProceedings of the 26th Symposium on Operating Systems Principles, SOSP’17. → page 40[18] A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, Scalable andSimple Distributed Transactions with Two-sided (RDMA) Datagram RPCs.In Proceedings of the 12th USENIX Conference on Operating SystemsDesign and Implementation, OSDI’16. → page 41[19] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks:Distributed Shared Memory on Standard Workstations and OperatingSystems. In Proceedings of the USENIX Winter 1994 Technical Conferenceon USENIX Winter 1994 Technical Conference, 1994. → page 944[20] G. Kola, T. Kosar, and M. Livny. Faults in Large Distributed Systems andWhat We Can Do About Them. In Euro-Par 2005 Parallel Processing. →page 5[21] S. Krishnapura, S. Achuthan, V. Lal, and T. Tang. Disaggregated ServersDrive Data Center Efficiency and Innovation. Technical report, Intel Corp.,2017. → page 1[22] M. Kuz´niar, P. Peresˇı´ni, N. Vasic´, M. Canini, and D. Kostic´. AutomaticFailure Recovery for Software-Defined Networks. In Proceedings of theSecond ACM SIGCOMM Workshop on Hot Topics in Software DefinedNetworking, HotSDN ’13, 2013. → page 40[23] B. Lantz, B. Heller, and N. McKeown. A Network in a Laptop: RapidPrototyping for Software-defined Networks. In Proceedings of the 9th ACMSIGCOMM Workshop on Hot Topics in Networks, HotNets-IX, 2010. →page 26[24] C. Lee, S. J. Park, A. Kejriwal, S. Matsushita, and J. Ousterhout.Implementing Linearizability at Large Scale and Low Latency. InProceedings of the 25th Symposium on Operating Systems Principles, SOSP’15. → page 9[25] J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Taming Uncertaintyin Distributed Systems with Help from the Network. In Proceedings of theTenth European Conference on Computer Systems, EuroSys ’15, 2015. →page 40[26] J. Li, E. Michael, and D. R. K. Ports. Eris: Coordination-Free ConsistentTransactions Using In-Network Concurrency Control. In Proceedings of the26th Symposium on Operating Systems Principles, SOSP ’17, . → page 40[27] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. K. Ports. Just SayNo to Paxos Overhead: Replacing Consensus with Network Ordering. InProceedings of the 12th USENIX Conference on Operating Systems Designand Implementation, OSDI’16, . → page 40[28] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F.Wenisch. Disaggregated Memory for Expansion and Sharing in BladeServers. In Proceedings of the 36th Annual International Symposium onComputer Architecture, ISCA ’09, 2009. → page 745[29] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, andT. F. Wenisch. System-level Implications of Disaggregated Memory. InProceedings of the 2012 IEEE 18th International Symposium onHigh-Performance Computer Architecture, HPCA ’12, 2012. → page 7[30] B. J. Nelson. Remote Procedure Call. PhD thesis, Carnegie MellonUniversity, 1981. → page 9[31] J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin.Latency-Tolerant Software Distributed Shared Memory. In Proceedings ofthe 2015 USENIX Conference on Usenix Annual Technical Conference,USENIX ATC ’15, 2015. → page 9[32] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri,S. Radhakrishnan, V. Subramanya, A. Vahdat, and R. N. Mysore. PortLand:a scalable fault-tolerant layer 2 data center network fabric. In Proceedings ofthe ACM SIGCOMM 2009 Conference on Data Communication,SIGCOMM ’09, 2009. → page 40[33] B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues andalgorithms. Computer, 1991. → page 9[34] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,D. Mazie`res, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, S. M.Rumble, E. Stratmann, and R. Stutsman. The Case for RAMClouds:Scalable High-performance Storage Entirely in DRAM. SIGOPS Oper. Syst.Rev. → page 9[35] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, and B. B.Welch. The Sprite Network Operating System. Computer, 1988. → page 41[36] R. K. Sahoo, M. S. Squillante, A. Sivasubramaniam, and Y. Zhang. Failuredata analysis of a large-scale heterogeneous server environment. InInternational Conference on Dependable Systems and Networks, 2004, DSN’04. → page 4[37] J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in systemdesign. ACM Transactions on Computer Systems, 1984. → page 6[38] B. Schroeder and G. A. Gibson. A Large-scale Study of Failures inHigh-performance Computing Systems. In Proceedings of the InternationalConference on Dependable Systems and Networks, DSN ’06. → page 446[39] M. D. Schroeder and M. Burrows. Performance of the Firefly RPC. ACMTrans. Comput. Syst., 1990. → page 41[40] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, andS. Gurumurthi. Feng Shui of supercomputer memory positional effects inDRAM and SRAM faults. In International Conference for HighPerformance Computing, Networking, Storage and Analysis (SC), 2013, SC’13. → page 4[41] P. Stuedi, A. Trivedi, B. Metzler, and J. Pfefferle. DaRPC: Data Center RPC.In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14. →page 41[42] M. Su, M. Zhang, K. Chen, Z. Guo, and Y. Wu. RFP: When RPC is FasterThan Server-Bypass with RDMA. In Proceedings of the Twelfth EuropeanConference on Computer Systems, EuroSys ’17. → page 41[43] Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo.Performance Implications of Failures in Large-scale Cluster Scheduling. InProceedings of the 10th International Conference on Job SchedulingStrategies for Parallel Processing, JSSPP ’04. → page 447


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items