UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Schedule data, not code Best, Micah J 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_november_best_micah.pdf [ 8.26MB ]
JSON: 24-1.0394747.json
JSON-LD: 24-1.0394747-ld.json
RDF/XML (Pretty): 24-1.0394747-rdf.xml
RDF/JSON: 24-1.0394747-rdf.json
Turtle: 24-1.0394747-turtle.txt
N-Triples: 24-1.0394747-rdf-ntriples.txt
Original Record: 24-1.0394747-source.json
Full Text

Full Text

Schedule Data, Not CodebyMicah J BestBSc (Honours), University of Victoria, 2004MSc, Simon Fraser University, 2007A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)Octorber 2020© Micah J Best, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:Schedule Data, Not Codesubmitted by Micah J Best in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Computer Science.Examining Committee:Alexandra Fedorova, Associate Professor, Electrical and Computer Engineering,UBCSupervisorArvind Gupta, Professor, Computer Science, University of TorontoSupervisory Committee MemberRonald Garcia, Associate Professor, Computer Science, UBCUniversity ExaminerChen Feng, Assistant Professor, School of Engineering, UBC OkanaganUniversity ExaminerAdditional Supervisory Committee Members:Ivan Beschastnikh, Associate Professor, Computer Science, UBCSupervisory Committee MemberSathish Gopalakrishnan, Associate Professor, Electrical and Computer Engineer-ing, UBCSupervisory Committee MemberiiAbstractParallel programming is hard and programmers still struggle to write code forshared memory multicore architectures that is both free of concurrency errors andefficient. Tools have advanced, but for tasks that are not embarrassingly parallel,or suitable for a limited model such as map/reduce, there is little help. We aim toaddress some major aspects of this still underserved area.We construct a model for parallelism, Data not Code (DnC), by starting withthe observation that a majority of performance and problems in parallel program-ming are rooted in the manipulation of data, and that a better approach is to sched-ule data, not code. Data items don’t exist in a vacuum but are instead organizedinto collections, so we focus on concurrent access to these collections from bothtask and data parallel operations. These concepts are already embraced by manyprogramming models and languages, such as map/reduce, GraphLab and SQL. Weseek to bring the excellent principles embodied in these models, such as declarativedata-centric syntax and the myriad of optimizations that it enables, to conventionalprogramming languages, like C++, making them available in a larger variety ofcontexts.To make this possible, we define new language constructs and augment proventechniques from databases for accessing arbitrary parts of a collection in a familiarand expressive manner. These not only provide the programmer with constructsthat are easy to use and reason about, but simultaneously allow us to better extractand analyze programmer intentions to automatically produce code with complexruntime optimizations.We present Cadmium, a proof of concept DnC language to demonstrate theeffectiveness of our model. We implement a variety of programs and show that,iiiwithout explicit parallel programming, they scale well on multicore architectures.We show performance competitive with, and often superior to, fine-grained locks,the most widely used method of preventing error-inducing data access in paralleloperations.ivLay SummaryMost modern computing devices, from desktops to cellphones, have CPUs (CentralProcessing Units: the part responsible for the actual computations) with multiplecores. Sparing technical details, this is as if the device has multiple CPUs and canperform truly simultaneous computations allowing them, theoretically, to do morework in less time.This creates problems for software developers. These independent cores can-not tell what the others are doing and share the same memory. If two or moreCPUs attempt to change the same memory at the same time, corruption of data canoccur. A CPU can signal the others of their activity, but that can slow down thecomputation and be difficult to get correct.We borrow ideas from Databases and other fields to propose a change to howprograms are written and develop a better method for organizing these signals formore efficient coordination.vPrefaceThis work grew out of a collaborative project under Dr. Alexandra Fedorova. I pro-duced several papers on these ideas with Craig Mustard, Shane Mottishaw, MarkRoth and with the assistance of the others listed as coauthors below.• Micah J Best, Nicholas Vining, Daniel Jacobsen and Alexandra Fedorova,Collection-focused Parallelism, Fifth USENIX Workshop on Hot Topics onParallelism (HotPar 13), 2013I did the vast majority of design, implementation and writing for this pa-per. Nicholas Vining and Daniel Jacobsen contributed suggestions, feedbackand aid with some of the code. Alexandra Fedorova supervised this process.• Mark Roth, Micah J Best, Craig Mustard and Alexandra Fedorova, Decon-structing the Overhead in Parallel Applications, IEEE International Sym-posium on Workload Characterization, 2012Mark Roth was the primary driver of the design of the work in this pa-per. Both Craig Mustard and I contributed refinements, wrote code and per-formed tests. The writing was a collaborative effort. Alexandra Fedorovasupervised this process.• Micah J Best, Shane Mottishaw, Craig Mustard, Mark Roth, Alexandra Fe-dorova, Andrew Brownsword, Synchronization via Scheduling: TechniquesFor Efficiently Managing Shared State, 32nd ACM SIGPLAN Conferenceon Programming Language Design and Implementation (PLDI’11), 2011I contributed the original ideas that formed the basis of this paper. ShaneMottishaw, Craig Mustard and Mark Roth provided numerous refinementsviand improvements. The coding, testing and writing was a collaborative effortbetween the four of us. Andrew Brownsword provided significant advice,feedback and refinement. Alexandra Fedorova supervised this process.• Micah J Best, Shane Mottishaw, Craig Mustard, Mark Roth, Parsiad Az-imzadeh, Alexandra Fedorova, Andrew Brownsword, Schedule Data NotCode, Third USENIX Workshop on Hot Topics on Parallelism (HotPar 11),2011I contributed the original ideas that formed the basis of this paper. ShaneMottishaw, Craig Mustard, Mark Roth and Parsiad Azimzadeh provided nu-merous refinements and improvements. The coding, testing and writing wasa collaborative effort between the five of us. Andrew Brownsword pro-vided significant advice, feedback and refinement to all stages of this project.Alexandra Fedorova supervised this process.• Micah J Best, Shane Mottishaw, Craig Mustard. Mark Roth, Alexandra Fe-dorova and Andrew Brownsword, Synchronization via Scheduling: Man-aging Shared State in Video Games, Second USENIX Workshop on HotTopics on Parallelism (HotPar 10), 2010This was the workshop paper that formed the basis for the PLDI paperabove. The contributions here are identical to those listed there.• Micah J Best, Alexandra Fedorova, Ryan Dickie, Andrea Tagliasacchi, AlexCouture-Beil, Craig Mustard, Shane Mottishaw, Aron Brown, Zhi Feng Huang,Xiaoyuan Xu, Nasser Ghazali and Andrew Brownsword, Searching for Con-current Design Patterns in Video Games: Practical lessons in achievingparallelism in a video game engine, in Proceedings of the 15th Interna-tional European Conference on Parallel and Distributed Computing (Euro-Par 2009), 2009.This was a collaborative effort between myself, Ryan Dickie, AndreaTagliasacchi, Alex Couture-Beil, Craig Mustard, Shane Mottishaw, AronBrown, Zhi Feng Huang, Xiaoyuan Xu and Nasser Ghazali. Each of uscontributed to the design, coding, testing and writing. Andrew Brownswordviiprovided significant advice, feedback and refinement to all stages of thisproject. Alexandra Fedorova supervised this process.Shane Mottishaw extended some of these ideas, primarily the paper publishedin PLDI, in his 2011 MSc thesis Synchronization Via Scheduling: Techniques ForEfficiently Managing Concurrent Shared Memory Accesses. The work presentedhere takes a different direction from that work.This work is, primarily, an attempt to refine and extend the ideas and techniquesfrom these publications. Chapter 5, in particular, is an extension of this work andcites the above, as appropriate.I implemented essentially all the code discussed in this document. The over-whelming majority of the code taken from these previous projects has been rewrit-ten.In Section 7.2.4 I describe a piece of software that I collaborated on during aseries of MITACS internships at Gaslamp Games with Nicholas Vining and DanielJacobsen. No code from this period was used for this work as it is property of thecompany, nor were any of the techniques or algorithms in this Section implementedin the software during my employment. I used my experiences to inspire the toolsthat would have aided me at the time of my employment. No endorsement fromGaslamp Games is implied.viiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Source Code Fragments . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Playing Chess in the Dark . . . . . . . . . . . . . . . . . . . . . . 11.2 Beyond Peeling Potatoes . . . . . . . . . . . . . . . . . . . . . . 31.3 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Why So Serial? . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 An Intentional Solution . . . . . . . . . . . . . . . . . . . . . . . 101.6 What We Did . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 What We Didn’t Do . . . . . . . . . . . . . . . . . . . . . . . . . 131.8 Fantastic Contributions and Where to Find Them . . . . . . . . . 14ix2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Software Transactional Memory (STM) . . . . . . . . . . . . . . 172.3 Computations on Collections . . . . . . . . . . . . . . . . . . . . 192.4 Parallel Query Processing . . . . . . . . . . . . . . . . . . . . . . 192.5 Parallel Languages . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7 Deadlock Detection . . . . . . . . . . . . . . . . . . . . . . . . . 223 Expressing Intention . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Cadmium Language Overview . . . . . . . . . . . . . . . . . . . 233.1.1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Managers . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Collections . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.4 Message Sending and Program Flow . . . . . . . . . . . . 283.1.5 Accumulators . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Collection Usage, Collection Contracts and Queries . . . . . . . . 313.2.1 Collection Contracts . . . . . . . . . . . . . . . . . . . . 313.2.2 Declarative Syntax for Collection Contracts . . . . . . . . 323.2.3 Memberships and Subcollections . . . . . . . . . . . . . 363.2.4 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.5 Delegate Blocks . . . . . . . . . . . . . . . . . . . . . . 383.2.6 Isolation Semantics . . . . . . . . . . . . . . . . . . . . . 383.3 Further Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.1 INSERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3.2 DELETE . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.3 UPDATE . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Extracting Intentions: Compiling Cadmium and Static Analysis . . 444.1 Parallel Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 444.1.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . 454.1.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . 46x4.2 Evaluating Shared Accesses . . . . . . . . . . . . . . . . . . . . 484.2.1 Determining Access Domains . . . . . . . . . . . . . . . 494.2.2 Inferring the Domain of Contracts . . . . . . . . . . . . . 494.2.3 Analyzing Program Flow . . . . . . . . . . . . . . . . . . 504.2.4 Acting On Results . . . . . . . . . . . . . . . . . . . . . 534.3 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . 544.4 Further Static Analysis . . . . . . . . . . . . . . . . . . . . . . . 574.5 Programmer Directed Optimizations . . . . . . . . . . . . . . . . 575 Enforcing/Executing Intentions: Scheduling And Runtime Algorithms 585.1 Scheduling Implementation . . . . . . . . . . . . . . . . . . . . . 585.2 Synchronization via Scheduling . . . . . . . . . . . . . . . . . . 615.3 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.4 Deriving Signatures from Queries . . . . . . . . . . . . . . . . . 655.5 Collection Reservation . . . . . . . . . . . . . . . . . . . . . . . 685.6 Signature Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . 696 Compiler Implementation Details . . . . . . . . . . . . . . . . . . . 726.1 Code Analysis Reporting . . . . . . . . . . . . . . . . . . . . . . 726.2 Integrating with C++ . . . . . . . . . . . . . . . . . . . . . . . . 737 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.1 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 767.1.2 Operation Overview . . . . . . . . . . . . . . . . . . . . 767.1.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.1.4 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.1.5 Algorithmic Modifications . . . . . . . . . . . . . . . . . 787.1.6 Testing Methodology . . . . . . . . . . . . . . . . . . . . 807.1.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.1.8 Sparse Signature Results . . . . . . . . . . . . . . . . . . 857.1.9 SIMD Comparsion . . . . . . . . . . . . . . . . . . . . . 857.1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 867.2 Comparison with other approaches . . . . . . . . . . . . . . . . . 86xi7.2.1 PageRank(Green-Marl) . . . . . . . . . . . . . . . . . . . 887.2.2 Delaunay Mesh Refinement (Galois) . . . . . . . . . . . . 967.2.3 canneal(PARSEC) . . . . . . . . . . . . . . . . . . . . . 1047.2.4 Mini-CE (Video Game) . . . . . . . . . . . . . . . . . . . 1158 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 135Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138xiiList of FiguresFigure 3.1 A subtree substructure query on a tree . . . . . . . . . . . . . 33Figure 4.1 Deadlock detection algorithm . . . . . . . . . . . . . . . . . 55Figure 5.1 Mapping queries to signatures . . . . . . . . . . . . . . . . . 61Figure 5.2 Partition numbering for linked lists . . . . . . . . . . . . . . . 67Figure 5.3 Partition numbering (n = 16) of quadtree (4 children) . . . . . 67Figure 6.1 Example Code Analysis Report . . . . . . . . . . . . . . . . 73Figure 7.1 Signature error rates – weakswapper versus brazen . . . . . . 84Figure 7.2 PageRank on 10,000 vertices Cadmium versus Green-Marl . . 92Figure 7.3 PageRank on 10,000 vertices Cadmium versus Green-Marl -Batch Factor Variation . . . . . . . . . . . . . . . . . . . . . 93Figure 7.4 PageRank (Google web graph) Activity Graph for batch factor10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Figure 7.5 PageRank (Google web graph) Activity Graph for batch factor2500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Figure 7.6 PageRank (Google web graph) Cadmium versus Green-Marl . 95Figure 7.7 Mesh Refinement on 10,000 vertices Cadmium versus Galois . 99Figure 7.8 Mesh Refinement on 10,000 Vertices – Signature Size Variation 101Figure 7.9 Mesh Refinement on 10,000 Vertices – Batch Size VariationNormalized . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Figure 7.10 canneal on 100k elements Cadmium versus PARSEC . . . . . 107Figure 7.11 canneal on 200k elements Cadmium versus PARSEC . . . . . 108xiiiFigure 7.12 canneal on 400k elements Cadmium versus PARSEC . . . . . 108Figure 7.13 canneal on 100k elements – Signature Width Variation . . . . 109Figure 7.14 canneal on 100k elements – Batch Variation . . . . . . . . . . 110Figure 7.15 canneal on 100k elements – No Work Stealing . . . . . . . . 111Figure 7.16 canneal on 100k elements – With Work Stealing . . . . . . . . 111Figure 7.17 canneal on 100k elements – Batch Variation . . . . . . . . . . 114Figure 7.18 Clockwork Empires Beta Screenshot . . . . . . . . . . . . . . 116Figure 7.19 mini-ce with 3d rendering . . . . . . . . . . . . . . . . . . . 120Figure 7.20 mini-ce with 2d rendering . . . . . . . . . . . . . . . . . . . 121Figure 7.21 mini-ce – Initial Results . . . . . . . . . . . . . . . . . . . . 126Figure 7.22 mini-ce – 750 agents – Activity Graph . . . . . . . . . . . . . 127Figure 7.23 mini-ce – Population Variation . . . . . . . . . . . . . . . . . 128Figure 7.24 mini-ce – Batch Size Variations . . . . . . . . . . . . . . . . 128Figure 7.25 mini-ce – Signature Width Variations . . . . . . . . . . . . . 129Figure 7.26 mini-ce – SvS versus fine grained locks . . . . . . . . . . . . 130xivList of Source Code Fragments3.1 Example Cadmium Entity . . . . . . . . . . . . . . . . . . . . . . 243.2 Cadmium contract: Isolate and Release phases . . . . . . . . . . . 334.1 Deadlock Potential Example . . . . . . . . . . . . . . . . . . . . 557.1 PageRank expressed in Green-Marl . . . . . . . . . . . . . . . . 907.2 PageRank expressed in Cadmium . . . . . . . . . . . . . . . . . . 907.3 Mesh Refinement Fragment 1 in Cadmium . . . . . . . . . . . . . 987.4 Mesh Refinement Fragment 2 in Cadmium . . . . . . . . . . . . . 997.5 Snippet of Galois C++ Code . . . . . . . . . . . . . . . . . . . . 1047.6 canneal Cadmium Fragment . . . . . . . . . . . . . . . . . . . . 1067.7 Grid Storage Definition for Agents . . . . . . . . . . . . . . . . . 1217.8 simAgentGrid Query . . . . . . . . . . . . . . . . . . . . . . . . 1217.9 Job Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.10 Rendering from the Quadtree . . . . . . . . . . . . . . . . . . . . 124xvAcknowledgmentsI would like to thank my supervisors, for showing me more patience than I de-served, as my progress deadlocked and crashed repeatedly. I couldn’t have askedfor better mentors and better role models.xviDedicationPour Maude, qui a ajouté toute une nouvelle dimension à mon univers.Tout ceci n’aurait pas été possible sans toi. Pour notre vie toute en-tière. Je t’aime. Beaucoup. Beaucoup. Beaucoup.xviiChapter 1IntroductionParallel programming is hard.1.1 Playing Chess in the DarkProgramming in general is hard, requiring intelligence, attention to detail, intuition,experience and a certain amount of creativity. In this way, it’s much like chess. Inwhich case, parallel programing is like playing 3D chess . . . in the dark . . . with anopponent that occasionally swaps your pieces without warning.Writing bug free serial code is difficult enough as it is. Consider the case ofthe implementation of the LZO algorithm [15], which was not only written by avery talented programer, but also ported and rewritten by several other talentedprogrammers. It was used in Linux and even went to Mars. However, there was abug in the original code and this bug was propagated though each of the derivativeversions. It took 20 years before somebody discovered it.In Star Trek canon, the purely logical and brilliant Spock was known to losegames of 3D chess against the folksy Dr. McCoy with his unpredictable, seeminglyillogical, moves. Similarly, even talented programmers can lose against the erraticand invisible moves of parallel execution. The programmer must face a horde ofextra problems including data races, deadlocks, and ulcer-inducing heisenbugs thatscuttle into dark corners whenever they spot you coming with a debugger.1Furthermore, correctness is necessary, but not sufficient1. Parallelism is forperformance. No algorithm requires parallel execution. The metric of success ishow much faster is this than the serial version? and in this case our devious op-ponent still has more tricks. One can easily introduce suboptimal cache utilizationpatterns or one of many other unfortunate circumstances from a long list of pitfalls.The author of this work started serious parallel programming working for avideo game studio, producing a title for the 3-cored Xbox 360. He worked forweeks to refactor some fairly deep engine code to free up an extra core for hisparallel implementation of one of game’s subsystems. With great anticipation, heran his first tests – only to find it was slower with the additional core. It was thenthat he learned about pros and cons (mostly the latter) of hyperthreading and, moregenerally, how deep the rabbit hole goes. It’s still remarkable, and probably asign of psychological damage, that he didn’t run screaming from the building andthat you’re not reading his thesis on Chaucer, Shakespeare and Other Things ThatExisted Long Before Computers.There are several different recognized ‘flavours’ of parallel execution. Frominstruction level parallelism, with SIMD operations, to massive ‘webscale’ dis-tributed services. In this work we’re concerned with multicore parallelism – mul-tiple physical processors running on a single machine with a shared address spacememory.We are now, by most accounts, in the second decade of the ‘multicore revolu-tion’. The increase in clock rates effectively stalled in the mid-2000s, but Moore’sLaw2 [66] has continued to hold, giving chip designers more and more transistorsto play with. They used this surplus to put the contents of two or more processorsonto the same chip. To paraphrase a tired internet meme: yo dawg, we heard youliked processors, so we put a processor in your processor. We then had CPUs thatcould finally do what operating systems had been faking for years: run more thanone thing at once3. Concurrency had become true parallelism.1For completeness, we note there is an interesting line of inquiry into algorithms that permitsome degree of incorrectness for the sake of performance. However, this is limited to a small classof problems and so we will assume correctness as a requirement.2We used to joke in the lab that the three most important laws that governed our work were:Moore’s, Amdahl’s and Murphy’s.3Yes, technically multi-processor systems pre-date even personal computers, but this was the first2Today, even cheap smartphones have multiple cores and researchers are discov-ering that many big data operations are more efficient on multicore architecturesthan networked clusters [60], so multicore parallelism is applicable to a large per-centage of developers.If we can be allowed a little reflected arrogance, the community of Systemsresearchers, both academic and corporate, are the ‘arms suppliers’ for the pro-grammers in the trenches4. The limits of how far programming can go are shapedby the tools that we create. So, in this light: How are we doing?. Well, one needonly launch Task Manager, Activity Monitor or top to see the answer: Not as wellas we could be. Most apps on your device will probably never utilize more thanone core, even though by now we all probably have two or more5. Certainly, a fewapplications will happily gobble all your computational resources, but they tendto be ‘the big ones’ written by companies, like Google and Adobe, with massiveengineering resources.1.2 Beyond Peeling PotatoesFor completeness, we’re going to need to define ‘embarrassingly parallel’ for thefollowing discussion and the rest of this document. In the seminal Art of Multipro-cessor Programming [39], the authors state:Some computational problems are “embarrassingly parallel”: they caneasily be divided into components that can be executed concurrently.When explaining our work to laypeople6 we often use the metaphor of ‘peelingpotatoes’ to describe embarrassingly parallel tasks. If it takes me an hour to peela bucket of potatoes, then it would take two of us half an hour. We would proceedto talk about the fact that we could keep adding people until we had more peoplethan potatoes and the biggest headache we have is distributing the potatoes in anmajor entry into consumer level machines.4In a reverse No True Scotsman, lets consider every programmer who advances the state of theart an honorary Systems researcher5This is being written on an Apple laptop with 6 cores and at the time of writing no process hasused more than 100% CPU all afternoon.6We were tempted to say muggles, but referencing both Star Trek and Harry Potter in one sectionmight be too geeky even for a Computer Science PhD thesis.3efficient matter. Though that last part can be surprisingly non-trivial, it’s certainlyeasier to deal with than having your kitchen help occasionally stabbing each otherbecause you forgot to secure a mutex somewhere in your code.So, what do we see when we look at the support given to the modern program-mer? Broadly, the answer falls into three major categories. The first is ways toeasily dispatch embarrassingly parallel work, with enumerable variations on theparallel_for with packages such as OpenMP [24]. The second category issystems that use a dataflow approach, where data is transformed into a series ofpipelined stages. This has become increasingly popular with the recent explosionof machine learning. However, this isn’t much different than the first category, onlyadding some implicit dependency handling between stages. Furthermore, it’s not ageneral purpose approach. TensorFlow [9] is a technical marvel, but you wouldn’twant to write a word processor with it.What if you had an algorithm whose updates were not easily described byindependent operations on a sequence of data, where two updates may touch thesame data at the same time (which we will refer to as a state conflict)? What ifyou had more than one potential operation on the same data at the same time?The proffered solutions make up the third category of our broad taxonomy and it’sa category that’s mostly made up of locks. Sometimes these include very fancylocks: reentrant mutexes, object based monitors and even more exotic constructs.Furthermore, what’s not a lock is generally a method for ‘forking’ and ‘joining’threads of execution with constructs such as promises, futures, etc. Essentiallythis sends the unstated message: If you’re embarrassingly parallel, we have youcovered. Otherwise, here’s a bunch of pieces to build your own system.At this point, we need another definition: complex application, though in thiscase it’s our own, more informal term. It may be easier to define it in comparisonto a simple application. A simple application is one that has a single purpose,generally taking a set of inputs and producing a set of outputs. Many Unix CLIapplications are like this, such as wc which takes a stream of text and returnsthe number of words. These can also include applications that stay resident inmemory and ‘serve’ independent requests where the requests are processed in asimple matter. A complex application is then one that is not simple and generallyhas some, but not necessarily all, of the following attributes:4• deals with a collection of data that, for even reasonable performance, mustbe modified ‘in place’ in memory (rather than immediately transformed to anew representation)• deals with multiple collections• contains processes that employ multiple algorithms• potentially initiates multiple processes per execution on the same input data• takes user input in an interactive mannerExamples of these include: web browsers, word processor, image editors, videogames, etc.We make this definition digression, at this point, to highlight the class of ap-plications that often contain non-embarrassingly parallel operations and presentdifficulties in ‘rolling your own’ scheme for safe parallel execution.1.3 A Cautionary TaleLet us give an illustrative story and consider a fictitious Lorem Ipsum LLC thatcreates real-time data visualization software. The software receives updates to thedata from the Internet and allows the user to add a set of different components andsub-components for visualization (call these widgets), any of which may containuser supplied scripts to filter, organize and interpret the data. Initially, the decisionis made that since the widgets organization is hierarchical, then they obviouslyshould be stored in tree structure7.Initially, Pat is charged with implementing the actual rendering to the screen.Since the number of different items can be larger than fits on one screen, theysimply find the node of the tree that is the common parent of every visible item andwalk the tree, drawing each to the screen in a post-order transversal.At the same time Sanjay is writing the network code, receiving partial updatesfrom the network and modifying the data that feeds the widgets.7For those readers familiar with the web stack, this is purposefully a simplified version of theDOM tree.5Early user testing reveals that the more times the display is updated per second,the more responsive it feels and consequentially leads to higher user satisfaction inthe target audience. So it is decided that instead of interleaving the display updateand the network update, they could be run on different threads to decrease the timebetween updates.As Lorem Ipsum is made up of UI/UX coders and data analysis experts, nobodyis really an expert on parallel programming. Pat looks up the documentation to forka second thread, putting their code on one thread and initiating Sanjay’s code onthe other. That really wasn’t that bad, thinks Pat for the approximately 30 secondsbefore the application segfaults when the networking code is updating a node thatis currently being rendered. With a sigh, Pat digs into the memory of their thirdyear Operating Systems class and reads the documentation on mutexes, wrappingeach access to the tree in lock/release pair. After wasting a weekend tracking downa method with an early exit that didn’t release the mutex properly, a more frazzledPat sits back in satisfaction and watches as the program runs for more than 10minutes without a crash. The sense of satisfaction is soon replaced with morefrazzle when timings reveal that the update is actually running less frequently. Boththe rendering code and the network code are pretty much always using some partof the tree at any given time and so the two sub-systems are effectively interleaved.With the additional overhead of lock acquisition, the total time is longer.Pat decides that each individual node in the tree needs a lock, and adds thesefine-grained locks, locking each node before rendering and releasing it afterwards.Now the update rate actually does increase and the crashes are still absent, but Patbegins to notice strange rendering bugs have appeared with elements going outsideof their proscribed boundaries. With a few more late nights, Pat realizes that somewidgets have a size based on their child widgets and when data is updated duringrendering, the sizes can change between the time that the layout is constructed andthe widget is rendered. Pat comes to the conclusion that they need to lock the entiresubtree being rendered, doing a walk first to lock each node, then rendering andfinally another walk to unlock. Finally, after several days away from the work thatutilizes their actual expertise, Pat has an implementation that works. The renderingerrors are gone and the application is updating more frequently than it was serially.Perhaps it’s not quite the total victory they hoped for, as the lock acquisition time6is a still added to the total rendering time, but it is an improvement.However, while Pat is settling back into designing the perfect rounded rect-angle for a new widget, Kamiko merges her branch with the first generation ofscripting functionality in it. She’s surprised and disheartened to find that it crashesalmost immediately post-merge. So after a few frantic teleconferences, she getsahold of Pat who tells her about their ad-hoc locking scheme and promises to writesomething about it on the internal wiki, when they have time. Kamiko goes back,spending an unplanned day wrapping her references to the tree in accordance withthe scheme. This works fine until beta-user reports come in that the applicationlocks up intermittently. Kamiko and Pat put their heads together and realize thatKamiko’s scripts can access more than one lock at once. In just the right circum-stance a script can lock a node just before the rendering locking can get to it. Itthen requires a node that the rendering has already locked and each will then waitfor the other forever. This is the classic deadlock situation and Kamiko and Patstart to get the classic deadlock induced ulcers.Finally, they agree to adopt an ‘all or nothing’ approach with their locks.Kamiko’s code takes less locks, so they try that first. However, with many dif-ferent branches it becomes really complicated and so, with deadlines slipping theyfinally decide to have the tree code back off if it can’t acquire a lock.In the end they have an application that runs in parallel, without error, butoccasionally it stutters a little when the renderer and the scripting system ‘fight’.However, it is faster and they need to ship and see their spouses at some point. Theyleave it as is and get back to their delayed ‘real’ work, with the nagging feeling thatit could have been better.What can we do to help the Lorem Ipsums of the world with their complexapplications? This document is our answer to that question. Unfortunately, wehaven’t quite derived a solveMyParallelProblems() function, but we be-lieve we’ve moved them closer than the Ikea-furniture-assembly-nightmare of try-ing to keep lock discipline across a large, complex application.71.4 Why So Serial?Now that we’ve illustrated the problem, we can start to build towards our solu-tion. We start by analyzing what went wrong in our little fable. Each of our threeheroes needed to access different subsets of the same tree, where they needed theentire subset to complete their computations. The reader may have noticed howmany times we have, already, used terms like set or collection. Collections arefundamental to Computer Science. We would be hamstrung without the abilityto generalize a process across an arbitrarily sized collection. A datum is a socialbeast, it travels in packs8.If we consider the collection as a single entity, such as when Pat tried the globallock and found that many items were needlessly locked, we miss the many caseswhere disjoint parts of the collection could be access simultaneously. Conversely,we can fail to see the tree for the nodes if we consider the elements as individualitems as with the subtle errors produced with fine-grained locks.Furthermore, collections aren’t, for the most part, bags of values. Collectionshave structure. When Pat employed fine-grained locks, they weren’t able to exploitthe beautifully recursive structure of the data.Notice as well that the problems we described didn’t come from inside eachof the three components (rendering, network and scripting)9. The problems camefrom the interaction between disconnected operations. This is why we underscorethe pervasiveness of complex applications, where a parallelization strategy thatworks for one subsystem may not apply to others sharing the same collection. Fi-nally, to bring this to the most abstract level, we note that these problems all stemfrom specific combinations of data being accessed simultaneously.At this point, we will need another definitional digression. We have noticedthat one of the lesser discussed problems in STEM is that the English language, likethe IPv4 address space, simply doesn’t have enough distinct elements for modernpurposes and many words have become overloaded10. This is definitely the case8Consider how rarely you hear the singular form, as opposed to the collective data.9Though, had the trio not given up they might have discovered deadlocks trying to parallelizetheir operations – especially the scripting system, which is essentially the same problem only inminiature.10The author has actually spent more than one entire evening arguing with his fiancée, a statisti-cian, over the meaning of variable.8with the term task. We will use this term throughout this document in its broadersense. A task is the basic unit that is scheduled. It consists of an entry point tothe code and zero or more parameters for the invocation. We will assume that itis designed to terminate11 and will run serially, though it may generate more tasksusing the same mechanism that was used to generate the initiating task. A programcan then be defined as a scheduler and a set of potentially executed tasks. In thisway we could view a classically serial program as a single ‘mono-task’ with an‘empty’ scheduler or a program using the fork/join threaded model as a scheduler,plus one task for the main thread and a task for each thread spawned.With that terminology established, we can now describe the crux of the problemas a step in real-time scheduling problem. Say we have p processors and let D ={d1, . . . ,dn} be all the data items in the system. Let T = {t1, . . . , tm} be the tasksthat have been submitted to the scheduler for execution at the current time. Forevery ti there is an Ri ⊆ D which is the data items to be accessed by that task. Inorder to guarantee a safe execution, we want to derive a subset of tasks T ′ ⊆ Twhere |T ′| ≤ p such that for every pair t j, tk ∈ T ′, j 6= k, R j ∩Rk = /0. This T ′ giveus a ‘coschedule’, a group of tasks to be run simultaneously12. Note that it followsthat there must exist a T ′ that is non-empty.The primary observation we want to make from this is that the tasks themselvesaren’t important for the safe coschedule, except in their transitive relationship tothe data. We can essentially reduce this to determining a mutually disjoint subsetof R1, . . . ,Rm. In a perfect world we could take a program and derive a dependencygraph of subsets of D and whenever a processor is free, select a subset that bothhad its dependencies satisfied and was disjoint in terms of data items with what wascurrently running. This would execute without state conflicts. This is the luxurythat a system like TensforFlow possesses, where a DAG13 is explicitly constructedand most operations don’t modify existing data in favor of producing results asnew data. As mentioned previously, the constraints necessary for that model aretoo limiting for general purpose applications.11Though this constraint could be relaxed for event driven ‘daemon’ style applications, but we arenot considering such applications.12Note this can also be used a verb; we can talk about coscheduling two or more tasks.13Directed Acyclic Graph.9So, in developing a general system for safe and efficient parallel execution whatwe have to work with is a scheduler and a set of potential tasks instead of the anidealized graph of data subsets. The question: from that starting point, how closecan we get to that ideal? In other words, how do we schedule data, not code.1.5 An Intentional SolutionOur quest to schedule data presents two major problems:• How do we associate a subset of data with a given task?• Given two sets of data, how do we test for intersection between sets of datain the microseconds we have to make a scheduling decision?In both cases the fact that data is naturally grouped into structured collections,a property that we will exploit without mercy, does allow us to shrink the problemsize drastically. For example, we know that if two collections A and B are disjointand if tasks t1 and t2 access only A and B respectively, then intersection testing isconstant.To make things more difficult, the exact data used by each task is going tochange based on runtime inputs. To compensate for the lack of exact knowledge,we exploit the fact that our program will still run successfully if our intersectiontesting allows false positives. If we decline to execute one of two tasks that whenexecuted turn out to be disjoint, we have only sacrificed some performance. So,we relax our goal to finding the smallest possible superset of the data touched.Again, we can leverage the collection based partitioning of data: if a task is doingan operation on a collection, it will at most touch the entire collection, but thatoperation will never include items outside the collection.While these facts help, and will be integral to our proposed solution, they stilldon’t give us everything we need. We could use static analysis on each task14and try to determine the used data superset. However, this is notoriously difficult.Consider a program written in C that walks a tree. The programmer sits down,thinks about trees and writes a node struct and connects them with pointers.By the time the code is submitted to the compiler, all that semantic information is14We are assuming that the set of possible tasks is easily harvestable from the code as written.10gone and the compiler only sees a series of pointer operations. For the informationwe need, we go to the source: the programmer’s intentions. Unfortunately, neuralprogramming interfaces are still nowhere close to production, so we must rely oncreating better tools.The language is the closest point of contact with the programmer in terms of thetoolchain. We elect to design a set of language constructs that better capture pro-grammer intention. This is where our observations about collections really comeinto play. The use of a global variable is obvious in most every language, but tellingwhich parts of a shared collection are accessed is much harder to determine. Bymaking collections first-class citizens of the language and mediating all access toshared data with stateless, declarative queries that reflect the structure of the targetcollection, we are able to dramatically increase the power of our analysis.With this approach, Pat could have written a line of code that effectively said Iwant to use the subtree of my widget tree rooted at node X. With this we would beable to build a runtime mechanism that could differentiate the desired part of thecollection from the part that goes unused. Pat must also tell us I’m done with thisat some point. We will discuss this in Chapter 3.This approach has two other virtues. Firstly, because these constructs replaceexisting code, we are not suggesting burdening the programmer with an extra layerof annotations on top of already complex code. This also makes our lives easieras we don’t need to check that these hypothetical annotations match the code aswritten because it is the code. When Pat would say I want to use the subtree rootedat X , this is not an annotation surrounding the actual data access. This directivecompiles to code that accesses only that subtree. Secondly, as these constructs arestateless and declarative, they give us the what and the when of the data accessed,but are separated from the how. This gives us plenty of latitude in our schedulerto apply scheduling optimizations across the entire program without violating theintent of the programmer.Processing queries on structured data is, of course, not new. Database systemshave been doing it for decades. We note that a lot of applications, especially thosewritten for the web, are written as a symbiosis between program code and database.One could construe this part of our proposal as generalizing databases for multiplecollection types and moving them into the program itself, much like the switch to11multicore moved multiple distinct processors into the same physical chip.Of course, this would all be for naught if we couldn’t exploit this informationfor efficiency. This means we need a general mechanism for scheduling which isable to encode the parts of the collection used, as dictated by the programmer’squeries, and do the intersection testing as close as possible to instantaneously.We have derived such a method, using bit strings, which we call signatures,and leveraging hardware supported atomic operations. These signatures act analo-gously to a lingua franca, a means of common exchange between operations. Byderiving a signature for every query at runtime, we reduce the problem of schedul-ing disjoint data groupings into one of comparing signatures.These two systems, query formulation and signature comparison act in har-mony to allow us to approach the ideal of scheduling data, not code and so werefer to this model as Data not Code (DnC).1.6 What We DidA model by itself is not particularly useful. In Systems, especially, new ideas needto be shown to have practical applications or, at least in the case of new research,demonstrated potential.We have proposed a system that is a symbiosis between different levels of thesoftware process: software construction, program compilation/analysis and run-time execution. These facets are often studied and refined in isolation – holdingthe other layers ‘fixed’.In the classic and well respected The Pragmatic Programmer [44] the authorsdetail an approach they refer to as a tracer bullet for exploratory development ofcomplex software. The idea is roughly orthogonal to the common software ap-proach of developing a single system and making ‘mock-ups’ or ‘stubs’, that havethe interface, but not the functionality, of the systems that the subject will interactwith. A tracer bullet on the other hand is a process of developing a narrow pathwaythrough the system from invocation to output. For example, a user interface maybe developed with only one or two fields, which are checked by another level ofcode, processed by a third and finally transmitted over the network with a protocolthat only has one message and one response. Is is then stored in a database that12has only a table for those values. All other parts of the software that are ‘adjacent’to this trace are ‘stubbed’. The results of this exercise serve as proof of concept, aroad map and opportunity to expose unforsaken difficulties.In this spirit we produced Cadmium, a prototype language, compiler and run-time library based around the DnC model. We developed a number of applicationsin this language in order to demonstrate and justify the process from end to end.Certainly we left out the details that weren’t necessary. Our standard library hasabout six functions, for example.Our contributions are not just in the algorithms and techniques, but in the ex-perience we relate and the insight we gained building and attempting to use thissystem. We will discuss our successes and failures in Chapter 7. We certainlydid have some failures: optimizations that didn’t work out and language constructsthat turned out to be far more unwieldy in use than ‘on paper’. We hope that thesewill be of benefit to future researchers as a complement to our successes and asfertile ground to grow even better solutions. Even if we never write another line ofCadmium code, we want to use software that can maximally exploit the beautifulhardware that we’ve paid way too much for15.1.7 What We Didn’t DoOur primary goal for this project was to demonstrate the viability of the DnC modeland we hope to convince you that we have succeeded. As a byproduct, we haveproduced a number of novel solutions to different problems. However, it was un-feasible to solve all the problems. Technically, there are an infinite number ofthings that we didn’t do16, but there have been certain things that we have beenasked about multiple occasions during the course of this project’s development.We will list the major ones here to properly temper expectations:• deterministic execution• distributed computations• interacting with the GPU15The reader can probably deduce our brand preference, but we won’t name any names.16At least it’s probably countably infinite.13Many of these form large chunks of our future work, though sometimes theyare quite low on our list.Futhermore, the primary thrust of our experiment was the synthesis of manydifferent facets that are normally considered separately. Due to the fact that ourresult is not a theoretic design, but a realized system we had to develop solutionsto achieve the results we want. While we most certainly have novel aspects to ourwork, there is a lot we just had to adapt for our purposes. We don’t claim to haveadvanced the state of the art in static analysis, for example. In fact our projectcould have benefited from even more advanced static analysis. However, the waywe assemble the pieces is novel and a contribution. The purpose was to provethat our model is worth further inquiry and worthy of efforts towards better staticanalysis and other techniques.1.8 Fantastic Contributions and Where to Find ThemOur contributions can be organized into three categories:Model We introduce the Data not Code (DnC) model which describes the generalmechanism and constructs we are employing.Language constructs We detail a number of new and adapted programming con-structs that can be used to realize the DnC model.Runtime algorithms We introduce a number of new runtime algorithms that facil-itate the efficient execution of DnC conforming programs.After we have discussed the various works that either inspired us or tried tosolve similar problems (Ch 2), we organize our discussion following the processof translating programmer intentions into correctly executing and performant soft-ware. We have grouped these discussions into three phases:Express Intentions by using a collection of programming constructs while avoid-ing requirements unrelated to expressing program function (Ch 3).Extract Intentions using static analysis and language extensions to derive usefulinformation and enforce the extracted intentions (Ch 4).Enforce/Execute Intentions using information from the previous phases to con-14struct an executable and inform its runtime elements for efficient and safe execution(Ch 5).Following this, we give an overview of the compiler implementation (Ch 6) andan in-depth exploration of the results in executing our test applications, presentingboth statistics and commentary (Ch 7).Finally, we conclude with a discussion of future work (Ch 8).15Chapter 2Related WorkNo project is born ex nihilo, without inspiration from existing works. As Cadmiumis an attempt to realize a model across a number of ‘layers’ of systems research, ithas a wide range of influences. Similarly, as these are important problems, manyothers have sought some solutions. Some of these works dovetail with ours throughconvergent evolution, while others take different routes.2.1 SignaturesThe signature mechanism (§5.3) involves partitioning the collection into N parti-tions and using atomic operations on a bit string of length N to mediate access.This is, in essence, a generalization of multiple exclusive locks.The quest for the ideal methods to control access to some computational objecthas been going on since Computer Science was in its infancy. Contributions havecome from luminaries such as Dijkstra [28], Lamport [53] and others [25, 32, 62].One of the most notable features of our signature scheme is its ability to arbi-trarily change the granularity of the locks on a given data collection by changinga single parameter. There have been other mechanisms proposed that have simi-lar virtues, such as DomLock [49] and others [56, 79]. Though we seek the sameproperty, we take a very different approach. Instead of an organization of sepa-rate locks, we collapse the entire set into the single bit-string. This gives us the16definite advantage of being able to set multiple ‘locks’ simultaneously1. Our tech-nique does admit ‘false positives’ and puts us on the performance/accuracy tradeoffspectrum. Furthermore, in all our work we vary the granularity on a collection bycollection basis. There is no technical reason that multiple collections couldn’tshare the same ‘signature space’, perhaps using one signature for the program thatis much larger than we would allocate for a single collection. Again, this wouldbe a tradeoff. By allowing greater accuracy in individual operations, we would beintroducing the possibility of operations on different collections interfering witheach other. It is one of many interesting avenues of inquiry that branch off fromour work.Others have used a similar scheme to our signature mechanism, except in im-plemented hardware such as Swarm [47] and Notary [84].2.2 Software Transactional Memory (STM)Synchronization via Scheduling (§5.2), the execution model we are proposing, de-termines the maximum set of data items affected by a particular task and will onlyco-schedule tasks with disjoint sets.On the surface this is a fairly novel approach with few antecedents in the litera-ture. However, it has been pointed out to us [63], that our technique can be seen asa variant of Software Transactional Memory (STM) [78]. In transactional memorythe programmer denotes a section of code as a ‘transaction’ and during executionif an executing thread makes memory access that conflicts with another thread,one of the threads is ‘rolled back’ and all the changes it has made are undone.These transactional regions are semantically similar to our Collection Contracts(§3.2.1). Futhermore, it can be interpreted that we are aggregating all possibleconflict checks into the one at task admission time in a pessimistic manner, tradingprecision for the elimination of the overhead of recording and potentially rollingback memory values.In this light, we are part of a family of research, dedicated to finding methodsto make STM more efficient and practical. This family includes many proposals1Technically this is virtually simultaneously in cases where the modification crosses the boundarydefined by the maximum size of an atomic operation.17for extension and modification of the STM process, which constitutes a rich andvaried body of work. Other pertinent highlights include:• improving transactional composition [34]• handling dynamically sized collections [40]• dealing with nested transactions [12, 14] 2• harmonizing transactions with Object-Oriented Programming [82]• using bit-string structures similar to our signatures [75]• constructing programming languages with transactions as a core construct[22]• attempting to improve performance by varying granularity as discussed above [57](§2.1)Our work differs from all of the above in two ways. The first is that our tech-nique is, as mentioned above, more pessimistic than ‘proper’ STM. We deny ex-ecution unless there is no chance of a conflict. This does put a greater restrictionon potential parallelism, but buys us the ability to actually execute elided withoutthe need for rollback bookkeeping. Secondly, for the most part a transaction inSTM is an annotation in a program, leaving it up to the programmer to determinewhere best to put them. In our system the transactions (the spans of the CollectionContracts) are defined as an integral part of the language and to use the languageis to provide the necessary information3. Furthermore, the definition of one of our‘transactional regions’ is significantly semantically richer and inexplicably tied inwith the parallel expression of the work. From the programmers description ofwhat they want to do from the collections point of view, we cannot only protect thedata, but schedule it better. In this way we schedule data, not code.2The related concept of nested Collection Contracts required quite a bit of work to address (§4.3)3This is a vote in favour of peer review. In earlier versions we used more ‘annotation’ styleconstructs and reviewers balked in no uncertain terms. This forced us to come up with a methodto provide the programmer with a ‘win-win’ situation, where the programmer gave us the necessaryinformation while only writing statements that contributed to their actual purpose for the code. Oursolution turned out to be useful for so much more than just harvesting these constraints.182.3 Computations on CollectionsTo talk about ‘computations on collections’ (searching, deriving new data, modify-ing values and modifying the structure of a group of related data) is to effectivelydiscuss the entire foundation of Computer Science. Even if we limited ourselvesto discussing processing collections in parallel, the discussion could still fill manyvolumes. If one were to type the name of their favorite data structure4 into GoogleScholar, one would likely find pages of papers detailing parallel computations onthat structure. This includes every data structure mentioned in this work, fromlists [30, 35, 80] to trees [58, 64, 70]. So we will simply reference a number ofsystems/packages designed for computations on data structures that are close toour work.Concurrent Collections [21] considers the primacy of collections as we do,but uses a more streaming/dataflow approach and does not target general purposeapplications.Many of the popular systems address only ‘linear’ collections (arrays, lists,tables), but there are several successful projects that address large scale processingof graphs, such as Green-Marl [43] and GraphChi [52].Galois [51], has goals similar to our own and is built for handling irregularaccess5 to collections, including graphs, but supports only data parallelism anduses a rollback mechanism similar to STM systems discussed above.2.4 Parallel Query ProcessingQuery processing is the complement to computations on collections. It involves aset of well defined components that can be assembled to describe questions abouta collection of data and a set of methods to, as efficiently as possible, produceanswers. Of course, wherever efficiency is sought, potential parallelism will beinvestigated.In terms of query formulation, there is likely no more successful standard thanSQL(initially SEQUEL) [23], descendents of which are in use on many millionsof systems today. It is likely that, unless you are reading this by torchlight in some4Everybody has a favorite data structure, right?5I.e. not embarrassingly parallel.19post-apocalyptic future, you have probably already done several things today thathave caused the initiation of an SQL query on your behalf. While SQL itself hasno notion of concurrency, it’s one of the primary methods of mediation between a‘programmer’ and a database, which have been subject of intense study as regardsconcurrency [27, 54, 65, 68, 81]. Once again, entire books have been written onthe subject.In relation to our endeavors, the most prominent system for query processingis LINQ [61], which essentially embeds SQL in the C# language. It is one ofthe cleanest and easiest ways to evaluate your structured program data in parallel.Though it is limited, as in many cases, to embarrassingly parallel operations andprovides no protection from race conditions and other parallel pitfalls. While thiswas not the inspiration for building a language around query processing, it did con-vince us to use the SQL-like syntax that we employ. Its popularity and very positivereception convinced us that a system such as ours may also be well received.Database is still almost synonymous with the table-based relational database,but there are many based around different collection structure such as Neo4j [7],which is graph-based. It should be noted that one cannot use SQL to query these‘alternate’ database organization schemes. Though we base our query constructson SQL with familiar commands, such as SELECT, we have attempted to make thefirst steps to generalizing it towards handling arbitrary collection types. This willbe discussed in more detail in Section Parallel LanguagesIn terms of influence, no work has had a greater impact on our project than theJade language [72], which annotates code to specify that a span uses only certainvariables for ensuring safe parallel execution. Most of the language level details inthis work are an answer to a colleague’s question: What can your system do thatyou can’t achieve with Jade’s withonly clause?6. Our answer to this questionrevolves primarily around operations on collections, which the reader will noticeis a primary foundation of our model. While Jade is focused generally aroundvariables, we focus on subcollections and provide significantly richer methods of6Thank you, Craig Mustard. You ask the best questions.20describing and interacting with them. Our more functional queries can be seenas generalization of the withonly clause, allowing the user to not only say thefollowing code uses the variable foo, but the following code uses only the even in-dices of this collection. Furthermore, Jade is built on a C-like language whereas westarted with our queries and built the whole Cadmium language around them. This,like so many things in Systems, is a tradeoff. Basing a new language on a robustand well accepted language like C not only increased its chances of acceptance bythe community, but also meant that there were a lot of problems that were orthog-onal to parallelism that were already solved. In our case, we spent a great deal oftime7 ‘reinventing the wheel’ to make a coherent syntax. On the other hand, wedidn’t have to ‘fight’ with language constructs that were designed with only serialexecution in mind.As well, every parallel language exists in relation to Cilk [19], one of the firstand most successful implicitly parallel languages. To drastically over-simplify,Cilk works by potentially dispatching function calls in parallel – essentially allow-ing the transparent transformation from serial procedural code to parallel code. Incontrast, we wanted a system that had no serial ‘backbone’ and only employed clas-sical procedural programming ‘locally’ inside our Entities (Objects/actors, (§3.1.1)).Another strong influence was the robust and effective Erlang language [13]which is not only implicitly parallel, but designed for fault tolerance. The Erlangmemory model, where memory is isolated from executing code and changes areinstigated by message passing ,is very similar to our own. We take it a step furtherby generalizing messages to semantically rich, internally stateless queries (§3.2).Similarly, we were influenced by the Barrelfish manycore Operating System [77],which uses messages as a semantic construct, but will take advantage of sharedmemory when possible to reduce overhead by potentially executing a message ini-tiation in a manner much closer to a classical function call – a technique that weuse widely in the code that Cadmium produces.There has also been a great deal of output in terms of parallel languages fromthe function programming community, as the side effect-less nature of functionalprogramming makes it a natural fit for safe concurrency. These languages range7Far more than anticipated, a warning to researchers beginning a similar endeavor.21from Multilisp [33] in 1980s, through to the increasingly popular Elixir [55], whichruns on the Erlang virtual machine, and includes parallel variants of existing func-tional languages such as Concurrent Haskell [48]. We note the successes comingfrom this community and let it serve as inspiration – attempting to mediate flexibleand performant, but side-effect inducing, procedural code with internally stateless,descriptive queries.There are a number of languages that are not explicitly parallel, but containdeeply ingrained concurrent elements. These include the Go language [3] with itspopular ‘Go Channels’ and Clojure [2], which contains an STM implementation‘out of the box’. Similarly, as with Concurrent Haskell, many have built parallelversions of existing procedural languages, such as Deterministic Parallel Java [20].Though we bemoaned the lack of robust support for ‘complex systems’ in ourintroduction, this is not to say that there has been zero effort in this direction.Recently, the Unity video game engine has incorporated DOTS [8] which works toautomatically parallelize the C# code that the engine uses by using code analysisand compiler optimizations. As with our system, it will detect potential conflictsimplied by the code, but unlike Cadmium it will not employ a method such as SvS,but simply alert the user and forbid the operation.2.6 Actor ModelWe draw inspiration from the Actor Model [41] where the program is composed ofactors, which can have encapsulated state, but only modify other actors by messagepassing. This model has been applied by others to address different aspects ofconcurrency, such as task parallelism [45]. We use this as the basis for our Entities(§3.1.1) and message passing (§3.1.4).2.7 Deadlock DetectionWork has been done utilizing static analysis techniques to detect deadlock [67, 83],including the generalized GoodLock algorithm [10, 37], which broadly resemblesthe one we derived (§4.3) before discovering it in the literature.22Chapter 3Expressing IntentionIn this Chapter we will detail the first third of DnC principles: allowing the ex-pression of programmer intentions. We start with a brief overview of the Cadmiumlanguage to give the rest context. Following that, we discuss the integral querysyntax and the semantics of how query results are handled. Together, the initiationdetails and scope of query handling form a collection contract, which is integral tohow DnC-conforming programs are written. These concepts will inform the finalpart of this chapter, where we discuss isolation semantics.3.1 Cadmium Language OverviewAs we stated previously (Ch 1), our goal is to explore, elucidate and justify the DnCmodel. This required that we realize the concepts in functional code. The result ofthis was Cadmium, an implicitly parallel language designed for the purpose.We had many choices to make to ‘fill in the gaps’ and many of our choiceswere arbitrary, being orthogonal to the requirements of our model. An exhaustivedescription would only serve to obscure instead of inform. We will briefly outlinethe basics of program organization and general syntax, while hopefully avoidingtoo much superfluous detail. The role of entities and queries is core to illustratingour model, while the fact that all of our integer types are implicitly 64-bit is not.Cadmium is an implicitly parallel, statically/strongly typed language, very roughlybased on the Actor Model and heavily inspired by our experience with many pro-23cedural languages such as C++.3.1.1 EntitiesThe central unit of organization in Cadmium programs is the entity. Analogous toa class in C++, Java, etc, an entity describes an aggregation of some data and/orprocedural logic. As with objects, the entity block is a ‘blueprint’ that can beused to define multiple instances.1 // define a type of message as a named tuple structure2 message NewValueMessage <number@int shouldAdd@bool>34 // define an entity5 entity Example {6 // local member, inferred to be of integer type7 primitiveMember: 238 // local collection, list of booleans9 collectionMember@CDList[bool]101112 // define a method that receives an int13 addValue(value@int) {14 memberInt += value15 }1617 // receiver for the named message18 NewValueMessage → {19 if $.shouldAdd {20 // invoke the method with the value in the message21 addValue($.number)22 }23 }24 }Listing 3.1: Example Cadmium EntityA basic entity is shown in Figure 3.1, which contains an example of its pri-mary facets.For the sake of completeness, for those readers unfamiliar with C-style lan-guages, the // denotes a line comment, where text is ignored by the compiler until24the next newline character (the block comment style /*. . .*/ is also supported).When we refer to a block, we are indicating the code delimited by the pair of { and}.Starting at line 7, local members are defined. The colon is used to denote aninitially assigned value and since Cadmium makes heavy use of type inference, thetype is unnecessary. Inside code bodies, the colon syntax is used to define a newvariable, similar to the let keyword in other languages, such as Rust and Swift.The member below shows a case where the type cannot be inferred as the mem-ber has an implicit default. When defining a variable the @〈typename〉is used todenote the type. The language supports the expected primitive types: integers,booleans, floats, etc. In this case, the member is another entity type. Unsurpris-ingly to those who have read our introduction, the language supports a numberof collection types, as built-in data types (much more on this below) and collec-tions are a specialized form of entity (also below). In the case of this example,the member is a CDList, which is a list in the same sense as in C++ or Python: adynamically sized and indexable series of values1.The code starting at line 13 shows a method definition, which should be famil-iar to any programmer of an object-oriented language. In this case, a return typeis omitted – equivalent of a C/C++ ‘void’ method. If a return type was needed itwould use the @〈typename〉suffix to indicate this.Finally, the signature and block beginning at 18 demonstrates a receiver, codedesignated to, as the name suggests, execute when a given message is received2.Cadmium makes heavy use of tuples and messages are one such use. The messageNewValueMessage is defined on line 2, which uses the standard tuple definitionsyntax. Tuple definitions can be named (the message keyword is, at present, syn-1All collections in Cadmium have the CD prefix. This is probably due to the fact that the authorwas investigating Objective-C (amongst other languages) during development and liked NS prefix.This was, in retrospect, probably unnecessary. Though it does make it easier to easily differentiateuser defined types from built in types at a glance.2Cadmium takes advantage of the full unicode character set, so→ is a valid operator in Cadmium,though for every non-ascii command there is an equivalent combination of ascii characters (->in thiscase ). We realized that one of the great unexpected challenges of the new language designer is thelack of matched delimiters in ascii. We found a lot of use for the more exotic delimiters, such as theFrench Guillemets to make our code aesthetically pleasing, even if it required us to add macros toour IDE.25tactic sugar to name a tuple type) and have named elements, such as number andshouldAdd in the example. These names are optional and tuple members can beaccessed by their index as well. Cadmium employs a form of structural typingwhen dealing with tuples, in that any tuples with the same member types can beused interchangeably. Names are only used in cases where disambiguation is re-quired (this is most common in the case of empty messages (<>), which are oftenused to invoke behavior that doesn’t require parameters).A reader may question the difference between a method and a receiver. Whyhave two very similar ways of invoking code with a set of parameters? Thereare a few reasons for this, but to fully answer this we first need to talk about theCadmium memory model.As we have stated above, we want to take advantage of the benefits of sharedmemory, but rigorously mediate all accesses. All data and code in the programis owned by some entity. An entity has unfettered access to the data it owns, butmust request data through the message passing interface (which we generalize tothe query interface, discussed below). So the primary difference between a receiverand method is a matter of access scope. An entity can receive a message no mat-ter the relationship between sender and receiver, but an entity can only invoke amethod on an entity that it owns. This differentiation provides a distinct semanticseparation between local interface and global interface.Additionally, during design we concocted other processes that were unique toreceivers such as establishing a channel, which in our model was an ephemeralcollection that was shared between entities. However, as our work progressed wediscovered that we could express our final goal of a viable prototype without thesechannels and to simplify things, we omit them from the discussion.The syntax for method invocation of a entity’s own methods is shown on line21, which is the same as normal function call. For owned entities, Cadmium usesthe dot syntax〈entity〉.〈method name〉(〈parameters. . .〉)common to most Object-Oriented languages such as Java and Python3.The above is the first of several expression schemas we will give in this chapter.3Or to partially mollify the OO language purists: languages with quasi-object-oriented facilities,such as Python.26We will use the following format for these:• non-italicized text denotes keywords or operators (the . in the above forexample)•〈description〉gives a description of a required element (such as the place-holder for an entity above)•[description]gives a description of an optional element•(alternative1∣∣alternative2∣∣. . .) gives a list of potential elements, or sets ofelements, separated by the∣∣symbol, such that the statement requires exactlyone of these alternatives to be present3.1.2 ManagersGiven our restrictions on where data lives that we detailed above the question nat-urally arises: How do entities ‘find’ each other if arbitrary data isn’t allowed inglobal scope?. The answer is managers. Managers are a specialized type of entitythat implicitly implements the singleton pattern [31]. That is to say, when a man-ager is defined it is automatically instantiated on program start and there will beexactly one instance present throughout execution.The name was chosen specifically to denote that it is responsible for controllingone aspect of the program. For example, from our introductory story, Pat mayhave constructed a Rendering manger, Sanjay a Network manager and Kamiko aScripting manager. We will see below how the managers make up a critical part ofprogram flow.Another example from the Cadmium Standard Library is console output. EveryCadmium program has a manager, STDOUT that ‘manages’ the currently connectedconsole output stream. To display a line of text the programmer would writeSTDOUT ← "some informative text"using the message sending syntax (§3.1.4) as the STDOUT manager receives a mes-sage containing a single string for display. The single primitive is interpreted asbeing equivalent to a tuple of length one.27The basic syntax for a manager is essentially identical to a basic entity, butreplaces the entity keyword with manager in the definition. The manager key-word can also be used with an existing entity definition to create a named managerinstance of that entity. This second form allows us to create manager collections,which provides a segue to finally discuss collections.3.1.3 CollectionsFrom a certain point of view, Cadmium entities are trivial collections, serving asaggregate collections of their members. Their members can be queried (discussedbelow) and modified via messages. The key difference is that ‘official’ collectionscan have dynamically sized data.In the development of our prototype, it was decided that, to avoid overextend-ing finite development resources, we would only have collections that were definedin the language, though obviously we forsee that a completely realized Cadmium(or its descendents) would have user specified collections. The current set of col-lections are implemented via a special application of our C++ interface (§6.2).The prototype, as presented, supports multi-dimensional arrays, dynamic lists, sets,graphs, digraphs and trees.While the actual definition of new collections is outside the scope of what isrequired for the demonstration of our model, interacting with them is central to ourwork and we go into detail below(§3.2).3.1.4 Message Sending and Program FlowMessage sending uses the sendto operator (←) with the syntax〈targets . . .〉← 〈message tuple〉where targets is a list of entities.Note that this implies a synchronous or blocking message, meaning that thecalling code will be suspended until the delivery is complete. The runtime enginehas been built with support for asynchronous messages, but they turned out not tobe required to express our proof of concept.The reader may wonder how our applications execute in parallel without asyn-chronous messages. There are three facets of the system that we use to achieve28parallelism.The first is parallel query execution and query delegate blocks, which we willdiscuss below (§3.2).The second method involves the use of broadcast messages. A message canbe sent to every member of a collection (or view (§3.2.4) ). The delivery will bepotentially in parallel, depending on the results of the static analysis (Ch 4).Both the first and second methods are ways of expressing data parallelism –applying some homogeneous logic to a single collection (though the broadcast de-livery may ‘cascade’ to items outside the target collection). Task parallelism, het-erogenous logic potentially applied to potentially multiple collection, is expressedeither through specifying multiple targets for the message or through the interplaybetween managers and a special kind of broadcast messages: system messages.System messages are messages that are sent from the scheduler to every managerthat has subscribed to them. A manager is considered subscribed to a system mes-sage if it has a receiver defined for that message.The scheduler uses a message provenance tracking system to track any cascadeof messages. In brief, when the receiver of a message A sends a message B, thiscreates a tracked dependency between B and A. We do this at runtime to account forthe asynchronous messages supported by our scheduler. In this way the scheduleris aware of when all transitive consequences of a message have been executed,which we refer to as the message being complete. This allows us to have anothervariation on message receivers: completion receipts. A completion receipt will bedelivered to an entity with a receiver of the form~〈message name〉→ {[statements . . .]}which is identical to the general receiver syntax with the addition of the leadingtilde. The scheduler will send a receipt when a sending of the named message iscomplete.We consider the above, like much of this work, to be a core for a fully realizedsystem of flow control for parallel programs. It is sufficient to realize our prototypeand justify our proposals, but also serves to be an avenue for further inquiry4. Inthe prototype, completion receipts only apply to system messages and managers.4Also, completion receipts are a partial fulfillment of our life-long desire to define a meaningful‘comefrom’ to complement the goto statement.29Currently Cadmium defines 3 system messages: Initialize, Execute andCleanUp. These will be delivered in that order, each after the previous systemmessage (and their corresponding completion receipts) is complete.So the combination of system messages and managers allows us to do awaywith a single entry-point (‘main’ statement). Each manager that subscribes to theInitialize system message will begin execution, potentially in parallel, uponprogram initiation.As a final note, a system message can be reinitialized by sending an instanceof that message to the program. Every Cadmium program defines a manager-like global construct with the same name as the program (this construct providesvarious other ‘program level’ functionality such as retrieving command line argu-ments). When program representation receives a system message it will cause thatmessage to be rebroadcast when the current broadcast is complete, as we want toretain the property that only one system message is active at once. We find this tobe crude, but effective enough to satisfy the needs of our prototype and consider ityet another prime candidate for further exploration.3.1.5 AccumulatorsIf there are two concurrent modifications to the same value, we are at risk of a stateconflict. As the reader will be aware by this point, the central thrust of this work isto ensure that this circumstance never occurs by not coscheduling two tasks wherethis would be possible. However, there are a non-trivial number of circumstanceswhere the final result is independent of the order of modifications and the result isnever consulted during the process. As a simple example, consider a process thatmaintains a count by incrementing an integer. Because addition is commutative,no matter the order that these increments are applied the result is the same.It is not feasible to detect these circumstances by static analysis. It is, again, aproperty that must come from the programmer’s head. To fully exploit these orderindependent, write-only operations, we introduce accumulator types, which can bemodified in a parallel context5 It will not cause the static analysis to consider thisa write, but will consider a read to be a serial execution inducing operation.5We generally implement these accumulators by storing results ‘thread locally’ and combining atthe end. Similar techniques exist in the literature [85].30For primitives, we support an accumulatorint as an integer type that supportsaddition and subtraction as with our example above. For collections, the built-insupport includes accumulator sets and accumulator lists.3.2 Collection Usage, Collection Contracts and Queries3.2.1 Collection ContractsAs we discussed in the introduction, at the highest level we want the programmerto tell the scheduler: I want to use this part of the collection and correspondinglyI am done with that data. We want to be able to do this without compelling theprogrammer to add a bunch of extra markup or put the onus on them to figure outthe optimal placement of these statements.Specifics will follow, but these statements will be used to bound a CollectionContract. We then guarantee exclusive access to the subcollection described forthe interval between the two statements. Under DnC all data-accesses must be partof a Collection Contract, and thus mediated by the scheduler6.Every model of parallelism defines the demarcation of critical sections of code7.In the standard threading model, one uses mutexes and semaphores; monitors usemethod bodies and Software Transactional Memory [78] uses specific transactiondelimiters. Similarly, DnC uses Collection Contracts to mark critical sections andthey serve as the primary unit of scheduling.Collection contracts, unlike other ways of demarcating critical sections, areinherently tied to the data being protected within. For example, STM allows spec-ifying where the transaction begins and ends, but those demarcations do not tell uswhat data the transaction protects. Using locks has the same limitation. As a re-sult, any static analysis method would struggle extracting programmer intention atcompile time. Jade’s withonly clause [72] is perhaps the most similar to our col-lection contracts, but since Jade was an extension to C, statically verifying that theenclosed code respects the declared intentions would be a quite difficult task. We6Except, as described above, for local variables and, in Cadmium, entity member variables(§3.1.1), which are accessed in a traditional way to avoid burdening the programmer at the costof more static analysis.7In some cases, such as map/reduce [26], this is to some degree implicit.31hope to address this shortcoming by using declarative syntax to express intendeddata accesses.In the work that preceded and inspired this, Synchronization via Scheduling(SvS) [16], we considered the data being used as essentially a ‘cloud of pointers’without reference to how they were organized. When we began to look at how tomake our algorithms better, we realized that it wasn’t a matter of needing moreclever algorithms8, it was a matter of needing more information. By incorporatingtypes and collections into our model, it made for much better estimates, eliminatedruntime bookkeeping and allowed finer grained decomposition of the program andits access patterns.A Collection Contract has three phases:Isolate Determine what data is needed, i.e the bounds of subcollections needed forthe operation.Modify/Derive Transform isolated data and/or produce new data.Release Conclude the operation (this is often implicit).3.2.2 Declarative Syntax for Collection ContractsAs we outlined above (Ch 2) SQL is perhaps the most widely adopted and time-tested language for declaratively expressing data accesses. The separation of whatthe programmer wants to do with the data from how they wish to do it, allowsdatabase systems to implement a variety of query optimizations underneath thecovers. Expecting that parallel programmers will write their code using SQL isa rather bold assumption, but we capitalize on the fact that C# LINQ [5], essen-tially as subset of SQL, has gained wide adoption as an extension to an imperativeprogramming language, due to its expressiveness.Cadmium builds on the success of LINQ by adopting a very similar syntax;the difference is that unlike C#, and more similarly to Dryad [46], we capitalize ondeclarative aspects to deliver transparent parallelism.The following is an example of the Cadmium query the programmer wouldwrite to express the boundaries of a collection contract:8Though we do have these now as well, if we can be allowed a little self-congratulations.321 |SELECT foo FROM bar| → view12 release view1Listing 3.2: Cadmium contract: Isolate and Release phasesThe contract is the span between the initiation of a query and when the viewis released. Queries comprise the Isolate step of a Collection Contract. The Mod-ify/Derive phase can be expressed almost identically to a standard imperative pro-gram. The query results are represented by a view, borrowing terminology fromthe database community.SQL, and by extension LINQ, are designed for relational data. Though LINQqueries can be issued on any collection that implements the IEnmerable inter-face, the examination of this interface reveals that it assumes tabular data and wouldnot effectively support data structures, like graphs. To address this shortcoming weextend the LINQ-like syntax to support the concepts of collection structure andcontent.A collection’s content is the data itself. Its structure is the organization of thecontent and the description of how each data item relates to the others, such asthe ordering of a list or the hierarchy of a tree. This second aspect includes dataconcerned with these relationships, such as next pointers or vertex labels.To effectively support queries concerning both content and structure, we pro-vide three categories of queries:Figure 3.1: A subtree substructure query on a tree33Content Queries based on data attributes, such as finding all elements with an xcoordinate above 10;Structural Queries based on the structure of the collection, such as selecting arrayelements 3 through 5 or the subtree rooted at node v.Hybrid Queries a combination of the two, such as finding all elements in a subtreethat have a dirty flag set to true.Instead of attempting to create a grammar that incorporates every possible col-lection, we define atoms: parameterizable, collection-specific logic. These atomscan be seen in the SELECT9 query which has the syntax:SELECT([cardinality]〈bound variables〉∣∣{ 〈substructure atoms . . . 〉 })[FROM〈source collection〉 [[WHERE〈filter expression〉 ][VISITBY〈visit policy〉 ][ORDERBY〈order policy〉 ][AS〈output names〉 ]where the various elements are as follows:cardinality the number of results returned, respecting visit order policy (see be-low). |SELECT FIRST foo FROM myTree| yields the first item found by a depth-first search (the default tree visit policy). By default the ALL cardinality is assumed(? in SQL).bound variables the placeholders variables for specifying a node. |SELECT aFROM arr WHERE a.val == 5| describes all elements with a val of 5. SQL usessimilar concept to describe columns, whereas variables refer to collection elements,and the dot notation is used for properties.substructure atoms collection specific parameterized identifiers for indicating aquery on a particular type of substructure of the collection. Essentially, while everycollection type has at least one basic, indivisible element type (nodes or edgesfor graphs, elements for list, etc) there are emergent structure types that involve9Note that we do not implement common SQL mainstays such as COUNT, AVG, etc. These canbe accomplished using other features. Adding them and others would not be difficult and expandingatom functionality is part of our future work.34multiple of these basic elements and in some way are dictated by the structure ofthe collection. Examples are easy to see in recursively defined collections such astrees and lists where the composite pieces of the structure are also structures of thesame type (i.e. subtrees for trees and sublists for lists, which are also trees and listsin their own right10). These substructures are important enough that any work withcollections that aims to encompass their general use needs to incorporate them intotheir ‘vocabulary’. The substructure atoms are collection specific and are currentlydefined by the Cadmium Standard Library.As an example, the following query, |SELECT range(n,n+2) FROM arr|, wherearr is an array, allows us to isolate the three contiguous elements. These canalso contain bound variables. Figure 3.1 shows the results of |SELECT FIRSTsubtree(s) FROM myTree WHERE s.val > 5|, making use of the WHERE clauseas discussed below.source collection the collection the query is performed on. This can include sub-collections(§3.2.3), views (§3.2.4) and pseudo-collections which lazily-evaluatedcollections such as 1. . .100, which denotes the integers between 0 and 99, inclusive(we will see this used in §7.2.3).filter expression a boolean valued, side effect free expression, making referenceto the names defined following the SELECT statement. Used to filter candidatesfor the result. We have seen this already in the example |SELECT a FROM arrWHERE a.val == 5| which will produce a set of results of exactly the elements ofarr where the val member field is 5.visit policy identifier for specifying the order the collection elements are evaluatedin, including for example: reverse for arrays and depthfirst for trees and graphs.As with substructures, these are collection specific and defined in the CadmiumStandard Library. Each collection defines a default visiting order. For example,the CDList collection type has randomaccess as the default visiting policy whichmeans that it will process the elements in no specified order.order policy the ordering for the results, which may differ from the visit order. Apredicate, such as a.val < b.val can be provided. Importantly, the results can10With, perhaps a little definitional slight-of-hand for the null element: empty tree or empty list.35be marked as unordered, indicating that the ordering is unimportant. The com-bination of visit policy and order policy are mostly applicable to delegate blocks(§3.2.5) and will determine if parallelizing this query is possible and if so, whatmethods are applicable.output names gives names to the items of the output if a bound variable has notbeen used. This is primarily useful for delegate blocks which are to be discussedbelow (§3.2.5).A query itself is a special type of a message so it uses the bidirectional messagepassing syntax11:〈target collection〉← |〈query〉|[→(〈view name〉 ∣∣{ 〈processing code 〉 }. ]In this case the FROM is omitted in query. The version with the FROM clause omitsthe sendto operator and is syntactic sugar to ensure that queries on constructs suchas pseudo-collections have a clearer form.The output is optional as there are ‘queries’, such as INSERT and DELETE thatchange the state of the collection without needing to ‘return’ anything to the in-voker. They will be discussed below (§3.3).The first alternative for the output is a view, discussed in §3.2.4 and the secondis a delegate block, discussed in § Memberships and SubcollectionsUnder the DnC model, every entity12, collection or primitive is a global manager(§3.1.2) or is in exactly one collection.However, there are times when a programmer will wish to demarcate someelements of a collection in a specific way. Cadmium allows the definition of sub-collections which are collections that contain references to elements of a particularcollection13. A subcollection functions identically to a collection, though it mayhave different structure14. Collections may be composed, such as the quadtree11This is currently the only allowable use of the bidirectional message syntax, but we expectedthat to change very shortly into our future work.12Aggregate object, potentially with behaviour attached (§3.1.1)13For those wondering, yes this does create additional overhead, but the compiler is being designedto emit only the nessesary code when required.14Currently only sets and lists are supported as subcollection types in Cadmium, but work proceedson expanding this.36(§7.2.4), which is a tree of resizable arrays.A subcollection must be explicitly marked with its origin collection. If wewanted a list that contained the nodes of a tree, arb, the type would be writtenCDList[*arb]. When any function-like construct receives a subcollection as a pa-rameter, that parameter must also indicate the origin collection. The consequenceof this is that it is possible to know at compile time which collection is being ac-cessed in any query. This solves a large number of potential aliasing issues.This does require the programmer to explicitly create a hierarchy of data intheir code. Though in our experience, most complex classes have a single intendeduse and most programs are built with collections for each of these classes. It shouldbe noted that the one collection rule doesn’t prohibit moving items between col-lections that contain the same type. Even when the aggregate data items are usedonly singularly, the worst the programmer needs to do is define a manager set andreference them from there. Interestingly, this enforced organization eliminates afew memory errors. Given that the implementation of the collection is correct, it’simpossible to have an ‘orphaned’ allocation, as it must exist in some collection,somewhere.3.2.4 ViewsAs shown in Listing 3.2, the result of a query is stored in a view. As in mostdatabase implementations, the creation of a view does not imply that the data iscopied. Views are the key construct used to enforce transparent parallel access tocollections using mechanisms described in Ch 4 and Ch 5.The view is released implicitly when it goes out of scope; the programmer neednot worry about ‘dangling views’. Furthermore, a programmer can elect to end thecontract explicitly with release. As this keyword implies, the termination of aview corresponds to the Release phase of a collection contract.While a view cannot be stored directly, it can be retained by transforming it intoa subcollection by release arrView → arrSub. In which case, arrSub wouldneed to be queried in order for the programmer to obtain access to its contents.This process is often used to make references to a specific element to be re-tained for future reference. These function as a bit of a hybrid between smart37pointers and optional types. The latter because they may be ‘empty’. We calledthis individual element subcollection creation single item stored view15. It wascommon enough that the compiler emits specific support for it, without all theextra baggage of actually being a subcollection.3.2.5 Delegate BlocksIn many cases, it is desirable to directly process the elements resulting from a query.This is similar to the ‘apply’ operation common to many languages and systems.This is shown in the following example:1 i: 723 |SELECT range(i,i+2) FROM arr ORDERBY unorderd AS v| → {4 if v == 5 {5 STDOUT ← "We found a five in range"6 }7 }This is one way we can express Data Parallelism in Cadmium. The policy weuse for all operations is that if results are ordered, they will be processed in thatorder so as to give the programmer more control and fewer unpleasant surprises. Ifunordered, we can take advantage of that and often execute in parallel. Similarly,if the collection is non-linear, such as trees, the delivery can still be performed inparallel, where ‘parents’ are evaluated before children. Note that if the results of aquery are non-disjoint, no two overlapping elements will be processed simultane-ously.3.2.6 Isolation SemanticsBy default, the view is the subcollection – by which we mean that once the querycan be satisfied, the invoking code has direct access to the memory of the elementsdescribed. As a consequence, we can directly reap the performance benefits ofa shared memory system with no unnecessary copying. This is one of severalways that our model differs from Software Transactional Memory. There are no15This really needs a snappier name or at least cute short form. SIS View? Solo View? Anotherline item for our future work.38rollbacks or failed operations in our model16.From the point of view of the contract invokers, it is as if they are the onlyentity accessing that collection. No changes can be made to the isolated data fromoutside the contract until it is complete. Furthermore, no contract that overlapsa contract in its Modify/Derive stage will be satisfied, allowed to proceed, untilthe first contract is complete. Effectively, changes will not be visible outside thecontract until its completion.This set of policies gives very strong isolation guarantees, and the set of non-nested contracts are serializable. However, the DnC model, and thus Cadmium,does allow for nested and overlapping contracts as in1 |SELECT foo FROM bar| → view12 |SELECT foo FROM baz| → view23 release view24 release view1This is two contracts in terms of isolation. The invoker has unfettered accessto the results of the bar query upon satisfaction of the first SELECT statement; how-ever, other entities may modify baz until the satisfaction of the second select state-ment. The system does not guarantee the repeated reads property between sequen-tial contracts, only within them. Thus, a second query to baz after release view2,but before release view1 may show changes made under other contracts17. Thismanner of strong guarantee and nested ‘transactions’ leaves open the potential fordeadlock. We detail how we prevent this in Section 4.3.There is an interesting sticking point between wanting to give strong isolationguarantees, giving the programmer tools to reason about their code in a more se-rial manner and augmenting collections. For example, consider a task holding acollection contract that is the result of a content query and another task adds newelements to that collection. If the new element fits the criteria for inclusion in thecontract, the collection contract is out of sync with the current state of the col-lection. We decided in the end that our isolation guarantees would not cover thiscircumstance and collection increasing would not be covered under the ’terms’ of16Non-concurrency related exceptions are part of our future work.17This is one of the few cases where programmers may need to be aware that they are program-ming in parallel. It was felt that this compromise between potential performance and surprising theprogrammer was justified.39a collection contract. That is to say, that if one were to secure a view to all itemsWHERE x.val > 5 and before it was released another operation requested to add anew element with a val of 7, that request would be granted, but the first operationsview would not be augmented.We are deeply considering adding a facility to specify the level of permis-siveness with respect to this behavior, likely enforced by something similar to thecollection reservation mechanism (§5.5).3.3 Further QueriesWhile SELECT is the cornerstone of our query system, it is not the only type ofquery message supported in Cadmium. These other queries will be easier to definenow that we have passed the previous discussion.There is a bit of nomenclature nuance here. By the dictionary definition, aquery is a request for information and not, for example, a request for action. So,INSERT described below, which adds a new element to the collection is not a requestfor information, but falls under the umbrella of what we’ve been calling our querysystem. However, as SQL has set the precedent of calling such things queries, wewill continue to do so.3.3.1 INSERTAs the name implies, INSERT is the mechanism by which items are added into acollection. The INSERT, query message uses the following syntax:INSERT(NEW[〈initiation message〉 ]∣∣FROM 〈source variable 〉 )[INTO〈target collection〉 ][BY〈insertion policy〉 ]where the various elements are as follows:initiation message the mechanism to create new entities inside the collection, arough approximation of the emplace series of method calls in the C++ STL.While not pictured in Figure 3.1, a receiver can have an optional NEW keyword thatmarks it as the message that can be delivered on instantiation, making it similar infunction to the C++/Java/etc constructor. So the initiation message must match themessage type specified in one of these NEW receivers.40source variable used to insert an already existing element into the collection.target collection the collection to be inserted into.insertion policy analogous to the VISITBY policy in SELECT (in fact, BY is simply analias for VISITBY), describing what part of the collection to insert into, with refer-ence to its structure. For example, if we wish to add an element to the end of a list: |INSERT 451 INTO myList BY append|. Note that the CDList also has an appendmethod, so if the list was local, the programmer could write myList.append(451).Currently, if a local item is inserted into a collection, it is done by copying, sothe original item is retained. Advanced future work includes more robust languagefeatures and optimizations for moving items between collections.3.3.2 DELETEIf you can put something in, you need to be able to take it out again. DELETE isthe complement to INSERT, and it removes items from a collection (and from theprogram in general). The query message uses the following syntax:DELETE([cardinality]〈bound variables〉∣∣{ 〈subcollection/view 〉 })[FROM〈target collection〉 [[WHERE〈filter expression〉 ][VISITBY〈visit policy〉 ][ORDERBY〈order policy〉 ]In general, this follows the same form as the SELECT statement. The only ex-ception to this is the second alternative following the DELETE keyword. In this casethe programmer can submit a subcollection or view taken from that collection andthis operation will remove all elements common to both. We use this, for example,in mesh refinement 7.2.2 when we take a subcollection (a contiguous subgraph)that we built in an earlier step and delete all vertices and edges in the originalgraph, so we rebuild it with better properties.Note that, if the target in question is a subcollection, which the reader mayrecall is essentially a collection of references to another collection, with potentiallydifferent structure; then the ‘reference’ is removed from the subcollection, but theoriginal collection remains unchanged. However, deleting from a view will removethe item in the original graph, because the view is the collection. This highlights41one of the distinctions between these two constructs.As stated above, this deletes the data item entirely, but it would not be muchmore than trivially difficult for it to ‘return’ the deleted item giving it an ‘extraction’aspect, as well.3.3.3 UPDATEThere is a very common pattern in software engineering. It has many names, butwe like to call it double buffering. This is the case where for a given set of data,we build a new version of the data, generally for a ‘next iteration’ with referenceto the old. When the new one is complete, the old one is discarded. This typesof procedure is a desirable target for parallelization with applications of processeslike the Stencil Pattern [59]. For this we provide the UPDATE query message, whichwe use to great effect in the PageRank evaluation (§7.2.1).The UPDATE query message is very similar in structure to SELECT, for hopefullyobvious reasons, with the following syntax:UPDATE([cardinality]〈bound variables〉∣∣{ 〈substructure atoms . . . 〉 })[IN〈source collection〉 [[WHERE〈filter expression〉 ][VISITBY〈visit policy〉 ][AS〈output names〉 ]where the elements are defined identically to SELECT. ORDERBY is omitted, at leastfor our initial version, as delegate blocks for UPDATE could be accomplished withan equivalent SELECT query.The primary difference lies in the generated view, which is a specialized updateview. An update view largely behaves like a standard view, except it has two prede-fined ‘members’: current and next. current has the contents that currently existin the collection, and next is the space constructed for the derived values. next isalso given the same structure as current. When the update view is released, thenext values will be ‘committed’ in place of the previous values which will thenbecome visible outside of the contract.423.4 ConclusionIn this chapter we have given a brief outline of the Cadmium language, highlightingthe features most pertinent to demonstrating our model. We constructed these as-pects specifically to allow the programmer a wider range of built-in facilities, suchas visit policies, that allow for expressive and concise operations on collections.We take every opportunity to channel the programmer into expressing parelleliz-able code, to sharing their intentions.However, this is only the first step. After the programmer has been kind enoughto share their thoughts with us, we need to extract them appropriately. Our con-structs expose numerous opportunities for parallel execution, but not all of themare safe, especially in combination. In the next chapter we will detail the compila-tion process and the static analysis we use to take advantage of our purpose-builtlanguage elements.43Chapter 4Extracting Intentions: CompilingCadmium and Static AnalysisIn the previous chapter, we described the fundamental aspects of the Cadmiumlanguage necessary to embody the DnC model. We focused on capturing more ofthe programmer’s intentions with rich, composable constructs that describe oper-ations on parts of collections. Importantly, these constructs focused on the whatand when of their operations, without compelling the programmer to tell us how(beyond specifying high level constraints such as visit order). However, this wasonly the start of battle. The programmer is, in a sense, ceding control to us inmatters of parallelism and it becomes our responsibility to ensure their programsexecute safely and efficiently. We have two opportunities to make the best use ofthe intentions that have been expressed to us: during compilation time (the subjectof this chapter) and during runtime (the subject of the next).4.1 Parallel OpportunitiesWe specifically designed the constructs in the previous chapter to allow paral-lelism whenever possible. These possibilities can be broken down into two dis-tinct catagories: data parallelism and task parallelism. As stated previously, thedifference between them lies roughly in the logic employed. Data parallelism isapplying one process to a set of data, where task parallelism is the simultaneous44execution of different processes that may or may not act on the same data items.There is a third major form of parallelism (assuming, as we are, limiting ourdiscourse to multicore shared memory architectures): instruction-level parallelism.This is where some hardware instruction produces more than one effect simultane-ously, such as the processor supported SIMD (Single Instruction Multiple Data).We do not address this directly here, but given our transpiling process(Ch 6) wemay still reap the benefits of a more mature compiler that may emit such instruc-tions.We will briefly describe the possible opportunities we have to emit paralleldispatching code when compiling a Cadmium program.4.1.1 Task ParallelismManagers/System Message As detailed above (§3.1.4), when a system messageis received, the corresponding receiver body of the manager is ready to run. As weknow the identities of receiving managers at compile time, we simply emit a seriesof statements to create a task for each recipient body and submit it to the scheduler.This way, each body will be executed in parallel, to the limit of available CPUresources.Heterogenous Message Targets Recall (§3.1.4) that the sendto statement mayhave multiple recipients. In our current implementation where we only use syn-chronous messages, we normally transform the message directly into a functioncall, instead of incurring the overhead of enqueuing the message and spawning anew task for it. In the case of multiple recipients, we transform this into a series oftask creation/submission statements.The potential snag with this is that we don’t want to continue executing thebody of the sender until the corresponding receivers have completed. Given thatreceivers can in turn be senders, we must assure that any cascading effects are takeninto account. Recall, that we have mentioned previously the message provenancesystem which tracks the cascading effects of receivers sending further messages.We will discuss the composition and implementation of tasks more in Chapter 5;but for now, we will only need to know the fact that Cadmium runtime tasks canbe suspended and resumed arbitrarily. This functionality is not available to the45application programmer, but we can certainly emit code to that effect during com-pilation. Every invocation of a receiver is associated with a message provenanceobject, which is the message that spawned it. When these objects are created, theycontain references to the message provenance of the task that spawned them. Ineffect this creates a tree of provenance leading back to the original system mes-sage/scheduler code that initiated the cascade. Each provenance object contains anatomic counter of its child objects. When a new child is created, it is incrementedand when a receiver is finished code is run that signals it to decrease. When thecounter is changed to zero, if there is a suspended task associated with it, the initialsender in the case we’re discussing, that task will be resumed immediately.In the case of multiple receivers, the initiating receiver creates the new tasks,seeded with the message data and invocation code, and suspends its host taskinto the reserved space in the associated provenance object. It uses architecture-appropriate memory barriers and a check to handle the corner case where all mes-sages have completed before the suspension takes place.This is nearly identical to the way the scheduler is invoked when a systemmessage is complete; but instead of resuming a task, it initiates code to broadcastthe next system message (which may be the broadcast of completion receipts) afterchecking to see if any message requesting a re-broadcast of the current systemmessage has been received. If there are no further system messages to send, theprogram is complete and it initiates the shutdown process.4.1.2 Data ParallelismQuery Execution A query itself may be executed in parallel. This is especially truefor content queries that yield a view (we will discuss delegate blocks immediatelybelow). Currently all Cadmium collections are easily partitionable, for reasonsthat will be become obvious when discussing signatures (§2.1). When the queryitself involves examining the collection, or parts thereof, these partitions can besubmitted, in batches, to the scheduler, using a multipart directive (§5.1).Invoking this behavior is contingent on the visit policy of the query1. If the1It guaranteed that we know these policies at compile time. Visit policies and order policies arenot first class citizens in the language and so the programer cannot, for example, assign a differentpolicy to a variable based on some input and invoke the query with that variable.46visit policy has the neutral property, such as the default set visit policy, then this isalways possible as no ordering has been imposed. If the visit policy has the linearproperty, such as forward and reverse policies of arrays and list, then as long aswe can determine which partitions fall into the range set by the cardinality specifierwe can use parallel dispatch. There is no reason we could not extend this to non-linear data structures, such as trees. For the purposes of this prototype these visitpolices were sufficient for demonstration purposes.If any ordering besides unordered is requested we have one of two situations.If the collection is already in that order (forward) or the desired ordering is im-plicitly determinable from the existing order (reverse), we then can dispatch thequery in parallel, again using batches of partitions, and ‘stitch together’ the resultsin the requested order. If the ordering is arbitrary, i.e. defined by an expression,we can merge sort in parallel by submitting sequences of batched partitions. Asthe data needs to be already ordered when it is examined by the caller, orderingdoesn’t prevent parallel expression in the way it does for delegate blocks.Delegate Block Invocation Delegate blocks (§3.2.5) provide a method of applyinga block of code directly to the results of a query. In our very first experimentswriting Cadmium code, we discovered that we would initiate a query, receive aview and immediately want to iterate through it; so we created this functionality tofurther exploit this emergent pattern2.Effectively, we have the same situation as satisfying content queries above,with the restriction that the programmer has told us to do something to each itemas we encounter it. So, we add the constraint that unless the query has the orderpolicy unordered, we will execute the delegate blocks serially; though we willreport this to programmer to make sure this is what they intended (§6.1). Thisis one place where a deeper application of static analysis could be quite useful.If the block doesn’t depend on any side effect from a previous instance of theblock, there is no reason to force serial execution of the block. This can be a bit2As we created this, we noticed the growth in the use of lambdas and closures in the generalprogramming world, where a parameterized block of code is first class citizen in the language. Theseare often used similarly to our delegates, which are not first class citizens. We would like to rectifythis, though we see this more in line with how C++ handles generics, by building new instances oftemplated code to retain the knowledge of the exact query for our static analysis.47subtle. Consider the case when the programmer has a list of items in a specificorder and has a delegate block that formats them and displays them on the screen.Technically, no data in the program has been transformed, so we may infer that thisis safe to parallelize. However, printing the data in a different order may or maynot be wrong. We must give the programmer a choice. After much discussion, weelected to take the position that, unless the data structure is implicitly unordered, asin the set, delegate execution would be serial unless specified otherwise. Though,we do note that from a semantic point of view, we aren’t asking the programmerif they want parallel execution or not. We’re asking them about a property of theprocess – if they intend an order or not.Update Views The creation or ‘committing’ of an update view can involve dupli-cating a great amount of data, which can be submitted, as with query execution, asbatched partitions for parallel execution.Many of these opportunities are always safe, such as executing a content queryin parallel. However, There are cases where there this is in uncertain. These includewhen there is concurrent access to collections and whether or not to parallelize adelegate block to an unordered query. We can’t know if these will be safe fromexamining the invocation alone. We’re going to need information from the entireprogram to make a determination.4.2 Evaluating Shared AccessesIn the previous section we identified all potential sites for emitting parallel dispatchcode. However, if a section of code can be executed in parallel and still yieldcorrect results, that does not necessarily mean that it can execute in parallel withoutstate conflicts. In the next chapter we will show our runtime technique (§5.2) formediating collection access; however, if we blindly augmented every collectionaccess with this protection, we would potentially incur a great deal of unnecessaryoverhead. If we can tell at compile time that protection is always unneeded we canavoid adding it. To do this, we need to evaluate each opportunity for the variousaccesses to shared data that are made in the region in question. By comparing theaccess domain, the set of data potentially accessed in each region, we can makeinformed decisions about the code we emit. We may be forced to serialize the48execution or apply the SvS runtime protection method (Ch 5). Furthermore, wecan identify many cases where we can guarantee that the access domain of a regionis disjoint with all others that could be potentially coscheduled, even if that data isaccessed by other regions that could never be coscheduled.4.2.1 Determining Access DomainsThe first step in evaluating the safety of the regions of a program is to identifytheir access domains, which include all possible accesses that may be made. Theregions in a Cadmium program are defined as a generalization of collection con-tracts. In the previous chapter, we defined collection contracts (§3.2.1) as being thespan between the start of a query and when the view is released. We expand thisdefinition to include one more case. We have stated before that the query systemis built upon the basis of the message passing system. An invocation of a receiverbody is considered to be an implicit contract on the entity members used. This is tosay that the scheduler will ensure that it is safe for the body to be executed, basedon the members it accesses, and the message will not be delivered until that can beguaranteed. The contract is initiated by the message sender, begins upon receipt ofthe message and ends when the body terminates.4.2.2 Inferring the Domain of ContractsDetermining any potential data item accessed directly in a particular contract isrelatively simple. Determining which statements belong to a particular contract canbe done with a linear walk through the AST, starting at the query statement (or atthe initiation of the receiver in the case of implicit contracts). With a little attentionto release statements that are enclosed in conditional blocks3, we can determinethe set of statements that are part of the contract. From this we simply analyzeeach expression for variable references and record if they’re a read or a write. Atthe end we have a list of data items accessed anywhere during the contract. Localand member variables are identifiable by consulting the parsed entity. Every otherdata item must belong to a global collection and the result of the one collection rule3As a part of enforcing the contract, if there is a release statement in a conditional block; we emitan extra check after every subsequent access to the resulting view that raises an error condition, ifthe view has been released.49is that we can trace back through any views and subcollections to determine whichone.This accounts for all the direct accesses during a contract, but we still need totrack all indirect accesses. In this case, indirect accesses are those that are presentin code outside the contract that is invoked by a statement in the contract. There arethree, nearly identical, cases where this may happen: function calls, method callsand message sending4. As we have no polymorphism, or other forms of runtimeselection of executed code, we can identify the exact body of code that is called.After that, it’s simply a case of repeating the same AST analysis on the identifiedcode body. This is, of course, recursive and breaking cycles is simply a matter ofmarking ‘seen’ bodies and limiting recursion based on that bookkeeping. As up therecursive stack, we match parameters to the values given at the call site and oncethe process completes, we have a complete description of the access domain. Weretain the information gathered searching for the indirect accesses for reuse, whenother contracts invoke those bodies.This does mean that we assume that the entire source code is available at com-pile time, unlike, for example, C++, which breaks the code into one or more ‘com-pilation units’ and builds the object code for that unit and ‘links’ them together atthe end. We acknowledge that this wouldn’t scale well to truly large codebases andpart of future work includes deriving a scheme to bundle the results of this analysiswith the compilation of some well defined subset of the code. Likely this wouldn’tbe the completely independent compilation that C++ achieves, as different interre-lated ‘units’ would still have to be compiled in reference to each other. At the veryleast, we could retain a great deal of this information between compilations or aspart of reusable libraries.4.2.3 Analyzing Program FlowIn order to ascertain the potential for concurrent accesses to a particular shared dataitem, we need to determine the constraints on when in an execution an access couldoccur. As a very coarse example, if collection X is always accessed once at the very4This is why we, with regret, elected to only deal with synchronous messages for the purposesof our prototype. This limits the potential of the language, but simplifies things enough that we cancomplete our proof of concept to motivate research that would include removing this limitation.50start of the program and always accessed immediately before termination, we canconclude that there is no possibility of concurrent access. Similarly, consider aprogram with only two accesses to a collection Y. The first access is in a contractin receiver A and the second in a contract in receiver B and B is only invoked bya message sent from receiver A after the contract. We can conclude that theseaccesses will never be simultaneous. However, if we have a collection Z that isaccessed in contracts in receiver bodies C and D, where some manager sends amessage invoking both bodies, we cannot guarantee mutual exclusion.In order to generalize this kind of analysis we are going to examine and namethe various partitions across the time domain of a program execution. Recall thatany action in a Cadmium program begins with a system message to a manager andthat any subsequent system messages and completion receipts will not be sent untilthe current message is complete. The consequence of this is that we can divideany execution of the program into a series of ‘epochs’ that begin with some systemmessage broadcast. We will call these epochs phases. Each phase can be namedby the system message that initiates it and we refer to the initiating message as thephase’s phase message. Given that the program can affect which system messageis sent at runtime5, we cannot predict the ordering of the phases in any particularexecution, but at compile time we can determine the total set of phases possible.Once the set of phases applicable to a program is determined, we can refine theproblem to discovering potential concurrent accesses during a particular phase. Forthis we need to examine every possible sequence of invocations that result from thereceiver bodies in the managers that subscribe to that particular phase message.We know that, very deliberately, there are no accesses to shared data outside ofa contract. Furthermore, with the analysis above, we have the access domains ofeach contract.To find the potential concurrent accesses to shared collections, we need to un-derstand the dependency relationships between the contracts that can be invoked inthe phase and thus understand potential conflicts of their access domains.For this, we construct a directed graph where each node represents a set ofdata items accessed. We initially seed the graph with one node each for the access5Currently the programmer can determine if the current message should be rebroadcast, but morecontrol will be added as we expand the program flow controls.51domains of the receiver bodies of the subscribed managers. We trace the code frominvocation to invocation and build the graph as follows, tracking for cycles, so thatwe never visit a contract more than once per invocation:• If one contract follows another in a single body with no multiple target mes-sages in between, we combine their access domains into a single node• If two contracts overlap in a single body, we combine them into one node• If a single message is sent inside a contract, we ignore it as we already haveits effects reflected in the current contract’s access domain• If inside a contract a message is sent to multiple receivers, we split the currentnode v, duplicating its contents, into a vpre and vpost such that any in edgesthat pointed to v now point to vpre and any out edges that originated fromv now originate from vpost . If S1, . . . ,Sn are the subgraphs generated fromapplying these rules to the appropriate receiver bodies of the target entities,we add an edge from vpre to the node of indegree 0 (there should be onlyone) in each Sn. Conversely, for every node of outdegree 0 in each Sn, weadd an edge from that node to vpost . At this point vpost is now the currentnode.• If a message is sent to a single target outside of a contract, we generate asubgraph S that is the result of applying these rules to that receiver body.If S is a single node, we combine it with the current one. Otherwise, wegenerate the subgraphs S1, . . . ,Sn as in the ‘in contract’ multi-target messagecase. If there are no further contracts or message invocations in the currentbody, we connect the current node to the nodes of indegree 0 in the generatedsubgraphs. Otherwise, we split and add edges as in the above case.By memoizing partial results, we can dramatically speed up this process.Note that the resulting graph will not be connected unless there was only asingle manager subscribed to the phase message in question. This is because thereis not a mechanism to act as a ‘sink’ to join any of the chains originating at eachmanager’s receiver body. We have already mentioned that adding that sink is part52of moving this system beyond the prototype phase. That would complicate thisanalysis somewhat, involving more complex searching for cut points, such as therecursive process that we detail immediately below.We then examine each global collection that was referenced during this phase.If it is present in the nodes of more than one connected component of the graphwhere at least one is a write, then we have a potential state conflict. If not, weneed to keep analyzing. For each connected component in which that collectionis referenced, we trace from the single indegree 0 vertex until we find a node ofoutdegree greater than 1, call it b. If no such node is found, then no state conflictcan be detected from that subgraph. If found, we look for a node j such that thereexists a directed path b, . . . , j and j is a cutpoint of the graph and there does notexist a b′ that is also a cutpoint of the graph and there is a shorter path from b toj. We consider the subgraph of all vertices on a path from b to j, not includingthose two vertices. If no such j exists, we consider the subgraph that is comprisedof all vertices v, v 6= b such that a path exists from b to v. In either case, we have anon-connected graph to consider and we repeat the above process recursively untileither a state conflict is discovered or we exhaust all possibilities.By repeating this method6 for each phase, we finally derive a list of phases withpotential state conflicts for each collection in the system.4.2.4 Acting On ResultsAfter completing this analysis, we have considerable information to make deci-sions about whether or not we can safely exploit the opportunities we identifiedabove (§4.1). We identified two areas we needed information for: global collectionaccess and delegate block parallel dispatch.We have, from the previous section, a list of phases and potential concurrentaccesses for each collection. Surrounding each collection access for collectionswhere this list is non-empty, we check which phase we’re in (the scheduler main-tains such information) and if we’re not in one of the indicated phases the access is6The actual implementation of this does not directly embody this exact procedure, as severalsteps, such as finding paths, are implicit, reuse information from other procedures, or are spreadover different passes in the compiler. We have presented it in its theoretic form to avoid going intouninformative details about the construction and transformations of our AST.53allowed to proceed unprotected. In the other cases, we’re going to need to ensuremutual exclusion between any two accesses. However, as we have pointed out,probably to the point of exhaustion, two accesses of a collection may very wellbe disjoint if they’re not accessing the whole collection. For this we’re going toneed something finer grained than a single mutex around the collection. For thiswe use a technique we pioneered called Synchronization via Scheduling, which isthe topic of the entire next chapter.For unordered delegate blocks, we need to consult the access domain of theassociated block. If it contains a write to a member of the entity or a local definedoutside the contract, then we emit only serial code (essentially a standard for loop)for this delegate.If programmers finds this serialization undesirable, they have three options:1. Re-factor the code, though often this is not possible, or practical.2. Query the member variable, in which case the normal rules of collectioncontracts apply7.3. Replace the member with an accumulator type which is designed to be ac-cessed in parallel and useful for counters and other commutative operations(§3.1.5).This may prompt a question: how is the programmer going to know if some-thing has been declared non-parellelizable? Given that this decision is based ona great deal of analysis, this is a good question. Our answer to this is the codeanalysis report, detailed below (§6.1).4.3 Deadlock AvoidanceIt’s one of those hard facts of life that so often the cause of problems is the solu-tions to other problems. In order to prevent state conflicts and the potential errorscaused by values being changed mid-use, we introduced collection contracts andtheir strong isolation semantics. In order to make them useful for writing complexsystems, we made them flexible and composable. In doing this, we opened the door7Member collections must always be queried.54Figure 4.1: Deadlock detection algorithmto another beast from the concurrency abyss: deadlock. To return to our metaphorfrom the introduction, if we aren’t careful how we restrain the hands of our evil 3dchess opponent, he’ll be unable to move at all. The game will stop, forever, withno winner and no loser.Deadlock may occur between two or more sets of overlapping contracts exe-cuted in parallel. Consider the following Cadmium code:1 someMessage → {2 . . .3 |SELECT { element: 23 } FROM foo| → view14 |SELECT { element: 17 } FROM bar| → view25 . . .6 release view27 . . .8 release view19 . . .10 }1112 someOtherMessage → {13 . . .14 |SELECT { element: 17 } FROM bar| → view315 |SELECT { element: 23 } FROM foo| → view416 . . .17 release view318 . . .19 release view420 . . .21 }Listing 4.1: Deadlock Potential ExampleThis is the setup for the classic deadlock condition if these two receivers arepermitted to execute in parallel. If the first receiver is granted the view on element5517 from foo and before it is granted the view to element 23 of bar, the secondreceiver is granted the view on that element. This creates a cyclic dependency andgiven strong guarantees of our collection contracts, the program will wait endlessly.This example is contrived, of course, but the indexes requested may be determinedat runtime, as opposed to the explicit number in the example. This would still beproblematic, even if all requests were to collection foo.During static analysis, we can test for deadlock potential. The augmentedGoodLock algorithm [11] has a number of similarities to our algorithm, thoughtheir model has complications that are eliminated in DnC.Our algorithm uses much of the same information that we gathered while deter-mining the access domain and phase graphs, as described above. We build a modelof the program in terms of contracts, treating receiver bodies and method calls asimplicit contracts on members accessed. Recall that collection in each query canbe resolved unambiguously, even if it involves a subcollection (§3.2.3).From this, we derive a set of ordered pairs, (a,b), of collections, where anIsolate step on b is reachable during the modify/derive of a. Using reachabilityanalysis, we create subsets consisting of the pairs that may be run concurrently.We effectively transform the set of pairs into a directed graph and check for cycleswhich would indicate a potential deadlock. Figure 4.1 shows this process, wherethe blocks with |X| show the maximum lifespan of the contract on collection X.If a cycle is detected, we do one of following8:1. Ensure that satisfying one of the contracts in the cycle is mutually exclusivewith the other contracts, by adding additional scheduling logic in its isolatephase.2. Augment a contract’s isolate phase to reserve(§5.5) the subject of the con-tained contract, and thus prevent the cyclic dependency. When executioncompletes the inner contract’s isolate phase (i.e. the inner query is success-ful), the reservation is released and execution continues.3. When a receiver body forms a cycle with instances of itself, mark it so thescheduler will serialize all message delivery to that receiver.8Some cycles are benign due to details about their components. We ignore these cycles.564.4 Further Static AnalysisThere are several circumstances when multiple instances of a block of code maybe executed concurrently. In addition to delegate blocks (§3.2.5), receivers maybe marked as sinks, acting as consumers in a producer/consumer relationship. TheSTDOUT string receiver we discussed earlier (§3.1.2), is an example of one suchreceiver.In both cases, if an entity member is accessed, or in the case of view delegatesa local variable outside the delegate is modified, incorrect results may occur. Weanalyze each body of code looking for such assignments and accesses. If found, weensure that the code body executes independent of any other instances, and issuesa compilation warning. Similarly, if two or more receiver bodies access the samemember variable, and one of those accesses is a write, we ensure mutual exclusionbetween these bodies and, again, issue a warning9.In the same vein, we examine the combination of each query on a particu-lar collection and only compile a program with the particular behavior needed toguarantee safety. This includes the reservation mechanism (§5.5), or as mentionedabove, we only apply our locking constructs in those phases where a state conflictis possible.4.5 Programmer Directed OptimizationsIn order to assure safety in all cases, we must make conservative assumptions.However, the programmer may have additional insight. Nearly every statement cancontain an annotation, which uses the syntax {?〈key: value, . . .〉}, where theset of possible keys is defined by the particular statement. This includes markinga receiver as a sink and adjusting the runtime parameters(§5.3). Annotations canbe used to silence the static analysis warnings discussed earlier. An annotation isnever required for correct execution, where an equivalent of debug mode wouldturn all annotations off.9Any method call is considered in this analysis.57Chapter 5Enforcing/Executing Intentions:Scheduling And RuntimeAlgorithmsIn the previous chapters, we showed how we added language features to allow forbetter translation of programmer intention to code and then interpreted the infor-mation we were able to harvest. What remains is runtime behavior: how do weorganize and schedule tasks and, more importantly, how do we efficiently mediatebetween two or more different queries accessing the same collection?5.1 Scheduling ImplementationInitially, we spawn a number of threads equal to the number of available cores1.The worker threads contain very little logic and are designed to be modular, exe-cuting directives, arbitrary pieces of code produced by the compiler. Each directivedescribes an operation from processing an array slice to delivering a set of broad-cast messages. Whenever possible, such as the case of system messages (§3.1.4),these directives are precomputed.It is common in our system for a number of related tasks to be created for asingle parallel operation. If we had a standard atomic queue (commonly called1Addressing sharing the machine with the other programs is part of our future work.58a work queue) for storage and distribution of directives, and if an operation had147 tasks associated with it, we would have to perform 147 enqueue and dequeueoperations in the execution of that operation.To reduce the potential of heavy contention during busy periods, our global taskdistribution structure, called the directive store, does model a queue with relaxedsemantics, but has an extra facet to accommodate these related tasks.Every one of these multipart directives has an atomic counter associated withit as part of the public interface of the directive store. Instead of dequeuing thedirective immediately, the idle worker in search of a task decrements the counter.The system is designed such that the value of this counter at the time of decre-ment, which atomic subtraction yields automatically2, will give the worker all theinformation it needs to find which ‘part’ of the directive it is now assigned to ex-ecute. As a very simplified example, consider a parallel operation with 100 tasksoperating on each element of an array with size 100. If a worker decremented theassociated counter from 47, then it will perform the directive on the element at in-dex 463. If the counter is decremented to zero the directive is dequeued before theexecution begins. Note that, from the point of view of the directive store the differ-ence between multipart directives and ‘single’ directives is a only descriptive. Alldirectives use the counter, but a multipart directive is one where the initial value isgreater than 1.The directive store itself is designed to balance between needing to accom-modate an unbounded number of directives and an aversion to memory allocation(which is quite slow in the terms of the time scale the scheduler operates on). Thedirectives are stored in uniformly sized arrays, called cells, which are in turn orga-nized as a linked list.As with many queues the head and tail are numerical indices, which strictlyincrease with enqueue and dequeue operations. When the tail reaches the endof a cell, that cell is removed from the linked list and ‘recycled’, either to internalstorage in the directive store, or deallocated if that internal storage already containsa predetermined number of cells. When the head reaches the end of a cell, a newcell is produced either from the internal storage or, in extremis, through memory2At least on x86.3Yes, we did have multiple off-by-one errors while building this.59allocation and linked into the end of the cell list. The numbering of cell elementsis kept between cells, that is to say that if the tail was 80 at the end of one cell, itwill be 81 at the start of the next cell4.In essence, we are trying to emulate having one infinitely long array, by movingno longer needed regions ‘in front’ of the head. It would be like having a train thatmoved the chunk of track behind it to the area closer to the destination so virtuallyinfinite trips could be done with a very small amount of actual track.In an ideal situation, there are never any more directives at once than can fit intoa single cell. This means that only two cells would ever be needed, avoiding mem-ory allocation and as the cells are designed to be large, cell swapping (a relativelylow cost operation anyway) is kept to a minimum. In the less than ideal situationwhere there is sudden burst of directive creation5, the structure will grow to accom-modate this to the limits of physical memory. There may be a bit of a performance‘hiccup’ during the allocation, but the program will continue operation.All application profiled in this work use a cell size of 8,192 with one additionalcell pre-allocated at scheduler initialization. While we have not done extensivetesting to determine if these are the most effective parameters, they easily handledthe demands of our applications.A directive contains the work we expect one worker to perform, all else beingequal. For a data parallel operation, we create a number of dispatches equal tothe number of cores. If not all threads can participate, the first one finished willpick up the slack by claiming a remaining directive. This reduces the potential forload balancing, but dramatically relieves the pressure on the central queue. We willshow that this gets good results in practice (Ch 7).When a worker thread begins to execute a directive, it does so inside a corou-tine6. If the operation is unable to proceed, the thread yields the coroutine, en-queues it, and moves to the next work. This ‘next work’ is not necessarily anotherdirective. It may be part of the same directive, such as delivering the next broadcast4Given that these indices are 64-bit unsigned integers, overflow is not a ‘real world’ concern5This also applies to the case where for some reason a really old directive can’t be completed andstays on the queue for a great length of time, causing the tail pointer to stay stuck at its location. Thiscase does not occur in our current implementation, but it was considered during design.6We found that boost::coroutine gives the best performance out of the alternatives wetried.60message. These coroutines are pooled to avoid allocation costs and are written tobe thread agnostic, so they can be migrated for work stealing and other purposes.We found that, in practice, yielding a coroutine costs about 500-2000 cycles. Whilethis is nonzero, it is small enough to do very fine-grained operations.While we initially tried to avoid the extra overhead and complexity of a workstealing scheduler, it turned out that a simple algorithm along the lines of the stan-dard ‘steal half’ [38] scheduler made a significant difference. We will discuss theevolution of our work stealing more in the test that inspired its inclusion (§7.2.3).Figure 5.1: Mapping queries to signatures5.2 Synchronization via SchedulingThe primary methodology we employ is a reformulation and rethinking of the con-cept of Synchronization via Scheduling (SvS) [16]. This was a technique thatwe developed and published previously. This entire work is an extension and re-finement of these previously published ideas and a response to the feedback wereceived on that work.In its original form, SvS had the high-level description: Know the maximumamount of data every task will touch and only co-schedule disjoint tasks. Usinga combination of static analysis and runtime bookkeeping, we derived a systemthat would follow pointers across the program data. It would only execute tasks inparallel whose pointers set was unreachable from others. As well, it employed aprimarily streaming model. However, this could drastically overestimate the maxi-mum impact of a task. We used our equivalent of receiver bodies as critical sectiondemarcation and used primarily data streaming techniques. These patterns are allexpressible in Cadmium, along with the more expressive and flexible constructs wedetail here.The genesis of this work was an attempt to correct the shortcomings of the SvS61model and to push it further. The change from considering pointers to the consid-eration of types and collections made for much better estimates, as programmerintention was considered, and much runtime bookkeeping is eliminated. Introduc-ing Collection Contracts also allowed for much finer grained decomposition. Thesetwo factors make DnC much more than an incremental improvement on SvS.The challenge in scheduling this way involves solving two related problems:determining the set of items in a subcollection, and performing set intersectiontests fast enough to be useful on the micro or nanosecond scale. We next introducea representation that will be critical in solving both problems.5.3 SignaturesA signature mapping is a partitioning of a potentially infinite, space into n orderedpartitions for some fixed n, which we call the width. A signature of some subspaceof this space is an n length bitstring, where bit i is 1, if partition i of the space hasa non-empty intersection with the subspace and 0 otherwise.For our purposes, the space generally under consideration is the maximal col-lection space, the collection that is a superset of all other collections of that type.For linked lists, this would be an infinite list, and for general trees this would be theinfinite tree, where every node has infinite children. This will be illustrated below(§5.4).Signatures give us a lossy representation of the subspace, as it doesn’t reflecthow many items are in the intersection — only that it is non-empty. The largern gets, the more accurately it reflects the distribution of items across the subcol-lection. The next section will illustrate concrete examples and show how we usesignatures in a scheduling heuristic.Scheduling with SignaturesSuppose we are executing a data-parallel query on an array, which processes the ar-ray in chunks of 3 consecutive elements using the query |SELECT range(i,i+2)FROM arr|. For simplicity, we use an array size of 8, signature width of 47, and7In practice, we use much larger widths. Our experiments show that for most cases a width of512-2048 is effective; and by default we use a width of 1024, which may be adjusted during static62a signature mapping that assigns indices {0,1} to partition 0, {2,3} to partition 1,and so on. As arrays have fixed size, the space to partition is finite.When the query is invoked, the caller constructs a use signature based on thequery and the attributes of arr, such as the width it uses. The range substructureatom refers to cells 2, 3, and 4 in the array; and so the use signature generatedis 0110, as shown in Figure 5.1. At this point, the scheduling code is invoked.The scheduler retrieves arr’s collection signature, the record of what parts of itare currently under contract. We will assume that arr is currently unused, so itscollection signature is 0000. The scheduler compares the two and finds that thesecond and third partitions of arr are available, so the query is satisfiable. Thecollection signature is updated to 0110. No changes to scheduling are necessary,so control is passed back to the invoking code, and the Modify/Derive begins.Now, say that another worker thread begins with i = 1. This requests elements1-3, and thus the use signature 1100. Control passes to the scheduler where 1100is compared to 0110. The scheduler sees that the second partition is already in use,so this query cannot be safely satisfied. The calling code is suspended by yieldingand enqueuing its coroutine, and its owning thread can move on to other work. Asthis occurs, the first contract concludes, releasing the view, and arr’s collectionsignature is updated to 0000. The second contract is revisited, and the intersectiontest is now successful. The contract can be satisfied, the coroutine is resumed andits Modify/Derive commences.Heuristics, False Positives and PerformanceThis is an ideal point to talk about the cost of the information lost when a signa-ture is constructed. As mutual exclusion is guaranteed at partition granularity, theintersection test could report that two subcollections overlap, when in reality theydon’t: a false positive. For example, if the second contract above had specifiedrange(5,7), its signature would have been 0011, which would have been evalu-ated as a conflict, even though ranges 2-4 and 5-7 are disjoint.The cost of a false negative, the failure to report a real overlap, is both the po-tential violation of the terms of the contract, and subsequent exposure to executionanalysis or explicitly with an annotation.63error. Because of this, signatures are specifically designed to prevent them. Thecost of a false positive is that two contracts are serialized unnecessarily. False pos-itives are inevitable, but can be made quite rare with a wide enough signature. Thecumulative negative effects of a small number of false positives are minimal, whilethe computational efficiency of the technique allows for fine grained schedulingwithout significant overhead.Virtues of SignaturesSignatures have several beneficial properties that make them ideal for our purposes:Quick Intersection Testing The intersection of two signatures can be computedsimply by checking if their bitwise and is non-zero. This tends to be a very simpleand cache friendly process that is optimized in the Cadmium runtime with SIMDoperations.Composability The union of signatures can be computed quickly with a bitwiseor. This is especially useful in batched operations, as the atomic operations ofmanipulating the collection signature are the most expensive part of the process.Applying a use signature to a collection signature is the equivalent to taking anarbitrary set of locks with one compound operation. Effectively, this allows us tovary lock granularity without changing the program logic8. We will demonstrate acomparison with individual fine grained locks in Section 7.2.4.Other Signature AttributesIn addition to width, we consider two other attributes of signatures:Padding During heavy concurrent use of a collection, the atomic operations in-volved in signature update can generate much cache invalidation traffic. Separatingthe signature into smaller units, each on their own cache line, reduces a lot of thefalse sharing from updating bits near each other in memory9.8We do not currently support changing the width of a collection signature during execution;however, the components necessary are present.9We have considered ‘striping’ high use collection signatures with low use collection signaturesto save memory; but this is dependent on harvesting more information to tell us which collectionsare high use (Ch 8).64Sparsity Often times we can determine the maximum number of bits a use signa-ture involves at compile time, as in the swap operation in Canneal (§7.2.3) where,at most, two bits are used. In this case, we can simply store and apply by the indicesof these bits.5.4 Deriving Signatures from QueriesIn this section we go deeper into how we derive signatures mappings suitable forpopulating from different queries. There are two classes of methods correspondingto the two aspects of collections: content and structure.Content MethodThe content method is the default approach when others are unavailable. To employthis method, one first selects some numerical attribute of the content, and thensimply partition by the residue classes of that value mod n. One could use anobject id, or even memory location10.Under this mapping, a use signature is derived by taking the union of the sig-natures of the items specified in the query, or taking advantage of sparse signatureswhere applicable. It is possible to optimize this by pre-computing signatures forvarious substructure types while the collection is being populated. This is analo-gous to adding an index to a database table. For example, if the program queriesfor the neighborhood of a graph vertex, as in the Page Rank (§7.2.1) or DelaunayMesh example (§7.2.2), we can compute the signature of each vertices neighbor-hood as edges are added. Signatures are loosely related to Bloom filters [18], andthese precomputed signatures have the same issue of non-decomposability. That isto say, if an edge is added, the current neighborhood signature can be augmentedby adding, at most, one bit; but if an edge is removed, a bit cannot be ‘subtracted’and the entire signature will need to be recomputed for maximum accuracy11.This scheme is well-suited to unordered sets or hashes, as we simply store the10Care must be taken when using memory locations as they are often allocated at regular intervals,and so some partitions may be considerably more represented than others. Discarding a few lowerorder bits can correct for this.11If the signature is not recomputed, it will simply overestimate the neighborhood, which as we’venoted has no effect on correctness and has only a minor one on performance.65items in a number of buckets equal to the width of the signature (or some factorthereof). This works well for selecting single items, which is desirable as these datastructures are often used as key/value stores. Subcollections will have a varietyof structures; but when two or more are being accessed concurrently, we need thefastest way to compare them. Given that we have the items from the subcollectionquery, we can compute the use signature of the type collection without referenceto the type collection itself.Structural MethodsOne of the chief contributions of this work is the consideration of collection struc-ture in parallelism. For each structure there are a number of substructural queriesthat are commonly used, such as subtrees, vertex neighborhoods, or array sub-ranges. If we can derive a signature without having to iterate through each elementin the substructure, we can drastically improve scheduling performance. Whilespace constraints prevent us from giving a complete list of common data structuresand substructures, we present a variety to demonstrate the concept and give moreinsight into our experimental applications (Ch 7).Arrays The signature mapping is a generalization of that in §5.3, where each par-tition is a range of size sn , and where s is the size of the array. If n 6 | s, the finalpartition also contains the remainder. If n > s, each element is given its own parti-tion, and any remaining partitions are empty. Range based signatures are computedby determining the partition of the first and last element through simple division,and populating 1s between them, inclusive.Resizable Arrays A resizable array, corresponding to the C++ vector, is treated ina similar manner to fix sized arrays, as long as the size stays constant. If the arraygrows past the established size, the new elements will be considered to be part ofthe final partition. Once the array is not currently under contract, it will updatethe size used for future signatures12. Ranges are computed identically to fixed sizearrays based on the established size.Lists For linked lists, we maintain a number of sublists not greater than the signa-12For more details on how the data structure can determine that it is currently unused see §5.5.66Figure 5.2: Partition numbering for linked listsFigure 5.3: Partition numbering (n = 16) of quadtree (4 children)ture width . Each sublist corresponds to a partition in the signature mapping, seeFigure 5.2. We keep these sublists balanced as follows: when a partition sublistis accessed, we check without locking (as approximate answers are fine) to see ifthe current partition is significantly larger or smaller than its neighbours. If it is,and the current collection signature allows us access to the sublists, we distributethe nodes. If we are unable to distribute the nodes, correctness of the procedure ismaintained and the rebalancing is only postponed.Trees We first consider trees with a bounded number of children, such as binarytrees or quadtrees(§7.2.4). Let the maximum number of children be c, and, for thesake of simplicity, assume that c | n. Cases where c6 | n are handled similarly to thearray partitioning above.67The root is assigned to the first partition. For the child i of the root, withenumeration beginning at 0, we assign it to partition i∗ c. For the children of childi, we assign them to the range of partitions between i ∗ c and (i+ 1) ∗ c− 1 at aninterval of nc2 . This pattern, illustrated in Figure 5.3, continues until all childrenof a node are in single partition, similar to the buckets previously discussed. Thesignature for a given position in the tree can be determined by a simple tree walk,which almost all queries on the tree perform anyway.The partitioning ensures that all descendants of a node are in a predictablerange; and so subtree signatures can be derived by finding the node, which the vastmajority of queries need to do anyway, and filling that range with 1s.Trees without a bounded number of children are handled similarly, but whenconsidering the enumeration of children we use i mod t in place of i for some fixedt. Without an annotation, we use t = 16. The consequence of this is that child 0 andchild 16 will use the same range. Once again, this can lead to securing access tomore than is requested, but will have no effect on safety and, except in pathologicalcases, little effect on performance.5.5 Collection ReservationWe discussed the need to secure exclusive access to an entire collection (§4.3). Onecould apply a use signature of all 1s, but this would involve several more atomicoperations than the average. It also would be prone to a considerable number offailures, if the collection was in any kind of demand. As this need is common, weadd a mechanism for reserving a collection for exclusive use.We add another set of bits, by default 32, to the signature as a reservation block.This functions primarily as a counter, similar to a reader/writer lock. Before theuse signature is applied, the query invoker increments the block, and decrementsit when complete. Reservation is requested by atomically setting the highest orderbit to 1, and no new contracts will be allowed until it is removed. The reserver issuspended, and a reference is placed in the collection. When the reservation block,excepting high order bit, becomes zero, the waiting coroutine will be resumed withexclusive access to the collection.This mechanism has uses beyond deadlock prevention. For content queries, the68entire collection has to be iterated over. We first reserve the entire collection, andthen test the members, write the signature for the values described to the collectionsignature, and, finally, remove the reservation. Once again, we employ a form oftwo-phase locking. In other cases, where reservation is required, we can exploitsome of the properties of structure based signature mapping. If we are iteratingacross the collection, say for a list, we can reserve the collection, write 1s intoevery position of the collection signature without atomic operations, and, as wecomplete processing of each partition, write a 0 to allow other contracts before ouroperation is complete. If we can execute the original contract in parallel, the sameapplies, except each worker thread is assigned a range and frees the finished parts,as demonstrated above.5.6 Signature HoistingAs we mentioned above (§5.3), one of the virtues of signatures is their composabil-ity and we exploit this wherever possible. The primary cost of our signature basedsystem is the application of a use signature to a collection signature, which employsone or more atomic operations. If we can reduce the number of such applications,we can reduce both the total overhead and potential contention. For example, con-sider 10 tasks that access the same collection, each with their own use signature.At the time of the tasks’ dispatch, we could hypothetically combine the use signa-tures into one single aggregate use signature. We can then add this signature intothe target collection’s signature and, when successful, execute all 10 tasks seriallywithout further need for any synchronization. The aggregation is no more work,in terms of types and number of instructions, than adding each of them into thecollection signature. However, since the aggregation needs no protection we havereduced the number of atomic operations by 90%, which is definitely significant,as not only have we avoided a number of potential cache misses that come hand-in-hand with atomic operations, but under heavy contention the signature systemcan tend to flood the bus with atomic traffic.This aggregation is not an absolute win, the aggregate signature is often ‘big-ger’ (in terms of the number of 1s) and so has a higher chance of collision withother potential use signatures from other tasks, where one of the component use69signatures may have been able to ‘fit’. Furthermore, every time we batch tasks, wedo reduce the scheduling overhead, but we also reduce the granularity of schedulingand so may have trouble balancing the load between the different cores13. How-ever, in our experience, we find that the balance is overwhelming in favor of doingat least some batching/aggregation.Given the potential rewards we want to find every opportunity for this type ofoptimization. However, the problem is often that the logic that produces the usesignature, generally a query, is ‘inside’ a task and not implicit in the task creationitself. Our solution for this is a compile time technique we call signature hoist-ing. When emitting the code for a particular operation we move, or ‘hoist’, theuse signature population code from inside the body of the task and into the pointwhere batches are constructed and tasks are dispatched, adding aggregation codeas appropriate.There are limitations to when we can employ signature hoisting as we needto be able to populate the signatures before any of the bodies in the combinedoperator are executed. We can, however, hoist signatures for an operation thatcontains a series of code bodies, such as those generated from a delegate block,and the following criteria are met:• the ‘outer’ operation can be executed in parallel• the inner bodies contain a collection contract (generally a query)• the parameters for the collection contract are fixed at the time of the initiationof the inner body (generally based on constants or parameters to the body)These conditions occur twice in our experiments. Firstly, we specifically en-gineered our canneal implementation (§7.2.3) to trigger this optimization and wewill discuss its effectiveness in detail in our discussion of that evaluation. Sec-ondly, it occurred naturally in our video game example (§7.2.4) when the agentsare updated. This updating is implemented by SELECTing from the Set of all agentsand then sending them an Update message. Recall that sending a message createsan implicit contract for the duration of the receiver body and the ‘parameters’ for13This balancing act between overhead and granularity in terms of batches of tasks is a majortheme in our evaluations (§7.2).70this contract are the agents themselves. During compilation the signature of theindividual agent entity, with respect to the Set, is hoisted as the outermost SELECTis batched.71Chapter 6Compiler Implementation DetailsThe Cadmium compiler is written in C++17, and is currently approximately 80klocs. It uses the excellent ANTLR [71] parser generator and various Boost [76]libraries with few other dependencies. Cadmium began life as a C++ library be-fore we realized the necessity for static analysis, and a form of interface that C++wasn’t suited for. The compiler acts as a translator between cadmium and C++,the results of which are combined with the Cadmium standard library and runtime,and subsequently fed to clang [1] to produce the final executable.Once again, we note that the compiler is currently in the prototype stage. Thereis still a great deal of work to make it ‘production ready’: filling out the standardlibrary; adding more query atoms, in addition to all the other issues we will discusas future work (Ch 8). It does, however, compile every example in Chapter 7without needing to resort to ‘behind the scenes tinkering’.6.1 Code Analysis ReportingThe author has been often frustrated by lack of feedback given by most compilers.While they will happily inform you of errors and are improving every year withsome quite useful warnings, they are still frustratingly silent on when they make adecision on your behalf. Did you actually inline that function like I asked? It’stelling that Microsoft felt the need to add an extra __forceinline directive.The author recalls an afternoon trying to get gcc’s auto vectorizer to engage with a72Figure 6.1: Example Code Analysis Reportlot of grumbling words not suitable for a family friendly thesis.With Cadmium we took steps to fill this gap. Every time a program is compiled,the output includes a series of notes detailing the choices made, especially in thecase of forced serialization or forced reservation of entire collections. A snippet ofwhich, cleaned up for clarity, is shown in Figure 6.1.As we moved from developing Cadmium to implementing our test applications,we found this feedback, along with the automatically generated activity graphs thatwe will see in Chapter 7, to be invaluable even though we, hopefully, understoodthe system inside out.We could forsee a time when this was integrated with the Language ServerProtocol and informative tool tips or coloured ‘squiggles’ would appear in the IDEto allow the programmer to see how their intention is interpreted in real time.6.2 Integrating with C++Due to its compilation process, Cadmium can easily be incorporated with existingC/C++. By defining external modules, a collection of user defined state and Cfunctions. These modules can include any C/C++ headers and the correspondingsource or library can be included in the project. We use this to load the data incanneal(§7.2.3), as the file system component has not yet been finalized. Onlyone function from a particular module can be executed at any given time, unlessthe function is marked with the threadsafe annotation. Cadmium objects areautomatically reserved when passed to an external function.73Cadmium primitive types are automatically translated into C++ types, and thecompiler can generate the header necessary to interface with a Cadmium entityor collection. Effort has been made to make the generated code humanly readable.The contents of an external are exempt from the static analysis, and the programmeris free to ‘shoot themselves in the foot’, but effort has been made to avoid invisibleside effects. Currently, there is no support for messages passing from C++, butmany of the prerequisites are present.As we will see in the next chapter, the presence of the C++ interface dramati-cally extends what we can do with just the prototype Cadmium. Our video gameexample (§7.2.4) displays graphics, responds to user input and interfaces with me-dia stored in the file system. This enabled us to get away with Cadmium’s paltrystandard library and allowed us to demonstrate that if Cadmium were fully realizedit could be used alongside existing libraries (using some, but not overly burden-some, care).74Chapter 7EvaluationIn the previous sections we have detailed our model, algorithms and tools in addi-tion to our prototype implementation of these ideas. In this section we will evaluatethe effectiveness of our model. This evaluation will attempt to incorporate the var-ious axes of the project. Obviously, as the end goal of parallelism is to increaseperformance, we will examine numerical timing data. Additionally, we will ex-plore and discuss the effectiveness of the programming experience implied by ourmodel: what went right and any shortcomings discovered that may be rectified byfuture work.A full user-study is beyond the scope of this project and, as we discoveredexploring this idea, the tools would need to be markedly more mature in order toeffectively measure the ability of programmers not familiar with the core project tograsp the principles of the model and exploit its virtues. However, to gain insight,the model was defined before the following experiments were implemented in orderto gain some kind of insight into using this model and helping to define the pathtowards a software artifact that is appropriate for initial human testing.757.1 Signatures7.1.1 IntroductionIn the course of our work, we have identified several factors that affect the perfor-mance of signature operations. The performance of signatures – especially duringadding to a shared signature (called a usesig) – is critical to the performance of ourconcurrent algorithms, especially when considering microtasks (those processingkernels that take a small enough amount of time to compute that they can be easilyoverwhelmed by the overhead of scheduling them). So reducing a few cycles, afew cache misses or the amount of bus traffic has a cascading effect on our finalresult.With this in mind we divided several potential optimizations for consideration.These fall into three classes:Padding distributing the signature data over more cache lines than necessary sothat concurrent signature operations are less likely to be modifying the samecache lineSparsity storing only the values that are present (a bit that is 1, also referred toas set) in the signature in order to avoid taking the time to deal with 0values that do not affect the outcome of the operation – especially useful forsignatures that are known to have a small number of set bitsAlgorithmic modifications in the method used to test or update the contents ofthe signature – generally a trade-off such as checking a value non-atomicallybefore performing a more expensive atomic operation7.1.2 Operation OverviewWe focus on four operations that comprise the bulk of our use of signatures:Add combining two or more signatures into a composite signature that has bits setfor each bits set in the source signatures. This is a non-atomic operation –generally done before accessing a usesig in order to, for example, combine76the signatures of a batch of data for one bulk access to the shared collection.We preserve the property that adding order is monotonically increasing – thatis to say that if bit b2 has a higher position than some bit b1 then b2 must notbe modified before b1 – though modifying them simultaneously is permitted.Intersection determining if two signatures have any bits set in common – essentialto performing our set intersection heuristics. The intersection is of primaryuse in the protected operation below.Protected Try Add the main focus of our optimizations – protected try and addis the most important interaction with any usesig. Acting as a concurrentcombination of the two operations above: given two signatures, the operationattempts to add the second to the first if they don’t intersect and returns true ifthe addition was successful and false otherwise. This operation must be whatwe will call bitwise-atomic, where the operation need not be atomic withrespect to the entire signature, but only to the set of bits to be modified. Forexample, one thread can be setting bit 23, while another thread is unsettingbit 172 simultaneously, but for that first thread bit 23 must be 0 immediatelyprior to the setting. Recall that, for signatures, a false positive is permittedthough obviously to be avoided when possible, but as long as an optimizationincreases false positives also decreases total runtime – then the optimizationis acceptable.Protected Remove the complementary operation to Add/Protected Try Add, Re-move takes two signatures and unsets the bits in the first that are set in thesecond. The protected version has the same bitwise-atomic guarantee as Pro-tected Try Add. Note that this operation assumes that the every bit set in thesecond signature is present in the first – i.e. the second has been previouslyadded, results otherwise are undefined. We we will use this assumption forseveral operations.7.1.3 PaddingAs noted above, padding a signature involves adding extra ‘dead space’ into thesignature storage in order to increase the number of cache lines it occupies. We77refer to the portion of the cache line that is used to store bits as a signature’s celland the number of bits in a signature’s cell, rather obviously, as its cell size. Thegenerally employed pattern is that for two signatures of the same size (number oftotal bits) that are used for an operation – the one that is shared (i.e. a usesig)will be padded, while the one that is computed in an elided environment will bestored non-padded. We have no cases where we have operations on two sharedsignatures, so we don’t consider that case. As a consequence of this pattern, alloperations need to support combinations of padded and non-padded versions.As a matter of nomenclature, we still discuss the cell size of a non-paddedsignature as being the largest ’chunk’ of data that is considered at one time by thealgorithm acting on it. This will be seen below when considering sparse signatures.7.1.4 SparsityFor the sparse case we only store the cells that have at least one bit set – with amaximum cell size of the largest type that can be used by CPU supported atomicoperations (currently 64-bits). Our implementation uses a fixed array that couldhold the number of cells whose sizes adds up to the total maximum size of thesignature. For a sparse signature of size 512 with a cell size of 64 will have 8 slotsin the array. Similarly, an array of indices is stored. For example, if in the 512/64-bit signature from above, bit 129 is set, then the first cell of the storage will havethe second bit set and the first index value will be 2, corresponding to the third 64-bit portion of the full signature. Future work includes testing to see if interleavingthe storage and the indices would be more efficient.Given how they are used, sparse signatures are designed only for the non-shared side of a Protected Try Add operation and, consequently, do not supportbeing the target of any protected operation, only the source.7.1.5 Algorithmic ModificationsWe have derived the following optimizations for the internal workings of the signa-ture operations – identified by short names for ease of discussion (and abbreviationsfor the following graphs).For the purposes of discussion, we will refer to a component as the data type78accessed in any step of the algorithm, which is generally the cell size, if the cellfits into a native machine type (no more than 64-bits), or 64-bits otherwise.Hasty (ha) By default, while doing an intersection test, the various operationslogically and the corresponding components from each signature and if theresult is non-zero, stops and returns false. In the Hasty variant, the inter-section operation forgoes the check after each pair, instead computing thecumulative and of the and of each corresponding pair and then checkingfor non-zero. The idea is that omission of the comparisons will save morethan time than the cost of the potential extra ands.WeakSwapper (ws) C++ exposes two different Compare and Swap operations onits atomic datatypes: strong and weak. The emitted assembly instructionsdiffer by platform/compiler, but in essence the strong version only returnsa negative when the comparison part of the operation is actually negative;while the weak version allows spurious failures, in that the operation mayreturn a negative value even if the comparison would allow for a successfulswap. By default, signatures use the strong version and this variant uses theweak version, which, incidentally, is much closer in function to the CASoperation supported on our target family of Intel CPUs.Brazen (br) Based on the old phrase ’Better to seek forgiveness than ask permis-sion’, the Brazen variant uses atomic or to add each component to the targetsignature directly instead of using the conditional Compare and Swap. Theatomic or operation always returns the directly previous state of the modi-fied value and we take advantage of this fact by checking this result to see ifthe previous value would have intersected with the value we added. If not,we proceed to the next component. In the case where the addition wouldhave failed, we then use atomic xor to remove the bits that we added (notincluding the ones already there) and return false. The rationale behind thisis that atomic or is almost always cheaper to use than the atomic Compareand Swap, as it locks the bus for less time, and the second atomic operationis only used in the case of failure.Frugal (fr) In the frugal variant the algorithm simply checks for a zero component79before reading or modifying memory, as a zero value never affects a test, addor remove operation.Cautious (ca) For the cautious variant we perform a non-atomic intersection testbefore every Protected Try Add operation. The non-atomic test is signifi-cantly cheaper to execute and this would be a massive savings during timeswhen there are lots of failures (a failure being any time when Protected TryAdd returns false). Of course this is another heuristic. It is possible that thenon-atomic nature of the test gets stale values and so the Add would havesucceeded or the intersection test yields false when the Add will fail. In theformer case we get a false positive and as detailed above this is acceptableif it increases the overall performance. In the latter case the Add is stillperformed atomically and so the add would fail anyway.Note that with the exception of brazen and weakswapper, none of the aboveare mutually exclusive, and so as we test for performance it behoves us to try thevalid combinations of these optimizations, in addition to testing them in isolation.7.1.6 Testing MethodologyWe detailed, above, the primary signature operations and have devised a test thatuses them, mimicking their actual use in practice. For a given number of cores (aswe we bind the worker threads to cores, we use cores and threads interchangeably),we generate a usesig (using a size of 1024 for these tests as this is our most commonsize - though we note that the performance characteristics of other sizes are notdramatically different). For each core we generate a list of signatures (1000 percores in the results below) and each performs the following:• do some amount of ’busy work’ which exercises the cache• attempt a Protected Try Add on the usesig with the next signature in the list• if failure, we advance the list and repeat• if success, we do more work, Protected Remove the signature from the us-esig, advance the list and repeat80We repeat this process a number of times (1000 in the results below) recordthe number of failures, take the average and record the failure rate. By varying theamount of work, we can induce different levels of contention on the usesig.In order to separate the time taken for the ‘work’ and the signature operationoverhead, which is the quantity we are really interested in, we have a version of theabove test that uses fake signatures that are guaranteed to succeed or fail in prede-termined manner and can determine this in the smallest amount of time possible(one non-atomic boolean comparison). Once we have the failure rate from the realsignature tests, we setup another series (1000 repetitions of 1000 signatures percore) of these fake signatures with their success/failure set to the same ratio. In thisway, we can get a decent approximation of the time in the test that is not overhead.Finally, we have built an unoptimized ’reference’ signature type by wrappinga C++ std::bitset in a spin lock. This gives us a baseline for performance to com-pare against – allowing us to detect the case where every implementation is highlysuboptimal and picking the best one would just be picking the best of a bad batch.Incidentally, it was also highly useful in debugging the optimized versions and giv-ing us a chance to check their correctness – as bugs in this kind of data structureare sometimes rather hard to detect.7.1.7 ResultsUsing the methodology above we test each algorithmic variant, using differentpadding levels (where 512-bits is no padding) and core counts. We use two typesof signature generators: random and hotspot. For random we set a certain percent-age of bits. We use 15% in this case, as this gives a lot of potential collisions. Hotspots add a given number of bits (25 in this case) clustered in area (using a normaldistribution centered around a random mean in the signature). This gives fewercollisions, but still has a decent number of non-zero bits so there is still work to do.This mimics a large number of cases we see in applying signatures in practice.We present the timing results adjusted by the ‘fake signature’ time, to isolateonly the amount of time used for signature computation. For easier comparisonthese values are then normalized to the reference signature time, that is the timetaken for the spin-lock/std::bitset version. Though for the record a single Protected81Try Add takes, on average, between 0.3 and 2 microseconds on our test machines.Note that while we describe these as random, we manipulate the random num-ber seed to generate the same set of signatures for every core count – regardless ofalgorithmic variation or padding.From these comprehensive results, we can draw the following conclusions:• The time taken is highly sensitive to the level of contention and the numberof cores involved• From the fact that the relative ‘curve’ for each variant is similar, we can inferthat the effects of padding is independent of the algorithmic effects• As contention rises the benefit of atomic operation optimization over thesimple single lock grows• No padding configuration is superior, but overall the best configuration isthose in the middle of the range, with neither 0 padding nor the maximumpadding. We omitted the data from testing with smaller cell sizes, but theyare almost always worse than the 64-bit case.8283• Of all the optimizations, frugal has the most dramatic positive effect• When combined with the above two variants, the cautious variant can becombined for sometimes slightly better results. This is not particularly useful• The second greatest gains are achieved with a different addition method (ei-ther weakswapper or brazen) and the combination of the two is either theclear winner or non-worse than others, depending on the conditions. How-ever, it should be recalled that weak swapping allows spurious failures, sowe need to look at the data and the relative failure rates across all tests andother optimizations.Hotspot High Hotspot Low Random High Random Lowweakswapper 24% 8% 72% 42%brazen 15% 6% 69% 42%Figure 7.1: Signature error rates – weakswapper versus brazenFrom Figure 7.1 we can see that the slight increase in comparison perfor-84mance allows for a non-trivially greater number of false positives and so thebrazen may turn out to be the best strategy to use overall.• We notice that the deviation in time, denoted by the error bars in the graphs,is very consistent for every test not involving 12 cores. As 12 is the totalnumber of cores in the system; this may be due to NUMA effects or interfer-ence from the small number of operating system processing tasks that werenecessary to keep active while the tests were running7.1.8 Sparse Signature ResultsIn the above results, all of the signatures were ’full’ to not completely overwhelmwith details. We now turn our attention to the sparse signature organization.For this test we use the same testing methodology as above except that in-stead of hotspot or random, we set 3 random bits in sparse signatures (the usesigis non-sparse as it needs protected operations). This mimics a large number of ourpractical cases, such as when we went to swap a small number of items in a collec-tion. This happens often and makes specific optimizations for low set bit contentsworthwhile.For the comparison, we average overall optimizations and padding values fromthe previous tests. We compare against sparse signatures with a cell size of 64 (themaximum) and 8 (the minimum).We can conclude two things: sparse signature optimization is effective, butusing smaller cells is not a useful savings.7.1.9 SIMD ComparsionWe conclude with a mystery. The signatures are optimized with SIMD operationswhere applicable, using SSE2, as this is the highest level of SIMD support guaran-teed to be on our testing machines. In order to test the efficacy of this optimizationwe ran 1000 trials of the Add and Test Intersection operations with random signa-tures with SIMD operations enabled and disabled (all signatures are non-padded,as these operations will be used in elided circumstances).The results are as follows:85We had expected the SIMD version to beat the non-SIMD version soundly.However, this was not the case. This is not only surprising, but does not matchthe preliminary results from other machines/compilers. Our future work includesinvestigating this.7.1.10 ConclusionWith this attempt at exhaustive testing, we are confident that we have the keys tominimize the overhead of signature operations as a part of our complete concur-rency algorithms. There are, of course, a number of avenues of inquiry remain-ing, but we suspect we are far along the diminishing returns curve for this topicwithout fundamentally rethinking the idea of signatures and their correspondingoperations.7.2 Comparison with other approachesOur main focus was to attempt to tame the trials of parallelizing complex applica-tions. To this end, we constructed the complex video game example featured below(§7.2.4). However, as that application resembled the ones that inspired the model8687initially, there is the danger of ‘over-fitting’. That is to say, we could have uncon-sciously, despite best intentions, designed a system that was ideal for producingthat exact application.We have hand picked 3 examples to compare against, both in terms of raw per-formance and in programming style. We have selected these examples, all widelyrespected by the community, because they represent a variety in terms of program-ming paradigm, collection use and critical section protection. While none of theseis a ‘complex application’ their smaller scope is ideal for isolating the performancedifference in adding cores to their execution. Furthermore, as Cadmium uses anovel programming model, porting is non-trivial and the smaller sizes allowed usto do this kind of comparison within the scope of a project of this size.7.2.1 PageRank(Green-Marl)DescriptionThe famous PageRank algorithm [69] was a substantial part of the genesis ofGoogle’s path to being the world leader in web searching. Essentially, it is basedon the idea that a page’s importance can be heuristically evaluated by the impor-tance of the pages that point to it. Essentially ‘importance’ is treated as a transitiveproperty, with each site that points to a page contributing a fraction of their ownimportance to the pages they link to.The algorithm obviously lends itself to being modeled as a directed graph (di-graph). The algorithm is processed in a number of iterations allowing the ‘value’of each page to propagate to its immediate neighbors and then to its neighbor’sneighbors in the next iteration and so on. This proceeds for either a set number ofiterations or until a certain stopping criteria has been reached1.While generally computed with adjacency matrices, it can also be computedusing a more dynamic digraph collection, with the value attached to each vertexbeing it’s ‘value’ and each directed edge represents a link from source to target.This avoids the cases where linear algebra based approach has no solution.1For the sake of easier comparison, we have eliminated the early stopping condition and simplycompute a certain number of iterations88ChallengesIn terms of computational dependencies, the value of each vertex at iteration idepends on the values of its neighbors at iteration i− 1. So the value of a vertexmust be retained until all of its neighbors have been updated. This leads, naturally,to having two copies of the graph: one for a completed iteration and one for thenext in a well established pattern, sometimes called double-buffering or a Stencilpattern variation [59]. Only in the case where the existing graph strains memoryresources would it be worth performing a more fine-grained strategy where, forexample, each core/thread works on an expanding subgraph and only keeps thevery minimum of multiple vertex values at once2.Green-MarlGreen-Marl [42] is DSL3 designed by researchers at the Oracle corporation specif-ically for implementing graph algorithms. Aimed at reducing the code overheadand chance for error in such applications, the Green-Marl DSL is, like the Cad-mium prototype, compiled to C/C++ code, which uses OpenMP [24] ‘under thehood’. We selected this as a comparison as it compares the ease of using a tooldesigned for the specific, versus the general purpose tool we are developing.ImplementationThis is an ideal place for us to demonstrate and test the UPDATE view facility ofCadmium queries (§3.3.3), which implements the double buffer pattern intrinsi-cally.The performance numbers below are gratifying. The most important number isthe length of time it took us to re-implement this in Cadmium: 28 minutes4.2While we did not perform an experiment to justify this statement, we think that the reader willagree that, in general, a simple partitioning schedule with no locks is superior to a complex orderingof vertices possibly with locks for the vertices on the boundaries between the vertices assigned todifferent cores.3Domain Specific Language4Though of course it then exposed a number of bugs in the compiler which took weeks to fix,which is part of our reason why we felt that a user study would give false data.891 Procedure pagerank(G: Graph, e,d: Double, max: Int; pg_rank:Node_Prop<Double>)2 {3 Double diff;4 Int cnt = 0;5 Double N = G.NumNodes();6 G.pg_rank = 1 / N;7 Do {8 diff = 0.0;9 Foreach (t: G.Nodes) {10 Double val = (1-d) / N + d*11 Sum(w: t.InNbrs) { w.pg_rank / w.OutDegree() };1213 diff += | val - t.pg_rank |;14 t.pg_rank <= val @ t;15 }16 cnt++;17 } While ((diff > e) && (cnt < max));18 }Listing 7.1: PageRank expressed in Green-Marl1 program pagerank {2 ARGV: {3 inputFile@string4 iterations@int5 damping@float6 }7 }89 manager networkGraph@CDDigraph[ float ]( labels: int )1011 manager Ranker {12 Initialize → {13 data: readCDCN(pagerank.inputFile)1415 |INSERT FROM data INTO networkGraph|1617 // setup initial values18 initial: 1.0 / toFloat(#networkGraph)19 |SELECT ALL v FROM networkGraph| → {9020 v.value = initial21 }22 }2324 Execute → {25 // go through the requested iterations26 |1 . . . pagerank.iterations AS i| → {27 numVertices: #networkGraph28 // update the graph29 |UPDATE networkGraph| → G3031 |ALL v IN G ORDERBY unordered| → {32 nTotal: 0.033 v.next.value = 0.0;34 |ALL n IN v.current.inNeighbors| → {35 nTotal += n.value/toFloat( #n.outNeighbors )36 }3738 v.next.value = ( 1.0 - pagerank.damping ) /39 toFloat( numVertices ) + pagerank.damping *40 nTotal41 }42 release G43 }44 }4546 CleanUp → {47 // write out the results48 |SELECT FROM networkGraph| → nG49 resultData: nG.toCDCN()50 writeCDCN( "data/results.cdcn" resultData )51 }52 }Listing 7.2: PageRank expressed in CadmiumWhile there is no established (or likely possible) metric for code simplicity,we invite the reader to compare the Green-Marl implementation (Figure 7.1) to itsCadmium port (Figure 7.2), where the Execute message receiver corresponds tothe function body in the Green-Marl code.91Figure 7.2: PageRank on 10,000 vertices Cadmium versus Green-MarlIn the Cadmium code the Update View, G, is created with the query |UPDATEnetworkGraph| . At point, the ‘shadow’ copy of the G is instantiated and thereservation mechanism is employed to ensure that access to the graph is exclusive.Note the use of v.current and v.current, as G is an Update View of the graph,then automatically v is also a Vertex View of the vertex that’s fed into the querydelegate block. The current and next ‘members’ are implicit in the Update Viewcorresponding to the the existing graph and its shadow copy, respectively. Thecurrent version is read-only for the duration of the contract.At the point of release G, the shadow copy and the previous copy are unified,which unavoidably requires a huge amount of copying.PerformanceTo evaluate the relative performance of the two systems, we use two graphs. Thefirst is a randomly generated graph on 10,000 vertices. The second is a partial graphof the Internet released by Google, as part of programming contest [4], to reflectthe origins of the PageRank algorithm. This latter graph has 875,713 vertices and5,105,039 edges.The 10,000 vertex graph results are presented in Figure 7.2. We can see that92Figure 7.3: PageRank on 10,000 vertices Cadmium versus Green-Marl -Batch Factor Variationthat the performance of the two systems is roughly equal. Cadmium, as we can see,performs better at lower core counts, able to exploit more parallelism and as corecount increase, both converge on approximately the same performance. The lessthan smooth curve for Cadmium is due to the variability induced by the particularsetting of the parameter that determines how work is divided up for execution. Thisgives us a nice segue to talking about batch factor .The term batches refers to the partitioning of a set of work where the cardinalityis known at the commencement of a parallel operation. Each batch is a ‘unit ofwork’ for the worker threads. Batch factor, congruently, is the number of partitionsthat are used for the operation. Essentially, changing the batch factor is the classictradeoff of scheduling overhead versus scheduling flexibility. As with signaturedetails, batch factor can be adjusted by an annotation to the query in Cadmiumcode (§4.5). A comparison between different batch factors is shown in Figure 7.3.This experiment is rather sensitive to changes in batch factor due to the fact thatthe partitioning is on the vertices of the graph, but as can be seen from inspectingthe code, the amount of work for each vertex is dependent on the number of itsneighbors; so the total amount of work for a batch can vary wildly. In this set93Figure 7.4: PageRank (Google web graph) Activity Graph for batch factor 10Figure 7.5: PageRank (Google web graph) Activity Graph for batch factor250094Figure 7.6: PageRank (Google web graph) Cadmium versus Green-Marlof test data, the graph is generated randomly and the edges are roughly equallydistributed, which is not the case for the Google web graph example, which givesus another convenient segue.Figure 7.6 shows the comparison with Green-Marl on the Google web graph,showing the variation between various batch factors. It is notable that the increasedamount of work erases the performance difference between the two systems as thedifference are less pronounced at that scale. However, when the batch factor isdramatically increased the performance takes a significant jump.Observe the difference between the activity graphs in Figure 7.4 and Figure 7.5showing the differences between batch factor of 10, which is 1 batch per core ontest machine, and a batch factor of 2500. Recall that an activity graph shows thecore utilization over time, with each vertical column representing one core and eachrounded rectangle representing some task being executed at the time span indicatedby the y-axis. Both show the tail end of the first iteration and the initiation of thesecond. Observe that, in the 10 batch factor instance, the cores are completelyutilized until batches start to complete, with one longer batch pushing back theiteration completion time noticeably. Note that the single isolated chunk in thebetween the iterations is initiating message body finalizing and beginning the new95contract. In contrast, the end of the iteration with batch factor 2,500 is barelyragged, but there are continuous interruptions of work to switch batches. Though,we will note that the gaps may appear larger due to the way SVG renders. Closeranalysis shows that while the time to schedule a new batch differs based on variousconditions it is approximately 1 µs on average. While this gap is miniscule in termsof the total runtime, a high batch factor can make a difference in the aggregate.InsightsThis experiment demonstrates the expressive power of Cadmium, comparing wellwith a specific purpose DSL and the performance characteristics also compare fa-vorably. The variation due to the batch factor underscore one of our primary fo-cuses for future work: automatically determining the best parameters for the var-ious algorithms (this will be a theme for the remainder of the section) such as thework we contributed to in the literature [73].7.2.2 Delaunay Mesh Refinement (Galois)DescriptionA triangulated mesh is, as the name suggests, a set of vertices and edges embeddedin a 2d or 3d space where every closed face is triangle. These objects are used oftenin several application catagories from computer graphics to physics simulations.These meshes can cause problems if they contain so-called ‘skinny triangles’, thosewith one interior angle significantly smaller than the others. A Delaunay Mesh is amesh such that for each triangle, no point of the mesh falls within the ‘circumcircle’of any triangle in the mesh. If one were to take any triangle in the mesh andexamine the circle that passes through each of its three points, that circle would notstrictly contain any point in the mesh.One popular way of deriving a Delaunay Mesh from an arbitrary mesh is Ru-pert’s Algorithm [74] The full details of which are beyond the scope of this doc-ument, but in essence it examines each triangle in the mesh looking for anglesless than a given amount. When a ‘bad triangle’ is discovered, it is removed fromthe graph and replaced with a new point, which is then joined with all the ver-96tices that border this new hole (or cavity) created by the removal to maintain thetriangulation property. However, the newly created triangles may also violate theDelaunay property and thus this ‘cavity’ may need to be expanded further beforetriangulation. This process of cavity growth continues until it replaces all ‘bad’(non-Delaunay) triangles transitively connected to the first.ChallengesDuring the execution of the algorithm, the determination, processing, removal andreplacement of the triangles in a cavity is a ‘local operation’. That is to say thatthe rest of the graph is untouched and unreferenced outside the cavity. Given that agraph that is a subject for such processing will generally contain many ‘bad’ trian-gles there is no reason that their resulting cavities cannot be processed in parallel.In other words, for any two disjoint cavities, there is no dependency implied fortheir processing. However, the fact that the bounds of any cavity cannot be knownbefore examination means that this algorithm is not ‘embarrassingly parallel’. Onecannot simply examine all triangles simultaneously and fix the bad ones, as twotriangles may end up being in the same cavity. So while there is no implicit depen-dency between cavities, there are dependencies between individual triangles thatare not known until runtime.GaloisGalois [50] is a project out of the University of Texas designed for ‘irregular par-allelization’ (operations on structures that are non-embarrassingly parallel) and iscurrently maintained by Intelligent Systems. It is a C++ library with a large enoughscope to nearly be a DSL in its own right. It uses a symbolic rollback system sim-ilar to those in Software Transactional Memory [78] and contains a number ofcollection types, graphs, lists, etc, all highly optimized for their parallel processingmodel.We chose this system as a point of comparison as its goals and scope are oneof the closest to Cadmium. As we will discuss below, what is presented to theprogrammer is very different, being augmentation to C++ instead of a higher-levellanguage approach. Additionally, they do not tackle the problem of heterogenous97algorithmic access5. However, they do address issues of determinism and dis-tributed processing, two aspects that we decided from the beginning were out ofscope for the initial version of Cadmium.ImplementationUnlike with PageRank(Section 7.2.1), we did not fully re-implement the DelaunayMesh Refinement algorithm for this evaluation. For an effective comparison, wetook advantage of the Cadmium C++ interface and created a hybrid using Cadmiumstructures in place of the Galois ones.In the implementation, a graph structure is used, but the vertices are the trian-gles of the input graph and edges describe edges shared between triangles. Theprocessing of the graph is essentially divided into two sequential steps:1. Iterate through the triangles in the graph, find each bad triangle and deter-mine (but not modify) its resulting cavity2. Iterate through the cavities, remove them and replace with the appropriatere-triangulationWe found that these operations mapped nicely onto different parts of the Cad-mium programming model. For the first step we used an accumulator list (§3.1.5)of sub-graphs. Rendered in pseudo-code the operation would be:1 CDGraph G[triangle] // a graph that contains triangles2 CDAccumulatorList[*G] cavities// a list of subgraphs of G3 . . .4 |SELECT ALL v IN G ORDERBY unordered| → {5 if v is bad {6 create a new subgraph c of G7 for each new triangle t of the cavity determined:8 |INSERT v INTO c|9 |INSERT c INTO cavities|10 }11 }Listing 7.3: Mesh Refinement Fragment 1 in Cadmium5Again, the condition where two or more algorithms may be applied to the same structure, perhapsby programmers who are not even aware of one another.98Figure 7.7: Mesh Refinement on 10,000 vertices Cadmium versus GaloisIn actual implementation, the contents of the delegate block that processes theresults are replaced with a single function call with a C++ body that contains themodified Galois code.The second step, in Cadmium pseudo-code, follows suit:1 |SELECT ALL cavity IN cavities ORDERBY unordered| → {2 |cavity| → c // secure the stored view3 remove the triangles in c with DELETE4 INSERT new vertices and edges INTO c5 release c6 }Listing 7.4: Mesh Refinement Fragment 2 in CadmiumAgain, the contract between the securing of the stored view (|cavity| throughrelease c) in our implementation is replaced by a call into C++ interface withmodified Galois code. Note that the cavity contains not only the vertices to beremoved by the ‘border’ defined by their neighbours In this case we can insert thenew components into the graph using c as a proxy. In this way we avoid the needto serialize the graph augmentation due to unbounded structural changes.99PerformanceFor this comparison we used the 10,000 vertex graph provided by the Galois instal-lation6. The results, with an optimized signature size and batch factor, are shownin Figure 7.7.The reader’s first question is probably some variation on why is the cadmiumapplication so much faster, even using only one core?. In analysis we concludedthat the answer to this was: memory allocation. Galois allocates memory con-stantly for its roll-back mechanism and is generally tuned for massive parallelism,including distributed computation. As in other places we pre-populate our buffers,pools, free-lists etc. On the other hand, the Cadmium graph implementation uses athread-local free-list mechanism for allocating new nodes. Given that Cadmium isa prototype and we’re working on good solutions to a certain set of already difficultproblems, we elected to leave the problem of parallel memory management to fu-ture work and ‘punt’ by making sure that the system had a healthy amount of mem-ory resources devoted to it. If we eliminate the preallocated free-list in the Cad-mium version, the overhead and contention of memory management overwhelmsthe rest of the computation and the system achieves virtually no performance in-crease with additional cores. In this case, we had to hand tune the pre-allocationsize to meet the needs of the system. This becomes another parameter to tune auto-matically, along with aspects such as signature and batch size, as part of our futurework.As the efficiency of the data structure is mostly, though not completely, or-thogonal to the question of parallel performance, we present the rest of our resultsnormalized to the 1-core value of each application for a fairer comparison.Again we present variations on the batch factor and the signature size achievedby changing the annotations of the collection and queries.Figure 7.8 shows the effects of varying the signature size and padding. Wecan see that 512 bits with no padding clearly outpaces the other. The modifica-tions to the graph are relatively long operations compared to those in some otherapplications, such as the upcoming canneal experiment (Section 7.2.3). Thus thechance of many simulations modifying signatures simultaneously is relatively low,6They also provide a 5 milion vertex graph, but we couldn’t execute the Galois application on itwithout a segfault on four different installations.100Figure 7.8: Mesh Refinement on 10,000 Vertices – Signature Size Variationso padding is generally a net loss, adding more accesses without contention to al-leviate. However, as the cavities can grow to encompass 40 or more triangles (andthus vertices in our representational graph), a use-signature for an operation mayhave 40 or more bits active and too small of a signature for the collection will causetoo much unnecessary synchronization. 512 bits with no padding is revealed to bethe ‘sweet spot’ with the best trade-offs between signature computation time, pos-sible atomic collision and partitioning granularity. The fact that is exactly the sizeof a single cache-line on our test machine is almost certainly no coincidence.Figure 7.9 shows the variation of batch factor on the final computation. We cansee, for the most part, the process is largely unaffected by changes in batch size.Though as the batch factor increases beyond a certain threshold, the aggregateoverhead of the batch dispatch mechanism begins to have a detrimental effect onthe runtime.InsightsThis was a very instructive example for us. We consider it another positive indica-tion that we were able to express the patterns of another programming model withour existing constructs without having to violate the spirit of our programming101Figure 7.9: Mesh Refinement on 10,000 Vertices – Batch Size Variation Nor-malizedmodel. The mechanism of stored queries naturally encompassed the Galois sub-graph/cavity structure. However, realizing that our system was given a handicapof the memory preallocation, gave us a lot of respect for the engineering work thatwent into the Galois system. Currently, Cadmium applications use a frankly deca-dent amount of memory and have a correspondingly long setup time (all applica-tions are measured from the point they begin the main computation and setup, dataloading and similar times have been omitted). The costs and contentions of mem-ory management have long been a problem in parallelization that isn’t as talkedabout as juggling critical sections. We regret not having more of a solution to offeron this front, but we had to resign ourselves to only solving some of the problems.One of the most interesting aspects was the experience of attempting to codethis algorithm ‘from scratch’ before duplicating the Galois solution. Attempting102to put ourselves in the shoes of the uninitiated parallel programmer showed how‘unnatural’ the final solution was versus the algorithm as originally published. Theprimary divergence was the partitioning of the algorithm into two distinct halves,as detailed above. It would be very reasonable for a programmer to ask why amI making the cavity just to store it and fix it later? Why shouldn’t I just processit now? The answer is, of course, that this decomposition is what allows for theparallel execution algorithms to be effective. Determining the entire set of mod-ifications previous to their processing allows Galois to derive their schedule andensures that Cadmium’s static analysis does not automatically serialize the entireoperation to avoid potential deadlock.We were gratified to see that this decomposition was cleanly achievable withexisting constructs, but still wondered if there was more that could be offered to thenon-expert in this case. We did stop to consider the path that a non-expert wouldtake to achieve this kind of result, in our system or others. This underscored for usthe need for detailed and readable compiler feedback (§6.1). Unless we give theprogrammer the power to affect every aspect of the system with its correspondingcomplexity, as discussed below, it is incumbent on the compiler to inform the pro-grammer what decisions it has made on the programmers behalf and why. We couldeasily envision the programmer sitting down, coding a similar algorithm withoutthe partition and seeing the feedback essentially amounting toQuery |SELECT...| executed serially, due to unboundedmodifications in the statement |DELETE ... |and working to move things around until a similar construction to the partitionedsolution is achieved.Finally, we consider the complexity of producing a solution and trade-offs ofcontrol versus complexity. C++ is widely considered to be programming in ‘expertmode’ and though we’re quite a fan of this level of control, we must agree. TheGalois solution offers fine grained control over every aspect of the system, but atthe cost of requiring that these aspects must be addressed, adding another layerof complexity to the already difficult and sometimes inscrutable task of creating aprogram in C++. As an example, while compiling this document we took a quicklook back on the Galois source and came across the following lines from the class103that fills in cavities after their removal:1 typedef std::vector<GNode,2 galois::PerIterAllocTy::rebind<GNode>::other>3 NodesTy;4 typedef std::vector<EdgeTuple,5 galois::PerIterAllocTy::rebind<6 EdgeTuple>::other>7 EdgesTy;Listing 7.5: Snippet of Galois C++ CodeAgain, there is no objective metric for complexity and expressiveness, so welet the reader judge for themselves. We bring this up not to disparage the Galoisproject, something that we have quite a bit of respect for, but to acknowledge thetradeoff between complexity and control. A Cadmium application offers nowherenear that kind of control – especially over the movement of memory, but this sec-tion should show that we’re on our way to achieve our goal of allowing non-expertsto create programs that are always safe and often performant.7.2.3 canneal(PARSEC)DescriptionCanneal, as the name suggests, performs simulated annealing. Very briefly, sim-ulated annealing is a classic heuristic optimization technique. Many other tech-niques in the same class suffer from often arriving at a sub-optimal answer bygetting stuck in a ‘local optimum’. Consider the the family of ‘hill climbing’ al-gorithms that essentially sample a set of near points, choose the the direction ofthe ‘best’ (defined by some objective function) point, sample that point’s neighbor-hood, choose the best direction and repeat until a point is found that is superior toall of its neighbors. The problem is that this point may be only the best point in itsimmediate area and a much better point may exist somewhere distant in the searchspace. Metaphorically, it may find the top of the hill, but there may be a moun-tain in the distance. Simulated annealing attempts to solve this problem by takingvariable sized jumps away from the sampled points, exploring apparently ‘worse’areas of the space in order to get away from local optima. The term annealing is104borrowed from metallurgy, where a piece of metal is made stronger by heating andshaping and continuing to shape as the metal cools. Correspondingly, simulatedannealing has a concept of temperature, which is gradually reduced as the pro-cess proceeds. At high temperature levels the algorithm will make big jumps andincreasingly smaller ones as the temperature lessens.In the case of canneal, simulated annealing is applied to the layout of circuitsas wires are required to connect different areas of the chip. Different placementsmake for different costs and the combinatorial explosion of different possibilitiesmakes a deterministic solution often infeasible.While this may, at first glance, appear to be a graph problem, canneal boils itdown to operations on a list, fixed to the size of the number of elements to be con-nected. The application randomly considers pairs of elements in the list, computesthe change in total cost to swap their locations and, depending on the temperature,swaps them or leaves the structure unchanged.ChallengesThe primary challenge in canneal centers around the list collection. The appli-cation is constantly modifying the list by swapping its elements. The amount ofcomputation required to determine to swap, or not, is minimal and so there is heavycontention on this primary data structure. Any extra overhead in atomically pro-tecting the list elements adds up quickly in terms of final runtime.We specifically choose canneal for our evaluations for one simple reason: it’sone of the worst cases for the Cadmium scheduler and SvS. The almost trivial sizeof the tasks combined with constant contention on the list go against the grain ofthe general purpose system of signatures designed to elegantly handle arbitrarysubsets of larger collections. Furthermore, it is easy to express a solution to thisproblem without needing to ‘dispatch’ work, except in the coarsest sense. The im-plementation, described below, that we compare against, dispatches a number ofthreads that only need to know how many swaps they are to perform and a simplesynchronization barrier is used to coordinate the necessary pause between temper-ature steps. So any overhead problems in the general purpose, work dispatching,batch making Cadmium scheduler is going to show up starkly.105We have touted SvS and its related techniques as being generally competitivewith hand-coded locks. In this case, the authors of canneal use a very clever atomicpointer swap. In their implementation, the central list stores pointers to the ele-ments in the graph, to avoid even the use of locks. It’s cases such as this that weuse to demonstrate that competitiveness.PARSECThe PARSEC benchmark suite [17] is venerable artifact in the community, wellknown and well regarded as a collection of applications that display a variety ofworkloads addressed with a variety of parallelization techniques. The cannealapplication included in PARSEC was written by Daniel Schwartz-Narbonne andChristian Bienia.ImplementationAs with Mesh Refinement (§7.2.2), we leverage the Cadmium C++ interface inorder to incorporate as much of the original code as possible as part of our com-parison. Reduced to pseudo-code the main loop is very simple:1 |1 . . . canneal.tempSteps| → {2 |1 . . . canneal.numSwaps ORDERBY unordered| → {3 index1 = random( elementListSize )4 index2 = random( elementListSize )5 |SELECT { element( index1 ) element( index2 ) }6 FROM elements| → i1, i278 if shouldSwap( i1 i2 ) {9 temp: i110 i1 = i211 i2 = temp12 }13 release i1, i214 }15 }Listing 7.6: canneal Cadmium FragmentNote that the clause element() is the standard way of indicating that the106Figure 7.10: canneal on 100k elements Cadmium versus PARSECsubstructure requested is a single element at a given index from a list/array. Thesimilarity between the names of element clause and elements collection iscompletely coincidental7.PerformanceWe compared performance on three different network sizes (number of elements)provided with the PARSEC distribution: 100k, 200k and 400k. The essential com-parison can be seen in Figure 7.10, Figure 7.11 and Figure 7.12, respectively. Thereare two immediate observations from these graphs. Firstly, especially at lower corecounts (and thus lower contention) Cadmium is nearly identical with the hand op-timized PARSEC implementation. Secondly, as the workload increases, the differ-ences become even less.As with previous evaluations, we compare Cadmium against itself with variouspermutations of batch size and signature width. Figure 7.138 shows the differences7Technically we could have omitted the element() altogether and just queried for {index1,index2} as the index accessed single element is the default subcollection of a list/array. Similarly,the release at the end of the braced scope is unnecessary as they would be automatically released atthat point. We have elected to make our code examples as explicit as possible.8Note that in terms of both batch factor and signature width, the 200k and 400k versions are107Figure 7.11: canneal on 200k elements Cadmium versus PARSECFigure 7.12: canneal on 400k elements Cadmium versus PARSEC108Figure 7.13: canneal on 100k elements – Signature Width Variationin performance characteristics with various configurations of signature width andpadding. Interestingly, the application is relatively stable under width variation,which came as a surprise to us. What didn’t come as surprise was that the bestresults came from the heavily padded (64 bits per cache-line) signature with arelatively low width (256 bits total). This makes sense as each query generates ause-signature of 2 active bits maximum, so the chance of 10 cores colliding often,even with only 256 partitions is low; so a coarser granularity is effective. However,the contention is high, so ‘spreading out’ the signature reduces atomic fighting ona single cache-line.We see more variation when we look at the effect of batch factor (Figure 7.14).It is of particular interest that the best performance (batch factor of 20) and theworst (batch factor of 10) are both relatively low. This is a case of critical mass.Given that the time it takes to compute one potential swap is generally consistent,very little load balancing needs to be performed. However, a batch factor of 10(only one batch per core) gives no room for any load balancing. A batch factor of20 gives a little leeway for some load balancing, but contributes next to no overheadin terms of managing batches. In this case, that little amount of flexibility makesnearly identical to the 100k, so we have omitted them.109Figure 7.14: canneal on 100k elements – Batch Variationsignificant difference.This gives us an excellent opportunity to talk about the scheduler’s work steal-ing capabilities. We were really loath to add this capability, preferring instead to‘work deal’ and pre-distribute the work as much as possible. We didn’t want toadd yet another protected structure to the runtime engine. Given that we were al-ready making considerable use of the hardware supported atomic operations forboth directive management and signature updating, we were afraid of completelyflooding the bus with atomic instructions. Consequently, work stealing was thelast addition to the core runtime engine and, to our surprise, worked rather wellwithout adding any truly detectable amount to our scheduling overhead. As wedetailed above, (§5.1), when the worker finds no directive to work on, it steals in around robin fashion from the other workers’ deferred queues. We found this to beconsiderably helpful in the case of canneal. Recall that a task (which in this caseis executing a batch of potential swaps) is placed on the deferred queue, if a use-signature protected-add fails. The coroutine yields and its host context is placedon the queue. In the case of canneal with a huge number of queries and thus ahuge number of use-signature applications a number of failures is inevitable. Asthe scheduler prioritizes new work over deferred (a heuristic that we have found110Figure 7.15: canneal on 100k elements – No Work StealingFigure 7.16: canneal on 100k elements – With Work Stealing111to be effective in practice) failures tend to build up in the deferred queues. Thiswould be fine except that there weren’t quite enough failures to generally give aneven distribution across the cores.Without work stealing, we would end up with a situation such as the one pic-tured in Figure 7.15, which shows the end of a temperature step (iteration) of thecanneal process9. Recall that a single batch, if it runs from initiation to completionwill form a round rectangle. A shape with a truncated bottom (and dashed line)represents a task/batch that has been suspended and, correspondingly, a task/batchwith a truncated top represents a task that has been taken from the deferred queueafter its use signature has been successfully applied. We can see that at this stageof processing, each worker has no more new work to perform and is clearing outits deferred queue. The uneven size of these queues causes several of the workersto go idle before all the work of the iteration is complete. Figure 7.16 shows thesame experiment with work stealing enabled and it’s clearly shown that the amountof idleness is reduced to nearly zero.Over the course of this project we attempted many experiments to come up witha robust policy on the priority of new work, reviving tasks from the deferred queueand work stealing (including what order to attempt deferred revival and whetheror not to make multiple attempts to revive a task before re-deferral). We foundthat while we could hand tune a policy that would give a performance boost for aparticular application, we found it to be both extremely fragile as minor changesin the application would cause it be degenerate instead of beneficial. Secondly,we couldn’t work out a way to present these options to the programmer in an an-notation, as with signature width and batch factor. This manner of schedulingparameter remains a prime point in our future work for both programmer optionsand automatic parameter tuning, as we’ve mentioned previously.InsightsOverall, we were incredibly pleased with the results of this experiment. Comparingour general, ‘heavy duty’ technique to hand-coded lock-free code nicely validatesour assertion of competitiveness.9Here we are showing a batch factor of 320 to make the effect clearer112This is one case where we made a change to the Cadmium programming modelfor one of our tests. As mentioned above, the PARSEC implementation uses an ar-ray of pointers. After some deliberation we added a new primitive to the language:the raw (or opaque) pointer type to facilitate better interactions with the C++ in-terface. This type is ‘inert’ in the language. There are no operators on it, not evenequality. It only serves as a mechanism to receive a value from the C++ interfaceto later pass back to another part of the C++ interface. We still are not convincedto keep this type in the future, however it is not without precedent. Embedded lan-guages, such as Lua [6], and others that interact heavily with C/C++ have a similarconcept. We consider the C++ interface to be not exactly a ‘necessary evil’, butperhaps a ‘critical unpleasantness’. We set out with the belief that in order to makeparallel programming more accessible, the models would have to change, whichwould require a great deal of rewriting of software10. This rewriting would, dueto real-world constraints, require an intermediate ‘hybrid’ step, so we consider theCadmium interoperability not just a happy side effect of our compiling process, buta critical virtue of the system, one that would need to be maintained even once thebootstrapping process is complete and Cadmium is compiled directly.We also added the multiple object release clause during this process, but thatshould have been there in the first place and isn’t, at all, a departure from ourmodel.For the most part we found the expression of the process to fit naturally intoour model, with one exception. In order to enable signature hoisting (§5.6), weneeded to re-architect the program in an awkward way. With the current staticanalysis, hoisting is only enabled when all parameters to the query are known atthe invocation point of the containing block. In order to trigger this effect, weneeded to generate all the random numbers for a temperature step at the start ofthe step, stored in a list and then dispatched as a parallel query on that list. Thisinduced the compiler to hoist the signatures. This we mark as a failure, giventhat both inducing this effect was awkward and that our reporting system, whichwe sung the praises of in dealing with Mesh Refinement (Section 7.2.2), wouldn’thave given the appropriate feedback. We take the lessons learned to inform our10This was underscored by our experience in rewriting just these applications.113Figure 7.17: canneal on 100k elements – Batch Variationfuture work.Finally, with the above, we found that signature hoisting didn’t give the ex-pected performance benefits, as shown in Figure 7.17. This came as a surprise tous as in preliminary testing on different hardware11 required at least 10 items to abatch, with hoisting, to achieve the quality of results presented above. From thisrather negative experience, we take away the obvious lesson that it’s important tonot ‘over-fit’ experiments on a particular piece of hardware. More importantly, thatrobust system would need to be hardware aware to some extent, or at least capableof automatic tuning after the application is installed on a particular machine.11While we present the results on one specific system, over the course of this project we have ex-perimented on several machines including several ‘pro-grade’ laptops and older ‘server class’ hard-ware. While we discovered bugs after these experiments that could have affected the results, theywere, in general, consistent with the ones presented here1147.2.4 Mini-CE (Video Game)DescriptionAs the concluding, and hopefully climactic, part of our evaluation, we present our‘complex application’, written from the ground up in Cadmium. As we discussedin the introduction (Section 1), a complex application has no strict definition, but asa rule-of-thumb it can be described as an application that has a number of separatedifferent parts, call them units, subsystems, modules, etc, that work in symbiosis.For example, consider the modern web-browser. The browser is responsible forretrieving content from the Internet, parsing it, laying it out for current display,executing live scripted content, responding to user input, managing cached infor-mation, and so on. Conceptually, at the ‘intersection’ of these subsystems, is oneor more collections, which are read and/or modified by these systems. In the caseof a browser, the central collection is the DOM12 tree. This collection describesthe elements (or Objects) in the page (or Document) being presented. The rendererreads from it while the scripting system modifies it, etc. Each of the subsystems ina complex application, as can be clearly seen in this example, generally employsa variety of algorithms and access patterns and they have no strict dependencieson each other. This is why we maintain that handles heterogenous algorithmicaccess to collections is a critical part of facilitates the parallelization of complexapplications. Even if the algorithms themselves are all embarrassingly parallel, thepotential concurrent access by multiple algorithms can create a whole new kind ofcomplexity. It is support for these applications that we see lacking in the currenttools and techniques beyond throw a bunch of locks around the structures and hopethat each subsystem has enough work to keep the cores busy while the others arewaiting.We designed Cadmium from the ground up for this class of problems. Howeverthis does pose a bit of a challenge in testing it, especially in the prototype stage. Acomplex application is, unsurprisingly, complex and porting say, an existing webbrowser, to Cadmium would be an undertaking as large as the project itself. Aswe have stated, none of the above applications qualifies as complex in our loose12Document Object Model115Figure 7.18: Clockwork Empires Beta Screenshotdefinition. Each one of them begins with an input, applies one algorithm and givesan output. So we were compelled to write a skeletal, minimal version of an existingapplication, a ‘version in miniature’ as opposed to a port.We chose the video game domain as a target for several reasons.1. Video games are some of the most complex, performance hungry applica-tions being written today, rivalling small operating systems in the number offacets. Additionally, they are also important ones as the industry has grownto billions of dollars per year and continues to grow.2. The author has a great deal of experience in the domain, having worked onseveral titles, including the one chosen for ‘miniaturization’. I want to makemy own games was the author’s primary motivation to pursue Computer Sci-ence and eventually to this document.3. Games are more fun than web browsers13.As eluded to above, this example is based heavily on an existing game, Clock-work Empires by Gaslamp Games (Figure 7.18). The original code is not publicand even if it were, as previously mentioned, porting would be a considerable un-dertaking as the original took several person-years to create. However, the author13Proof left to the reader.116leveraged his experience to extract some of the most important patterns and imple-ment them from the ground up.As mentioned in our descriptive explanation of ‘complex system’ the problemareas generally occur at the ‘intersection’ of different subsystems – generally at thepoint of shared collection access. In this case we chose to focus on the intersectionof three systems, AI, Simulation and Rendering, which will be detailed shortly.The game itself fits into the subgenre of ‘colony builder’, which is in turnrelated to the genre of ‘city builder’, which the reader may be familiar with fromsuch best-sellers as the venerable SimCity franchise.In this case, the player takes on the task of running a settlement in a ‘newland’, with only indirect control over the citizens who will inhabit it. They issuecommands to build certain types of structures and create tasks (called jobs) for thecitizens to perform, though in the tested version there is no human input to facilitatereproducibility.These citizens do have a certain amount of autonomy, with a certain amountof personality simulated for each one of them, based on a few different attributesassigned to them by the game. To complicate matters, the local area, while richin the necessary resources, is also plagued by hostile creatures, which for thesepurposes we will refer to under the umbrella term ‘monsters’.Our simplified version has the following properties:• All agents are either citizens or monsters• The map is generated randomly, with unpassable hills and bodies of waterthat the agents must circumnavigate in order to reach their various destina-tions• The map is covered with resources of various types, distributed randomly• The citizens have a home base in the center of the map, where they beginand no monster is spawned14• Each citizen needs a job in order to perform any action• Jobs comes in two varieties: gathering and patrolling14A term of art describing the process of bringing a new object into the game world.117• Gathering jobs involve a citizen going to a specified area of the map, search-ing for a specified resource and bringing it back to the home base• Patrolling jobs involve the citizen going to a specified area and continuallymoving, looking for monsters• The job a citizen selects is based on their attributes. A citizen with high‘courage’ and high ’health’ will be biased towards patrolling jobs. A citizenwith a low ‘ambition’ will be biased away from jobs that are a long distanceaway from the home base• Monsters have a job analogue in the form of a simple state machine. If theyhave had no contact with a citizen they simply wander and if they spot acitizen they will chase them and attempt to injure them if they can physicallyreach them• A monster in pursuit has random chance at every ‘step’ of abandoning thechase and returning to wandering• A citizen, on encountering a monster, will flee to home base if they are gath-ering or attempt to attack the monster if they are patrolling• If a citizen or monster is attacked enough that their health attribute drops tozero, they are dead and are removed from the game world• When an object is removed from the world (resource gathered, agent killed,etc) a new one is randomly spawned after some time has elapsed so thesimulation maintains a fairly steady state• Similarly, when a task is completed a new one will be generated, though notnecessarily for the same location or typeThe actual game is far more complex, with buildings adding to the map, agentsneeding downtime or to follow their passions, etc. However, in this way we coverthe ‘core loop’ of the game and are able to study the patterns of interactions be-tween data structures and algorithms needed to implement these rules.118While there are many subsystems involved in a game, there is another coarserdivision often used to discuss these topics, which will be useful here: presentationversus simulation15.Consider a video game version of tic-tac-toe. The simulation would trackwhich board segments have an ‘X’ or an ‘O’, whose turn it is, be responsiblefor computing the next move of the AI player and determine when a game endcondition is achieved. The presentation (or simply rendering) would keep track ofthe actual pixel coordinates of the lines that make up the board, the font used forplayers’ marks and would be responsible for actually producing the pixels.The timeline of execution is broken up into ‘frames’, a term borrowed fromtraditional cell animation, marking the point where the presentation updates thecontents of the display. A higher ‘frame rate’ (generally measured in Frame PerSecond (FPS)) is considered to be more pleasing aesthetically and can make adifference in interactions for games which require reflexes.The simulation is generally also divided into ‘frames’, to echo the nomencla-ture from the presentation, where the state of the objects it’s responsible for is‘updated’ for current conditions (such as in the above, i.e. checking to see if eitherplayer has won).It is not unusual for the presentation and simulation to run at different cadences,such as 60 FPS for the rendering and 10 FPS for the simulation. For simplicitywe will consider them to be equal and one ‘frame’ is the time it takes for bothpresentation and simulation to perform all of their state updating and hardwareinteraction responsibilities.ChallengesThe primary challenge in expressing this system in parallel is the complete lackof embarrassingly parallel algorithms, especially when considered in context ofthe other subsystems. An agent, when being updated for the current frame maychange its own data, cause another agent’s data to change, affect the list of jobs orchange the position of the model that the renderer uses to represent it on screen.Furthermore, much of the rendering is serial (even if in most modern games it15There are many different names for each of these in the literature and general discourse. Theirdefinitions are not any more robust than ‘complex system’119Figure 7.19: mini-ce with 3d renderingis parallelized on the GPU, the preparation of data and transferring often has largeserial segments).We built two versions of the renderer: one using 3d graphics (Figure 7.19)and the OpenGL framework and one that used an older ‘frame buffer’(Figure 7.20)technique, as the latter can be executed on ‘server class’ hardware, which oftendoes not have the required GPU. The numbers presented are using the more generalpurpose rendering, as we didn’t find a notable difference between the performanceprofiles of either of them.ImplementationTo organize the various aspects of the system, we used several different collectiontypes. The simulation portion of the Agents were implemented as Cadmium enti-ties: objects with member state and the ability to send and receive messages. Theagents were stored in a manager (§3.1.2) Set.Tracking the physical locations of the agents was an ideal place to use the sub-collection with different structure pattern (§3.2.3), which the reader may recall iswhere we define a collection to be composed of members from another collection,though this ‘child’ collection has a different structure to the ‘parent’ collection.120Figure 7.20: mini-ce with 2d renderingIn this case, the parent collection is agents, the Set containing Agents, and it isdefined:1 alias CDGrid[ *agents ]( 1024 1024 ) SimAgentGrid2 manager simAgentGrid @ SimAgentGridListing 7.7: Grid Storage Definition for Agentsto the be a Grid(2d array) of size 1024 by 1024 whose member cells are drawnfrom our original agents collection16. This allows our Agents to easily query todetermine the other Agents in a given area. For example:1 |SELECT { block: <blockUL blockLR> }2 FROM simAgentGrid3 AS simAgentGridContents|Listing 7.8: simAgentGrid QueryThis selects a rectangular area of the grid as well-defined substructure (a ‘block’)with the coordinates given demarcating its extents (upper left and lower right),16Creating an alias for the type to then use as the type for a manager declaration is not required,but it was a convention to increase readability that we developed as we gained experience writingCadmium code.121which could be, for example +/-5 of the agents position. This gives the invokingthread of execution a ‘snap shot’ view of that portion of the structure, protectedfrom outside modification, that it can inspect and modify at its leisure for the lifes-pan of the contract.It is important to note, from a semantic point of view, the contents of these cellsare potentially empty, single item stored views of the agents collection. Having aview on the simAgentGrid does not automatically give the programmer a viewon its contents, only the ability to see the cardinality of the stored views (i.e arethey empty or, effectively, ‘null’) and to allow the securing of the query. Thoughwe ended up needing both this behavior and the securing of all the items in thesub collection view immediately (i.e. that a view of the subcollection is a viewof the parent collection) and so are deeply considering how these two differentapproaches should be incorporated as the language matures.An astute reader will note that this implies that only one agent can occupy asingle cell at one time. This is not only true, but a behavior that was desired.Similarly, we defined another grid to hold resources. Though as resources areinterchangeable (one unit of wood is one unit of wood) we didn’t need make a‘parent’ collection for these entities and simply defined the grid contents be a tuplewith type and quantity of the resource, adding an option type for ’none’.The jobs for citizen agents could be defined as a tuple. Originally, we had acomplex data structure to hold these jobs, but for everything we tried, we ended upwith the static analysis needing to reserve the entire collection, a fact that we’ll gointo in depth below. In the end we used a simple List. The job selection code lookslike the following:1 bestJob@Job2 bestScore: -134 |availableJobs| → possibleJobs56 |possibleJobs AS job| → {7 // if nothing else, we’ll take the first one8 if bestScore < 0 {9 bestJob = job10 bestScore = computeJobScore(job)11 }12212 else {13 currentJobScore: computeJobScore(job)14 if currentJobScore > bestScore {15 bestJob = job16 bestScore = currentJobScore17 }18 }19 }2021 release possibleJobsListing 7.9: Job SelectionThe rendering was going to be a different challenge. Firstly, this involvedtalking to the hardware in a non-trivial way. Up to this point, adding facilities tothe Cadmium standard library for tasks, such as console output and file readinghad sufficed. However, in this case we were compelled to write serious code thatrequired the C++ interface. We wrote the various versions of renderer using thepopular SDL (Simple Direct Layer) C library. There are already a considerablenumber of bindings for using SDL in various languages, but the authors had notthought to provide Cadmium bindings.The first, and most important, choice was to consider what was going to ‘live’in Cadmium and what was going to stay completely on the C++ side. For thesimple 2d renderer (Figure 7.20) we elected to create an interface that allowedthe calling code to specifically refer to the sprites17 by number (this was beforewe added raw pointers for canneal (Section 7.2.3)). The Cadmium code couldprovide screen coordinates and the numerical ‘handle’ to the renderer. When everyimage had been placed, it would instruct the renderer to update the screen. Thistechnique is called double buffering, nearly ubiquitous in the domain where thenext visual frame is constructed in a separate chunk of memory and then in onestep is moved to the screen. Often times this is synced with the refresh of themonitor (v-sync), but we disabled this for the experiments below, as it created agreat deal of ‘noise’ in the timing that obscured what was actually happening withthe processing. We knew that we didn’t need to worry about the memory objects17Another term of art. A sprite is a rectangular raster image, with potential transparency, that canbe moved arbitrarily around the screen.123owned by the C++/SDL code. As was detailed earlier a call to an external modulewill be assumed to conflict with another call to the same module without a flag inits annotation. In this case, the code we wrote was determined to have no chanceof conflicts and so the compiler omitted the mutex it would have emitted if not.This meant that we could represent anything renderable with a simple tuplecomposed mostly of coordinates and handles, without the need for the heavy dutyEntity constructions we needed for Agents. Note that it is common practice to haveseparate representations of a game object’s simulation presence and its presentationpresence. Generally, the simulation representation will in some way ‘own’ thepresentation representation and our implementation was no exception.We chose to model objects in real valued space by storing them in a quadtree [29].A quadtree is a type of kd-tree used commonly in these applications and is some-times referred to as a spatial subdivision structure. Essentially, the tree representsfiner and finer divisions of a certain space. The root represents the entire ‘space’ inconsideration. Every node has zero or four children and each child represents onequadrant of the space represented by its parent18. When an object is inserted intothe tree, it is assigned to vertex representing the smallest area that strictly containsit. In effect, asking what’s in this area is equivalent to doing a tree walk. Thismakes it ideal for spatial queries and thus for our purposes.As with the majority of games, the entire ‘world’ isn’t rendered at one time.Generally, the player is focused on one area. This means that the renderer mustselect only those objects that are currently visible given the ‘camera’ position, thescope of which is defined by box called viewPort in the main rendering loop:1 | SELECT ALL toRender2 FROM renderSpace3 WHERE toRender.layer == 04 VISITBY inArea( viewPort ) | → {5 ULx: (toRender.position.UL.x - viewOffset.x) *6 (1.0/(viewScaleFactor))7 ULy: (toRender.position.UL.y - viewOffset.y) *8 (1.0/(viewScaleFactor))9 LRx: (toRender.position.LR.x - viewOffset.x) *10 (1.0/(viewScaleFactor))18The tree generally has an externally defined maximum depth.12411 LRy: (toRender.position.LR.y - viewOffset.y) *12 (1.0/(viewScaleFactor))1314 RenderCanvas::drawImage( toInt(ULx),15 toInt(ULy),16 toInt(LRx),17 toInt(LRy),18 toRender.handle );19 }2021 | SELECT ALL toRender22 FROM renderSpace23 WHERE toRender.layer == 124 VISITBY inArea( viewPort ) | → {25 ULx: (toRender.position.UL.x - viewOffset.x) *26 (1.0/(viewScaleFactor)‘’)27 ULy: (toRender.position.UL.y - viewOffset.y) *28 (1.0/(viewScaleFactor))29 LRx: (toRender.position.LR.x - viewOffset.x) *30 (1.0/(viewScaleFactor))31 LRy: (toRender.position.LR.y - viewOffset.y) *32 (1.0/(viewScaleFactor))3334 RenderCanvas::drawImage( toInt(ULx),35 toInt(ULy),36 toInt(LRx),37 toInt(LRy),38 toRender.handle );39 }4041 RenderCanvas::draw()Listing 7.10: Rendering from the QuadtreeNote that layer refers to the ‘stacking’ of the images. Layer 0 is background,such as grass, mountains, water, etc and layer 1 is the agents, resources, etc.In this way, the renderer has a view, and thus uninterrupted access to all thereal-spaced representations that it needs to process as a view. The key thing is thatthe complementary space is left available for modification. An agent will changeits rendering representation more often than not during its update. However, most125Figure 7.21: mini-ce – Initial Resultsof them are not being rendered in a given frame and so while the relatively longserial process of pushing pixel data around is happening, the agents who are notbeing rendered can adjust their spatial positions, giving us a potentially massivereduction in total frame time. This is achieved without the need for an agent toknow if it’s being rendered or not. Essentially, SvS ‘sorts’ the agents into beingrendered/not being rendered and defers the rendered ones till after the drawingis complete. Most of the code the user requires to achieve this is listed above.This also gives us yet another convenient segue into talking about the performancecharacteristics.PerformanceOur initial results, with tuned signature width and batch size parameters, are pre-sented in Figure 7.21. The measurement is the average time to complete one frame.We ran 10 repetitions of 3000 frames and discarded the first 1000 of each run to letthings ‘warm up’ and achieve a relatively stable state. As a fulfillment of our basicdesign goals, the single core frame time is under 16 ms, which achieves the current‘gold standard’ of 60FPS.We say initial results as the reader may notice that additional cores after the 4th126Figure 7.22: mini-ce – 750 agents – Activity Graphgives little benefit. This is not due to any deficiencies of the Cadmium scheduler,but large unavoidably serial task that contains the quadtree walking and frame ren-dering. This can be seen in the activity graph in Figure 7.22, where the renderertask is the large rounded rectangle on the far left (core 0).This is where video games as an experimental subject demonstrates one of theaspects that makes it a fascinating field for systems research. One can assume that,in the majority of cases, games always want more. If the current performance isbetter than that required for responsiveness, there is always something more thatdesigners will want to add and players will want to experience: more particles,better anti-aliasing, better collision detection. The list is endless. So, we couldincrease our workload, simulate more agents, without violating the parameters ofthe domain.Our original test simulated 500 agents, 400 citizens and 100 monsters, and wecontinued to increase the number of agents till we surpassed the threshold of 60FPSfor the single core case. This turned out to be 3000 agents total, which gives us thepleasing performance curve in Figure 7.23. Note that while the rendering task doestake longer for a greater number of agents, this growth is nowhere near linear. Thecomponent of renderer that moves the constructed frame from off-screen buffer toscreen is constant, and while there are more agents to render, not all of them willbe visible.127Figure 7.23: mini-ce – Population VariationFigure 7.24: mini-ce – Batch Size Variations128Figure 7.25: mini-ce – Signature Width VariationsAs with all previous experiments, we have evaluated the the effects of varyingbatch size and signature width.For the batch size, Figure 7.24, we show the variations of the agent updatedispatch, as it is, by far, the most sensitive operation in the application. As withprevious experiments, we see that a batch factor of 10 (equal to the maximumnumber of cores) performance suffers, as there is less flexibility in scheduling.Diminishing returns set in around a batch size of 100, where the gains in flexibilityare met by the costs in increased overhead.Unlike previous experiments, we have more than one collection and conse-quently more than one signature size parameter. Figure 7.25 shows the variation ofthe signature width and sparsity properties for the quadtree19 and Grid collections.We omit the width of the Set, for clarity, as it had the least discernable effect andwe omit the List due to the fact that, as we detailed above, static analysis alwaysemployed the reservation mechanism, which bypasses the signature mechanismaltogether.19Note the signature width of the quadtree is limited to perfect squares. This is due to a simplifi-cation in our implementation that is not implicit in the technique. When we observed the small sizeof the effect between 256 and 1024 we did not feel that the extra effort to remove this limitations waswarranted.129Figure 7.26: mini-ce – SvS versus fine grained locksIn the cases with fewer cores, low width for the quadtree and grid seem toperform the best. As the core count rises, a larger width for the grid performsbetter, but the quadtree still functions well with a low width. This makes sense, asthe quadtree accesses are brief in terms of their ‘critical section’ – simply addingor removing an item from the list attached to a particular node and so speed of‘locking’ is paramount. On the other hand, accesses to the grids tend to be longer,with the agent evaluating the contents of a block and making choices, so the costof a false positive will be higher. At maximum core count, both cases benefit frombeing on an end, albeit opposite ends, of the speed/accuracy tradeoff spectrum.Finally, as this was a new application, we don’t have a previous application forcomparison. Instead, we took the compiler generated C++ code and replaced thesignature interaction code with fine grained locks20. The results of these efforts areshown in Figure 7.26. As we anticipated, the SvS version outperforms the clas-sic locking version. The difference between the two curves shows the additionaloverhead of taking the many locks required to secure the required subcollectionsto satisfy each query. While the SvS version outperforms the locking version at20We also produced a version with coarse ‘global’ locks, but the incremental performance gainsfor each additional cores was so miniscule that it was not considered to be a viable example.130all times, the gap between the two curves does decrease as the core count rises.This partially reflects the difference in the way the two techniques handle highercontention, but notably includes the effects of false positives that the SvS versioncan suffer from, while the locking version does not.InsightsAt the start of this section, we talked about the dangers of ‘over fitting’ a generalsolution to a specific example. This was something we tried to avoid in the designof this system. It turns out that we need not have worried so much. Even thoughthe patterns of this software inspired a lot of the design of Cadmium, we still en-countered a number of times where the model was awkward or insufficient: a lotof Oh, yeah. That’s not going to work like I thought it was going to. At least wecould console ourselves on the long nights of restructuring and rethinking that wewere being honest about ‘eating our own dogfood’, as the saying goes.One easily isolated example of this is pathfinding. That is the process an agentinitiates to determine the route to get from point a to point b, while avoiding im-passable obstacles. Pathfinding in mini-ce was done, as is standard practice, withan implementation of the A? algorithm [36]. We attempted to parallelize this al-gorithm, but found two things. Firstly, any performant implementation seemedto need to reuse memory buffers and so the language would need the concept ofthread local storage or similar and as we have stated, we have tried to see how farwe could get without directly bringing concepts like ‘threads’ into the language(outside of annotations). Secondly, the algorithm, as written, is definitely sequen-tial as it evaluates the ‘best’ point at every step. This would need to be reconsidered.We realized we would need the complement to an accumulator to hold the priorityqueue – a ‘source’ to the accumulator’s ‘sink’. This idea excites us a great deal, butdue to scope limitations we’re forced to leave it to future work. In the end, we useda serial implementation that the author had already written in C++, showing, yetagain, the virtues of the C++ interface. However, because it was serial21, it madefor a dramatically unbalanced workload, which only got worse as the number ofcores increased, as the serial sections remained the same length. To alleviate this,21At least for each instantiation of the algorithm, we used the information available from CadmiumContext (§5.1) C++ representation to implement Thread Local Storage131we constructed a heuristic version that found the closest point to goal that could befound by examining no more than an arbitrary number of locations. This worked totame the extreme variance of the update times, but would not have been necessaryif we were able to better express some variation on A?.We won’t be satisfied with the next iteration of Cadmium unless it can expressan algorithm with the following properties:• a central collection (such as the priority queue in A?) that can safely receiveand dispatch new items – as with the chanel interface, message passing gen-eralization• a way to elegantly describe a stopping condition (such as finding the des-tination in A?) akin to a parallel_while as a complement to the wellestablished parallel_for.A deeper example was in ways we didn’t anticipate the Entity semantics inter-acting with the code analysis. The agent update is performed by sending a broad-cast message to the set of agents, which essentially gives the receiving Entity a viewof its own members. So, if two agents are involved in an interaction that modifiedboth Entities, the code analysis would trigger, correctly, that deadlock was possi-ble. For these more complex interactions we were compelled to perform a morecomplicated version of the separation inducing restructuring we had to perform formesh refinement (Section 7.2.2). In that case, we first needed to do a read-onlysearch of the graph to ascertain what we had to change, and a second modificationpass to fix everything we found in the first step. In this case, we had to defer thesecomplex interactions till after the main update loop and store the subjects and anyparameters. Again, no serial programmer would have ever written code like that.The code analysis reports are there to guide the programmer away from what isn’tgoing to work, but the semantics of the language should also guide the program-mer towards a good solution. Unlike with pathfinding, we don’t have a clear setof criteria to know when we’ve solved this problem. However, we do notice thatthis problem, mesh refinement and to a certain extent the list searching problemdiscussed below all involve the need to do a read-only search, followed by a set ofmodifications, indicated by the results in the first step.132In general, we found the Entity semantics to be a double-edged sword. Theywere easy to visualize and reason about, made for well organized code and werevery comfortable after nearly two decades of Object-Oriented programming. Theywere also the cause of a great deal of the awkwardness in implementation22. Inthe end we conclude that Entities are a good start for a natural parallel model, butfurther work is needed to push this further for a truly effective composition of codeand data.As eluded to several times in this section, we ran into problems allowing con-current access to the List that contained the jobs. The problem was not the codeanalysis forcing the inclusion of the reservation mechanism, but the exhaustivesearch. The citizen needs to see all the potential options before choosing one, andthat eventual selection will not be known until the last job is examined. It may havebeen possible to modify the algorithm so that it uses the two part decompositiondetailed above and to check if the job selected still exists, though given that thisList is unordered and unindexed, we had no mechanism to retrieve a specific jobfrom the list. The pattern of needing to reserve the entire collection in order to iso-late a subset does occur again and again. The other structures in this example all,by complete chance, have an implicit method going directly to part of the collec-tion in question. Obviously, grid coordinates clearly indicate the substructure (andthe corresponding SvS partitions). Only slightly more indirectly, the Set, which isunsurprisingly implemented as a hash table, is able to take the subject of the queryand immediately reduce the area of concern to a subcollection that at worst con-tains the data in question and still allows concurrent access. The quadtree is themost subtle in this respect. Given that the tree has a regular structure and that eachnode corresponds to specific spatial coordinates for any given rectangular area, asubtree can be determined that contains said rectangle. Otherwise, the query wouldhave been forced to do a walk to find the appropriate subtree and so many of thegains of the system would have been lost.Another potential solution ‘from a different angle’ would have been to use pre-computed query signatures. The idea is to take the contents of a WHERE clause and22This was exacerbated by the fact that we had neglected to include any concept of ‘this’(i.e. alanguage level symbol that represents the current Entity that is executing). Though, this was anoversight and not a hole in the model and will be rectified in the next version.133as the collection is populated, build the corresponding signature. For example ifwe had a query on the job list that was WHERE job.danger > 10, every timea job was added to the List that had the appropriate property, it would be addedto a signature stored with the collection. When that particular query was initiated,this signature could be employed as a use-signature in place of doing a full collec-tion search. It would be analogous to adding an index to a database table, tradingspace and some update time for faster, and in our case more concurrently acces-sible, queries. As a side effect, the composability of signatures transfers directlyto composability of clauses. We considered this design from the beginning, as itfollows directly from the intersection of queries and signatures. However, thereis a considerable number of details to address, mostly to do with maintaining thestored signatures as the collection is modified. This complexity put the techniqueout of scope for this project, as it would be a good candidate for a project itself.After reading the preceding paragraphs, one might be led to be believe that weconsider this experiment to be a failure. This is not the case, by any means. Itcertainly was a little disheartening to see our shiny new ideas get battered whenexposed to some approximation of real world conditions, however, overall this wasa success. We were able to, with a few blemishes, express a number of symbioticpatterns in the our candidate model. We point the reader to the elegance of ex-pressing sharing the render space between the renderer and the agents. Finally, theperformance curves show numerically that we have definitely achieved the goal ofnotable parallel performance.The point of this exercise was to test our model against real world conditions,both in terms of performance and expressability. We had a number of successfulresults, which validated our direction. We also had a number of failures, which wehave spent a lot of text dissecting. This is, however, the point of research. To trynew things and learn from them. Given that we have shown that our foundationis sound, in the end each of our failures lead to directions for future work, whichgives us one final segue to a discussion of the future directions suggested by thisproject.134Chapter 8Conclusion and Future WorkBy this point, we hope we’ve made a strong case for the viability of our model.By doing a ‘lateral’ design that explored and exploited the inextricable link be-tween the way software encodes the intentions of the programmer and the minutiaof runtime behavior, we tried to show a rough sketch of what an entire ecosystemwould look like. We hope that we’ve shown that by considering the problem of par-allelism wholistically, even complex systems can be tamed and programmers canuse all these cores that the hardware designers have given us, while still focusingon the subjects of their own expertise.We had lofty goals and no single project is going to solve all the problems in thedomain. What this represents is a start. Certainly, we developed some interestingand powerful algorithms and techniques, but we also gained a lot experience tryingthings in a different way. We hope that our failures are also valuable, leavingthreads to weave into something better.Throughout this work, we have used the phrase future work many times. Invideo games there is the concept of a ‘vertical slice’, where the studio builds onesegment of the game to show off what the final product will look like. This couldbe one segment of a level or a certain encounter. Scenery is modeled, charactersare animated and music is composed. The idea is to present the experience of thefinished product, even if only for a few minutes. If one were to move their avataroutside this prepared area it would probably crash or at least be incomprehensible.Generally, this slice is used to show a publisher what it would look like fully re-135alized to secure the funding to make it a reality. Once this funding is granted thelong road of fleshing out the rest of the game begins, working many long hours tobring the rest in line with that vision.Our wish list for future Cadmium (and DnC as a whole) development is longerthan an overly-optimistic child’s letter to Santa. We’ve listed many of these through-out this text, but they fall into a few major categories:Collections As we mentioned our prototype only implements parts of the interfacedescribed, mostly focusing on fully fleshing out those facets needed for our tests.As well, it will be necessary to flesh out our user provided collections. There area lot of interesting problems in specifying how a collection responds to variousqueries. Futhermore, there is still a great deal of untapped potential in differentsignature schemes. The partitioning of collection space has a noticeable effect onthe accuracy of the heuristic which in turn has an effect on performance.Queries We gained a lot by adding a richer set of constructs to describe the dataneeded for an operation, but this could go so much further. More ways of describ-ing operations, more ways of inserting complex behavior, queries that feed into oneanother like an embedded map/reduce, and so much more.Messages One key concept we had to abandon to keep the scope of this sane wasthe idea of message channels, essentially marrying the best of ‘dataflow’ systemswith ours. Another was the idea of asynchronous messages. Our runtime supportsit, but our static analysis does not. As well, we only scratched the surface of bi-directional messages with the query/view process.Program Flow Our phase breakdown was enough for us to express complex appli-cations like our video game example, but left some ‘negative space’ that’s achingto be filled. Allowing an entity to receive something like a completion receiptwhen some other operation has completed would dramatically increase the expres-siveness of the system. There is a real use for a properly realized ‘comefrom’.Furthermore, there is much more that can be done with system messages and inter-actions with the scheduler.Reporting and Debugging We were pleasantly surprised at how amazingly helpfuljust our Code Analysis Reports and autogenerated Activity Graphs were in devel-136oping our application. There really is no such thing as too much insight into what’sgoing on in a complex application1.Feedback-Directed Optimization Another avenue that we originally intended toexplore, but were forced to abandon to keep this within scope, was the idea thatwe could use the same ‘hooks’ that we used to generate the activity graphs togenerate program traces that could be, in turn, fed back into the compiler to helptune parameters like signature size and batch factor. The fact that our languagehas a greater semantic richness means that any tool built to optimize it has a legup in figuring out how pieces relate to each other and our algorithms expose alot of ‘knobs’ to turn for exploration. The reader may have noticed that in ourevaluations (Ch 7) we had a lot of different combinations to try to find the bestresults. This is a time-consuming process and while not necessary for correctness(or even decent performance) our goal was always to allow the programmer tothink about parallelism as little as possible and still get results.Going further afield, there are even more ambitious directions, such as thethings we always get asked about in the introduction (§1.7). There’s also morelow level optimizations, such as increased optimization for cache utilization byleveraging our augmented knowledge of the data being scheduled.So, if you’ll excuse us, we have work to do21Said by someone who uses Compiler Explorer (https://godbolt.org) for fun.2Perhaps after taking a break somewhere where nobody says the word ‘parallel’.137Bibliography[1] clang: a c language family frontend for llvm. URL http://clang.llvm.org. →page 72[2] Clojure. URL https://clojure.org. → page 22[3] The go programming language. URL https://golang.org. → page 22[4] Snap: Network datasets: Google web graph. URLhttps://snap.stanford.edu/data/web-Google.html. → page 92[5] Getting started with linq in c#. URLhttps://msdn.microsoft.com/en-us/library/bb397933.aspx. → page 32[6] The programming language lua. URL http://www.lua.org/. → page 113[7] Neo4j. URL https://neo4j.com/. → page 20[8] Dots. URL https://unity.com/dots. → page 22[9] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scalemachine learning on heterogeneous systems, 2015. URLhttp://tensorflow.org/. → page 4[10] R. Agarwal, L. Wang, and S. Stoller. Detecting potential deadlocks withstatic analysis and run-time monitoring. pages 191–207, 11 2005.doi:10.1007/11678779_14. → page 22138[11] R. Agarwal, L. Wang, and S. D. Stoller. Detecting Potential Deadlocks withStatic Analysis and Run-Time Monitoring, pages 191–207. Springer BerlinHeidelberg, Berlin, Heidelberg, 2006. ISBN 978-3-540-32605-2.doi:10.1007/11678779_14. URL http://dx.doi.org/10.1007/11678779_14.→ page 56[12] K. Agrawal, J. T. Fineman, and J. Sukha. Nested parallelism in transactionalmemory. In Proceedings of the 13th ACM SIGPLAN Symposium onPrinciples and practice of parallel programming, pages 163–174, 2008. →page 18[13] J. Armstrong. A history of erlang. In HOPL III: Proceedings of the thirdACM SIGPLAN conference on History of programming languages, pages6–1–6–26, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-766-X.doi:http://doi.acm.org/10.1145/1238844.1238850. URLhttp://portal.acm.org/citation.cfm?id=1238844.1238850. → page 21[14] W. Baek, N. Bronson, C. Kozyrakis, and K. Olukotun. Implementing andevaluating nested parallel transactions in software transactional memory. InProceedings of the twenty-second annual ACM symposium on Parallelism inalgorithms and architectures, pages 253–262, 2010. → page 18[15] D. A. Bailey. Raising lazarus - the 20 year old bug that went to mars. URLhttp://blog.securitymouse.com/2014/06/. → page 1[16] M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, andA. Brownsword. Synchronization via scheduling: techniques for efficientlymanaging shared state. In Proceedings of the 32nd ACM SIGPLANConference on Programming Language Design and Implementation, PLDI2011, San Jose, CA, USA, June 4-8, 2011, pages 640–652, 2011.doi:10.1145/1993498.1993573. URLhttp://doi.acm.org/10.1145/1993498.1993573. → pages 32, 61[17] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, PrincetonUniversity, January 2011. → page 106[18] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.Commun. ACM, 13(7):422–426, 1970. ISSN 0001-0782.doi:http://doi.acm.org/10.1145/362686.362692. → page 65[19] R. D. Blumofe et al. Cilk: An efficient multithreaded runtime system. In J.of Parallel and Dist. Comp., pages 207–216, 1995.139doi:http://citeseerx.ist.psu.edu/viewdoc/summary?doi= →page 21[20] R. L. Bocchino. Deterministic Parallel Java, pages 566–573. Springer US,Boston, MA, 2011. ISBN 978-0-387-09766-4.doi:10.1007/978-0-387-09766-4_119. URLhttps://doi.org/10.1007/978-0-387-09766-4_119. → page 22[21] Z. Budimlic, M. Burke, V. Cave, K. Knobe, G. Lowney, R. Newton,J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar.Concurrent collections. Sci. Program., 18(3-4):203–217, Aug. 2010. ISSN1058-9244. doi:10.1155/2010/521797. URLhttp://dx.doi.org/10.1155/2010/521797. → page 19[22] B. D. Carlstrom, A. McDonald, H. Chafi, J. Chung, C. C. Minh,C. Kozyrakis, and K. Olukotun. The atomos transactional programminglanguage. SIGPLAN Not., 41(6):1–13, June 2006. ISSN 0362-1340.doi:10.1145/1133255.1133983. URLhttps://doi.org/10.1145/1133255.1133983. → page 18[23] D. D. Chamberlin and R. F. Boyce. Sequel: A structured english querylanguage. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD)workshop on Data description, access and control, pages 249–264, 1974. →page 19[24] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Menon.Parallel programming in OpenMP. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2001. ISBN 1-55860-671-8. → pages 4, 89[25] P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with“readers” and “writers”. Commun. ACM, 14(10):667–668, Oct. 1971. ISSN0001-0782. doi:10.1145/362759.362813. URLhttps://doi.org/10.1145/362759.362813. → page 16[26] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on largeclusters. Commun. ACM, 51(1):107–113, 2008. ISSN 0001-0782.doi:http://doi.acm.org/10.1145/1327452.1327492. → page 31[27] D. J. Dewitt and J. Gray. Parallel database systems: the future of highperformance database systems. COMMUNICATIONS OF THE ACM, 35:85–98, 1992. → page 20140[28] E. W. Dijkstra. Solution of a problem in concurrent programming control.Commun. ACM, 8(9):569, Sept. 1965. ISSN 0001-0782.doi:10.1145/365559.365617. URL https://doi.org/10.1145/365559.365617.→ page 16[29] R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval oncomposite keys. Acta Inf., 4(1):1–9, Mar. 1974. ISSN 0001-5903.doi:10.1007/BF00288933. URL https://doi.org/10.1007/BF00288933. →page 124[30] J. Gabarró, C. Martínez, and X. Messeguer. A design of a parallel dictionaryusing skip lists. Theoretical Computer Science, 158(1):1 – 33, 1996. ISSN0304-3975. doi:https://doi.org/10.1016/0304-3975(94)00288-6. URLhttp://www.sciencedirect.com/science/article/pii/0304397594002886. →page 19[31] E. Gamma, R. Helm, R. Johnson, and J. M. Vlissides. Design Patterns:Elements of Reusable Object-Oriented Software. Addison-WesleyProfessional, 1 edition, 1994. ISBN 0201633612. → page 27[32] A. N. Habermann. Synchronization of communicating processes. Commun.ACM, 15(3):171–176, Mar. 1972. ISSN 0001-0782.doi:10.1145/361268.361277. URL https://doi.org/10.1145/361268.361277.→ page 16[33] R. H. Halstead. Multilisp: A language for concurrent symbolic computation.ACM Trans. Program. Lang. Syst., 7(4):501–538, Oct. 1985. ISSN0164-0925. doi:10.1145/4472.4478. URLhttps://doi.org/10.1145/4472.4478. → page 22[34] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy. Composable memorytransactions. In Proceedings of the Tenth ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming, PPoPP ’05, page 48–60,New York, NY, USA, 2005. Association for Computing Machinery. ISBN1595930809. doi:10.1145/1065944.1065952. URLhttps://doi.org/10.1145/1065944.1065952. → page 18[35] T. L. Harris. A pragmatic implementation of non-blocking linked-lists. InProceedings of the 15th International Conference on DistributedComputing, DISC ’01, page 300–314, Berlin, Heidelberg, 2001.Springer-Verlag. ISBN 3540426051. → page 19141[36] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristicdetermination of minimum cost paths. IEEE Transactions on SystemsScience and Cybernetics, 4(2):100–107, 1968. → page 131[37] K. Havelund. Using runtime analysis to guide model checking of javaprograms. In International SPIN Workshop on Model Checking of Software,pages 245–264. Springer, 2000. → page 22[38] D. Hendler and N. Shavit. Non-blocking steal-half work queues. InProceedings of the Twenty-First Annual Symposium on Principles ofDistributed Computing, PODC ’02, page 280–289, New York, NY, USA,2002. Association for Computing Machinery. ISBN 1581134851.doi:10.1145/571825.571876. URL https://doi.org/10.1145/571825.571876.→ page 61[39] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. MorganKaufmann Publishers Inc., San Francisco, CA, USA, 2008. ISBN0123705916. → page 3[40] M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer. Softwaretransactional memory for dynamic-sized data structures. In Proceedings ofthe Twenty-Second Annual Symposium on Principles of DistributedComputing, PODC ’03, page 92–101, New York, NY, USA, 2003.Association for Computing Machinery. ISBN 1581137087.doi:10.1145/872035.872048. URL https://doi.org/10.1145/872035.872048.→ page 18[41] C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalismfor artificial intelligence. In Proceedings of the 3rd International JointConference on Artificial Intelligence, IJCAI’73, pages 235–245, SanFrancisco, CA, USA, 1973. Morgan Kaufmann Publishers Inc. URLhttp://dl.acm.org/citation.cfm?id=1624775.1624804. → page 22[42] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: A dsl for easyand efficient graph analysis. SIGPLAN Not., 47(4):349–362, Mar. 2012.ISSN 0362-1340. doi:10.1145/2248487.2151013. URLhttps://doi.org/10.1145/2248487.2151013. → page 89[43] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: A dsl for easyand efficient graph analysis. In Proceedings of the Seventeenth InternationalConference on Architectural Support for Programming Languages andOperating Systems, ASPLOS XVII, pages 349–362, New York, NY, USA,1422012. ACM. ISBN 978-1-4503-0759-8. doi:10.1145/2150976.2151013.URL http://doi.acm.org/10.1145/2150976.2151013. → page 19[44] A. Hunt and D. Thomas. The Pragmatic programmer : from journeyman tomaster. Addison-Wesley, Boston [etc.], 2000. ISBN 020161622X9780201616224. → page 12[45] S. M. Imam and V. Sarkar. Integrating task parallelism with actors.SIGPLAN Not., 47(10):753–772, Oct. 2012. ISSN 0362-1340.doi:10.1145/2398857.2384671. URLhttps://doi.org/10.1145/2398857.2384671. → page 22[46] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributeddata-parallel programs from sequential building blocks. In EuroSys ’07:Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference onComputer Systems 2007, pages 59–72, New York, NY, USA, 2007. ACM.ISBN 978-1-59593-636-3.doi:http://doi.acm.org/10.1145/1272996.1273005. → page 32[47] M. C. Jeffrey, S. Subramanian, C. Yan, J. S. Emer, and D. Sanchez.Unlocking ordered parallelism with the swarm architecture. IEEE Micro, 36(3):105–117, 2016. → page 17[48] S. P. Jones, A. Gordon, and S. Finne. Concurrent haskell. In POPL,volume 96, pages 295–308. Citeseer, 1996. → page 22[49] S. Kalikar and R. Nasre. Domlock: A new multi-granularity lockingtechnique for hierarchies. SIGPLAN Not., 51(8), Feb. 2016. ISSN0362-1340. doi:10.1145/3016078.2851164. URLhttps://doi.org/10.1145/3016078.2851164. → page 16[50] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P.Chew. Optimistic parallelism requires abstractions. Commun. ACM, 52(9):89–97, 2009. ISSN 0001-0782.doi:http://doi.acm.org/10.1145/1562164.1562188. → page 97[51] M. Kulkarni et al. Scheduling strategies for optimistic parallel execution ofirregular programs. In SPAA ’08, pages 217–228. ACM, 2008. ISBN978-1-59593-973-9. doi:http://doi.acm.org/10.1145/1378533.1378575. →page 19[52] A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graphcomputation on just a pc. In Proceedings of the 10th USENIX Conference on143Operating Systems Design and Implementation, OSDI’12, pages 31–46,Berkeley, CA, USA, 2012. USENIX Association. ISBN 978-1-931971-96-6.URL http://dl.acm.org/citation.cfm?id=2387880.2387884. → page 19[53] L. Lamport. A fast mutual exclusion algorithm. ACM Trans. Comput. Syst.,5(1):1–11, Jan. 1987. ISSN 0734-2071. doi:10.1145/7351.7352. URLhttps://doi.org/10.1145/7351.7352. → page 16[54] P. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, and M. Zwilling.High-performance concurrency control mechanisms for main-memorydatabases. CoRR, abs/1201.0228, 2012. URLhttp://arxiv.org/abs/1201.0228. → page 20[55] S. S. Laurent and J. D. Eisenberg. Introducing Elixir: Getting Started inFunctional Programming. O’Reilly Media, Inc., 2nd edition, 2017. ISBN1491956771. → page 22[56] Z. Li, L. Li, H. Cui, and H. Wan. An adaptive-granularity locking algorithmand its application in collaborative authoring system. In 2007 11thInternational Conference on Computer Supported Cooperative Work inDesign, pages 168–173, 2007. → page 16[57] S. S. Mannarswamy and R. Govindarajan. Variable granularity accesstracking scheme for improving the performance of software transactionalmemory. In 2011 IEEE International Parallel Distributed ProcessingSymposium, pages 455–466, 2011. → page 18[58] K. Matsuzaki, Z. Hu, and M. Takeichi. Parallel skeletons for manipulatinggeneral trees. Parallel Computing, 32(7-8):590–603, 2006. → page 19[59] M. McCool, J. Reinders, and A. Robison. Structured Parallel Programming:Patterns for Efficient Computation. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 1st edition, 2012. ISBN 9780123914439. → pages42, 89[60] F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In15th Workshop on Hot Topics in Operating Systems (HotOS XV), KartauseIttingen, Switzerland, 2015. USENIX Association. URL https://www.usenix.org/conference/hotos15/workshop-program/presentation/mcsherry. → page3[61] E. Meijer, B. Beckman, and G. Bierman. Linq: reconciling object, relationsand xml in the .net framework. In international conference on Management144of data (SIGMOD), pages 706–706. ACM, 2006. ISBN 1-59593-434-0.URL http://doi.acm.org/10.1145/1142473.1142552. → page 20[62] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalablesynchronization on shared-memory multiprocessors. ACM Trans. Comput.Syst., 9(1):21–65, Feb. 1991. ISSN 0734-2071.doi:10.1145/103727.103729. URL https://doi.org/10.1145/103727.103729.→ page 16[63] M. Moir. personal communication. → page 17[64] A. Moitra and S. S. Iyengar. A maximally parallel balancing algorithm forobtaining complete balanced binary trees. IEEE transactions on computers,100(6):563–565, 1985. → page 19[65] L. Molesky and K. Ramamritham. Efficient locking for shared memorydatabase systems. Technical report, 1994. → page 20[66] G. Moore. Cramming more components onto integrated circuits.Proceedings of the IEEE, 86(1):82–85, 1998.doi:10.1109/JPROC.1998.658762. URLhttp://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=658762. → page2[67] M. Naik, C. Park, K. Sen, and D. Gay. Effective static deadlock detection.In 2009 IEEE 31st International Conference on Software Engineering, pages386–396, 2009. → page 22[68] T. Neumann, T. Mühlbauer, and A. Kemper. Fast serializable multi-versionconcurrency control for main-memory database systems. In Proceedings ofthe 2015 ACM SIGMOD International Conference on Management of Data,SIGMOD ’15, page 677–689, New York, NY, USA, 2015. Association forComputing Machinery. ISBN 9781450327589.doi:10.1145/2723372.2749436. URLhttps://doi.org/10.1145/2723372.2749436. → page 20[69] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citationranking: Bringing order to the web. In Proceedings of the 7th InternationalWorld Wide Web Conference, pages 161–172, Brisbane, Australia, 1998.URL citeseer.nj.nec.com/page98pagerank.html. → page 88[70] H. Park and K. Park. Parallel algorithms for red–black trees. TheoreticalComputer Science, 262(1-2):415–435, 2001. → page 19145[71] T. Parr and K. Fisher. Ll(*): the foundation of the ANTLR parser generator.In Proceedings of the 32nd ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, PLDI 2011, San Jose, CA, USA, June4-8, 2011, pages 425–436, 2011. doi:10.1145/1993498.1993548. URLhttp://doi.acm.org/10.1145/1993498.1993548. → page 72[72] M. C. Rinard et al. The design, implementation, and evaluation of jade.ACM Trans. Program. Lang. Syst., 20(3):483–545, 1998. ISSN 0164-0925.doi:http://doi.acm.org/10.1145/291889.291893. → pages 20, 31[73] M. Roth, M. J. Best, C. Mustard, and A. Fedorova. Deconstructing theoverhead in parallel applications. In Proceedings of the 2012 IEEEInternational Symposium on Workload Characterization, IISWC 2012, LaJolla, CA, USA, November 4-6, 2012, pages 59–68. IEEE Computer Society,2012. doi:10.1109/IISWC.2012.6402901. URLhttps://doi.org/10.1109/IISWC.2012.6402901. → page 96[74] J. Ruppert. A delaunay refinement algorithm for quality 2-dimensional meshgeneration. In Selected Papers from the Fourth Annual ACM SIAMSymposium on Discrete Algorithms, SODA ’93, page 548–585, USA, 1995.Academic Press, Inc. → page 96[75] D. Sanchez, L. Yen, M. D. Hill, and K. Sankaralingam. Implementingsignatures for transactional memory. In MICRO ’07: Proceedings of the 40thAnnual IEEE/ACM International Symposium on Microarchitecture, pages123–133, Washington, DC, USA, 2007. IEEE Computer Society. ISBN0-7695-3047-8. doi:http://dx.doi.org/10.1109/MICRO.2007.20. → page 18[76] B. Schling. The Boost C++ Libraries. XML Press, 2011. ISBN0982219199, 9780982219195. → page 72[77] A. Schüpbach, S. Peter, A. Baumann, T. Roscoe, P. Barham, T. Harris, andR. Isaacs. Embracing diversity in the barrelfish manycore operating system.In In Proceedings of the Workshop on Managed Many-Core Systems, 2008.→ page 21[78] N. Shavit et al. Software transactional memory. In PODC ’95, pages204–213. ACM, 1995. ISBN 0-89791-710-3.doi:http://doi.acm.org/10.1145/224964.224987. → pages 17, 31, 97[79] Suh-Yin Lee and Ruey-Long Liou. A multi-granularity locking model forconcurrency control in object-oriented database systems. IEEE Transactionson Knowledge and Data Engineering, 8(1):144–156, 1996. → page 16146[80] S. Timnat, A. Braginsky, A. Kogan, and E. Petrank. Wait-free linked-lists.In R. Baldoni, P. Flocchini, and R. Binoy, editors, Principles of DistributedSystems, pages 330–344, Berlin, Heidelberg, 2012. Springer BerlinHeidelberg. ISBN 978-3-642-35476-2. → page 19[81] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactionsin multicore in-memory databases. In Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles, SOSP ’13, page 18–32,New York, NY, USA, 2013. Association for Computing Machinery. ISBN9781450323888. doi:10.1145/2517349.2522713. URLhttps://doi.org/10.1145/2517349.2522713. → page 20[82] A. Welc, S. Jagannathan, and A. L. Hosking. Transactional monitors forconcurrent objects. In European Conference on Object-OrientedProgramming, pages 518–541. Springer, 2004. → page 18[83] A. Williams, W. Thies, and M. D. Ernst. Static deadlock detection for javalibraries. In European conference on object-oriented programming, pages602–629. Springer, 2005. → page 22[84] L. Yen, S. C. Draper, and M. D. Hill. Notary: Hardware techniques toenhance signatures. In 2008 41st IEEE/ACM International Symposium onMicroarchitecture, pages 234–245, 2008. → page 17[85] H. Yu and L. Rauchwerger. Adaptive reduction parallelization techniques.In ACM International Conference on Supercomputing 25th AnniversaryVolume, page 311–322, New York, NY, USA, 2000. Association forComputing Machinery. ISBN 9781450328401.doi:10.1145/2591635.2667180. URLhttps://doi.org/10.1145/2591635.2667180. → page 30147


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items