Optimizing Modern Code Review ThroughRecommendation AlgorithmsbyGiovanni VivianiB. in Informatics, Universita` della Svizzera Italiana, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2016© Giovanni Viviani, 2016AbstractSoftware developers have many tools at their disposal that use a variety of sophis-ticated technology, such as static analysis and model checking, to help find defectsbefore software is released. Despite the availability of such tools, software de-velopment still relies largely on human inspection of code to find defects. Manysoftware development projects use code reviews as a means to ensure this humaninspection occurs before a commit is merged into the system. Known as moderncode review, this approach is based on tools, such as Gerrit, that help developerstrack commits for which review is needed and that help perform reviews asyn-chronously. As part of this approach, developers are often presented with a listof open code reviews requiring attention. Existing code review tools simply orderthis list of open reviews based on the last update time of the review; it is left to adeveloper to find a suitable review on which to work from a long list of reviews.In this thesis, we present an investigation of four algorithms that recommend anordering of the list of open reviews based on properties of the reviews. We use asimulation study over a dataset of six projects from the Eclipse Foundation to showthat an algorithm based on ordering reviews from least lines of code modified inthe changes to be reviewed to most lines of code modified out performs other al-gorithms. This algorithm shows promise for eliminating stagnation of reviews andoptimizing the average duration reviews are open.iiPrefaceNo part of this thesis has been published. This thesis is an original intellectualproduct of the author, G. Viviani.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 42.1 Code Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Lightweight Code Review . . . . . . . . . . . . . . . . . 52.2 Code Review Tools . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Gerrit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Code Review Completion Time . . . . . . . . . . . . . . . . . . . 82.4 Code Review Recommendation . . . . . . . . . . . . . . . . . . . 83 Recommending Code Review Ordering . . . . . . . . . . . . . . . . 103.1 Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10iv3.2 Edit Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Recommendation Algorithms . . . . . . . . . . . . . . . . . . . . 134 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.1 Eclipse Foundation Data . . . . . . . . . . . . . . . . . . . . . . 154.2 Actual Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Actual Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 Effort per Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.5 Simulation and Estimated Duration . . . . . . . . . . . . . . . . . 185 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.1 JGit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2 EGit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Algorithm Choice . . . . . . . . . . . . . . . . . . . . . . . . . . 296 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.2 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 327.1 Additional Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 327.2 Better Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 327.3 Personalized Recommendations . . . . . . . . . . . . . . . . . . 337.4 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 338 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.1 EGit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A.2 Linuxtools . . . . . . . . . . . . . . . . . . . . . . . . . . . 47A.3 JGit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.4 Sirius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61A.5 Osee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68vA.6 Tracecompass . . . . . . . . . . . . . . . . . . . . . . . . . . 75viList of TablesTable 4.1 The Eclipse Foundation dataset . . . . . . . . . . . . . . . . . 15Table 4.2 Duration and effort in hours for the Eclipse dataset . . . . . . . 18Table 5.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 21Table 5.2 Best estimation algoritm for each project . . . . . . . . . . . . 29viiList of FiguresFigure 2.1 Basic model for Fagan Inspection . . . . . . . . . . . . . . . 5Figure 2.2 A sample of the main page of the Gerrit web interface . . . . 7Figure 3.1 Output example of the Gumtree tool. Yellow represents anupdate, green an addition and blue a move. . . . . . . . . . . 12Figure 4.1 Actual duration (Da) of code reviews for JGit, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 5.1 Actual duration (Da) of code reviews for JGit, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 5.2 Simulation results for JGit using the RlocMin and ReditMin al-gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 5.3 Violin plots of the difference between the two estimates andthe real durations for JGit . . . . . . . . . . . . . . . . . . . 24Figure 5.4 Actual duration (Da) of code reviews for EGit, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 5.5 Simulation results for EGit using the RlocMin and ReditMin al-gorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.6 Violin plots of the difference between the two estimates andthe real durations for EGit . . . . . . . . . . . . . . . . . . . 28Figure A.1 Actual durations (Da) of code reviews for EGit, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 40viiiFigure A.2 Scatter plots with regression lines of the estimated durationsfor EGit computed using the min algorithms . . . . . . . . . 41Figure A.3 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for EGit . . . . . . . 42Figure A.4 Scatter plots with regression lines of the estimated durationsfor EGit computed using the max algorithms . . . . . . . . . 43Figure A.5 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for EGit . . . . . . 44Figure A.6 Violin plots of the difference between the estimates and thereal durations for EGit for the min algorithms . . . . . . . . 45Figure A.7 Violin plots of the difference between the estimates and thereal durations for EGit for the max algorithms . . . . . . . . 46Figure A.8 Actual durations (Da) of code reviews for Linuxtools, sortedby actual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . 47Figure A.9 Scatter plots with regression lines of the estimated durationsfor Linuxtools computed using the min algorithms . . . . 48Figure A.10 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for Linuxtools . . 49Figure A.11 Scatter plots with regression lines of the estimated durationsfor Linuxtools computed using the max algorithms . . . . 50Figure A.12 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for Linuxtools . . 51Figure A.13 Violin plots of the difference between the estimates and thereal durations for Linuxtools for the min algorithms . . . 52Figure A.14 Violin plots of the difference between the estimates and thereal durations for Linuxtools for the max algorithms . . . 53Figure A.15 Actual durations (Da) of code reviews for JGit, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 54Figure A.16 Scatter plots with regression lines of the estimated durationsfor JGit computed using the min algorithms . . . . . . . . . 55Figure A.17 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for JGit . . . . . . . 56ixFigure A.18 Scatter plots with regression lines of the estimated durationsfor JGit computed using the max algorithms . . . . . . . . . 57Figure A.19 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for JGit . . . . . . 58Figure A.20 Violin plots of the difference between the estimates and thereal durations for JGit for the min algorithms . . . . . . . . 59Figure A.21 Violin plots of the difference between the estimates and thereal durations for JGit for the max algorithms . . . . . . . . 60Figure A.22 Actual durations (Da) of code reviews for Sirius, sorted byactual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . 61Figure A.23 Scatter plots with regression lines of the estimated durationsfor Sirius computed using the min algorithms . . . . . . . 62Figure A.24 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for Sirius . . . . . 63Figure A.25 Scatter plots with regression lines of the estimated durationsfor Sirius computed using the max algorithms . . . . . . . 64Figure A.26 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for Sirius . . . . . 65Figure A.27 Violin plots of the difference between the estimates and thereal durations for Sirius for the min algorithms . . . . . . . 66Figure A.28 Violin plots of the difference between the estimates and thereal durations for Sirius for the max algorithms . . . . . . 67Figure A.29 Actual durations (Da) of code reviews for Osee, sorted by ac-tual effort (Ea) . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure A.30 Scatter plots with regression lines of the estimated durationsfor Osee computed using the min algorithms . . . . . . . . . 69Figure A.31 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for Osee . . . . . . . 70Figure A.32 Scatter plots with regression lines of the estimated durationsfor Osee computed using the max algorithms . . . . . . . . . 71Figure A.33 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for Osee . . . . . . 72xFigure A.34 Violin plots of the difference between the estimates and thereal durations for Osee for the min algorithms . . . . . . . . 73Figure A.35 Violin plots of the difference between the estimates and thereal durations for Osee for the max algorithms . . . . . . . . 74Figure A.36 Actual durations (Da) of code reviews for Tracecompass,sorted by actual effort (Ea) . . . . . . . . . . . . . . . . . . . 75Figure A.37 Scatter plots with regression lines of the estimated durationsfor Tracecompass computed using the min algorithms . . . 76Figure A.38 Scatter plots with regression lines of the difference betweenthe min estimates and the real durations for Tracecompass 77Figure A.39 Scatter plots with regression lines of the estimated durationsfor Tracecompass computed using the max algorithms . . 78Figure A.40 Scatter plots with regression lines of the difference betweenthe max estimates and the real durations for Tracecompass 79Figure A.41 Violin plots of the difference between the estimates and thereal durations for Tracecompass for the min algorithms . . 80Figure A.42 Violin plots of the difference between the estimates and thereal durations for Tracecompass for the max algorithms . . 81xiAcknowledgmentsI would like to start thanking my parents and my sister for believing in me andsupporting me from Europe over the past two years. They have been encouragingme to follow my dreams since I took the decision of studying Computer Sciencefive years ago. Pursuing my studies on the other side of the planet has been a greatsource of stress for all of us, but I am glad to have their support.Special thanks to my supervisor, Gail C. Murphy, for being the best mentor Icould have asked for. Gail has been there to guide me whenever I felt lost and shehas done an amazing job in teaching me what research is. She has also shown anincredible patience in dealing with my stubbornness and my habit of signing up fortoo many extra-curricular activities.I would also thank my second reader, Reid Holmes, for taking the time to readthis thesis and providing valuable feedback. Similarly, I want to acknowledge thewhole Software Practices Lab for the support and for providing a great environmentin which to work.Finally, I want to thank all my friends, both in Canada and Switzerland, forkeeping me sane during the past two years. Special mentions go to Jessica Wongand Ambra Garcea for enduring me during ups and downs without too many com-plaints.xiiTo my family, for providing all the support I have ever needed.Chapter 1IntroductionFor over forty years, software developers have used humans looking at the code asa means of reducing defects in the code. In the 1970s and 1980s, developers useda highly structured process, known as code inspection, in which groups of peoplemet in person to review code line-by-line [6]. In recent years, advancements intools and techniques have allowed this process to evolve to a more lightweightapproach, defined by Baccheeli and Bird as modern code review [1].The lightweight approach has been largely made possible by the invention ofcollaborative tools, such as Gerrit1, that enable human reviewers to work asyn-chronously and remotely from each other. These tools enable code reviews to beperformed for most, if not all, commits made to a source code repository both foropen and closed source projects. Compared to the older style code inspections,most code reviews have a smaller size and involve less people [13].In projects using a modern code review approach, every commit submitted isrequired to go through a review process that will decide if the change satisfies theproject standards. Developers on a project are typically expected to contribute toopen code reviews. On large projects, the number of open code reviews can becomequite large, leading to a slow down in the release of changes. As an example,an analysis of six projects from the Eclipse ecosystem shows that there can behundreds of code review open at a time; the JGit project currently has more than1https://gerrit.googlecode.com, verified 20/7/161150 open code reviews2.Despite developers spending time daily working on code reviews [12], codereviews go stagnant and remain stagnant for long periods of time. My analysis ofthe code review process for the Eclipse ecosystem showed that some code reviewscan end up being left open for entire months without any update: in the JGitproject, some code reviews are open for two years before being merged into themain branch.In this thesis, I hypothesize that the stagnation of reviews is due, in part, to alack of support by the tools, such as Gerrit, to help developers choose which codereviews to work on. Gerrit, and other similar tools, provide many features to aidcode reviews. For example, the tools provide easy access to a list of files changedas part of a commit and user interfaces to visualize the difference of files in thecommit to the current state of the file. However, these tools offer nearly no supportfor choosing the code review to analyze, but they rather display the reviews orderedby update timestamps.I investigate whether an algorithm that suggests developers the order in whichto work on code reviews can help avoid stagnation of reviews and improve theoverall review process. I propose four algorithms for ordering open code reviews.I report on the effectiveness of each algorithm by using a simulation to show theeffect that the algorithm would have in the order in which code review are solvedand the average duration of resolution if the reviews were worked on in the sug-gested order. I found that an algorithm based on changes between the commit andexisting code, when ordered from least changes (lines of code) to most changes,results in a lower average duration of open reviews and less stagnation.This thesis makes the following contributions.• Provides evidence to the problem of code review stagnation, identified by Rigbyand Bird [13]• It introduces four algorithms to order open code reviews to avoid stagnationand to reduce overall code review resolution time.• It shows, through a simulation study, the effectiveness of each algorithm,2Verified 2/7/20162finding that an algorithm based on a syntactic analysis of the change re-ported in lines of code is the most likely approach to solve the code reviewstagnation problem.I being by reviewing background and related work (Chapter 2). I then presentmy approach of recommending a code review ordering (Chapter 3). Chapter 4presents the evaluation I run for validating my approach, followed by the results inChapter 5. In Chapter 6, I address the threats to my results and I discuss possiblefuture developments in Chapter 7. I finally summarize the work and draw someconclusion in Chapter 8.3Chapter 2Background and Related WorkIn this chapter, I describe the origin of code reviews, starting from the code in-spection process up to the concept of lightweight code review. I also describe codereview tools, with particular attention to Gerrit, the tool around which this researchis built. I also cover research closely related to this thesis: time taken to completecode reviews and tools that use recommendation to improve aspect of the codereview process. Finally, I provide some information on code review systems.2.1 Code InspectionThe idea of code review can be attributed to Michael Fagan. Fagan defined aninspection process for code with two goals[5]:• find and fix all product defects, and• find and fix all development process defects that lead to product defects.Figure 2.1 outlines the Fagan Inspection process. The Fagan Inspection is com-posed of 6 steps:1. Planning: Materials meet the entry criteria; arrange the availability of theparticipants; arrange the place and time of the meeting.2. Overview: Educate the group on the materials; assign the roles to the par-ticipants.4Figure 2.1: Basic model for Fagan Inspection3. Preparation: Prepare for the review by learning the materials.4. Meeting: Find the defects.5. Rework: Rework the defects.6. Follow-up: Assure that all defects have been fixed; assure that no otherdefect has been introduced.In a follow-up paper, Fagan describes the advancement that has been madeon the original concept of code inspection, identifying three main aspects DefectDetection, Defect Prevention and Process Management, and actions that can betaken to improve them[7].Since the introduction of code inspection, researcher have studied how and whythis process works and how to improve it. For further information, Kollanus andKoskinen provide a survey covering the research on code inspection, between 1980and 2008[11].2.1.1 Lightweight Code ReviewIn more recent years, the code inspection process has become more lightweightcompared to the code inspection created by Fagan. This lightweight process hasbeen named Modern Code Review by Bacchelli and Bird [1], to differentiate be-tween the old process of inspection described by Fagan from the process that hasbeen recently evolving. This process is being used by many companies, and it ischaracterized by being informal and tool-based.Through its basis in tools, modern code review has the advantage of being asyn-chronous, whereas previous software inspection processes were synchronous. This5allows teams spread around the globe to continue their work without the problemof scheduling meetings.2.2 Code Review ToolsMany code review tools have been developed to help developers with the codereview process. Pre-commit and post-commit are the two different approaches thatcan be taken for code reviews. The major classification of the tools is based on theuse of a pre-commit, like Gerrit1, or a post-commit workflow, like Upsource2. Ina pre-commit workflow, changes are reviewed before they are applied to the code,while in post-commit they are first applied and then reviewed.Pre-commit allows for checking the quality of the changes before they are ap-plied: this allows developers to assess that the code of the changes satisfies theproject standards and that the changes do not introduce bugs in the code. Thedownside with this approach is that it lengthens the release period since the changehas to be first approved. This means that developers are not able to work with thechanges for a longer time, possibly slowing down the development time.Post-commit, on the other hand, immediately applies the changes, and thenthe code is reviewed. This approach allows developers to continue working onnew features while waiting for the review to be completed, allowing for a fasterrelease cycle. The downside of this approach is that it is more likely to have bugsintroduced in code and that there is no guarantee that the review will ever takeplace.In this thesis, I focus on the pre-commit approach, particularly around the Ger-rit tool.2.2.1 GerritGerrit is a popular tool for supporting modern code review and the tool on whichI focus in this thesis. To use Gerrit, developers can hook Gerrit up to the Git dis-tributed version control system3 hosting the source code for their system. After1https://www.gerritcodereview.com/2https://www.jetbrains.com/upsource/3Git6Figure 2.2: A sample of the main page of the Gerrit web interfaceconnecting to the web interface of Gerrit, developers are presented with a list ofopen code reviews for the project. The developer can then submit a new codereview or select an existing code review on which to work. The developers’ inter-actions with Gerrit are asynchronous from each other, providing several benefits:• there is no need for meetings in which multiple developers convene to reviewcode as was true for older software inspection approaches,• developers are in charge of their own context switching and can work oncode reviews when it fits into their work patterns, and• developers in multiple timezones or different work patterns can collaborateon reviews.Figure 2.2 shows a code review in Gerrit. When a review is created, Gerrit showsthe same information available from Git about the commit to be reviewed. In ad-dition, Gerrit enables a record of an activity stream of work on the review in theform of comments about the review and a list of patches applied to the original7commit. In many projects, Gerrit can be integrated with a continuous integrationservice such that for each new patch, integration is run and comments are added bythe continuous integration tool to the review. Gerrit uses a pre-commit workflowfor reviews. A Gerrit code review works in three stages:• Verified: A change enter this phase as soon as it is verified that it can bemerged without breaking the build. This is often done by a continuous inte-gration system, but can also be done manually.• Approve: Any developer can comment on a code review, however, only aspecific set of developers, specified by the owner of the Gerrit server, is ableto approve a change.• Merged: A change that has been successfully merged in the repository willmove to this final stage. It is still possible to comment on the review.2.3 Code Review Completion TimeMy initial investigation on the Eclipse dataset indicated that code review stagnationis a phenomenon present in several projects. Similar findings were made by Rigbyet al. [15] when they report that if a code review is not closed immediately, it willnot be reviewed and will become stagnant. Even in the other cases, some codereviews can take a long period of time before they are resolved. This observationwas also made by Rigby and Bird, who observed that 50% of the reviews theystudied have a duration of almost a month [13].Jiang et al. noticed that the integration time of a code review was particularlydependent on the experience of the developer that created it[10]. Similarly, Bosuand Carver observed that code reviews submitted by newcomers to a project tendto receive feedback more slowly, another instance where code reviews may be inthe queue for longer than desired [4].2.4 Code Review RecommendationRigby and Storey investigated OSS projects, interviewing developers to find outhow they were selecting code reviews to be reviewed, finding that often devel-opers tend to review code they are familiar with or that they have edited in the8past[14]. Subsequent work has considered the use of recommenders to speed codereview resolution by automatically selecting reviewers.Balachandran proposes Re-view Bot, a tool that recommends reviewers based on the line change history ofthe code requiring a review, reporting an accuracy in finding the correct reviewerof 60-92%[2]. Thongtanunam et al. propose a similar approach, in which the ex-pertise is computed from the similarity between the path of the files changed in thecode review and the path of files changed in code reviews reviewed by the reviewerin the past[16][17]. They empirically show on historical data from a selection ofopen source projects that the recommender can accurately recommend the correctreviewer for 79% of the reviews within the top ten recommendations generated.Zanjani et al. presented a new approach, chRev, based on the expertise of thereviewers[19]. In this approach, the expertise is calculated with a combination ofprevious comments made by the reviewer on reviews on the code being reviewed,number of workdays spent on those comments and period of time since the lastcomment. Baysal and colleagues take a different approach to recommendation byshowing a developer reviews only related to issues on which they have worked[3].My recommendation approach aims to improve the process of code reviews froma different angle. Instead of assigning the reviewer to the code review, I aim toassign the code review to the reviewer. My approach also differs in being agnosticto the developer asking for the recommendation: as a result, my recommender isnot sensitive to the ebb and flow of developers joining and leaving an open sourceproject or a company involved in a closed source project.9Chapter 3Recommending Code ReviewOrderingI propose the use of a recommender to help reduce stagnation of code reviews andreduce the duration of time that code reviews remain open. The recommender usesinformation from open code reviews to rank the code reviews and suggest whichreview should be worked on next. The goal of the recommender is to optimize theoverall handling of reviews. A recommender for code review ordering could beeasily integrated into code review tools, such as Gerrit, that present a list of openreviews to developers, without requiring any change in the front end. In additionto an overall reduction in time reviews are open, software developers could alsobenefit from reducing or eliminating the time necessary to choose a review onwhich to work. The algorithms I investigate in this thesis are based on metrics thatare accessible from the code review as soon as the review is created. I considertwo metrics to embed in algorithms: 1) lines of code modified by the code changecausing the review to occur and 2) edit actions representing syntax changes to filesinvolved in the change.3.1 Lines of CodeA simple, but still often turned to metric when considering code, is a count oflines of code. A benefit of using the number of lines of code in a code change to10represent the complexity of the change is the simplicity of computing the metric.The idea behind using lines of code for ordering code reviews is that the morelines of code that have been changed, likely the harder it will be for a reviewer tounderstand the scope of the change.For my investigation of recommenders, I use the lines of code as reported byGerrit for a change. Gerrit, like Git, uses information from running diff on thefiles associated with the change. Gerrit records the number of lines inserted anddeleted in each file mentioned in the change. When using lines of code in a codereview ordering recommender, I sum the number of lines added and the number oflines removed for each file in the code review.3.2 Edit ActionsThe simplicity of lines of code as a proxy metric for complexity is that one singleline change can be more complex than to understand a many line, simple change.Another way to assess the complexity of change in code is through edit actions,as explored by Falleri et al. [8]. Edit actions are based on a syntactic understandingof the change, and are computed from the AST1 of the code. Edit actions can be offour possible types:• an addition, indicating that nodes have been added to the syntax tree,• a deletion, indicating that nodes have been removed from the syntax tree,• an update, indicating that a syntax node (and its children) have been modi-fied, but remains of the same type, and• a move, indicating that the syntax node (and its children) have been movedto another parent in the syntax tree, without having been modified.The edit script of a change is defined as the sequence of edit actions necessaryto move from one version of the file to next one. Computing the minimum editscript for file in a given change, defined as the edit script that requires the leastnumber of edit actions, is an NP-hard problem. To approximate the edit script foreach file in a change described in a review, I use the Gumtree2 tool. Gumtree uses a1Abstract Syntax Tree2https://github.com/GumTreeDiff/gumtree/wiki11heuristic algorithm to calculate the minimum edit script from the AST of the code.Figure 3.1 shows an example of the minimum edit script generated by Gumtreefor a simple piece of Java code that has been changed. In this example, yellowrepresents an update, green an addition and blue a move.Figure 3.1: Output example of the Gumtree tool. Yellow represents an up-date, green an addition and blue a move.Edit actions allow us to separate movements and changes in files from additionsand deletions, as compared to using lines of code. However, the use of Gumtree islimited. First of all, Gumtree requires a parsers to generate the ASTs. Currently,Gumtree supports only 4 languages: Java, C, JavaScript and Ruby. Gumtree usesa heuristic algorithm, computing only an approximation of the solution. There aresome extreme cases in which Gumtree returns an errorenous results or is not able12to compute the results. These cases seem to be caused by specific conditions thatappear relatively rarely.3.3 Recommendation AlgorithmsI decided to initially opt for a greedy approach: I use the above mentioned metricsto define four recommendation algorithms:1. lines of code that orders the code reviews from least lines of code changedto most lines of code changed (RlocMin),2. lines of code that orders the code from most lines of code changes to leastchanged (RlocMax),3. edit actions that orders the code reviews from least edit actions in the reviewto most edit actions (ReditMin), and4. edit actions that orders the code reviews from most edit actions in the reviewto least edit actions (ReditMax).I investigate the ordering of code reviews from both least to most of each metricand vice versa as there is no basis to determine whether tackling the likely hardestreview first is better than last.13Chapter 4SimulationThe best way to evaluate the different algorithms for recommending the order inwhich to proceed with code reviews would be to build the algorithms into a stan-dard code review tool, put the algorithms into use for an extended period of timeand compare against the history of duration times for code reviews of the project.Such an evaluation is very costly in terms of people’s time and effort.Instead of starting with this costly approach, I chose to compare the effective-ness of the algorithms using a simulation approach. Simulation approaches areoften used to evaluate recommendation systems for software development [18].With a simulation approach, I apply an algorithm to order code reviews for his-torical project data and compute estimates of the completion time of the reviewsif such an ordering was used. A simulation allows us to initially reproduce theoriginal process and observe the performances of the various algorithms.I start my description of the simulation by explaining the project data used.I then explain how I compute the actual duration (Da) and the actual effort (Ea)for each code review in a project’s history. Using these values, I can define thesimulation: the goal of the simulation is to estimate a duration (De) for each codereview if the algorithm’s ordering was respected. I defer a discussion of the threatsto validity until after I present the results (Chapter 6).14Table 4.1: The Eclipse Foundation datasetProject # Code Reviews # Lines Of Codeegit 4,636 16,563org.eclipse.linuxtools 4,441 239,176jgit 4,255 168,864org.eclipse.sirius 2,504 382,459org.eclipse.osee 2,451 696,706org.eclipse.tracecompass 1,894 196,3444.1 Eclipse Foundation DataThe data I use for the simulation study comes from the Gerrit repository of theEclipse Foundation1. I sorted Eclipse projects by the number of successfully mergedcode reviews and chose the top six for study. I took this approach to ensure enoughdata to see trends in the ordering algorithms simulated. Table 4.1 shows the projectused in the simulation study. The data reported for code reviews is from 2010-08-19 to 2016-03-06.Before beginning the simulation, I had to clean the dataset to remove codereviews with confusing metadata. These reviews were easily identifiable becausetheir creation date happened after the merge date. All of the faulty entries belong tothe first 5,000 reviews and have the same creation date (2012-2-10), suggesting thedata problem might be related to the introduction of Gerrit for Eclipse Foundationprojects.All of the projects in the simulation are written in Java and represent a varietyof project sizes from tens of thousands of lines to code to half a million of lines ofcode (Table 4.1).4.2 Actual DurationActual duration (Da) is the time from the creation of a code review to its completionas obtained from the historical data of Gerrit. The actual duration of a review iscomputed as the difference between the time the code review was opened and the1https://eclipse.org/org/foundation/, verified 3/7/1615time the code review was merged, computed using the timestamps saved in thecode review. I report Da in hours.Duration is an important factor in code reviews: a code review that has beenleft open for too long is at risk to become stagnant. In some cases, an open codereview might delay the project, particularly if it affects critical parts of the code.4.3 Actual EffortTo run the simulation, I need to determine how much effort is needed to resolvea given code review and how much effort is available from project personnel on aparticular day to work on code reviews. I define effort of a given review (Ea) asthe sum of the number of messages, number of patches and number of developersinvolved in that code review.Effort is an important factor in my evaluation because it allows us to quantifythe amount of work that was put on the code review. A code review that involvedmore reviewers and went through multiple iterations required more work than acode review with a single version and only one reviewer. In using this definition,I am dependent on the externalized activity that happened on a code review beforeit was merged. I discuss the impact of this choice on the validity of the results inChapter 6.To give a sense of the data on which I run simulations, Figure 4.1 shows theactual duration (Da) (Y-axis) in relation to the actual effort (Ea) (X-axis) of eachcode review in the JGit project. An analysis of the plot indicates that Da is lessthan a day for half of the code reviews; these code reviews are all plotted on thebottom left of the graph as they typically also have a small Ea value. The overallaverage Da is higher because of the code reviews plotted on the right side of thegraph. In Figure 4.1, I have also plotted regression curves for degrees 1, 2 and 3.I will use these regression lines later in the thesis to compare to the results of therecommendation algorithms.4.4 Effort per HourFor the simulation, I also need to determine, for each project, the amount of effortdevelopers on the project spend on average over a unit of time. I use effort per hour16Figure 4.1: Actual duration (Da) of code reviews for JGit, sorted by actualeffort (Ea)(Eh) as the average amount of effort spent per hour. To calculate this value, I dividethe sum of effort for every code review in the project and by the number of hoursbetween the creation of the first code review and the last registered activity on anycode review.One might note the large differences of values between Da and both Ea and Ehfor most projects. These differences are the result of how the values are calculated.While Da is based on the values of actual duration from the code reviews, Ea andEh are derived from the time difference, in hours, between the creation of the firstcode review and the last registered activity. For this reason, Da is affected by theco-existence of multiple open code reviews at the same time, while the interval oftime calculated is not.Table 4.2 shows the value of these three metrics for each project in the simu-lation. The first two columns show the values for the average actual duration (Da)17Table 4.2: Duration and effort in hours for the Eclipse datasetProject Name Avg. Da Avg. Ea Ehegit 268.27 6.46 0.34org.eclipse.linuxtools 138.58 7.41 0.81jgit 358.32 6.99 0.41org.eclipse.sirius 190.07 8.13 0.79org.eclipse.osee 76.37 8.06 0.23org.eclipse.tracecompass 303.08 9.68 1.33and the average actual effort (Ea). The average actual duration varies greatly be-tween projects and does not show any correlation with the project size (as reportedin Table 4.1. Average effort (Ea) is in a much smaller range between 6 and 9, hint-ing that the effort per code review in terms of externalized actions is roughly thesame between projects.The third column of Table 4.1 shows the computed Eh for each project in thedataset studied. All of these values are quite low compared to their respective Ea,meaning that it will be unlikely for a code review to be closed immediately even ifit is the only open review.4.5 Simulation and Estimated DurationThe simulation proceeds for a code review ordering algorithm by iterating over thedata, in order of the project history, and estimating the amount of time that eachcode review would have taken if it was worked on in the order suggested.Specifically, the simulation keeps a list of open code reviews, ordered increas-ingly by creation date, and has a clock, initialized at the creation date of the firstcode review. The clock ticks hour by hour and runs until the date of the last activityregistered on Gerrit. At every iteration, the following actions are executed in order:• The clock is moved forward by an hour.• The list of open code reviews is updated by adding the code reviews openedin the last hour.18• The list of open code reviews is sorted following the current code reviewordering algorithm being analyzed.• The amount of effort available for the hour is assigned to the code reviews,starting from the top one. Each time a code review is assigned an effort atleast as large as its Ea, the review is considered closed. If there are no codereviews left open, the amount of effort remaining is carried forward.I repeated the process for every project with every code review ordering algo-rithm. The simulation enables the computation of an estimated duration (De) foreach review. The estimated duration represents the estimated amount of time thatwould have taken to close a code review if the recommended order of code reviewswas to be followed.The code of the simulation is available at http://www.cs.ubc.ca/∼vivianig/scribe.19Chapter 5ResultsI simulated each of the recommendation algorithms on the dataset. In table 5.1,I show the results of the simulation. Each project in the table has two rows, onefor the algorithm version using the LOC1 metric, the other one for the versionusing the edit actions. The columns are used to show the results for each orderingof the recommendation. For instance, RlocMin is represented by LOC indicated inthe row and Min represented by the column. For each project, I report the averageestimated duration (De) computed by a particular algorithm, the difference betweenthe average actual duration and the average estimated duration (Da −De) and thestandard deviation of the estimated duration.The results show that the RlocMin algorithm based on ordering the reviews fromthe least lines of code changes to the most changed, shown in the Min column, al-ways produces an estimated duration (De) less than the actual duration, a desirableresult. This result can be seen because the values of the difference in the actualand estimated duration (Avg Da −De) are always positive (and large). With twoexceptions, the ReditMin also performs better than what occurred in reality. The twoexceptions are org.eclipse.linuxtools and orc.eclipse.osee.The algorithms based on ordering from most to least, whether line or edit actionbased, perform worse than reality for all but one projects. For org.eclipse.osee,the algorithm based on lines of code (RlocMax) performs better than reality. Thestandard deviation values are fairly large compared to the respective De values.1Lines of code20Table 5.1: Simulation resultsProject Name Metric Avg De (Hrs) Avg Da −De (Hrs) Std. Dev. DeMin Max Min Max Min MaxegitLOC 47 5248 221 -4979 412 6141Actions 107 9810 161 -9540 861 8593org.eclipse.linuxtoolsLOC 8 1060 131 -921 175 1884Actions 197 10427 -57 -10287 1276 8459jgitLOC 23 2091 335 -1731 282 3356Actions 15 374 343 -16 146 1067org.eclipse.siriusLOC 11 425 179 -235 101 720Actions 13 1549 177 -1357 183 2356org.eclipse.oseeLOC 61 9929 16 7961 545 7966Actions 156 14610 -78 -14533 1071 7967org.eclipse.tracecompassLOC 42 5422 261 -5117 341 3413Actions 50 5734 253 -5429 351 3439These results suggest that either RlocMin or ReditMin may be useful to subjectto human evaluation in the context of actual use. A decision on which algorithmto first subject to human evaluation requires a deeper investigation of the trendsindicated in Table 5.1. I run an detailed analysis on two projects, JGit and EGit,focusing only on the algortithm sorting in ascending order. The complete set ofresults are available in Appendix A.5.1 JGitFigure 5.1 presents the distribution of actual duration of code reviews compared totheir effort for the JGit project. The duration is on the y-axis, in hours, while theeffort is on the x-axis. The figure shows most code reviews have a low value forboth effort and duration, as the data points are clustered near the bottom left. Thereis no recognizable correlation between duration and effort. The three coloured linesrepresent the regression curves for degrees 1, 2 and 3.Figures 5.2(a) and 5.2(b) show the the result of the simulations for RlocMin andReditMin respectively by plotting estimated duration (De) (y-axis) against actual ef-fort (Ea) (x-axis) sorted by effort. As before, regression lines are shown with degree1 (red), degree 2 (green) and degree 3 (cyan). These figures show how, compared21Figure 5.1: Actual duration (Da) of code reviews for JGit, sorted by actualeffort (Ea)to the actual values shown in figure 5.1, the estimated durations of the code re-views have a largely smaller value for both RlocMin and ReditMin. The figures alsoshow how the number of outliers, representing stagnant reviews, has diminishedconsiderably.Most of the cases where code reviews had large values also disappeared, lead-ing to a more compact graph. Looking at the regression curves, we see how theresults obtained using ReditMin are less scattered, since even the regression of de-gree one is able to predict it with the same error as higher degrees. In figure 5.2(a),a single code review (the dot that appears in the top left corner) received a largeestimated duration when estimating using RlocMin, even if it has a fairly low effortvalue. This outlier is the result of my algorithm being nothing more than a greedyalgorithm: because I am considering the ordering of reviews by least estimated ef-fort to largest estimated effort, it can happen than certain code reviews with a low22(a) Estimated Duration (De) computed using RlocMinv. Effort (Ea) for JGit(b) Estimated Duration (De) computed using ReditMinv. Effort (Ea) for JGitFigure 5.2: Simulation results for JGit using the RlocMin and ReditMin algo-rithms23(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for JGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for JGitFigure 5.3: Violin plots of the difference between the two estimates and thereal durations for JGit24estimated effort are delayed. I describe some possible solution for this problem inChapter 7.Figure 5.3 shows the difference between the actual duration and the estimatefrom the RlocMin and ReditMin algorithms, using a violin plot. In the same way asthe previous plots, the y-axis represents the duration in hours and the x-axis repre-sents the effort. These plots are useful to understand the degree of improvement inmy estimates. In figure 5.3(a), most of the differences are positive, indicating thatthe estimates produced by the algorithms are better than the actual durations in themajority of cases. There are few outliers, some of them positive and one of themnegative. The negative outlier is the same code review described in the previousparagraph. The positive outliers indicate code reviews that either got resolved im-mediately, in the case of small effort values, or were introduced in a situation wereall other code reviews had already been solved, in the case of a large effort value.Figure 5.3(b) presents a more scattered situation. There are no differences with alarge negative result, but there are more differences with a negative result. The twoplots are similar, suggesting that the two algorithms return a similar ordering inmany cases.5.2 EGitFigure 5.4 shows the distribution of actual duration of code reviews compared totheir effort for the EGit project. The duration is on the y-axis, while the effort ison the x-axis. Similarly to the JGit project in figure 5.1, most code reviews have alow value for both effort and duration, as can be seen by the dots being prevalentlylocated in the bottom right corner. In confront, to figure 5.1 for the JGit project,this plot appears to be less scattered. As before, the three coloured lines representthe regression curves for degrees 1, 2 and 3.Figures 5.5(a) and 5.5(b) show the the result of the simulations for RlocMinand ReditMin respectively by plotting estimated duration (De) (y-axis) against actualeffort (Ea) (x-axis) sorted by effort. As before, regression lines are shown withdegree 1 (red), degree 2 (green) and degree 3 (cyan). These results differ fromthe simulation on JGit: in JGit, the simulation using ReditMin performed better,for EGit the RlocMin algorithm performs better. In EGit, the simulation using25Figure 5.4: Actual duration (Da) of code reviews for EGit, sorted by actualeffort (Ea)ReditMin, shown in figure 5.5(b) is more scattered and presents many outliers. Onthe other hand, the simulation using RlocMin, shown in figure 5.5(a) is more com-pact and presents few outliers. The regression curves echo this observation: infigure 5.5(a), the curves of first and second degree are nearly the same, and thecurve of third degree is close; in figure 5.5(b), on the other hand, the curves differa lot.Figure 5.6 shows the difference between the actual duration and the estimatefrom the RlocMin and ReditMin algorithms, using a violin plot. In the same way as theprevious plots, the y-axis represents the duration in hours and the x-axis the effort.These plots are useful to understand the degree of improvement in my estimates.The graph are fairly similar and positive, indicating that both algorithms performwell. Nonetheless, we can notice how figure 5.5(b) is more compact around thex-axis, while figure 5.5(a) is less scattered but tends upwards. This indicates that,26(a) Estimated Duration (De) computed using RlocMinv. Effort (Ea) for EGit(b) Estimated Duration (De) computed using ReditMinv. Effort (Ea) for EGitFigure 5.5: Simulation results for EGit using the RlocMin and ReditMin algo-rithms27(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for EGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for EGitFigure 5.6: Violin plots of the difference between the two estimates and thereal durations for EGit28Table 5.2: Best estimation algoritm for each projectProject Best Resultegit locMinorg.eclipse.linuxtools locMinjgit editMinorg.eclipse.sirius locMinorg.eclipse.osee locMinorg.eclipse.tracecompass locMinRlocMin performs better than ReditMin.5.3 Algorithm ChoiceAfter analyzing in depth the simulation for two of my six projects, I can concludedthat RlocMin performs better. Table 5.2 shows the best algorithm for each project.Interestingly, in only one project out of six, the JGit project, ReditMin performsbetter than RlocMin. In all other cases, RlocMin performs better. Based on theseresults, and considering the higher computation of cost to compute edit actionscompared to the cost to compute lines of code, I argue that RlocMin should be subjectto further human evaluation in real use.29Chapter 6Threats to ValidityEvery study must make choices that affect the validity of the results. I first discusschoices that affect the results I report. I then discuss choices that affect whethermy results may hold when applied to order code reviews for other projects.6.1 Internal ValidityIn the simulation I used to evaluate the ordering algorithms, I assume that effort isspent on code reviews for the entire 24 hours of each day. This assumption doesnot reflect reality in which people work a limited number of hours a day. I believemy simulation is still valid because my calculation of the effort available per houris averaged over the day. Thus, even though the effort is spread over 24 hours a dayby the simulation, the total amount of effort is the same. For this reason, I arguethat the estimate used by the simulation is valid.The effort computed per review (Ea) may not accurately represent the amountof work actually required to complete the review. My computation for Ea is able tomeasure only the amount of activity recorded in the review, such as in the onlinediscussion; it is unable to measure the amount of work put into such actions asdeveloping patches. My intent with my estimate of actual effort (Ea) is to capturethe essence of work on code reviews.Due to inconsistent metadata, as described in Chapter 4, I cleaned the dataset,removing a number of code reviews from my dataset and the simulation. Even with30the cleaning, the projects all contain a large number of code reviews, thus I believethe effect on the overall results is minimal.A final threat comes from imprecision in the calculation of the metrics usedin the algorithms. As described in Chapter 5, both lines of code and edit actionssuffer from possible false results. In these cases, a single outlier can have a rippleeffect over the entire simulation, inflating (or deflating) the estimated durations. Iargue that the number of outliers is far smaller than the general trend: when theeffect of the outlier is applied, it will affect the estimated duration for all the codereviews but only by a small amount.6.2 Construct ValidityMy recommendation approach aims to avoid stagnation in modern code review andto reduce overall code review resolution time. In order to evaluate it, I built the sim-ulation to reproduce the original process and observe the effect of the algorithms.This approach might not be suited to claim an improvement in the modern codereview process, but it allowed me to quickly understand which algorithms wouldperform the best before moving to a more structured user study.I simulated the algorithms on projects from one ecosystem, Eclipse. The ecosys-tem may represent a limited number of software development processes that affectthe data and results. By simulating over seven projects from the ecosystem, I be-lieve I have captured a variety of teams with variations in their work practices.Also, since the simulation is relative to a given project, and not absolute, I believethere is more likelihood that similar results would be seen on projects from otherecosystem.31Chapter 7Discussion and Future WorkI have made many assumptions in my investigation of recommenders to optimizecode review processing. I discuss other metrics than those I studied that might beused for a code review ordering recommender, more personalized approaches forrecommending which code review to work on and future work needed in evaluatingrecommenders to improve code review handling.7.1 Additional MetricsThe algorithms for code review ordering I investigated were based on metrics avail-able from the code reviews themselves. There are more metrics in this categorythat could be investigated. For example, instead of just using lines of code, a met-ric based on number of files or packages changed might provide better results. Or,path similarities of files modified in a change might be used: the more similar thepaths, the less complex the code change may be. Other characteristics of the codereviews that are open and available to be ordered could also be considered, such asthe length of time a review is open or the amount of activity that has occurred todate on the review.7.2 Better AlgorithmsThe greedy approach returns promising results, but can suffer from particular cases.I described an example in chapter 5, in which a code review was delayed for a long32time, even though it should have been dealt with sooner. I argue that an algorithmthat takes into account both the greedy order and the time that a code review hasbeen waiting may be beneficial for creating a recommender.7.3 Personalized RecommendationsThe recommendation algorithms I have considered in this thesis all provide one or-dering for all developers on a project. More complex recommendation algorithmscould be investigated that produce a personalized recommendation for each devel-oper available to work on a code review. For instance, the recommender coulduse knowledge of what code a developer knows (e.g., [9]) to rank code reviewswhere the developer has knowledge of the code higher in the list. A more complexrecommendation algorithm could also take into account the time a developer hasavailable to perform a code review: if the developer indicates only a short time isavailable, simpler code reviews might be prioritized over others.7.4 Human EvaluationMy current evaluation of algorithms is based on a computer simulation. Althoughthis approach can be useful to compare algorithms, it cannot take into account allof the factors that may affect the resolution of code reviews in practice. A humanevaluation of algorithms that perform well in simulation is needed. A human eval-uation of RlocMin or ReditMin could be performed as an A/B (or split) test. Sinceboth of these algorithms order the list of open code reviews, the algorithms couldbe put into practice for different weeks of a year and the actual duration of reviewscompared to the actual duration prior to the use of an ordered code review list.Care would need to be taken in the number of reviews subjected to each treatmentand the complexity of the reviews as measured using some of the metrics I haveintroduced to ensure similar samples.33Chapter 8SummarySoftware developers expend significant human effort on code reviews. Yet, despitethis effort, code reviews remain open for long periods of time on projects and anumber of code reviews go stagnant.As a means for reducing stagnation and the duration of time required to resolvereviews, I introduce the idea of a code review ordering recommender. An advantageof this approach is that it could be easily put in use in a project as the recommendercan be integrated easily into existing code review tools, such as Gerrit.I introduced four algorithms that might be used to recommend an order for opencode reviews. I performed simulation studies on these algorithms on a dataset ofsix projects from the Eclipse Foundation. I found that the algorithm which rankedthe open code reviews from least lines of code to the most lines of code involved inthe change to be reviewed (RlocMin), and the algorithm which ranked the open codereviews from least amount of edit actions based on the syntax of the change to themost (ReditMin) performed the best of the four algorithms. A more detailed analysisindicates that the effectiveness of those two algorithms varies project by project.Based on the results of the simulation, the algorithm based on edit actions(RlocMin) may be the most stable and the most suitable to subject, as a next step, toevaluation in use in a real project.34Bibliography[1] A. Bacchelli and C. Bird. Expectations, Outcomes, and Challenges ofModern Code Review. In Proceedings of the 2013 International Conferenceon Software Engineering, ICSE ’13, pages 712–721, Piscataway, NJ, USA,2013. IEEE Press. ISBN 978-1-4673-3076-3. URLhttp://dl.acm.org/citation.cfm?id=2486788.2486882. → pages 1, 5[2] V. Balachandran. Reducing Human Effort and Improving Quality in PeerCode Reviews Using Automatic Static Analysis and ReviewerRecommendation. In Proceedings of the 2013 International Conference onSoftware Engineering, ICSE ’13, pages 931–940, Piscataway, NJ, USA,2013. IEEE Press. ISBN 978-1-4673-3076-3. URLhttp://dl.acm.org/citation.cfm?id=2486788.2486915. → pages 9[3] O. Baysal, R. Holmes, and M. W. Godfrey. No Issue Left Behind: ReducingInformation Overload in Issue Tracking. In Proceedings of the 22Nd ACMSIGSOFT International Symposium on Foundations of SoftwareEngineering, FSE 2014, pages 666–677, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-3056-5. doi:10.1145/2635868.2635887. URLhttp://doi.acm.org/10.1145/2635868.2635887. → pages 9[4] A. Bosu and J. C. Carver. Impact of Developer Reputation on Code ReviewOutcomes in OSS Projects: An Empirical Investigation. In Proceedings ofthe 8th ACM/IEEE International Symposium on Empirical SoftwareEngineering and Measurement, ESEM ’14, pages 33:1—-33:10, New York,NY, USA, 2014. ACM. ISBN 978-1-4503-2774-9.doi:10.1145/2652524.2652544. URLhttp://doi.acm.org/10.1145/2652524.2652544. → pages 8[5] M. Fagan. A History of Software Inspections, pages 562–573. SpringerBerlin Heidelberg, Berlin, Heidelberg, 2002. ISBN 978-3-642-59412-0.doi:10.1007/978-3-642-59412-0 34. URLhttp://dx.doi.org/10.1007/978-3-642-59412-0 34. → pages 435[6] M. E. Fagan. Design and Code Inspections to Reduce Errors in ProgramDevelopment. IBM Syst. J., 15(3):182–211, sep 1976. ISSN 0018-8670.doi:10.1147/sj.153.0182. URL http://dx.doi.org/10.1147/sj.153.0182. →pages 1[7] M. E. Fagan. Advances in software inspections. IEEE Transactions onSoftware Engineering, SE-12(7):744–751, jul 1986. ISSN 0098-5589.doi:10.1109/TSE.1986.6312976. → pages 5[8] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus.Fine-grained and Accurate Source Code Differencing. In Proceedings of the29th ACM/IEEE International Conference on Automated SoftwareEngineering, ASE ’14, pages 313–324, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-3013-8. doi:10.1145/2642937.2642982. URLhttp://doi.acm.org/10.1145/2642937.2642982. → pages 11[9] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill. A Degree-of-knowledgeModel to Capture Source Code Familiarity. In Proceedings of the 32NdACM/IEEE International Conference on Software Engineering - Volume 1,ICSE ’10, pages 385–394, New York, NY, USA, 2010. ACM. ISBN978-1-60558-719-6. doi:10.1145/1806799.1806856. URLhttp://doi.acm.org/10.1145/1806799.1806856. → pages 33[10] Y. Jiang, B. Adams, and D. M. German. Will My Patch Make It? And HowFast?: Case Study on the Linux Kernel. In Proceedings of the 10th WorkingConference on Mining Software Repositories, MSR ’13, pages 101–110,Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487111. → pages 8[11] S. Kollanus and J. Koskinen. Survey of software inspection research. TheOpen Software Engineering Journal, 3(1):15–34, 2009. → pages 5[12] A. N. Meyer, T. Fritz, G. C. Murphy, and T. Zimmermann. SoftwareDevelopers’ Perceptions of Productivity. In Proceedings of the 22Nd ACMSIGSOFT International Symposium on Foundations of SoftwareEngineering, FSE 2014, pages 19–29, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-3056-5. doi:10.1145/2635868.2635892. URLhttp://doi.acm.org/10.1145/2635868.2635892. → pages 2[13] P. C. Rigby and C. Bird. Convergent Contemporary Software Peer ReviewPractices. In Proceedings of the 2013 9th Joint Meeting on Foundations ofSoftware Engineering, ESEC/FSE 2013, pages 202–212, New York, NY,36USA, 2013. ACM. ISBN 978-1-4503-2237-9.doi:10.1145/2491411.2491444. URLhttp://doi.acm.org/10.1145/2491411.2491444. → pages 1, 2, 8[14] P. C. Rigby and M.-A. Storey. Understanding Broadcast Based Peer Reviewon Open Source Software Projects. In Proceedings of the 33rd InternationalConference on Software Engineering, ICSE ’11, pages 541–550, New York,NY, USA, 2011. ACM. ISBN 978-1-4503-0445-0.doi:10.1145/1985793.1985867. URLhttp://doi.acm.org/10.1145/1985793.1985867. → pages 9[15] P. C. Rigby, D. M. German, and M.-A. Storey. Open Source Software PeerReview Practices: A Case Study of the Apache Server. In Proceedings of the30th International Conference on Software Engineering, ICSE ’08, pages541–550, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-079-1.doi:10.1145/1368088.1368162. URLhttp://doi.acm.org/10.1145/1368088.1368162. → pages 8[16] P. Thongtanunam, R. G. Kula, A. E. C. Cruz, N. Yoshida, and H. Iida.Improving Code Review Effectiveness Through ReviewerRecommendations. In Proceedings of the 7th International Workshop onCooperative and Human Aspects of Software Engineering, CHASE 2014,pages 119–122, New York, NY, USA, 2014. ACM. ISBN978-1-4503-2860-9. doi:10.1145/2593702.2593705. URLhttp://doi.acm.org/10.1145/2593702.2593705. → pages 9[17] P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, andK. i. Matsumoto. Who should review my code? A file location-basedcode-reviewer recommendation approach for Modern Code Review. InSoftware Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22ndInternational Conference on, pages 141–150, mar 2015.doi:10.1109/SANER.2015.7081824. → pages 9[18] R. J. Walker and R. Holmes. Simulation, chapter Simulation, pages301–327. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. ISBN978-3-642-45135-5. doi:10.1007/978-3-642-45135-5 12. URLhttp://link.springer.com/10.1007/978-3-642-45135-5 12. → pages 14[19] M. Zanjani, H. Kagdi, and C. Bird. Automatically Recommending PeerReviewers in Modern Code Review. IEEE Transactions on SoftwareEngineering, PP(99):1–1, jun 2015. ISSN 0098-5589.doi:10.1109/TSE.2015.2500238. URL37http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7328331.→ pages 938Appendix ASimulation ResultsIn this Appendix we provide all the plots for each of the 6 projects used in thesimulation described in chapter 4. For each project, we provide the following plots:• a scatter plot of the distribution of actual duration of code review,• a scatter plot for each of the estimated durations for the two Min algorithms,• a scatter plot for the difference between each Min estimation and the actualduration,• a violin plot for the difference between each Min estimation and the actualduration,• a scatter plot for each of the estimated durations for the two Max algorithms,• a scatter plot for the difference between each Max estimation and the actualduration,• a violin plot for the difference between each Max estimation and the actualduration.39A.1 EGitFigure A.1: Actual durations (Da) of code reviews for EGit, sorted by actualeffort (Ea)40(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) for EGit(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) for EGitFigure A.2: Scatter plots with regression lines of the estimated durations forEGit computed using the min algorithms41(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for EGit(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for EGitFigure A.3: Scatter plots with regression lines of the difference between themin estimates and the real durations for EGit42(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) for EGit(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) for EGitFigure A.4: Scatter plots with regression lines of the estimated durations forEGit computed using the max algorithms43(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for EGit(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for EGitFigure A.5: Scatter plots with regression lines of the difference between themax estimates and the real durations for EGit44(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for EGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for EGitFigure A.6: Violin plots of the difference between the estimates and the realdurations for EGit for the min algorithms45(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for EGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for EGitFigure A.7: Violin plots of the difference between the estimates and the realdurations for EGit for the max algorithms46A.2 LinuxtoolsFigure A.8: Actual durations (Da) of code reviews for Linuxtools, sortedby actual effort (Ea)47(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) forLinuxtools(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) forLinuxtoolsFigure A.9: Scatter plots with regression lines of the estimated durations forLinuxtools computed using the min algorithms48(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for Linuxtools(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for LinuxtoolsFigure A.10: Scatter plots with regression lines of the difference between themin estimates and the real durations for Linuxtools49(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) forLinuxtools(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) forLinuxtoolsFigure A.11: Scatter plots with regression lines of the estimated durations forLinuxtools computed using the max algorithms50(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for Linuxtools(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for LinuxtoolsFigure A.12: Scatter plots with regression lines of the difference between themax estimates and the real durations for Linuxtools51(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for Linuxtools(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for LinuxtoolsFigure A.13: Violin plots of the difference between the estimates and the realdurations for Linuxtools for the min algorithms52(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for Linuxtools(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for LinuxtoolsFigure A.14: Violin plots of the difference between the estimates and the realdurations for Linuxtools for the max algorithms53A.3 JGitFigure A.15: Actual durations (Da) of code reviews for JGit, sorted by ac-tual effort (Ea)54(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) for JGit(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) for JGitFigure A.16: Scatter plots with regression lines of the estimated durations forJGit computed using the min algorithms55(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for JGit(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for JGitFigure A.17: Scatter plots with regression lines of the difference between themin estimates and the real durations for JGit56(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) for JGit(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) for JGitFigure A.18: Scatter plots with regression lines of the estimated durations forJGit computed using the max algorithms57(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for JGit(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for JGitFigure A.19: Scatter plots with regression lines of the difference between themax estimates and the real durations for JGit58(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for JGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for JGitFigure A.20: Violin plots of the difference between the estimates and the realdurations for JGit for the min algorithms59(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for JGit(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for JGitFigure A.21: Violin plots of the difference between the estimates and the realdurations for JGit for the max algorithms60A.4 SiriusFigure A.22: Actual durations (Da) of code reviews for Sirius, sorted byactual effort (Ea)61(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) for Sirius(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) for SiriusFigure A.23: Scatter plots with regression lines of the estimated durations forSirius computed using the min algorithms62(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for Sirius(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for SiriusFigure A.24: Scatter plots with regression lines of the difference between themin estimates and the real durations for Sirius63(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) for Sirius(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) for SiriusFigure A.25: Scatter plots with regression lines of the estimated durations forSirius computed using the max algorithms64(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for Sirius(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for SiriusFigure A.26: Scatter plots with regression lines of the difference between themax estimates and the real durations for Sirius65(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for Sirius(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for SiriusFigure A.27: Violin plots of the difference between the estimates and the realdurations for Sirius for the min algorithms66(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for Sirius(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for SiriusFigure A.28: Violin plots of the difference between the estimates and the realdurations for Sirius for the max algorithms67A.5 OseeFigure A.29: Actual durations (Da) of code reviews for Osee, sorted by ac-tual effort (Ea)68(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) for Osee(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) for OseeFigure A.30: Scatter plots with regression lines of the estimated durations forOsee computed using the min algorithms69(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for Osee(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for OseeFigure A.31: Scatter plots with regression lines of the difference between themin estimates and the real durations for Osee70(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) for Osee(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) for OseeFigure A.32: Scatter plots with regression lines of the estimated durations forOsee computed using the max algorithms71(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for Osee(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for OseeFigure A.33: Scatter plots with regression lines of the difference between themax estimates and the real durations for Osee72(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for Osee(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for OseeFigure A.34: Violin plots of the difference between the estimates and the realdurations for Osee for the min algorithms73(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for Osee(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for OseeFigure A.35: Violin plots of the difference between the estimates and the realdurations for Osee for the max algorithms74A.6 TracecompassFigure A.36: Actual durations (Da) of code reviews for Tracecompass,sorted by actual effort (Ea)75(a) Estimated Duration (De) computed using RlocMin v. Effort (Ea) forTracecompass(b) Estimated Duration (De) computed using ReditMin v. Effort (Ea) forTracecompassFigure A.37: Scatter plots with regression lines of the estimated durations forTracecompass computed using the min algorithms76(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMin v. Effort (Ea) for Tracecompass(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMin v. Effort (Ea) for TracecompassFigure A.38: Scatter plots with regression lines of the difference between themin estimates and the real durations for Tracecompass77(a) Estimated Duration (De) computed using RlocMax v. Effort (Ea) forTracecompass(b) Estimated Duration (De) computed using ReditMax v. Effort (Ea) forTracecompassFigure A.39: Scatter plots with regression lines of the estimated durations forTracecompass computed using the max algorithms78(a) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using RlocMax v. Effort (Ea) for Tracecompass(b) Difference betweent the actual duration (Da) and the estimated Duration (De)computed using ReditMax v. Effort (Ea) for TracecompassFigure A.40: Scatter plots with regression lines of the difference between themax estimates and the real durations for Tracecompass79(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMin for Tracecompass(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMin for TracecompassFigure A.41: Violin plots of the difference between the estimates and the realdurations for Tracecompass for the min algorithms80(a) Difference between actual duration (Da) and estimated Duration (De) com-puted using RlocMax for Tracecompass(b) Difference between actual duration (Da) and estimated Duration (De) com-puted using ReditMax for TracecompassFigure A.42: Violin plots of the difference between the estimates and the realdurations for Tracecompass for the max algorithms81